We already check for started commands in all callbacks, but we should
also protect against already completed commands. Do this by taking
the checks to common code.
Acked-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When a userspace client requests a NBD device be disconnected, the
DISCONNECT_REQUESTED flag is set. While this flag is set, the driver
will not inform userspace when a connection is closed.
Unfortunately the flag was never cleared, so once a disconnect was
requested the driver would thereafter never tell userspace about a
closed connection. Thus when connections failed due to timeout, no
attempt to reconnect was made and eventually the device would fail.
Fix by clearing the DISCONNECT_REQUESTED flag (and setting the
DISCONNECTED flag) once all connections are closed.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Kevin Vigor <kvigor@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
By completing the request entirely in the driver we can remove the
BLK_EH_HANDLED return value and thus the split responsibility between the
driver and the block layer that has been causing trouble.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
By completing the request entirely in the driver we can remove the
BLK_EH_HANDLED return value and thus the split responsibility between the
driver and the block layer that has been causing trouble.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
By completing the request entirely in the driver we can remove the
BLK_EH_HANDLED return value and thus the split responsibility between the
driver and the block layer that has been causing trouble.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The BLK_EH_NOT_HANDLED implies nothing happen, but very often that
is not what is happening - instead the driver already completed the
command. Fix the symbolic name to reflect that a little better.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Convert the S_<FOO> symbolic permissions to their octal equivalents as
using octal and not symbolic permissions is preferred by many as more
readable.
see: https://lkml.org/lkml/2016/8/2/1945
Done with automated conversion via:
$ ./scripts/checkpatch.pl -f --types=SYMBOLIC_PERMS --fix-inplace <files...>
Miscellanea:
o Wrapped modified multi-line calls to a single line where appropriate
o Realign modified multi-line calls to open parenthesis
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCAAGBQJbBramAAoJEPfTWPspceCm2gsP/R5p/Vo8Ml303cdgCOrw+Q+T
W/Wp6hGmNUZ7HjMlfF85rF1bqpgGoj226qIibHvKi7eYwUScDNgJwcrrWnNazbGx
Rnl/+NQ/H/38pBvvDJKjth9n0LY/O1geemhnWzYZmqmZJe3jFNTDpw5pGy4RkTpJ
4DCuXg2n81x+kDt7Nslb0dYONkrc1yUjelHS/sJKlUoPs1MvsICGsWNS+Lw0WMIj
Ls282QNs1Z6Y1gcM18UXsTNJSXQFTBmsbH3CIBukHckP2wjZrMEvgM+uxIORqwID
0DJWBSVNJMxrsyZtkMMkkz2wMjPjSvx3HjD/ULpglJD6nGSwRAiQr6XIxARSEEDL
3prQyhEeVGSioOjt5IR2Llt9QDYVhb1GJqqHYmDZux0SMCVg1Pv7mclrudpXGpzq
v2mSJqnwLofOuTFrSh6afgxClD/yh1UNKByf4Ni78PagAgNS4SjEIz8VDG4m1iNu
UgWMbM+vqpZhgIlSDsmnqFEVyGwCWwUOHGCeshQXIy9xaWbCtmaGQpNh4Tho273O
wH6F/jwCnw1BcUeiwOQi+qcwEf1z6nRTGGb3hZt7u2Tj6r8HZzNHtkRaFXsIrnvA
Vc3Y9+YzRWBVV6wzqPC5K+alvGs5ZXLj7BIxQorEkoMkNCYWiPsiBS/rtc0OYiST
b2rC+5eYzquEZ42aqIIn
=QBaF
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20180524' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Two fixes that should go into this release:
- a loop writeback error clearing fix from Jeff
- the sr sense fix from myself"
* tag 'for-linus-20180524' of git://git.kernel.dk/linux-block:
loop: clear wb_err in bd_inode when detaching backing file
sr: pass down correctly sized SCSI sense buffer
For some reason we had discard granularity set to 512 always even when
discards were disabled. Fix this by having the default be 0, and then
if we turn it on set the discard granularity to the blocksize.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add WQ_UNBOUND to the knbd-recv workqueue so we're not bound
to a single CPU that is selected at device creation time.
Signed-off-by: Dan Melnic <dmm@fb.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When a loop block device encounters a writeback error, that error will
get propagated to the bd_inode's wb_err field. If we then detach the
backing file from it, attach another and fsync it, we'll get back the
writeback error that we had from the previous backing file.
This is a bit of a grey area as POSIX doesn't cover loop devices, but it
is somewhat counterintuitive.
If we detach a backing file from the loopdev while there are still
unreported errors, take it as a sign that we're no longer interested in
the previous file, and clear out the wb_err in the loop blockdev.
Reported-and-Tested-by: Theodore Y. Ts'o <tytso@mit.edu>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We need to make sure we don't just set the size of the bdev to 0 while
it's being used by a file system. We have the appropriate check in
nbd_bdev_reset, simply use that helper instead.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bd_invalidated is kind of a pain wrt partitions as it really only
triggers the partition rescan if it is set after bd_ops->open() runs, so
setting it when we reset the device isn't useful. We also sporadically
would still have partitions left over in some disconnect cases, so fix
this by always setting bd_invalidated on open if there's no
configuration or if we've had a disconnect action happen, that way the
partition table gets invalidated and rescanned properly.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is what the ioctl based nbd disconnect does as well. Without this
the device will just sit there and wait for the connection to go away
(or IO to occur) before the device gets torn down. Instead clear
everything up on our end so the configuration goes away as quickly as
possible.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When we stopped relying on the bdev everywhere I broke updating the
block device size on the fly, which ceph relies on. We can't just do
set_capacity, we also have to do bd_set_size so things like parted will
notice the device size change.
Fixes: 29eaadc ("nbd: stop using the bdev everywhere")
cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
I messed up changing the size of an NBD device while it was connected by
not actually updating the device or doing the uevent. Fix this by
updating everything if we're connected and we change the size.
cc: stable@vger.kernel.org
Fixes: 639812a ("nbd: don't set the device size until we're connected")
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This fixes a use after free bug, we shouldn't be doing disk->queue right
after we do del_gendisk(disk). Save the queue and do the cleanup after
the del_gendisk.
Fixes: c6a4759ea0 ("nbd: add device refcounting")
cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Variants of proc_create{,_data} that directly take a seq_file show
callback and drastically reduces the boilerplate code in the callers.
All trivial callers converted over.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Found a bug (with ASAN) where we were passing a bio to bio_copy_data()
with bi_next not NULL, when it should have been - a driver had left
bi_next set to something after calling bio_endio().
Since the normal case is only copying single bios, split out
bio_list_copy_data() to avoid more bugs like this in the future.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Same numerical value (for now at least), but a much better documentation
of intent.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Switch everyone to blk_get_request_flags, and then rename
blk_get_request_flags to blk_get_request.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The ps3disk driver already kmaps all pages when copying from/to the
internal bounce buffer, so it can accept highmem pages just fine.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
DAC960 just sets the block bounce limit to the dma mask, which means
that the iommu or swiotlb already take care of the bounce buffering,
and the block bouncing can be removed.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
mtip32xx just sets the block bounce limit to the dma mask, which means
that the iommu or swiotlb already take care of the bounce buffering,
and the block bouncing can be removed.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Branch to the right label in the error handling path in order to keep it
logical.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
... and store num_bvecs for client code's convenience.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
This commit sets QUEUE_FLAG_NONROT and clears up QUEUE_FLAG_ADD_RANDOM
to mark the ramdisks as non-rotational device.
Signed-off-by: SeongJae Park <sj38.park@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCAAGBQJa4LCXAAoJEPfTWPspceCm4DkQAMT9oc4gVTooyRdjFkUwwYQp
oPQr8cNwp1ieB9n85sCVvo8AqP+hcRrp99/sNtMUfdTRcSAumHoILfJF8NcBHWW1
6t609oi73runWbKnMvv+cTNFjwkUT30HHlIZRG5Qtn4wsy6vZAQEiv7J0z7PWays
Lz7Hc1dfu2EbUoE4Phbe/wAqkQmw5HMGCCBltA9YhaApHTudR0i0bLyEN39zNyeF
GifGkqETJ847HD00x25Uko008K0Mg6C+bkBwXBXw/E0mzO+QBG14U1WcLCw7USP2
0KwaO1/F5SiaAUFBbW6TGCAYr/0M8JH9PRqAnGQ0sFlrnsh4gYrYEhnXtzMAbJb5
AfFZxGc12XzTHcrNF+Genx81NpVCCblgEPxkZky8FXUZS6p91X0kNZmtfsaFCl7e
3rc3reORz+9FaM481kY52Acw1J3gZZwdryXS040911yWRvtdS1dk80Q0FnL1qq44
WvoXawk78x59+tcGwyC57dPmTAaFMGHiFQvx4zM5EkoBqTSTtfHkHQMeyFk3Er6U
eSl+2Cp1FWiWZo4sJsmQtmhOshvBeIENyU4H3HaovsWFcOacFyLQSks2FSPquM4G
gUsg6yADbamotdiFpbALM97cEcN4se38WuMNbXsqk3gTBHsCML52m7f8IWpgU3ZW
hWuEU++093bAHADmhG81
=335E
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20180425' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
"I ended up sitting on this about a week longer than I wanted to, since
we were hashing out details with a timeout change. I've now killed
that patch, so we can flush the existing queue in due time.
This contains:
- Fix for an old regression, where entering the queue can be
disturbed by a signal to the process. This can cause spurious EIO.
Fix from Alan Jenkins.
- cdrom information leak fix from Dan.
- Trivial helper for testing queue FUA from Dave Chinner, part of his
O_DIRECT FUA series.
- Series of swim fixes from Finn that actually makes it work again.
- Loop O_DIRECT corruption fix, which caused data corruption in
production for us. From me.
- BFQ crash fix from me.
- bcache maintainer update. Michael no longer has the time to do it,
Coly has stepped up to serve as the new maintainer.
- blkcg locking fixes from Jiang Biao.
- Revert of a change from this merge window from Ming, that causes an
issue on some hardware.
- Minor clarification doc addition from Linus Walleij"
* tag 'for-linus-20180425' of git://git.kernel.dk/linux-block: (22 commits)
Revert "blk-mq: remove code for dealing with remapping queue"
block: mq: Add some minor doc for core structs
bcache: mark Coly Li as bcache maintainer
MAINTAINERS: Remove me as maintainer of bcache
blkcg: init root blkcg_gq under lock
blkcg: small fix on comment in blkcg_init_queue
blkcg: don't hold blkcg lock when deactivating policy
block: add blk_queue_fua() helper function
cdrom: information leak in cdrom_ioctl_media_changed()
bfq-iosched: ensure to clear bic/bfqq pointers when preparing request
blk-mq: start request gstate with gen 1
block/swim: Select appropriate drive on device open
block/swim: Fix IO error at end of medium
block/swim: Check drive type
block/swim: Rename macros to avoid inconsistent inverted logic
block/swim: Don't log an error message for an invalid ioctl
block/swim: Remove extra put_disk() call from error path
block/swim: Fix array bounds check
m68k/mac: Don't remap SWIM MMIO region
loop: handle short DIO reads
...
The driver supports internal and external FDD units so the floppy_open
function must not hard-code the drive location.
Cc: Laurent Vivier <lvivier@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: stable@vger.kernel.org # v4.14+
Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Acked-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reading to the end of a 720K disk results in an IO error instead of EOF
because the block layer thinks the disk has 2880 sectors. (Partly this
is a result of inverted logic of the ONEMEG_MEDIA bit that's now fixed.)
Initialize the density and head count in swim_add_floppy() to agree
with the device size passed to set_capacity() during drive probe.
Call set_capacity() again upon device open, after refreshing the density
and head count values.
Cc: Laurent Vivier <lvivier@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: stable@vger.kernel.org # v4.14+
Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Acked-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The SWIM chip is compatible with GCR-mode Sony 400K/800K drives but
this driver only supports MFM mode. Therefore only Sony FDHD drives
are supported. Skip incompatible drives.
Cc: Laurent Vivier <lvivier@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: stable@vger.kernel.org # v4.14+
Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Acked-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The Sony drive status bits use active-low logic. The swim_readbit()
function converts that to 'C' logic for readability. Hence, the
sense of the names of the status bit macros should not be inverted.
Mostly they are correct. However, the TWOMEG_DRIVE, MFM_MODE and
TWOMEG_MEDIA macros have inverted sense (like MkLinux). Fix this
inconsistency and make the following patches less confusing.
The same problem affects swim3.c so fix that too.
No functional change.
The FDHD drive status bits are documented in sonydriv.cpp from MAME
and in swimiii.h from MkLinux.
Cc: Laurent Vivier <lvivier@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Jens Axboe <axboe@kernel.dk>
Cc: stable@vger.kernel.org # v4.14+
Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Acked-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The 'eject' shell command may send various different ioctl commands.
This leads to error messages on the console even though the FDEJECT
ioctl succeeds.
~# eject floppy
SWIM floppy_ioctl: unknown cmd 21257
SWIM floppy_ioctl: unknown cmd 1
Don't log an error message for an invalid ioctl, just do as the
swim3 driver does and return -ENOTTY.
Cc: Laurent Vivier <lvivier@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: stable@vger.kernel.org # v4.14+
Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Acked-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add an option to turn off discard and write zeroes offload support to
avoid deprovisioning a fully provisioned image. When enabled, discard
requests will fail with -EOPNOTSUPP, write zeroes requests will fall
back to manually zeroing.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Tested-by: Hitoshi Kamei <hitoshi.kamei.xm@hitachi.com>
In order to take full advantage of merging in ceph_file_to_extents(),
allow object set sized I/Os. If the layout is not "fancy", an object
set consists of just one object.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
In some configurations gcc cannot see that rbd_assert(0) leads to an
unreachable code path:
drivers/block/rbd.c: In function 'rbd_img_is_write':
drivers/block/rbd.c:1397:1: error: control reaches end of non-void function [-Werror=return-type]
drivers/block/rbd.c: In function '__rbd_obj_handle_request':
drivers/block/rbd.c:2499:1: error: control reaches end of non-void function [-Werror=return-type]
drivers/block/rbd.c: In function 'rbd_obj_handle_write':
drivers/block/rbd.c:2471:1: error: control reaches end of non-void function [-Werror=return-type]
As the rbd_assert() here shows has no extra information beyond the verbose
BUG(), we can simply use BUG() directly in its place. This is reliably
detected as not returning on any architecture, since it doesn't depend
on the unlikely() comparison that confused gcc.
Fixes: 3da691bf43 ("rbd: new request handling code")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
currently, the rbd_wait_state_locked() will wait forever if we
can't get our state locked. Example:
rbd map --exclusive test1 --> /dev/rbd0
rbd map test1 --> /dev/rbd1
dd if=/dev/zero of=/dev/rbd1 bs=1M count=1 --> IO blocked
To avoid this problem, this patch introduce a timeout design
in rbd_wait_state_locked(). Then rbd_wait_state_locked() will
return error when we reach a timeout.
This patch allow user to set the lock_timeout in rbd mapping.
Signed-off-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
We ran into an issue with loop and btrfs, where btrfs would complain about
checksum errors. It turns out that is because we don't handle short reads
at all, we just zero fill the remainder. Worse than that, we don't handle
the filling properly, which results in loop trying to advance a single
bio by much more than its size, since it doesn't take chaining into
account.
Handle short reads appropriately, by simply retrying at the new correct
offset. End the remainder of the request with EIO, if we get a 0 read.
Fixes: bc07c10a36 ("block: loop: support DIO & AIO")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We can always get at the request from the payload, no need to store
a pointer to it.
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCAAGBQJa0QliAAoJEPfTWPspceCmsvYQALP46qxy2abVk9388z2CyJJa
/bKaooiZJg7e5rnOMvHFr4xx3H3NZIGSb45u3SydU8BnyBHssGMkKgV3ohbNb/zt
rNV+ib8VBe6D8cD3A04sQbQGkBHm9ABBQKo6DKAQEIYBAenHsHaZU2cY0QM0k3YD
ismiA3mHF7X3gSH92XG7bqYlgvrLNLVKfUCihw6hsInzpVXN+mX5K0zQU1E/hYOd
Qotwev6kYA8SPTgXniiV7QFLKF/nMbXY+Cdd9W1Pr9Yh/mJmkR6xo/nq2bMEcnRH
Me5OB8/KxvLI1LdJwuu5eKZpAPE0KgOsonbC08irSM8qEYpS9Ei/GyHSZx99E6dc
JbqtYFsm5tr2a6wwkt1C79e/n7O80lV6gPvYF00bGHRIZAI7L78O7bXIkAufJQez
ZGJck+tTgqrzjS3iX/e7Vf7XYMpZO8Y9HIG5f6tdS8PtIvYf06MMn8XGY+HuhJvS
1ZrPAHsEsyi/YIrt/bXOAbHdL8VCrfwnWVPa8HDM9MDPDyVkazXzyUt8U8K4Kwte
S6edIrmb5BbVClBVKNYft+bavdZRIAqWGxarYIKphr8dbhNK6p1RVevRHEz56NXP
NMvEeVsWgXnEz4sioXcw4TMC8GnZ5evR/gHkhJ51GEoQgeK3St0wQAem26v6TmSu
cdlCVKYfKyNCVLp6CLza
=SgEX
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20180413' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Followup fixes for this merge window. This contains:
- Series from Ming, fixing corner cases in our CPU <-> queue mapping.
This triggered repeated warnings on especially s390, but I also hit
it in cpu hot plug/unplug testing while doing IO on NVMe on x86-64.
- Another fix from Ming, ensuring that we always order budget and
driver tag identically, avoiding a deadlock on QD=1 devices.
- Loop locking regression fix from this merge window, from Omar.
- Another loop locking fix, this time missing an unlock, from Tetsuo
Handa.
- Fix for racing IO submission with device removal from Bart.
- sr reference fix from me, fixing a case where disk change or
getevents can race with device removal.
- Set of nvme fixes by way of Keith, from various contributors"
* tag 'for-linus-20180413' of git://git.kernel.dk/linux-block: (28 commits)
nvme: expand nvmf_check_if_ready checks
nvme: Use admin command effects for admin commands
nvmet: fix space padding in serial number
nvme: check return value of init_srcu_struct function
nvmet: Fix nvmet_execute_write_zeroes sector count
nvme-pci: Separate IO and admin queue IRQ vectors
nvme-pci: Remove unused queue parameter
nvme-pci: Skip queue deletion if there are no queues
nvme: target: fix buffer overflow
nvme: don't send keep-alives to the discovery controller
nvme: unexport nvme_start_keep_alive
nvme-loop: fix kernel oops in case of unhandled command
nvme: enforce 64bit offset for nvme_get_log_ext fn
sr: get/drop reference to device in revalidate and check_events
blk-mq: Revert "blk-mq: reimplement blk_mq_hw_queue_mapped"
blk-mq: Avoid that submitting a bio concurrently with device removal triggers a crash
backing: silence compiler warning using __printf
blk-mq: remove code for dealing with remapping queue
blk-mq: reimplement blk_mq_hw_queue_mapped
blk-mq: don't check queue mapped in __blk_mq_delay_run_hw_queue()
...
- support for rbd "fancy" striping (myself). The striping feature bit
is now fully implemented, allowing mapping v2 images with non-default
striping patterns. This completes support for --image-format 2.
- CephFS quota support (Luis Henriques and Zheng Yan). This set is
based on the new SnapRealm code in the upcoming v13.y.z ("Mimic")
release. Quota handling will be rejected on older filesystems.
- memory usage improvements in CephFS (Chengguang Xu). Directory
specific bits have been split out of ceph_file_info and some effort
went into improving cap reservation code to avoid OOM crashes.
Also included a bunch of assorted fixes all over the place from
Chengguang and others.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJazOI/AAoJEEp/3jgCEfOLOu0IAKGFkcCo0UdQDGHHJZHn2rAm
CSWMMwyYGAhoWI6Gva0jx1A2omZLFSeq/MC8dWLL/MNAKt8i/qo8bTsTrwCHMR2Q
D0FsvMWIhkWRS1/FcD1uVDhn0a/DFm5Kfy8kzz3v695TDCt+BYWrCqyHTB/wSdRR
VpO3KdpHQ9h3ojNBRgIniOCNPeQP+QzLXy+P0h0oKbP2Y03mwJlsWG4L6zakkkwT
e2I+RVdlOMUDJ7rZxiXESBr6BuLI4oOkPe8roQGmZPy1Xe17xa9M5iWVNuM6RUhO
Z9bS2aLMhbDyeCPqvzgAnsUtFT0PAQjB5NYw2yqisbHs/wrU5kMOOpcLqz/Ls/s=
=v1I9
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.17-rc1' of git://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"The big ticket items are:
- support for rbd "fancy" striping (myself).
The striping feature bit is now fully implemented, allowing mapping
v2 images with non-default striping patterns. This completes
support for --image-format 2.
- CephFS quota support (Luis Henriques and Zheng Yan).
This set is based on the new SnapRealm code in the upcoming v13.y.z
("Mimic") release. Quota handling will be rejected on older
filesystems.
- memory usage improvements in CephFS (Chengguang Xu).
Directory specific bits have been split out of ceph_file_info and
some effort went into improving cap reservation code to avoid OOM
crashes.
Also included a bunch of assorted fixes all over the place from
Chengguang and others"
* tag 'ceph-for-4.17-rc1' of git://github.com/ceph/ceph-client: (67 commits)
ceph: quota: report root dir quota usage in statfs
ceph: quota: add counter for snaprealms with quota
ceph: quota: cache inode pointer in ceph_snap_realm
ceph: fix root quota realm check
ceph: don't check quota for snap inode
ceph: quota: update MDS when max_bytes is approaching
ceph: quota: support for ceph.quota.max_bytes
ceph: quota: don't allow cross-quota renames
ceph: quota: support for ceph.quota.max_files
ceph: quota: add initial infrastructure to support cephfs quotas
rbd: remove VLA usage
rbd: fix spelling mistake: "reregisteration" -> "reregistration"
ceph: rename function drop_leases() to a more descriptive name
ceph: fix invalid point dereference for error case in mdsc destroy
ceph: return proper bool type to caller instead of pointer
ceph: optimize memory usage
ceph: optimize mds session register
libceph, ceph: add __init attribution to init funcitons
ceph: filter out used flags when printing unused open flags
ceph: don't wait on writeback when there is no more dirty pages
...
Commit 2d1d4c1e59 made loop_get_status() drop lo_ctx_mutex before
returning, but the loop_get_status_old(), loop_get_status64(), and
loop_get_status_compat() wrappers don't call loop_get_status() if the
passed argument is NULL. The callers expect that the lock is dropped, so
make sure we drop it in that case, too.
Reported-by: syzbot+31e8daa8b3fc129e75f2@syzkaller.appspotmail.com
Fixes: 2d1d4c1e59 ("loop: don't call into filesystem while holding lo_ctl_mutex")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Merge updates from Andrew Morton:
- a few misc things
- ocfs2 updates
- the v9fs maintainers have been missing for a long time. I've taken
over v9fs patch slinging.
- most of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (116 commits)
mm,oom_reaper: check for MMF_OOM_SKIP before complaining
mm/ksm: fix interaction with THP
mm/memblock.c: cast constant ULLONG_MAX to phys_addr_t
headers: untangle kmemleak.h from mm.h
include/linux/mmdebug.h: make VM_WARN* non-rvals
mm/page_isolation.c: make start_isolate_page_range() fail if already isolated
mm: change return type to vm_fault_t
mm, oom: remove 3% bonus for CAP_SYS_ADMIN processes
mm, page_alloc: wakeup kcompactd even if kswapd cannot free more memory
kernel/fork.c: detect early free of a live mm
mm: make counting of list_lru_one::nr_items lockless
mm/swap_state.c: make bool enable_vma_readahead and swap_vma_readahead() static
block_invalidatepage(): only release page if the full page was invalidated
mm: kernel-doc: add missing parameter descriptions
mm/swap.c: remove @cold parameter description for release_pages()
mm/nommu: remove description of alloc_vm_area
zram: drop max_zpage_size and use zs_huge_class_size()
zsmalloc: introduce zs_huge_class_size()
mm: fix races between swapoff and flush dcache
fs/direct-io.c: minor cleanups in do_blockdev_direct_IO
...
Remove ZRAM's enforced "huge object" value and use zsmalloc huge-class
watermark instead, which makes more sense.
TEST
- I used a 1G zram device, LZO compression back-end, original
data set size was 444MB. Looking at zsmalloc classes stats the
test ended up to be pretty fair.
BASE ZRAM/ZSMALLOC
=====================
zram mm_stat
498978816 191482495 199831552 0 199831552 15634 0
zsmalloc classes
class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable
...
151 2448 0 0 1240 1240 744 3 0
168 2720 0 0 4200 4200 2800 2 0
190 3072 0 0 10100 10100 7575 3 0
202 3264 0 0 380 380 304 4 0
254 4096 0 0 10620 10620 10620 1 0
Total 7 46 106982 106187 48787 0
PATCHED ZRAM/ZSMALLOC
=====================
zram mm_stat
498978816 182579184 194248704 0 194248704 15628 0
zsmalloc classes
class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable
...
151 2448 0 0 1240 1240 744 3 0
168 2720 0 0 4200 4200 2800 2 0
190 3072 0 0 10100 10100 7575 3 0
202 3264 0 0 7180 7180 5744 4 0
254 4096 0 0 3820 3820 3820 1 0
Total 8 45 106959 106193 47424 0
As we can see, we reduced the number of objects stored in class-4096,
because a huge number of objects which we previously forcibly stored in
class-4096 now stored in non-huge class-3264. This results in lower
memory consumption:
- zsmalloc now uses 47424 physical pages, which is less than 48787 pages
zsmalloc used before.
- objects that we store in class-3264 share zspages. That's why overall
the number of pages that both class-4096 and class-3264 consumed went
down from 10924 to 9564.
[sergey.senozhatsky.work@gmail.com: add pool param to zs_huge_class_size()]
Link: http://lkml.kernel.org/r/20180314081833.1096-3-sergey.senozhatsky@gmail.com
Link: http://lkml.kernel.org/r/20180306070639.7389-3-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCAAGBQJawr05AAoJEPfTWPspceCmT2UP/1uuaqwzyl4VjFNb/k7KS7UM
+Cs/1HBlGomgMA8orDTGqtWqLRdR3z4RSh0+MvXTzQ78HpFVYz7CbDc9itHm+G9M
X0ypD4kF/JGCFb5cxk+x6qv28uO2nv4DP3+0hHqJWLH4UVJBWDY6bs4BPShsf9QB
I6XjioNMhoqylXgdOITLODJZz+TcChlJMDAqwhpJwh9TH1wjobleAZ6AdmCPfgi5
h0UCKMUKzcVJlNZwQUrzrs2cxcx9Uhunnbz7HK0ZV4n/FKFtDpGynFpQQ71pZxKe
Be0ZOBPCQvC3ykOM/egCIvC/e5y7FgrjORD6jxyu1PTwAugI5E1VYSMxHkXvgPAx
zOo9A7RT4GPO2tDQv+DbzNFpqeSAclTgSmr+/y1wmheBs8DiSt7MPVBiNM4zdCNv
NLk9z7IEjFhdmluSB/LbTb1aokypMb/q7QTLouPHdwGn80k7yrhFyLHgdjpNTQ2K
UHfHZvGxkOX6SmFhBNOtIFUkuSceenh64a0RkRle7filx+ImpbCVm2/GYi9zZNCu
EtctgzLbLmz40zMiyDaZS2bxBgGzfn6yf4xd9LsaAJPMhvZnmXogT0D9ctWXB0WU
mMaS7sOkLnNjnGkzF1fHkeiZ/oigrstJbe+CA7BtOdwxpWn6MZBgKEoFQ6iA2b3X
5J1axMgVH5LAsIEcEQVq
=RVhK
-----END PGP SIGNATURE-----
Merge tag 'for-4.17/block-20180402' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
"It's a pretty quiet round this time, which is nice. This contains:
- series from Bart, cleaning up the way we set/test/clear atomic
queue flags.
- series from Bart, fixing races between gendisk and queue
registration and removal.
- set of bcache fixes and improvements from various folks, by way of
Michael Lyle.
- set of lightnvm updates from Matias, most of it being the 1.2 to
2.0 transition.
- removal of unused DIO flags from Nikolay.
- blk-mq/sbitmap memory ordering fixes from Omar.
- divide-by-zero fix for BFQ from Paolo.
- minor documentation patches from Randy.
- timeout fix from Tejun.
- Alpha "can't write a char atomically" fix from Mikulas.
- set of NVMe fixes by way of Keith.
- bsg and bsg-lib improvements from Christoph.
- a few sed-opal fixes from Jonas.
- cdrom check-disk-change deadlock fix from Maurizio.
- various little fixes, comment fixes, etc from various folks"
* tag 'for-4.17/block-20180402' of git://git.kernel.dk/linux-block: (139 commits)
blk-mq: Directly schedule q->timeout_work when aborting a request
blktrace: fix comment in blktrace_api.h
lightnvm: remove function name in strings
lightnvm: pblk: remove some unnecessary NULL checks
lightnvm: pblk: don't recover unwritten lines
lightnvm: pblk: implement 2.0 support
lightnvm: pblk: implement get log report chunk
lightnvm: pblk: rename ppaf* to addrf*
lightnvm: pblk: check for supported version
lightnvm: implement get log report chunk helpers
lightnvm: make address conversions depend on generic device
lightnvm: add support for 2.0 address format
lightnvm: normalize geometry nomenclature
lightnvm: complete geo structure with maxoc*
lightnvm: add shorten OCSSD version in geo
lightnvm: add minor version to generic geometry
lightnvm: simplify geometry structure
lightnvm: pblk: refactor init/exit sequences
lightnvm: Avoid validation of default op value
lightnvm: centralize permission check for lightnvm ioctl
...
As part of the effort to remove VLAs from the kernel[1], this moves
the literal values into the stack array calculation instead of using a
variable for the sizing. The resulting size can be found from
sizeof(buf).
[1] https://lkml.org/lkml/2018/3/7/621
Signed-off-by: Kyle Spiers <kyle@spiers.me>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Trivial fix to spelling mistake in rdb_warn message text.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Currently we request the latest osdmap only if ceph_pg_poolid_by_name()
fails with -ENOENT. This is effective with newly created pools, but we
also want to avoid attempting to map from pools that were recently
deleted and report "pool does not exist" instead. (Such an attempt
eventually fails in the OSD client after map check code kicks in, but
the error message is confusing.)
Request the latest osdmap unconditionally after bumping a ref on an
existing client in rbd_client_find().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
If the layout is "fancy", we need to be able to rearrange the provided
bio_vecs in stripe unit chunks to make it possible for the messenger to
read/write directly from/to the provided data buffer, without employing
a temporary data buffer for assembling the result.
Higher level bio_vec arrays are generally immutable, so this requires
copying into a private array. Only the bio_vecs themselves are shuffled
around, not the actual data. OWN_BVECS doesn't own any pages.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
rbd_parent_request_create() takes a ref on obj_req for child_img_req.
There is no point in doing that because child_img_req is created on
behalf of obj_req -- obj_req is the initiator and can't be completed
before child_img_req.
Open-code the rest of rbd_parent_request_create() and remove it along
with rbd_parent_request_destroy().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
A whole-object layered discard is implemented as a truncate rather
than a delete: a dummy object is needed to prevent the CoW machinery
from kicking in. However, a truncate on a non-existent object is
a no-op. If the object doesn't exist in HEAD, a discard request is
effectively ignored, which violates our "discard zeroes data" promise
and breaks REQ_OP_WRITE_ZEROES implementation.
A non-exclusive create on an existing object is also a no-op, so the
fix is to do a compound create+truncate instead.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
In preparation for rbd "fancy" striping, replace obj_req->img_offset
with obj_req->img_extents. A single starting offset isn't sufficient
because we want only one OSD request per object and will merge adjacent
object extents in ceph_file_to_extents(). The final object extent may
map into multiple different byte ranges in the image.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
obj_req->object_no -> obj_req->ex.oe_objno
obj_req->offset -> obj_req->ex.oe_off
obj_req->length -> obj_req->ex.oe_len
... and use ex for linking object requests to image requests.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
All object requests are associated with an image request now -- avoid
duplicating the same info in each object request.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Do away with partial request completions and all the associated
complexity. Individual object requests no longer need to be completed
in order -- when the last one becomes ready, we complete the entire
higher level request all at once.
This also wraps up the conversion to a state machine model and
eliminates the recursion described in commit 6d69bb536b ("rbd:
prevent kernel stack blow up on rbd map").
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
It should be void now. Also, object requests are unlinked only in
image request destructor, which can't run before rbd_img_request_put(),
so no need for _safe.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
No need to pass rbd_dev and op_type to rbd_osd_req_create(): there are
no standalone (!IMG_DATA) object requests anymore and osd_req->r_flags
can be set in rbd_osd_req_format_{read,write}().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The notable changes are:
- instead of explicitly stat'ing the object to see if it exists before
issuing the write, send the write optimistically along with the stat
in a single OSD request
- zero copyup optimization
- all object requests are associated with an image request and have
a valid ->img_request pointer; there are no standalone (!IMG_DATA)
object requests anymore
- code is structured as a state machine (vs a bunch of callbacks with
implicit state)
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
In preparation for rbd "fancy" striping which requires bio_vec arrays,
wire up BVECS data type and kill off PAGES data type. There is nothing
wrong with using page vectors for copyup requests -- it's just less
iterator boilerplate code to write for the new striping framework.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
The initiating object request is the proper owner -- save a bit of
space.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
obj_req->pages is for provided data buffers. stat requests are
internal and should be NODATA.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
The reason we clone bios is to be able to give each object request
(and consequently each ceph_osd_data/ceph_msg_data item) its own
pointer to a (list of) bio(s). The messenger then initializes its
cursor with cloned bio's ->bi_iter, so it knows where to start reading
from/writing to. That's all the cloned bios are used for: to determine
each object request's starting position in the provided data buffer.
Introduce ceph_bio_iter to do exactly that -- store position within bio
list (i.e. pointer to bio) + position within that bio (i.e. bvec_iter).
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Commit 21acdf45f4 ("rbd: set max_segments to USHRT_MAX") removed the
limit on max_segments. Remove the limit on max_segment_size as well.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Even after the previous patch to drop lo_ctl_mutex while calling
vfs_getattr(), there are other cases where we can end up sleeping for a
long time while holding lo_ctl_mutex. Let's avoid the uninterruptible
sleep from the ioctls.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We hit an issue where a loop device on NFS was stuck in
loop_get_status() doing vfs_getattr() after the NFS server died, which
caused a pile-up of uninterruptible processes waiting on lo_ctl_mutex.
There's no reason to hold this lock while we wait on the filesystem;
let's drop it so that other processes can do their thing. We need to
grab a reference on lo_backing_file while we use it, and we can get rid
of the check on lo_device, which has been unnecessary since commit
a34c0ae9ebd6 ("[PATCH] loop: remove the bio remapping capability") in
the linux-history tree.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It happens often while I'm preparing a patch for a block driver that
I'm wondering: is a definition of SECTOR_SIZE and/or SECTOR_SHIFT
available for this driver? Do I have to introduce definitions of these
constants before I can use these constants? To avoid this confusion,
move the existing definitions of SECTOR_SIZE and SECTOR_SHIFT into the
<linux/blkdev.h> header file such that these become available for all
block drivers. Make the SECTOR_SIZE definition in the uapi msdos_fs.h
header file conditional to avoid that including that header file after
<linux/blkdev.h> causes the compiler to complain about a SECTOR_SIZE
redefinition.
Note: the SECTOR_SIZE / SECTOR_SHIFT / SECTOR_BITS definitions have
not been removed from uapi header files nor from NAND drivers in
which these constants are used for another purpose than converting
block layer offsets and sizes into a number of sectors.
Cc: David S. Miller <davem@davemloft.net>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The following commit:
commit aa4d86163e ("block: loop: switch to VFS ITER_BVEC")
replaced __do_lo_send_write(), which used ITER_KVEC iterators, with
lo_write_bvec() which uses ITER_BVEC iterators. In this change, though,
the WRITE flag was lost:
- iov_iter_kvec(&from, ITER_KVEC | WRITE, &kvec, 1, len);
+ iov_iter_bvec(&i, ITER_BVEC, bvec, 1, bvec->bv_len);
This flag is necessary for the DAX case because we make decisions based on
whether or not the iterator is a READ or a WRITE in dax_iomap_actor() and
in dax_iomap_rw().
We end up going through this path in configurations where we combine a PMEM
device with 4k sectors, a loopback device and DAX. The consequence of this
missed flag is that what we intend as a write actually turns into a read in
the DAX code, so no data is ever written.
The very simplest test case is to create a loopback device and try and
write a small string to it, then hexdump a few bytes of the device to see
if the write took. Without this patch you read back all zeros, with this
you read back the string you wrote.
For XFS this causes us to fail or panic during the following xfstests:
xfs/074 xfs/078 xfs/216 xfs/217 xfs/250
For ext4 we have a similar issue where writes never happen, but we don't
currently have any xfstests that use loopback and show this issue.
Fix this by restoring the WRITE flag argument to iov_iter_bvec(). This
causes the xfstests to all pass.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@vger.kernel.org
Fixes: commit aa4d86163e ("block: loop: switch to VFS ITER_BVEC")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
when mounting an ISO filesystem sometimes (very rarely)
the system hangs because of a race condition between two tasks.
PID: 6766 TASK: ffff88007b2a6dd0 CPU: 0 COMMAND: "mount"
#0 [ffff880078447ae0] __schedule at ffffffff8168d605
#1 [ffff880078447b48] schedule_preempt_disabled at ffffffff8168ed49
#2 [ffff880078447b58] __mutex_lock_slowpath at ffffffff8168c995
#3 [ffff880078447bb8] mutex_lock at ffffffff8168bdef
#4 [ffff880078447bd0] sr_block_ioctl at ffffffffa00b6818 [sr_mod]
#5 [ffff880078447c10] blkdev_ioctl at ffffffff812fea50
#6 [ffff880078447c70] ioctl_by_bdev at ffffffff8123a8b3
#7 [ffff880078447c90] isofs_fill_super at ffffffffa04fb1e1 [isofs]
#8 [ffff880078447da8] mount_bdev at ffffffff81202570
#9 [ffff880078447e18] isofs_mount at ffffffffa04f9828 [isofs]
#10 [ffff880078447e28] mount_fs at ffffffff81202d09
#11 [ffff880078447e70] vfs_kern_mount at ffffffff8121ea8f
#12 [ffff880078447ea8] do_mount at ffffffff81220fee
#13 [ffff880078447f28] sys_mount at ffffffff812218d6
#14 [ffff880078447f80] system_call_fastpath at ffffffff81698c49
RIP: 00007fd9ea914e9a RSP: 00007ffd5d9bf648 RFLAGS: 00010246
RAX: 00000000000000a5 RBX: ffffffff81698c49 RCX: 0000000000000010
RDX: 00007fd9ec2bc210 RSI: 00007fd9ec2bc290 RDI: 00007fd9ec2bcf30
RBP: 0000000000000000 R8: 0000000000000000 R9: 0000000000000010
R10: 00000000c0ed0001 R11: 0000000000000206 R12: 00007fd9ec2bc040
R13: 00007fd9eb6b2380 R14: 00007fd9ec2bc210 R15: 00007fd9ec2bcf30
ORIG_RAX: 00000000000000a5 CS: 0033 SS: 002b
This task was trying to mount the cdrom. It allocated and configured a
super_block struct and owned the write-lock for the super_block->s_umount
rwsem. While exclusively owning the s_umount lock, it called
sr_block_ioctl and waited to acquire the global sr_mutex lock.
PID: 6785 TASK: ffff880078720fb0 CPU: 0 COMMAND: "systemd-udevd"
#0 [ffff880078417898] __schedule at ffffffff8168d605
#1 [ffff880078417900] schedule at ffffffff8168dc59
#2 [ffff880078417910] rwsem_down_read_failed at ffffffff8168f605
#3 [ffff880078417980] call_rwsem_down_read_failed at ffffffff81328838
#4 [ffff8800784179d0] down_read at ffffffff8168cde0
#5 [ffff8800784179e8] get_super at ffffffff81201cc7
#6 [ffff880078417a10] __invalidate_device at ffffffff8123a8de
#7 [ffff880078417a40] flush_disk at ffffffff8123a94b
#8 [ffff880078417a88] check_disk_change at ffffffff8123ab50
#9 [ffff880078417ab0] cdrom_open at ffffffffa00a29e1 [cdrom]
#10 [ffff880078417b68] sr_block_open at ffffffffa00b6f9b [sr_mod]
#11 [ffff880078417b98] __blkdev_get at ffffffff8123ba86
#12 [ffff880078417bf0] blkdev_get at ffffffff8123bd65
#13 [ffff880078417c78] blkdev_open at ffffffff8123bf9b
#14 [ffff880078417c90] do_dentry_open at ffffffff811fc7f7
#15 [ffff880078417cd8] vfs_open at ffffffff811fc9cf
#16 [ffff880078417d00] do_last at ffffffff8120d53d
#17 [ffff880078417db0] path_openat at ffffffff8120e6b2
#18 [ffff880078417e48] do_filp_open at ffffffff8121082b
#19 [ffff880078417f18] do_sys_open at ffffffff811fdd33
#20 [ffff880078417f70] sys_open at ffffffff811fde4e
#21 [ffff880078417f80] system_call_fastpath at ffffffff81698c49
RIP: 00007f29438b0c20 RSP: 00007ffc76624b78 RFLAGS: 00010246
RAX: 0000000000000002 RBX: ffffffff81698c49 RCX: 0000000000000000
RDX: 00007f2944a5fa70 RSI: 00000000000a0800 RDI: 00007f2944a5fa70
RBP: 00007f2944a5f540 R8: 0000000000000000 R9: 0000000000000020
R10: 00007f2943614c40 R11: 0000000000000246 R12: ffffffff811fde4e
R13: ffff880078417f78 R14: 000000000000000c R15: 00007f2944a4b010
ORIG_RAX: 0000000000000002 CS: 0033 SS: 002b
This task tried to open the cdrom device, the sr_block_open function
acquired the global sr_mutex lock. The call to check_disk_change()
then saw an event flag indicating a possible media change and tried
to flush any cached data for the device.
As part of the flush, it tried to acquire the super_block->s_umount
lock associated with the cdrom device.
This was the same super_block as created and locked by the previous task.
The first task acquires the s_umount lock and then the sr_mutex_lock;
the second task acquires the sr_mutex_lock and then the s_umount lock.
This patch fixes the issue by moving check_disk_change() out of
cdrom_open() and let the caller take care of it.
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch has been generated as follows:
for verb in set_unlocked clear_unlocked set clear; do
replace-in-files queue_flag_${verb} blk_queue_flag_${verb%_unlocked} \
$(git grep -lw queue_flag_${verb} drivers block/bsg*)
done
Except for protecting all queue flag changes with the queue lock
this patch does not change any functionality.
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use the blk_queue_flag_*() functions instead of open-coding these.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull a xen_blkfront fix from Konrad:
"It has one simple fix for the multi-queue support not showing up after
a block device was detached/re-attached."
* 'stable/for-jens-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
xen-blkfront: move negotiate_mq to cover all cases of new VBDs
negotiate_mq should happen in all cases of a new VBD being discovered by
xen-blkfront, whether called through _probe() or a hot-attached new VBD
from dom-0 via xenstore. Otherwise, hot-attached new VBDs are left
configured without multi-queue.
Signed-off-by: Bhavesh Davda <bhavesh.davda@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
On ARM64, the default page size has been 64K on some distributions, and
we should allow ARM64 people to play null_blk.
This patch fixes the issue by extend page bitmap size for supporting
other non-4KB PAGE_SIZE.
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Kyungchan Koh <kkc6196@fb.com>,
Cc: weiping zhang <zhangweiping@didichuxing.com>
Cc: Yi Zhang <yi.zhang@redhat.com>
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fix a typo in pkt_start_recovery.
Fixes: 74d46992e0 ("block: replace bi_bdev with a gendisk pointer and partitions index")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Initialize the request queue lock earlier such that the following
race can no longer occur:
blk_init_queue_node() blkcg_print_blkgs()
blk_alloc_queue_node (1)
q->queue_lock = &q->__queue_lock (2)
blkcg_init_queue(q) (3)
spin_lock_irq(blkg->q->queue_lock) (4)
q->queue_lock = lock (5)
spin_unlock_irq(blkg->q->queue_lock) (6)
(1) allocate an uninitialized queue;
(2) initialize queue_lock to its default internal lock;
(3) initialize blkcg part of request queue, which will create blkg and
then insert it to blkg_list;
(4) traverse blkg_list and find the created blkg, and then take its
queue lock, here it is the default *internal lock*;
(5) *race window*, now queue_lock is overridden with *driver specified
lock*;
(6) now unlock *driver specified lock*, not the locked *internal lock*,
unlock balance breaks.
The changes in this patch are as follows:
- Move the .queue_lock initialization from blk_init_queue_node() into
blk_alloc_queue_node().
- Only override the .queue_lock pointer for legacy queues because it
is not useful for blk-mq queues to override this pointer.
- For all all block drivers that initialize .queue_lock explicitly,
change the blk_alloc_queue() call in the driver into a
blk_alloc_queue_node() call and remove the explicit .queue_lock
initialization. Additionally, initialize the spin lock that will
be used as queue lock earlier if necessary.
Reported-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch does not change any functionality.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Remove the disk, partition and bdi sysfs attributes before cleaning up
the request queue associated with the disk.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Remove the disk, partition and bdi sysfs attributes before cleaning up
the request queue associated with the disk.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>