linux

mirror of https://github.com/torvalds/linux.git synced 2024-12-27 13:22:23 +00:00

Author	SHA1	Message	Date
Linus Torvalds	2e959dd87a	for-5.4/post-2019-09-24 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl2J8xQQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpujgD/94s9GGKN8JShxCpT0YNuWyyFF5gNlaimQU RSGAwnv2YUgEGNSUOPpcaj5FAYhTfYzbqoHlE+jytA2U5KXTOhc5Z85QV+TY4HPs I03xczYuYD/uX0QuF00zU2+6eV3lETELPiBARbfEQdHfm72iwurweHzlh4dfhbxW P7UA/cKixXWF2CH9wg5347Ll93nD24f2pi8BUyLJi/xpdlaRrN11Ii8AzNlRmq52 VRxURuogl98W89F6EV2VhPGFgUEYHY2Ot7II2OqqV+jmjHDQW9y5hximzINOqkxs bQwo5J+WrDSPoqwl8+db2k7QQjAl1XKDAHmCwz+7J/BoOgZj8/M1FMBwzita+5x+ UqxEYe7k+2G3w2zuhBrq03BypU8pwqFep/QI0cCCPaHs4J5QnkVOScEqd6iV/C3T FPvMvqDf7MrElghj4Qa2IZlh/CgqmLG5NUEz8E40cXkdiP+E+eK9ZY2Uwx2XhBrm 7Gl+SpG5DxWqqJeRNVWjFwM4p5L+01NtwDbTjZ1rsf+mCW5cNsy/L9B4UpPz4HxW coAs0y/Ce+ZhCopIXZ4jLDBoTG9yoVg8EcyfaHKD2Zz0mUFxa2xm+LvXKeT49qqx xuodpKD3fiuM7h9Xgv+cDsmn8Rr8gSeXEGV7qzpudmkxbp6IVg/yG5hC/dM921GR EVrRtUIwdw== =aAPP -----END PGP SIGNATURE----- Merge tag 'for-5.4/post-2019-09-24' of git://git.kernel.dk/linux-block Pull more block updates from Jens Axboe: "Some later additions that weren't quite done for the first pull request, and also a few fixes that have arrived since. This contains: - Kill silly pktcdvd warning on attempting to register a non-scsi passthrough device (me) - Use symbolic constants for the block t10 protection types, and switch to handling it in core rather than in the drivers (Max) - libahci platform missing node put fix (Nishka) - Small series of fixes for BFQ (Paolo) - Fix possible nbd crash (Xiubo)" * tag 'for-5.4/post-2019-09-24' of git://git.kernel.dk/linux-block: block: drop device references in bsg_queue_rq() block: t10-pi: fix -Wswitch warning pktcdvd: remove warning on attempting to register non-passthrough dev ata: libahci_platform: Add of_node_put() before loop exit nbd: fix possible page fault for nbd disk nbd: rename the runtime flags as NBD_RT_ prefixed block, bfq: push up injection only after setting service time block, bfq: increase update frequency of inject limit block, bfq: reduce upper bound for inject limit to max_rq_in_driver+1 block, bfq: update inject limit only after injection occurred block: centralize PI remapping logic to the block layer block: use symbolic constants for t10_pi type	2019-09-24 16:31:50 -07:00
Martin Wilck	d46fe2cb2d	block: drop device references in bsg_queue_rq() Make sure that bsg_queue_rq() calls put_device() if an error is encountered after get_device() was successful. Fixes: `cd2f076f1d` ("bsg: convert to use blk-mq") Signed-off-by: Martin Wilck <mwilck@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-09-23 11:17:24 -06:00
Max Gurtovoy	be21683e48	block: t10-pi: fix -Wswitch warning Changing the switch() statement to symbolic constants made the compiler (at least clang-9, did not check gcc) notice that there is one enum value that is not handled here: block/t10-pi.c:62:11: error: enumeration value 'T10_PI_TYPE0_PROTECTION' not handled in switch [-Werror,-Wswitch] Add a BUG_ON statement if we ever get to t10_pi_verify function with TYPE0 and replace the switch() statement with if/else clause for the valid types. Fixes: 9b2061b1a262 ("block: use symbolic constants for t10_pi type") Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-09-23 08:05:19 -06:00
Linus Torvalds	671df18953	dma-mapping updates for 5.4: - add dma-mapping and block layer helpers to take care of IOMMU merging for mmc plus subsequent fixups (Yoshihiro Shimoda) - rework handling of the pgprot bits for remapping (me) - take care of the dma direct infrastructure for swiotlb-xen (me) - improve the dma noncoherent remapping infrastructure (me) - better defaults for ->mmap, ->get_sgtable and ->get_required_mask (me) - cleanup mmaping of coherent DMA allocations (me) - various misc cleanups (Andy Shevchenko, me) -----BEGIN PGP SIGNATURE----- iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl2CSucLHGhjaEBsc3Qu ZGUACgkQD55TZVIEUYPfrhAAgXZA/EdFPvkkCoDrmgtf3XkudX9gajeCd9g4NZy6 ZBQElTVvm4S0sQj7IXgALnMumDMbbTibW5SQLX5GwQDe+XXBpZ8ajpAnJAXc8a5T qaFQ4SInr4CgBZf9nZKDkbSBZ1Tu3AQm1c0QI8riRCkrVTuX4L06xpCef4Yh4mgO rwWEjIioYpQiKZMmu98riXh3ZNfFG3mVJRhKt8B6XJbBgnUnjDOPYGgaUwp6CU20 tFBKL2GaaV0vdLJ5wYhIGXT4DJ8tp9T5n3IYGZv1Ux889RaZEHlCrMxzelYeDbCT KhZbhcSECGnddsh73t/UX7/KhytuqnfKa9n+Xo6AWuA47xO4c36quOOcTk9M0vE5 TfGDmewgL6WIv4lzokpRn5EkfDhyL33j8eYJrJ8e0ldcOhSQIFk4ciXnf2stWi6O JrlzzzSid+zXxu48iTfoPdnMr7psTpiMvvRvKfEeMp2FX9Fg6EdMzJYLTEl+COHB 0WwNacZmY3P01+b5EZXEgqKEZevIIdmPKbyM9rPtTjz8BjBwkABHTpN3fWbVBf7/ Ax6OPYyW40xp1fnJuzn89m3pdOxn88FpDdOaeLz892Zd+Qpnro1ayulnFspVtqGM mGbzA9whILvXNRpWBSQrvr2IjqMRjbBxX3BVACl3MMpOChgkpp5iANNfSDjCftSF Zu8= =/wGv -----END PGP SIGNATURE----- Merge tag 'dma-mapping-5.4' of git://git.infradead.org/users/hch/dma-mapping Pull dma-mapping updates from Christoph Hellwig: - add dma-mapping and block layer helpers to take care of IOMMU merging for mmc plus subsequent fixups (Yoshihiro Shimoda) - rework handling of the pgprot bits for remapping (me) - take care of the dma direct infrastructure for swiotlb-xen (me) - improve the dma noncoherent remapping infrastructure (me) - better defaults for ->mmap, ->get_sgtable and ->get_required_mask (me) - cleanup mmaping of coherent DMA allocations (me) - various misc cleanups (Andy Shevchenko, me) * tag 'dma-mapping-5.4' of git://git.infradead.org/users/hch/dma-mapping: (41 commits) mmc: renesas_sdhi_internal_dmac: Add MMC_CAP2_MERGE_CAPABLE mmc: queue: Fix bigger segments usage arm64: use asm-generic/dma-mapping.h swiotlb-xen: merge xen_unmap_single into xen_swiotlb_unmap_page swiotlb-xen: simplify cache maintainance swiotlb-xen: use the same foreign page check everywhere swiotlb-xen: remove xen_swiotlb_dma_mmap and xen_swiotlb_dma_get_sgtable xen: remove the exports for xen_{create,destroy}_contiguous_region xen/arm: remove xen_dma_ops xen/arm: simplify dma_cache_maint xen/arm: use dev_is_dma_coherent xen/arm: consolidate page-coherent.h xen/arm: use dma-noncoherent.h calls for xen-swiotlb cache maintainance arm: remove wrappers for the generic dma remap helpers dma-mapping: introduce a dma_common_find_pages helper dma-mapping: always use VM_DMA_COHERENT for generic DMA remap vmalloc: lift the arm flag for coherent mappings to common code dma-mapping: provide a better default ->get_required_mask dma-mapping: remove the dma_declare_coherent_memory export remoteproc: don't allow modular build ...	2019-09-19 13:27:23 -07:00
Paolo Valente	58494c980f	block, bfq: push up injection only after setting service time If equal to 0, the injection limit for a bfq_queue is pushed to 1 after a first sample of the total service time of the I/O requests of the queue is computed (to allow injection to start). Yet, because of a mistake in the branch that performs this action, the push may happen also in some other case. This commit fixes this issue. Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-09-17 20:03:49 -06:00
Paolo Valente	17c3d26602	block, bfq: increase update frequency of inject limit The update period of the injection limit has been tentatively set to 100 ms, to reduce fluctuations. This value however proved to cause, occasionally, the limit to be decremented for some bfq_queue only after the queue underwent excessive injection for a lot of time. This commit reduces the period to 10 ms. Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-09-17 20:03:49 -06:00
Paolo Valente	c1e0a18228	block, bfq: reduce upper bound for inject limit to max_rq_in_driver+1 Upon an increment attempt of the injection limit, the latter is constrained not to become higher than twice the maximum number max_rq_in_driver of I/O requests that have happened to be in service in the drive. This high bound allows the injection limit to grow beyond max_rq_in_driver, which may then cause max_rq_in_driver itself to grow. However, since the limit is incremented by only one unit at a time, there is no need for such a high bound, and just max_rq_in_driver+1 is enough. Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-09-17 20:03:49 -06:00
Paolo Valente	23ed570acc	block, bfq: update inject limit only after injection occurred BFQ updates the injection limit of each bfq_queue as a function of how much the limit inflates the service times experienced by the I/O requests of the queue. So only service times affected by injection must be taken into account. Unfortunately, in the current implementation of this update scheme, the service time of an I/O request rq not affected by injection may happen to be considered in the following case: there is no I/O request in service when rq arrives. This commit fixes this issue by making sure that only service times affected by injection are considered for updating the injection limit. In particular, the service time of an I/O request rq is now considered only if at least one of the following two conditions holds: - the destination bfq_queue for rq underwent injection before rq arrival, and there is still I/O in service in the drive on rq arrival (the service of such unfinished I/O may delay the service of rq); - injection occurs between the arrival and the completion time of rq. Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-09-17 20:03:49 -06:00
Max Gurtovoy	54d4e6ab91	block: centralize PI remapping logic to the block layer Currently t10_pi_prepare/t10_pi_complete functions are called during the NVMe and SCSi layers command preparetion/completion, but their actual place should be the block layer since T10-PI is a general data integrity feature that is used by block storage protocols. Introduce .prepare_fn and .complete_fn callbacks within the integrity profile that each type can implement according to its needs. Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Suggested-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Fixed to not call queue integrity functions if BLK_DEV_INTEGRITY isn't defined in the config. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-09-17 20:03:49 -06:00
Max Gurtovoy	5eaed68dd3	block: use symbolic constants for t10_pi type Replace all hard-coded values with T10_PI_TYPES to make the code more readable. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-09-17 20:03:49 -06:00
Linus Torvalds	7ad67ca553	for-5.4/block-2019-09-16 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl1/no0QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpmo9EACFXMbdNmEEUMyRSdOkVLlr7ZlTyQi1tLpB YESDPxdBfybzpi0qa8JSaysGIfvSkSjmSAqBqrWPmASOSOL6CK4bbA4fTYbgPplk XeHUdgGiG34oCQUn8Xil5reYaTm7I6LQWnWTpVa5fIhAyUYaGJL+987ykoGmpQmB Dvf3YSc+8H0RTp9PCMVd6UCGPkZbVlLImGad3PF5ULvTEaE4RCXC2aiAgh0p1l5A J2CkRZ+/mio3zN2O4YN7VdPGfr1Wo1iZ834xbIGLegv1miHXagFk7jwTcC7zIt5t oSnJnqIg3iCe7SpWt4Bkzw/zy/2UqaspifbCMgw8vychlViVRUHFO5h85Yboo7kQ OMLEQPcwjm6dTHv5h1iXF9LW1O7NoiYmmgvApU9uOo1HUrl1X7PZ3JEfUsVHxkOO T4D5igf0Krsl1eAbiwEUQzy7vFZ8PlRHqrHgK+fkyotzHu1BJR7OQkYygEfGFOB/ EfMxplGDpmibYGuWCwDX2bPAmLV3SPUQENReHrfPJRDt5TD1UkFpVGv/PLLhbr0p cLYI78DKpDSigBpVMmwq5nTYpnex33eyDTTA8C0sakcsdzdmU5qv30y3wm4nTiep f6gZo6IMXwRg/rCgVVrd9SKQAr/8wEzVlsDW3qyi2pVT8sHIgm0tFv7paihXGdDV xsKgmTrQQQ== =Qt+h -----END PGP SIGNATURE----- Merge tag 'for-5.4/block-2019-09-16' of git://git.kernel.dk/linux-block Pull block updates from Jens Axboe: - Two NVMe pull requests: - ana log parse fix from Anton - nvme quirks support for Apple devices from Ben - fix missing bio completion tracing for multipath stack devices from Hannes and Mikhail - IP TOS settings for nvme rdma and tcp transports from Israel - rq_dma_dir cleanups from Israel - tracing for Get LBA Status command from Minwoo - Some nvme-tcp cleanups from Minwoo, Potnuri and Myself - Some consolidation between the fabrics transports for handling the CAP register - reset race with ns scanning fix for fabrics (move fabrics commands to a dedicated request queue with a different lifetime from the admin request queue)." - controller reset and namespace scan races fixes - nvme discovery log change uevent support - naming improvements from Keith - multiple discovery controllers reject fix from James - some regular cleanups from various people - Series fixing (and re-fixing) null_blk debug printing and nr_devices checks (André) - A few pull requests from Song, with fixes from Andy, Guoqing, Guilherme, Neil, Nigel, and Yufen. - REQ_OP_ZONE_RESET_ALL support (Chaitanya) - Bio merge handling unification (Christoph) - Pick default elevator correctly for devices with special needs (Damien) - Block stats fixes (Hou) - Timeout and support devices nbd fixes (Mike) - Series fixing races around elevator switching and device add/remove (Ming) - sed-opal cleanups (Revanth) - Per device weight support for BFQ (Fam) - Support for blk-iocost, a new model that can properly account cost of IO workloads. (Tejun) - blk-cgroup writeback fixes (Tejun) - paride queue init fixes (zhengbin) - blk_set_runtime_active() cleanup (Stanley) - Block segment mapping optimizations (Bart) - lightnvm fixes (Hans/Minwoo/YueHaibing) - Various little fixes and cleanups * tag 'for-5.4/block-2019-09-16' of git://git.kernel.dk/linux-block: (186 commits) null_blk: format pr_* logs with pr_fmt null_blk: match the type of parameter nr_devices null_blk: do not fail the module load with zero devices block: also check RQF_STATS in blk_mq_need_time_stamp() block: make rq sector size accessible for block stats bfq: Fix bfq linkage error raid5: use bio_end_sector in r5_next_bio raid5: remove STRIPE_OPS_REQ_PENDING md: add feature flag MD_FEATURE_RAID0_LAYOUT md/raid0: avoid RAID0 data corruption due to layout confusion. raid5: don't set STRIPE_HANDLE to stripe which is in batch list raid5: don't increment read_errors on EILSEQ return nvmet: fix a wrong error status returned in error log page nvme: send discovery log page change events to userspace nvme: add uevent variables for controller devices nvme: enable aen regardless of the presence of I/O queues nvme-fabrics: allow discovery subsystems accept a kato nvmet: Use PTR_ERR_OR_ZERO() in nvmet_init_discovery() nvme: Remove redundant assignment of cq vector nvme: Assign subsys instance from first ctrl ...	2019-09-17 16:57:47 -07:00
Linus Torvalds	7f2444d38f	Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core timer updates from Thomas Gleixner: "Timers and timekeeping updates: - A large overhaul of the posix CPU timer code which is a preparation for moving the CPU timer expiry out into task work so it can be properly accounted on the task/process. An update to the bogus permission checks will come later during the merge window as feedback was not complete before heading of for travel. - Switch the timerqueue code to use cached rbtrees and get rid of the homebrewn caching of the leftmost node. - Consolidate hrtimer_init() + hrtimer_init_sleeper() calls into a single function - Implement the separation of hrtimers to be forced to expire in hard interrupt context even when PREEMPT_RT is enabled and mark the affected timers accordingly. - Implement a mechanism for hrtimers and the timer wheel to protect RT against priority inversion and live lock issues when a (hr)timer which should be canceled is currently executing the callback. Instead of infinitely spinning, the task which tries to cancel the timer blocks on a per cpu base expiry lock which is held and released by the (hr)timer expiry code. - Enable the Hyper-V TSC page based sched_clock for Hyper-V guests resulting in faster access to timekeeping functions. - Updates to various clocksource/clockevent drivers and their device tree bindings. - The usual small improvements all over the place" * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (101 commits) posix-cpu-timers: Fix permission check regression posix-cpu-timers: Always clear head pointer on dequeue hrtimer: Add a missing bracket and hide `migration_base' on !SMP posix-cpu-timers: Make expiry_active check actually work correctly posix-timers: Unbreak CONFIG_POSIX_TIMERS=n build tick: Mark sched_timer to expire in hard interrupt context hrtimer: Add kernel doc annotation for HRTIMER_MODE_HARD x86/hyperv: Hide pv_ops access for CONFIG_PARAVIRT=n posix-cpu-timers: Utilize timerqueue for storage posix-cpu-timers: Move state tracking to struct posix_cputimers posix-cpu-timers: Deduplicate rlimit handling posix-cpu-timers: Remove pointless comparisons posix-cpu-timers: Get rid of 64bit divisions posix-cpu-timers: Consolidate timer expiry further posix-cpu-timers: Get rid of zero checks rlimit: Rewrite non-sensical RLIMIT_CPU comment posix-cpu-timers: Respect INFINITY for hard RTTIME limit posix-cpu-timers: Switch thread group sampling to array posix-cpu-timers: Restructure expiry array posix-cpu-timers: Remove cputime_expires ...	2019-09-17 12:35:15 -07:00
Hou Tao	9a91b05bba	block: also check RQF_STATS in blk_mq_need_time_stamp() In __blk_mq_end_request() if block stats needs update, we should ensure now is valid instead of 0 even when iostat is disabled. Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-09-15 16:02:10 -06:00
Hou Tao	3d24430694	block: make rq sector size accessible for block stats Currently rq->data_len will be decreased by partial completion or zeroed by completion, so when blk_stat_add() is invoked, data_len will be zero and there will never be samples in poll_cb because blk_mq_poll_stats_bkt() will return -1 if data_len is zero. We could move blk_stat_add() back to __blk_mq_complete_request(), but that would make the effort of trying to call ktime_get_ns() once in vain. Instead we can reuse throtl_size field, and use it for both block stats and block throttle, and adjust the logic in blk_mq_poll_stats_bkt() accordingly. Fixes: `4bc6339a58` ("block: move blk_stat_add() to __blk_mq_end_request()") Tested-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-09-15 16:02:08 -06:00
Pavel Begunkov	89f3b6d62f	bfq: Fix bfq linkage error Since commit `795fe54c2a` ("bfq: Add per-device weight"), bfq uses blkg_conf_prep() and blkg_conf_finish(), which are not exported. So, it causes linkage error if bfq compiled as a module. Fixes: `795fe54c2a` ("bfq: Add per-device weight") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-09-14 11:33:43 -06:00
Ming Lei	0a67b5a926	block: fix race between switching elevator and removing queues `cecf5d87ff` ("block: split .sysfs_lock into two locks") starts to release & actuire sysfs_lock again during switching elevator. So it isn't enough to prevent switching elevator from happening by simply clearing QUEUE_FLAG_REGISTERED with holding sysfs_lock, because in-progress switch still can move on after re-acquiring the lock, meantime the flag of QUEUE_FLAG_REGISTERED won't get checked. Fixes this issue by checking 'q->elevator' directly & locklessly after q->kobj is removed in blk_unregister_queue(), this way is safe because q->elevator can't be changed at that time. Fixes: `cecf5d87ff` ("block: split .sysfs_lock into two locks") Cc: Christoph Hellwig <hch@infradead.org> Cc: Hannes Reinecke <hare@suse.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-09-12 07:13:22 -06:00
Stanley Chu	8a15b4d7cd	block: bypass blk_set_runtime_active for uninitialized q->dev Some devices may skip blk_pm_runtime_init() and have null pointer in its request_queue->dev. For example, SCSI devices of UFS Well-Known LUNs. Currently the null pointer is checked by the user of blk_set_runtime_active(), i.e., scsi_dev_type_resume(). It is better to check it by blk_set_runtime_active() itself instead of by its users. Signed-off-by: Stanley Chu <stanley.chu@mediatek.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-09-12 07:11:56 -06:00
Tejun Heo	7c1ee704a1	iocost_monitor: Report debt Report debt and rename del_ms row to delay for consistency. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-09-10 12:31:39 -06:00
Tejun Heo	e1518f63f2	blk-iocost: Don't let merges push vtime into the future Merges have the same problem that forced-bios had which is fixed by the previous patch. The cost of a merge is calculated at the time of issue and force-advances vtime into the future. Until global vtime catches up, how the cgroup's hweight changes in the meantime doesn't matter and it often leads to situations where the cost is calculated at one hweight and paid at a very different one. See the previous patch for more details. Fix it by never advancing vtime into the future for merges. If budget is available, vtime is advanced. Otherwise, the cost is charged as debt. This brings merge cost handling in line with issue cost handling in ioc_rqos_throttle(). Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-09-10 12:31:39 -06:00
Tejun Heo	36a524814f	blk-iocost: Account force-charged overage in absolute vtime Currently, when a bio needs to be force-charged and there isn't enough budget, vtime is simply pushed into the future. This means that the cost of the whole bio is scaled using the current hweight and then charged immediately. Until the global vtime advances beyond this future vtime, the cgroup won't be allowed to issue normal IOs. This is incorrect and can lead to, for example, exploding vrate or extended stalls if vrate range is constrained. Consider the following scenario. 1. A cgroup with a very low hweight runs out of budget. 2. A storm of swap-out happens on it. All of them are scaled according to the current low hweight and charged to vtime pushing it to a far future. 3. All other cgroups go idle and now the above cgroup has access to the whole device. However, because vtime is already wound using the past low hweight, what its current hweight is doesn't matter until global vtime catches up to the local vtime. 4. As a result, either vrate gets ramped up extremely or the IOs stall while the underlying device is idle. This is because the hweight the overage is calculated at is different from the hweight that it's being paid at. Fix it by remembering the overage in absoulte vtime and continuously paying with the actual budget according to the current hweight at each period. Note that non-forced bios which wait already remembers the cost in absolute vtime. This brings forced-bio accounting in line. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-09-10 12:31:39 -06:00
Tejun Heo	e036c4caba	blk-iocost: Fix incorrect operation order during iocg free ioc_pd_free() first cancels the hrtimers and then deactivates the iocg. However, the iocg timer can run inbetween and reschedule the hrtimers which will end up running after the iocg is freed leading to crashes like the following. general protection fault: 0000 [#1] SMP ... RIP: 0010:iocg_kick_delay+0xbe/0x1b0 RSP: 0018:ffffc90003598ea0 EFLAGS: 00010046 RAX: 1cee00fd69512b54 RBX: ffff8881bba48400 RCX: 00000000000003e8 RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8881bba48400 RBP: 0000000000004e20 R08: 0000000000000002 R09: 00000000000003e8 R10: 0000000000000000 R11: 0000000000000000 R12: ffffc90003598ef0 R13: 00979f3810ad461f R14: ffff8881bba4b400 R15: 25439f950d26e1d1 FS: 0000000000000000(0000) GS:ffff88885f800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f64328c7e40 CR3: 0000000002409005 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <IRQ> iocg_delay_timer_fn+0x3d/0x60 __hrtimer_run_queues+0xfe/0x270 hrtimer_interrupt+0xf4/0x210 smp_apic_timer_interrupt+0x5e/0x120 apic_timer_interrupt+0xf/0x20 </IRQ> Fix it by canceling hrtimers after deactivating the iocg. Fixes: `7caa47151a` ("blkcg: implement blk-iocost") Reported-by: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-09-10 12:17:04 -06:00
Fam Zheng	795fe54c2a	bfq: Add per-device weight This adds to BFQ the missing per-device weight interfaces: blkio.bfq.weight_device on legacy and io.bfq.weight on unified. The implementation pretty closely resembles what we had in CFQ and the parsing code is basically reused. Tests ===== Using two cgroups and three block devices, having weights setup as: Cgroup test1 test2 ============================================ default 100 500 sda 500 100 sdb default default sdc 200 200 cgroup v1 runs -------------- sda.test1.out: READ: bw=913MiB/s sda.test2.out: READ: bw=183MiB/s sdb.test1.out: READ: bw=213MiB/s sdb.test2.out: READ: bw=1054MiB/s sdc.test1.out: READ: bw=650MiB/s sdc.test2.out: READ: bw=650MiB/s cgroup v2 runs -------------- sda.test1.out: READ: bw=915MiB/s sda.test2.out: READ: bw=184MiB/s sdb.test1.out: READ: bw=216MiB/s sdb.test2.out: READ: bw=1069MiB/s sdc.test1.out: READ: bw=621MiB/s sdc.test2.out: READ: bw=622MiB/s Signed-off-by: Fam Zheng <zhengfeiran@bytedance.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-09-06 14:33:52 -06:00
Fam Zheng	5ff047e328	bfq: Extract bfq_group_set_weight from bfq_io_set_weight_legacy This function will be useful when we update weight from the soon-coming per-device interface. Signed-off-by: Fam Zheng <zhengfeiran@bytedance.com> Reviewed-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-09-06 14:33:50 -06:00
Fam Zheng	e9d3c866bf	bfq: Fix the missing barrier in __bfq_entity_update_weight_prio The comment of bfq_group_set_weight says the reading of prio_changed should happen before the reading of weight, but a memory barrier is missing here. Add it now, to match the smp_wmb() there. Signed-off-by: Fam Zheng <zhengfeiran@bytedance.com> Reviewed-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-09-06 14:33:48 -06:00
Jens Axboe	a26142559c	block: fix elevator_get_by_features() The lookup logic is broken - 'e' will never be NULL, even if the list is empty. Maintain lookup hit in a separate variable instead. Fixes: `a0958ba7fc` ("block: Improve default elevator selection") Reported-by: Julia Lawall <julia.lawall@lip6.fr> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-09-06 07:02:31 -06:00
Damien Le Moal	737eb78e82	block: Delay default elevator initialization When elevator_init_mq() is called from blk_mq_init_allocated_queue(), the only information known about the device is the number of hardware queues as the block device scan by the device driver is not completed yet for most drivers. The device type and elevator required features are not set yet, preventing to correctly select the default elevator most suitable for the device. This currently affects all multi-queue zoned block devices which default to the "none" elevator instead of the required "mq-deadline" elevator. These drives currently include host-managed SMR disks connected to a smartpqi HBA and null_blk block devices with zoned mode enabled. Upcoming NVMe Zoned Namespace devices will also be affected. Fix this by adding the boolean elevator_init argument to blk_mq_init_allocated_queue() to control the execution of elevator_init_mq(). Two cases exist: 1) elevator_init = false is used for calls to blk_mq_init_allocated_queue() within blk_mq_init_queue(). In this case, a call to elevator_init_mq() is added to __device_add_disk(), resulting in the delayed initialization of the queue elevator after the device driver finished probing the device information. This effectively allows elevator_init_mq() access to more information about the device. 2) elevator_init = true preserves the current behavior of initializing the elevator directly from blk_mq_init_allocated_queue(). This case is used for the special request based DM devices where the device gendisk is created before the queue initialization and device information (e.g. queue limits) is already known when the queue initialization is executed. Additionally, to make sure that the elevator initialization is never done while requests are in-flight (there should be none when the device driver calls device_add_disk()), freeze and quiesce the device request queue before calling blk_mq_init_sched() in elevator_init_mq(). Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-09-05 19:52:34 -06:00
Damien Le Moal	a0958ba7fc	block: Improve default elevator selection For block devices that do not specify required features, preserve the current default elevator selection (mq-deadline for single queue devices, none for multi-queue devices). However, for devices specifying required features (e.g. zoned block devices ELEVATOR_F_ZBD_SEQ_WRITE feature), select the first available elevator providing the required features. In all cases, default to "none" if no elevator is available or if the initialization of the default elevator fails. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-09-05 19:52:34 -06:00
Damien Le Moal	68c43f133a	block: Introduce elevator features Introduce the definition of elevator features through the elevator_features flags in the elevator_type structure. Each flag can represent a feature supported by an elevator. The first feature defined by this patch is support for zoned block device sequential write constraint with the flag ELEVATOR_F_ZBD_SEQ_WRITE, which is implemented by the mq-deadline elevator using zone write locking. Other possible features are IO priorities, write hints, latency targets or single-LUN dual-actuator disks (for which the elevator could maintain one LBA ordered list per actuator). The required_elevator_features field is also added to the request_queue structure to allow a device driver to specify elevator feature flags that an elevator must support for the correct operation of the device (e.g. device drivers for zoned block devices can have the ELEVATOR_F_ZBD_SEQ_WRITE flag as a required feature). The helper function blk_queue_required_elevator_features() is defined for setting this new field. With these two new fields in place, the elevator functions elevator_match() and elevator_find() are modified to allow a user to set only an elevator with a set of features that satisfies the device required features. Elevators not matching the device requirements are not shown in the device sysfs queue/scheduler file to prevent their use. The "none" elevator can always be selected as before. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-09-05 19:52:33 -06:00
Damien Le Moal	954b4a5ce4	block: Change elevator_init_mq() to always succeed If the default elevator chosen is mq-deadline, elevator_init_mq() may return an error if mq-deadline initialization fails, leading to blk_mq_init_allocated_queue() returning an error, which in turn will cause the block device initialization to fail and the device not being exposed. Instead of taking such extreme measure, handle mq-deadline initialization failures in the same manner as when mq-deadline is not available (no module to load), that is, default to the "none" scheduler. With this change, elevator_init_mq() return type can be changed to void. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-09-05 19:52:33 -06:00
Damien Le Moal	61db437d1c	block: Cleanup elevator_init_mq() use Instead of checking a queue tag_set BLK_MQ_F_NO_SCHED flag before calling elevator_init_mq() to make sure that the queue supports IO scheduling, use the elevator.c function elv_support_iosched() in elevator_init_mq(). This does not introduce any functional change but ensure that elevator_init_mq() does the right thing based on the queue settings. Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-09-05 19:52:33 -06:00
Marcos Paulo de Souza	85c0a037dc	block: elevator.c: Remove now unused elevator= argument Since the inclusion of blk-mq, elevator argument was not being considered anymore, and it's utility died long with the legacy IO path, now removed too. Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Bob Liu <bob.liu@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com> Fold with doc removal patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-09-03 08:02:53 -06:00
Damien Le Moal	cb8acabbe3	block: mq-deadline: Fix queue restart handling Commit `7211aef86f` ("block: mq-deadline: Fix write completion handling") added a call to blk_mq_sched_mark_restart_hctx() in dd_dispatch_request() to make sure that write request dispatching does not stall when all target zones are locked. This fix left a subtle race when a write completion happens during a dispatch execution on another CPU: CPU 0: Dispatch CPU1: write completion dd_dispatch_request() lock(&dd->lock); ... lock(&dd->zone_lock); dd_finish_request() rq = find request lock(&dd->zone_lock); unlock(&dd->zone_lock); zone write unlock unlock(&dd->zone_lock); ... __blk_mq_free_request check restart flag (not set) -> queue not run ... if (!rq && have writes) blk_mq_sched_mark_restart_hctx() unlock(&dd->lock) Since the dispatch context finishes after the write request completion handling, marking the queue as needing a restart is not seen from __blk_mq_free_request() and blk_mq_sched_restart() not executed leading to the dispatch stall under 100% write workloads. Fix this by moving the call to blk_mq_sched_mark_restart_hctx() from dd_dispatch_request() into dd_finish_request() under the zone lock to ensure full mutual exclusion between write request dispatch selection and zone unlock on write request completion. Fixes: `7211aef86f` ("block: mq-deadline: Fix write completion handling") Cc: stable@vger.kernel.org Reported-by: Hans Holmberg <Hans.Holmberg@wdc.com> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-09-03 07:59:51 -06:00
Yoshihiro Shimoda	45147fb522	block: add a helper function to merge the segments This patch adds a helper function whether a queue can merge the segments by the DMA MAP layer (e.g. via IOMMU). Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Simon Horman <horms+renesas@verge.net.au Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-09-03 08:32:50 +02:00
Tejun Heo	e916ad29d9	blkcg: add missing NULL check in ioc_cpd_alloc() ioc_cpd_alloc() forgot to check NULL return from kzalloc(). Add it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-30 07:16:19 -06:00
Tejun Heo	3532e72272	blkcg: fix missing free on error path of blk_iocost_init() blk_iocost_init() forgot to free its percpu stat on the error path. Fix it. Fixes: `7caa47151a` ("blkcg: implement blk-iocost") Reported-by: Hillf Danton <hdanton@sina.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-29 09:59:14 -06:00
Tejun Heo	8504dea783	blkcg: add tools/cgroup/iocost_coef_gen.py Add a script which can be used to generate device-specific iocost linear model coefficients. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-28 21:17:17 -06:00
Tejun Heo	6954ff185e	blkcg: add tools/cgroup/iocost_monitor.py Instead of mucking with debugfs and ->pd_stat(), add drgn based monitoring script. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-28 21:17:14 -06:00
Tejun Heo	7caa47151a	blkcg: implement blk-iocost This patchset implements IO cost model based work-conserving proportional controller. While io.latency provides the capability to comprehensively prioritize and protect IOs depending on the cgroups, its protection is binary - the lowest latency target cgroup which is suffering is protected at the cost of all others. In many use cases including stacking multiple workload containers in a single system, it's necessary to distribute IO capacity with better granularity. One challenge of controlling IO resources is the lack of trivially observable cost metric. The most common metrics - bandwidth and iops - can be off by orders of magnitude depending on the device type and IO pattern. However, the cost isn't a complete mystery. Given several key attributes, we can make fairly reliable predictions on how expensive a given stream of IOs would be, at least compared to other IO patterns. The function which determines the cost of a given IO is the IO cost model for the device. This controller distributes IO capacity based on the costs estimated by such model. The more accurate the cost model the better but the controller adapts based on IO completion latency and as long as the relative costs across differents IO patterns are consistent and sensible, it'll adapt to the actual performance of the device. Currently, the only implemented cost model is a simple linear one with a few sets of default parameters for different classes of device. This covers most common devices reasonably well. All the infrastructure to tune and add different cost models is already in place and a later patch will also allow using bpf progs for cost models. Please see the top comment in blk-iocost.c and documentation for more details. v2: Rebased on top of RQ_ALLOC_TIME changes and folded in Rik's fix for a divide-by-zero bug in current_hweight() triggered by zero inuse_sum. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andy Newell <newella@fb.com> Cc: Josef Bacik <jbacik@fb.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-28 21:17:12 -06:00
Tejun Heo	6f816b4b74	blk-mq: add optional request->alloc_time_ns There are currently two start time timestamps - start_time_ns and io_start_time_ns. The former marks the request allocation and and the second issue-to-device time. The planned io.weight controller needs to measure the total time bios take to execute after it leaves rq_qos including the time spent waiting for request to become available, which can easily dominate on saturated devices. This patch adds request->alloc_time_ns which records when the request allocation attempt started. As it isn't used for the usual stats, make it optional behind CONFIG_BLK_RQ_ALLOC_TIME and QUEUE_FLAG_RQ_ALLOC_TIME so that it can be compiled out when there are no users and it's active only on queues which need it even when compiled in. v2: s/pre_start_time/alloc_time/ and add CONFIG_BLK_RQ_ALLOC_TIME gating as suggested by Jens. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-28 21:17:10 -06:00
Tejun Heo	beab17fc2a	blkcg: s/RQ_QOS_CGROUP/RQ_QOS_LATENCY/ io.weight is gonna be another rq_qos cgroup mechanism. Let's rename RQ_QOS_CGROUP which is being used by io.latency to RQ_QOS_LATENCY in preparation. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-28 21:17:08 -06:00
Tejun Heo	9677a3e01f	block/rq_qos: implement rq_qos_ops->queue_depth_changed() wbt already gets queue depth changed notification through wbt_set_queue_depth(). Generalize it into rq_qos_ops->queue_depth_changed() so that other rq_qos policies can easily hook into the events too. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-28 21:17:07 -06:00
Tejun Heo	d3e65ffff6	block/rq_qos: add rq_qos_merge() Add a merge hook for rq_qos. This will be used by io.weight. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-28 21:17:05 -06:00
Tejun Heo	015d254cb0	blkcg: separate blkcg_conf_get_disk() out of blkg_conf_prep() Separate out blkcg_conf_get_disk() so that it can be used by blkcg policy interface file input parsers before the policy is actually enabled. This doesn't introduce any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-28 21:17:04 -06:00
Tejun Heo	86a5bba5c2	blkcg: make ->cpd_init_fn() optional For policies which can do enough initialization from ->cpd_alloc_fn(), make ->cpd_init_fn() optional. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-28 21:17:03 -06:00
Tejun Heo	cf09a8ee19	blkcg: pass @q and @blkcg into blkcg_pol_alloc_pd_fn() Instead of @node, pass in @q and @blkcg so that the alloc function has more context. This doesn't cause any behavior change and will be used by io.weight implementation. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-28 21:17:01 -06:00
Ming Lei	cecf5d87ff	block: split .sysfs_lock into two locks The kernfs built-in lock of 'kn->count' is held in sysfs .show/.store path. Meantime, inside block's .show/.store callback, q->sysfs_lock is required. However, when mq & iosched kobjects are removed via blk_mq_unregister_dev() & elv_unregister_queue(), q->sysfs_lock is held too. This way causes AB-BA lock because the kernfs built-in lock of 'kn-count' is required inside kobject_del() too, see the lockdep warning[1]. On the other hand, it isn't necessary to acquire q->sysfs_lock for both blk_mq_unregister_dev() & elv_unregister_queue() because clearing REGISTERED flag prevents storing to 'queue/scheduler' from being happened. Also sysfs write(store) is exclusive, so no necessary to hold the lock for elv_unregister_queue() when it is called in switching elevator path. So split .sysfs_lock into two: one is still named as .sysfs_lock for covering sync .store, the other one is named as .sysfs_dir_lock for covering kobjects and related status change. sysfs itself can handle the race between add/remove kobjects and showing/storing attributes under kobjects. For switching scheduler via storing to 'queue/scheduler', we use the queue flag of QUEUE_FLAG_REGISTERED with .sysfs_lock for avoiding the race, then we can avoid to hold .sysfs_lock during removing/adding kobjects. [1] lockdep warning ====================================================== WARNING: possible circular locking dependency detected 5.3.0-rc3-00044-g73277fc75ea0 #1380 Not tainted ------------------------------------------------------ rmmod/777 is trying to acquire lock: 00000000ac50e981 (kn->count#202){++++}, at: kernfs_remove_by_name_ns+0x59/0x72 but task is already holding lock: 00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&q->sysfs_lock){+.+.}: __lock_acquire+0x95f/0xa2f lock_acquire+0x1b4/0x1e8 __mutex_lock+0x14a/0xa9b blk_mq_hw_sysfs_show+0x63/0xb6 sysfs_kf_seq_show+0x11f/0x196 seq_read+0x2cd/0x5f2 vfs_read+0xc7/0x18c ksys_read+0xc4/0x13e do_syscall_64+0xa7/0x295 entry_SYSCALL_64_after_hwframe+0x49/0xbe -> #0 (kn->count#202){++++}: check_prev_add+0x5d2/0xc45 validate_chain+0xed3/0xf94 __lock_acquire+0x95f/0xa2f lock_acquire+0x1b4/0x1e8 __kernfs_remove+0x237/0x40b kernfs_remove_by_name_ns+0x59/0x72 remove_files+0x61/0x96 sysfs_remove_group+0x81/0xa4 sysfs_remove_groups+0x3b/0x44 kobject_del+0x44/0x94 blk_mq_unregister_dev+0x83/0xdd blk_unregister_queue+0xa0/0x10b del_gendisk+0x259/0x3fa null_del_dev+0x8b/0x1c3 [null_blk] null_exit+0x5c/0x95 [null_blk] __se_sys_delete_module+0x204/0x337 do_syscall_64+0xa7/0x295 entry_SYSCALL_64_after_hwframe+0x49/0xbe other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&q->sysfs_lock); lock(kn->count#202); lock(&q->sysfs_lock); lock(kn->count#202); * DEADLOCK * 2 locks held by rmmod/777: #0: 00000000e69bd9de (&lock){+.+.}, at: null_exit+0x2e/0x95 [null_blk] #1: 00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b stack backtrace: CPU: 0 PID: 777 Comm: rmmod Not tainted 5.3.0-rc3-00044-g73277fc75ea0 #1380 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20180724_192412-buildhw-07.phx4 Call Trace: dump_stack+0x9a/0xe6 check_noncircular+0x207/0x251 ? print_circular_bug+0x32a/0x32a ? find_usage_backwards+0x84/0xb0 check_prev_add+0x5d2/0xc45 validate_chain+0xed3/0xf94 ? check_prev_add+0xc45/0xc45 ? mark_lock+0x11b/0x804 ? check_usage_forwards+0x1ca/0x1ca __lock_acquire+0x95f/0xa2f lock_acquire+0x1b4/0x1e8 ? kernfs_remove_by_name_ns+0x59/0x72 __kernfs_remove+0x237/0x40b ? kernfs_remove_by_name_ns+0x59/0x72 ? kernfs_next_descendant_post+0x7d/0x7d ? strlen+0x10/0x23 ? strcmp+0x22/0x44 kernfs_remove_by_name_ns+0x59/0x72 remove_files+0x61/0x96 sysfs_remove_group+0x81/0xa4 sysfs_remove_groups+0x3b/0x44 kobject_del+0x44/0x94 blk_mq_unregister_dev+0x83/0xdd blk_unregister_queue+0xa0/0x10b del_gendisk+0x259/0x3fa ? disk_events_poll_msecs_store+0x12b/0x12b ? check_flags+0x1ea/0x204 ? mark_held_locks+0x1f/0x7a null_del_dev+0x8b/0x1c3 [null_blk] null_exit+0x5c/0x95 [null_blk] __se_sys_delete_module+0x204/0x337 ? free_module+0x39f/0x39f ? blkcg_maybe_throttle_current+0x8a/0x718 ? rwlock_bug+0x62/0x62 ? __blkcg_punt_bio_submit+0xd0/0xd0 ? trace_hardirqs_on_thunk+0x1a/0x20 ? mark_held_locks+0x1f/0x7a ? do_syscall_64+0x4c/0x295 do_syscall_64+0xa7/0x295 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7fb696cdbe6b Code: 73 01 c3 48 8b 0d 1d 20 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 008 RSP: 002b:00007ffec9588788 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 0000559e589137c0 RCX: 00007fb696cdbe6b RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559e58913828 RBP: 0000000000000000 R08: 00007ffec9587701 R09: 0000000000000000 R10: 00007fb696d4eae0 R11: 0000000000000206 R12: 00007ffec95889b0 R13: 00007ffec95896b3 R14: 0000559e58913260 R15: 0000559e589137c0 Cc: Christoph Hellwig <hch@infradead.org> Cc: Hannes Reinecke <hare@suse.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-27 10:40:20 -06:00
Ming Lei	58c898ba37	block: add helper for checking if queue is registered There are 4 users which check if queue is registered, so add one helper to check it. Cc: Christoph Hellwig <hch@infradead.org> Cc: Hannes Reinecke <hare@suse.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-27 10:40:20 -06:00
Ming Lei	c6ba933358	blk-mq: don't hold q->sysfs_lock in blk_mq_map_swqueue blk_mq_map_swqueue() is called from blk_mq_init_allocated_queue() and blk_mq_update_nr_hw_queues(). For the former caller, the kobject isn't exposed to userspace yet. For the latter caller, hctx sysfs entries and debugfs are un-registered before updating nr_hw_queues. On the other hand, commit `2f8f1336a4` ("blk-mq: always free hctx after request queue is freed") moves freeing hctx into queue's release handler, so there won't be race with queue release path too. So don't hold q->sysfs_lock in blk_mq_map_swqueue(). Cc: Christoph Hellwig <hch@infradead.org> Cc: Hannes Reinecke <hare@suse.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-27 10:40:20 -06:00
Ming Lei	c48dac137a	block: don't hold q->sysfs_lock in elevator_init_mq The original comment says: q->sysfs_lock must be held to provide mutual exclusion between elevator_switch() and here. Which is simply wrong. elevator_init_mq() is only called from blk_mq_init_allocated_queue, which is always called before the request queue is registered via blk_register_queue(), for dm-rq or normal rq based driver. However, queue's kobject is only exposed and added to sysfs in blk_register_queue(). So there isn't such race between elevator_switch() and elevator_init_mq(). So avoid to hold q->sysfs_lock in elevator_init_mq(). Cc: Christoph Hellwig <hch@infradead.org> Cc: Hannes Reinecke <hare@suse.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-27 10:40:20 -06:00
Bart Van Assche	9685b22702	block: Remove blk_mq_register_dev() This function has no callers. Hence remove it. Cc: Christoph Hellwig <hch@infradead.org> Cc: Ming Lei <ming.lei@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-27 10:40:19 -06:00
Christoph Hellwig	d1916c86cc	block: move same page handling from __bio_add_pc_page to the callers Hiding page refcount manipulation inside a low-level bio helper is somewhat awkward. Instead return the same page information to the callers, where it fits in much better. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-22 07:14:39 -06:00
Christoph Hellwig	384209cd5b	block: create a bio_try_merge_pc_page helper Passsthrough bio handling should be the same as normal bio handling, except that we need to take hardware limitations into account. Thus use the common try_merge implementation after checking the hardware limits. This changes behavior in that we now also check segment and dma boundary settings for same page merges, which is a little more work but has no effect as those need to be larger than the page size. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-22 07:14:38 -06:00
Christoph Hellwig	320ea869a1	block: improve the gap check in __bio_add_pc_page If we can add more data into an existing segment we do not create a gap per definition, so move the check for a gap after the attempt to merge into the segment. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-22 07:14:36 -06:00
Revanth Rajashekar	238bdcdf5d	block: sed-opal: Removed duplicate OPAL_METHOD_LENGTH definition The original commit adding the sed-opal library by mistake added two definitions of OPAL_METHOD_LENGTH, remove one of them. Signed-off-by: Revanth Rajashekar <revanth.rajashekar@intel.com> Reviewed-by: Scott Bauer <sbauer@plzdonthack.me> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-20 09:34:49 -06:00
Revanth Rajashekar	89c6cc2cab	block: sed-opal: Remove always false conditional statement In the function 'response_parse', num_entries will never be 0 as slen is checked for 0. Hence, the condition 'if (num_entries == 0)' can never be true. Signed-off-by: Revanth Rajashekar <revanth.rajashekar@intel.com> Reviewed-by: Scott Bauer <sbauer@plzdonthack.me> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-20 09:33:21 -06:00
Revanth Rajashekar	5cc23ed75b	block: sed-opal: Add/remove spaces Signed-off-by: Revanth Rajashekar <revanth.rajashekar@intel.com> Reviewed-by: Scott Bauer <sbauer@plzdonthack.me> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-20 09:33:20 -06:00
Junxiao Bi	988721db93	block: remove struct request_queue queue_head The dispatch list is not used any more, as the legacy block IO stack has been removed. Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-19 08:55:10 -06:00
Jens Axboe	7b6620d7db	block: remove REQ_NOWAIT_INLINE We had a few issues with this code, and there's still a problem around how we deal with error handling for chained/split bios. For now, just revert the code and we'll try again with a thoroug solution. This reverts commits: `e15c2ffa10` ("block: fix O_DIRECT error handling for bio fragments") `0eb6ddfb86` ("block: Fix __blkdev_direct_IO() for bio fragments") `6a43074e2f` ("block: properly handle IOCB_NOWAIT for async O_DIRECT IO") `893a1c9720` ("blk-mq: allow REQ_NOWAIT to return an error inline") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-15 11:09:16 -06:00
Johannes Weiner	b8e24a9300	block: annotate refault stalls from IO submission psi tracks the time tasks wait for refaulting pages to become uptodate, but it does not track the time spent submitting the IO. The submission part can be significant if backing storage is contended or when cgroup throttling (io.latency) is in effect - a lot of time is spent in submit_bio(). In that case, we underreport memory pressure. Annotate submit_bio() to account submission time as memory stall when the bio is reading userspace workingset pages. Tested-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-14 08:50:01 -06:00
zhengbin	73d9c8d4c0	blk-mq: Fix memory leak in blk_mq_init_allocated_queue error handling If blk_mq_init_allocated_queue->elevator_init_mq fails, need to release the previously requested resources. Fixes: `d348499138` ("blk-mq-sched: allow setting of default IO scheduler") Signed-off-by: zhengbin <zhengbin13@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-12 08:15:36 -06:00
zhengbin	e26cc08265	blk-mq: move cancel of requeue_work to the front of blk_exit_queue blk_exit_queue will free elevator_data, while blk_mq_requeue_work will access it. Move cancel of requeue_work to the front of blk_exit_queue to avoid use-after-free. blk_exit_queue blk_mq_requeue_work __elevator_exit blk_mq_run_hw_queues blk_mq_exit_sched blk_mq_run_hw_queue dd_exit_queue blk_mq_hctx_has_pending kfree(elevator_data) blk_mq_sched_has_work dd_has_work Fixes: `fbc2a15e34` ("blk-mq: move cancel of requeue_work into blk_mq_release") Cc: stable@vger.kernel.org Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: zhengbin <zhengbin13@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-12 08:14:11 -06:00
Paolo Valente	fd03177c33	block, bfq: handle NULL return value by bfq_init_rq() As reported in [1], the call bfq_init_rq(rq) may return NULL in case of OOM (in particular, if rq->elv.icq is NULL because memory allocation failed in failed in ioc_create_icq()). This commit handles this circumstance. [1] https://lkml.org/lkml/2019/7/22/824 Cc: Hsin-Yi Wang <hsinyi@google.com> Cc: Nicolas Boichat <drinkcat@chromium.org> Cc: Doug Anderson <dianders@chromium.org> Reported-by: Guenter Roeck <linux@roeck-us.net> Reported-by: Hsin-Yi Wang <hsinyi@google.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-08 07:31:50 -06:00
Paolo Valente	3f758e844a	block, bfq: move update of waker and woken list to queue freeing Since commit `13a857a4c4` ("block, bfq: detect wakers and unconditionally inject their I/O"), every bfq_queue has a pointer to a waker bfq_queue and a list of the bfq_queues it may wake. In this respect, when a bfq_queue, say Q, remains with no I/O source attached to it, Q cannot be woken by any other bfq_queue, and cannot wake any other bfq_queue. Then Q must be removed from the woken list of its possible waker bfq_queue, and all bfq_queues in the woken list of Q must stop having a waker bfq_queue. Q remains with no I/O source in two cases: when the last process associated with Q exits or when such a process gets associated with a different bfq_queue. Unfortunately, commit `13a857a4c4` ("block, bfq: detect wakers and unconditionally inject their I/O") performed the above updates only in the first case. This commit fixes this bug by moving these updates to when Q gets freed. This is a simple and safe way to handle all cases, as both the above events, process exit and re-association, lead to Q being freed soon, and because dangling references would come out only after Q gets freed (if no update were performed). Fixes: `13a857a4c4` ("block, bfq: detect wakers and unconditionally inject their I/O") Reported-by: Douglas Anderson <dianders@chromium.org> Tested-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-08 07:30:52 -06:00
Paolo Valente	08d383a749	block, bfq: reset last_completed_rq_bfqq if the pointed queue is freed Since commit `13a857a4c4` ("block, bfq: detect wakers and unconditionally inject their I/O"), BFQ stores, in a per-device pointer last_completed_rq_bfqq, the last bfq_queue that had an I/O request completed. If some bfq_queue receives new I/O right after the last request of last_completed_rq_bfqq has been completed, then last_completed_rq_bfqq may be a waker bfq_queue. But if the bfq_queue last_completed_rq_bfqq points to is freed, then last_completed_rq_bfqq becomes a dangling reference. This commit resets last_completed_rq_bfqq if the pointed bfq_queue is freed. Fixes: `13a857a4c4` ("block, bfq: detect wakers and unconditionally inject their I/O") Reported-by: Douglas Anderson <dianders@chromium.org> Tested-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-08 07:30:50 -06:00
Hans Holmberg	00ec4f3039	block: stop exporting bio_map_kern Now that there no module users left of bio_map_kern, stop exporting the symbol. Reviewed-by: Javier González <javier@javigon.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Hans Holmberg <hans@owltronix.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-06 08:20:10 -06:00
Ming Lei	556f36e90d	blk-mq: balance mapping between present CPUs and queues Spread queues among present CPUs first, then building mapping on other non-present CPUs. So we can minimize count of dead queues which are mapped by un-present CPUs only. Then bad IO performance can be avoided by unbalanced mapping between present CPUs and queues. The similar policy has been applied on Managed IRQ affinity. Cc: Yi Zhang <yi.zhang@redhat.com> Reported-by: Yi Zhang <yi.zhang@redhat.com> Reviewed-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-04 21:43:12 -06:00
Chaitanya Kulkarni	6e33dbf280	blk-zoned: implement REQ_OP_ZONE_RESET_ALL This implements REQ_OP_ZONE_RESET_ALL as a special case of the block device zone reset operations where we just simply issue bio with the newly introduced req op. We issue this req op when the number of sectors is equal to the device's partition's number of sectors and device has no partitions. We also add support so that blk_op_str() can print the new reset-all zone operation. This patch also adds a generic make request check for newly introduced REQ_OP_ZONE_RESET_ALL req_opf. We simply return error when queue is zoned and reset-all flag is not set for REQ_OP_ZONE_RESET_ALL. Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-04 21:41:29 -06:00
Bart Van Assche	67ed8b7386	block: Fix a comment in blk_cleanup_queue() Change a reference to the legacy block layer into a reference to blk-mq. Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Hannes Reinecke <hare@suse.com> Cc: James Smart <james.smart@broadcom.com> Cc: Ming Lei <ming.lei@redhat.com> Cc: Jianchao Wang <jianchao.w.wang@oracle.com> Cc: Dongli Zhang <dongli.zhang@oracle.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-04 21:41:29 -06:00
Bart Van Assche	9cc5169cd4	block: Improve physical block alignment of split bios Consider the following example: * The logical block size is 4 KB. * The physical block size is 8 KB. * max_sectors equals (16 KB >> 9) sectors. * A non-aligned 4 KB and an aligned 64 KB bio are merged into a single non-aligned 68 KB bio. The current behavior is to split such a bio into (16 KB + 16 KB + 16 KB + 16 KB + 4 KB). The start of none of these five bio's is aligned to a physical block boundary. This patch ensures that such a bio is split into four aligned and one non-aligned bio instead of being split into five non-aligned bios. This improves performance because most block devices can handle aligned requests faster than non-aligned requests. Since the physical block size is larger than or equal to the logical block size, this patch preserves the guarantee that the returned value is a multiple of the logical block size. Cc: Christoph Hellwig <hch@infradead.org> Cc: Ming Lei <ming.lei@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-04 21:41:29 -06:00
Bart Van Assche	708b25b344	block: Simplify blk_bio_segment_split() Move the max_sectors check into bvec_split_segs() such that a single call to that function can do all the necessary checks. This patch optimizes the fast path further, namely if a bvec fits in a page. Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-04 21:41:29 -06:00
Bart Van Assche	ff9811b3cf	block: Simplify bvec_split_segs() Simplify this function by by removing two if-tests. Other than requiring that the @sectors pointer is not NULL, this patch does not change the behavior of bvec_split_segs(). Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-04 21:41:29 -06:00
Bart Van Assche	dad7758459	block: Document the bio splitting functions Since what the bio splitting functions do is nontrivial, document these functions. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Cc: Christoph Hellwig <hch@infradead.org> Cc: Ming Lei <ming.lei@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-04 21:41:29 -06:00
Bart Van Assche	af2c68fe94	block: Declare several function pointer arguments 'const' Make it clear to the compiler and also to humans that the functions that query request queue properties do not modify any member of the request_queue data structure. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Cc: Christoph Hellwig <hch@infradead.org> Cc: Ming Lei <ming.lei@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-04 21:41:29 -06:00
Ming Lei	a87ccce0b5	blk-mq: remove blk_mq_complete_request_sync blk_mq_tagset_wait_completed_request() has been applied for waiting for completed request's fn, so not necessary to use blk_mq_complete_request_sync() any more. Cc: Max Gurtovoy <maxg@mellanox.com> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Keith Busch <keith.busch@intel.com> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-04 21:41:29 -06:00
Ming Lei	f9934a80f9	blk-mq: introduce blk_mq_tagset_wait_completed_request() blk-mq may schedule to call queue's complete function on remote CPU via IPI, but doesn't provide any way to synchronize the request's complete fn. The current queue freeze interface can't provide the synchonization because aborted requests stay at blk-mq queues during EH. In some driver's EH(such as NVMe), hardware queue's resource may be freed & re-allocated. If the completed request's complete fn is run finally after the hardware queue's resource is released, kernel crash will be triggered. Prepare for fixing this kind of issue by introducing blk_mq_tagset_wait_completed_request(). Cc: Max Gurtovoy <maxg@mellanox.com> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Keith Busch <keith.busch@intel.com> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-04 21:41:29 -06:00
Ming Lei	aa306ab703	blk-mq: introduce blk_mq_request_completed() NVMe needs this function to decide if one request to be aborted has been completed in normal IO path already. So introduce it. Cc: Max Gurtovoy <maxg@mellanox.com> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Keith Busch <keith.busch@intel.com> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-04 21:41:29 -06:00
Thomas Gleixner	9dd8813ed9	hrtimer/treewide: Use hrtimer_sleeper_start_expires() hrtimer_sleepers will gain a scheduling class dependent treatment on PREEMPT_RT. Use the new hrtimer_sleeper_start_expires() function to make that possible. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2019-08-01 17:43:16 +02:00
Sebastian Andrzej Siewior	dbc1625fc9	hrtimer: Consolidate hrtimer_init() + hrtimer_init_sleeper() calls hrtimer_init_sleeper() calls require prior initialisation of the hrtimer object which is embedded into the hrtimer_sleeper. Combine the initialization and spare a function call. Fixup all call sites. This is also a preparatory change for PREEMPT_RT to do hrtimer sleeper specific initializations of the embedded hrtimer without modifying any of the call sites. No functional change. [ anna-maria: Minor cleanups ] [ tglx: Adopted to the removal of the task argument of hrtimer_init_sleeper() and trivial polishing. Folded a fix from Stephen Rothwell for the vsoc code ] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20190726185752.887468908@linutronix.de	2019-08-01 17:43:15 +02:00
Thomas Gleixner	b744948725	hrtimer: Remove task argument from hrtimer_init_sleeper() All callers hand in 'current' and that's the only task pointer which actually makes sense. Remove the task argument and set current in the function. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20190726185752.791885290@linutronix.de	2019-07-30 23:57:51 +02:00
Linus Torvalds	5168afe6ef	for-linus-20190726-2 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl07oAsQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpiDOD/wLARyG0xGhavA7WQFjtCSyASJdfH5wz/zE lCpYWTEuD5fZlrVLg3wyKdJKq2g+nqtHxo8I1z7CvdJmky9hRlHqWmJqCVzS3hzq dkskY8ZlXDzDxvVOs+3QgNrhZrZF+ScnmcSsO2+UHmmdlXZiyWRA5pgzgI9AA77o HbF4izj86jhlKnuKgtNB9RoNbUEnmxBCksi+4+lU1zEM416/l0aXzBO+l3bYW0mn td8/oruXLB3v49dMqo/Xy10/+6PYsvVs+6gvT2sCCr6pMyM5XwWyZzEdRB448Nem 1f+trlxCmLo3f81YsQOkMYD0dRnmNHGTk2BOazkJd7ZytKpjSUMgIkk9ty7R27kR Ct1cfxqfp6IyEwIwVmGpo7HW486wytmuq0WZGsqc2G0Cg23QIRE4/HVQypMUemgk RGCx5CBJLGZHqrHMzTGhU31hY5XPcd5dBd9W/UdloFP3ta3jkd2sFqzzevqtQhfV Mbva3YJCujQObWFJd7+L+LRrW1mLFecnKJZYKetOvDQ48gAfy7OQePZRTmcM0YhH VGbj8dRnXLmF1b+4KBnPni7xBLKN14zdvm7jnFViGjCG9CESHm3Gv2/4ZHqaj9kq gK7Ze9cSCXwk2R235m0/DucCUiFOLcvpcFEgtW+h+0jio7NEN/3NIZ00gepKyuhp S7GZhmFfRw== =iI4A -----END PGP SIGNATURE----- Merge tag 'for-linus-20190726-2' of git://git.kernel.dk/linux-block Pull block DMA segment fix from Jens Axboe: "Here's the virtual boundary segment size fix" * tag 'for-linus-20190726-2' of git://git.kernel.dk/linux-block: block: fix max segment size handling in blk_queue_virt_boundary	2019-07-26 19:20:34 -07:00
Christoph Hellwig	c6c84f78e2	block: fix max segment size handling in blk_queue_virt_boundary We should only set the max segment size to unlimited if we actually have a virt boundary. Otherwise we accidentally clear that limit when called from the SCSI midlayer, which always calls blk_queue_virt_boundary, even if that mask is 0. Fixes: `7ad388d8e4` ("scsi: core: add a host / host template field for the virt boundary") Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-07-26 12:50:57 -06:00
Linus Torvalds	0441281965	for-linus-20190726 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl07DGAQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgplf5EADOOvOdsz9N/Iw8ZHHHJXCqKR26zZv75G1z 0h1PGC7p0JZQbYFo0Zo7mjiRBGlg6tlXc2d4Gyl94XJKDwjeYTcFDvbvERdYa+MH d2RiFkAfR967Ri4fb+FP5L3mYOQdMJ/zk0xCDHLv/DcxeFLa5a9EJS1+vBSR+AcB 0JpJWuHypGqGmbTaL0z9q2pmx0mgA1ERlWQtkMLrsEr2Vqg/rrjGwe2bGFY00lXc vKtFkpfugKc4zVAPSzC1YZgojfDDpGNEA4QMtxMsEH4hqyMpHhrtUedNY5QrjC0B p9h6aPXXYr2KhGP0grrEytzaYUOzK2crK5h+q+1vu6nOgx2EgmnLM9tBu/LuRH1j uUzKJOa3/AE+bU7uZEsaUerTBsHrgEBa1x8G92obYRnjgW3aCD2CaSbjjBhNxTZ4 1dXyr0DTHFXZmfcfWja5tO26JTPzjwVOrwiRyU0S727UsdVJupoHiYLr5fwaDfgn /Du2I/XWvFtflm5i0ND0sdcX1yRlFiGZ9e45z1QFaFmcteKKWzRBDlC6mQzI/lw3 oc583mhDR3tRtJxow+wn6AuMUehFRh8wj0UhL/MEMjLW8GiqXU5aRtanT+22Xz4L saNDQieeEnV7raMYXMP0qIhkJtrNASmJQos+MOJAEGOWcS2ePIUUio2kSXie+071 BphJd2RamQ== =HIzH -----END PGP SIGNATURE----- Merge tag 'for-linus-20190726' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: - Several io_uring fixes/improvements: - Blocking fix for O_DIRECT (me) - Latter page slowness for registered buffers (me) - Fix poll hang under certain conditions (me) - Defer sequence check fix for wrapped rings (Zhengyuan) - Mismatch in async inc/dec accounting (Zhengyuan) - Memory ordering issue that could cause stall (Zhengyuan) - Track sequential defer in bytes, not pages (Zhengyuan) - NVMe pull request from Christoph - Set of hang fixes for wbt (Josef) - Redundant error message kill for libahci (Ding) - Remove unused blk_mq_sched_started_request() and related ops (Marcos) - drbd dynamic alloc shash descriptor to reduce stack use (Arnd) - blkcg ->pd_stat() non-debug print (Tejun) - bcache memory leak fix (Wei) - Comment fix (Akinobu) - BFQ perf regression fix (Paolo) * tag 'for-linus-20190726' of git://git.kernel.dk/linux-block: (24 commits) io_uring: ensure ->list is initialized for poll commands Revert "nvme-pci: don't create a read hctx mapping without read queues" nvme: fix multipath crash when ANA is deactivated nvme: fix memory leak caused by incorrect subsystem free nvme: ignore subnqn for ADATA SX6000LNP drbd: dynamically allocate shash descriptor block: blk-mq: Remove blk_mq_sched_started_request and started_request bcache: fix possible memory leak in bch_cached_dev_run() io_uring: track io length in async_list based on bytes io_uring: don't use iov_iter_advance() for fixed buffers block: properly handle IOCB_NOWAIT for async O_DIRECT IO blk-mq: allow REQ_NOWAIT to return an error inline io_uring: add a memory barrier before atomic_read rq-qos: use a mb for got_token rq-qos: set ourself TASK_UNINTERRUPTIBLE after we schedule rq-qos: don't reset has_sleepers on spurious wakeups rq-qos: fix missed wake-ups in rq_qos_throttle wait: add wq_has_single_sleeper helper block, bfq: check also in-flight I/O in dispatch plugging block: fix sysfs module parameters directory path in comment ...	2019-07-26 10:32:12 -07:00
Marcos Paulo de Souza	327fe1d42b	block: blk-mq: Remove blk_mq_sched_started_request and started_request blk_mq_sched_completed_request is a function that checks if the elevator related to the request has started_request implemented, but currently, none of the available IO schedulers implement started_request, so remove both. Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-07-23 07:25:09 -06:00
Jens Axboe	893a1c9720	blk-mq: allow REQ_NOWAIT to return an error inline By default, if a caller sets REQ_NOWAIT and we need to block, we'll return -EAGAIN through the bio->bi_end_io() callback. For some use cases, this makes it hard to use. Allow a caller to ask for inline return of errors related to blocking by also setting REQ_NOWAIT_INLINE. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-07-21 21:46:23 -06:00
Josef Bacik	ac38297f70	rq-qos: use a mb for got_token Oleg noticed that our checking of data.got_token is unsafe in the cleanup case, and should really use a memory barrier. Use a wmb on the write side, and a rmb() on the read side. We don't need one in the main loop since we're saved by set_current_state(). Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-07-18 10:20:14 -06:00
Josef Bacik	d14a9b389a	rq-qos: set ourself TASK_UNINTERRUPTIBLE after we schedule In case we get a spurious wakeup we need to make sure to re-set ourselves to TASK_UNINTERRUPTIBLE so we don't busy wait. Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-07-18 10:20:13 -06:00
Josef Bacik	64e7ea875e	rq-qos: don't reset has_sleepers on spurious wakeups If we raced with somebody else getting an inflight counter we could fail to get an inflight counter with no sleepers on the list, and thus need to go to sleep. In this case has_sleepers should be true because we are now relying on the waker to get our inflight counter for us. And in the case of spurious wakeups we'd still want this to be the case. So set has_sleepers to true if we went to sleep to make sure we're woken up the proper way. Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-07-18 10:20:13 -06:00
Josef Bacik	545fbd0775	rq-qos: fix missed wake-ups in rq_qos_throttle We saw a hang in production with WBT where there was only one waiter in the throttle path and no outstanding IO. This is because of the has_sleepers optimization that is used to make sure we don't steal an inflight counter for new submitters when there are people already on the list. We can race with our check to see if the waitqueue has any waiters (this is done locklessly) and the time we actually add ourselves to the waitqueue. If this happens we'll go to sleep and never be woken up because nobody is doing IO to wake us up. Fix this by checking if the waitqueue has a single sleeper on the list after we add ourselves, that way we have an uptodate view of the list. Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-07-18 10:20:13 -06:00
Paolo Valente	b5e02b484d	block, bfq: check also in-flight I/O in dispatch plugging Consider a sync bfq_queue Q that remains empty while in service, and suppose that, when this happens, there is a fair amount of already in-flight I/O not belonging to Q. In such a situation, I/O dispatching may need to be plugged (until new I/O arrives for Q), for the following reason. The drive may decide to serve in-flight non-Q's I/O requests before Q's ones, thereby delaying the arrival of new I/O requests for Q (recall that Q is sync). If I/O-dispatching is not plugged, then, while Q remains empty, a basically uncontrolled amount of I/O from other queues may be dispatched too, possibly causing the service of Q's I/O to be delayed even longer in the drive. This problem gets more and more serious as the speed and the queue depth of the drive grow, because, as these two quantities grow, the probability to find no queue busy but many requests in flight grows too. If Q has the same weight and priority as the other queues, then the above delay is unlikely to cause any issue, because all queues tend to undergo the same treatment. So, since not plugging I/O dispatching is convenient for throughput, it is better not to plug. Things change in case Q has a higher weight or priority than some other queue, because Q's service guarantees may simply be violated. For this reason, commit `1de0c4cd9e` ("block, bfq: reduce idling only in symmetric scenarios") does plug I/O in such an asymmetric scenario. Plugging minimizes the delay induced by already in-flight I/O, and enables Q to recover the bandwidth it may lose because of this delay. Yet the above commit does not cover the case of weight-raised queues, for efficiency concerns. For weight-raised queues, I/O-dispatch plugging is activated simply if not all bfq_queues are weight-raised. But this check does not handle the case of in-flight requests, because a bfq_queue may become non busy before all its in-flight requests are completed. This commit performs I/O-dispatch plugging for weight-raised queues if there are some in-flight requests. As a practical example of the resulting recover of control, under write load on a Samsung SSD 970 PRO, gnome-terminal starts in 1.5 seconds after this fix, against 15 seconds before the fix (as a reference, gnome-terminal takes about 35 seconds to start with any of the other I/O schedulers). Fixes: `1de0c4cd9e` ("block, bfq: reduce idling only in symmetric scenarios") Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-07-18 07:22:15 -06:00
Linus Torvalds	c309b6f242	docs conversion for v5.3-rc1 -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE+QmuaPwR3wnBdVwACF8+vY7k4RUFAl0tpocACgkQCF8+vY7k 4RWoxA//b/fmDXP3WPzrjjSmpyB9ml0/epKzPbT5S2j0lftqKBmet29k+PCjVrTx Nq2QauehY9ug5h8UMVUCmzPr95F0tSIGRoqk1vrn7z0K3q6k1SHrtvqbY1Bgb2Uk Qvh2YFU4fQLJg8WAbExCjxCdbdmBKQVGKTwCtM+tP5OMxwAFOmQrjGaUaKCKIIA2 7Wzrx8CpSji+bJ3uK/d36c+4M9oDly5eaxBhoboL3BI0y+GqwiSASGwTO7BxrPOg 0wq5IZHnqS8+bprT9xQdDOqf+UOY9U1cxE/+sqsHxblfUEx9gfLy/R+FLmJn+SS9 Z3yLy4SqVHQMpWBjEAGodohikF60PAuTdymSC11jqFaKCUxWrIZg5xO+0blMrxPF 7vYIexutCkaBMHBlNaNsHIqB7B/2FGGKoN7QW64hwvwJCGvF7OmJcV+R4bROGvh4 nFuis9/Nm66Fq7I3aw37ThyZ0aWZdaQ0QJTH9ksxU/ZCz2hhMNYu/rXggrDvkS4U nr77ZT5Gd7nj4b110zf8+99uiGiinY6hTfzPAuTCLBhaxwrv4/xDHAhpwdEB5T4j 8gOkxV8c0XWtL7sKqhGJvs/RRe2za0Y9XH6fyxsYfWcfuLjEvug8ouXMad9gxFWH DL3WnKJEMGLScei2wux4kGOwEbkR1bUf2cHJfh3GpCB/y8vgLOc= =smxY -----END PGP SIGNATURE----- Merge tag 'docs/v5.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media Pull rst conversion of docs from Mauro Carvalho Chehab: "As agreed with Jon, I'm sending this big series directly to you, c/c him, as this series required a special care, in order to avoid conflicts with other trees" * tag 'docs/v5.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (77 commits) docs: kbuild: fix build with pdf and fix some minor issues docs: block: fix pdf output docs: arm: fix a breakage with pdf output docs: don't use nested tables docs: gpio: add sysfs interface to the admin-guide docs: locking: add it to the main index docs: add some directories to the main documentation index docs: add SPDX tags to new index files docs: add a memory-devices subdir to driver-api docs: phy: place documentation under driver-api docs: serial: move it to the driver-api docs: driver-api: add remaining converted dirs to it docs: driver-api: add xilinx driver API documentation docs: driver-api: add a series of orphaned documents docs: admin-guide: add a series of orphaned documents docs: cgroup-v1: add it to the admin-guide book docs: aoe: add it to the driver-api book docs: add some documentation dirs to the driver-api book docs: driver-model: move it to the driver-api book docs: lp855x-driver.rst: add it to the driver-api book ...	2019-07-16 12:21:41 -07:00
Akinobu Mita	1624b0b200	block: fix sysfs module parameters directory path in comment The runtime configurable module parameter files are located under /sys/module/MODULENAME/parameters, not /sys/module/MODULENAME. Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-07-16 10:16:33 -06:00
Tejun Heo	07b0fdecb2	blkcg: allow blkcg_policy->pd_stat() to print non-debug info too Currently, ->pd_stat() is called only when moduleparam blkcg_debug_stats is set which prevents it from printing non-debug policy-specific statistics. Let's move debug testing down so that ->pd_stat() can print non-debug stat too. This patch doesn't cause any visible behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-07-16 10:06:39 -06:00
Mauro Carvalho Chehab	4f4cfa6c56	docs: admin-guide: add a series of orphaned documents There are lots of documents that belong to the admin-guide but are on random places (most under Documentation root dir). Move them to the admin guide. Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Acked-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>	2019-07-15 11:03:02 -03:00
Mauro Carvalho Chehab	da82c92f11	docs: cgroup-v1: add it to the admin-guide book Those files belong to the admin guide, so add them. Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>	2019-07-15 11:03:02 -03:00
Mauro Carvalho Chehab	898bd37a92	docs: block: convert to ReST Rename the block documentation files to ReST, add an index for them and adjust in order to produce a nice html output via the Sphinx build system. At its new index.rst, let's add a :orphan: while this is not linked to the main index.rst file, in order to avoid build warnings. Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>	2019-07-15 09:20:27 -03:00
Damien Le Moal	26202928fa	block: Limit zone array allocation size Limit the size of the struct blk_zone array used in blk_revalidate_disk_zones() to avoid memory allocation failures leading to disk revalidation failure. Also further reduce the likelyhood of such failures by using kvcalloc() (that is vmalloc()) instead of allocating contiguous pages with alloc_pages(). Fixes: `515ce60613` ("scsi: sd_zbc: Fix sd_zbc_report_zones() buffer allocation") Fixes: `e76239a374` ("block: add a report_zones method") Cc: stable@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-07-11 20:04:40 -06:00
Damien Le Moal	bd976e5272	block: Kill gfp_t argument of blkdev_report_zones() Only GFP_KERNEL and GFP_NOIO are used with blkdev_report_zones(). In preparation of using vmalloc() for large report buffer and zone array allocations used by this function, remove its "gfp_t gfp_mask" argument and rely on the caller context to use memalloc_noio_save/restore() where necessary (block layer zone revalidation and dm-zoned I/O error path). Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-07-11 20:04:37 -06:00
Damien Le Moal	b4c5875d36	block: Allow mapping of vmalloc-ed buffers To allow the SCSI subsystem scsi_execute_req() function to issue requests using large buffers that are better allocated with vmalloc() rather than kmalloc(), modify bio_map_kern() to allow passing a buffer allocated with vmalloc(). To do so, detect vmalloc-ed buffers using is_vmalloc_addr(). For vmalloc-ed buffers, flush the buffer using flush_kernel_vmap_range(), use vmalloc_to_page() instead of virt_to_page() to obtain the pages of the buffer, and invalidate the buffer addresses with invalidate_kernel_vmap_range() on completion of read BIOs. This last point is executed using the function bio_invalidate_vmalloc_pages() which is defined only if the architecture defines ARCH_HAS_FLUSH_KERNEL_DCACHE_PAGE, that is, if the architecture actually needs the invalidation done. Fixes: `515ce60613` ("scsi: sd_zbc: Fix sd_zbc_report_zones() buffer allocation") Fixes: `e76239a374` ("block: add a report_zones method") Cc: stable@vger.kernel.org Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-07-11 20:04:36 -06:00
Wenwen Wang	e7bf90e5af	block/bio-integrity: fix a memory leak bug In bio_integrity_prep(), a kernel buffer is allocated through kmalloc() to hold integrity metadata. Later on, the buffer will be attached to the bio structure through bio_integrity_add_page(), which returns the number of bytes of integrity metadata attached. Due to unexpected situations, bio_integrity_add_page() may return 0. As a result, bio_integrity_prep() needs to be terminated with 'false' returned to indicate this error. However, the allocated kernel buffer is not freed on this execution path, leading to a memory leak. To fix this issue, free the allocated buffer before returning from bio_integrity_prep(). Reviewed-by: Ming Lei <ming.lei@redhat.com> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Wenwen Wang <wenwen@cs.uga.edu> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-07-11 20:01:21 -06:00
Damien Le Moal	b49773e7bc	block: Disable write plugging for zoned block devices Simultaneously writing to a sequential zone of a zoned block device from multiple contexts requires mutual exclusion for BIO issuing to ensure that writes happen sequentially. However, even for a well behaved user correctly implementing such synchronization, BIO plugging may interfere and result in BIOs from the different contextx to be reordered if plugging is done outside of the mutual exclusion section, e.g. the plug was started by a function higher in the call chain than the function issuing BIOs. Context A Context B \| blk_start_plug() \| ... \| seq_write_zone() \| mutex_lock(zone) \| bio-0->bi_iter.bi_sector = zone->wp \| zone->wp += bio_sectors(bio-0) \| submit_bio(bio-0) \| bio-1->bi_iter.bi_sector = zone->wp \| zone->wp += bio_sectors(bio-1) \| submit_bio(bio-1) \| mutex_unlock(zone) \| return \| -----------------------> \| seq_write_zone() \| mutex_lock(zone) \| bio-2->bi_iter.bi_sector = zone->wp \| zone->wp += bio_sectors(bio-2) \| submit_bio(bio-2) \| mutex_unlock(zone) \| <------------------------- \| \| blk_finish_plug() In the above example, despite the mutex synchronization ensuring the correct BIO issuing order 0, 1, 2, context A BIOs 0 and 1 end up being issued after BIO 2 of context B, when the plug is released with blk_finish_plug(). While this problem can be addressed using the blk_flush_plug_list() function (in the above example, the call must be inserted before the zone mutex lock is released), a simple generic solution in the block layer avoid this additional code in all zoned block device user code. The simple generic solution implemented with this patch is to introduce the internal helper function blk_mq_plug() to access the current context plug on BIO submission. This helper returns the current plug only if the target device is not a zoned block device or if the BIO to be plugged is not a write operation. Otherwise, the caller context plug is ignored and NULL returned, resulting is all writes to zoned block device to never be plugged. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-07-10 14:18:01 -06:00
Konstantin Khlebnikov	3a10f999ff	blk-throttle: fix zero wait time for iops throttled group After commit `991f61fe7e` ("Blk-throttle: reduce tail io latency when iops limit is enforced") wait time could be zero even if group is throttled and cannot issue requests right now. As a result throtl_select_dispatch() turns into busy-loop under irq-safe queue spinlock. Fix is simple: always round up target time to the next throttle slice. Fixes: `991f61fe7e` ("Blk-throttle: reduce tail io latency when iops limit is enforced") Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: stable@vger.kernel.org # v4.19+ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-07-10 09:00:57 -06:00
Damien Le Moal	113ab72ed4	block: Fix potential overflow in blk_report_zones() For large values of the number of zones reported and/or large zone sizes, the sector increment calculated with blk_queue_zone_sectors(q) * n in blk_report_zones() loop can overflow the unsigned int type used for the calculation as both "n" and blk_queue_zone_sectors() value are unsigned int. E.g. for a device with 256 MB zones (524288 sectors), overflow happens with 8192 or more zones reported. Changing the return type of blk_queue_zone_sectors() to sector_t, fixes this problem and avoids overflow problem for all other callers of this helper too. The same change is also applied to the bdev_zone_sectors() helper. Fixes: `e76239a374` ("block: add a report_zones method") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-07-10 09:00:57 -06:00
Tejun Heo	d3f77dfdc7	blkcg: implement REQ_CGROUP_PUNT When a shared kthread needs to issue a bio for a cgroup, doing so synchronously can lead to priority inversions as the kthread can be trapped waiting for that cgroup. This patch implements REQ_CGROUP_PUNT flag which makes submit_bio() punt the actual issuing to a dedicated per-blkcg work item to avoid such priority inversions. This will be used to fix priority inversions in btrfs compression and should be generally useful as we grow filesystem support for comprehensive IO control. Cc: Chris Mason <clm@fb.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-07-10 09:00:57 -06:00
Tejun Heo	9b0eb69b75	cgroup, blkcg: Prepare some symbols for module and !CONFIG_CGROUP usages btrfs is going to use css_put() and wbc helpers to improve cgroup writeback support. Add dummy css_get() definition and export wbc helpers to prepare for module and !CONFIG_CGROUP builds. Reported-by: kbuild test robot <lkp@intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-07-10 09:00:57 -06:00
Josef Bacik	fd112c7465	blk-cgroup: turn on psi memstall stuff With the psi stuff in place we can use the memstall flag to indicate pressure that happens from throttling. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-07-10 09:00:57 -06:00
Josef Bacik	b554db147f	block: init flush rq ref count to 1 We discovered a problem in newer kernels where a disconnect of a NBD device while the flush request was pending would result in a hang. This is because the blk mq timeout handler does if (!refcount_inc_not_zero(&rq->ref)) return true; to determine if it's ok to run the timeout handler for the request. Flush_rq's don't have a ref count set, so we'd skip running the timeout handler for this request and it would just sit there in limbo forever. Fix this by always setting the refcount of any request going through blk_init_rq() to 1. I tested this with a nbd-server that dropped flush requests to verify that it hung, and then tested with this patch to verify I got the timeout as expected and the error handling kicked in. Thanks, Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-07-10 09:00:57 -06:00
Linus Torvalds	3b99107f0e	for-5.3/block-20190708 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl0jrIMQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgptlFD/9CNsBX+Aap2lO6wKNr6QISwNAK76GMzEay s4LSY2kGkXvzv8i89mCuY+8UVNI8WH2/22WnU+8CBAJOjWyFQMsIwH/mrq0oZWRD J6STJE8rTr6Fc2MvJUWryp/xdBh3+eDIsAdIZVHVAkIzqYPBnpIAwEIeIw8t0xsm v9ngpQ3WD6ep8tOj9pnG1DGKFg1CmukZCC/Y4CQV1vZtmm2I935zUwNV/TB+Egfx G8JSC0cSV02LMK88HCnA6MnC/XSUC0qgfXbnmP+TpKlgjVX+P/fuB3oIYcZEu2Rk 3YBpIkhsQytKYbF42KRLsmBH72u6oB9G+tNZTgB1STUDrZqdtD9xwX1rjDlY0ZzP EUDnk48jl/cxbs+VZrHoE2TcNonLiymV7Kb92juHXdIYmKFQStprGcQUbMaTkMfB 6BYrYLifWx0leu1JJ1i7qhNmug94BYCSCxcRmH0p6kPazPcY9LXNmDWMfMuBPZT7 z79VLZnHF2wNXJyT1cBluwRYYJRT4osWZ3XUaBWFKDgf1qyvXJfrN/4zmgkEIyW7 ivXC+KLlGkhntDlWo2pLKbbyOIKY1HmU6aROaI11k5Zyh0ixKB7tHKavK39l+NOo YB41+4l6VEpQEyxyRk8tO0sbHpKaKB+evVIK3tTwbY+Q0qTExErxjfWUtOgRWhjx iXJssPRo4w== =VSYT -----END PGP SIGNATURE----- Merge tag 'for-5.3/block-20190708' of git://git.kernel.dk/linux-block Pull block updates from Jens Axboe: "This is the main block updates for 5.3. Nothing earth shattering or major in here, just fixes, additions, and improvements all over the map. This contains: - Series of documentation fixes (Bart) - Optimization of the blk-mq ctx get/put (Bart) - null_blk removal race condition fix (Bob) - req/bio_op() cleanups (Chaitanya) - Series cleaning up the segment accounting, and request/bio mapping (Christoph) - Series cleaning up the page getting/putting for bios (Christoph) - block cgroup cleanups and moving it to where it is used (Christoph) - block cgroup fixes (Tejun) - Series of fixes and improvements to bcache, most notably a write deadlock fix (Coly) - blk-iolatency STS_AGAIN and accounting fixes (Dennis) - Series of improvements and fixes to BFQ (Douglas, Paolo) - debugfs_create() return value check removal for drbd (Greg) - Use struct_size(), where appropriate (Gustavo) - Two lighnvm fixes (Heiner, Geert) - MD fixes, including a read balance and corruption fix (Guoqing, Marcos, Xiao, Yufen) - block opal shadow mbr additions (Jonas, Revanth) - sbitmap compare-and-exhange improvemnts (Pavel) - Fix for potential bio->bi_size overflow (Ming) - NVMe pull requests: - improved PCIe suspent support (Keith Busch) - error injection support for the admin queue (Akinobu Mita) - Fibre Channel discovery improvements (James Smart) - tracing improvements including nvmetc tracing support (Minwoo Im) - misc fixes and cleanups (Anton Eidelman, Minwoo Im, Chaitanya Kulkarni)" - Various little fixes and improvements to drivers and core" * tag 'for-5.3/block-20190708' of git://git.kernel.dk/linux-block: (153 commits) blk-iolatency: fix STS_AGAIN handling block: nr_phys_segments needs to be zero for REQ_OP_WRITE_ZEROES blk-mq: simplify blk_mq_make_request() blk-mq: remove blk_mq_put_ctx() sbitmap: Replace cmpxchg with xchg block: fix .bi_size overflow block: sed-opal: check size of shadow mbr block: sed-opal: ioctl for writing to shadow mbr block: sed-opal: add ioctl for done-mark of shadow mbr block: never take page references for ITER_BVEC direct-io: use bio_release_pages in dio_bio_complete block_dev: use bio_release_pages in bio_unmap_user block_dev: use bio_release_pages in blkdev_bio_end_io iomap: use bio_release_pages in iomap_dio_bio_end_io block: use bio_release_pages in bio_map_user_iov block: use bio_release_pages in bio_unmap_user block: optionally mark pages dirty in bio_release_pages block: move the BIO_NO_PAGE_REF check into bio_release_pages block: skd_main.c: Remove call to memset after dma_alloc_coherent block: mtip32xx: Remove call to memset after dma_alloc_coherent ...	2019-07-09 10:45:06 -07:00
Linus Torvalds	92c1d65221	Merge branch 'for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "Documentation updates and the addition of cgroup_parse_float() which will be used by new controllers including blk-iocost" * 'for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: docs: cgroup-v1: convert docs to ReST and rename to *.rst cgroup: Move cgroup_parse_float() implementation out of CONFIG_SYSFS cgroup: add cgroup_parse_float()	2019-07-08 21:35:12 -07:00
Greg Kroah-Hartman	7e41c3c9b6	blk-mq: fix up placement of debugfs directory of queue files When the blk-mq debugfs file creation logic was "cleaned up" it was cleaned up too much, causing the queue file to not be created in the correct location. Turns out the check for the directory being present is needed as if that has not happened yet, the files should not be created, and the function will be called later on in the initialization code so that the files can be created in the correct location. Fixes: `6cfc0081b0` ("blk-mq: no need to check return value of debugfs_create functions") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: linux-block@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-07-06 10:07:38 -06:00
Dennis Zhou	c9b3007fec	blk-iolatency: fix STS_AGAIN handling The iolatency controller is based on rq_qos. It increments on rq_qos_throttle() and decrements on either rq_qos_cleanup() or rq_qos_done_bio(). `a3fb01ba5a` fixes the double accounting issue where blk_mq_make_request() may call both rq_qos_cleanup() and rq_qos_done_bio() on REQ_NO_WAIT. So checking STS_AGAIN prevents the double decrement. The above works upstream as the only way we can get STS_AGAIN is from blk_mq_get_request() failing. The STS_AGAIN handling isn't a real problem as bio_endio() skipping only happens on reserved tag allocation failures which can only be caused by driver bugs and already triggers WARN. However, the fix creates a not so great dependency on how STS_AGAIN can be propagated. Internally, we (Facebook) carry a patch that kills read ahead if a cgroup is io congested or a fatal signal is pending. This combined with chained bios progagate their bi_status to the parent is not already set can can cause the parent bio to not clean up properly even though it was successful. This consequently leaks the inflight counter and can hang all IOs under that blkg. To nip the adverse interaction early, this removes the rq_qos_cleanup() callback in iolatency in favor of cleaning up always on the rq_qos_done_bio() path. Fixes: `a3fb01ba5a` ("blk-iolatency: only account submitted bios") Debugged-by: Tejun Heo <tj@kernel.org> Debugged-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-07-05 15:14:00 -06:00
Christoph Hellwig	d665e12aa7	block: nr_phys_segments needs to be zero for REQ_OP_WRITE_ZEROES Fix a regression introduced when removing bi_phys_segments for Write Zeroes requests, which need to have a segment count of zero, as they don't have a payload. Fixes: `14ccb66b3f` ("block: remove the bi_phys_segments field in struct bio") Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-07-03 07:20:40 -06:00
Bart Van Assche	970d168de6	blk-mq: simplify blk_mq_make_request() Move the blk_mq_bio_to_request() call in front of the if-statement. Cc: Hannes Reinecke <hare@suse.com> Cc: Omar Sandoval <osandov@fb.com> Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-07-02 21:03:38 -06:00
Bart Van Assche	c05f42206f	blk-mq: remove blk_mq_put_ctx() No code that occurs between blk_mq_get_ctx() and blk_mq_put_ctx() depends on preemption being disabled for its correctness. Since removing the CPU preemption calls does not measurably affect performance, simplify the blk-mq code by removing the blk_mq_put_ctx() function and also by not disabling preemption in blk_mq_get_ctx(). Cc: Hannes Reinecke <hare@suse.com> Cc: Omar Sandoval <osandov@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-07-02 21:03:27 -06:00
Ming Lei	79d08f89bb	block: fix .bi_size overflow 'bio->bi_iter.bi_size' is 'unsigned int', which at most hold 4G - 1 bytes. Before `07173c3ec2` ("block: enable multipage bvecs"), one bio can include very limited pages, and usually at most 256, so the fs bio size won't be bigger than 1M bytes most of times. Since we support multi-page bvec, in theory one fs bio really can be added > 1M pages, especially in case of hugepage, or big writeback with too many dirty pages. Then there is chance in which .bi_size is overflowed. Fixes this issue by using bio_full() to check if the added segment may overflow .bi_size. Cc: Liu Yiding <liuyd.fnst@cn.fujitsu.com> Cc: kernel test robot <rong.a.chen@intel.com> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: linux-xfs@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org Cc: stable@vger.kernel.org Fixes: `07173c3ec2` ("block: enable multipage bvecs") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-07-01 08:18:54 -06:00
Jens Axboe	5be1f9d82f	Linux 5.2-rc6 -----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl0Os1seHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGtx4H/j6i482XzcGFKTBm A7mBoQpy+kLtoUov4EtBAR62OuwI8rsahW9di37QKndPoQrczWaKBmr3De6LCdPe v3pl3O6wBbvH5ru+qBPFX9PdNbDvimEChh7LHxmMxNQq3M+AjZAZVJyfpoiFnx35 Fbge+LZaH/k8HMwZmkMr5t9Mpkip715qKg2o9Bua6dkH0AqlcpLlC8d9a+HIVw/z aAsyGSU8jRwhoAOJsE9bJf0acQ/pZSqmFp0rDKqeFTSDMsbDRKLGq/dgv4nW0RiW s7xqsjb/rdcvirRj3rv9+lcTVkOtEqwk0PVdL9WOf7g4iYrb3SOIZh8ZyViaDSeH VTS5zps= =huBY -----END PGP SIGNATURE----- Merge tag 'v5.2-rc6' into for-5.3/block Merge 5.2-rc6 into for-5.3/block, so we get the same page merge leak fix. Otherwise we end up having conflicts with future patches between for-5.3/block and master that touch this area. In particular, it makes the bio_full() fix hard to backport to stable. * tag 'v5.2-rc6': (482 commits) Linux 5.2-rc6 Revert "iommu/vt-d: Fix lock inversion between iommu->lock and device_domain_lock" Bluetooth: Fix regression with minimum encryption key size alignment tcp: refine memory limit test in tcp_fragment() x86/vdso: Prevent segfaults due to hoisted vclock reads SUNRPC: Fix a credential refcount leak Revert "SUNRPC: Declare RPC timers as TIMER_DEFERRABLE" net :sunrpc :clnt :Fix xps refcount imbalance on the error path NFS4: Only set creation opendata if O_CREAT ARM: 8867/1: vdso: pass --be8 to linker if necessary KVM: nVMX: reorganize initial steps of vmx_set_nested_state KVM: PPC: Book3S HV: Invalidate ERAT when flushing guest TLB entries habanalabs: use u64_to_user_ptr() for reading user pointers nfsd: replace Jeff by Chuck as nfsd co-maintainer inet: clear num_timeout reqsk_alloc() PCI/P2PDMA: Ignore root complex whitelist when an IOMMU is present net: mvpp2: debugfs: Add pmap to fs dump ipv6: Default fib6_type to RTN_UNICAST when not set net: hns3: Fix inconsistent indenting net/af_iucv: always register net_device notifier ...	2019-07-01 08:16:08 -06:00
Jonas Rabenstein	ff91064ea3	block: sed-opal: check size of shadow mbr Check whether the shadow mbr does fit in the provided space on the target. Also a proper firmware should handle this case and return an error we may prevent problems or even damage with crappy firmwares. Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de> Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz> Reviewed-by: Scott Bauer <sbauer@plzdonthack.me> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-06-29 10:34:08 -06:00
Jonas Rabenstein	a9b25b4cf2	block: sed-opal: ioctl for writing to shadow mbr Allow modification of the shadow mbr. If the shadow mbr is not marked as done, this data will be presented read only as the device content. Only after marking the shadow mbr as done and unlocking a locking range the actual content is accessible. Co-authored-by: David Kozub <zub@linux.fjfi.cvut.cz> Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de> Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz> Reviewed-by: Scott Bauer <sbauer@plzdonthack.me> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-06-29 10:33:57 -06:00
Jonas Rabenstein	c988844341	block: sed-opal: add ioctl for done-mark of shadow mbr Enable users to mark the shadow mbr as done without completely deactivating the shadow mbr feature. This may be useful on reboots, when the power to the disk is not disconnected in between and the shadow mbr stores the required boot files. Of course, this saves also the (few) commands required to enable the feature if it is already enabled and one only wants to mark the shadow mbr as done. Co-authored-by: David Kozub <zub@linux.fjfi.cvut.cz> Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de> Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed by: Scott Bauer <sbauer@plzdonthack.me> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-06-29 10:31:33 -06:00
Christoph Hellwig	b620743077	block: never take page references for ITER_BVEC If we pass pages through an iov_iter we always already have a reference in the caller. Thus remove the ITER_BVEC_FLAG_NO_REF and don't take reference to pages by default for bvec backed iov_iters. Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-06-29 09:47:32 -06:00
Christoph Hellwig	506e079847	block: use bio_release_pages in bio_map_user_iov Use bio_release_pages instead of open coding it. Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-06-29 09:47:31 -06:00
Christoph Hellwig	163cc2d3cd	block: use bio_release_pages in bio_unmap_user Use bio_release_pages instead of open coding it. Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-06-29 09:47:31 -06:00
Christoph Hellwig	d241a95f35	block: optionally mark pages dirty in bio_release_pages A lot of callers of bio_release_pages also want to mark the released pages as dirty. Add a mark_dirty parameter to avoid a second relatively expensive bio_for_each_segment_all loop. Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-06-29 09:47:31 -06:00
Christoph Hellwig	b2d0d99135	block: move the BIO_NO_PAGE_REF check into bio_release_pages Move the BIO_NO_PAGE_REF check into bio_release_pages instead of duplicating it in both callers. Also make the function available outside of bio.c so that we can reuse it in other direct I/O implementations. Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-06-29 09:47:31 -06:00
Revanth Rajashekar	15ddffcb34	block: sed-opal: "Never True" conditions 'who' an unsigned variable in stucture opal_session_info can never be lesser than zero. Hence, the condition "who < OPAL_ADMIN1" can never be true. Signed-off-by: Revanth Rajashekar <revanth.rajashekar@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-06-29 09:40:31 -06:00
Revanth Rajashekar	5e4c7cf60e	block: sed-opal: PSID reverttper capability PSID is a 32 character password printed on the drive label, to prove its physical access. This PSID reverttper function is very useful to regain the control over the drive when it is locked and the user can no longer access it because of some failures. However, all the data on the drive is completely erased. This method is advisable only when the user is exhausted of all other recovery methods. PSID capabilities are described in: https://trustedcomputinggroup.org/wp-content/uploads/TCG_Storage-Opal_Feature_Set_PSID_v1.00_r1.00.pdf Signed-off-by: Revanth Rajashekar <revanth.rajashekar@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-06-29 09:40:30 -06:00
Douglas Anderson	dbc3117d4c	block, bfq: NULL out the bic when it's no longer valid In reboot tests on several devices we were seeing a "use after free" when slub_debug or KASAN was enabled. The kernel complained about: Unable to handle kernel paging request at virtual address 6b6b6c2b ...which is a classic sign of use after free under slub_debug. The stack crawl in kgdb looked like: 0 test_bit (addr=<optimized out>, nr=<optimized out>) 1 bfq_bfqq_busy (bfqq=<optimized out>) 2 bfq_select_queue (bfqd=<optimized out>) 3 __bfq_dispatch_request (hctx=<optimized out>) 4 bfq_dispatch_request (hctx=<optimized out>) 5 0xc056ef00 in blk_mq_do_dispatch_sched (hctx=0xed249440) 6 0xc056f728 in blk_mq_sched_dispatch_requests (hctx=0xed249440) 7 0xc0568d24 in __blk_mq_run_hw_queue (hctx=0xed249440) 8 0xc0568d94 in blk_mq_run_work_fn (work=<optimized out>) 9 0xc024c5c4 in process_one_work (worker=0xec6d4640, work=0xed249480) 10 0xc024cff4 in worker_thread (__worker=0xec6d4640) Digging in kgdb, it could be found that, though bfqq looked fine, bfqq->bic had been freed. Through further digging, I postulated that perhaps it is illegal to access a "bic" (AKA an "icq") after bfq_exit_icq() had been called because the "bic" can be freed at some point in time after this call is made. I confirmed that there certainly were cases where the exact crashing code path would access the "bic" after bfq_exit_icq() had been called. Sspecifically I set the "bfqq->bic" to (void *)0x7 and saw that the bic was 0x7 at the time of the crash. To understand a bit more about why this crash was fairly uncommon (I saw it only once in a few hundred reboots), you can see that much of the time bfq_exit_icq_fbqq() fully frees the bfqq and thus it can't access the ->bic anymore. The only case it doesn't is if bfq_put_queue() sees a reference still held. However, even in the case when bfqq isn't freed, the crash is still rare. Why? I tracked what happened to the "bic" after the exit routine. It doesn't get freed right away. Rather, put_io_context_active() eventually called put_io_context() which queued up freeing on a workqueue. The freeing then actually happened later than that through call_rcu(). Despite all these delays, some extra debugging showed that all the hoops could be jumped through in time and the memory could be freed causing the original crash. Phew! To make a long story short, assuming it truly is illegal to access an icq after the "exit_icq" callback is finished, this patch is needed. Cc: stable@vger.kernel.org Reviewed-by: Paolo Valente <paolo.valente@unimore.it> Signed-off-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-06-28 07:44:19 -06:00
Damien Le Moal	a5b47a40be	block: Remove unused code bio_flush_dcache_pages() is unused. Remove it. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-06-27 07:34:25 -06:00
Douglas Anderson	2b50f230f7	block, bfq: Init saved_wr_start_at_switch_to_srt in unlikely case Some debug code suggested by Paolo was tripping when I did reboot stress tests. Specifically in bfq_bfqq_resume_state() "bic->saved_wr_start_at_switch_to_srt" was later than the current value of "jiffies". A bit of debugging showed that "bic->saved_wr_start_at_switch_to_srt" was actually 0 and a bit more debugging showed that was because we had run through the "unlikely" case in the bfq_bfqq_save_state() function. Let's init "saved_wr_start_at_switch_to_srt" in the unlikely case to something sane. NOTE: this fixes no known real-world errors. Reviewed-by: Paolo Valente <paolo.valente@linaro.org> Reviewed-by: Guenter Roeck <groeck@chromium.org> Signed-off-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-06-26 16:25:35 -06:00
Paolo Valente	e6feaf215f	block, bfq: fix operator in BFQQ_TOTALLY_SEEKY By mistake, there is a '&' instead of a '==' in the definition of the macro BFQQ_TOTALLY_SEEKY. This commit replaces the wrong operator with the correct one. Fixes: `7074f076ff` ("block, bfq: do not tag totally seeky queues as soft rt") Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-06-25 11:38:08 -06:00
Paolo Valente	3726112ec7	block, bfq: re-schedule empty queues if they deserve I/O plugging Consider, on one side, a bfq_queue Q that remains empty while in service, and, on the other side, the pending I/O of bfq_queues that, according to their timestamps, have to be served after Q. If an uncontrolled amount of I/O from the latter bfq_queues were dispatched while Q is waiting for its new I/O to arrive, then Q's bandwidth guarantees would be violated. To prevent this, I/O dispatch is plugged until Q receives new I/O (except for a properly controlled amount of injected I/O). Unfortunately, preemption breaks I/O-dispatch plugging, for the following reason. Preemption is performed in two steps. First, Q is expired and re-scheduled. Second, the new bfq_queue to serve is chosen. The first step is needed by the second, as the second can be performed only after Q's timestamps have been properly updated (done in the expiration step), and Q has been re-queued for service. This dependency is a consequence of the way how BFQ's scheduling algorithm is currently implemented. But Q is not re-scheduled at all in the first step, because Q is empty. As a consequence, an uncontrolled amount of I/O may be dispatched until Q becomes non empty again. This breaks Q's service guarantees. This commit addresses this issue by re-scheduling Q even if it is empty. This in turn breaks the assumption that all scheduled queues are non empty. Then a few extra checks are now needed. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-06-25 09:07:35 -06:00
Paolo Valente	96a291c38c	block, bfq: preempt lower-weight or lower-priority queues BFQ enqueues the I/O coming from each process into a separate bfq_queue, and serves bfq_queues one at a time. Each bfq_queue may be served for at most timeout_sync milliseconds (default: 125 ms). This service scheme is prone to the following inaccuracy. While a bfq_queue Q1 is in service, some empty bfq_queue Q2 may receive I/O, and, according to BFQ's scheduling policy, may become the right bfq_queue to serve, in place of the currently in-service bfq_queue. In this respect, postponing the service of Q2 to after the service of Q1 finishes may delay the completion of Q2's I/O, compared with an ideal service in which all non-empty bfq_queues are served in parallel, and every non-empty bfq_queue is served at a rate proportional to the bfq_queue's weight. This additional delay is equal at most to the time Q1 may unjustly remain in service before switching to Q2. If Q1 and Q2 have the same weight, then this time is most likely negligible compared with the completion time to be guaranteed to Q2's I/O. In addition, first, one of the reasons why BFQ may want to serve Q1 for a while is that this boosts throughput and, second, serving Q1 longer reduces BFQ's overhead. As a conclusion, it is usually better not to preempt Q1 if both Q1 and Q2 have the same weight. In contrast, as Q2's weight or priority becomes higher and higher compared with that of Q1, the above delay becomes larger and larger, compared with the I/O completion times that have to be guaranteed to Q2 according to Q2's weight. So reducing this delay may be more important than avoiding the costs of preempting Q1. Accordingly, this commit preempts Q1 if Q2 has a higher weight or a higher priority than Q1. Preemption causes Q1 to be re-scheduled, and triggers a new choice of the next bfq_queue to serve. If Q2 really is the next bfq_queue to serve, then Q2 will be set in service immediately. This change reduces the component of the I/O latency caused by the above delay by about 80%. For example, on an (old) PLEXTOR PX-256M5 SSD, the maximum latency reported by fio drops from 15.1 to 3.2 ms for a process doing sporadic random reads while another process is doing continuous sequential reads. Signed-off-by: Nicola Bottura <bottura.nicola95@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-06-25 09:07:35 -06:00
Paolo Valente	13a857a4c4	block, bfq: detect wakers and unconditionally inject their I/O A bfq_queue Q may happen to be synchronized with another bfq_queue Q2, i.e., the I/O of Q2 may need to be completed for Q to receive new I/O. We call Q2 "waker queue". If I/O plugging is being performed for Q, and Q is not receiving any more I/O because of the above synchronization, then, thanks to BFQ's injection mechanism, the waker queue is likely to get served before the I/O-plugging timeout fires. Unfortunately, this fact may not be sufficient to guarantee a high throughput during the I/O plugging, because the inject limit for Q may be too low to guarantee a lot of injected I/O. In addition, the duration of the plugging, i.e., the time before Q finally receives new I/O, may not be minimized, because the waker queue may happen to be served only after other queues. To address these issues, this commit introduces the explicit detection of the waker queue, and the unconditional injection of a pending I/O request of the waker queue on each invocation of bfq_dispatch_request(). One may be concerned that this systematic injection of I/O from the waker queue delays the service of Q's I/O. Fortunately, it doesn't. On the contrary, next Q's I/O is brought forward dramatically, for it is not blocked for milliseconds. Reported-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Tested-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-06-25 09:07:34 -06:00
Paolo Valente	a3f9bce369	block, bfq: bring forward seek&think time update Until the base value for request service times gets finally computed for a bfq_queue, the inject limit for that queue does depend on the think-time state (short\|long) of the queue. A timely update of the think time then guarantees a quicker activation or deactivation of the injection. Fortunately, the think time of a bfq_queue is updated in the same code path as the inject limit; but after the inject limit. This commits moves the update of the think time before the update of the inject limit. For coherence, it moves the update of the seek time too. Reported-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Tested-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-06-25 09:07:34 -06:00
Paolo Valente	24792ad01c	block, bfq: update base request service times when possible I/O injection gets reduced if it increases the request service times of the victim queue beyond a certain threshold. The threshold, in its turn, is computed as a function of the base service time enjoyed by the queue when it undergoes no injection. As a consequence, for injection to work properly, the above base value has to be accurate. In this respect, such a value may vary over time. For example, it varies if the size or the spatial locality of the I/O requests in the queue change. It is then important to update this value whenever possible. This commit performs this update. Reported-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Tested-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-06-25 09:07:34 -06:00
Paolo Valente	db599f9ed9	block, bfq: fix rq_in_driver check in bfq_update_inject_limit One of the cases where the parameters for injection may be updated is when there are no more in-flight I/O requests. The number of in-flight requests is stored in the field bfqd->rq_in_driver of the descriptor bfqd of the device. So, the controlled condition is bfqd->rq_in_driver == 0. Unfortunately, this is wrong because, the instruction that checks this condition is in the code path that handles the completion of a request, and, in particular, the instruction is executed before bfqd->rq_in_driver is decremented in such a code path. This commit fixes this issue by just replacing 0 with 1 in the comparison. Reported-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Tested-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-06-25 09:07:34 -06:00
Paolo Valente	766d61412e	block, bfq: reset inject limit when think-time state changes Until the base value of the request service times gets finally computed for a bfq_queue, the inject limit does depend on the think-time state (short\|long). The limit must be 0 or 1 if the think time is deemed, respectively, as short or long. However, such a check and possible limit update is performed only periodically, once per second. So, to make the injection mechanism much more reactive, this commit performs the update also every time the think-time state changes. In addition, in the following special case, this commit lets the inject limit of a bfq_queue bfqq remain equal to 1 even if bfqq's think time is short: bfqq's I/O is synchronized with that of some other queue, i.e., bfqq may receive new I/O only after the I/O of the other queue is completed. Keeping the inject limit to 1 allows the blocking I/O to be served while bfqq is in service. And this is very convenient both for bfqq and for the total throughput, as explained in detail in the comments in bfq_update_has_short_ttime(). Reported-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Tested-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-06-25 09:07:34 -06:00
Chaitanya Kulkarni	b0e5168a77	block: update print_req_error() Improve the print_req_error with additional request fields which are helpful for debugging. Use newly introduced blk_op_str() to print the REQ_OP_XXX in the string format. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-06-20 13:03:51 -06:00
Chaitanya Kulkarni	874c893bf0	block: use blk_op_str() in blk-mq-debugfs.c Now that we've a helper function blk_op_str() to convert the REQ_OP_XXX to string XXX, adjust the code to use that. Get rid of the duplicate array op_name which is now present in the blk-core.c which we renamed it to "blk_op_name" and open coding in the blk-mq-debugfs.c. Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-06-20 13:03:51 -06:00
Chaitanya Kulkarni	e47bc4eda9	block: add centralize REQ_OP_XXX to string helper In order to centralize the REQ_OP_XXX to string conversion which can be used in the block layer and different places in the kernel like f2fs, this patch adds a new helper function along with an array similar to the one present in the blk-mq-debugfs.c. We keep this helper functionality centralize under blk-core.c instead of blk-mq-debugfs.c since blk-core.c is configured using CONFIG_BLOCK and it will not be dependent on blk-mq-debugfs.c which is configured using CONFIG_BLK_DEBUG_FS. Next patch adjusts the code in the blk-mq-debugfs.c with newly introduced helper. Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-06-20 13:03:51 -06:00
Christoph Hellwig	178cc590e5	block: improve print_req_error Print the calling function instead of print_req_error as a prefix, and print the operation and op_flags separately instead of the whole field. Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-06-20 13:03:51 -06:00
Christoph Hellwig	8060c47ba8	block: rename CONFIG_DEBUG_BLK_CGROUP to CONFIG_BFQ_CGROUP_DEBUG This option is entirely bfq specific, give it an appropinquate name. Also make it depend on CONFIG_BFQ_GROUP_IOSCHED in Kconfig, as all the functionality already does so anyway. Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-06-20 10:32:35 -06:00
Christoph Hellwig	d6258980da	bfq-iosched: move bfq_stat_recursive_sum into the only caller This function was moved from core block code and is way to generic. Fold it into the only caller and simplify it based on the actually passed arguments. Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-06-20 10:32:34 -06:00
Christoph Hellwig	c0ce79dca5	blk-cgroup: move struct blkg_stat to bfq This structure and assorted infrastructure is only used by the bfq I/O scheduler. Move it there instead of bloating the common code. Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-06-20 10:32:34 -06:00
Christoph Hellwig	7af6fd9112	blk-cgroup: introduce a new struct blkg_rwstat_sample When sampling the blkcg counts we don't need atomics or per-cpu variables. Introduce a new structure just containing plain u64 counters. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-06-20 10:32:34 -06:00
Christoph Hellwig	5d0b6e48cb	blk-cgroup: pass blkg_rwstat structures by reference Returning a structure generates rather bad code, so switch to passing by reference. Also don't require the structure to be zeroed and add to the 0-initialized counters, but actually set the counters to the calculated value. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-06-20 10:32:34 -06:00
Christoph Hellwig	239eeb0857	blk-cgroup: factor out a helper to read rwstat counter Trying to break up the crazy statements to something readable. Also switch to an unsigned counter as it can't ever turn negative. Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-06-20 10:32:34 -06:00
Christoph Hellwig	1aa0a133fb	block: mark blk_rq_bio_prep as inline This function just has a few trivial assignments, has two callers with one of them being in the fastpath. Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-06-20 10:29:22 -06:00
Christoph Hellwig	d627065d88	block: untangle the end of blk_bio_segment_split Now that we don't need to assign the front/back segment sizes, we can duplicating the segs assignment for the split vs no-split case and remove a whole chunk of boilerplate code. Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-06-20 10:29:22 -06:00
Christoph Hellwig	e9cd19c0c1	block: simplify blk_recalc_rq_segments Return the segement and let the callers assign them, which makes the code a littler more obvious. Also pass the request instead of q plus bio chain, allowing for the use of rq_for_each_bvec. Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-06-20 10:29:22 -06:00
Christoph Hellwig	14ccb66b3f	block: remove the bi_phys_segments field in struct bio We only need the number of segments in the blk-mq submission path. Remove the field from struct bio, and return it from a variant of blk_queue_split instead of that it can passed as an argument to those functions that need the value. This also means we stop recounting segments except for cloning and partial segments. To keep the number of arguments in this how path down remove pointless struct request_queue arguments from any of the functions that had it and grew a nr_segs argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-06-20 10:29:22 -06:00

1 2 3 4 5 ...

4713 Commits