Commit Graph

3763 Commits

Author SHA1 Message Date
Ming Lei
7d4901a90d blk-mq: avoid to map CPU into stale hw queue
blk_mq_pci_map_queues() may not map one CPU into any hw queue, but its
previous map isn't cleared yet, and may point to one stale hw queue
index.

This patch fixes the following issue by clearing the mapping table before
setting it up in blk_mq_pci_map_queues().

This patches fixes this following issue reported by Zhang Yi:

[  101.202734] BUG: unable to handle kernel NULL pointer dereference at 0000000094d3013f
[  101.211487] IP: blk_mq_map_swqueue+0xbc/0x200
[  101.216346] PGD 0 P4D 0
[  101.219171] Oops: 0000 [#1] SMP
[  101.222674] Modules linked in: sunrpc ipmi_ssif vfat fat intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate intel_uncore mxm_wmi intel_rapl_perf iTCO_wdt ipmi_si ipmi_devintf pcspkr iTCO_vendor_support sg dcdbas ipmi_msghandler wmi mei_me lpc_ich shpchp mei acpi_power_meter dm_multipath ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci libahci crc32c_intel libata tg3 nvme nvme_core megaraid_sas ptp i2c_core pps_core dm_mirror dm_region_hash dm_log dm_mod
[  101.284881] CPU: 0 PID: 504 Comm: kworker/u25:5 Not tainted 4.15.0-rc2 #1
[  101.292455] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.5.5 08/16/2017
[  101.301001] Workqueue: nvme-wq nvme_reset_work [nvme]
[  101.306636] task: 00000000f2c53190 task.stack: 000000002da874f9
[  101.313241] RIP: 0010:blk_mq_map_swqueue+0xbc/0x200
[  101.318681] RSP: 0018:ffffc9000234fd70 EFLAGS: 00010282
[  101.324511] RAX: ffff88047ffc9480 RBX: ffff88047e130850 RCX: 0000000000000000
[  101.332471] RDX: ffffe8ffffd40580 RSI: ffff88047e509b40 RDI: ffff88046f37a008
[  101.340432] RBP: 000000000000000b R08: ffff88046f37a008 R09: 0000000011f94280
[  101.348392] R10: ffff88047ffd4d00 R11: 0000000000000000 R12: ffff88046f37a008
[  101.356353] R13: ffff88047e130f38 R14: 000000000000000b R15: ffff88046f37a558
[  101.364314] FS:  0000000000000000(0000) GS:ffff880277c00000(0000) knlGS:0000000000000000
[  101.373342] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  101.379753] CR2: 0000000000000098 CR3: 000000047f409004 CR4: 00000000001606f0
[  101.387714] Call Trace:
[  101.390445]  blk_mq_update_nr_hw_queues+0xbf/0x130
[  101.395791]  nvme_reset_work+0x6f4/0xc06 [nvme]
[  101.400848]  ? pick_next_task_fair+0x290/0x5f0
[  101.405807]  ? __switch_to+0x1f5/0x430
[  101.409988]  ? put_prev_entity+0x2f/0xd0
[  101.414365]  process_one_work+0x141/0x340
[  101.418836]  worker_thread+0x47/0x3e0
[  101.422921]  kthread+0xf5/0x130
[  101.426424]  ? rescuer_thread+0x380/0x380
[  101.430896]  ? kthread_associate_blkcg+0x90/0x90
[  101.436048]  ret_from_fork+0x1f/0x30
[  101.440034] Code: 48 83 3c ca 00 0f 84 2b 01 00 00 48 63 cd 48 8b 93 10 01 00 00 8b 0c 88 48 8b 83 20 01 00 00 4a 03 14 f5 60 04 af 81 48 8b 0c c8 <48> 8b 81 98 00 00 00 f0 4c 0f ab 30 8b 81 f8 00 00 00 89 42 44
[  101.461116] RIP: blk_mq_map_swqueue+0xbc/0x200 RSP: ffffc9000234fd70
[  101.468205] CR2: 0000000000000098
[  101.471907] ---[ end trace 5fe710f98228a3ca ]---
[  101.482489] Kernel panic - not syncing: Fatal exception
[  101.488505] Kernel Offset: disabled
[  101.497752] ---[ end Kernel panic - not syncing: Fatal exception

Reviewed-by: Christoph Hellwig <hch@lst.de>
Suggested-by: Christoph Hellwig <hch@lst.de>
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06 09:25:36 -07:00
Ming Lei
24f5a90f0d blk-mq: quiesce queue during switching io sched and updating nr_requests
Dispatch may still be in-progress after queue is frozen, so we have to
quiesce queue before switching IO scheduler and updating nr_requests.

Also when switching io schedulers, blk_mq_run_hw_queue() may still be
called somewhere(such as from nvme_reset_work()), and io scheduler's
per-hctx data may not be setup yet, so cause oops even inside
blk_mq_hctx_has_pending(), such as it can be run just between:

        ret = e->ops.mq.init_sched(q, e);
AND
        ret = e->ops.mq.init_hctx(hctx, i)

inside blk_mq_init_sched().

This reverts commit 7a148c2fcff8330(block: don't call blk_mq_quiesce_queue()
after queue is frozen) basically, and makes sure blk_mq_hctx_has_pending
won't be called if queue is quiesced.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Fixes: 7a148c2fcff83309(block: don't call blk_mq_quiesce_queue() after queue is frozen)
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06 09:25:36 -07:00
Ming Lei
c2856ae2f3 blk-mq: quiesce queue before freeing queue
After queue is frozen, dispatch still may happen, for example:

1) requests are submitted from several contexts
2) requests from all these contexts are inserted to queue, but may dispatch
to LLD in one of these paths, but other paths sill need to move on even all
these requests are completed(that means blk_mq_freeze_queue_wait() returns
at that time)
3) dispatch after queue freezing still moves on and causes use-after-free,
because request queue is freed

This patch quiesces queue after it is frozen, and makes sure all
in-progress dispatch are completed.

This patch fixes the following kernel crash when running heavy IOs vs.
deleting device:

[   36.719251] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[   36.720318] IP: kyber_has_work+0x14/0x40
[   36.720847] PGD 254bf5067 P4D 254bf5067 PUD 255e6a067 PMD 0
[   36.721584] Oops: 0000 [#1] PREEMPT SMP
[   36.722105] Dumping ftrace buffer:
[   36.722570]    (ftrace buffer empty)
[   36.723057] Modules linked in: scsi_debug ebtable_filter ebtables ip6table_filter ip6_tables tcm_loop iscsi_target_mod target_core_file target_core_iblock target_core_pscsi target_core_mod xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c bridge stp llc fuse iptable_filter ip_tables sd_mod sg btrfs xor zstd_decompress zstd_compress xxhash raid6_pq mptsas mptscsih bcache crc32c_intel ahci mptbase libahci serio_raw scsi_transport_sas nvme libata shpchp lpc_ich virtio_scsi nvme_core binfmt_misc dm_mod iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi null_blk configs
[   36.733438] CPU: 2 PID: 2374 Comm: fio Not tainted 4.15.0-rc2.blk_mq_quiesce+ #714
[   36.735143] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.9.3-1.fc25 04/01/2014
[   36.736688] RIP: 0010:kyber_has_work+0x14/0x40
[   36.737515] RSP: 0018:ffffc9000209bca0 EFLAGS: 00010202
[   36.738431] RAX: 0000000000000008 RBX: ffff88025578bfc8 RCX: ffff880257bf4ed0
[   36.739581] RDX: 0000000000000038 RSI: ffffffff81a98c6d RDI: ffff88025578bfc8
[   36.740730] RBP: ffff880253cebfc8 R08: ffffc9000209bda0 R09: ffff8802554f3480
[   36.741885] R10: ffffc9000209be60 R11: ffff880263f72538 R12: ffff88025573e9e8
[   36.743036] R13: ffff88025578bfd0 R14: 0000000000000001 R15: 0000000000000000
[   36.744189] FS:  00007f9b9bee67c0(0000) GS:ffff88027fc80000(0000) knlGS:0000000000000000
[   36.746617] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   36.748483] CR2: 0000000000000008 CR3: 0000000254bf4001 CR4: 00000000003606e0
[   36.750164] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   36.751455] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   36.752796] Call Trace:
[   36.753992]  blk_mq_do_dispatch_sched+0x7f/0xe0
[   36.755110]  blk_mq_sched_dispatch_requests+0x119/0x190
[   36.756179]  __blk_mq_run_hw_queue+0x83/0x90
[   36.757144]  __blk_mq_delay_run_hw_queue+0xaf/0x110
[   36.758046]  blk_mq_run_hw_queue+0x24/0x70
[   36.758845]  blk_mq_flush_plug_list+0x1e7/0x270
[   36.759676]  blk_flush_plug_list+0xd6/0x240
[   36.760463]  blk_finish_plug+0x27/0x40
[   36.761195]  do_io_submit+0x19b/0x780
[   36.761921]  ? entry_SYSCALL_64_fastpath+0x1a/0x7d
[   36.762788]  entry_SYSCALL_64_fastpath+0x1a/0x7d
[   36.763639] RIP: 0033:0x7f9b9699f697
[   36.764352] RSP: 002b:00007ffc10f991b8 EFLAGS: 00000206 ORIG_RAX: 00000000000000d1
[   36.765773] RAX: ffffffffffffffda RBX: 00000000008f6f00 RCX: 00007f9b9699f697
[   36.766965] RDX: 0000000000a5e6c0 RSI: 0000000000000001 RDI: 00007f9b8462a000
[   36.768377] RBP: 0000000000000000 R08: 0000000000000001 R09: 00000000008f6420
[   36.769649] R10: 00007f9b846e5000 R11: 0000000000000206 R12: 00007f9b795d6a70
[   36.770807] R13: 00007f9b795e4140 R14: 00007f9b795e3fe0 R15: 0000000100000000
[   36.771955] Code: 83 c7 10 e9 3f 68 d1 ff 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 97 b0 00 00 00 48 8d 42 08 48 83 c2 38 <48> 3b 00 74 06 b8 01 00 00 00 c3 48 3b 40 08 75 f4 48 83 c0 10
[   36.775004] RIP: kyber_has_work+0x14/0x40 RSP: ffffc9000209bca0
[   36.776012] CR2: 0000000000000008
[   36.776690] ---[ end trace 4045cbce364ff2a4 ]---
[   36.777527] Kernel panic - not syncing: Fatal exception
[   36.778526] Dumping ftrace buffer:
[   36.779313]    (ftrace buffer empty)
[   36.780081] Kernel Offset: disabled
[   36.780877] ---[ end Kernel panic - not syncing: Fatal exception

Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06 09:25:36 -07:00
Jens Axboe
ca11f209a4 mq-deadline: make it clear that __dd_dispatch_request() works on all hw queues
Don't pass in the hardware queue to __dd_dispatch_request(), since it
leads the reader to believe that we are returning a request for that
specific hardware queue. That's not how mq-deadline works, the state
for determining which request to serve next is shared across all
hardware queues for a device.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06 09:23:11 -07:00
Ming Lei
cf8c0c6a38 block: blk-merge: remove unnecessary check
In this case, 'sectors' can't be zero at all, so remove the check
and let the bio be split.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06 09:18:00 -07:00
Ming Lei
a2d37968d7 block: blk-merge: try to make front segments in full size
When merging one bvec into segment, if the bvec is too big
to merge, current policy is to move the whole bvec into another
new segment.

This patchset changes the policy into trying to maximize size of
front segments, that means in above situation, part of bvec
is merged into current segment, and the remainder is put
into next segment.

This patch prepares for support multipage bvec because
it can be quite common to see this case and we should try
to make front segments in full size.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06 09:18:00 -07:00
Ming Lei
6a501bf080 blk-merge: compute bio->bi_seg_front_size efficiently
It is enough to check and compute bio->bi_seg_front_size just
after the 1st segment is found, but current code checks that
for each bvec, which is inefficient.

This patch follows the way in  __blk_recalc_rq_segments()
for computing bio->bi_seg_front_size, and it is more efficient
and code becomes more readable too.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06 09:18:00 -07:00
Ming Lei
25d8be77e1 block: move bio_alloc_pages() to bcache
bcache is the only user of bio_alloc_pages(), so move this function into
bcache, and avoid it being misused in the future.

Also rename it to bch_bio_allo_pages() since it is bcache only.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06 09:18:00 -07:00
Ming Lei
3c892a098b block: bounce: don't access bio->bi_io_vec in copy_to_high_bio_irq
Firstly this patch introduces BVEC_ITER_ALL_INIT for iterating one bio
from start to end.

As we need to support multipage bvecs, don't access bio->bi_io_vec
in copy_to_high_bio_irq(), and just use the standard iterator for that.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06 09:18:00 -07:00
Ming Lei
7891f05cbf block: bounce: avoid direct access to bvec table
We will support multipage bvecs in the future, so change to iterator way
for getting bv_page of bvec from original bio.

Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06 09:18:00 -07:00
Paolo Valente
9b25bd0368 block, bfq: remove batches of confusing ifdefs
Commit a33801e8b4 ("block, bfq: move debug blkio stats behind
CONFIG_DEBUG_BLK_CGROUP") introduced two batches of confusing ifdefs:
one reported in [1], plus a similar one in another function. This
commit removes both batches, in the way suggested in [1].

[1] https://www.spinics.net/lists/linux-block/msg20043.html

Fixes: a33801e8b4 ("block, bfq: move debug blkio stats behind CONFIG_DEBUG_BLK_CGROUP")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Tested-by: Luca Miccio <lucmiccio@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-05 09:32:59 -07:00
Paolo Valente
a34b024448 block, bfq: consider also past I/O in soft real-time detection
BFQ privileges the I/O of soft real-time applications, such as video
players, to guarantee to these application a high bandwidth and a low
latency. In this respect, it is not easy to correctly detect when an
application is soft real-time. A particularly nasty false positive is
that of an I/O-bound application that occasionally happens to meet all
requirements to be deemed as soft real-time. After being detected as
soft real-time, such an application monopolizes the device. Fortunately,
BFQ will realize soon that the application is actually not soft
real-time and suspend every privilege. Yet, the application may happen
again to be wrongly detected as soft real-time, and so on.

As highlighted by our tests, this problem causes BFQ to occasionally
fail to guarantee a high responsiveness, in the presence of heavy
background I/O workloads. The reason is that the background workload
happens to be detected as soft real-time, more or less frequently,
during the execution of the interactive task under test. To give an
idea, because of this problem, Libreoffice Writer occasionally takes 8
seconds, instead of 3, to start up, if there are sequential reads and
writes in the background, on a Kingston SSDNow V300.

This commit addresses this issue by leveraging the following facts.

The reason why some applications are detected as soft real-time despite
all BFQ checks to avoid false positives, is simply that, during high
CPU or storage-device load, I/O-bound applications may happen to do
I/O slowly enough to meet all soft real-time requirements, and pass
all BFQ extra checks. Yet, this happens only for limited time periods:
slow-speed time intervals are usually interspersed between other time
intervals during which these applications do I/O at a very high speed.
To exploit these facts, this commit introduces a little change, in the
detection of soft real-time behavior, to systematically consider also
the recent past: the higher the speed was in the recent past, the
later next I/O should arrive for the application to be considered as
soft real-time. At the beginning of a slow-speed interval, the minimum
arrival time allowed for the next I/O usually happens to still be so
high, to fall *after* the end of the slow-speed period itself. As a
consequence, the application does not risk to be deemed as soft
real-time during the slow-speed interval. Then, during the next
high-speed interval, the application cannot, evidently, be deemed as
soft real-time (exactly because of its speed), and so on.

This extra filtering proved to be rather effective: in the above test,
the frequency of false positives became so low that the start-up time
was 3 seconds in all iterations (apart from occasional outliers,
caused by page-cache-management issues, which are out of the scope of
this commit, and cannot be solved by an I/O scheduler).

Tested-by: Lee Tibbert <lee.tibbert@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Angelo Ruocco <angeloruocco90@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-05 09:31:19 -07:00
Angelo Ruocco
4403e4e467 block, bfq: remove superfluous check in queue-merging setup
When two or more processes do I/O in a way that the their requests are
sequential in respect to one another, BFQ merges the bfq_queues associated
with the processes. This way the overall I/O pattern becomes sequential,
and thus there is a boost in througput.
These cooperating processes usually start or restart to do I/O shortly
after each other. So, in order to avoid merging non-cooperating processes,
BFQ ensures that none of these queues has been in weight raising for too
long.

In this respect, from commit "block, bfq-sq, bfq-mq: let a queue be merged
only shortly after being created", BFQ checks whether any queue (and not
only weight-raised ones) is doing I/O continuously from too long to be
merged.

This new additional check makes the first one useless: a queue doing
I/O from long enough, if being weight-raised, is also a queue in
weight raising for too long to be merged. Accordingly, this commit
removes the first check.

Signed-off-by: Angelo Ruocco <angeloruocco90@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-05 09:26:11 -07:00
Paolo Valente
7b8fa3b900 block, bfq: let a queue be merged only shortly after starting I/O
In BFQ and CFQ, two processes are said to be cooperating if they do
I/O in such a way that the union of their I/O requests yields a
sequential I/O pattern. To get such a sequential I/O pattern out of
the non-sequential pattern of each cooperating process, BFQ and CFQ
merge the queues associated with these processes. In more detail,
cooperating processes, and thus their associated queues, usually
start, or restart, to do I/O shortly after each other. This is the
case, e.g., for the I/O threads of KVM/QEMU and of the dump
utility. Basing on this assumption, this commit allows a bfq_queue to
be merged only during a short time interval (100ms) after it starts,
or re-starts, to do I/O.  This filtering provides two important
benefits.

First, it greatly reduces the probability that two non-cooperating
processes have their queues merged by mistake, if they just happen to
do I/O close to each other for a short time interval. These spurious
merges cause loss of service guarantees. A low-weight bfq_queue may
unjustly get more than its expected share of the throughput: if such a
low-weight queue is merged with a high-weight queue, then the I/O for
the low-weight queue is served as if the queue had a high weight. This
may damage other high-weight queues unexpectedly.  For instance,
because of this issue, lxterminal occasionally took 7.5 seconds to
start, instead of 6.5 seconds, when some sequential readers and
writers did I/O in the background on a FUJITSU MHX2300BT HDD.  The
reason is that the bfq_queues associated with some of the readers or
the writers were merged with the high-weight queues of some processes
that had to do some urgent but little I/O. The readers then exploited
the inherited high weight for all or most of their I/O, during the
start-up of terminal. The filtering introduced by this commit
eliminated any outlier caused by spurious queue merges in our start-up
time tests.

This filtering also provides a little boost of the throughput
sustainable by BFQ: 3-4%, depending on the CPU. The reason is that,
once a bfq_queue cannot be merged any longer, this commit makes BFQ
stop updating the data needed to handle merging for the queue.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Angelo Ruocco <angeloruocco90@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-05 09:26:09 -07:00
Angelo Ruocco
1be6e8a964 block, bfq: check low_latency flag in bfq_bfqq_save_state()
A just-created bfq_queue will certainly be deemed as interactive on
the arrival of its first I/O request, if the low_latency flag is
set. Yet, if the queue is merged with another queue on the arrival of
its first I/O request, it will not have the chance to be flagged as
interactive. Nevertheless, if the queue is then split soon enough, it
has to be flagged as interactive after the split.

To handle this early-merge scenario correctly, BFQ saves the state of
the queue, on the merge, as if the latter had already been deemed
interactive. So, if the queue is split soon, it will get
weight-raised, because the previous state of the queue is resumed on
the split.

Unfortunately, in the act of saving the state of the newly-created
queue, BFQ doesn't check whether the low_latency flag is set, and this
causes early-merged queues to be then weight-raised, on queue splits,
even if low_latency is off. This commit addresses this problem by
adding the missing check.

Signed-off-by: Angelo Ruocco <angeloruocco90@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-05 09:26:08 -07:00
Paolo Valente
05e9028356 block, bfq: add missing rq_pos_tree update on rq removal
If two processes do I/O close to each other, then BFQ merges the
bfq_queues associated with these processes, to get a more sequential
I/O, and thus a higher throughput.  In this respect, to detect whether
two processes are doing I/O close to each other, BFQ keeps a list of
the head-of-line I/O requests of all active bfq_queues.  The list is
ordered by initial sectors, and implemented through a red-black tree
(rq_pos_tree).

Unfortunately, the update of the rq_pos_tree was incomplete, because
the tree was not updated on the removal of the head-of-line I/O
request of a bfq_queue, in case the queue did not remain empty. This
commit adds the missing update.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Angelo Ruocco <angeloruocco90@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-05 09:26:06 -07:00
Paolo Valente
f0ba5ea2fe block, bfq: increase threshold to deem I/O as random
If two processes do I/O close to each other, i.e., are cooperating
processes in BFQ (and CFQ'S) nomenclature, then BFQ merges their
associated bfq_queues, so as to get sequential I/O from the union of
the I/O requests of the processes, and thus reach a higher
throughput. A merged queue is then split if its I/O stops being
sequential. In this respect, BFQ deems the I/O of a bfq_queue as
(mostly) sequential only if less than 4 I/O requests are random, out
of the last 32 requests inserted into the queue.

Unfortunately, extensive testing (with the interleaved_io benchmark of
the S suite [1], and with real applications spawning cooperating
processes) has clearly shown that, with such a low threshold, only a
rather low I/O throughput may be reached when several cooperating
processes do I/O. In particular, the outcome of each test run was
bimodal: if queue merging occurred and was stable during the test,
then the throughput was close to the peak rate of the storage device,
otherwise the throughput was arbitrarily low (usually around 1/10 of
the peak rate with a rotational device). The probability to get the
unlucky outcomes grew with the number of cooperating processes: it was
already significant with 5 processes, and close to one with 7 or more
processes.

The cause of the low throughput in the unlucky runs was that the
merged queues containing the I/O of these cooperating processes were
soon split, because they contained more random I/O requests than those
tolerated by the 4/32 threshold, but
- that I/O would have however allowed the storage device to reach
  peak throughput or almost peak throughput;
- in contrast, the I/O of these processes, if served individually
  (from separate queues) yielded a rather low throughput.

So we repeated our tests with increasing values of the threshold,
until we found the minimum value (19) for which we obtained maximum
throughput, reliably, with at least up to 9 cooperating
processes. Then we checked that the use of that higher threshold value
did not cause any regression for any other benchmark in the suite [1].
This commit raises the threshold to such a higher value.

[1] https://github.com/Algodev-github/S

Signed-off-by: Angelo Ruocco <angeloruocco90@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-05 09:23:57 -07:00
Damien Le Moal
8dc8146f9c deadline-iosched: Introduce zone locking support
Introduce zone write locking to avoid write request reordering with
zoned block devices. This is achieved using a finer selection of the
next request to dispatch:
1) Any non-write request is always allowed to proceed.
2) Any write to a conventional zone is always allowed to proceed.
3) For a write to a sequential zone, the zone lock is first checked.
   a) If the zone is not locked, the write is allowed to proceed after
      its target zone is locked.
   b) If the zone is locked, the write request is skipped and the next
      request in the dispatch queue tested (back to step 1).

For a write request that has locked its target zone, the zone is
unlocked either when the request completes and the method
deadline_request_completed() is called, or when the request is requeued
using the method deadline_add_request().

Requests targeting a locked zone are always left in the scheduler queue
to preserve the initial write order. If no write request can be
dispatched, allow reads to be dispatched even if the write batch is not
done.

If the device used is not a zoned block device, or if zoned block device
support is disabled, this patch does not modify deadline behavior.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-05 09:22:17 -07:00
Damien Le Moal
c117bac701 deadline-iosched: Introduce dispatch helpers
Avoid directly referencing the next_rq and fifo_list arrays using the
helper functions deadline_next_request() and deadline_fifo_request() to
facilitate changes in the dispatch request selection in
deadline_dispatch_requests() for zoned block devices.

While at it, also remove the unnecessary forward declaration of the
function deadline_move_request().

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-05 09:22:17 -07:00
Damien Le Moal
5700f69178 mq-deadline: Introduce zone locking support
Introduce zone write locking to avoid write request reordering with
zoned block devices. This is achieved using a finer selection of the
next request to dispatch:
1) Any non-write request is always allowed to proceed.
2) Any write to a conventional zone is always allowed to proceed.
3) For a write to a sequential zone, the zone lock is first checked.
   a) If the zone is not locked, the write is allowed to proceed after
      its target zone is locked.
   b) If the zone is locked, the write request is skipped and the next
      request in the dispatch queue tested (back to step 1).

For a write request that has locked its target zone, the zone is
unlocked either when the request completes with a call to the method
deadline_request_completed() or when the request is requeued using
dd_insert_request().

Requests targeting a locked zone are always left in the scheduler queue
to preserve the lba ordering for write requests. If no write request
can be dispatched, allow reads to be dispatched even if the write batch
is not done.

If the device used is not a zoned block device, or if zoned block device
support is disabled, this patch does not modify mq-deadline behavior.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-05 09:22:17 -07:00
Damien Le Moal
bf09ce56f0 mq-deadline: Introduce dispatch helpers
Avoid directly referencing the next_rq and fifo_list arrays using the
helper functions deadline_next_request() and deadline_fifo_request() to
facilitate changes in the dispatch request selection in
__dd_dispatch_request() for zoned block devices.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Bart Van Assche <Bart.VanAssche@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-05 09:22:17 -07:00
Christoph Hellwig
6cc77e9cb0 block: introduce zoned block devices zone write locking
Components relying only on the request_queue structure for accessing
block devices (e.g. I/O schedulers) have a limited knowledged of the
device characteristics. In particular, the device capacity cannot be
easily discovered, which for a zoned block device also result in the
inability to easily know the number of zones of the device (the zone
size is indicated by the chunk_sectors field of the queue limits).

Introduce the nr_zones field to the request_queue structure to simplify
access to this information. Also, add the bitmap seq_zone_bitmap which
indicates which zones of the device are sequential zones (write
preferred or write required) and the bitmap seq_zones_wlock which
indicates if a zone is write locked, that is, if a write request
targeting a zone was dispatched to the device. These fields are
initialized by the low level block device driver (sd.c for ZBC/ZAC
disks). They are not initialized by stacking drivers (device mappers)
handling zoned block devices (e.g. dm-linear).

Using this, I/O schedulers can introduce zone write locking to control
request dispatching to a zoned block device and avoid write request
reordering by limiting to at most a single write request per zone
outside of the scheduler at any time.

Based on previous patches from Damien Le Moal.

Signed-off-by: Christoph Hellwig <hch@lst.de>
[Damien]
* Fixed comments and identation in blkdev.h
* Changed helper functions
* Fixed this commit message
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-05 09:22:17 -07:00
Ming Lei
454be724f6 block: drain queue before waiting for q_usage_counter becoming zero
Now we track legacy requests with .q_usage_counter in commit 055f6e18e0
("block: Make q_usage_counter also track legacy requests"), but that
commit never runs and drains legacy queue before waiting for this counter
becoming zero, then IO hang is caused in the test of pulling disk during IO.

This patch fixes the issue by draining requests before waiting for
q_usage_counter becoming zero, both Mauricio and chenxiang reported this
issue, and observed that it can be fixed by this patch.

Link: https://marc.info/?l=linux-block&m=151192424731797&w=2
Fixes: 055f6e18e08f("block: Make q_usage_counter also track legacy requests")
Cc: Wen Xiong <wenxiong@us.ibm.com>
Tested-by: "chenxiang (M)" <chenxiang66@hisilicon.com>
Tested-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-05 09:09:48 -07:00
Liu Bo
913a9500b9 blk-mq: remove confusing comment of blk_mq_sched_dispatch_requests
Commit de14829740
("blk-mq: introduce .get_budget and .put_budget in blk_mq_ops")
changes the function to return bool type, and then commit 1f460b63d4
("blk-mq: don't restart queue when .get_budget returns BLK_STS_RESOURCE")
changes it back to void, but the comment remains.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-05 08:36:33 -07:00
Jens Axboe
4e5dff41be blk-mq: improve heavily contended tag case
Even with a number of waitqueues, we can get into a situation where we
are heavily contended on the waitqueue lock. I got a report on spc1
where we're spending seconds doing this. Arguably the use case is nasty,
I reproduce it with one device and 1000 threads banging on the device.
But that doesn't mean we shouldn't be handling it better.

What ends up happening is that a thread will fail to get a tag, add
itself to the waitqueue, and subsequently get woken up when a tag is
freed - only to find itself going back to sleep on the waitqueue.

Instead of waking all threads, use an exclusive wait and wake up our
sbitmap batch count instead. This seems to work well for me (massive
improvement for this use case), and it survives basic testing. But I
haven't fully verified it yet.

An additional improvement is running the queue and checking for a new
tag BEFORE needing to add ourselves to the waitqueue.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-12-22 11:09:37 -07:00
Shaohua Li
111be88398 block-throttle: avoid double charge
If a bio is throttled and split after throttling, the bio could be
resubmited and enters the throttling again. This will cause part of the
bio to be charged multiple times. If the cgroup has an IO limit, the
double charge will significantly harm the performance. The bio split
becomes quite common after arbitrary bio size change.

To fix this, we always set the BIO_THROTTLED flag if a bio is throttled.
If the bio is cloned/split, we copy the flag to new bio too to avoid a
double charge. However, cloned bio could be directed to a new disk,
keeping the flag be a problem. The observation is we always set new disk
for the bio in this case, so we can clear the flag in bio_set_dev().

This issue exists for a long time, arbitrary bio size change just makes
it worse, so this should go into stable at least since v4.2.

V1-> V2: Not add extra field in bio based on discussion with Tejun

Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: stable@vger.kernel.org
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-12-20 11:10:17 -07:00
Jens Axboe
0abc2a1038 block: fix blk_rq_append_bio
Commit caa4b02476e3(blk-map: call blk_queue_bounce from blk_rq_append_bio)
moves blk_queue_bounce() into blk_rq_append_bio(), but don't consider
the fact that the bounced bio becomes invisible to caller since the
parameter type is 'struct bio *'. Make it a pointer to a pointer to
a bio, so the caller sees the right bio also after a bounce.

Fixes: caa4b02476 ("blk-map: call blk_queue_bounce from blk_rq_append_bio")
Cc: Christoph Hellwig <hch@lst.de>
Reported-by: Michele Ballabio <barra_cuda@katamail.com>
(handling failure of blk_rq_append_bio(), only call bio_get() after
blk_rq_append_bio() returns OK)
Tested-by: Michele Ballabio <barra_cuda@katamail.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-12-18 13:55:43 -07:00
Ming Lei
14cb0dc647 block: don't let passthrough IO go into .make_request_fn()
Commit a8821f3f3("block: Improvements to bounce-buffer handling") tries
to make sure that the bio to .make_request_fn won't exceed BIO_MAX_PAGES,
but ignores that passthrough I/O can use blk_queue_bounce() too.
Especially, passthrough IO may not be sector-aligned, and the check
of 'sectors < bio_sectors(*bio_orig)' inside __blk_queue_bounce() may
become true even though the max bvec number doesn't exceed BIO_MAX_PAGES,
then cause the bio splitted, and the original passthrough bio is submited
to generic_make_request().

This patch fixes this issue by checking if the bio is passthrough IO,
and use bio_kmalloc() to allocate the cloned passthrough bio.

Cc: NeilBrown <neilb@suse.com>
Fixes: a8821f3f3("block: Improvements to bounce-buffer handling")
Tested-by: Michele Ballabio <barra_cuda@katamail.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-12-18 13:55:43 -07:00
Omar Sandoval
fcf38cdf33 kyber: fix another domain token wait queue hang
Commit 8cf4666020 ("kyber: fix hang on domain token wait queue") fixed
a hang caused by leaving wait entries on the domain token wait queue
after the __sbitmap_queue_get() retry succeeded, making that wait entry
a "dud" which won't in turn wake more entries up. However, we can also
get a dud entry if kyber_get_domain_token() fails once but is then
called again and succeeds. This can happen if the hardware queue is
rerun for some other reason, or, more likely, kyber_dispatch_request()
tries the same domain twice.

The fix is to remove our entry from the wait queue whenever we
successfully get a token. The only complication is that we might be on
one of many wait queues in the struct sbitmap_queue, but that's easily
fixed by remembering which wait queue we were put on.

While we're here, only initialize the wait queue entry once instead of
on every wait, and use spin_lock_irq() instead of spin_lock_irqsave(),
since this is always called from process context with irqs enabled.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-12-06 12:33:07 -07:00
Linus Torvalds
75f64f68af Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "A selection of fixes/changes that should make it into this series.
  This contains:

   - NVMe, two merges, containing:
        - pci-e, rdma, and fc fixes
        - Device quirks

   - Fix for a badblocks leak in null_blk

   - bcache fix from Rui Hua for a race condition regression where
     -EINTR was returned to upper layers that didn't expect it.

   - Regression fix for blktrace for a bug introduced in this series.

   - blktrace cleanup for cgroup id.

   - bdi registration error handling.

   - Small series with cleanups for blk-wbt.

   - Various little fixes for typos and the like.

  Nothing earth shattering, most important are the NVMe and bcache fixes"

* 'for-linus' of git://git.kernel.dk/linux-block: (34 commits)
  nvme-pci: fix NULL pointer dereference in nvme_free_host_mem()
  nvme-rdma: fix memory leak during queue allocation
  blktrace: fix trace mutex deadlock
  nvme-rdma: Use mr pool
  nvme-rdma: Check remotely invalidated rkey matches our expected rkey
  nvme-rdma: wait for local invalidation before completing a request
  nvme-rdma: don't complete requests before a send work request has completed
  nvme-rdma: don't suppress send completions
  bcache: check return value of register_shrinker
  bcache: recover data from backing when data is clean
  bcache: Fix building error on MIPS
  bcache: add a comment in journal bucket reading
  nvme-fc: don't use bit masks for set/test_bit() numbers
  blk-wbt: fix comments typo
  blk-wbt: move wbt_clear_stat to common place in wbt_done
  blk-sysfs: remove NULL pointer checking in queue_wb_lat_store
  blk-wbt: remove duplicated setting in wbt_init
  nvme-pci: add quirk for delay before CHK RDY for WDC SN200
  block: remove useless assignment in bio_split
  null_blk: fix dev->badblocks leak
  ...
2017-12-01 08:05:45 -05:00
Al Viro
1771e70a2e block: annotate ->poll() instances
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-11-27 16:20:00 -05:00
weiping zhang
3dfbdc44d6 blk-wbt: fix comments typo
Signed-off-by: weiping zhang <zhangweiping@didichuxing.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-23 22:00:20 -07:00
weiping zhang
62d772fa9d blk-wbt: move wbt_clear_stat to common place in wbt_done
wbt_done call wbt_clear_stat no matter current stat was tracked
or not, move it to common place.

Signed-off-by: weiping zhang <zhangweiping@didichuxing.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-23 22:00:18 -07:00
weiping zhang
f680474345 blk-sysfs: remove NULL pointer checking in queue_wb_lat_store
wbt_init doesn't set q->rq_wb to NULL, if wbt_init return 0,
so check return value is enough, remove NULL checking.

Signed-off-by: weiping zhang <zhangweiping@didichuxing.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-23 22:00:17 -07:00
weiping zhang
612ea091fc blk-wbt: remove duplicated setting in wbt_init
rwb->wc and rwb->queue_depth were overwritten by wbt_set_write_cache and
wbt_set_queue_depth, remove the default setting.

Signed-off-by: weiping zhang <zhangweiping@didichuxing.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-23 22:00:15 -07:00
Mikulas Patocka
f341a4d384 block: remove useless assignment in bio_split
Remove useless assignment to the variable "split" because the variable is
unconditionally assigned later.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-22 11:26:05 -07:00
Kees Cook
e99e88a9d2 treewide: setup_timer() -> timer_setup()
This converts all remaining cases of the old setup_timer() API into using
timer_setup(), where the callback argument is the structure already
holding the struct timer_list. These should have no behavioral changes,
since they just change which pointer is passed into the callback with
the same available pointers after conversion. It handles the following
examples, in addition to some other variations.

Casting from unsigned long:

    void my_callback(unsigned long data)
    {
        struct something *ptr = (struct something *)data;
    ...
    }
    ...
    setup_timer(&ptr->my_timer, my_callback, ptr);

and forced object casts:

    void my_callback(struct something *ptr)
    {
    ...
    }
    ...
    setup_timer(&ptr->my_timer, my_callback, (unsigned long)ptr);

become:

    void my_callback(struct timer_list *t)
    {
        struct something *ptr = from_timer(ptr, t, my_timer);
    ...
    }
    ...
    timer_setup(&ptr->my_timer, my_callback, 0);

Direct function assignments:

    void my_callback(unsigned long data)
    {
        struct something *ptr = (struct something *)data;
    ...
    }
    ...
    ptr->my_timer.function = my_callback;

have a temporary cast added, along with converting the args:

    void my_callback(struct timer_list *t)
    {
        struct something *ptr = from_timer(ptr, t, my_timer);
    ...
    }
    ...
    ptr->my_timer.function = (TIMER_FUNC_TYPE)my_callback;

And finally, callbacks without a data assignment:

    void my_callback(unsigned long data)
    {
    ...
    }
    ...
    setup_timer(&ptr->my_timer, my_callback, 0);

have their argument renamed to verify they're unused during conversion:

    void my_callback(struct timer_list *unused)
    {
    ...
    }
    ...
    timer_setup(&ptr->my_timer, my_callback, 0);

The conversion is done with the following Coccinelle script:

spatch --very-quiet --all-includes --include-headers \
	-I ./arch/x86/include -I ./arch/x86/include/generated \
	-I ./include -I ./arch/x86/include/uapi \
	-I ./arch/x86/include/generated/uapi -I ./include/uapi \
	-I ./include/generated/uapi --include ./include/linux/kconfig.h \
	--dir . \
	--cocci-file ~/src/data/timer_setup.cocci

@fix_address_of@
expression e;
@@

 setup_timer(
-&(e)
+&e
 , ...)

// Update any raw setup_timer() usages that have a NULL callback, but
// would otherwise match change_timer_function_usage, since the latter
// will update all function assignments done in the face of a NULL
// function initialization in setup_timer().
@change_timer_function_usage_NULL@
expression _E;
identifier _timer;
type _cast_data;
@@

(
-setup_timer(&_E->_timer, NULL, _E);
+timer_setup(&_E->_timer, NULL, 0);
|
-setup_timer(&_E->_timer, NULL, (_cast_data)_E);
+timer_setup(&_E->_timer, NULL, 0);
|
-setup_timer(&_E._timer, NULL, &_E);
+timer_setup(&_E._timer, NULL, 0);
|
-setup_timer(&_E._timer, NULL, (_cast_data)&_E);
+timer_setup(&_E._timer, NULL, 0);
)

@change_timer_function_usage@
expression _E;
identifier _timer;
struct timer_list _stl;
identifier _callback;
type _cast_func, _cast_data;
@@

(
-setup_timer(&_E->_timer, _callback, _E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, &_callback, _E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, _callback, (_cast_data)_E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, &_callback, (_cast_data)_E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, (_cast_func)_callback, _E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, (_cast_func)&_callback, _E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, (_cast_func)_callback, (_cast_data)_E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, (_cast_func)&_callback, (_cast_data)_E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, (_cast_data)_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, (_cast_data)&_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, &_callback, (_cast_data)_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, &_callback, (_cast_data)&_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, (_cast_func)_callback, (_cast_data)_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, (_cast_func)_callback, (_cast_data)&_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, (_cast_func)&_callback, (_cast_data)_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, (_cast_func)&_callback, (_cast_data)&_E);
+timer_setup(&_E._timer, _callback, 0);
|
 _E->_timer@_stl.function = _callback;
|
 _E->_timer@_stl.function = &_callback;
|
 _E->_timer@_stl.function = (_cast_func)_callback;
|
 _E->_timer@_stl.function = (_cast_func)&_callback;
|
 _E._timer@_stl.function = _callback;
|
 _E._timer@_stl.function = &_callback;
|
 _E._timer@_stl.function = (_cast_func)_callback;
|
 _E._timer@_stl.function = (_cast_func)&_callback;
)

// callback(unsigned long arg)
@change_callback_handle_cast
 depends on change_timer_function_usage@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._timer;
type _origtype;
identifier _origarg;
type _handletype;
identifier _handle;
@@

 void _callback(
-_origtype _origarg
+struct timer_list *t
 )
 {
(
	... when != _origarg
	_handletype *_handle =
-(_handletype *)_origarg;
+from_timer(_handle, t, _timer);
	... when != _origarg
|
	... when != _origarg
	_handletype *_handle =
-(void *)_origarg;
+from_timer(_handle, t, _timer);
	... when != _origarg
|
	... when != _origarg
	_handletype *_handle;
	... when != _handle
	_handle =
-(_handletype *)_origarg;
+from_timer(_handle, t, _timer);
	... when != _origarg
|
	... when != _origarg
	_handletype *_handle;
	... when != _handle
	_handle =
-(void *)_origarg;
+from_timer(_handle, t, _timer);
	... when != _origarg
)
 }

// callback(unsigned long arg) without existing variable
@change_callback_handle_cast_no_arg
 depends on change_timer_function_usage &&
                     !change_callback_handle_cast@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._timer;
type _origtype;
identifier _origarg;
type _handletype;
@@

 void _callback(
-_origtype _origarg
+struct timer_list *t
 )
 {
+	_handletype *_origarg = from_timer(_origarg, t, _timer);
+
	... when != _origarg
-	(_handletype *)_origarg
+	_origarg
	... when != _origarg
 }

// Avoid already converted callbacks.
@match_callback_converted
 depends on change_timer_function_usage &&
            !change_callback_handle_cast &&
	    !change_callback_handle_cast_no_arg@
identifier change_timer_function_usage._callback;
identifier t;
@@

 void _callback(struct timer_list *t)
 { ... }

// callback(struct something *handle)
@change_callback_handle_arg
 depends on change_timer_function_usage &&
	    !match_callback_converted &&
            !change_callback_handle_cast &&
            !change_callback_handle_cast_no_arg@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._timer;
type _handletype;
identifier _handle;
@@

 void _callback(
-_handletype *_handle
+struct timer_list *t
 )
 {
+	_handletype *_handle = from_timer(_handle, t, _timer);
	...
 }

// If change_callback_handle_arg ran on an empty function, remove
// the added handler.
@unchange_callback_handle_arg
 depends on change_timer_function_usage &&
	    change_callback_handle_arg@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._timer;
type _handletype;
identifier _handle;
identifier t;
@@

 void _callback(struct timer_list *t)
 {
-	_handletype *_handle = from_timer(_handle, t, _timer);
 }

// We only want to refactor the setup_timer() data argument if we've found
// the matching callback. This undoes changes in change_timer_function_usage.
@unchange_timer_function_usage
 depends on change_timer_function_usage &&
            !change_callback_handle_cast &&
            !change_callback_handle_cast_no_arg &&
	    !change_callback_handle_arg@
expression change_timer_function_usage._E;
identifier change_timer_function_usage._timer;
identifier change_timer_function_usage._callback;
type change_timer_function_usage._cast_data;
@@

(
-timer_setup(&_E->_timer, _callback, 0);
+setup_timer(&_E->_timer, _callback, (_cast_data)_E);
|
-timer_setup(&_E._timer, _callback, 0);
+setup_timer(&_E._timer, _callback, (_cast_data)&_E);
)

// If we fixed a callback from a .function assignment, fix the
// assignment cast now.
@change_timer_function_assignment
 depends on change_timer_function_usage &&
            (change_callback_handle_cast ||
             change_callback_handle_cast_no_arg ||
             change_callback_handle_arg)@
expression change_timer_function_usage._E;
identifier change_timer_function_usage._timer;
identifier change_timer_function_usage._callback;
type _cast_func;
typedef TIMER_FUNC_TYPE;
@@

(
 _E->_timer.function =
-_callback
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E->_timer.function =
-&_callback
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E->_timer.function =
-(_cast_func)_callback;
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E->_timer.function =
-(_cast_func)&_callback
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E._timer.function =
-_callback
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E._timer.function =
-&_callback;
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E._timer.function =
-(_cast_func)_callback
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E._timer.function =
-(_cast_func)&_callback
+(TIMER_FUNC_TYPE)_callback
 ;
)

// Sometimes timer functions are called directly. Replace matched args.
@change_timer_function_calls
 depends on change_timer_function_usage &&
            (change_callback_handle_cast ||
             change_callback_handle_cast_no_arg ||
             change_callback_handle_arg)@
expression _E;
identifier change_timer_function_usage._timer;
identifier change_timer_function_usage._callback;
type _cast_data;
@@

 _callback(
(
-(_cast_data)_E
+&_E->_timer
|
-(_cast_data)&_E
+&_E._timer
|
-_E
+&_E->_timer
)
 )

// If a timer has been configured without a data argument, it can be
// converted without regard to the callback argument, since it is unused.
@match_timer_function_unused_data@
expression _E;
identifier _timer;
identifier _callback;
@@

(
-setup_timer(&_E->_timer, _callback, 0);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, _callback, 0L);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, _callback, 0UL);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, 0);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, 0L);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, 0UL);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_timer, _callback, 0);
+timer_setup(&_timer, _callback, 0);
|
-setup_timer(&_timer, _callback, 0L);
+timer_setup(&_timer, _callback, 0);
|
-setup_timer(&_timer, _callback, 0UL);
+timer_setup(&_timer, _callback, 0);
|
-setup_timer(_timer, _callback, 0);
+timer_setup(_timer, _callback, 0);
|
-setup_timer(_timer, _callback, 0L);
+timer_setup(_timer, _callback, 0);
|
-setup_timer(_timer, _callback, 0UL);
+timer_setup(_timer, _callback, 0);
)

@change_callback_unused_data
 depends on match_timer_function_unused_data@
identifier match_timer_function_unused_data._callback;
type _origtype;
identifier _origarg;
@@

 void _callback(
-_origtype _origarg
+struct timer_list *unused
 )
 {
	... when != _origarg
 }

Signed-off-by: Kees Cook <keescook@chromium.org>
2017-11-21 15:57:07 -08:00
Kees Cook
bca237a52c block/laptop_mode: Convert timers to use timer_setup()
In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new timer_setup() and from_timer()
to pass the timer pointer explicitly.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Jeff Layton <jlayton@redhat.com>
Cc: linux-block@vger.kernel.org
Cc: linux-mm@kvack.org
Signed-off-by: Kees Cook <keescook@chromium.org>
2017-11-21 15:46:44 -08:00
Randy Dunlap
7fb526212f block: genhd.c: fix message typo
Fix typo in error message.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-19 11:02:19 -07:00
weiping zhang
3a92168bc8 block: add WARN_ON if bdi register fail
device_add_disk need do more safety error handle, so this patch just
add WARN_ON.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: weiping zhang <zhangweiping@didichuxing.com>

Adapted for current series by me.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-19 11:02:13 -07:00
Linus Torvalds
16382e17c0 Merge branch 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull iov_iter updates from Al Viro:

 - bio_{map,copy}_user_iov() series; those are cleanups - fixes from the
   same pile went into mainline (and stable) in late September.

 - fs/iomap.c iov_iter-related fixes

 - new primitive - iov_iter_for_each_range(), which applies a function
   to kernel-mapped segments of an iov_iter.

   Usable for kvec and bvec ones, the latter does kmap()/kunmap() around
   the callback. _Not_ usable for iovec- or pipe-backed iov_iter; the
   latter is not hard to fix if the need ever appears, the former is by
   design.

   Another related primitive will have to wait for the next cycle - it
   passes page + offset + size instead of pointer + size, and that one
   will be usable for everything _except_ kvec. Unfortunately, that one
   didn't get exposure in -next yet, so...

 - a bit more lustre iov_iter work, including a use case for
   iov_iter_for_each_range() (checksum calculation)

 - vhost/scsi leak fix in failure exit

 - misc cleanups and detritectomy...

* 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (21 commits)
  iomap_dio_actor(): fix iov_iter bugs
  switch ksocknal_lib_recv_...() to use of iov_iter_for_each_range()
  lustre: switch struct ksock_conn to iov_iter
  vhost/scsi: switch to iov_iter_get_pages()
  fix a page leak in vhost_scsi_iov_to_sgl() error recovery
  new primitive: iov_iter_for_each_range()
  lnet_return_rx_credits_locked: don't abuse list_entry
  xen: don't open-code iov_iter_kvec()
  orangefs: remove detritus from struct orangefs_kiocb_s
  kill iov_shorten()
  bio_alloc_map_data(): do bmd->iter setup right there
  bio_copy_user_iov(): saner bio size calculation
  bio_map_user_iov(): get rid of copying iov_iter
  bio_copy_from_iter(): get rid of copying iov_iter
  move more stuff down into bio_copy_user_iov()
  blk_rq_map_user_iov(): move iov_iter_advance() down
  bio_map_user_iov(): get rid of the iov_for_each()
  bio_map_user_iov(): move alignment check into the main loop
  don't rely upon subsequent bio_add_pc_page() calls failing
  ... and with iov_iter_get_pages_alloc() it becomes even simpler
  ...
2017-11-17 12:08:18 -08:00
Linus Torvalds
06ede5f608 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull more block layer updates from Jens Axboe:
 "A followup pull request, with some parts that either needed a bit more
  testing before going in, merge sync, or just later arriving fixes.
  This contains:

   - Timer related updates from Kees. These were purposefully delayed
     since I didn't want to pull in a later v4.14-rc tag to my block
     tree.

   - ide-cd prep sense buffer fix from Bart. Also delayed, as not to
     clash with the late fix we put into 4.14-rc.

   - Small BFQ updates series from Luca and Paolo.

   - Single nvmet fix from James, fixing a non-functional case there.

   - Bio fast clone fix from Michael, which made bcache return the wrong
     data for some cases.

   - Legacy IO path regression hang fix from Ming"

* 'for-linus' of git://git.kernel.dk/linux-block:
  bio: ensure __bio_clone_fast copies bi_partno
  nvmet_fc: fix better length checking
  block: wake up all tasks blocked in get_request()
  block, bfq: move debug blkio stats behind CONFIG_DEBUG_BLK_CGROUP
  block, bfq: update blkio stats outside the scheduler lock
  block, bfq: add missing invocations of bfqg_stats_update_io_add/remove
  doc, block, bfq: update max IOPS sustainable with BFQ
  ide: Make ide_cdrom_prep_fs() initialize the sense buffer pointer
  md: Convert timers to use timer_setup()
  block: swim3: Convert timers to use timer_setup()
  block/aoe: Convert timers to use timer_setup()
  amifloppy: Convert timers to use timer_setup()
  block/floppy: Convert callback to pass timer_list
2017-11-17 10:56:56 -08:00
Michael Lyle
62530ed8b1 bio: ensure __bio_clone_fast copies bi_partno
A new field was introduced in 74d46992e0, bi_partno, instead of using
bdev->bd_contains and encoding the partition information in the bi_bdev
field.  __bio_clone_fast was changed to copy the disk information, but
not the partition information.  At minimum, this regressed bcache and
caused data corruption.

Signed-off-by: Michael Lyle <mlyle@lyle.org>
Fixes: 74d46992e0 ("block: replace bi_bdev with a gendisk pointer and partitions index")
Reported-by: Pavel Goran <via-bcache@pvgoran.name>
Reported-by: Campbell Steven <casteven@gmail.com>
Reviewed-by: Coly Li <colyli@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Cc: <stable@vger.kernel.org> # 4.14
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-17 08:29:34 -07:00
Ming Lei
34d9715ac1 block: wake up all tasks blocked in get_request()
Once blk_set_queue_dying() is done in blk_cleanup_queue(), we call
blk_freeze_queue() and wait for q->q_usage_counter becoming zero. But
if there are tasks blocked in get_request(), q->q_usage_counter can
never become zero. So we have to wake up all these tasks in
blk_set_queue_dying() first.

Fixes: 3ef28e83ab ("block: generic request_queue reference counting")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-15 21:51:03 -07:00
Johannes Thumshirn
d904bfa79f block/blk-mq.c: use kmalloc_array_node()
Now that we have a NUMA-aware version of kmalloc_array() we can use it
instead of kmalloc_node() without an overflow check in the size
calculation.

Link: http://lkml.kernel.org/r/20170927082038.3782-3-jthumshirn@suse.de
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Cc: Mike Marciniszyn <infinipath@intel.com>
Cc: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:02 -08:00
Luca Miccio
a33801e8b4 block, bfq: move debug blkio stats behind CONFIG_DEBUG_BLK_CGROUP
BFQ currently creates, and updates, its own instance of the whole
set of blkio statistics that cfq creates. Yet, from the comments
of Tejun Heo in [1], it turned out that most of these statistics
are meant/useful only for debugging. This commit makes BFQ create
the latter, debugging statistics only if the option
CONFIG_DEBUG_BLK_CGROUP is set.

By doing so, this commit also enables BFQ to enjoy a high perfomance
boost. The reason is that, if CONFIG_DEBUG_BLK_CGROUP is not set, then
BFQ has to update far fewer statistics, and, in particular, not the
heaviest to update.  To give an idea of the benefits, if
CONFIG_DEBUG_BLK_CGROUP is not set, then, on an Intel i7-4850HQ, and
with 8 threads doing random I/O in parallel on null_blk (configured
with 0 latency), the throughput of BFQ grows from 310 to 400 KIOPS
(+30%). We have measured similar or even much higher boosts with other
CPUs: e.g., +45% with an ARM CortexTM-A53 Octa-core. Our results have
been obtained and can be reproduced very easily with the script in [1].

[1] https://www.spinics.net/lists/linux-block/msg18943.html

Suggested-by: Tejun Heo <tj@kernel.org>
Suggested-by: Ulf Hansson <ulf.hansson@linaro.org>
Tested-by: Lee Tibbert <lee.tibbert@gmail.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Luca Miccio <lucmiccio@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-14 20:13:33 -07:00
Paolo Valente
24bfd19bb7 block, bfq: update blkio stats outside the scheduler lock
bfq invokes various blkg_*stats_* functions to update the statistics
contained in the special files blkio.bfq.* in the blkio controller
groups, i.e., the I/O accounting related to the proportional-share
policy provided by bfq. The execution of these functions takes a
considerable percentage, about 40%, of the total per-request execution
time of bfq (i.e., of the sum of the execution time of all the bfq
functions that have to be executed to process an I/O request from its
creation to its destruction).  This reduces the request-processing
rate sustainable by bfq noticeably, even on a multicore CPU. In fact,
the bfq functions that invoke blkg_*stats_* functions cannot be
executed in parallel with the rest of the code of bfq, because both
are executed under the same same per-device scheduler lock.

To reduce this slowdown, this commit moves, wherever possible, the
invocation of these functions (more precisely, of the bfq functions
that invoke blkg_*stats_* functions) outside the critical sections
protected by the scheduler lock.

With this change, and with all blkio.bfq.* statistics enabled, the
throughput grows, e.g., from 250 to 310 KIOPS (+25%) on an Intel
i7-4850HQ, in case of 8 threads doing random I/O in parallel on
null_blk, with the latter configured with 0 latency. We obtained the
same or higher throughput boosts, up to +30%, with other processors
(some figures are reported in the documentation). For our tests, we
used the script [1], with which our results can be easily reproduced.

NOTE. This commit still protects the invocation of blkg_*stats_*
functions with the request_queue lock, because the group these
functions are invoked on may otherwise disappear before or while these
functions are executed.  Fortunately, tests without even this lock
show, by difference, that the serialization caused by this lock has a
little impact (at most ~5% of throughput reduction).

[1] https://github.com/Algodev-github/IOSpeed

Tested-by: Lee Tibbert <lee.tibbert@gmail.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Luca Miccio <lucmiccio@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-14 20:13:33 -07:00
Luca Miccio
614822f81f block, bfq: add missing invocations of bfqg_stats_update_io_add/remove
bfqg_stats_update_io_add and bfqg_stats_update_io_remove are to be
invoked, respectively, when an I/O request enters and when an I/O
request exits the scheduler. Unfortunately, bfq does not fully comply
with this scheme, because it does not invoke these functions for
requests that are inserted into or extracted from its priority
dispatch list. This commit fixes this mistake.

Tested-by: Lee Tibbert <lee.tibbert@gmail.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Luca Miccio <lucmiccio@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-14 20:13:33 -07:00
Linus Torvalds
47f521ba18 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md
Pull MD update from Shaohua Li:
 "This update mostly includes bug fixes:

   - md-cluster now supports raid10 from Guoqing

   - raid5 PPL fixes from Artur

   - badblock regression fix from Bo

   - suspend hang related fixes from Neil

   - raid5 reshape fixes from Neil

   - raid1 freeze deadlock fix from Nate

   - memleak fixes from Zdenek

   - bitmap related fixes from Me and Tao

   - other fixes and cleanups"

* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: (33 commits)
  md: free unused memory after bitmap resize
  md: release allocated bitset sync_set
  md/bitmap: clear BITMAP_WRITE_ERROR bit before writing it to sb
  md: be cautious about using ->curr_resync_completed for ->recovery_offset
  badblocks: fix wrong return value in badblocks_set if badblocks are disabled
  md: don't check MD_SB_CHANGE_CLEAN in md_allow_write
  md-cluster: update document for raid10
  md: remove redundant variable q
  raid1: remove obsolete code in raid1_write_request
  md-cluster: Use a small window for raid10 resync
  md-cluster: Suspend writes in RAID10 if within range
  md-cluster/raid10: set "do_balance = 0" if area is resyncing
  md: use lockdep_assert_held
  raid1: prevent freeze_array/wait_all_barriers deadlock
  md: use TASK_IDLE instead of blocking signals
  md: remove special meaning of ->quiesce(.., 2)
  md: allow metadata update while suspending.
  md: use mddev_suspend/resume instead of ->quiesce()
  md: move suspend_hi/lo handling into core md code
  md: don't call bitmap_create() while array is quiesced.
  ...
2017-11-14 16:07:26 -08:00
Linus Torvalds
e2c5923c34 Merge branch 'for-4.15/block' of git://git.kernel.dk/linux-block
Pull core block layer updates from Jens Axboe:
 "This is the main pull request for block storage for 4.15-rc1.

  Nothing out of the ordinary in here, and no API changes or anything
  like that. Just various new features for drivers, core changes, etc.
  In particular, this pull request contains:

   - A patch series from Bart, closing the whole on blk/scsi-mq queue
     quescing.

   - A series from Christoph, building towards hidden gendisks (for
     multipath) and ability to move bio chains around.

   - NVMe
        - Support for native multipath for NVMe (Christoph).
        - Userspace notifications for AENs (Keith).
        - Command side-effects support (Keith).
        - SGL support (Chaitanya Kulkarni)
        - FC fixes and improvements (James Smart)
        - Lots of fixes and tweaks (Various)

   - bcache
        - New maintainer (Michael Lyle)
        - Writeback control improvements (Michael)
        - Various fixes (Coly, Elena, Eric, Liang, et al)

   - lightnvm updates, mostly centered around the pblk interface
     (Javier, Hans, and Rakesh).

   - Removal of unused bio/bvec kmap atomic interfaces (me, Christoph)

   - Writeback series that fix the much discussed hundreds of millions
     of sync-all units. This goes all the way, as discussed previously
     (me).

   - Fix for missing wakeup on writeback timer adjustments (Yafang
     Shao).

   - Fix laptop mode on blk-mq (me).

   - {mq,name} tupple lookup for IO schedulers, allowing us to have
     alias names. This means you can use 'deadline' on both !mq and on
     mq (where it's called mq-deadline). (me).

   - blktrace race fix, oopsing on sg load (me).

   - blk-mq optimizations (me).

   - Obscure waitqueue race fix for kyber (Omar).

   - NBD fixes (Josef).

   - Disable writeback throttling by default on bfq, like we do on cfq
     (Luca Miccio).

   - Series from Ming that enable us to treat flush requests on blk-mq
     like any other request. This is a really nice cleanup.

   - Series from Ming that improves merging on blk-mq with schedulers,
     getting us closer to flipping the switch on scsi-mq again.

   - BFQ updates (Paolo).

   - blk-mq atomic flags memory ordering fixes (Peter Z).

   - Loop cgroup support (Shaohua).

   - Lots of minor fixes from lots of different folks, both for core and
     driver code"

* 'for-4.15/block' of git://git.kernel.dk/linux-block: (294 commits)
  nvme: fix visibility of "uuid" ns attribute
  blk-mq: fixup some comment typos and lengths
  ide: ide-atapi: fix compile error with defining macro DEBUG
  blk-mq: improve tag waiting setup for non-shared tags
  brd: remove unused brd_mutex
  blk-mq: only run the hardware queue if IO is pending
  block: avoid null pointer dereference on null disk
  fs: guard_bio_eod() needs to consider partitions
  xtensa/simdisk: fix compile error
  nvme: expose subsys attribute to sysfs
  nvme: create 'slaves' and 'holders' entries for hidden controllers
  block: create 'slaves' and 'holders' entries for hidden gendisks
  nvme: also expose the namespace identification sysfs files for mpath nodes
  nvme: implement multipath access to nvme subsystems
  nvme: track shared namespaces
  nvme: introduce a nvme_ns_ids structure
  nvme: track subsystems
  block, nvme: Introduce blk_mq_req_flags_t
  block, scsi: Make SCSI quiesce and resume work reliably
  block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
  ...
2017-11-14 15:32:19 -08:00
Jens Axboe
ff821d2714 blk-mq: fixup some comment typos and lengths
Various typos and/or spelling errors in comments. Fixes a few > 80 char
lines as well.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 22:05:12 -07:00
Jens Axboe
f906a6a0f4 blk-mq: improve tag waiting setup for non-shared tags
If we run out of driver tags, we currently treat shared and non-shared
tags the same - both cases hook into the tag waitqueue. This is a bit
more costly than it needs to be on unshared tags, since we have to both
grab the hctx lock, and the waitqueue lock (and disable interrupts).
For the non-shared case, we can simply mark the queue as needing a
restart.

Split blk_mq_dispatch_wait_add() to account for both cases, and
rename it to blk_mq_mark_tag_wait() to better reflect what it
does now.

Without this patch, shared and non-shared performance is about the same
with 4 fio thread hammering on a single null_blk device (~410K, at 75%
sys). With the patch, the shared case is the same, but the non-shared
tags case runs at 431K at 71% sys.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 19:55:57 -07:00
Jens Axboe
79f720a751 blk-mq: only run the hardware queue if IO is pending
Currently we are inconsistent in when we decide to run the queue. Using
blk_mq_run_hw_queues() we check if the hctx has pending IO before
running it, but we don't do that from the individual queue run function,
blk_mq_run_hw_queue(). This results in a lot of extra and pointless
queue runs, potentially, on flush requests and (much worse) on tag
starvation situations. This is observable just looking at top output,
with lots of kworkers active. For the !async runs, it just adds to the
CPU overhead of blk-mq.

Move the has-pending check into the run function instead of having
callers do it.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 19:55:57 -07:00
Colin Ian King
f0fba398fe block: avoid null pointer dereference on null disk
It is possible that the pointer disk can be null and hence
we can get a null pointer deference when accessing disk->flags.
Add a null pointer check to avoid the dereference.

Detected by CoverityScan, CID#1461133 ("Explicit null dereferenced")

Fixes: 8ddcd65325 ("block: introduce GENHD_FL_HIDDEN")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 19:55:57 -07:00
Hannes Reinecke
17eac09963 block: create 'slaves' and 'holders' entries for hidden gendisks
When creating nvme multipath devices we should populate the 'slaves' and
'holders' directorys properly to aid userspace topology detection.

Signed-off-by: Hannes Reinecke <hare@suse.com>
[hch: split from a larger patch]
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 19:53:25 -07:00
Bart Van Assche
9a95e4ef70 block, nvme: Introduce blk_mq_req_flags_t
Several block layer and NVMe core functions accept a combination
of BLK_MQ_REQ_* flags through the 'flags' argument but there is
no verification at compile time whether the right type of block
layer flags is passed. Make it possible for sparse to verify this.
This patch does not change any functionality.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Cc: linux-nvme@lists.infradead.org
Cc: Christoph Hellwig <hch@lst.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 19:53:25 -07:00
Bart Van Assche
3a0a529971 block, scsi: Make SCSI quiesce and resume work reliably
The contexts from which a SCSI device can be quiesced or resumed are:
* Writing into /sys/class/scsi_device/*/device/state.
* SCSI parallel (SPI) domain validation.
* The SCSI device power management methods. See also scsi_bus_pm_ops.

It is essential during suspend and resume that neither the filesystem
state nor the filesystem metadata in RAM changes. This is why while
the hibernation image is being written or restored that SCSI devices
are quiesced. The SCSI core quiesces devices through scsi_device_quiesce()
and scsi_device_resume(). In the SDEV_QUIESCE state execution of
non-preempt requests is deferred. This is realized by returning
BLKPREP_DEFER from inside scsi_prep_state_check() for quiesced SCSI
devices. Avoid that a full queue prevents power management requests
to be submitted by deferring allocation of non-preempt requests for
devices in the quiesced state. This patch has been tested by running
the following commands and by verifying that after each resume the
fio job was still running:

for ((i=0; i<10; i++)); do
  (
    cd /sys/block/md0/md &&
    while true; do
      [ "$(<sync_action)" = "idle" ] && echo check > sync_action
      sleep 1
    done
  ) &
  pids=($!)
  for d in /sys/class/block/sd*[a-z]; do
    bdev=${d#/sys/class/block/}
    hcil=$(readlink "$d/device")
    hcil=${hcil#../../../}
    echo 4 > "$d/queue/nr_requests"
    echo 1 > "/sys/class/scsi_device/$hcil/device/queue_depth"
    fio --name="$bdev" --filename="/dev/$bdev" --buffered=0 --bs=512 \
      --rw=randread --ioengine=libaio --numjobs=4 --iodepth=16       \
      --iodepth_batch=1 --thread --loops=$((2**31)) &
    pids+=($!)
  done
  sleep 1
  echo "$(date) Hibernating ..." >>hibernate-test-log.txt
  systemctl hibernate
  sleep 10
  kill "${pids[@]}"
  echo idle > /sys/block/md0/md/sync_action
  wait
  echo "$(date) Done." >>hibernate-test-log.txt
done

Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
References: "I/O hangs after resuming from suspend-to-ram" (https://marc.info/?l=linux-block&m=150340235201348).
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: Martin Steigerwald <martin@lichtvoll.de>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 19:53:25 -07:00
Bart Van Assche
c9254f2ddb block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
This flag will be used in the next patch to let the block layer
core know whether or not a SCSI request queue has been quiesced.
A quiesced SCSI queue namely only processes RQF_PREEMPT requests.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: Martin Steigerwald <martin@lichtvoll.de>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 19:53:25 -07:00
Bart Van Assche
1b6d65a0bf block: Introduce BLK_MQ_REQ_PREEMPT
Set RQF_PREEMPT if BLK_MQ_REQ_PREEMPT is passed to
blk_get_request_flags().

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: Martin Steigerwald <martin@lichtvoll.de>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 19:53:25 -07:00
Bart Van Assche
6a15674d1e block: Introduce blk_get_request_flags()
A side effect of this patch is that the GFP mask that is passed to
several allocation functions in the legacy block layer is changed
from GFP_KERNEL into __GFP_DIRECT_RECLAIM.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: Martin Steigerwald <martin@lichtvoll.de>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 19:53:25 -07:00
Ming Lei
055f6e18e0 block: Make q_usage_counter also track legacy requests
This patch makes it possible to pause request allocation for
the legacy block layer by calling blk_mq_freeze_queue() and
blk_mq_unfreeze_queue().

Signed-off-by: Ming Lei <ming.lei@redhat.com>
[ bvanassche: Combined two patches into one, edited a comment and made sure
  REQ_NOWAIT is handled properly in blk_old_get_request() ]
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: Martin Steigerwald <martin@lichtvoll.de>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 19:53:25 -07:00
Jens Axboe
eb619fdb2d blk-mq: fix issue with shared tag queue re-running
This patch attempts to make the case of hctx re-running on driver tag
failure more robust. Without this patch, it's pretty easy to trigger a
stall condition with shared tags. An example is using null_blk like
this:

modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 submit_queues=1 hw_queue_depth=1

which sets up 4 devices, sharing the same tag set with a depth of 1.
Running a fio job ala:

[global]
bs=4k
rw=randread
norandommap
direct=1
ioengine=libaio
iodepth=4

[nullb0]
filename=/dev/nullb0
[nullb1]
filename=/dev/nullb1
[nullb2]
filename=/dev/nullb2
[nullb3]
filename=/dev/nullb3

will inevitably end with one or more threads being stuck waiting for a
scheduler tag. That IO is then stuck forever, until someone else
triggers a run of the queue.

Ensure that we always re-run the hardware queue, if the driver tag we
were waiting for got freed before we added our leftover request entries
back on the dispatch list.

Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Tested-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 19:53:25 -07:00
Bart Van Assche
aba7afc567 blk-mq: Avoid that request queue removal can trigger list corruption
Avoid that removal of a request queue sporadically triggers the
following warning:

list_del corruption. next->prev should be ffff8807d649b970, but was 6b6b6b6b6b6b6b6b
WARNING: CPU: 3 PID: 342 at lib/list_debug.c:56 __list_del_entry_valid+0x92/0xa0
Call Trace:
 process_one_work+0x11b/0x660
 worker_thread+0x3d/0x3b0
 kthread+0x129/0x140
 ret_from_fork+0x27/0x40

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 19:53:25 -07:00
Ming Lei
0c6af1ccd5 blk-mq: put driver tag if dispatch budget can't be got
We have to put the driver tag if dispatch budget can't be got, otherwise
it might cause IO deadlock, especially in case that size of tags is very
small.

Fixes: de1482974080(blk-mq: introduce .get_budget and .put_budget in blk_mq_ops)
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 19:53:25 -07:00
Christoph Hellwig
f00c4d80ff block: pass full fmode_t to blk_verify_command
Use the obvious calling convention.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 19:53:25 -07:00
Christoph Hellwig
d004a5e7d4 block: remove __bio_kmap_atomic
This helper doesn't buy us much over calling kmap_atomic directly.
In fact in the only caller it does a bit of useless work as the
caller already has the bvec at hand, and said caller would even
buggy for a multi-segment bio due to the use of this helper.

So just remove it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 19:53:25 -07:00
Jens Axboe
05b7941394 Revert "blk-mq: don't handle TAG_SHARED in restart"
This reverts commit 358a3a6bcc.

We have cases that aren't covered 100% in the drivers, so for now
we have to retain the shared tag restart loops.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 19:53:25 -07:00
Ingo Molnar
8c5db92a70 Merge branch 'linus' into locking/core, to resolve conflicts
Conflicts:
	include/linux/compiler-clang.h
	include/linux/compiler-gcc.h
	include/linux/compiler-intel.h
	include/uapi/linux/stddef.h

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-07 10:32:44 +01:00
Ming Lei
923218f616 blk-mq: don't allocate driver tag upfront for flush rq
The idea behind it is simple:

1) for none scheduler, driver tag has to be borrowed for flush rq,
   otherwise we may run out of tag, and that causes an IO hang. And
   get/put driver tag is actually noop for none, so reordering tags
   isn't necessary at all.

2) for a real I/O scheduler, we need not allocate a driver tag upfront
   for flush rq. It works just fine to follow the same approach as
   normal requests: allocate driver tag for each rq just before calling
   ->queue_rq().

One driver visible change is that the driver tag isn't shared in the
flush request sequence. That won't be a problem, since we always do that
in legacy path.

Then flush rq need not be treated specially wrt. get/put driver tag.
This cleans up the code - for instance, reorder_tags_to_front() can be
removed, and we needn't worry about request ordering in dispatch list
for avoiding I/O deadlock.

Also we have to put the driver tag before requeueing.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-04 12:40:13 -06:00
Ming Lei
244c65a3cc blk-mq: move blk_mq_put_driver_tag*() into blk-mq.h
We need this helper to put the driver tag for flush rq, since we will
not share tag in the flush request sequence in the following patch
in case that I/O scheduler is applied.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-04 12:39:57 -06:00
Ming Lei
a6a252e649 blk-mq-sched: decide how to handle flush rq via RQF_FLUSH_SEQ
In case of IO scheduler we always pre-allocate one driver tag before
calling blk_insert_flush(), and flush request will be marked as
RQF_FLUSH_SEQ once it is in flush machinery.

So if RQF_FLUSH_SEQ isn't set, we call blk_insert_flush() to handle
the request, otherwise the flush request is dispatched to ->dispatch
list directly.

This is a preparation patch for not preallocating a driver tag for flush
requests, and for not treating flush requests as a special case. This is
similar to what the legacy path does.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-04 12:38:50 -06:00
Ming Lei
598906f814 blk-flush: use blk_mq_request_bypass_insert()
In the following patch, we will use RQF_FLUSH_SEQ to decide:

1) if the flag isn't set, the flush rq need to be inserted via
blk_insert_flush()

2) otherwise, the flush rq need to be dispatched directly since
it is in flush machinery now.

So we use blk_mq_request_bypass_insert() for requests of bypassing
flush machinery, just like the legacy path did.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-04 12:38:40 -06:00
Ming Lei
b0850297c7 block: pass 'run_queue' to blk_mq_request_bypass_insert
Block flush need this function without running the queue, so add a
parameter controlling whether we run it or not.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-04 12:38:40 -06:00
Ming Lei
9c71c83c85 blk-flush: don't run queue for requests bypassing flush
blk_insert_flush() should only insert request since run queue always
follows it.

In case of bypassing flush, we don't need to run queue because every
blk_insert_flush() follows one run queue.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-04 12:38:40 -06:00
Jianchao Wang
6d6f167ce7 blk-mq: put the driver tag of nxt rq before first one is requeued
When freeing the driver tag of the next rq with an I/O scheduler
configured, we get the first entry of the list. However, this can
race with requeue of a request, and we end up getting the wrong request
from the head of the list. Free the driver tag of next rq before the
failed one is requeued in the failure branch of queue_rq callback.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-04 12:38:40 -06:00
weiping zhang
e840107322 blkcg: add sanity check for blkcg policy operations
blkcg policy should keep cpd/pd's alloc_fn and free_fn in pairs,
otherwise policy would register fail.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: weiping zhang <zhangweiping@didichuxing.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-04 12:31:15 -06:00
Ming Lei
88022d7201 blk-mq: don't handle failure in .get_budget
It is enough to just check if we can get the budget via .get_budget().
And we don't need to deal with device state change in .get_budget().

For SCSI, one issue to be fixed is that we have to call
scsi_mq_uninit_cmd() to free allocated ressources if SCSI device fails
to handle the request. And it isn't enough to simply call
blk_mq_end_request() to do that if this request is marked as
RQF_DONTPREP.

Fixes: 0df21c86bdbf(scsi: implement .get_budget and .put_budget for blk-mq)
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-04 12:31:08 -06:00
Christoph Hellwig
e4f36b249b block: fix peeking requests during PM
We need to look for an active PM request until the next softbarrier
instead of looking for the first non-PM request.  Otherwise any cause
of request reordering might starve the PM request(s).

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-04 08:17:06 -06:00
Bart Van Assche
21e768b442 blk-mq: Make blk_mq_get_request() error path less confusing
blk_mq_get_tag() can modify data->ctx. This means that in the
error path of blk_mq_get_request() data->ctx should be passed to
blk_mq_put_ctx() instead of local_ctx. Note: since blk_mq_put_ctx()
ignores its argument, this patch does not change any functionality.

References: commit 1ad43c0078 ("blk-mq: don't leak preempt counter/q_usage_counter when allocating rq failed")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-03 12:33:56 -06:00
Liu Bo
39b4954c0a badblocks: fix wrong return value in badblocks_set if badblocks are disabled
MD's rdev_set_badblocks() expects that badblocks_set() returns 1 if
badblocks are disabled, otherwise, rdev_set_badblocks() will record
superblock changes and return success in that case and md will fail to
report an IO error which it should.

This bug has existed since badblocks were introduced in commit
9e0e252a04 ("badblocks: Add core badblock management code").

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Acked-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-11-03 11:29:50 -07:00
weiping zhang
c2e82a2348 blk-mq: fix nr_requests wrong value when modify it from sysfs
if blk-mq use "none" io scheduler, nr_request get a wrong value when
input a number > tag_set->queue_depth. blk_mq_tag_update_depth will get
the smaller one min(nr, set->queue_depth), and then q->nr_request get a
wrong value.

Reproduce:

echo none > /sys/block/nvme0n1/queue/scheduler
echo 1000000 > /sys/block/nvme0n1/queue/nr_requests
cat /sys/block/nvme0n1/queue/nr_requests
1000000

Signed-off-by: weiping zhang <zhangweiping@didichuxing.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-03 12:21:06 -06:00
Christoph Hellwig
ea435e1b93 block: add a poll_fn callback to struct request_queue
That we we can also poll non blk-mq queues.  Mostly needed for
the NVMe multipath code, but could also be useful elsewhere.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-03 10:31:48 -06:00
Christoph Hellwig
8ddcd65325 block: introduce GENHD_FL_HIDDEN
With this flag a driver can create a gendisk that can be used for I/O
submission inside the kernel, but which is not registered as user
facing block device.  This will be useful for the NVMe multipath
implementation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-03 10:31:48 -06:00
Christoph Hellwig
517bf3c306 block: don't look at the struct device dev_t in disk_devt
The hidden gendisks introduced in the next patch need to keep the dev
field in their struct device empty so that udev won't try to create
block device nodes for them.  To support that rewrite disk_devt to
look at the major and first_minor fields in the gendisk itself instead
of looking into the struct device.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-03 10:31:48 -06:00
Christoph Hellwig
ef71de8b15 block: add a blk_steal_bios helper
This helpers allows to bounce steal the uncompleted bios from a request so
that they can be reissued on another path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-03 10:31:48 -06:00
Christoph Hellwig
f421e1d9ad block: provide a direct_make_request helper
This helper allows reinserting a bio into a new queue without much
overhead, but requires all queue limits to be the same for the upper
and lower queues, and it does not provide any recursion preventions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Javier González <javier@cnexlabs.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-03 10:31:48 -06:00
Jens Axboe
3e2cb3ad47 Merge branch 'nvme-4.15' of git://git.infradead.org/nvme into for-4.15/block
Pull NVMe changes from Christoph:

"Below are the currently queue nvme updates for Linux 4.15.  There are
a few more things that could make it for this merge window, but I'd
like to get things into linux-next, especially for the unlikely case
that Linus decided to cut -rc8.

Highlights:
 - support for SGLs in the PCIe driver (Chaitanya Kulkarni)
 - disable I/O schedulers for the admin queue (Israel Rukshin)
 - various Fibre Channel fixes and enhancements (James Smart)
 - various refactoring for better code sharing between transports
   (Sagi Grimberg and me)

as well as lots of little bits from various contributors."
2017-11-03 10:28:51 -06:00
Greg Kroah-Hartman
b24413180f License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier.  The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
 - file had no licensing information it it.
 - file was a */uapi/* one with no licensing information in it,
 - file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne.  Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed.  Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
 - Files considered eligible had to be source code files.
 - Make and config files were included as candidates if they contained >5
   lines of source
 - File already had some variant of a license header in it (even if <5
   lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

 - when both scanners couldn't find any license traces, file was
   considered to have no license information in it, and the top level
   COPYING file license applied.

   For non */uapi/* files that summary was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0                                              11139

   and resulted in the first patch in this series.

   If that file was a */uapi/* path one, it was "GPL-2.0 WITH
   Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0 WITH Linux-syscall-note                        930

   and resulted in the second patch in this series.

 - if a file had some form of licensing information in it, and was one
   of the */uapi/* ones, it was denoted with the Linux-syscall-note if
   any GPL family license was found in the file or had no licensing in
   it (per prior point).  Results summary:

   SPDX license identifier                            # files
   ---------------------------------------------------|------
   GPL-2.0 WITH Linux-syscall-note                       270
   GPL-2.0+ WITH Linux-syscall-note                      169
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
   LGPL-2.1+ WITH Linux-syscall-note                      15
   GPL-1.0+ WITH Linux-syscall-note                       14
   ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
   LGPL-2.0+ WITH Linux-syscall-note                       4
   LGPL-2.1 WITH Linux-syscall-note                        3
   ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
   ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1

   and that resulted in the third patch in this series.

 - when the two scanners agreed on the detected license(s), that became
   the concluded license(s).

 - when there was disagreement between the two scanners (one detected a
   license but the other didn't, or they both detected different
   licenses) a manual inspection of the file occurred.

 - In most cases a manual inspection of the information in the file
   resulted in a clear resolution of the license that should apply (and
   which scanner probably needed to revisit its heuristics).

 - When it was not immediately clear, the license identifier was
   confirmed with lawyers working with the Linux Foundation.

 - If there was any question as to the appropriate license identifier,
   the file was flagged for further research and to be revisited later
   in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights.  The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
 - a full scancode scan run, collecting the matched texts, detected
   license ids and scores
 - reviewing anything where there was a license detected (about 500+
   files) to ensure that the applied SPDX license was correct
 - reviewing anything where there was no detection but the patch license
   was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
   SPDX license was correct

This produced a worksheet with 20 files needing minor correction.  This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg.  Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected.  This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.)  Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-02 11:10:55 +01:00
Ming Lei
1f460b63d4 blk-mq: don't restart queue when .get_budget returns BLK_STS_RESOURCE
SCSI restarts its queue in scsi_end_request() automatically, so we don't
need to handle this case in blk-mq.

Especailly any request won't be dequeued in this case, we needn't to
worry about IO hang caused by restart vs. dispatch.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-01 08:20:34 -06:00
Ming Lei
358a3a6bcc blk-mq: don't handle TAG_SHARED in restart
Now restart is used in the following cases, and TAG_SHARED is for
SCSI only.

1) .get_budget() returns BLK_STS_RESOURCE
- if resource in target/host level isn't satisfied, this SCSI device
will be added in shost->starved_list, and the whole queue will be rerun
(via SCSI's built-in RESTART) in scsi_end_request() after any request
initiated from this host/targe is completed. Forget to mention, host level
resource can't be an issue for blk-mq at all.

- the same is true if resource in the queue level isn't satisfied.

- if there isn't outstanding request on this queue, then SCSI's RESTART
can't work(blk-mq's can't work too), and the queue will be run after
SCSI_QUEUE_DELAY, and finally all starved sdevs will be handled by SCSI's
RESTART when this request is finished

2) scsi_dispatch_cmd() returns BLK_STS_RESOURCE
- if there isn't onprogressing request on this queue, the queue
will be run after SCSI_QUEUE_DELAY

- otherwise, SCSI's RESTART covers the rerun.

3) blk_mq_get_driver_tag() failed
- BLK_MQ_S_TAG_WAITING covers the cross-queue RESTART for driver
allocation.

In one word, SCSI's built-in RESTART is enough to cover the queue
rerun, and we don't need to pay special attention to TAG_SHARED wrt. restart.

In my test on scsi_debug(8 luns), this patch improves IOPS by 20% ~ 30% when
running I/O on these 8 luns concurrently.

Aslo Roman Pen reported the current RESTART is very expensive especialy
when there are lots of LUNs attached in one host, such as in his
test, RESTART causes half of IOPS be cut.

Fixes: https://marc.info/?l=linux-kernel&m=150832216727524&w=2
Fixes: 6d8c6c0f97 ("blk-mq: Restart a single queue if tag sets are shared")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-01 08:20:33 -06:00
Ming Lei
b347689ffb blk-mq-sched: improve dispatching from sw queue
SCSI devices use host-wide tagset, and the shared driver tag space is
often quite big. However, there is also a queue depth for each lun(
.cmd_per_lun), which is often small, for example, on both lpfc and
qla2xxx, .cmd_per_lun is just 3.

So lots of requests may stay in sw queue, and we always flush all
belonging to same hw queue and dispatch them all to driver.
Unfortunately it is easy to cause queue busy because of the small
.cmd_per_lun.  Once these requests are flushed out, they have to stay in
hctx->dispatch, and no bio merge can happen on these requests, and
sequential IO performance is harmed.

This patch introduces blk_mq_dequeue_from_ctx for dequeuing a request
from a sw queue, so that we can dispatch them in scheduler's way. We can
then avoid dequeueing too many requests from sw queue, since we don't
flush ->dispatch completely.

This patch improves dispatching from sw queue by using the .get_budget
and .put_budget callbacks.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-01 08:20:02 -06:00
Ming Lei
de14829740 blk-mq: introduce .get_budget and .put_budget in blk_mq_ops
For SCSI devices, there is often a per-request-queue depth, which needs
to be respected before queuing one request.

Currently blk-mq always dequeues the request first, then calls
.queue_rq() to dispatch the request to lld. One obvious issue with this
approach is that I/O merging may not be successful, because when the
per-request-queue depth can't be respected, .queue_rq() has to return
BLK_STS_RESOURCE, and then this request has to stay in hctx->dispatch
list. This means it never gets a chance to be merged with other IO.

This patch introduces .get_budget and .put_budget callback in blk_mq_ops,
then we can try to get reserved budget first before dequeuing request.
If the budget for queueing I/O can't be satisfied, we don't need to
dequeue request at all. Hence the request can be left in the IO
scheduler queue, for more merging opportunities.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-01 08:20:02 -06:00
Ming Lei
63ba8e31c3 block: kyber: check if there are requests in ctx in kyber_has_work()
There may be request in sw queue, and not fetched to domain queue
yet, so check it in kyber_has_work().

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-01 08:20:02 -06:00
Ming Lei
caf8eb0d60 blk-mq-sched: move actual dispatching into one helper
So that it becomes easy to support to dispatch from sw queue in the
following patch.

No functional change.

Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Suggested-by: Christoph Hellwig <hch@lst.de> # for simplifying dispatch logic
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-01 08:20:02 -06:00
Ming Lei
5e3d02bbaf blk-mq-sched: dispatch from scheduler IFF progress is made in ->dispatch
When the hw queue is busy, we shouldn't take requests from the scheduler
queue any more, otherwise it is difficult to do IO merge.

This patch fixes the awful IO performance on some SCSI devices(lpfc,
qla2xxx, ...) when mq-deadline/kyber is used by not taking requests if
hw queue is busy.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-01 08:20:02 -06:00
Bart Van Assche
4e9b6f2082 block: Fix a race between blk_cleanup_queue() and timeout handling
Make sure that if the timeout timer fires after a queue has been
marked "dying" that the affected requests are finished.

Reported-by: chenxiang (M) <chenxiang66@hisilicon.com>
Fixes: commit 287922eb0b ("block: defer timeouts to a workqueue")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Tested-by: chenxiang (M) <chenxiang66@hisilicon.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-30 13:28:10 -06:00
Byungchul Park
e319e1fbd9 block, locking/lockdep: Assign a lock_class per gendisk used for wait_for_completion()
Darrick posted the following warning and Dave Chinner analyzed it:

> ======================================================
> WARNING: possible circular locking dependency detected
> 4.14.0-rc1-fixes #1 Tainted: G        W
> ------------------------------------------------------
> loop0/31693 is trying to acquire lock:
>  (&(&ip->i_mmaplock)->mr_lock){++++}, at: [<ffffffffa00f1b0c>] xfs_ilock+0x23c/0x330 [xfs]
>
> but now in release context of a crosslock acquired at the following:
>  ((complete)&ret.event){+.+.}, at: [<ffffffff81326c1f>] submit_bio_wait+0x7f/0xb0
>
> which lock already depends on the new lock.
>
> the existing dependency chain (in reverse order) is:
>
> -> #2 ((complete)&ret.event){+.+.}:
>        lock_acquire+0xab/0x200
>        wait_for_completion_io+0x4e/0x1a0
>        submit_bio_wait+0x7f/0xb0
>        blkdev_issue_zeroout+0x71/0xa0
>        xfs_bmapi_convert_unwritten+0x11f/0x1d0 [xfs]
>        xfs_bmapi_write+0x374/0x11f0 [xfs]
>        xfs_iomap_write_direct+0x2ac/0x430 [xfs]
>        xfs_file_iomap_begin+0x20d/0xd50 [xfs]
>        iomap_apply+0x43/0xe0
>        dax_iomap_rw+0x89/0xf0
>        xfs_file_dax_write+0xcc/0x220 [xfs]
>        xfs_file_write_iter+0xf0/0x130 [xfs]
>        __vfs_write+0xd9/0x150
>        vfs_write+0xc8/0x1c0
>        SyS_write+0x45/0xa0
>        entry_SYSCALL_64_fastpath+0x1f/0xbe
>
> -> #1 (&xfs_nondir_ilock_class){++++}:
>        lock_acquire+0xab/0x200
>        down_write_nested+0x4a/0xb0
>        xfs_ilock+0x263/0x330 [xfs]
>        xfs_setattr_size+0x152/0x370 [xfs]
>        xfs_vn_setattr+0x6b/0x90 [xfs]
>        notify_change+0x27d/0x3f0
>        do_truncate+0x5b/0x90
>        path_openat+0x237/0xa90
>        do_filp_open+0x8a/0xf0
>        do_sys_open+0x11c/0x1f0
>        entry_SYSCALL_64_fastpath+0x1f/0xbe
>
> -> #0 (&(&ip->i_mmaplock)->mr_lock){++++}:
>        up_write+0x1c/0x40
>        xfs_iunlock+0x1d0/0x310 [xfs]
>        xfs_file_fallocate+0x8a/0x310 [xfs]
>        loop_queue_work+0xb7/0x8d0
>        kthread_worker_fn+0xb9/0x1f0
>
> Chain exists of:
>   &(&ip->i_mmaplock)->mr_lock --> &xfs_nondir_ilock_class --> (complete)&ret.event
>
>  Possible unsafe locking scenario by crosslock:
>
>        CPU0                    CPU1
>        ----                    ----
>   lock(&xfs_nondir_ilock_class);
>   lock((complete)&ret.event);
>                                lock(&(&ip->i_mmaplock)->mr_lock);
>                                unlock((complete)&ret.event);
>
>                *** DEADLOCK ***

The warning is a false positive, caused by the fact that all
wait_for_completion()s in submit_bio_wait() are waiting with the same
lock class.

However, some bios have nothing to do with others, for example in the case
of loop devices, there's no direct connection between the bios of an upper
device and the bios of a lower device(=loop device).

The safest way to assign different lock classes to different devices is
to do it for each gendisk. In other words, this patch assigns a
lockdep_map per gendisk and uses it when initializing completion in
submit_bio_wait().

Analyzed-by: Dave Chinner <david@fromorbit.com>
Reported-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: amir73il@gmail.com
Cc: axboe@kernel.dk
Cc: david@fromorbit.com
Cc: hch@infradead.org
Cc: idryomov@gmail.com
Cc: johan@kernel.org
Cc: johannes.berg@intel.com
Cc: kernel-team@lge.com
Cc: linux-block@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Cc: oleg@redhat.com
Cc: tj@kernel.org
Link: http://lkml.kernel.org/r/1508921765-15396-10-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-26 07:54:17 +02:00
Jens Axboe
4d740bc9f0 mq-deadline: add 'deadline' as a name alias
The scheduler framework now supports looking up the appropriate
scheduler with the {name,mq} tupple. We can register mq-deadline
with the alias of 'deadline', so that switching to 'deadline'
will do the right thing based on the type of driver attached to
it.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-25 12:36:55 -06:00
Jens Axboe
8ac0d9a81e elevator: allow name aliases
Since we now lookup elevator types with the appropriate multiqueue
capability, allow schedulers to register with an alias alongside
the real name. This is in preparation for allowing 'mq-deadline'
to register an alias of 'deadline' as well.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-25 12:36:50 -06:00
Jens Axboe
2527d99789 elevator: lookup mq vs non-mq elevators
If an IO scheduler is selected via elevator= and it doesn't match
the driver in question wrt blk-mq support, then we fail to boot.

The elevator= parameter is deprecated and only supported for
non-mq devices. Augment the elevator lookup API so that we
pass in if we're looking for an mq capable scheduler or not,
so that we only ever return a valid type for the queue in
question.

Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=196695
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-25 12:36:45 -06:00
Ilya Dryomov
d5ce4c31d6 block: cope with WRITE ZEROES failing in blkdev_issue_zeroout()
sd_config_write_same() ignores ->max_ws_blocks == 0 and resets it to
permit trying WRITE SAME on older SCSI devices, unless ->no_write_same
is set.  Because REQ_OP_WRITE_ZEROES is implemented in terms of WRITE
SAME, blkdev_issue_zeroout() may fail with -EREMOTEIO:

  $ fallocate -zn -l 1k /dev/sdg
  fallocate: fallocate failed: Remote I/O error
  $ fallocate -zn -l 1k /dev/sdg  # OK
  $ fallocate -zn -l 1k /dev/sdg  # OK

The following calls succeed because sd_done() sets ->no_write_same in
response to a sense that would become BLK_STS_TARGET/-EREMOTEIO, causing
__blkdev_issue_zeroout() to fall back to generating ZERO_PAGE bios.

This means blkdev_issue_zeroout() must cope with WRITE ZEROES failing
and fall back to manually zeroing, unless BLKDEV_ZERO_NOFALLBACK is
specified.  For BLKDEV_ZERO_NOFALLBACK case, return -EOPNOTSUPP if
sd_done() has just set ->no_write_same thus indicating lack of offload
support.

Fixes: c20cfc27a4 ("block: stop using blkdev_issue_write_same for zeroing")
Cc: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-25 12:28:23 -06:00
Ilya Dryomov
425a4dba79 block: factor out __blkdev_issue_zero_pages()
blkdev_issue_zeroout() will use this in !BLKDEV_ZERO_NOFALLBACK case.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-25 12:28:22 -06:00
Ilya Dryomov
bb749b31c2 block: move CAP_SYS_ADMIN check in blkdev_roset()
Check for CAP_SYS_ADMIN before calling into the driver, similar to
blkdev_flushbuf().  This is safer and can spare a check in the driver.

(Currently BLKROSET is overridden by md and rbd, rbd is missing the
check.  md has the check, but it covers a lot more than BLKROSET.)

Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-25 12:25:00 -06:00
Christoph Hellwig
65e53aab6d block: Use DECLARE_COMPLETION_ONSTACK() in submit_bio_wait()
Simplify the code by getting rid of the submit_bio_ret structure.

(This also helps address a lockdep false positive.)

Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: amir73il@gmail.com
Cc: axboe@kernel.dk
Cc: darrick.wong@oracle.com
Cc: david@fromorbit.com
Cc: hch@infradead.org
Cc: idryomov@gmail.com
Cc: johan@kernel.org
Cc: johannes.berg@intel.com
Cc: kernel-team@lge.com
Cc: linux-block@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Cc: oleg@redhat.com
Cc: tj@kernel.org
Link: http://lkml.kernel.org/r/1508921765-15396-2-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-25 12:18:59 +02:00
Mark Rutland
6aa7de0591 locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()
Please do not apply this to mainline directly, instead please re-run the
coccinelle script shown below and apply its output.

For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't harmful, and changing them results in
churn.

However, for some features, the read/write distinction is critical to
correct operation. To distinguish these cases, separate read/write
accessors must be used. This patch migrates (most) remaining
ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
coccinelle script:

----
// Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
// WRITE_ONCE()

// $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch

virtual patch

@ depends on patch @
expression E1, E2;
@@

- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)

@ depends on patch @
expression E;
@@

- ACCESS_ONCE(E)
+ READ_ONCE(E)
----

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: tj@kernel.org
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-25 11:01:08 +02:00
Dmitry Monakhov
351499a172 block: Invalidate cache on discard v2
It is reasonable drop page cache on discard, otherwise that pages may
be written by writeback second later, so thin provision devices will
not be happy. This seems to be a  security leak in case of secure discard case.

Also add check for queue_discard flag on early stage.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-24 18:44:57 -06:00
Sagi Grimberg
dab7487bdf block: remove blk_mq_reinit_tagset
No callers left.

Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-10-18 19:27:49 +02:00
Sagi Grimberg
149e10f8ff block: introduce blk_mq_tagset_iter
Iterator helper to apply a function on all the
tags in a given tagset. export it as it will be used
outside the block layer later on.

Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-10-18 19:27:48 +02:00
Omar Sandoval
8cf4666020 kyber: fix hang on domain token wait queue
When we're getting a domain token, if we fail to get a token on our
first attempt, we put the current hardware queue on a wait queue and
then try again just in case a token was freed after our initial attempt
but before we got on the wait queue. If this second attempt succeeds, we
currently leave the hardware queue on the wait queue. Usually this is
okay; we'll just run the hardware queue one extra time when another
token is freed. However, if the hardware queue doesn't have any other
requests waiting, then when it it gets the extra wakeup, it won't have
anything to free and therefore won't wake up any other hardware queues.
If tokens are limited, then we won't make forward progress and the
device will hang.

Reported-by: Bin Zha <zhabin.zb@alibaba-inc.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-17 16:18:11 -06:00
Randy Dunlap
519c8e9ffd block: fix Sphinx kernel-doc warning
Sphinx treats symbols that end with '_' as a kind of special
documentation indicator, so fix that by adding an ending '*'
to it.

../block/bio.c:404: ERROR: Unknown target name: "gfp".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-16 13:00:12 -06:00
Al Viro
0e5b935d43 bio_alloc_map_data(): do bmd->iter setup right there
just need to copy it iter instead of iter->nr_segs

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-10-11 17:23:43 -04:00
Al Viro
d16d44ebb0 bio_copy_user_iov(): saner bio size calculation
it's a bounce buffer; we don't *care* how badly is the real
source/destination fragmented, all that matters is the total
size.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-10-11 17:23:42 -04:00
Al Viro
0a0f151364 bio_map_user_iov(): get rid of copying iov_iter
we do want *iter advanced

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-10-11 17:23:42 -04:00
Al Viro
98a09d6106 bio_copy_from_iter(): get rid of copying iov_iter
we want the one passed to it advanced, anyway

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-10-11 17:23:41 -04:00
Al Viro
2884d0be87 move more stuff down into bio_copy_user_iov()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-10-11 17:23:40 -04:00
Al Viro
e81cef5d30 blk_rq_map_user_iov(): move iov_iter_advance() down
... into bio_{map,copy}_user_iov()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-10-11 17:23:40 -04:00
Al Viro
b282cc7669 bio_map_user_iov(): get rid of the iov_for_each()
Use iov_iter_npages()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-10-11 17:23:39 -04:00
Al Viro
98f0bc9905 bio_map_user_iov(): move alignment check into the main loop
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-10-11 17:23:39 -04:00
Al Viro
e2e115d18b don't rely upon subsequent bio_add_pc_page() calls failing
... they might actually succeed in some cases (when we are at the
queue-imposed segments limit, the next page is not mergable with
the last one we'd got in, but the first page covered by the next
iovec *is* mergable).  Make sure that once it's failed, we are
done with that bio.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-10-11 17:23:38 -04:00
Al Viro
629e42bcc3 ... and with iov_iter_get_pages_alloc() it becomes even simpler
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-10-11 17:23:38 -04:00
Al Viro
076098e51b bio_map_user_iov(): switch to iov_iter_get_pages()/iov_iter_advance()
... and to hell with iov_for_each() nonsense

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-10-11 17:23:37 -04:00
Al Viro
1cfd0ddd82 bio_copy_user_iov(): don't ignore ->iov_offset
Since "block: support large requests in blk_rq_map_user_iov" we
started to call it with partially drained iter; that works fine
on the write side, but reads create a copy of iter for completion
time.  And that needs to take the possibility of ->iov_iter != 0
into account...

Cc: stable@vger.kernel.org #v4.5+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-10-10 23:55:14 -04:00
Al Viro
2b04e8f6bb more bio_map_user_iov() leak fixes
we need to take care of failure exit as well - pages already
in bio should be dropped by analogue of bio_unmap_pages(),
since their refcounts had been bumped only once per reference
in bio.

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-10-10 23:54:57 -04:00
Vitaly Mayatskikh
95d78c28b5 fix unbalanced page refcounting in bio_map_user_iov
bio_map_user_iov and bio_unmap_user do unbalanced pages refcounting if
IO vector has small consecutive buffers belonging to the same page.
bio_add_pc_page merges them into one, but the page reference is never
dropped.

Cc: stable@vger.kernel.org
Signed-off-by: Vitaly Mayatskikh <v.mayatskih@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-10-10 23:54:51 -04:00
Shaohua Li
85acb3ba2f block: set request_list for request
Legacy queue sets request's request_list, mq doesn't. This makes mq does
the same thing, so we can find cgroup of a request. Note, we really
only use blkg field of request_list, it's pointless to allocate mempool
for request_list in mq case.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-10 13:48:16 -06:00
Shaohua Li
eca8b53a67 blk-stat: delete useless code
Fix two issues:
- the per-cpu stat flush is unnecessary, nobody uses per-cpu stat except
  sum it to global stat. We can do the calculation there. The flush just
  wastes cpu time.
- some fields are signed int/s64. I don't see the point.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-10 13:48:14 -06:00
Jiufei Xue
53cfdc10a9 blk-throttle: fix null pointer dereference while throttling writeback IOs
A null pointer dereference can occur when blkcg is removed manually
with writeback IOs inflight. This is caused by the following case:

Writeback kworker submit the bio and set bio->bi_cg_private to tg
in blk_throtl_assoc_bio.
Then we remove the block cgroup manually, the blkg and tg would be
freed if there is no request inflight.
When the submitted bio come back, blk_throtl_bio_endio() fetch the tg
which was already freed.

Fix this by increasing the refcount of blkg in funcion
blk_throtl_assoc_bio() so that the blkg will not be freed until the
bio_endio called.

Reviewed-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jiufei Xue <jiufei.xjf@alibaba-inc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-10 13:09:34 -06:00
weiping zhang
58a9edce0a blkcg: check pol->cpd_free_fn before free cpd
check pol->cpd_free_fn() instead of pol->cpd_alloc_fn() when free cpd.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: weiping zhang <zhangweiping@didichuxing.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-10 09:04:47 -06:00
Paolo Valente
99fead8d38 block, bfq: fix unbalanced decrements of burst size
The commit "block, bfq: decrease burst size when queues in burst
exit" introduced the decrement of burst_size on the removal of a
bfq_queue from the burst list. Unfortunately, this decrement can
happen to be performed even when burst size is already equal to 0,
because of unbalanced decrements. A description follows of the cause
of these unbalanced decrements, namely a wrong assumption, and of the
way how this wrong assumption leads to unbalanced decrements.

The wrong assumption is that a bfq_queue can exit only if the process
associated with the bfq_queue has exited. This is false, because a
bfq_queue, say Q, may exit also as a consequence of a merge with
another bfq_queue. In this case, Q exits because the I/O of its
associated process has been redirected to another bfq_queue.

The decrement unbalance occurs because Q may then be re-created after
a split, and added back to the current burst list, *without*
incrementing burst_size. burst_size is not incremented because Q is
not a new bfq_queue added to the burst list, but a bfq_queue only
temporarily removed from the list, and, before the commit "bfq-sq,
bfq-mq: decrease burst size when queues in burst exit", burst_size was
not decremented when Q was removed.

This commit addresses this issue by just checking whether the exiting
bfq_queue is a merged bfq_queue, and, in that case, not decrementing
burst_size. Unfortunately, this still leaves room for unbalanced
decrements, in the following rarer case: on a split, the bfq_queue
happens to be inserted into a different burst list than that it was
removed from when merged. If this happens, the number of elements in
the new burst list becomes higher than burst_size (by one). When the
bfq_queue then exits, it is of course not in a merged state any
longer, thus burst_size is decremented, which results in an unbalanced
decrement.  To handle this sporadic, unlucky case in a simple way,
this commit also checks that burst_size is larger than 0 before
decrementing it.

Finally, this commit removes an useless, extra check: the check that
the bfq_queue is sync, performed before checking whether the bfq_queue
is in the burst list. This extra check is redundant, because only sync
bfq_queues can be inserted into the burst list.

Fixes: 7cb04004fa ("block, bfq: decrease burst size when queues in burst exit")
Reported-by: Philip Müller <philm@manjaro.org>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Angelo Ruocco <angeloruocco90@gmail.com>
Tested-by: Philip Müller <philm@manjaro.org>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Lee Tibbert <lee.tibbert@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-09 09:54:58 -06:00
Luca Miccio
b5dc5d4d1f block,bfq: Disable writeback throttling
Similarly to CFQ, BFQ has its write-throttling heuristics, and it
is better not to combine them with further write-throttling
heuristics of a different nature.
So this commit disables write-back throttling for a device if BFQ
is used as I/O scheduler for that device.

Signed-off-by: Luca Miccio <lucmiccio@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Lee Tibbert <lee.tibbert@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-09 08:29:21 -06:00
Tim Hansen
4078def82f block/bio: Remove null checks before mempool_destroy in bioset_free
This patch removes redundant checks for null values on bio_pool and
bvec_pool.

Found using make coccicheck M=block/ on linux-net tree on the
next-20170929 tag.

Signed-off-by: Tim Hansen <devtimhansen@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-06 13:03:14 -06:00
Tim Hansen
4b14a5c5d5 block: remove unnecessary NULL checks in bioset_integrity_free()
mempool_destroy() already checks for a NULL value being passed in, this
eliminates duplicate checks.

This was caught by running make coccicheck M=block/ on linus' tree on
commit 77ede3a014 (current head as of this
patch).

Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Tim Hansen <devtimhansen@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-06 13:03:12 -06:00
Christoph Hellwig
5fdee2127f block: remove QUEUE_FLAG_STACKABLE
We already have a queue_is_rq_based helper to check if a request_queue
is request based, so we can remove the flag for it.

Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-05 15:22:59 -06:00
Jens Axboe
fc13457f74 blk-mq: document the need to have STARTED and COMPLETED share a byte
For memory ordering guarantees on stores, we need to ensure that
these two bits share the same byte of storage in the unsigned
long. Add a comment as to why, and a BUILD_BUG_ON() to ensure that
we don't violate this requirement.

Suggested-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-04 11:22:24 -06:00
Peter Zijlstra
a7af0af321 blk-mq: attempt to fix atomic flag memory ordering
Attempt to untangle the ordering in blk-mq. The patch introducing the
single smp_mb__before_atomic() is obviously broken in that it doesn't
clearly specify a pairing barrier and an obtained guarantee.

The comment is further misleading in that it hints that the
deadline store and the COMPLETE store also need to be ordered, but
AFAICT there is no such dependency. However what does appear to be
important is the clear happening _after_ the store, and that worked by
pure accident.

This clarifies blk_mq_start_request() -- we should not get there with
STARTING set -- this simplifies the code and makes the barrier usage
sane (the old code could be read to allow not having _any_ atomic after
the barrier, in which case the barrier hasn't got anything to order). We
then also introduce the missing pairing barrier for it.

Also down-grade the barrier to smp_wmb(), this is cheaper for
PowerPC/ARM and doesn't cost anything extra on x86.

And it documents the STARTING vs COMPLETE ordering. Although I've not
been entirely successful in reverse engineering the blk-mq state
machine so there might still be more funnies around timeout vs
requeue.

If I got anything wrong, feel free to educate me by adding comments to
clarify things ;-)

Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Ming Lei <tom.leiming@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Fixes: 538b753418 ("blk-mq: request deadline must be visible before marking rq as started")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-04 11:20:11 -06:00
Benjamin Block
eab40cf336 bsg-lib: fix use-after-free under memory-pressure
When under memory-pressure it is possible that the mempool which backs
the 'struct request_queue' will make use of up to BLKDEV_MIN_RQ count
emergency buffers - in case it can't get a regular allocation. These
buffers are preallocated and once they are also used, they are
re-supplied with old finished requests from the same request_queue (see
mempool_free()).

The bug is, when re-supplying the emergency pool, the old requests are
not again ran through the callback mempool_t->alloc(), and thus also not
through the callback bsg_init_rq(). Thus we skip initialization, and
while the sense-buffer still should be good, scsi_request->cmd might
have become to be an invalid pointer in the meantime. When the request
is initialized in bsg.c, and the user's CDB is larger than BLK_MAX_CDB,
bsg will replace it with a custom allocated buffer, which is freed when
the user's command is finished, thus it dangles afterwards. When next a
command is sent by the user that has a smaller/similar CDB as
BLK_MAX_CDB, bsg will assume that scsi_request->cmd is backed by
scsi_request->__cmd, will not make a custom allocation, and write into
undefined memory.

Fix this by splitting bsg_init_rq() into two functions:
 - bsg_init_rq() is changed to only do the allocation of the
   sense-buffer, which is used to back the bsg job's reply buffer. This
   pointer should never change during the lifetime of a scsi_request, so
   it doesn't need re-initialization.
 - bsg_initialize_rq() is a new function that makes use of
   'struct request_queue's initialize_rq_fn callback (which was
   introduced in v4.12). This is always called before the request is
   given out via blk_get_request(). This function does the remaining
   initialization that was previously done in bsg_init_rq(), and will
   also do it when the request is taken from the emergency-pool of the
   backing mempool.

Fixes: 50b4d48552 ("bsg-lib: fix kernel panic resulting from missing allocation of reply-buffer")
Cc: <stable@vger.kernel.org> # 4.11+
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Benjamin Block <bblock@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-04 08:35:04 -06:00
Omar Sandoval
70e62f4bac blk-mq-debugfs: fix device sched directory for default scheduler
In blk_mq_debugfs_register(), I remembered to set up the per-hctx sched
directories if a default scheduler was already configured by
blk_mq_sched_init() from blk_mq_init_allocated_queue(), but I didn't do
the same for the device-wide sched directory. Fix it.

Fixes: d332ce0918 ("blk-mq-debugfs: allow schedulers to register debugfs attributes")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-03 15:58:55 -06:00
Joseph Qi
4f02fb7617 blk-throttle: fix possible io stall when upgrade to max
There is a case which will lead to io stall. The case is described as
follows.
/test1
  |-subtest1
/test2
  |-subtest2
And subtest1 and subtest2 each has 32 queued bios already.

Now upgrade to max. In throtl_upgrade_state, it will try to dispatch
bios as follows:
1) tg=subtest1, do nothing;
2) tg=test1, transfer 32 queued bios from subtest1 to test1; no pending
left, no need to schedule next dispatch;
3) tg=subtest2, do nothing;
4) tg=test2, transfer 32 queued bios from subtest2 to test2; no pending
left, no need to schedule next dispatch;
5) tg=/, transfer 8 queued bios from test1 to /, 8 queued bios from
test2 to /, 8 queued bios from test1 to /, and 8 queued bios from test2
to /; note that test1 and test2 each still has 16 queued bios left;
6) tg=/, try to schedule next dispatch, but since disptime is now
(update in tg_update_disptime, wait=0), pending timer is not scheduled
in fact;
7) In throtl_upgrade_state it totally dispatches 32 queued bios and with
32 left. test1 and test2 each has 16 queued bios;
8) throtl_pending_timer_fn sees the left over bios, but could do
nothing, because throtl_select_dispatch returns 0, and test1/test2 has
no pending tg.

The blktrace shows the following:
8,32   0        0     2.539007641     0  m   N throtl upgrade to max
8,32   0        0     2.539072267     0  m   N throtl /test2 dispatch nr_queued=16 read=0 write=16
8,32   7        0     2.539077142     0  m   N throtl /test1 dispatch nr_queued=16 read=0 write=16

So force schedule dispatch if there are pending children.

Reviewed-by: Shaohua Li <shli@fb.com>
Signed-off-by: Joseph Qi <qijiang.qj@alibaba-inc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-03 15:41:55 -06:00
Christoph Hellwig
9c9883744d block: move __elv_next_request to blk-core.c
No need to have this helper inline in a header.  Also drop the __ prefix.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-03 08:43:04 -06:00
Paolo Valente
7cb04004fa block, bfq: decrease burst size when queues in burst exit
If many queues belonging to the same group happen to be created
shortly after each other, then the concurrent processes associated
with these queues have typically a common goal, and they get it done
as soon as possible if not hampered by device idling.  Examples are
processes spawned by git grep, or by systemd during boot. As for
device idling, this mechanism is currently necessary for weight
raising to succeed in its goal: privileging I/O.  In view of these
facts, BFQ does not provide the above queues with either weight
raising or device idling.

On the other hand, a burst of queue creations may be caused also by
the start-up of a complex application. In this case, these queues need
usually to be served one after the other, and as quickly as possible,
to maximise responsiveness. Therefore, in this case the best strategy
is to weight-raise all the queues created during the burst, i.e., the
exact opposite of the strategy for the above case.

To distinguish between the two cases, BFQ uses an empirical burst-size
threshold, found through extensive tests and monitoring of daily
usage. Only large bursts, i.e., burst with a size above this
threshold, are considered as generated by a high number of parallel
processes. In this respect, upstart-based boot proved to be rather
hard to detect as generating a large burst of queue creations, because
with upstart most of the queues created in a burst exit *before* the
next queues in the same burst are created. To address this issue, I
changed the burst-detection mechanism so as to not decrease the size
of the current burst even if one of the queues in the burst is
eliminated.

Unfortunately, this missing decrease causes false positives on very
fast systems: on the start-up of a complex application, such as
libreoffice writer, so many queues are created, served and exited
shortly after each other, that a large burst of queue creations is
wrongly detected as occurring. These false positives just disappear if
the size of a burst is decreased when one of the queues in the burst
exits. This commit restores the missing burst-size decrease, relying
of the fact that upstart is apparently unlikely to be used on systems
running this and future versions of the kernel.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Mauro Andreolini <mauro.andreolini@unimore.it>
Signed-off-by: Angelo Ruocco <angeloruocco90@gmail.com>
Tested-by: Mirko Montanari <mirkomontanari91@gmail.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Lee Tibbert <lee.tibbert@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-03 08:40:58 -06:00
Paolo Valente
894df937e0 block, bfq: let early-merged queues be weight-raised on split too
A just-created bfq_queue, say Q, may happen to be merged with another
bfq_queue on the very first invocation of the function
__bfq_insert_request. In such a case, even if Q would clearly deserve
interactive weight raising (as it has just been created), the function
bfq_add_request does not make it to be invoked for Q, and thus to
activate weight raising for Q. As a consequence, when the state of Q
is saved for a possible future restore, after a split of Q from the
other bfq_queue(s), such a state happens to be (unjustly)
non-weight-raised. Then the bfq_queue will not enjoy any weight
raising on the split, even if should still be in an interactive
weight-raising period when the split occurs.

This commit solves this problem as follows, for a just-created
bfq_queue that is being early-merged: it stores directly, in the saved
state of the bfq_queue, the weight-raising state that would have been
assigned to the bfq_queue if not early-merged.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Tested-by: Angelo Ruocco <angeloruocco90@gmail.com>
Tested-by: Mirko Montanari <mirkomontanari91@gmail.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Lee Tibbert <lee.tibbert@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-03 08:40:57 -06:00
Paolo Valente
3e2bdd6dff block, bfq: check and switch back to interactive wr also on queue split
As already explained in the message of commit "block, bfq: fix
wrong init of saved start time for weight raising", if a soft
real-time weight-raising period happens to be nested in a larger
interactive weight-raising period, then BFQ restores the interactive
weight raising at the end of the soft real-time weight raising. In
particular, BFQ checks whether the latter has ended only on request
dispatches.

Unfortunately, the above scheme fails to restore interactive weight
raising in the following corner case: if a bfq_queue, say Q,
1) Is merged with another bfq_queue while it is in a nested soft
real-time weight-raising period. The weight-raising state of Q is
then saved, and not considered any longer until a split occurs.
2) Is split from the other bfq_queue(s) at a time instant when its
soft real-time weight raising is already finished.
On the split, while resuming the previous, soft real-time
weight-raised state of the bfq_queue Q, BFQ checks whether the
current soft real-time weight-raising period is actually over. If so,
BFQ switches weight raising off for Q, *without* checking whether the
soft real-time period was actually nested in a non-yet-finished
interactive weight-raising period.

This commit addresses this issue by adding the above missing check in
bfq_queue splits, and restoring interactive weight raising if needed.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Tested-by: Angelo Ruocco <angeloruocco90@gmail.com>
Tested-by: Mirko Montanari <mirkomontanari91@gmail.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Lee Tibbert <lee.tibbert@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-03 08:40:56 -06:00
Paolo Valente
4baa8bb13f block, bfq: fix wrong init of saved start time for weight raising
This commit fixes a bug that causes bfq to fail to guarantee a high
responsiveness on some drives, if there is heavy random read+write I/O
in the background. More precisely, such a failure allowed this bug to
be found [1], but the bug may well cause other yet unreported
anomalies.

BFQ raises the weight of the bfq_queues associated with soft real-time
applications, to privilege the I/O, and thus reduce latency, for these
applications. This mechanism is named soft-real-time weight raising in
BFQ. A soft real-time period may happen to be nested into an
interactive weight raising period, i.e., it may happen that, when a
bfq_queue switches to a soft real-time weight-raised state, the
bfq_queue is already being weight-raised because deemed interactive
too. In this case, BFQ saves in a special variable
wr_start_at_switch_to_srt, the time instant when the interactive
weight-raising period started for the bfq_queue, i.e., the time
instant when BFQ started to deem the bfq_queue interactive. This value
is then used to check whether the interactive weight-raising period
would still be in progress when the soft real-time weight-raising
period ends.  If so, interactive weight raising is restored for the
bfq_queue. This restore is useful, in particular, because it prevents
bfq_queues from losing their interactive weight raising prematurely,
as a consequence of spurious, short-lived soft real-time
weight-raising periods caused by wrong detections as soft real-time.

If, instead, a bfq_queue switches to soft-real-time weight raising
while it *is not* already in an interactive weight-raising period,
then the variable wr_start_at_switch_to_srt has no meaning during the
following soft real-time weight-raising period. Unfortunately the
handling of this case is wrong in BFQ: not only the variable is not
flagged somehow as meaningless, but it is also set to the time when
the switch to soft real-time weight-raising occurs. This may cause an
interactive weight-raising period to be considered mistakenly as still
in progress, and thus a spurious interactive weight-raising period to
start for the bfq_queue, at the end of the soft-real-time
weight-raising period. In particular the spurious interactive
weight-raising period will be considered as still in progress, if the
soft-real-time weight-raising period does not last very long. The
bfq_queue will then be wrongly privileged and, if I/O bound, will
unjustly steal bandwidth to truly interactive or soft real-time
bfq_queues, harming responsiveness and low latency.

This commit fixes this issue by just setting wr_start_at_switch_to_srt
to minus infinity (farthest past time instant according to jiffies
macros): when the soft-real-time weight-raising period ends, certainly
no interactive weight-raising period will be considered as still in
progress.

[1] Background I/O Type: Random - Background I/O mix: Reads and writes
- Application to start: LibreOffice Writer in
http://www.phoronix.com/scan.php?page=news_item&px=Linux-4.13-IO-Laptop

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Angelo Ruocco <angeloruocco90@gmail.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Lee Tibbert <lee.tibbert@gmail.com>
Tested-by: Mirko Montanari <mirkomontanari91@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-03 08:40:54 -06:00
Jens Axboe
7beb2f845b blk-mq: wire up completion notifier for laptop mode
For some reason, the laptop mode IO completion notified was never wired
up for blk-mq. Ensure that we trigger the callback appropriately, to arm
the laptop mode flush timer.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-03 08:38:17 -06:00
Jens Axboe
5385fa47d8 blk-mq-tag: kill unused tag enums
We don't have any notion of a tagging cache anymore, and haven't
for a long time. Kill off the unused enums.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-01 01:26:21 -06:00
weiping zhang
547248736a blk-mq: remove unused function hctx_allow_merges
since 9bddeb2a5b "blk-mq: make per-sw-queue bio merge as default .bio_merge"
there is no caller for this function.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: weiping zhang <zhangweiping@didichuxing.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-30 02:17:37 -06:00
Shaohua Li
af551fb3be blkcg: delete unused APIs
Nobody uses the APIs right now.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-26 07:41:22 -06:00
Shaohua Li
f5c156c4c2 block: fix a crash caused by wrong API
part_stat_show takes a part device not a disk, so we should use
part_to_disk.

Fixes: d62e26b3ffd2("block: pass in queue to inflight accounting")
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25 08:56:05 -06:00
Waiman Long
5acb3cc2c2 blktrace: Fix potential deadlock between delete & sysfs ops
The lockdep code had reported the following unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(s_active#228);
                               lock(&bdev->bd_mutex/1);
                               lock(s_active#228);
  lock(&bdev->bd_mutex);

 *** DEADLOCK ***

The deadlock may happen when one task (CPU1) is trying to delete a
partition in a block device and another task (CPU0) is accessing
tracing sysfs file (e.g. /sys/block/dm-1/trace/act_mask) in that
partition.

The s_active isn't an actual lock. It is a reference count (kn->count)
on the sysfs (kernfs) file. Removal of a sysfs file, however, require
a wait until all the references are gone. The reference count is
treated like a rwsem using lockdep instrumentation code.

The fact that a thread is in the sysfs callback method or in the
ioctl call means there is a reference to the opended sysfs or device
file. That should prevent the underlying block structure from being
removed.

Instead of using bd_mutex in the block_device structure, a new
blk_trace_mutex is now added to the request_queue structure to protect
access to the blk_trace structure.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>

Fix typo in patch subject line, and prune a comment detailing how
the code used to work.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25 08:56:05 -06:00
Christoph Hellwig
f507b54dcc bsg-lib: don't free job in bsg_prepare_job
The job structure is allocated as part of the request, so we should not
free it in the error path of bsg_prepare_job.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25 08:56:05 -06:00