Currently no-idle queues in cfq are not serviced fairly:
even if they can only dispatch a small number of requests at a time,
they have to compete with idling queues to be serviced, experiencing
large latencies.
We should notice, instead, that no-idle queues are the ones that would
benefit most from having low latency, in fact they are any of:
* processes with large think times (e.g. interactive ones like file
managers)
* seeky (e.g. programs faulting in their code at startup)
* or marked as no-idle from upper levels, to improve latencies of those
requests.
This patch improves the fairness and latency for those queues, by:
* separating sync idle, sync no-idle and async queues in separate
service_trees, for each priority
* service all no-idle queues together
* and idling when the last no-idle queue has been serviced, to
anticipate for more no-idle work
* the timeslices allotted for idle and no-idle service_trees are
computed proportionally to the number of processes in each set.
Servicing all no-idle queues together should have a performance boost
for NCQ-capable drives, without compromising fairness.
Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
cfq can disable idling for queues in various circumstances.
When workloads of different priorities are competing, if the higher
priority queue has idling disabled, lower priority queues may steal
its disk share. For example, in a scenario with an RT process
performing seeky reads vs a BE process performing sequential reads,
on an NCQ enabled hardware, with low_latency unset,
the RT process will dispatch only the few pending requests every full
slice of service for the BE process.
The patch solves this issue by always performing idle on the last
queue at a given priority class > idle. If the same process, or one
that can pre-empt it (so at the same priority or higher), submits a
new request within the idle window, the lower priority queue won't
dispatch, saving the disk bandwidth for higher priority ones.
Note: this doesn't touch the non_rotational + NCQ case (no hardware
to test if this is a benefit in that case).
Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We use different service trees for different priority classes.
This allows a simplification in the service tree insertion code, that no
longer has to consider priority while walking the tree.
Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We embed a pointer to the service tree in each queue, to handle multiple
service trees easily.
Service trees are enriched with a counter.
cfq_add_rq_rb is invoked after putting the rq in the fifo, to ensure
that all fields in rq are properly initialized.
Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
When the number of processes performing I/O concurrently increases,
a fixed time slice per process will cause large latencies.
This patch, if low_latency mode is enabled, will scale the time slice
assigned to each process according to a 300ms target latency.
In order to keep fairness among processes:
* The number of active processes is computed using a special form of
running average, that quickly follows sudden increases (to keep latency low),
and decrease slowly (to have fairness in spite of rapid decreases of this
value).
To safeguard sequential bandwidth, we impose a minimum time slice
(computed using 2*cfq_slice_idle as base, adjusted according to priority
and async-ness).
Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
If active queue hasn't enough requests and idle window opens, cfq will not
dispatch sufficient requests to hardware. In such situation, current code
will zero hw_tag. But this is because cfq doesn't dispatch enough requests
instead of hardware queue doesn't work. Don't zero hw_tag in such case.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
cfq_queues are merged if they are issuing requests within the mean seek
distance of one another. This patch detects when the coopearting stops and
breaks the queues back up.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The flag used to indicate that a cfqq was allowed to jump ahead in the
scheduling order due to submitting a request close to the queue that
just executed. Since closely cooperating queues are now merged, the flag
holds little meaning. Change it to indicate that multiple queues were
merged. This will later be used to allow the breaking up of merged queues
when they are no longer cooperating.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
When cooperating cfq_queues are detected currently, they are allowed to
skip ahead in the scheduling order. It is much more efficient to
automatically share the cfq_queue data structure between cooperating processes.
Performance of the read-test2 benchmark (which is written to emulate the
dump(8) utility) went from 12MB/s to 90MB/s on my SATA disk. NFS servers
with multiple nfsd threads also saw performance increases.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
async cfq_queue's are already shared between processes within the same
priority, and forthcoming patches will change the mapping of cic to sync
cfq_queue from 1:1 to 1:N. So, calculate the seekiness of a process
based on the cfq_queue instead of the cfq_io_context.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
If the average think time is larger than the remaining time slice
for any given queue, don't allow it to idle. A succesful idle also
means that we need to dispatch and complete a request, so if we don't
even have time left for the idle process, we would overrun the slice
in any case.
Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Saves 16 bytes of text, woohoo. But the more important point is
that it makes the code more readable when returning bool for 0/1
cases.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
CFQ enables idle only for processes that think less than the allowed
idle time. Since idle time is lower for seeky queues, we should use the
correct value in the comparison.
Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We should subtract the slice residual from the rb tree key, since
a negative residual count indicates that the cfqq overran its slice
the last time. Hence we want to add the overrun time, to position
it a bit further away in the service tree.
Reported-by: Corrado Zoccolo <czoccolo@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Commit a9327cac44 added seperate read
and write statistics of in_flight requests. And exported the number
of read and write requests in progress seperately through sysfs.
But Corrado Zoccolo <czoccolo@gmail.com> reported getting strange
output from "iostat -kx 2". Global values for service time and
utilization were garbage. For interval values, utilization was always
100%, and service time is higher than normal.
So this was reverted by commit 0f78ab9899
The problem was in part_round_stats_single(), I missed the following:
if (now == part->stamp)
return;
- if (part->in_flight) {
+ if (part_in_flight(part)) {
__part_stat_add(cpu, part, time_in_queue,
part_in_flight(part) * (now - part->stamp));
__part_stat_add(cpu, part, io_ticks, (now - part->stamp));
With this chunk included, the reported regression gets fixed.
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
--
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
It was briefly introduced to allow CFQ to to delayed scheduling,
but we ended up removing that feature again. So lets kill the
function and export, and just switch CFQ back to the normal work
schedule since it is now passing in a '0' delay from all call
sites.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The RR service tree is indexed by a key that is relative to current jiffies.
This can cause problems on jiffies wraparound.
The patch fixes it using time_before comparison, and changing
the add_front path to use a relative number, too.
Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
cfq uses rq->start_time as the fifo indicator, but that field may
get modified prior to cfq doing it's fifo list adjustment when
a request gets merged with another request. This can cause the
fifo list to become unordered.
Reported-by: Corrado Zoccolo <czoccolo@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We cannot delay for the first dispatch of the async queue if it
hasn't dispatched at all, since that could present a local user
DoS attack vector using an app that just did slow timed sync reads
while filling memory.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Not all users of the topology information want to use libblkid. Provide
the topology information through bdev ioctls.
Also clarify sector size comments for existing BLK ioctls.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Don't think that's necessarily a perfect description of what this
option fiddles with, but it's probably better than 'desktop'.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This slowly ramps up the async queue depth based on the time
passed since the sync IO, and doesn't allow async at all until
a sync slice period has passed.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
o Do not allow more than max_dispatch requests from an async queue, if some
sync request has finished recently. This is in the hope that sync activity
is still going on in the system and we might receive a sync request soon.
Most likely from a sync queue which finished a request and we did not enable
idling on it.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
AS is mostly a subset of CFQ, so there's little point in still
providing this separate IO scheduler. Hopefully at some point we
can get down to one single IO scheduler again, at least this brings
us closer by having only one intelligent IO scheduler.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This is basically identical to what Vivek Goyal posted, but combined
into one and labelled 'desktop' instead of 'fairness'. The goal
is to continue to improve on the latency side of things as it relates
to interactiveness, keeping the questionable bits under this sysfs
tunable so it would be easy for throughput-only people to turn off.
Apart from adding the interactive sysfs knob, it also adds the
behavioural change of allowing slice idling even if the hardware
does tagged command queuing.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Since 2.6.31 now has request-based device-mapper, it's useful to have
a tracepoint for request-remapping as well as bio-remapping.
This patch adds a tracepoint for request-remapping, trace_block_rq_remap().
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Alasdair G Kergon <agk@redhat.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Currently we set the bio size to the byte equivalent of the blocks to
be trimmed when submitting the initial DISCARD ioctl. That means it
is subject to the max_hw_sectors limitation of the HBA which is
much lower than the size of a DISCARD request we can support.
Add a separate max_discard_sectors tunable to limit the size for discard
requests.
We limit the max discard request size in bytes to 32bit as that is the
limit for bio->bi_size. This could be much larger if we had a way to pass
that information through the block layer.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
prepare_discard_fn() was being called in a place where memory allocation
was effectively impossible. This makes it inappropriate for all but
the most trivial translations of Linux's DISCARD operation to the block
command set. Additionally adding a payload there makes the ownership
of the bio backing unclear as it's now allocated by the device driver
and not the submitter as usual.
It is replaced with QUEUE_FLAG_DISCARD which is used to indicate whether
the queue supports discard operations or not. blkdev_issue_discard now
allocates a one-page, sector-length payload which is the right thing
for the common ATA and SCSI implementations.
The mtd implementation of prepare_discard_fn() is replaced with simply
checking for the request being a discard.
Largely based on a previous patch from Matthew Wilcox <matthew@wil.cx>
which did the prepare_discard_fn but not the different payload allocation
yet.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Since 2.6.31 now has request-based device-mapper, it's useful to have
a tracepoint for request-remapping as well as bio-remapping.
This patch adds a tracepoint for request-remapping, trace_block_rq_remap().
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Alasdair G Kergon <agk@redhat.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Currently we set the bio size to the byte equivalent of the blocks to
be trimmed when submitting the initial DISCARD ioctl. That means it
is subject to the max_hw_sectors limitation of the HBA which is
much lower than the size of a DISCARD request we can support.
Add a separate max_discard_sectors tunable to limit the size for discard
requests.
We limit the max discard request size in bytes to 32bit as that is the
limit for bio->bi_size. This could be much larger if we had a way to pass
that information through the block layer.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
prepare_discard_fn() was being called in a place where memory allocation
was effectively impossible. This makes it inappropriate for all but
the most trivial translations of Linux's DISCARD operation to the block
command set. Additionally adding a payload there makes the ownership
of the bio backing unclear as it's now allocated by the device driver
and not the submitter as usual.
It is replaced with QUEUE_FLAG_DISCARD which is used to indicate whether
the queue supports discard operations or not. blkdev_issue_discard now
allocates a one-page, sector-length payload which is the right thing
for the common ATA and SCSI implementations.
The mtd implementation of prepare_discard_fn() is replaced with simply
checking for the request being a discard.
Largely based on a previous patch from Matthew Wilcox <matthew@wil.cx>
which did the prepare_discard_fn but not the different payload allocation
yet.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Stacking devices do not have an inherent max_hw_sector limit. Set the
default to INT_MAX so we are bounded only by capabilities of the
underlying storage.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The topology changes unintentionally caused SAFE_MAX_SECTORS to be set
for stacking devices. Set the default limit to BLK_DEF_MAX_SECTORS and
provide SAFE_MAX_SECTORS in blk_queue_make_request() for legacy hw
drivers that depend on the old behavior.
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This allows subsytems to provide devtmpfs with non-default permissions
for the device node. Instead of the default mode of 0600, null, zero,
random, urandom, full, tty, ptmx now have a mode of 0666, which allows
non-privileged processes to access standard device nodes in case no
other userspace process applies the expected permissions.
This also fixes a wrong assignment in pktcdvd and a checkpatch.pl complain.
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6:
Driver Core: devtmpfs - kernel-maintained tmpfs-based /dev
debugfs: Modify default debugfs directory for debugging pktcdvd.
debugfs: Modified default dir of debugfs for debugging UHCI.
debugfs: Change debugfs directory of IWMC3200
debugfs: Change debuhgfs directory of trace-events-sample.h
debugfs: Fix mount directory of debugfs by default in events.txt
hpilo: add poll f_op
hpilo: add interrupt handler
hpilo: staging for interrupt handling
driver core: platform_device_add_data(): use kmemdup()
Driver core: Add support for compatibility classes
uio: add generic driver for PCI 2.3 devices
driver-core: move dma-coherent.c from kernel to driver/base
mem_class: fix bug
mem_class: use minor as index instead of searching the array
driver model: constify attribute groups
UIO: remove 'default n' from Kconfig
Driver core: Add accessor for device platform data
Driver core: move dev_get/set_drvdata to drivers/base/dd.c
Driver core: add new device to bus's list before probing
Let attribute group vectors be declared "const". We'd
like to let most attribute metadata live in read-only
sections... this is a start.
Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (46 commits)
powerpc64: convert to dynamic percpu allocator
sparc64: use embedding percpu first chunk allocator
percpu: kill lpage first chunk allocator
x86,percpu: use embedding for 64bit NUMA and page for 32bit NUMA
percpu: update embedding first chunk allocator to handle sparse units
percpu: use group information to allocate vmap areas sparsely
vmalloc: implement pcpu_get_vm_areas()
vmalloc: separate out insert_vmalloc_vm()
percpu: add chunk->base_addr
percpu: add pcpu_unit_offsets[]
percpu: introduce pcpu_alloc_info and pcpu_group_info
percpu: move pcpu_lpage_build_unit_map() and pcpul_lpage_dump_cfg() upward
percpu: add @align to pcpu_fc_alloc_fn_t
percpu: make @dyn_size mandatory for pcpu_setup_first_chunk()
percpu: drop @static_size from first chunk allocators
percpu: generalize first chunk allocator selection
percpu: build first chunk allocators selectively
percpu: rename 4k first chunk allocator to page
percpu: improve boot messages
percpu: fix pcpu_reclaim() locking
...
Fix trivial conflict as by Tejun Heo in kernel/sched.c