We should use the GFP flags that the caller specified instead of picking
our own. All the callers specify GFP_KERNEL so this doesn't make a
difference to how the kernel runs, it's just a cleanup.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull cgroup changes from Tejun Heo:
"Out of the 8 commits, one fixes a long-standing locking issue around
tasklist walking and others are cleanups."
* 'for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: Walk task list under tasklist_lock in cgroup_enable_task_cg_list
cgroup: Remove wrong comment on cgroup_enable_task_cg_list()
cgroup: remove cgroup_subsys argument from callbacks
cgroup: remove extra calls to find_existing_css_set
cgroup: replace tasklist_lock with rcu_read_lock
cgroup: simplify double-check locking in cgroup_attach_proc
cgroup: move struct cgroup_pidlist out from the header file
cgroup: remove cgroup_attach_task_current_cg()
Pull scheduler changes for v3.4 from Ingo Molnar
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
printk: Make it compile with !CONFIG_PRINTK
sched/x86: Fix overflow in cyc2ns_offset
sched: Fix nohz load accounting -- again!
sched: Update yield() docs
printk/sched: Introduce special printk_sched() for those awkward moments
sched/nohz: Correctly initialize 'next_balance' in 'nohz' idle balancer
sched: Cleanup cpu_active madness
sched: Fix load-balance wreckage
sched: Clean up parameter passing of proc_sched_autogroup_set_nice()
sched: Ditch per cgroup task lists for load-balancing
sched: Rename load-balancing fields
sched: Move load-balancing arguments into helper struct
sched/rt: Do not submit new work when PI-blocked
sched/rt: Prevent idle task boosting
sched/wait: Add __wake_up_all_locked() API
sched/rt: Document scheduler related skip-resched-check sites
sched/rt: Use schedule_preempt_disabled()
sched/rt: Add schedule_preempt_disabled()
sched/rt: Do not throttle when PI boosting
sched/rt: Keep period timer ticking when rt throttling is active
...
After the previous patch to cfq, there's no ioc_get_changed() user
left. This patch yanks out ioc_{ioprio|cgroup|get}_changed() and all
related stuff.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
cfq caches the associated cfqq's for a given cic. The cache needs to
be flushed if the cic's ioprio or blkcg has changed. It is currently
done by requiring the changing action to set the respective
ICQ_*_CHANGED bit in the icq and testing it from cfq_set_request(),
which involves iterating through all the affected icqs.
All cfq wants to know is whether ioprio and/or blkcg have changed
since the last flush and can be easily achieved by just remembering
the current ioprio and blkcg ID in cic.
This patch adds cic->{ioprio|blkcg_id}, updates all ioprio users to
use the remembered value instead, and updates cfq_set_request() path
such that, instead of using icq_get_changed(), the current values are
compared against the remembered ones and trigger appropriate flush
action if not. Condition tests are moved inside both _changed
functions which are now named check_ioprio_changed() and
check_blkcg_changed().
ioprio.h::task_ioprio*() can't be used anymore and replaced with
open-coded IOPRIO_CLASS_NONE case in cfq_async_queue_prio().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that io_cq is managed by block core and guaranteed to exist for
any in-flight request, it is easier and carries more information to
pass around cfq_io_cq than io_context.
This patch updates cfq_init_prio_data(), cfq_find_alloc_queue() and
cfq_get_queue() to take @cic instead of @ioc. This change removes a
duplicate cfq_cic_lookup() from cfq_find_alloc_queue().
This change enables the use of cic-cached ioprio in the next patch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add 64bit unique id to blkcg. This will be used by policies which
want blkcg identity test to tell whether the associated blkcg has
changed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With recent plug merge updates, all non-percpu stat updates happen
under queue_lock making stats_lock unnecessary to synchronize stat
updates. The only synchronization necessary is stat reading, which
can be done using u64_stats_sync instead.
This patch removes blkio_group->stats_lock and adds
blkio_group_stats->syncp for reader synchronization.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Restructure blkio_get_stat() to prepare for removal of stats_lock.
* Define BLKIO_STAT_ARR_NR explicitly to denote which stats have
subtypes instead of using BLKIO_STAT_QUEUED.
* Separate out stat acquisition and printing. After this, there are
only two users of blkio_fill_stat(). Just open code it.
* The code was mixing MAX_KEY_LEN and MAX_KEY_LEN - 1. There's no
need to subtract one. Use MAX_KEY_LEN consistently.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blkiocg_reset_stats() implements stat reset for blkio.reset_stats
cgroupfs file. This feature is very unconventional and something
which shouldn't have been merged. It's only useful when there's only
one user or tool looking at the stats. As soon as multiple users
and/or tools are involved, it becomes useless as resetting disrupts
other usages. There are very good reasons why all other stats expect
readers to read values at the start and end of a period and subtract
to determine delta over the period.
The implementation is rather complex - some fields shouldn't be
cleared and it saves some fields, resets whole and restores for some
reason. Reset of percpu stats is also racy. The comment points to
64bit store atomicity for the reason but even without that stores for
zero can simply race with other CPUs doing RMW and get clobbered.
Simplify reset by
* Clear selectively instead of resetting and restoring.
* Grouping debug stat fields to be reset and using memset() over them.
* Not caring about stats_lock.
* Using memset() to reset percpu stats.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With recent plug merge updates, merged stats are no longer called for
plug merges and now only updated while holding queue_lock. As
stats_lock is scheduled to be removed, there's no reason to use percpu
for merged stats. Don't use percpu for merged stats.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Current per cpu stat allocation assumes GFP_KERNEL allocation flag. But in
IO path there are times when we want GFP_NOIO semantics. As there is no
way to pass the allocation flags to alloc_percpu(), this patch delays the
allocation of stats using a worker thread.
v2-> tejun suggested following changes. Changed the patch accordingly.
- move alloc_node location in structure
- reduce the size of names of some of the fields
- Reduce the scope of locking of alloc_list_lock
- Simplified stat_alloc_fn() by allocating stats for all
policies in one go and then assigning these to a group.
v3 -> Andrew suggested to put some comments in the code. Also raised
concerns about trying to allocate infinitely in case of allocation
failure. I have changed the logic to sleep for 10ms before retrying.
That should take care of non-preemptible UP kernels.
v4 -> Tejun had more suggestions.
- drop list_for_each_entry_all()
- instead of msleep() use queue_delayed_work()
- Some cleanups realted to more compact coding.
v5-> tejun suggested more cleanups leading to more compact code.
tj: - Relocated pcpu_stats into blkio_stat_alloc_fn().
- Minor comment update.
- This also fixes suspicious RCU usage warning caused by invoking
cgroup_path() from blkg_alloc() without holding RCU read lock.
Now that blkg_alloc() doesn't require sleepable context, RCU
read lock from blkg_lookup_create() is maintained throughout
blkg_alloc().
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull block fixes from Jens Axboe:
"Been sitting on this for a while, but lets get this out the door.
This fixes various important bugs for 3.3 final, along with a few more
trivial ones. Please pull!"
* 'for-linus' of git://git.kernel.dk/linux-block:
block: fix ioc leak in put_io_context
block, sx8: fix pointer math issue getting fw version
Block: use a freezable workqueue for disk-event polling
drivers/block/DAC960: fix -Wuninitialized warning
drivers/block/DAC960: fix DAC960_V2_IOCTL_Opcode_T -Wenum-compare warning
block: fix __blkdev_get and add_disk race condition
block: Fix setting bio flags in drivers (sd_dif/floppy)
block: Fix NULL pointer dereference in sd_revalidate_disk
block: exit_io_context() should call elevator_exit_icq_fn()
block: simplify ioc_release_fn()
block: replace icq->changed with icq->flags
Make blk-throttle call bio_associate_current() on bios being delayed
such that they get issued to block layer with the original io_context.
This allows stacking blk-throttle and cfq-iosched propio policies.
bios will always be issued with the correct ioc and blkcg whether it
gets delayed by blk-throttle or not.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Implement bio_blkio_cgroup() which returns the blkcg associated with
the bio if exists or %current's blkcg, and use it in blk-throttle and
cfq-iosched propio. This makes both cgroup policies honor task
association for the bio instead of always assuming %current.
As nobody is using bio_set_task() yet, this doesn't introduce any
behavior change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
IO scheduling and cgroup are tied to the issuing task via io_context
and cgroup of %current. Unfortunately, there are cases where IOs need
to be routed via a different task which makes scheduling and cgroup
limit enforcement applied completely incorrectly.
For example, all bios delayed by blk-throttle end up being issued by a
delayed work item and get assigned the io_context of the worker task
which happens to serve the work item and dumped to the default block
cgroup. This is double confusing as bios which aren't delayed end up
in the correct cgroup and makes using blk-throttle and cfq propio
together impossible.
Any code which punts IO issuing to another task is affected which is
getting more and more common (e.g. btrfs). As both io_context and
cgroup are firmly tied to task including userland visible APIs to
manipulate them, it makes a lot of sense to match up tasks to bios.
This patch implements bio_associate_current() which associates the
specified bio with %current. The bio will record the associated ioc
and blkcg at that point and block layer will use the recorded ones
regardless of which task actually ends up issuing the bio. bio
release puts the associated ioc and blkcg.
It grabs and remembers ioc and blkcg instead of the task itself
because task may already be dead by the time the bio is issued making
ioc and blkcg inaccessible and those are all block layer cares about.
elevator_set_req_fn() is updated such that the bio elvdata is being
allocated for is available to the elevator.
This doesn't update block cgroup policies yet. Further patches will
implement the support.
-v2: #ifdef CONFIG_BLK_CGROUP added around bio->bi_ioc dereference in
rq_ioc() to fix build breakage.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Kent Overstreet <koverstreet@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently ioc->nr_tasks is used to decide two things - whether an ioc
is done issuing IOs and whether it's shared by multiple tasks. This
patch separate out the first into ioc->active_ref, which is acquired
and released using {get|put}_io_context_active() respectively.
This will be used to associate bio's with a given task. This patch
doesn't introduce any visible behavior change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Make the following interface updates to prepare for future ioc related
changes.
* create_io_context() returning ioc only works for %current because it
doesn't increment ref on the ioc. Drop @task parameter from it and
always assume %current.
* Make create_io_context_slowpath() return 0 or -errno and rename it
to create_task_io_context().
* Make ioc_create_icq() take @ioc as parameter instead of assuming
that of %current. The caller, get_request(), is updated to create
ioc explicitly and then pass it into ioc_create_icq().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
get_request() is structured a bit unusually in that failure path is
inlined in the usual flow with goto labels atop and inside it.
Relocate the error path to the end of the function.
This is to prepare for icq handling changes in get_request() and
doesn't introduce any behavior change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that blkg additions / removals are always done under both q and
blkcg locks, the only places RCU locking is necessary are
blkg_lookup[_create]() for lookup w/o blkcg lock. This patch drops
unncessary RCU locking replacing it with plain blkcg locking as
necessary.
* blkiocg_pre_destroy() already perform proper locking and don't need
RCU. Dropped.
* blkio_read_blkg_stats() now uses blkcg->lock instead of RCU read
lock. This isn't a hot path.
* Now unnecessary synchronize_rcu() from queue exit paths removed.
This makes q->nr_blkgs unnecessary. Dropped.
* RCU annotation on blkg->q removed.
-v2: Vivek pointed out that blkg_lookup_create() still needs to be
called under rcu_read_lock(). Updated.
-v3: After the update, stats_lock locking in blkio_read_blkg_stats()
shouldn't be using _irq variant as it otherwise ends up enabling
irq while blkcg->lock is locked. Fixed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blkgs are chained from both blkcgs and request_queues and thus
subjected to two locks - blkcg->lock and q->queue_lock. As both blkcg
and q can go away anytime, locking during removal is tricky. It's
currently solved by wrapping removal inside RCU, which makes the
synchronization complex. There are three locks to worry about - the
outer RCU, q lock and blkcg lock, and it leads to nasty subtle
complications like conditional synchronize_rcu() on queue exit paths.
For all other paths, blkcg lock is naturally nested inside q lock and
the only exception is blkcg removal path, which is a very cold path
and can be implemented as clumsy but conceptually-simple reverse
double lock dancing.
This patch updates blkg removal path such that blkgs are removed while
holding both q and blkcg locks, which is trivial for request queue
exit path - blkg_destroy_all(). The blkcg removal path,
blkiocg_pre_destroy(), implements reverse double lock dancing
essentially identical to ioc_release_fn().
This simplifies blkg locking - no half-dead blkgs to worry about. Now
unnecessary RCU annotations will be removed by the next patch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, blkg is per cgroup-queue-policy combination. This is
unnatural and leads to various convolutions in partially used
duplicate fields in blkg, config / stat access, and general management
of blkgs.
This patch make blkg's per cgroup-queue and let them serve all
policies. blkgs are now created and destroyed by blkcg core proper.
This will allow further consolidation of common management logic into
blkcg core and API with better defined semantics and layering.
As a transitional step to untangle blkg management, elvswitch and
policy [de]registration, all blkgs except the root blkg are being shot
down during elvswitch and bypass. This patch adds blkg_root_update()
to update root blkg in place on policy change. This is hacky and racy
but should be good enough as interim step until we get locking
simplified and switch over to proper in-place update for all blkgs.
-v2: Root blkgs need to be updated on elvswitch too and blkg_alloc()
comment wasn't updated according to the function change. Fixed.
Both pointed out by Vivek.
-v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for
all policies. This freed root pd during elvswitch before the
last queue finished exiting and led to oops. Directly invoke
update_root_blkg_pd() only on BLKIO_POLICY_PROP from
cfq_exit_queue(). This also is closer to what will be done with
proper in-place blkg update. Reported by Vivek.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With the previous patch to move blkg list heads and counters to
request_queue and blkg, logic to manage them in both policies are
almost identical and can be moved to blkcg core.
This patch moves blkg link logic into blkg_lookup_create(), implements
common blkg unlink code in blkg_destroy(), and updates
blkg_destory_all() so that it's policy specific and can skip root
group. The updated blkg_destroy_all() is now used to both clear queue
for bypassing and elv switching, and release all blkgs on q exit.
This patch introduces a race window where policy [de]registration may
race against queue blkg clearing. This can only be a problem on cfq
unload and shouldn't be a real problem in practice (and we have many
other places where this race already exists). Future patches will
remove these unlikely races.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, specific policy implementations are responsible for
maintaining list and number of blkgs. This duplicates code
unnecessarily, and hinders factoring common code and providing blkcg
API with better defined semantics.
After this patch, request_queue hosts list heads and counters and blkg
has list nodes for both policies. This patch only relocates the
necessary fields and the next patch will actually move management code
into blkcg core.
Note that request_queue->blkg_list[] and ->nr_blkgs[] are hardcoded to
have 2 elements. This is to avoid include dependency and will be
removed by the next patch.
This patch doesn't introduce any behavior change.
-v2: Now unnecessary conditional on CONFIG_BLK_CGROUP_MODULE removed
as pointed out by Vivek.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blkg is scheduled to be unified for all policies and thus there won't
be one-to-one mapping from blkg to policy. Update stat related
functions to take explicit @pol or @plid arguments and not use
blkg->plid.
This is painful for now but most of specific stat interface functions
will be replaced with a handful of generic helpers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
To prepare for unifying blkgs for different policies, make blkg->pd an
array with BLKIO_NR_POLICIES elements and move blkg->conf, ->stats,
and ->stats_cpu into blkg_policy_data.
This patch doesn't introduce any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, blkcg policy implementations manage blkg refcnt duplicating
mostly identical code in both policies. This patch moves refcnt to
blkg and let blkcg core handle refcnt and freeing of blkgs.
* cfq blkgs now also get freed via RCU.
* cfq blkgs lose RB_EMPTY_ROOT() sanity check on blkg free. If
necessary, we can add blkio_exit_group_fn() to resurrect this.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, blkg's are embedded in private data blkcg policy private
data structure and thus allocated and freed by policies. This leads
to duplicate codes in policies, hinders implementing common part in
blkcg core with strong semantics, and forces duplicate blkg's for the
same cgroup-q association.
This patch introduces struct blkg_policy_data which is a separate data
structure chained from blkg. Policies specifies the amount of private
data it needs in its blkio_policy_type->pdata_size and blkcg core
takes care of allocating them along with blkg which can be accessed
using blkg_to_pdata(). blkg can be determined from pdata using
pdata_to_blkg(). blkio_alloc_group_fn() method is accordingly updated
to blkio_init_group_fn().
For consistency, tg_of_blkg() and cfqg_of_blkg() are replaced with
blkg_to_tg() and blkg_to_cfqg() respectively, and functions to map in
the reverse direction are added.
Except that policy specific data now lives in a separate data
structure from blkg, this patch doesn't introduce any functional
difference.
This will be used to unify blkg's for different policies.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keep track of all request_queues which have blkcg initialized and turn
on bypass and invoke blkcg_clear_queue() on all before making changes
to blkcg policies.
This is to prepare for moving blkg management into blkcg core. Note
that this uses more brute force than necessary. Finer grained shoot
down will be implemented later and given that policy [un]registration
almost never happens on running systems (blk-throtl can't be built as
a module and cfq usually is the builtin default iosched), this
shouldn't be a problem for the time being.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently block core calls directly into blk-throttle for init, drain
and exit. This patch adds blkcg_{init|drain|exit}_queue() which wraps
the blk-throttle functions. This is to give more control and
visiblity to blkcg core layer for proper layering. Further patches
will add logic common to blkcg policies to the functions.
While at it, collapse blk_throtl_release() into blk_throtl_exit().
There's no reason to keep them separate.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, blkg points to the associated blkcg via its css_id. This
unnecessarily complicates dereferencing blkcg. Let blkg hold a
reference to the associated blkcg and point directly to it and disable
css_id on blkio_subsys.
This change requires splitting blkiocg_destroy() into
blkiocg_pre_destroy() and blkiocg_destroy() so that all blkg's can be
destroyed and all the blkcg references held by them dropped during
cgroup removal.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk-cgroup printing code currently assumes that there is a device/disk
associated with every queue in the system, but modules like floppy,
can instantiate request queues without registering disk which can lead
to oops.
Skip the queue/blkg which don't have dev/disk associated with them.
-tj: Factored out backing_dev_info check into blkg_dev_name().
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blkg->dev is dev_t recording the device number of the block device for
the associated request_queue. It is used to identify the associated
block device when printing out configuration or stats.
This is redundant to begin with. A blkg is an association between a
cgroup and a request_queue and it of course is possible to reach
request_queue from blkg and synchronization conventions are in place
for safe q dereferencing, so this shouldn't be necessary from the
beginning. Furthermore, it's initialized by sscanf()ing the device
name of backing_dev_info. The mind boggles.
Anyways, if blkg is visible under rcu lock, we *know* that the
associated request_queue hasn't gone away yet and its bdi is
registered and alive - blkg can't be created for request_queue which
hasn't been fully initialized and it can't go away before blkg is
removed.
Let stat and conf read functions get device name from
blkg->q->backing_dev_info.dev and pass it down to printing functions
and remove blkg->dev.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that blkcg configuration lives in blkg's, blkio_policy_node is no
longer necessary. Kill it.
blkio_policy_parse_and_set() now fails if invoked for missing device
and functions to print out configurations are updated to print from
blkg's.
cftype_blkg_same_policy() is dropped along with other policy functions
for consistency. Its one line is open coded in the only user -
blkio_read_blkg_stats().
-v2: Update to reflect the retry-on-bypass logic change of the
previous patch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blkcg is very peculiar in that it allows setting and remembering
configurations for non-existent devices by maintaining separate data
structures for configuration.
This behavior is completely out of the usual norms and outright
confusing; furthermore, it uses dev_t number to match the
configuration to devices, which is unpredictable to begin with and
becomes completely unuseable if EXT_DEVT is fully used.
It is wholely unnecessary - we already have fully functional userland
mechanism to program devices being hotplugged which has full access to
device identification, connection topology and filesystem information.
Add a new struct blkio_group_conf which contains all blkcg
configurations to blkio_group and let blkio_group, which can be
created iff the associated device exists and is removed when the
associated device goes away, carry all configurations.
Note that, after this patch, all newly created blkg's will always have
the default configuration (unlimited for throttling and blkcg's weight
for propio).
This patch makes blkio_policy_node meaningless but doesn't remove it.
The next patch will.
-v2: Updated to retry after short sleep if blkg lookup/creation failed
due to the queue being temporarily bypassed as indicated by
-EBUSY return. Pointed out by Vivek.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently both blk-throttle and cfq-iosched implement their own
blkio_group creation code in throtl_get_tg() and cfq_get_cfqg(). This
patch factors out the common code into blkg_lookup_create(), which
returns ERR_PTR value so that transitional failures due to queue
bypass can be distinguished from other failures.
* New plkio_policy_ops methods blkio_alloc_group_fn() and
blkio_link_group_fn added. Both are transitional and will be
removed once the blkg management code is fully moved into
blk-cgroup.c.
* blkio_alloc_group_fn() allocates policy-specific blkg which is
usually a larger data structure with blkg as the first entry and
intiailizes it. Note that initialization of blkg proper, including
percpu stats, is responsibility of blk-cgroup proper.
Note that default config (weight, bps...) initialization is done
from this method; otherwise, we end up violating locking order
between blkcg and q locks via blkcg_get_CONF() functions.
* blkio_link_group_fn() is called under queue_lock and responsible for
linking the blkg to the queue. blkcg side is handled by blk-cgroup
proper.
* The common blkg creation function is named blkg_lookup_create() and
blkiocg_lookup_group() is renamed to blkg_lookup() for consistency.
Also, throtl / cfq related functions are similarly [re]named for
consistency.
This simplifies blkcg policy implementations and enables further
cleanup.
-v2: Vivek noticed that blkg_lookup_create() incorrectly tested
blk_queue_dead() instead of blk_queue_bypass() leading a user of
the function ending up creating a new blkg on bypassing queue.
This is a bug introduced while relocating bypass patches before
this one. Fixed.
-v3: ERR_PTR patch folded into this one. @for_root added to
blkg_lookup_create() to allow creating root group on a bypassed
queue during elevator switch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For root blkg, blk_throtl_init() was using throtl_alloc_tg()
explicitly and cfq_init_queue() was manually initializing embedded
cfqd->root_group, adding unnecessarily different code paths to blkg
handling.
Make both use the usual blkio_group get functions - throtl_get_tg()
and cfq_get_cfqg() - for the root blkio_group too. Note that
blk_throtl_init() callsite is pushed downwards in
blk_alloc_queue_node() so that @q is sufficiently initialized for
throtl_get_tg().
This simplifies root blkg handling noticeably for cfq and will allow
further modularization of blkcg API.
-v2: Vivek pointed out that using cfq_get_cfqg() won't work if
CONFIG_CFQ_GROUP_IOSCHED is disabled. Fix it by factoring out
initialization of base part of cfqg into cfq_init_cfqg_base() and
alloc/init/free explicitly if !CONFIG_CFQ_GROUP_IOSCHED.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Block cgroup policies are maintained in a linked list and,
theoretically, multiple policies sharing the same policy ID are
allowed.
This patch temporarily restricts one policy per plid and adds
blkio_policy[] array which indexes registered policy types by plid.
Both the restriction and blkio_policy[] array are transitional and
will be removed once API cleanup is complete.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blkgio_group is association between a block cgroup and a queue for a
given policy. Using opaque void * for association makes things
confusing and hinders factoring of common code. Use request_queue *
and, if necessary, policy id instead.
This will help block cgroup API cleanup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In both blkg get functions - throtl_get_tg() and cfq_get_cfqg(),
instead of obtaining blkcg of %current explicitly, let the caller
specify the blkcg to use as parameter and make both functions hold on
to the blkcg.
This is part of block cgroup interface cleanup and will help making
blkcg API more modular.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
rcu_read_lock() in throtl_get_tb() and cfq_get_cfqg() holds onto
@blkcg while looking up blkg. For API cleanup, the next patch will
make the caller responsible for determining @blkcg to look blkg from
and let them specify it as a parameter. Move rcu read locking out to
the callers to prepare for the change.
-v2: Originally this patch was described as a fix for RCU read locking
bug around @blkg, which Vivek pointed out to be incorrect. It
was from misunderstanding the role of rcu locking as protecting
@blkg not @blkcg. Patch description updated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Elevator switch may involve changes to blkcg policies. Implement
shoot down of blkio_groups.
Combined with the previous bypass updates, the end goal is updating
blkcg core such that it can ensure that blkcg's being affected become
quiescent and don't have any per-blkg data hanging around before
commencing any policy updates. Until queues are made aware of the
policies that applies to them, as an interim step, all per-policy blkg
data will be shot down.
* blk-throtl doesn't need this change as it can't be disabled for a
live queue; however, update it anyway as the scheduled blkg
unification requires this behavior change. This means that
blk-throtl configuration will be unnecessarily lost over elevator
switch. This oddity will be removed after blkcg learns to associate
individual policies with request_queues.
* blk-throtl dosen't shoot down root_tg. This is to ease transition.
Unified blkg will always have persistent root group and not shooting
down root_tg for now eases transition to that point by avoiding
having to update td->root_tg and is safe as blk-throtl can never be
disabled
-v2: Vivek pointed out that group list is not guaranteed to be empty
on return from clear function if it raced cgroup removal and
lost. Fix it by waiting a bit and retrying. This kludge will
soon be removed once locking is updated such that blkg is never
in limbo state between blkcg and request_queue locks.
blk-throtl no longer shoots down root_tg to avoid breaking
td->root_tg.
Also, Nest queue_lock inside blkio_list_lock not the other way
around to avoid introduce possible deadlock via blkcg lock.
-v3: blkcg_clear_queue() repositioned and renamed to
blkg_destroy_all() to increase consistency with later changes.
cfq_clear_queue() updated to check q->elevator before
dereferencing it to avoid NULL dereference on not fully
initialized queues (used by later change).
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Extend queue bypassing such that dying queue is always bypassing and
blk-throttle is drained on bypass. With blkcg policies updated to
test blk_queue_bypass() instead of blk_queue_dead(), this ensures that
no bio or request is held by or going through blkcg policies on a
bypassing queue.
This will be used to implement blkg cleanup on elevator switches and
policy changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Rename and extend elv_queisce_start/end() to
blk_queue_bypass_start/end() which are exported and supports nesting
via @q->bypass_depth. Also add blk_queue_bypass() to test bypass
state.
This will be further extended and used for blkio_group management.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
elevator_ops->elevator_init_fn() has a weird return value. It returns
a void * which the caller should assign to q->elevator->elevator_data
and %NULL return denotes init failure.
Update such that it returns integer 0/-errno and sets elevator_data
directly as necessary.
This makes the interface more conventional and eases further cleanup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Elevator switch tries hard to keep as much as context until new
elevator is ready so that it can revert to the original state if
initializing the new elevator fails for some reason. Unfortunately,
with more auxiliary contexts to manage, this makes elevator init and
exit paths too complex and fragile.
This patch makes elevator_switch() unregister the current elevator and
flush icq's before start initializing the new one. As we still keep
the old elevator itself, the only difference is that we lose icq's on
rare occassions of switching failure, which isn't critical at all.
Note that this makes explicit elevator parameter to
elevator_init_queue() and __elv_register_queue() unnecessary as they
always can use the current elevator.
This patch enables block cgroup cleanups.
-v2: blk_add_trace_msg() prints elevator name from @new_e instead of
@e->type as the local variable no longer exists. This caused
build failure on CONFIG_BLK_DEV_IO_TRACE.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
cfq has been registering zeroed blkio_poilcy_cfq if CFQ_GROUP_IOSCHED
is disabled. This fortunately doesn't collide with blk-throtl as
BLKIO_POLICY_PROP is zero but is unnecessary and risky. Just don't
register it if not enabled.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Block cgroup core can be built as module; however, it isn't too useful
as blk-throttle can only be built-in and cfq-iosched is usually the
default built-in scheduler. Scheduled blkcg cleanup requires calling
into blkcg from block core. To simplify that, disallow building blkcg
as module by making CONFIG_BLK_CGROUP bool.
If building blkcg core as module really matters, which I doubt, we can
revisit it after blkcg API cleanup.
-v2: Vivek pointed out that IOSCHED_CFQ was incorrectly updated to
depend on BLK_CGROUP. Fixed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, blk_cleanup_queue() doesn't call elv_drain_elevator() if
q->elevator doesn't exist; however, bio based drivers don't have
elevator initialized but can still use blk-throttle. This patch moves
q->elevator test inside blk_drain_queue() such that only
elv_drain_elevator() is skipped if !q->elevator.
-v2: loop can have registered queue which has NULL request_fn. Make
sure we don't call into __blk_run_queue() in such cases.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Vivek Goyal <vgoyal@redhat.com>
Fold in bug fix from Vivek.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch (as1519) fixes a bug in the block layer's disk-events
polling. The polling is done by a work routine queued on the
system_nrt_wq workqueue. Since that workqueue isn't freezable, the
polling continues even in the middle of a system sleep transition.
Obviously, polling a suspended drive for media changes and such isn't
a good thing to do; in the case of USB mass-storage devices it can
lead to real problems requiring device resets and even re-enumeration.
The patch fixes things by creating a new system-wide, non-reentrant,
freezable workqueue and using it for disk-events polling.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
CC: <stable@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The following situation might occur:
__blkdev_get: add_disk:
register_disk()
get_gendisk()
disk_block_events()
disk->ev == NULL
disk_add_events()
__disk_unblock_events()
disk->ev != NULL
--ev->block
Then we unblock events, when they are suppose to be blocked. This can
trigger events related block/genhd.c warnings, but also can crash in
sd_check_events() or other places.
I'm able to reproduce crashes with the following scripts (with
connected usb dongle as sdb disk).
<snip>
DEV=/dev/sdb
ENABLE=/sys/bus/usb/devices/1-2/bConfigurationValue
function stop_me()
{
for i in `jobs -p` ; do kill $i 2> /dev/null ; done
exit
}
trap stop_me SIGHUP SIGINT SIGTERM
for ((i = 0; i < 10; i++)) ; do
while true; do fdisk -l $DEV 2>&1 > /dev/null ; done &
done
while true ; do
echo 1 > $ENABLE
sleep 1
echo 0 > $ENABLE
done
</snip>
I use the script to verify patch fixing oops in sd_revalidate_disk
http://marc.info/?l=linux-scsi&m=132935572512352&w=2
Without Jun'ichi Nomura patch titled "Fix NULL pointer dereference in
sd_revalidate_disk" or this one, script easily crash kernel within
a few seconds. With both patches applied I do not observe crash.
Unfortunately after some time (dozen of minutes), script will hung in:
[ 1563.906432] [<c08354f5>] schedule_timeout_uninterruptible+0x15/0x20
[ 1563.906437] [<c04532d5>] msleep+0x15/0x20
[ 1563.906443] [<c05d60b2>] blk_drain_queue+0x32/0xd0
[ 1563.906447] [<c05d6e00>] blk_cleanup_queue+0xd0/0x170
[ 1563.906454] [<c06d278f>] scsi_free_queue+0x3f/0x60
[ 1563.906459] [<c06d7e6e>] __scsi_remove_device+0x6e/0xb0
[ 1563.906463] [<c06d4aff>] scsi_forget_host+0x4f/0x60
[ 1563.906468] [<c06cd84a>] scsi_remove_host+0x5a/0xf0
[ 1563.906482] [<f7f030fb>] quiesce_and_remove_host+0x5b/0xa0 [usb_storage]
[ 1563.906490] [<f7f03203>] usb_stor_disconnect+0x13/0x20 [usb_storage]
Anyway I think this patch is some step forward.
As drawback, I do not teardown on sysfs file create error, because I do
not know how to nullify disk->ev (since it can be used). However add_disk
error handling practically does not exist too, and things will work
without this sysfs file, except events will not be exported to user
space.
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since 2.6.39 (1196f8b), when a driver returns -ENOMEDIUM for open(),
__blkdev_get() calls rescan_partitions() to remove
in-kernel partition structures and raise KOBJ_CHANGE uevent.
However it ends up calling driver's revalidate_disk without open
and could cause oops.
In the case of SCSI:
process A process B
----------------------------------------------
sys_open
__blkdev_get
sd_open
returns -ENOMEDIUM
scsi_remove_device
<scsi_device torn down>
rescan_partitions
sd_revalidate_disk
<oops>
Oopses are reported here:
http://marc.info/?l=linux-scsi&m=132388619710052
This patch separates the partition invalidation from rescan_partitions()
and use it for -ENOMEDIUM case.
Reported-by: Huajun Li <huajun.li.lee@gmail.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
From: Ben Hutchings <ben@decadent.org.uk>
Extended VBLKs (those larger than the preset VBLK size) are divided
into fragments, each with its own VBLK header. Our LDM implementation
generally assumes that each VBLK is contiguous in memory, so these
fragments must be assembled before further processing.
Currently the reassembly seems to be done quite wrongly - no VBLK
header is copied into the contiguous buffer, and the length of the
header is subtracted twice from each fragment. Also the total
length of the reassembled VBLK is calculated incorrectly.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Anton Altaparmakov <anton@tuxera.com>
While updating locking, b2efa05265 "block, cfq: unlink
cfq_io_context's immediately" moved elevator_exit_icq_fn() invocation
from exit_io_context() to the final ioc put. While this doesn't cause
catastrophic failure, it effectively removes task exit notification to
elevator and cause noticeable IO performance degradation with CFQ.
On task exit, CFQ used to immediately expire the slice if it was being
used by the exiting task as no more IO would be issued by the task;
however, after b2efa05265, the notification is lost and disk could sit
idle needlessly, leading to noticeable IO performance degradation for
certain workloads.
This patch renames ioc_exit_icq() to ioc_destroy_icq(), separates
elevator_exit_icq_fn() invocation into ioc_exit_icq() and invokes it
from exit_io_context(). ICQ_EXITED flag is added to avoid invoking
the callback more than once for the same icq.
Walking icq_list from ioc side and invoking elevator callback requires
reverse double locking. This may be better implemented using RCU;
unfortunately, using RCU isn't trivial. e.g. RCU protection would
need to cover request_queue and queue_lock switch on cleanup makes
grabbing queue_lock from RCU unsafe. Reverse double locking should
do, at least for now.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-bisected-by: Shaohua Li <shli@kernel.org>
LKML-Reference: <CANejiEVzs=pUhQSTvUppkDcc2TNZyfohBRLygW5zFmXyk5A-xQ@mail.gmail.com>
Tested-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reverse double lock dancing in ioc_release_fn() can be simplified by
just using trylock on the queue_lock and back out from ioc lock on
trylock failure. Simplify it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
icq->changed was used for ICQ_*_CHANGED bits. Rename it to flags and
access it under ioc->lock instead of using atomic bitops.
ioc_get_changed() is added so that the changed part can be fetched and
cleared as before.
icq->flags will be used to carry other flags.
Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
11a3122f6c "block: strip out locking optimization in put_io_context()"
removed ioc_lock depth lockdep annoation along with locking
optimization; however, while recursing from put_io_context() is no
longer possible, ioc_release_fn() may still end up putting the last
reference of another ioc through elevator, which wlil grab ioc->lock
triggering spurious (as the ioc is always different one) A-A deadlock
warning.
As this can only happen one time from ioc_release_fn(), using non-zero
subclass from ioc_release_fn() is enough. Use subclass 1.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We create "bsg" link if q->kobj.sd is not NULL, so remove it only
when the same condition is true.
Fixes:
WARNING: at fs/sysfs/inode.c:323 sysfs_hash_and_remove+0x2b/0x77()
sysfs: can not remove 'bsg', no directory
Call Trace:
[<c0429683>] warn_slowpath_common+0x6a/0x7f
[<c0537a68>] ? sysfs_hash_and_remove+0x2b/0x77
[<c042970b>] warn_slowpath_fmt+0x2b/0x2f
[<c0537a68>] sysfs_hash_and_remove+0x2b/0x77
[<c053969a>] sysfs_remove_link+0x20/0x23
[<c05d88f1>] bsg_unregister_queue+0x40/0x6d
[<c0692263>] __scsi_remove_device+0x31/0x9d
[<c069149f>] scsi_forget_host+0x41/0x52
[<c0689fa9>] scsi_remove_host+0x71/0xe0
[<f7de5945>] quiesce_and_remove_host+0x51/0x83 [usb_storage]
[<f7de5a1e>] usb_stor_disconnect+0x18/0x22 [usb_storage]
[<c06c29de>] usb_unbind_interface+0x4e/0x109
[<c067a80f>] __device_release_driver+0x6b/0xa6
[<c067a861>] device_release_driver+0x17/0x22
[<c067a46a>] bus_remove_device+0xd6/0xe6
[<c06785e2>] device_del+0xf2/0x137
[<c06c101f>] usb_disable_device+0x94/0x1a0
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Plug merge calls two elevator callbacks outside queue lock -
elevator_allow_merge_fn() and elevator_bio_merged_fn(). Although
attempt_plug_merge() suggests that elevator is guaranteed to be there
through the existing request on the plug list, nothing prevents plug
merge from calling into dying or initializing elevator.
For regular merges, bypass ensures elvpriv count to reach zero, which
in turn prevents merges as all !ELVPRIV requests get REQ_SOFTBARRIER
from forced back insertion. Plug merge doesn't check ELVPRIV, and, as
the requests haven't gone through elevator insertion yet, it doesn't
have SOFTBARRIER set allowing merges on a bypassed queue.
This, for example, leads to the following crash during elevator
switch.
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: [<ffffffff813b34e9>] cfq_allow_merge+0x49/0xa0
PGD 112cbc067 PUD 115d5c067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
CPU 1
Modules linked in: deadline_iosched
Pid: 819, comm: dd Not tainted 3.3.0-rc2-work+ #76 Bochs Bochs
RIP: 0010:[<ffffffff813b34e9>] [<ffffffff813b34e9>] cfq_allow_merge+0x49/0xa0
RSP: 0018:ffff8801143a38f8 EFLAGS: 00010297
RAX: 0000000000000000 RBX: ffff88011817ce28 RCX: ffff880116eb6cc0
RDX: 0000000000000000 RSI: ffff880118056e20 RDI: ffff8801199512f8
RBP: ffff8801143a3908 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: ffff880118195708
R13: ffff880118052aa0 R14: ffff8801143a3d50 R15: ffff880118195708
FS: 00007f19f82cb700(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000008 CR3: 0000000112c6a000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process dd (pid: 819, threadinfo ffff8801143a2000, task ffff880116eb6cc0)
Stack:
ffff88011817ce28 ffff880118195708 ffff8801143a3928 ffffffff81391bba
ffff88011817ce28 ffff880118195708 ffff8801143a3948 ffffffff81391bf1
ffff88011817ce28 0000000000000000 ffff8801143a39a8 ffffffff81398e3e
Call Trace:
[<ffffffff81391bba>] elv_rq_merge_ok+0x4a/0x60
[<ffffffff81391bf1>] elv_try_merge+0x21/0x40
[<ffffffff81398e3e>] blk_queue_bio+0x8e/0x390
[<ffffffff81396a5a>] generic_make_request+0xca/0x100
[<ffffffff81396b04>] submit_bio+0x74/0x100
[<ffffffff811d45c2>] __blockdev_direct_IO+0x1ce2/0x3450
[<ffffffff811d0dc7>] blkdev_direct_IO+0x57/0x60
[<ffffffff811460b5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff811986b2>] do_sync_read+0xe2/0x120
[<ffffffff81199345>] vfs_read+0xc5/0x180
[<ffffffff81199501>] sys_read+0x51/0x90
[<ffffffff81aeac12>] system_call_fastpath+0x16/0x1b
There are multiple ways to fix this including making plug merge check
ELVPRIV; however,
* Calling into elevator outside queue lock is confusing and
error-prone.
* Requests on plug list aren't known to the elevator. They aren't on
the elevator yet, so there's no elevator specific state to update.
* Given the nature of plug merges - collecting bio's for the same
purpose from the same issuer - elevator specific restrictions aren't
applicable.
So, simply don't call into elevator methods from plug merge by moving
elv_bio_merged() from bio_attempt_*_merge() to blk_queue_bio(), and
using blk_try_merge() in attempt_plug_merge().
This is based on Jens' patch to skip elevator_allow_merge_fn() from
plug merge.
Note that this makes per-cgroup merged stats skip plug merging.
Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <4F16F3CA.90904@kernel.dk>
Original-patch-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_rq_merge_ok() is the elevator-neutral part of merge eligibility
test. blk_try_merge() determines merge direction and expects the
caller to have tested elv_rq_merge_ok() previously.
elv_rq_merge_ok() now wraps blk_rq_merge_ok() and then calls
elv_iosched_allow_merge(). elv_try_merge() is removed and the two
callers are updated to call elv_rq_merge_ok() explicitly followed by
blk_try_merge(). While at it, make rq_merge_ok() functions return
bool.
This is to prepare for plug merge update and doesn't introduce any
behavior change.
This is based on Jens' patch to skip elevator_allow_merge_fn() from
plug merge.
Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <4F16F3CA.90904@kernel.dk>
Original-patch-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
put_io_context() performed a complex trylock dancing to avoid
deferring ioc release to workqueue. It was also broken on UP because
trylock was always assumed to succeed which resulted in unbalanced
preemption count.
While there are ways to fix the UP breakage, even the most
pathological microbench (forced ioc allocation and tight fork/exit
loop) fails to show any appreciable performance benefit of the
optimization. Strip it out. If there turns out to be workloads which
are affected by this change, simpler optimization from the discussion
thread can be applied later.
Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <1328514611.21268.66.camel@sli10-conroe>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The argument is not used at all, and it's not necessary, because
a specific callback handler of course knows which subsys it
belongs to.
Now only ->pupulate() takes this argument, because the handlers of
this callback always call cgroup_add_file()/cgroup_add_files().
So we reduce a few lines of code, though the shrinking of object size
is minimal.
16 files changed, 113 insertions(+), 162 deletions(-)
text data bss dec hex filename
5486240 656987 7039960 13183187 c928d3 vmlinux.o.orig
5486170 656987 7039960 13183117 c9288d vmlinux.o
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The block layer has some code trying to determine if two CPUs share a
cache, the scheduler has a similar function. Expose the function used
by the scheduler and make the block layer use it, thereby removing the
block layers usage of CONFIG_SCHED* and topology bits.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Jens Axboe <axboe@kernel.dk>
Link: http://lkml.kernel.org/r/1327579450.2446.95.camel@twins
cfq_slice_expired will change saved_workload_slice. It should be called
first so saved_workload_slice is correctly set to 0 after workload type
is changed.
This fixes the code order changed by 54b466e44b.
Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
* 'for-3.3/core' of git://git.kernel.dk/linux-block: (37 commits)
Revert "block: recursive merge requests"
block: Stop using macro stubs for the bio data integrity calls
blockdev: convert some macros to static inlines
fs: remove unneeded plug in mpage_readpages()
block: Add BLKROTATIONAL ioctl
block: Introduce blk_set_stacking_limits function
block: remove WARN_ON_ONCE() in exit_io_context()
block: an exiting task should be allowed to create io_context
block: ioc_cgroup_changed() needs to be exported
block: recursive merge requests
block, cfq: fix empty queue crash caused by request merge
block, cfq: move icq creation and rq->elv.icq association to block core
block, cfq: restructure io_cq creation path for io_context interface cleanup
block, cfq: move io_cq exit/release to blk-ioc.c
block, cfq: move icq cache management to block core
block, cfq: move io_cq lookup to blk-ioc.c
block, cfq: move cfqd->icq_list to request_queue and add request->elv.icq
block, cfq: reorganize cfq_io_context into generic and cfq specific parts
block: remove elevator_queue->ops
block: reorder elevator switch sequence
...
Fix up conflicts in:
- block/blk-cgroup.c
Switch from can_attach_task to can_attach
- block/cfq-iosched.c
conflict with now removed cic index changes (we now use q->id instead)
This reverts commit 274193224c.
We have some problems related to selection of empty queues
that need to be resolved, evidence so far points to the
recursive merge logic making either being the cause or at
least the accelerator for this. So revert it for now, until
we figure this out.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linux allows executing the SG_IO ioctl on a partition or LVM volume, and
will pass the command to the underlying block device. This is
well-known, but it is also a large security problem when (via Unix
permissions, ACLs, SELinux or a combination thereof) a program or user
needs to be granted access only to part of the disk.
This patch lets partitions forward a small set of harmless ioctls;
others are logged with printk so that we can see which ioctls are
actually sent. In my tests only CDROM_GET_CAPABILITY actually occurred.
Of course it was being sent to a (partition on a) hard disk, so it would
have failed with ENOTTY and the patch isn't changing anything in
practice. Still, I'm treating it specially to avoid spamming the logs.
In principle, this restriction should include programs running with
CAP_SYS_RAWIO. If for example I let a program access /dev/sda2 and
/dev/sdb, it still should not be able to read/write outside the
boundaries of /dev/sda2 independent of the capabilities. However, for
now programs with CAP_SYS_RAWIO will still be allowed to send the
ioctls. Their actions will still be logged.
This patch does not affect the non-libata IDE driver. That driver
however already tests for bd != bd->bd_contains before issuing some
ioctl; it could be restricted further to forbid these ioctls even for
programs running with CAP_SYS_ADMIN/CAP_SYS_RAWIO.
Cc: linux-scsi@vger.kernel.org
Cc: Jens Axboe <axboe@kernel.dk>
Cc: James Bottomley <JBottomley@parallels.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[ Make it also print the command name when warning - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce a wrapper around scsi_cmd_ioctl that takes a block device.
The function will then be enhanced to detect partition block devices
and, in that case, subject the ioctls to whitelisting.
Cc: linux-scsi@vger.kernel.org
Cc: Jens Axboe <axboe@kernel.dk>
Cc: James Bottomley <JBottomley@parallels.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce an ioctl which permits applications to query whether a block
device is rotational.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Stacking driver queue limits are typically bounded exclusively by the
capabilities of the low level devices, not by the stacking driver
itself.
This patch introduces blk_set_stacking_limits() which has more liberal
metrics than the default queue limits function. This allows us to
inherit topology parameters from bottom devices without manually
tweaking the default limits in each driver prior to calling the stacking
function.
Since there is now a clear distinction between stacking and low-level
devices, blk_set_default_limits() has been modified to carry the more
conservative values that we used to manually set in
blk_queue_make_request().
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
* 'for-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
cgroup: fix to allow mounting a hierarchy by name
cgroup: move assignement out of condition in cgroup_attach_proc()
cgroup: Remove task_lock() from cgroup_post_fork()
cgroup: add sparse annotation to cgroup_iter_start() and cgroup_iter_end()
cgroup: mark cgroup_rmdir_waitq and cgroup_attach_proc() as static
cgroup: only need to check oldcgrp==newgrp once
cgroup: remove redundant get/put of task struct
cgroup: remove redundant get/put of old css_set from migrate
cgroup: Remove unnecessary task_lock before fetching css_set on migration
cgroup: Drop task_lock(parent) on cgroup_fork()
cgroups: remove redundant get/put of css_set from css_set_check_fetched()
resource cgroups: remove bogus cast
cgroup: kill subsys->can_attach_task(), pre_attach() and attach_task()
cgroup, cpuset: don't use ss->pre_attach()
cgroup: don't use subsys->can_attach_task() or ->attach_task()
cgroup: introduce cgroup_taskset and use it in subsys->can_attach(), cancel_attach() and attach()
cgroup: improve old cgroup handling in cgroup_attach_proc()
cgroup: always lock threadgroup during migration
threadgroup: extend threadgroup_lock() to cover exit and exec
threadgroup: rename signal->threadgroup_fork_lock to ->group_rwsem
...
Fix up conflict in kernel/cgroup.c due to commit e0197aae59: "cgroups:
fix a css_set not found bug in cgroup_attach_proc" that already
mentioned that the bug is fixed (differently) in Tejun's cgroup
patchset. This one, in other words.
* 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (165 commits)
reiserfs: Properly display mount options in /proc/mounts
vfs: prevent remount read-only if pending removes
vfs: count unlinked inodes
vfs: protect remounting superblock read-only
vfs: keep list of mounts for each superblock
vfs: switch ->show_options() to struct dentry *
vfs: switch ->show_path() to struct dentry *
vfs: switch ->show_devname() to struct dentry *
vfs: switch ->show_stats to struct dentry *
switch security_path_chmod() to struct path *
vfs: prefer ->dentry->d_sb to ->mnt->mnt_sb
vfs: trim includes a bit
switch mnt_namespace ->root to struct mount
vfs: take /proc/*/mounts and friends to fs/proc_namespace.c
vfs: opencode mntget() mnt_set_mountpoint()
vfs: spread struct mount - remaining argument of next_mnt()
vfs: move fsnotify junk to struct mount
vfs: move mnt_devname
vfs: move mnt_list to struct mount
vfs: switch pnode.h macros to struct mount *
...
We're doing some odd things there, which already messes up various users
(see the net/socket.c code that this removes), and it was going to add
yet more crud to the block layer because of the incorrect error code
translation.
ENOIOCTLCMD is not an error return that should be returned to user mode
from the "ioctl()" system call, but it should *not* be translated as
EINVAL ("Invalid argument"). It should be translated as ENOTTY
("Inappropriate ioctl for device").
That EINVAL confusion has apparently so permeated some code that the
block layer actually checks for it, which is sad. We continue to do so
for now, but add a big comment about how wrong that is, and we should
remove it entirely eventually. In the meantime, this tries to keep the
changes localized to just the EINVAL -> ENOTTY fix, and removing code
that makes it harder to do the right thing.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
both callers of device_get_devnode() are only interested in lower 16bits
and nobody tries to return anything wider than 16bit anyway.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Move invalidate_bdev, block_sync_page into fs/block_dev.c. Export
kill_bdev as well, so brd doesn't have to open code it. Reduce
buffer_head.h requirement accordingly.
Removed a rather large comment from invalidate_bdev, as it looked a bit
obsolete to bother moving. The small comment replacing it says enough.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Commit 5e081591 "block: warn if tag is greater than real_max_depth"
cleaned up blk_queue_end_tag() to warn when the tag is truly invalid
(greater than real_max_depth). However, it changed behavior in the tag <
max_depth case to not end the request. Leading to triggering of
BUG_ON(blk_queued_rq(rq)) in the request completion path:
http://marc.info/?l=linux-kernel&m=132204370518629&w=2
In order to allow blk_queue_resize_tags() to shrink the tag space
blk_queue_end_tag() must always complete tags with a value less than
real_max_depth regardless of the current max_depth. The comment about
"handling the shrink case" seems to be what prompted changes in this
space, so remove it and BUG on all invalid tags (made even simpler by
Matthew's suggestion to use an unsigned compare).
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Tao Ma <boyu.mt@taobao.com>
Cc: Matthew Wilcox <matthew@wil.cx>
Reported-by: Meelis Roos <mroos@ut.ee>
Reported-by: Ed Nadolski <edmund.nadolski@intel.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6e736be7 "block: make ioc get/put interface more conventional and fix
race on alloction" added WARN_ON_ONCE() in exit_io_context() which
triggers if !PF_EXITING. All tasks hitting exit_io_context() from
task exit should have PF_EXITING set but task struct tearing down
after fork failure calls into the function without PF_EXITING,
triggering the condition.
WARNING: at block/blk-ioc.c:234 exit_io_context+0x40/0x92()
Pid: 17090, comm: trinity Not tainted 3.2.0-rc6-next-20111222-sasha-dirty #77
Call Trace:
[<ffffffff810b69a3>] warn_slowpath_common+0x8f/0xb2
[<ffffffff810b6a77>] warn_slowpath_null+0x18/0x1a
[<ffffffff8181a7a2>] exit_io_context+0x40/0x92
[<ffffffff810b58c9>] copy_process+0x126f/0x1453
[<ffffffff810b5c1b>] do_fork+0x120/0x3e9
[<ffffffff8106242f>] sys_clone+0x26/0x28
[<ffffffff82425803>] stub_clone+0x13/0x20
---[ end trace a2e4eb670b375238 ]---
Reported-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
While fixing io_context creation / task exit race condition,
6e736be7f2 "block: make ioc get/put interface more conventional and
fix race on alloction" also prevented an exiting (%PF_EXITING) task
from creating its own io_context. This is incorrect as exit path may
issue IOs, e.g. from exit_files(), and if those IOs are the first ones
issued by the task, io_context needs to be created to process the IOs.
Combined with the existing problem of io_context / io_cq creation
failure having the possibility of stalling IO, this problem results in
deterministic full IO lockup with certain workloads.
Fix it by allowing io_context creation regardless of %PF_EXITING for
%current.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With the ioc changed, ioc_cgroup_changed() can be used by modular
code. So ensure that it is exported.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
All requests of a queue could be merged to other requests of other queue.
Such queue will not have request in it, but it's in service tree. This
will cause kernel oops.
I encounter a BUG_ON() in cfq_dispatch_request() with next patch, but the
issue should exist without the patch.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In my workload, thread 1 accesses a, a+2, ..., thread 2 accesses a+1,
a+3,.... When the requests are flushed to queue, a and a+1 are merged
to (a, a+1), a+2 and a+3 too to (a+2, a+3), but (a, a+1) and (a+2, a+3)
aren't merged.
With recursive merge below, the workload throughput gets improved 20%
and context switch drops 60%.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
All requests of a queue could be merged to other requests of other queue.
Such queue will not have request in it, but it's in service tree. This
will cause kernel oops.
I encounter a BUG_ON() in cfq_dispatch_request() with next patch, but the
issue should exist without the patch.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
While probing, fd sets up queue, probes hardware and tears down the
queue if probing fails. In the process, blk_drain_queue() kicks the
queue which failed to finish initialization and fd is unhappy about
that.
floppy0: no floppy controllers found
------------[ cut here ]------------
WARNING: at drivers/block/floppy.c:2929 do_fd_request+0xbf/0xd0()
Hardware name: To Be Filled By O.E.M.
VFS: do_fd_request called on non-open device
Modules linked in:
Pid: 1, comm: swapper Not tainted 3.2.0-rc4-00077-g5983fe2 #2
Call Trace:
[<ffffffff81039a6a>] warn_slowpath_common+0x7a/0xb0
[<ffffffff81039b41>] warn_slowpath_fmt+0x41/0x50
[<ffffffff813d657f>] do_fd_request+0xbf/0xd0
[<ffffffff81322b95>] blk_drain_queue+0x65/0x80
[<ffffffff81322c93>] blk_cleanup_queue+0xe3/0x1a0
[<ffffffff818a809d>] floppy_init+0xdeb/0xe28
[<ffffffff818a72b2>] ? daring+0x6b/0x6b
[<ffffffff810002af>] do_one_initcall+0x3f/0x170
[<ffffffff81884b34>] kernel_init+0x9d/0x11e
[<ffffffff810317c2>] ? schedule_tail+0x22/0xa0
[<ffffffff815dbb14>] kernel_thread_helper+0x4/0x10
[<ffffffff81884a97>] ? start_kernel+0x2be/0x2be
[<ffffffff815dbb10>] ? gs_change+0xb/0xb
Avoid it by making blk_drain_queue() kick queue iff dispatch queue has
something on it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Ralf Hildebrandt <Ralf.Hildebrandt@charite.de>
Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Tested-by: Sergei Trofimovich <slyich@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now block layer knows everything necessary to create and associate
icq's with requests. Move ioc_create_icq() to blk-ioc.c and update
get_request() such that, if elevator_type->icq_size is set, requests
are automatically associated with their matching icq's before
elv_set_request(). io_context reference is also managed by block core
on request alloc/free.
* Only ioprio/cgroup changed handling remains from cfq_get_cic().
Collapsed into cfq_set_request().
* This removes queue kicking on icq allocation failure (for now). As
icq allocation failure is rare and the only effect of queue kicking
achieved was possibily accelerating queue processing, this change
shouldn't be noticeable.
There is a larger underlying problem. Unlike request allocation,
icq allocation is not guaranteed to succeed eventually after
retries. The number of icq is unbound and thus mempool can't be the
solution either. This effectively adds allocation dependency on
memory free path and thus possibility of deadlock.
This usually wouldn't happen because icq allocation is not a hot
path and, even when the condition triggers, it's highly unlikely
that none of the writeback workers already has icq.
However, this is still possible especially if elevator is being
switched under high memory pressure, so we better get it fixed.
Probably the only solution is just bypassing elevator and appending
to dispatch queue on any elevator allocation failure.
* Comment added to explain how icq's are managed and synchronized.
This completes cleanup of io_context interface.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add elevator_ops->elevator_init_icq_fn() and restructure
cfq_create_cic() and rename it to ioc_create_icq().
The new function expects its caller to pass in io_context, uses
elevator_type->icq_cache, handles generic init, calls the new elevator
operation for elevator specific initialization, and returns pointer to
created or looked up icq. This leaves cfq_icq_pool variable without
any user. Removed.
This prepares for io_context interface cleanup and doesn't introduce
any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With kmem_cache managed by blk-ioc, io_cq exit/release can be moved to
blk-ioc too. The odd ->io_cq->exit/release() callbacks are replaced
with elevator_ops->elevator_exit_icq_fn() with unlinking from both ioc
and q, and freeing automatically handled by blk-ioc. The elevator
operation only need to perform exit operation specific to the elevator
- in cfq's case, exiting the cfqq's.
Also, clearing of io_cq's on q detach is moved to block core and
automatically performed on elevator switch and q release.
Because the q io_cq points to might be freed before RCU callback for
the io_cq runs, blk-ioc code should remember to which cache the io_cq
needs to be freed when the io_cq is released. New field
io_cq->__rcu_icq_cache is added for this purpose. As both the new
field and rcu_head are used only after io_cq is released and the
q/ioc_node fields aren't, they are put into unions.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Let elevators set ->icq_size and ->icq_align in elevator_type and
elv_register() and elv_unregister() respectively create and destroy
kmem_cache for icq.
* elv_register() now can return failure. All callers updated.
* icq caches are automatically named "ELVNAME_io_cq".
* cfq_slab_setup/kill() are collapsed into cfq_init/exit().
* While at it, minor indentation change for iosched_cfq.elevator_name
for consistency.
This will help moving icq management to block core. This doesn't
introduce any functional change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that all io_cq related data structures are in block core layer,
io_cq lookup can be moved from cfq-iosched.c to blk-ioc.c.
Lookup logic from cfq_cic_lookup() is moved to ioc_lookup_icq() with
parameter return type changes (cfqd -> request_queue, cfq_io_cq ->
io_cq) and cfq_cic_lookup() becomes thin wrapper around
cfq_cic_lookup().
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Most of icq management is about to be moved out of cfq into blk-ioc.
This patch prepares for it.
* Move cfqd->icq_list to request_queue->icq_list
* Make request explicitly point to icq instead of through elevator
private data. ->elevator_private[3] is replaced with sub struct elv
which contains icq pointer and priv[2]. cfq is updated accordingly.
* Meaningless clearing of ->elevator_private[0] removed from
elv_set_request(). At that point in code, the field was guaranteed
to be %NULL anyway.
This patch doesn't introduce any functional change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently io_context and cfq logics are mixed without clear boundary.
Most of io_context is independent from cfq but cfq_io_context handling
logic is dispersed between generic ioc code and cfq.
cfq_io_context represents association between an io_context and a
request_queue, which is a concept useful outside of cfq, but it also
contains fields which are useful only to cfq.
This patch takes out generic part and put it into io_cq (io
context-queue) and the rest into cfq_io_cq (cic moniker remains the
same) which contains io_cq. The following changes are made together.
* cfq_ttime and cfq_io_cq now live in cfq-iosched.c.
* All related fields, functions and constants are renamed accordingly.
* ioc->ioc_data is now "struct io_cq *" instead of "void *" and
renamed to icq_hint.
This prepares for io_context API cleanup. Documentation is currently
sparse. It will be added later.
Changes in this patch are mechanical and don't cause functional
change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
elevator_queue->ops points to the same ops struct ->elevator_type.ops
is pointing to. The only effect of caching it in elevator_queue is
shorter notation - it doesn't save any indirect derefence.
Relocate elevator_type->list which used only during module init/exit
to the end of the structure, rename elevator_queue->elevator_type to
->type, and replace elevator_queue->ops with elevator_queue->type.ops.
This doesn't introduce any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Elevator switch sequence first attached the new elevator, then tried
registering it (sysfs) and if that failed attached back the old
elevator. However, sysfs registration doesn't require the elevator to
be attached, so there is no reason to do the "detach, attach new,
register, maybe re-attach old" sequence. It can just do "register,
detach, attach".
* elevator_init_queue() is updated to set ->elevator_data directly and
return 0 / -errno. This allows elevator_exit() on an unattached
elevator.
* __elv_unregister_queue() which was necessary to unregister
unattached q is removed in favor of __elv_register_queue() which can
register unattached q.
* elevator_attach() becomes a single assignment and obscures more then
it helps. Dropped.
This will help cleaning up io_context handling across elevator switch.
This patch doesn't introduce visible behavior change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When called under queue_lock, current_io_context() triggers lockdep
warning if it hits allocation path. This is because io_context
installation is protected by task_lock which is not IRQ safe, so it
triggers irq-unsafe-lock -> irq -> irq-safe-lock -> irq-unsafe-lock
deadlock warning.
Given the restriction, accessor + creator rolled into one doesn't work
too well. Drop current_io_context() and let the users access
task->io_context directly inside queue_lock combined with explicit
creation using create_io_context().
Future ioc updates will further consolidate ioc access and the create
interface will be unexported.
While at it, relocate ioc internal interface declarations in blk.h and
add section comments before and after.
This patch does not introduce functional change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that lazy paths are removed, cfqd_dead_key() is meaningless and
cic->q can be used whereever cic->key is used. Kill cic->key.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that cic's are immediately unlinked under both locks, there's no
need to count and drain cic's before module unload. RCU callback
completion is waited with rcu_barrier().
While at it, remove residual RCU operations on cic_list.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that all cic's are immediately unlinked from both ioc and queue,
lazy dropping from lookup path and trimming on elevator unregister are
unnecessary. Kill them and remove now unused elevator_ops->trim().
This also leaves call_for_each_cic() without any user. Removed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
cic is association between io_context and request_queue. A cic is
linked from both ioc and q and should be destroyed when either one
goes away. As ioc and q both have their own locks, locking becomes a
bit complex - both orders work for removal from one but not from the
other.
Currently, cfq tries to circumvent this locking order issue with RCU.
ioc->lock nests inside queue_lock but the radix tree and cic's are
also protected by RCU allowing either side to walk their lists without
grabbing lock.
This rather unconventional use of RCU quickly devolves into extremely
fragile convolution. e.g. The following is from cfqd going away too
soon after ioc and q exits raced.
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:
[ 88.503444]
Pid: 599, comm: hexdump Not tainted 3.1.0-rc10-work+ #158 Bochs Bochs
RIP: 0010:[<ffffffff81397628>] [<ffffffff81397628>] cfq_exit_single_io_context+0x58/0xf0
...
Call Trace:
[<ffffffff81395a4a>] call_for_each_cic+0x5a/0x90
[<ffffffff81395ab5>] cfq_exit_io_context+0x15/0x20
[<ffffffff81389130>] exit_io_context+0x100/0x140
[<ffffffff81098a29>] do_exit+0x579/0x850
[<ffffffff81098d5b>] do_group_exit+0x5b/0xd0
[<ffffffff81098de7>] sys_exit_group+0x17/0x20
[<ffffffff81b02f2b>] system_call_fastpath+0x16/0x1b
The only real hot path here is cic lookup during request
initialization and avoiding extra locking requires very confined use
of RCU. This patch makes cic removal from both ioc and request_queue
perform double-locking and unlink immediately.
* From q side, the change is almost trivial as ioc->lock nests inside
queue_lock. It just needs to grab each ioc->lock as it walks
cic_list and unlink it.
* From ioc side, it's a bit more difficult because of inversed lock
order. ioc needs its lock to walk its cic_list but can't grab the
matching queue_lock and needs to perform unlock-relock dancing.
Unlinking is now wholly done from put_io_context() and fast path is
optimized by using the queue_lock the caller already holds, which is
by far the most common case. If the ioc accessed multiple devices,
it tries with trylock. In unlikely cases of fast path failure, it
falls back to full double-locking dance from workqueue.
Double-locking isn't the prettiest thing in the world but it's *far*
simpler and more understandable than RCU trick without adding any
meaningful overhead.
This still leaves a lot of now unnecessary RCU logics. Future patches
will trim them.
-v2: Vivek pointed out that cic->q was being dereferenced after
cic->release() was called. Updated to use local variable @this_q
instead.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
* cfq_cic_lookup() may be called without queue_lock and multiple tasks
can execute it simultaneously for the same shared ioc. Nothing
prevents them racing each other and trying to drop the same dead cic
entry multiple times.
* smp_wmb() in cfq_exit_cic() doesn't really do anything and nothing
prevents cfq_cic_lookup() seeing stale cic->key. This usually
doesn't blow up because by the time cic is exited, all requests have
been drained and new requests are terminated before going through
elevator. However, it can still be triggered by plug merge path
which doesn't grab queue_lock and thus can't check DEAD state
reliably.
This patch updates lookup locking such that,
* Lookup is always performed under queue_lock. This doesn't add any
more locking. The only issue is cfq_allow_merge() which can be
called from plug merge path without holding any lock. For now, this
is worked around by using cic of the request to merge into, which is
guaranteed to have the same ioc. For longer term, I think it would
be best to separate out plug merge method from regular one.
* Spurious ioc->lock locking around cic lookup hint assignment
dropped.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
cfq_get_io_context() would fail if multiple tasks race to insert cic's
for the same association. This patch restructures
cfq_get_io_context() such that slow path insertion race is handled
properly.
Note that the restructuring also makes cfq_get_io_context() called
under queue_lock and performs both ioc and cfqd insertions while
holding both ioc and queue locks. This is part of on-going locking
tightening and will be used to simplify synchronization rules.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
ioprio/cgroup change was handled by marking the changed state in ioc
and, on the following access to the ioc, performing RCU-protected
iteration through all cic's grabbing the matching queue_lock.
This patch moves the changed state to each cic. When ioprio or cgroup
changes, the respective bit is set on all cic's of the ioc and when
each of those cic (not ioc) is accessed, change is applied for that
specific ioc-queue pair.
This also fixes the following two race conditions between setting and
clearing of changed states.
* Missing barrier between assign/load of ioprio and ioprio_changed
allowed applying old ioprio.
* Change requests could happen between application of change and
clearing of changed variables.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Make the following changes to prepare for ioc/cic management cleanup.
* Add cic->q so that ioc can determine the associated queue without
querying cfq. This will eventually replace ->key.
* Factor out cfq_release_cic() from cic_free_func(). This function
assumes that the caller handled locking.
* Rename __cfq_exit_single_io_context() to cfq_exit_cic() and make it
take only @cic.
* Restructure cfq_cic_link() for future updates.
This patch doesn't introduce any functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk_get_queue() is peculiar in that it returns 0 on success and 1 on
failure instead of 0 / -errno or boolean. Update it such that it
returns %true on success and %false on failure.
* Make sure the caller checks for the return value.
* Separate out __blk_get_queue() which doesn't check whether @q is
dead and put it in blk.h. This will be used later.
This patch doesn't introduce any functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ignoring copy_io() during fork, io_context can be allocated from two
places - current_io_context() and set_task_ioprio(). The former is
always called from local task while the latter can be called from
different task. The synchornization between them are peculiar and
dubious.
* current_io_context() doesn't grab task_lock() and assumes that if it
saw %NULL ->io_context, it would stay that way until allocation and
assignment is complete. It has smp_wmb() between alloc/init and
assignment.
* set_task_ioprio() grabs task_lock() for assignment and does
smp_read_barrier_depends() between "ioc = task->io_context" and "if
(ioc)". Unfortunately, this doesn't achieve anything - the latter
is not a dependent load of the former. ie, if ioc itself were being
dereferenced "ioc->xxx", it would mean something (not sure what tho)
but as the code currently stands, the dependent read barrier is
noop.
As only one of the the two test-assignment sequences is task_lock()
protected, the task_lock() can't do much about race between the two.
Nothing prevents current_io_context() and set_task_ioprio() allocating
its own ioc for the same task and overwriting the other's.
Also, set_task_ioprio() can race with exiting task and create a new
ioc after exit_io_context() is finished.
ioc get/put doesn't have any reason to be complex. The only hot path
is accessing the existing ioc of %current, which is simple to achieve
given that ->io_context is never destroyed as long as the task is
alive. All other paths can happily go through task_lock() like all
other task sub structures without impacting anything.
This patch updates ioc get/put so that it becomes more conventional.
* alloc_io_context() is replaced with get_task_io_context(). This is
the only interface which can acquire access to ioc of another task.
On return, the caller has an explicit reference to the object which
should be put using put_io_context() afterwards.
* The functionality of current_io_context() remains the same but when
creating a new ioc, it shares the code path with
get_task_io_context() and always goes through task_lock().
* get_io_context() now means incrementing ref on an ioc which the
caller already has access to (be that an explicit refcnt or implicit
%current one).
* PF_EXITING inhibits creation of new io_context and once
exit_io_context() is finished, it's guaranteed that both ioc
acquisition functions return %NULL.
* All users are updated. Most are trivial but
smp_read_barrier_depends() removal from cfq_get_io_context() needs a
bit of explanation. I suppose the original intention was to ensure
ioc->ioprio is visible when set_task_ioprio() allocates new
io_context and installs it; however, this wouldn't have worked
because set_task_ioprio() doesn't have wmb between init and install.
There are other problems with this which will be fixed in another
patch.
* While at it, use NUMA_NO_NODE instead of -1 for wildcard node
specification.
-v2: Vivek spotted contamination from debug patch. Removed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
* int return from put_io_context() wasn't used by anybody. Make it
return void like other put functions and docbook-fy the function
comment.
* Reorder dummy declarations for !CONFIG_BLOCK case a bit.
* Make alloc_ioc_context() use __GFP_ZERO allocation, take init out of
if block and drop 0'ing.
* Docbook-fy current_io_context() comment.
This patch doesn't introduce any functional change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
cfq allocates per-queue id using ida and uses it to index cic radix
tree from io_context. Move it to q->id and allocate on queue init and
free on queue release. This simplifies cfq a bit and will allow for
further improvements of io context life-cycle management.
This patch doesn't introduce any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_insert_cloned_request(), blk_execute_rq_nowait() and
blk_flush_plug_list() either didn't check whether the queue was dead
or did it without holding queue_lock. Update them so that dead state
is checked while holding queue_lock.
AFAICS, this plugs all holes (requeue doesn't matter as the request is
transitioning atomically from in_flight to queued).
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When trying to drain all requests, blk_drain_queue() checked only
q->rq.count[]; however, this only tracks REQ_ALLOCED requests. This
patch updates blk_drain_queue() such that it looks at all the counters
and queues so that request_queue is actually empty on completion.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are a number of QUEUE_FLAG_DEAD tests. Add blk_queue_dead()
macro and use it.
This patch doesn't introduce any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The only user left for blk_insert_request() is sx8 and it can be
trivially switched to use blk_execute_rq_nowait() - special requests
aren't included in io stat and sx8 doesn't use block layer tagging.
Switch sx8 and kill blk_insert_requeset().
This patch doesn't introduce any functional difference.
Only compile tested.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jeff Garzik <jgarzik@pobox.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that subsys->can_attach() and attach() take @tset instead of
@task, they can handle per-task operations. Convert
->can_attach_task() and ->attach_task() users to use ->can_attach()
and attach() instead. Most converions are straight-forward.
Noteworthy changes are,
* In cgroup_freezer, remove unnecessary NULL assignments to unused
methods. It's useless and very prone to get out of sync, which
already happened.
* In cpuset, PF_THREAD_BOUND test is checked for each task. This
doesn't make any practical difference but is conceptually cleaner.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: James Morris <jmorris@namei.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
cfq_cic_link() has race condition. When some processes which shared ioc
issue I/O to same block device simultaneously, cfq_cic_link() returns -EEXIST
sometimes. The race condition might stop I/O by following steps:
step 1: Process A: Issue an I/O to /dev/sda
step 2: Process A: Get an ioc (iocA here) in get_io_context() which does not
linked with a cic for the device
step 3: Process A: Get a new cic for the device (cicA here) in
cfq_alloc_io_context()
step 4: Process B: Issue an I/O to /dev/sda
step 5: Process B: Get iocA in get_io_context() since process A and B share the
same ioc
step 6: Process B: Get a new cic for the device (cicB here) in
cfq_alloc_io_context() since iocA has not been linked with a
cic for the device yet
step 7: Process A: Link cicA to iocA in cfq_cic_link()
step 8: Process A: Dispatch I/O to driver and finish it
step 9: Process B: Try to link cicB to iocA in cfq_cic_link()
But it fails with showing "cfq: cic link failed!" kernel
message, since iocA has already linked with cicA at step 7.
step 10: Process B: Wait for finishig I/O in get_request_wait()
The function does not wake up, when there is no I/O to the
device.
When cfq_cic_link() returns -EEXIST, it means ioc has already linked with cic.
So when cfq_cic_link() return -EEXIST, retry cfq_cic_lookup().
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If we fail allocating the blkpg stats, we free cfqd and cfgq.
But we need to free the IDA cfqd->cic_index as well.
Signed-off-by: majianpeng <majianpeng@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
struct request_queue is allocated with __GFP_ZERO so its "node" field is
zero before initialization. This causes an oops if node 0 is offline in
the page allocator because its zonelists are not initialized. From Dave
Young's dmesg:
SRAT: Node 1 PXM 2 0-d0000000
SRAT: Node 1 PXM 2 100000000-330000000
SRAT: Node 0 PXM 1 330000000-630000000
Initmem setup node 1 0000000000000000-000000000affb000
...
Built 1 zonelists in Node order, mobility grouping on.
...
BUG: unable to handle kernel paging request at 0000000000001c08
IP: [<ffffffff8111c355>] __alloc_pages_nodemask+0xb5/0x870
and __alloc_pages_nodemask+0xb5 translates to a NULL pointer on
zonelist->_zonerefs.
The fix is to initialize q->node at the time of allocation so the correct
node is passed to the slab allocator later.
Since blk_init_allocated_queue_node() is no longer needed, merge it with
blk_init_allocated_queue().
[rientjes@google.com: changelog, initializing q->node]
Cc: stable@vger.kernel.org [2.6.37+]
Reported-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Tested-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
After flush plug list, the list has no request, so we need to add a
trace_block_plug().
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
get_request_wait() could sleep and flush the plug list. If the list is
already flushed, don't flush again.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Even after commit 5478755616
("block: check for proper length of iov entries earlier ...")
we still won't check for zero-length entries after an unaligned
entry. Remove the break-statement, so all entries are checked.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This reverts commit a72c5e5eb7.
The commit introduced alias for block devices which is intended to be
used during logging although actual usage hasn't been committed yet.
This approach adds very limited benefit (raw log might be easier to
follow) which can be trivially implemented in userland but has a lot
of problems.
It is much worse than netif renames because it doesn't rename the
actual device but just adds conveninence name which isn't used
universally or enforced. Everything internal including device lookup
and sysfs still uses the internal name and nothing prevents two
devices from using conflicting alias - ie. sda can have sdb as its
alias.
This has been nacked by people working on device driver core, block
layer and kernel-userland interface and shouldn't have been
upstreamed. Revert it.
http://thread.gmane.org/gmane.linux.kernel/1155104http://thread.gmane.org/gmane.linux.scsi/68632http://thread.gmane.org/gmane.linux.scsi/69776
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Acked-by: Kay Sievers <kay.sievers@vrfy.org>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Nao Nishijima <nao.nishijima.xt@hitachi.com>
Cc: Alan Cox <alan@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
* 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
Revert "tracing: Include module.h in define_trace.h"
irq: don't put module.h into irq.h for tracking irqgen modules.
bluetooth: macroize two small inlines to avoid module.h
ip_vs.h: fix implicit use of module_get/module_put from module.h
nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
include: replace linux/module.h with "struct module" wherever possible
include: convert various register fcns to macros to avoid include chaining
crypto.h: remove unused crypto_tfm_alg_modname() inline
uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
pm_runtime.h: explicitly requires notifier.h
linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
miscdevice.h: fix up implicit use of lists and types
stop_machine.h: fix implicit use of smp.h for smp_processor_id
of: fix implicit use of errno.h in include/linux/of.h
of_platform.h: delete needless include <linux/module.h>
acpi: remove module.h include from platform/aclinux.h
miscdevice.h: delete unnecessary inclusion of module.h
device_cgroup.h: delete needless include <linux/module.h>
net: sch_generic remove redundant use of <linux/module.h>
net: inet_timewait_sock doesnt need <linux/module.h>
...
Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
- drivers/media/dvb/frontends/dibx000_common.c
- drivers/media/video/{mt9m111.c,ov6650.c}
- drivers/mfd/ab3550-core.c
- include/linux/dmaengine.h
* 'for-3.2/drivers' of git://git.kernel.dk/linux-block: (30 commits)
virtio-blk: use ida to allocate disk index
hpsa: add small delay when using PCI Power Management to reset for kump
cciss: add small delay when using PCI Power Management to reset for kump
xen/blkback: Fix two races in the handling of barrier requests.
xen/blkback: Check for proper operation.
xen/blkback: Fix the inhibition to map pages when discarding sector ranges.
xen/blkback: Report VBD_WSECT (wr_sect) properly.
xen/blkback: Support 'feature-barrier' aka old-style BARRIER requests.
xen-blkfront: plug device number leak in xlblk_init() error path
xen-blkfront: If no barrier or flush is supported, use invalid operation.
xen-blkback: use kzalloc() in favor of kmalloc()+memset()
xen-blkback: fixed indentation and comments
xen-blkfront: fix a deadlock while handling discard response
xen-blkfront: Handle discard requests.
xen-blkback: Implement discard requests ('feature-discard')
xen-blkfront: add BLKIF_OP_DISCARD and discard request struct
drivers/block/loop.c: remove unnecessary bdev argument from loop_clr_fd()
drivers/block/loop.c: emit uevent on auto release
drivers/block/cpqarray.c: use pci_dev->revision
loop: always allow userspace partitions and optionally support automatic scanning
...
Fic up trivial header file includsion conflict in drivers/block/loop.c
* 'for-3.2/core' of git://git.kernel.dk/linux-block: (29 commits)
block: don't call blk_drain_queue() if elevator is not up
blk-throttle: use queue_is_locked() instead of lockdep_is_held()
blk-throttle: Take blkcg->lock while traversing blkcg->policy_list
blk-throttle: Free up policy node associated with deleted rule
block: warn if tag is greater than real_max_depth.
block: make gendisk hold a reference to its queue
blk-flush: move the queue kick into
blk-flush: fix invalid BUG_ON in blk_insert_flush
block: Remove the control of complete cpu from bio.
block: fix a typo in the blk-cgroup.h file
block: initialize the bounce pool if high memory may be added later
block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
block: drop @tsk from attempt_plug_merge() and explain sync rules
block: make get_request[_wait]() fail if queue is dead
block: reorganize throtl_get_tg() and blk_throtl_bio()
block: reorganize queue draining
block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg()
block: pass around REQ_* flags instead of broken down booleans during request alloc/free
block: move blk_throtl prototypes to block/blk.h
block: fix genhd refcounting in blkio_policy_parse_and_set()
...
Fix up trivial conflicts due to "mddev_t" -> "struct mddev" conversion
and making the request functions be of type "void" instead of "int" in
- drivers/md/{faulty.c,linear.c,md.c,md.h,multipath.c,raid0.c,raid1.c,raid10.c,raid5.c}
- drivers/staging/zram/zram_drv.c
blk_cleanup_queue() may be called before elevator is set up on a
queue which triggers the following oops.
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff8125a69c>] elv_drain_elevator+0x1c/0x70
...
Pid: 830, comm: kworker/0:2 Not tainted 3.1.0-next-20111025_64+ #1590
Bochs Bochs
RIP: 0010:[<ffffffff8125a69c>] [<ffffffff8125a69c>] elv_drain_elevator+0x1c/0x70
...
Call Trace:
[<ffffffff8125da92>] blk_drain_queue+0x42/0x70
[<ffffffff8125db90>] blk_cleanup_queue+0xd0/0x1c0
[<ffffffff81469640>] md_free+0x50/0x70
[<ffffffff8126f43b>] kobject_release+0x8b/0x1d0
[<ffffffff81270d56>] kref_put+0x36/0xa0
[<ffffffff8126f2b7>] kobject_put+0x27/0x60
[<ffffffff814693af>] mddev_delayed_delete+0x2f/0x40
[<ffffffff81083450>] process_one_work+0x100/0x3b0
[<ffffffff8108527f>] worker_thread+0x15f/0x3a0
[<ffffffff81089937>] kthread+0x87/0x90
[<ffffffff81621834>] kernel_thread_helper+0x4/0x10
Fix it by making blk_cleanup_queue() check whether q->elevator is set
up before invoking blk_drain_queue.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This file isn't using full modular functionality, and hence
can be "downgraded" to just using the export.h header.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
These files were getting <linux/module.h> via an implicit include
path, but we want to crush those out of existence since they cost
time during compiles of processing thousands of lines of headers
for no reason. Give them the lightweight header that just contains
the EXPORT_SYMBOL infrastructure.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
blkcg->policy_list is protected by blkcg->lock. Its not rcu protected
list. So even for readers, they need to take blkcg->lock. There are
few functions which were reading the list without taking lock. Fix it.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If a rule is being deleted, free up associated policy node. Otherwise
that memory is leaked.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In case tag depth is reduced, it is max_depth not real_max_depth.
So we should allow a request with tag >= max_depth, but for a
tag >= real_max_depth, there really should be some problem.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The following command sequence triggers an oops.
# mount /dev/sdb1 /mnt
# echo 1 > /sys/class/scsi_device/0\:0\:1\:0/device/delete
# umount /mnt
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:
Pid: 791, comm: umount Not tainted 3.1.0-rc3-work+ #8 Bochs Bochs
RIP: 0010:[<ffffffff810d0879>] [<ffffffff810d0879>] __lock_acquire+0x389/0x1d60
...
Call Trace:
[<ffffffff810d2845>] lock_acquire+0x95/0x140
[<ffffffff81aed87b>] _raw_spin_lock+0x3b/0x50
[<ffffffff811573bc>] bdi_lock_two+0x5c/0x70
[<ffffffff811c2f6c>] bdev_inode_switch_bdi+0x4c/0xf0
[<ffffffff811c3fcb>] __blkdev_put+0x11b/0x1d0
[<ffffffff811c4010>] __blkdev_put+0x160/0x1d0
[<ffffffff811c40df>] blkdev_put+0x5f/0x190
[<ffffffff8118f18d>] kill_block_super+0x4d/0x80
[<ffffffff8118f4a5>] deactivate_locked_super+0x45/0x70
[<ffffffff8119003a>] deactivate_super+0x4a/0x70
[<ffffffff811ac4ad>] mntput_no_expire+0xed/0x130
[<ffffffff811acf2e>] sys_umount+0x7e/0x3a0
[<ffffffff81aeeeab>] system_call_fastpath+0x16/0x1b
This is because bdev holds on to disk but disk doesn't pin the
associated queue. If a SCSI device is removed while the device is
still open, the sdev puts the base reference to the queue on release.
When the bdev is finally released, the associated queue is already
gone along with the bdi and bdev_inode_switch_bdi() ends up
dereferencing already freed bdi.
Even if it were not for this bug, disk not holding onto the associated
queue is very unusual and error-prone.
Fix it by making add_disk() take an extra reference to its queue and
put it on disk_release() and ensuring that disk and its fops owner are
put in that order after all accesses to the disk and queue are
complete.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A dm-multipath user reported[1] a problem when trying to boot
a kernel with commit 4853abaae7
(block: fix flush machinery for stacking drivers with differring
flush flags) applied. It turns out that an empty flush request
can be sent into blk_insert_flush. When the BUG_ON was fixed
to allow for this, I/O on the underlying device would stall. The
reason is that blk_insert_cloned_request does not kick the queue.
In the aforementioned commit, I had added a special case to
kick the queue if data was sent down but the queue flags did
not require a flush. A better solution is to push the queue
kick up into blk_insert_cloned_request.
This patch, along with a follow-on which fixes the BUG_ON, fixes
the issue reported.
[1] http://www.redhat.com/archives/dm-devel/2011-September/msg00154.html
Reported-by: Christophe Saout <christophe@saout.de>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Stable note: 3.1
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A user reported a regression due to commit
4853abaae7 (block: fix flush
machinery for stacking drivers with differring flush flags).
Part of the problem is that blk_insert_flush required a
single bio be attached to the request. In reality, having
no attached bio is also a valid case, as can be observed with
an empty flush.
[1] http://www.redhat.com/archives/dm-devel/2011-September/msg00154.html
Reported-by: Christophe Saout <christophe@saout.de>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com
Acked-by: Tejun Heo <tj@kernel.org>
Stable note: 3.1
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bio originally has the functionality to set the complete cpu, but
it is broken.
Chirstoph said that "This code is unused, and from the all the
discussions lately pretty obviously broken. The only thing keeping
it serves is creating more confusion and possibly more bugs."
And Jens replied with "We can kill bio_set_completion_cpu(). I'm fine
with leaving cpu control to the request based drivers, they are the
only ones that can toggle the setting anyway".
So this patch tries to remove all the work of controling complete cpu
from a bio.
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
request_queue is refcounted but actually depdends on lifetime
management from the queue owner - on blk_cleanup_queue(), block layer
expects that there's no request passing through request_queue and no
new one will.
This is fundamentally broken. The queue owner (e.g. SCSI layer)
doesn't have a way to know whether there are other active users before
calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
guarantee that the queue is and would stay valid while it's holding a
reference.
With delay added in blk_queue_bio() before queue_lock is grabbed, the
following oops can be easily triggered when a device is removed with
in-flight IOs.
sd 0:0:1:0: [sdb] Stopping disk
ata1.01: disabled
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:
Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
RIP: 0010:[<ffffffff8137d651>] [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
...
Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
...
Call Trace:
[<ffffffff8137d774>] elv_merge+0x84/0xe0
[<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
[<ffffffff813838ea>] generic_make_request+0xca/0x100
[<ffffffff81383994>] submit_bio+0x74/0x100
[<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
[<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
[<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
[<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff8118c1ca>] do_sync_read+0xda/0x120
[<ffffffff8118ce55>] vfs_read+0xc5/0x180
[<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
[<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
This happens because blk_queue_cleanup() destroys the queue and
elevator whether IOs are in progress or not and DEAD tests are
sprinkled in the request processing path without proper
synchronization.
Similar problem exists for blk-throtl. On queue cleanup, blk-throtl
is shutdown whether it has requests in it or not. Depending on
timing, it either oopses or throttled bios are lost putting tasks
which are waiting for bio completion into eternal D state.
The way it should work is having the usual clear distinction between
shutdown and release. Shutdown drains all currently pending requests,
marks the queue dead, and performs partial teardown of the now
unnecessary part of the queue. Even after shutdown is complete,
reference holders are still allowed to issue requests to the queue
although they will be immmediately failed. The rest of teardown
happens on release.
This patch makes the following changes to make blk_queue_cleanup()
behave as proper shutdown.
* QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
queue_lock.
* Unsynchronized DEAD check in generic_make_request_checks() removed.
This couldn't make any meaningful difference as the queue could die
after the check.
* blk_drain_queue() updated such that it can drain all requests and is
now called during cleanup.
* blk_throtl updated such that it checks DEAD on grabbing queue_lock,
drains all throttled bios during cleanup and free td when queue is
released.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
attempt_plug_merge() accesses elevator without holding queue_lock and
may call into ->elevator_bio_merge_fn(). The elvator is guaranteed to
be valid because it's accessed iff the plugged list has requests and
elevator is never exited with live requests, so as long as the
elevator method can deal with unlocked access, this is safe.
Explain the sync rules around attempt_plug_merge() and drop the
unnecessary @tsk parameter.
This patch doesn't introduce any functional change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently get_request[_wait]() allocates request whether queue is dead
or not. This patch makes get_request[_wait]() return NULL if @q is
dead. blk_queue_bio() is updated to fail the submitted bio if request
allocation fails. While at it, add docbook comments for
get_request[_wait]().
Note that the current code has rather unclear (there are spurious DEAD
tests scattered around) assumption that the owner of a queue
guarantees that no request travels block layer if the queue is dead
and this patch in itself doesn't change much; however, this will allow
fixing the broken assumption in the next patch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_throtl_bio() and throtl_get_tg() have rather unusual interface.
* throtl_get_tg() returns pointer to a valid tg or ERR_PTR(-ENODEV),
and drops queue_lock in the latter case. Different locking context
depending on return value is error-prone and DEAD state is scheduled
to be protected by queue_lock anyway. Move DEAD check inside
queue_lock and return valid tg or NULL.
* blk_throtl_bio() indicates return status both with its return value
and in/out param **@bio. The former is used to indicate whether
queue is found to be dead during throtl processing. The latter
whether the bio is throttled.
There's no point in returning DEAD check result from
blk_throtl_bio(). The queue can die after blk_throtl_bio() is
finished but before make_request_fn() grabs queue lock.
Make it take *@bio instead and return boolean result indicating
whether the request is throttled or not.
This patch doesn't cause any visible functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reorganize queue draining related code in preparation of queue exit
changes.
* Factor out actual draining from elv_quiesce_start() to
blk_drain_queue().
* Make elv_quiesce_start/end() responsible for their own locking.
* Replace open-coded ELVSWITCH clearing in elevator_switch() with
elv_quiesce_end().
This patch doesn't cause any visible functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_get/put_queue() in scsi_cmd_ioctl() and throtl_get_tg() are
completely bogus. The caller must have a reference to the queue on
entry and taking an extra reference doesn't change anything.
For scsi_cmd_ioctl(), the only effect is that it ends up checking
QUEUE_FLAG_DEAD on entry; however, this is bogus as queue can die
right after blk_get_queue(). Dead queue should be and is handled in
request issue path (it's somewhat broken now but that's a separate
problem and doesn't affect this one much).
throtl_get_tg() incorrectly assumes that q is rcu freed. Also, it
doesn't check return value of blk_get_queue(). If the queue is
already dead, it ends up doing an extra put.
Drop them.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_alloc_request() and freed_request() take different combinations of
REQ_* @flags, @priv and @is_sync when @flags is superset of the latter
two. Make them take @flags only. This cleans up the code a bit and
will ease updating allocation related REQ_* flags.
This patch doesn't introduce any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_throtl interface is block internal and there's no reason to have
them in linux/blkdev.h. Move them to block/blk.h.
This patch doesn't introduce any functional change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blkio_policy_parse_and_set() calls blkio_check_dev_num() to check
whether the given dev_t is valid. blkio_check_dev_num() uses
get_gendisk() for verification but never puts the returned genhd
leaking the reference.
This patch collapses blkio_check_dev_num() into its caller and updates
it such that the genhd is put before returning.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The following command sequence triggers an oops.
# mount /dev/sdb1 /mnt
# echo 1 > /sys/class/scsi_device/0\:0\:1\:0/device/delete
# umount /mnt
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:
Pid: 791, comm: umount Not tainted 3.1.0-rc3-work+ #8 Bochs Bochs
RIP: 0010:[<ffffffff810d0879>] [<ffffffff810d0879>] __lock_acquire+0x389/0x1d60
...
Call Trace:
[<ffffffff810d2845>] lock_acquire+0x95/0x140
[<ffffffff81aed87b>] _raw_spin_lock+0x3b/0x50
[<ffffffff811573bc>] bdi_lock_two+0x5c/0x70
[<ffffffff811c2f6c>] bdev_inode_switch_bdi+0x4c/0xf0
[<ffffffff811c3fcb>] __blkdev_put+0x11b/0x1d0
[<ffffffff811c4010>] __blkdev_put+0x160/0x1d0
[<ffffffff811c40df>] blkdev_put+0x5f/0x190
[<ffffffff8118f18d>] kill_block_super+0x4d/0x80
[<ffffffff8118f4a5>] deactivate_locked_super+0x45/0x70
[<ffffffff8119003a>] deactivate_super+0x4a/0x70
[<ffffffff811ac4ad>] mntput_no_expire+0xed/0x130
[<ffffffff811acf2e>] sys_umount+0x7e/0x3a0
[<ffffffff81aeeeab>] system_call_fastpath+0x16/0x1b
This is because bdev holds on to disk but disk doesn't pin the
associated queue. If a SCSI device is removed while the device is
still open, the sdev puts the base reference to the queue on release.
When the bdev is finally released, the associated queue is already
gone along with the bdi and bdev_inode_switch_bdi() ends up
dereferencing already freed bdi.
Even if it were not for this bug, disk not holding onto the associated
queue is very unusual and error-prone.
Fix it by making add_disk() take an extra reference to its queue and
put it on disk_release() and ensuring that disk and its fops owner are
put in that order after all accesses to the disk and queue are
complete.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A kernel crash is observed when a mounted ext3/ext4 filesystem is
physically removed. The problem is that blk_cleanup_queue() frees up
some resources eg by calling elevator_exit(), which are not checked for
in normal operation. So we should rather move these calls to the
destructor function blk_release_queue() as at that point all remaining
references are gone. However, in doing so we have to ensure that any
externally supplied queue_lock is disconnected as the driver might free
up the lock after the call of blk_cleanup_queue(),
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The bug is we're not able to remove the device from blkio cgroup's
per-device control files if it gets unplugged.
To reproduce the bug:
# mount -t cgroup -o blkio xxx /cgroup
# cd /cgroup
# echo "8:0 1000" > blkio.throttle.read_bps_device
# unplug the device
# cat blkio.throttle.read_bps_device
8:0 1000
# echo "8:0 0" > blkio.throttle.read_bps_device
-bash: echo: write error: No such device
After patching, the device removal will succeed.
Thanks for the comments of Paul, Zefan, and Vivek.
Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <paul@paulmenage.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The kerneldoc for blk_release_queue() is referring to blk_cleanup_queue().
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Thus spake Andrew Morton:
"And I have the usual maintainability whine. If someone comes up to
vmscan.c and sees it calling blk_start_plug(), how are they supposed to
work out why that call is there? They go look at the blk_start_plug()
definition and it is undocumented. I think we can do better than this?"
Adapted from the LWN article - http://lwn.net/Articles/438256/ by Jens
Axboe and from an earlier attempt by Shaohua Li to document blk-plug.
[akpm@linux-foundation.org: grammatical and spelling tweaks]
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move all the checks performed on a bio into a new helper, and call it as
soon as bio is submitted even if it is a re-submission from ->make_request.
We explicitly mark the new helper as beeing non-inlined as the stack
usage for printing the block device name in the failure case is quite
high and this a patch where we have to be extremely conservative about
stack usage.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In __blk_complete_request, we check both QUEUE_FLAG_SAME_COMP and req->cpu
to decide whether we should use req->cpu. Actually the user can also
select the complete cpu by either setting BIO_CPU_AFFINE or by calling
bio_set_completion_cpu. Current solution makes these 2 ways don't work
any more. So we'd better just check req->cpu.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is very little benefit in allowing to let a ->make_request
instance update the bios device and sector and loop around it in
__generic_make_request when we can archive the same through calling
generic_make_request from the driver and letting the loop in
generic_make_request handle it.
Note that various drivers got the return value from ->make_request and
returned non-zero values for errors.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Avoid the hacks need for request based device mappers currently by simply
exporting the symbol instead of trying to get it through the back door.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
We have ELV_NAME_MAX defined to 16, and hence we should use it
instead of the magic nubmer 16 for elevator's name string.
Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
This patch allows the user to set an "alias" of the disk via sysfs interface.
This patch only adds a new attribute "alias" in gendisk structure.
To show the alias instead of the device name in kernel messages,
we need to revise printk messages and use alias_name() in them.
Example:
(current) printk("disk name is %s\n", disk->disk_name);
(new) printk("disk name is %s\n", alias_name(disk));
Users can use alphabets, numbers, '-' and '_' in "alias" attribute. A disk can
have an "alias" which length is up to 255 bytes. This attribute is write-once.
Suggested-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Suggested-by: Jon Masters <jcm@redhat.com>
Signed-off-by: Nao Nishijima <nao.nishijima.xt@hitachi.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Cleaning up the code a little bit. attempt_plug_merge() traverses the plug
list anyway, we can do the request counting there, so stack size is reduced
a little bit.
The motivation here is I suspect if we should count the requests for each
queue (task could handle multiple disks in the meantime), but my test doesn't
show it's worthy doing. If somebody proves we should do it, below change
will make that more easier.
Signed-off-by: Shaohua Li <shli@kernel.org>
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Do blk_flush_plug_list() first and then add new request aDo blk_flush_plug_list() first and then add new request aDo blk_flush_plug_list() first and then add new request at the tail. New
request can't be merged to existing requests, but later new requests might
be merged with this new one. If blk_flush_plug_list() is done later, the
merge doesn't happen.
Believe it or not, this fixes a 10% regression running sysbench workload.
Signed-off-by: Shaohua Li <shli@kernel.org>
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Commit 5757a6d76c added the QUEUE_FLAG_SAME_FORCE flag, but fails to
clear that flag when the current state is '2' (SAME_COMP + SAME_FORCE)
and the new state is '1' (SAME_COMP).
Acked-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Eric Seppanen <eric@purestorage.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
There are cases where suppressing partition scan is useful - e.g. for
lo devices and pseudo SATA devices which advertise to be a disk but
get upset on partition scan (some port multiplier control devices show
such behavior).
This patch adds GENHD_FL_NO_PART_SCAN which suppresses partition scan
regardless of the number of possible partitions. disk_partitionable()
is renamed to disk_part_scan_enabled() as suppressing partition scan
doesn't imply the device can't be partitioned using
BLKPG_ADD/DEL_PARTITION calls from userland. show_partition() now
directly tests disk_max_parts() to maintain backward-compatibility.
-v2: Updated to make it clear that only partition scan is suppressed
not partitioning itself as suggested by Kay Sievers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Add a new REQ_PRIO to let requests preempt others in the cfq I/O schedule,
and lave REQ_META purely for marking requests as metadata in blktrace.
All existing callers of REQ_META except for XFS are updated to also
set REQ_PRIO for now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* 'for-linus' of git://git.kernel.dk/linux-block: (23 commits)
Revert "cfq: Remove special treatment for metadata rqs."
block: fix flush machinery for stacking drivers with differring flush flags
block: improve rq_affinity placement
blktrace: add FLUSH/FUA support
Move some REQ flags to the common bio/request area
allow blk_flush_policy to return REQ_FSEQ_DATA independent of *FLUSH
xen/blkback: Make description more obvious.
cfq-iosched: Add documentation about idling
block: Make rq_affinity = 1 work as expected
block: swim3: fix unterminated of_device_id table
block/genhd.c: remove useless cast in diskstats_show()
drivers/cdrom/cdrom.c: relax check on dvd manufacturer value
drivers/block/drbd/drbd_nl.c: use bitmap_parse instead of __bitmap_parse
bsg-lib: add module.h include
cfq-iosched: Reduce linked group count upon group destruction
blk-throttle: correctly determine sync bio
loop: fix deadlock when sysfs and LOOP_CLR_FD race against each other
loop: add BLK_DEV_LOOP_MIN_COUNT=%i to allow distros 0 pre-allocated loop devices
loop: add management interface for on-demand device allocation
loop: replace linked list of allocated devices with an idr index
...
We have a kernel build regression since 3.1-rc1, which is about 10%
regression. The kernel source is in an ext3 filesystem.
Alex Shi bisect it to commit:
commit a07405b780
Author: Justin TerAvest <teravest@google.com>
Date: Sun Jul 10 22:09:19 2011 +0200
cfq: Remove special treatment for metadata rqs.
Apparently this is caused by lack metadata preemption, where ext3/ext4
do use READ_META. I didn't see a way to fix the issue, so suggest
reverting the patch.
This reverts commit a07405b780.
Reported-by: Alex Shi<alex.shi@intel.com>
Reported-by: Shaohua Li<shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Commit ae1b153962, block: reimplement
FLUSH/FUA to support merge, introduced a performance regression when
running any sort of fsyncing workload using dm-multipath and certain
storage (in our case, an HP EVA). The test I ran was fs_mark, and it
dropped from ~800 files/sec on ext4 to ~100 files/sec. It turns out
that dm-multipath always advertised flush+fua support, and passed
commands on down the stack, where those flags used to get stripped off.
The above commit changed that behavior:
static inline struct request *__elv_next_request(struct request_queue *q)
{
struct request *rq;
while (1) {
- while (!list_empty(&q->queue_head)) {
+ if (!list_empty(&q->queue_head)) {
rq = list_entry_rq(q->queue_head.next);
- if (!(rq->cmd_flags & (REQ_FLUSH | REQ_FUA)) ||
- (rq->cmd_flags & REQ_FLUSH_SEQ))
- return rq;
- rq = blk_do_flush(q, rq);
- if (rq)
- return rq;
+ return rq;
}
Note that previously, a command would come in here, have
REQ_FLUSH|REQ_FUA set, and then get handed off to blk_do_flush:
struct request *blk_do_flush(struct request_queue *q, struct request *rq)
{
unsigned int fflags = q->flush_flags; /* may change, cache it */
bool has_flush = fflags & REQ_FLUSH, has_fua = fflags & REQ_FUA;
bool do_preflush = has_flush && (rq->cmd_flags & REQ_FLUSH);
bool do_postflush = has_flush && !has_fua && (rq->cmd_flags &
REQ_FUA);
unsigned skip = 0;
...
if (blk_rq_sectors(rq) && !do_preflush && !do_postflush) {
rq->cmd_flags &= ~REQ_FLUSH;
if (!has_fua)
rq->cmd_flags &= ~REQ_FUA;
return rq;
}
So, the flush machinery was bypassed in such cases (q->flush_flags == 0
&& rq->cmd_flags & (REQ_FLUSH|REQ_FUA)).
Now, however, we don't get into the flush machinery at all. Instead,
__elv_next_request just hands a request with flush and fua bits set to
the scsi_request_fn, even if the underlying request_queue does not
support flush or fua.
The agreed upon approach is to fix the flush machinery to allow
stacking. While this isn't used in practice (since there is only one
request-based dm target, and that target will now reflect the flush
flags of the underlying device), it does future-proof the solution, and
make it function as designed.
In order to make this work, I had to add a field to the struct request,
inside the flush structure (to store the original req->end_io). Shaohua
had suggested overloading the union with rb_node and completion_data,
but the completion data is used by device mapper and can also be used by
other drivers. So, I didn't see a way around the additional field.
I tested this patch on an HP EVA with both ext4 and xfs, and it recovers
the lost performance. Comments and other testers, as always, are
appreciated.
Cheers,
Jeff
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
This patch reverts commit 35ae66e0a09ab70ed(block: Make rq_affinity = 1
work as expected). The purpose is to avoid an unnecessary IPI.
Let's take an example. My test box has cpu 0-7, one socket. Say request is
added from CPU 1, blk_complete_request() occurs at CPU 7. Without the reverted
patch, softirq will be done at CPU 7. With it, an IPI will be directed to CPU
0, and softirq will be done at CPU 0. In this case, doing softirq at CPU 0 and
CPU 7 have no difference from cache sharing point view and we can avoid an
ipi if doing it in CPU 7.
An immediate concern is this is just like QUEUE_FLAG_SAME_FORCE, but actually
not. blk_complete_request() is running in interrupt handler, and currently
I/O controller doesn't support multiple interrupts (I checked several LSI
cards and AHCI), so only one CPU can run blk_complete_request(). This is
still quite different as QUEUE_FLAG_SAME_FORCE.
Since only one CPU runs softirq, the only difference with below patch is
softirq not always runs at the first CPU of a group.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
blk_insert_flush has the following check:
/*
* If there's data but flush is not necessary, the request can be
* processed directly without going through flush machinery. Queue
* for normal execution.
*/
if ((policy & REQ_FSEQ_DATA) &&
!(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
list_add_tail(&rq->queuelist, &q->queue_head);
return;
}
However, blk_flush_policy will not return with policy set to only
REQ_FSEQ_DATA:
static unsigned int blk_flush_policy(unsigned int fflags, struct request *rq)
{
unsigned int policy = 0;
if (fflags & REQ_FLUSH) {
if (rq->cmd_flags & REQ_FLUSH)
policy |= REQ_FSEQ_PREFLUSH;
if (blk_rq_sectors(rq))
policy |= REQ_FSEQ_DATA;
if (!(fflags & REQ_FUA) && (rq->cmd_flags & REQ_FUA))
policy |= REQ_FSEQ_POSTFLUSH;
}
return policy;
}
Notice that REQ_FSEQ_DATA is only set if REQ_FLUSH is set. Fix this
mismatch by moving the setting of REQ_FSEQ_DATA outside of the REQ_FLUSH
check.
Tejun notes:
Hmmm... yes, this can become a correctness issue if (and only if)
blk_queue_flush() is called to change q->flush_flags while requests
are in-flight; otherwise, requests wouldn't reach the function at all.
Also, I think it would be a generally good idea to always set
FSEQ_DATA if the request has data.
Cheers,
Jeff
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Commit 5757a6d76c introduced a new rq_affinity = 2 so as to make
the request completed in the __make_request cpu. But it makes the
old rq_affinity = 1 not work any more. The root cause is that
if the 'cpu' and 'req->cpu' is in the same group and cpu != req->cpu,
ccpu will be the same as group_cpu, so the completion will be
excuted in the 'cpu' not 'group_cpu'.
This patch fix problem by simpling removing group_cpu and the codes
are more explicit now. If ccpu == cpu, we complete in cpu, otherwise
we raise_blk_irq to ccpu.
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Roland Dreier <roland@purestorage.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Reviewed-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
init_fault_attr_dentries() is used to export fault_attr via debugfs.
But it can only export it in debugfs root directory.
Per Forlin is working on mmc_fail_request which adds support to inject
data errors after a completed host transfer in MMC subsystem.
The fault_attr for mmc_fail_request should be defined per mmc host and
export it in debugfs directory per mmc host like
/sys/kernel/debug/mmc0/mmc_fail_request.
init_fault_attr_dentries() doesn't help for mmc_fail_request. So this
introduces fault_create_debugfs_attr() which is able to create a
directory in the arbitrary directory and replace
init_fault_attr_dentries().
[akpm@linux-foundation.org: extraneous semicolon, per Randy]
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Tested-by: Per Forlin <per.forlin@linaro.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove the (unsigned long long) cast in diskstats_show() and adjusts the
seq_printf() format string to 'unsigned long'
diskstats_show() uses part_stat_read() to get the stats, which either
accesses the specified field in the struct disk_stats directly (non SMP)
or sums up the per CPU values in a variable of the same type as the field,
so in any case the result will have the same type and range as the
specified field which for all disk_stats entries is unsigned long
Also, for unsigned long ranges the output of %lu should be identical to
the one of %llu, so no change in the actual proc entry contents.
Signed-off-by: Herbert Poetzl <herbert@13thfloor.at>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Due to conflicts with the moduleh tree in linux-next, we
run into an include file mess. We really need export.h
in that tree, but if we add module.h locally then the
issue is easier to resolve.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
FQ keeps track of number of groups which are linked on blkcg->blkg_list.
This is useful to avoid races between queue exit and cgroup exit code
paths. So if at the request queue exit time linked group count is not
zero, that means there are some group out there which is yet to be
deleted under rcu read period and queue exit code should wait for
on rcu period.
In my previous patch I forgot to decrease the number of group count.
So in current form, we nr_blkcg_linked_grps is always non-zero and
we will always wait one rcu period (if BLK_CGROUP=y). The side effect
of this is that it can increase boot time. I am surprised, nobody
complained so far.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
read request is always sync. Using rw_is_sync() to determine
if a bio is sync.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
This moves the FC classes bsg code to the block layer and
makes it a lib so that other classes like iscsi and SAS can use it.
It is helpful because working with the request queue, bios,
creating scatterlists, etc are a pain that the LLD does not
have to worry about with normal IOs and should not have to
worry about for bsg requests.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
This changes should_fail_request() to more usable wrapper function of
should_fail(). It can avoid putting #ifdef CONFIG_FAIL_MAKE_REQUEST in
the middle of a function.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After commit 5757a6d7 introduced an unsafe calling of
smp_processor_id(), with preempt debuggin turned on we spew a lot of:
BUG: using smp_processor_id() in preemptible [00000000] code: kjournald/514
caller is __make_request+0x1b8/0x308
[<c0019f44>] (unwind_backtrace+0x0/0xe8) from [<c024b4cc>] (debug_smp_processor_id+0xbc/0xf0)
[<c024b4cc>] (debug_smp_processor_id+0xbc/0xf0) from [<c0223d14>] (__make_request+0x1b8/0x308)
[<c0223d14>] (__make_request+0x1b8/0x308) from [<c02215ac>] (generic_make_request+0x4dc/0x558)
[<c02215ac>] (generic_make_request+0x4dc/0x558) from [<c022173c>] (submit_bio+0x114/0x138)
[<c022173c>] (submit_bio+0x114/0x138) from [<c011f504>] (submit_bh+0x148/0x16c)
[<c011f504>] (submit_bh+0x148/0x16c) from [<c0121ed8>] (__sync_dirty_buffer+0x88/0xd8)
[<c0121ed8>] (__sync_dirty_buffer+0x88/0xd8) from [<c01aff78>] (journal_commit_transaction+0x1198/0x1688)
[<c01aff78>] (journal_commit_transaction+0x1198/0x1688) from [<c01b4034>] (kjournald+0xb4/0x224)
[<c01b4034>] (kjournald+0xb4/0x224) from [<c0069ea0>] (kthread+0x8c/0x94)
[<c0069ea0>] (kthread+0x8c/0x94) from [<c00137f8>] (kernel_thread_exit+0x0/0x8)
Fix this by just using raw_smp_processor_id(), it's just a hint
after all. There's no pinning of the CPU or accessing per-cpu
structures involved.
Reported-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* 'for-3.1/drivers' of git://git.kernel.dk/linux-block:
cciss: do not attempt to read from a write-only register
xen/blkback: Add module alias for autoloading
xen/blkback: Don't let in-flight requests defer pending ones.
bsg: fix address space warning from sparse
bsg: remove unnecessary conditional expressions
bsg: fix bsg_poll() to return POLLOUT properly
* 'for-3.1/core' of git://git.kernel.dk/linux-block: (24 commits)
block: strict rq_affinity
backing-dev: use synchronize_rcu_expedited instead of synchronize_rcu
block: fix patch import error in max_discard_sectors check
block: reorder request_queue to remove 64 bit alignment padding
CFQ: add think time check for group
CFQ: add think time check for service tree
CFQ: move think time check variables to a separate struct
fixlet: Remove fs_excl from struct task.
cfq: Remove special treatment for metadata rqs.
block: document blk_plug list access
block: avoid building too big plug list
compat_ioctl: fix make headers_check regression
block: eliminate potential for infinite loop in blkdev_issue_discard
compat_ioctl: fix warning caused by qemu
block: flush MEDIA_CHANGE from drivers on close(2)
blk-throttle: Make total_nr_queued unsigned
block: Add __attribute__((format(printf...) and fix fallout
fs/partitions/check.c: make local symbols static
block:remove some spare spaces in genhd.c
block:fix the comment error in blkdev.h
...
Some systems benefit from completions always being steered to the strict
requester cpu rather than the looser "per-socket" steering that
blk_cpu_to_group() attempts by default. This is because the first
CPU in the group mask ends up being completely overloaded with work,
while the others (including the original submitter) has power left
to spare.
Allow the strict mode to be set by writing '2' to the sysfs control
file. This is identical to the scheme used for the nomerges file,
where '2' is a more aggressive setting than just being turned on.
echo 2 > /sys/block/<bdev>/queue/rq_affinity
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Roland Dreier <roland@purestorage.com>
Tested-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
A '!' snuck in before the unlikely, rendering it useless.
Reported-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (77 commits)
[SCSI] fix crash in scsi_dispatch_cmd()
[SCSI] sr: check_events() ignore GET_EVENT when TUR says otherwise
[SCSI] bnx2i: Fixed kernel panic due to illegal usage of sc->request->cpu
[SCSI] bfa: Update the driver version to 3.0.2.1
[SCSI] bfa: Driver and BSG enhancements.
[SCSI] bfa: Added support to query PHY.
[SCSI] bfa: Added HBA diagnostics support.
[SCSI] bfa: Added support for flash configuration
[SCSI] bfa: Added support to obtain SFP info.
[SCSI] bfa: Added support for CEE info and stats query.
[SCSI] bfa: Extend BSG interface.
[SCSI] bfa: FCS bug fixes.
[SCSI] bfa: DMA memory allocation enhancement.
[SCSI] bfa: Brocade-1860 Fabric Adapter vHBA support.
[SCSI] bfa: Brocade-1860 Fabric Adapter PLL init fixes.
[SCSI] bfa: Added Fabric Assigned Address(FAA) support
[SCSI] bfa: IOC bug fixes.
[SCSI] bfa: Enable ASIC block configuration and query.
[SCSI] bnx2i: Updated copyright and bump version
[SCSI] bnx2i: Modified to skip CNIC registration if iSCSI is not supported
...
Fix up some trivial conflicts in:
- drivers/scsi/bnx2fc/{bnx2fc.h,bnx2fc_fcoe.c}:
Crazy broadcom version number conflicts
- drivers/target/tcm_fc/tfc_cmd.c
Just trivial cleanups done on adjacent lines
USB surprise removal of sr is triggering an oops in
scsi_dispatch_command(). What seems to be happening is that USB is
hanging on to a queue reference until the last close of the upper
device, so the crash is caused by surprise remove of a mounted CD
followed by attempted unmount.
The problem is that USB doesn't issue its final commands as part of
the SCSI teardown path, but on last close when the block queue is long
gone. The long term fix is probably to make sr do the teardown in the
same way as sd (so remove all the lower bits on ejection, but keep the
upper disk alive until last close of user space). However, the
current oops can be simply fixed by not allowing any commands to be
sent to a dead queue.
Cc: stable@kernel.org
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
The rcu callback disk_free_ptbl_rcu_cb() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(disk_free_ptbl_rcu_cb).
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Currently when the last queue of a group has no request, we don't expire
the queue to hope request from the group comes soon, so the group doesn't
miss its share. But if the think time is big, the assumption isn't correct
and we just waste bandwidth. In such case, we don't do idle.
[global]
runtime=30
direct=1
[test1]
cgroup=test1
cgroup_weight=1000
rw=randread
ioengine=libaio
size=500m
runtime=30
directory=/mnt
filename=file1
thinktime=9000
[test2]
cgroup=test2
cgroup_weight=1000
rw=randread
ioengine=libaio
size=500m
runtime=30
directory=/mnt
filename=file2
patched base
test1 64k 39k
test2 548k 540k
total 604k 578k
group1 gets much better throughput because it waits less time.
To check if the patch changes behavior of queue without think time. I also
tried to give test1 2ms think time or no think time. The test result is stable.
The thoughput doesn't change with/without the patch.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Currently when the last queue of a service tree has no request, we don't
expire the queue to hope request from the service tree comes soon, so the
service tree doesn't miss its share. But if the think time is big, the
assumption isn't correct and we just waste bandwidth. In such case, we
don't do idle.
[global]
runtime=10
direct=1
[test1]
rw=randread
ioengine=libaio
size=500m
directory=/mnt
filename=file1
thinktime=9000
[test2]
rw=read
ioengine=libaio
size=1G
directory=/mnt
filename=file2
patched base
test1 41k/s 33k/s
test2 15868k/s 15789k/s
total 15902k/s 15817k/s
A slightly better
To check if the patch changes behavior of queue without think time. I also
tried to give test1 2ms think time or no think time. The test has variation
even without the patch, but the average throughput doesn't change with/without
the patch.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Move the variables to do think time check to a sepatate struct. This is
to prepare adding think time check for service tree and group. No
functional change.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
fs_excl is a poor man's priority inheritance for filesystems to hint to
the block layer that an operation is important. It was never clearly
specified, not widely adopted, and will not prevent starvation in many
cases (like across cgroups).
fs_excl was introduced with the time sliced CFQ IO scheduler, to
indicate when a process held FS exclusive resources and thus needed
a boost.
It doesn't cover all file systems, and it was never fully complete.
Lets kill it.
Signed-off-by: Justin TerAvest <teravest@google.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
There is no consistency among filesystems from what bios (or requests)
are marked as being metadata. It's interesting to expose this in traces,
but we shouldn't schedule the requests differently based on whether or
not they're marked as being metadata.
Signed-off-by: Justin TerAvest <teravest@google.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
When I test fio script with big I/O depth, I found the total throughput drops
compared to some relative small I/O depth. The reason is the thread accumulates
big requests in its plug list and causes some delays (surely this depends
on CPU speed).
I thought we'd better have a threshold for requests. When a threshold reaches,
this means there is no request merge and queue lock contention isn't severe
when pushing per-task requests to queue, so the main advantages of blk plug
don't exist. We can force a plug list flush in this case.
With this, my test throughput actually increases and almost equals to small
I/O depth. Another side effect is irq off time decreases in blk_flush_plug_list()
for big I/O depth.
The BLK_MAX_REQUEST_COUNT is choosen arbitarily, but 16 is efficiently to
reduce lock contention to me. But I'm open here, 32 is ok in my test too.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Due to the recently identified overflow in read_capacity_16() it was
possible for max_discard_sectors to be zero but still have discards
enabled on the associated device's queue.
Eliminate the possibility for blkdev_issue_discard to infinitely loop.
Interestingly this issue wasn't identified until a device, whose
discard_granularity was 0 due to read_capacity_16 overflow, was consumed
by blk_stack_limits() to construct limits for a higher-level DM
multipath device. The multipath device's resulting limits never had the
discard limits stacked because blk_stack_limits() will only do so if
the bottom device's discard_granularity != 0. This resulted in the
multipath device's limits.max_discard_sectors being 0.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
On Linux x86_64 host with 32bit userspace, running
qemu or even just "qemu-img create -f qcow2 some.img 1G"
causes a kernel warning:
ioctl32(qemu-img:5296): Unknown cmd fd(3) cmd(00005326){t:'S';sz:0} arg(7fffffff) on some.img
ioctl32(qemu-img:5296): Unknown cmd fd(3) cmd(801c0204){t:02;sz:28} arg(fff77350) on some.img
ioctl 00005326 is CDROM_DRIVE_STATUS,
ioctl 801c0204 is FDGETPRM.
The warning appears because the Linux compat-ioctl handler for these
ioctls only applies to block devices, while qemu also uses the ioctls on
plain files.
Signed-off-by: Johannes Stezenbach <js@sig21.net>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Currently, only open(2) is defined as the 'clearing' point. It has
two roles - first, it's an acknowledgement from userland indicating
that the event has been received and kernel can clear pending states
and proceed to generate more events. Secondly, it's passed on to
device drivers as a hint indicating that a synchronization point has
been reached and it might want to take a deeper look at the device.
The latter currently is only used by sr which uses two different
mechanisms - GET_EVENT_MEDIA_STATUS_NOTIFICATION and TEST_UNIT_READY
to discover events, where the former is lighter weight and safe to be
used repeatedly but may not provide full coverage. Among other
things, GET_EVENT can't detect media removal while TUR can.
This patch makes close(2) - blkdev_put() - indicate clearing hint for
MEDIA_CHANGE to drivers. disk_check_events() is renamed to
disk_flush_events() and updated to take @mask for events to flush
which is or'd to ev->clearing and will be passed to the driver on the
next ->check_events() invocation.
This change makes sr generate MEDIA_CHANGE when media is ejected from
userland - e.g. with eject(1).
Note: Given the current usage, it seems @clearing hint is needlessly
complex. disk_clear_events() can simply clear all events and the hint
can be boolean @flush.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
ioc->ioc_data is rcu protectd, so uses correct API to access it.
This doesn't change any behavior, but just make code consistent.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: stable@kernel.org # after ab4bd22d
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
I got a rcu warnning at boot. the ioc->ioc_data is rcu_deferenced, but
doesn't hold rcu_read_lock.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: stable@kernel.org # after ab4bd22d
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Second condition in OR always implies first condition is false
thus bytes_read in the second is not needed. The same goes to
bytes_written.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Acked-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
POLLOUT should be returned only if bd->queued_cmds < bd->max_queue
so that bsg_alloc_command() can proceed.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Acked-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
The total of two unsigned values should also be unsigned.
Update throtl_log output to unsigned.
Update total_nr_queued test to non-zero to be the
same as the other total_nr_queued tests.
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Use the compiler to verify format strings and arguments.
Fix fallout.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Use the compiler to verify format strings and arguments.
Fix fallout.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
disk_block_events() should guarantee that the event work is not in
flight on return and once blocked it shouldn't issue further
cancellations.
Because there was no synchronization between the first blocker doing
cancel_delayed_work_sync() and the following blockers, the following
blockers could finish before cancellation was complete, which broke
both guarantees - event work could be in flight and cancellation could
happen after return.
This bug triggered WARN_ON_ONCE() in disk_clear_events() reported in
bug#34662.
https://bugzilla.kernel.org/show_bug.cgi?id=34662
Fix it by adding an outer mutex which protects both block count
manipulation and work cancellation.
-v2: Use outer mutex instead of bit waitqueue per Linus.
Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-by: Sitsofe Wheeler <sitsofe@yahoo.com>
Reported-by: Sitsofe Wheeler <sitsofe@yahoo.com>
Reported-by: Borislav Petkov <bp@alien8.de>
Reported-by: Meelis Roos <mroos@linux.ee>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
After the previous update to disk_check_events(), nobody is using
non-syncing __disk_block_events(). Remove @sync and, as this makes
__disk_block_events() virtually identical to disk_block_events(),
remove the underscore prefixed version.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
This patch is part of fix for triggering of WARN_ON_ONCE() in
disk_clear_events() reported in bug#34662.
https://bugzilla.kernel.org/show_bug.cgi?id=34662
disk_clear_events() blocks events, schedules and flushes the event
work. It expects the work to have started execution on schedule and
finished on return from flush. WARN_ON_ONCE() triggers if the event
work hasn't executed as expected. This problem happens because
__disk_block_events() fails to guarantee that the event work item is
not in flight on return from the function in race-free manner. The
problem is two-fold and this patch addresses one of them.
When __disk_block_events() is called with @sync == %false, it bumps
event block count, calls cancel_delayed_work() and return. This makes
it impossible to guarantee that event polling is not in flight on
return from syncing __disk_block_events() - if the first blocker was
non-syncing, polling could still be in progress and later syncing ones
would assume that the first blocker already canceled it.
Making __disk_block_events() cancel_sync regardless of block count
isn't feasible either as it may race with forced event checking in
disk_clear_events().
As disk_check_events() is the only user of non-syncing
__disk_block_events(), updating it to directly cancel and schedule
event work is the easiest way to solve the issue.
Note that there's another bug in __disk_block_events() and this patch
doesn't fix the issue completely. Later patch will fix the other bug.
Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-by: Sitsofe Wheeler <sitsofe@yahoo.com>
Reported-by: Sitsofe Wheeler <sitsofe@yahoo.com>
Reported-by: Borislav Petkov <bp@alien8.de>
Reported-by: Meelis Roos <mroos@linux.ee>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
If we rename the return of alloc_io_context() and get_io_context() from
"ret" to "ioc" the code get's (a bit) more readable and (a lot) more
grepable.
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Since we are modifying this RCU pointer, we need to hold
the lock protecting it around it.
This fixes a potential reuse and double free of a cfq
io_context structure. The bug has been in CFQ for a long
time, it hit very few people but those it did hit seemed
to see it a lot.
Tracked in RH bugzilla here:
https://bugzilla.redhat.com/show_bug.cgi?id=577968
Credit goes to Paul Bolle for figuring out that the issue
was around the one-hit ioc->ioc_data cache. Thanks to his
hard work the issue is now fixed.
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Since we are modifying this RCU pointer, we need to hold
the lock protecting it around it.
This fixes a potential reuse and double free of a cfq
io_context structure. The bug has been in CFQ for a long
time, it hit very few people but those it did hit seemed
to see it a lot.
Tracked in RH bugzilla here:
https://bugzilla.redhat.com/show_bug.cgi?id=577968
Credit goes to Paul Bolle for figuring out that the issue
was around the one-hit ioc->ioc_data cache. Thanks to his
hard work the issue is now fixed.
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Hi, Jens,
If you recall, I posted an RFC patch for this back in July of last year:
http://lkml.org/lkml/2010/7/13/279
The basic problem is that a process can issue a never-ending stream of
async direct I/Os to the same sector on a device, thus starving out
other I/O in the system (due to the way the alias handling works in both
cfq and deadline). The solution I proposed back then was to start
dispatching from the fifo after a certain number of aliases had been
dispatched. Vivek asked why we had to treat aliases differently at all,
and I never had a good answer. So, I put together a simple patch which
allows aliases to be added to the rb tree (it adds them to the right,
though that doesn't matter as the order isn't guaranteed anyway). I
think this is the preferred solution, as it doesn't break up time slices
in CFQ or batches in deadline. I've tested it, and it does solve the
starvation issue. Let me know what you think.
Cheers,
Jeff
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
list_entry() and hlist_entry() are both simply aliases for
container_of(), but since io_context.cic_list.first is an hlist_node one
should at least use the correct alias.
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
queue_fail can only be reached if cic is NULL, so its check for cic must
be bogus.
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Add cgroup subsystem callbacks for per-thread attachment in atomic contexts
Add can_attach_task(), pre_attach(), and attach_task() as new callbacks
for cgroups's subsystem interface. Unlike can_attach and attach, these
are for per-thread operations, to be called potentially many times when
attaching an entire threadgroup.
Also, the old "bool threadgroup" interface is removed, as replaced by
this. All subsystems are modified for the new interface - of note is
cpuset, which requires from/to nodemasks for attach to be globally scoped
(though per-cpuset would work too) to persist from its pre_attach to
attach_task and attach.
This is a pre-patch for cgroup-procs-writable.patch.
Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Reviewed-by: Paul Menage <menage@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9fd097b149 (block: unexport DISK_EVENT_MEDIA_CHANGE for legacy/fringe
drivers) removed DISK_EVENT_MEDIA_CHANGE from legacy/fringe block
drivers which have inadequate ->check_events(). Combined with earlier
change 7c88a168da (block: don't propagate unlisted DISK_EVENTs to
userland), this enables using ->check_events() for internal processing
while avoiding enabling in-kernel block event polling which can lead
to infinite event loop.
Unfortunately, this made many drivers including floppy without any bit
set in disk->events and ->async_events in which case disk_add_events()
simply skipped allocation of disk->ev, which disables whole event
handling. As ->check_events() is still used during open processing
for revalidation, this can lead to open failure.
This patch always allocates disk->ev if ->check_events is implemented.
In the long term, it would make sense to simply include the event
structure inline into genhd as it's now used by virtually all block
devices.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Ondrej Zary <linux@rainbow-software.org>
Reported-by: Alex Villacis Lasso <avillaci@ceibo.fiec.espol.edu.ec>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
When struct cfq_data allocation fails, cic_index need to be freed.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
The 'group_changed' variable is initialized to 0 and never changed, so
checking the variable is meaningless.
It is a leftover from 0bbfeb8320 ("cfq-iosched: Always provide group
iosolation."). Let's get rid of it.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Justin TerAvest <teravest@google.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Reduce the number of bit operations in cfq_choose_req() on average
(and worst) cases.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Simplify the calculation in cfq_prio_to_maxrq(), plus replace CFQ_PRIO_LISTS to
IOPRIO_BE_NR since they are the same and IOPRIO_BE_NR looks more reasonable in
this context IMHO.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
If we don't explicitly initialize it to zero, CFQ might think that
cgroup of ioc has changed and it generates lots of unnecessary calls
to call_for_each_cic(changed_cgroup). Fix it.
cfq_get_io_context()
cfq_ioc_set_cgroup()
call_for_each_cic(ioc, changed_cgroup)
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Commit 73c1010119 ("block: initial patch for on-stack per-task plugging")
removed calls to elv_bio_merged() when @bio merged with @req. Re-add them.
This in turn will update merged stats in associated group. That
should be safe as long as request has got reference to the blkio_group.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Divyesh Shah <dpshah@google.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Make BLKIO_STAT_MERGED per cpu hence gettring rid of need of taking
blkg->stats_lock.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
We allocated per cpu stats struct for root group but did not free it.
Fix it.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
We don't need them anymore, so kill:
- REQ_ON_PLUG checks in various places
- !rq_mergeable() check in plug merging
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Currently we take a queue lock on each bio to check if there are any
throttling rules associated with the group and also update the stats.
Now access the group under rcu and update the stats without taking
the queue lock. Queue lock is taken only if there are throttling rules
associated with the group.
So the common case of root group when there are no rules, save
unnecessary pounding of request queue lock.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Now dispatch stats update is lock free. But reset of these stats still
takes blkg->stats_lock and is dependent on that. As stats are per cpu,
we should be able to just reset the stats on each cpu without any locks.
(Atleast for 64bit arch).
On 32bit arch there is a small race where 64bit updates are not atomic.
The result of this race can be that in the presence of other writers,
one might not get 0 value after reset of a stat and might see something
intermediate
One can write more complicated code to cover this race like sending IPI
to other cpus to reset stats and for offline cpus, reset these directly.
Right not I am not taking that path because reset_update is more of a
debug feature and it can happen only on 32bit arch and possibility of
it happening is small. Will fix it if it becomes a real problem. For
the time being going for code simplicity.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Some of the stats are 64bit and updation will be non atomic on 32bit
architecture. Use sequence counters on 32bit arch to make reading
of stats safe.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Currently we take blkg_stat lock for even updating the stats. So even if
a group has no throttling rules (common case for root group), we end
up taking blkg_lock, for updating the stats.
Make dispatch stats per cpu so that these can be updated without taking
blkg lock.
If cpu goes offline, these stats simply disappear. No protection has
been provided for that yet. Do we really need anything for that?
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Soon we will allow accessing a throtl_grp under rcu_read_lock(). Hence
start freeing up throtl_grp after one rcu grace period.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Use same helper function for root group as we use with dynamically
allocated groups to add it to various lists.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
A helper function for the code which is used at 2-3 places. Makes reading
code little easier.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Currently, we allocate root throtl_grp statically. But as we will be
introducing per cpu stat pointers and that will be allocated
dynamically even for root group, we might as well make whole root
throtl_grp allocation dynamic and treat it in same manner as other
groups.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Currently, all the cfq_group or throtl_group allocations happen while
we are holding ->queue_lock and sleeping is not allowed.
Soon, we will move to per cpu stats and also need to allocate the
per group stats. As one can not call alloc_percpu() from atomic
context as it can sleep, we need to drop ->queue_lock, allocate the
group, retake the lock and continue processing.
In throttling code, I check the queue DEAD flag again to make sure
that driver did not call blk_cleanup_queue() in the mean time.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
blkg->key = cfqd is an rcu protected pointer and hence we used to do
call_rcu(cfqd->rcu_head) to free up cfqd after one rcu grace period.
The problem here is that even though cfqd is around, there are no
gurantees that associated request queue (td->queue) or q->queue_lock
is still around. A driver might have called blk_cleanup_queue() and
release the lock.
It might happen that after freeing up the lock we call
blkg->key->queue->queue_ock and crash. This is possible in following
path.
blkiocg_destroy()
blkio_unlink_group_fn()
cfq_unlink_blkio_group()
Hence, wait for an rcu peirod if there are groups which have not
been unlinked from blkcg->blkg_list. That way, if there are any groups
which are taking cfq_unlink_blkio_group() path, can safely take queue
lock.
This is how we have taken care of race in throttling logic also.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Nobody seems to be using cfq_find_alloc_cfqg() function parameter "create".
Get rid of that.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
cgroup unaccounted_time file is created only if CONFIG_DEBUG_BLK_CGROUP=y.
there are some fields which are out side this config option. Fix that.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Group initialization code seems to be at two places. root group
initialization in blk_throtl_init() and dynamically allocated group
in throtl_find_alloc_tg(). Create a common function and use at both
the places.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Since for-2.6.40/core was forked off the 2.6.39 devel tree, we've
had churn in the core area that makes it difficult to handle
patches for eg cfq or blk-throttle. Instead of requiring that they
be based in older versions with bugs that have been fixed later
in the rc cycle, merge in 2.6.39 final.
Also fixes up conflicts in the below files.
Conflicts:
drivers/block/paride/pcd.c
drivers/cdrom/viocd.c
drivers/ide/ide-cd.c
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
blk_cleanup_queue() calls elevator_exit() and after this, we can't
touch the elevator without oopsing. __elv_next_request() must check
for this state because in the refcounted queue model, we can still
call it after blk_cleanup_queue() has been called.
This was reported as causing an oops attributable to scsi.
Signed-off-by: James Bottomley <James.Bottomley@suse.de>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Let's check a scenario:
1. blk_delay_queue(q, SCSI_QUEUE_DELAY);
2. blk_run_queue_async();
the second one will became a noop, because q->delay_work already has
WORK_STRUCT_PENDING_BIT set, so the delayed work will still run after
SCSI_QUEUE_DELAY. But blk_run_queue_async actually hopes the delayed
work runs immediately.
Fix this by doing a cancel on potentially pending delayed work
before queuing an immediate run of the workqueue.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
In some cases we would end up stacking discard_zeroes_data incorrectly.
Fix this by enabling the feature by default for stacking drivers and
clearing it for low-level drivers. Incorporating a device that does not
support dzd will then cause the feature to be disabled in the stacking
driver.
Also ensure that the maximum discard value does not overflow when
exported in sysfs and return 0 in the alignment and dzd fields for
devices that don't support discard.
Reported-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Currentlly we first map the task to cgroup and then cgroup to
blkio_cgroup. There is a more direct way to get to blkio_cgroup
from task using task_subsys_state(). Use that.
The real reason for the fix is that it also avoids a race in generic
cgroup code. During remount/umount rebind_subsystems() is called and
it can do following with and rcu protection.
cgrp->subsys[i] = NULL;
That means if somebody got hold of cgroup under rcu and then it tried
to do cgroup->subsys[] to get to blkio_cgroup, it would get NULL which
is wrong. I was running into this race condition with ltp running on a
upstream derived kernel and that lead to crash.
So ideally we should also fix cgroup generic code to wait for rcu
grace period before setting pointer to NULL. Li Zefan is not very keen
on introducing synchronize_wait() as he thinks it will slow
down moun/remount/umount operations.
So for the time being atleast fix the kernel crash by taking a more
direct route to blkio_cgroup.
One tester had reported a crash while running LTP on a derived kernel
and with this fix crash is no more seen while the test has been
running for over 6 days.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Currently we return -EOPNOTSUPP in blkdev_issue_discard() if any of the
bio fails due to underlying device not supporting discard request.
However, if the device is for example dm device composed of devices
which some of them support discard and some of them does not, it is ok
for some bios to fail with EOPNOTSUPP, but it does not mean that discard
is not supported at all.
This commit removes the check for bios failed with EOPNOTSUPP and change
blkdev_issue_discard() to return operation not supported if and only if
the device does not actually supports it, not just part of the device as
some bios might indicate.
This change also fixes problem with BLKDISCARD ioctl() which now works
correctly on such dm devices.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
CC: Jens Axboe <jaxboe@fusionio.com>
CC: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
In blkdev_issue_zeroout() we are submitting regular WRITE bios, so we do
not need to check for -EOPNOTSUPP specifically in case of error. Also
there is no need to have label submit: because there is no way to jump
out from the while cycle without an error and we really want to exit,
rather than try again. And also remove the check for (sz == 0) since at
that point sz can never be zero.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
CC: Dmitry Monakhov <dmonakhov@openvz.org>
CC: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Currently we are waiting for every submitted REQ_DISCARD bio separately,
but it can have unwanted consequences of repeatedly flushing the queue,
so we rather submit bios in batches and wait for the entire batch, hence
narrowing the window of other ios going in.
Use bio_batch_end_io() and struct bio_batch for that purpose, the same
is used by blkdev_issue_zeroout(). Also change bio_batch_end_io() so we
always set !BIO_UPTODATE in the case of error and remove the check for
bb, since we are the only user of this function and we always set this.
Remove bio_get()/bio_put() from the blkdev_issue_discard() since
bio_alloc() and bio_batch_end_io() is doing the same thing, hence it is
not needed anymore.
I have done simple dd testing with surprising results. The script I have
used is:
for i in $(seq 10); do
echo $i
dd if=/dev/sdb1 of=/dev/sdc1 bs=4k &
sleep 5
done
/usr/bin/time -f %e ./blkdiscard /dev/sdc1
Running time of BLKDISCARD on the whole device:
with patch without patch
0.95 15.58
So we can see that in this artificial test the kernel with the patch
applied is approx 16x faster in discarding the device.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
CC: Dmitry Monakhov <dmonakhov@openvz.org>
CC: Jens Axboe <jaxboe@fusionio.com>
CC: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
In some drives, flush requests are non-queueable. When flush request is
running, normal read/write requests can't run. If block layer dispatches
such request, driver can't handle it and requeue it. Tejun suggested we
can hold the queue when flush is running. This can avoid unnecessary
requeue. Also this can improve performance. For example, we have
request flush1, write1, flush 2. flush1 is dispatched, then queue is
hold, write1 isn't inserted to queue. After flush1 is finished, flush2
will be dispatched. Since disk cache is already clean, flush2 will be
finished very soon, so looks like flush2 is folded to flush1.
In my test, the queue holding completely solves a regression introduced by
commit 53d63e6b0d:
block: make the flush insertion use the tail of the dispatch list
It's not a preempt type request, in fact we have to insert it
behind requests that do specify INSERT_FRONT.
which causes about 20% regression running a sysbench fileio
workload.
Stable: 2.6.39 only
Cc: stable@kernel.org
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
flush request isn't queueable in some drives. Add a flag to let driver
notify block layer about this. We can optimize flush performance with the
knowledge.
Stable: 2.6.39 only
Cc: stable@kernel.org
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
After the anticipatory scheduler was dropped, there was no need to
special-case the request_module string. As such, drop the redundant
sprintf and stack variable.
Signed-off-by: Kees Cook <kees.cook@canonical.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
unplug is replaced with blk_run_queue now in blk_execute_rq_nowait,
so change the comment accordingly.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
DISK_EVENT_MEDIA_CHANGE is used for both userland visible event and
internal event for revalidation of removeable devices. Some legacy
drivers don't implement proper event detection and continuously
generate events under certain circumstances. For example, ide-cd
generates media changed continuously if there's no media in the drive,
which can lead to infinite loop of events jumping back and forth
between the driver and userland event handler.
This patch updates disk event infrastructure such that it never
propagates events not listed in disk->events to userland. Those
events are processed the same for internal purposes but uevent
generation is suppressed.
This also ensures that userland only gets events which are advertised
in the @events sysfs node lowering risk of confusion.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
The sort insert is the one that goes to the IO scheduler. With
the SORT_MERGE addition, we could bypass IO scheduler setup
but still ask the IO scheduler to insert the request. This would
cause an oops on switching IO schedulers through the sysfs
interface, unless the disk just happened to be idle while it
occured.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
In queue_requests_store, the code looks like
if (rl->count[BLK_RW_SYNC] >= q->nr_requests) {
blk_set_queue_full(q, BLK_RW_SYNC);
} else if (rl->count[BLK_RW_SYNC]+1 <= q->nr_requests) {
blk_clear_queue_full(q, BLK_RW_SYNC);
wake_up(&rl->wait[BLK_RW_SYNC]);
}
If we don't satify the situation of "if", we can get that
rl->count[BLK_RW_SYNC} < q->nr_quests. It is the same as
rl->count[BLK_RW_SYNC]+1 <= q->nr_requests.
All the "else" should satisfy the "else if" check so it isn't
needed actually.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
We do not call blk_trace_remove_sysfs() in err return path
if kobject_add() fails. This path fixes it.
Cc: stable@kernel.org
Signed-off-by: Liu Yuan <tailai.ly@taobao.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
We don't pass in a 'force_kblockd' anymore, get rid of the
stsale comment.
Reported-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
We are currently using this flag to check whether it's safe
to call into ->request_fn(). If it is set, we punt to kblockd.
But we get a lot of false positives and excessive punts to
kblockd, which hurts performance.
The only real abuser of this infrastructure is SCSI. So export
the async queue run and convert SCSI over to use that. There's
room for improvement in that SCSI need not always use the async
call, but this fixes our performance issue and they can fix that
up in due time.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
For some configurations of CONFIG_PREEMPT that is not true. So
get rid of __call_for_each_cic() and always uses the explicitly
rcu_read_lock() protected call_for_each_cic() instead.
This fixes a potential bug related to IO scheduler removal or
online switching.
Thanks to Paul McKenney for clarifying this.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
With all drivers and file systems converted, we only have
in-core use of this function. So remove the export.
Reporteed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Instead of overloading __blk_run_queue to force an offload to kblockd
add a new blk_run_queue_async helper to do it explicitly. I've kept
the blk_queue_stopped check for now, but I suspect it's not needed
as the check we do when the workqueue items runs should be enough.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
If we know we are going to punt to kblockd, we can drop the queue
lock before calling into __blk_run_queue() since it only does a
safe bit test and a workqueue call. Since kblockd needs to grab
this very lock as one of the first things it does, it's a good
optimization to drop the lock before waking kblockd.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
MD can't use this since it really requires us to be able to
keep more than a single piece of state for the unplug. Commit
048c9374 added the required support for MD, so get rid of this
now unused code.
This reverts commit f75664570d.
Conflicts:
block/blk-core.c
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
md/raid requires an unplug callback, but as it does not uses
requests the current code cannot provide one.
So allow arbitrary callbacks to be attached to the blk_plug.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
It's a pretty close match to what we had before - the timer triggering
would mean that nobody unplugged the plug in due time, in the new
scheme this matches very closely what the schedule() unplug now is.
It's essentially the difference between an explicit unplug (IO unplug)
or an implicit unplug (timer unplug, we scheduled with pending IO
queued).
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
For the explicit unplugging, we'd prefer to kick things off
immediately and not pay the penalty of the latency to switch
to kblockd. So let blk_finish_plug() do the run inline, while
the implicit-on-schedule-out unplug will punt to kblockd.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
It's a bit of a mess currently. task->plug is being cleared
and reset in __blk_finish_plug(), and blk_finish_plug() is
testing for a NULL plug which cannot happen even from schedule()
anymore since it uses blk_needs_flush_plug() to determine
whether to call into this function at all.
So get rid of some of the cruft.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
In the function blk_register_queue(), var _dev_ is already assigned by
disk_to_dev().So use it directly instead of calling disk_to_dev() again.
Signed-off-by: Liu Yuan <tailai.ly@taobao.com>
Modified by me to delete an empty line in the same function while
in there anyway.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
There are worries that we are now consuming a lot more stack in
some cases, since we potentially call into IO dispatch from
schedule() or io_schedule(). We can reduce this problem by moving
the running of the queue to kblockd, like the old plugging scheme
did as well.
This may or may not be a good idea from a performance perspective,
depending on how many tasks have queue plugs running at the same
time. For even the slightly contended case, doing just a single
queue run from kblockd instead of multiple runs directly from the
unpluggers will be faster.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
The original use for this dates back to when we had to track write
requests for serializing around barriers. That's not needed anymore,
so kill it.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
This was removed with the queue plug state. But we can easily readd
by checking if this is the first request going to this queue. It's
good information to have when tracing to see how effective the
plugging is.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
MD would like to know when a queue is unplugged, so it can flush
it's bitmap writes. Add such a callback.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
It was removed with the on-stack plugging, readd it and track the
depth of requests added when flushing the plug.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
If the request_fn ends up blocking, we could be re-entering
the plug flush. Since the list is protected by explicitly
not allowing schedule events, this isn't a terribly good idea.
Additionally, it can cause us to recurse. As request_fn called by
__blk_run_queue is allowed to 'schedule()' (after dropping the queue
lock of course), it is possible to get a recursive call:
schedule -> blk_flush_plug -> __blk_finish_plug -> flush_plug_list
-> __blk_run_queue -> request_fn -> schedule
We must make sure that the second schedule does not call into
blk_flush_plug again. So instead of leaving the list of requests on
blk_plug->list, move them to a separate list leaving blk_plug->list
empty.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Comparison function for list_sort() must be anticommutative,
otherwise it is not sorting in ordinary meaning.
But fortunately list_sort() always check ((*cmp)(priv, a, b) <= 0)
it not distinguish negative and zero, so comparison function can
implement only less-or-equal instead of full three-way comparison.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
The current block integrity (DIF/DIX) support in DM is verifying that
all devices' integrity profiles match during DM device resume (which
is past the point of no return). To some degree that is unavoidable
(stacked DM devices force this late checking). But for most DM
devices (which aren't stacking on other DM devices) the ideal time to
verify all integrity profiles match is during table load.
Introduce the notion of an "initialized" integrity profile: a profile
that was blk_integrity_register()'d with a non-NULL 'blk_integrity'
template. Add blk_integrity_is_initialized() to allow checking if a
profile was initialized.
Update DM integrity support to:
- check all devices with _initialized_ integrity profiles match
during table load; uninitialized profiles (e.g. for underlying DM
device(s) of a stacked DM device) are ignored.
- disallow a table load that would result in an integrity profile that
conflicts with a DM device's existing (in-use) integrity profile
- avoid clearing an existing integrity profile
- validate all integrity profiles match during resume; but if they
don't all we can do is report the mismatch (during resume we're past
the point of no return)
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
xchg does not work portably with smaller than 32bit types.
Signed-off-by: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
It's not a preempt type request, in fact we have to insert it
behind requests that do specify INSERT_FRONT.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Merge it with __elv_add_request(), it's pretty pointless to
have a function with only two callers. The main interface
is elv_add_request()/__elv_add_request().
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Currently we just dump a non-informative 'request botched' message.
Lets actually try and print something sane to help debug issues
around this.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
When the queue work handler was converted to delayed work, the
stopping was inadvertently made sync as well. Change this back
to being async stop, using __cancel_delayed_work() instead of
cancel_delayed_work().
Reported-by: Jeremy Fitzhardinge <jeremy@goop.org>
Reported-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
With the introduction of the on-stack plugging, we would assume
that any request being inserted was a normal file system request.
As flush/fua requires a special insert mode, this caused problems.
Fix this up by checking for this in flush_plug_list() and use
the appropriate insert mechanism.
Big thanks goes to Markus Tripplesdorf for tirelessly testing
patches, and to Sergey Senozhatsky for helping find the real
issue.
Reported-by: Markus Tripplesdorf <markus@trippelsdorf.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
Documentation/iostats.txt: bit-size reference etc.
cfq-iosched: removing unnecessary think time checking
cfq-iosched: Don't clear queue stats when preempt.
blk-throttle: Reset group slice when limits are changed
blk-cgroup: Only give unaccounted_time under debug
cfq-iosched: Don't set active queue in preempt
block: fix non-atomic access to genhd inflight structures
block: attempt to merge with existing requests on plug flush
block: NULL dereference on error path in __blkdev_get()
cfq-iosched: Don't update group weights when on service tree
fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
block: Require subsystems to explicitly allocate bio_set integrity mempool
jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
fs: make fsync_buffers_list() plug
mm: make generic_writepages() use plugging
blk-cgroup: Add unaccounted time to timeslice_used.
block: fixup plugging stubs for !CONFIG_BLOCK
block: remove obsolete comments for blkdev_issue_zeroout.
blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
...
Fix up conflicts in fs/{aio.c,super.c}
Removing think time checking. A high thinktime queue might means the queue
dispatches several requests and then do away. Limitting such queue seems
meaningless. And also this can simplify code. This is suggested by Vivek.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
For v2, I added back lines to cfq_preempt_queue() that were removed
during updates for accounting unaccounted_time. Thanks for pointing out
that I'd missed these, Vivek.
Previous commit "cfq-iosched: Don't set active queue in preempt" wrongly
cleared stats for preempting queues when it shouldn't have, because when
we choose a queue to preempt, it still isn't necessarily scheduled next.
Thanks to Vivek Goyal for figuring this out and understanding how the
preemption code works.
Signed-off-by: Justin TerAvest <teravest@google.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Lina reported that if throttle limits are initially very high and then
dropped, then no new bio might be dispatched for a long time. And the
reason being that after dropping the limits we don't reset the existing
slice and do the rate calculation with new low rate and account the bios
dispatched at high rate. To fix it, reset the slice upon rate change.
https://lkml.org/lkml/2011/3/10/298
Another problem with very high limit is that we never queued the
bio on throtl service tree. That means we kept on extending the
group slice but never trimmed it. Fix that also by regulary
trimming the slice even if bio is not being queued up.
Reported-by: Lina Lu <lulina_nuaa@foxmail.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
This change moves unaccounted_time to only be reported when
CONFIG_DEBUG_BLK_CGROUP is true.
Signed-off-by: Justin TerAvest <teravest@google.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Commit "Add unaccounted time to timeslice_used" changed the behavior of
cfq_preempt_queue to set cfqq active. Vivek pointed out that other
preemption rules might get involved, so we shouldn't manually set which
queue is active.
This cleans up the code to just clear the queue stats at preemption
time.
Signed-off-by: Justin TerAvest <teravest@google.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
One of the disadvantages of on-stack plugging is that we potentially
lose out on merging since all pending IO isn't always visible to
everybody. When we flush the on-stack plugs, right now we don't do
any checks to see if potential merge candidates could be utilized.
Correct this by adding a new insert variant, ELEVATOR_INSERT_SORT_MERGE.
It works just ELEVATOR_INSERT_SORT, but first checks whether we can
merge with an existing request before doing the insertion (if we fail
merging).
This fixes a regression with multiple processes issuing IO that
can be merged.
Thanks to Shaohua Li <shaohua.li@intel.com> for testing and fixing
an accounting bug.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (170 commits)
[SCSI] scsi_dh_rdac: Add MD36xxf into device list
[SCSI] scsi_debug: add consecutive medium errors
[SCSI] libsas: fix ata list corruption issue
[SCSI] hpsa: export resettable host attribute
[SCSI] hpsa: move device attributes to avoid forward declarations
[SCSI] scsi_debug: Logical Block Provisioning (SBC3r26)
[SCSI] sd: Logical Block Provisioning update
[SCSI] Include protection operation in SCSI command trace
[SCSI] hpsa: fix incorrect PCI IDs and add two new ones (2nd try)
[SCSI] target: Fix volume size misreporting for volumes > 2TB
[SCSI] bnx2fc: Broadcom FCoE offload driver
[SCSI] fcoe: fix broken fcoe interface reset
[SCSI] fcoe: precedence bug in fcoe_filter_frames()
[SCSI] libfcoe: Remove stale fcoe-netdev entries
[SCSI] libfcoe: Move FCOE_MTU definition from fcoe.h to libfcoe.h
[SCSI] libfc: introduce __fc_fill_fc_hdr that accepts fc_hdr as an argument
[SCSI] fcoe, libfc: initialize EM anchors list and then update npiv EMs
[SCSI] Revert "[SCSI] libfc: fix exchange being deleted when the abort itself is timed out"
[SCSI] libfc: Fixing a memory leak when destroying an interface
[SCSI] megaraid_sas: Version and Changelog update
...
Fix up trivial conflicts due to whitespace differences in
drivers/scsi/libsas/{sas_ata.c,sas_scsi_host.c}
Version 3 is updated to apply to for-2.6.39/core.
For version 2, I took Vivek's advice and made sure we update the group
weight from cfq_group_service_tree_add().
If a weight was updated while a group is on the service tree, the
calculation for the total weight of the service tree can be adjusted
improperly, which either leads to bad service tree weights, or
potentially crashes (if total_weight becomes 0).
This patch defers updates to the weight until a group is off the service
tree.
Signed-off-by: Justin TerAvest <teravest@google.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
There are two kind of times that tasks are not charged for: the first
seek and the extra time slice used over the allocated timeslice. Both
of these exported as a new unaccounted_time stat.
I think it would be good to have this reported in 'time' as well, but
that is probably a separate discussion.
Signed-off-by: Justin TerAvest <teravest@google.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
barrier is already removed, so remove the obsolete comments
in blkdev_issue_zeroout.
Cc: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
BZ29402
https://bugzilla.kernel.org/show_bug.cgi?id=29402
We can hit serious mis-synchronization in bio completion path of
blkdev_issue_zeroout() leading to a panic.
The problem is that when we are going to wait_for_completion() in
blkdev_issue_zeroout() we check if the bb.done equals issued (number of
submitted bios). If it does, we can skip the wait_for_completition()
and just out of the function since there is nothing to wait for.
However, there is a ordering problem because bio_batch_end_io() is
calling atomic_inc(&bb->done) before complete(), hence it might seem to
blkdev_issue_zeroout() that all bios has been completed and exit. At
this point when bio_batch_end_io() is going to call complete(bb->wait),
bb and wait does not longer exist since it was allocated on stack in
blkdev_issue_zeroout() ==> panic!
(thread 1) (thread 2)
bio_batch_end_io() blkdev_issue_zeroout()
if(bb) { ...
if (bb->end_io) ...
bb->end_io(bio, err); ...
atomic_inc(&bb->done); ...
... while (issued != atomic_read(&bb.done))
... (let issued == bb.done)
... (do the rest of the function)
... return ret;
complete(bb->wait);
^^^^^^^^
panic
We can fix this easily by simplifying bio_batch and completion counting.
Also remove bio_end_io_t *end_io since it is not used.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reported-by: Eric Whitney <eric.whitney@hp.com>
Tested-by: Eric Whitney <eric.whitney@hp.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
CC: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Use plug in throttle dispatch also as we are dispatching a bunch of
bios in throttle context and some of them might merge.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
With the plugging now being explicitly controlled by the
submitter, callers need not pass down unplugging hints
to the block layer. If they want to unplug, it's because they
manually plugged on their own - in which case, they should just
unplug at will.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Code has been converted over to the new explicit on-stack plugging,
and delay users have been converted to use the new API for that.
So lets kill off the old plugging along with aops->sync_page().
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
This patch adds support for creating a queuing context outside
of the queue itself. This enables us to batch up pieces of IO
before grabbing the block device queue lock and submitting them to
the IO scheduler.
The context is created on the stack of the process and assigned in
the task structure, so that we can auto-unplug it if we hit a schedule
event.
The current queue plugging happens implicitly if IO is submitted to
an empty device, yet callers have to remember to unplug that IO when
they are going to wait for it. This is an ugly API and has caused bugs
in the past. Additionally, it requires hacks in the vm (->sync_page()
callback) to handle that logic. By switching to an explicit plugging
scheme we make the API a lot nicer and can get rid of the ->sync_page()
hack in the vm.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Currently, disk_unblock_events() implicitly kick event check if the
block count reaches zero. This behavior is not described in the
comment and hinders with future changes. Make the unblocker
explicitly check events by calling disk_check_events() as necessary.
This patch doesn't cause any behavior difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kay Sievers <kay.sievers@vrfy.org>
We've found that we still get good, useful isolation at weights this
low. I'd like to adjust the minimum so that any other changes can take
these values into account.
Signed-off-by: Justin TerAvest <teravest@google.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
When throttle group limits are updated through cgroups, a thread is
woken up to process these updates. While reviewing that code, oleg noted
couple of race conditions existed in the code and he also suggested that
code can be simplified.
This patch fixes the races simplifies the code based on Oleg's suggestions:
- Use xchg().
- Introduced a common function throtl_update_blkio_group_common()
which is shared now by all iops/bps update functions.
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Fixed a merge issue, throtl_schedule_delayed_work() takes throtl_data
as the argument now, not the queue.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
With the help of cgroup interface one can go and upate the bps/iops
limits of existing group. Once the limits are udpated, a thread is
woken up to see if some blocked group needs recalculation based on new
limits and needs to be requeued.
There was also a piece of code where I was checking for group limit
update when a fresh bio comes in. This patch gets rid of that piece of
code and keeps processing the limit change at one place
throtl_process_limit_change(). It just keeps the code simple and easy
to understand.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
The update_vdisktime logic is broken since commit
b54ce60eb7, st->min_vdisktime never makes
a progress. Fix it.
Thanks Vivek for pointing it out.
Signed-off-by: Gui Jianfeng <guijianfen@cn.fujitsu.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
If there are a sync and an async queue and the sync queue's think time
is small, we can ignore the sync queue's dispatch quantum. Because the
sync queue will always preempt the async queue, we don't need to care
about async's latency. This can fix a performance regression of
aiostress test, which is introduced by commit f8ae6e3eb8. The issue
should exist even without the commit, but the commit amplifies the
impact.
The initial post does the same optimization for RT queue too, but since
I have no real workload for it, Vivek suggests to drop it.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Reviewed-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
This merge creates two set of conflicts. One is simple context
conflicts caused by removal of throtl_scheduled_delayed_work() in
for-linus and removal of throtl_shutdown_timer_wq() in
for-2.6.39/core.
The other is caused by commit 255bb490c8 (block: blk-flush shouldn't
call directly into q->request_fn() __blk_run_queue()) in for-linus
crashing with FLUSH reimplementation in for-2.6.39/core. The conflict
isn't trivial but the resolution is straight-forward.
* __blk_run_queue() calls in flush_end_io() and flush_data_end_io()
should be called with @force_kblockd set to %true.
* elv_insert() in blk_kick_flush() should use
%ELEVATOR_INSERT_REQUEUE.
Both changes are to avoid invoking ->request_fn() directly from
request completion path and closely match the changes in the commit
255bb490c8.
Signed-off-by: Tejun Heo <tj@kernel.org>
Move blk_throtl_exit() in blk_cleanup_queue() as blk_throtl_exit() is
written in such a way that it needs queue lock. In blk_release_queue()
there is no gurantee that ->queue_lock is still around.
Initially blk_throtl_exit() was in blk_cleanup_queue() but Ingo reported
one problem.
https://lkml.org/lkml/2010/10/23/86
And a quick fix moved blk_throtl_exit() to blk_release_queue().
commit 7ad58c0286
Author: Jens Axboe <jaxboe@fusionio.com>
Date: Sat Oct 23 20:40:26 2010 +0200
block: fix use-after-free bug in blk throttle code
This patch reverts above change and does not try to shutdown the
throtl work in blk_sync_queue(). By avoiding call to
throtl_shutdown_timer_wq() from blk_sync_queue(), we should also avoid
the problem reported by Ingo.
blk_sync_queue() seems to be used only by md driver and it seems to be
using it to make sure q->unplug_fn is not called as md registers its
own unplug functions and it is about to free up the data structures
used by unplug_fn(). Block throttle does not call back into unplug_fn()
or into md. So there is no need to cancel blk throttle work.
In fact I think cancelling block throttle work is bad because it might
happen that some bios are throttled and scheduled to be dispatched later
with the help of pending work and if work is cancelled, these bios might
never be dispatched.
Block layer also uses blk_sync_queue() during blk_cleanup_queue() and
blk_release_queue() time. That should be safe as we are also calling
blk_throtl_exit() which should make sure all the throttling related
data structures are cleaned up.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
There does not seem to be a clear convention whether q->queue_lock is
initialized or not when blk_cleanup_queue() is called. In the past it
was not necessary but now blk_throtl_exit() takes up queue lock by
default and needs queue lock to be available.
In fact elevator_exit() code also has similar requirement just that it
is less stringent in the sense that elevator_exit() is called only if
elevator is initialized.
Two problems have been noticed because of ambiguity about spin lock
status.
- If a driver calls blk_alloc_queue() and then soon calls
blk_cleanup_queue() almost immediately, (because some other
driver structure allocation failed or some other error happened)
then blk_throtl_exit() will run into issues as queue lock is not
initialized. Loop driver ran into this issue recently and I
noticed error paths in md driver too. Similar error paths should
exist in other drivers too.
- If some driver provided external spin lock and zapped the lock
before blk_cleanup_queue(), then it can lead to issues.
So this patch initializes the default queue lock at queue allocation time.
block throttling code is one of the users of queue lock and it is
initialized at the queue allocation time, so it makes sense to
initialize ->queue_lock also to internal lock. A driver can overide that
lock later. This will take care of the issue where a driver does not have
to worry about initializing the queue lock to default before calling
blk_cleanup_queue()
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Rename the numerals in the diskstats_show() into the macros.
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Liu Yuan <tailai.ly@taobao.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
blk-flush decomposes a flush into sequence of multiple requests. On
completion of a request, the next one is queued; however, block layer
must not implicitly call into q->request_fn() directly from completion
path. This makes the queue behave unexpectedly when seen from the
drivers and violates the assumption that q->request_fn() is called
with process context + queue_lock.
This patch makes blk-flush the following two changes to make sure
q->request_fn() is not called directly from request completion path.
- blk_flush_complete_seq_end_io() now asks __blk_run_queue() to always
use kblockd instead of calling directly into q->request_fn().
- queue_next_fseq() uses ELEVATOR_INSERT_REQUEUE instead of
ELEVATOR_INSERT_FRONT so that elv_insert() doesn't try to unplug the
request queue directly.
Reported by Jan in the following threads.
http://thread.gmane.org/gmane.linux.ide/48778http://thread.gmane.org/gmane.linux.ide/48786
stable: applicable to v2.6.37.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jan Beulich <JBeulich@novell.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
__blk_run_queue() automatically either calls q->request_fn() directly
or schedules kblockd depending on whether the function is recursed.
blk-flush implementation needs to be able to explicitly choose
kblockd. Add @force_kblockd.
All the current users are converted to specify %false for the
parameter and this patch doesn't introduce any behavior change.
stable: This is prerequisite for fixing ide oops caused by the new
blk-flush implementation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jan Beulich <JBeulich@novell.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Effectively, make group_isolation=1 the default and remove the tunable.
The setting group_isolation=0 was because by default we idle on
sync-noidle tree and on fast devices, this can be very harmful for
throughput.
However, this problem can also be addressed by tuning slice_idle and
possibly group_idle on faster storage devices.
This change simplifies the CFQ code by removing the feature entirely.
Signed-off-by: Justin TerAvest <teravest@google.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
o Dominik Klein reported a system hang issue while doing some blkio
throttling testing.
https://lkml.org/lkml/2011/2/24/173
o Some tracing revealed that CFQ was not dispatching any more jobs as
queue unplug was not happening. And queue unplug was not happening
because unplug work was not being called as there was one throttling
work on same cpu which as not finished yet. And throttling work had not
finished as it was tyring to dispatch a bio to CFQ but all the request
descriptors were consume to it was put to sleep.
o So basically it is a cyclic dependecny between CFQ unplug work and
throtl dispatch work. Tejun suggested that use separate workqueue for
such cases.
o This patch uses a separate workqueue for throttle related work and
does not rely on kblockd workqueue anymore.
Cc: stable@kernel.org
Reported-by: Dominik Klein <dk@in-telegence.net>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* 'for-linus' of git://neil.brown.name/md:
md: Fix - again - partition detection when array becomes active
Fix over-zealous flush_disk when changing device size.
md: avoid spinlock problem in blk_throtl_exit
md: correctly handle probe of an 'mdp' device.
md: don't set_capacity before array is active.
md: Fix raid1->raid0 takeover
Adam Kovari and others reported that disconnecting an USB drive with
an ntfs-3g filesystem would cause "kernel BUG at fs/inode.c:1421!" to
be triggered.
The BUG could be traced back to ioctl(BLKBSZSET), which would
erroneously decrement the refcount on the bdev. This is because
blkdev_get() expects the refcount to be already incremented and either
returns success or decrements the refcount and returns an error.
The bug was introduced by e525fd89 (block: make blkdev_get/put()
handle exclusive access), which didn't take into account this behavior
of blkdev_get().
This fixes
https://bugzilla.kernel.org/show_bug.cgi?id=29202
(and likely 29792 too)
Reported-by: Adam Kovari <kovariadam@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are two cases when we call flush_disk.
In one, the device has disappeared (check_disk_change) so any
data will hold becomes irrelevant.
In the oter, the device has changed size (check_disk_size_change)
so data we hold may be irrelevant.
In both cases it makes sense to discard any 'clean' buffers,
so they will be read back from the device if needed.
In the former case it makes sense to discard 'dirty' buffers
as there will never be anywhere safe to write the data. In the
second case it *does*not* make sense to discard dirty buffers
as that will lead to file system corruption when you simply enlarge
the containing devices.
flush_disk calls __invalidate_devices.
__invalidate_device calls both invalidate_inodes and invalidate_bdev.
invalidate_inodes *does* discard I_DIRTY inodes and this does lead
to fs corruption.
invalidate_bev *does*not* discard dirty pages, but I don't really care
about that at present.
So this patch adds a flag to __invalidate_device (calling it
__invalidate_device2) to indicate whether dirty buffers should be
killed, and this is passed to invalidate_inodes which can choose to
skip dirty inodes.
flusk_disk then passes true from check_disk_change and false from
check_disk_size_change.
dm avoids tripping over this problem by calling i_size_write directly
rathher than using check_disk_size_change.
md does use check_disk_size_change and so is affected.
This regression was introduced by commit 608aeef17a which causes
check_disk_size_change to call flush_disk, so it is suitable for any
kernel since 2.6.27.
Cc: stable@kernel.org
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Andrew Patterson <andrew.patterson@hp.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: NeilBrown <neilb@suse.de>
Classify severity of I/O errors for target, nexus, and
transport errors.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Acked-by: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: James Bottomley <James.Bottomley@suse.de>
Flush requests are never put on the IO scheduler. Convert request
structure's elevator_private* into an array and have the flush fields
share a union with it.
Reclaim the space lost in 'struct request' by moving 'completion_data'
back in the union with 'rb_node'.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Skip elevator initialization for flush requests by passing priv=0 to
blk_alloc_request() in get_request(). As such elv_set_request() is
never called for flush requests.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
cdrom: support devices that have check_events but not media_changed
cfq-iosched: Don't wait if queue already has requests.
blkio-throttle: Avoid calling blkiocg_lookup_group() for root group
cfq: rename a function to give it more appropriate name
cciss: make cciss_revalidate not loop through CISS_MAX_LUNS volumes unnecessarily.
drivers/block/aoe/Makefile: replace the use of <module>-objs with <module>-y
loop: queue_lock NULL pointer derefence in blk_throtl_exit
drivers/block/Makefile: replace the use of <module>-objs with <module>-y
blktrace: Don't output messages if NOTIFY isn't set.
Commit 7667aa0630 added logic to wait for
the last queue of the group to become busy (have at least one request),
so that the group does not lose out for not being continuously
backlogged. The commit did not check for the condition that the last
queue already has some requests. As a result, if the queue already has
requests, wait_busy is set. Later on, cfq_select_queue() checks the
flag, and decides that since the queue has a request now and wait_busy
is set, the queue is expired. This results in early expiration of the
queue.
This patch fixes the problem by adding a check to see if queue already
has requests. If it does, wait_busy is not set. As a result, time slices
do not expire early.
The queues with more than one request are usually buffered writers.
Testing shows improvement in isolation between buffered writers.
Cc: stable@kernel.org
Signed-off-by: Justin TerAvest <teravest@google.com>
Reviewed-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
The current FLUSH/FUA support has evolved from the implementation
which had to perform queue draining. As such, sequencing is done
queue-wide one flush request after another. However, with the
draining requirement gone, there's no reason to keep the queue-wide
sequential approach.
This patch reimplements FLUSH/FUA support such that each FLUSH/FUA
request is sequenced individually. The actual FLUSH execution is
double buffered and whenever a request wants to execute one for either
PRE or POSTFLUSH, it queues on the pending queue. Once certain
conditions are met, a flush request is issued and on its completion
all pending requests proceed to the next sequence.
This allows arbitrary merging of different type of flushes. How they
are merged can be primarily controlled and tuned by adjusting the
above said 'conditions' used to determine when to issue the next
flush.
This is inspired by Darrick's patches to merge multiple zero-data
flushes which helps workloads with highly concurrent fsync requests.
* As flush requests are never put on the IO scheduler, request fields
used for flush share space with rq->rb_node. rq->completion_data is
moved out of the union. This increases the request size by one
pointer.
As rq->elevator_private* are used only by the iosched too, it is
possible to reduce the request size further. However, to do that,
we need to modify request allocation path such that iosched data is
not allocated for flush requests.
* FLUSH/FUA processing happens on insertion now instead of dispatch.
- Comments updated as per Vivek and Mike.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "Darrick J. Wong" <djwong@us.ibm.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
bio's for flush are completed twice - once during the data phase and
one more time after the whole sequence is complete. The first
completion shouldn't notify completion to the issuer.
This was achieved by skipping all bio completion steps in
req_bio_endio() for the first completion; however, this has two
drawbacks.
* Error is not recorded in bio and must be tracked somewhere else.
* Partial completion is not supported.
Both don't cause problems for the current users; however, they make
further improvements difficult. Change req_bio_endio() such that it
only skips the actual notification part for the first completion. bio
completion is implemented with partial completions on mind anyway so
this is as simple as moving the REQ_FLUSH_SEQ conditional such that
only calling of bio_endio() is skipped.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
rq == &q->flush_rq was used to determine whether a rq is part of a
flush sequence, which worked because all requests in a flush sequence
were sequenced using the single dedicated request. This is about to
change, so introduce REQ_FLUSH_SEQ flag to distinguish flush sequence
requests.
This patch doesn't cause any behavior change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
The meaning of CONFIG_EMBEDDED has long since been obsoleted; the option
is used to configure any non-standard kernel with a much larger scope than
only small devices.
This patch renames the option to CONFIG_EXPERT in init/Kconfig and fixes
references to the option throughout the kernel. A new CONFIG_EMBEDDED
option is added that automatically selects CONFIG_EXPERT when enabled and
can be used in the future to isolate options that should only be
considered for embedded systems (RISC architectures, SLOB, etc).
Calling the option "EXPERT" more accurately represents its intention: only
expert users who understand the impact of the configuration changes they
are making should enable it.
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Acked-by: David Woodhouse <david.woodhouse@intel.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Greg KH <gregkh@suse.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Robin Holt <holt@sgi.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
o Jeff Moyer was doing some testing on a RAM backed disk and
blkiocg_lookup_group() showed up high overhead after memcpy(). Similarly
somebody else reported that blkiocg_lookup_group() is eating 6% extra
cpu. Though looking at the code I can't think why the overhead of
this function is so high. One thing is that it is called with very high
frequency (once for every IO).
o For lot of folks blkio controller will be compiled in but they might
not have actually created cgroups. Hence optimize the case of root
cgroup where we can avoid calling blkiocg_lookup_group() if IO is happening
in root group (common case).
Reported-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
o Rename a function to give it more approprate name. We are calculating
cfq queue slice and function name gives the impression as if cfq group
slice length is being calculated.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
If a queue is preempted before it gets slice assigned, the queue doesn't get
compensation, which looks unfair. For such queue, we compensate it for a whole
slice.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
I got this:
fio-874 [007] 2157.724514: 8,32 m N cfq874 preempt
fio-874 [007] 2157.724519: 8,32 m N cfq830 slice expired t=1
fio-874 [007] 2157.724520: 8,32 m N cfq830 sl_used=1 disp=0 charge=1 iops=0 sect=0
fio-874 [007] 2157.724521: 8,32 m N cfq830 set_active wl_prio:0 wl_type:0
fio-874 [007] 2157.724522: 8,32 m N cfq830 Not idling. st->count:1
cfq830 is an async queue, and preempted by a sync queue cfq874. But since we
have cfqg->saved_workload_slice mechanism, the preempt is a nop.
Looks currently our preempt is totally broken if the two queues are not from
the same workload type.
Below patch fixes it. This will might make async queue starvation, but it's
what our old code does before cgroup is added.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits)
block: ensure that completion error gets properly traced
blktrace: add missing probe argument to block_bio_complete
block cfq: don't use atomic_t for cfq_group
block cfq: don't use atomic_t for cfq_queue
block: trace event block fix unassigned field
block: add internal hd part table references
block: fix accounting bug on cross partition merges
kref: add kref_test_and_get
bio-integrity: mark kintegrityd_wq highpri and CPU intensive
block: make kblockd_workqueue smarter
Revert "sd: implement sd_check_events()"
block: Clean up exit_io_context() source code.
Fix compile warnings due to missing removal of a 'ret' variable
fs/block: type signature of major_to_index(int) to major_to_index(unsigned)
block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p)
cfq-iosched: don't check cfqg in choose_service_tree()
fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors
cdrom: export cdrom_check_events()
sd: implement sd_check_events()
sr: implement sr_check_events()
...
cfq_group->ref is used with queue_lock hold, the only exception is
cfq_set_request, which looks like a bug to me, so ref doesn't need
to be an atomic and atomic operation is slower.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
cfq_queue->ref is used with queue_lock hold, so ref doesn't need to be an atomic
and atomic operation is slower.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
We can't use krefs since it's apparently restricted to very basic
reference counting.
This reverts commit e4a683c8.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
/proc/diskstats would display a strange output as follows.
$ cat /proc/diskstats |grep sda
8 0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
8 1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
~~~~~~~~~~
8 2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
8 3 sda3 54 487 2188 92 0 0 0 0 0 88 92
8 4 sda4 4 0 8 0 0 0 0 0 0 0 0
8 5 sda5 81 2027 2130 138 0 0 0 0 0 87 137
Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.
The detailed root cause is as follows.
Assuming that there are two partition, sda1 and sda2.
1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
is 0 and sda2's one is 1.
| hd_struct->in_flight
---------------------------
sda1 | 0
sda2 | 1
---------------------------
2. A bio belongs to sda1 is issued and is merged into the request mentioned on
step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
from sda2 region to sda1 region. However the two partition's
hd_struct->in_flight are not changed.
| hd_struct->in_flight
---------------------------
sda1 | 0
sda2 | 1
---------------------------
3. The request is finished and blk_account_io_done() is called. In this case,
sda2's hd_struct->in_flight, not a sda1's one, is decremented.
| hd_struct->in_flight
---------------------------
sda1 | -1
sda2 | 1
---------------------------
The patch fixes the problem by caching the partition lookup
inside the request structure, hence making sure that the increment
and decrement will always happen on the same partition struct. This
also speeds up IO with accounting enabled, since it cuts down on
the number of lookups we have to do.
Also add a refcount to struct hd_struct to keep the partition in
memory as long as users exist. We use kref_test_and_get() to ensure
we don't add a reference to a partition which is going away.
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
kblockd is used for unplugging and may affect IO latency and
throughput and the max number of concurrent work items are bound by
the number of block devices. Make it HIGHPRI workqueue w/ default max
concurrency.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Conflicts:
MAINTAINERS
arch/arm/mach-omap2/pm24xx.c
drivers/scsi/bfa/bfa_fcpim.c
Needed to update to apply fixes for which the old branch was too
outdated.
This patch fixes a spelling error in a source code comment and removes
superfluous braces in the function exit_io_context().
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
cciss: fix cciss_revalidate panic
block: max hardware sectors limit wrapper
block: Deprecate QUEUE_FLAG_CLUSTER and use queue_limits instead
blk-throttle: Correct the placement of smp_rmb()
blk-throttle: Trim/adjust slice_end once a bio has been dispatched
block: check for proper length of iov entries earlier in blk_rq_map_user_iov()
drbd: fix for spin_lock_irqsave in endio callback
drbd: don't recvmsg with zero length
The major/minor device numbers are always defined and used as `unsigned'.
Signed-off-by: Yang Zhang <kthreadd@gmail.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
When cfq_choose_cfqg() is called in select_queue(), there must be at least one
backlogged CFQ queue waiting for dispatching, hence there must be at least one
backlogged CFQ group on service tree. So we never call choose_service_tree()
with cfqg == NULL.
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Implement blk_limits_max_hw_sectors() and make
blk_queue_max_hw_sectors() a wrapper around it.
DM needs this to avoid setting queue_limits' max_hw_sectors and
max_sectors directly. dm_set_device_limits() now leverages
blk_limits_max_hw_sectors() logic to establish the appropriate
max_hw_sectors minimum (PAGE_SIZE). Fixes issue where DM was
incorrectly setting max_sectors rather than max_hw_sectors (which
caused dm_merge_bvec()'s max_hw_sectors check to be ineffective).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@kernel.org
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
When stacking devices, a request_queue is not always available. This
forced us to have a no_cluster flag in the queue_limits that could be
used as a carrier until the request_queue had been set up for a
metadevice.
There were several problems with that approach. First of all it was up
to the stacking device to remember to set queue flag after stacking had
completed. Also, the queue flag and the queue limits had to be kept in
sync at all times. We got that wrong, which could lead to us issuing
commands that went beyond the max scatterlist limit set by the driver.
The proper fix is to avoid having two flags for tracking the same thing.
We deprecate QUEUE_FLAG_CLUSTER and use the queue limit directly in the
block layer merging functions. The queue_limit 'no_cluster' is turned
into 'cluster' to avoid double negatives and to ease stacking.
Clustering defaults to being enabled as before. The queue flag logic is
removed from the stacking function, and explicitly setting the cluster
flag is no longer necessary in DM and MD.
Reported-by: Ed Lin <ed.lin@promise.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Currently, media presence polling for removeable block devices is done
from userland. There are several issues with this.
* Polling is done by periodically opening the device. For SCSI
devices, the command sequence generated by such action involves a
few different commands including TEST_UNIT_READY. This behavior,
while perfectly legal, is different from Windows which only issues
single command, GET_EVENT_STATUS_NOTIFICATION. Unfortunately, some
ATAPI devices lock up after being periodically queried such command
sequences.
* There is no reliable and unintrusive way for a userland program to
tell whether the target device is safe for media presence polling.
For example, polling for media presence during an on-going burning
session can make it fail. The polling program can avoid this by
opening the device with O_EXCL but then it risks making a valid
exclusive user of the device fail w/ -EBUSY.
* Userland polling is unnecessarily heavy and in-kernel implementation
is lighter and better coordinated (workqueue, timer slack).
This patch implements framework for in-kernel disk event handling,
which includes media presence polling.
* bdops->check_events() is added, which supercedes ->media_changed().
It should check whether there's any pending event and return if so.
Currently, two events are defined - DISK_EVENT_MEDIA_CHANGE and
DISK_EVENT_EJECT_REQUEST. ->check_events() is guaranteed not to be
called parallelly.
* gendisk->events and ->async_events are added. These should be
initialized by block driver before passing the device to add_disk().
The former contains the mask of all supported events and the latter
the mask of all events which the device can report without polling.
/sys/block/*/events[_async] export these to userland.
* Kernel parameter block.events_dfl_poll_msecs controls the system
polling interval (default is 0 which means disable) and
/sys/block/*/events_poll_msecs control polling intervals for
individual devices (default is -1 meaning use system setting). Note
that if a device can report all supported events asynchronously and
its polling interval isn't explicitly set, the device won't be
polled regardless of the system polling interval.
* If a device is opened exclusively with write access, event checking
is automatically disabled until all write exclusive accesses are
released.
* There are event 'clearing' events. For example, both of currently
defined events are cleared after the device has been successfully
opened. This information is passed to ->check_events() callback
using @clearing argument as a hint.
* Event checking is always performed from system_nrt_wq and timer
slack is set to 25% for polling.
* Nothing changes for drivers which implement ->media_changed() but
not ->check_events(). Going forward, all drivers will be converted
to ->check_events() and ->media_change() will be dropped.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
There's no reason for register_disk() and del_gendisk() to be in
fs/partitions/check.c. Move both to genhd.c. While at it, collapse
unlink_gendisk(), which was artificially in a separate function due to
genhd.c / check.c split, into del_gendisk().
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
If priority is changed, continuing to check workload_expires and service tree
count of the previous workload does not make sense. We should always choose
the workload with lowest key of new priority in such case.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
This patch corrects an issue in bsg that results in a general protection
fault if an LLD is removed while an application is using an open file
handle to a bsg device, and the application issues an ioctl. The fault
occurs because the class_dev is NULL, having been cleared in
bsg_unregister_queue() when the driver was removed. With this
patch, a check is made for the class_dev, and the application
will receive ENXIO if the related object is gone.
Signed-off-by: Carl Lajeunesse <carl.lajeunesse@emulex.com>
Signed-off-by: James Smart <james.smart@emulex.com>
Signed-off-by: James Bottomley <James.Bottomley@suse.de>
o I was discussing what are the variable being updated without spin lock and
why do we need barriers and Oleg pointed out that location of smp_rmb()
should be between read of td->limits_changed and tg->limits_changed. This
patch fixes it.
o Following is one possible sequence of events. Say cpu0 is executing
throtl_update_blkio_group_read_bps() and cpu1 is executing
throtl_process_limit_change().
cpu0 cpu1
tg->limits_changed = true;
smp_mb__before_atomic_inc();
atomic_inc(&td->limits_changed);
if (!atomic_read(&td->limits_changed))
return;
if (tg->limits_changed)
do_something;
If cpu0 has updated tg->limits_changed and td->limits_changed, we want to
make sure that if update to td->limits_changed is visible on cpu1, then
update to tg->limits_changed should also be visible.
Oleg pointed out to ensure that we need to insert an smp_rmb() between
td->limits_changed read and tg->limits_changed read.
o I had erroneously put smp_rmb() before atomic_read(&td->limits_changed).
This patch fixes it.
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
o During some testing I did following and noticed throttling stops working.
- Put a very low limit on a cgroup, say 1 byte per second.
- Start some reads, this will set slice_end to a very high value.
- Change the limit to higher value say 1MB/s
- Now IO unthrottles and finishes as expected.
- Try to do the read again but IO is not limited to 1MB/s as expected.
o What is happening.
- Initially low value of limit sets slice_end to a very high value.
- During updation of limit, slice_end is not being truncated.
- Very high value of slice_end leads to keeping the existing slice
valid for a very long time and new slice does not start.
- tg_may_dispatch() is called in blk_throtle_bio(), and trim_slice()
is not called in this path. So slice_start is some old value and
practically we are able to do huge amount of IO.
o There are many ways it can be fixed. I have fixed it by trying to
adjust/cleanup slice_end in trim_slice(). Generally we extend slices if bio
is big and can't be dispatched in one slice. After dispatch of bio, readjust
the slice_end to make sure we don't end up with huge values.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
It's able to check whether a CFQ group on a service tree by
checking "cfqg->rb_node". There's no need to maintain an
extra flag here.
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
When a cfq group is running, it won't be dequeued from service tree, so
there's no need to store the active one in st->active. Just gid rid of it.
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
commit 9284bcf checks for proper length of iov entries in
blk_rq_map_user_iov(). But if the map is unaligned, kernel
will break out the loop without checking for the proper length.
So we need to check the proper length before the unalign check.
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
cciss: fix build for PROC_FS disabled
block: fix amiga and atari floppy driver compile warning
blk-throttle: Fix calculation of max number of WRITES to be dispatched
ioprio: grab rcu_read_lock in sys_ioprio_{set,get}()
xen/blkfront: cope with backend that fail empty BLKIF_OP_WRITE_BARRIER requests
xen/blkfront: Implement FUA with BLKIF_OP_WRITE_BARRIER
xen/blkfront: change blk_shadow.request to proper pointer
xen/blkfront: map REQ_FLUSH into a full barrier
The big kernel lock has been removed from all these files at some point,
leaving only the #include.
Remove this too as a cleanup.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
o Allow hierarchical cgroup creation for blkio controller
o Currently we disallow it as both the io controller policies (throttling
as well as proportion bandwidth) do not support hierarhical accounting
and control. But the flip side is that blkio controller can not be used with
libvirt as libvirt creates a cgroup hierarchy deeper than 1 level.
<top-level-cgroup-dir>/<controller>/libvirt/qemu/<virtual-machine-groups>
o So this patch will allow creation of cgroup hierarhcy but at the backend
everything will be treated as flat. So if somebody created a an hierarchy
like as follows.
root
/ \
test1 test2
|
test3
CFQ and throttling will practically treat all groups at same level.
pivot
/ | \ \
root test1 test2 test3
o Once we have actual support for hierarchical accounting and control
then we can introduce another cgroup tunable file "blkio.use_hierarchy"
which will be 0 by default but if user wants to enforce hierarhical
control then it can be set to 1. This way there should not be any
ABI problems down the line.
o The only not so pretty part is introduction of extra file "use_hierarchy"
down the line. Kame-san had mentioned that hierarhical accounting is
expensive in memory controller hence they keep it off by default. I
suspect same will be the case for IO controller also as for each IO
completion we shall have to account IO through hierarchy up to the root.
if yes, then it probably is not a very bad idea to introduce this extra
file so that it will be used only when somebody needs it and some people
might enable hierarchy only in part of the hierarchy.
o This is how basically memory controller also uses "use_hierarhcy" and
they also allowed creation of hierarchies when actual backend support
was not available.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Reviewed-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Reviewed-by: Ciju Rajan K <ciju@linux.vnet.ibm.com>
Tested-by: Ciju Rajan K <ciju@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
o Currently we try to dispatch more READS and less WRITES (75%, 25%) in one
dispatch round. ummy pointed out that there is a bug in max_nr_writes
calculation. This patch fixes it.
Reported-by: ummy y <yummylln@yahoo.com.cn>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Over time, block layer has accumulated a set of APIs dealing with bdev
open, close, claim and release.
* blkdev_get/put() are the primary open and close functions.
* bd_claim/release() deal with exclusive open.
* open/close_bdev_exclusive() are combination of open and claim and
the other way around, respectively.
* bd_link/unlink_disk_holder() to create and remove holder/slave
symlinks.
* open_by_devnum() wraps bdget() + blkdev_get().
The interface is a bit confusing and the decoupling of open and claim
makes it impossible to properly guarantee exclusive access as
in-kernel open + claim sequence can disturb the existing exclusive
open even before the block layer knows the current open if for another
exclusive access. Reorganize the interface such that,
* blkdev_get() is extended to include exclusive access management.
@holder argument is added and, if is @FMODE_EXCL specified, it will
gain exclusive access atomically w.r.t. other exclusive accesses.
* blkdev_put() is similarly extended. It now takes @mode argument and
if @FMODE_EXCL is set, it releases an exclusive access. Also, when
the last exclusive claim is released, the holder/slave symlinks are
removed automatically.
* bd_claim/release() and close_bdev_exclusive() are no longer
necessary and either made static or removed.
* bd_link_disk_holder() remains the same but bd_unlink_disk_holder()
is no longer necessary and removed.
* open_bdev_exclusive() becomes a simple wrapper around lookup_bdev()
and blkdev_get(). It also has an unexpected extra bdev_read_only()
test which probably should be moved into blkdev_get().
* open_by_devnum() is modified to take @holder argument and pass it to
blkdev_get().
Most of bdev open/close operations are unified into blkdev_get/put()
and most exclusive accesses are tested atomically at the open time (as
it should). This cleans up code and removes some, both valid and
invalid, but unnecessary all the same, corner cases.
open_bdev_exclusive() and open_by_devnum() can use further cleanup -
rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop
special features. Well, let's leave them for another day.
Most conversions are straight-forward. drbd conversion is a bit more
involved as there was some reordering, but the logic should stay the
same.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Brown <neilb@suse.de>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Peter Osterlund <petero2@telia.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Alex Elder <aelder@sgi.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: dm-devel@redhat.com
Cc: drbd-dev@lists.linbit.com
Cc: Leo Chen <leochen@broadcom.com>
Cc: Scott Branden <sbranden@broadcom.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: Joern Engel <joern@logfs.org>
Cc: reiserfs-devel@vger.kernel.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
REQ_HARDBARRIER is dead now, so remove the leftovers. What's left
at this point is:
- various checks inside the block layer.
- sanity checks in bio based drivers.
- now unused bio_empty_barrier helper.
- Xen blockfront use of BLKIF_OP_WRITE_BARRIER - it's dead for a while,
but Xen really needs to sort out it's barrier situaton.
- setting of ordered tags in uas - dead code copied from old scsi
drivers.
- scsi different retry for barriers - it's dead and should have been
removed when flushes were converted to FS requests.
- blktrace handling of barriers - removed. Someone who knows blktrace
better should add support for REQ_FLUSH and REQ_FUA, though.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Structure hd_geometry is copied to userland with 4 padding bytes
between cylinders and start fields uninitialized on 64-bit platforms.
It leads to leaking of contents of kernel stack memory.
Currently there is no memset() in real implementations of getgeo()
in drivers/block/, so it makes sense to have memset() in blkdev_ioctl().
Signed-off-by: Vasiliy Kulikov <segooon@gmail.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Convert direct reads of an inode's i_size to using i_size_read().
i_size_{read,write} use a seqcount to protect reads from accessing
incomple writes. Concurrent i_size_write()s require mutual exclussion
to protect the seqcount that is used by i_size_{read,write}. But
i_size_read() callers do not need to use additional locking.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: NeilBrown <neilb@suse.de>
Acked-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Ensure that we pass down properly validated iov segments before
calling into the mapping or copy functions.
Reported-by: Dan Rosenberg <drosenberg@vsecurity.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Vivek suggests we don't need schedule a dispatch when an idle queue
becomes nonidle. And he is right, cfq_should_preempt already covers
the logic.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
If a deep seek queue slowly deliver requests but disk is much faster, idle
for the queue just wastes disk throughput. If the queue delevers all requests
before half its slice is used, the patch disable idle for it.
In my test, application delivers 32 requests one time, the disk can accept
128 requests at maxium and disk is fast. without the patch, the throughput
is just around 30m/s, while with it, the speed is about 80m/s. The disk is
a SSD, but is detected as a rotational disk. I can configure it as SSD, but
I thought the deep seek queue logic should be fixed too, for example,
considering a fast raid.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
A queue is idle at cfq_dispatch_requests(), but it gets noidle later. Unless
other task explictly does unplug or all requests are drained, we will not
deliever requests to the disk even cfq_arm_slice_timer doesn't make the
queue idle. For example, cfq_should_idle() returns true because of
service_tree->count == 1, and then other queues are added. Note, I didn't
see obvious performance impacts so far with the patch, but just thought
this could be a problem.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
This reverts commit 7681bfeecc.
Conflicts:
include/linux/genhd.h
It has numerous issues with the cleanup path and non-elevator
devices. Revert it for now so we can come up with a clean
version without rushing things.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
blk_throtl_exit() frees the throttle data hanging off the queue
in blk_cleanup_queue(), but blk_put_queue() will indirectly
dereference this data when calling blk_sync_queue() which in
turns calls throtl_shutdown_timer_wq().
Fix this by moving the freeing of the throttle data to when
the queue is truly being released, and post the call to
blk_sync_queue().
Reported-by: Ingo Molnar <mingo@elte.hu>
Tested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (31 commits)
driver core: Display error codes when class suspend fails
Driver core: Add section count to memory_block struct
Driver core: Add mutex for adding/removing memory blocks
Driver core: Move find_memory_block routine
hpilo: Despecificate driver from iLO generation
driver core: Convert link_mem_sections to use find_memory_block_hinted.
driver core: Introduce find_memory_block_hinted which utilizes kset_find_obj_hinted.
kobject: Introduce kset_find_obj_hinted.
driver core: fix build for CONFIG_BLOCK not enabled
driver-core: base: change to new flag variable
sysfs: only access bin file vm_ops with the active lock
sysfs: Fail bin file mmap if vma close is implemented.
FW_LOADER: fix kconfig dependency warning on HOTPLUG
uio: Statically allocate uio_class and use class .dev_attrs.
uio: Support 2^MINOR_BITS minors
uio: Cleanup irq handling.
uio: Don't clear driver data
uio: Fix lack of locking in init_uio_class
SYSFS: Allow boot time switching between deprecated and modern sysfs layout
driver core: remove CONFIG_SYSFS_DEPRECATED_V2 but keep it for block devices
...
* 'for-2.6.37/barrier' of git://git.kernel.dk/linux-2.6-block: (46 commits)
xen-blkfront: disable barrier/flush write support
Added blk-lib.c and blk-barrier.c was renamed to blk-flush.c
block: remove BLKDEV_IFL_WAIT
aic7xxx_old: removed unused 'req' variable
block: remove the BH_Eopnotsupp flag
block: remove the BLKDEV_IFL_BARRIER flag
block: remove the WRITE_BARRIER flag
swap: do not send discards as barriers
fat: do not send discards as barriers
ext4: do not send discards as barriers
jbd2: replace barriers with explicit flush / FUA usage
jbd2: Modify ASYNC_COMMIT code to not rely on queue draining on barrier
jbd: replace barriers with explicit flush / FUA usage
nilfs2: replace barriers with explicit flush / FUA usage
reiserfs: replace barriers with explicit flush / FUA usage
gfs2: replace barriers with explicit flush / FUA usage
btrfs: replace barriers with explicit flush / FUA usage
xfs: replace barriers with explicit flush / FUA usage
block: pass gfp_mask and flags to sb_issue_discard
dm: convey that all flushes are processed as empty
...
* 'for-2.6.37/core' of git://git.kernel.dk/linux-2.6-block: (39 commits)
cfq-iosched: Fix a gcc 4.5 warning and put some comments
block: Turn bvec_k{un,}map_irq() into static inline functions
block: fix accounting bug on cross partition merges
block: Make the integrity mapped property a bio flag
block: Fix double free in blk_integrity_unregister
block: Ensure physical block size is unsigned int
blkio-throttle: Fix possible multiplication overflow in iops calculations
blkio-throttle: limit max iops value to UINT_MAX
blkio-throttle: There is no need to convert jiffies to milli seconds
blkio-throttle: Fix link failure failure on i386
blkio: Recalculate the throttled bio dispatch time upon throttle limit change
blkio: Add root group to td->tg_list
blkio: deletion of a cgroup was causes oops
blkio: Do not export throttle files if CONFIG_BLK_DEV_THROTTLING=n
block: set the bounce_pfn to the actual DMA limit rather than to max memory
block: revert bad fix for memory hotplug causing bounces
Fix compile error in blk-exec.c for !CONFIG_DETECT_HUNG_TASK
block: set the bounce_pfn to the actual DMA limit rather than to max memory
block: Prevent hang_check firing during long I/O
cfq: improve fsync performance for small files
...
Fix up trivial conflicts due to __rcu sparse annotation in include/linux/genhd.h
* 'llseek' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl:
vfs: make no_llseek the default
vfs: don't use BKL in default_llseek
llseek: automatically add .llseek fop
libfs: use generic_file_llseek for simple_attr
mac80211: disallow seeks in minstrel debug code
lirc: make chardev nonseekable
viotape: use noop_llseek
raw: use explicit llseek file operations
ibmasmfs: use generic_file_llseek
spufs: use llseek in all file operations
arm/omap: use generic_file_llseek in iommu_debug
lkdtm: use generic_file_llseek in debugfs
net/wireless: use generic_file_llseek in debugfs
drm: use noop_llseek
* 'trivial' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl:
block: autoconvert trivial BKL users to private mutex
drivers: autoconvert trivial BKL users to private mutex
ipmi: autoconvert trivial BKL users to private mutex
mac: autoconvert trivial BKL users to private mutex
mtd: autoconvert trivial BKL users to private mutex
scsi: autoconvert trivial BKL users to private mutex
Fix up trivial conflicts (due to addition of private mutex right next to
deletion of a version string) in drivers/char/pcmcia/cm40[04]0_cs.c
I have some systems which need legacy sysfs due to old tools that are
making assumptions that a directory can never be a symlink to another
directory, and it's a big hazzle to compile separate kernels for them.
This patch turns CONFIG_SYSFS_DEPRECATED into a run time option
that can be switched on/off the kernel command line. This way
the same binary can be used in both cases with just a option
on the command line.
The old CONFIG_SYSFS_DEPRECATED_V2 option is still there to set
the default. I kept the weird name to not break existing
config files.
Also the compat code can be still completely disabled by undefining
CONFIG_SYSFS_DEPRECATED_SWITCH -- just the optimizer takes
care of this now instead of lots of ifdefs. This makes the code
look nicer.
v2: This is an updated version on top of Kay's patch to only
handle the block devices. I tested it on my old systems
and that seems to work.
Cc: axboe@kernel.dk
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
- Andi encountedred following warning with gcc 4.5
linux/block/cfq-iosched.c: In function ‘cfq_dispatch_requests’:
linux/block/cfq-iosched.c:2156:3: warning: array subscript is above array
bounds
- Warning happens due to following code.
slice = group_slice * count /
max_t(unsigned, cfqg->busy_queues_avg[cfqd->serving_prio],
cfq_group_busy_queues_wl(cfqd->serving_prio, cfqd, cfqg));
gcc is complaining about cfqg->busy_queues_avg[] being indexed by CFQ
prio classes (RT, BE and IDLE) while the array size is only 2.
- At run time, we never access cfqg->busy_queues_avg[IDLE] and return from
function before this code hits.
- To fix warning increase the array size though it will remain unused. This
patch also puts some comments to clarify some of the confusions.
- I have taken Jens's patch and modified it a bit.
- Compile tested with gcc 4.4 and boot tested. I don't have gcc 4.5
running, Andi can you please test it with gcc 4.5 to make sure it
worked.
Reported-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
/proc/diskstats would display a strange output as follows.
$ cat /proc/diskstats |grep sda
8 0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
8 1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
~~~~~~~~~~
8 2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
8 3 sda3 54 487 2188 92 0 0 0 0 0 88 92
8 4 sda4 4 0 8 0 0 0 0 0 0 0 0
8 5 sda5 81 2027 2130 138 0 0 0 0 0 87 137
Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.
The detailed root cause is as follows.
Assuming that there are two partition, sda1 and sda2.
1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
is 0 and sda2's one is 1.
| hd_struct->in_flight
---------------------------
sda1 | 0
sda2 | 1
---------------------------
2. A bio belongs to sda1 is issued and is merged into the request mentioned on
step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
from sda2 region to sda1 region. However the two partition's
hd_struct->in_flight are not changed.
| hd_struct->in_flight
---------------------------
sda1 | 0
sda2 | 1
---------------------------
3. The request is finished and blk_account_io_done() is called. In this case,
sda2's hd_struct->in_flight, not a sda1's one, is decremented.
| hd_struct->in_flight
---------------------------
sda1 | -1
sda2 | 1
---------------------------
The patch fixes the problem by caching the partition lookup
inside the request structure, hence making sure that the increment
and decrement will always happen on the same partition struct. This
also speeds up IO with accounting enabled, since it cuts down on
the number of lookups we have to do.
When reloading partition tables, quiesce IO to ensure that no
request references to the partition struct exists. When it is safe
to free the partition table, the IO for that device is restarted
again.
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
bsg incorrectly returns sg's masked_status value for device_status.
[jejb: fix up expression logic]
Reported-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Stable Tree <stable@kernel.org>
Signed-off-by: James Bottomley <James.Bottomley@suse.de>
All file_operations should get a .llseek operation so we can make
nonseekable_open the default for future file operations without a
.llseek pointer.
The three cases that we can automatically detect are no_llseek, seq_lseek
and default_llseek. For cases where we can we can automatically prove that
the file offset is always ignored, we use noop_llseek, which maintains
the current behavior of not returning an error from a seek.
New drivers should normally not use noop_llseek but instead use no_llseek
and call nonseekable_open at open time. Existing drivers can be converted
to do the same when the maintainer knows for certain that no user code
relies on calling seek on the device file.
The generated code is often incorrectly indented and right now contains
comments that clarify for each added line why a specific variant was
chosen. In the version that gets submitted upstream, the comments will
be gone and I will manually fix the indentation, because there does not
seem to be a way to do that using coccinelle.
Some amount of new code is currently sitting in linux-next that should get
the same modifications, which I will do at the end of the merge window.
Many thanks to Julia Lawall for helping me learn to write a semantic
patch that does all this.
===== begin semantic patch =====
// This adds an llseek= method to all file operations,
// as a preparation for making no_llseek the default.
//
// The rules are
// - use no_llseek explicitly if we do nonseekable_open
// - use seq_lseek for sequential files
// - use default_llseek if we know we access f_pos
// - use noop_llseek if we know we don't access f_pos,
// but we still want to allow users to call lseek
//
@ open1 exists @
identifier nested_open;
@@
nested_open(...)
{
<+...
nonseekable_open(...)
...+>
}
@ open exists@
identifier open_f;
identifier i, f;
identifier open1.nested_open;
@@
int open_f(struct inode *i, struct file *f)
{
<+...
(
nonseekable_open(...)
|
nested_open(...)
)
...+>
}
@ read disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ read_no_fpos disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
... when != off
}
@ write @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ write_no_fpos @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
... when != off
}
@ fops0 @
identifier fops;
@@
struct file_operations fops = {
...
};
@ has_llseek depends on fops0 @
identifier fops0.fops;
identifier llseek_f;
@@
struct file_operations fops = {
...
.llseek = llseek_f,
...
};
@ has_read depends on fops0 @
identifier fops0.fops;
identifier read_f;
@@
struct file_operations fops = {
...
.read = read_f,
...
};
@ has_write depends on fops0 @
identifier fops0.fops;
identifier write_f;
@@
struct file_operations fops = {
...
.write = write_f,
...
};
@ has_open depends on fops0 @
identifier fops0.fops;
identifier open_f;
@@
struct file_operations fops = {
...
.open = open_f,
...
};
// use no_llseek if we call nonseekable_open
////////////////////////////////////////////
@ nonseekable1 depends on !has_llseek && has_open @
identifier fops0.fops;
identifier nso ~= "nonseekable_open";
@@
struct file_operations fops = {
... .open = nso, ...
+.llseek = no_llseek, /* nonseekable */
};
@ nonseekable2 depends on !has_llseek @
identifier fops0.fops;
identifier open.open_f;
@@
struct file_operations fops = {
... .open = open_f, ...
+.llseek = no_llseek, /* open uses nonseekable */
};
// use seq_lseek for sequential files
/////////////////////////////////////
@ seq depends on !has_llseek @
identifier fops0.fops;
identifier sr ~= "seq_read";
@@
struct file_operations fops = {
... .read = sr, ...
+.llseek = seq_lseek, /* we have seq_read */
};
// use default_llseek if there is a readdir
///////////////////////////////////////////
@ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier readdir_e;
@@
// any other fop is used that changes pos
struct file_operations fops = {
... .readdir = readdir_e, ...
+.llseek = default_llseek, /* readdir is present */
};
// use default_llseek if at least one of read/write touches f_pos
/////////////////////////////////////////////////////////////////
@ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read.read_f;
@@
// read fops use offset
struct file_operations fops = {
... .read = read_f, ...
+.llseek = default_llseek, /* read accesses f_pos */
};
@ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write.write_f;
@@
// write fops use offset
struct file_operations fops = {
... .write = write_f, ...
+ .llseek = default_llseek, /* write accesses f_pos */
};
// Use noop_llseek if neither read nor write accesses f_pos
///////////////////////////////////////////////////////////
@ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
identifier write_no_fpos.write_f;
@@
// write fops use offset
struct file_operations fops = {
...
.write = write_f,
.read = read_f,
...
+.llseek = noop_llseek, /* read and write both use no f_pos */
};
@ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write_no_fpos.write_f;
@@
struct file_operations fops = {
... .write = write_f, ...
+.llseek = noop_llseek, /* write uses no f_pos */
};
@ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
@@
struct file_operations fops = {
... .read = read_f, ...
+.llseek = noop_llseek, /* read uses no f_pos */
};
@ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
@@
struct file_operations fops = {
...
+.llseek = noop_llseek, /* no read or write fn */
};
===== End semantic patch =====
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Julia Lawall <julia@diku.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Commit 3839e4b introduced a kobject_put but failed to remove the
kmem_cache_free beneath it, leading to a double free.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Physical block size was declared unsigned int to accomodate the maximum
size reported by READ CAPACITY(16). Make sure we use the right type in
the related functions.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2.6.36 introduces an API for drivers to switch the IO scheduler
instead of manually calling the elevator exit and init functions.
This API was added since q->elevator must be cleared in between
those two calls. And since we already have this functionality
directly from use by the sysfs interface to switch schedulers
online, it was prudent to reuse it internally too.
But this API needs the queue to be in a fully initialized state
before it is called, or it will attempt to unregister elevator
kobjects before they have been added. This results in an oops
like this:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000051
IP: [<ffffffff8116f15e>] sysfs_create_dir+0x2e/0xc0
PGD 47ddfc067 PUD 47c6a1067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:02.0/0000:04:00.1/irq
CPU 2
Modules linked in: t(+) loop hid_apple usbhid ahci ehci_hcd uhci_hcd libahci usbcore nls_base igb
Pid: 7319, comm: modprobe Not tainted 2.6.36-rc6+ #132 QSSC-S4R/QSSC-S4R
RIP: 0010:[<ffffffff8116f15e>] [<ffffffff8116f15e>] sysfs_create_dir+0x2e/0xc0
RSP: 0018:ffff88027da25d08 EFLAGS: 00010246
RAX: ffff88047c68c528 RBX: 00000000fffffffe RCX: 0000000000000000
RDX: 000000000000002f RSI: 000000000000002f RDI: ffff88047e196c88
RBP: ffff88027da25d38 R08: 0000000000000000 R09: d84156c5635688c0
R10: d84156c5635688c0 R11: 0000000000000000 R12: ffff88047e196c88
R13: 0000000000000000 R14: 0000000000000000 R15: ffff88047c68c528
FS: 00007fcb0b26f6e0(0000) GS:ffff880287400000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000051 CR3: 000000047e76e000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process modprobe (pid: 7319, threadinfo ffff88027da24000, task ffff88027d377090)
Stack:
ffff88027da25d58 ffff88047c68c528 00000000fffffffe ffff88047e196c88
<0> ffff88047c68c528 ffff88047e05bd90 ffff88027da25d78 ffffffff8123fb77
<0> ffff88047e05bd90 0000000000000000 ffff88047e196c88 ffff88047c68c528
Call Trace:
[<ffffffff8123fb77>] kobject_add_internal+0xe7/0x1f0
[<ffffffff8123fd98>] kobject_add_varg+0x38/0x60
[<ffffffff8123feb9>] kobject_add+0x69/0x90
[<ffffffff8116efe0>] ? sysfs_remove_dir+0x20/0xa0
[<ffffffff8103d48d>] ? sub_preempt_count+0x9d/0xe0
[<ffffffff8143de20>] ? _raw_spin_unlock+0x30/0x50
[<ffffffff8116efe0>] ? sysfs_remove_dir+0x20/0xa0
[<ffffffff8116eff4>] ? sysfs_remove_dir+0x34/0xa0
[<ffffffff81224204>] elv_register_queue+0x34/0xa0
[<ffffffff81224aad>] elevator_change+0xfd/0x250
[<ffffffffa007e000>] ? t_init+0x0/0x361 [t]
[<ffffffffa007e000>] ? t_init+0x0/0x361 [t]
[<ffffffffa007e0a8>] t_init+0xa8/0x361 [t]
[<ffffffff810001de>] do_one_initcall+0x3e/0x170
[<ffffffff8108c3fd>] sys_init_module+0xbd/0x220
[<ffffffff81002f2b>] system_call_fastpath+0x16/0x1b
Code: e5 41 56 41 55 41 54 49 89 fc 53 48 83 ec 10 48 85 ff 74 52 48 8b 47 18 49 c7 c5 00 46 61 81 48 85 c0 74 04 4c 8b 68 30 45 31 f6 <41> 80 7d 51 00 74 0e 49 8b 44 24 28 4c 89 e7 ff 50 20 49 89 c6
RIP [<ffffffff8116f15e>] sysfs_create_dir+0x2e/0xc0
RSP <ffff88027da25d08>
CR2: 0000000000000051
---[ end trace a6541d3bf07945df ]---
Fix this by adding a registered bit to the elevator queue, which is
set when the sysfs kobjects have been registered.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
The block device drivers have all gained new lock_kernel
calls from a recent pushdown, and some of the drivers
were already using the BKL before.
This turns the BKL into a set of per-driver mutexes.
Still need to check whether this is safe to do.
file=$1
name=$2
if grep -q lock_kernel ${file} ; then
if grep -q 'include.*linux.mutex.h' ${file} ; then
sed -i '/include.*<linux\/smp_lock.h>/d' ${file}
else
sed -i 's/include.*<linux\/smp_lock.h>.*$/include <linux\/mutex.h>/g' ${file}
fi
sed -i ${file} \
-e "/^#include.*linux.mutex.h/,$ {
1,/^\(static\|int\|long\)/ {
/^\(static\|int\|long\)/istatic DEFINE_MUTEX(${name}_mutex);
} }" \
-e "s/\(un\)*lock_kernel\>[ ]*()/mutex_\1lock(\&${name}_mutex)/g" \
-e '/[ ]*cycle_kernel_lock();/d'
else
sed -i -e '/include.*\<smp_lock.h\>/d' ${file} \
-e '/cycle_kernel_lock()/d'
fi
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
o User can specify max iops value of 32bit (UINT_MAX), through cgroup
interface. If a user has specified say 4294967294 (UNIT_MAX - 2), then
on 32bit platform, following multiplication can overflow.
io_allowed = (tg->iops[rw] * jiffy_elapsed_rnd)
o Explicitly cast the multiplication to 64bit and then perform division and
then check whether result is still great then UNINT_MAX.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
- Limit max iops value to UINT_MAX and return error to user if value is more
than that instead of accepting bigger values and truncating implicitly.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
o Do not convert jiffies to mili seconds as it is not required. Just work
with jiffies and HZ.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
o Randy Dunlap reported following linux-next failure. This patch fixes it.
on i386:
blk-throttle.c:(.text+0x1abb8): undefined reference to `__udivdi3'
blk-throttle.c:(.text+0x1b1dc): undefined reference to `__udivdi3'
o bytes_per_second interface is 64bit and I was continuing to do 64 bit
division even on 32bit platform without help of special macros/functions
hence the failure.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reported-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
o Currently any cgroup throttle limit changes are processed asynchronousy and
the change does not take affect till a new bio is dispatched from same group.
o It might happen that a user sets a redicuously low limit on throttling.
Say 1 bytes per second on reads. In such cases simple operations like mount
a disk can wait for a very long time.
o Once bio is throttled, there is no easy way to come out of that wait even if
user increases the read limit later.
o This patch fixes it. Now if a user changes the cgroup limits, we recalculate
the bio dispatch time according to new limits.
o Can't take queueu lock under blkcg_lock, hence after the change I wake
up the dispatch thread again which recalculates the time. So there are some
variables being synchronized across two threads without lock and I had to
make use of barriers. Hoping I have used barriers correctly. Any review of
memory barrier code especially will help.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
o Currently all the dynamically allocated groups, except root grp is added
to td->tg_list. This was not a problem so far but in next patch I will
travel through td->tg_list to process any updates of limits on the group.
If root group is not in tg_list, then root group's updates are not
processed.
o It is better to root group also to tg_list instead of doing special
processing for it during limit updates.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
o Now a cgroup list of blkg elements can contain blkg from multiple policies.
Before sending an unlink event, make sure blkg belongs to they policy. If
policy does not own the blkg, do not send update for this blkg.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Currently throttling related files were visible even if user had disabled
throttling using config options. It was switching off background throttling
of bio but not the cgroup files. This patch fixes it.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
The bounce_pfn of the request queue in 64 bit systems is set to the
current max_low_pfn. Adding more memory later makes this incorrect.
Memory allocated beyond this boot time max_low_pfn appear to require
bounce buffers (bounce buffers are actually not allocated but used in
calculating segments that may result in "over max segments limit"
errors).
Signed-off-by: Malahal Naineni <malahal@us.ibm.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Revert "block: set the bounce_pfn to the actual DMA limit rather than to max memory"
This reverts commit c49825facf.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Add logic to prevent two I/O requests being merged if
only one of them is a discard. Ditto secure discard.
Without this fix, it is possible for write requests
to transform into discard requests. For example:
Submit bio 1 to discard 8 sectors from sector n
Submit bio 2 to write 8 sectors from sector n + 16
Submit bio 3 to write 8 sectors from sector n + 8
Bio 1 becomes request 1. Bio 2 becomes request 2.
Bio 3 is merged with request 2, and then subsequently
request 2 is merged with request 1 resulting in just
one I/O request which discards all 24 sectors.
Signed-off-by: Adrian Hunter <adrian.hunter@nokia.com>
(Moved the checks above the position checks /Jens)
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
The bounce_pfn of the request queue in 64 bit systems is set to the
current max_low_pfn. Adding more memory later makes this incorrect.
Memory allocated beyond this boot time max_low_pfn appear to require
bounce buffers (bounce buffers are actually not allocated but used in
calculating segments that may result in "over max segments limit"
errors).
Signed-off-by: Malahal Naineni <malahal@us.ibm.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
During long I/O operations, the hang_check timer may fire,
trigger stack dumps that unnecessarily alarm the user.
Eg. hdparm --security-erase NULL /dev/sdb ## can take *hours* to complete
So, if hang_check is armed, we should wake up periodically
to prevent it from triggering. This patch uses a wake-up interval
equal to half the hang_check timer period, which keeps overhead low enough.
Signed-off-by: Mark Lord <mlord@pobox.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Mike reported a kernel crash when a usb key hotplug is performed while all
kernel thrads are not in a root cgroup and are running in one of the child
cgroups of blkio controller.
BUG: unable to handle kernel NULL pointer dereference at 0000002c
IP: [<c11c7b08>] cfq_get_queue+0x232/0x412
*pde = 00000000
Oops: 0000 [#1] PREEMPT
last sysfs file: /sys/devices/pci0000:00/0000:00:1d.7/usb2/2-1/2-1:1.0/host3/scsi_host/host3/uevent
[..]
Pid: 30039, comm: scsi_scan_3 Not tainted 2.6.35.2-fg.roam #1 Volvi2 /Aspire 4315
EIP: 0060:[<c11c7b08>] EFLAGS: 00010086 CPU: 0
EIP is at cfq_get_queue+0x232/0x412
EAX: f705f9c0 EBX: e977abac ECX: 00000000 EDX: 00000000
ESI: f00da400 EDI: f00da4ec EBP: e977a800 ESP: dff8fd00
DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
Process scsi_scan_3 (pid: 30039, ti=dff8e000 task=f6b6c9a0 task.ti=dff8e000)
Stack:
00000000 00000000 00000001 01ff0000 f00da508 00000000 f00da524 f00da540
<0> e7994940 dd631750 f705f9c0 e977a820 e977ac44 f00da4d0 00000001 f6b6c9a0
<0> 00000010 00008010 0000000b 00000000 00000001 e977a800 dd76fac0 00000246
Call Trace:
[<c11c7f10>] ? cfq_set_request+0x228/0x34c
[<c11c7ce8>] ? cfq_set_request+0x0/0x34c
[<c11bb3b9>] ? elv_set_request+0xf/0x1c
[<c11bdd51>] ? get_request+0x1ad/0x22f
[<c11bddf2>] ? get_request_wait+0x1f/0x11a
[<c11d013b>] ? kvasprintf+0x33/0x3b
[<c127b537>] ? scsi_execute+0x1d/0x103
[<c127b675>] ? scsi_execute_req+0x58/0x83
[<c127c391>] ? scsi_probe_and_add_lun+0x188/0x7c2
[<c12718c6>] ? attribute_container_add_device+0x15/0xfa
[<c11c95d1>] ? kobject_get+0xf/0x13
[<c126d1db>] ? get_device+0x10/0x14
[<c127be93>] ? scsi_alloc_target+0x217/0x24d
[<c127cbd8>] ? __scsi_scan_target+0x95/0x480
[<c10204eb>] ? dequeue_entity+0x14/0x1fe
[<c1020491>] ? update_curr+0x165/0x1ab
[<c1020491>] ? update_curr+0x165/0x1ab
[<c127d00d>] ? scsi_scan_channel+0x4a/0x76
[<c127d0b0>] ? scsi_scan_host_selected+0x77/0xad
[<c127d13c>] ? do_scan_async+0x0/0x11a
[<c127d137>] ? do_scsi_scan_host+0x51/0x56
[<c127d13c>] ? do_scan_async+0x0/0x11a
[<c127d14a>] ? do_scan_async+0xe/0x11a
[<c127d13c>] ? do_scan_async+0x0/0x11a
[<c10354c5>] ? kthread+0x5e/0x63
[<c1035467>] ? kthread+0x0/0x63
[<c1002af6>] ? kernel_thread_helper+0x6/0x10
Code: 44 24 1c 54 83 44 24 18 54 83 fa 03 75 94 8b 06 c7 86 64 02 00 00 01 00 00 00 83 e0 03 09 f0 89 06 8b 44 24 28 8b 90 58 01 00 00 <8b> 42 2c 85 c0 75 03 8b 42 08 8d 54 24 48 52 8d 4c 24 50 51 68
EIP: [<c11c7b08>] cfq_get_queue+0x232/0x412 SS:ESP 0068:dff8fd00
CR2: 000000000000002c
---[ end trace 9a88306573f69b12 ]---
The problem here is that we don't have bdi->dev information available when
thread does some IO. Hence when dev_name() tries to access bdi->dev, it
crashes.
This problem does not happen if kernel threads are in root group as root
group is statically allocated at device initialization time and we don't
hit this piece of code.
Fix it by delaying the filling of major and minor number information of
device in blk_group. Initially a blk_group is created with 0 as device
information and this information is filled later once some more IO comes
in from same group.
Reported-by: Mike Kazantsev <mk.fraggod@gmail.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
This bug was introduced in 7b6d91daee
"block: unify flags for struct bio and struct request"
Cc: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Fsync performance for small files achieved by cfq on high-end disks is
lower than what deadline can achieve, due to idling introduced between
the sync write happening in process context and the journal commit.
Moreover, when competing with a sequential reader, a process writing
small files and fsync-ing them is starved.
This patch fixes the two problems by:
- marking journal commits as WRITE_SYNC, so that they get the REQ_NOIDLE
flag set,
- force all queues that have REQ_NOIDLE requests to be put in the noidle
tree.
Having the queue associated to the fsync-ing process and the one associated
to journal commits in the noidle tree allows:
- switching between them without idling,
- fairness vs. competing idling queues, since they will be serviced only
after the noidle tree expires its slice.
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Tested-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
All the blkdev_issue_* helpers can only sanely be used for synchronous
caller. To issue cache flushes or barriers asynchronously the caller needs
to set up a bio by itself with a completion callback to move the asynchronous
state machine ahead. So drop the BLKDEV_IFL_WAIT flag that is always
specified when calling blkdev_issue_* and also remove the now unused flags
argument to blkdev_issue_flush and blkdev_issue_zeroout. For
blkdev_issue_discard we need to keep it for the secure discard flag, which
gains a more descriptive name and loses the bitops vs flag confusion.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
When a new disk is being discovered, add_disk() first ties the bdev to gendisk
(via register_disk()->blkdev_get()) and only after that calls
bdi_register_bdev(). Because register_disk() also creates disk's kobject, it
can happen that userspace manages to open and modify the device's data (or
inode) before its BDI is properly initialized leading to a warning in
__mark_inode_dirty().
Fix the problem by registering BDI early enough.
This patch addresses https://bugzilla.kernel.org/show_bug.cgi?id=16312
Cc: stable@kernel.org
Reported-by: Larry Finger <Larry.Finger@lwfinger.net>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
o Actual implementation of throttling policy in block layer. Currently it
implements READ and WRITE bytes per second throttling logic. IOPS throttling
comes in later patches.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
o cgroup chagnes for throttle policy.
o Introduces READ and WRITE bytes per second throttling rules.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
o This patch prepares the base for introducing new IO control policies.
Currently all the code is written knowing there is only one policy
and that is proportional bandwidth. Creating infrastructure for newer
policies to come in.
o Also there were many functions which were generated using macro. It was
very confusing. Got rid of those.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
o Kill extra "dev weight" header which is printed when somebody reads
blkio.weight_device file. This really seems to be out of convention. No other
blkio files are printing any header at the start of file. I think it is ok
to just print values and how to interpret values should be part of
documentation.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
This is the third patch in a series which adds support for
storing partition metadata, optionally, off of the hd_struct.
One major use for that data is being able to resolve partition
by other identities than just the index on a block device. Device
enumeration varies by platform and there's a benefit to being able
to use something like EFI GPT's GUIDs to determine the correct
block device and partition to mount as the root.
This change adds that support to root= by adding support for
the following syntax:
root=PARTUUID=hex-uuid
Signed-off-by: Will Drewry <wad@chromium.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
I'm reposting this patch series as v4 since there have been no additional
comments, and I cleaned up one extra bit of unneeded code (in 3/3). The patches
are against Linus's tree: 2bfc96a127
(2.6.36-rc3).
Would this patchset be suitable for inclusion in an mm branch?
This changes adds a partition_meta_info struct which itself contains a
union of structures that provide partition table specific metadata.
This change leaves the union empty. The subsequent patch includes an
implementation for CONFIG_EFI_PARTITION-based metadata.
Signed-off-by: Will Drewry <wad@chromium.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Change type of 2nd parameter of blk_rq_aligned() into unsigned long
and remove unnecessary casting. Now we can call it with 'uaddr'
instead of 'ubuf' in __blk_rq_map_user() so that it can remove
following warnings from sparse:
block/blk-map.c:57:31: warning: incorrect type in argument 2 (different address spaces)
block/blk-map.c:57:31: expected void *addr
block/blk-map.c:57:31: got void [noderef] <asn:1>*ubuf
However blk_rq_map_kern() needs one more local variable to handle it.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Some controllers have a hardware limit on the number of protection
information scatter-gather list segments they can handle.
Introduce a max_integrity_segments limit in the block layer and provide
a new scsi_host_template setting that allows HBA drivers to provide a
value suitable for the hardware.
Add support for honoring the integrity segment limit when merging both
bios and requests.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@carl.home.kernel.dk>
We have several users of min_not_zero, each of them using their own
definition. Move the define to kernel.h.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@carl.home.kernel.dk>
Remove support for barriers on discards, which is unused now. Also
remove the DISCARD_NOBARRIER I/O type in favour of just setting the
rw flags up locally in blkdev_issue_discard.
tj: Also remove DISCARD_SECURE and use REQ_SECURE directly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Currently __blk_rq_prep_clone() copies only REQ_WRITE and REQ_DISCARD.
There's no reason to omit other command flags and REQ_FUA needs to be
copied to implement FUA support in request-based dm.
REQ_COMMON_MASK which specifies flags to be copied from bio to request
already identifies all the command flags. Define REQ_CLONE_MASK to be
the same as REQ_COMMON_MASK for clarity and make __blk_rq_prep_clone()
copy all flags in the mask.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Update blkdev_issue_flush() to use new REQ_FLUSH interface.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
rq->rq_disk and bio->bi_bdev->bd_disk may differ if a request has
passed through remapping drivers. FSEQ_DATA request incorrectly
followed bio->bi_bdev->bd_disk ending up being issued w/ mismatching
rq_disk. Make it follow orig_rq->rq_disk.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Tested-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
While completing a request from a REQ_FLUSH/FUA sequence, another
request can be pushed to the request queue. If a driver tests
elv_queue_empty() before completing a request and runs the queue again
only if the queue wasn't empty, this may lead to hang. Please note
that most drivers either kick the queue unconditionally or test queue
emptiness after completing the current request and don't have this
problem.
This patch removes this possibility by making REQ_FLUSH/FUA sequence
code kick the queue if the queue was empty before completing a request
from REQ_FLUSH/FUA sequence.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
init_flush_request() only set REQ_FLUSH when initializing flush
requests making them READ requests. Use WRITE_FLUSH instead.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
We need to call blk_rq_init and elv_insert for all cases in queue_next_fseq,
so take these calls into common code. Also move the end_io initialization
from queue_flush into queue_next_fseq and rename queue_flush to
init_flush_request now that it's old name doesn't apply anymore.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
There are a number of make_request based drivers which don't support
cache flushes. Filter out flush bio's in __generic_make_request() so
that they don't have to worry about them. All FLUSH/FUA requests with
data are converted to regular IO requests and empty ones are completed
immediately.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Now that the backend conversion is complete, export sequenced
FLUSH/FUA capability through REQ_FLUSH/FUA flags. REQ_FLUSH means the
device cache should be flushed before executing the request. REQ_FUA
means that the data in the request should be on non-volatile media on
completion.
Block layer will choose the correct way of implementing the semantics
and execute it. The request may be passed to the device directly if
the device can handle it; otherwise, it will be sequenced using one or
more proxy requests. Devices will never see REQ_FLUSH and/or FUA
which it doesn't support.
Also, unlike the original REQ_HARDBARRIER, REQ_FLUSH/FUA requests are
never failed with -EOPNOTSUPP. If the underlying device doesn't
support FLUSH/FUA, the block layer simply make those noop. IOW, it no
longer distinguishes between writeback cache which doesn't support
cache flush and writethrough/no cache. Devices which have WB cache
w/o flush are very difficult to come by these days and there's nothing
much we can do anyway, so it doesn't make sense to require everyone to
implement -EOPNOTSUPP handling. This will simplify filesystems and
block drivers as they can drop -EOPNOTSUPP retry logic for barriers.
* QUEUE_ORDERED_* are removed and QUEUE_FSEQ_* are moved into
blk-flush.c.
* REQ_FLUSH w/o data can also be directly passed to drivers without
sequencing but some drivers assume that zero length requests don't
have rq->bio which isn't true for these requests requiring the use
of proxy requests.
* REQ_COMMON_MASK now includes REQ_FLUSH | REQ_FUA so that they are
copied from bio to request.
* WRITE_BARRIER is marked deprecated and WRITE_FLUSH, WRITE_FUA and
WRITE_FLUSH_FUA are added.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
With ordering requirements dropped, barrier and ordered are misnomers.
Now all block layer does is sequencing FLUSH and FUA. Rename them to
flush.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Without ordering requirements, barrier and ordering are minomers.
Rename block/blk-barrier.c to block/blk-flush.c. Rename of symbols
will follow.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Filesystems will take all the responsibilities for ordering requests
around commit writes and will only indicate how the commit writes
themselves should be handled by block layers. This patch drops
barrier ordering by queue draining from block layer. Ordering by
draining implementation was somewhat invasive to request handling.
List of notable changes follow.
* Each queue has 1 bit color which is flipped on each barrier issue.
This is used to track whether a given request is issued before the
current barrier or not. REQ_ORDERED_COLOR flag and coloring
implementation in __elv_add_request() are removed.
* Requests which shouldn't be processed yet for draining were stalled
by returning -EAGAIN from blk_do_ordered() according to the test
result between blk_ordered_req_seq() and blk_blk_ordered_cur_seq().
This logic is removed.
* Draining completion logic in elv_completed_request() removed.
* All barrier sequence requests were queued to request queue and then
trckled to lower layer according to progress and thus maintaining
request orders during requeue was necessary. This is replaced by
queueing the next request in the barrier sequence only after the
current one is complete from blk_ordered_complete_seq(), which
removes the need for multiple proxy requests in struct request_queue
and the request sorting logic in the ELEVATOR_INSERT_REQUEUE path of
elv_insert().
* As barriers no longer have ordering constraints, there's no need to
dump the whole elevator onto the dispatch queue on each barrier.
Insert barriers at the front instead.
* If other barrier requests come to the front of the dispatch queue
while one is already in progress, they are stored in
q->pending_barriers and restored to dispatch queue one-by-one after
each barrier completion from blk_ordered_complete_seq().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Make the following cleanups in preparation of barrier/flush update.
* blk_do_ordered() declaration is moved from include/linux/blkdev.h to
block/blk.h.
* blk_do_ordered() now returns pointer to struct request, with %NULL
meaning "try the next request" and ERR_PTR(-EAGAIN) "try again
later". The third case will be dropped with further changes.
* In the initialization of proxy barrier request, data direction is
already set by init_request_from_bio(). Drop unnecessary explicit
REQ_WRITE setting and move init_request_from_bio() above REQ_FUA
flag setting.
* add_request() is collapsed into __make_request().
These changes don't make any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Barrier is deemed too heavy and will soon be replaced by FLUSH/FUA
requests. Deprecate barrier. All REQ_HARDBARRIERs are failed with
-EOPNOTSUPP and blk_queue_ordered() is replaced with simpler
blk_queue_flush().
blk_queue_flush() takes combinations of REQ_FLUSH and FUA. If a
device has write cache and can flush it, it should set REQ_FLUSH. If
the device can handle FUA writes, it should also set REQ_FUA.
All blk_queue_ordered() users are converted.
* ORDERED_DRAIN is mapped to 0 which is the default value.
* ORDERED_DRAIN_FLUSH is mapped to REQ_FLUSH.
* ORDERED_DRAIN_FLUSH_FUA is mapped to REQ_FLUSH | REQ_FUA.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Boaz Harrosh <bharrosh@panasas.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Alasdair G Kergon <agk@redhat.com>
Cc: Pierre Ossman <drzeus@drzeus.cx>
Cc: Stefan Weinhuber <wein@de.ibm.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Nobody is making meaningful use of ORDERED_BY_TAG now and queue
draining for barrier requests will be removed soon which will render
the advantage of tag ordering moot. Kill ORDERED_BY_TAG. The
following users are affected.
* brd: converted to ORDERED_DRAIN.
* virtio_blk: ORDERED_TAG path was already marked deprecated. Removed.
* xen-blkfront: ORDERED_TAG case dropped.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
While testing CPU DLPAR, the following problem was discovered.
We were DLPAR removing the first CPU, which in this case was
logical CPUs 0-3. CPUs 0-2 were already marked offline and
we were in the process of offlining CPU 3. After marking
the CPU inactive and offline in cpu_disable, but before the
cpu was completely idle (cpu_die), we ended up in __make_request
on CPU 3. There we looked at the topology map to see which CPU
to complete the I/O on and found no CPUs in the cpu_sibling_map.
This resulted in the block layer setting the completion cpu
to be NR_CPUS, which then caused an oops when we tried to
complete the I/O.
Fix this by sanity checking the value we return from blk_cpu_to_group
to be a valid cpu value.
Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Currently drivers must do an elevator_exit() + elevator_init()
to switch IO schedulers. There are a few problems with this:
- Since commit 1abec4fdbb,
elevator_init() requires a zeroed out q->elevator
pointer. The two existing in-kernel users don't do that.
- It will only work at initialization time, since using the
above two-staged construct does not properly quisce the queue.
So add elevator_change() which takes care of this, and convert
the elv_iosched_store() sysfs interface to use this helper as well.
Reported-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Reported-by: Kevin Vigor <kevin@vigor.nu>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Return of the bi_rw tests is no longer bool after commit 74450be1. But
results of such tests are stored in bools. This doesn't fit in there
for some compilers (gcc 4.5 here), so either use !! magic to get real
bools or use ulong where the result is assigned somewhere.
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
o Divyesh had gotten rid of this code in the past. I want to re-introduce it
back as it helps me a lot during debugging.
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Divyesh Shah <dpshah@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
o Implement a new tunable group_idle, which allows idling on the group
instead of a cfq queue. Hence one can set slice_idle = 0 and not idle
on the individual queues but idle on the group. This way on fast storage
we can get fairness between groups at the same time overall throughput
improves.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
o Implement another CFQ mode where we charge group in terms of number
of requests dispatched instead of measuring the time. Measuring in terms
of time is not possible when we are driving deeper queue depths and there
are requests from multiple cfq queues in the request queue.
o This mode currently gets activated if one sets slice_idle=0 and associated
disk supports NCQ. Again the idea is that on an NCQ disk with idling disabled
most of the queues will dispatch 1 or more requests and then cfq queue
expiry happens and we don't have a way to measure time. So start providing
fairness in terms of IOPS.
o Currently IOPS mode works only with cfq group scheduling. CFQ is following
different scheduling algorithms for queue and group scheduling. These IOPS
stats are used only for group scheduling hence in non-croup mode nothing
should change.
o For CFQ group scheduling one can disable slice idling so that we don't idle
on queue and drive deeper request queue depths (achieving better throughput),
at the same time group idle is enabled so one should get service
differentiation among groups.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Do not idle either on cfq queue or service tree if slice_idle=0. User does
not want any queue or service tree idling. Currently even if slice_idle=0,
we were waiting for request to finish before expiring the queue and that
can lead to lower queue depths.
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
If the cgroup hierarchy for blkio control groups is deeper than two
levels, kernel should not allow the creation of further levels. mkdir
system call does not except EINVAL as a return value. This patch
replaces EINVAL with more appropriate EPERM
Signed-off-by: Ciju Rajan K <ciju@linux.vnet.ibm.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Secure discard is the same as discard except that all copies of the
discarded sectors (perhaps created by garbage collection) must also be
erased.
Signed-off-by: Adrian Hunter <adrian.hunter@nokia.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Cc: Kyungmin Park <kmpark@infradead.org>
Cc: Madhusudhan Chikkature <madhu.cr@ti.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ben Gardiner <bengardiner@nanometrics.ca>
Cc: <linux-mmc@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To avoid more patches, I also fixed other spelling
and grammar bugs when they were in the same or
following line:
successfull -> successful
parse -> parses
controler -> controller
controlers -> controllers
Cc: Jiri Kosina <trivial@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
- If function called without barrier option retvalue is incorrect
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Propagate REQ_DISCARD in cmd_flags when cloning a discard request.
Skip blk_rq_check_limits's existing checks for discard requests because
discard limits will have already been checked in blkdev_issue_discard.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
q->bar_rq.rq_disk is NULL. Use the rq_disk of the original request
instead.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
the block layer doesn't set rq->cmd_type on flush requests. By
definition, it should be REQ_TYPE_FS (the lower layers build a command
and interpret the result of it, that is, the block layer doesn't know
the details).
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
If the queue doesn't have a limit set, or it just set UINT_MAX like
we default to, we coud be sending down a discard request that isn't
of the correct granularity if the block size is > 512b.
Fix this by adjusting max_discard_sectors down to the proper
alignment.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Issuing a blkdev_issue_flush() on an unconfigured loop device causes a panic as
q->make_request_fn is not configured. This can occur when trying to mount the
unconfigured loop device as an XFS filesystem. There are no guards that catch
the bio before the request function is called because we don't add a payload to
the bio. Instead, manually check this case as soon as we have a pointer to the
queue to flush.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
The blkpg_ioctl and blkdev_reread_part access fields of
the bdev and gendisk structures, yet they always do so
under the protection of bdev->bd_mutex, which seems
sufficient.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
cked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
We only call the functions set_device_ro(),
invalidate_bdev(), sync_filesystem() and sync_blockdev()
while holding the BKL in these commands. All
of these are also done in other code paths without
the BKL, which leads me to the conclusion that
the BKL is not needed here either.
The reason we hold it here is that it was originally
pushed down into the ioctl function from vfs_ioctl.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
The blktrace driver currently needs the BKL, but
we should not need to take that in the block layer,
so just push it down into the driver itself.
It is quite likely that the BKL is not actually
required in blktrace code and could be removed
in a follow-on patch.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
As a preparation for the removal of the big kernel
lock in the block layer, this removes the BKL
from the common ioctl handling code, moving it
into every single driver still using it.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
This is preparation for removing q->prepare_flush_fn.
Temporarily, blk_queue_ordered() permits QUEUE_ORDERED_DO_PREFLUSH and
QUEUE_ORDERED_DO_POSTFLUSH without prepare_flush_fn.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
SCSI-ml needs a way to mark a request as flush request in
q->prepare_flush_fn because it needs to identify them later (e.g. in
q->request_fn or prep_rq_fn).
queue_flush sets REQ_HARDBARRIER in rq->cmd_flags however the block
layer also sends normal REQ_TYPE_FS requests with REQ_HARDBARRIER. So
SCSI-ml can't use REQ_HARDBARRIER to identify flush requests.
We could change the block layer to clear REQ_HARDBARRIER bit before
sending non flush requests to the lower layers. However, intorudcing
the new flag looks cleaner (surely easier).
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: James Bottomley <James.Bottomley@suse.de>
Cc: David S. Miller <davem@davemloft.net>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Alasdair G Kergon <agk@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Just some dead code.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Allocating a fixed payload for discard requests always was a horrible hack,
and it's not coming to byte us when adding support for discard in DM/MD.
So change the code to leave the allocation of a payload to the lowlevel
driver. Unfortunately that means we'll need another hack, which allows
us to update the various block layer length fields indicating that we
have a payload. Instead of hiding this in sd.c, which we already partially
do for UNMAP support add a documented helper in the core block layer for it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Remove the current bio flags and reuse the request flags for the bio, too.
This allows to more easily trace the type of I/O from the filesystem
down to the block driver. There were two flags in the bio that were
missing in the requests: BIO_RW_UNPLUG and BIO_RW_AHEAD. Also I've
renamed two request flags that had a superflous RW in them.
Note that the flags are in bio.h despite having the REQ_ name - as
blkdev.h includes bio.h that is the only way to go for now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Remove all the trivial wrappers for the cmd_type and cmd_flags fields in
struct requests. This allows much easier grepping for different request
types instead of unwinding through macros.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
There are two reasons for doing this:
- On SSD disks, the completion times aren't as random as they
are for rotational drives. So it's questionable whether they
should contribute to the random pool in the first place.
- Calling add_disk_randomness() has a lot of overhead.
This adds /sys/block/<dev>/queue/add_random that will allow you to
switch off on a per-device basis. The default setting is on, so there
should be no functional changes from this patch.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
In submit_bio, we count vm events by check READ/WRITE.
But actually DISCARD_NOBARRIER also has the WRITE flag set.
It looks as if in blkdev_issue_discard, we also add a
page as the payload and the bio_has_data check isn't enough.
So add another check for discard bio.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
e98ef89b has a typo, causing cfq_blkiocg_update_completion_stats()
to call itself instead of blkiocg_update_completion_stats().
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Hi,
A user reported a kernel bug when running a particular program that did
the following:
created 32 threads
- each thread took a mutex, grabbed a global offset, added a buffer size
to that offset, released the lock
- read from the given offset in the file
- created a new thread to do the same
- exited
The result is that cfq's close cooperator logic would trigger, as the
threads were issuing I/O within the mean seek distance of one another.
This workload managed to routinely trigger a use after free bug when
walking the list of merge candidates for a particular cfqq
(cfqq->new_cfqq). The logic used for merging queues looks like this:
static void cfq_setup_merge(struct cfq_queue *cfqq, struct cfq_queue *new_cfqq)
{
int process_refs, new_process_refs;
struct cfq_queue *__cfqq;
/* Avoid a circular list and skip interim queue merges */
while ((__cfqq = new_cfqq->new_cfqq)) {
if (__cfqq == cfqq)
return;
new_cfqq = __cfqq;
}
process_refs = cfqq_process_refs(cfqq);
/*
* If the process for the cfqq has gone away, there is no
* sense in merging the queues.
*/
if (process_refs == 0)
return;
/*
* Merge in the direction of the lesser amount of work.
*/
new_process_refs = cfqq_process_refs(new_cfqq);
if (new_process_refs >= process_refs) {
cfqq->new_cfqq = new_cfqq;
atomic_add(process_refs, &new_cfqq->ref);
} else {
new_cfqq->new_cfqq = cfqq;
atomic_add(new_process_refs, &cfqq->ref);
}
}
When a merge candidate is found, we add the process references for the
queue with less references to the queue with more. The actual merging
of queues happens when a new request is issued for a given cfqq. In the
case of the test program, it only does a single pread call to read in
1MB, so the actual merge never happens.
Normally, this is fine, as when the queue exits, we simply drop the
references we took on the other cfqqs in the merge chain:
/*
* If this queue was scheduled to merge with another queue, be
* sure to drop the reference taken on that queue (and others in
* the merge chain). See cfq_setup_merge and cfq_merge_cfqqs.
*/
__cfqq = cfqq->new_cfqq;
while (__cfqq) {
if (__cfqq == cfqq) {
WARN(1, "cfqq->new_cfqq loop detected\n");
break;
}
next = __cfqq->new_cfqq;
cfq_put_queue(__cfqq);
__cfqq = next;
}
However, there is a hole in this logic. Consider the following (and
keep in mind that each I/O keeps a reference to the cfqq):
q1->new_cfqq = q2 // q2 now has 2 process references
q3->new_cfqq = q2 // q2 now has 3 process references
// the process associated with q2 exits
// q2 now has 2 process references
// queue 1 exits, drops its reference on q2
// q2 now has 1 process reference
// q3 exits, so has 0 process references, and hence drops its references
// to q2, which leaves q2 also with 0 process references
q4 comes along and wants to merge with q3
q3->new_cfqq still points at q2! We follow that link and end up at an
already freed cfqq.
So, the fix is to not follow a merge chain if the top-most queue does
not have a process reference, otherwise any queue in the chain could be
already freed. I also changed the logic to disallow merging with a
queue that does not have any process references. Previously, we did
this check for one of the merge candidates, but not the other. That
doesn't really make sense.
Without the attached patch, my system would BUG within a couple of
seconds of running the reproducer program. With the patch applied, my
system ran the program for over an hour without issues.
This addresses the following bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=16217
Thanks a ton to Phil Carns for providing the bug report and an excellent
reproducer.
[ Note for stable: this applies to 2.6.32/33/34 ].
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Reported-by: Phil Carns <carns@mcs.anl.gov>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Filesystems assume that DISCARD_BARRIER are full barriers, so that they
don't have to track in-progress discard operation when submitting new I/O.
But currently we only treat them as elevator barriers, which don't
actually do the nessecary queue drains.
Also remove the unlikely around both the DISCARD and BARRIER requests -
the happen far too often for a static mispredict.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
blk_init_allocated_queue_node may fail and the caller _could_ retry.
Accommodate the unlikely event that blk_init_allocated_queue_node is
called on an already initialized (possibly partially) request_queue.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
On blk_init_allocated_queue_node failure, only free the request_queue if
it is wasn't previously allocated outside the block layer
(e.g. blk_init_queue_node was blk_init_allocated_queue_node caller).
This addresses an interface bug introduced by the following commit:
01effb0 block: allow initialization of previously allocated
request_queue
Otherwise the request_queue may be free'd out from underneath a caller
that is managing the request_queue directly (e.g. caller uses
blk_alloc_queue + blk_init_allocated_queue_node).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Bio-based DM doesn't use an elevator (queue is !blk_queue_stackable()).
Longer-term DM will not allocate an elevator for bio-based DM. But even
then there will be small potential for an elevator to be allocated for
a request-based DM table only to have a bio-based table be loaded in the
end.
Displaying "none" for bio-based DM will help avoid user confusion.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Use small consequent indexes as radix tree keys instead of sparse cfqd address.
This change will reduce radix tree depth from 11 (6 for 32-bit hosts)
to 1 if host have <=64 disks under cfq control, or to 0 if there only one disk.
So, this patch save 10*560 bytes for each process (5*296 for 32-bit hosts)
For each cfqd allocate cic index from ida.
To unlink dead cic from tree without cfqd access store index into ->key.
(bit 0 -- dead mark, bits 1..30 -- index: ida produce id in range 0..2^31-1)
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Remove ->dead_key field from cfq_io_context to shrink its size to 128 bytes.
(64 bytes for 32-bit hosts)
Use lower bit in ->key as dead-mark, instead of moving key to separate field.
After this for dead cfq_io_context we got cic->key != cfqd automatically.
Thus, io_context's last-hit cache should work without changing.
Now to check ->key for non-dead state compare it with cfqd,
instead of checking ->key for non-null value as it was before.
Plus remove obsolete race protection in cfq_cic_lookup.
This race gone after v2.6.24-1728-g4ac845a
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Remove all rcu head inits. We don't care about the RCU head state before passing
it to call_rcu() anyway. Only leave the "on_stack" variants so debugobjects can
keep track of objects on stack.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
blk_init_queue() allocates the request_queue structure and then
initializes it as needed (request_fn, elevator, etc).
Split initialization out to blk_init_allocated_queue_node.
Introduce blk_init_allocated_queue wrapper function to model existing
blk_init_queue and blk_init_queue_node interfaces.
Export elv_register_queue to allow a newly added elevator to be
registered with sysfs. Export elv_unregister_queue for symmetry.
These changes allow DM to initialize a device's request_queue with more
precision. In particular, DM no longer unconditionally initializes a
full request_queue (elevator et al). It only does so for a
request-based DM device.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
with CONFIG_PROVE_RCU=y, a warning can be triggered:
# mount -t cgroup -o blkio xxx /mnt
# mkdir /mnt/subgroup
...
kernel/cgroup.c:4442 invoked rcu_dereference_check() without protection!
...
To fix this, we avoid caling css_depth() here, which is a bit simpler
than the original code.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
- Add bio_batch helper primitive. This is rather generic primitive
for submitting/waiting a complex request which consists of several
bios.
- blkdev_issue_zeroout() generate number of zero filed write bios.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Move blkdev_issue_discard from blk-barrier.c because it is
not barrier related.
Later the file will be populated by other helpers.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
In some places caller don't want to wait a request to complete.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The patch just convert all blkdev_issue_xxx function to common
set of flags. Wait/allocation semantics preserved.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch fixes few usability and configurability issues.
o All the cgroup based controller options are configurable from
"Genral Setup/Control Group Support/" menu. blkio is the only exception.
Hence make this option visible in above menu and make it configurable from
there to bring it inline with rest of the cgroup based controllers.
o Get rid of CONFIG_DEBUG_CFQ_IOSCHED.
This option currently does two things.
- Enable printing of cgroup paths in blktrace
- Enables CONFIG_DEBUG_BLK_CGROUP, which in turn displays additional stat
files in cgroup.
If we are using group scheduling, blktrace data is of not really much use
if cgroup information is not present. To get this data, currently one has to
also enable CONFIG_DEBUG_CFQ_IOSCHED, which in turn brings the overhead of
all the additional debug stat files which is not desired.
Hence, this patch moves printing of cgroup paths under
CONFIG_CFQ_GROUP_IOSCHED.
This allows us to get rid of CONFIG_DEBUG_CFQ_IOSCHED completely. Now all
the debug stat files are controlled only by CONFIG_DEBUG_BLK_CGROUP which
can be enabled through config menu.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Divyesh Shah <dpshah@google.com>
Reviewed-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
blk_rq_timed_out_timer() relied on blk_add_timer() never returning a
timer value of zero, but commit 7838c15b8d
removed the code that bumped this value when it was zero.
Therefore when jiffies is near wrap we could get unlucky & not set the
timeout value correctly.
This patch uses a flag to indicate that the timeout value was set and so
handles jiffies wrap correctly, and it keeps all the logic in one
function so should be easier to maintain in the future.
Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
After merging the block tree, 20100414's linux-next build (x86_64
allmodconfig) failed like this:
ERROR: "get_gendisk" [block/blk-cgroup.ko] undefined!
ERROR: "sched_clock" [block/blk-cgroup.ko] undefined!
This happens because the two symbols aren't exported and hence not available
when blk-cgroup code is built as a module. I've tried to stay consistent with
the use of EXPORT_SYMBOL or EXPORT_SYMBOL_GPL with the other symbols in the
respective files.
Signed-off-by: Divyesh Shah <dpshah@google.com>
Acked-by: Gui Jianfeng <guijianfeng@cn.fujitsu.cn>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
blk_rq_timed_out_timer() relied on blk_add_timer() never returning a
timer value of zero, but commit 7838c15b8d
removed the code that bumped this value when it was zero.
Therefore when jiffies is near wrap we could get unlucky & not set the
timeout value correctly.
This patch uses a flag to indicate that the timeout value was set and so
handles jiffies wrap correctly, and it keeps all the logic in one
function so should be easier to maintain in the future.
Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Fixes compile errors in blk-cgroup code for empty_time stat and a merge fix in
CFQ. The first error was when CONFIG_DEBUG_CFQ_IOSCHED is not set.
Signed-off-by: Divyesh Shah <dpshah@google.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Changelog from v1:
o Call blkiocg_update_idle_time_stats() at cfq_rq_enqueued() instead of at
dispatch time.
Changelog from original patchset: (in response to Vivek Goyal's comments)
o group blkiocg_update_blkio_group_dequeue_stats() with other DEBUG functions
o rename blkiocg_update_set_active_queue_stats() to
blkiocg_update_avg_queue_size_stats()
o s/request/io/ in blkiocg_update_request_add_stats() and
blkiocg_update_request_remove_stats()
o Call cfq_del_timer() at request dispatch() instead of
blkiocg_update_idle_time_stats()
Signed-off-by: Divyesh Shah<dpshah@google.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Currently, IO Controller makes use of blkio.weight to assign weight for
all devices. Here a new user interface "blkio.weight_device" is introduced to
assign different weights for different devices. blkio.weight becomes the
default value for devices which are not configured by "blkio.weight_device"
You can use the following format to assigned specific weight for a given
device:
#echo "major:minor weight" > blkio.weight_device
major:minor represents device number.
And you can remove weight for a given device as following:
#echo "major:minor 0" > blkio.weight_device
V1->V2 changes:
- use user interface "weight_device" instead of "policy" suggested by Vivek
- rename some struct suggested by Vivek
- rebase to 2.6-block "for-linus" branch
- remove an useless list_empty check pointed out by Li Zefan
- some trivial typo fix
V2->V3 changes:
- Move policy_*_node() functions up to get rid of forward declarations
- rename related functions by adding prefix "blkio_"
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: (34 commits)
cfq-iosched: Fix the incorrect timeslice accounting with forced_dispatch
loop: Update mtime when writing using aops
block: expose the statistics in blkio.time and blkio.sectors for the root cgroup
backing-dev: Handle class_create() failure
Block: Fix block/elevator.c elevator_get() off-by-one error
drbd: lc_element_by_index() never returns NULL
cciss: unlock on error path
cfq-iosched: Do not merge queues of BE and IDLE classes
cfq-iosched: Add additional blktrace log messages in CFQ for easier debugging
i2o: Remove the dangerous kobj_to_i2o_device macro
block: remove 16 bytes of padding from struct request on 64bits
cfq-iosched: fix a kbuild regression
block: make CONFIG_BLK_CGROUP visible
Remove GENHD_FL_DRIVERFS
block: Export max number of segments and max segment size in sysfs
block: Finalize conversion of block limits functions
block: Fix overrun in lcm() and move it to lib
vfs: improve writeback_inodes_wb()
paride: fix off-by-one test
drbd: fix al-to-on-disk-bitmap for 4k logical_block_size
...
When CFQ dispatches requests forcefully due to a barrier or changing iosched,
it runs through all cfqq's dispatching requests and then expires each queue.
However, it does not activate a cfqq before flushing its IOs resulting in
using stale values for computing slice_used.
This patch fixes it by calling activate queue before flushing reuqests from
each queue.
This is useful mostly for barrier requests because when the iosched is changing
it really doesnt matter if we have incorrect accounting since we're going to
break down all structures anyway.
We also now expire the current timeslice before moving on with the dispatch
to accurately account slice used for that cfqq.
Signed-off-by: Divyesh Shah<dpshah@google.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
1) group_wait_time - This is the amount of time the cgroup had to wait to get a
timeslice for one of its queues from when it became busy, i.e., went from 0
to 1 request queued. This is different from the io_wait_time which is the
cumulative total of the amount of time spent by each IO in that cgroup waiting
in the scheduler queue. This stat is a great way to find out any jobs in the
fleet that are being starved or waiting for longer than what is expected (due
to an IO controller bug or any other issue).
2) empty_time - This is the amount of time a cgroup spends w/o any pending
requests. This stat is useful when a job does not seem to be able to use its
assigned disk share by helping check if that is happening due to an IO
controller bug or because the job is not submitting enough IOs.
3) idle_time - This is the amount of time spent by the IO scheduler idling
for a given cgroup in anticipation of a better request than the exising ones
from other queues/cgroups.
All these stats are recorded using start and stop events. When reading these
stats, we do not add the delta between the current time and the last start time
if we're between the start and stop events. We avoid doing this to make sure
that these numbers are always monotonically increasing when read. Since we're
using sched_clock() which may use the tsc as its source, it may induce some
inconsistency (due to tsc resync across cpus) if we included the current delta.
Signed-off-by: Divyesh Shah<dpshah@google.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
These stats are useful for getting a feel for the queue depth of the cgroup,
i.e., how filled up its queues are at a given instant and over the existence of
the cgroup. This ability is useful when debugging problems in the wild as it
helps understand the application's IO pattern w/o having to read through the
userspace code (coz its tedious or just not available) or w/o the ability
to run blktrace (since you may not have root access and/or not want to disturb
performance).
Signed-off-by: Divyesh Shah<dpshah@google.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This includes both the number of bios merged into requests belonging to this
cgroup as well as the number of requests merged together.
In the past, we've observed different merging behavior across upstream kernels,
some by design some actual bugs. This stat helps a lot in debugging such
problems when applications report decreased throughput with a new kernel
version.
This needed adding an extra elevator function to capture bios being merged as I
did not want to pollute elevator code with blkiocg knowledge and hence needed
the accounting invocation to come from CFQ.
Signed-off-by: Divyesh Shah<dpshah@google.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
that include some minor fixes and addresses all comments.
Changelog: (most based on Vivek Goyal's comments)
o renamed blkiocg_reset_write to blkiocg_reset_stats
o more clarification in the documentation on io_service_time and io_wait_time
o Initialize blkg->stats_lock
o rename io_add_stat to blkio_add_stat and declare it static
o use bool for direction and sync
o derive direction and sync info from existing rq methods
o use 12 for major:minor string length
o define io_service_time better to cover the NCQ case
o add a separate reset_stats interface
o make the indexed stats a 2d array to simplify macro and function pointer code
o blkio.time now exports in jiffies as before
o Added stats description in patch description and
Documentation/cgroup/blkio-controller.txt
o Prefix all stats functions with blkio and make them static as applicable
o replace IO_TYPE_MAX with IO_TYPE_TOTAL
o Moved #define constant to top of blk-cgroup.c
o Pass dev_t around instead of char *
o Add note to documentation file about resetting stats
o use BLK_CGROUP_MODULE in addition to BLK_CGROUP config option in #ifdef
statements
o Avoid struct request specific knowledge in blk-cgroup. blk-cgroup.h now has
rq_direction() and rq_sync() functions which are used by CFQ and when using
io-controller at a higher level, bio_* functions can be added.
Signed-off-by: Divyesh Shah<dpshah@google.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
One of the features of laptop-mode is that it forces a writeout of dirty
pages if something else triggers a physical read or write from a device.
The current implementation flushes pages on all devices, rather than only
the one that triggered the flush. This patch alters the behaviour so that
only the recently accessed block device is flushed, preventing other
disks being spun up for no terribly good reason.
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Currently, the io statistics for the root cgroup are maintained, but
they are not shown because the device information is not available at
the point that the root blkio cgroup is created. This patch updates
the device information when the statistics are updated so that the
statistics become visible.
Signed-off-by: Ricky Benitez <rickyb@google.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We also add start_time_ns and io_start_time_ns fields to struct request
here to record the time when a request is created and when it is
dispatched to device. We use ns uints here as ms and jiffies are
not very useful for non-rotational media.
Signed-off-by: Divyesh Shah<dpshah@google.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
- io_service_time
- io_wait_time
- io_serviced
- io_service_bytes
These stats are accumulated per operation type helping us to distinguish between
read and write, and sync and async IO. This patch does not increment any of
these stats.
Signed-off-by: Divyesh Shah<dpshah@google.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
that info at request dispatch with other stats now. This patch removes the
existing support for accounting sectors for a blkio_group. This will be added
back differently in the next two patches.
Signed-off-by: Divyesh Shah<dpshah@google.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
elevator_get() not check the name length, if the name length > sizeof(elv),
elv will miss the '\0'. And elv buffer will be replace "-iosched" as something
like aaaaaaaaa, then call request_module() can load an not trust module.
Signed-off-by: Zhitong Wang <zhitong.wangzt@alibaba-inc.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Even if they are found to be co-operating.
The prio_trees do not have any IDLE cfqqs on them. cfq_close_cooperator()
is called from cfq_select_queue() and cfq_completed_request(). The latter
ensures that the close cooperator code does not get invoked if the current
cfqq is of class IDLE but the former doesn't seem to have any such checks.
So an IDLE cfqq may get merged with a BE cfqq from the same group which
should be avoided.
Signed-off-by: Divyesh Shah<dpshah@google.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
These have helped us debug some issues we've noticed in earlier IO
controller versions and should be useful now as well. The extra logging
covers:
- idling behavior. Since there are so many conditions based on which we decide
to idle or not, this patch adds a log message for some conditions that we've
found useful.
- workload slices and current prio and workload type
Changelog from v1:
o moved log message from cfq_set_active_queue() to __cfq_set_active_queue()
o changed queue_count to st->count
Signed-off-by: Divyesh Shah<dpshah@google.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Alex Shi reported a kbuild regression which is about 10% performance lost.
He bisected to this commit: 3dde36ddea.
The reason is cfqq_close() can't find close cooperator. Restoring
cfq_rq_close()'s threshold to original value makes the regression go away.
Since for_preempt parameter isn't used anymore, this patch deletes it.
Reported-by: Alex Shi <alex.shi@intel.com>
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Corrado Zoccolo <czoccolo@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Make the config visible, so we can choose from CONFIG_BLK_CGROUP=y
and CONFIG_BLK_CGROUP=m when CONFIG_IOSCHED_CFQ=m.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
These two values are useful when debugging issues surrounding maximum
I/O size. Put them in sysfs with the rest of the queue limits.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
lcm() was defined to take integer-sized arguments. The supplied
arguments are multiplied, however, causing us to overflow given
sufficiently large input. That in turn led to incorrect optimal I/O
size reporting in some cases (RAID over RAID).
Switch lcm() over to unsigned long similar to gcd() and move the
function from blk-settings.c to lib.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (56 commits)
doc: fix typo in comment explaining rb_tree usage
Remove fs/ntfs/ChangeLog
doc: fix console doc typo
doc: cpuset: Update the cpuset flag file
Fix of spelling in arch/sparc/kernel/leon_kernel.c no longer needed
Remove drivers/parport/ChangeLog
Remove drivers/char/ChangeLog
doc: typo - Table 1-2 should refer to "status", not "statm"
tree-wide: fix typos "ass?o[sc]iac?te" -> "associate" in comments
No need to patch AMD-provided drivers/gpu/drm/radeon/atombios.h
devres/irq: Fix devm_irq_match comment
Remove reference to kthread_create_on_cpu
tree-wide: Assorted spelling fixes
tree-wide: fix 'lenght' typo in comments and code
drm/kms: fix spelling in error message
doc: capitalization and other minor fixes in pnp doc
devres: typo fix s/dev/devm/
Remove redundant trailing semicolons from macros
fix typo "definetly" -> "definitely" in comment
tree-wide: s/widht/width/g typo in comments
...
Fix trivial conflict in Documentation/laptops/00-INDEX
Modify the Block I/O cgroup subsystem to be able to be built as a module.
As the CFQ disk scheduler optionally depends on blk-cgroup, config options
in block/Kconfig, block/Kconfig.iosched, and block/blk-cgroup.h are
enhanced to support the new module dependency.
Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Constify struct sysfs_ops.
This is part of the ops structure constification
effort started by Arjan van de Ven et al.
Benefits of this constification:
* prevents modification of data that is shared
(referenced) by many other structure instances
at runtime
* detects/prevents accidental (but not intentional)
modification attempts on archs that enforce
read-only kernel data at runtime
* potentially better optimized code as the compiler
can assume that the const data cannot be changed
* the compiler/linker move const data into .rodata
and therefore exclude them from false sharing
Signed-off-by: Emese Revfy <re.emese@gmail.com>
Acked-by: David Teigland <teigland@redhat.com>
Acked-by: Matt Domsch <Matt_Domsch@dell.com>
Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Acked-by: Hans J. Koch <hjk@linutronix.de>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
As the comment says the initial value of last_waited is never used, so
there is no need to initialise it with the current jiffies. Jiffies is
hot enough without accessing it for no reason.
Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Reorder cfq_rb_root to remove 8 bytes of padding on 64 bit builds.
Consequently removing 56 bytes from cfq_group and 64 bytes from
cfq_data.
Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Currently a queue can only dispatch up to 4 requests if there are other queues.
This isn't optimal, device can handle more requests, for example, AHCI can
handle 31 requests. I can understand the limit is for fairness, but we could
do a tweak: if the queue still has a lot of slice left, sounds we could
ignore the limit. Test shows this boost my workload (two thread randread of
a SSD) from 78m/s to 100m/s.
Thanks for suggestions from Corrado and Vivek for the patch.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Counters for requests "in flight" and "in driver" are used asymmetrically
in cfq_may_dispatch, and have slightly different meaning.
We split the rq_in_flight counter (was sync_flight) to count both sync
and async requests, in order to use this one, which is more accurate in
some corner cases.
The rq_in_driver counter is coalesced, since individual sync/async counts
are not used any more.
Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
CFQ currently applies the same logic of detecting seeky queues and
grouping them together for rotational disks as well as SSDs.
For SSDs, the time to complete a request doesn't depend on the
request location, but only on the size.
This patch therefore changes the criterion to group queues by
request size in case of SSDs, in order to achieve better fairness.
Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Current seeky detection is based on average seek lenght.
This is suboptimal, since the average will not distinguish between:
* a process doing medium sized seeks
* a process doing some sequential requests interleaved with larger seeks
and even a medium seek can take lot of time, if the requested sector
happens to be behind the disk head in the rotation (50% probability).
Therefore, we change the seeky queue detection to work as follows:
* each request can be classified as sequential if it is very close to
the current head position, i.e. it is likely in the disk cache (disks
usually read more data than requested, and put it in cache for
subsequent reads). Otherwise, the request is classified as seeky.
* an history window of the last 32 requests is kept, storing the
classification result.
* A queue is marked as seeky if more than 1/8 of the last 32 requests
were seeky.
This patch fixes a regression reported by Yanmin, on mmap 64k random
reads.
Reported-by: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Except for SCSI no device drivers distinguish between physical and
hardware segment limits. Consolidate the two into a single segment
limit.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The block layer calling convention is blk_queue_<limit name>.
blk_queue_max_sectors predates this practice, leading to some confusion.
Rename the function to appropriately reflect that its intended use is to
set max_hw_sectors.
Also introduce a temporary wrapper for backwards compability. This can
be removed after the merge window is closed.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Add a BLK_ prefix to block layer constants.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
blk_queue_max_hw_sectors is no longer called by any subsystem and can be
removed.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Clarify blk_queue_max_sectors and update documentation.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
There's no need to take css reference here, for the caller
has already called rcu_read_lock() to prevent cgroup from
being removed.
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Now that the bio list management stuff is generic, convert
generic_make_request to use bio lists instead of its own private bio
list implementation.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This reverts commit fb1e75389b.
"Benjamin S." <sbenni@gmx.de> reports that the patch in question
causes a big drop in sequential throughput for him, dropping from
200MB/sec down to only 70MB/sec.
Needs to be investigated more fully, for now lets just revert the
offending commit.
Conflicts:
include/linux/blkdev.h
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This removes 8 bytes of padding from struct cfq_queue on 64 bit builds,
shrinking it's size to 256 bytes, so fitting into 1 fewer cachelines and
allowing 1 more object/slab in it's kmem_cache.
Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
----
patch against 2.6.33-rc8
tested on x86_64 AMDX2
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
In particular, several occurances of funny versions of 'success',
'unknown', 'therefore', 'acknowledge', 'argument', 'achieve', 'address',
'beginning', 'desirable', 'separate' and 'necessary' are fixed.
Signed-off-by: Daniel Mack <daniel@caiaq.de>
Cc: Joe Perches <joe@perches.com>
Cc: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Currently we split seeky coop queues after 1s, which is too big. Below patch
marks seeky coop queue split_coop flag after one slice. After that, if new
requests come in, the queues will be splitted. Patch is suggested by Corrado.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Reviewed-by: Corrado Zoccolo <czoccolo@gmail.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Few weeks back, Shaohua Li had posted similar patch. I am reposting it
with more test results.
This patch does two things.
- Do not idle on async queues.
- It also changes the write queue depth CFQ drives (cfq_may_dispatch()).
Currently, we seem to driving queue depth of 1 always for WRITES. This is
true even if there is only one write queue in the system and all the logic
of infinite queue depth in case of single busy queue as well as slowly
increasing queue depth based on last delayed sync request does not seem to
be kicking in at all.
This patch will allow deeper WRITE queue depths (subjected to the other
WRITE queue depth contstraints like cfq_quantum and last delayed sync
request).
Shaohua Li had reported getting more out of his SSD. For me, I have got
one Lun exported from an HP EVA and when pure buffered writes are on, I
can get more out of the system. Following are test results of pure
buffered writes (with end_fsync=1) with vanilla and patched kernel. These
results are average of 3 sets of run with increasing number of threads.
AVERAGE[bufwfs][vanilla]
-------
job Set NR ReadBW(KB/s) MaxClat(us) WriteBW(KB/s) MaxClat(us)
--- --- -- ------------ ----------- ------------- -----------
bufwfs 3 1 0 0 95349 474141
bufwfs 3 2 0 0 100282 806926
bufwfs 3 4 0 0 109989 2.7301e+06
bufwfs 3 8 0 0 116642 3762231
bufwfs 3 16 0 0 118230 6902970
AVERAGE[bufwfs] [patched kernel]
-------
bufwfs 3 1 0 0 270722 404352
bufwfs 3 2 0 0 206770 1.06552e+06
bufwfs 3 4 0 0 195277 1.62283e+06
bufwfs 3 8 0 0 260960 2.62979e+06
bufwfs 3 16 0 0 299260 1.70731e+06
I also ran buffered writes along with some sequential reads and some
buffered reads going on in the system on a SATA disk because the potential
risk could be that we should not be driving queue depth higher in presence
of sync IO going to keep the max clat low.
With some random and sequential reads going on in the system on one SATA
disk I did not see any significant increase in max clat. So it looks like
other WRITE queue depth control logic is doing its job. Here are the
results.
AVERAGE[brr, bsr, bufw together] [vanilla]
-------
job Set NR ReadBW(KB/s) MaxClat(us) WriteBW(KB/s) MaxClat(us)
--- --- -- ------------ ----------- ------------- -----------
brr 3 1 850 546345 0 0
bsr 3 1 14650 729543 0 0
bufw 3 1 0 0 23908 8274517
brr 3 2 981.333 579395 0 0
bsr 3 2 14149.7 1175689 0 0
bufw 3 2 0 0 21921 1.28108e+07
brr 3 4 898.333 1.75527e+06 0 0
bsr 3 4 12230.7 1.40072e+06 0 0
bufw 3 4 0 0 19722.3 2.4901e+07
brr 3 8 900 3160594 0 0
bsr 3 8 9282.33 1.91314e+06 0 0
bufw 3 8 0 0 18789.3 23890622
AVERAGE[brr, bsr, bufw mixed] [patched kernel]
-------
job Set NR ReadBW(KB/s) MaxClat(us) WriteBW(KB/s) MaxClat(us)
--- --- -- ------------ ----------- ------------- -----------
brr 3 1 837 417973 0 0
bsr 3 1 14357.7 591275 0 0
bufw 3 1 0 0 24869.7 8910662
brr 3 2 1038.33 543434 0 0
bsr 3 2 13351.3 1205858 0 0
bufw 3 2 0 0 18626.3 13280370
brr 3 4 913 1.86861e+06 0 0
bsr 3 4 12652.3 1430974 0 0
bufw 3 4 0 0 15343.3 2.81305e+07
brr 3 8 890 2.92695e+06 0 0
bsr 3 8 9635.33 1.90244e+06 0 0
bufw 3 8 0 0 17200.3 24424392
So looks like it might make sense to include this patch.
Thanks
Vivek
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Updated 'nomerges' tunable to accept a value of '2' - indicating that _no_
merges at all are to be attempted (not even the simple one-hit cache).
The following table illustrates the additional benefit - 5 minute runs of
a random I/O load were applied to a dozen devices on a 16-way x86_64 system.
nomerges Throughput %System Improvement (tput / %sys)
-------- ------------ ----------- -------------------------
0 12.45 MB/sec 0.669365609
1 12.50 MB/sec 0.641519199 0.40% / 2.71%
2 12.52 MB/sec 0.639849750 0.56% / 2.96%
Signed-off-by: Alan D. Brunelle <alan.brunelle@hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
In cfq_should_preempt(), we currently allow some cases where a non-RT request
can preempt an ongoing RT cfqq timeslice. This should not happen.
Examples include:
o A sync_noidle wl type non-RT request pre-empting a sync_noidle wl type cfqq
on which we are idling.
o Once we have per-cgroup async queues, a non-RT sync request pre-empting a RT
async cfqq.
Signed-off-by: Divyesh Shah<dpshah@google.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
All callers of the stacking functions use 512-byte sector units rather
than byte offsets. Simplify the code so the stacking functions take
sectors when specifying data offsets.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
DM does not want to know about partition offsets. Add a partition-aware
wrapper that DM can use when stacking block devices.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Discard alignment reporting for partitions was incorrect. Update to
match the algorithm used elsewhere.
The alignment can be negative (misaligned). Fix format string
accordingly.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The top device misalignment flag would not be set if the added bottom
device was already misaligned as opposed to causing a stacking failure.
Also massage the reporting so that an error is only returned if adding
the bottom device caused the misalignment. I.e. don't return an error
if the top is already flagged as misaligned.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
queue_sector_alignment_offset returned the wrong value which caused
partitions to report an incorrect alignment_offset. Since offset
alignment calculation is needed several places it has been split into a
separate helper function. The topology stacking function has been
updated accordingly.
Furthermore, comments have been added to clarify how the stacking
function works.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Tested-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
seek_mean could be very big sometimes, using it as close criteria is meaningless
as this doen't improve any performance. So if it's big, let's fallback to
default value.
Reviewed-by: Corrado Zoccolo <czoccolo@gmail.com>
Signed-off-by: Shaohua Li<shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The stacking code incorrectly scaled up the data offset in some cases
causing misaligned devices to report alignment. Rewrite the stacking
algorithm to remedy this and apply the same alignment principles to the
discard handling.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
o CFQ now internally divides cfq queues in therr workload categories. sync-idle,
sync-noidle and async. Which workload to run depends primarily on rb_key
offset across three service trees. Which is a combination of mulitiple things
including what time queue got queued on the service tree.
There is one exception though. That is if we switched the prio class, say
we served some RT tasks and again started serving BE class, then with-in
BE class we always started with sync-noidle workload irrespective of rb_key
offset in service trees.
This can provide better latencies for sync-noidle workload in the presence
of RT tasks.
o This patch gets rid of that exception and which workload to run with-in
class always depends on lowest rb_key across service trees. The reason
being that now we have multiple BE class groups and if we always switch
to sync-noidle workload with-in group, we can potentially starve a sync-idle
workload with-in group. Same is true for async workload which will be in
root group. Also the workload-switching with-in group will become very
unpredictable as it now depends whether some RT workload was running in
the system or not.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Acked-by: Corrado Zoccolo <czoccolo@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
o Currently code does not seem to be using cfqd->nr_groups. Get rid of it.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
o allow_merge() already checks if submitting task is pointing to same cfqq
as rq has been queued in. If everything is fine, we should not be having
a task in one cgroup and having a pointer to cfqq in other cgroup.
Well I guess in some situations it can happen and that is, when a random
IO queue has been moved into root cgroup for group_isolation=0. In
this case, tasks's cgroup/group is different from where actually cfqq is,
but this is intentional and in this case merging should be allowed.
The second situation is where due to close cooperator patches, multiple
processes can be sharing a cfqq. If everything implemented right, we should
not end up in a situation where tasks from different processes in different
groups are sharing the same cfqq as we allow merging of cooperating queues
only if they are in same group.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Commit 86b3728141 adds a check for
misaligned stacking offsets, but it's buggy since the defaults are 0.
Hence all dm devices that pass in a non-zero starting offset will
be marked as misaligned amd dm will complain.
A real fix is coming, in the mean time disable the discard granularity
check so that users don't worry about dm reporting about misaligned
devices.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* 'for-2.6.33' of git://git.kernel.dk/linux-2.6-block:
cfq: set workload as expired if it doesn't have any slice left
Fix a CFQ crash in "for-2.6.33" branch of block tree
cfq: Remove wait_request flag when idle time is being deleted
cfq-iosched: commenting non-obvious initialization
cfq-iosched: Take care of corner cases of group losing share due to deletion
cfq-iosched: Get rid of cfqq wait_busy_done flag
cfq: Optimization for close cooperating queue searching
block,xd: Delay allocation of DMA buffers until device is known
drbd: Following the hmac change to SHASH (see linux commit 8bd1209cff)
cfq-iosched: reduce write depth only if sync was delayed
When a group is resumed, if it doesn't have workload slice left,
we should set workload_expires as expired. Otherwise, we might
start from where we left in previous group by error.
Thanks the idea from Corrado.
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
I think my previous patch introduced a bug which can lead to CFQ hitting
BUG_ON().
The offending commit in for-2.6.33 branch is.
commit 7667aa0630
Author: Vivek Goyal <vgoyal@redhat.com>
Date: Tue Dec 8 17:52:58 2009 -0500
cfq-iosched: Take care of corner cases of group losing share due to deletion
While doing some stress testing on my box, I enountered following.
login: [ 3165.148841] BUG: scheduling while
atomic: swapper/0/0x10000100
[ 3165.149821] Modules linked in: cfq_iosched dm_multipath qla2xxx igb
scsi_transport_fc dm_snapshot [last unloaded: scsi_wait_scan]
[ 3165.149821] Pid: 0, comm: swapper Not tainted
2.6.32-block-for-33-merged-new #3
[ 3165.149821] Call Trace:
[ 3165.149821] <IRQ> [<ffffffff8103fab8>] __schedule_bug+0x5c/0x60
[ 3165.149821] [<ffffffff8103afd7>] ? __wake_up+0x44/0x4d
[ 3165.149821] [<ffffffff8153a979>] schedule+0xe3/0x7bc
[ 3165.149821] [<ffffffff8103a796>] ? cpumask_next+0x1d/0x1f
[ 3165.149821] [<ffffffffa000b21d>] ? cfq_dispatch_requests+0x6ba/0x93e
[cfq_iosched]
[ 3165.149821] [<ffffffff810422d8>] __cond_resched+0x2a/0x35
[ 3165.149821] [<ffffffffa000b21d>] ? cfq_dispatch_requests+0x6ba/0x93e
[cfq_iosched]
[ 3165.149821] [<ffffffff8153b1ee>] _cond_resched+0x2c/0x37
[ 3165.149821] [<ffffffff8100e2db>] is_valid_bugaddr+0x16/0x2f
[ 3165.149821] [<ffffffff811e4161>] report_bug+0x18/0xac
[ 3165.149821] [<ffffffff8100f1fc>] die+0x39/0x63
[ 3165.149821] [<ffffffff8153cde1>] do_trap+0x11a/0x129
[ 3165.149821] [<ffffffff8100d470>] do_invalid_op+0x96/0x9f
[ 3165.149821] [<ffffffffa000b21d>] ? cfq_dispatch_requests+0x6ba/0x93e
[cfq_iosched]
[ 3165.149821] [<ffffffff81034b4d>] ? enqueue_task+0x5c/0x67
[ 3165.149821] [<ffffffff8103ae83>] ? task_rq_unlock+0x11/0x13
[ 3165.149821] [<ffffffff81041aae>] ? try_to_wake_up+0x292/0x2a4
[ 3165.149821] [<ffffffff8100c935>] invalid_op+0x15/0x20
[ 3165.149821] [<ffffffffa000b21d>] ? cfq_dispatch_requests+0x6ba/0x93e
[cfq_iosched]
[ 3165.149821] [<ffffffff810df5a6>] ? virt_to_head_page+0xe/0x2f
[ 3165.149821] [<ffffffff811d8c2a>] blk_peek_request+0x191/0x1a7
[ 3165.149821] [<ffffffff811e5b8d>] ? kobject_get+0x1a/0x21
[ 3165.149821] [<ffffffff812c8d4c>] scsi_request_fn+0x82/0x3df
[ 3165.149821] [<ffffffff8110b2de>] ? bio_fs_destructor+0x15/0x17
[ 3165.149821] [<ffffffff810df5a6>] ? virt_to_head_page+0xe/0x2f
[ 3165.149821] [<ffffffff811d931f>] __blk_run_queue+0x42/0x71
[ 3165.149821] [<ffffffff811d9403>] blk_run_queue+0x26/0x3a
[ 3165.149821] [<ffffffff812c8761>] scsi_run_queue+0x2de/0x375
[ 3165.149821] [<ffffffff812b60ac>] ? put_device+0x17/0x19
[ 3165.149821] [<ffffffff812c92d7>] scsi_next_command+0x3b/0x4b
[ 3165.149821] [<ffffffff812c9b9f>] scsi_io_completion+0x1c9/0x3f5
[ 3165.149821] [<ffffffff812c3c36>] scsi_finish_command+0xb5/0xbe
I think I have hit following BUG_ON() in cfq_dispatch_request().
BUG_ON(RB_EMPTY_ROOT(&cfqq->sort_list));
Please find attached the patch to fix it. I have done some stress testing
with it and have not seen it happening again.
o We should wait on a queue even after slice expiry only if it is empty. If
queue is not empty then continue to expire it.
o If we decide to keep the queue then make cfqq=NULL. Otherwise select_queue()
will return a valid cfqq and cfq_dispatch_request() can hit following
BUG_ON().
BUG_ON(RB_EMPTY_ROOT(&cfqq->sort_list))
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Remove wait_request flag when idle time is being deleted, otherwise
it'll hit this path every time when a request is enqueued.
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Added a comment to explain the initialization of last_delayed_sync.
Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
If there is a sequential reader running in a group, we wait for next request
to come in that group after slice expiry and once new request is in, we expire
the queue. Otherwise we delete the group from service tree and group looses
its fair share.
So far I was marking a queue as wait_busy if it had consumed its slice and
it was last queue in the group. But this condition did not cover following
two cases.
1.If a request completed and slice has not expired yet. Next request comes
in and is dispatched to disk. Now select_queue() hits and slice has expired.
This group will be deleted. Because request is still in the disk, this queue
will never get a chance to wait_busy.
2.If request completed and slice has not expired yet. Before next request
comes in (delay due to think time), select_queue() hits and expires the
queue hence group. This queue never got a chance to wait busy.
Gui was hitting the boundary condition 1 and not getting fairness numbers
proportional to weight.
This patch puts the checks for above two conditions and improves the fairness
numbers for sequential workload on rotational media. Check in select_queue()
takes care of case 1 and additional check in should_wait_busy() takes care
of case 2.
Reported-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
o Get rid of wait_busy_done flag. This flag only tells we were doing wait
busy on a queue and that queue got request so expire it. That information
can easily be obtained by (cfq_cfqq_wait_busy() && queue_is_not_empty). So
remove this flag and keep code simple.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
It doesn't make any sense to try to find out a close cooperating
queue if current cfqq is the only one in the group.
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The introduction of ramp-up formula for async queue depths has
slowed down dirty page reclaim, by reducing async write performance.
This patch makes sure the formula kicks in only when sync request
was recently delayed.
Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Fix a crash during boot reported by Jeff Moyer. Fix the issue of accessing
cfqq after freeing it.
Reported-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@carl.(none)>
After the merge of the IO controller patches, booting on my megaraid
box ran much slower. Vivek Goyal traced it down to megaraid discovery
creating tons of devices, each suffering a grace period when they later
kill that queue (if no device is found).
So lets use call_rcu() to batch these deferred frees, instead of taking
the grace period hit for each one.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
o Now issues of blkio controller and CFQ in module mode should be fixed.
Enable the cfq group scheduling support in module mode.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
o One of the goals of block IO controller is that it should be able to
support mulitple io control policies, some of which be operational at
higher level in storage hierarchy.
o To begin with, we had one io controlling policy implemented by CFQ, and
I hard coded the CFQ functions called by blkio. This created issues when
CFQ is compiled as module.
o This patch implements a basic dynamic io controlling policy registration
functionality in blkio. This is similar to elevator functionality where
ioschedulers register the functions dynamically.
o Now in future, when more IO controlling policies are implemented, these
can dynakically register with block IO controller.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
o blkio controller is inside the kernel and cfq makes use of interfaces
exported by blkio. CFQ can be a module too, hence export symbols used
by CFQ.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
With CLONE_IO, parent's io_context->nr_tasks is incremented, but never
decremented whenever copy_process() fails afterwards, which prevents
exit_io_context() from calling IO schedulers exit functions.
Give a task_struct to exit_io_context(), and call exit_io_context() instead of
put_io_context() in copy_process() cleanup path.
Signed-off-by: Louis Rilling <louis.rilling@kerlabs.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
With CLONE_IO, copy_io() increments both ioc->refcount and ioc->nr_tasks.
However exit_io_context() only decrements ioc->refcount if ioc->nr_tasks
reaches 0.
Always call put_io_context() in exit_io_context().
Signed-off-by: Louis Rilling <louis.rilling@kerlabs.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
cfq_arm_slice_timer() has logic to disable idle window for SSD device. The same
thing should be done at cfq_select_queue() too, otherwise we will still see
idle window. This makes the nonrot check logic consistent in cfq.
Tests in a intel SSD with low_latency knob close, below patch can triple disk
thoughput for muti-thread sequential read.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
o rq_noidle() is supposed to tell cfq that do not expect a request after this
one, hence don't idle. But this does not seem to work very well. For example
for direct random readers, rq_noidle = 1 but there is next request coming
after this. Not idling, leads to a group not getting its share even if
group_isolation=1.
o The right solution for this issue is to scan the higher layers and set
right flag (WRITE_SYNC or WRITE_ODIRECT). For the time being, this single
line fix helps. This should not have any significant impact when we are
not using cgroups. I will later figure out IO paths in higher layer and
fix it.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
o If a group is running only a random reader, then it will not have enough
traffic to keep disk busy and we will reduce overall throughput. This
should result in better latencies for random reader though. If we don't
idle on random reader service tree, then this random reader will experience
large latencies if there are other groups present in system with sequential
readers running in these.
o One solution suggested by corrado is that by default keep the random readers
or sync-noidle workload in root group so that during one dispatch round
we idle only once on sync-noidle tree. This means that all the sync-idle
workload queues will be in their respective group and we will see service
differentiation in those but not on sync-noidle workload.
o Provide a tunable group_isolation. If set, this will make sure that even
sync-noidle queues go in their respective group and we wait on these. This
provides stronger isolation between groups but at the expense of throughput
if group does not have enough traffic to keep the disk busy.
o By default group_isolation = 0
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
o Async queues are not per group. Instead these are system wide and maintained
in root group. Hence their workload slice length should be calculated
based on total number of queues in the system and not just queues in the
root group.
o As root group's default weight is 1000, make sure to charge async queue
more in terms of vtime so that it does not get more time on disk because
root group has higher weight.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
o If a queue consumes its slice and then gets deleted from service tree, its
associated group will also get deleted from service tree if this was the
only queue in the group. That will make group loose its share.
o For the queues on which we have idling on and if these have used their
slice, wait a bit for these queues to get backlogged again and then
expire these queues so that group does not loose its share.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
o If a task changes cgroup, drop reference to the cfqq associated with io
context and set cfqq pointer stored in ioc to NULL so that upon next request
arrival we will allocate a new queue in new group.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
o Do not allow following three operations across groups for isolation.
- selection of co-operating queues
- preemtpions across groups
- request merging across groups.
o Async queues are currently global and not per group. Allow preemption of
an async queue if a sync queue in other group gets backlogged.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
o Export disk time and sector used by a group to user space through cgroup
interface.
o Also export a "dequeue" interface to cgroup which keeps track of how many
a times a group was deleted from service tree. Helps in debugging.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
o One can choose to change elevator or delete a cgroup. Implement group
reference counting so that both elevator exit and cgroup deletion can
take place gracefully.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
o Determine the cgroup IO submitting task belongs to and create the cfq
group if it does not exist already.
o Also link cfqq and associated cfq group.
o Currently all async IO is mapped to root group.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
o This patch introduces the functionality to do the accounting of group time
when a queue expires. This time used decides which is the group to go
next.
o Also introduce the functionlity to save and restore the workload type
context with-in group. It might happen that once we expire the cfq queue
and group, a different group will schedule in and we will lose the context
of the workload type. Hence save and restore it upon queue expiry.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
o So far we had 300ms soft target latency system wide. Now with the
introduction of cfq groups, divide that latency by number of groups so
that one can come up with group target latency which will be helpful
in determining the workload slice with-in group and also the dynamic
slice length of the cfq queue.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
o Bring in the per cfq group weight and how vdisktime is calculated for the
group. Also bring in the functionality of updating the min_vdisktime of
the group service tree.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
o This is basic implementation of blkio controller cgroup interface. This is
the common interface visible to user space and should be used by different
IO control policies as we implement those.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
o So far we just had one cfq_group in cfq_data. To create space for more than
one cfq_group, we need to have a service tree of groups where all the groups
can be queued if they have active cfq queues backlogged in these.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
o Currently cfqq deletes a queue from service tree if it is empty (even if
we might idle on the queue). This patch keeps the queue on service tree
hence associated group remains on the service tree until we decide that
we are not going to idle on the queue and expire it.
o This just helps in time accounting for queue/group and in implementation
of rest of the patches.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
o Implement a macro to traverse each service tree in the group. This avoids
usage of double for loop and special condition for idle tree 4 times.
o Macro is little twisted because of special handling of idle class service
tree.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
o This patch introduce the notion of cfq groups. Soon we will can have multiple
groups of different weights in the system.
o Various service trees (prioclass and workload type trees), will become per
cfq group. So hierarchy looks as follows.
cfq_groups
|
workload type
|
cfq queue
o When an scheduling decision has to be taken, first we select the cfq group
then workload with-in the group and then cfq queue with-in the workload
type.
o This patch just makes various workload service tree per cfq group and
introduce the function to be able to choose a group for scheduling.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
o must_dispatch flag should be set only if we decided not to run the queue
and dispatch the request.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Since commit 2f5cb7381b, each queue can send
up to 4 * 4 requests if only one queue exists. I wonder why we have such limit.
Device supports tag can send more requests. For example, AHCI can send 31
requests. Test (direct aio randread) shows the limits reduce about 4% disk
thoughput.
On the other hand, since we send one request one time, if other queue
pop when current is sending more than cfq_quantum requests, current queue will
stop send requests soon after one request, so sounds there is no big latency.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The discard ioctl is used by mkfs utilities to clear a block device
prior to putting metadata down. However, not all devices return zeroed
blocks after a discard. Some drives return stale data, potentially
containing old superblocks. It is therefore important to know whether
discarded blocks are properly zeroed.
Both ATA and SCSI drives have configuration bits that indicate whether
zeroes are returned after a discard operation. Implement a block level
interface that allows this information to be bubbled up the stack and
queried via a new block device ioctl.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This reverts commit 3586e917f2.
Corrado Zoccolo <czoccolo@gmail.com> correctly points out, that we need
consistency of rb_key offset across groups. This means we cannot properly
use the per-service_tree service count. Revert this change.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Idling logic was disabled in some corner cases, leading to unfair share
for noidle queues.
* the idle timer was not armed if there were other requests in the
driver. unfortunately, those requests could come from other workloads,
or queues for which we don't enable idling. So we will check only
pending requests from the active queue
* rq_noidle check on no-idle queue could disable the end of tree idle if
the last completed request was rq_noidle. Now, we will disable that
idle only if all the queues served in the no-idle tree had rq_noidle
requests.
Reported-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Seeky sync queues with large depth can gain unfairly big share of disk
time, at the expense of other seeky queues. This patch ensures that
idling will be enabled for queues with I/O depth at least 4, and small
think time. The decision to enable idling is sticky, until an idle
window times out without seeing a new request.
The reasoning behind the decision is that, if an application is using
large I/O depth, it is already optimized to make full utilization of
the hardware, and therefore we reserve a slice of exclusive use for it.
Reported-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
An incoming no-idle queue should preempt the active no-idle queue
only if the active queue is idling due to service tree empty.
Previous code was buggy in two ways:
* it relied on service_tree field to be set on the active queue, while
it is not set when the code is idling for a new request
* it didn't check for the service tree empty condition, so could lead to
LIFO behaviour if multiple queues with depth > 1 were preempting each
other on an non-NCQ device.
Reported-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
CFQ's detection of queueing devices initially assumes a queuing device
and detects if the queue depth reaches a certain threshold.
However, it will reconsider this choice periodically.
Unfortunately, if device is considered not queuing, CFQ will force a
unit queue depth for some workloads, thus defeating the detection logic.
This leads to poor performance on queuing hardware,
since the idle window remains enabled.
Given this premise, switching to hw_tag = 0 after we have proved at
least once that the device is NCQ capable is not a good choice.
The new detection code starts in an indeterminate state, in which CFQ behaves
as if hw_tag = 1, and then, if for a long observation period we never saw
large depth, we switch to hw_tag = 0, otherwise we stick to hw_tag = 1,
without reconsidering it again.
Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
cfq_should_idle returns false for no-idle queues that are not the last,
so the control flow will never reach the removed code in a state that
satisfies the if condition.
The unreachable code was added to emulate previous cfq behaviour for
non-NCQ rotational devices. My tests show that even without it, the
performances and fairness are comparable with previous cfq, thanks to
the fact that all seeky queues are grouped together, and that we idle at
the end of the tree.
Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Mtdblock driver doesn't call flush_dcache_page for pages in request. So,
this causes problems on architectures where the icache doesn't fill from
the dcache or with dcache aliases. The patch fixes this.
The ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE symbol was introduced to avoid
pointless empty cache-thrashing loops on architectures for which
flush_dcache_page() is a no-op. Every architecture was provided with this
flush pages on architectires where ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE is
equal 1 or do nothing otherwise.
See "fix mtd_blkdevs problem with caches on some architectures" discussion
on LKML for more information.
Signed-off-by: Ilya Loginov <isloginov@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Peter Horton <phorton@bitbox.co.uk>
Cc: "Ed L. Cashin" <ecashin@coraid.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
For the moment, different workload cfq queues are put into different
service trees. But CFQ still uses "busy_queues" to estimate rb_key
offset when inserting a cfq queue into a service tree. I think this
isn't appropriate, and it should make use of service tree count to do
this estimation. This patch is for for-2.6.33 branch.
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Use HZ-independent calculation of milliseconds.
Add jiffies.h where it was missing since functions or macros
from it are used.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
While SSDs track block usage on a per-sector basis, RAID arrays often
have allocation blocks that are bigger. Allow the discard granularity
and alignment to be set and teach the topology stacking logic how to
handle them.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Cfq has a bug in computation of next_rq, that affects transition
between multiple sequential request streams in a single queue
(e.g.: two sequential buffered writers of the same priority),
causing the alternation between the two streams for a transient period.
8,0 1 18737 0.260400660 5312 D W 141653311 + 256
8,0 1 20839 0.273239461 5400 D W 141653567 + 256
8,0 1 20841 0.276343885 5394 D W 142803919 + 256
8,0 1 20843 0.279490878 5394 D W 141668927 + 256
8,0 1 20845 0.292459993 5400 D W 142804175 + 256
8,0 1 20847 0.295537247 5400 D W 141668671 + 256
8,0 1 20849 0.298656337 5400 D W 142804431 + 256
8,0 1 20851 0.311481148 5394 D W 141668415 + 256
8,0 1 20853 0.314421305 5394 D W 142804687 + 256
8,0 1 20855 0.318960112 5400 D W 142804943 + 256
The fix makes sure that the next_rq is computed from the last
dispatched request, and not affected by merging.
8,0 1 37776 4.305161306 0 D W 141738087 + 256
8,0 1 37778 4.308298091 0 D W 141738343 + 256
8,0 1 37780 4.312885190 0 D W 141738599 + 256
8,0 1 37782 4.315933291 0 D W 141738855 + 256
8,0 1 37784 4.319064459 0 D W 141739111 + 256
8,0 1 37786 4.331918431 5672 D W 142803007 + 256
8,0 1 37788 4.334930332 5672 D W 142803263 + 256
8,0 1 37790 4.337902723 5672 D W 142803519 + 256
8,0 1 37792 4.342359774 5672 D W 142803775 + 256
8,0 1 37794 4.345318286 0 D W 142804031 + 256
Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Quiet sparse noise about symbol's not being declared.
Symbol blk_default_cmd_filter is only used locally and should be static.
The function blk_scsi_ioctl_init() is a fs_initcall and should also be
static.
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We need to rework this logic post the cooperating cfq_queue merging,
for now just get rid of it and Jeff Moyer will fix the fall out.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
CFQ has an optimization for cooperated applications. if several
io-context have close requests, they will get boost. But the
optimization get abused. Considering thread a, b, which work on one
file. a reads sectors s, s+2, s+4, ...; b reads sectors s+1, s+3, s
+5, ... Both a and b are sequential read, so they can open idle window.
a reads a sector s and goes to idle window and wakeup b. b reads sector
s+1, since in current implementation, cfq_should_preempt() thinks a and
b are cooperators, b will preempt a. b then reads sector s+1 and goes to
idle window and wakeup a. for the same reason, a will preempt b and
reads s+2. a and b will continue the circle. The circle will be very
long, and a and b will occupy whole disk queue. Other applications will
nearly have no chance to run.
Fix this limiting coop preempt until a queue is scheduled normally
again.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Commit a6151c3a5c inadvertently reversed
a preempt condition check, potentially causing a performance regression.
Make the meta check correct again.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Currently no-idle queues in cfq are not serviced fairly:
even if they can only dispatch a small number of requests at a time,
they have to compete with idling queues to be serviced, experiencing
large latencies.
We should notice, instead, that no-idle queues are the ones that would
benefit most from having low latency, in fact they are any of:
* processes with large think times (e.g. interactive ones like file
managers)
* seeky (e.g. programs faulting in their code at startup)
* or marked as no-idle from upper levels, to improve latencies of those
requests.
This patch improves the fairness and latency for those queues, by:
* separating sync idle, sync no-idle and async queues in separate
service_trees, for each priority
* service all no-idle queues together
* and idling when the last no-idle queue has been serviced, to
anticipate for more no-idle work
* the timeslices allotted for idle and no-idle service_trees are
computed proportionally to the number of processes in each set.
Servicing all no-idle queues together should have a performance boost
for NCQ-capable drives, without compromising fairness.
Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
cfq can disable idling for queues in various circumstances.
When workloads of different priorities are competing, if the higher
priority queue has idling disabled, lower priority queues may steal
its disk share. For example, in a scenario with an RT process
performing seeky reads vs a BE process performing sequential reads,
on an NCQ enabled hardware, with low_latency unset,
the RT process will dispatch only the few pending requests every full
slice of service for the BE process.
The patch solves this issue by always performing idle on the last
queue at a given priority class > idle. If the same process, or one
that can pre-empt it (so at the same priority or higher), submits a
new request within the idle window, the lower priority queue won't
dispatch, saving the disk bandwidth for higher priority ones.
Note: this doesn't touch the non_rotational + NCQ case (no hardware
to test if this is a benefit in that case).
Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We use different service trees for different priority classes.
This allows a simplification in the service tree insertion code, that no
longer has to consider priority while walking the tree.
Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We embed a pointer to the service tree in each queue, to handle multiple
service trees easily.
Service trees are enriched with a counter.
cfq_add_rq_rb is invoked after putting the rq in the fifo, to ensure
that all fields in rq are properly initialized.
Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
When the number of processes performing I/O concurrently increases,
a fixed time slice per process will cause large latencies.
This patch, if low_latency mode is enabled, will scale the time slice
assigned to each process according to a 300ms target latency.
In order to keep fairness among processes:
* The number of active processes is computed using a special form of
running average, that quickly follows sudden increases (to keep latency low),
and decrease slowly (to have fairness in spite of rapid decreases of this
value).
To safeguard sequential bandwidth, we impose a minimum time slice
(computed using 2*cfq_slice_idle as base, adjusted according to priority
and async-ness).
Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
If active queue hasn't enough requests and idle window opens, cfq will not
dispatch sufficient requests to hardware. In such situation, current code
will zero hw_tag. But this is because cfq doesn't dispatch enough requests
instead of hardware queue doesn't work. Don't zero hw_tag in such case.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
cfq_queues are merged if they are issuing requests within the mean seek
distance of one another. This patch detects when the coopearting stops and
breaks the queues back up.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The flag used to indicate that a cfqq was allowed to jump ahead in the
scheduling order due to submitting a request close to the queue that
just executed. Since closely cooperating queues are now merged, the flag
holds little meaning. Change it to indicate that multiple queues were
merged. This will later be used to allow the breaking up of merged queues
when they are no longer cooperating.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
When cooperating cfq_queues are detected currently, they are allowed to
skip ahead in the scheduling order. It is much more efficient to
automatically share the cfq_queue data structure between cooperating processes.
Performance of the read-test2 benchmark (which is written to emulate the
dump(8) utility) went from 12MB/s to 90MB/s on my SATA disk. NFS servers
with multiple nfsd threads also saw performance increases.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
async cfq_queue's are already shared between processes within the same
priority, and forthcoming patches will change the mapping of cic to sync
cfq_queue from 1:1 to 1:N. So, calculate the seekiness of a process
based on the cfq_queue instead of the cfq_io_context.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
With 2.6.32-rc5 in a KVM guest using dm and virtio_blk, we see the
following errors:
end_request: I/O error, dev vda, sector 0
end_request: I/O error, dev vda, sector 0
The errors go away if dm stops submitting empty barriers, by reverting:
commit 52b1fd5a27
Author: Mikulas Patocka <mpatocka@redhat.com>
dm: send empty barriers to targets in dm_flush
We should silently error all barriers, even empty barriers, on devices
like virtio_blk which don't support them.
See also:
https://bugzilla.redhat.com/514901
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Neil Brown <neilb@suse.de>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
If the average think time is larger than the remaining time slice
for any given queue, don't allow it to idle. A succesful idle also
means that we need to dispatch and complete a request, so if we don't
even have time left for the idle process, we would overrun the slice
in any case.
Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Saves 16 bytes of text, woohoo. But the more important point is
that it makes the code more readable when returning bool for 0/1
cases.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
CFQ enables idle only for processes that think less than the allowed
idle time. Since idle time is lower for seeky queues, we should use the
correct value in the comparison.
Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We should subtract the slice residual from the rb tree key, since
a negative residual count indicates that the cfqq overran its slice
the last time. Hence we want to add the overrun time, to position
it a bit further away in the service tree.
Reported-by: Corrado Zoccolo <czoccolo@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Commit a9327cac44 added seperate read
and write statistics of in_flight requests. And exported the number
of read and write requests in progress seperately through sysfs.
But Corrado Zoccolo <czoccolo@gmail.com> reported getting strange
output from "iostat -kx 2". Global values for service time and
utilization were garbage. For interval values, utilization was always
100%, and service time is higher than normal.
So this was reverted by commit 0f78ab9899
The problem was in part_round_stats_single(), I missed the following:
if (now == part->stamp)
return;
- if (part->in_flight) {
+ if (part_in_flight(part)) {
__part_stat_add(cpu, part, time_in_queue,
part_in_flight(part) * (now - part->stamp));
__part_stat_add(cpu, part, io_ticks, (now - part->stamp));
With this chunk included, the reported regression gets fixed.
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
--
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
It was briefly introduced to allow CFQ to to delayed scheduling,
but we ended up removing that feature again. So lets kill the
function and export, and just switch CFQ back to the normal work
schedule since it is now passing in a '0' delay from all call
sites.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The RR service tree is indexed by a key that is relative to current jiffies.
This can cause problems on jiffies wraparound.
The patch fixes it using time_before comparison, and changing
the add_front path to use a relative number, too.
Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
cfq uses rq->start_time as the fifo indicator, but that field may
get modified prior to cfq doing it's fifo list adjustment when
a request gets merged with another request. This can cause the
fifo list to become unordered.
Reported-by: Corrado Zoccolo <czoccolo@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We cannot delay for the first dispatch of the async queue if it
hasn't dispatched at all, since that could present a local user
DoS attack vector using an app that just did slow timed sync reads
while filling memory.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Not all users of the topology information want to use libblkid. Provide
the topology information through bdev ioctls.
Also clarify sector size comments for existing BLK ioctls.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Don't think that's necessarily a perfect description of what this
option fiddles with, but it's probably better than 'desktop'.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This slowly ramps up the async queue depth based on the time
passed since the sync IO, and doesn't allow async at all until
a sync slice period has passed.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
o Do not allow more than max_dispatch requests from an async queue, if some
sync request has finished recently. This is in the hope that sync activity
is still going on in the system and we might receive a sync request soon.
Most likely from a sync queue which finished a request and we did not enable
idling on it.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
AS is mostly a subset of CFQ, so there's little point in still
providing this separate IO scheduler. Hopefully at some point we
can get down to one single IO scheduler again, at least this brings
us closer by having only one intelligent IO scheduler.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This is basically identical to what Vivek Goyal posted, but combined
into one and labelled 'desktop' instead of 'fairness'. The goal
is to continue to improve on the latency side of things as it relates
to interactiveness, keeping the questionable bits under this sysfs
tunable so it would be easy for throughput-only people to turn off.
Apart from adding the interactive sysfs knob, it also adds the
behavioural change of allowing slice idling even if the hardware
does tagged command queuing.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Since 2.6.31 now has request-based device-mapper, it's useful to have
a tracepoint for request-remapping as well as bio-remapping.
This patch adds a tracepoint for request-remapping, trace_block_rq_remap().
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Alasdair G Kergon <agk@redhat.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Currently we set the bio size to the byte equivalent of the blocks to
be trimmed when submitting the initial DISCARD ioctl. That means it
is subject to the max_hw_sectors limitation of the HBA which is
much lower than the size of a DISCARD request we can support.
Add a separate max_discard_sectors tunable to limit the size for discard
requests.
We limit the max discard request size in bytes to 32bit as that is the
limit for bio->bi_size. This could be much larger if we had a way to pass
that information through the block layer.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
prepare_discard_fn() was being called in a place where memory allocation
was effectively impossible. This makes it inappropriate for all but
the most trivial translations of Linux's DISCARD operation to the block
command set. Additionally adding a payload there makes the ownership
of the bio backing unclear as it's now allocated by the device driver
and not the submitter as usual.
It is replaced with QUEUE_FLAG_DISCARD which is used to indicate whether
the queue supports discard operations or not. blkdev_issue_discard now
allocates a one-page, sector-length payload which is the right thing
for the common ATA and SCSI implementations.
The mtd implementation of prepare_discard_fn() is replaced with simply
checking for the request being a discard.
Largely based on a previous patch from Matthew Wilcox <matthew@wil.cx>
which did the prepare_discard_fn but not the different payload allocation
yet.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Since 2.6.31 now has request-based device-mapper, it's useful to have
a tracepoint for request-remapping as well as bio-remapping.
This patch adds a tracepoint for request-remapping, trace_block_rq_remap().
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Alasdair G Kergon <agk@redhat.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Currently we set the bio size to the byte equivalent of the blocks to
be trimmed when submitting the initial DISCARD ioctl. That means it
is subject to the max_hw_sectors limitation of the HBA which is
much lower than the size of a DISCARD request we can support.
Add a separate max_discard_sectors tunable to limit the size for discard
requests.
We limit the max discard request size in bytes to 32bit as that is the
limit for bio->bi_size. This could be much larger if we had a way to pass
that information through the block layer.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
prepare_discard_fn() was being called in a place where memory allocation
was effectively impossible. This makes it inappropriate for all but
the most trivial translations of Linux's DISCARD operation to the block
command set. Additionally adding a payload there makes the ownership
of the bio backing unclear as it's now allocated by the device driver
and not the submitter as usual.
It is replaced with QUEUE_FLAG_DISCARD which is used to indicate whether
the queue supports discard operations or not. blkdev_issue_discard now
allocates a one-page, sector-length payload which is the right thing
for the common ATA and SCSI implementations.
The mtd implementation of prepare_discard_fn() is replaced with simply
checking for the request being a discard.
Largely based on a previous patch from Matthew Wilcox <matthew@wil.cx>
which did the prepare_discard_fn but not the different payload allocation
yet.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Stacking devices do not have an inherent max_hw_sector limit. Set the
default to INT_MAX so we are bounded only by capabilities of the
underlying storage.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The topology changes unintentionally caused SAFE_MAX_SECTORS to be set
for stacking devices. Set the default limit to BLK_DEF_MAX_SECTORS and
provide SAFE_MAX_SECTORS in blk_queue_make_request() for legacy hw
drivers that depend on the old behavior.
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This allows subsytems to provide devtmpfs with non-default permissions
for the device node. Instead of the default mode of 0600, null, zero,
random, urandom, full, tty, ptmx now have a mode of 0666, which allows
non-privileged processes to access standard device nodes in case no
other userspace process applies the expected permissions.
This also fixes a wrong assignment in pktcdvd and a checkpatch.pl complain.
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6:
Driver Core: devtmpfs - kernel-maintained tmpfs-based /dev
debugfs: Modify default debugfs directory for debugging pktcdvd.
debugfs: Modified default dir of debugfs for debugging UHCI.
debugfs: Change debugfs directory of IWMC3200
debugfs: Change debuhgfs directory of trace-events-sample.h
debugfs: Fix mount directory of debugfs by default in events.txt
hpilo: add poll f_op
hpilo: add interrupt handler
hpilo: staging for interrupt handling
driver core: platform_device_add_data(): use kmemdup()
Driver core: Add support for compatibility classes
uio: add generic driver for PCI 2.3 devices
driver-core: move dma-coherent.c from kernel to driver/base
mem_class: fix bug
mem_class: use minor as index instead of searching the array
driver model: constify attribute groups
UIO: remove 'default n' from Kconfig
Driver core: Add accessor for device platform data
Driver core: move dev_get/set_drvdata to drivers/base/dd.c
Driver core: add new device to bus's list before probing
Let attribute group vectors be declared "const". We'd
like to let most attribute metadata live in read-only
sections... this is a start.
Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (46 commits)
powerpc64: convert to dynamic percpu allocator
sparc64: use embedding percpu first chunk allocator
percpu: kill lpage first chunk allocator
x86,percpu: use embedding for 64bit NUMA and page for 32bit NUMA
percpu: update embedding first chunk allocator to handle sparse units
percpu: use group information to allocate vmap areas sparsely
vmalloc: implement pcpu_get_vm_areas()
vmalloc: separate out insert_vmalloc_vm()
percpu: add chunk->base_addr
percpu: add pcpu_unit_offsets[]
percpu: introduce pcpu_alloc_info and pcpu_group_info
percpu: move pcpu_lpage_build_unit_map() and pcpul_lpage_dump_cfg() upward
percpu: add @align to pcpu_fc_alloc_fn_t
percpu: make @dyn_size mandatory for pcpu_setup_first_chunk()
percpu: drop @static_size from first chunk allocators
percpu: generalize first chunk allocator selection
percpu: build first chunk allocators selectively
percpu: rename 4k first chunk allocator to page
percpu: improve boot messages
percpu: fix pcpu_reclaim() locking
...
Fix trivial conflict as by Tejun Heo in kernel/sched.c
* 'for-2.6.32' of git://git.kernel.dk/linux-2.6-block: (29 commits)
block: use blkdev_issue_discard in blk_ioctl_discard
Make DISCARD_BARRIER and DISCARD_NOBARRIER writes instead of reads
block: don't assume device has a request list backing in nr_requests store
block: Optimal I/O limit wrapper
cfq: choose a new next_req when a request is dispatched
Seperate read and write statistics of in_flight requests
aoe: end barrier bios with EOPNOTSUPP
block: trace bio queueing trial only when it occurs
block: enable rq CPU completion affinity by default
cfq: fix the log message after dispatched a request
block: use printk_once
cciss: memory leak in cciss_init_one()
splice: update mtime and atime on files
block: make blk_iopoll_prep_sched() follow normal 0/1 return convention
cfq-iosched: get rid of must_alloc flag
block: use interrupts disabled version of raise_softirq_irqoff()
block: fix comment in blk-iopoll.c
block: adjust default budget for blk-iopoll
block: fix long lines in block/blk-iopoll.c
block: add blk-iopoll, a NAPI like approach for block devices
...
blk_ioctl_discard duplicates large amounts of code from blkdev_issue_discard,
the only difference between the two is that blkdev_issue_discard needs to
send a barrier discard request and blk_ioctl_discard a non-barrier one,
and blk_ioctl_discard needs to wait on the request. To facilitates this
add a flags argument to blkdev_issue_discard to control both aspects of the
behaviour. This will be very useful later on for using the waiting
funcitonality for other callers.
Based on an earlier patch from Matthew Wilcox <matthew@wil.cx>.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Stacked devices do not. For now, just error out with -EINVAL. Later
we could make the limit apply on stacked devices too, for throttling
reasons.
This fixes
5a54cd13353bb3b88887604e2c980aa01e314309
and should go into 2.6.31 stable as well.
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Implement blk_limits_io_opt() and make blk_queue_io_opt() a wrapper
around it. DM needs this to avoid poking at the queue_limits directly.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Currently, there is a single in_flight counter measuring the number of
requests in the request_queue. But some monitoring tools would like to
know how many read requests and write requests are in progress. Split the
current in_flight counter into two seperate counters for read and write.
This information is exported as a sysfs attribute, as changing the
currently available stat files would break the existing tools.
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
If BIO is discarded or cross over end of device,
BIO queueing trial doesn't occur.
Actually the trace was called just before make_request at first:
[PATCH] Block queue IO tracing support (blktrace) as of 2006-03-23
2056a782f8e7e65fd4bfd027506b4ce1c5e9ccd4
And then 2 patches added some checks between them:
[PATCH] md: check bio address after mapping through partitions
5ddfe9691c91a244e8d1be597b6428fcefd58103,
[BLOCK] Don't allow empty barriers to be passed down to
queues that don't grok them
51fd77bd9f512ab6cc9df0733ba1caaab89eb957
It breaks original goal.
Let's trace it only when it happens.
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The blktrace tools can show process id when cfq dispatched a request,
using cfq_log_cfqq() instead of cfq_log().
Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
It's not currently used, as pointed out by
Gui Jianfeng <guijianfeng@cn.fujitsu.com>. We already check the
wait_request flag to allow an idling queue priority allocation access,
so we don't need this extra flag.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This borrows some code from NAPI and implements a polled completion
mode for block devices. The idea is the same as NAPI - instead of
doing the command completion when the irq occurs, schedule a dedicated
softirq in the hopes that we will complete more IO when the iopoll
handler is invoked. Devices have a budget of commands assigned, and will
stay in polled mode as long as they continue to consume their budget
from the iopoll softirq handler. If they do not, the device is set back
to interrupt completion mode.
This patch holds the core bits for blk-iopoll, device driver support
sold separately.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Instead of just checking whether this device uses block layer
tagging, we can improve the detection by looking at the maximum
queue depth it has reached. If that crosses 4, then deem it a
queuing device.
This is important on high IOPS devices, since plugging hurts
the performance there (it can be as much as 10-15% of the sys
time).
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Get rid of any functions that test for these bits and make callers
use bio_rw_flagged() directly. Then it is at least directly apparent
what variable and flag they check.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Whenever a block device changes it's read-only attribute
notify the userspace about it.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
o Get rid of busy_rt_queues infrastructure. Looks like it is redundant.
o Once an RT queue gets request it will preempt any of the BE or IDLE queues
immediately. Otherwise this queue will be put on service tree and scheduler
will anyway select this queue before any of the BE or IDLE queue. Hence
looks like there is no need to keep track of how many busy RT queues are
currently on service tree.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
To lessen the impact of async IO on sync IO, let the device drain of
any async IO in progress when switching to a sync cfqq that has idling
enabled.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Update scsi_io_completion() such that it only fails requests till the
next error boundary and retry the leftover. This enables block layer
to merge requests with different failfast settings and still behave
correctly on errors. Allow merge of requests of different failfast
settings.
As SCSI is currently the only subsystem which follows failfast status,
there's no need to worry about other block drivers for now.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Niel Lambrechts <niel.lambrechts@gmail.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Failfast has characteristics from other attributes. When issuing,
executing and successuflly completing requests, failfast doesn't make
any difference. It only affects how a request is handled on failure.
Allowing requests with different failfast settings to be merged cause
normal IOs to fail prematurely while not allowing has performance
penalties as failfast is used for read aheads which are likely to be
located near in-flight or to-be-issued normal IOs.
This patch introduces the concept of 'mixed merge'. A request is a
mixed merge if it is merge of segments which require different
handling on failure. Currently the only mixable attributes are
failfast ones (or lack thereof).
When a bio with different failfast settings is added to an existing
request or requests of different failfast settings are merged, the
merged request is marked mixed. Each bio carries failfast settings
and the request always tracks failfast state of the first bio. When
the request fails, blk_rq_err_bytes() can be used to determine how
many bytes can be safely failed without crossing into an area which
requires further retrials.
This allows request merging regardless of failfast settings while
keeping the failure handling correct.
This patch only implements mixed merge but doesn't enable it. The
next one will update SCSI to make use of mixed merge.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Niel Lambrechts <niel.lambrechts@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
bio and request use the same set of failfast bits. This patch makes
the following changes to simplify things.
* enumify BIO_RW* bits and reorder bits such that BIOS_RW_FAILFAST_*
bits coincide with __REQ_FAILFAST_* bits.
* The above pushes BIO_RW_AHEAD out of sync with __REQ_FAILFAST_DEV
but the matching is useless anyway. init_request_from_bio() is
responsible for setting FAILFAST bits on FS requests and non-FS
requests never use BIO_RW_AHEAD. Drop the code and comment from
blk_rq_bio_prep().
* Define REQ_FAILFAST_MASK which is OR of all FAILFAST bits and
simplify FAILFAST flags handling in init_request_from_bio().
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This enables us to track who does what and print info. Its main use
is catching dirty inodes on the default_backing_dev_info, so we can
fix that up.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The patch "block: Use accessor functions for queue limits"
(ae03bf639a) changed queue_max_sectors_store()
to use blk_queue_max_sectors() instead of directly assigning the value.
But blk_queue_max_sectors() differs a bit
1. It sets both max_sectors_kb, and max_hw_sectors_kb
2. Never allows one to change max_sectors_kb above BLK_DEF_MAX_SECTORS. If one
specifies a value greater then max_hw_sectors is set to that value but
max_sectors is set to BLK_DEF_MAX_SECTORS
I am not sure whether blk_queue_max_sectors() should be changed, as it seems
to be that way for a long time. And there may be callers dependent on that
behaviour.
This patch simply reverts to the older way of directly assigning the value to
max_sectors as it was before.
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Conflicts:
arch/sparc/kernel/smp_64.c
arch/x86/kernel/cpu/perf_counter.c
arch/x86/kernel/setup_percpu.c
drivers/cpufreq/cpufreq_ondemand.c
mm/percpu.c
Conflicts in core and arch percpu codes are mostly from commit
ed78e1e078dd44249f88b1dd8c76dafb39567161 which substituted many
num_possible_cpus() with nr_cpu_ids. As for-next branch has moved all
the first chunk allocators into mm/percpu.c, the changes are moved
from arch code to mm/percpu.c.
Signed-off-by: Tejun Heo <tj@kernel.org>
Make Block Layer SG support v4 the default, since recent udev versions
depend on this to access serial numbers and other low level info properly.
This should be backported to older kernels as well, since most distros have
enabled this for a long time.
Signed-off-by: John Stoffel <john@stoffel.org>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Update topology comments and sysfs documentation based upon discussions
with Neil Brown.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
When stacking block devices ensure that optimal I/O size is scaled
accordingly.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Introduce blk_limits_io_min() and make blk_queue_io_min() call it.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
blk_queue_stack_limits() has been superceded by blk_stack_limits() and
disk_stack_limits(). Wrap the function call for now, we'll deprecate it
later.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Prior to the change for more sane end_io functions, we exported
the helpers with the normal EXPORT_SYMBOL(). That got changed
to _GPL() for the new interface. Revert that particular change,
on the basis that this is basic functionality and doesn't dip
into internal structures. If these exports can't be non-GPL,
then we may as well make EXPORT_SYMBOL() imply GPL for
everything.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
blk_integrity_unregister should use kobject_put to release the kobject,
otherwise after bi is freed, memory of bi->kobj->name is leaked.
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Move the assignment of a default lock below blk_init_queue() to
blk_queue_make_request(), so we also get to set the default lock
for ->make_request_fn() based drivers. This is important since the
queue flag locking requires a lock to be in place.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
In blk-sysfs.c, queue_var_store uses unsigned long to store data,
but queue_var_show uses unsigned int to show data. This causes,
# echo 70000000000 > /sys/block/<dev>/queue/read_ahead_kb
# cat /sys/block/<dev>/queue/read_ahead_kb => get wrong value
Fix it by using unsigned long.
While at it, convert queue_rq_affinity_show() such that it uses bool
variable instead of explicit != 0 testing.
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Commit ab0fd1debe tries to prevent merge
of requests with different failfast settings. In elv_rq_merge_ok(),
it compares new bio's failfast flags against the merge target
request's. However, the flag testing accessors for bio and blk don't
return boolean but the tested bit value directly and FAILFAST on bio
and blk don't match, so directly comparing them with == results in
false negative unnecessary preventing merge of readahead requests.
This patch convert the results to boolean by negating them before
comparison.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Boaz Harrosh <bharrosh@panasas.com>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Jeff Garzik <jeff@garzik.org>
In case memory is scarce, we now default to oom_cfqq. Once memory is
available again, we should allocate a new cfqq and stop using oom_cfqq for
a particular io context.
Once a new request comes in, check if we are using oom_cfqq, and if yes,
try to allocate a new cfqq.
Tested the patch by forcing the use of oom_cfqq and upon next request thread
realized that it was using oom_cfqq and it allocated a new cfqq.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Currently, blk_scsi_ioctl_init() is not called since it lacks
an initcall marking. This causes the command table to be
unitialized, hence somce commands are block when they should
not have been.
This fixes a regression introduced by commit
018e044689
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Pull linus#master to merge PER_CPU_DEF_ATTRIBUTES and alpha build fix
changes. As alpha in percpu tree uses 'weak' attribute instead of
inline assembly, there's no need for __used attribute.
Conflicts:
arch/alpha/include/asm/percpu.h
arch/mn10300/kernel/vmlinux.lds.S
include/linux/percpu-defs.h
Block layer used to merge requests and bios with different failfast
settings. This caused regular IOs to fail prematurely when they were
merged into failfast requests for readahead.
Niel Lambrechts could trigger the problem semi-reliably on ext4 when
resuming from STR. ext4 uses readahead when reading inodes and
combined with the deterministic extra SATA PHY exception cycle during
resume on the specific configuration, non-readahead inode read would
fail causing ext4 errors. Please read the following thread for
details.
http://lkml.org/lkml/2009/5/23/21
This patch makes block layer reject merging if the failfast settings
don't match. This is correct but likely to lower IO performance by
preventing regular IOs from mingling into surrounding readahead
requests. Changes to allow such mixed merges and handle errors
correctly will be added later.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Niel Lambrechts <niel.lambrechts@gmail.com>
Cc: Theodore Tso <tytso@mit.edu>
Signed-off-by: Jens Axboe <axboe@carl.(none)>
With the changes for falling back to an oom_cfqq, we never fail
to find/allocate a queue in cfq_get_queue(). So remove the check.
Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The next_ordered flag is only meaningful for devices that use __make_request.
So move the test against next_ordered out of generic code and in to
__make_request
Since this test was added, barriers have not worked on md or any
devices that don't use __make_request and so don't bother to set
next_ordered. (dm explicitly sets something other than
QUEUE_ORDERED_NONE since
commit 99360b4c18
but notes in the comments that it is otherwise meaningless).
Cc: Ken Milmore <ken.milmore@googlemail.com>
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The initial patches to support this through sysfs export were broken
and have been if 0'ed out in any release. So lets just kill the code
and reclaim some space in struct request_queue, if anyone would later
like to fixup the sysfs bits, the git history can easily restore
the removed bits.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch restores stacking ability to the block layer integrity
infrastructure by creating a set of dedicated bip slabs. Each bip slab
has an embedded bio_vec array at the end. This cuts down on memory
allocations and also simplifies the code compared to the original bvec
version. Only the largest bip slab is backed by a mempool. The pool is
contained in the bio_set so stacking drivers can ensure forward
progress.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@carl.(none)>
Setup an emergency fallback cfqq that we allocate at IO scheduler init
time. If the slab allocation fails in cfq_find_alloc_queue(), we'll just
punt IO to that cfqq instead. This ensures that cfq_find_alloc_queue()
never fails without having to ensure free memory.
On cfqq lookup, always try to allocate a new cfqq if the given cfq io
context has the oom_cfqq assigned. This ensures that we only temporarily
punt to this shared queue.
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We're going to be needing that init code outside of that function
to get rid of the __GFP_NOFAIL in cfqq allocation.
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Percpu variable definition is about to be updated such that all percpu
symbols including the static ones must be unique. Update percpu
variable definitions accordingly.
* as,cfq: rename ioc_count uniquely
* cpufreq: rename cpu_dbs_info uniquely
* xen: move nesting_count out of xen_evtchn_do_upcall() and rename it
* mm: move ratelimits out of balance_dirty_pages_ratelimited_nr() and
rename it
* ipv4,6: rename cookie_scratch uniquely
* x86 perf_counter: rename prev_left to pmc_prev_left, irq_entry to
pmc_irq_entry and nmi_entry to pmc_nmi_entry
* perf_counter: rename disable_count to perf_disable_count
* ftrace: rename test_event_disable to ftrace_test_event_disable
* kmemleak: rename test_pointer to kmemleak_test_pointer
* mce: rename next_interval to mce_next_interval
[ Impact: percpu usage cleanups, no duplicate static percpu var names ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: linux-mm <linux-mm@kvack.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Andi Kleen <andi@firstfloor.org>
The SMP handler (sas_smp_request) was fixed to use the block API
properly, so we don't need this workaround to avoid blk_put_request()
warning.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Warning(block/blk-settings.c:108): No description found for parameter 'lim'
Warning(block/blk-settings.c:108): Excess function parameter 'limits' description in 'blk_set_default_limits'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Follow-up to "block: enable by default support for large devices
and files on 32-bit archs".
Rename CONFIG_LBD to CONFIG_LBDAF to:
- allow update of existing [def]configs for "default y" change
- reflect that it is used also for large files support nowadays
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Correct stacking bounce_pfn limit setting and prevent warnings on
32-bit.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (64 commits)
debugfs: use specified mode to possibly mark files read/write only
debugfs: Fix terminology inconsistency of dir name to mount debugfs filesystem.
xen: remove driver_data direct access of struct device from more drivers
usb: gadget: at91_udc: remove driver_data direct access of struct device
uml: remove driver_data direct access of struct device
block/ps3: remove driver_data direct access of struct device
s390: remove driver_data direct access of struct device
parport: remove driver_data direct access of struct device
parisc: remove driver_data direct access of struct device
of_serial: remove driver_data direct access of struct device
mips: remove driver_data direct access of struct device
ipmi: remove driver_data direct access of struct device
infiniband: ehca: remove driver_data direct access of struct device
ibmvscsi: gadget: at91_udc: remove driver_data direct access of struct device
hvcs: remove driver_data direct access of struct device
xen block: remove driver_data direct access of struct device
thermal: remove driver_data direct access of struct device
scsi: remove driver_data direct access of struct device
pcmcia: remove driver_data direct access of struct device
PCIE: remove driver_data direct access of struct device
...
Manually fix up trivial conflicts due to different direct driver_data
direct access fixups in drivers/block/{ps3disk.c,ps3vram.c}
When porting blktrace to tracepoints, we changed to trace/block.h
for trace prober declarations.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
DM reuses the request queue when swapping in a new device table
Introduce blk_set_default_limits() which can be used to reset the the
queue_limits prior to stacking devices.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
I noticed a blank line in blktrace output. This patch fixes that.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Actually, last_end_request in cfq_data isn't used now. So lets
just remove it.
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This adds support to the BSG driver to report the proper device name to
userspace for the bsg devices.
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This adds support for block drivers to report their requested nodename
to userspace. It also updates a number of block drivers to provide the
needed subdirectory and device name to be used for them.
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block: (153 commits)
block: add request clone interface (v2)
floppy: fix hibernation
ramdisk: remove long-deprecated "ramdisk=" boot-time parameter
fs/bio.c: add missing __user annotation
block: prevent possible io_context->refcount overflow
Add serial number support for virtio_blk, V4a
block: Add missing bounce_pfn stacking and fix comments
Revert "block: Fix bounce limit setting in DM"
cciss: decode unit attention in SCSI error handling code
cciss: Remove no longer needed sendcmd reject processing code
cciss: change SCSI error handling routines to work with interrupts enabled.
cciss: separate error processing and command retrying code in sendcmd_withirq_core()
cciss: factor out fix target status processing code from sendcmd functions
cciss: simplify interface of sendcmd() and sendcmd_withirq()
cciss: factor out core of sendcmd_withirq() for use by SCSI error handling code
cciss: Use schedule_timeout_uninterruptible in SCSI error handling code
block: needs to set the residual length of a bidi request
Revert "block: implement blkdev_readpages"
block: Fix bounce limit setting in DM
Removed reference to non-existing file Documentation/PCI/PCI-DMA-mapping.txt
...
Manually fix conflicts with tracing updates in:
block/blk-sysfs.c
drivers/ide/ide-atapi.c
drivers/ide/ide-cd.c
drivers/ide/ide-floppy.c
drivers/ide/ide-tape.c
include/trace/events/block.h
kernel/trace/blktrace.c
* 'for-2.6.31' of git://git.kernel.org/pub/scm/linux/kernel/git/bart/ide-2.6: (28 commits)
ide-tape: fix debug call
alim15x3: Remove historical hacks, re-enable init_hwif for PowerPC
ide-dma: don't reset request fields on dma_timeout_retry()
ide: drop rq->data handling from ide_map_sg()
ide-atapi: kill unused fields and callbacks
ide-tape: simplify read/write functions
ide-tape: use byte size instead of sectors on rw issue functions
ide-tape: unify r/w init paths
ide-tape: kill idetape_bh
ide-tape: use standard data transfer mechanism
ide-tape: use single continuous buffer
ide-atapi,tape,floppy: allow ->pc_callback() to change rq->data_len
ide-tape,floppy: fix failed command completion after request sense
ide-pm: don't abuse rq->data
ide-cd,atapi: use bio for internal commands
ide-atapi: convert ide-{floppy,tape} to using preallocated sense buffer
ide-cd: convert to using generic sense request
ide: add helpers for preparing sense requests
ide-cd: don't abuse rq->buffer
ide-atapi: don't abuse rq->buffer
...
This patch adds the following 2 interfaces for request-stacking drivers:
- blk_rq_prep_clone(struct request *clone, struct request *orig,
struct bio_set *bs, gfp_t gfp_mask,
int (*bio_ctr)(struct bio *, struct bio*, void *),
void *data)
* Clones bios in the original request to the clone request
(bio_ctr is called for each cloned bios.)
* Copies attributes of the original request to the clone request.
The actual data parts (e.g. ->cmd, ->buffer, ->sense) are not
copied.
- blk_rq_unprep_clone(struct request *clone)
* Frees cloned bios from the clone request.
Request stacking drivers (e.g. request-based dm) need to make a clone
request for a submitted request and dispatch it to other devices.
To allocate request for the clone, request stacking drivers may not
be able to use blk_get_request() because the allocation may be done
in an irq-disabled context.
So blk_rq_prep_clone() takes a request allocated by the caller
as an argument.
For each clone bio in the clone request, request stacking drivers
should be able to set up their own completion handler.
So blk_rq_prep_clone() takes a callback function which is called
for each clone bio, and a pointer for private data which is passed
to the callback.
NOTE:
blk_rq_prep_clone() doesn't copy any actual data of the original
request. Pages are shared between original bios and cloned bios.
So caller must not complete the original request before the clone
request.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* 'tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (244 commits)
Revert "x86, bts: reenable ptrace branch trace support"
tracing: do not translate event helper macros in print format
ftrace/documentation: fix typo in function grapher name
tracing/events: convert block trace points to TRACE_EVENT(), fix !CONFIG_BLOCK
tracing: add protection around module events unload
tracing: add trace_seq_vprint interface
tracing: fix the block trace points print size
tracing/events: convert block trace points to TRACE_EVENT()
ring-buffer: fix ret in rb_add_time_stamp
ring-buffer: pass in lockdep class key for reader_lock
tracing: add annotation to what type of stack trace is recorded
tracing: fix multiple use of __print_flags and __print_symbolic
tracing/events: fix output format of user stack
tracing/events: fix output format of kernel stack
tracing/trace_stack: fix the number of entries in the header
ring-buffer: discard timestamps that are at the start of the buffer
ring-buffer: try to discard unneeded timestamps
ring-buffer: fix bug in ring_buffer_discard_commit
ftrace: do not profile functions when disabled
tracing: make trace pipe recognize latency format flag
...
Currently io_context has an atomic_t(32-bit) as refcount. In the case of
cfq, for each device against whcih a task does I/O, a reference to the
io_context would be taken. And when there are multiple process sharing
io_contexts(CLONE_IO) would also have a reference to the same io_context.
Theoretically the possible maximum number of processes sharing the same
io_context + the number of disks/cfq_data referring to the same io_context
can overflow the 32-bit counter on a very high-end machine.
Even though it is an improbable case, let us make it atomic_long_t.
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
TRACE_EVENT is a more generic way to define tracepoints. Doing so adds
these new capabilities to this tracepoint:
- zero-copy and per-cpu splice() tracing
- binary tracing without printf overhead
- structured logging records exposed under /debug/tracing/events
- trace events embedded in function tracer output and other plugins
- user-defined, per tracepoint filter expressions
...
Cons:
- no dev_t info for the output of plug, unplug_timer and unplug_io events.
no dev_t info for getrq and sleeprq events if bio == NULL.
no dev_t info for rq_abort,...,rq_requeue events if rq->rq_disk == NULL.
This is mainly because we can't get the deivce from a request queue.
But this may change in the future.
- A packet command is converted to a string in TP_assign, not TP_print.
While blktrace do the convertion just before output.
Since pc requests should be rather rare, this is not a big issue.
- In blktrace, an event can have 2 different print formats, but a TRACE_EVENT
has a unique format, which means we have some unused data in a trace entry.
The overhead is minimized by using __dynamic_array() instead of __array().
I've benchmarked the ioctl blktrace vs the splice based TRACE_EVENT tracing:
dd dd + ioctl blktrace dd + TRACE_EVENT (splice)
1 7.36s, 42.7 MB/s 7.50s, 42.0 MB/s 7.41s, 42.5 MB/s
2 7.43s, 42.3 MB/s 7.48s, 42.1 MB/s 7.43s, 42.4 MB/s
3 7.38s, 42.6 MB/s 7.45s, 42.2 MB/s 7.41s, 42.5 MB/s
So the overhead of tracing is very small, and no regression when using
those trace events vs blktrace.
And the binary output of TRACE_EVENT is much smaller than blktrace:
# ls -l -h
-rw-r--r-- 1 root root 8.8M 06-09 13:24 sda.blktrace.0
-rw-r--r-- 1 root root 195K 06-09 13:24 sda.blktrace.1
-rw-r--r-- 1 root root 2.7M 06-09 13:25 trace_splice.out
Following are some comparisons between TRACE_EVENT and blktrace:
plug:
kjournald-480 [000] 303.084981: block_plug: [kjournald]
kjournald-480 [000] 303.084981: 8,0 P N [kjournald]
unplug_io:
kblockd/0-118 [000] 300.052973: block_unplug_io: [kblockd/0] 1
kblockd/0-118 [000] 300.052974: 8,0 U N [kblockd/0] 1
remap:
kjournald-480 [000] 303.085042: block_remap: 8,0 W 102736992 + 8 <- (8,8) 33384
kjournald-480 [000] 303.085043: 8,0 A W 102736992 + 8 <- (8,8) 33384
bio_backmerge:
kjournald-480 [000] 303.085086: block_bio_backmerge: 8,0 W 102737032 + 8 [kjournald]
kjournald-480 [000] 303.085086: 8,0 M W 102737032 + 8 [kjournald]
getrq:
kjournald-480 [000] 303.084974: block_getrq: 8,0 W 102736984 + 8 [kjournald]
kjournald-480 [000] 303.084975: 8,0 G W 102736984 + 8 [kjournald]
bash-2066 [001] 1072.953770: 8,0 G N [bash]
bash-2066 [001] 1072.953773: block_getrq: 0,0 N 0 + 0 [bash]
rq_complete:
konsole-2065 [001] 300.053184: block_rq_complete: 8,0 W () 103669040 + 16 [0]
konsole-2065 [001] 300.053191: 8,0 C W 103669040 + 16 [0]
ksoftirqd/1-7 [001] 1072.953811: 8,0 C N (5a 00 08 00 00 00 00 00 24 00) [0]
ksoftirqd/1-7 [001] 1072.953813: block_rq_complete: 0,0 N (5a 00 08 00 00 00 00 00 24 00) 0 + 0 [0]
rq_insert:
kjournald-480 [000] 303.084985: block_rq_insert: 8,0 W 0 () 102736984 + 8 [kjournald]
kjournald-480 [000] 303.084986: 8,0 I W 102736984 + 8 [kjournald]
Changelog from v2 -> v3:
- use the newly introduced __dynamic_array().
Changelog from v1 -> v2:
- use __string() instead of __array() to minimize the memory required
to store hex dump of rq->cmd().
- support large pc requests.
- add missing blk_fill_rwbs_rq() in block_rq_requeue TRACE_EVENT.
- some cleanups.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4A2DF669.5070905@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Due to commit 1cd96c242a ("block: WARN
in __blk_put_request() for potential bio leak"), BSG SMP requests get
the false warnings:
WARNING: at block/blk-core.c:1068 __blk_put_request+0x52/0xc0()
This sets rq->bio to NULL to avoid that false warnings.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
DM no longer needs to set limits explicitly when calling blk_stack_limits.
Let the latter automatically deal with bounce_pfn scaling.
Fix kerneldoc variable names.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Tejun's "block: set rq->resid_len to blk_rq_bytes() on issue" patch
seems to be incomplete; It doesn't set rq->resid_len to blk_rq_bytes()
for a bidi request (req->next_rq). As a result, all bidi users are
broken.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
blk_queue_bounce_limit() is more than a wrapper about the request queue
limits.bounce_pfn variable. Introduce blk_queue_bounce_pfn() which can
be called by stacking drivers that wish to set the bounce limit
explicitly.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
I found one more mis-conversion to the 'request is always dequeued
when completing' model in elv_abort_queue() during code inspection.
Although I haven't hit any problem caused by this mis-conversion yet
and just done compile/boot test, please apply if you have no problem.
Request must be dequeued when it completes.
However, elv_abort_queue() completes requests without dequeueing.
This will cause oops in the __blk_end_request_all().
This patch fixes the oops.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Doing a bit of torture testing, I ran across a BUG in the block
subsystem (at blk-core.c:2048): the test for if the request is queued.
It turns out the trigger was a BLKPREP_KILL coming out of the SCSI prep
function. Currently for BLKPREP_KILL requests, we send them straight
into __blk_end_request_all() with an error, but they've never been
dequeued, so they trip the bug. Fix this by starting requests before
killing them.
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
DM needs to use blk_stack_limits(), so it needs to be exported.
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The commit below in 2.6-block/for-2.6.31 causes no diskstat problem
because the blk_discard_rq() check was added with '&&'.
It should be 'blk_fs_request() || blk_discard_rq()'.
This patch does it and fixes the no diskstat problem.
Please review and apply.
------ /proc/diskstat without this patch -------------------------------------
8 0 sda 0 0 0 0 0 0 0 0 0 0 0
------------------------------------------------------------------------------
----- /proc/diskstat with this patch applied ---------------------------------
8 0 sda 4186 303 373621 61600 9578 3859 107468 169479 2 89755 231059
------------------------------------------------------------------------------
--------------------------------------------------------------------------
commit c69d48540c
Author: Jens Axboe <jens.axboe@oracle.com>
Date: Fri Apr 24 08:12:19 2009 +0200
block: include discard requests in IO accounting
We currently don't do merging on discard requests, but we potentially
could. If we do, then we need to include discard requests in the IO
accounting, or merging would end up decrementing in_flight IO counters
for an IO which never incremented them.
So enable accounting for discard requests.
<snip>
static inline int blk_do_io_stat(struct request *rq)
{
- return rq->rq_disk && blk_rq_io_stat(rq) && blk_fs_request(rq);
+ return rq->rq_disk && blk_rq_io_stat(rq) && blk_fs_request(rq) &&
+ blk_discard_rq(rq);
}
--------------------------------------------------------------------------
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
commit e8939a50466fd963eb1ba9118c34b9ffb7ff6aa6
Author: Tejun Heo <tj@kernel.org>
Date: Fri May 8 11:54:16 2009 +0900
block: implement and enforce request peek/start/fetch
Added a BUG_ON(blk_queued_rq(req)) to the top of blk_finish_req().
Unfortunately, this checks whether req->queuelist is empty. This list
is doing double duty both as the queue list and the tag list, so tagged
requests come in here with this not empty and boom (the tag list is
emptied by blk_queue_end_tag() lower down).
Fix this by moving the BUG_ON to below the end tag we also seem
vulnerable to this in blk_requeue_request() as well. I think all uses
of blk_queued_rq() need auditing because the check is clearly wrong in
the tagged case.
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
To support devices with physical block sizes bigger than 512 bytes we
need to ensure proper alignment. This patch adds support for exposing
I/O topology characteristics as devices are stacked.
logical_block_size is the smallest unit the device can address.
physical_block_size indicates the smallest I/O the device can write
without incurring a read-modify-write penalty.
The io_min parameter is the smallest preferred I/O size reported by
the device. In many cases this is the same as the physical block
size. However, the io_min parameter can be scaled up when stacking
(RAID5 chunk size > physical block size).
The io_opt characteristic indicates the optimal I/O size reported by
the device. This is usually the stripe width for arrays.
The alignment_offset parameter indicates the number of bytes the start
of the device/partition is offset from the device's natural alignment.
Partition tools and MD/DM utilities can use this to pad their offsets
so filesystems start on proper boundaries.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Currently stacking devices do not have a queue directory in sysfs.
However, many of the I/O characteristics like sector size, maximum
request size, etc. are queue properties.
This patch enables the queue directory for MD/DM devices. The elevator
code has been modified to deal with queues that do not have an I/O
scheduler.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
To accommodate stacking drivers that do not have an associated request
queue we're moving the limits to a separate, embedded structure.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Convert all external users of queue limits to using wrapper functions
instead of poking the request queue variables directly.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Until now we have had a 1:1 mapping between storage device physical
block size and the logical block sized used when addressing the device.
With SATA 4KB drives coming out that will no longer be the case. The
sector size will be 4KB but the logical block size will remain
512-bytes. Hence we need to distinguish between the physical block size
and the logical ditto.
This patch renames hardsect_size to logical_block_size.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Add a note about how one needs to be careful when setting up these bio
chains.
Extracted from Boaz's updated patch.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
OSD was the last in-tree user of blk_rq_append_bio(). Now
that it is fixed blk_rq_append_bio is un-exported and
is only used internally by block layer.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
New block API:
given a struct bio allocates a new request. This is the parallel of
generic_make_request for BLOCK_PC commands users.
The passed bio may be a chained-bio. The bio is bounced if needed
inside the call to this member.
This is in the effort of un-exporting blk_rq_append_bio().
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
CC: Jeff Garzik <jeff@garzik.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Use blk_rq_append_bio() internally instead of blk_rq_bio_prep()
so blk_rq_map_kern can be called multiple times, to map multiple
buffers.
This is in the effort to un-export blk_rq_append_bio()
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
In commit c3a4d78c58, while introducing
rq->resid_len, the default value of residue count was changed from
full count to zero. The conversion was done under the assumption that
when a request fails residue count wasn't defined. However, Boaz and
James pointed out that this wasn't true and the residue count should
be preserved for failed requests too.
This patchset restores the original behavior by setting rq->resid_len
to blk_rq_bytes(rq) on request start and restoring explicit clearing
in affected drivers. While at it, take advantage of the fact that
rq->resid_len is set to full count where applicable.
* ide-cd: rq->resid_len cleared on pc success
* mptsas: req->resid_len cleared on success
* sas_expander: rsp/req->resid_len cleared on success
* mpt2sas_transport: req->resid_len cleared on success
* ide-cd, ide-tape, mptsas, sas_host_smp, mpt2sas_transport, ub: take
advantage of initial full count to simplify code
Boaz Harrosh spotted bug in resid_len initialization. Fixed as
suggested.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Borislav Petkov <petkovbb@googlemail.com>
Cc: Boaz Harrosh <bharrosh@panasas.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Pete Zaitcev <zaitcev@redhat.com>
Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Cc: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Cc: Eric Moore <Eric.Moore@lsi.com>
Cc: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Current bio_vec array index out-of-bounds test within
__end_that_request_first() does not seem correct.
It checks bio->bi_idx against bio->bi_vcnt, but the subsequent code
uses idx (which is, bio->bi_idx + next_idx) as the array index into
bio_vec array. This means that the test really make sense only at
the first iteration of !(nr_bytes >=bio->bi_size) case (when next_idx
== zero). Fix this by replacing bio->bi_idx with idx.
(This patch applies to 2.6.30-rc4.)
Signed-off-by: Kazuhisa Ichikawa <ki@epsilou.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Let's put the completion related functions back to block/blk-core.c
where they have lived. We can also unexport blk_end_bidi_request() and
__blk_end_bidi_request(), which nobody uses.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Till now block layer allowed two separate modes of request execution.
A request is always acquired from the request queue via
elv_next_request(). After that, drivers are free to either dequeue it
or process it without dequeueing. Dequeue allows elv_next_request()
to return the next request so that multiple requests can be in flight.
Executing requests without dequeueing has its merits mostly in
allowing drivers for simpler devices which can't do sg to deal with
segments only without considering request boundary. However, the
benefit this brings is dubious and declining while the cost of the API
ambiguity is increasing. Segment based drivers are usually for very
old or limited devices and as converting to dequeueing model isn't
difficult, it doesn't justify the API overhead it puts on block layer
and its more modern users.
Previous patches converted all block low level drivers to dequeueing
model. This patch completes the API transition by...
* renaming elv_next_request() to blk_peek_request()
* renaming blkdev_dequeue_request() to blk_start_request()
* adding blk_fetch_request() which is combination of peek and start
* disallowing completion of queued (not started) requests
* applying new API to all LLDs
Renamings are for consistency and to break out of tree code so that
it's apparent that out of tree drivers need updating.
[ Impact: block request issue API cleanup, no functional change ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Mike Miller <mike.miller@hp.com>
Cc: unsik Kim <donari75@gmail.com>
Cc: Paul Clements <paul.clements@steeleye.com>
Cc: Tim Waugh <tim@cyberelk.net>
Cc: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Laurent Vivier <Laurent@lvivier.info>
Cc: Jeff Garzik <jgarzik@pobox.com>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Grant Likely <grant.likely@secretlab.ca>
Cc: Adrian McMenamin <adrian@mcmen.demon.co.uk>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Cc: Borislav Petkov <petkovbb@googlemail.com>
Cc: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Cc: Alex Dubov <oakad@yahoo.com>
Cc: Pierre Ossman <drzeus@drzeus.cx>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Markus Lidel <Markus.Lidel@shadowconnect.com>
Cc: Stefan Weinhuber <wein@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Pete Zaitcev <zaitcev@redhat.com>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Block low level drivers for some reason have been pretty good at
abusing block layer API. Especially struct request's fields tend to
get violated in all possible ways. Make it clear that low level
drivers MUST NOT access or manipulate rq->sector and rq->data_len
directly by prefixing them with double underscores.
This change is also necessary to break build of out-of-tree codes
which assume the previous block API where internal fields can be
manipulated and rq->data_len carries residual count on completion.
[ Impact: hide internal fields, block API change ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
struct request has had a few different ways to represent some
properties of a request. ->hard_* represent block layer's view of the
request progress (completion cursor) and the ones without the prefix
are supposed to represent the issue cursor and allowed to be updated
as necessary by the low level drivers. The thing is that as block
layer supports partial completion, the two cursors really aren't
necessary and only cause confusion. In addition, manual management of
request detail from low level drivers is cumbersome and error-prone at
the very least.
Another interesting duplicate fields are rq->[hard_]nr_sectors and
rq->{hard_cur|current}_nr_sectors against rq->data_len and
rq->bio->bi_size. This is more convoluted than the hard_ case.
rq->[hard_]nr_sectors are initialized for requests with bio but
blk_rq_bytes() uses it only for !pc requests. rq->data_len is
initialized for all request but blk_rq_bytes() uses it only for pc
requests. This causes good amount of confusion throughout block layer
and its drivers and determining the request length has been a bit of
black magic which may or may not work depending on circumstances and
what the specific LLD is actually doing.
rq->{hard_cur|current}_nr_sectors represent the number of sectors in
the contiguous data area at the front. This is mainly used by drivers
which transfers data by walking request segment-by-segment. This
value always equals rq->bio->bi_size >> 9. However, data length for
pc requests may not be multiple of 512 bytes and using this field
becomes a bit confusing.
In general, having multiple fields to represent the same property
leads only to confusion and subtle bugs. With recent block low level
driver cleanups, no driver is accessing or manipulating these
duplicate fields directly. Drop all the duplicates. Now rq->sector
means the current sector, rq->data_len the current total length and
rq->bio->bi_size the current segment length. Everything else is
defined in terms of these three and available only through accessors.
* blk_recalc_rq_sectors() is collapsed into blk_update_request() and
now handles pc and fs requests equally other than rq->sector update.
This means that now pc requests can use partial completion too (no
in-kernel user yet tho).
* bio_cur_sectors() is replaced with bio_cur_bytes() as block layer
now uses byte count as the primary data length.
* blk_rq_pos() is now guranteed to be always correct. In-block users
converted.
* blk_rq_bytes() is now guaranteed to be always valid as is
blk_rq_sectors(). In-block users converted.
* blk_rq_sectors() is now guaranteed to equal blk_rq_bytes() >> 9.
More convenient one is used.
* blk_rq_bytes() and blk_rq_cur_bytes() are now inlined and take const
pointer to request.
[ Impact: API cleanup, single way to represent one property of a request ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
With recent cleanups, there is no place where low level driver
directly manipulates request fields. This means that the 'hard'
request fields always equal the !hard fields. Convert all
rq->sectors, nr_sectors and current_nr_sectors references to
accessors.
While at it, drop superflous blk_rq_pos() < 0 test in swim.c.
[ Impact: use pos and nr_sectors accessors ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
Tested-by: Grant Likely <grant.likely@secretlab.ca>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Tested-by: Adrian McMenamin <adrian@mcmen.demon.co.uk>
Acked-by: Adrian McMenamin <adrian@mcmen.demon.co.uk>
Acked-by: Mike Miller <mike.miller@hp.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Cc: Borislav Petkov <petkovbb@googlemail.com>
Cc: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Cc: Eric Moore <Eric.Moore@lsi.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Pete Zaitcev <zaitcev@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Paul Clements <paul.clements@steeleye.com>
Cc: Tim Waugh <tim@cyberelk.net>
Cc: Jeff Garzik <jgarzik@pobox.com>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Alex Dubov <oakad@yahoo.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Dario Ballabio <ballabio_dario@emc.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: unsik Kim <donari75@gmail.com>
Cc: Laurent Vivier <Laurent@lvivier.info>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Implement accessors - blk_rq_pos(), blk_rq_sectors() and
blk_rq_cur_sectors() which return rq->hard_sector, rq->hard_nr_sectors
and rq->hard_cur_sectors respectively and convert direct references of
the said fields to the accessors.
This is in preparation of request data length handling cleanup.
Geert : suggested adding const to struct request * parameter to accessors
Sergei : spotted error in patch description
[ Impact: cleanup ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
Acked-by: Stephen Rothwell <sfr@canb.auug.org.au>
Tested-by: Grant Likely <grant.likely@secretlab.ca>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Ackec-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Cc: Borislav Petkov <petkovbb@googlemail.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
rq->data_len served two purposes - the length of data buffer on issue
and the residual count on completion. This duality creates some
headaches.
First of all, block layer and low level drivers can't really determine
what rq->data_len contains while a request is executing. It could be
the total request length or it coulde be anything else one of the
lower layers is using to keep track of residual count. This
complicates things because blk_rq_bytes() and thus
[__]blk_end_request_all() relies on rq->data_len for PC commands.
Drivers which want to report residual count should first cache the
total request length, update rq->data_len and then complete the
request with the cached data length.
Secondly, it makes requests default to reporting full residual count,
ie. reporting that no data transfer occurred. The residual count is
an exception not the norm; however, the driver should clear
rq->data_len to zero to signify the normal cases while leaving it
alone means no data transfer occurred at all. This reverse default
behavior complicates code unnecessarily and renders block PC on some
drivers (ide-tape/floppy) unuseable.
This patch adds rq->resid_len which is used only for residual count.
While at it, remove now unnecessasry blk_rq_bytes() caching in
ide_pc_intr() as rq->data_len is not changed anymore.
Boaz : spotted missing conversion in osd
Sergei : spotted too early conversion to blk_rq_bytes() in ide-tape
[ Impact: cleanup residual count handling, report 0 resid by default ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Cc: Borislav Petkov <petkovbb@googlemail.com>
Cc: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Cc: Mike Miller <mike.miller@hp.com>
Cc: Eric Moore <Eric.Moore@lsi.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Doug Gilbert <dgilbert@interlog.com>
Cc: Mike Miller <mike.miller@hp.com>
Cc: Eric Moore <Eric.Moore@lsi.com>
Cc: Darrick J. Wong <djwong@us.ibm.com>
Cc: Pete Zaitcev <zaitcev@redhat.com>
Cc: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Merge reason: tracing/core was on a .30-rc1 base and was missing out on
on a handful of tracing fixes present in .30-rc5-almost.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We currently don't do merging on discard requests, but we potentially
could. If we do, then we need to include discard requests in the IO
accounting, or merging would end up decrementing in_flight IO counters
for an IO which never incremented them.
So enable accounting for discard requests.
Problem found by Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We currently check for file system requests outside of blk_do_io_stat(rq),
but we may as well just include it.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Now that all block request data transfer is done via bio, rq->data
isn't used. Kill it.
While at it, make the roles of rq->special and buffer clear.
[ Impact: drop now unncessary field from struct request ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Boaz Harrosh <bharrosh@panasas.com>
There are many [__]blk_end_request() call sites which call it with
full request length and expect full completion. Many of them ensure
that the request actually completes by doing BUG_ON() the return
value, which is awkward and error-prone.
This patch adds [__]blk_end_request_all() which takes @rq and @error
and fully completes the request. BUG_ON() is added to to ensure that
this actually happens.
Most conversions are simple but there are a few noteworthy ones.
* cdrom/viocd: viocd_end_request() replaced with direct calls to
__blk_end_request_all().
* s390/block/dasd: dasd_end_request() replaced with direct calls to
__blk_end_request_all().
* s390/char/tape_block: tapeblock_end_request() replaced with direct
calls to blk_end_request_all().
[ Impact: cleanup ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Mike Miller <mike.miller@hp.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Jeff Garzik <jgarzik@pobox.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Alex Dubov <oakad@yahoo.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
rq->start_time was initialized in init_request_from_bio() so special
requests didn't have start_time set. This has been okay as start_time
has been used only for fs requests; however, there is no indication of
this actually is the case or not. Set rq->start_time in blk_rq_init()
and guarantee that all initialized rq's have its start_time set. This
improves consistency at virtually no cost and future changes will make
use of the timestamp for !bio requests.
[ Impact: rq->start_time is valid for all requests ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Request completion has gone through several changes and became a bit
messy over the time. Clean it up.
1. end_that_request_data() is a thin wrapper around
end_that_request_data_first() which checks whether bio is NULL
before doing anything and handles bidi completion.
blk_update_request() is a thin wrapper around
end_that_request_data() which clears nr_sectors on the last
iteration but doesn't use the bidi completion.
Clean it up by moving the initial bio NULL check and nr_sectors
clearing on the last iteration into end_that_request_data() and
renaming it to blk_update_request(), which makes blk_end_io() the
only user of end_that_request_data(). Collapse
end_that_request_data() into blk_end_io().
2. There are four visible completion variants - blk_end_request(),
__blk_end_request(), blk_end_bidi_request() and end_request().
blk_end_request() and blk_end_bidi_request() uses blk_end_request()
as the backend but __blk_end_request() and end_request() use
separate implementation in __blk_end_request() due to different
locking rules.
blk_end_bidi_request() is identical to blk_end_io(). Collapse
blk_end_io() into blk_end_bidi_request(), separate out request
update into internal helper blk_update_bidi_request() and add
__blk_end_bidi_request(). Redefine [__]blk_end_request() as thin
inline wrappers around [__]blk_end_bidi_request().
3. As the whole request issue/completion usages are about to be
modified and audited, it's a good chance to convert completion
functions return bool which better indicates the intended meaning
of return values.
4. The function name end_that_request_last() is from the days when it
was a public interface and slighly confusing. Give it a proper
internal name - blk_finish_request().
5. Add description explaning that blk_end_bidi_request() can be safely
used for uni requests as suggested by Boaz Harrosh.
The only visible behavior change is from #1. nr_sectors counts are
cleared after the final iteration no matter which function is used to
complete the request. I couldn't find any place where the code
assumes those nr_sectors counters contain the values for the last
segment and this change is good as it makes the API much more
consistent as the end result is now same whether a request is
completed using [__]blk_end_request() alone or in combination with
blk_update_request().
API further cleaned up per Christoph's suggestion.
[ Impact: cleanup, rq->*nr_sectors always updated after req completion ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Boaz Harrosh <bharrosh@panasas.com>
Cc: Christoph Hellwig <hch@infradead.org>
With recent IDE updates, blk_end_request_callback() doesn't have any
user now. Kill it.
[ Impact: removal of unused convoluted interface ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: code reorganization
elv_next_request() and elv_dequeue_request() are public block layer
interface than actual elevator implementation. They mostly deal with
how requests interact with block layer and low level drivers at the
beginning of rqeuest processing whereas __elv_next_request() is the
actual eleveator request fetching interface.
Move the two functions to blk-core.c. This prepares for further
interface cleanup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reorder request completion functions such that
* All request completion functions are located together.
* Functions which are used by only one caller is put right above the
caller.
* end_request() is put after other completion functions but before
blk_update_request().
This change is for completion function cleanup which will follow.
[ Impact: cleanup, code reorganization ]
Signed-off-by: Tejun Heo <tj@kernel.org>
* In blk_rq_timed_out_timer(), else { if } to else if
* In blk_add_timer(), simplify if/else block
[ Impact: cleanup ]
Signed-off-by: Tejun Heo <tj@kernel.org>
blk_insert_request() doesn't need to worry about REQ_SOFTBARRIER.
Don't set it. Combined with recent ide updates, REQ_SOFTBARRIER is
now only used in elevator proper and for discard requests.
[ Impact: cleanup ]
Signed-off-by: Tejun Heo <tj@kernel.org>
RQ_NOMERGE_FLAGS already clears defines which REQ flags aren't
mergeable. There is no reason to specify it superflously. It only
adds to confusion. Don't set REQ_NOMERGE for barriers and requests
with specific queueing directive. REQ_NOMERGE is now exclusively used
by the merging code.
[ Impact: cleanup ]
Signed-off-by: Tejun Heo <tj@kernel.org>
blk_start_queueing() is identical to __blk_run_queue() except that it
doesn't check for recursion. None of the current users depends on
blk_start_queueing() running request_fn directly. Replace usages of
blk_start_queueing() with [__]blk_run_queue() and kill it.
[ Impact: removal of mostly duplicate interface function ]
Signed-off-by: Tejun Heo <tj@kernel.org>
__blk_run_queue wraps blk_invoke_request_fn() such that it
additionally removes plug and bails out early if the queue is empty.
Both extra operations have their own pending mechanisms and don't
cause any harm correctness-wise when they are done superflously.
The only user of blk_invoke_request_fn() being blk_start_queue(),
there isn't much reason to keep both functions around. Merge
blk_invoke_request_fn() into __blk_run_queue() and make
blk_start_queue() use __blk_run_queue() instead.
[ Impact: merge two subtly different internal functions ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Enable by default support for large devices and files (CONFIG_LBD):
- With 1TB disks being a commodity hardware it is quite easy to hit 2TB
limitation while building RAIDs etc. and many distros have been using
CONFIG_LBD=y by default already (at least Fedora 10 and openSUSE 11.1).
- This should also prevent a subtle ext4 filesystem compatibility issue:
mke2fs.ext4 defaults to creating filesystems with huge_files feature
enabled and such filesystems cannot be later mounted read-write on
machines with CONFIG_LBD=n (it should be quite easy to hit this issue
when trying to use filesystem created using distro kernel on system
running the self-build kernel, think about USB disk enclosures & co.).
While at it:
- Clarify config option help text w.r.t. mounting ext4 filesystems
(they can be mounted with CONFIG_LBD=n but in the read-only mode).
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Impact: subtle behavior change
For fs requests, rq is only carrier of bios and rq error status as a
whole doesn't mean much. This is the reason why rq->errors is being
cleared on each partial completion of a request as on each partial
completion the error status is transferred to the respective bios.
For pc requests, rq->errors is used to carry error status to the
issuer and thus __end_that_request_first() doesn't clear it on such
cases.
The condition was fine till now as only fs and pc requests have used
bio and thus the bio completion path. However, future changes will
unify data accesses to bio and all non fs users care about rq error
status. Clear rq->errors on bio completion only for fs requests.
In general, the implicit clearing is a bit too subtle especially as
the meaning of rq->errors is completely dependent on low level
drivers. Unifying / cleaning up rq->errors usage and letting llds
manage it would be better. TODO comment added.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jens Axboe <axboe@kernel.dk>
Currently we look it up from ->ioprio, but ->ioprio can change if
either the process gets its IO priority changed explicitly, or if
cfq decides to temporarily boost it. So if we are unlucky, we can
end up attempting to remove a node from a different rbtree root than
where it was added.
Fix this by using ->org_ioprio as the prio_tree index, since that
will only change for explicit IO priority settings (not for a boost).
Additionally cache the rbtree root inside the cfqq, then we don't have
to add code to reinsert the cfqq in the prio_tree if IO priority changes.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
cfq_prio_tree_lookup() should return the direct match, yet it always
returns zero. Fix that.
cfq_prio_tree_add() assumes that we don't get a direct match, while
it is very possible that we do. Using O_DIRECT, you can have different
cfqq with matching requests, since you don't have the page cache
to serialize things for you. Fix this bug by only adding the cfqq if
there isn't an existing match.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Very rarely under stress testing of dm, oopses are occuring as
something tampers with an old stack frame. This has been traced back
to blk_abort_queue() leaving a timeout_list pointing to the stack.
The reason is that sometimes blk_abort_request() won't delete the
timer (if the request is marked as complete but before the timer has
been removed, a small race window). Fix this by splicing back from
the ususally empty list to the q->timeout_list.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This simplifies I/O stat accounting switching code and separates it
completely from I/O scheduler switch code.
Requests are accounted according to the state of their request queue
at the time of the request allocation. There is no need anymore to
flush the request queue when switching I/O accounting state.
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
If the cfq io context doesn't have enough samples yet to provide a mean
seek distance, then use the default threshold we have for seeky IO instead
of defaulting to 0.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Right now, depending on the first sector to which a process issues I/O,
the seek time may start out way out of whack. So make sure we start
with 0 sectors in seek, instead of the offset of the first request
issued.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
/proc/diskstats used to show stats for all disks whether they're
zero-sized or not and their non-zero partitions. Commit
074a7aca7a accidentally changed the
behavior such that it doesn't print out zero sized disks. This patch
implements DISK_PITER_INCL_EMPTY_PART0 flag to partition iterator and
uses it in diskstats_show() such that empty part0 is shown in
/proc/diskstats.
Reported and bisectd by Dianel Collins.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Daniel Collins <solemnwarning@solemnwarning.no-ip.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Impact: don't set GFP_DMA in q->bounce_gfp unnecessarily
All DMA address limits are expressed in terms of the last addressable
unit (byte or page) instead of one plus that. However, when
determining bounce_gfp for 64bit machines in blk_queue_bounce_limit(),
it compares the specified limit against 0x100000000UL to determine
whether it's below 4G ending up falsely setting GFP_DMA in
q->bounce_gfp.
As DMA zone is very small on x86_64, this makes larger SG_IO transfers
very eager to trigger OOM killer. Fix it. While at it, rename the
parameter to @dma_mask for clarity and convert comment to proper
winged style.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Impact: fix SG_IO behavior such that it matches the documentation
SG_IO howto says that if ->dxfer_len and sum of iovec disagress, the
shorter one wins. However, the current implementation returns -EINVAL
for such cases. Trim iovc if it's longer than ->dxfer_len.
This patch uses iov_*() helpers which take struct iovec * by casting
struct sg_iovec * to it. sg_iovec is always identical to iovec and
this will be further cleaned up with later patches.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Impact: subtle behavior change
For fs requests, rq is only carrier of bios and rq error status as a
whole doesn't mean much. This is the reason why rq->errors is being
cleared on each partial completion of a request as on each partial
completion the error status is transferred to the respective bios.
For pc requests, rq->errors is used to carry error status to the
issuer and thus __end_that_request_first() doesn't clear it on such
cases.
The condition was fine till now as only fs and pc requests have used
bio and thus the bio completion path. However, future changes will
unify data accesses to bio and all non fs users care about rq error
status. Clear rq->errors on bio completion only for fs requests.
In general, the implicit clearing is a bit too subtle especially as
the meaning of rq->errors is completely dependent on low level
drivers. Unifying / cleaning up rq->errors usage and letting llds
manage it would be better. TODO comment added.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jens Axboe <axboe@kernel.dk>
Though one can specify '-d /dev/sda1' when using blktrace, it still
traces the whole sda.
To support per-partition tracing, when we start tracing, we initialize
bt->start_lba and bt->end_lba to the start and end sector of that
partition.
Note some actions are per device, thus we don't filter 0-sector events.
The original patch and discussion can be found here:
http://marc.info/?l=linux-btrace&m=122949374214540&w=2
Signed-off-by: Shawn Du <duyuyang@gmail.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
LKML-Reference: <49E42620.4050701@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If we have processes that are working in close proximity to each
other on disk, we don't want to idle wait. Instead allow the close
process to issue a request, getting better aggregate bandwidth.
The anticipatory scheduler has similar checks, noop and deadline do
not need it since they don't care about process <-> io mappings.
The code for CFQ is a little more involved though, since we split
request queues into per-process contexts.
This fixes a performance problem with eg dump(8), since it uses
several processes in some silly attempt to speed IO up. Even if
dump(8) isn't really a valid case (it should be fixed by using
CLONE_IO), there are other cases where we see close processes
and where idling ends up hurting performance.
Credit goes to Jeff Moyer <jmoyer@redhat.com> for writing the
initial implementation.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We only kick the dispatch for an idling queue, if we think it's a
(somewhat) fully merged request. Also allow a kick if we have other
busy queues in the system, since we don't want to risk waiting for
a potential merge in that case. It's better to get some work done and
proceed.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
It's called from the workqueue handlers from process context, so
we always have irqs enabled when entered.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
blk_rq_unmap_user() returns -EFAULT if a program passes an invalid
address to kernel. SG_IO path needs to pass the returned value to user
space instead of ignoring it.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
"Zhang, Yanmin" <yanmin_zhang@linux.intel.com> reports that commit
b029195dda introduced a regression
of about 50% with sequential threaded read workloads. The test
case is:
tiotest -k0 -k1 -k3 -f 80 -t 32
which starts 32 threads each reading a 80MB file. Twiddle the kick
queue logic so that we do start IO immediately, if it appears to be
a fully merged request. We can't really detect that, so just check
if the request is bigger than a page or not. The assumption is that
since single bio issues will first queue a single request with just
one page attached and then later do merges on that, if we already
have more than a page worth of data in the request, then the request
is most likely good to go.
Verified that this doesn't cause a regression with the test case that
commit b029195dda was fixing. It does not,
we still see maximum sized requests for the queue-then-merge cases.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
branch tracer, intel-iommu: fix build with CONFIG_BRANCH_TRACER=y
branch tracer: Fix for enabling branch profiling makes sparse unusable
ftrace: Correct a text align for event format output
Update /debug/tracing/README
tracing/ftrace: alloc the started cpumask for the trace file
tracing, x86: remove duplicated #include
ftrace: Add check of sched_stopped for probe_sched_wakeup
function-graph: add proper initialization for init task
tracing/ftrace: fix missing include string.h
tracing: fix incorrect return type of ns2usecs()
tracing: remove CALLER_ADDR2 from wakeup tracer
blktrace: fix pdu_len when tracing packet command requests
blktrace: small cleanup in blk_msg_write()
blktrace: NUL-terminate user space messages
tracing: move scripts/trace/power.pl to scripts/tracing/power.pl
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
loop: mutex already unlocked in loop_clr_fd()
cfq-iosched: don't let idling interfere with plugging
block: remove unused REQ_UNPLUG
cfq-iosched: kill two unused cfqq flags
cfq-iosched: change dispatch logic to deal with single requests at the time
mflash: initial support
cciss: change to discover first memory BAR
cciss: kernel scan thread for MSA2012
cciss: fix residual count for block pc requests
block: fix inconsistency in I/O stat accounting code
block: elevator quiescing helpers
When CFQ is waiting for a new request from a process, currently it'll
immediately restart queuing when it sees such a request. This doesn't
work very well with streamed IO, since we then end up splitting IO
that would otherwise have been merged nicely. For a simple dd test,
this causes 10x as many requests to be issued as we should have.
Normally this goes unnoticed due to the low overhead of requests
at the device side, but some hardware is very sensitive to request
sizes and there it can cause big slow downs.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The request inherits the unplug flag from the bio, but it isn't actually
used. The bio flag stops at __make_request(), which tells it to unplug
after submission. Passing it on to the request doesn't make any sense.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We only manipulate the must_dispatch and queue_new flags, they are not
tested anymore. So get rid of them.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The IO scheduler core calls into the IO scheduler dispatch_request hook
to move requests from the IO scheduler and into the driver dispatch
list. It only does so when the dispatch list is empty. CFQ moves several
requests to the dispatch list, which can cause higher latencies if we
suddenly have to switch to some important sync IO. Change the logic to
move one request at the time instead.
This should almost be functionally equivalent to what we did before,
except that we now honor 'quantum' as the maximum queue depth at the
device side from any single cfqq. If there's just a single active
cfqq, we allow up to 4 times the normal quantum.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This forces in_flight to be zero when turning off or on the I/O stat
accounting and stops updating I/O stats in attempt_merge() when
accounting is turned off.
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Simple helper functions to quiesce the request queue. These are
currently only used for switching IO schedulers on-the-fly, but
we can use them to properly switch IO accounting on and off as well.
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Fix a typo (this was in the original patch but was not merged when the code
fixes were for some reason)
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
By default, CFQ will anticipate more IO from a given io context if the
previously completed IO was sync. This used to be fine, since the only
sync IO was reads and O_DIRECT writes. But with more "normal" sync writes
being used now, we don't want to anticipate for those.
Add a bio/request flag that informs the IO scheduler that this is a sync
request that we should not idle for. Introduce WRITE_ODIRECT specifically
for O_DIRECT writes, and make sure that the other sync writes set this
flag.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For the older SSD devices that don't do command queuing, we do want to
enable plugging to get better merging.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This makes sure that we never wait on async IO for sync requests, instead
of doing the split on writes vs reads.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (413 commits)
tracing, net: fix net tree and tracing tree merge interaction
tracing, powerpc: fix powerpc tree and tracing tree interaction
ring-buffer: do not remove reader page from list on ring buffer free
function-graph: allow unregistering twice
trace: make argument 'mem' of trace_seq_putmem() const
tracing: add missing 'extern' keywords to trace_output.h
tracing: provide trace_seq_reserve()
blktrace: print out BLK_TN_MESSAGE properly
blktrace: extract duplidate code
blktrace: fix memory leak when freeing struct blk_io_trace
blktrace: fix blk_probes_ref chaos
blktrace: make classic output more classic
blktrace: fix off-by-one bug
blktrace: fix the original blktrace
blktrace: fix a race when creating blk_tree_root in debugfs
blktrace: fix timestamp in binary output
tracing, Text Edit Lock: cleanup
tracing: filter fix for TRACE_EVENT_FORMAT events
ftrace: Using FTRACE_WARN_ON() to check "freed record" in ftrace_release()
x86: kretprobe-booster interrupt emulation code fix
...
Fix up trivial conflicts in
arch/parisc/include/asm/ftrace.h
include/linux/memory.h
kernel/extable.c
kernel/module.c
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-cpumask: (36 commits)
cpumask: remove cpumask allocation from idle_balance, fix
numa, cpumask: move numa_node_id default implementation to topology.h, fix
cpumask: remove cpumask allocation from idle_balance
x86: cpumask: x86 mmio-mod.c use cpumask_var_t for downed_cpus
x86: cpumask: update 32-bit APM not to mug current->cpus_allowed
x86: microcode: cleanup
x86: cpumask: use work_on_cpu in arch/x86/kernel/microcode_core.c
cpumask: fix CONFIG_CPUMASK_OFFSTACK=y cpu hotunplug crash
numa, cpumask: move numa_node_id default implementation to topology.h
cpumask: convert node_to_cpumask_map[] to cpumask_var_t
cpumask: remove x86 cpumask_t uses.
cpumask: use cpumask_var_t in uv_flush_tlb_others.
cpumask: remove cpumask_t assignment from vector_allocation_domain()
cpumask: make Xen use the new operators.
cpumask: clean up summit's send_IPI functions
cpumask: use new cpumask functions throughout x86
x86: unify cpu_callin_mask/cpu_callout_mask/cpu_initialized_mask/cpu_sibling_setup_mask
cpumask: convert struct cpuinfo_x86's llc_shared_map to cpumask_var_t
cpumask: convert node_to_cpumask_map[] to cpumask_var_t
x86: unify 32 and 64-bit node_to_cpumask_map
...
* 'ipi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
s390: remove arch specific smp_send_stop()
panic: clean up kernel/panic.c
panic, smp: provide smp_send_stop() wrapper on UP too
panic: decrease oops_in_progress only after having done the panic
generic-ipi: eliminate WARN_ON()s during oops/panic
generic-ipi: cleanups
generic-ipi: remove CSD_FLAG_WAIT
generic-ipi: remove kmalloc()
generic IPI: simplify barriers and locking
Impact: output all of packet commands - not just the first 4 / 8 bytes
Since commit d7e3c3249e ("block: add
large command support"), struct request->cmd has been changed from
unsinged char cmd[BLK_MAX_CDB] to unsigned char *cmd.
v1 -> v2: by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
- make sure rq->cmd_len is always intialized, and then we can use
rq->cmd_len instead of BLK_MAX_CDB.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
LKML-Reference: <49D4507E.2060602@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'percpu-cpumask-x86-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (682 commits)
percpu: fix spurious alignment WARN in legacy SMP percpu allocator
percpu: generalize embedding first chunk setup helper
percpu: more flexibility for @dyn_size of pcpu_setup_first_chunk()
percpu: make x86 addr <-> pcpu ptr conversion macros generic
linker script: define __per_cpu_load on all SMP capable archs
x86: UV: remove uv_flush_tlb_others() WARN_ON
percpu: finer grained locking to break deadlock and allow atomic free
percpu: move fully free chunk reclamation into a work
percpu: move chunk area map extension out of area allocation
percpu: replace pcpu_realloc() with pcpu_mem_alloc() and pcpu_mem_free()
x86, percpu: setup reserved percpu area for x86_64
percpu, module: implement reserved allocation and use it for module percpu variables
percpu: add an indirection ptr for chunk page map access
x86: make embedding percpu allocator return excessive free space
percpu: use negative for auto for pcpu_setup_first_chunk() arguments
percpu: improve first chunk initial area map handling
percpu: cosmetic renames in pcpu_setup_first_chunk()
percpu: clean up percpu constants
x86: un-__init fill_pud/pmd/pte
x86: remove vestigial fix_ioremap prototypes
...
Manually merge conflicts in arch/ia64/kernel/irq_ia64.c
bsg submits REQ_TYPE_BLOCK_PC so the right check is max_hw_sectors.
But I've removed this check because right after, bsg proceeds with
calling blk_rq_map_user() which does all the right checks.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Put a WARN_ON in __blk_put_request if it is about to
leak bio(s). This is a serious bug that can happen in error
handling code paths.
For this to work I have fixed a couple of places in block/ where
request->bio != NULL ownership was not honored. And a small cleanup
at sg_io() while at it.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Currently inherited from sg.c bsg will submit asynchronous request
at the head-of-the-queue, (using "at_head" set in the call to
blk_execute_rq_nowait()). This is bad in situation where the queues
are full, requests will execute out of order, and can cause
starvation of the first submitted requests.
The sg_io_v4->flags member is used and a bit is allocated to denote the
Q_AT_TAIL. Zero is to queue at_head as before, to be compatible with old
code at the write/read path. SG_IO code path behavior was changed so to
be the same as write/read behavior. SG_IO was very rarely used and breaking
compatibility with it is OK at this stage.
sg_io_hdr at sg.h also has a flags member and uses 3 bits from the first
nibble and one bit from the last nibble. Even though none of these bits
are supported by bsg, The second nibble is allocated for use by bsg. Just
in case.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
CC: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Impact: cleanup
This is presumably what those definitions are for, and while all archs
define cpu_core_map/cpu_sibling map, that's changing (eg. x86 wants to
change it to a pointer).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This allows it to compile and be used on the ps3 platform that wants
to use the #define values in scsi.h without actually having
CONFIG_SCSI set.
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Commit 1e42807918 introduced a bug where we
don't get front/back segment sizes in the bio in blk_recount_segments().
Fix this by tracking the back bio as well as the front bio in
__blk_recalc_rq_segments(), this also cleans up the interface by getting
rid of the segment size pointer passing.
Tested-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Add documentation for register_blkdev() function and for the parameters.
Signed-off-by: Márton Németh <nm127@freemail.hu>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Oleg noticed that we don't strictly need CSD_FLAG_WAIT, rework
the code so that we can use CSD_FLAG_LOCK for both purposes.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This prepares for a real __alloc_percpu, by adding an alignment argument.
Only one place uses __alloc_percpu directly, and that's for a string.
tj: af_inet also uses __alloc_percpu(), update it.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
blk_abort_queue() iterates the timeout list and aborts each request on the
list, but if the driver error handling readds a request to the timeout list
during this processing, we could be looping forever. Fix this by splicing
current entries to a local list and run over that list instead.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Hi Tejun,
it looks like your commit:
block: don't depend on consecutive minor space
f331c0296f
broke a particular case for booting from partitioned md/raid devices.
That is the second time this has been broken recently. The previous
time was fixed by
block: do_mounts - accept root=<non-existant partition>
30f2f0eb4b
Because the data isn't available when an md device is first created
(we add disks and set it up after creation), the initial partition
scan finds nothing. It is not until the device is opened that
another partition scan happens and finds something.
So at the point where the kernel parameter "root=/dev/md_d0p1" is
being parsed, md_d0 exists, but md_d0p1 does not.
However if we let blk_lookup_devt return the correct device number
even though the device doesn't exist, then the attempt to mount it
will successfully find the partition.
I have tried in the past to find a way to get the partition table to
be read as soon as the array is assembled but that proved impossible
(at the time). I don't remember the details, and could possibly
revisit it. However it would be really nice if blk_lookup_devt
could be adjusted to again accept non existant partitions.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We can't OR shift values, so get rid of BIO_RW_SYNC and use BIO_RW_SYNCIO
and BIO_RW_UNPLUG explicitly. This brings back the behaviour from before
213d9417fe.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
When submitting requests via SG_IO, which does a sync io, a
bsg_command is not allocated. So an in-Kernel sense_buffer was not
set. However when calling blk_execute_rq() with no sense buffer
one is provided from the stack. Now bsg at blk_complete_sgv4_hdr_rq()
would check if rq->sense_len and a sense was requested by sg_io_v4
the rq->sense was copy_user() back, but by now it is already mangled
stack memory.
I have fixed that by forcing a sense_buffer when calling bsg_map_hdr().
The bsg_command->sense is provided in the write/read path like before,
and on-the-stack buffer is provided when doing SG_IO.
I have also fixed a dprintk message to print rq->errors in hex because
of the scsi bit-field use of this member. For other block devices it
does not matter anyway.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Acked-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Impact: cleanup
To make it easy for ftrace plugin writers, as this was open coded in
the existing plugins
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Frédéric Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: new API
These new functions do what previously was being open coded, reducing
the number of details ftrace plugin writers have to worry about.
It also standardizes the handling of stacktrace, userstacktrace and
other trace options we may introduce in the future.
With this patch, for instance, the blk tracer (and some others already
in the tree) can use the "userstacktrace" /d/tracing/trace_options
facility.
$ codiff /tmp/vmlinux.before /tmp/vmlinux.after
linux-2.6-tip/kernel/trace/trace.c:
trace_vprintk | -5
trace_graph_return | -22
trace_graph_entry | -26
trace_function | -45
__ftrace_trace_stack | -27
ftrace_trace_userstack | -29
tracing_sched_switch_trace | -66
tracing_stop | +1
trace_seq_to_user | -1
ftrace_trace_special | -63
ftrace_special | +1
tracing_sched_wakeup_trace | -70
tracing_reset_online_cpus | -1
13 functions changed, 2 bytes added, 355 bytes removed, diff: -353
linux-2.6-tip/block/blktrace.c:
__blk_add_trace | -58
1 function changed, 58 bytes removed, diff: -58
linux-2.6-tip/kernel/trace/trace.c:
trace_buffer_lock_reserve | +88
trace_buffer_unlock_commit | +86
2 functions changed, 174 bytes added, diff: +174
/tmp/vmlinux.after:
16 functions changed, 176 bytes added, 413 bytes removed, diff: -237
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Frédéric Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: simplification of tracers
As all tracers are doing this we might as well do it in
register_ftrace_event and save one branch each time we call these
callbacks.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
As they actually all return these enumerators.
Reported-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: bugfix and cleanup
Some callsites were returning either TRACE_ITER_PARTIAL_LINE if the
trace_seq routines (trace_seq_printf, etc) returned 0 meaning its buffer
was full, or zero otherwise.
But...
/* Return values for print_line callback */
enum print_line_t {
TRACE_TYPE_PARTIAL_LINE = 0, /* Retry after flushing the seq */
TRACE_TYPE_HANDLED = 1,
TRACE_TYPE_UNHANDLED = 2 /* Relay to other output functions */
};
In other cases the return value was not being relayed at all.
Most of the time it didn't hurt because the page wasn't get filled, but
for correctness sake, handle the return values everywhere.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: new feature
With this and a blkrawverify modified not to verify the sequence numbers
we can start using the userspace tools to verify that the data produced
with the ftrace plugin works as expected.
Example:
[root@f10-1 ~]# echo 1 > /sys/block/sda/sda1/trace/enable
[root@f10-1 ~]# echo bin > /d/tracing/trace_options
[root@f10-1 ~]# echo blk > /d/tracing/current_tracer
[root@f10-1 ~]# cat /d/tracing/trace_pipe > sda1.blktrace.0
^C
[root@f10-1 ~]# ./blkrawverify --noseq sda1
Verifying sda1
CPU 0
Wrote output to sda1.verify.out
[root@f10-1 ~]# cat sda1.verify.out
---------------
Verifying sda1
---------------------
Summary for cpu 0:
1349 valid + 0 invalid (100.0%) processed
[root@f10-1 ~]#
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: API change
The trace_seq and trace_entry are in trace_iterator, where there are
more fields that may be needed by tracers, so just pass the
tracer_iterator as is already the case for struct tracer->print_line.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Some initial probe requests don't have disk->queue mapped yet, so we
can't rely on a non-NULL queue in blk_queue_io_stat(). Wrap it in
blk_do_io_stat().
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch adds the ability to pre-empt an ongoing BE timeslice when a RT
request is waiting for the current timeslice to complete. This reduces the
wait time to disk for RT requests from an upper bound of 4 (current value
of cfq_quantum) to 1 disk request.
Applied Jens' suggeested changes to avoid the rb lookup and use !cfq_class_rt()
and retested.
Latency(secs) for the RT task when doing sequential reads from 10G file.
| only RT | RT + BE | RT + BE + this patch
small (512 byte) reads | 143 | 163 | 145
large (1Mb) reads | 142 | 158 | 146
Signed-off-by: Divyesh Shah <dpshah@google.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This allows us to turn off disk stat accounting completely, for the cases
where the 0.5-1% reduction in system time is important.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This fixes a "regression" from 2.6.28, where the barrier probes that file
systems may do would trigger additional end request warnings in dmesg.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
For some devices (i.e. CFA ATA) we can't reliably detect whether
the device is of rotational or non-rotational type so we need to
leave the final decision about this setting to the user-space.
As a bonus do a minor CodingStyle fixup in queue_nomerges_store().
Suggested-by: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Allow a block device to allocate and register an integrity profile
without providing a template. This allows DM to preallocate a profile
to avoid deadlocks during table reconfiguration.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Impact: cleanup
Use tracing_reset_online_cpus instead of open coding it.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: build fix
Also mention in the help text that blktrace now can be used using
the ftrace interface.
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: New way of using the blktrace infrastructure
This drops the requirement of userspace utilities to use the blktrace
facility.
Configuration is done thru sysfs, adding a "trace" directory to the
partition directory where blktrace can be enabled for the associated
request_queue.
The same filters present in the IOCTL interface are present as sysfs
device attributes.
The /sys/block/sdX/sdXN/trace/enable file allows tracing without any
filters.
The other files in this directory: pid, act_mask, start_lba and end_lba
can be used with the same meaning as with the IOCTL interface.
Using the sysfs interface will only setup the request_queue->blk_trace
fields, tracing will only take place when the "blk" tracer is selected
via the ftrace interface, as in the following example:
To see the trace, one can use the /d/tracing/trace file or the
/d/tracign/trace_pipe file, with semantics defined in the ftrace
documentation in Documentation/ftrace.txt.
[root@f10-1 ~]# cat /t/trace
kjournald-305 [000] 3046.491224: 8,1 A WBS 6367 + 8 <- (8,1) 6304
kjournald-305 [000] 3046.491227: 8,1 Q R 6367 + 8 [kjournald]
kjournald-305 [000] 3046.491236: 8,1 G RB 6367 + 8 [kjournald]
kjournald-305 [000] 3046.491239: 8,1 P NS [kjournald]
kjournald-305 [000] 3046.491242: 8,1 I RBS 6367 + 8 [kjournald]
kjournald-305 [000] 3046.491251: 8,1 D WB 6367 + 8 [kjournald]
kjournald-305 [000] 3046.491610: 8,1 U WS [kjournald] 1
<idle>-0 [000] 3046.511914: 8,1 C RS 6367 + 8 [6367]
[root@f10-1 ~]#
The default line context (prefix) format is the one described in the ftrace
documentation, with the blktrace specific bits using its existing format,
described in blkparse(8).
If one wants to have the classic blktrace formatting, this is possible by
using:
[root@f10-1 ~]# echo blk_classic > /t/trace_options
[root@f10-1 ~]# cat /t/trace
8,1 0 3046.491224 305 A WBS 6367 + 8 <- (8,1) 6304
8,1 0 3046.491227 305 Q R 6367 + 8 [kjournald]
8,1 0 3046.491236 305 G RB 6367 + 8 [kjournald]
8,1 0 3046.491239 305 P NS [kjournald]
8,1 0 3046.491242 305 I RBS 6367 + 8 [kjournald]
8,1 0 3046.491251 305 D WB 6367 + 8 [kjournald]
8,1 0 3046.491610 305 U WS [kjournald] 1
8,1 0 3046.511914 0 C RS 6367 + 8 [6367]
[root@f10-1 ~]#
Using the ftrace standard format allows more flexibility, such
as the ability of asking for backtraces via trace_options:
[root@f10-1 ~]# echo noblk_classic > /t/trace_options
[root@f10-1 ~]# echo stacktrace > /t/trace_options
[root@f10-1 ~]# cat /t/trace
kjournald-305 [000] 3318.826779: 8,1 A WBS 6375 + 8 <- (8,1) 6312
kjournald-305 [000] 3318.826782:
<= submit_bio
<= submit_bh
<= sync_dirty_buffer
<= journal_commit_transaction
<= kjournald
<= kthread
<= child_rip
kjournald-305 [000] 3318.826836: 8,1 Q R 6375 + 8 [kjournald]
kjournald-305 [000] 3318.826837:
<= generic_make_request
<= submit_bio
<= submit_bh
<= sync_dirty_buffer
<= journal_commit_transaction
<= kjournald
<= kthread
Please read the ftrace documentation to use aditional, standardized
tracing filters such as /d/tracing/trace_cpumask, etc.
See also /d/tracing/trace_mark to add comments in the trace stream,
that is equivalent to the /d/block/sdaN/msg interface.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (57 commits)
jbd2: Fix oops in jbd2_journal_init_inode() on corrupted fs
ext4: Remove "extents" mount option
block: Add Kconfig help which notes that ext4 needs CONFIG_LBD
ext4: Make printk's consistently prefixed with "EXT4-fs: "
ext4: Add sanity checks for the superblock before mounting the filesystem
ext4: Add mount option to set kjournald's I/O priority
jbd2: Submit writes to the journal using WRITE_SYNC
jbd2: Add pid and journal device name to the "kjournald2 starting" message
ext4: Add markers for better debuggability
ext4: Remove code to create the journal inode
ext4: provide function to release metadata pages under memory pressure
ext3: provide function to release metadata pages under memory pressure
add releasepage hooks to block devices which can be used by file systems
ext4: Fix s_dirty_blocks_counter if block allocation failed with nodelalloc
ext4: Init the complete page while building buddy cache
ext4: Don't allow new groups to be added during block allocation
ext4: mark the blocks/inode bitmap beyond end of group as used
ext4: Use new buffer_head flag to check uninit group bitmaps initialization
ext4: Fix the race between read_inode_bitmap() and ext4_new_inode()
ext4: code cleanup
...
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (45 commits)
[SCSI] qla2xxx: Update version number to 8.03.00-k1.
[SCSI] qla2xxx: Add ISP81XX support.
[SCSI] qla2xxx: Use proper request/response queues with MQ instantiations.
[SCSI] qla2xxx: Correct MQ-chain information retrieval during a firmware dump.
[SCSI] qla2xxx: Collapse EFT/FCE copy procedures during a firmware dump.
[SCSI] qla2xxx: Don't pollute kernel logs with ZIO/RIO status messages.
[SCSI] qla2xxx: Don't fallback to interrupt-polling during re-initialization with MSI-X enabled.
[SCSI] qla2xxx: Remove support for reading/writing HW-event-log.
[SCSI] cxgb3i: add missing include
[SCSI] scsi_lib: fix DID_RESET status problems
[SCSI] fc transport: restore missing dev_loss_tmo callback to LLDD
[SCSI] aha152x_cs: Fix regression that keeps driver from using shared interrupts
[SCSI] sd: Correctly handle 6-byte commands with DIX
[SCSI] sd: DIF: Fix tagging on platforms with signed char
[SCSI] sd: DIF: Show app tag on error
[SCSI] Fix error handling for DIF/DIX
[SCSI] scsi_lib: don't decrement busy counters when inserting commands
[SCSI] libsas: fix test for negative unsigned and typos
[SCSI] a2091, gvp11: kill warn_unused_result warnings
[SCSI] fusion: Move a dereference below a NULL test
...
Fixed up trivial conflict due to moving the async part of sd_probe
around in the async probes vs using dev_set_name() in naming.
The commit 818827669d (block: make
blk_rq_map_user take a NULL user-space buffer) extended
blk_rq_map_user to accept a NULL user-space buffer with a READ
command. It was necessary to convert sg to use the block layer mapping
API.
This patch extends blk_rq_map_user again for a WRITE command. It is
necessary to convert st and osst drivers to use the block layer
apping API.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
This fixes bio_copy_user_iov to properly handle the partial mappings
with struct rq_map_data (which only sg uses for now but st and osst
will shortly). It adds the offset member to struct rq_map_data and
changes blk_rq_map_user to update it so that bio_copy_user_iov can add
an appropriate page frame via bio_add_pc_page().
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Original patch from Nikanth Karthikesan <knikanth@suse.de>
When a queue exits the queue lock is taken and cfq_exit_queue() would free all
the cic's associated with the queue.
But when a task exits, cfq_exit_io_context() gets cic one by one and then
locks the associated queue to call __cfq_exit_single_io_context. It looks like
between getting a cic from the ioc and locking the queue, the queue might have
exited on another cpu.
Fix this by rechecking the cfq_io_context queue key inside the queue lock
again, and not calling into __cfq_exit_single_io_context() if somebody
beat us to it.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We have two seperate config entries for large devices/files. One
is CONFIG_LBD that guards just the devices, the other is CONFIG_LSF
that handles large files. This doesn't make a lot of sense, you typically
want both or none. So get rid of CONFIG_LSF and change CONFIG_LBD wording
to indicate that it covers both.
Acked-by: Jean Delvare <khali@linux-fr.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
zero is invalid for max_phys_segments, max_hw_segments, and
max_segment_size. It's better to use use min_not_zero instead of
min. min() works though (because the commit 0e435ac makes sure that
these values are set to the default values, non zero, if a queue is
initialized properly).
With this patch, blk_queue_stack_limits does the almost same thing
that dm's combine_restrictions_low() does. I think that it's easy to
remove dm's combine_restrictions_low.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
disk_map_sector_rcu() returns a partition from a sector offset,
which we use for IO statistics on a per-partition basis. The
lookup itself is an O(N) list lookup, where N is the number of
partitions. This actually hurts performance quite a bit, even
on the lower end partitions. On higher numbered partitions,
it can get pretty bad.
Solve this by adding a one-hit cache for partition lookup.
This makes the lookup O(1) for the case where we do most IO to
one partition. Even for mixed partition workloads, amortized cost
is pretty close to O(1) since the natural IO batching makes the
one-hit cache last for lots of IOs.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This basically limits the hardware queue depth to 4*quantum at any
point in time, which is 16 with the default settings. As CFQ uses
other means to shrink the hardware queue when necessary in the first
place, there's really no need for this extra heuristic. Additionally,
it ends up hurting performance in some cases.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We just want to hand the first bits of IO to the device as fast
as possible. Gains a few percent on the IOPS rate.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Empty barrier on write-through (or no cache) w/ ordered tag has no
command to execute and without any command to execute ordered tag is
never issued to the device and the ordering is never achieved. Force
draining for such cases.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Empty barrier required special handling in __elv_next_request() to
complete it without letting the low level driver see it.
With previous changes, barrier code is now flexible enough to skip the
BAR step using the same barrier sequence selection mechanism. Drop
the special handling and mask off q->ordered from start_ordered().
Remove blk_empty_barrier() test which now has no user.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Barrier completion had the following assumptions.
* start_ordered() couldn't finish the whole sequence properly. If all
actions are to be skipped, q->ordseq is set correctly but the actual
completion was never triggered thus hanging the barrier request.
* Drain completion in elv_complete_request() assumed that there's
always at least one request in the queue when drain completes.
Both assumptions are true but these assumptions need to be removed to
improve empty barrier implementation. This patch makes the following
changes.
* Make start_ordered() use blk_ordered_complete_seq() to mark skipped
steps complete and notify __elv_next_request() that it should fetch
the next request if the whole barrier has completed inside
start_ordered().
* Make drain completion path in elv_complete_request() check whether
the queue is empty. Empty queue also indicates drain completion.
* While at it, convert 0/1 return from blk_do_ordered() to false/true.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
In all barrier sequences, the barrier write itself was always assumed
to be issued and thus didn't have corresponding control flag. This
patch adds QUEUE_ORDERED_DO_BAR and unify action mask handling in
start_ordered() such that any barrier action can be skipped.
This patch doesn't introduce any visible behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* Because barrier mode can be changed dynamically, whether barrier is
supported or not can be determined only when actually issuing the
barrier and there is no point in checking it earlier. Drop barrier
support check in generic_make_request() and __make_request(), and
update comment around the support check in blk_do_ordered().
* There is no reason to check discard support in both
generic_make_request() and __make_request(). Drop the check in
__make_request(). While at it, move error action block to the end
of the function and add unlikely() to q existence test.
* Barrier request, be it empty or not, is never passed to low level
driver and thus it's meaningless to try to copy back req->sector to
bio->bi_sector on error. In addition, the notion of failed sector
doesn't make any sense for empty barrier to begin with. Drop the
code block from __end_that_request_first().
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Separate out ordering type (drain,) and action masks (preflush,
postflush, fua) from visible ordering mode selectors
(QUEUE_ORDERED_*). Ordering types are now named QUEUE_ORDERED_BY_*
while action masks are named QUEUE_ORDERED_DO_*.
This change is necessary to add QUEUE_ORDERED_DO_BAR and make it
optional to improve empty barrier implementation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
After many improvements on kblockd_flush_work, it is now identical to
cancel_work_sync, so a direct call to cancel_work_sync is suggested.
The only difference is that cancel_work_sync is a GPL symbol,
so no non-GPL modules anymore.
Signed-off-by: Cheng Renquan <crquan@gmail.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Allow the scsi request REQ_QUIET flag to be propagated to the buffer
file system layer. The basic ideas is to pass the flag from the scsi
request to the bio (block IO) and then to the buffer layer. The buffer
layer can then suppress needless printks.
This patch declutters the kernel log by removed the 40-50 (per lun)
buffer io error messages seen during a boot in my multipath setup . It
is a good chance any real errors will be missed in the "noise" it the
logs without this patch.
During boot I see blocks of messages like
"
__ratelimit: 211 callbacks suppressed
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242847
Buffer I/O error on device sdm, logical block 1
Buffer I/O error on device sdm, logical block 5242878
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242872
"
in my logs.
My disk environment is multipath fiber channel using the SCSI_DH_RDAC
code and multipathd. This topology includes an "active" and "ghost"
path for each lun. IO's to the "ghost" path will never complete and the
SCSI layer, via the scsi device handler rdac code, quick returns the IOs
to theses paths and sets the REQ_QUIET scsi flag to suppress the scsi
layer messages.
I am wanting to extend the QUIET behavior to include the buffer file
system layer to deal with these errors as well. I have been running this
patch for a while now on several boxes without issue. A few runs of
bonnie++ show no noticeable difference in performance in my setup.
Thanks for John Stultz for the quiet_error finalization.
Submitted-by: Keith Mannthey <kmannth@us.ibm.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
There's no need to take queue_lock or kernel_lock when modifying
bdi->ra_pages. So remove them. Also remove out of date comment for
queue_max_sectors_store().
Signed-off-by: Wu Fengguang <wfg@linux.intel.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
There is no argument named @tags in blk_init_tags,
remove its' comment.
Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Convert the timeout ioctl scalling to use the clock_t functions
which are much more accurate with some USER_HZ vs HZ combinations.
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
For sync IO, we'll often do them serialized. This means we'll be touching
the queue timer for every IO, as opposed to only occasionally like we
do for queued IO. Instead of deleting the timer when the last request
is removed, just let continue running. If a new request comes up soon
we then don't have to readd the timer again. If no new requests arrive,
the timer will expire without side effect later.
This improves high iops sync IO by ~1%.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Now the rq->deadline can't be zero if the request is in the
timeout_list, so there is no need to have next_set. There is no need to
access a request's deadline field if blk_rq_timed_out is called on it.
Signed-off-by: Malahal Naineni <malahal@us.ibm.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
cpu_coregroup_map returned a cpumask_t: it's going away.
(Note, the sched part of this patch won't apply meaningfully to the
sched tree, but I'm posting it to show the goal).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ingo Molnar <mingo@redhat.com>
There's no point in having too short SG_IO timeouts, since if the
command does end up timing out, we'll end up through the reset sequence
that is several seconds long in order to abort the command that timed
out.
As a result, shorter timeouts than a few seconds simply do not make
sense, as the recovery would be longer than the timeout itself.
Add a BLK_MIN_SG_TIMEOUT to match the existign BLK_DEFAULT_SG_TIMEOUT.
Suggested-by: Alan Cox <alan@lxorguk.ukuu.org.uk>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Cc: Jeff Garzik <jeff@garzik.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Update FMODE_NDELAY before each ioctl call so that we can kill the
magic FMODE_NDELAY_NOW. It would be even better to do this directly
in setfl(), but for that we'd need to have FMODE_NDELAY for all files,
not just block special files.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Commit 33c2dca495 (trim file propagation
in block/compat_ioctl.c) removed the handling of some ioctls from
compat_blkdev_driver_ioctl. That caused them to be rejected as unknown
by the compat layer.
Signed-off-by: Andreas Schwab <schwab@suse.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Fix setting of max_segment_size and seg_boundary mask for stacked md/dm
devices.
When stacking devices (LVM over MD over SCSI) some of the request queue
parameters are not set up correctly in some cases by default, namely
max_segment_size and and seg_boundary mask.
If you create MD device over SCSI, these attributes are zeroed.
Problem become when there is over this mapping next device-mapper mapping
- queue attributes are set in DM this way:
request_queue max_segment_size seg_boundary_mask
SCSI 65536 0xffffffff
MD RAID1 0 0
LVM 65536 -1 (64bit)
Unfortunately bio_add_page (resp. bio_phys_segments) calculates number of
physical segments according to these parameters.
During the generic_make_request() is segment cout recalculated and can
increase bio->bi_phys_segments count over the allowed limit. (After
bio_clone() in stack operation.)
Thi is specially problem in CCISS driver, where it produce OOPS here
BUG_ON(creq->nr_phys_segments > MAXSGENTRIES);
(MAXSEGENTRIES is 31 by default.)
Sometimes even this command is enough to cause oops:
dd iflag=direct if=/dev/<vg>/<lv> of=/dev/null bs=128000 count=10
This command generates bios with 250 sectors, allocated in 32 4k-pages
(last page uses only 1024 bytes).
For LVM layer, it allocates bio with 31 segments (still OK for CCISS),
unfortunatelly on lower layer it is recalculated to 32 segments and this
violates CCISS restriction and triggers BUG_ON().
The patch tries to fix it by:
* initializing attributes above in queue request constructor
blk_queue_make_request()
* make sure that blk_queue_stack_limits() inherits setting
(DM uses its own function to set the limits because it
blk_queue_stack_limits() was introduced later. It should probably switch
to use generic stack limit function too.)
* sets the default seg_boundary value in one place (blkdev.h)
* use this mask as default in DM (instead of -1, which differs in 64bit)
Bugs related to this:
https://bugzilla.redhat.com/show_bug.cgi?id=471639http://bugzilla.kernel.org/show_bug.cgi?id=8672
Signed-off-by: Milan Broz <mbroz@redhat.com>
Reviewed-by: Alasdair G Kergon <agk@redhat.com>
Cc: Neil Brown <neilb@suse.de>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Mike Miller <mike.miller@hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
blkdev_dequeue_request() and elv_dequeue_request() are equivalent and
both start the timeout timer. Barrier code dequeues the original
barrier request but doesn't passes the request itself to lower level
driver, only broken down proxy requests; however, as the original
barrier code goes through the same dequeue path and timeout timer is
started on it. If barrier sequence takes long enough, this timer
expires but the low level driver has no idea about this request and
oops follows.
Timeout timer shouldn't have been started on the original barrier
request as it never goes through actual IO. This patch unexports
elv_dequeue_request(), which has no external user anyway, and makes it
operate on elevator proper w/o adding the timer and make
blkdev_dequeue_request() call elv_dequeue_request() and add timer.
Internal users which don't pass the request to driver - barrier code
and end_that_request_last() - are converted to use
elv_dequeue_request().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Mike Anderson <andmike@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
disk->node_id will be refered in allocating in disk_expand_part_tbl, so we
should set it before disk->node_id is refered.
Signed-off-by: Cheng Renquan <crquan@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
mapping. Which is good if pages were mapped - but if they were provided
by someone else and just copied then bad things happen - pages are
released once here, and once by caller, leading to user triggerable BUG
at include/linux/mm.h:246.
Signed-off-by: Petr Vandrovec <petr@vandrovec.name>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Port to the new tracepoints API: split DEFINE_TRACE() and DECLARE_TRACE()
sites. Spread them out to the usage sites, as suggested by
Mathieu Desnoyers.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
This was a forward port of work done by Mathieu Desnoyers, I changed it to
encode the 'what' parameter on the tracepoint name, so that one can register
interest in specific events and not on classes of events to then check the
'what' parameter.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If the size passed in is OK but we end up mapping too many segments,
we call the unmap path directly like from IO completion. But from IO
completion we have an extra reference to the bio, so this error case
goes OOPS when it attempts to free and already free bio.
Fix it by getting an extra reference to the bio before calling the
unmap failure case.
Reported-by: Petr Vandrovec <vandrove@vc.cvut.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We run into system boot failure with kernel 2.6.28-rc. We found it on a
couple of machines, including T61 notebook, nehalem machine, and another
HPC NX6325 notebook. All the machines use FedoraCore 8 or FedoraCore 9.
With kernel prior to 2.6.28-rc, system boot doesn't fail.
I debug it and locate the root cause. Pls. see
http://bugzilla.kernel.org/show_bug.cgi?id=11899https://bugzilla.redhat.com/show_bug.cgi?id=471517
As a matter of fact, there are 2 bugs.
1)root=/dev/sda1, system boot randomly fails. Mostly, boot for 5 times
and fails once. nash has a bug. Some of its functions misuse return
value 0. Sometimes, 0 means timeout and no uevent available. Sometimes,
0 means nash gets an uevent, but the uevent isn't block-related (for
exmaple, usb). If by coincidence, kernel tells nash that uevents are
available, but kernel also set timeout, nash might stops collecting
other uevents in queue if current uevent isn't block-related. I work
out a patch for nash to fix it.
http://bugzilla.kernel.org/attachment.cgi?id=18858
2) root=LABEL=/, system always can't boot. initrd init reports
switchroot fails. Here is an executation branch of nash when booting:
(1) nash read /sys/block/sda/dev; Assume major is 8 (on my desktop)
(2) nash query /proc/devices with the major number; It found line
"8 sd";
(3) nash use 'sd' to search its own probe table to find device (DISK)
type for the device and add it to its own list;
(4) Later on, it probes all devices in its list to get filesystem
labels; scsi register "8 sd" always.
When major is 259, nash fails to find the device(DISK) type. I enables
CONFIG_DEBUG_BLOCK_EXT_DEVT=y when compiling kernel, so 259 is picked up
for device /dev/sda1, which causes nash to fail to find device (DISK)
type.
To fixing issue 2), I create a patch for nash and another patch for
kernel.
http://bugzilla.kernel.org/attachment.cgi?id=18859http://bugzilla.kernel.org/attachment.cgi?id=18837
Below is the patch for kernel 2.6.28-rc4. It registers blkext, a new
block device in proc/devices.
With 2 patches on nash and 1 patch on kernel, I boot my machines for
dozens of times without failure.
Signed-off-by Zhang Yanmin <yanmin.zhang@linux.intel.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Make add_partition() return pointer to the new hd_struct on success
and ERR_PTR() value on failure. This change will be used to fix md
autodetection bug.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch (as1159b) changes the timeout routines in the block core to
use round_jiffies_up(). There's no point in rounding the timer
deadline down, since if it expires too early we will have to restart
it.
The patch also removes some unnecessary tests when a request is
removed from the queue's timer list.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Move the calling blk_delete_timer to later in end_that_request_last to
address an issue where blkdev_dequeue_request may have add a timer for the
request.
Signed-off-by: Mike Anderson <andmike@linux.vnet.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Block queue supports two usage models - one where block driver peeks
at the front of queue using elv_next_request(), processes it and
finishes it and the other where block driver peeks at the front of
queue, dequeue the request using blkdev_dequeue_request() and finishes
it. The latter is more flexible as it allows the driver to process
multiple commands concurrently.
These two inconsistent usage models affect the block layer
implementation confusing. For some, elv_next_request() is considered
the issue point while others consider blkdev_dequeue_request() the
issue point.
Till now the inconsistency mostly affect only accounting, so it didn't
really break anything seriously; however, with block layer timeout,
this inconsistency hits hard. Block layer considers
elv_next_request() the issue point and adds timer but SCSI layer
thinks it was just peeking and when the request can't process the
command right away, it's just left there without further processing.
This makes the request dangling on the timer list and, when the timer
goes off, the request which the SCSI layer and below think is still on
the block queue ends up in the EH queue, causing various problems - EH
hang (failed count goes over busy count and EH never wakes up),
WARN_ON() and oopses as low level driver trying to handle the unknown
command, etc. depending on the timing.
As SCSI midlayer is the only user of block layer timer at the moment,
moving blk_add_timer() to elv_dequeue_request() fixes the problem;
however, this two usage models definitely need to be cleaned up in the
future.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We need to do bd_claim() only if file hadn't been opened with O_EXCL
and then we have no need to use file itself as owner.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Most of that stuff doesn't need BKL at all; expand in the (only) caller,
merge the switch into one there and leave BKL only around the stuff that
might actually need it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
To keep the size of changesets sane we split the switch by drivers;
to keep the damn thing bisectable we do the following:
1) rename the affected methods, add ones with correct
prototypes, make (few) callers handle both. That's this changeset.
2) for each driver convert to new methods. *ALL* drivers
are converted in this series.
3) kill the old (renamed) methods.
Note that it _is_ a flagday; all in-tree drivers are converted and by the
end of this series no trace of old methods remain. The only reason why
we do that this way is to keep the damn thing bisectable and allow per-driver
debugging if anything goes wrong.
New methods:
open(bdev, mode)
release(disk, mode)
ioctl(bdev, mode, cmd, arg) /* Called without BKL */
compat_ioctl(bdev, mode, cmd, arg)
locked_ioctl(bdev, mode, cmd, arg) /* Called with BKL, legacy */
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Analog of blkdev_driver_ioctl() with sane arguments. For
now uses fake struct file, by the end of the series it won't
and blkdev_driver_ioctl() will become a wrapper around it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
block: remove __generic_unplug_device() from exports
block: move q->unplug_work initialization
blktrace: pass zfcp driver data
blktrace: add support for driver data
block: fix current kernel-doc warnings
block: only call ->request_fn when the queue is not stopped
block: simplify string handling in elv_iosched_store()
block: fix kernel-doc for blk_alloc_devt()
block: fix nr_phys_segments miscalculation bug
block: add partition attribute for partition number
block: add BIG FAT WARNING to CONFIG_DEBUG_BLOCK_EXT_DEVT
softirq: Add support for triggering softirq work on softirqs.
modprobe loop; rmmod loop effectively creates a blk_queue and destroys it
which results in q->unplug_work being canceled without it ever being
initialized.
Therefore, move the initialization of q->unplug_work from
blk_queue_make_request() to blk_alloc_queue*().
Reported-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Fix block kernel-doc warnings:
Warning(linux-2.6.27-git4//fs/block_dev.c:1272): No description found for parameter 'path'
Warning(linux-2.6.27-git4//block/blk-core.c:1021): No description found for parameter 'cpu'
Warning(linux-2.6.27-git4//block/blk-core.c:1021): No description found for parameter 'part'
Warning(/var/linsrc/linux-2.6.27-git4//block/genhd.c:544): No description found for parameter 'partno'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Callers should use either blk_run_queue/__blk_run_queue, or
blk_start_queueing() to invoke request handling instead of calling
->request_fn() directly as that does not take the queue stopped
flag into account.
Also add appropriate comments on the above functions to detail
their usage.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
strlcpy() guarantees the dest buffer is NULL teminated.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This fixes the bug reported by Nikanth Karthikesan <knikanth@suse.de>:
http://lkml.org/lkml/2008/10/2/203
The root cause of the bug is that blk_phys_contig_segment
miscalculates q->max_segment_size.
blk_phys_contig_segment checks:
req->biotail->bi_size + next_req->bio->bi_size > q->max_segment_size
But blk_recalc_rq_segments might expect that req->biotail and the
previous bio in the req are supposed be merged into one
segment. blk_recalc_rq_segments might also expect that next_req->bio
and the next bio in the next_req are supposed be merged into one
segment. In such case, we merge two requests that can't be merged
here. Later, blk_rq_map_sg gives more segments than it should.
We need to keep track of segment size in blk_recalc_rq_segments and
use it to see if two requests can be merged. This patch implements it
in the similar way that we used to do for hw merging (virtual
merging).
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Multipath is best at handling transport errors. If it gets a device
error then there is not much the multipath layer can do. It will just
access the same device but from a different path.
This patch breaks up failfast into device, transport and driver errors.
The multipath layers (md and dm mutlipath) only ask the lower levels to
fast fail transport errors. The user of failfast, read ahead, will ask
to fast fail on all errors.
Note that blk_noretry_request will return true if any failfast bit
is set. This allows drivers that do not support the multipath failfast
bits to continue to fail on any failfast error like before. Drivers
like scsi that are able to fail fast specific errors can check
for the specific fail fast type. In the next patch I will convert
scsi.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
The DM and MD integrity support now depends on being able to use
gendisks instead of block_devices when comparing integrity profiles.
Change function parameters accordingly.
Also update comparison logic so that two NULL profiles are a valid
configuration.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
- kobject_del already puts the parent.
- Set integrity profile to NULL to prevent stale data.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch removes end_queued_request() and end_dequeued_request(),
which are no longer used.
As a results, users of __end_request() became only end_request().
So the actual code in __end_request() is moved to end_request()
and __end_request() is removed.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch converts elevator to use __blk_end_request() directly
so that end_{queued|dequeued}_request() can be removed.
Related 'uptodate' arguments is converted to 'error'.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Define as 32, which is is what BDEVNAME_SIZE is/was as well. This keeps
the user interface the same and gets rid of the difference between
kernel and user api here.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch adds an new interface, blk_lld_busy(), to check lld's
busy state from the block layer.
blk_lld_busy() calls down into low-level drivers for the checking
if the drivers set q->lld_busy_fn() using blk_queue_lld_busy().
This resolves a performance problem on request stacking devices below.
Some drivers like scsi mid layer stop dispatching request when
they detect busy state on its low-level device like host/target/device.
It allows other requests to stay in the I/O scheduler's queue
for a chance of merging.
Request stacking drivers like request-based dm should follow
the same logic.
However, there is no generic interface for the stacked device
to check if the underlying device(s) are busy.
If the request stacking driver dispatches and submits requests to
the busy underlying device, the requests will stay in
the underlying device's queue without a chance of merging.
This causes performance problem on burst I/O load.
With this patch, busy state of the underlying device is exported
via q->lld_busy_fn(). So the request stacking driver can check it
and stop dispatching requests if busy.
The underlying device driver must return the busy state appropriately:
1: when the device driver can't process requests immediately.
0: when the device driver can process requests immediately,
including abnormal situations where the device driver needs
to kill all requests.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
blk_start_queueing() should act like the generic queue unplugging
and kicking and ignore a stopped queue. Such a queue may not be
run until after a call to blk_start_queue().
Signed-off-by: Elias Oltmanns <eo@nebensachen.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
By only allowing async IO to consume 3/4 ths of the tag depth, we
always have slots free to serve sync IO. This is important to avoid
having writes fill the entire tag queue, thus starving reads.
Original patch and idea from Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We really need to know about the hardware tagging support as well,
since if the SSD does not do tagging then we still want to idle.
Otherwise have the same dependent sync IO vs flooding async IO
problem as on rotational media.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We don't want to idle in AS/CFQ if the device doesn't have a seek
penalty. So add a QUEUE_FLAG_NONROT to indicate a non-rotational
device, low level drivers should set this flag upon discovery of
an SSD or similar device type.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch adds a queue flag to indicate the block device can be
used for request stacking.
Request stacking drivers need to stack their devices on top of
only devices of which q->request_fn is functional.
Since bio stacking drivers (e.g. md, loop) basically initialize
their queue using blk_alloc_queue() and don't set q->request_fn,
the check of (q->request_fn == NULL) looks enough for that purpose.
However, dm will become both types of stacking driver (bio-based and
request-based). And dm will always set q->request_fn even if the dm
device is bio-based of which q->request_fn is not functional actually.
So we need something else to distinguish the type of the device.
Adding a queue flag is a solution for that.
The reason why dm always sets q->request_fn is to keep
the compatibility of dm user-space tools.
Currently, all dm user-space tools are using bio-based dm without
specifying the type of the dm device they use.
To use request-based dm without changing such tools, the kernel
must decide the type of the dm device automatically.
The automatic type decision can't be done at the device creation time
and needs to be deferred until such tools load a mapping table,
since the actual type is decided by dm target type included in
the mapping table.
So a dm device has to be initialized using blk_init_queue()
so that we can load either type of table.
Then, all queue stuffs are set (e.g. q->request_fn) and we have
no element to distinguish that it is bio-based or request-based,
even after a table is loaded and the type of the device is decided.
By the way, some stuffs of the queue (e.g. request_list, elevator)
are needless when the dm device is used as bio-based.
But the memory size is not so large (about 20[KB] per queue on ia64),
so I hope the memory loss can be acceptable for bio-based dm users.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch adds blk_insert_cloned_request(), a generic request
submission interface for request stacking drivers.
Request-based dm will use it to submit their clones to underlying
devices.
blk_rq_check_limits() is also added because it is possible that
the lower queue has stronger limitations than the upper queue
if multiple drivers are stacking at request-level.
Not only for blk_insert_cloned_request()'s internal use, the function
will be used by request-based dm when the queue limitation is
modified (e.g. by replacing dm's table).
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch adds blk_update_request(), which updates struct request
with completing its data part, but doesn't complete the struct
request itself.
Though it looks like end_that_request_first() of older kernels,
blk_update_request() should be used only by request stacking drivers.
Request-based dm will use it in bio->bi_end_io callback to update
the original request when a data part of a cloned request completes.
Followings are additional background information of why request-based
dm needs this interface.
- Request stacking drivers can't use blk_end_request() directly from
the lower driver's completion context (bio->bi_end_io or rq->end_io),
because some device drivers (e.g. ide) may try to complete
their request with queue lock held, and it may cause deadlock.
See below for detailed description of possible deadlock:
<http://marc.info/?l=linux-kernel&m=120311479108569&w=2>
- To solve that, request-based dm offloads the completion of
cloned struct request to softirq context (i.e. using
blk_complete_request() from rq->end_io).
- Though it is possible to use the same solution from bio->bi_end_io,
it will delay the notification of bio completion to the original
submitter. Also, it will cause inefficient partial completion,
because the lower driver can't perform the cloned request anymore
and request-based dm needs to requeue and redispatch it to
the lower driver again later. That's not good.
- So request-based dm needs blk_update_request() to perform the bio
completion in the lower driver's completion context, which is more
efficient.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
When a driver calls blk_cleanup_queue(), the device should be fully idle.
However, the block layer may have pending plugging timers and the IO
schedulers may have pending work in the work queues. So quisce the device
by waiting for the timer and flushing the work queues.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We cannot abort a request if we raced with the timeout handler already,
or with the IO completion. So make blk_abort_request() mark the request
as complete, and only continue if we succeeded.
Found and suggested by Mike Anderson <andmike@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Only works for the generic request timer handling. Allows one to
sporadically ignore request completions, thus exercising the timeout
handling.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Two mods to blkdev_issue_discard(), thinking ahead to its use on swap:
1. Add gfp_mask argument, so swap allocation can use it where GFP_KERNEL
might deadlock but GFP_NOIO is safe.
2. Enlarge nr_sects argument from unsigned to sector_t: unsigned long is
enough to cover a whole swap area, but sector_t suits any partition.
Change sb_issue_discard()'s nr_blocks to sector_t too; but no need seen
for a gfp_mask there, just pass GFP_KERNEL down to blkdev_issue_discard().
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Right now SCSI and others do their own command timeout handling.
Move those bits to the block layer.
Instead of having a timer per command, we try to be a bit more clever
and simply have one per-queue. This avoids the overhead of having to
tear down and setup a timer for each command, so it will result in a lot
less timer fiddling.
Signed-off-by: Mike Anderson <andmike@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
seqf can be started multiple times for a read and the header should be
printed only for the initial one. Fix it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch changes blk_rq_map_user to accept a NULL user-space buffer
with a READ command if rq_map_data is not NULL. Thus a caller can pass
page frames to lk_rq_map_user to just set up a request and bios with
page frames propely. bio_uncopy_user (called via blk_rq_unmap_user)
doesn't copy data to user space with such request.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
bdget_disk() and blk_lookup_devt() never cared whether the specified
partition (or disk) is zero sized or not. I got confused while
converting those not to depend on consecutive minor numbers in commit
5a6411b1178baf534aa9138052864dfa89d3eada and later when dev0 was added
it broke callers which expected to get valid return for zero sized
disk devices.
So, they never needed nr_sects checks in the first place. Kill them.
This problem was spotted and debugged by Bartlmoiej Zolnierkiewicz.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Noticed by sparse:
block/blk-softirq.c:156:12: warning: symbol 'blk_softirq_init' was not declared. Should it be static?
block/genhd.c:583:28: warning: function 'bdget_disk' with external linkage has definition
block/genhd.c:659:17: warning: incorrect type in argument 1 (different base types)
block/genhd.c:659:17: expected unsigned int [unsigned] [usertype] size
block/genhd.c:659:17: got restricted gfp_t
block/genhd.c:659:29: warning: incorrect type in argument 2 (different base types)
block/genhd.c:659:29: expected restricted gfp_t [usertype] flags
block/genhd.c:659:29: got unsigned int
block: kmalloc args reversed
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This adds blk_rq_aligned helper function to see if alignment and
padding requirement is satisfied for DMA transfer. This also converts
blk_rq_map_kern and __blk_rq_map_user to use the helper function.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch introduces struct rq_map_data to enable bio_copy_use_iov()
use reserved pages.
Currently, bio_copy_user_iov allocates bounce pages but
drivers/scsi/sg.c wants to allocate pages by itself and use
them. struct rq_map_data can be used to pass allocated pages to
bio_copy_user_iov.
The current users of bio_copy_user_iov simply passes NULL (they don't
want to use pre-allocated pages).
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Douglas Gilbert <dougg@torque.net>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Currently, blk_rq_map_user and blk_rq_map_user_iov always do
GFP_KERNEL allocation.
This adds gfp_mask argument to blk_rq_map_user and blk_rq_map_user_iov
so sg can use it (sg always does GFP_ATOMIC allocation).
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Douglas Gilbert <dougg@torque.net>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
CFQ's detection of queueing devices assumes a non-queuing device and detects
if the queue depth reaches a certain threshold. Under some workloads (e.g.
synchronous reads), CFQ effectively forces a unit queue depth, thus defeating
the detection logic. This leads to poor performance on queuing hardware,
since the idle window remains enabled.
This patch inverts the sense of the logic: assume a queuing-capable device,
and detect if the depth does not exceed the threshold.
Signed-off-by: Aaron Carroll <aaronc@gelato.unsw.edu.au>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We should just check for rq->bio, as that is really the information
we are looking for. Even if the bio attached doesn't carry data,
we still need to do IO post processing on it.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Somewhat incomplete, as we do allow merges of requests and bios
that have different completion CPUs given. This is done on the
assumption that a larger IO is still more beneficial than CPU
locality.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch adds support for controlling the IO completion CPU of
either all requests on a queue, or on a per-request basis. We export
a sysfs variable (rq_affinity) which, if set, migrates completions
of requests to the CPU that originally submitted it. A bio helper
(bio_set_completion_cpu()) is also added, so that queuers can ask
for completion on that specific CPU.
In testing, this has been show to cut the system time by as much
as 20-40% on synthetic workloads where CPU affinity is desired.
This requires a little help from the architecture, so it'll only
work as designed for archs that are using the new generic smp
helper infrastructure.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Now that disk and partition handlings are mostly unified, it's easy to
allow disk to have extended device number. This patch makes
add_disk() use extended device number if disk->minors is zero. Both
sd and ide-disk are updated to use this.
* sd_format_disk_name() is implemented which can generically determine
the drive name. This removes disk number restriction stemming from
limited device names.
* If sd index goes over SD_MAX_DISKS (which can be increased now BTW),
sd simply doesn't initialize minors letting block layer choose
extended device number.
* If CONFIG_DEBUG_EXT_DEVT is set, both sd and ide-disk always set
minors to 0 and use extended device numbers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
With previous changes, it's meaningless to limit the number of
partitions. Replace @ext_minors with GENHD_FL_EXT_DEVT such that
setting the flag allows the disk to have maximum number of allowed
partitions (only limited by the number of entries in parsed_partitions
as determined by MAX_PART constant).
This kills not-too-pretty alloc_disk_ext[_node]() functions and makes
@minors parameter to alloc_disk[_node]() unnecessary. The parameter
is left alone to avoid disturbing the users.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
disk->__part used to be statically allocated to the maximum possible
number of partitions. This patch makes partition array allocation
dynamic. The added overhead is minimal as only real change is one
memory dereference changed to RCU one. This saves both a bit of
memory and cpu cycles iterating through unoccupied slots and makes
increasing partition limit easier.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Move stats related fields - stamp, in_flight, dkstats - from disk to
part0 and unify stat handling such that...
* part_stat_*() now updates part0 together if the specified partition
is not part0. ie. part_stat_*() are now essentially all_stat_*().
* {disk|all}_stat_*() are gone.
* part_round_stats() is updated similary. It handles part0 stats
automatically and disk_round_stats() is killed.
* part_{inc|dec}_in_fligh() is implemented which automatically updates
part0 stats for parts other than part0.
* disk_map_sector_rcu() is updated to return part0 if no part matches.
Combined with the above changes, this makes NULL special case
handling in callers unnecessary.
* Separate stats show code paths for disk are collapsed into part
stats show code paths.
* Rename disk_stat_lock/unlock() to part_stat_lock/unlock()
While at it, reposition stat handling macros a bit and add missing
parentheses around macro parameters.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
GENHD_FL_FAIL for disk is what make_it_fail is for parts. Kill it and
use part0->make_it_fail. Sysfs node handling is unified too.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Till now, bdev->bd_part is set only if the bdev was for parts other
than part0. This patch makes bdev->bd_part always set so that code
paths don't have to differenciate common handling.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Move disk->holder_dir to part0->holder_dir. Kill now mostly
superflous bdev_get_holder().
While at it, kill superflous kobject_get/put() around holder_dir,
slave_dir and cmd_filter creation and collapse
disk_sysfs_add_subdirs() into register_disk(). These serve no purpose
but obfuscating the code.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Now that capacity and __dev are moved to part0, part0 and others can
share the same method.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Move disk->__dev to part0->__dev. This simplifies bdget_disk() and
lookup_devt() and allows common sysfs attributes to be unified.
part_to_disk() is updated to handle part0 -> disk.
Updated to include a fix from Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>,
he writes:
"part0 is a "special" partition and doesn't need to have capacity set - this
fixes regression caused by "block: move __dev from disk to part0" commit."
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
genhd and partition code handled disk and partitions separately. All
information about the whole disk was in struct genhd and partitions in
struct hd_struct. However, the whole disk (part0) and other
partitions have a lot in common and the data structures end up having
good number of common fields and thus separate code paths doing the
same thing. Also, the partition array was indexed by partno - 1 which
gets pretty confusing at times.
This patch introduces partition 0 and makes the partition array
indexed by partno. Following patches will unify the handling of disk
and parts piece-by-piece.
This patch also implements disk_partitionable() which tests whether a
disk is partitionable. With coming dynamic partition array change,
the most common usage of disk_max_parts() will be testing whether a
disk is partitionable and the number of max partitions will become
much less important.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Implement {disk|part}_to_dev() and use them to access generic device
instead of directly dereferencing {disk|part}->dev. To make sure no
user is left behind, rename generic devices fields to __dev.
This is in preparation of unifying partition 0 handling with other
partitions.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Extended devt introduces non-contiguos device numbers. This patch
implements a debug option which forces most devt allocations to be
from the extended area and spreads them out. This is enabled by
default if DEBUG_KERNEL is set and achieves...
1. Detects code paths in kernel or userland which expect predetermined
consecutive device numbers.
2. When something goes wrong, avoid corruption as adding to the minor
of earlier partition won't lead to the wrong but valid device.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
With extended minors and the soon-to-follow debug feature, large minor
numbers for block devices will be common. This patch does the
followings to make printouts pretty.
* Adapt print formats such that large minors don't break the
formatting.
* For extended MAJ:MIN, %02x%02x for MAJ:MIN used in
printk_all_partitions() doesn't cut it anymore. Update it such that
%03x:%05x is used if either MAJ or MIN doesn't fit in %02x.
* Implement ext_range sysfs attribute which shows total minors the
device can use including both conventional minor space and the
extended one.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Implement extended device numbers. A block driver can tell block
layer that it wants to use extended device numbers. After the usual
minor space is used up, block layer automatically allocates devt's
from EXT_BLOCK_MAJOR.
Currently only one major number is allocated for this but as the
allocation is strictly on-demand, ~1mil minor space under it should
suffice unless the system actually has more than ~1mil partitions and
if that ever happens adding more majors to the extended devt area is
easy.
Due to internal implementation issues, the first partition can't be
allocated on the extended area. In other words, genhd->minors should
at least be 1. This limitation will be lifted by later changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
There are two variants of stat functions - ones prefixed with double
underbars which don't care about preemption and ones without which
disable preemption before manipulating per-cpu counters. It's unclear
whether the underbarred ones assume that preemtion is disabled on
entry as some callers don't do that.
This patch unifies diskstats access by implementing disk_stat_lock()
and disk_stat_unlock() which take care of both RCU (for partition
access) and preemption (for per-cpu counter access). diskstats access
should always be enclosed between the two functions. As such, there's
no need for the versions which disables preemption. They're removed
and double underbars ones are renamed to drop the underbars. As an
extra argument is added, there's no danger of using the old version
unconverted.
disk_stat_lock() uses get_cpu() and returns the cpu index and all
diskstat functions which access per-cpu counters now has @cpu
argument to help RT.
This change adds RCU or preemption operations at some places but also
collapses several preemption ops into one at others. Overall, the
performance difference should be negligible as all involved ops are
very lightweight per-cpu ones.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
disk->part[] is protected by its matching bdev's lock. However,
non-critical accesses like collecting stats and printing out sysfs and
proc information used to be performed without any locking. As
partitions can come and go dynamically, partitions can go away
underneath those non-critical accesses. As some of those accesses are
writes, this theoretically can lead to silent corruption.
This patch fixes the race by using RCU for the partition array and dev
reference counter to hold partitions.
* Rename disk->part[] to disk->__part[] to make sure no one outside
genhd layer proper accesses it directly.
* Use RCU for disk->__part[] dereferencing.
* Implement disk_{get|put}_part() which can be used to get and put
partitions from gendisk respectively.
* Iterators are implemented to help iterate through all partitions
safely.
* Functions which require RCU readlock are marked with _rcu suffix.
* Use disk_put_part() in __blkdev_put() instead of directly putting
the contained kobject.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* Implement disk_devt() and part_devt() and use them to directly
access devt instead of computing it from ->major and ->first_minor.
Note that all references to ->major and ->first_minor outside of
block layer is used to determine devt of the disk (the part0) and as
->major and ->first_minor will continue to represent devt for the
disk, converting these users aren't strictly necessary. However,
convert them for consistency.
* Implement disk_max_parts() to avoid directly deferencing
genhd->minors.
* Update bdget_disk() such that it doesn't assume consecutive minor
space.
* Move devt computation from register_disk() to add_disk() and make it
the only one (all other usages use the initially determined value).
These changes clean up the code and will help disk->part dereference
fix and extended block device numbers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
In hd_struct, @partno is used to denote partition number and a number
of other places use @part to denote hd_struct. Functions use @part
and @index instead. This causes confusion and makes it difficult to
use consistent variable names for hd_struct. Always use @partno if a
variable represents partition number.
Also, print out functions use @f or @part for seq_file argument. Use
@seqf uniformly instead.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch makes the following misc updates in preparation for
disk->part dereference fix and extended block devt support.
* implment part_to_disk()
* fix comment about gendisk->part indexing
* rename get_part() to disk_map_sector()
* don't use n which is always zero while printing disk information in
diskstats_show()
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
d805dda4 tried to fix error case handling in add_partition() but had a
few problems.
* disk->part[] entry is set early and left dangling if operation
fails.
* Once device initialized, the last put_device() is responsible for
freeing all the resources. The failure path freed part_stats and p
regardless of put_device() causing double free.
* holders subdir holds reference to the disk device, so failure path
should remove it to release resources properly which was missing.
This patch fixes the above problems and while at it move partition
slot busy check into add_partition() for completeness and inlines
holders subdirectory creation. Using separate function for it just
obfuscates the code.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Abdel Benamrouche <draconux@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
delete_partition() was noop for zero length partition. As the
addition code allows creating zero lenght partition and deletion is
assumed to always succeed, this causes memory leak for zero length
partitions. Allow zero length partitions to end their meaningless
lives.
While at it, allow deleting zero lenght partition via
BLKPG_DEL_PARTITION ioctl too.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Recent block_class iteration updates 5c6f35c5..27f3025 converted all
class device iteration to class_for_each_device() and
class_find_device(), which are correct but pain in the ass to use.
This pach converts them to newly introduced class_dev_iterator so that
they can use more natural control structures instead of separate
callbacks and struct to pass parameters to them.
This results in smaller and easier code.
This patch also restores the original behavior of not printing header
in /proc/partitions if there's no partition to print. This is trivial
but still user-visible behavior.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
block_class_lock protects major_names array and bdev_map and doesn't
have anything to do with block class devices. Don't grab them while
iterating over block class devices.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Recent block_class iteration updates 5c6f35c5..27f3025 broke partition
info printouts.
* printk_all_partitions(): Partition print out stops when it meets a
partition hole. Partition printing inner loop should continue
instead of exiting on empty partition slot.
* /proc/partitions and /proc/diskstats: If all information can't be
read in single read(), the information is truncated. This is
because find_start() doesn't actually update the counter containing
the initial seek. It runs to the end and ends up always reporting
EOF on the second read.
This patch fixes both problems.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Remove hw_segments field from struct bio and struct request. Without virtual
merge accounting they have no purpose.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Deadline currently only batches sector-contiguous requests, so except
for a few circumstances (e.g. requests in a single direction), it is
essentially first come first served. This is bad for throughput, so
change it to CSCAN, which means requests in a batch do not need to be
sequential and are issued in increasing sector order.
Signed-off-by: Aaron Carroll <aaronc@gelato.unsw.edu.au>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
But blkdev_issue_discard() still emits requests which are interpreted as
soft barriers, because naïve callers might otherwise issue subsequent
writes to those same sectors, which might cross on the queue (if they're
reallocated quickly enough).
Callers still _can_ issue non-barrier discard requests, but they have to
take care of queue ordering for themselves.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We may well want mkfs tools to use this to mark the whole device as
unwanted before they format it, for example.
The ioctl takes a pair of uint64_ts, which are start offset and length
in _bytes_. Although at the moment it might make sense for them both to
be in 512-byte sectors, I don't want to limit the ABI to that.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Barriers should be submitted with the WRITE flag set.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Let the compiler see what's going on, and it can all get a lot simpler.
On PPC64 this reduces the size of the code calculating these bits by
about 60%. On x86_64 it's less of a win -- only 40%.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Some block devices benefit from a hint that they can forget the contents
of certain sectors. Add basic support for this to the block core, along
with a 'blkdev_issue_discard()' helper function which issues such
requests.
The caller doesn't get to provide an end_io functio, since
blkdev_issue_discard() will automatically split the request up into
multiple bios if appropriate. Neither does the function wait for
completion -- it's expected that callers won't care about when, or even
_if_, the request completes. It's only a hint to the device anyway. By
definition, the file system doesn't _care_ about these sectors any more.
[With feedback from OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> and
Jens Axboe <jens.axboe@oracle.com]
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
I have another request for the block filter SG_IO command whitelist,
specifically the MMC streaming command set SET READ AHEAD command.
The command applies only to MMC CDROM/DVDROM drives with the streaming
optional feature set. The command is useful to cdparanoia in that it
allows explicit cache control side effects that are, on many drives,
cdparanoia's most efficient way to flush/disable the media cache on
cdrom drives. I am aware of no reason why it should not be accessible
from usespace.
Also note that the command is already fully accessible through the
SCSI-native version of the SG_IO ioctl as well as the traditional SG
interface. The command is only being refused on block devices. That
means that on a typical stock distro, the command is available through
/dev/sg* but not /dev/scd* although both are typically available and
accessible. Filtering the command is not providing any protection,
only a confusing inconsistency.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We still have life time issues with the sysfs command filter kobject,
so disable it for 2.6.27 release. We can revisit this and make it work
properly for 2.6.28, for 2.6.27 release it's too risky.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
/proc/partitions didn't use to write out the header if there was no
partition. However, recent commit 66c64afe changed the behavior.
This is nothing major but there's no reason to change user visible
behavior without a good rationale. Restore the original behavior.
Note that 2.6.28 has clean up changes scheduled which will replace
this rather hacky implementation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Greg KH <greg@kroah.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch remove blk_register_filter and blk_unregister_filter in
gendisk, and adds them to sd.c, sr.c. and ide-cd.c
The commit abf5439370 moved cmdfilter
from gendisk to request_queue. It turned out that in some subsystems
multiple gendisks share a single request_queue. So we get:
Using physmap partition information
Creating 3 MTD partitions on "physmap-flash":
0x00000000-0x01c00000 : "User FS"
0x01c00000-0x01c40000 : "booter"
kobject (8511c410): tried to init an initialized object, something is seriously wrong.
Call Trace:
[<8036644c>] dump_stack+0x8/0x34
[<8021f050>] kobject_init+0x50/0xcc
[<8021fa18>] kobject_init_and_add+0x24/0x58
[<8021d20c>] blk_register_filter+0x4c/0x64
[<8021c194>] add_disk+0x78/0xe0
[<8027d14c>] add_mtd_blktrans_dev+0x254/0x278
[<8027c8f0>] blktrans_notify_add+0x40/0x78
[<80279c00>] add_mtd_device+0xd0/0x150
[<8027b090>] add_mtd_partitions+0x568/0x5d8
[<80285458>] physmap_flash_probe+0x2ac/0x334
[<802644f8>] driver_probe_device+0x12c/0x244
[<8026465c>] __driver_attach+0x4c/0x84
[<80263c64>] bus_for_each_dev+0x58/0xac
[<802633ec>] bus_add_driver+0xc4/0x24c
[<802648e0>] driver_register+0xcc/0x184
[<80100460>] _stext+0x60/0x1bc
In the long term, we need to fix such subsystems but we need a quick
fix now. This patch add the command filter support to only sd and sr
though it might be useful for other SG_IO users (such as cciss).
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Reported-by: Manuel Lauss <mano@roarinelk.homelinux.net>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
It's not used for anything. On top of that, it's racy and can thus
trigger a faulty BUG_ON() in __blk_free_tags() on queue exit.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch changes the interface of the cmd filter to use a +/-
notation like:
echo -- +0x02 +0x03 -0x08
If neither + or - is given it defaults to + (allow command).
Note: The interface was added in 2.6.17-rc1 and is unused and
undocumented so far so it's safe to change it.
Reviewed-by: Matthew Wilcox <willy@linux.intel.com>
Reviewed-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: jens.axboe@oracle.com
Cc: James.Bottomley@hansenpartnership.com
Cc: dan.j.williams@intel.com
Cc: pjones@redhat.com
Cc: viro@zeniv.linux.org.uk
Cc: dougg@torque.net
Signed-off-by: Adel Gadllah <adel.gadllah@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Technically, the cmd_filter would be applied to other protocols though
it's unlikely to happen. Putting SCSI stuff to request_queue is kinda
layer violation. So let's rename it.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
cmd_filter works only for the block layer SG_IO with SCSI block
devices. It breaks scsi/sg.c, bsg, and the block layer SG_IO with SCSI
character devices (such as st). We hit a kernel crash with them.
The problem is that cmd_filter code accesses to gendisk (having struct
blk_scsi_cmd_filter) via inode->i_bdev->bd_disk. It works for only
SCSI block device files. With character device files, inode->i_bdev
leads you to struct cdev. inode->i_bdev->bd_disk->blk_scsi_cmd_filter
isn't safe.
SCSI ULDs don't expose gendisk; they keep it private. bsg needs to be
independent on any protocols. We shouldn't change ULDs to expose their
gendisk.
This patch moves struct blk_scsi_cmd_filter from gendisk to
request_queue, a common object, which eveyone can access to.
The user interface doesn't change; users can change the filters via
/sys/block/. gendisk has a pointer to request_queue so the cmd_filter
code accesses to struct blk_scsi_cmd_filter.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Otherwise we leak references, which is not a good thing to do.
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The proc files get truncated if they do not fit into the buffer with
a single read(). We need to move the seq_file index from the callback
of class_find_device() to the caller of class_find_device(), to keep
its value across multiple invocations of the callback.
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
blk_plug_device() must be called with the queue lock held, so callers
often just grab and release the lock for that purpose. Add a helper
that does just that.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
It seems cdrwtool in the udftools has been unusable on "modern" kernels
for some time. A Google search reveals many people with the same issue
but no solution (cdrwtool fails to format the disk). After spending some
time tracking down the issue, it comes down to the following:
The udftools still use the older CDROM_SEND_PACKET interface to send
things like FORMAT_UNIT through to the drive. They should really be
updated, but that's another story. Since most distros are using libata
now, the cd or dvd burner appears as a SCSI device, and we wind up in
block/scsi_ioctl.c. Here, the code tries to take the "struct
cdrom_generic_command" and translate it and stuff it into a "struct
sg_io_hdr" structure so it can pass it to the modern sg_io() routine
instead. Unfortunately, there is one error, or rather an omission in the
translation. The timeout that is passed in in the "struct
cdrom_generic_command" is in HZ=100 units, and this is modified and
correctly converted to jiffies by use of clock_t_to_jiffies(). However,
a little further down, this cgc.timeout value in jiffies is simply
copied into the sg_io_hdr timeout, which should be in milliseconds.
Since most modern x86 kernels seems to be getting build with HZ=250, the
timeout that is passed to sg_io and eventually converted to the
timeout_per_command member of the scsi_cmnd structure is now four times
too small. Since cdrwtool tries to set the timeout to one hour for the
FORMAT_UNIT command, and it takes about 20 minutes to format a 4x CDRW,
the SCSI error-handler kicks in after the FORMAT_UNIT completes because
it took longer than the incorrectly-calculated timeout.
[jejb: fix up whitespace]
Signed-off-by: Tim Wright <timw@splhi.com>
Cc: Stable Tree <stable@kernel.org>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Use WARN() instead of a printk+WARN_ON() pair; this way the message
becomes part of the warning section for better reporting/collection.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use the proper class iterator function instead of mucking around in the
internals of the class structures.
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The seq_start call is the better place for the header for the file, that
way we don't have to be mucking in the class structure to try to figure
out if this is the first partition or not.
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Use the proper class iterator function instead of mucking around in the
internals of the class structures.
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
These functions are only needed if CONFIG_PROC_FS is enabled, so save
the space when it is not.
This also makes it easier for a patch later in this series to work
properly if CONFIG_PROC_FS is not enabled.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Use the proper class iterator function instead of mucking around in the
internals of the class structures.
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Use the proper class iterator function instead of mucking around in the
internals of the class structures.
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Warn if something really bad happens if we can't create this link.
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
device_create() is race-prone, so use the race-free
device_create_drvdata() instead as device_create() is going away.
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Why?:
There are occasions where userspace would like to access sysfs
attributes for a device but it may not know how sysfs has named the
device or the path. For example what is the sysfs path for
/dev/disk/by-id/ata-ST3160827AS_5MT004CK? With this change a call to
stat(2) returns the major:minor then userspace can see that
/sys/dev/block/8:32 links to /sys/block/sdc.
What are the alternatives?:
1/ Add an ioctl to return the path: Doable, but sysfs is meant to reduce
the need to proliferate ioctl interfaces into the kernel, so this
seems counter productive.
2/ Use udev to create these symlinks: Also doable, but it adds a
udev dependency to utilities that might be running in a limited
environment like an initramfs.
3/ Do a full-tree search of sysfs.
[kay.sievers@vrfy.org: fix duplicate registrations]
[kay.sievers@vrfy.org: cleanup suggestions]
Cc: Neil Brown <neilb@suse.de>
Cc: Tejun Heo <htejun@gmail.com>
Acked-by: Kay Sievers <kay.sievers@vrfy.org>
Reviewed-by: SL Baur <steve@xemacs.org>
Acked-by: Kay Sievers <kay.sievers@vrfy.org>
Acked-by: Mark Lord <lkml@rtr.ca>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (102 commits)
[SCSI] scsi_dh: fix kconfig related build errors
[SCSI] sym53c8xx: Fix bogus sym_que_entry re-implementation of container_of
[SCSI] scsi_cmnd.h: remove double inclusion of linux/blkdev.h
[SCSI] make struct scsi_{host,target}_type static
[SCSI] fix locking in host use of blk_plug_device()
[SCSI] zfcp: Cleanup external header file
[SCSI] zfcp: Cleanup code in zfcp_erp.c
[SCSI] zfcp: zfcp_fsf cleanup.
[SCSI] zfcp: consolidate sysfs things into one file.
[SCSI] zfcp: Cleanup of code in zfcp_aux.c
[SCSI] zfcp: Cleanup of code in zfcp_scsi.c
[SCSI] zfcp: Move status accessors from zfcp to SCSI include file.
[SCSI] zfcp: Small QDIO cleanups
[SCSI] zfcp: Adapter reopen for large number of unsolicited status
[SCSI] zfcp: Fix error checking for ELS ADISC requests
[SCSI] zfcp: wait until adapter is finished with ERP during auto-port
[SCSI] ibmvfc: IBM Power Virtual Fibre Channel Adapter Client Driver
[SCSI] sg: Add target reset support
[SCSI] lib: Add support for the T10 (SCSI) Data Integrity Field CRC
[SCSI] sd: Move scsi_disk() accessor function to sd.h
...
* git://git.kernel.org/pub/scm/linux/kernel/git/bart/ide-2.6: (80 commits)
ide-floppy: fix unfortunate function naming
ide-tape: unify idetape_create_read/write_cmd
ide: add ide_pc_intr() helper
ide-{floppy,scsi}: read Status Register before stopping DMA engine
ide-scsi: add more debugging to idescsi_pc_intr()
ide-scsi: use pc->callback
ide-floppy: add more debugging to idefloppy_pc_intr()
ide-tape: always log debug info in idetape_pc_intr() if debugging is enabled
ide-tape: add ide_tape_io_buffers() helper
ide-tape: factor out DSC handling from idetape_pc_intr()
ide-{floppy,tape}: move checking of ->failed_pc to ->callback
ide: add ide_issue_pc() helper
ide: add PC_FLAG_DRQ_INTERRUPT pc flag
ide-scsi: move idescsi_map_sg() call out from idescsi_issue_pc()
ide: add ide_transfer_pc() helper
ide-scsi: set drive->scsi flag for devices handled by the driver
ide-{cd,floppy,tape}: remove checking for drive->scsi
ide: add PC_FLAG_ZIP_DRIVE pc flag
ide-tape: factor out waiting for good ireason from idetape_transfer_pc()
ide-tape: set PC_FLAG_DMA_IN_PROGRESS flag in idetape_transfer_pc()
...
All the users of blk_end_sync_rq has gone (they are converted to use
blk_execute_rq). This unexports blk_end_sync_rq.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Borislav Petkov <petkovbb@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Some uses blk_put_request asymmetrically, that is, they uses it with
requests that not allocated by blk_get_request. As a result,
blk_put_request has a hack to catch a NULL request_queue. Now such
callers are fixed (they use blk_get_request properly). So we can
safely remove the hack in blk_put_request.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Borislav Petkov <petkovbb@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
For blk_pm_resume_request() requests (which are used only by IDE subsystem
currently) the queue is stopped so we need to call ->request_fn explicitly.
Thanks to:
- Rafael for reporting/bisecting the bug
- Borislav/Rafael for testing the fix
This is a preparation for converting IDE to use blk_execute_rq().
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Borislav Petkov <petkovbb@gmail.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
* 'core/softirq' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
softirq: remove irqs_disabled warning from local_bh_enable
softirq: remove initialization of static per-cpu variable
Remove argument from open_softirq which is always NULL
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: (37 commits)
splice: fix generic_file_splice_read() race with page invalidation
ramfs: enable splice write
drivers/block/pktcdvd.c: avoid useless memset
cdrom: revert commit 22a9189 (cdrom: use kmalloced buffers instead of buffers on stack)
scsi: sr avoids useless buffer allocation
block: blk_rq_map_kern uses the bounce buffers for stack buffers
block: add blk_queue_update_dma_pad
DAC960: push down BKL
pktcdvd: push BKL down into driver
paride: push ioctl down into driver
block: use get_unaligned_* helpers
block: extend queue_flag bitops
block: request_module(): use format string
Add bvec_merge_data to handle stacked devices and ->merge_bvec()
block: integrity flags can't use bit ops on unsigned short
cmdfilter: extend default read filter
sg: fix odd style (extra parenthesis) introduced by cmd filter patch
block: add bounce support to blk_rq_map_user_iov
cfq-iosched: get rid of enable_idle being unused warning
allow userspace to modify scsi command filter on per device basis
...
If you do a modremove of any sas driver, you run into an oops on
shutdown when the host is removed (coming from the host bsg device).
The root cause seems to be that there's a use after free of the
bsg_class_device: In bsg_kref_release_function, this is used (to do a
put_device(bcg->parent) after bcg->release has been called. In sas (and
possibly many other things) bcd->release frees the queue which contains
the bsg_class_device, so we get a put_device on unreferenced memory.
Fix this by taking a copy of the pointer to the parent before releasing
bsg.
Acked-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
We don't need to hold bsg_mutex during bsg_complete_all_commands(). It
leads to a problem that we block bsg_unregister_queue during
bsg_complete_all_commands (untill all the outstanding commands
complete).
Thanks to Pete Wyckoff for finding the bug and testing the patch.
The detailed bug report is:
http://marc.info/?l=linux-scsi&m=121182137132145&w=2
Tested-by: Pete Wyckoff <pw@osc.edu>
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
blk_rq_map_kern is used for kernel internal I/Os. Some callers use
this function with stack buffers but DMA to/from the stack buffers
leads to memory corruption on a non-coherent platform.
This patch make blk_rq_map_kern uses the bounce buffers if a caller
passes a stack buffer (on the all platforms for simplicity).
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Tejun Heo <htejun@gmail.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This adds blk_queue_update_dma_pad to prevent LLDs from overwriting
the dma pad mask wrongly (we added blk_queue_update_dma_alignment due
to the same reason).
This also converts libata to use blk_queue_update_dma_pad instead of
blk_queue_dma_pad.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Avoid bad things happening if the module has a printk control string in
its name.
Signed-off-by: maximilian attems <max@stro.at>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch adds the commands that the former sg filter allowed for read
access to the cmdfilter to keep userspace apps that rely on them working.
Signed-off-by: Adel Gadllah <adel.gadllah@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
blk_rq_map_user_iov can't handle the bounce buffer (it means that the
bio_map_user_iov path doesn't work with a LLD that needs GFP_DMA).
This patch fixes blk_rq_map_user_iov to support the bounce buffer.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch exports the per-gendisk command filter to user space through
sysfs, so it can be changed by the system administrator.
All users of the old cmd filter have been converted to use the new one.
Original patch from Peter Jones.
Signed-off-by: Adel Gadllah <adel.gadllah@gmail.com>
Signed-off-by: Peter Jones <pjones@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Some block devices support verifying the integrity of requests by way
of checksums or other protection information that is submitted along
with the I/O.
This patch implements support for generating and verifying integrity
metadata, as well as correctly merging, splitting and cloning bios and
requests that have this extra information attached.
See Documentation/block/data-integrity.txt for more information.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This allows a user to annotate the blk trace stream: writing a suitable
message to {/sys/kernel/debug}/block/<dsf>/msg will have it propagated
into the trace stream.
Signed-off-by: Alan D. Brunelle <alan.brunelle@hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Now that blktrace has the ability to carry arbitrary messages in
its stream, use that for some CFQ logging.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
If we have multiple tasks freeing io contexts when as-iosched
is being unloaded, we could complete() ioc_gone twice. Fix that by
protecting ioc_gone complete() and clearing with a spinlock for
just that purpose. Doesn't matter from a performance perspective,
since it'll only enter that path when ioc_gone != NULL (when as-iosched
is being rmmod'ed).
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
If we have multiple tasks freeing cfq_io_contexts when cfq-iosched
is being unloaded, we could complete() ioc_gone twice. Fix that by
protecting ioc_gone complete() and clearing with a spinlock for
just that purpose. Doesn't matter from a performance perspective,
since it'll only enter that path when ioc_gone != NULL (when cfq-iosched
is being rmmod'ed).
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
AS scheduler alternates between issuing read and write batches. It does
the batch switch only after all requests from the previous batch are
completed.
When switching to a write batch, if there is an on-going read request,
it waits for its completion and indicates its intention of switching by
setting ad->changed_batch and the new direction but does not update the
batch_expire_time for the new write batch which it does in the case of
no previous pending requests.
On completion of the read request, it sees that we were waiting for the
switch and schedules work for kblockd right away and resets the
ad->changed_data flag.
Now when kblockd enters dispatch_request where it is expected to pick
up a write request, it in turn ends the write batch because the
batch_expire_timer was not updated and shows the expire timestamp for
the previous batch.
This results in the write starvation for all the cases where there is
the intention for switching to a write batch, but there is a previous
in-flight read request and the batch gets reverted to a read_batch
right away.
This also holds true in the reverse case (switching from a write batch
to a read batch with an in-flight write request).
I've checked that this bug exists on 2.6.11, 2.6.18, 2.6.24 and
linux-2.6-block git HEAD. I've tested the fix on x86 platforms with
SCSI drives where the driver asks for the next request while a current
request is in-flight.
This patch is based off linux-2.6-block git HEAD.
Bug reproduction:
A simple scenario which reproduces this bug is:
- dd if=/dev/hda3 of=/dev/null &
- lilo
The lilo takes forever to complete.
This can also be reproduced fairly easily with the earlier dd and
another test
program doing msync().
The example test program below should print out a message after every
iteration
but it simply hangs forever. With this bugfix it makes forward progress.
====
Example test program using msync() (thanks to suleiman AT google DOT
com)
inline uint64_t
rdtsc(void)
{
int64_t tsc;
__asm __volatile("rdtsc" : "=A" (tsc));
return (tsc);
}
int
main(int argc, char **argv)
{
struct stat st;
uint64_t e, s, t;
char *p, q;
long i;
int fd;
if (argc < 2) {
printf("Usage: %s <file>\n", argv[0]);
return (1);
}
if ((fd = open(argv[1], O_RDWR | O_NOATIME)) < 0)
err(1, "open");
if (fstat(fd, &st) < 0)
err(1, "fstat");
p = mmap(NULL, st.st_size, PROT_READ | PROT_WRITE,
MAP_SHARED, fd, 0);
t = 0;
for (i = 0; i < 1000; i++) {
*p = 0;
msync(p, 4096, MS_SYNC);
s = rdtsc();
*p = 0;
__asm __volatile(""::: "memory");
e = rdtsc();
if (argc > 2)
printf("%d: %lld cycles %jd %jd\n",
i, e - s, (intmax_t)s, (intmax_t)e);
t += e - s;
}
printf("average time: %lld cycles\n", t / 1000);
return (0);
}
Cc: <stable@kernel.org>
Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
As we may run relay_reserve from interrupt context we must always disable
IRQs. This is because a call to relay_reserve may expose previously written
data to use space.
Updated new message code and an old but related comment.
Signed-off-by: Carl Henrik Lunde <chlunde@ping.uio.no>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 30f2f0eb4b ("block: do_mounts -
accept root=<non-existant partition>") extended blk_lookup_devt() to be
able to look up partitions that had not yet been registered, but in the
process made the assumption that the '&block_class.devices' list only
contains disk devices and that you can do 'dev_to_disk(dev)' on them.
That isn't actually true. The block_class device list also contains the
partitions we've discovered so far, and you can't just do a
'dev_to_disk()' on those.
So make sure to only work on devices that block/genhd.c has registered
itself, something we can test by checking the 'dev->type' member. This
makes the loop in blk_lookup_devt() match the other such loops in this
file.
[ We may want to do an alternate version that knows to handle _either_
whole-disk devices or partitions, but for now this is the minimal fix
for a series of crashes reported by Mariusz Kozlowski in
http://lkml.org/lkml/2008/5/25/25
and Ingo in
http://lkml.org/lkml/2008/6/9/39 ]
Reported-by: Mariusz Kozlowski <m.kozlowski@tuxland.pl>
Reported-by: Ingo Molnar <mingo@elte.hu>
Cc: Neil Brown <neilb@suse.de>
Cc: Joao Luis Meloni Assirati <assirati@nonada.if.usp.br>
Acked-by: Kay Sievers <kay.sievers@vrfy.org>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cfq_cic_lookup() needs to properly protect ioc->ioc_data before
dereferencing it and also exclude updaters of ioc->ioc_data as well.
Also add a number of comments documenting why the existing RCU usage
is OK.
Thanks a lot to "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> for
review and comments!
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Currently it uses a single static char array, but that risks
being corrupted when multiple users issue message notes at the
same time. Make the buffers dynamically allocated when the trace
is setup and make them per-cpu instead.
The default max message size of 1k is also very large, the
interface is mainly for small text notes. So shrink it to 128 bytes.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Allows messages to be inserted into blktrace streams.
Signed-off-by: Alan D. Brunelle <alan.brunelle@hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
saves 8 bytes of padding & increases objects/slab from 30 to 32 on my
AMD64 config
Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
In function get_request_wait, the second call to get_request could be
moved to the end of the while loop, because if the first call to
get_request fails, the second call will fail without sleep.
Signed-off-by: Zhang Yanmin <yanmin.zhang@intel.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
As git-grep shows, open_softirq() is always called with the last argument
being NULL
block/blk-core.c: open_softirq(BLOCK_SOFTIRQ, blk_done_softirq, NULL);
kernel/hrtimer.c: open_softirq(HRTIMER_SOFTIRQ, run_hrtimer_softirq, NULL);
kernel/rcuclassic.c: open_softirq(RCU_SOFTIRQ, rcu_process_callbacks, NULL);
kernel/rcupreempt.c: open_softirq(RCU_SOFTIRQ, rcu_process_callbacks, NULL);
kernel/sched.c: open_softirq(SCHED_SOFTIRQ, run_rebalance_domains, NULL);
kernel/softirq.c: open_softirq(TASKLET_SOFTIRQ, tasklet_action, NULL);
kernel/softirq.c: open_softirq(HI_SOFTIRQ, tasklet_hi_action, NULL);
kernel/timer.c: open_softirq(TIMER_SOFTIRQ, run_timer_softirq, NULL);
net/core/dev.c: open_softirq(NET_TX_SOFTIRQ, net_tx_action, NULL);
net/core/dev.c: open_softirq(NET_RX_SOFTIRQ, net_rx_action, NULL);
This observation has already been made by Matthew Wilcox in June 2002
(http://www.cs.helsinki.fi/linux/linux-kernel/2002-25/0687.html)
"I notice that none of the current softirq routines use the data element
passed to them."
and the situation hasn't changed since them. So it appears we can safely
remove that extra argument to save 128 (54) bytes of kernel data (text).
Signed-off-by: Carlos R. Mafra <crmafra@ift.unesp.br>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
As setting and clearing queue flags now requires that we hold a spinlock
on the queue, and as blk_queue_stack_limits is called without that lock,
get the lock inside blk_queue_stack_limits.
For blk_queue_stack_limits to be able to find the right lock, each md
personality needs to set q->queue_lock to point to the appropriate lock.
Those personalities which didn't previously use a spin_lock, us
q->__queue_lock. So always initialise that lock when allocated.
With this in place, setting/clearing of the QUEUE_FLAG_PLUGGED bit will no
longer cause warnings as it will be clear that the proper lock is held.
Thanks to Dan Williams for review and fixing the silly bugs.
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Alistair John Strachan <alistair@devzero.co.uk>
Cc: Nick Piggin <npiggin@suse.de>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Jacek Luczak <difrost.kernel@gmail.com>
Cc: Prakash Punnoor <prakash@punnoor.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some devices, like md, may create partitions only at first access,
so allow root= to be set to a valid non-existant partition of an
existing disk. This applies only to non-initramfs root mounting.
This fixes a regression from 2.6.24 which did allow this to happen and
broke some users machines :(
Acked-by: Neil Brown <neilb@suse.de>
Tested-by: Joao Luis Meloni Assirati <assirati@nonada.if.usp.br>
Cc: stable <stable@kernel.org>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
bdevname() fills the buffer that it is given as a parameter, so calling
strcpy() or snprintf() on the returned value is redundant (and probably not
guaranteed to work - I don't think strcpy and snprintf support overlapping
buffers.)
Signed-off-by: Jean Delvare <khali@linux-fr.org>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
get_part() is fairly expensive, as it O(N) loops over partitions
to find the right one. In lots of normal IO paths we end up looking
up the partition twice, to make matters even worse. Change the
stat add code to accept a passed in partition instead.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We currently set all processes to the best-effort scheduling class,
regardless of what CPU scheduling class they belong to. Improve that
so that we correctly track idle and rt scheduling classes as well.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Original patch from Mikulas Patocka <mpatocka@redhat.com>
Mike Anderson was doing an OLTP benchmark on a computer with 48 physical
disks mapped to one logical device via device mapper.
He found that there was a slowdown on request_queue->lock in function
generic_unplug_device. The slowdown is caused by the fact that when some
code calls unplug on the device mapper, device mapper calls unplug on all
physical disks. These unplug calls take the lock, find that the queue is
already unplugged, release the lock and exit.
With the below patch, performance of the benchmark was increased by 18%
(the whole OLTP application, not just block layer microbenchmarks).
So I'm submitting this patch for upstream. I think the patch is correct,
because when more threads call simultaneously plug and unplug, it is
unspecified, if the queue is or isn't plugged (so the patch can't make
this worse). And the caller that plugged the queue should unplug it
anyway. (if it doesn't, there's 3ms timeout).
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
put_io_context() drops the RCU read lock before calling into cfq_dtor(),
however we need to hold off freeing there before grabbing and
dereferencing the first object on the list.
So extend the rcu_read_lock() scope to cover the calling of cfq_dtor(),
and optimize cfq_free_io_context() to use a new variant for
call_for_each_cic() that assumes the RCU read lock is already held.
Hit in the wild by Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
For most initialization purposes, calling blk_queue_init_tags() without
the queue lock held is OK. Only if called for resizing an existing map
must the lock be held. Ditto for tag cleanup, the maps are reference
counted.
So switch the general queue flag setting to the unlocked variant, but
retain the locked variant for resizing.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Concurrency isn't a big deal here since we have requests in flight
at this point, but do the locked variant to set a better example.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6:
[SCSI] aic94xx: fix section mismatch
[SCSI] u14-34f: Fix 32bit only problem
[SCSI] dpt_i2o: sysfs code
[SCSI] dpt_i2o: 64 bit support
[SCSI] dpt_i2o: move from virt_to_bus/bus_to_virt to dma_alloc_coherent
[SCSI] dpt_i2o: use standard __init / __exit code
[SCSI] megaraid_sas: fix suspend/resume sections
[SCSI] aacraid: Add Power Management support
[SCSI] aacraid: Fix jbod operations scan issues
[SCSI] aacraid: Fix warning about macro side-effects
[SCSI] add support for variable length extended commands
[SCSI] Let scsi_cmnd->cmnd use request->cmd buffer
[SCSI] bsg: add large command support
[SCSI] aacraid: Fix down_interruptible() to check the return value correctly
[SCSI] megaraid_sas; Update the Version and Changelog
[SCSI] ibmvscsi: Handle non SCSI error status
[SCSI] bug fix for free list handling
[SCSI] ipr: Rename ipr's state scsi host attribute to prevent collisions
[SCSI] megaraid_mbox: fix Dell CERC firmware problem
Add support for variable-length, extended, and vendor specific
CDBs to scsi-ml. It is now possible for initiators and ULD's
to issue these types of commands. LLDs need not change much.
All they need is to raise the .max_cmd_len to the longest command
they support (see iscsi patch).
- clean-up some code paths that did not expect commands to be
larger than 16, and change cmd_len members' type to short as
char is not enough.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
This enables bsg to handle the request length larger than BLK_MAX_CDB
(mainly for the variable length CDB format).
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Provide a place in sysfs (/sys/class/bdi) for the backing_dev_info object.
This allows us to see and set the various BDI specific variables.
In particular this properly exposes the read-ahead window for all relevant
users and /sys/block/<block>/queue/read_ahead_kb should be deprecated.
With patient help from Kay Sievers and Greg KH
[mszeredi@suse.cz]
- split off NFS and FUSE changes into separate patches
- document new sysfs attributes under Documentation/ABI
- do bdi_class_init as a core_initcall, otherwise the "default" BDI
won't be initialized
- remove bdi_init_fmt macro, it's not used very much
[akpm@linux-foundation.org: fix ia64 warning]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Acked-by: Greg KH <greg@kroah.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The block I/O + elevator + I/O scheduler code spend a lot of time trying
to merge I/Os -- rightfully so under "normal" circumstances. However,
if one were to know that the incoming I/O stream was /very/ random in
nature, the cycles are wasted.
This patch adds a per-request_queue tunable that (when set) disables
merge attempts (beyond the simple one-hit cache check), thus freeing up
a non-trivial amount of CPU cycles.
Signed-off-by: Alan D. Brunelle <alan.brunelle@hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch changes rq->cmd from the static array to a pointer to
support large commands.
We rarely handle large commands. So for optimization, a struct request
still has a static array for a command. rq_init sets rq->cmd pointer
to the static array.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This is a preparation for changing rq->cmd from the static array to a
pointer.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Boaz Harrosh <bharrosh@panasas.com>
Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This rename rq_init() blk_rq_init() and export it. Any path that hands
the request to the block layer needs to call it to initialize the
request.
This is a preparation for large command support, which needs to
initialize the request in a proper way (that is, just doing a memset()
will not work).
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
blk_get_request initializes rq->cmd (rq_init does) so the users don't
need to do that.
The purpose of this patch is to remove sizeof(rq->cmd) and &rq->cmd,
as a preparation for large command support, which changes rq->cmd from
the static array to a pointer. sizeof(rq->cmd) will not make sense and
&rq->cmd won't work.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Alasdair G Kergon <agk@redhat.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch fixes the following build error with UML and gcc 4.3:
<-- snip -->
...
CC block/blk-barrier.o
/home/bunk/linux/kernel-2.6/git/linux-2.6/block/blk-barrier.c: In function ‘blk_do_ordered’:
/home/bunk/linux/kernel-2.6/git/linux-2.6/block/blk-barrier.c:57: sorry, unimplemented: inlining failed in call to ‘blk_ordered_cur_seq’: function body not available
/home/bunk/linux/kernel-2.6/git/linux-2.6/block/blk-barrier.c:252: sorry, unimplemented: called from here
/home/bunk/linux/kernel-2.6/git/linux-2.6/block/blk-barrier.c:57: sorry, unimplemented: inlining failed in call to ‘blk_ordered_cur_seq’: function body not available
/home/bunk/linux/kernel-2.6/git/linux-2.6/block/blk-barrier.c:253: sorry, unimplemented: called from here
make[2]: *** [block/blk-barrier.o] Error 1
<-- snip -->
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch fixes the following build error with UML and gcc 4.3:
<-- snip -->
...
CC block/elevator.o
/home/bunk/linux/kernel-2.6/git/linux-2.6/block/elevator.c: In function ‘elv_merge’:
/home/bunk/linux/kernel-2.6/git/linux-2.6/block/elevator.c:73: sorry, unimplemented: inlining failed in call to ‘elv_rq_merge_ok’: function body not available
/home/bunk/linux/kernel-2.6/git/linux-2.6/block/elevator.c:103: sorry, unimplemented: called from here
/home/bunk/linux/kernel-2.6/git/linux-2.6/block/elevator.c:73: sorry, unimplemented: inlining failed in call to ‘elv_rq_merge_ok’: function body not available
/home/bunk/linux/kernel-2.6/git/linux-2.6/block/elevator.c:495: sorry, unimplemented: called from here
make[2]: *** [block/elevator.o] Error 1
make[1]: *** [block] Error 2
<-- snip -->
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We can save some atomic ops in the IO path, if we clearly define
the rules of how to modify the queue flags.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch adds bio_copy_kern similar to
bio_copy_user. blk_rq_map_kern uses bio_copy_kern instead of
bio_map_kern if necessary.
bio_copy_kern uses temporary pages and the bi_end_io callback frees
these pages. bio_copy_kern saves the original kernel buffer at
bio->bi_private it doesn't use something like struct bio_map_data to
store the information about the caller.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This requires moving rq_init() from get_request() to blk_alloc_request().
The upside is that we can now require an rq_init() from any path that
wishes to hand the request to the block layer.
rq_init() will be exported for the code that uses struct request
without blk_get_request.
This is a preparation for large command support, which needs to
initialize struct request in a proper way (that is, just doing a
memset() will not work).
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch adds release callback support, which is called when a bsg
device goes away. bsg_register_queue() takes a pointer to a callback
function. This feature is useful for stuff like sas_host that can't
use the release callback in struct device.
If a caller doesn't need bsg's release callback, it can call
bsg_register_queue() with NULL pointer (e.g. scsi devices can use
release callback in struct device so they don't need bsg's callback).
With this patch, bsg uses kref for refcounts on bsg devices instead of
get/put_device in fops->open/release. bsg calls put_device and the
caller's release callback (if it was registered) in kref_put's
release.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
* 'for-2.6.26' of git://git.kernel.dk/linux-2.6-block:
block: fix blk_register_queue() return value
block: fix memory hotplug and bouncing in block layer
block: replace remaining __FUNCTION__ occurrences
Kconfig: clean up block/Kconfig help descriptions
cciss: fix warning oops on rmmod of driver
cciss: Fix race between disk-adding code and interrupt handler
block: move the padding adjustment to blk_rq_map_sg
block: add bio_copy_user_iov support to blk_rq_map_user_iov
block: convert bio_copy_user to bio_copy_user_iov
loop: manage partitions in disk image
cdrom: use kmalloced buffers instead of buffers on stack
cdrom: make unregister_cdrom() return void
cdrom: use list_head for cdrom_device_info list
cdrom: protect cdrom_device_info list by mutex
cdrom: cleanup hardcoded error-code
cdrom: remove ifdef CONFIG_SYSCTL
blk_register_queue() returns -ENXIO when queue->request_fn is NULL. But there
are some block drivers that call blk_register_queue() via add_disk() with
queue->request_fn == NULL. (For example, brd, loop)
Although no one checks return value of blk_register_queue(), this patch makes
it return 0 instead of -ENXIO when queue->request_fn is NULL,
Also this patch adds warning when blk_register_queue() and
blk_unregister_queue() are called with queue == NULL rather than ignore
invalid usage silently.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Modify the help descriptions of block/Kconfig for clarity, accuracy and
consistency.
Refactor the BLOCK description a bit. The wording "This permits ... to be
removed" isn't quite right; the block layer is removed when the option is
disabled, whereas most descriptions talk about what happens when the option is
enabled. Reformat the list of what is affected by disabling the block layer.
Add more examples of large block devices to LBD and strive for technical
accuracy; block devices of size _exactly_ 2TB require CONFIG_LBD, not only
"bigger than 2TB". Also try to say (perhaps not very clearly) that the config
option is only needed when you want to have individual block devices of size
>= 2TB, for example if you had 3 x 1TB disks in your computer you'd have a
total storage size of 3TB but you wouldn't need the option unless you want to
aggregate those disks into a RAID or LVM.
Improve terminology and grammar on BLK_DEV_IO_TRACE.
I also added the boilerplate "If unsure, say N" to most options.
Precisely say "2TB and larger" for LSF.
Indent the help text for BLK_DEV_BSG by 2 spaces in accordance with the
standard.
Signed-off-by: Nick Andrew <nick@nick-andrew.net>
Cc: "Randy.Dunlap" <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
blk_rq_map_user adjusts bi_size of the last bio. It breaks the rule
that req->data_len (the true data length) is equal to sum(bio). It
broke the scsi command completion code.
commit e97a294ef6 was introduced to fix
the above issue. However, the partial completion code doesn't work
with it. The commit is also a layer violation (scsi mid-layer should
not know about the block layer's padding).
This patch moves the padding adjustment to blk_rq_map_sg (suggested by
James). The padding works like the drain buffer. This patch breaks the
rule that req->data_len is equal to sum(sg), however, the drain buffer
already broke it. So this patch just restores the rule that
req->data_len is equal to sub(bio) without breaking anything new.
Now when a low level driver needs padding, blk_rq_map_user and
blk_rq_map_user_iov guarantee there's enough room for padding.
blk_rq_map_sg can safely extend the last entry of a scatter list.
blk_rq_map_sg must extend the last entry of a scatter list only for a
request that got through bio_copy_user_iov. This patches introduces
new REQ_COPY_USER flag.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
With this patch, blk_rq_map_user_iov uses bio_copy_user_iov when a low
level driver needs padding or a buffer in sg_iovec isn't aligned. That
is, it uses temporary kernel buffers instead of mapping user pages
directly.
When a LLD needs padding, later blk_rq_map_sg needs to extend the last
entry of a scatter list. bio_copy_user_iov guarantees that there is
enough space for padding by using temporary kernel buffers instead of
user pages.
blk_rq_map_user_iov needs buffers in sg_iovec to be aligned. The
comment in blk_rq_map_user_iov indicates that drivers/scsi/sg.c also
needs buffers in sg_iovec to be aligned. Actually, drivers/scsi/sg.c
works with unaligned buffers in sg_iovec (it always uses temporary
kernel buffers).
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
It's big, but there doesn't seem to be a way to split it up smaller...
Signed-off-by: Tony Jones <tonyj@suse.de>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Cc: Roland Dreier <rolandd@cisco.com>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (137 commits)
[SCSI] iscsi: bidi support for iscsi_tcp
[SCSI] iscsi: bidi support at the generic libiscsi level
[SCSI] iscsi: extended cdb support
[SCSI] zfcp: Fix error handling for blocked unit for send FCP command
[SCSI] zfcp: Remove zfcp_erp_wait from slave destory handler to fix deadlock
[SCSI] zfcp: fix 31 bit compile warnings
[SCSI] bsg: no need to set BSG_F_BLOCK bit in bsg_complete_all_commands
[SCSI] bsg: remove minor in struct bsg_device
[SCSI] bsg: use better helper list functions
[SCSI] bsg: replace kobject_get with blk_get_queue
[SCSI] bsg: takes a ref to struct device in fops->open
[SCSI] qla1280: remove version check
[SCSI] libsas: fix endianness bug in sas_ata
[SCSI] zfcp: fix compiler warning caused by poking inside new semaphore (linux-next)
[SCSI] aacraid: Do not describe check_reset parameter with its value
[SCSI] aacraid: Fix down_interruptible() to check the return value
[SCSI] sun3_scsi_vme: add MODULE_LICENSE
[SCSI] st: rename flush_write_buffer()
[SCSI] tgt: use KMEM_CACHE macro
[SCSI] initio: fix big endian problems for auto request sense
...
Before bsg_complete_all_commands is called, BSG_F_BLOCK bit is always
set.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
minor in struct bsg_device is used as identifier to find the
corresponding struct bsg_device_class. However, request_queuse can be
used as identifier for that and the minor in struct bsg_device is
unnecessary.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
This replace hlist_for_each and list_entry with hlist_for_each_entry
and list_first_entry respectively.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Both takes a ref to a queue. But blk_get_queue checks QUEUE_FLAG_DEAD
and is more appropriate interface here.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
bsg_register_queue() takes a ref to struct device that a caller
passes. For example, bsg takes a ref to the sdev_gendev for scsi
devices. However, bsg doesn't inrease the refcount in fops->open. So
while an application opens a bsg device, the scsi device that the bsg
device holds can go away (bsg also takes a ref to a queue, but it
doesn't prevent the device from going away).
With this patch, bsg increases the refcount of struct device in
fops->open and decreases it in fops->release.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
hdparm explicitely marks HDIO_[UNREGISTER,SCAN]_HWIF ioctls as DANGEROUS
and given the number of bugs we can assume that there are no real users:
* DMA has no chance of working because DMA resources are released by
ide_unregister() and they are never allocated again.
* Since ide_init_hwif_ports() is used for ->io_ports[] setup the ioctls
don't work for almost all hosts with "non-standard" (== non ISA-like)
layout of IDE taskfile registers (there is a lot of such host drivers).
* ide_port_init_devices() is not called when probing IDE devices so:
- drive->autotune is never set and IDE host/devices are not programmed
for the correct PIO/DMA transfer modes (=> possible data corruption)
- host specific I/O 32-bit and IRQ unmasking settings are not applied
(=> possible data corruption)
- host specific ->port_init_devs method is not called (=> no luck with
ht6560b, qd65xx and opti621 host drivers)
* ->rw_disk method is not preserved (=> no HPT3xxN chipsets support).
* ->serialized flag is not preserved (=> possible data corruption when
using icside, aec62xx (ATP850UF chipset), cmd640, cs5530, hpt366
(HPT3xxN chipsets), rz1000, sc1200, dtc2278 and ht6560b host drivers).
* ->ack_intr method is not preserved (=> needed by ide-cris, buddha,
gayle and macide host drivers).
* ->sata_scr[] and sata_misc[] is cleared by ide_unregister() and it
isn't initialized again (SiI3112 support needs them).
* To issue an ioctl() there need to be at least one IDE device present
in the system.
* ->cable_detect method is not preserved + it is not called when probing
IDE devices so cable detection is broken (however since DMA support is
also broken it doesn't really matter ;-).
* Some objects which may have already been freed in ide_unregister()
are restored by ide_hwif_restore() (i.e. ->hwgroup).
* ide_register_hw() may unregister unrelated IDE ports if free ide_hwifs[]
slot cannot be found.
* When IDE host drivers are modular unregistered port may be re-used by
different host driver that owned it first causing subtle bugs.
Since we now have a proper warm-plug support remove these ioctls,
then remove no longer needed:
- ide_register_hw() and ide_hwif_restore() functions
- 'init_default' and 'restore' arguments of ide_unregister()
- zeroeing of hwif->{dma,extra}_* fields in ide_unregister()
As an added bonus IDE core code size shrinks by ~3kB (x86-32).
v2:
* fix ide_unregister() arguments in cleanup_module() (Andrew Morton).
v3:
* fix ide_unregister() arguments in palm_bk3710.c.
Acked-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
When switching scheduler from cfq, cfq_exit_queue() does not clear
ioc->ioc_data, leaving a dangling pointer that can deceive the following
lookups when the iosched is switched back to cfq. The pattern that can
trigger that is the following:
- elevator switch from cfq to something else;
- module unloading, with elv_unregister() that calls cfq_free_io_context()
on ioc freeing the cic (via the .trim op);
- module gets reloaded and the elevator switches back to cfq;
- reallocation of a cic at the same address as before (with a valid key).
To fix it just assign NULL to ioc_data in __cfq_exit_single_io_context(),
that is called from the regular exit path and from the elevator switching
code. The only path that frees a cic and is not covered is the error handling
one, but cic's freed in this way are never cached in ioc_data.
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
SLAB_DESTROY_BY_RCU is not a direct substitute for normal call_rcu()
freeing, since it'll page freeing but NOT object freeing. So change
cfq to do the freeing on its own.
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Looking a bit closer into this regression the reason this can't be
right is that dma_addr common default is BLK_BOUNCE_HIGH and most
machines have less than 4G. So if you do:
if (b_pfn <= (min_t(u64, 0xffffffff, BLK_BOUNCE_HIGH) >> PAGE_SHIFT))
dma = 1
that will translate to:
if (BLK_BOUNCE_HIGH <= BLK_BOUNCE_HIGH)
dma = 1
So for 99% of hardware this will trigger unnecessary GFP_DMA
allocations and isa pooling operations.
Also note how the 32bit code still does b_pfn < blk_max_low_pfn.
I guess this is what you were looking after. I didn't verify but as
far as I can tell, this will stop the regression with isa dma
operations at boot for 99% of blkdev/memory combinations out there and
I guess this fixes the setups with >4G of ram and 32bit pci cards as
well (this also retains symmetry with the 32bit code).
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Fixes:
block/genhd.c:361: warning: ignoring return value of ‘class_register’, declared with attribute warn_unused_result
Signed-off-by: Roland McGrath <roland@redhat.com>
Acked-by: Jeff Garzik <jeff@garzik.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is important to eg dm, that tries to decide whether to stop using
barriers or not.
Tested as working by Anders Henke <anders.henke@1und1.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Introduced between 2.6.25-rc2 and -rc3
block/blk-map.c:154:14: warning: symbol 'bio' shadows an earlier one
block/blk-map.c:110:13: originally declared here
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Intoduced between 2.6.25-rc2 and -rc3
block/blk-settings.c:319:12: warning: function 'blk_queue_dma_drain' with external linkage has definition
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch removes the unused export of blk_rq_map_user_iov.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch removes the unused exports of blk_{get,put}_queue.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch contains the following cleanups:
- make the needlessly global struct disk_type static
- #if 0 the unused genhd_media_change_notify()
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch adds a proper prototye for blk_dev_init() in block/blk.h
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Every file should include the headers containing the externs for its
global functions (in this case for __blk_queue_free_tags()).
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
For some non-x86 systems with 4GB or upper 4GB memory,
we need increase the range of addresses that can be
used for direct DMA in 64-bit kernel.
Signed-off-by: Yang Shi <yang.shi@windriver.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Block layer alignment was used for two different purposes - memory
alignment and padding. This causes problems in lower layers because
drivers which only require memory alignment ends up with adjusted
rq->data_len. Separate out padding such that padding occurs iff
driver explicitly requests it.
Tomo: restorethe code to update bio in blk_rq_map_user
introduced by the commit 40b01b9bbd
according to padding alignment.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The meaning of rq->data_len was changed to the length of an allocated
buffer from the true data length. It breaks SG_IO friends and
bsg. This patch restores the meaning of rq->data_len to the true data
length and adds rq->extra_len to store an extended length (due to
drain buffer and padding).
This patch also removes the code to update bio in blk_rq_map_user
introduced by the commit 40b01b9bbd.
The commit adjusts bio according to memory alignment
(queue_dma_alignment). However, memory alignment is NOT padding
alignment. This adjustment also breaks SG_IO friends and bsg. Padding
alignment needs to be fixed in a proper way (by a separate patch).
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <axboe@carl.home.kernel.dk>
Clear drain buffer before chaining if the command in question is a
write.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Draining shouldn't be done for commands where overflow may indicate
data integrity issues. Add dma_drain_needed callback to
request_queue. Drain buffer is appened iff this function returns
non-zero.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
With padding and draining moved into it, block layer now may extend
requests as directed by queue parameters, so now a request has two
sizes - the original request size and the extended size which matches
the size of area pointed to by bios and later by sgs. The latter size
is what lower layers are primarily interested in when allocating,
filling up DMA tables and setting up the controller.
Both padding and draining extend the data area to accomodate
controller characteristics. As any controller which speaks SCSI can
handle underflows, feeding larger data area is safe.
So, this patch makes the primary data length field, request->data_len,
indicate the size of full data area and add a separate length field,
request->raw_data_len, for the unmodified request size. The latter is
used to report to higher layer (userland) and where the original
request size should be fed to the controller or device.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
DMA start address and transfer size alignment for PC requests are
achieved using bio_copy_user() instead of bio_map_user(). This works
because bio_copy_user() always uses full pages and block DMA alignment
isn't allowed to go over PAGE_SIZE.
However, the implementation didn't update the last bio of the request
to make this padding visible to lower layers. This patch makes
blk_rq_map_user() extend the last bio such that it includes the
padding area and the size of area pointed to by the request is
properly aligned.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Currently we fail if someone requests a valid io scheduler, but it's
modular and not currently loaded. That can happen from a driver init
asking for a different scheduler, or online switching through sysfs
as requested by a user.
This patch makes elevator_get() request_module() to attempt to load
the appropriate module, instead of requiring that done manually.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
It's cumbersome to browse a radix tree from start to finish, especially
since we modify keys when a process exits. So add a hlist for the single
purpose of browsing over all known cfq_io_contexts, used for exit,
io prio change, etc.
This fixes http://bugzilla.kernel.org/show_bug.cgi?id=9948
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Rearrange fields in cache order and initialize some fields that
we didn't previously init. Remove init of ->completion_data, it's
part of a union with ->hash. Luckily clearing the rb node is the same
as setting it to null!
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
It blindly copies everything in the io_context, including the lock.
That doesn't work so well for either lock ordering or lockdep.
There seems zero point in swapping io contexts on a request to request
merge, so the best point of action is to just remove it.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Since it's acquired from irq context, all locking must be of the
irq safe variant. Most are already inside the queue lock (which
already disables interrupts), but the io scheduler rmmod path
always has irqs enabled and the put_io_context() path may legally
be called with irqs enabled (even if it isn't usually). So fixup
those two.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This fixes a problem in SCSI where we use the (previously
uninitialised) cmd_type via blk_pc_request() to set up the transfer in
scsi_init_sgtable().
Acked-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
If the two requests belong to the same io context, we will attempt
to lock the same lock twice. But swapping contexts is pointless in
that case, so just check for rioc == nioc before doing the double
lock and copy.
Tested-by: Olof Johansson <olof@lixom.net>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Currently you must be root to set idle io prio class on a process. This
is due to the fact that the idle class is implemented as a true idle
class, meaning that it will not make progress if someone else is
requesting disk access. Unfortunately this means that it opens DOS
opportunities by locking down file system resources, hence it is root
only at the moment.
This patch relaxes the idle class a little, by removing the truly idle
part (which entals a grace period with associated timer). The
modifications make the idle class as close to zero impact as can be done
while still guarenteeing progress. This means we can relax the root only
criteria as well.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
These DMA drain buffer implementations in drivers are pretty horrible
to do in terms of manipulating the scatterlist. Plus they're being
done at least in drivers/ide and drivers/ata, so we now have code
duplication.
The one use case for this, as I understand it is AHCI controllers doing
PIO mode to mmc devices but translating this to DMA at the controller
level.
So, what about adding a callback to the block layer that permits the
adding of the drain buffer for the problem devices. The idea is that
you'd do this in slave_configure after you find one of these devices.
The beauty of doing it in the block layer is that it quietly adds the
drain buffer to the end of the sg list, so it automatically gets mapped
(and unmapped) without anything unusual having to be done to the
scatterlist in driver/scsi or drivers/ata and without any alteration to
the transfer length.
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The io context sharing introduced a per-ioc spinlock, that would protect
the cfq io context lookup. That is a regression from the original, since
we never needed any locking there because the ioc/cic were process private.
The cic lookup is changed from an rbtree construct to a radix tree, which
we can then use RCU to make the reader side lockless. That is the performance
critical path, modifying the radix tree is only done on process creation
(when that process first does IO, actually) and on process exit (if that
process has done IO).
As it so happens, radix trees are also much faster for this type of
lookup where the key is a pointer. It's a very sparse tree.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch merges complete_request() into end_that_request_last()
for cleanup.
complete_request() was introduced by earlier part of this patch-set,
not to break the existing users of end_that_request_last().
Since all users are converted to blk_end_request interfaces and
end_that_request_last() is no longer exported, the code can be
merged to end_that_request_last().
Cc: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch converts 'uptodate' arguments of no longer exported
interfaces, end_that_request_first/last, to 'error', and removes
internal conversions for it in blk_end_request interfaces.
Also, this patch removes no longer needed end_io_error().
Cc: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch removes the following functions:
o end_that_request_first()
o end_that_request_chunk()
and stops exporting the functions below:
o end_that_request_last()
Cc: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch adds a variant of the interface, blk_end_bidi_request(),
which completes a bidi request.
Bidi request must be completed as a whole, both rq and rq->next_rq
at once. So the interface has 2 arguments for completion size.
As for ->end_io, only rq->end_io is called (rq->next_rq->end_io is not
called). So if special completion handling is needed, the handler
must be set to rq->end_io.
And the handler must take care of freeing next_rq too, since
the interface doesn't care of it if rq->end_io is not NULL.
Cc: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch adds a variant of the interface, blk_end_request_callback(),
which has driver callback feature.
Drivers may need to do special works between end_that_request_first()
and end_that_request_last().
For such drivers, blk_end_request_callback() allows it to pass
a callback function which is called between end_that_request_first()
and end_that_request_last().
This interface is only for fallback of other blk_end_request interfaces.
Drivers should avoid their tricky behaviors and use other interfaces
as much as possible.
Currently, only one driver, ide-cd, needs this interface.
So this interface should/will be removed, after the driver removes
such tricky behaviors.
o ide-cd (cdrom_newpc_intr())
In PIO mode, cdrom_newpc_intr() needs to defer end_that_request_last()
until the device clears DRQ_STAT and raises an interrupt after
end_that_request_first().
So end_that_request_first() and end_that_request_last() are called
separately in cdrom_newpc_intr().
This means blk_end_request_callback() has to return without
completing request even if no leftover in the request.
To satisfy the requirement, callback function has return value
so that drivers can tell blk_end_request_callback() to return
without completing request.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch converts core parts of block layer to use blk_end_request
interfaces. Related 'uptodate' arguments are converted to 'error'.
'dequeue' argument was originally introduced for end_dequeued_request(),
where no attempt should be made to dequeue the request as it's already
dequeued.
However, it's not necessary as it can be checked with
list_empty(&rq->queuelist).
(Dequeued request has empty list and queued request doesn't.)
And it has been done in blk_end_request interfaces.
As a result of this patch, end_queued_request() and
end_dequeued_request() become identical. A future patch will merge
and rename them and change users of those functions.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch adds/exports functions to get the size of request in bytes.
They are useful because blk_end_request interfaces take bytes
as a completed I/O size instead of sectors.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch adds 2 new interfaces for request completion:
o blk_end_request() : called without queue lock
o __blk_end_request() : called with queue lock held
blk_end_request takes 'error' as an argument instead of 'uptodate',
which current end_that_request_* take.
The meanings of values are below and the value is used when bio is
completed.
0 : success
< 0 : error
Some device drivers call some generic functions below between
end_that_request_{first/chunk} and end_that_request_last().
o add_disk_randomness()
o blk_queue_end_tag()
o blkdev_dequeue_request()
These are called in the blk_end_request interfaces as a part of
generic request completion.
So all device drivers become to call above functions.
To decide whether to call blkdev_dequeue_request(), blk_end_request
uses list_empty(&rq->queuelist) (blk_queued_rq() macro is added for it).
So drivers must re-initialize it using list_init() or so before calling
blk_end_request if drivers use it for its specific purpose.
(Currently, there is no driver which completes request without
re-initializing the queuelist after used it. So rq->queuelist
can be used for the purpose above.)
"Normal" drivers can be converted to use blk_end_request()
in a standard way shown below.
a) end_that_request_{chunk/first}
spin_lock_irqsave()
(add_disk_randomness(), blk_queue_end_tag(), blkdev_dequeue_request())
end_that_request_last()
spin_unlock_irqrestore()
=> blk_end_request()
b) spin_lock_irqsave()
end_that_request_{chunk/first}
(add_disk_randomness(), blk_queue_end_tag(), blkdev_dequeue_request())
end_that_request_last()
spin_unlock_irqrestore()
=> spin_lock_irqsave()
__blk_end_request()
spin_unlock_irqsave()
c) spin_lock_irqsave()
(add_disk_randomness(), blk_queue_end_tag(), blkdev_dequeue_request())
end_that_request_last()
spin_unlock_irqrestore()
=> blk_end_request() or spin_lock_irqsave()
__blk_end_request()
spin_unlock_irqrestore()
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Since the SCSI layer uses the request queues from the block layer, blktrace can
also be used to trace the requests to all SCSI devices (like SCSI tape drives),
not only disks. The only missing part is the ioctl interface to start and stop
tracing.
This patch adds the SETUP, START, STOP and TEARDOWN ioctls from blktrace to the
sg device files. With this change, blktrace can be used for SCSI devices like
for disks, e.g.: blktrace -d /dev/sg1 -o - | blkparse -i -
Signed-off-by: Christof Schmitt <christof.schmitt@de.ibm.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (200 commits)
[SCSI] usbstorage: use last_sector_bug flag universally
[SCSI] libsas: abstract STP task status into a function
[SCSI] ultrastor: clean up inline asm warnings
[SCSI] aic7xxx: fix firmware build
[SCSI] aacraid: fib context lock for management ioctls
[SCSI] ch: remove forward declarations
[SCSI] ch: fix device minor number management bug
[SCSI] ch: handle class_device_create failure properly
[SCSI] NCR5380: fix section mismatch
[SCSI] sg: fix /proc/scsi/sg/devices when no SCSI devices
[SCSI] IB/iSER: add logical unit reset support
[SCSI] don't use __GFP_DMA for sense buffers if not required
[SCSI] use dynamically allocated sense buffer
[SCSI] scsi.h: add macro for enclosure bit of inquiry data
[SCSI] sd: add fix for devices with last sector access problems
[SCSI] fix pcmcia compile problem
[SCSI] aacraid: add Voodoo Lite class of cards.
[SCSI] aacraid: add new driver features flags
[SCSI] qla2xxx: Update version number to 8.02.00-k7.
[SCSI] qla2xxx: Issue correct MBC_INITIALIZE_FIRMWARE command.
...
Now that the old kobject_init() function is gone, rename
kobject_init_ng() to kobject_init() to clean up the namespace.
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Now that the old kobject_add() function is gone, rename kobject_add_ng()
to kobject_add() to clean up the namespace.
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This converts the code to use the new kobject functions, cleaning up the
logic in doing so.
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This converts the code to use the new kobject functions, cleaning up the
logic in doing so.
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This moves the block devices to /sys/class/block. It will create a
flat list of all block devices, with the disks and partitions in one
directory. For compatibility /sys/block is created and contains symlinks
to the disks.
/sys/class/block
|-- sda -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda
|-- sda1 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda1
|-- sda10 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda10
|-- sda5 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda5
|-- sda6 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda6
|-- sda7 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda7
|-- sda8 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda8
|-- sda9 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda9
`-- sr0 -> ../../devices/pci0000:00/0000:00:1f.2/host1/target1:0:0/1:0:0:0/block/sr0
/sys/block/
|-- sda -> ../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda
`-- sr0 -> ../devices/pci0000:00/0000:00:1f.2/host1/target1:0:0/1:0:0:0/block/sr0
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Dynamically create the kset instead of declaring it statically. We also
rename block_subsys to block_kset to catch all users of this symbol
with a build error instead of an easy-to-ignore build warning.
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
We don't need a "default" ktype for a kset. We should set this
explicitly every time for each kset. This change is needed so that we
can make ksets dynamic, and cleans up one of the odd, undocumented
assumption that the kset/kobject/ktype model has.
This patch is based on a lot of help from Kay Sievers.
Nasty bug in the block code was found by Dave Young
<hidave.darkstar@gmail.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Dave Young <hidave.darkstar@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The purpose of this is to allow stacked alignment settings, with the
ultimate queue alignment being set to the largest alignment requirement
in the stack.
The reason for this is so that the SCSI mid-layer can relax the default
alignment requirements (which are basically causing a lot of superfluous
copying to go on in the SG_IO interface) while allowing transports,
devices or HBAs to add stricter limits if they need them.
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Currently in BSG, errors returned in req->errors aren't passed back to
the calling programme (either via SG_IO or via read/write). Fix this,
while preserving the SCSI convention of returning status in
req->errors.
Now update libsas to return errors correctly instead of to ignore
them.
Acked-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
David Dillow reported broken blktrace timestamps. The reason
is cpu_clock() which is not a global time source.
Fix bkltrace timestamps by using ktime_get() like the networking
code does for packet timestamps. This also removes a whole lot
of complexity from bkltrace.c and shrinks the code by 500 bytes:
text data bss dec hex filename
2888 124 44 3056 bf0 blktrace.o.before
2390 116 44 2550 9f6 blktrace.o.after
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
elv_register() always returns 0, and there isn't anything it does where
it should return an error (the only error condition is so grave that
it's handled with a BUG_ON).
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
New write batches currently start from where the last one completed.
We have no idea where the head is after switching batches, so this
makes little sense. Instead, start the next batch from the request
with the earliest deadline in the hope that we avoid a deadline
expiry later on.
Signed-off-by: Aaron Carroll <aaronc@gelato.unsw.edu.au>
Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Two comments refer to deadlines applying to reads only. This is
not the case.
Signed-off-by: Aaron Carroll <aaronc@gelato.unsw.edu.au>
Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Use msecs_to_jiffies() and jiffies_to_msecs() in scsi_ioctl().
Sometimes callers use very large values for e.g. vendor specific media
clear command and calculation can overflow.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This was a temporary debugging thing for sg chaining testing, revert
it now as it has served its purpose.
This reverts commit 563063a808.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Fix a memory leak in alloc_disk_node(). Don't forget to free 'dkstats' when the allocation of 'part' failed.
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
if blktrace program segfault it will not be able
to call BLKTRACETEARDOWN. Now if we run the blktrace
again that would result in a failure to create the
block/<device> debugfs directory.This will result
in blk_remove_root() to be called which will set
blk_tree_root to NULL. But the debugfs block dir
still exist because it contain subdirectory.
Now if we try to fix it using BLKTRACETEARDOWN
it won't work because blk_tree_root is NULL.
Fix the same.
Tested as below
root@qemu-image:/home/kvaneesh/blktrace# ./blktrace -d /dev/hdc
Segmentation fault
root@qemu-image:/home/kvaneesh/blktrace# ./blktrace -d /dev/hdc
BLKTRACESETUP: No such file or directory
Failed to start trace on /dev/hdc
root@qemu-image:/home/kvaneesh/blktrace# ./blktrace -k /dev/hdc
root@qemu-image:/home/kvaneesh/blktrace# ./blktrace -d /dev/hdc
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Added blk_unplug interface, allowing all invocations of unplugs to result
in a generated blktrace UNPLUG.
Signed-off-by: Alan D. Brunelle <Alan.Brunelle@hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Credit goes to juergen.kadidlo@exasol.com for diagnosing this issue
and supplying the initial patch.
blk_queue_invalidate_tags() must use the proper requeueing paths instead
of open coding the re-add of the request, otherwise we bug out in rq
accounting. Just switch to using blk_requeue_request(), that takes care
of end-tag handling as well and also adds the blktrace REQUEUE notify
event that is also appropriate here.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
In theory, if the queue was idle long enough, cfq_idle_class_timer may have
a false (and very long) timeout because jiffies can wrap into the past wrt
->last_end_request.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
After the fresh boot:
ionice -c3 -p $$
echo cfq >> /sys/block/XXX/queue/scheduler
dd if=/dev/XXX of=/dev/null bs=512 count=1
Now dd hangs in D state and the queue is completely stalled for approximately
INITIAL_JIFFIES + CFQ_IDLE_GRACE jiffies. This is because cfq_init_queue()
forgets to initialize cfq_data->last_end_request.
(I guess this patch is not complete, overflow is still possible)
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Spotted by Nick <gentuu@gmail.com>, hopefully can explain the second trace in
http://bugzilla.kernel.org/show_bug.cgi?id=9180.
If ->async_idle_cfqq != NULL cfq_put_async_queues() puts it IOPRIO_BE_NR times
in a loop. Fix this.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
[BLOCK] Don't allow empty barriers to be passed down to queues that don't grok them
dm: bounce_pfn limit added
Deadline iosched: Fix batching fairness
Deadline iosched: Reset batch for ordered requests
Deadline iosched: Factor out finding latter reques
After switching data directions, deadline always starts the next batch
from the lowest-sector request. This gives excessive deadline expiries
and large latency and throughput disparity between high- and low-sector
requests; an order of magnitude in some tests.
This patch changes the batching behaviour so new batches start from the
request whose expiry is earliest.
Signed-off-by: Aaron Carroll <aaronc@gelato.unsw.edu.au>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The deadline I/O scheduler does not reset the batch count when starting
a new batch at a higher-sectored request. This means the second and
subsequent batch in the same data direction will never exceed a single
request in size whenever higher-sectored requests are pending.
This patch gives new batches in the same data direction as old ones
their full quota of requests by resetting the batch count.
Signed-off-by: Aaron Carroll <aaronc@gelato.unsw.edu.au>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Factor finding the next request in sector-sorted order into
a function deadline_latter_request.
Signed-off-by: Aaron Carroll <aaronc@gelato.unsw.edu.au>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
sg_mark_end() overwrites the page_link information, but all users want
__sg_mark_end() behaviour where we just set the end bit. That is the most
natural way to use the sg list, since you'll fill it in and then mark the
end point.
So change sg_mark_end() to only set the termination bit. Add a sg_magic
debug check as well, and clear a chain pointer if it is set.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The conversion of handlers to compat_blkdev_ioctl accidentally
disabled handling of most ioctl numbers on block devices because
of a typo. Fix the one line to enable it all again.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <axboe@carl.home.kernel.dk>
For the locking to work, only the tag map and tag bit map may be shared
(incidentally, I was just explaining this to Nick yesterday, but I
apparently didn't review the code well enough myself). But we also share
the busy list! The busy_list must be queue private, or we need a
block_queue_tag covering lock as well.
So we have to move the busy_list to the queue. This'll work fine, and
it'll actually also fix a problem with blk_queue_invalidate_tags() which
will invalidate tags across all shared queues. This is a bit confusing,
the low level driver should call it for each queue seperately since
otherwise you cannot kill tags on just a single queue for eg a hard
drive that stops responding. Since the function has no callers
currently, it's not an issue.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
cfq_get_queue()->cfq_find_alloc_queue() can fail, check the returned value.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Note that this isn't a bug at the moment, since the regular IO path
does not call this path without __GFP_WAIT set. However, it could be a
future bug, so I've applied it.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
blk_sync_queue() cancels the timer, but forgets to cancel the work.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The nr_sector argument of drive_stat_acct() is not used anymore since the read and write sectors statistics are now updated in end_that_request_first(). This patch removes the useless argument.
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Most drivers need to set length and offset as well, so may as well fold
those three lines into one.
Add sg_assign_page() for those two locations that only needed to set
the page, where the offset/length is set outside of the function context.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Since blk_rq_map_sg() sets the termination bit at the end of the sg
table, we could see it prematurely on the next mapping unless we
force drivers to do a full sg_init_table() prior to each mapping. So
force clear the termination bit to avoid having to put that clear in
the driver for every mapping.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The task_struct->pid member is going to be deprecated, so start
using the helpers (task_pid_nr/task_pid_vnr/task_pid_nr_ns) in
the kernel.
The first thing to start with is the pid, printed to dmesg - in
this case we may safely use task_pid_nr(). Besides, printks produce
more (much more) than a half of all the explicit pid usage.
[akpm@linux-foundation.org: git-drm went and changed lots of stuff]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Dave Airlie <airlied@linux.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix kernel-api docbook contents problems.
docproc: linux-2.6.23-git13/include/asm-x86/unaligned_32.h: No such file or directory
Warning(linux-2.6.23-git13//include/linux/list.h:482): bad line: of list entry
Warning(linux-2.6.23-git13//mm/filemap.c:864): No description found for parameter 'ra'
Warning(linux-2.6.23-git13//block/ll_rw_blk.c:3760): No description found for parameter 'req'
Warning(linux-2.6.23-git13//include/linux/input.h:1077): No description found for parameter 'private'
Warning(linux-2.6.23-git13//include/linux/input.h:1077): No description found for parameter 'cdev'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: WU Fengguang <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The memset() of the sg entry was originally removed, because it could
overwrite a chain pointer. But it's quite OK to memset() it when we know
it's a valid entry, since it can't contain a chain pointer.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Remove the size limit max_sectors_kb imposed on max_readahead_kb.
The size restriction is unreasonable. Especially when max_sectors_kb cannot
grow larger than max_hw_sectors_kb, which can be rather small for some disk
drives.
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Convert cpu_sibling_map from a static array sized by NR_CPUS to a per_cpu
variable. This saves sizeof(cpumask_t) * NR unused cpus. Access is mostly
from startup and CPU HOTPLUG functions.
Signed-off-by: Mike Travis <travis@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Expose this setting for now, so that users can play with enabling
large commands without defaulting it to on globally. This is a debug
patch, it will be dropped for the final versions.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This implements functionality to pass down or insert a barrier
in a queue, without having data attached to it. The ->prepare_flush_fn()
infrastructure from data barriers are reused to provide this
functionality.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
End of device check is done twice in __generic_make_request() and it's
fully inlined each time. Factor out bio_check_eod().
Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We can use this helper in the elevator core for BLKPREP_KILL, and it'll
also be useful for the empty barrier patch.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
struct file_operations is generally const (to avoid false sharing and get compile time errors on accidental writing to this shared structure); bsg recently added one of these without the const keyword. Patch below marks it const....
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* master.kernel.org:/pub/scm/linux/kernel/git/gregkh/driver-2.6: (75 commits)
PM: merge device power-management source files
sysfs: add copyrights
kobject: update the copyrights
kset: add some kerneldoc to help describe what these strange things are
Driver core: rename ktype_edd and ktype_efivar
Driver core: rename ktype_driver
Driver core: rename ktype_device
Driver core: rename ktype_class
driver core: remove subsystem_init()
sysfs: move sysfs file poll implementation to sysfs_open_dirent
sysfs: implement sysfs_open_dirent
sysfs: move sysfs_dirent->s_children into sysfs_dirent->s_dir
sysfs: make sysfs_root a regular directory dirent
sysfs: open code sysfs_attach_dentry()
sysfs: make s_elem an anonymous union
sysfs: make bin attr open get active reference of parent too
sysfs: kill unnecessary NULL pointer check in sysfs_release()
sysfs: kill unnecessary sysfs_get() in open paths
sysfs: reposition sysfs_dirent->s_mode.
sysfs: kill sysfs_update_file()
...
struct cdev does not need the kobject name to be set, as it is never
used. This patch fixes up the few places it is set.
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Acked-by: Mauro Carvalho Chehab <mchehab@infradead.org>
A number of different drivers incorrect access the kobject name field
directly. This is not correct as the name might not be in the array.
Use the proper accessor function instead.
This changes the uevent buffer functions to use a struct instead of a
long list of parameters. It does no longer require the caller to do the
proper buffer termination and size accounting, which is currently wrong
in some places. It fixes a known bug where parts of the uevent
environment are overwritten because of wrong index calculations.
Many thanks to Mathieu Desnoyers for finding bugs and improving the
error handling.
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The floppy ioctls are used by multiple drivers, so they should be
handled in a shared location. Also, add minor cleanups.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
These are shared by all cd-rom drivers and should have common
handlers. Do slight cosmetic cleanups in the process.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
BLKPG is common to all block devices, so it should be handled
by common code.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
These are common to multiple block drivers, so they should
be handled by the block layer.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
blk_trace_setup is broken on x86_64 compat systems,
this makes the code work correctly on all 64 bit architectures
in compat mode.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Handle those blockdev ioctl calls that are compatible
directly from the compat_blkdev_ioctl() function, instead
of having to go through the compat_ioctl hash lookup.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Make compat_blkdev_ioctl and blkdev_ioctl reflect the respective
native versions. This is somewhat more efficient and makes it easier
to keep the two in sync.
Also get rid of the bogus handling for broken_blkgetsize and the
duplicate entry for BLKRASET.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
As bi_end_io is only called once when the reqeust is complete,
the 'size' argument is now redundant. Remove it.
Now there is no need for bio_endio to subtract the size completed
from bi_size. So don't do that either.
While we are at it, change bi_end_io to return void.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The only caller of bio_endio that does not pass the full bi_size
is end_that_request_first. Also, no ->bi_end_io method is really
interested in bi_size being decremented.
So move the decrement and related code into ll_rw_blk and merge it
with order_bio_endio to form req_bio_endio which does endio functionality
specific to request completion.
As some ->bi_end_io methods do check bi_size of 0, we set it thus for
now, but that will go in the next patch.
Signed-off-by: Neil Brown <neilb@suse.de>
### Diffstat output
./block/ll_rw_blk.c | 42 +++++++++++++++++++++++++++---------------
./fs/bio.c | 23 +++++++++++------------
2 files changed, 38 insertions(+), 27 deletions(-)
diff .prev/block/ll_rw_blk.c ./block/ll_rw_blk.c
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The entire function of flush_dry_bio_endio is to undo the effects
of bio_endio (when called on a barrier request). So remove the
function and the call to bio_endio.
This allows us to remove "bi_size" from "struct request_queue".
Signed-off-by: Neil Brown <neilb@suse.de>
### Diffstat output
./block/ll_rw_blk.c | 39 ++-------------------------------------
./include/linux/blkdev.h | 1 -
2 files changed, 2 insertions(+), 38 deletions(-)
diff .prev/block/ll_rw_blk.c ./block/ll_rw_blk.c
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
blk_cpu_notifier is marked as __devinitdata, but __devinitdata need not
be __init even if HOTPLUG_CPU=n, which wastes space. It should be marked
__cpuinitdata, and the callback itself as __cpuinit.
Signed-off-by: Satyam Sharma <satyam@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
These have very similar functions and should share code where
possible.
Signed-off-by: Neil Brown <neilb@suse.de>
diff .prev/block/ll_rw_blk.c ./block/ll_rw_blk.c
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
blk_rq_bio_prep is exported for use in exactly
one place. That place can benefit from using
the new blk_rq_append_bio instead.
So
- change dm-emc to call blk_rq_append_bio
- stop exporting blk_rq_bio_prep, and
- initialise rq_disk in blk_rq_bio_prep,
as dm-emc needs it.
Signed-off-by: Neil Brown <neilb@suse.de>
diff .prev/block/ll_rw_blk.c ./block/ll_rw_blk.c
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
ll_back_merge_fn is currently exported to SCSI where is it used,
together with blk_rq_bio_prep, in exactly the same way these
functions are used in __blk_rq_map_user.
So move the common code into a new function (blk_rq_append_bio), and
don't export ll_back_merge_fn any longer.
Signed-off-by: Neil Brown <neilb@suse.de>
diff .prev/block/ll_rw_blk.c ./block/ll_rw_blk.c
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Every usage of rq_for_each_bio wraps a usage of
bio_for_each_segment, so these can be combined into
rq_for_each_segment.
We define "struct req_iterator" to hold the 'bio' and 'index' that
are needed for the double iteration.
Signed-off-by: Neil Brown <neilb@suse.de>
Various compile fixes by me...
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
blk_recalc_rq_segments calls blk_recount_segments on each bio,
then does some extra calculations to handle segments that overlap
two bios.
If we merge the code from blk_recount_segments into
blk_recalc_rq_segments, we can process the whole request one bio_vec
at a time, and not need the messy cross-bio calculations.
Then blk_recount_segments can be implemented by calling
blk_recalc_rq_segments, passing it a simple on-stack request which
stores just the bio.
Signed-off-by: Neil Brown <neilb@suse.de>
diff .prev/block/ll_rw_blk.c ./block/ll_rw_blk.c
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Should add some comments for the tag barriers (they won't be so important
if we can switch over to the explicit _lock bitops, but for now we should
make it clear).
Jens' original patch said a barrier after the test_and_clear_bit was also
required. I can't see why (and it would prevent the use of the _lock bitop).
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
--
There's a race condition in blk_queue_end_tag() for shared tag maps,
users include stex (promise supertrak thingy) and qla2xxx. The former
at least has reported bugs in this area, not sure why we haven't seen
any for the latter. It could be because the window is narrow and that
other conditions in the qla2xxx code hide this. It's a real bug,
though, as the stex smp users can attest.
We need to ensure two things - the tag bit clearing needs to happen
AFTER we cleared the tag pointer, as the tag bit clearing/setting is
what protects this map. Secondly, we need to ensure that the visibility
of the tag pointer and tag bit clear are ordered properly.
[ I removed the SMP barriers - "test_and_clear_bit()" already implies
all the required barriers. -- Linus ]
Also see http://bugzilla.kernel.org/show_bug.cgi?id=7842
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch provides more information concerning REMAP operations on block
IOs. The additional information provides clearer details at the user level,
and supports post-processing analysis in btt.
o Adds in partition remaps on the same device.
o Fixed up the remap information in DM to be in the right order
o Sent up mapped-from and mapped-to device information
Signed-off-by: Alan D. Brunelle <alan.brunelle@hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This updates sg_io_v4 structure (based on Doug's RFC, release 1.3).
The major changes are:
- add dout_resid field
- increase tag size to 64 bits to comply with SAM-4 and SRP
- add dout_iovec_count and din_iovec_count
dout_iovec_count and din_iovec_count aren't supported now. I'm not
sure whether they will be supported or not but they were added for the
possible future changes.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
* master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-rc-fixes-2.6: (28 commits)
[SCSI] mpt fusion: Changes in mptctl.c for logging support
[SCSI] mpt fusion: Changes in mptfc.c mptlan.c mptsas.c and mptspi.c for logging support
[SCSI] mpt fusion: Changes in mptscsih.c for logging support
[SCSI] mpt fusion: Changes in mptbase.c for logging support
[SCSI] mpt fusion: logging support in Kconfig, Makefile, mptbase.h and addition of mptdebug.h
[SCSI] libsas: Fix potential NULL dereference in sas_smp_get_phy_events()
[SCSI] bsg: Fix build for CONFIG_BLOCK=n
[SCSI] aacraid: fix Sunrise Lake reset handling
[SCSI] aacraid: add SCSI SYNCHONIZE_CACHE range checking
[SCSI] add easyRAID to the no report luns blacklist
[SCSI] advansys: lindent and other large, uninteresting changes
[SCSI] aic79xx, aic7xxx: Fix incorrect width setting
[SCSI] qla2xxx: fix to honor ignored parameters in sysfs attributes
[SCSI] aacraid: draw line in sand, sundry cleanup and version update
[SCSI] iscsi_tcp: Turn off bounce buffers
[SCSI] libiscsi: fix cmd seqeunce number checking
[SCSI] iscsi_tcp, ib_iser Enable module refcounting for iscsi host template
[SCSI] libiscsi: make sure session is not blocked when removing host
[SCSI] libsas: Remove PCI dependencies
[SCSI] simscsi: convert to use the data buffer accessors
...
BLK_DEV_BSG was added outside of the if BLOCK check, which allows it to
be enabled when CONFIG_BLOCK=n. This leads to many screenlengths of
errors, starting with a parse error on the request_queue_t definition.
Obviously this wasn't intended for CONFIG_BLOCK=n usage, so just move the
option back in to the block.
Caught with a randconfig on sh.
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Acked-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
use cpu_clock() instead of sched_clock(). (the latter is not a proper
clock-source)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
BLK_DEV_BSG was added outside of the if BLOCK check, which allows it to
be enabled when CONFIG_BLOCK=n. This leads to many screenlengths of
errors, starting with a parse error on the request_queue_t definition.
Obviously this wasn't intended for CONFIG_BLOCK=n usage, so just move the
option back in to the block.
Caught with a randconfig on sh.
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
--
block/Kconfig | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Some of the code has been gradually transitioned to using the proper
struct request_queue, but there's lots left. So do a full sweet of
the kernel and get rid of this typedef and replace its uses with
the proper type.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
- kill uhdr in bsg_command structure
- it's not necessary to put SG v4 stuff to block/scsi_ioctl.c
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
This replaces the current linear search for a unique minor number with
lib/idr.c.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
blk_fill_sghdr_rq, blk_unmap_sghdr_rq, and blk_complete_sghdr_rq were
exported for bsg, however bsg was changed to support only sg v4.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
scsi_sysfs_add_sdev ignores the bsg_register_queue failure, so
bsg_unregister_queue must check whether the queue has a bsg device.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
Currently, bsg doesn't make class backlinks (a process whereby you'd get
a link to bsg in the device directory in the same way you get one for
sg). This is because the bsg device is uninitialised, so the class
device has nothing it can attach to. The fix is to make the bsg device
point to the cdevice of the entity creating the bsg, necessitating
changing the bsg_register_queue() prototype into a form that takes the
generic device.
Acked-by: FUJITA Tomonori <tomof@acm.org>
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
unfortunately, if IS_ERR(class_dev) is true, that means class_dev isn't
null and the check in the error leg is pointless ... it's also asking
for trouble to request unregistration of a device we haven't actually
created (although it works currently). Fix by using explicit gotos and
unregisters.
Acked-by: FUJITA Tomonori <tomof@acm.org>
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
There are some leftover bits from the task cooperator patch, that was
yanked out again. While it will get reintroduced, no point in having
this write-only stuff in the tree. So yank it.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
If we have two processes with different ioprio_class, but the same
ioprio_data, their async requests will fall into the same queue. I guess
such behavior is not expected, because it's not right to put real-time
requests and best-effort requests in the same queue.
The attached patch fixes the problem by introducing additional *cfqq
fields on cfqd, pointing to per-(class,priority) async queues.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Slab destructors were no longer supported after Christoph's
c59def9f22 change. They've been
BUGs for both slab and slub, and slob never supported them
either.
This rips out support for the dtor pointer from kmem_cache_create()
completely and fixes up every single callsite in the kernel (there were
about 224, not including the slab allocator definitions themselves,
or the documentation references).
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
This patch moves the bsg registration into SCSI so that bsg no longer
has a dependency on the scsi_interface_register API.
This can be viewed as a temporary expedient until we can get universal
bsg binding sorted out properly. Also use the sdev bus_id as the
generic bsg name (to avoid clashes with the queue name).
Acked-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
Transform some calls to kmalloc/memset to a single kzalloc (or kcalloc).
Here is a short excerpt of the semantic patch performing
this transformation:
@@
type T2;
expression x;
identifier f,fld;
expression E;
expression E1,E2;
expression e1,e2,e3,y;
statement S;
@@
x =
- kmalloc
+ kzalloc
(E1,E2)
... when != \(x->fld=E;\|y=f(...,x,...);\|f(...,x,...);\|x=E;\|while(...) S\|for(e1;e2;e3) S\)
- memset((T2)x,0,E1);
@@
expression E1,E2,E3;
@@
- kzalloc(E1 * E2,E3)
+ kcalloc(E1,E2,E3)
[akpm@linux-foundation.org: get kcalloc args the right way around]
Signed-off-by: Yoann Padioleau <padator@wanadoo.fr>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Acked-by: Russell King <rmk@arm.linux.org.uk>
Cc: Bryan Wu <bryan.wu@analog.com>
Acked-by: Jiri Slaby <jirislaby@gmail.com>
Cc: Dave Airlie <airlied@linux.ie>
Acked-by: Roland Dreier <rolandd@cisco.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Acked-by: Dmitry Torokhov <dtor@mail.ru>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Mauro Carvalho Chehab <mchehab@infradead.org>
Acked-by: Pierre Ossman <drzeus-list@drzeus.cx>
Cc: Jeff Garzik <jeff@garzik.org>
Cc: "David S. Miller" <davem@davemloft.net>
Acked-by: Greg KH <greg@kroah.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: "Antonino A. Daplas" <adaplas@pol.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'bsg' of git://git.kernel.dk/data/git/linux-2.6-block:
bsg: fix missing space in version print
Don't define empty struct bsg_class_device if !CONFIG_BLK_DEV_BSG
bsg: Kconfig updates
bsg: minor cleanup
bsg: device hash table cleanup
bsg: fix initialization error handling bugs
bsg: mark FUJITA Tomonori as bsg maintainer
bsg: convert to dynamic major
bsg: address various review comments
Put WARN_ON and fixed all callers of unregister_blkdev(). Now we can make
unregister_blkdev return void.
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When unregister_blkdev() has failed, something wrong happened. This patch
adds WARN_ON to notify of such badness.
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kmalloc_node() and kmem_cache_alloc_node() were not available in a zeroing
variant in the past. But with __GFP_ZERO it is possible now to do zeroing
while allocating.
Use __GFP_ZERO to remove the explicit clearing of memory via memset whereever
we can.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tomo introduced a bug in his commit, removing the space between
"driver" and "version" in the init printk.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
- add the detailed explanation.
- remove 'default y'.
- make 'EXPERIMENTAL' keyword visible to the user in menu.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This fixes the following bugs and cleans up the initialization code:
- cdev_del is missing.
- unregister_chrdev_region should be used instead of unregister_chrdev.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
240 was hardcoded, that was clearly a dumb mistake. Convert bsg
to use alloc_chrdev_region() to retrieve a dynamic major.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This address most of the comments made by Andrew. The two remaining
are conversion to idr, and dynamic major.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The SCSI code can be compiled modular, but BLK_DEV_BSG currently cannot,
and depends on the SCSI layer. So make sure that it depends on the SCSI
layer being compiled in, not just available as a module.
Noticed by Jeff Garzik and S.Çağlar Onur.
Cc: Jeff Garzik <jeff@garzik.org>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: S.Çağlar Onur <caglar@pardus.org.tr>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We had a merge issue with the "dentry" field going away from the
kobject, and being replaced by a sysfs_dirent field (named "sd")
instead. That broke the BSG compile.
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: James Bottomley <James.Bottomley@SteelEye.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This updates bsg entry in Kconfig:
- bsg supports sg v4
- bsg depends on SCSI
- it might be better to mark it experimental for a while
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>