bitmap_copy_from_slot reads the bitmap from the slot mentioned.
It then copies the set bits to the node local bitmap.
This is helper function for the resync operation on node failure.
bitmap_set_memory_bits() currently assumes it is only run at startup and that
they bitmap is currently empty. So if it finds that a region is already
marked as dirty, it won't mark it dirty again. Change bitmap_set_memory_bits()
to always set the NEEDED_MASK bit if 'needed' is set.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
On-disk format:
0 4k 8k 12k
-------------------------------------------------------------------
| idle | md super | bm super [0] + bits |
| bm bits[0, contd] | bm super[1] + bits | bm bits[1, contd] |
| bm super[2] + bits | bm bits [2, contd] | bm super[3] + bits |
| bm bits [3, contd] | | |
Bitmap super has a field nodes, which defines the maximum number
of nodes the device can use. While reading the bitmap super, if
the cluster finds out that the number of nodes is > 0:
1. Requests the md-cluster module.
2. Calls md_cluster_ops->join(), which sets up clustering such as
joining DLM lockspace.
Since the first time, the first bitmap is read. After the call
to the cluster_setup, the bitmap offset is adjusted and the
superblock is re-read. This also ensures the bitmap is read
the bitmap lock (when bitmap lock is introduced in later patches)
Questions:
1. cluster name is repeated in all bitmap supers. Is that okay?
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
DLM offers callbacks when a node fails and the lock remastery
is performed:
1. recover_prep: called when DLM discovers a node is down
2. recover_slot: called when DLM identifies the node and recovery
can start
3. recover_done: called when all nodes have completed recover_slot
recover_slot() and recover_done() are also called when the node joins
initially in order to inform the node with its slot number. These slot
numbers start from one, so we deduct one to make it start with zero
which the cluster-md code uses.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
md_cluster_info stores the cluster information in the MD device.
The join() is called when mddev detects it is a clustered device.
The main responsibilities are:
1. Setup a DLM lockspace
2. Setup all initial locks such as super block locks and bitmap lock (will come later)
The leave() clears up the lockspace and all the locks held.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Most attributes can be read safely without any locking.
A race might lead to a slightly out-dated value, but nothing wrong.
We already have locking in some places where needed.
All that remains is can_clear_show(), behind_writes_used_show()
and action_show() which are easily fixed.
Signed-off-by: NeilBrown <neilb@suse.de>
commit 8eb23b9f35
sched: Debug nested sleeps
causes false-positive warnings in RAID5 code.
This annotation removes them and adds a comment
explaining why there is no real problem.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If two threads call bitmap_unplug at the same time, then
one might schedule all the writes, and the other might
decide that it doesn't need to wait. But really it does.
It rarely hurts to wait when it isn't absolutely necessary,
and the current code doesn't really focus on 'absolutely necessary'
anyway. So just wait always.
This can potentially lead to data corruption if a crash happens
at an awkward time and data was written before the bitmap was
updated. It is very unlikely, but this should go to -stable
just to be safe. Appropriate for any -stable.
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@vger.kernel.org (please delay until 3.18 is released)
file_page_index(store, 0) is *always* 0.
This is because the bitmap sb, at 256 bytes, is *always* less than
one page.
So subtracting it has no effect and the code should be removed.
Reported-by: Goldwyn Rodrigues <rgoldwyn@suse.de>
Signed-off-by: NeilBrown <neilb@suse.de>
md bitmap code currently tries to use i_writecount to stop any other
process from writing to out bitmap file. But that is really an abuse
and has bit-rotted so locking is all wrong.
So discard that - root should be allowed to shoot self in foot.
Still use it in a much less intrusive way to stop the same file being
used as bitmap on two different array, and apply other checks to
ensure the file is at least vaguely usable for bitmap storage
(is regular, is open for write. Support for ->bmap is already checked
elsewhere).
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: NeilBrown <neilb@suse.de>
kernfs has just been separated out from sysfs and we're already in
full conflict mode. Nothing can make the situation any worse. Let's
take the chance to name things properly.
This patch performs the following renames.
* s/sysfs_elem_dir/kernfs_elem_dir/
* s/sysfs_elem_symlink/kernfs_elem_symlink/
* s/sysfs_elem_attr/kernfs_elem_file/
* s/sysfs_dirent/kernfs_node/
* s/sd/kn/ in kernfs proper
* s/parent_sd/parent/
* s/target_sd/target/
* s/dir_sd/parent/
* s/to_sysfs_dirent()/rb_to_kn()/
* misc renames of local vars when they conflict with the above
Because md, mic and gpio dig into sysfs details, this patch ends up
modifying them. All are sysfs_dirent renames and trivial. While we
can avoid these by introducing a dummy wrapping struct sysfs_dirent
around kernfs_node, given the limited usage outside kernfs and sysfs
proper, I don't think such workaround is called for.
This patch is strictly rename only and doesn't introduce any
functional difference.
- mic / gpio renames were missing. Spotted by kbuild test robot.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Cc: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The pre-existing sysfs interfaces which take explicit namespace
argument are weird in that they place the optional @ns in front of
@name which is contrary to the established convention. For example,
we end up forcing vast majority of sysfs_get_dirent() users to do
sysfs_get_dirent(parent, NULL, name), which is silly and error-prone
especially as @ns and @name may be interchanged without causing
compilation warning.
This renames sysfs_get_dirent() to sysfs_get_dirent_ns() and swap the
positions of @name and @ns, and sysfs_get_dirent() is now a wrapper
around sysfs_get_dirent_ns(). This makes confusions a lot less
likely.
There are other interfaces which take @ns before @name. They'll be
updated by following patches.
This patch doesn't introduce any functional changes.
v2: EXPORT_SYMBOL_GPL() wasn't updated leading to undefined symbol
error on module builds. Reported by build test robot. Fixed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kay Sievers <kay@vrfy.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The usage of strict_strtoul() is not preferred, because
strict_strtoul() is obsolete. Thus, kstrtoul() should be
used.
Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: NeilBrown <neilb@suse.de>
The value returned by test_and_set_bit_le() drivers/md/bitmap.c is not used.
So just use set_bit_le(). The same goes for test_and_clear_bit_le().
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: linux-raid@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
This patch replaces list_for_each_continue_rcu() with
list_for_each_entry_continue_rcu() to save a few lines
of code and allow removing list_for_each_continue_rcu().
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Signed-off-by: NeilBrown <neilb@suse.de>
queuing writes to the md thread means that all requests go through the
one processor which may not be able to keep up with very high request
rates.
So use the plugging infrastructure to submit all requests on unplug.
If a 'schedule' is needed, we fall back on the old approach of handing
the requests to the thread for it to handle.
Signed-off-by: NeilBrown <neilb@suse.de>
Now that bitmaps can grow and shrink it is best if we record
how much space is available. This means that when
we reduce the size of the bitmap we won't "lose" the space
for late when we might want to increase the size of the bitmap
again.
Signed-off-by: NeilBrown <neilb@suse.de>
As a reshape may change the sync_size and/or chunk_size, we need
to update these whenever we write out the bitmap superblock.
Signed-off-by: NeilBrown <neilb@suse.de>
This function will allocate the new data structures and copy
bits across from old to new, allowing for the possibility that the
chunksize has changed.
Use the same function for performing the initial allocation
of the structures. This improves test coverage.
When bitmap_resize is used to resize an existing bitmap, it
only copies '1' bits in, not '0' bits.
So when allocating the bitmap, ensure everything is initialised
to ZERO.
Signed-off-by: NeilBrown <neilb@suse.de>
The new "struct bitmap_counts" contains all the fields that are
related to counting the number of active writes in each bitmap chunk.
Having this separate will make it easier to change the chunksize
or overall size of a bitmap atomically.
Signed-off-by: NeilBrown <neilb@suse.de>
Using e.g. set_bit instead of __set_bit and using test_and_clear_bit
allow us to remove some locking and contract other locked ranges.
It is rare that we set or clear a lot of these bits, so gain should
outweigh any cost.
Signed-off-by: NeilBrown <neilb@suse.de>
There functions really do one thing together: release the
'bitmap_storage'. So make them just one function.
Since we removed the locking (previous patch), we don't need to zero
any fields before freeing them, so it all becomes a bit simpler.
Signed-off-by: NeilBrown <neilb@suse.de>
There is no real value in freeing things the moment there is an error.
It is just as good to free the bitmap file and pages when the bitmap
is explicitly removed (and replaced?) or at shutdown.
With this gone, the bitmap will only disappear when the array is
quiescent, so we can remove some locking.
As the 'filemap' doesn't disappear now, include extra checks before
trying to write any of it out.
Also remove the check for "has it disappeared" in
bitmap_daemon_write().
Signed-off-by: NeilBrown <neilb@suse.de>
All of these sites can only be called from process context with
irqs enabled, so using irqsave/irqrestore just adds noise.
Remove it.
Signed-off-by: NeilBrown <neilb@suse.de>
We currently use '&' and '|' which isn't the norm in the kernel
and doesn't allow easy atomicity.
So change to bit numbers and {set,clear,test}_bit.
This allows us to remove a spinlock/unlock (which was dubious anyway)
and some other simplifications.
Signed-off-by: NeilBrown <neilb@suse.de>
Just do single-bit manipulations on bitmap->flags and copy whole
value between that and sb->state.
This will allow next patch which changes how bit manipulations are
performed on bitmap->flags.
This does result in BITMAP_STALE not being set in sb by
bitmap_read_sb, however as the setting is determined by other
information in the 'sb' we do not lose information this way.
Normally, bitmap_load will be called shortly which will clear
BITMAP_STALE anyway.
Signed-off-by: NeilBrown <neilb@suse.de>
This function isn't really needed. It sets or clears a flag in both
bitmap->flags and sb->state.
However both times it is called, bitmap_update_sb is called soon
afterwards which copies bitmap->flags to sb->state.
So just make changes to bitmap->flags, and open-code those rather than
hiding in a function.
Signed-off-by: NeilBrown <neilb@suse.de>
This new 'struct bitmap_storage' reflects the external storage of the
bitmap.
Having this clearly defined will make it easier to change the storage
used while the array is active.
Signed-off-by: NeilBrown <neilb@suse.de>
Most often we have the page number, not the page. And that is what
the *_page_attr() functions really want. So change the arguments to
take that number.
Signed-off-by: NeilBrown <neilb@suse.de>
Instead of allocating pages in read_sb_page, read_page and
bitmap_read_sb, allocate them all in bitmap_init_from disk.
Also replace the hack of calling "attach_page_buffers(page, NULL)" to
ensure that free_buffer() won't complain, by putting a test for
PagePrivate in free_buffer().
Signed-off-by: NeilBrown <neilb@suse.de>
An md bitmap comprises two parts
- internal counting of active writes per 'chunk'.
- external storage of whether there are any active writes on
each chunk
The second requires the first, but the first doesn't require the
second.
Not having backing storage means that the bitmap cannot expedite
resync after a crash, but it still allows us to expedite the recovery
of a recently-removed device.
So: allow a bitmap to exist even if there is no backing device.
In that case we default to 128M chunks.
A particular value of this is that we can remove and re-add a bitmap
(possibly of a different granularity) on a degraded array, and not
lose the information needed to fast-recover the missing device.
We don't actually activate these bitmaps yet - that will come
in a later patch.
Signed-off-by: NeilBrown <neilb@suse.de>
If we are to allow bitmaps to be resized when the array is resized,
we need to know how much space there is.
So create an attribute to store this information and set appropriate
defaults.
It can be set more precisely via sysfs, or future metadata extensions
may allow it to be recorded.
Signed-off-by: NeilBrown <neilb@suse.de>
There are two different 'pending' concepts in the handling of the
write intent bitmap.
Firstly, a 'page' from the bitmap (which container PAGE_SIZE*8 bits)
may have changes (bits cleared) that should be written in due course.
There is no hurry for these and the page will transition from
PENDING to NEEDWRITE and will then be written, though if it ever
becomes DIRTY it will be written much sooner and PENDING will be
cleared.
Secondly, a page of counters - which contains PAGE_SIZE/2 counters, one
for each bit, can usefully have a 'pending' flag which indicates if
any of the counters are low (2 or 1) and ready to be processed by
bitmap_daemon_work(). If this flag is clear we can skip the whole
page.
These two concepts are currently combined in the bitmap-file flag.
This causes a tighter connection between the counters and the bitmap
file than I would like - as I want to add some flexibility to the
bitmap file.
So introduce a new flag with the page-of-counters, and rewrite
bitmap_daemon_work() so that it handles the two different 'pending'
concepts separately.
This also allows us to clear BITMAP_PAGE_PENDING when we write out
a dirty page, which may occasionally reduce the number of times we
write a page.
Signed-off-by: NeilBrown <neilb@suse.de>
commit 61a0d80c "md/bitmap: discard CHUNK_BLOCK_SHIFT macro"
replaced CHUNK_BLOCK_RATIO() by the same text that was
replacing CHUNK_BLOCK_SHIFT() - which is clearly wrong.
The result is that 'chunks' is often too small by 1,
which can sometimes result in a crash (not sure how).
So use the correct replacement, and get rid of CHUNK_BLOCK_RATIO
which is no longe used.
Reported-by: Karl Newman <siliconfiend@gmail.com>
Tested-by: Karl Newman <siliconfiend@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If a bitmap is added while the array is active, it is possible
for bitmap_daemon_work to run while the bitmap is being
initialised.
This is particularly a problem if bitmap_daemon_work sees
bitmap->filemap as non-NULL before it has been filled in properly.
So hold bitmap_info.mutex while filling in ->filemap
to prevent problems.
This patch is suitable for any -stable kernel, though it might not
apply cleanly before about 3.1.
Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
bitmap_new_disk_sb() would still create V3 bitmap superblock
with host-endian layout.
Perhaps I'm confused, but shouldn't bitmap_new_disk_sb() be
creating a V4 bitmap superblock instead, that is portable,
as per comment in bitmap.h?
Signed-off-by: Andrei Warkentin <andrey.warkentin@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Mostly tidying up code in preparation for some bigger changes
next time.
A few bug fixes tagged for -stable.
Main functionality change is that some RAID10 arrays can now
grow to use extra space that may have been made available on the
individual devices.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQIVAwUAT2bLBjnsnt1WYoG5AQKN3xAAv1UlR5Kem5WN7Ex4lmR9xj3lr9dbURYT
TtvrUuCy3pYYWdTuijb+IBqkbODF0kPDHIhUiBx9fXUfMavkp/b9heXS/vJ3pcH4
1j99NUbOGL/AylD1TPRV9TQxGTKhEjK3n26bY0t/amLc92bWJaytMO1B9cz38LN+
qx6ufpIepz4DPXXtPYpnkBR4cZ6L4/ZXQvjf5BqG6WfKwc+0Nyncg8ipYEqhBWy7
R7ztF5yPo0yl96Wopa2KG91OroWflmyZo1DNYcbUbKtbNGGtYC92GFadOH+wNupM
FnmXv10ivfVGU5w4SpshAwOg+4OSUqmWNsBxUhpYbf8ChbN+lOl0VZdH6UBxo19D
3SqZWT/yz4I4HYd5rtr35MXFdOeBNM++CHQs4F68BLA0B6OcHfWsA9bvly2tnBVx
iEBFPd277qWztUr8m6yz7AFf/0dgyXuIhuB3d7IkVrG5yG3FX6hPi2T0FSA33qMx
Lwi5w6O4DREg5tG09xEYEnXgXe+PnB8HsKb1U/m76XMQ0UScvX6dLA6934Vg+DCv
xf+AYqob0Tc/Op5I7h2PbVXq7DciNXwlX1WvM0m+TEaV+3fl1FB0VsCcANAV6JVn
uRLmvtePQRt0hxAog2p7OsumVnxMhbuEo5h8rJMKWM7IbhueKNoz+gBwpcFLzBmY
ygWc4peLQpE=
=MGuM
-----END PGP SIGNATURE-----
Merge tag 'md-3.4' of git://neil.brown.name/md
Pull md updates for 3.4 from Neil Brown:
"Mostly tidying up code in preparation for some bigger changes next
time.
A few bug fixes tagged for -stable.
Main functionality change is that some RAID10 arrays can now grow to
use extra space that may have been made available on the individual
devices."
Fixed up trivial conflicts with the k[un]map_atomic() cleanups in
drivers/md/bitmap.c.
* tag 'md-3.4' of git://neil.brown.name/md: (22 commits)
md: Add judgement bb->unacked_exist in function md_ack_all_badblocks().
md: fix clearing of the 'changed' flags for the bad blocks list.
md/bitmap: discard CHUNK_BLOCK_SHIFT macro
md/bitmap: remove unnecessary indirection when allocating.
md/bitmap: remove some pointless locking.
md/bitmap: change a 'goto' to a normal 'if' construct.
md/bitmap: move printing of bitmap status to bitmap.c
md/bitmap: remove some unused noise from bitmap.h
md/raid10 - support resizing some RAID10 arrays.
md/raid1: handle merge_bvec_fn in member devices.
md/raid10: handle merge_bvec_fn in member devices.
md: add proper merge_bvec handling to RAID0 and Linear.
md: tidy up rdev_for_each usage.
md/raid1,raid10: avoid deadlock during resync/recovery.
md/bitmap: ensure to load bitmap when creating via sysfs.
md: don't set md arrays to readonly on shutdown.
md: allow re-add to failed arrays.
md/raid5: use atomic_dec_return() instead of atomic_dec() and atomic_read().
md: Use existed macros instead of numbers
md/raid5: removed unused 'added_devices' variable.
...
Be redefining ->chunkshift as the shift from sectors to chunks rather
than bytes to chunks, we can just use "bitmap->chunkshift" which is
shorter than the macro call, and less indirect.
Signed-off-by: NeilBrown <neilb@suse.de>
These funcitons don't add anything useful except possibly the trace
points, and I don't think they are worth the extra indirection.
So remove them.
Signed-off-by: NeilBrown <neilb@suse.de>
There is nothing gained by holding a lock while we check if a pointer
is NULL or not. If there could be a race, then it could become NULL
immediately after the unlock - but there is no race here.
So just remove the locking.
Signed-off-by: NeilBrown <neilb@suse.de>
The use of a goto makes the control flow more obscure here.
So make it a normal:
if (x) {
Y;
}
No functional change.
Signed-off-by: NeilBrown <neilb@suse.de>