tweaks.
- Add journal support to the DM raid target to close the 'write hole' on
raid 4/5/6.
- Fix dm-cache corruption, due to rounding bug, when cache exceeds 2TB.
- Add 'metadata2' feature to dm-cache to separate the dirty bitset out
from other cache metadata. This improves speed of shutting down
a large cache device (which implies writing out dirty bits).
- Fix a memory leak during dm-stats data structure destruction.
- Fix a DM multipath round-robin path selector performance regression
that was caused by less precise balancing across all paths.
- Lastly, introduce a DM core fix for a long-standing DM snapshot
deadlock that is rooted in the complexity of the device stack used in
conjunction with block core maintaining bios on current->bio_list to
manage recursion in generic_make_request(). A more comprehensive fix
to block core (and its hook in the cpu scheduler) would be wonderful
but this DM-specific fix is pragmatic considering how difficult it has
been to make progress on a generic fix.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJYrJJeAAoJEMUj8QotnQNaDskIAIJeMX3Dc8Skt00tZ6vEj3p6
9juDpOrBKH3RYdqPmrYy9lVhhpFs6OoDfTQZaW/SmjDjHboJ3skKMjO+/NWav4nN
39LoDfxLbDi06fC7Y4H7FHUPjb5sKSzw4W5IttFEKmHOwkz+iwVFL1R0dihBqv7G
Lq0Ta6xffW8jHrzpmmSDY1I6FSmZ9LlHPCL00qQ5Z7WkMS5oDk0GzZoLFasdNfvm
fP9N13+uel2/R7hclpxE6J+IZPN5ARG3HAQ5POS+2gMlIzaH4AlMh7yf5q0sSGwq
uQsmdps8c+LOtAakOzVScykEZvwBh+ci8VqE1X1zol+fl8ijeWqgWtz4XXYECC0=
=saD8
-----END PGP SIGNATURE-----
Merge tag 'dm-4.11-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer:
- Fix dm-raid transient device failure processing and other smaller
tweaks.
- Add journal support to the DM raid target to close the 'write hole'
on raid 4/5/6.
- Fix dm-cache corruption, due to rounding bug, when cache exceeds 2TB.
- Add 'metadata2' feature to dm-cache to separate the dirty bitset out
from other cache metadata. This improves speed of shutting down a
large cache device (which implies writing out dirty bits).
- Fix a memory leak during dm-stats data structure destruction.
- Fix a DM multipath round-robin path selector performance regression
that was caused by less precise balancing across all paths.
- Lastly, introduce a DM core fix for a long-standing DM snapshot
deadlock that is rooted in the complexity of the device stack used in
conjunction with block core maintaining bios on current->bio_list to
manage recursion in generic_make_request(). A more comprehensive fix
to block core (and its hook in the cpu scheduler) would be wonderful
but this DM-specific fix is pragmatic considering how difficult it
has been to make progress on a generic fix.
* tag 'dm-4.11-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (22 commits)
dm: flush queued bios when process blocks to avoid deadlock
dm round robin: revert "use percpu 'repeat_count' and 'current_path'"
dm stats: fix a leaked s->histogram_boundaries array
dm space map metadata: constify dm_space_map structures
dm cache metadata: use cursor api in blocks_are_clean_separate_dirty()
dm persistent data: add cursor skip functions to the cursor APIs
dm cache metadata: use dm_bitset_new() to create the dirty bitset in format 2
dm bitset: add dm_bitset_new()
dm cache metadata: name the cache block that couldn't be loaded
dm cache metadata: add "metadata2" feature
dm cache metadata: use bitset cursor api to load discard bitset
dm bitset: introduce cursor api
dm btree: use GFP_NOFS in dm_btree_del()
dm space map common: memcpy the disk root to ensure it's arch aligned
dm block manager: add unlikely() annotations on dm_bufio error paths
dm cache: fix corruption seen when using cache > 2TB
dm raid: cleanup awkward branching in raid_message() option processing
dm raid: use mddev rather than rdev->mddev
dm raid: use read_disk_sb() throughout
dm raid: add raid4/5/6 journaling support
...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCAAGBQJYqeb8AAoJEPfTWPspceCmB3UP/3UtcPrzEm8w2cxB9MaWhZN3
J+jiwlO4vaqhm2HVzQtoJqfaqRlud/iDx5cIXE2S7FnIM54ZKs3CANbKu8X+b1zm
eJije3zMI8A8qyftigbz6a/Y2kWE4ZqFEc9WU5CWawfTl3ImCVUi8+F5X0wOLU/h
r50zAQOEyURH4G5usNl9q0olF6FonJ82AcYm1iJ0QP2wYWZRJauC0rRn8IT93tyK
bZPHnGKdkd7km8yi3zr2GNWOfuZZuA0HWAaF4qfrHPZQ883gITFAUIlFb1f+2TNl
DkQzRrBB2wPWPnlbfb9KejMkvL94hflzsLb5rHt835DyVXFRyjxsgyAI8A+LPGSz
vqZ3rsbWj6H4F9z2CkZ+T+AP/ZSWDNjwc0RXPm9HYdR5CDeTxIUVvnFQ44YNsmTv
Xd5BKrUJ2oKegAxQG6zcuFx23p8JzhT70l+mNrMdtyeKnDD9FRdDvhKG9AHeTipn
o/DnGivhS3UMQoQ7D68KOO+kuhLDeo7my5XGsnjzMO/iHqg++7IP2HyYYs/Ba4qZ
cYaCtSDQW71Zt0vsqa6dvPuXBveu4h8Qh8R7uAGjSGS9IAFFb4Cab2tiUdISE6PE
YnMWzY+G6pT8imlLVOL5/QFuo2Q4pUsaL0AHpXMCN9TZnQtbqXa8eqwnKnQ0m2KN
7ut0IYYEPaYUX5xFn1K6
=z7AL
-----END PGP SIGNATURE-----
Merge tag 'for-4.11/linus-merge-signed' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
- blk-mq scheduling framework from me and Omar, with a port of the
deadline scheduler for this framework. A port of BFQ from Paolo is in
the works, and should be ready for 4.12.
- Various fixups and improvements to the above scheduling framework
from Omar, Paolo, Bart, me, others.
- Cleanup of the exported sysfs blk-mq data into debugfs, from Omar.
This allows us to export more information that helps debug hangs or
performance issues, without cluttering or abusing the sysfs API.
- Fixes for the sbitmap code, the scalable bitmap code that was
migrated from blk-mq, from Omar.
- Removal of the BLOCK_PC support in struct request, and refactoring of
carrying SCSI payloads in the block layer. This cleans up the code
nicely, and enables us to kill the SCSI specific parts of struct
request, shrinking it down nicely. From Christoph mainly, with help
from Hannes.
- Support for ranged discard requests and discard merging, also from
Christoph.
- Support for OPAL in the block layer, and for NVMe as well. Mainly
from Scott Bauer, with fixes/updates from various others folks.
- Error code fixup for gdrom from Christophe.
- cciss pci irq allocation cleanup from Christoph.
- Making the cdrom device operations read only, from Kees Cook.
- Fixes for duplicate bdi registrations and bdi/queue life time
problems from Jan and Dan.
- Set of fixes and updates for lightnvm, from Matias and Javier.
- A few fixes for nbd from Josef, using idr to name devices and a
workqueue deadlock fix on receive. Also marks Josef as the current
maintainer of nbd.
- Fix from Josef, overwriting queue settings when the number of
hardware queues is updated for a blk-mq device.
- NVMe fix from Keith, ensuring that we don't repeatedly mark and IO
aborted, if we didn't end up aborting it.
- SG gap merging fix from Ming Lei for block.
- Loop fix also from Ming, fixing a race and crash between setting loop
status and IO.
- Two block race fixes from Tahsin, fixing request list iteration and
fixing a race between device registration and udev device add
notifiations.
- Double free fix from cgroup writeback, from Tejun.
- Another double free fix in blkcg, from Hou Tao.
- Partition overflow fix for EFI from Alden Tondettar.
* tag 'for-4.11/linus-merge-signed' of git://git.kernel.dk/linux-block: (156 commits)
nvme: Check for Security send/recv support before issuing commands.
block/sed-opal: allocate struct opal_dev dynamically
block/sed-opal: tone down not supported warnings
block: don't defer flushes on blk-mq + scheduling
blk-mq-sched: ask scheduler for work, if we failed dispatching leftovers
blk-mq: don't special case flush inserts for blk-mq-sched
blk-mq-sched: don't add flushes to the head of requeue queue
blk-mq: have blk_mq_dispatch_rq_list() return if we queued IO or not
block: do not allow updates through sysfs until registration completes
lightnvm: set default lun range when no luns are specified
lightnvm: fix off-by-one error on target initialization
Maintainers: Modify SED list from nvme to block
Move stack parameters for sed_ioctl to prevent oversized stack with CONFIG_KASAN
uapi: sed-opal fix IOW for activate lsp to use correct struct
cdrom: Make device operations read-only
elevator: fix loading wrong elevator type for blk-mq devices
cciss: switch to pci_irq_alloc_vectors
block/loop: fix race between I/O and set_status
blk-mq-sched: don't hold queue_lock when calling exit_icq
block: set make_request_fn manually in blk_mq_update_nr_hw_queues
...
Pull locking updates from Ingo Molnar:
"The main changes in this cycle were:
- Implement wraparound-safe refcount_t and kref_t types based on
generic atomic primitives (Peter Zijlstra)
- Improve and fix the ww_mutex code (Nicolai Hähnle)
- Add self-tests to the ww_mutex code (Chris Wilson)
- Optimize percpu-rwsems with the 'rcuwait' mechanism (Davidlohr
Bueso)
- Micro-optimize the current-task logic all around the core kernel
(Davidlohr Bueso)
- Tidy up after recent optimizations: remove stale code and APIs,
clean up the code (Waiman Long)
- ... plus misc fixes, updates and cleanups"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits)
fork: Fix task_struct alignment
locking/spinlock/debug: Remove spinlock lockup detection code
lockdep: Fix incorrect condition to print bug msgs for MAX_LOCKDEP_CHAIN_HLOCKS
lkdtm: Convert to refcount_t testing
kref: Implement 'struct kref' using refcount_t
refcount_t: Introduce a special purpose refcount type
sched/wake_q: Clarify queue reinit comment
sched/wait, rcuwait: Fix typo in comment
locking/mutex: Fix lockdep_assert_held() fail
locking/rtmutex: Flip unlikely() branch to likely() in __rt_mutex_slowlock()
locking/rwsem: Reinit wake_q after use
locking/rwsem: Remove unnecessary atomic_long_t casts
jump_labels: Move header guard #endif down where it belongs
locking/atomic, kref: Implement kref_put_lock()
locking/ww_mutex: Turn off __must_check for now
locking/atomic, kref: Avoid more abuse
locking/atomic, kref: Use kref_get_unless_zero() more
locking/atomic, kref: Kill kref_sub()
locking/atomic, kref: Add kref_read()
locking/atomic, kref: Add KREF_INIT()
...
Commit df2cb6daa4 ("block: Avoid deadlocks with bio allocation by
stacking drivers") created a workqueue for every bio set and code
in bio_alloc_bioset() that tries to resolve some low-memory deadlocks
by redirecting bios queued on current->bio_list to the workqueue if the
system is low on memory. However other deadlocks (see below **) may
happen, without any low memory condition, because generic_make_request
is queuing bios to current->bio_list (rather than submitting them).
** the related dm-snapshot deadlock is detailed here:
https://www.redhat.com/archives/dm-devel/2016-July/msg00065.html
Fix this deadlock by redirecting any bios on current->bio_list to the
bio_set's rescue workqueue on every schedule() call. Consequently,
when the process blocks on a mutex, the bios queued on
current->bio_list are dispatched to independent workqueus and they can
complete without waiting for the mutex to be available.
The structure blk_plug contains an entry cb_list and this list can contain
arbitrary callback functions that are called when the process blocks.
To implement this fix DM (ab)uses the onstack plug's cb_list interface
to get its flush_current_bio_list() called at schedule() time.
This fixes the snapshot deadlock - if the map method blocks,
flush_current_bio_list() will be called and it redirects bios waiting
on current->bio_list to appropriate workqueues.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1267650
Depends-on: df2cb6daa4 ("block: Avoid deadlocks with bio allocation by stacking drivers")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The sloppy nature of lockless access to percpu pointers
(s->current_path) in rr_select_path(), from multiple threads, is
causing some paths to used more than others -- which results in less
IO performance being observed.
Revert these upstream commits to restore truly symmetric round-robin
IO submission in DM multipath:
b0b477c dm round robin: use percpu 'repeat_count' and 'current_path'
802934b dm round robin: do not use this_cpu_ptr() without having preemption disabled
There is no benefit to all this complexity if repeat_count = 1 (which is
the recommended default).
Cc: stable@vger.kernel.org # 4.6+
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Declare dm_space_map structures as const as they are only passed as an
argument to the function memcpy. This argument is of type const void *,
so dm_space_map structures having this property can be declared as
const.
File size before:
text data bss dec hex filename
4889 240 0 5129 1409 dm-space-map-metadata.o
File size after:
text data bss dec hex filename
5139 0 0 5139 1413 dm-space-map-metadata.o
Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Improves __load_mapping_v1() and __load_mapping_v2() DMERR messages to
explicitly name the cache block number whose mapping couldn't be
loaded.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
If "metadata2" is provided as a table argument when creating/loading a
cache target a more compact metadata format, with separate dirty bits,
is used. "metadata2" improves speed of shutting down a cache target.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
dm_btree_del() is called from an ioctl so don't recurse into FS.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The metadata_space_map_root passed to sm_ll_open_metadata() may or may
not be arch aligned, use memcpy to ensure it is. This is not a fast
path so the extra memcpy doesn't hurt us.
Long-term it'd be better to use the kernel's alignment infrastructure to
remove the memcpy()s that are littered across persistent-data (btree,
array, space-maps, etc).
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
A rounding bug due to compiler generated temporary being 32bit was found
in remap_to_cache(). A localized cast in remap_to_cache() fixes the
corruption but this preferred fix (changing from uint32_t to sector_t)
eliminates potential for future rounding errors elsewhere.
Cc: stable@vger.kernel.org
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
.. at least for unprivileged users. Before we called into the SCSI
ioctl code to allow excemptions for a few SCSI passthrough ioctls,
but this is pretty unsafe and except for this call dm knows nothing
about SCSI ioctls.
As the SCSI ioctl code is now optional, we really don't want to
drag it in for DM, and the exception is not very useful anyway.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
The lockdep splat below hints at a bug in RCU usage in dm-crypt that
was introduced with commit c538f6ec9f ("dm crypt: add ability to use
keys from the kernel key retention service"). The kernel keyring
function user_key_payload() is in fact a wrapper for
rcu_dereference_protected() which must not be called with only
rcu_read_lock() section mark.
Unfortunately the kernel keyring subsystem doesn't currently provide
an interface that allows the use of an RCU read-side section. So for
now we must drop RCU in favour of rwsem until a proper function is
made available in the kernel keyring subsystem.
===============================
[ INFO: suspicious RCU usage. ]
4.10.0-rc5 #2 Not tainted
-------------------------------
./include/keys/user-type.h:53 suspicious rcu_dereference_protected() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
2 locks held by cryptsetup/6464:
#0: (&md->type_lock){+.+.+.}, at: [<ffffffffa02472a2>] dm_lock_md_type+0x12/0x20 [dm_mod]
#1: (rcu_read_lock){......}, at: [<ffffffffa02822f8>] crypt_set_key+0x1d8/0x4b0 [dm_crypt]
stack backtrace:
CPU: 1 PID: 6464 Comm: cryptsetup Not tainted 4.10.0-rc5 #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.1-1.fc24 04/01/2014
Call Trace:
dump_stack+0x67/0x92
lockdep_rcu_suspicious+0xc5/0x100
crypt_set_key+0x351/0x4b0 [dm_crypt]
? crypt_set_key+0x1d8/0x4b0 [dm_crypt]
crypt_ctr+0x341/0xa53 [dm_crypt]
dm_table_add_target+0x147/0x330 [dm_mod]
table_load+0x111/0x350 [dm_mod]
? retrieve_status+0x1c0/0x1c0 [dm_mod]
ctl_ioctl+0x1f5/0x510 [dm_mod]
dm_ctl_ioctl+0xe/0x20 [dm_mod]
do_vfs_ioctl+0x8e/0x690
? ____fput+0x9/0x10
? task_work_run+0x7e/0xa0
? trace_hardirqs_on_caller+0x122/0x1b0
SyS_ioctl+0x3c/0x70
entry_SYSCALL_64_fastpath+0x18/0xad
RIP: 0033:0x7f392c9a4ec7
RSP: 002b:00007ffef6383378 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007ffef63830a0 RCX: 00007f392c9a4ec7
RDX: 000000000124fcc0 RSI: 00000000c138fd09 RDI: 0000000000000005
RBP: 00007ffef6383090 R08: 00000000ffffffff R09: 00000000012482b0
R10: 2a28205d34383336 R11: 0000000000000246 R12: 00007f392d803a08
R13: 00007ffef63831e0 R14: 0000000000000000 R15: 00007f392d803a0b
Fixes: c538f6ec9f ("dm crypt: add ability to use keys from the kernel key retention service")
Reported-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Ondrej Kozina <okozina@redhat.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Fixes a crash in dm_table_find_target() due to a NULL struct dm_table
being passed from dm_old_request_fn() that races with DM device
destruction.
Reported-by: artem@flashgrid.io
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
We will want to have struct backing_dev_info allocated separately from
struct request_queue. As the first step add pointer to backing_dev_info
to request_queue and convert all users touching it. No functional
changes in this patch.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
DM already calls blk_mq_alloc_request on the request_queue of the
underlying device if it is a blk-mq device. But now that we allow drivers
to allocate additional data and initialize it ahead of time we need to do
the same for all drivers. Doing so and using the new cmd_size
infrastructure in the block layer greatly simplifies the dm-rq and mpath
code, and should also make arbitrary combinations of SQ and MQ devices
with SQ or MQ device mapper tables easily possible as a further step.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
DM tries to copy a few fields around for BLOCK_PC requests, but given
that no dm-target ever wires up scsi_cmd_ioctl BLOCK_PC can't actually
be sent to dm.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Return an errno value instead of the passed in queue so that the callers
don't have to keep track of two queues, and move the assignment of the
request_fn and lock to the caller as passing them as argument doesn't
simplify anything. While we're at it also remove two pointless NULL
assignments, given that the request structure is zeroed on allocation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
No need for the local variables, the bio is still live and we can just
assign the bits we want directly. Make me wonder why we can't assign
all the bio flags to start with.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This centralizes the checks for bios that needs to be go into the flush
state machine.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
For consistency, call read_disk_sb() from
attempt_restore_of_faulty_devices() instead
of calling sync_page_io() directly.
Explicitly set device to faulty on superblock read error.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Add md raid4/5/6 journaling support (upstream commit bac624f3f8 started
the implementation) which closes the write hole (i.e. non-atomic updates
to stripes) using a dedicated journal device.
Background:
raid4/5/6 stripes hold N data payloads per stripe plus one parity raid4/5
or two raid6 P/Q syndrome payloads in an in-memory stripe cache.
Parity or P/Q syndromes used to recover any data payloads in case of a disk
failure are calculated from the N data payloads and need to be updated on the
different component devices of the raid device. Those are non-atomic,
persistent updates. Hence a crash can cause failure to update all stripe
payloads persistently and thus cause data loss during stripe recovery.
This problem gets addressed by writing whole stripe cache entries (together with
journal metadata) to a persistent journal entry on a dedicated journal device.
Only if that journal entry is written successfully, the stripe cache entry is
updated on the component devices of the raid device (i.e. writethrough type).
In case of a crash, the entry can be recovered from the journal and be written
again thus ensuring consistent stripe payload suitable to data recovery.
Future dependencies:
once writeback caching being worked on to compensate for the throughput
implictions involved with writethrough overhead is supported with journaling
in upstream, an additional patch based on this one will support it in dm-raid.
Journal resilience related remarks:
because stripes are recovered from the journal in case of a crash, the
journal device better be resilient. Resilience becomes mandatory with
future writeback support, because loosing the working set in the log
means data loss as oposed to writethrough, were the loss of the
journal device 'only' reintroduces the write hole.
Fix comment on data offsets in parse_dev_params() and initialize
new_data_offset as well.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
During raid set resize checks and setting up the recovery offset in case a raid
set grows, calculated rd->md.dev_sectors is compared to rs->dev[0].rdev.sectors.
Device 0 may not be defined in case userspace passes in '- -' for it
(lvm2 doesn't do that so far), thus it's device sectors can't be taken
authoritatively in this comparison and another valid device must be used
to retrieve the device size.
Use mddev->dev_sectors in checking for ongoing recovery for the same reason.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
This fix addresses the following 3 failure scenarios:
1) If a (transiently) inaccessible metadata device is being passed into the
constructor (e.g. a device tuple '254:4 254:5'), it is processed as if
'- -' was given. This erroneously results in a status table line containing
'- -', which mistakenly differs from what has been passed in. As a result,
userspace libdevmapper puts the device tuple seperate from the RAID device
thus not processing the dependencies properly.
2) False health status char 'A' instead of 'D' is emitted on the status
status info line for the meta/data device tuple in this metadata device
failure case.
3) If the metadata device is accessible when passed into the constructor
but the data device (partially) isn't, that leg may be set faulty by the
raid personality on access to the (partially) unavailable leg. Restore
tried in a second raid device resume on such failed leg (status char 'D')
fails after the (partial) leg returned.
Fixes for aforementioned failure scenarios:
- don't release passed in devices in the constructor thus allowing the
status table line to e.g. contain '254:4 254:5' rather than '- -'
- emit device status char 'D' rather than 'A' for the device tuple
with the failed metadata device on the status info line
- when attempting to restore faulty devices in a second resume, allow the
device hot remove function to succeed by setting the device to not in-sync
In case userspace intentionally passes '- -' into the constructor to avoid that
device tuple (e.g. to split off a raid1 leg temporarily for later re-addition),
the status table line will correctly show '- -' and the status info line will
provide a '-' device health character for the non-defined device tuple.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
write-back cache in degraded mode introduces corner cases to the array.
Although we try to cover all these corner cases, it is safer to just
disable write-back cache when the array is in degraded mode.
In this patch, we disable writeback cache for degraded mode:
1. On device failure, if the array enters degraded mode, raid5_error()
will submit async job r5c_disable_writeback_async to disable
writeback;
2. In r5c_journal_mode_store(), it is invalid to enable writeback in
degraded mode;
3. In r5c_try_caching_write(), stripes with s->failed>0 will be handled
in write-through mode.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Write back cache requires a complex RMW mechanism, where old data is
read into dev->orig_page for prexor, and then xor is done with
dev->page. This logic is already implemented in the write path.
However, current read path is not awared of this requirement. When
the array is optimal, the RMW is not required, as the data are
read from raid disks. However, when the target stripe is degraded,
complex RMW is required to generate right data.
To keep read path as clean as possible, we handle read path by
flushing degraded, in-journal stripes before processing reads to
missing dev.
Specifically, when there is read requests to a degraded stripe
with data in journal, handle_stripe_fill() calls
r5c_make_stripe_write_out() and exits. Then handle_stripe_dirtying()
will do the complex RMW and flush the stripe to RAID disks. After
that, read requests are handled.
There is one more corner case when there is non-overwrite bio for
the missing (or out of sync) dev. handle_stripe_dirtying() will not
be able to process the non-overwrite bios without constructing the
data in handle_stripe_fill(). This is fixed by delaying non-overwrite
bios in handle_stripe_dirtying(). So handle_stripe_fill() works on
these bios after the stripe is flushed to raid disks.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
For safer operation, all arrays start in write-through mode, which has been
better tested and is more mature. And actually the write-through/write-mode
isn't persistent after array restarted, so we always start array in
write-through mode. However, if recovery found data-only stripes before the
shutdown (from previous write-back mode), it is not safe to start the array in
write-through mode, as write-through mode can not handle stripes with data in
write-back cache. To solve this problem, we flush all data-only stripes in
r5l_recovery_log(). When r5l_recovery_log() returns, the array starts with
empty cache in write-through mode.
This logic is implemented in r5c_recovery_flush_data_only_stripes():
1. enable write back cache
2. flush all stripes
3. wake up conf->mddev->thread
4. wait for all stripes get flushed (reuse wait_for_quiescent)
5. disable write back cache
The wait in 4 will be waked up in release_inactive_stripe_list()
when conf->active_stripes reaches 0.
It is safe to wake up mddev->thread here because all the resource
required for the thread has been initialized.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
With write back cache, we use orig_page to do prexor. This patch
makes sure we read data into orig_page for it.
Flag R5_OrigPageUPTDODATE is added to show whether orig_page
has the latest data from raid disk.
We introduce a helper function uptodate_for_rmw() to simplify
the a couple conditions in handle_stripe_dirtying().
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
This is a nasty interface and setting the state of a foreign task must
not be done. As of the following commit:
be628be095 ("bcache: Make gc wakeup sane, remove set_task_state()")
... everyone in the kernel calls set_task_state() with current, allowing
the helper to be removed.
However, as the comment indicates, it is still around for those archs
where computing current is more expensive than using a pointer, at least
in theory. An important arch that is affected is arm64, however this has
been addressed now [1] and performance is up to par making no difference
with either calls.
Of all the callers, if any, it's the locking bits that would care most
about this -- ie: we end up passing a tsk pointer to a lot of the lock
slowpath, and setting ->state on that. The following numbers are based
on two tests: a custom ad-hoc microbenchmark that just measures
latencies (for ~65 million calls) between get_task_state() vs
get_current_state().
Secondly for a higher overview, an unlink microbenchmark was used,
which pounds on a single file with open, close,unlink combos with
increasing thread counts (up to 4x ncpus). While the workload is quite
unrealistic, it does contend a lot on the inode mutex or now rwsem.
[1] https://lkml.kernel.org/r/1483468021-8237-1-git-send-email-mark.rutland@arm.com
== 1. x86-64 ==
Avg runtime set_task_state(): 601 msecs
Avg runtime set_current_state(): 552 msecs
vanilla dirty
Hmean unlink1-processes-2 36089.26 ( 0.00%) 38977.33 ( 8.00%)
Hmean unlink1-processes-5 28555.01 ( 0.00%) 29832.55 ( 4.28%)
Hmean unlink1-processes-8 37323.75 ( 0.00%) 44974.57 ( 20.50%)
Hmean unlink1-processes-12 43571.88 ( 0.00%) 44283.01 ( 1.63%)
Hmean unlink1-processes-21 34431.52 ( 0.00%) 38284.45 ( 11.19%)
Hmean unlink1-processes-30 34813.26 ( 0.00%) 37975.17 ( 9.08%)
Hmean unlink1-processes-48 37048.90 ( 0.00%) 39862.78 ( 7.59%)
Hmean unlink1-processes-79 35630.01 ( 0.00%) 36855.30 ( 3.44%)
Hmean unlink1-processes-110 36115.85 ( 0.00%) 39843.91 ( 10.32%)
Hmean unlink1-processes-141 32546.96 ( 0.00%) 35418.52 ( 8.82%)
Hmean unlink1-processes-172 34674.79 ( 0.00%) 36899.21 ( 6.42%)
Hmean unlink1-processes-203 37303.11 ( 0.00%) 36393.04 ( -2.44%)
Hmean unlink1-processes-224 35712.13 ( 0.00%) 36685.96 ( 2.73%)
== 2. ppc64le ==
Avg runtime set_task_state(): 938 msecs
Avg runtime set_current_state: 940 msecs
vanilla dirty
Hmean unlink1-processes-2 19269.19 ( 0.00%) 30704.50 ( 59.35%)
Hmean unlink1-processes-5 20106.15 ( 0.00%) 21804.15 ( 8.45%)
Hmean unlink1-processes-8 17496.97 ( 0.00%) 17243.28 ( -1.45%)
Hmean unlink1-processes-12 14224.15 ( 0.00%) 17240.21 ( 21.20%)
Hmean unlink1-processes-21 14155.66 ( 0.00%) 15681.23 ( 10.78%)
Hmean unlink1-processes-30 14450.70 ( 0.00%) 15995.83 ( 10.69%)
Hmean unlink1-processes-48 16945.57 ( 0.00%) 16370.42 ( -3.39%)
Hmean unlink1-processes-79 15788.39 ( 0.00%) 14639.27 ( -7.28%)
Hmean unlink1-processes-110 14268.48 ( 0.00%) 14377.40 ( 0.76%)
Hmean unlink1-processes-141 14023.65 ( 0.00%) 16271.69 ( 16.03%)
Hmean unlink1-processes-172 13417.62 ( 0.00%) 16067.55 ( 19.75%)
Hmean unlink1-processes-203 15293.08 ( 0.00%) 15440.40 ( 0.96%)
Hmean unlink1-processes-234 13719.32 ( 0.00%) 16190.74 ( 18.01%)
Hmean unlink1-processes-265 16400.97 ( 0.00%) 16115.22 ( -1.74%)
Hmean unlink1-processes-296 14388.60 ( 0.00%) 16216.13 ( 12.70%)
Hmean unlink1-processes-320 15771.85 ( 0.00%) 15905.96 ( 0.85%)
x86-64 (known to be fast for get_current()/this_cpu_read_stable() caching)
and ppc64 (with paca) show similar improvements in the unlink microbenches.
The small delta for ppc64 (2ms), does not represent the gains on the unlink
runs. In the case of x86, there was a decent amount of variation in the
latency runs, but always within a 20 to 50ms increase), ppc was more constant.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave@stgolabs.net
Cc: mark.rutland@arm.com
Link: http://lkml.kernel.org/r/1483479794-14013-5-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This fixes a build error on certain architectures, such as ppc64.
Fixes: 6995f0b247e("md: takeover should clear unrelated bits")
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Commit 6995f0b (md: takeover should clear unrelated bits) clear
unrelated bits, but it's quite fragile. To avoid error in the future,
define a macro for unsupported mddev flags for each raid type and use it
to clear unsupported mddev flags. This should be less error-prone.
Suggested-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Trivial fix to spelling mistake "recoverying" to "recovering" in
pr_dbg message.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Shaohua Li <shli@fb.com>
r5l_load_log() calls functions that requires a proper conf->log,
for example, r5c_is_writeback(). Therefore, we should set
conf->log before calling r5l_load_log(). If r5l_load_log() fails,
conf->log is set back to NULL.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
We only need to update sh->log_start at the end of recovery,
which is r5c_recovery_rewrite_data_only_stripes(), so it is not
necessary to set it before that. In this patch, log_start is
removed from r5c_recovery_alloc_stripe().
After updating all sh->log_start, rewrite_data_only_stripes()
also updates log->next_checkpoints to the last sh->log_start.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>