When we ran the lvm test "shell/integrity-blocksize-3.sh" on a kernel with
kasan, we got failure in write_page.
The reason for the failure is that md_bitmap_destroy is called before
destroying the thread and the thread may be waiting in the function
write_page for the bio to complete. When the thread finishes waiting, it
executes "if (test_bit(BITMAP_WRITE_ERROR, &bitmap->flags))", which
triggers the kasan warning.
Note that the commit 48df498daf that caused this bug claims that it is
neede for md-cluster, you should check md-cluster and possibly find
another bugfix for it.
BUG: KASAN: use-after-free in write_page+0x18d/0x680 [md_mod]
Read of size 8 at addr ffff889162030c78 by task mdX_raid1/5539
CPU: 10 PID: 5539 Comm: mdX_raid1 Not tainted 5.19.0-rc2 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x34/0x44
print_report.cold+0x45/0x57a
? __lock_text_start+0x18/0x18
? write_page+0x18d/0x680 [md_mod]
kasan_report+0xa8/0xe0
? write_page+0x18d/0x680 [md_mod]
kasan_check_range+0x13f/0x180
write_page+0x18d/0x680 [md_mod]
? super_sync+0x4d5/0x560 [dm_raid]
? md_bitmap_file_kick+0xa0/0xa0 [md_mod]
? rs_set_dev_and_array_sectors+0x2e0/0x2e0 [dm_raid]
? mutex_trylock+0x120/0x120
? preempt_count_add+0x6b/0xc0
? preempt_count_sub+0xf/0xc0
md_update_sb+0x707/0xe40 [md_mod]
md_reap_sync_thread+0x1b2/0x4a0 [md_mod]
md_check_recovery+0x533/0x960 [md_mod]
raid1d+0xc8/0x2a20 [raid1]
? var_wake_function+0xe0/0xe0
? psi_group_change+0x411/0x500
? preempt_count_sub+0xf/0xc0
? _raw_spin_lock_irqsave+0x78/0xc0
? __lock_text_start+0x18/0x18
? raid1_end_read_request+0x2a0/0x2a0 [raid1]
? preempt_count_sub+0xf/0xc0
? _raw_spin_unlock_irqrestore+0x19/0x40
? del_timer_sync+0xa9/0x100
? try_to_del_timer_sync+0xc0/0xc0
? _raw_spin_lock_irqsave+0x78/0xc0
? __lock_text_start+0x18/0x18
? __list_del_entry_valid+0x68/0xa0
? finish_wait+0xa3/0x100
md_thread+0x161/0x260 [md_mod]
? unregister_md_personality+0xa0/0xa0 [md_mod]
? _raw_spin_lock_irqsave+0x78/0xc0
? prepare_to_wait_event+0x2c0/0x2c0
? unregister_md_personality+0xa0/0xa0 [md_mod]
kthread+0x148/0x180
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x1f/0x30
</TASK>
Allocated by task 5522:
kasan_save_stack+0x1e/0x40
__kasan_kmalloc+0x80/0xa0
md_bitmap_create+0xa8/0xe80 [md_mod]
md_run+0x777/0x1300 [md_mod]
raid_ctr+0x249c/0x4a30 [dm_raid]
dm_table_add_target+0x2b0/0x620 [dm_mod]
table_load+0x1c8/0x400 [dm_mod]
ctl_ioctl+0x29e/0x560 [dm_mod]
dm_compat_ctl_ioctl+0x7/0x20 [dm_mod]
__do_compat_sys_ioctl+0xfa/0x160
do_syscall_64+0x90/0xc0
entry_SYSCALL_64_after_hwframe+0x46/0xb0
Freed by task 5680:
kasan_save_stack+0x1e/0x40
kasan_set_track+0x21/0x40
kasan_set_free_info+0x20/0x40
__kasan_slab_free+0xf7/0x140
kfree+0x80/0x240
md_bitmap_free+0x1c3/0x280 [md_mod]
__md_stop+0x21/0x120 [md_mod]
md_stop+0x9/0x40 [md_mod]
raid_dtr+0x1b/0x40 [dm_raid]
dm_table_destroy+0x98/0x1e0 [dm_mod]
__dm_destroy+0x199/0x360 [dm_mod]
dev_remove+0x10c/0x160 [dm_mod]
ctl_ioctl+0x29e/0x560 [dm_mod]
dm_compat_ctl_ioctl+0x7/0x20 [dm_mod]
__do_compat_sys_ioctl+0xfa/0x160
do_syscall_64+0x90/0xc0
entry_SYSCALL_64_after_hwframe+0x46/0xb0
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 48df498daf ("md: move bitmap_destroy to the beginning of __md_stop")
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Two callers of md_alloc want to use the newly allocated devices, so
return it instead of letting them find it cumbersomely after the
allocation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-and-tested-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
autorun_devices should not be limited to the controls for the legacy
probe on open, so just call md_alloc directly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-and-tested-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Eliminate the following coccicheck warning:
./drivers/md/md.c:8208:2-3: Unneeded semicolon
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
After merging the block tree, today's linux-next build (x86_64
allmodconfig) failed like this:
drivers/md/md.c:717:22: error: 'mddev_find' defined but not used [-Werror=unused-function]
717 | static struct mddev *mddev_find(dev_t unit)
| ^~~~~~~~~~
cc1: all warnings being treated as errors
Caused by commit
4500d5c17910 ("md: simplify md_open")
Make mddev_find() available only for non-modular builds.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220721131132.070be166@canb.auug.org.au
Signed-off-by: Jens Axboe <axboe@kernel.dk>
'first' will always be greater than or equal to 0, it is unnecessary to
repeat the 0 check, clean it up.
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that devices are on the all_mddevs list until the gendisk is freed,
there can't be any duplicates. Remove the global list lookup and just
grab a reference.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This ensures device names don't get prematurely reused. Instead add a
deleted flag to skip already deleted devices in mddev_get and other
places that only want to see live mddevs.
Reported-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Just do a simple list_for_each_entry_safe on all_mddevs, and only grab a
reference when we drop the lock and delete the now unused for_each_mddev
macro.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Just do a simple list_for_each_entry_safe on all_mddevs, and only grab a
reference when we drop the lock.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Just do a plain list_for_each that only grabs a mddev reference in
the case where the thread sleeps and restarts the list iteration.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This splits the code into nicely readable chunks and also avoids
the refcount inc/dec manipulations.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The md_free name is rather misleading, so pick a better one.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ensure that all private data is only freed once all accesses are done.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Error handling in md_alloc is a mess. Untangle it to just free the mddev
directly before add_disk is called and thus the gendisk is globally
visible. After that clear the hold flag and let the mddev_put take care
of cleaning up the mddev through the usual mechanisms.
Fixes: 5e55e2f5fc ("[PATCH] md: convert compile time warnings into runtime warnings")
Fixes: 9be68dd7ac ("md: add error handling support for add_disk()")
Fixes: 7ad1069166 ("md: properly unwind when failing to add the kobject in md_alloc")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Once a kobject is initialized, the containing object should not be
directly freed. So delay initialization until it is added. Also
remove the kobject_del call as the last put will remove the kobject as
well. The explicitly delete isn't needed here, and dropping it will
simplify further fixes.
With this md_free now does not need to check that ->gendisk is non-NULL
as it is always set by the time that kobject_init is called on
mddev->kobj.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
raid5_get_active_stripe() can sleep in various situations and it
is called by make_stripe_request() while inside the
prepare_to_wait()/finish_wait() section. Nested waits like this are
not supported.
This was noticed while making other changes that add different sleeps
to raid5_get_active_stripe() that caused a WARNING with
CONFIG_DEBUG_ATOMIC_SLEEP.
No ill effects have been noticed with the code as is, but theoretically
a nested and here could cause a dead lock so it should be fixed.
To fix this, convert the prepare_to_wait() call to use wake_woken()
which supports nested sleeps.
Link: https://lwn.net/Articles/628628/
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For unaligned IO that have nearly maximum sectors, the number of stripes
will end up being one greater than the size of the bitmap. When this
happens, the last stripe in the IO will not be processed as it should
be, resulting in data corruption.
However, this is not normally seen when the backing block devices have
4K physical block sizes since the block layer will split the request
before that happens.
To fix this increase the bitmap size by one bit and ensure the full
number of stripes are checked when calling find_first_bit().
Reported-by: David Sloan <David.Sloan@eideticom.com>
Fixes: 7e55c60acf ("md/raid5: Pivot raid5_make_request()")
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The "Asynchronous device registration (EXPERIMENTAL)" Kconfig option is
for 2+ years, it is used when registration takes too much time for
massive amount of cached data, to avoid udev task timeout during boot
time.
Many users and products enable this Kconfig option for quite long time
(e.g. SUSE Linux) and it works as expected and no issue reported.
It is time to remove the "EXPERIMENTAL" tag from this Kconfig item.
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20220719042724.8498-2-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are 2 spelling mistakes in comments. Fix it.
Signed-off-by: Zhang Jiaming <jiaming@nfschina.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The block layer defaults the maximum segments to 128, which means
requests tend to get split around the 512KB depending on how many
pages can be merged. There's no such restriction in the raid5 code
so increase the limit to USHRT_MAX so that larger requests can be
sent as one.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a debug print for raid5_make_request() so that each request is
printed and add the logical sector number to the debug print in
__add_stripe_bio().
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
raid5_make_request() loops through every page in the request,
finds the appropriate stripe and adds the bio for that page in the
disk.
This causes a great deal of contention on the hash_lock and extra
work seeing each stripe must be found once for every data disk.
The number of times a stripe must be found can be reduced by pivoting
raid5_make_request() so that it loops through every stripe and then
loops through every disk in that stripe to see if the bio must be
added. This reduces the number of times the hash lock must be taken
by a factor equal to the number of data disks.
To accomplish this, the logical sectors that have already been added
must be tracked. Tracking them is done with a bitmap: the bits
for all pages are set at the start of the request and each bit
is cleared once the bio is added to a stripe.
Finding the next sector to be done is then just a call to
find_first_bit() so that sectors that have been done can simply be
skipped.
One minor downside is that the maximum sectors for a request must be
limited so that the bitmap can be appropriately sized on the stack.
This limit is arbitrarily chosen to be 256 stripe pages which works out
to 1MB if PAGE_SIZE == DEFAULT_STRIPE_SIZE. This doesn't actually
restrict the maximum request further seeing the default block queue
settings are used which restricts the number of segments to 128 (which
results in request sizes that are approximately 512KB).
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When testing if a previous stripe has had reshape expand past it, use
the earliest or latest logical sector in all the disks for that stripe
head. This will allow adding multiple disks at a time in a subesquent
patch.
To do this cleaner, refactor the check into a helper function called
stripe_ahead_of_reshape().
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Factor out two helper functions from add_stripe_bio(): one to check for
overlap (stripe_bio_overlaps()), and one to actually add the bio to the
stripe (__add_stripe_bio()). The latter function will always succeed.
This will be useful in the next patch so that overlap can be checked for
multiple disks before adding any
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When batching, every stripe head has to find the previous stripe head to
add to the batch list. This involves taking the hash lock which is
highly contended during IO.
Instead of finding the previous stripe_head each time, store a
reference to the previous stripe_head in a pointer so that it doesn't
require taking the contended lock another time.
The reference to the previous stripe must be released before scheduling
and waiting for work to get done. Otherwise, it can hold up
raid5_activate_delayed() and deadlock.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The for loop with retry label can be more cleanly expressed as a while
loop by moving the logical_sector increment into the success path.
No functional changes intended.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that prepare_to_wait() isn't in the way, move read_sequcount_begin()
into make_stripe_request().
No functional changes intended.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
prepare_to_wait() can be reasonably called after schedule instead of
setting a flag and preparing in the next loop iteration.
This means that prepare_to_wait() will be called before
read_seqcount_begin(), but there shouldn't be any reason that the order
matters here. On the first iteration of the loop prepare_to_wait() is
already called first.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Factor out the inner loop of raid5_make_request() into it's own helper
called make_stripe_request().
The helper returns a number of statuses: SUCCESS, RETRY,
SCHEDULE_AND_RETRY and FAIL. This makes the code a bit easier to
understand and allows the SCHEDULE_AND_RETRY path to be made common.
A context structure is added to contain do_flush. It will be used
more in subsequent patches for state that needs to be kept
outside the loop.
No functional changes intended. This will be cleaned up further in
subsequent patches to untangle the gen_lock and do_prepare logic
further.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Both uses of find_stripe() require a fairly complicated dance to
increment the reference count. Move this into a common find_get_stripe()
helper.
No functional changes intended.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
stripe_add_to_batch_list() is better done in the loop in make_request
instead of inside add_stripe_bio(). This is clearer and allows for
storing the batch_head state outside the loop in a subsequent patch.
The call to add_stripe_bio() in retry_aligned_read() is for read
and batching only applies to write. So it's impossible for batching
to happen at that call site.
No functional changes intended.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Break immediately if raid5_get_active_stripe() returns NULL and deindent
the rest of the loop. Annotate this check with an unlikely().
This makes the code easier to read and reduces the indentation level.
No functional changes intended.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are a few uses of an ugly ternary operator in raid5_make_request()
to check if a sector is a head of a reshape sector.
Factor this out into a simple helper called ahead_of_reshape().
No functional changes intended.
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The check in raid5_make_request differs very slightly from the logic
that causes it to block lower down. This likely does not cause a bug
as the check is fuzzy anyway (as reshape may move on between the first
check and the subsequent check). However, make it consistent so it can
be cleaned up in a subsequent patch.
The condition which causes the schedule is:
!(mddev->reshape_backwards ? logical_sector < conf->reshape_progress :
logical_sector >= conf->reshape_progress) &&
(mddev->reshape_backwards ? logical_sector < conf->reshape_safe :
logical_sector >= conf->reshape_safe)
The condition that causes the early bailout is made to match this.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since the bug which commit 8b48ec23cc ("md: don't unregister sync_thread
with reconfig_mutex held") fixed is related with action_store path, other
callers which reap sync_thread didn't need to be changed.
Let's pull md_unregister_thread from md_reap_sync_thread, then fix previous
bug with belows.
1. unlock mddev before md_reap_sync_thread in action_store.
2. save reshape_position before unlock, then restore it to ensure position
not changed accidentally by others.
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Boot-time assembly of arrays with md= command-line arguments breaks when
CONFIG_BLOCK_LEGACY_AUTOLOAD is unset. md_setup_drive() in md-autodetect.c
calls blkdev_get_by_dev(), assuming this implicitly creates the block
device.
Fix this by attempting to md_alloc() the array first. As in the probe path,
ignore any error as failure is caught by blkdev_get_by_dev() anyway.
Signed-off-by: Chris Webb <chris@arachsys.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The mdadm test 07layouts randomly produces a kernel hung task deadlock.
The deadlock is caused by the suspend_lo/suspend_hi files being set by
the mdadm background process during reshape and not being cleared
because the process hangs. (Leaving aside the issue of the fragility of
freezing kernel tasks by buggy userspace processes...)
When the background mdadm process hangs it, is waiting (without a
timeout) on a change to the sync_completed file signalling that the
reshape has completed. The process is woken up a couple times when
the reshape finishes but it is woken up before MD_RECOVERY_RUNNING
is cleared so sync_completed_show() reports 0 instead of "none".
To fix this, notify the sysfs file in md_reap_sync_thread() after
MD_RECOVERY_RUNNING has been cleared. This wakes up mdadm and causes
it to continue and write to suspend_lo/suspend_hi to allow IO to
continue.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The 07layouts test in mdadm fails on some systems. The failure
presents itself as the backup file not being removed before the next
layout is grown into:
mdadm: /dev/md0: cannot create backup file /tmp/md-test-backup:
File exists
This is because the background mdadm process, which is responsible for
cleaning up this backup file gets into an infinite loop waiting for
the reshape to start. mdadm checks the mdstat file if a reshape is
going and, if it is not, it waits for an event on the file or times
out in 5 seconds. On faster machines, the reshape may complete before
the 5 seconds times out, and thus the background mdadm process loops
waiting for a reshape to start that has already occurred.
mdadm reads the mdstat file to start, but mdstat does not report that the
reshape has begun, even though it has indeed begun. So the mdstat_wait()
call (in mdadm) which polls on the mdstat file won't ever return until
timing out.
The reason mdstat reports the reshape has started is due to an issue
in status_resync(). recovery_active is subtracted from curr_resync which
will result in a value of zero for the first chunk of reshaped data, and
the resulting read will report no reshape in progress.
To fix this, if "resync - recovery_active" is an overloaded value, force
the value to be MD_RESYNC_ACTIVE so the code reports a resync in progress.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Comments in the code document special values used for
mddev->curr_resync. Make this clearer by using an enum to label these
values.
The only functional change is a couple places use the wrong comparison
operator that implied 3 is another special value. They are all
fixed to imply that 3 or greater is an active resync.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
radix_tree_lookup_slot() and radix_tree_replace_slot() API expect the
slot returned and looked up to be marked with __rcu. Otherwise
sparse warnings are generated:
drivers/md/raid5-cache.c:2939:23: warning: incorrect type in
assignment (different address spaces)
drivers/md/raid5-cache.c:2939:23: expected void **pslot
drivers/md/raid5-cache.c:2939:23: got void [noderef] __rcu **
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A NULL pointer dereferlence on conf->log is seen randomly with
the mdadm test 21raid5cache. Kasan reporst:
BUG: KASAN: null-ptr-deref in r5l_reclaimable_space+0xf5/0x140
Read of size 8 at addr 0000000000000860 by task md0_reclaim/3086
Call Trace:
dump_stack_lvl+0x5a/0x74
kasan_report.cold+0x5f/0x1a9
__asan_load8+0x69/0x90
r5l_reclaimable_space+0xf5/0x140
r5l_do_reclaim+0xf4/0x5e0
r5l_reclaim_thread+0x69/0x3b0
md_thread+0x1a2/0x2c0
kthread+0x177/0x1b0
ret_from_fork+0x22/0x30
This is caused by conf->log being cleared in r5l_exit_log() before
stopping the reclaim thread.
To fix this, clear conf->log after the reclaim_thread is unregistered
and after flushing disable_writeback_work.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The only place that uses RCU to access conf->log is in
r5l_log_disk_error(). This function is mostly used in the IO path
and once with mddev_lock() held in raid5_change_consistency_policy().
It is known that the IO will be suspended before the log is freed and
r5l_log_exit() is called with the mddev_lock() held.
This should mean that conf->log can not be freed while the function is
being called, so the RCU protection is not necessary. Drop the
rcu_read_lock() as well as the synchronize_rcu() and
rcu_assign_pointer() usage.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The mddev->lock spinlock doesn't protect against the removal of
conf->log in r5l_exit_log() so conf->log may be freed before it
is used.
To fix this, take the mddev_lock() insteaad of the mddev->lock spinlock.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The raid5-cache code relies on there being no IO in flight when
log_exit() is called. There are two places where this is not
guaranteed so add mddev_suspend() and mddev_resume() calls to these
sites.
The site in raid5_change_consistency_policy() is in the error path,
and another similar call site already has suspend/resume calls just
below it; so it should be equally safe to make that change here.
There is one remaining site in raid5_remove_disk() that we call log_exit()
without suspending the array. Unfortunately, as the comment stated, we
cannot call mddev_suspend from raid5d.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
ppl_handle_flush_request() takes an struct r5log argument but doesn't
use it. It has no buisiness taking this argument as it is only used
by raid5-cache and has no way to derference it anyway. Remove
the argument.
No functional changes intended.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
extern is not necessary and recommended against when defining prototype
functions in headers. checkpatch.pl complains about these. So remove
them.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
being split acorss files.
- Improve DM core's BLK_STS_DM_REQUEUE and BLK_STS_AGAIN handling.
- Optimize DM core's more common bio splitting by eliminating the use
of bio cloning with bio_split+bio_chain. Shift that cloning cost to
the relatively unlikely dm_io requeue case that only occurs during
error handling. Introduces dm_io_rewind() that will clone a bio that
reflects the subset of the original bio that must be requeued.
- Remove DM core's dm_table_get_num_targets() wrapper and audit all
dm_table_get_target() callers.
- Fix potential for OOM with DM writecache target by setting a default
MAX_WRITEBACK_JOBS (set to 256MiB or 1/16 of total system memory,
whichever is smaller).
- Fix DM writecache target's stats that are reported through
DM-specific table info.
- Fix use-after-free crash in dm_sm_register_threshold_callback().
- Refine DM core's Persistent Reservation handling in preparation for
broader work Mike Christie is doing to add compatibility with
Microsoft Windows Failover Cluster.
- Fix various KASAN reported bugs in the DM raid target.
- Fix DM raid target crash due to md_handle_request() bio splitting
that recurses to block core without properly initializing the bio's
bi_dev.
- Fix some code comment typos and fix some Documentation formatting.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmLoAAUACgkQxSPxCi2d
A1rFUAf/RnLkzNPS1QJ1uVbiiW64zQUD2o5U8kAliOZxoXQ45U+CgL22a5VaT2R+
rI9+YSg/VX9YkdJwB6I+y7VRWjUCO3NfYOfQUX5NS8GfcPQQn2Zp+jy3t8VnjjXN
5+8Ylu5tX1+QSQiBB9JvIgiK71rSorT6H+KF6bZypL7te5kj39hDaRbmh40VOMOY
QWfs8DmyJft9IHPG7Mku/rFJbdVJph3f31IoqUOoGTOKoDPDUDezhCVHi2vpFcV+
S+iCKkeXmP6eUdCnNzBreYPTcP9OtCvkZEPhFg4lh+jyhjyFq1o0B8G5cRo5jvgr
GcN1GUT040sSqNXFquA4UU6RGvaxlg==
=gBRJ
-----END PGP SIGNATURE-----
Merge tag 'for-6.0/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer:
- Refactor DM core's mempool allocation so that it clearer by not being
split acorss files.
- Improve DM core's BLK_STS_DM_REQUEUE and BLK_STS_AGAIN handling.
- Optimize DM core's more common bio splitting by eliminating the use
of bio cloning with bio_split+bio_chain. Shift that cloning cost to
the relatively unlikely dm_io requeue case that only occurs during
error handling. Introduces dm_io_rewind() that will clone a bio that
reflects the subset of the original bio that must be requeued.
- Remove DM core's dm_table_get_num_targets() wrapper and audit all
dm_table_get_target() callers.
- Fix potential for OOM with DM writecache target by setting a default
MAX_WRITEBACK_JOBS (set to 256MiB or 1/16 of total system memory,
whichever is smaller).
- Fix DM writecache target's stats that are reported through
DM-specific table info.
- Fix use-after-free crash in dm_sm_register_threshold_callback().
- Refine DM core's Persistent Reservation handling in preparation for
broader work Mike Christie is doing to add compatibility with
Microsoft Windows Failover Cluster.
- Fix various KASAN reported bugs in the DM raid target.
- Fix DM raid target crash due to md_handle_request() bio splitting
that recurses to block core without properly initializing the bio's
bi_dev.
- Fix some code comment typos and fix some Documentation formatting.
* tag 'for-6.0/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (29 commits)
dm: fix dm-raid crash if md_handle_request() splits bio
dm raid: fix address sanitizer warning in raid_resume
dm raid: fix address sanitizer warning in raid_status
dm: Start pr_preempt from the same starting path
dm: Fix PR release handling for non All Registrants
dm: Start pr_reserve from the same starting path
dm: Allow dm_call_pr to be used for path searches
dm: return early from dm_pr_call() if DM device is suspended
dm thin: fix use-after-free crash in dm_sm_register_threshold_callback
dm writecache: count number of blocks discarded, not number of discard bios
dm writecache: count number of blocks written, not number of write bios
dm writecache: count number of blocks read, not number of read bios
dm writecache: return void from functions
dm kcopyd: use __GFP_HIGHMEM when allocating pages
dm writecache: set a default MAX_WRITEBACK_JOBS
Documentation: dm writecache: Render status list as list
Documentation: dm writecache: add blank line before optional parameters
dm snapshot: fix typo in snapshot_map() comment
dm raid: remove redundant "the" in parse_raid_params() comment
dm cache: fix typo in 2 comment blocks
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmLko3gQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpmQaD/90NKFj4v8I456TUQyg1jimXEsL+e84E6o2
ALWVb6JzQvlPVQXNLnK5YKIunMWOTtTMz0nyB8sVRwVJVJO0P5d7QopAkZM8fkyU
MK5OCzoryENw4DTc2wJS4in6cSbGylIuN74wMzlf7+M67JTImfoZQhbTMcjwzZfn
b3OlL6sID7zMXwGcuOJPZyUJICCpDhzdSF9JXqKma5PQuG2SBmQyvFxJAcsoFBPc
YetnoRIOIN6yBvsIZaPaYq7XI9MIvF0e67EQtyCEHj4tHpyVnyDWkeObVFULsISU
gGEKbkYPvNUzRAU5Q1NBBHh1tTfkf/MaUxTuZwoEwZ/s04IGBGMmrZGyfvdfzYo6
M7NwSEg/TrUSNfTwn65mQi7uOXu1pGkJrqz84Flm8u9Qid9Vd7LExLG5p/ggnWdH
5th93MDEmtEg29e9DXpEAuS5d0t3TtSvosflaKpyfNNfr+P0rWCN6GM/uW62VUTK
ls69SQh/AQJRbg64jU4xper6WhaYtSXK7TKEnxJycoEn9gYNyCcdot2uekth0xRH
ChHGmRlteiqe/y4uFWn/2dcxWjoleiHbFjTaiRL75WVl8wIDEjw02LGuoZ61Ss9H
WOV+MT7KqNjBGe6lreUY+O/PO02dzmoR6heJXN19p8zr/pBuLCTGX7UpO7rzgaBR
4N1HEozvIw==
=celk
-----END PGP SIGNATURE-----
Merge tag 'for-5.20/block-2022-07-29' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
- Improve the type checking of request flags (Bart)
- Ensure queue mapping for a single queues always picks the right queue
(Bart)
- Sanitize the io priority handling (Jan)
- rq-qos race fix (Jinke)
- Reserved tags handling improvements (John)
- Separate memory alignment from file/disk offset aligment for O_DIRECT
(Keith)
- Add new ublk driver, userspace block driver using io_uring for
communication with the userspace backend (Ming)
- Use try_cmpxchg() to cleanup the code in various spots (Uros)
- Finally remove bdevname() (Christoph)
- Clean up the zoned device handling (Christoph)
- Clean up independent access range support (Christoph)
- Clean up and improve block sysfs handling (Christoph)
- Clean up and improve teardown of block devices.
This turns the usual two step process into something that is simpler
to implement and handle in block drivers (Christoph)
- Clean up chunk size handling (Christoph)
- Misc cleanups and fixes (Bart, Bo, Dan, GuoYong, Jason, Keith, Liu,
Ming, Sebastian, Yang, Ying)
* tag 'for-5.20/block-2022-07-29' of git://git.kernel.dk/linux-block: (178 commits)
ublk_drv: fix double shift bug
ublk_drv: make sure that correct flags(features) returned to userspace
ublk_drv: fix error handling of ublk_add_dev
ublk_drv: fix lockdep warning
block: remove __blk_get_queue
block: call blk_mq_exit_queue from disk_release for never added disks
blk-mq: fix error handling in __blk_mq_alloc_disk
ublk: defer disk allocation
ublk: rewrite ublk_ctrl_get_queue_affinity to not rely on hctx->cpumask
ublk: fold __ublk_create_dev into ublk_ctrl_add_dev
ublk: cleanup ublk_ctrl_uring_cmd
ublk: simplify ublk_ch_open and ublk_ch_release
ublk: remove the empty open and release block device operations
ublk: remove UBLK_IO_F_PREFLUSH
ublk: add a MAINTAINERS entry
block: don't allow the same type rq_qos add more than once
mmc: fix disk/queue leak in case of adding disk failure
ublk_drv: fix an IS_ERR() vs NULL check
ublk: remove UBLK_IO_F_INTEGRITY
ublk_drv: remove unneeded semicolon
...
Commit ca522482e3 ("dm: pass NULL bdev to bio_alloc_clone")
introduced the optimization to _not_ perform bio_associate_blkg()'s
relatively costly work when DM core clones its bio. But in doing so it
exposed the possibility for DM's cloned bio to alter DM target
behavior (e.g. crash) if a target were to issue IO without first
calling bio_set_dev().
The DM raid target can trigger an MD crash due to its need to split
the DM bio that is passed to md_handle_request(). The split will
recurse to submit_bio_noacct() using a bio with an uninitialized
->bi_blkg. This NULL bio->bi_blkg causes blk_throtl_bio() to
dereference a NULL blkg_to_tg(bio->bi_blkg).
Fix this in DM core by adding a new 'needs_bio_set_dev' target flag that
will make alloc_tio() call bio_set_dev() on behalf of the target.
dm-raid is the only target that requires this flag. bio_set_dev()
initializes the DM cloned bio's ->bi_blkg, using bio_associate_blkg,
before passing the bio to md_handle_request().
Long-term fix would be to audit and refactor MD code to rely on DM to
split its bio, using dm_accept_partial_bio(), but there are MD raid
personalities (e.g. raid1 and raid10) whose implementation are tightly
coupled to handling the bio splitting inline.
Fixes: ca522482e3 ("dm: pass NULL bdev to bio_alloc_clone")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@kernel.org>