Otherwise most non-x86 architectures (e.g. riscv, arm) will resort to
byte-by-byte access.
Cc: stable@vger.kernel.org
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Remove this extra BUG_ON() that calls node_check() -- which avoids extra crc checking.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Both sm_disk_new_block and sm_disk_commit are needlessly calling
sm_disk_get_nr_free(). Looks like old queries used for some
debugging.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
NULL pointer dereference was observed in super_written() when it tries
to access the mddev structure.
[The below stack trace is from an older kernel, but the problem described
in this patch applies to the mainline kernel.]
[ 1194.474861] task: ffff8fdd20858000 task.stack: ffffb99d40790000
[ 1194.488000] RIP: 0010:super_written+0x29/0xe1
[ 1194.499688] RSP: 0018:ffff8ffb7fcc3c78 EFLAGS: 00010046
[ 1194.512477] RAX: 0000000000000000 RBX: ffff8ffb7bf4a000 RCX: ffff8ffb78991048
[ 1194.527325] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8ffb56b8a200
[ 1194.542576] RBP: ffff8ffb7fcc3c90 R08: 000000000000000b R09: 0000000000000000
[ 1194.558001] R10: ffff8ffb56b8a298 R11: 0000000000000000 R12: ffff8ffb56b8a200
[ 1194.573070] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 1194.588117] FS: 0000000000000000(0000) GS:ffff8ffb7fcc0000(0000) knlGS:0000000000000000
[ 1194.604264] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1194.617375] CR2: 00000000000002b8 CR3: 00000021e040a002 CR4: 00000000007606e0
[ 1194.632327] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1194.647865] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1194.663316] PKRU: 55555554
[ 1194.674090] Call Trace:
[ 1194.683735] <IRQ>
[ 1194.692948] bio_endio+0xae/0x135
[ 1194.703580] blk_update_request+0xad/0x2fa
[ 1194.714990] blk_update_bidi_request+0x20/0x72
[ 1194.726578] __blk_end_bidi_request+0x2c/0x4d
[ 1194.738373] __blk_end_request_all+0x31/0x49
[ 1194.749344] blk_flush_complete_seq+0x377/0x383
[ 1194.761550] flush_end_io+0x1dd/0x2a7
[ 1194.772910] blk_finish_request+0x9f/0x13c
[ 1194.784544] scsi_end_request+0x180/0x25c
[ 1194.796149] scsi_io_completion+0xc8/0x610
[ 1194.807503] scsi_finish_command+0xdc/0x125
[ 1194.818897] scsi_softirq_done+0x81/0xde
[ 1194.830062] blk_done_softirq+0xa4/0xcc
[ 1194.841008] __do_softirq+0xd9/0x29f
[ 1194.851257] irq_exit+0xe6/0xeb
[ 1194.861290] do_IRQ+0x59/0xe3
[ 1194.871060] common_interrupt+0x1c6/0x382
[ 1194.881988] </IRQ>
[ 1194.890646] RIP: 0010:cpuidle_enter_state+0xdd/0x2a5
[ 1194.902532] RSP: 0018:ffffb99d40793e68 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff43
[ 1194.917317] RAX: ffff8ffb7fce27c0 RBX: ffff8ffb7fced800 RCX: 000000000000001f
[ 1194.932056] RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000000
[ 1194.946428] RBP: ffffb99d40793ea0 R08: 0000000000000004 R09: 0000000000002ed2
[ 1194.960508] R10: 0000000000002664 R11: 0000000000000018 R12: 0000000000000003
[ 1194.974454] R13: 000000000000000b R14: ffffffff925715a0 R15: 0000011610120d5a
[ 1194.988607] ? cpuidle_enter_state+0xcc/0x2a5
[ 1194.999077] cpuidle_enter+0x17/0x19
[ 1195.008395] call_cpuidle+0x23/0x3a
[ 1195.017718] do_idle+0x172/0x1d5
[ 1195.026358] cpu_startup_entry+0x73/0x75
[ 1195.035769] start_secondary+0x1b9/0x20b
[ 1195.044894] secondary_startup_64+0xa5/0xa5
[ 1195.084921] RIP: super_written+0x29/0xe1 RSP: ffff8ffb7fcc3c78
[ 1195.096354] CR2: 00000000000002b8
bio in the above stack is a bitmap write whose completion is invoked after
the tear down sequence sets the mddev structure to NULL in rdev.
During tear down, there is an attempt to flush the bitmap writes, but for
external bitmaps, there is no explicit wait for all the bitmap writes to
complete. For instance, md_bitmap_flush() is called to flush the bitmap
writes, but the last call to md_bitmap_daemon_work() in md_bitmap_flush()
could generate new bitmap writes for which there is no explicit wait to
complete those writes. The call to md_bitmap_update_sb() will return
simply for external bitmaps and the follow-up call to md_update_sb() is
conditional and may not get called for external bitmaps. This results in a
kernel panic when the completion routine, super_written() is called which
tries to reference mddev in the rdev that has been set to
NULL(in unbind_rdev_from_array() by tear down sequence).
The solution is to call md_super_wait() for external bitmaps after the
last call to md_bitmap_daemon_work() in md_bitmap_flush() to ensure there
are no pending bitmap writes before proceeding with the tear down.
Cc: stable@vger.kernel.org
Signed-off-by: Sudhakar Panneerselvam <sudhakar.panneerselvam@oracle.com>
Reviewed-by: Zhao Heming <heming.zhao@suse.com>
Signed-off-by: Song Liu <song@kernel.org>
Instead of returning an existing mddev, just for it to be discarded
later directly return -EEXIST. Rename the function to mddev_alloc now
that it doesn't find an existing mddev.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Allocate the new mddev first speculatively, which greatly simplifies
the code flow.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Split out a self contained helper to find a free minor for the md
"unit" number.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
commit df7b59ba92 ("dm verity: fix FEC for RS roots unaligned to
block size") introduced the possibility for misaligned roots IO
relative to the underlying device's logical block size. E.g. Android's
default RS roots=2 results in dm_bufio->block_size=1024, which causes
the following EIO if the logical block size of the device is 4096,
given v->data_dev_block_bits=12:
E sd 0 : 0:0:0: [sda] tag#30 request not aligned to the logical block size
E blk_update_request: I/O error, dev sda, sector 10368424 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
E device-mapper: verity-fec: 254:8: FEC 9244672: parity read failed (block 18056): -5
Fix this by onlu using f->roots for dm_bufio blocksize IFF it is
aligned to v->data_dev_block_bits.
Fixes: df7b59ba92 ("dm verity: fix FEC for RS roots unaligned to block size")
Cc: stable@vger.kernel.org
Signed-off-by: Jaegeuk Kim <jaegeuk@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The patch "bcache: remove PTR_CACHE" introduces a compiling failure in
debug.c with following error message,
In file included from drivers/md/bcache/bcache.h:182:0,
from drivers/md/bcache/debug.c:9:
drivers/md/bcache/debug.c: In function 'bch_btree_verify':
drivers/md/bcache/debug.c:53:19: error: 'c' undeclared (first use in
this function)
bio_set_dev(bio, c->cache->bdev);
^
This patch fixes the regression by replacing c->cache->bdev by b->c->
cache->bdev.
Signed-off-by: Coly Li <colyli@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210411134316.80274-8-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Cast multiple variables to (int64_t) in order to give the compiler
complete information about the proper arithmetic to use. Notice that
these variables are being used in contexts that expect expressions of
type int64_t (64 bit, signed). And currently, such expressions are
being evaluated using 32-bit arithmetic.
Fixes: d0cf9503e9 ("octeontx2-pf: ethtool fec mode support")
Addresses-Coverity-ID: 1501724 ("Unintentional integer overflow")
Addresses-Coverity-ID: 1501725 ("Unintentional integer overflow")
Addresses-Coverity-ID: 1501726 ("Unintentional integer overflow")
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20210411134316.80274-7-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
building with 'make W=1' shows a harmless warning for each user of the
EBUG_ON() macro:
drivers/md/bcache/bset.c: In function 'bch_btree_sort_partial':
drivers/md/bcache/util.h:30:55: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body]
30 | #define EBUG_ON(cond) do { if (cond); } while (0)
| ^
drivers/md/bcache/bset.c:1312:9: note: in expansion of macro 'EBUG_ON'
1312 | EBUG_ON(oldsize >= 0 && bch_count_data(b) != oldsize);
| ^~~~~~~
Reword the macro slightly to avoid the warning.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20210411134316.80274-5-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This fixes the following sparse warnings:
drivers/md/bcache/features.c:22:16: warning: Using plain integer as NULL
pointer
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20210411134316.80274-4-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Remove the PTR_CACHE inline and replace it with a direct dereference
of c->cache.
(Coly Li: fix the typo from PTR_BUCKET to PTR_CACHE in commit log)
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20210411134316.80274-3-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In bch_cached_dev_run(), free(env[1])|free(env[2])|free(buf)
show up three times. This patch introduce out tag in
which free(env[1])|free(env[2])|free(buf) are only called
one time. If we need to call free() when errors occur,
we can set error code to ret, and then goto out tag directly.
Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20210411134316.80274-2-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
list_sort() internally casts the comparison function passed to it
to a different type with constant struct list_head pointers, and
uses this pointer to call the functions, which trips indirect call
Control-Flow Integrity (CFI) checking.
Instead of removing the consts, this change defines the
list_cmp_func_t type and changes the comparison function types of
all list_sort() callers to use const pointers, thus avoiding type
mismatches.
Suggested-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210408182843.1754385-10-samitolvanen@google.com
Split mddev_find into a simple mddev_find that just finds an existing
mddev by the unit number, and a more complicated mddev_find that deals
with find or allocating a mddev.
This turns out to fix this bug reported by Zhao Heming.
----------------------------- snip ------------------------------
commit d3374825ce ("md: make devices disappear when they are no longer
needed.") introduced protection between mddev creating & removing. The
md_open shouldn't create mddev when all_mddevs list doesn't contain
mddev. With currently code logic, there will be very easy to trigger
soft lockup in non-preempt env.
*** env ***
kvm-qemu VM 2C1G with 2 iscsi luns
kernel should be non-preempt
*** script ***
about trigger 1 time with 10 tests
`1 node1="15sp3-mdcluster1"
2 node2="15sp3-mdcluster2"
3
4 mdadm -Ss
5 ssh ${node2} "mdadm -Ss"
6 wipefs -a /dev/sda /dev/sdb
7 mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda \
/dev/sdb --assume-clean
8
9 for i in {1..100}; do
10 echo ==== $i ====;
11
12 echo "test ...."
13 ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb"
14 sleep 1
15
16 echo "clean ....."
17 ssh ${node2} "mdadm -Ss"
18 done
`
I use mdcluster env to trigger soft lockup, but it isn't mdcluster
speical bug. To stop md array in mdcluster env will do more jobs than
non-cluster array, which will leave enough time/gap to allow kernel to
run md_open.
*** stack ***
`ID: 2831 TASK: ffff8dd7223b5040 CPU: 0 COMMAND: "mdadm"
#0 [ffffa15d00a13b90] __schedule at ffffffffb8f1935f
#1 [ffffa15d00a13ba8] exact_lock at ffffffffb8a4a66d
#2 [ffffa15d00a13bb0] kobj_lookup at ffffffffb8c62fe3
#3 [ffffa15d00a13c28] __blkdev_get at ffffffffb89273b9
#4 [ffffa15d00a13c98] blkdev_get at ffffffffb8927964
#5 [ffffa15d00a13cb0] do_dentry_open at ffffffffb88dc4b4
#6 [ffffa15d00a13ce0] path_openat at ffffffffb88f0ccc
#7 [ffffa15d00a13db8] do_filp_open at ffffffffb88f32bb
#8 [ffffa15d00a13ee0] do_sys_open at ffffffffb88ddc7d
#9 [ffffa15d00a13f38] do_syscall_64 at ffffffffb86053cb ffffffffb900008c
or:
[ 884.226509] mddev_put+0x1c/0xe0 [md_mod]
[ 884.226515] md_open+0x3c/0xe0 [md_mod]
[ 884.226518] __blkdev_get+0x30d/0x710
[ 884.226520] ? bd_acquire+0xd0/0xd0
[ 884.226522] blkdev_get+0x14/0x30
[ 884.226524] do_dentry_open+0x204/0x3a0
[ 884.226531] path_openat+0x2fc/0x1520
[ 884.226534] ? seq_printf+0x4e/0x70
[ 884.226536] do_filp_open+0x9b/0x110
[ 884.226542] ? md_release+0x20/0x20 [md_mod]
[ 884.226543] ? seq_read+0x1d8/0x3e0
[ 884.226545] ? kmem_cache_alloc+0x18a/0x270
[ 884.226547] ? do_sys_open+0x1bd/0x260
[ 884.226548] do_sys_open+0x1bd/0x260
[ 884.226551] do_syscall_64+0x5b/0x1e0
[ 884.226554] entry_SYSCALL_64_after_hwframe+0x44/0xa9
`
*** rootcause ***
"mdadm -A" (or other array assemble commands) will start a daemon "mdadm
--monitor" by default. When "mdadm -Ss" is running, the stop action will
wakeup "mdadm --monitor". The "--monitor" daemon will immediately get
info from /proc/mdstat. This time mddev in kernel still exist, so
/proc/mdstat still show md device, which makes "mdadm --monitor" to open
/dev/md0.
The previously "mdadm -Ss" is removing action, the "mdadm --monitor"
open action will trigger md_open which is creating action. Racing is
happening.
`<thread 1>: "mdadm -Ss"
md_release
mddev_put deletes mddev from all_mddevs
queue_work for mddev_delayed_delete
at this time, "/dev/md0" is still available for opening
<thread 2>: "mdadm --monitor ..."
md_open
+ mddev_find can't find mddev of /dev/md0, and create a new mddev and
| return.
+ trigger "if (mddev->gendisk != bdev->bd_disk)" and return
-ERESTARTSYS.
`
In non-preempt kernel, <thread 2> is occupying on current CPU. and
mddev_delayed_delete which was created in <thread 1> also can't be
schedule.
In preempt kernel, it can also trigger above racing. But kernel doesn't
allow one thread running on a CPU all the time. after <thread 2> running
some time, the later "mdadm -A" (refer above script line 13) will call
md_alloc to alloc a new gendisk for mddev. it will break md_open
statement "if (mddev->gendisk != bdev->bd_disk)" and return 0 to caller,
the soft lockup is broken.
------------------------------ snip ------------------------------
Cc: stable@vger.kernel.org
Fixes: d3374825ce ("md: make devices disappear when they are no longer needed.")
Reported-by: Heming Zhao <heming.zhao@suse.com>
Reviewed-by: Heming Zhao <heming.zhao@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Factor out a self-contained helper to just lookup a mddev by the dev_t
"unit".
Cc: stable@vger.kernel.org
Reviewed-by: Heming Zhao <heming.zhao@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
commit d3374825ce ("md: make devices disappear when they are no longer
needed.") introduced protection between mddev creating & removing. The
md_open shouldn't create mddev when all_mddevs list doesn't contain
mddev. With currently code logic, there will be very easy to trigger
soft lockup in non-preempt env.
This patch changes md_open returning from -ERESTARTSYS to -EBUSY, which
will break the infinitely retry when md_open enter racing area.
This patch is partly fix soft lockup issue, full fix needs mddev_find
is split into two functions: mddev_find & mddev_find_or_alloc. And
md_open should call new mddev_find (it only does searching job).
For more detail, please refer with Christoph's "split mddev_find" patch
in later commits.
*** env ***
kvm-qemu VM 2C1G with 2 iscsi luns
kernel should be non-preempt
*** script ***
about trigger every time with below script
```
1 node1="mdcluster1"
2 node2="mdcluster2"
3
4 mdadm -Ss
5 ssh ${node2} "mdadm -Ss"
6 wipefs -a /dev/sda /dev/sdb
7 mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda \
/dev/sdb --assume-clean
8
9 for i in {1..10}; do
10 echo ==== $i ====;
11
12 echo "test ...."
13 ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb"
14 sleep 1
15
16 echo "clean ....."
17 ssh ${node2} "mdadm -Ss"
18 done
```
I use mdcluster env to trigger soft lockup, but it isn't mdcluster
speical bug. To stop md array in mdcluster env will do more jobs than
non-cluster array, which will leave enough time/gap to allow kernel to
run md_open.
*** stack ***
```
[ 884.226509] mddev_put+0x1c/0xe0 [md_mod]
[ 884.226515] md_open+0x3c/0xe0 [md_mod]
[ 884.226518] __blkdev_get+0x30d/0x710
[ 884.226520] ? bd_acquire+0xd0/0xd0
[ 884.226522] blkdev_get+0x14/0x30
[ 884.226524] do_dentry_open+0x204/0x3a0
[ 884.226531] path_openat+0x2fc/0x1520
[ 884.226534] ? seq_printf+0x4e/0x70
[ 884.226536] do_filp_open+0x9b/0x110
[ 884.226542] ? md_release+0x20/0x20 [md_mod]
[ 884.226543] ? seq_read+0x1d8/0x3e0
[ 884.226545] ? kmem_cache_alloc+0x18a/0x270
[ 884.226547] ? do_sys_open+0x1bd/0x260
[ 884.226548] do_sys_open+0x1bd/0x260
[ 884.226551] do_syscall_64+0x5b/0x1e0
[ 884.226554] entry_SYSCALL_64_after_hwframe+0x44/0xa9
```
*** rootcause ***
"mdadm -A" (or other array assemble commands) will start a daemon "mdadm
--monitor" by default. When "mdadm -Ss" is running, the stop action will
wakeup "mdadm --monitor". The "--monitor" daemon will immediately get
info from /proc/mdstat. This time mddev in kernel still exist, so
/proc/mdstat still show md device, which makes "mdadm --monitor" to open
/dev/md0.
The previously "mdadm -Ss" is removing action, the "mdadm --monitor"
open action will trigger md_open which is creating action. Racing is
happening.
```
<thread 1>: "mdadm -Ss"
md_release
mddev_put deletes mddev from all_mddevs
queue_work for mddev_delayed_delete
at this time, "/dev/md0" is still available for opening
<thread 2>: "mdadm --monitor ..."
md_open
+ mddev_find can't find mddev of /dev/md0, and create a new mddev and
| return.
+ trigger "if (mddev->gendisk != bdev->bd_disk)" and return
-ERESTARTSYS.
```
In non-preempt kernel, <thread 2> is occupying on current CPU. and
mddev_delayed_delete which was created in <thread 1> also can't be
schedule.
In preempt kernel, it can also trigger above racing. But kernel doesn't
allow one thread running on a CPU all the time. after <thread 2> running
some time, the later "mdadm -A" (refer above script line 13) will call
md_alloc to alloc a new gendisk for mddev. it will break md_open
statement "if (mddev->gendisk != bdev->bd_disk)" and return 0 to caller,
the soft lockup is broken.
Cc: stable@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Zhao Heming <heming.zhao@suse.com>
Signed-off-by: Song Liu <song@kernel.org>
Add a new flag "reset_recalculate" that will restart recalculating
from the beginning of the device. It can be used if we want to change
the hash function. Example:
dmsetup remove_all
rmmod brd
set -e
modprobe brd rd_size=1048576
dmsetup create in --table '0 2000000 integrity /dev/ram0 0 16 J 2 internal_hash:sha256 recalculate'
sleep 10
dmsetup status
dmsetup remove in
dmsetup create in --table '0 2000000 integrity /dev/ram0 0 16 J 2 internal_hash:sha3-256 reset_recalculate'
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Since commit ff9ea32381 ("block, bdi: an active gendisk always has a
request_queue associated with it") the request_queue pointer returned
from bdev_get_queue() shall never be NULL.
Signed-off-by: Xu Wang <vulab@iscas.ac.cn>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Since commit ff9ea32381 ("block, bdi: an active gendisk always has a
request_queue associated with it") the request_queue pointer returned
from bdev_get_queue() shall never be NULL.
Signed-off-by: Xu Wang <vulab@iscas.ac.cn>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
These are only used by DM core, DM target modules should only use
dm_{get,put}_device.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
If we set non-empty param->name or param->uuid on the DM_LIST_DEVICES_CMD
ioctl, the set values are considered filter prefixes. The ioctl will only
return entries with matching name or uuid prefix.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
When LVM needs to find a device with a particular UUID it needs to ask for
UUID for each device. This patch returns UUID directly in the list of
devices, so that LVM doesn't have to query all the devices with an ioctl.
The UUID is returned if the flag DM_UUID_FLAG is set in the parameters.
Returning UUID is done in backward-compatible way. There's one unused
32-bit word value after the event number. This patch sets the bit
DM_NAME_LIST_FLAG_HAS_UUID if UUID is present and
DM_NAME_LIST_FLAG_DOESNT_HAVE_UUID if it isn't (if none of these bits is
set, then we have an old kernel that doesn't support returning UUIDs). The
UUID is stored after this word. The 'next' value is updated to point after
the UUID, so that old version of libdevmapper will skip the UUID without
attempting to interpret it.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
For high numbers of DM devices the 64-entry hash table has non-trivial
overhead. Fix this by replacing the hash table with a red-black tree.
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
If more than one one handling mode is requested during DM verity table
load, the last requested mode will be used.
Change this to impose more strict checking so that the table load will
fail if more than one error handling mode is requested.
Signed-off-by: JeongHyeon Lee <jhs2.lee@samsung.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Remove useless "while" loop. If the condition ci.sector_count && !error is
true, we go to a branch that ends with "break". If this condition is
false, the "while" loop will not be executed again. So, the loop can't be
executed more than once.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Zero-length and one-element arrays are deprecated, see
Documentation/process/deprecated.rst
Flexible-array members should be used instead.
Generated by: scripts/coccinelle/misc/flexible_array.cocci
CC: Denis Efremov <efremov@linux.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: kernel test robot <lkp@intel.com>
Signed-off-by: Julia Lawall <julia.lawall@inria.fr>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
If there are not any dm devices, we need to zero the "dev" argument in
the first structure dm_name_list. However, this can cause out of
bounds write, because the "needed" variable is zero and len may be
less than eight.
Fix this bug by reporting DM_BUFFER_FULL_FLAG if the result buffer is
too small to hold the "nl->dev" value.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reading /proc/mdstat with a read buffer size that would not
fit the unused status line in the first read will skip this
line from the output.
So 'dd if=/proc/mdstat bs=64 2>/dev/null' will not print something
like: unused devices: <none>
Don't return NULL immediately in start() for v=2 but call
show() once to print the status line also for multiple reads.
Cc: stable@vger.kernel.org
Fixes: 1f4aace60b ("fs/seq_file.c: simplify seq_file iteration code and interface")
Signed-off-by: Jan Glauber <jglauber@digitalocean.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
For far layout, the discard region is not continuous on disks. So it needs
far copies r10bio to cover all regions. It needs a way to know all r10bios
have finish or not. Similar with raid10_sync_request, only the first r10bio
master_bio records the discard bio. Other r10bios master_bio record the
first r10bio. The first r10bio can finish after other r10bios finish and
then return the discard bio.
Tested-by: Adrian Huang <ahuang12@lenovo.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Now the discard request is split by chunk size. So it takes a long time
to finish mkfs on disks which support discard function. This patch improve
handling raid10 discard request. It uses the similar way with patch
29efc390b (md/md0: optimize raid0 discard handling).
But it's a little complex than raid0. Because raid10 has different layout.
If raid10 is offset layout and the discard request is smaller than stripe
size. There are some holes when we submit discard bio to underlayer disks.
For example: five disks (disk1 - disk5)
D01 D02 D03 D04 D05
D05 D01 D02 D03 D04
D06 D07 D08 D09 D10
D10 D06 D07 D08 D09
The discard bio just wants to discard from D03 to D10. For disk3, there is
a hole between D03 and D08. For disk4, there is a hole between D04 and D09.
D03 is a chunk, raid10_write_request can handle one chunk perfectly. So
the part that is not aligned with stripe size is still handled by
raid10_write_request.
If reshape is running when discard bio comes and the discard bio spans the
reshape position, raid10_write_request is responsible to handle this
discard bio.
I did a test with this patch set.
Without patch:
time mkfs.xfs /dev/md0
real4m39.775s
user0m0.000s
sys0m0.298s
With patch:
time mkfs.xfs /dev/md0
real0m0.105s
user0m0.000s
sys0m0.007s
nvme3n1 259:1 0 477G 0 disk
└─nvme3n1p1 259:10 0 50G 0 part
nvme4n1 259:2 0 477G 0 disk
└─nvme4n1p1 259:11 0 50G 0 part
nvme5n1 259:6 0 477G 0 disk
└─nvme5n1p1 259:12 0 50G 0 part
nvme2n1 259:9 0 477G 0 disk
└─nvme2n1p1 259:15 0 50G 0 part
nvme0n1 259:13 0 477G 0 disk
└─nvme0n1p1 259:14 0 50G 0 part
Reviewed-by: Coly Li <colyli@suse.de>
Reviewed-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Tested-by: Adrian Huang <ahuang12@lenovo.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
The following patch will reuse these logics, so pull the same codes into
one function.
Tested-by: Adrian Huang <ahuang12@lenovo.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Now it allocs r10bio->devs[conf->copies]. Discard bio needs to submit
to all member disks and it needs to use r10bio. So extend to
r10bio->devs[geo.raid_disks].
Reviewed-by: Coly Li <colyli@suse.de>
Tested-by: Adrian Huang <ahuang12@lenovo.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Move these logic from raid0.c to md.c, so that we can also use it in
raid10.c.
Reviewed-by: Coly Li <colyli@suse.de>
Reviewed-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Tested-by: Adrian Huang <ahuang12@lenovo.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
When a DM device is first created it doesn't yet have an established
capacity, therefore the use of set_capacity_and_notify() should be
conditional given the potential for needless pr_info "detected
capacity change" noise even if capacity is 0.
One could argue that the pr_info() in set_capacity_and_notify() is
misplaced, but that position is not held uniformly.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: f64d9b2eac ("dm: use set_capacity_and_notify")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Commit 24f6b6036c ("dm table: fix zoned iterate_devices based device
capability checks") triggered dm table load failure when dm-zoned device
is set up for zoned block devices and a regular device for cache.
The commit inverted logic of two callback functions for iterate_devices:
device_is_zoned_model() and device_matches_zone_sectors(). The logic of
device_is_zoned_model() was inverted then all destination devices of all
targets in dm table are required to have the expected zoned model. This
is fine for dm-linear, dm-flakey and dm-crypt on zoned block devices
since each target has only one destination device. However, this results
in failure for dm-zoned with regular cache device since that target has
both regular block device and zoned block devices.
As for device_matches_zone_sectors(), the commit inverted the logic to
require all zoned block devices in each target have the specified
zone_sectors. This check also fails for regular block device which does
not have zones.
To avoid the check failures, fix the zone model check and the zone
sectors check. For zone model check, introduce the new feature flag
DM_TARGET_MIXED_ZONED_MODEL, and set it to dm-zoned target. When the
target has this flag, allow it to have destination devices with any
zoned model. For zone sectors check, skip the check if the destination
device is not a zoned block device. Also add comments and improve an
error message to clarify expectations to the two checks.
Fixes: 24f6b6036c ("dm table: fix zoned iterate_devices based device capability checks")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Three optional parameters must be accepted at once in a DM verity table, e.g.:
(verity_error_handling_mode) (ignore_zero_block) (check_at_most_once)
Fix this to be possible by incrementing DM_VERITY_OPTS_MAX.
Signed-off-by: JeongHyeon Lee <jhs2.lee@samsung.com>
Fixes: 843f38d382 ("dm verity: add 'check_at_most_once' option to only validate hashes once")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmBLzKsQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpi0ID/9djN1db0OrAjQgWdOQsKwzcPG4fmVRHJAu
Zi8SPRj0ByonWGaPWjiSi297/j00dfYFFIXaB1Pfo4j0wX0IK8bJINl0G8SN6Dag
WYBBrT/5rCQgD8fjQ1XhuzuqLwxwcZfYXAnCAlqABG18nPk532D4dX2CMEasl8F7
XWTTj5PqHDN4bCcriH1GEA5S+2nmoz5YXjNZEDcY3/pQMdyb8Jo9mRfZubkrnRxK
c9fz2LjUz0IRaSb+9PILY5qDLOSIh+vHOIk/3BKW9DoqU/S3kTTr4twqnOclfVPH
VgJM9b+sHveVCztCJ9bnNGkW7HWjUQa8gb/B40NBxKEhw7w/HCjykhhxd+QTUQTM
GJVMRGYWhzuUEuU1M1hArPua0GLmPKSvC0CRgbKRmgPNjshTquZPJnBBFwv2wZKQ
GkrwktdK9ihE1ya4gu20MupST3PIpT3jtc6NAizr6DCy0wJ0Z1X5KYnFdbtS79No
I9qPC8lu3AcZq6NXdBfTO9ngIdiUwi9AfSYj7koS/4dmnVccVJmaj0/NNmVp2Ro3
HtaObanBnTi9v8YHl8WgX6lq5RjuQ204fXmd0No4mHFvgxsl7YaX+JBts7S3A2Nf
PoQLqmulcLmzT3EVuEg279aXw2rbnyWHARbF/5/tIr4JcugtLJhwFnBA5YgFreq9
lSbqgoKSHw==
=qHyO
-----END PGP SIGNATURE-----
Merge tag 'block-5.12-2021-03-12-v2' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Mostly just random fixes all over the map.
The only odd-one-out change is finally getting the rename of
BIO_MAX_PAGES to BIO_MAX_VECS done. This should've been done with the
multipage bvec change, but it's been left.
Do it now to avoid hassles around changes piling up for the next merge
window.
Summary:
- NVMe pull request:
- one more quirk (Dmitry Monakhov)
- fix max_zone_append_sectors initialization (Chaitanya Kulkarni)
- nvme-fc reset/create race fix (James Smart)
- fix status code on aborts/resets (Hannes Reinecke)
- fix the CSS check for ZNS namespaces (Chaitanya Kulkarni)
- fix a use after free in a debug printk in nvme-rdma (Lv Yunlong)
- Follow-up NVMe error fix for NULL 'id' (Christoph)
- Fixup for the bd_size_lock being IRQ safe, now that the offending
driver has been dropped (Damien).
- rsxx probe failure error return (Jia-Ju)
- umem probe failure error return (Wei)
- s390/dasd unbind fixes (Stefan)
- blk-cgroup stats summing fix (Xunlei)
- zone reset handling fix (Damien)
- Rename BIO_MAX_PAGES to BIO_MAX_VECS (Christoph)
- Suppress uevent trigger for hidden devices (Daniel)
- Fix handling of discard on busy device (Jan)
- Fix stale cache issue with zone reset (Shin'ichiro)"
* tag 'block-5.12-2021-03-12-v2' of git://git.kernel.dk/linux-block:
nvme: fix the nsid value to print in nvme_validate_or_alloc_ns
block: Discard page cache of zone reset target range
block: Suppress uevent for hidden device when removed
block: rename BIO_MAX_PAGES to BIO_MAX_VECS
nvme-pci: add the DISABLE_WRITE_ZEROES quirk for a Samsung PM1725a
nvme-rdma: Fix a use after free in nvmet_rdma_write_data_done
nvme-core: check ctrl css before setting up zns
nvme-fc: fix racing controller reset and create association
nvme-fc: return NVME_SC_HOST_ABORTED_CMD when a command has been aborted
nvme-fc: set NVME_REQ_CANCELLED in nvme_fc_terminate_exchange()
nvme: add NVME_REQ_CANCELLED flag in nvme_cancel_request()
nvme: simplify error logic in nvme_validate_ns()
nvme: set max_zone_append_sectors nvme_revalidate_zones
block: rsxx: fix error return code of rsxx_pci_probe()
block: Fix REQ_OP_ZONE_RESET_ALL handling
umem: fix error return code in mm_pci_probe()
blk-cgroup: Fix the recursive blkg rwstat
s390/dasd: fix hanging IO request during DASD driver unbind
s390/dasd: fix hanging DASD driver unbind
block: Try to handle busy underlying device on discard
Ever since the addition of multipage bio_vecs BIO_MAX_PAGES has been
horribly confusingly misnamed. Rename it to BIO_MAX_VECS to stop
confusing users of the bio API.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20210311110137.1132391-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reed-Solomon roots that are unaligned to block size.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmBCgmkTHHNuaXR6ZXJA
cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWlF9B/481eKdOYK5RLj6LneVf10niUACEN0G
tp8CeKh2wEcTjX+9pWxvigAE7FZnvD0sCts3eRCd8egkdV3L9uHyaCHU9V8iGP8L
dSjHXjbwbOumww5FY1ddx2ZiRImcU7YpEj5TrUZi2TyoAB2jOblDgaUt8jWsbppa
8miPUZi0Kp56Z5EsFJcT8dYbIlLfUpD/XfZ0hjqMoc9XZOeoGvYqpYX1pZgIfj2s
F23Dru858zqjv7OLmIgjzSgS0dZSwdLMUpKInaW1AipvNQpWSuh2DQdP7Ail3+yy
gUirpBjztoYCfqwpQTJpGnnb47tECYUXef6dI7xR5bXnyaX8CG5bRvVV
=a00f
-----END PGP SIGNATURE-----
Merge tag 'for-5.12/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
"Fix DM verity target's optional Forward Error Correction (FEC) for
Reed-Solomon roots that are unaligned to block size"
* tag 'for-5.12/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm verity: fix FEC for RS roots unaligned to block size
dm bufio: subtract the number of initial sectors in dm_bufio_get_device_size
Optional Forward Error Correction (FEC) code in dm-verity uses
Reed-Solomon code and should support roots from 2 to 24.
The error correction parity bytes (of roots lengths per RS block) are
stored on a separate device in sequence without any padding.
Currently, to access FEC device, the dm-verity-fec code uses dm-bufio
client with block size set to verity data block (usually 4096 or 512
bytes).
Because this block size is not divisible by some (most!) of the roots
supported lengths, data repair cannot work for partially stored parity
bytes.
This fix changes FEC device dm-bufio block size to "roots << SECTOR_SHIFT"
where we can be sure that the full parity data is always available.
(There cannot be partial FEC blocks because parity must cover whole
sectors.)
Because the optional FEC starting offset could be unaligned to this
new block size, we have to use dm_bufio_set_sector_offset() to
configure it.
The problem is easily reproduced using veritysetup, e.g. for roots=13:
# create verity device with RS FEC
dd if=/dev/urandom of=data.img bs=4096 count=8 status=none
veritysetup format data.img hash.img --fec-device=fec.img --fec-roots=13 | awk '/^Root hash/{ print $3 }' >roothash
# create an erasure that should be always repairable with this roots setting
dd if=/dev/zero of=data.img conv=notrunc bs=1 count=8 seek=4088 status=none
# try to read it through dm-verity
veritysetup open data.img test hash.img --fec-device=fec.img --fec-roots=13 $(cat roothash)
dd if=/dev/mapper/test of=/dev/null bs=4096 status=noxfer
# wait for possible recursive recovery in kernel
udevadm settle
veritysetup close test
With this fix, errors are properly repaired.
device-mapper: verity-fec: 7:1: FEC 0: corrected 8 errors
...
Without it, FEC code usually ends on unrecoverable failure in RS decoder:
device-mapper: verity-fec: 7:1: FEC 0: failed to correct: -74
...
This problem is present in all kernels since the FEC code's
introduction (kernel 4.5).
It is thought that this problem is not visible in Android ecosystem
because it always uses a default RS roots=2.
Depends-on: a14e5ec66a ("dm bufio: subtract the number of initial sectors in dm_bufio_get_device_size")
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Tested-by: Jérôme Carretero <cJ-ko@zougloub.eu>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Cc: stable@vger.kernel.org # 4.5+
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
dm_bufio_get_device_size returns the device size in blocks. Before
returning the value, we must subtract the nubmer of starting
sectors. The number of starting sectors may not be divisible by block
size.
Note that currently, no target is using dm_bufio_set_sector_offset and
dm_bufio_get_device_size simultaneously, so this change has no effect.
However, an upcoming dm-verity-fec fix needs this change.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Milan Broz <gmazyland@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmA6njIQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgprolD/9zWti9LsZvA7yE+PhVwrwF3CsNzLfQlClw
99HaA7HxtAc/VLJrnD/SubhCAPdBC5B2xPv6faajdwF2iUR3Rr1Uc93CQ3uP2KKq
kvm6ALTpzPTMI6YSABhY74sg9BkkoDbMo54JQYVQPleiE+5eDLbuFZck6ObfUHyY
a4aaImlndWp/t14GzrClL4hucF+5KJy846P+QCVclkh0yl8xSsqZ5LIFU7tu3iQb
HpZ5HKLT/2ma/EOr3wknnsIe97AUZQU0q5aMparhYlm+qR511eop3QXx850FL/oC
tEGceKLij6qazmkiocKVzML8Fs+Y9/a4vCMjLCScWJmzDlmKdlH2uudeahN6b9Hm
15qRQHOjl1Hc2bdr5ZVn87nq9RWhSm18C+SRMwOKHCOnEhwxqM3RjRfAgj4BJ6QB
PFbFqdY+8Y1YLPFmn9hph72ePaEcN4L2IXW6TI/WX8mot8ODAnkq9Hr38dKwzO+i
0mon6DVyJKKho6XwvVu5IYurkR2beQprjeVUxwZjjT6DxUgsc+J6itK5LDHFSkeZ
qZlXn5Di8MkiXg0DFJYDQiFXnO0Z5GlRWOGPVfBaOr3x+1dqzDdHGw4oz1oGqvnr
GNNYCsYIpDGm7eauX5lqL5MUFpjqRCceXy5JSHPhnWWw617nYkr4H9jdsV9HiTX1
tQFx05QW3w==
=ccMs
-----END PGP SIGNATURE-----
Merge tag 'block-5.12-2021-02-27' of git://git.kernel.dk/linux-block
Pull more block updates from Jens Axboe:
"A few stragglers (and one due to me missing it originally), and fixes
for changes in this merge window mostly. In particular:
- blktrace cleanups (Chaitanya, Greg)
- Kill dead blk_pm_* functions (Bart)
- Fixes for the bio alloc changes (Christoph)
- Fix for the partition changes (Christoph, Ming)
- Fix for turning off iopoll with polled IO inflight (Jeffle)
- nbd disconnect fix (Josef)
- loop fsync error fix (Mauricio)
- kyber update depth fix (Yang)
- max_sectors alignment fix (Mikulas)
- Add bio_max_segs helper (Matthew)"
* tag 'block-5.12-2021-02-27' of git://git.kernel.dk/linux-block: (21 commits)
block: Add bio_max_segs
blktrace: fix documentation for blk_fill_rw()
block: memory allocations in bounce_clone_bio must not fail
block: remove the gfp_mask argument to bounce_clone_bio
block: fix bounce_clone_bio for passthrough bios
block-crypto-fallback: use a bio_set for splitting bios
block: fix logging on capacity change
blk-settings: align max_sectors on "logical_block_size" boundary
block: reopen the device in blkdev_reread_part
block: don't skip empty device in in disk_uevent
blktrace: remove debugfs file dentries from struct blk_trace
nbd: handle device refs for DESTROY_ON_DISCONNECT properly
kyber: introduce kyber_depth_updated()
loop: fix I/O error on fsync() in detached loop devices
block: fix potential IO hang when turning off io_poll
block: get rid of the trace rq insert wrapper
blktrace: fix blk_rq_merge documentation
blktrace: fix blk_rq_issue documentation
blktrace: add blk_fill_rwbs documentation comment
block: remove superfluous param in blk_fill_rwbs()
...
It's often inconvenient to use BIO_MAX_PAGES due to min() requiring the
sign to be the same. Introduce bio_max_segs() and change BIO_MAX_PAGES to
be unsigned to make it easier for the users.
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
internal_hash and journal_mac capabilities.
- Various DM writecache fixes to address performance, fix table output
to match what was provided at table creation, fix writing beyond end
of device when shrinking underlying data device, and a couple other
small cleanups.
- Add DM crypt support for using trusted keys.
- Fix deadlock when swapping to DM crypt device by throttling number
of in-flight REQ_SWAP bios. Implemented in DM core so that other
bio-based targets can opt-in by setting ti->limit_swap_bios.
- Fix various inverted logic bugs in the .iterate_devices callout
functions that are used to assess if specific feature or capability
is supported across all devices being combined/stacked by DM.
- Fix DM era target bugs that exposed users to lost writes or memory
leaks.
- Add DM core support for passing through inline crypto support of
underlying devices. Includes block/keyslot-manager changes that
enable extending this support to DM.
- Various small fixes and cleanups (spelling fixes, front padding
calculation cleanup, cleanup conditional zoned support in targets,
etc).
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmAqxggTHHNuaXR6ZXJA
cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWjVOCACkZKleQhsCEYHNtjZ40Du+4PPBvESA
O+ScdUCeik4YUXvQtlFRPcYxxOH0zL0CUivLnNlsKzGTTgulw5azgFNuUTzIhH5y
a86Q+DReigPegzVCCOenInU18pYa03rLtYOAb6SK49IqVeMWMFSJVBv73HWS7OFV
slMlsQCN46YgbviYsGUXk5+uKMET4ijJZVW+8zSYg0GsWLHdgQtBkEoojO1n9H2B
jio2Nvhto0bJ4dV482lmd3G+LABmaBbLs0Xx/a7iHVigkIYZz4BHwDYNz/EQnNEi
dYlOrSL9a6ur+DFR6vxShzG40LbK7KVr8jHiXyKv2WZA7FMK0l4fyEFV
=E+n3
-----END PGP SIGNATURE-----
Merge tag 'for-5.12/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer:
- Fix DM integrity's HMAC support to provide enhanced security of
internal_hash and journal_mac capabilities.
- Various DM writecache fixes to address performance, fix table output
to match what was provided at table creation, fix writing beyond end
of device when shrinking underlying data device, and a couple other
small cleanups.
- Add DM crypt support for using trusted keys.
- Fix deadlock when swapping to DM crypt device by throttling number of
in-flight REQ_SWAP bios. Implemented in DM core so that other
bio-based targets can opt-in by setting ti->limit_swap_bios.
- Fix various inverted logic bugs in the .iterate_devices callout
functions that are used to assess if specific feature or capability
is supported across all devices being combined/stacked by DM.
- Fix DM era target bugs that exposed users to lost writes or memory
leaks.
- Add DM core support for passing through inline crypto support of
underlying devices. Includes block/keyslot-manager changes that
enable extending this support to DM.
- Various small fixes and cleanups (spelling fixes, front padding
calculation cleanup, cleanup conditional zoned support in targets,
etc).
* tag 'for-5.12/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (31 commits)
dm: fix deadlock when swapping to encrypted device
dm: simplify target code conditional on CONFIG_BLK_DEV_ZONED
dm: set DM_TARGET_PASSES_CRYPTO feature for some targets
dm: support key eviction from keyslot managers of underlying devices
dm: add support for passing through inline crypto support
block/keyslot-manager: Introduce functions for device mapper support
block/keyslot-manager: Introduce passthrough keyslot manager
dm era: only resize metadata in preresume
dm era: Use correct value size in equality function of writeset tree
dm era: Fix bitset memory leaks
dm era: Verify the data block size hasn't changed
dm era: Reinitialize bitset cache before digesting a new writeset
dm era: Update in-core bitset after committing the metadata
dm era: Recover committed writeset after crash
dm writecache: use bdev_nr_sectors() instead of open-coded equivalent
dm writecache: fix writing beyond end of underlying device when shrinking
dm table: remove needless request_queue NULL pointer checks
dm table: fix zoned iterate_devices based device capability checks
dm table: fix DAX iterate_devices based device capability checks
dm table: fix iterate_devices based device capability checks
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmAtnZ8QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpoOhD/9TZJN6mgvlmO2zuNlZwko0jD+HUYNRHdfa
UiZhKs55ShlT/Wd8MMLcmMU2/+iztq4c/ZLK9eS7NHgKTu3GbgsICEZK+XLSTVJh
gCwwEnY2dnBAIwBWxLeDG04DQvcwnOhVN1OSmhFbXG3dpElXSyfEjx00niGtl0tE
5YtvmStpqkC0tHlxq8CMyyfaL1ODGmBK2uhDeQCO12SXIgKondJUaI3/H1l1dC5t
+yg4PsSLqezo0oWmqdTEE7lcEJs4GK1ZOhIBLtWe6tl/zaVD6DuzJL83pChm8vF+
qV4LCpJL0wUL7IG601AemFcUmEg34oC0FD6GYYhXxVOrlk43V6AfkycZ2rljNRop
/+Ff+CmXfWPAwSfJi/vDlveCvgyAJNMpv4/GwynUM1563v3TYy0YjT6Jlz6M3TUn
pS0MW7iUHj3t36U3JvcYnITqTPSfTMtYMOsWDx+V4+E9iGsQF1d7KZlCPMvjWAI2
c3QuWjitXd10BZ2qUvSzAg6piv1taBKJxg1PsGlu707mHZp5J6VPAkYQn4rFgjua
uCBbmRQCDF/wJ02IBmBUiMPP64UGbkhr+O3MILPqki967BdDDrLzjTs4e5zbMeu/
qB9XRr1Yd95GCyS8I42OCC906NXrJ2R2E8dtV1XoASWGusL/wFZrLd1th8Uq8ibb
Os+G4t1Qug==
=vx6U
-----END PGP SIGNATURE-----
Merge tag 'for-5.12/drivers-2021-02-17' of git://git.kernel.dk/linux-block
Pull block driver updates from Jens Axboe:
- Remove the skd driver. It's been EOL for a long time (Damien)
- NVMe pull requests
- fix multipath handling of ->queue_rq errors (Chao Leng)
- nvmet cleanups (Chaitanya Kulkarni)
- add a quirk for buggy Amazon controller (Filippo Sironi)
- avoid devm allocations in nvme-hwmon that don't interact well
with fabrics (Hannes Reinecke)
- sysfs cleanups (Jiapeng Chong)
- fix nr_zones for multipath (Keith Busch)
- nvme-tcp crash fix for no-data commands (Sagi Grimberg)
- nvmet-tcp fixes (Sagi Grimberg)
- add a missing __rcu annotation (Christoph)
- failed reconnect fixes (Chao Leng)
- various tracing improvements (Michal Krakowiak, Johannes
Thumshirn)
- switch the nvmet-fc assoc_list to use RCU protection (Leonid
Ravich)
- resync the status codes with the latest spec (Max Gurtovoy)
- minor nvme-tcp improvements (Sagi Grimberg)
- various cleanups (Rikard Falkeborn, Minwoo Im, Chaitanya
Kulkarni, Israel Rukshin)
- Floppy O_NDELAY fix (Denis)
- MD pull request
- raid5 chunk_sectors fix (Guoqing)
- Use lore links (Kees)
- Use DEFINE_SHOW_ATTRIBUTE for nbd (Liao)
- loop lock scaling (Pavel)
- mtip32xx PCI fixes (Bjorn)
- bcache fixes (Kai, Dongdong)
- Misc fixes (Tian, Yang, Guoqing, Joe, Andy)
* tag 'for-5.12/drivers-2021-02-17' of git://git.kernel.dk/linux-block: (64 commits)
lightnvm: pblk: Replace guid_copy() with export_guid()/import_guid()
lightnvm: fix unnecessary NULL check warnings
nvme-tcp: fix crash triggered with a dataless request submission
block: Replace lkml.org links with lore
nbd: Convert to DEFINE_SHOW_ATTRIBUTE
nvme: add 48-bit DMA address quirk for Amazon NVMe controllers
nvme-hwmon: rework to avoid devm allocation
nvmet: remove else at the end of the function
nvmet: add nvmet_req_subsys() helper
nvmet: use min of device_path and disk len
nvmet: use invalid cmd opcode helper
nvmet: use invalid cmd opcode helper
nvmet: add helper to report invalid opcode
nvmet: remove extra variable in id-ns handler
nvmet: make nvmet_find_namespace() req based
nvmet: return uniform error for invalid ns
nvmet: set status to 0 in case for invalid nsid
nvmet-fc: add a missing __rcu annotation to nvmet_fc_tgt_assoc.queues
nvme-multipath: set nr_zones for zoned namespaces
nvmet-tcp: fix potential race of tcp socket closing accept_work
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmAtmIwQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgplzLEAC5O+3rBM8QuiJdo39Yppmuw4hDJ6hOKynP
EJQLKQQi0VfXgU+MprGvcbpFYmNbgICvUICQkEzJuk++kPCu/BJtJz0yErQeLgS+
RdXiPV6enbF7iRML5TVRTr1q/z7sJMXcIIJ8Pz/rU/JNfGYExVd0WfnEY9mp1jOt
Bl9V+qyTazdP+Ma4+uEPatSayqcdi1rxB5I+7v/sLiOvKZZWkaRZjUZ/mxAjUfvK
dBOOPjMygEo3tCLkIyyA6lpLvr1r+SUZhLuebRLEKa3To3TW6RtoG0qwpKmI2iKw
ylLeVLB60nM9RUxjflVOfBsHxz1bDg5Ve86y5nCjQd4Jo8x1c4DnecyGE5/Tu8Rg
rgbsfD6nFWzhDCvcZT0XrfQ4ZAjIL2IfT+ypQiQ6UlRd3hvIKRmzWMkjuH2svr0u
ey9Kq+lYerI4cM0F3W73gzUKdIQOuCzBCYxQuSQQomscBa7FCInyU192dAI9Aj6l
Yd06mgKu6qCx6zLv6JfpBqaBHZMwyGE4dmZgPQFuuwO+b4N+Ck3Jm5fzEzw/xIxQ
wdo/DlsAl60BXentB6FByGBJaCjVdSymRqN/xNCAbFKCjmr6TLBuXPfg1gYYO7xC
VOcVjWe8iN3wWHZab3t2mxMKH9B9B/KKzIhu6TNHSmgtQ5paZPRCBx995pDyRw26
WC22RGC2MA==
=os1E
-----END PGP SIGNATURE-----
Merge tag 'for-5.12/block-2021-02-17' of git://git.kernel.dk/linux-block
Pull core block updates from Jens Axboe:
"Another nice round of removing more code than what is added, mostly
due to Christoph's relentless pursuit of tech debt removal/cleanups.
This pull request contains:
- Two series of BFQ improvements (Paolo, Jan, Jia)
- Block iov_iter improvements (Pavel)
- bsg error path fix (Pan)
- blk-mq scheduler improvements (Jan)
- -EBUSY discard fix (Jan)
- bvec allocation improvements (Ming, Christoph)
- bio allocation and init improvements (Christoph)
- Store bdev pointer in bio instead of gendisk + partno (Christoph)
- Block trace point cleanups (Christoph)
- hard read-only vs read-only split (Christoph)
- Block based swap cleanups (Christoph)
- Zoned write granularity support (Damien)
- Various fixes/tweaks (Chunguang, Guoqing, Lei, Lukas, Huhai)"
* tag 'for-5.12/block-2021-02-17' of git://git.kernel.dk/linux-block: (104 commits)
mm: simplify swapdev_block
sd_zbc: clear zone resources for non-zoned case
block: introduce blk_queue_clear_zone_settings()
zonefs: use zone write granularity as block size
block: introduce zone_write_granularity limit
block: use blk_queue_set_zoned in add_partition()
nullb: use blk_queue_set_zoned() to setup zoned devices
nvme: cleanup zone information initialization
block: document zone_append_max_bytes attribute
block: use bi_max_vecs to find the bvec pool
md/raid10: remove dead code in reshape_request
block: mark the bio as cloned in bio_iov_bvec_set
block: set BIO_NO_PAGE_REF in bio_iov_bvec_set
block: remove a layer of indentation in bio_iov_iter_get_pages
block: turn the nr_iovecs argument to bio_alloc* into an unsigned short
block: remove the 1 and 4 vec bvec_slabs entries
block: streamline bvec_alloc
block: factor out a bvec_alloc_gfp helper
block: move struct biovec_slab to bio.c
block: reuse BIO_INLINE_VECS for integrity bvecs
...
The system would deadlock when swapping to a dm-crypt device. The reason
is that for each incoming write bio, dm-crypt allocates memory that holds
encrypted data. These excessive allocations exhaust all the memory and the
result is either deadlock or OOM trigger.
This patch limits the number of in-flight swap bios, so that the memory
consumed by dm-crypt is limited. The limit is enforced if the target set
the "limit_swap_bios" variable and if the bio has REQ_SWAP set.
Non-swap bios are not affected becuase taking the semaphore would cause
performance degradation.
This is similar to request-based drivers - they will also block when the
number of requests is over the limit.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Allow removal of CONFIG_BLK_DEV_ZONED conditionals in target_type
definition of various targets.
Suggested-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
dm-linear and dm-flakey obviously can pass through inline crypto support.
Co-developed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Satya Tangirala <satyat@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Now that device mapper supports inline encryption, add the ability to
evict keys from all underlying devices. When an upper layer requests
a key eviction, we simply iterate through all underlying devices
and evict that key from each device.
Co-developed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Satya Tangirala <satyat@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Update the device-mapper core to support exposing the inline crypto
support of the underlying device(s) through the device-mapper device.
This works by creating a "passthrough keyslot manager" for the dm
device, which declares support for encryption settings which all
underlying devices support. When a supported setting is used, the bio
cloning code handles cloning the crypto context to the bios for all the
underlying devices. When an unsupported setting is used, the blk-crypto
fallback is used as usual.
Crypto support on each underlying device is ignored unless the
corresponding dm target opts into exposing it. This is needed because
for inline crypto to semantically operate on the original bio, the data
must not be transformed by the dm target. Thus, targets like dm-linear
can expose crypto support of the underlying device, but targets like
dm-crypt can't. (dm-crypt could use inline crypto itself, though.)
A DM device's table can only be changed if the "new" inline encryption
capabilities are a (*not* necessarily strict) superset of the "old" inline
encryption capabilities. Attempts to make changes to the table that result
in some inline encryption capability becoming no longer supported will be
rejected.
For the sake of clarity, key eviction from underlying devices will be
handled in a future patch.
Co-developed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Satya Tangirala <satyat@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Metadata resize shouldn't happen in the ctr. The ctr loads a temporary
(inactive) table that will only become active upon resume. That is why
resize should always be done in terms of resume. Otherwise a load (ctr)
whose inactive table never becomes active will incorrectly resize the
metadata.
Also, perform the resize directly in preresume, instead of using the
worker to do it.
The worker might run other metadata operations, e.g., it could start
digestion, before resizing the metadata. These operations will end up
using the old size.
This could lead to errors, like:
device-mapper: era: metadata_digest_transcribe_writeset: dm_array_set_value failed
device-mapper: era: process_old_eras: digest step failed, stopping digestion
The reason of the above error is that the worker started the digestion
of the archived writeset using the old, larger size.
As a result, metadata_digest_transcribe_writeset tried to write beyond
the end of the era array.
Fixes: eec40579d8 ("dm: add era target")
Cc: stable@vger.kernel.org # v3.15+
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Fix the writeset tree equality test function to use the right value size
when comparing two btree values.
Fixes: eec40579d8 ("dm: add era target")
Cc: stable@vger.kernel.org # v3.15+
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Reviewed-by: Ming-Hung Tsai <mtsai@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Deallocate the memory allocated for the in-core bitsets when destroying
the target and in error paths.
Fixes: eec40579d8 ("dm: add era target")
Cc: stable@vger.kernel.org # v3.15+
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Reviewed-by: Ming-Hung Tsai <mtsai@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
dm-era doesn't support changing the data block size of existing devices,
so check explicitly that the requested block size for a new target
matches the one stored in the metadata.
Fixes: eec40579d8 ("dm: add era target")
Cc: stable@vger.kernel.org # v3.15+
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Reviewed-by: Ming-Hung Tsai <mtsai@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
In case of devices with at most 64 blocks, the digestion of consecutive
eras uses the writeset of the first era as the writeset of all eras to
digest, leading to lost writes. That is, we lose the information about
what blocks were written during the affected eras.
The digestion code uses a dm_disk_bitset object to access the archived
writesets. This structure includes a one word (64-bit) cache to reduce
the number of array lookups.
This structure is initialized only once, in metadata_digest_start(),
when we kick off digestion.
But, when we insert a new writeset into the writeset tree, before the
digestion of the previous writeset is done, or equivalently when there
are multiple writesets in the writeset tree to digest, then all these
writesets are digested using the same cache and the cache is not
re-initialized when moving from one writeset to the next.
For devices with more than 64 blocks, i.e., the size of the cache, the
cache is indirectly invalidated when we move to a next set of blocks, so
we avoid the bug.
But for devices with at most 64 blocks we end up using the same cached
data for digesting all archived writesets, i.e., the cache is loaded
when digesting the first writeset and it never gets reloaded, until the
digestion is done.
As a result, the writeset of the first era to digest is used as the
writeset of all the following archived eras, leading to lost writes.
Fix this by reinitializing the dm_disk_bitset structure, and thus
invalidating the cache, every time the digestion code starts digesting a
new writeset.
Fixes: eec40579d8 ("dm: add era target")
Cc: stable@vger.kernel.org # v3.15+
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
In case of a system crash, dm-era might fail to mark blocks as written
in its metadata, although the corresponding writes to these blocks were
passed down to the origin device and completed successfully.
Consider the following sequence of events:
1. We write to a block that has not been yet written in the current era
2. era_map() checks the in-core bitmap for the current era and sees
that the block is not marked as written.
3. The write is deferred for submission after the metadata have been
updated and committed.
4. The worker thread processes the deferred write
(process_deferred_bios()) and marks the block as written in the
in-core bitmap, **before** committing the metadata.
5. The worker thread starts committing the metadata.
6. We do more writes that map to the same block as the write of step (1)
7. era_map() checks the in-core bitmap and sees that the block is marked
as written, **although the metadata have not been committed yet**.
8. These writes are passed down to the origin device immediately and the
device reports them as completed.
9. The system crashes, e.g., power failure, before the commit from step
(5) finishes.
When the system recovers and we query the dm-era target for the list of
written blocks it doesn't report the aforementioned block as written,
although the writes of step (6) completed successfully.
The issue is that era_map() decides whether to defer or not a write
based on non committed information. The root cause of the bug is that we
update the in-core bitmap, **before** committing the metadata.
Fix this by updating the in-core bitmap **after** successfully
committing the metadata.
Fixes: eec40579d8 ("dm: add era target")
Cc: stable@vger.kernel.org # v3.15+
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Following a system crash, dm-era fails to recover the committed writeset
for the current era, leading to lost writes. That is, we lose the
information about what blocks were written during the affected era.
dm-era assumes that the writeset of the current era is archived when the
device is suspended. So, when resuming the device, it just moves on to
the next era, ignoring the committed writeset.
This assumption holds when the device is properly shut down. But, when
the system crashes, the code that suspends the target never runs, so the
writeset for the current era is not archived.
There are three issues that cause the committed writeset to get lost:
1. dm-era doesn't load the committed writeset when opening the metadata
2. The code that resizes the metadata wipes the information about the
committed writeset (assuming it was loaded at step 1)
3. era_preresume() starts a new era, without taking into account that
the current era might not have been archived, due to a system crash.
To fix this:
1. Load the committed writeset when opening the metadata
2. Fix the code that resizes the metadata to make sure it doesn't wipe
the loaded writeset
3. Fix era_preresume() to check for a loaded writeset and archive it,
before starting a new era.
Fixes: eec40579d8 ("dm: add era target")
Cc: stable@vger.kernel.org # v3.15+
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Use semicolons and braces.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is potentially long running and not latency sensitive, let's get
it out of the way of other latency sensitive events.
As observed in the previous commit, the `system_wq` comes easily
congested by bcache, and this fixes a few more stalls I was observing
every once in a while.
Let's not make this `WQ_MEM_RECLAIM` as it showed to reduce performance
of boot and file system operations in my tests. Also, without
`WQ_MEM_RECLAIM`, I no longer see desktop stalls. This matches the
previous behavior as `system_wq` also does no memory reclaim:
> // workqueue.c:
> system_wq = alloc_workqueue("events", 0, 0);
Cc: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org # 5.4+
Signed-off-by: Kai Krakow <kai@kaishome.de>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Before killing `btree_io_wq`, the queue was allocated using
`create_singlethread_workqueue()` which has `WQ_MEM_RECLAIM`. After
killing it, it no longer had this property but `system_wq` is not
single threaded.
Let's combine both worlds and make it multi threaded but able to
reclaim memory.
Cc: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org # 5.4+
Signed-off-by: Kai Krakow <kai@kaishome.de>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This reverts commit 56b30770b2.
With the btree using the `system_wq`, I seem to see a lot more desktop
latency than I should.
After some more investigation, it looks like the original assumption
of 56b3077 no longer is true, and bcache has a very high potential of
congesting the `system_wq`. In turn, this introduces laggy desktop
performance, IO stalls (at least with btrfs), and input events may be
delayed.
So let's revert this. It's important to note that the semantics of
using `system_wq` previously mean that `btree_io_wq` should be created
before and destroyed after other bcache wqs to keep the same
assumptions.
Cc: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org # 5.4+
Signed-off-by: Kai Krakow <kai@kaishome.de>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Should be `register_device_async`.
Cc: Coly Li <colyli@suse.de>
Signed-off-by: Kai Krakow <kai@kaishome.de>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Current way to calculate the writeback rate only considered the
dirty sectors, this usually works fine when the fragmentation
is not high, but it will give us unreasonable small rate when
we are under a situation that very few dirty sectors consumed
a lot dirty buckets. In some case, the dirty bucekts can reached
to CUTOFF_WRITEBACK_SYNC while the dirty data(sectors) not even
reached the writeback_percent, the writeback rate will still
be the minimum value (4k), thus it will cause all the writes to be
stucked in a non-writeback mode because of the slow writeback.
We accelerate the rate in 3 stages with different aggressiveness,
the first stage starts when dirty buckets percent reach above
BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW (50), the second is
BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID (57), the third is
BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH (64). By default
the first stage tries to writeback the amount of dirty data
in one bucket (on average) in (1 / (dirty_buckets_percent - 50)) second,
the second stage tries to writeback the amount of dirty data in one bucket
in (1 / (dirty_buckets_percent - 57)) * 100 millisecond, the third
stage tries to writeback the amount of dirty data in one bucket in
(1 / (dirty_buckets_percent - 64)) millisecond.
the initial rate at each stage can be controlled by 3 configurable
parameters writeback_rate_fp_term_{low|mid|high}, they are by default
1, 10, 1000, the hint of IO throughput that these values are trying
to achieve is described by above paragraph, the reason that
I choose those value as default is based on the testing and the
production data, below is some details:
A. When it comes to the low stage, there is still a bit far from the 70
threshold, so we only want to give it a little bit push by setting the
term to 1, it means the initial rate will be 170 if the fragment is 6,
it is calculated by bucket_size/fragment, this rate is very small,
but still much reasonable than the minimum 8.
For a production bcache with unheavy workload, if the cache device
is bigger than 1 TB, it may take hours to consume 1% buckets,
so it is very possible to reclaim enough dirty buckets in this stage,
thus to avoid entering the next stage.
B. If the dirty buckets ratio didn't turn around during the first stage,
it comes to the mid stage, then it is necessary for mid stage
to be more aggressive than low stage, so i choose the initial rate
to be 10 times more than low stage, that means 1700 as the initial
rate if the fragment is 6. This is some normal rate
we usually see for a normal workload when writeback happens
because of writeback_percent.
C. If the dirty buckets ratio didn't turn around during the low and mid
stages, it comes to the third stage, and it is the last chance that
we can turn around to avoid the horrible cutoff writeback sync issue,
then we choose 100 times more aggressive than the mid stage, that
means 170000 as the initial rate if the fragment is 6. This is also
inferred from a production bcache, I've got one week's writeback rate
data from a production bcache which has quite heavy workloads,
again, the writeback is triggered by the writeback percent,
the highest rate area is around 100000 to 240000, so I believe this
kind aggressiveness at this stage is reasonable for production.
And it should be mostly enough because the hint is trying to reclaim
1000 bucket per second, and from that heavy production env,
it is consuming 50 bucket per second on average in one week's data.
Option writeback_consider_fragment is to control whether we want
this feature to be on or off, it's on by default.
Lastly, below is the performance data for all the testing result,
including the data from production env:
https://docs.google.com/document/d/1AmbIEa_2MhB9bqhC3rfga9tp7n9YX9PLn0jSUxscVW0/edit?usp=sharing
Signed-off-by: dongdong tao <dongdong.tao@canonical.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Do not attempt to write any data beyond the end of the underlying data
device while shrinking it.
The DM writecache device must be suspended when the underlying data
device is shrunk.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Since commit ff9ea32381 ("block, bdi: an active gendisk always has a
request_queue associated with it") the request_queue pointer returned
from bdev_get_queue() shall never be NULL.
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Fix dm_table_supports_zoned_model() and invert logic of both
iterate_devices_callout_fn so that all devices' zoned capabilities are
properly checked.
Add one more parameter to dm_table_any_dev_attr(), which is actually
used as the @data parameter of iterate_devices_callout_fn, so that
dm_table_matches_zone_sectors() can be replaced by
dm_table_any_dev_attr().
Fixes: dd88d313be ("dm table: add zoned block devices validation")
Cc: stable@vger.kernel.org
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Fix dm_table_supports_dax() and invert logic of both
iterate_devices_callout_fn so that all devices' DAX capabilities are
properly checked.
Fixes: 545ed20e6d ("dm: add infrastructure for DAX support")
Cc: stable@vger.kernel.org
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
According to the definition of dm_iterate_devices_fn:
* This function must iterate through each section of device used by the
* target until it encounters a non-zero return code, which it then returns.
* Returns zero if no callout returned non-zero.
For some target type (e.g. dm-stripe), one call of iterate_devices() may
iterate multiple underlying devices internally, in which case a non-zero
return code returned by iterate_devices_callout_fn will stop the iteration
in advance. No iterate_devices_callout_fn should return non-zero unless
device iteration should stop.
Rename dm_table_requires_stable_pages() to dm_table_any_dev_attr() and
elevate it for reuse to stop iterating (and return non-zero) on the
first device that causes iterate_devices_callout_fn to return non-zero.
Use dm_table_any_dev_attr() to properly iterate through devices.
Rename device_is_nonrot() to device_is_rotational() and invert logic
accordingly to fix improper disposition.
Fixes: c3c4555edd ("dm table: clear add_random unless all devices have it set")
Fixes: 4693c9668f ("dm table: propagate non rotational flag")
Cc: stable@vger.kernel.org
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
LVM doesn't like it when the target returns different values from what
was set in the constructor. Fix dm-writecache so that the returned
table values are exactly the same as requested values.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org # v4.18+
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
A bio allocated by bio_alloc_bioset comes pre-zeroed, no need to
clear random fields.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, raid5 calculates dev_sectors from chunk_sectors without
proper cast, which is problematic.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Commit 27f5411a71 ("dm crypt: support using encrypted keys") extended
dm-crypt to allow use of "encrypted" keys along with "user" and "logon".
Along the same lines, teach dm-crypt to support "trusted" keys as well.
Signed-off-by: Ahmad Fatoum <a.fatoum@pengutronix.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
IS_ENABLED(CONFIG_ENCRYPTED_KEYS) is true whether the option is built-in
or a module, so use it instead of #if defined checking for each
separately.
The other #if was to avoid a static function defined, but unused
warning. As we now always build the callsite when the function
is defined, we can remove that first #if guard.
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Ahmad Fatoum <a.fatoum@pengutronix.de>
Acked-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Remove NULL checks before vfree() to fix these warnings:
./drivers/md/dm-writecache.c:2008:2-7: WARNING: NULL check before some
freeing functions is not needed.
./drivers/md/dm-writecache.c:2024:2-7: WARNING: NULL check before some
freeing functions is not needed.
Signed-off-by: Tian Tao <tiantao6@hisilicon.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Fix a thinko in ssd_commit_superblock. region.count is in sectors, not
bytes. This bug doesn't corrupt data, but it causes performance
degradation.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: dc8a01ae1d ("dm writecache: optimize superblock write")
Cc: stable@vger.kernel.org # v5.7+
Reported-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The "fix_hmac" argument improves security of internal_hash and
journal_mac:
- the section number is mixed to the mac, so that an attacker can't
copy sectors from one journal section to another journal section
- the superblock is protected by journal_mac
- a 16-byte salt stored in the superblock is mixed to the mac, so
that the attacker can't detect that two disks have the same hmac
key and also to disallow the attacker to move sectors from one
disk to another
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reported-by: Daniel Glockner <dg@emlix.com>
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> # ReST fix
Tested-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
shadow_root() truncates 64-bit dm_block_t into 32-bit int. This is
not an issue in practice, since dm metadata as of v5.11 can only hold at
most 4161600 blocks (255 index entries * ~16k metadata blocks).
Nevertheless, this can confuse users debugging some specific data
corruption scenarios. Also, DM_SM_METADATA_MAX_BLOCKS may be bumped in
the future, or persistent-data may find its use in other places.
Therefore, switch the return type of shadow_root from int to dm_block_t.
Signed-off-by: Jinoh Kang <jinoh.kang.kr@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Add two helper macros calculating the offset of bio in struct dm_io and
struct dm_target_io respectively.
Besides, simplify the front padding calculation in
dm_alloc_md_mempools().
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
There is a spelling mistake in a dm_integrity_io_error error
message. Fix it.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
See Documentation/core-api/printk-formats.rst.
commit cbacb5ab0a ("docs: printk-formats: Stop encouraging use of unnecessary %h[xudi] and %hh[xudi]")
Standard integer promotion is already done and %hx and %hhx is useless
so do not encourage the use of %hh[xudi] or %h[xudi].
Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Make the read-only check in restart_array identical to the other two
read-only checks.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
->meta_bdev is optional and not set for most arrays. Add a
rdev_read_only helper that calls bdev_read_only for both devices
in a safe way.
Fixes: 6f0d9689b6 ("block: remove the NULL bdev check in bdev_read_only")
Reported-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmAUXQsQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgppO4EAClcqoneAuhT4UvRVNxblXPhPaoC69aNgXd
s+34uQSCqeWrWIAokfKp8bh3kyRqe00591auA7DwtwNqGpWuIECX8o9QvROEkuxv
0o4JFGMTHOJKP1W79Oy3RpF5oee6rMMOQN7EFL272p2xd8NRCP33c4fKvJRz+DDE
0kCcZhVjca0nZ+9OJC+WAlV+dit3azCAKSp7cItJsdOgZL74ZcGECm0pA8RpStyi
tQrUr2yiHLkm1lcOYfid0fG2/5a4vAGZQav+EshOWYw9UGeMquq/aqPuZZtEUjKe
oEECACfJ9cWErsi1CirIk5j5RKHOHmFSG3kRAmyvFB4f3YDGYxerI7eodWjNA0d5
38wW96sWuV4l0ShPmD3jGWIDTTcDZh4nEImCObf5YJFbr2fQXofWVWseIyo0zG8Y
zDa1N/M7XgkrScX8OF33NC1uv/oExhHA7jXuQN6mRBESYjcCrH2Lf6mXAA2C8u4T
z1RaG7ckRXGSbV3ol1ROrHj0RTXQ3zeIHj3yMRU8TKH0z6s+ob46D2PZCLi6cLvI
IuELhzKsS1EzMSVsYk9/AegynWFjVCRJoVUVxTsrxfGEF7attwmur3lOAjbZwSWb
jXlRbrkgBL1Pwbjg8AODEoq0jJgVM/S/3fG2rpcYLwwYC+FQ73/K+URmEuMsqkFC
GrYllTSMFg==
=hb7W
-----END PGP SIGNATURE-----
Merge tag 'block-5.11-2021-01-29' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"All over the place fixes for this release:
- blk-cgroup iteration teardown resched fix (Baolin)
- NVMe pull request from Christoph:
- add another Write Zeroes quirk (Chaitanya Kulkarni)
- handle a no path available corner case (Daniel Wagner)
- use the proper RCU aware list_add helper (Chao Leng)
- bcache regression fix (Coly)
- bdev->bd_size_lock IRQ fix. This will be fixed in drivers for 5.12,
but for now, we'll make it IRQ safe (Damien)
- null_blk zoned init fix (Damien)
- add_partition() error handling fix (Dinghao)
- s390 dasd kobject fix (Jan)
- nbd fix for freezing queue while adding connections (Josef)
- tag queueing regression fix (Ming)
- revert of a patch that inadvertently meant that we regressed write
performance on raid (Maxim)"
* tag 'block-5.11-2021-01-29' of git://git.kernel.dk/linux-block:
null_blk: cleanup zoned mode initialization
nvme-core: use list_add_tail_rcu instead of list_add_tail for nvme_init_ns_head
nvme-multipath: Early exit if no path is available
nvme-pci: add the DISABLE_WRITE_ZEROES quirk for a SPCC device
bcache: only check feature sets when sb->version >= BCACHE_SB_VERSION_CDEV_WITH_FEATURES
block: fix bd_size_lock use
blk-cgroup: Use cond_resched() when destroy blkgs
Revert "block: simplify set_init_blocksize" to regain lost performance
nbd: freeze the queue while we're adding connections
s390/dasd: Fix inconsistent kobject removal
block: Fix an error handling in add_partition
blk-mq: test QUEUE_FLAG_HCTX_ACTIVE for sbitmap_shared in hctx_may_queue
For super block version < BCACHE_SB_VERSION_CDEV_WITH_FEATURES, it
doesn't make sense to check the feature sets. This patch checks
super block version in bch_has_feature_* routines, if the version
doesn't have feature sets yet, returns 0 (false) to the caller.
Fixes: 5342fd4255 ("bcache: set bcache device into read-only mode for BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET")
Fixes: ffa4703275 ("bcache: add bucket_size_hi into struct cache_sb_disk for large bucket")
Cc: stable@vger.kernel.org # 5.9+
Reported-and-tested-by: Bockholdt Arne <a.bockholdt@precitec-optronik.de>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Refactor raid5_read_one_chunk so that all simple checks are done
before allocating the bio.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
md_bio_alloc_sync is never called with a NULL mddev, and ->sync_set is
initialized in md_run, so it always must be initialized as well. Just
open code the remaining call to bio_alloc_bioset.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use an on-stack bio and biovec for the single page synchronous I/O.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bio_alloc_mddev is never called with a NULL mddev, and ->bio_set is
initialized in md_run, so it always must be initialized as well. Just
open code the remaining call to bio_alloc_bioset.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use blkdev_issue_flush instead of open coding it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is no point in allocating memory for a synchronous flush.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Always use the bio_set_dev helper to assign ->bi_bdev to make sure
other state related to the device is uptodate.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This bioset is just for allocating bio only from bio_next_split, and it
needn't bvecs, so remove the flag.
Cc: linux-bcache@vger.kernel.org
Cc: Coly Li <colyli@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>