Commit Graph

6720 Commits

Author SHA1 Message Date
Mikulas Patocka
611c3e168b dm writecache: add optional "metadata_only" parameter
Add a "metadata_only" parameter that when present: only metadata is
promoted to the cache. This option improves performance for heavier
REQ_META workloads (e.g. device-mapper-test-suite's "git clone and
checkout" benchmark improves from 341s to 312s).

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-25 15:25:21 -04:00
Mikulas Patocka
867de40c4c dm writecache: write at least 4k when committing
SSDs perform badly with sub-4k writes (because they perfrorm
read-modify-write internally), so make sure writecache writes at least
4k when committing.

Fixes: 991bd8d7bc ("dm writecache: commit just one block, not a full page")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-21 16:15:10 -04:00
Mikulas Patocka
ee55b92a73 dm writecache: flush origin device when writing and cache is full
Commit d53f1fafec ("dm writecache: do
direct write if the cache is full") changed dm-writecache, so that it
writes directly to the origin device if the cache is full.
Unfortunately, it doesn't forward flush requests to the origin device,
so that there is a bug where flushes are being ignored.

Fix this by adding missing flush forwarding.

For PMEM mode, we fix this bug by disabling direct writes to the origin
device, because it performs better.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: d53f1fafec ("dm writecache: do direct write if the cache is full")
Cc: stable@vger.kernel.org # v5.7+
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-16 12:57:14 -04:00
Mikulas Patocka
293128b1ef dm writecache: have ssd writeback wait if the kcopyd workqueue is busy
Make dm-writecache wait if the kcopyd workqueue is busy (as will
happen if waiting for page allocation or inside submit_bio).

This change improves performance of "mkfs.ext2" by approximately 20%
on one testbed.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-15 15:42:03 -04:00
Baokun Li
8c77f1cb84 dm writecache: use list_move instead of list_del/list_add in writecache_writeback()
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-14 11:49:04 -04:00
Mikulas Patocka
991bd8d7bc dm writecache: commit just one block, not a full page
Some architectures have pages larger than 4k and committing a full
page causes needless overhead.

Fix this by writing a single block when committing the superblock.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-14 11:45:30 -04:00
Mikulas Patocka
620cbe40ed dm writecache: remove unused gfp_t argument from wc_add_block()
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-14 11:44:06 -04:00
Damien Le Moal
f34ee1dce6 dm crypt: Fix zoned block device support
Zone append BIOs (REQ_OP_ZONE_APPEND) always specify the start sector
of the zone to be written instead of the actual sector location to
write. The write location is determined by the device and returned to
the host upon completion of the operation. This interface, while simple
and efficient for writing into sequential zones of a zoned block
device, is incompatible with the use of sector values to calculate a
cypher block IV. All data written in a zone end up using the same IV
values corresponding to the first sectors of the zone, but read
operation will specify any sector within the zone resulting in an IV
mismatch between encryption and decryption.

To solve this problem, report to DM core that zone append operations are
not supported. This result in the zone append operations being emulated
using regular write operations.

Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04 12:07:38 -04:00
Damien Le Moal
bb37d77239 dm: introduce zone append emulation
For zoned targets that cannot support zone append operations, implement
an emulation using regular write operations. If the original BIO
submitted by the user is a zone append operation, change its clone into
a regular write operation directed at the target zone write pointer
position.

To do so, an array of write pointer offsets (write pointer position
relative to the start of a zone) is added to struct mapped_device. All
operations that modify a sequential zone write pointer (writes, zone
reset, zone finish and zone append) are intersepted in __map_bio() and
processed using the new functions dm_zone_map_bio().

Detection of the target ability to natively support zone append
operations is done from dm_table_set_restrictions() by calling the
function dm_set_zones_restrictions(). A target that does not support
zone append operation, either by explicitly declaring it using the new
struct dm_target field zone_append_not_supported, or because the device
table contains a non-zoned device, has its mapped device marked with the
new flag DMF_ZONE_APPEND_EMULATED. The helper function
dm_emulate_zone_append() is introduced to test a mapped device for this
new flag.

Atomicity of the zones write pointer tracking and updates is done using
a zone write locking mechanism based on a bitmap. This is similar to
the block layer method but based on BIOs rather than struct request.
A zone write lock is taken in dm_zone_map_bio() for any clone BIO with
an operation type that changes the BIO target zone write pointer
position. The zone write lock is released if the clone BIO is failed
before submission or when dm_zone_endio() is called when the clone BIO
completes.

The zone write lock bitmap of the mapped device, together with a bitmap
indicating zone types (conv_zones_bitmap) and the write pointer offset
array (zwp_offset) are allocated and initialized with a full device zone
report in dm_set_zones_restrictions() using the function
dm_revalidate_zones().

For failed operations that may have modified a zone write pointer, the
zone write pointer offset is marked as invalid in dm_zone_endio().
Zones with an invalid write pointer offset are checked and the write
pointer updated using an internal report zone operation when the
faulty zone is accessed again by the user.

All functions added for this emulation have a minimal overhead for
zoned targets natively supporting zone append operations. Regular
device targets are also not affected. The added code also does not
impact builds with CONFIG_BLK_DEV_ZONED disabled by stubbing out all
dm zone related functions.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04 12:07:37 -04:00
Damien Le Moal
e2118b3c3d dm: rearrange core declarations for extended use from dm-zone.c
Move the definitions of struct dm_target_io, struct dm_io and the bits
of the flags field of struct mapped_device from dm.c to dm-core.h to
make them usable from dm-zone.c. For the same reason, declare
dec_pending() in dm-core.h after renaming it to dm_io_dec_pending().
And for symmetry of the function names, introduce the inline helper
dm_io_inc_pending() instead of directly using atomic_inc() calls.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04 12:07:37 -04:00
Damien Le Moal
bf14e2b250 dm: Forbid requeue of writes to zones
A target map method requesting the requeue of a bio with
DM_MAPIO_REQUEUE or completing it with DM_ENDIO_REQUEUE can cause
unaligned write errors if the bio is a write operation targeting a
sequential zone. If a zoned target request such a requeue, warn about
it and kill the IO.

The function dm_is_zone_write() is introduced to detect write operations
to zoned targets.

This change does not affect the target drivers supporting zoned devices
and exposing a zoned device, namely dm-crypt, dm-linear and dm-flakey as
none of these targets ever request a requeue.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04 12:07:33 -04:00
Damien Le Moal
912e887505 dm: Introduce dm_report_zones()
To simplify the implementation of the report_zones operation of a zoned
target, introduce the function dm_report_zones() to set a target
mapping start sector in struct dm_report_zones_args and call
blkdev_report_zones(). This new function is exported and the report
zones callback function dm_report_zones_cb() is not.

dm-linear, dm-flakey and dm-crypt are modified to use dm_report_zones().

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04 12:07:32 -04:00
Damien Le Moal
7fc1872848 dm: move zone related code to dm-zone.c
Move core and table code used for zoned targets and conditionally
defined with #ifdef CONFIG_BLK_DEV_ZONED to the new file dm-zone.c.
This file is conditionally compiled depending on CONFIG_BLK_DEV_ZONED.
The small helper dm_set_zones_restrictions() is introduced to
initialize a mapped device request queue zone attributes in
dm_table_set_restrictions().

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04 12:07:31 -04:00
Damien Le Moal
dd73c320ec dm: cleanup device_area_is_invalid()
In device_area_is_invalid(), use bdev_is_zoned() instead of open
coding the test on the zoned model returned by bdev_zoned_model().

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04 12:07:30 -04:00
Damien Le Moal
6842d264aa dm: Fix dm_accept_partial_bio() relative to zone management commands
Fix dm_accept_partial_bio() to actually check that zone management
commands are not passed as explained in the function documentation
comment. Also, since a zone append operation cannot be split, add
REQ_OP_ZONE_APPEND as a forbidden command.

White lines are added around the group of BUG_ON() calls to make the
code more legible.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04 12:07:29 -04:00
Damien Le Moal
bab6849942 dm zoned: check zone capacity
The dm-zoned target cannot support zoned block devices with zones that
have a capacity smaller than the zone size (e.g. NVMe zoned namespaces)
due to the current chunk zone mapping implementation as it is assumed
that zones and chunks have the same size with all blocks usable.
If a zoned drive is found to have zones with a capacity different from
the zone size, fail the target initialization.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Cc: stable@vger.kernel.org # v5.9+
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04 12:07:28 -04:00
Rikard Falkeborn
ccde2cbfa3 dm table: Constify static struct blk_ksm_ll_ops
The only usage of dm_ksm_ll_ops is to make a copy of it to the ksm_ll_ops
field in the blk_keyslot_manager struct. Make it const to allow the
compiler to put it in read-only memory.

Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04 12:07:28 -04:00
Mikulas Patocka
af4f6cabcc dm writecache: interrupt writeback if suspended
If the DM device is suspended, interrupt the writeback sequence so
that there is no excessive suspend delay.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04 12:07:27 -04:00
Mikulas Patocka
ee50cc19d8 dm writecache: don't split bios when overwriting contiguous cache content
If dm-writecache overwrites existing cached data, it splits the
incoming bio into many block-sized bios. The I/O scheduler does merge
these bios into one large request but this needless splitting and
merging causes performance degradation.

Fix this by avoiding bio splitting if the cache target area that is
being overwritten is contiguous.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04 12:07:26 -04:00
Mikulas Patocka
6bcd658f2a dm kcopyd: avoid spin_lock_irqsave from process context
The functions "pop", "push_head", "do_work" can only be called from
process context. Therefore, replace spin_lock_irq{save,restore} with
spin_{lock,unlock}_irq.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04 12:07:25 -04:00
Mikulas Patocka
db2351eb22 dm kcopyd: avoid useless atomic operations
The functions set_bit and clear_bit are atomic. We don't need
atomicity when making flags for dm-kcopyd. So, change them to direct
manipulation of the flags.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04 12:07:24 -04:00
Joe Thornber
6b06dd5a97 dm space map disk: cache a small number of index entries
The disk space map stores it's index entries in a btree, these are
accessed very frequently, so having a few cached makes a big difference
to performance.

With this change provisioning a new block takes roughly 20% less cpu.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04 12:07:23 -04:00
Joe Thornber
be500ed721 dm space maps: improve performance with inc/dec on ranges of blocks
When we break sharing on btree nodes we typically need to increment
the reference counts to every value held in the node.  This can
cause a lot of repeated calls to the space maps.  Fix this by changing
the interface to the space map inc/dec methods to take ranges of
adjacent blocks to be operated on.

For installations that are using a lot of snapshots this will reduce
cpu overhead of fundamental operations such as provisioning a new block,
or deleting a snapshot, by as much as 10 times.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04 12:07:22 -04:00
Joe Thornber
5faafc77f7 dm space maps: don't reset space map allocation cursor when committing
Current commit code resets the place where the search for free blocks
will begin back to the start of the metadata device.  There are a couple
of repercussions to this:

- The first allocation after the commit is likely to take longer than
  normal as it searches for a free block in an area that is likely to
  have very few free blocks (if any).

- Any free blocks it finds will have been recently freed.  Reusing them
  means we have fewer old copies of the metadata to aid recovery from
  hardware error.

Fix these issues by leaving the cursor alone, only resetting when the
search hits the end of the metadata device.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04 12:07:21 -04:00
Joe Thornber
4eafdb1515 dm btree: improve btree residency
This commit improves the residency of btrees built in the metadata for
dm-thin and dm-cache.

When inserting a new entry into a full btree node the current code
splits the node into two.  This can result in very many half full nodes,
particularly if the insertions are occurring in an ascending order (as
happens in dm-thin with large writes).

With this commit, when we insert into a full node we first try and move
some entries to a neighbouring node that has space, failing that it
tries to split two neighbouring nodes into three.

Results are given below.  'Residency' is how full nodes are on average
as a percentage.  Average instruction counts for the operations
are given to show the extra processing has little overhead.

                         +--------------------------+--------------------------+
                         |         Before           |         After            |
+------------+-----------+-----------+--------------+-----------+--------------+
|    Test    |   Phase   | Residency | Instructions | Residency | Instructions |
+------------+-----------+-----------+--------------+-----------+--------------+
| Ascending  | insert    |        50 |         1876 |        96 |         1930 |
|            | overwrite |        50 |         1789 |        96 |         1746 |
|            | lookup    |        50 |          778 |        96 |          778 |
| Descending | insert    |        50 |         3024 |        96 |         3181 |
|            | overwrite |        50 |         1789 |        96 |         1746 |
|            | lookup    |        50 |          778 |        96 |          778 |
| Random     | insert    |        68 |         3800 |        84 |         3736 |
|            | overwrite |        68 |         4254 |        84 |         3911 |
|            | lookup    |        68 |          779 |        84 |          779 |
| Runs       | insert    |        63 |         2546 |        82 |         2815 |
|            | overwrite |        63 |         2013 |        82 |         1986 |
|            | lookup    |        63 |          778 |        82 |          779 |
+------------+-----------+-----------+--------------+-----------+--------------+

   Ascending - keys are inserted in ascending order.
   Descending - keys are inserted in descending order.
   Random - keys are inserted in random order.
   Runs - keys are split into ascending runs of ~20 length.  Then
          the runs are shuffled.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Colin Ian King <colin.king@canonical.com> # contains_key() fix
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04 12:07:20 -04:00
Mikulas Patocka
7e768532b2 dm snapshot: properly fix a crash when an origin has no snapshots
If an origin target has no snapshots, o->split_boundary is set to 0.
This causes BUG_ON(sectors <= 0) in block/bio.c:bio_split().

Fix this by initializing chunk_size, and in turn split_boundary, to
rounddown_pow_of_two(UINT_MAX) -- the largest power of two that fits
into "unsigned" type.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-05-25 16:19:58 -04:00
Mikulas Patocka
f16dba5dc6 dm snapshot: revert "fix a crash when an origin has no snapshots"
Commit 7ee06ddc40 ("dm snapshot: fix a
crash when an origin has no snapshots") introduced a regression in
snapshot merging - causing the lvm2 test lvcreate-cache-snapshot.sh
got stuck in an infinite loop.

Even though commit 7ee06ddc40 was marked
for stable@ the stable team was notified to _not_ backport it.

Fixes: 7ee06ddc40 ("dm snapshot: fix a crash when an origin has no snapshots")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-05-25 16:16:47 -04:00
John Keeping
0c1f3193b1 dm verity: fix require_signatures module_param permissions
The third parameter of module_param() is permissions for the sysfs node
but it looks like it is being used as the initial value of the parameter
here.  In fact, false here equates to omitting the file from sysfs and
does not affect the value of require_signatures.

Making the parameter writable is not simple because going from
false->true is fine but it should not be possible to remove the
requirement to verify a signature.  But it can be useful to inspect the
value of this parameter from userspace, so change the permissions to
make a read-only file in sysfs.

Signed-off-by: John Keeping <john@metanate.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-05-25 16:14:05 -04:00
Mikulas Patocka
bc8f3d4647 dm integrity: fix sparse warnings
Use the types __le* instead of __u* to fix sparse warnings.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-05-13 14:53:49 -04:00
Mikulas Patocka
dbae70d452 dm integrity: revert to not using discard filler when recalulating
Revert the commit 7a5b96b478 ("dm integrity:
use discard support when recalculating").

There's a bug that when we write some data beyond the current recalculate
boundary, the checksum will be rewritten with the discard filler later.
And the data will no longer have integrity protection. There's no easy
fix for this case.

Also, another problematic case is if dm-integrity is used to detect
bitrot (random device errors, bit flips, etc); dm-integrity should
detect that even for unused sectors. With commit 7a5b96b478 it can
happen that such change is undetected (because discard filler is not a
valid checksum).

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-05-13 14:53:48 -04:00
Mikulas Patocka
c699a0db2d dm snapshot: fix crash with transient storage and zero chunk size
The following commands will crash the kernel:

modprobe brd rd_size=1048576
dmsetup create o --table "0 `blockdev --getsize /dev/ram0` snapshot-origin /dev/ram0"
dmsetup create s --table "0 `blockdev --getsize /dev/ram0` snapshot /dev/ram0 /dev/ram1 N 0"

The reason is that when we test for zero chunk size, we jump to the label
bad_read_metadata without setting the "r" variable. The function
snapshot_ctr destroys all the structures and then exits with "r == 0". The
kernel then crashes because it falsely believes that snapshot_ctr
succeeded.

In order to fix the bug, we set the variable "r" to -EINVAL.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-05-13 14:42:52 -04:00
Mikulas Patocka
7ee06ddc40 dm snapshot: fix a crash when an origin has no snapshots
If an origin target has no snapshots, o->split_boundary is set to 0.
This causes BUG_ON(sectors <= 0) in block/bio.c:bio_split().

Fix this by initializing chunk_size, and in turn split_boundary, to
rounddown_pow_of_two(UINT_MAX) -- the largest power of two that fits
into "unsigned" type.

Reported-by: Michael Tokarev <mjt@tls.msk.ru>
Tested-by: Michael Tokarev <mjt@tls.msk.ru>
Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-05-10 11:11:36 -04:00
Linus Torvalds
a48b0872e6 Merge branch 'akpm' (patches from Andrew)
Merge yet more updates from Andrew Morton:
 "This is everything else from -mm for this merge window.

  90 patches.

  Subsystems affected by this patch series: mm (cleanups and slub),
  alpha, procfs, sysctl, misc, core-kernel, bitmap, lib, compat,
  checkpatch, epoll, isofs, nilfs2, hpfs, exit, fork, kexec, gcov,
  panic, delayacct, gdb, resource, selftests, async, initramfs, ipc,
  drivers/char, and spelling"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (90 commits)
  mm: fix typos in comments
  mm: fix typos in comments
  treewide: remove editor modelines and cruft
  ipc/sem.c: spelling fix
  fs: fat: fix spelling typo of values
  kernel/sys.c: fix typo
  kernel/up.c: fix typo
  kernel/user_namespace.c: fix typos
  kernel/umh.c: fix some spelling mistakes
  include/linux/pgtable.h: few spelling fixes
  mm/slab.c: fix spelling mistake "disired" -> "desired"
  scripts/spelling.txt: add "overflw"
  scripts/spelling.txt: Add "diabled" typo
  scripts/spelling.txt: add "overlfow"
  arm: print alloc free paths for address in registers
  mm/vmalloc: remove vwrite()
  mm: remove xlate_dev_kmem_ptr()
  drivers/char: remove /dev/kmem for good
  mm: fix some typos and code style problems
  ipc/sem.c: mundane typo fixes
  ...
2021-05-07 00:34:51 -07:00
Matthew Wilcox (Oracle)
4ee60ec156 include: remove pagemap.h from blkdev.h
My UEK-derived config has 1030 files depending on pagemap.h before this
change.  Afterwards, just 326 files need to be rebuilt when I touch
pagemap.h.  I think blkdev.h is probably included too widely, but
untangling that dependency is harder and this solves my problem.  x86
allmodconfig builds, but there may be implicit include problems on other
architectures.

Link: https://lkml.kernel.org/r/20210309195747.283796-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Dan Williams <dan.j.williams@intel.com>		[nvdimm]
Acked-by: Jens Axboe <axboe@kernel.dk>				[block]
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Coly Li <colyli@suse.de>				[bcache]
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>	[scsi]
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-06 19:24:11 -07:00
Linus Torvalds
7af81cd0c4 - Improve scalability of DM's device hash by switching to rbtree
- Extend DM ioctl's DM_LIST_DEVICES_CMD handling to include UUID and
   allow filtering based on name or UUID prefix.
 
 - Various small fixes for typos, warnings, unused function, or
   needlessly exported interfaces.
 
 - Remove needless request_queue NULL pointer checks in DM thin and
   cache targets.
 
 - Remove unnecessary loop in DM core's __split_and_process_bio().
 
 - Remove DM core's dm_vcalloc() and just use kvcalloc or
   kvmalloc_array instead (depending whether zeroing is useful).
 
 - Fix request-based DM's double free of blk_mq_tag_set in device
   remove after table load fails.
 
 - Improve DM persistent data performance on non-x86 by fixing packed
   structs to have a stated alignment. Also remove needless extra work
   from redundant calls to sm_disk_get_nr_free() and a paranoid BUG_ON()
   that caused duplicate checksum calculation.
 
 - Fix missing goto in DM integrity's bitmap_flush_interval error
   handling.
 
 - Add "reset_recalculate" feature flag to DM integrity.
 
 - Improve DM integrity by leveraging discard support to avoid needless
   re-writing of metadata and also use discard support to improve
   hash recalculation.
 
 - Fix race with DM raid target's reshape and MD raid4/5/6 resync that
   resulted in inconsistant reshape state during table reloads.
 
 - Update DM raid target to temove unnecessary discard limits for raid0
   and raid10 now that MD has optimized discard handling for both raid
   levels.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmCMVxkTHHNuaXR6ZXJA
 cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWjD7B/4mucuval0w8OBl7MPE5mR/tPmU/Avr
 iFmXQjofjXCXGdCk7XqOqKZGKlm/jCTQkhbZPEo5PTCZPf6iGyeoSFOC0xck9HUI
 9EJNrnx6L0ch3OS5IREgc3oO7vXFnmnLvmo27t7yUaqWqdMBIZ/rEA6Ro7oq4pXc
 fo/Yqka2iuS0X2RKhAbci2KAOWdeX400nqH7bHaR5VgOE3JRJtsEWRTmkngB4bzM
 Bsz2zzaSnbFa/Nlg2cH69kw2NJpZsAoKmoQ0OJt7QQs+GYLvCbioQnlxLPSFhlKX
 4mSPPPKy3TxaAqsHXt+gCG6Vs8VvvgK0iHIa9e0ifIy3wvLoosLpIXhc
 =k1ND
 -----END PGP SIGNATURE-----

Merge tag 'for-5.13/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - Improve scalability of DM's device hash by switching to rbtree

 - Extend DM ioctl's DM_LIST_DEVICES_CMD handling to include UUID and
   allow filtering based on name or UUID prefix.

 - Various small fixes for typos, warnings, unused function, or
   needlessly exported interfaces.

 - Remove needless request_queue NULL pointer checks in DM thin and
   cache targets.

 - Remove unnecessary loop in DM core's __split_and_process_bio().

 - Remove DM core's dm_vcalloc() and just use kvcalloc or kvmalloc_array
   instead (depending whether zeroing is useful).

 - Fix request-based DM's double free of blk_mq_tag_set in device remove
   after table load fails.

 - Improve DM persistent data performance on non-x86 by fixing packed
   structs to have a stated alignment. Also remove needless extra work
   from redundant calls to sm_disk_get_nr_free() and a paranoid BUG_ON()
   that caused duplicate checksum calculation.

 - Fix missing goto in DM integrity's bitmap_flush_interval error
   handling.

 - Add "reset_recalculate" feature flag to DM integrity.

 - Improve DM integrity by leveraging discard support to avoid needless
   re-writing of metadata and also use discard support to improve hash
   recalculation.

 - Fix race with DM raid target's reshape and MD raid4/5/6 resync that
   resulted in inconsistant reshape state during table reloads.

 - Update DM raid target to temove unnecessary discard limits for raid0
   and raid10 now that MD has optimized discard handling for both raid
   levels.

* tag 'for-5.13/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (26 commits)
  dm raid: remove unnecessary discard limits for raid0 and raid10
  dm rq: fix double free of blk_mq_tag_set in dev remove after table load fails
  dm integrity: use discard support when recalculating
  dm integrity: increase RECALC_SECTORS to improve recalculate speed
  dm integrity: don't re-write metadata if discarding same blocks
  dm raid: fix inconclusive reshape layout on fast raid4/5/6 table reload sequences
  dm raid: fix fall-through warning in rs_check_takeover() for Clang
  dm clone metadata: remove unused function
  dm integrity: fix missing goto in bitmap_flush_interval error handling
  dm: replace dm_vcalloc()
  dm space map common: fix division bug in sm_ll_find_free_block()
  dm persistent data: packed struct should have an aligned() attribute too
  dm btree spine: remove paranoid node_check call in node_prep_for_write()
  dm space map disk: remove redundant calls to sm_disk_get_nr_free()
  dm integrity: add the "reset_recalculate" feature flag
  dm persistent data: remove unused return from exit_shadow_spine()
  dm cache: remove needless request_queue NULL pointer checks
  dm thin: remove needless request_queue NULL pointer check
  dm: unexport dm_{get,put}_table_device
  dm ebs: fix a few typos
  ...
2021-05-01 11:34:03 -07:00
Mike Snitzer
ca4a4e9a55 dm raid: remove unnecessary discard limits for raid0 and raid10
Commit 29efc390b9 ("md/md0: optimize raid0 discard handling") and
commit d30588b273 ("md/raid10: improve raid10 discard request")
remove MD raid0's and raid10's inability to properly handle large
discards. So eliminate associated constraints from dm-raid's support.

Depends-on: 29efc390b9 ("md/md0: optimize raid0 discard handling")
Depends-on: d30588b273 ("md/raid10: improve raid10 discard request")
Reported-by: Matthew Ruffell <matthew.ruffell@canonical.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-30 14:38:37 -04:00
Benjamin Block
8e947c8f4a dm rq: fix double free of blk_mq_tag_set in dev remove after table load fails
When loading a device-mapper table for a request-based mapped device,
and the allocation/initialization of the blk_mq_tag_set for the device
fails, a following device remove will cause a double free.

E.g. (dmesg):
  device-mapper: core: Cannot initialize queue for request-based dm-mq mapped device
  device-mapper: ioctl: unable to set up device queue for new table.
  Unable to handle kernel pointer dereference in virtual kernel address space
  Failing address: 0305e098835de000 TEID: 0305e098835de803
  Fault in home space mode while using kernel ASCE.
  AS:000000025efe0007 R3:0000000000000024
  Oops: 0038 ilc:3 [#1] SMP
  Modules linked in: ... lots of modules ...
  Supported: Yes, External
  CPU: 0 PID: 7348 Comm: multipathd Kdump: loaded Tainted: G        W      X    5.3.18-53-default #1 SLE15-SP3
  Hardware name: IBM 8561 T01 7I2 (LPAR)
  Krnl PSW : 0704e00180000000 000000025e368eca (kfree+0x42/0x330)
             R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
  Krnl GPRS: 000000000000004a 000000025efe5230 c1773200d779968d 0000000000000000
             000000025e520270 000000025e8d1b40 0000000000000003 00000007aae10000
             000000025e5202a2 0000000000000001 c1773200d779968d 0305e098835de640
             00000007a8170000 000003ff80138650 000000025e5202a2 000003e00396faa8
  Krnl Code: 000000025e368eb8: c4180041e100       lgrl    %r1,25eba50b8
             000000025e368ebe: ecba06b93a55       risbg   %r11,%r10,6,185,58
            #000000025e368ec4: e3b010000008       ag      %r11,0(%r1)
            >000000025e368eca: e310b0080004       lg      %r1,8(%r11)
             000000025e368ed0: a7110001           tmll    %r1,1
             000000025e368ed4: a7740129           brc     7,25e369126
             000000025e368ed8: e320b0080004       lg      %r2,8(%r11)
             000000025e368ede: b904001b           lgr     %r1,%r11
  Call Trace:
   [<000000025e368eca>] kfree+0x42/0x330
   [<000000025e5202a2>] blk_mq_free_tag_set+0x72/0xb8
   [<000003ff801316a8>] dm_mq_cleanup_mapped_device+0x38/0x50 [dm_mod]
   [<000003ff80120082>] free_dev+0x52/0xd0 [dm_mod]
   [<000003ff801233f0>] __dm_destroy+0x150/0x1d0 [dm_mod]
   [<000003ff8012bb9a>] dev_remove+0x162/0x1c0 [dm_mod]
   [<000003ff8012a988>] ctl_ioctl+0x198/0x478 [dm_mod]
   [<000003ff8012ac8a>] dm_ctl_ioctl+0x22/0x38 [dm_mod]
   [<000000025e3b11ee>] ksys_ioctl+0xbe/0xe0
   [<000000025e3b127a>] __s390x_sys_ioctl+0x2a/0x40
   [<000000025e8c15ac>] system_call+0xd8/0x2c8
  Last Breaking-Event-Address:
   [<000000025e52029c>] blk_mq_free_tag_set+0x6c/0xb8
  Kernel panic - not syncing: Fatal exception: panic_on_oops

When allocation/initialization of the blk_mq_tag_set fails in
dm_mq_init_request_queue(), it is uninitialized/freed, but the pointer
is not reset to NULL; so when dev_remove() later gets into
dm_mq_cleanup_mapped_device() it sees the pointer and tries to
uninitialize and free it again.

Fix this by setting the pointer to NULL in dm_mq_init_request_queue()
error-handling. Also set it to NULL in dm_mq_cleanup_mapped_device().

Cc: <stable@vger.kernel.org> # 4.6+
Fixes: 1c357a1e86 ("dm: allocate blk_mq_tag_set rather than embed in mapped_device")
Signed-off-by: Benjamin Block <bblock@linux.ibm.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-30 14:19:08 -04:00
Mikulas Patocka
7a5b96b478 dm integrity: use discard support when recalculating
If we have discard support we don't have to recalculate hash - we can
just fill the metadata with the discard pattern.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-30 14:02:06 -04:00
Mikulas Patocka
b1a2b93320 dm integrity: increase RECALC_SECTORS to improve recalculate speed
Increase RECALC_SECTORS because it improves recalculate speed slightly
(from 390kiB/s to 410kiB/s).

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-30 14:02:05 -04:00
Mikulas Patocka
a9c0fda4c0 dm integrity: don't re-write metadata if discarding same blocks
If we discard already discarded blocks we do not need to write discard
pattern to the metadata, because it is already there.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-30 14:01:39 -04:00
Linus Torvalds
fc05860628 for-5.13/drivers-2021-04-27
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmCIJYcQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpieWD/92qbtWl/z+9oCY212xV+YMoMqj/vGROX+U
 9i/FQJ3AIC/AUoNjZeW3NIbiaNqde5mrLlUSCHgn6RLsHK7p0GQJ4ohpbIGFG5+i
 2+Efm+vjlCxLVGrkeZEwMtsht7w/NbOYDr1Rgv9b4lQ6iWI11Mg8E337Whl1me1k
 h6bEXaioK9yqxYtsLgcn9I1qQ2p7gok0HX7zFU/XxEUZylqH6E4vQhj2+NL8UUqE
 7siFHADZE99Z7LXtOkl8YyOlGU52RCUzqDHWydvkipKjgYBi95HLXGT64Z+WCEvz
 HI54oVDRWr+uWdqDFfy+ncHm8pNeP0GV9JPhDz4ELRTSndoxB2il7wRLvp6wxV9d
 8Y4j7vb30i+8GGbM0c79dnlG76D9r5ivbTKixcXFKB128NusQR6JymIv1pKlSKhk
 H871/iOarrepAAUwVR5CtldDDJCy/q1Hks+7UXbaM3F9iNitxsJNZryQq9xdTu/N
 ThFOTz+VECG4RJLxIwmsWGiLgwr52/ybAl2MBcn+s7uC4jM/TFKpdQBfQnOAiINb
 MLlfuYRRSMg1Osb2fYZneR2ifmSNOMRdDJb+tsZGz4xWmZcj0uL4QgqcsOvuiOEQ
 veF/Ky50qw57hWtiEhvqa7/WIxzNF3G3wejqqA8hpT9Qifu0QawYTnXGUttYNBB1
 mO9R3/ccaw==
 =c0x4
 -----END PGP SIGNATURE-----

Merge tag 'for-5.13/drivers-2021-04-27' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:

 - MD changes via Song:
        - raid5 POWER fix
        - raid1 failure fix
        - UAF fix for md cluster
        - mddev_find_or_alloc() clean up
        - Fix NULL pointer deref with external bitmap
        - Performance improvement for raid10 discard requests
        - Fix missing information of /proc/mdstat

 - rsxx const qualifier removal (Arnd)

 - Expose allocated brd pages (Calvin)

 - rnbd via Gioh Kim:
        - Change maintainer
        - Change domain address of maintainers' email
        - Add polling IO mode and document update
        - Fix memory leak and some bug detected by static code analysis
          tools
        - Code refactoring

 - Series of floppy cleanups/fixes (Denis)

 - s390 dasd fixes (Julian)

 - kerneldoc fixes (Lee)

 - null_blk double free (Lv)

 - null_blk virtual boundary addition (Max)

 - Remove xsysace driver (Michal)

 - umem driver removal (Davidlohr)

 - ataflop fixes (Dan)

 - Revalidate disk removal (Christoph)

 - Bounce buffer cleanups (Christoph)

 - Mark lightnvm as deprecated (Christoph)

 - mtip32xx init cleanups (Shixin)

 - Various fixes (Tian, Gustavo, Coly, Yang, Zhang, Zhiqiang)

* tag 'for-5.13/drivers-2021-04-27' of git://git.kernel.dk/linux-block: (143 commits)
  async_xor: increase src_offs when dropping destination page
  drivers/block/null_blk/main: Fix a double free in null_init.
  md/raid1: properly indicate failure when ending a failed write request
  md-cluster: fix use-after-free issue when removing rdev
  nvme: introduce generic per-namespace chardev
  nvme: cleanup nvme_configure_apst
  nvme: do not try to reconfigure APST when the controller is not live
  nvme: add 'kato' sysfs attribute
  nvme: sanitize KATO setting
  nvmet: avoid queuing keep-alive timer if it is disabled
  brd: expose number of allocated pages in debugfs
  ataflop: fix off by one in ataflop_probe()
  ataflop: potential out of bounds in do_format()
  drbd: Fix fall-through warnings for Clang
  block/rnbd: Use strscpy instead of strlcpy
  block/rnbd-clt-sysfs: Remove copy buffer overlap in rnbd_clt_get_path_name
  block/rnbd-clt: Remove max_segment_size
  block/rnbd-clt: Generate kobject_uevent when the rnbd device state changes
  block/rnbd-srv: Remove unused arguments of rnbd_srv_rdma_ev
  Documentation/ABI/rnbd-clt: Add description for nr_poll_queues
  ...
2021-04-28 14:39:37 -07:00
Linus Torvalds
57fa2369ab CFI on arm64 series for v5.13-rc1
- Clean up list_sort prototypes (Sami Tolvanen)
 
 - Introduce CONFIG_CFI_CLANG for arm64 (Sami Tolvanen)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmCHCR8ACgkQiXL039xt
 wCZyFQ//fnUZaXR2K354zDyW6CJljMf+d94RF6rH+J6eMTH2/HXa5v0iJokwABLf
 ussP6qF4k5wtmI22Gm9A5Zc3e4iiry5pC0jOdk0mk4gzWwFN9MdgNxJZIGA3xqhS
 bsBK4AGrVKjtZl48G1/ZxJuNDeJhVp6GNK2n6/Gl4rZF6R7D/Upz0XelyJRdDpcM
 HIGma7jZl6xfGU0mdWCzpOGK1zdMca1WVs7A4YuurSbLn5PZJrcNVWLouDqt/Si2
 AduSri1gyPClicgvqWjMOzhUpuw/nJtBLRl1x1EsWk/KSZ1/uNVjlewfzdN4fZrr
 zbtFr2gLubYLK6JOX7/LqoHlOTgE3tYLL+WIVN75DsOGZBKgHhmebTmWLyqzV0SL
 oqcyM5d3ucC6msdtAK5Fv4MSp8rpjqlK1Ha4SGRT6kC2wut7AhZ3KD7eyRIz8mV9
 Sa9mhignGFJnTEUp+LSbYdrAudgSKxB40WyXPmswAXX4VJFRD4ONrrcAON/SzkUT
 Hw/JdFRCKkJjgwNQjIQoZcUNMTbFz2PlNIEnjJWm38YImQKQlCb2mXaZKCwBkf45
 aheCZk17eKoxTCXFMd+KxlyNEtS2yBfq/PpZgvw7GW/pfFbWUg1+2O41LnihIe5v
 zu0hN1wNCQqgfxiMZqX1OTb9C/2vybzGsXILt+9nppjZ8EBU7iU=
 =wU6U
 -----END PGP SIGNATURE-----

Merge tag 'cfi-v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull CFI on arm64 support from Kees Cook:
 "This builds on last cycle's LTO work, and allows the arm64 kernels to
  be built with Clang's Control Flow Integrity feature. This feature has
  happily lived in Android kernels for almost 3 years[1], so I'm excited
  to have it ready for upstream.

  The wide diffstat is mainly due to the treewide fixing of mismatched
  list_sort prototypes. Other things in core kernel are to address
  various CFI corner cases. The largest code portion is the CFI runtime
  implementation itself (which will be shared by all architectures
  implementing support for CFI). The arm64 pieces are Acked by arm64
  maintainers rather than coming through the arm64 tree since carrying
  this tree over there was going to be awkward.

  CFI support for x86 is still under development, but is pretty close.
  There are a handful of corner cases on x86 that need some improvements
  to Clang and objtool, but otherwise works well.

  Summary:

   - Clean up list_sort prototypes (Sami Tolvanen)

   - Introduce CONFIG_CFI_CLANG for arm64 (Sami Tolvanen)"

* tag 'cfi-v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  arm64: allow CONFIG_CFI_CLANG to be selected
  KVM: arm64: Disable CFI for nVHE
  arm64: ftrace: use function_nocfi for ftrace_call
  arm64: add __nocfi to __apply_alternatives
  arm64: add __nocfi to functions that jump to a physical address
  arm64: use function_nocfi with __pa_symbol
  arm64: implement function_nocfi
  psci: use function_nocfi for cpu_resume
  lkdtm: use function_nocfi
  treewide: Change list_sort to use const pointers
  bpf: disable CFI in dispatcher functions
  kallsyms: strip ThinLTO hashes from static functions
  kthread: use WARN_ON_FUNCTION_MISMATCH
  workqueue: use WARN_ON_FUNCTION_MISMATCH
  module: ensure __cfi_check alignment
  mm: add generic function_nocfi macro
  cfi: add __cficanonical
  add support for Clang CFI
2021-04-27 10:16:46 -07:00
Paul Clements
2417b9869b md/raid1: properly indicate failure when ending a failed write request
This patch addresses a data corruption bug in raid1 arrays using bitmaps.
Without this fix, the bitmap bits for the failed I/O end up being cleared.

Since we are in the failure leg of raid1_end_write_request, the request
either needs to be retried (R1BIO_WriteError) or failed (R1BIO_Degraded).

Fixes: eeba6809d8 ("md/raid1: end bio when the device faulty")
Cc: stable@vger.kernel.org # v5.2+
Signed-off-by: Paul Clements <paul.clements@us.sios.com>
Signed-off-by: Song Liu <song@kernel.org>
2021-04-23 09:40:17 -07:00
Heming Zhao
f7c7a2f9a2 md-cluster: fix use-after-free issue when removing rdev
md_kick_rdev_from_array will remove rdev, so we should
use rdev_for_each_safe to search list.

How to trigger:

env: Two nodes on kvm-qemu x86_64 VMs (2C2G with 2 iscsi luns).

```
node2=192.168.0.3

for i in {1..20}; do
    echo ==== $i `date` ====;

    mdadm -Ss && ssh ${node2} "mdadm -Ss"
    wipefs -a /dev/sda /dev/sdb

    mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l 1 /dev/sda \
       /dev/sdb --assume-clean
    ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb"
    mdadm --wait /dev/md0
    ssh ${node2} "mdadm --wait /dev/md0"

    mdadm --manage /dev/md0 --fail /dev/sda --remove /dev/sda
    sleep 1
done
```

Crash stack:

```
stack segment: 0000 [#1] SMP
... ...
RIP: 0010:md_check_recovery+0x1e8/0x570 [md_mod]
... ...
RSP: 0018:ffffb149807a7d68 EFLAGS: 00010207
RAX: 0000000000000000 RBX: ffff9d494c180800 RCX: ffff9d490fc01e50
RDX: fffff047c0ed8308 RSI: 0000000000000246 RDI: 0000000000000246
RBP: 6b6b6b6b6b6b6b6b R08: ffff9d490fc01e40 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
R13: ffff9d494c180818 R14: ffff9d493399ef38 R15: ffff9d4933a1d800
FS:  0000000000000000(0000) GS:ffff9d494f700000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe68cab9010 CR3: 000000004c6be001 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 raid1d+0x5c/0xd40 [raid1]
 ? finish_task_switch+0x75/0x2a0
 ? lock_timer_base+0x67/0x80
 ? try_to_del_timer_sync+0x4d/0x80
 ? del_timer_sync+0x41/0x50
 ? schedule_timeout+0x254/0x2d0
 ? md_start_sync+0xe0/0xe0 [md_mod]
 ? md_thread+0x127/0x160 [md_mod]
 md_thread+0x127/0x160 [md_mod]
 ? wait_woken+0x80/0x80
 kthread+0x10d/0x130
 ? kthread_park+0xa0/0xa0
 ret_from_fork+0x1f/0x40
```

Fixes: dbb64f8635 ("md-cluster: Fix adding of new disk with new reload code")
Fixes: 659b254fa7 ("md-cluster: remove a disk asynchronously from cluster environment")
Cc: stable@vger.kernel.org
Reviewed-by: Gang He <ghe@suse.com>
Signed-off-by: Heming Zhao <heming.zhao@suse.com>
Signed-off-by: Song Liu <song@kernel.org>
2021-04-23 09:39:04 -07:00
Heinz Mauelshagen
f99a8e4373 dm raid: fix inconclusive reshape layout on fast raid4/5/6 table reload sequences
If fast table reloads occur during an ongoing reshape of raid4/5/6
devices the target may race reading a superblock vs the the MD resync
thread; causing an inconclusive reshape state to be read in its
constructor.

lvm2 test lvconvert-raid-reshape-stripes-load-reload.sh can cause
BUG_ON() to trigger in md_run(), e.g.:
"kernel BUG at drivers/md/raid5.c:7567!".

Scenario triggering the bug:

1. the MD sync thread calls end_reshape() from raid5_sync_request()
   when done reshaping. However end_reshape() _only_ updates the
   reshape position to MaxSector keeping the changed layout
   configuration though (i.e. any delta disks, chunk sector or RAID
   algorithm changes). That inconclusive configuration is stored in
   the superblock.

2. dm-raid constructs a mapping, loading named inconsistent superblock
   as of step 1 before step 3 is able to finish resetting the reshape
   state completely, and calls md_run() which leads to mentioned bug
   in raid5.c.

3. the MD RAID personality's finish_reshape() is called; which resets
   the reshape information on chunk sectors, delta disks, etc. This
   explains why the bug is rarely seen on multi-core machines, as MD's
   finish_reshape() superblock update races with the dm-raid
   constructor's superblock load in step 2.

Fix identifies inconclusive superblock content in the dm-raid
constructor and resets it before calling md_run(), factoring out
identifying checks into rs_is_layout_change() to share in existing
rs_reshape_requested() and new rs_reset_inclonclusive_reshape(). Also
enhance a comment and remove an empty line.

Cc: stable@vger.kernel.org
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-21 18:39:03 -04:00
Gustavo A. R. Silva
be962b2f07 dm raid: fix fall-through warning in rs_check_takeover() for Clang
In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning
by explicitly adding a break statement instead of letting the code fall
through to the next case.

Link: https://github.com/KSPP/linux/issues/115
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-20 18:10:04 -04:00
Jiapeng Chong
87d5742b73 dm clone metadata: remove unused function
Fix the following clang warning:

drivers/md/dm-clone-metadata.c:279:19: warning: unused function
'superblock_write_lock' [-Wunused-function].

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-19 13:20:31 -04:00
Tian Tao
17e9e134a8 dm integrity: fix missing goto in bitmap_flush_interval error handling
Fixes: 468dfca38b ("dm integrity: add a bitmap mode")
Cc: stable@vger.kernel.org
Signed-off-by: Tian Tao <tiantao6@hisilicon.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-19 13:17:10 -04:00
Matthew Wilcox (Oracle)
7a35693adc dm: replace dm_vcalloc()
Use kvcalloc or kvmalloc_array instead (depending whether zeroing is
useful).

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-19 13:13:26 -04:00
Joe Thornber
5208692e80 dm space map common: fix division bug in sm_ll_find_free_block()
This division bug meant the search for free metadata space could skip
the final allocation bitmap's worth of entries. Fix affects DM thinp,
cache and era targets.

Cc: stable@vger.kernel.org
Signed-off-by: Joe Thornber <ejt@redhat.com>
Tested-by: Ming-Hung Tsai <mtsai@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-19 12:48:13 -04:00