Commit Graph

1014083 Commits

Author SHA1 Message Date
Mikulas Patocka
95b88f4d71 dm writecache: pause writeback if cache full and origin being written directly
Implementation reuses dm_io_tracker, that until now was only used by
dm-cache, to track if any writes were issued directly to the origin
(due to cache being full) within the last second. If so writeback is
paused for a second.

This change improves performance for when the cache is full and IO is
issued directly to the origin device (rather than through the cache).

Depends-on: d53f1fafec ("dm writecache: do direct write if the cache is full")
Suggested-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-25 16:04:01 -04:00
Mike Snitzer
dc4fa29fe4 dm io tracker: factor out IO tracker
Allow other code to use dm_io_tracker.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-25 15:28:59 -04:00
Hou Tao
b6e58b5466 dm btree remove: assign new_root only when removal succeeds
remove_raw() in dm_btree_remove() may fail due to IO read error
(e.g. read the content of origin block fails during shadowing),
and the value of shadow_spine::root is uninitialized, but
the uninitialized value is still assign to new_root in the
end of dm_btree_remove().

For dm-thin, the value of pmd->details_root or pmd->root will become
an uninitialized value, so if trying to read details_info tree again
out-of-bound memory may occur as showed below:

  general protection fault, probably for non-canonical address 0x3fdcb14c8d7520
  CPU: 4 PID: 515 Comm: dmsetup Not tainted 5.13.0-rc6
  Hardware name: QEMU Standard PC
  RIP: 0010:metadata_ll_load_ie+0x14/0x30
  Call Trace:
   sm_metadata_count_is_more_than_one+0xb9/0xe0
   dm_tm_shadow_block+0x52/0x1c0
   shadow_step+0x59/0xf0
   remove_raw+0xb2/0x170
   dm_btree_remove+0xf4/0x1c0
   dm_pool_delete_thin_device+0xc3/0x140
   pool_message+0x218/0x2b0
   target_message+0x251/0x290
   ctl_ioctl+0x1c4/0x4d0
   dm_ctl_ioctl+0xe/0x20
   __x64_sys_ioctl+0x7b/0xb0
   do_syscall_64+0x40/0xb0
   entry_SYSCALL_64_after_hwframe+0x44/0xae

Fixing it by only assign new_root when removal succeeds

Signed-off-by: Hou Tao <houtao1@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-25 15:25:24 -04:00
Damien Le Moal
28436ba34b dm zone: fix dm_revalidate_zones() memory allocation
Make sure that the zone write pointer offset array is allocated with a
vmalloc in dm_zone_revalidate_cb() by passing GFP_KERNEL gfp flag to
kvcalloc(). However, since we do not want to trigger IOs while
revalidating zones, change dm_revalidate_zones() to have the zone scan
done in GFP_NOIO context using memalloc_noio_save/restore calls.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: bb37d77239 ("dm: introduce zone append emulation")
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-25 15:25:23 -04:00
Colin Ian King
326dbde2e0 dm ps io affinity: remove redundant continue statement
The continue statement at the end of a for-loop has no effect,
remove it.

Addresses-Coverity: ("Continue has no effect")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-25 15:25:22 -04:00
Mikulas Patocka
611c3e168b dm writecache: add optional "metadata_only" parameter
Add a "metadata_only" parameter that when present: only metadata is
promoted to the cache. This option improves performance for heavier
REQ_META workloads (e.g. device-mapper-test-suite's "git clone and
checkout" benchmark improves from 341s to 312s).

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-25 15:25:21 -04:00
Mike Snitzer
cd039afa0a dm writecache: add "cleaner" and "max_age" to Documentation
Backfill missing Documentation.

Fixes: 93de44eb3f ("dm writecache: implement the "cleaner" policy")
Fixes: 3923d4854e ("dm writecache: implement gradual cleanup")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-25 15:25:19 -04:00
Mikulas Patocka
867de40c4c dm writecache: write at least 4k when committing
SSDs perform badly with sub-4k writes (because they perfrorm
read-modify-write internally), so make sure writecache writes at least
4k when committing.

Fixes: 991bd8d7bc ("dm writecache: commit just one block, not a full page")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-21 16:15:10 -04:00
Mikulas Patocka
ee55b92a73 dm writecache: flush origin device when writing and cache is full
Commit d53f1fafec ("dm writecache: do
direct write if the cache is full") changed dm-writecache, so that it
writes directly to the origin device if the cache is full.
Unfortunately, it doesn't forward flush requests to the origin device,
so that there is a bug where flushes are being ignored.

Fix this by adding missing flush forwarding.

For PMEM mode, we fix this bug by disabling direct writes to the origin
device, because it performs better.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: d53f1fafec ("dm writecache: do direct write if the cache is full")
Cc: stable@vger.kernel.org # v5.7+
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-16 12:57:14 -04:00
Mikulas Patocka
293128b1ef dm writecache: have ssd writeback wait if the kcopyd workqueue is busy
Make dm-writecache wait if the kcopyd workqueue is busy (as will
happen if waiting for page allocation or inside submit_bio).

This change improves performance of "mkfs.ext2" by approximately 20%
on one testbed.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-15 15:42:03 -04:00
Baokun Li
8c77f1cb84 dm writecache: use list_move instead of list_del/list_add in writecache_writeback()
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-14 11:49:04 -04:00
Mikulas Patocka
991bd8d7bc dm writecache: commit just one block, not a full page
Some architectures have pages larger than 4k and committing a full
page causes needless overhead.

Fix this by writing a single block when committing the superblock.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-14 11:45:30 -04:00
Mikulas Patocka
620cbe40ed dm writecache: remove unused gfp_t argument from wc_add_block()
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-14 11:44:06 -04:00
Damien Le Moal
f34ee1dce6 dm crypt: Fix zoned block device support
Zone append BIOs (REQ_OP_ZONE_APPEND) always specify the start sector
of the zone to be written instead of the actual sector location to
write. The write location is determined by the device and returned to
the host upon completion of the operation. This interface, while simple
and efficient for writing into sequential zones of a zoned block
device, is incompatible with the use of sector values to calculate a
cypher block IV. All data written in a zone end up using the same IV
values corresponding to the first sectors of the zone, but read
operation will specify any sector within the zone resulting in an IV
mismatch between encryption and decryption.

To solve this problem, report to DM core that zone append operations are
not supported. This result in the zone append operations being emulated
using regular write operations.

Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04 12:07:38 -04:00
Damien Le Moal
bb37d77239 dm: introduce zone append emulation
For zoned targets that cannot support zone append operations, implement
an emulation using regular write operations. If the original BIO
submitted by the user is a zone append operation, change its clone into
a regular write operation directed at the target zone write pointer
position.

To do so, an array of write pointer offsets (write pointer position
relative to the start of a zone) is added to struct mapped_device. All
operations that modify a sequential zone write pointer (writes, zone
reset, zone finish and zone append) are intersepted in __map_bio() and
processed using the new functions dm_zone_map_bio().

Detection of the target ability to natively support zone append
operations is done from dm_table_set_restrictions() by calling the
function dm_set_zones_restrictions(). A target that does not support
zone append operation, either by explicitly declaring it using the new
struct dm_target field zone_append_not_supported, or because the device
table contains a non-zoned device, has its mapped device marked with the
new flag DMF_ZONE_APPEND_EMULATED. The helper function
dm_emulate_zone_append() is introduced to test a mapped device for this
new flag.

Atomicity of the zones write pointer tracking and updates is done using
a zone write locking mechanism based on a bitmap. This is similar to
the block layer method but based on BIOs rather than struct request.
A zone write lock is taken in dm_zone_map_bio() for any clone BIO with
an operation type that changes the BIO target zone write pointer
position. The zone write lock is released if the clone BIO is failed
before submission or when dm_zone_endio() is called when the clone BIO
completes.

The zone write lock bitmap of the mapped device, together with a bitmap
indicating zone types (conv_zones_bitmap) and the write pointer offset
array (zwp_offset) are allocated and initialized with a full device zone
report in dm_set_zones_restrictions() using the function
dm_revalidate_zones().

For failed operations that may have modified a zone write pointer, the
zone write pointer offset is marked as invalid in dm_zone_endio().
Zones with an invalid write pointer offset are checked and the write
pointer updated using an internal report zone operation when the
faulty zone is accessed again by the user.

All functions added for this emulation have a minimal overhead for
zoned targets natively supporting zone append operations. Regular
device targets are also not affected. The added code also does not
impact builds with CONFIG_BLK_DEV_ZONED disabled by stubbing out all
dm zone related functions.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04 12:07:37 -04:00
Damien Le Moal
e2118b3c3d dm: rearrange core declarations for extended use from dm-zone.c
Move the definitions of struct dm_target_io, struct dm_io and the bits
of the flags field of struct mapped_device from dm.c to dm-core.h to
make them usable from dm-zone.c. For the same reason, declare
dec_pending() in dm-core.h after renaming it to dm_io_dec_pending().
And for symmetry of the function names, introduce the inline helper
dm_io_inc_pending() instead of directly using atomic_inc() calls.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04 12:07:37 -04:00
Damien Le Moal
9ffbbb435d block: introduce BIO_ZONE_WRITE_LOCKED bio flag
Introduce the BIO flag BIO_ZONE_WRITE_LOCKED to indicate that a BIO owns
the write lock of the zone it is targeting. This is the counterpart of
the struct request flag RQF_ZONE_WRITE_LOCKED.

This new BIO flag is reserved for now for zone write locking control
for device mapper targets exposing a zoned block device. Since in this
case, the lock flag must not be propagated to the struct request that
will be used to process the BIO, a BIO private flag is used rather than
changing the RQF_ZONE_WRITE_LOCKED request flag into a common REQ_XXX
flag that could be used for both BIO and request. This avoids conflicts
down the stack with the block IO scheduler zone write locking
(in mq-deadline).

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04 12:07:36 -04:00
Damien Le Moal
d0ea6bde14 block: introduce bio zone helpers
Introduce the helper functions bio_zone_no() and bio_zone_is_seq().
Both are the BIO counterparts of the request helpers blk_rq_zone_no()
and blk_rq_zone_is_seq(), respectively returning the number of the
target zone of a bio and true if the BIO target zone is sequential.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04 12:07:35 -04:00
Damien Le Moal
1ee533eca7 block: improve handling of all zones reset operation
SCSI, ZNS and null_blk zoned devices support resetting all zones using
a single command (REQ_OP_ZONE_RESET_ALL), as indicated using the device
request queue flag QUEUE_FLAG_ZONE_RESETALL. This flag is not set for
device mapper targets creating zoned devices. In this case, a user
request for resetting all zones of a device is processed in
blkdev_zone_mgmt() by issuing a REQ_OP_ZONE_RESET operation for each
zone of the device. This leads to different behaviors of the
BLKRESETZONE ioctl() depending on the target device support for the
reset all operation. E.g.

blkzone reset /dev/sdX

will reset all zones of a SCSI device using a single command that will
ignore conventional, read-only or offline zones.

But a dm-linear device including conventional, read-only or offline
zones cannot be reset in the same manner as some of the single zone
reset operations issued by blkdev_zone_mgmt() will fail. E.g.:

blkzone reset /dev/dm-Y
blkzone: /dev/dm-0: BLKRESETZONE ioctl failed: Remote I/O error

To simplify applications and tools development, unify the behavior of
the all-zone reset operation by modifying blkdev_zone_mgmt() to not
issue a zone reset operation for conventional, read-only and offline
zones, thus mimicking what an actual reset-all device command does on a
device supporting REQ_OP_ZONE_RESET_ALL. This emulation is done using
the new function blkdev_zone_reset_all_emulated(). The zones needing a
reset are identified using a bitmap that is initialized using a zone
report. Since empty zones do not need a reset, also ignore these zones.
The function blkdev_zone_reset_all() is introduced for block devices
natively supporting reset all operations. blkdev_zone_mgmt() is modified
to call either function to execute an all zone reset request.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
[hch: split into multiple functions]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04 12:07:34 -04:00
Damien Le Moal
bf14e2b250 dm: Forbid requeue of writes to zones
A target map method requesting the requeue of a bio with
DM_MAPIO_REQUEUE or completing it with DM_ENDIO_REQUEUE can cause
unaligned write errors if the bio is a write operation targeting a
sequential zone. If a zoned target request such a requeue, warn about
it and kill the IO.

The function dm_is_zone_write() is introduced to detect write operations
to zoned targets.

This change does not affect the target drivers supporting zoned devices
and exposing a zoned device, namely dm-crypt, dm-linear and dm-flakey as
none of these targets ever request a requeue.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04 12:07:33 -04:00
Damien Le Moal
912e887505 dm: Introduce dm_report_zones()
To simplify the implementation of the report_zones operation of a zoned
target, introduce the function dm_report_zones() to set a target
mapping start sector in struct dm_report_zones_args and call
blkdev_report_zones(). This new function is exported and the report
zones callback function dm_report_zones_cb() is not.

dm-linear, dm-flakey and dm-crypt are modified to use dm_report_zones().

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04 12:07:32 -04:00
Damien Le Moal
7fc1872848 dm: move zone related code to dm-zone.c
Move core and table code used for zoned targets and conditionally
defined with #ifdef CONFIG_BLK_DEV_ZONED to the new file dm-zone.c.
This file is conditionally compiled depending on CONFIG_BLK_DEV_ZONED.
The small helper dm_set_zones_restrictions() is introduced to
initialize a mapped device request queue zone attributes in
dm_table_set_restrictions().

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04 12:07:31 -04:00
Damien Le Moal
dd73c320ec dm: cleanup device_area_is_invalid()
In device_area_is_invalid(), use bdev_is_zoned() instead of open
coding the test on the zoned model returned by bdev_zoned_model().

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04 12:07:30 -04:00
Damien Le Moal
6842d264aa dm: Fix dm_accept_partial_bio() relative to zone management commands
Fix dm_accept_partial_bio() to actually check that zone management
commands are not passed as explained in the function documentation
comment. Also, since a zone append operation cannot be split, add
REQ_OP_ZONE_APPEND as a forbidden command.

White lines are added around the group of BUG_ON() calls to make the
code more legible.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04 12:07:29 -04:00
Damien Le Moal
bab6849942 dm zoned: check zone capacity
The dm-zoned target cannot support zoned block devices with zones that
have a capacity smaller than the zone size (e.g. NVMe zoned namespaces)
due to the current chunk zone mapping implementation as it is assumed
that zones and chunks have the same size with all blocks usable.
If a zoned drive is found to have zones with a capacity different from
the zone size, fail the target initialization.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Cc: stable@vger.kernel.org # v5.9+
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04 12:07:28 -04:00
Rikard Falkeborn
ccde2cbfa3 dm table: Constify static struct blk_ksm_ll_ops
The only usage of dm_ksm_ll_ops is to make a copy of it to the ksm_ll_ops
field in the blk_keyslot_manager struct. Make it const to allow the
compiler to put it in read-only memory.

Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04 12:07:28 -04:00
Mikulas Patocka
af4f6cabcc dm writecache: interrupt writeback if suspended
If the DM device is suspended, interrupt the writeback sequence so
that there is no excessive suspend delay.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04 12:07:27 -04:00
Mikulas Patocka
ee50cc19d8 dm writecache: don't split bios when overwriting contiguous cache content
If dm-writecache overwrites existing cached data, it splits the
incoming bio into many block-sized bios. The I/O scheduler does merge
these bios into one large request but this needless splitting and
merging causes performance degradation.

Fix this by avoiding bio splitting if the cache target area that is
being overwritten is contiguous.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04 12:07:26 -04:00
Mikulas Patocka
6bcd658f2a dm kcopyd: avoid spin_lock_irqsave from process context
The functions "pop", "push_head", "do_work" can only be called from
process context. Therefore, replace spin_lock_irq{save,restore} with
spin_{lock,unlock}_irq.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04 12:07:25 -04:00
Mikulas Patocka
db2351eb22 dm kcopyd: avoid useless atomic operations
The functions set_bit and clear_bit are atomic. We don't need
atomicity when making flags for dm-kcopyd. So, change them to direct
manipulation of the flags.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04 12:07:24 -04:00
Joe Thornber
6b06dd5a97 dm space map disk: cache a small number of index entries
The disk space map stores it's index entries in a btree, these are
accessed very frequently, so having a few cached makes a big difference
to performance.

With this change provisioning a new block takes roughly 20% less cpu.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04 12:07:23 -04:00
Joe Thornber
be500ed721 dm space maps: improve performance with inc/dec on ranges of blocks
When we break sharing on btree nodes we typically need to increment
the reference counts to every value held in the node.  This can
cause a lot of repeated calls to the space maps.  Fix this by changing
the interface to the space map inc/dec methods to take ranges of
adjacent blocks to be operated on.

For installations that are using a lot of snapshots this will reduce
cpu overhead of fundamental operations such as provisioning a new block,
or deleting a snapshot, by as much as 10 times.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04 12:07:22 -04:00
Joe Thornber
5faafc77f7 dm space maps: don't reset space map allocation cursor when committing
Current commit code resets the place where the search for free blocks
will begin back to the start of the metadata device.  There are a couple
of repercussions to this:

- The first allocation after the commit is likely to take longer than
  normal as it searches for a free block in an area that is likely to
  have very few free blocks (if any).

- Any free blocks it finds will have been recently freed.  Reusing them
  means we have fewer old copies of the metadata to aid recovery from
  hardware error.

Fix these issues by leaving the cursor alone, only resetting when the
search hits the end of the metadata device.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04 12:07:21 -04:00
Joe Thornber
4eafdb1515 dm btree: improve btree residency
This commit improves the residency of btrees built in the metadata for
dm-thin and dm-cache.

When inserting a new entry into a full btree node the current code
splits the node into two.  This can result in very many half full nodes,
particularly if the insertions are occurring in an ascending order (as
happens in dm-thin with large writes).

With this commit, when we insert into a full node we first try and move
some entries to a neighbouring node that has space, failing that it
tries to split two neighbouring nodes into three.

Results are given below.  'Residency' is how full nodes are on average
as a percentage.  Average instruction counts for the operations
are given to show the extra processing has little overhead.

                         +--------------------------+--------------------------+
                         |         Before           |         After            |
+------------+-----------+-----------+--------------+-----------+--------------+
|    Test    |   Phase   | Residency | Instructions | Residency | Instructions |
+------------+-----------+-----------+--------------+-----------+--------------+
| Ascending  | insert    |        50 |         1876 |        96 |         1930 |
|            | overwrite |        50 |         1789 |        96 |         1746 |
|            | lookup    |        50 |          778 |        96 |          778 |
| Descending | insert    |        50 |         3024 |        96 |         3181 |
|            | overwrite |        50 |         1789 |        96 |         1746 |
|            | lookup    |        50 |          778 |        96 |          778 |
| Random     | insert    |        68 |         3800 |        84 |         3736 |
|            | overwrite |        68 |         4254 |        84 |         3911 |
|            | lookup    |        68 |          779 |        84 |          779 |
| Runs       | insert    |        63 |         2546 |        82 |         2815 |
|            | overwrite |        63 |         2013 |        82 |         1986 |
|            | lookup    |        63 |          778 |        82 |          779 |
+------------+-----------+-----------+--------------+-----------+--------------+

   Ascending - keys are inserted in ascending order.
   Descending - keys are inserted in descending order.
   Random - keys are inserted in random order.
   Runs - keys are split into ascending runs of ~20 length.  Then
          the runs are shuffled.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Colin Ian King <colin.king@canonical.com> # contains_key() fix
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04 12:07:20 -04:00
Mikulas Patocka
7e768532b2 dm snapshot: properly fix a crash when an origin has no snapshots
If an origin target has no snapshots, o->split_boundary is set to 0.
This causes BUG_ON(sectors <= 0) in block/bio.c:bio_split().

Fix this by initializing chunk_size, and in turn split_boundary, to
rounddown_pow_of_two(UINT_MAX) -- the largest power of two that fits
into "unsigned" type.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-05-25 16:19:58 -04:00
Mikulas Patocka
f16dba5dc6 dm snapshot: revert "fix a crash when an origin has no snapshots"
Commit 7ee06ddc40 ("dm snapshot: fix a
crash when an origin has no snapshots") introduced a regression in
snapshot merging - causing the lvm2 test lvcreate-cache-snapshot.sh
got stuck in an infinite loop.

Even though commit 7ee06ddc40 was marked
for stable@ the stable team was notified to _not_ backport it.

Fixes: 7ee06ddc40 ("dm snapshot: fix a crash when an origin has no snapshots")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-05-25 16:16:47 -04:00
John Keeping
0c1f3193b1 dm verity: fix require_signatures module_param permissions
The third parameter of module_param() is permissions for the sysfs node
but it looks like it is being used as the initial value of the parameter
here.  In fact, false here equates to omitting the file from sysfs and
does not affect the value of require_signatures.

Making the parameter writable is not simple because going from
false->true is fine but it should not be possible to remove the
requirement to verify a signature.  But it can be useful to inspect the
value of this parameter from userspace, so change the permissions to
make a read-only file in sysfs.

Signed-off-by: John Keeping <john@metanate.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-05-25 16:14:05 -04:00
Linus Torvalds
c4681547bc Linux 5.13-rc3 2021-05-23 11:42:48 -10:00
Linus Torvalds
6ebb6814a1 Two perf fixes:
- Do not check the LBR_TOS MSR when setting up unrelated LBR MSRs as this
    can cause malfunction when TOS is not supported.
 
  - Allocate the LBR XSAVE buffers along with the DS buffers upfront because
    allocating them when adding an event can deadlock.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmCqWZQTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoZe6D/9vKZRfmbdvOwfE0Z7dBil2dyBoqYHw
 yTfnGs0pNpmj2mY7ZPig3gmXnryt4cwMcTtVJZFdmYJBBS28tqGeJb7eMIxtSqAu
 aBChANXVOnbAS8vXe555nS/BVxEtfkIBgOLyYlD0hE4PIZpBbqTzhxj4daLTk4bg
 h4hyIzDMBs+4tLupVhOahg6ZaHd97S61e7gBR9I5D7tHAUMo8Ea0ChQD5p2kDYI5
 leKroBnjUFMP0o2DYPUR+6Zma50JQ29QK1u34q4mx0IhIdgqtNs8eZE/kYUu2FSZ
 Nf4GG2ALzxWjXniqRPrXbkP/ScrwVdhg7ULOG34aNT1Jx04KaMSkav93gbOdozV9
 IYvf99zwptD3g9W4yDveWjGrhEoxiARI4VUQJBnIbQP9rZ2YIRwOHIXwux1AxWQh
 ok6CP94m16AWp0wbm9HRJS80MUv1zHPK5oW3B1vkchxHihkV4XdoP6vny03Z2FcC
 elMgFquDxP4l5i3ao0Ryn4YaCAFc3MiWvw3x4YXpv9dCE6Al+iKdclkYPJgrYWYE
 wUN8jdODfIyBIOI94GnwYl5rHIoS2fH8LSpP36gIX7DMKPh5y7184jUQWjHTN6PT
 /D94us3m8Cvq+sfZrhGDHPY4BEaYeczTo6vbVoN4iBthElLYRkHp7lc7gcDpsU8B
 dzHcwyMFQAnKLQ==
 =7WGT
 -----END PGP SIGNATURE-----

Merge tag 'perf-urgent-2021-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fixes from Thomas Gleixner:
 "Two perf fixes:

   - Do not check the LBR_TOS MSR when setting up unrelated LBR MSRs as
     this can cause malfunction when TOS is not supported

   - Allocate the LBR XSAVE buffers along with the DS buffers upfront
     because allocating them when adding an event can deadlock"

* tag 'perf-urgent-2021-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/lbr: Remove cpuc->lbr_xsave allocation from atomic context
  perf/x86: Avoid touching LBR_TOS MSR for Arch LBR
2021-05-23 06:32:40 -10:00
Linus Torvalds
0898678c74 Two locking fixes:
- Invoke the lockdep tracepoints in the correct place so the ordering
    is correct again.
 
  - Don't leave the mutex WAITER bit stale when the last waiter is dropping
    out early due to a signal as that forces all subsequent lock operations
    needlessly into the slowpath until it's cleaned up again.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmCqVmQTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoUqXEACR3LsrJ+VfkktVlZK8DWygRcjLFvvo
 mV620OEdJcCOwy/Qs3qKkyIoMiba7ASrbIWoZa28+tbZZ/iovXotkRH5rh1MhDoU
 3S6QlpPeg7shyN5iDG0JlvDTVPQs6g4oC+8bAQmJIuUeQ7hPh72O49vIDSEF6mzG
 6j8C0l5tYvmojgJKY6PJYWSZ6MNVv/gCUWWwRdmShSYmdNR3W/GaN6jTFI6qVitS
 a3NE5ksVr1LC5Ro5QraVdmif/XlUxZ8UaEN6VyaXjBuOBO2UxUevm61khv0X5fpS
 IHpcDjZukgSwccXSzd9bttWJ5EKqLDC+nfFeOdJg2GFXfRZd+uGwVV3IN2U8r7fj
 pP9Wcy5dDJrFF7dVYnDU7y7IP2ZOwDoh98mQkVt90SV4zp2HcZnl3x5iqvxQrND3
 r3c88myDOZBCCroRIMxxlNpYWOozlVYtHi/mmFj3x97YoPQYwpuMunz+/i8b5j6B
 UvtM2VsevyiGZd9pzSZ/dl3Tf19VXrtY60Sc8qG6LdTukOldLBq6J9fOcUI2fHCZ
 kXiS+utT1nIWyvwRgoMcFOTOTgfzdDKRYkPu7pMVcNoRB91KgTmozGVCT4uIN3dF
 kHpm+FyGLgKDdL8AB7VTWSSTFgb2quZBeLGSr4OnVVSTJlQ3xfpKD5vtKBysYhf7
 6My7E5pCZhgr9Q==
 =7vqW
 -----END PGP SIGNATURE-----

Merge tag 'locking-urgent-2021-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking fixes from Thomas Gleixner:
 "Two locking fixes:

   - Invoke the lockdep tracepoints in the correct place so the ordering
     is correct again

   - Don't leave the mutex WAITER bit stale when the last waiter is
     dropping out early due to a signal as that forces all subsequent
     lock operations needlessly into the slowpath until it's cleaned up
     again"

* tag 'locking-urgent-2021-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/mutex: clear MUTEX_FLAGS if wait_list is empty due to signal
  locking/lockdep: Correct calling tracepoints
2021-05-23 06:30:08 -10:00
Linus Torvalds
f73d2a4293 A few fixes for irqchip drivers:
- Allocate interrupt descriptors correctly on Mainstone PXA when
    SPARSE_IRQ is enabled; otherwise the interrupt association fails.
 
  - Make the APPLE AIC chip driver depend on APPLE.
 
  - Remove redundant error output on devm_ioremap_resource() failure.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmCqVSUTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoZYLD/9jdS5AF6o81I3098RLocyjWSCMOCHt
 agSjNGqNtbYEwCSrn9oEFLMiqxh5O5nJasO7TRXeTW5ESMX8QtI4ToRBo9PCey9c
 E34HMIH2w9tfJjpb/7M07OmGn7IjsOkRO7R9UPAjvLqpUmU5IJ0jBO2lCHpkpUwB
 OPiQQC+EjzCRYLL135hmio6PMbaveF6WUaBau5RD3lkJxzdiUSNDsDvIoKhppGnM
 WE9r0FkfAW48GYGIebqjwXCfK4ZQXN3a5IxIES03U/QlVOyaCaocALYyCsMB0Avk
 vmLu1plV+D3ux9LVPy7FOALjM6X7woI65vavSD5zyixRnQSCMF2A7tsx3AFUGxl/
 mxu4h6pkSVHDvlSI5x9GSNM+vVHUMf6CPQne9a7Q7j0JKIQf/eYm4gZX9LsPf9h8
 EFg7O5PHYZk/d6IGx4VBL8lD9a642LgJdmmbFDcbwDKNrBMjbcYGJOuvSXDivVCY
 2AMZKK/sdOwjlip18yIUSS6u2F7JRVjwZADFKL8FQDQZnpXSgzZQUdRW1eFBkh18
 +nqsA5yvn9hYpmMi29+HnLoo/CjVxVoU1X7lhiqlhoa7TdA8ucGTVfTVQatIzriE
 EsoiGAdMF68SEuxBf1i5hky7HI+2DDnChjWLU/o9nM/fr5mOgGs0MV53Dah+Aomw
 IPj1xBbm8VWgzg==
 =XjHi
 -----END PGP SIGNATURE-----

Merge tag 'irq-urgent-2021-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq fixes from Thomas Gleixner:
 "A few fixes for irqchip drivers:

   - Allocate interrupt descriptors correctly on Mainstone PXA when
     SPARSE_IRQ is enabled; otherwise the interrupt association fails

   - Make the APPLE AIC chip driver depend on APPLE

   - Remove redundant error output on devm_ioremap_resource() failure"

* tag 'irq-urgent-2021-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip: Remove redundant error printing
  irqchip/apple-aic: APPLE_AIC should depend on ARCH_APPLE
  ARM: PXA: Fix cplds irqdesc allocation when using legacy mode
2021-05-23 06:28:20 -10:00
Linus Torvalds
7de7ac8d60 - Fix how SEV handles MMIO accesses by forwarding potential page faults instead
of killing the machine and by using the accessors with the exact functionality
 needed when accessing memory.
 
 - Fix a confusion with Clang LTO compiler switches passed to the it
 
 - Handle the case gracefully when VMGEXIT has been executed in userspace
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmCqKdwACgkQEsHwGGHe
 VUrnfBAAitJ9ytn5PzrLhg9cKt+BRVg8QQExWUYqOrSDXHus5+X/21YKey7BBhIj
 rMJSHi7qytO5rrfj5nw3dIH30hnat8nn5GWcNMG0hi1ptep+GP0xMG1nGw7INJDW
 85FpQI9jpO+vz0AcoZYAtSOWbwonVqbhjdHGzDhIi2e0Qt+1uKbjsT+iPxANBpyB
 fyEU3biPyWfKY4JSr1n0EHBywR329IW5I+yZInb2SBEU42V4vDBGFCXgdS8eFGo5
 KPz/bikERC/gZuDIRXDP6riKIpy1yCO1JZb0EgukwDddbzNz/ox7dX9JL+dEeRzl
 0zr28cJSoZgYQjdi3LU412CMVa8eYw7Ca0/mbhADdZK6Wd7xUNEiUR7FFoBA2Jxp
 +oYzYe4KvlsaFQyPrt8mfJDA36r+FZcqr3WJF+LYmPbRi+cbNDbKSoeDqShAh+Fq
 uUVNloWiOltsRuCS5/du8qzhmJLdIH1uFqtYK37PGLzAHz+KJ9SAdLWaYaLx4GFd
 rrFuCnk5DmoDf3I5lQvIzIEmYysEQOloGgDR6dDaPFRymOgor7BsCdR+dtxVQ6P6
 SMSUzyJLq4tC4dzT5PxWfZDlO+wIxu5QAOhu95oWIdZbsaoABZYCuLf7T7XQr9PA
 DLil4v4i7/FGpDBh+2s3V5hTXHKATuI7SGXnMNfx1eLurChg07k=
 =51BK
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v5.13_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:

 - Fix how SEV handles MMIO accesses by forwarding potential page faults
   instead of killing the machine and by using the accessors with the
   exact functionality needed when accessing memory.

 - Fix a confusion with Clang LTO compiler switches passed to the it

 - Handle the case gracefully when VMGEXIT has been executed in
   userspace

* tag 'x86_urgent_for_v5.13_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/sev-es: Use __put_user()/__get_user() for data accesses
  x86/sev-es: Forward page-faults which happen during emulation
  x86/sev-es: Don't return NULL from sev_es_get_ghcb()
  x86/build: Fix location of '-plugin-opt=' flags
  x86/sev-es: Invalidate the GHCB after completing VMGEXIT
  x86/sev-es: Move sev_es_put_ghcb() in prep for follow on patch
2021-05-23 06:12:25 -10:00
Linus Torvalds
28ceac6959 powerpc fixes for 5.13 #4
Fix breakage of strace (and other ptracers etc.) when using the new scv ABI (Power9 or
 later with glibc >= 2.33).
 
 Fix early_ioremap() on 64-bit, which broke booting on some machines.
 
 Thanks to: Dmitry V. Levin, Nicholas Piggin, Alexey Kardashevskiy, Christophe Leroy.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmCqKaoTHG1wZUBlbGxl
 cm1hbi5pZC5hdQAKCRBR6+o8yOGlgER4D/9Nqbw1u16uoBrIyHaI4Q6UasXIcktc
 ghFs0tOKNawNUyJUcl8/utH8ilpUTOnZPLeYWX9wP/KZFzHhEoWTmUZI5wcX+hkO
 V0ZabIsJ9+mKZXffSqBliehRQpqQAS5vlpJOWN0WFUx2Jaqv+QAfGLuPMAvvpqx1
 5yis2wVyC0ooo03TiaD2SjK2axzDa3Z+QOwcbAFYrb9/c2THU5J4y3+JeicHIZqi
 pySwBE5INa25zjqgDxw6ONMNpdflQvB4i06rnGlkTnUbqtUW4oGVyE3cLTwkcL+j
 zz6jN27jP0am6pM3+1JTIJcvyUETheMYmL5MPa7yzQqngD4egdNMl62p0WYLIgYo
 LRvPpkF0mfgt9RdIbvCo5+dhni0FcCdqTJcCfmUG6ndQ9vCYFCtCvnRrl/9iqqLJ
 B38Kjaad2T7oFmLBRKOHYVf5p77g1i37xiMcHu0m2Emrbi5ftenLnlOQ9Xk/xW/v
 cp7e0o/D3PJjqy9EsZ+o0DiZq1AZe0dg8nKCVIXXF6UaLNb2copP0ylplBF7aefs
 PW3Fkbq4zjRxE5UYBaz9BZmijtxH9IKywkaCS1/K+EgGjfhIP+XsmH0+qdd1JDqW
 M47B8Bl8ucdOA9eD48GeOY9KBSbvR5sK83NibGAEMRfyNSDZPE7Z3OzI9goeWfCG
 R6LDOridKGOuNQ==
 =qeQq
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-5.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:

 - Fix breakage of strace (and other ptracers etc.) when using the new
   scv ABI (Power9 or later with glibc >= 2.33).

 - Fix early_ioremap() on 64-bit, which broke booting on some machines.

Thanks to Dmitry V. Levin, Nicholas Piggin, Alexey Kardashevskiy, and
Christophe Leroy.

* tag 'powerpc-5.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/64s/syscall: Fix ptrace syscall info with scv syscalls
  powerpc/64s/syscall: Use pt_regs.trap to distinguish syscall ABI difference between sc and scv syscalls
  powerpc: Fix early setup to make early_ioremap() work
2021-05-23 06:07:33 -10:00
Linus Torvalds
4d7620341e Kbuild fixes for v5.13
- Fix short log indentation for tools builds
 
  - Fix dummy-tools to adjust to the latest stackprotector check
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmCpy2QVHG1hc2FoaXJv
 eUBrZXJuZWwub3JnAAoJED2LAQed4NsGs8gQAKxU1ZhSOU8c3I1PWVMJUr4pYJ++
 5qG1mKMBsVWrnU4eSsCehybj347sK5se+uvaYtyGLvllq8maWNnKnqrfVXyNEdhj
 z8Ckpw0zmgLMYlTzUAEubMTOKG3hwwEKIIO6NnlpSfJC0oseI0iUI1JixvsgcZ7w
 cP4aj/197RnFaWET4prI0rz7urulsI994aebqc/4DHWwoz4pnqmJH3X50rtudGA7
 IIxCPZ5oh0yXvJVianV2H7sCX8gGS4Hn0T19pI5GkCzRnfdBuztta2hky3ND9amv
 3xWe1zMRXsYUf5vX4VWJGPdCL0ZsRnsQOiyRpD7DAwGU0kDvLJO8Dhhg8awIirtt
 /ZOToxS50O72eOP/UMbunxSrUUxnIipGIBXBPDOcbcMPZ3JklVTox9NefuEXeDO3
 ncPqmGi9SgaB55j/VzoQ1Pd0d1yL2Ze7IJsiRajPD4kJc6YU7cv6g+E5I0ZiE+Jo
 vv6YiAtcvMIUNhSRKs4avE2WmoCd5H4eSIiiGo2cX2T5d2mqeFj0VZwSyoB53Ab0
 WfjUrnxn9VsKEPzDI5aG9qjiR/VufsbtAs6zUPyVBR0XCG6jJp91n0GPjY9VuGoj
 6dUXauc0q1rFkybLneBMhaORSBu5tCNSj9PFmBXTdphbb+D1eInyMu6f6FaVXG4p
 G3gVaAniJxur8ta4
 =/1KA
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-fixes-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild fixes from Masahiro Yamada:

 - Fix short log indentation for tools builds

 - Fix dummy-tools to adjust to the latest stackprotector check

* tag 'kbuild-fixes-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  kbuild: dummy-tools: adjust to stricter stackprotector check
  scripts/jobserver-exec: Fix a typo ("envirnoment")
  tools build: Fix quiet cmd indentation
2021-05-22 19:53:56 -10:00
Linus Torvalds
34c5c89890 Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "10 patches.

  Subsystems affected by this patch series: mm (pagealloc, gup, kasan,
  and userfaultfd), ipc, selftests, watchdog, bitmap, procfs, and lib"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  userfaultfd: hugetlbfs: fix new flag usage in error path
  lib: kunit: suppress a compilation warning of frame size
  proc: remove Alexey from MAINTAINERS
  linux/bits.h: fix compilation error with GENMASK
  watchdog: reliable handling of timestamps
  kasan: slab: always reset the tag in get_freepointer_safe()
  tools/testing/selftests/exec: fix link error
  ipc/mqueue, msg, sem: avoid relying on a stack reference past its expiry
  Revert "mm/gup: check page posion status for coredump."
  mm/shuffle: fix section mismatch warning
2021-05-22 15:20:20 -10:00
Mike Kravetz
e32905e573 userfaultfd: hugetlbfs: fix new flag usage in error path
In commit d6995da311 ("hugetlb: use page.private for hugetlb specific
page flags") the use of PagePrivate to indicate a reservation count
should be restored at free time was changed to the hugetlb specific flag
HPageRestoreReserve.  Changes to a userfaultfd error path as well as a
VM_BUG_ON() in remove_inode_hugepages() were overlooked.

Users could see incorrect hugetlb reserve counts if they experience an
error with a UFFDIO_COPY operation.  Specifically, this would be the
result of an unlikely copy_huge_page_from_user error.  There is not an
increased chance of hitting the VM_BUG_ON.

Link: https://lkml.kernel.org/r/20210521233952.236434-1-mike.kravetz@oracle.com
Fixes: d6995da311 ("hugetlb: use page.private for hugetlb specific page flags")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Mina Almasry <almasry.mina@google.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-22 15:09:07 -10:00
Zhen Lei
1b6d63938a lib: kunit: suppress a compilation warning of frame size
lib/bitfield_kunit.c: In function `test_bitfields_constants':
lib/bitfield_kunit.c:93:1: warning: the frame size of 7456 bytes is larger than 2048 bytes [-Wframe-larger-than=]
 }
 ^

As the description of BITFIELD_KUNIT in lib/Kconfig.debug, it "Only useful
for kernel devs running the KUnit test harness, and not intended for
inclusion into a production build".  Therefore, it is not worth modifying
variable 'test_bitfields_constants' to clear this warning.  Just suppress
it.

Link: https://lkml.kernel.org/r/20210518094533.7652-1-thunder.leizhen@huawei.com
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Vitor Massaru Iha <vitor@massaru.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-22 15:09:07 -10:00
Alexey Dobriyan
43b2ec977c proc: remove Alexey from MAINTAINERS
People Cc me and I don't have time.

Link: https://lkml.kernel.org/r/YKarMxHJBIhMHQIh@localhost.localdomain
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-22 15:09:07 -10:00
Rikard Falkeborn
f747e6667e linux/bits.h: fix compilation error with GENMASK
GENMASK() has an input check which uses __builtin_choose_expr() to
enable a compile time sanity check of its inputs if they are known at
compile time.

However, it turns out that __builtin_constant_p() does not always return
a compile time constant [0].  It was thought this problem was fixed with
gcc 4.9 [1], but apparently this is not the case [2].

Switch to use __is_constexpr() instead which always returns a compile time
constant, regardless of its inputs.

Link: https://lore.kernel.org/lkml/42b4342b-aefc-a16a-0d43-9f9c0d63ba7a@rasmusvillemoes.dk [0]
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=19449 [1]
Link: https://lore.kernel.org/lkml/1ac7bbc2-45d9-26ed-0b33-bf382b8d858b@I-love.SAKURA.ne.jp [2]
Link: https://lkml.kernel.org/r/20210511203716.117010-1-rikard.falkeborn@gmail.com
Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Yury Norov <yury.norov@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-22 15:09:07 -10:00
Petr Mladek
0f90b88dbc watchdog: reliable handling of timestamps
Commit 9bf3bc949f ("watchdog: cleanup handling of false positives")
tried to handle a virtual host stopped by the host a more
straightforward and cleaner way.

But it introduced a risk of false softlockup reports.  The virtual host
might be stopped at any time, for example between
kvm_check_and_clear_guest_paused() and is_softlockup().  As a result,
is_softlockup() might read the updated jiffies and detects a softlockup.

A solution might be to put back kvm_check_and_clear_guest_paused() after
is_softlockup() and detect it.  But it would put back the cycle that
complicates the logic.

In fact, the handling of all the timestamps is not reliable.  The code
does not guarantee when and how many times the timestamps are read.  For
example, "period_ts" might be touched anytime also from NMI and re-read in
is_softlockup().  It works just by chance.

Fix all the problems by making the code even more explicit.

1. Make sure that "now" and "period_ts" timestamps are read only once.
   They might be changed at anytime by NMI or when the virtual guest is
   stopped by the host.  Note that "now" timestamp does this implicitly
   because "jiffies" is marked volatile.

2. "now" time must be read first.  The state of "period_ts" will
   decide whether it will be used or the period will get restarted.

3. kvm_check_and_clear_guest_paused() must be called before reading
   "period_ts".  It touches the variable when the guest was stopped.

As a result, "now" timestamp is used only when the watchdog was not
touched and the guest not stopped in the meantime.  "period_ts" is
restarted in all other situations.

Link: https://lkml.kernel.org/r/YKT55gw+RZfyoFf7@alley
Fixes: 9bf3bc949f ("watchdog: cleanup handling of false positives")
Signed-off-by: Petr Mladek <pmladek@suse.com>
Reported-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-22 15:09:07 -10:00