linux

mirror of https://github.com/torvalds/linux.git synced 2024-10-28 15:51:43 +00:00

Author	SHA1	Message	Date
Helen Koike	0f41fcf788	dm ioctl: fix hang in early create error condition The dm_early_create() function (which deals with "dm-mod.create=" kernel command line option) calls dm_hash_insert() who gets an extra reference to the md object. In case of failure, this reference wasn't being released, causing dm_destroy() to hang, thus hanging the whole boot process. Fix this by calling __hash_remove() in the error path. Fixes: `6bbc923dfc` ("dm: add support to directly boot to a mapped device") Cc: stable@vger.kernel.org Signed-off-by: Helen Koike <helen.koike@collabora.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-05-16 09:52:06 -04:00
Mike Snitzer	05d6909ea9	dm integrity: whitespace, coding style and dead code cleanup Just some things that stood out like a sore thumb. Also, converted some printk(KERN_CRIT, ...) to DMCRIT(...) Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-05-09 16:00:31 -04:00
Mikulas Patocka	482714932e	dm integrity: implement synchronous mode for reboot handling Unfortunatelly, there may be bios coming even after the reboot notifier was called. We don't want these bios to make the bitmap dirty again. To address this, implement a synchronous mode - when a bio is about to be terminated, we clean the bitmap and terminate the bio after the clean operation succeeds. This obviously slows down bio processing, but it makes sure that when all bios are finished, the bitmap will be clean. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-05-08 13:41:59 -04:00
Mikulas Patocka	1f5a77591b	dm integrity: handle machine reboot in bitmap mode When in bitmap mode the bitmap must be cleared when rebooting. This commit adds the reboot hook. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-05-08 13:41:58 -04:00
Mikulas Patocka	468dfca38b	dm integrity: add a bitmap mode Introduce an alternate mode of operation where dm-integrity uses a bitmap instead of a journal. If a bit in the bitmap is 1, the corresponding region's data and integrity tags are not synchronized - if the machine crashes, the unsynchronized regions will be recalculated. The bitmap mode is faster than the journal mode, because we don't have to write the data twice, but it is also less reliable, because if data corruption happens when the machine crashes, it may not be detected. Benchmark results for an SSD connected to a SATA300 port, when doing large linear writes with dd: buffered I/O: raw device throughput - 245MB/s dm-integrity with journaling - 120MB/s dm-integrity with bitmap - 238MB/s direct I/O with 1MB block size: raw device throughput - 248MB/s dm-integrity with journaling - 123MB/s dm-integrity with bitmap - 223MB/s For more info see dm-integrity in Documentation/device-mapper/ Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-05-08 13:41:58 -04:00
Mikulas Patocka	8b3bbd490d	dm integrity: introduce a function add_new_range_and_wait() Introduce a function add_new_range_and_wait() in order to avoid repetitive code. It will be used in the following commit. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-05-08 13:40:28 -04:00
Mikulas Patocka	4f43446ddf	dm integrity: allow large ranges to be described Change n_sectors data type from unsigned to sector_t. Following commits will need to lock large ranges. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-05-07 16:05:12 -04:00
Mikulas Patocka	d5027e0345	dm ingerity: pass size to dm_integrity_alloc_page_list() Pass size to dm_integrity_alloc_page_list(). This is needed so following commits can pass a size that is different from ic->journal_pages. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-05-07 16:05:12 -04:00
Mikulas Patocka	981e8a980d	dm integrity: introduce rw_journal_sectors() Introduce a function rw_journal_sectors() that takes sector and length as its arguments instead of a section and the number of sections. This functions will be used in further patches. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-05-07 16:05:11 -04:00
Mikulas Patocka	88ad5d1eb1	dm integrity: update documentation Update documentation with the "meta_device" parameter and flags. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-05-07 16:05:10 -04:00
Mikulas Patocka	893e3c395b	dm integrity: don't report unused options If we are not journaling, don't report journaling options in the table status. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-05-07 16:05:09 -04:00
Mikulas Patocka	97abfde17a	dm integrity: don't check null pointer before kvfree and vfree The functions kfree, vfree and kvfree do nothing if we pass a NULL pointer to them. So we don't need to test the pointer for NULL. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-05-07 16:05:08 -04:00
Mikulas Patocka	30bba430dd	dm integrity: correctly calculate the size of metadata area When we use separate devices for data and metadata, dm-integrity would incorrectly calculate the size of the metadata device as if it had 512-byte block size - and it would refuse activation with larger block size and smaller metadata device. Fix this so that it takes actual block size into account, which fixes the following reported issue: https://gitlab.com/cryptsetup/cryptsetup/issues/450 Fixes: `356d9d52e1` ("dm integrity: allow separate metadata device") Cc: stable@vger.kernel.org # v4.19+ Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-05-07 16:05:08 -04:00
YueHaibing	9ccce5a0fb	dm dust: Make dm_dust_init and dm_dust_exit static Fix sparse warnings: drivers/md/dm-dust.c:495:12: warning: symbol 'dm_dust_init' was not declared. Should it be static? drivers/md/dm-dust.c:505:13: warning: symbol 'dm_dust_exit' was not declared. Should it be static? Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-05-07 16:05:07 -04:00
Colin Ian King	cacddeab56	dm dust: remove redundant unsigned comparison to less than zero Variable block is an unsigned long long hence the less than zero comparison is always false, hence it is redundant and can be removed. Addresses-Coverity: ("Unsigned compared against 0") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Bryan Gurney <bgurney@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-05-07 16:05:06 -04:00
Martin Wilck	940bc47178	dm mpath: always free attached_handler_name in parse_path() Commit `b592211c33` ("dm mpath: fix attached_handler_name leak and dangling hw_handler_name pointer") fixed a memory leak for the case where setup_scsi_dh() returns failure. But setup_scsi_dh may return success and not "use" attached_handler_name if the retain_attached_hwhandler flag is not set on the map. As setup_scsi_sh properly "steals" the pointer by nullifying it, freeing it unconditionally in parse_path() is safe. Fixes: `b592211c33` ("dm mpath: fix attached_handler_name leak and dangling hw_handler_name pointer") Cc: stable@vger.kernel.org Reported-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Martin Wilck <mwilck@suse.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-04-30 16:51:30 -04:00
Helen Koike	8e890c1ab1	dm init: fix max devices/targets checks dm-init should allow up to DM_MAX_{DEVICES,TARGETS} for devices/targets, and not DM_MAX_{DEVICES,TARGETS} - 1. Fix the checks and also fix the error message when the number of devices is surpassed. Fixes: `6bbc923dfc` ("dm: add support to directly boot to a mapped device") Cc: stable@vger.kernel.org Signed-off-by: Helen Koike <helen.koike@collabora.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-04-30 16:51:23 -04:00
Bryan Gurney	e4f3fabd67	dm: add dust target Add the dm-dust target, which simulates the behavior of bad sectors at arbitrary locations, and the ability to enable the emulation of the read failures at an arbitrary time. This target behaves similarly to a linear target. At a given time, the user can send a message to the target to start failing read requests on specific blocks. When the failure behavior is enabled, reads of blocks configured "bad" will fail with EIO. Writes of blocks configured "bad" will result in the following: 1. Remove the block from the "bad block list". 2. Successfully complete the write. After this point, the block will successfully contain the written data, and will service reads and writes normally. This emulates the behavior of a "remapped sector" on a hard disk drive. dm-dust provides logging of which blocks have been added or removed to the "bad block list", as well as logging when a block has been removed from the bad block list. These messages can be used alongside the messages from the driver using a dm-dust device to analyze the driver's behavior when a read fails at a given time. (This logging can be reduced via a "quiet" mode, if desired.) NOTE: If the block size is larger than 512 bytes, only the first sector of each "dust block" is detected. Placing a limiting layer above a dust target, to limit the minimum I/O size to the dust block size, will ensure proper emulation of the given large block size. Signed-off-by: Bryan Gurney <bgurney@redhat.com> Co-developed-by: Joe Shimkus <jshimkus@redhat.com> Co-developed-by: John Dorminy <jdorminy@redhat.com> Co-developed-by: John Pittman <jpittman@redhat.com> Co-developed-by: Thomas Jaskiewicz <tjaskiew@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-04-30 16:37:19 -04:00
Mikulas Patocka	f8011d3344	dm writecache: avoid unnecessary lookups in writecache_find_entry() This is a small optimization in writecache_find_entry(). If we go past the condition "if (unlikely(!node))", we can be certain that there is no entry in the tree that has the block equal to the "block" variable. Consequently, we can return the next entry directly, we don't need to go to the second part of the function that finds the entry with lowest or highest seq number that matches the "block" variable. Also, add some whitespace and cleanup needless braces. Suggested-by: Huaisheng Ye <yehs1@lenovo.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-04-26 11:48:03 -04:00
Huaisheng Ye	08a8e80462	dm writecache: remove unused member page_offset in writeback_struct The stucture member page_offset in writeback_struct never has been used actually. Remove it. Signed-off-by: Huaisheng Ye <yehs1@lenovo.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-04-26 11:32:50 -04:00
Mikulas Patocka	81bc6d150a	dm delay: fix a crash when invalid device is specified When the target line contains an invalid device, delay_ctr() will call delay_dtr() with NULL workqueue. Attempting to destroy the NULL workqueue causes a crash. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-04-26 11:29:32 -04:00
Peng Wang	514cf4f881	dm: only initialize md->dax_dev if CONFIG_DAX_DRIVER is enabled md->dax_dev defaults to NULL and there is no need to initialize it if CONFIG_DAX_DRIVER is disabled. Signed-off-by: Peng Wang <rocking@whu.edu.cn> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-04-26 11:28:17 -04:00
Yufen Yu	5de719e3d0	dm mpath: fix missing call of path selector type->end_io After commit `396eaf21ee` ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback"), map_request() will requeue the tio when issued clone request return BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE. Thus, if device driver status is error, a tio may be requeued multiple times until the return value is not DM_MAPIO_REQUEUE. That means type->start_io may be called multiple times, while type->end_io is only called when IO complete. In fact, even without commit `396eaf21ee`, setup_clone() failure can also cause tio requeue and associated missed call to type->end_io. The service-time path selector selects path based on in_flight_size, which is increased by st_start_io() and decreased by st_end_io(). Missed calls to st_end_io() can lead to in_flight_size count error and will cause the selector to make the wrong choice. In addition, queue-length path selector will also be affected. To fix the problem, call type->end_io in ->release_clone_rq before tio requeue. map_info is passed to ->release_clone_rq() for map_request() error path that result in requeue. Fixes: `396eaf21ee` ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback") Cc: stable@vger.kernl.org Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-04-25 15:38:52 -04:00
Mike Snitzer	873f258bec	dm thin metadata: do not write metadata if no changes occurred Otherwise, just activating a thin-pool and thin device and then deactivating them will cause the thin-pool metadata to be changed (e.g. superblock written) -- even without any metadata being changed. Add 'in_service' flag to struct dm_pool_metadata and set it in pmd_write_lock() because all on-disk metadata changes must take a write lock of pmd->root_lock. Once 'in_service' is set it is never cleared. __commit_transaction() will return 0 if 'in_service' is not set. dm_pool_commit_metadata() is updated to use __pmd_write_lock() so that it isn't the sole reason for putting a thin-pool in service. Also fix dm_pool_commit_metadata() to open the next transaction if the return from __commit_transaction() is 0. Not seeing why the early return ever made since for a return of 0 given that dm-io's async_io(), as used by bufio, always returns 0. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-04-18 16:18:34 -04:00
Mike Snitzer	6a1b1ddc6a	dm thin metadata: add wrappers for managing write locking of metadata No functional change, but this prepares to hook off of pmd_write_lock() with additional functionality (as provided in next commit). Suggested-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-04-18 16:18:34 -04:00
Mike Snitzer	a1ed4d9e93	dm thin metadata: check __commit_transaction()'s return Fix __reserve_metadata_snap() to return early if __commit_transaction() fails. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-04-18 16:18:33 -04:00
Mike Snitzer	c6e086e0c9	dm space map common: zero entire ll_disk Otherwise, memory that is allocated (and potentially not previously zeroed) will get written to disk as part of the space maps. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-04-18 16:18:32 -04:00
Huaisheng Ye	84420b1e5d	dm writecache: add unlikely for returned value of rb_next/prev In functions writecache_discard() and writecache_find_entry() there is a high probablity that the pointer of structure rb_node won't equal NULL. Add unlikely for the pointer node NULL. Signed-off-by: Huaisheng Ye <yehs1@lenovo.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-04-18 16:18:31 -04:00
Huaisheng Ye	09f2d65630	dm writecache: remove needless dereferences in __writecache_writeback_pmem() bio is already available so there is no need to access it in terms of the wb pointer. Signed-off-by: Huaisheng Ye <yehs1@lenovo.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-04-18 16:18:31 -04:00
Nikos Tsironis	3f1637f210	dm snapshot: Use fine-grained locking scheme Substitute the global locking scheme with a fine grained one, employing the read-write semaphore and the scalable exception tables with per-bucket locks introduced by the previous two commits. Summarizing, we now use a read-write semaphore to protect the mostly read fields of the snapshot structure, e.g., valid, active, etc., and per-bucket bit spinlocks to protect accesses to the complete and pending exception tables. Finally, we use an extra spinlock (pe_allocation_lock) to serialize the allocation of new exceptions by the exception store. This allocation is really fast, so the extra spinlock doesn't hurt the performance. This scheme allows dm-snapshot to scale better, resulting in increased IOPS and reduced latency. Following are some benchmark results using the null_blk device: modprobe null_blk gb=1024 bs=512 submit_queues=8 hw_queue_depth=4096 \ queue_mode=2 irqmode=1 completion_nsec=1 nr_devices=1 * Benchmark fio_origin_randwrite_throughput_N, from the device mapper test suite [1] (direct IO, random 4K writes to origin device, IO engine libaio): +--------------+-------------+------------+ \| # of workers \| IOPS Before \| IOPS After \| +--------------+-------------+------------+ \| 1 \| 57708 \| 66421 \| \| 2 \| 63415 \| 77589 \| \| 4 \| 67276 \| 98839 \| \| 8 \| 60564 \| 109258 \| +--------------+-------------+------------+ * Benchmark fio_origin_randwrite_latency_N, from the device mapper test suite [1] (direct IO, random 4K writes to origin device, IO engine psync): +--------------+-----------------------+----------------------+ \| # of workers \| Latency (usec) Before \| Latency (usec) After \| +--------------+-----------------------+----------------------+ \| 1 \| 16.25 \| 13.27 \| \| 2 \| 31.65 \| 25.08 \| \| 4 \| 55.28 \| 41.08 \| \| 8 \| 121.47 \| 74.44 \| +--------------+-----------------------+----------------------+ * Benchmark fio_snapshot_randwrite_throughput_N, from the device mapper test suite [1] (direct IO, random 4K writes to snapshot device, IO engine libaio): +--------------+-------------+------------+ \| # of workers \| IOPS Before \| IOPS After \| +--------------+-------------+------------+ \| 1 \| 72593 \| 84938 \| \| 2 \| 97379 \| 134973 \| \| 4 \| 90610 \| 143077 \| \| 8 \| 90537 \| 180085 \| +--------------+-------------+------------+ * Benchmark fio_snapshot_randwrite_latency_N, from the device mapper test suite [1] (direct IO, random 4K writes to snapshot device, IO engine psync): +--------------+-----------------------+----------------------+ \| # of workers \| Latency (usec) Before \| Latency (usec) After \| +--------------+-----------------------+----------------------+ \| 1 \| 12.53 \| 10.6 \| \| 2 \| 19.78 \| 14.89 \| \| 4 \| 40.37 \| 23.47 \| \| 8 \| 89.32 \| 48.48 \| +--------------+-----------------------+----------------------+ [1] https://github.com/jthornber/device-mapper-test-suite Co-developed-by: Ilias Tsitsimpis <iliastsi@arrikto.com> Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-04-18 16:18:30 -04:00
Nikos Tsironis	f79ae415b6	dm snapshot: Make exception tables scalable Use list_bl to implement the exception hash tables' buckets. This change permits concurrent access, to distinct buckets, by multiple threads. Also, implement helper functions to lock and unlock the exception tables based on the chunk number of the exception at hand. We retain the global locking, by means of down_write(), which is replaced by the next commit. Still, we must acquire the per-bucket spinlocks when accessing the hash tables, since list_bl does not allow modification on unlocked lists. Co-developed-by: Ilias Tsitsimpis <iliastsi@arrikto.com> Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-04-18 16:18:29 -04:00
Nikos Tsironis	4ad8d880b6	dm snapshot: Replace mutex with rw semaphore dm-snapshot uses a single mutex to serialize every access to the snapshot state. This includes all accesses to the complete and pending exception tables, which occur at every origin write, every snapshot read/write and every exception completion. The lock statistics indicate that this mutex is a bottleneck (average wait time ~480 usecs for 8 processes doing random 4K writes to the origin device) preventing dm-snapshot to scale as the number of threads doing IO increases. The major contention points are __origin_write()/snapshot_map() and pending_complete(), i.e., the submission and completion of pending exceptions. Replace this mutex with a rw semaphore. We essentially revert commit `ae1093be5a` ("dm snapshot: use mutex instead of rw_semaphore") and together with the next two patches we substitute the single mutex with a fine-grained locking scheme, where we use a read-write semaphore to protect the mostly read fields of the snapshot structure, e.g., valid, active, etc., and per-bucket bit spinlocks to protect accesses to the complete and pending exception tables. Co-developed-by: Ilias Tsitsimpis <iliastsi@arrikto.com> Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-04-18 16:18:28 -04:00
Nikos Tsironis	65fc7c3704	dm snapshot: Don't sleep holding the snapshot lock When completing a pending exception, pending_complete() waits for all conflicting reads to drain, before inserting the final, completed exception. Conflicting reads are snapshot reads redirected to the origin, because the relevant chunk is not remapped to the COW device the moment we receive the read. The completed exception must be inserted into the exception table after all conflicting reads drain to ensure snapshot reads don't return corrupted data. This is required because inserting the completed exception into the exception table signals that the relevant chunk is remapped and both origin writes and snapshot merging will now overwrite the chunk in origin. This wait is done holding the snapshot lock to ensure that pending_complete() doesn't starve if new snapshot reads keep coming for this chunk. In preparation for the next commit, where we use a spinlock instead of a mutex to protect the exception tables, we remove the need for holding the lock while waiting for conflicting reads to drain. We achieve this in two steps: 1. pending_complete() inserts the completed exception before waiting for conflicting reads to drain and removes the pending exception after all conflicting reads drain. This ensures that new snapshot reads will be redirected to the COW device, instead of the origin, and thus pending_complete() will not starve. Moreover, we use the existence of both a completed and a pending exception to signify that the COW is done but there are conflicting reads in flight. 2. In __origin_write() we check first if there is a pending exception and then if there is a completed exception. If there is a pending exception any submitted BIO is delayed on the pe->origin_bios list and DM_MAPIO_SUBMITTED is returned. This ensures that neither writes to the origin nor snapshot merging can overwrite the origin chunk, until all conflicting reads drain, and thus snapshot reads will not return corrupted data. Summarizing, we now have the following possible combinations of pending and completed exceptions for a chunk, along with their meaning: A. No exceptions exist: The chunk has not been remapped yet. B. Only a pending exception exists: The chunk is currently being copied to the COW device. C. Both a pending and a completed exception exist: COW for this chunk has completed but there are snapshot reads in flight which had been redirected to the origin before the chunk was remapped. D. Only the completed exception exists: COW has been completed and there are no conflicting reads in flight. Co-developed-by: Ilias Tsitsimpis <iliastsi@arrikto.com> Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-04-18 16:18:27 -04:00
Nikos Tsironis	34191ae816	list_bl: Add hlist_bl_add_before/behind helpers Add hlist_bl_add_before/behind helpers to add an element before/after an existing element in a bl_list. Co-developed-by: Ilias Tsitsimpis <iliastsi@arrikto.com> Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com> Reviewed-by: Paul E. McKenney <paulmck@linux.ibm.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-04-18 16:18:27 -04:00
Nikos Tsironis	ae325dcd19	list: Don't use WRITE_ONCE() in hlist_add_behind() Commit `1c97be677f` ("list: Use WRITE_ONCE() when adding to lists and hlists") introduced the use of WRITE_ONCE() to atomically write the list head's ->next pointer. hlist_add_behind() doesn't touch the hlist head's ->first pointer so there is no reason to use WRITE_ONCE() in this case. Co-developed-by: Ilias Tsitsimpis <iliastsi@arrikto.com> Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com> Reviewed-by: Paul E. McKenney <paulmck@linux.ibm.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-04-18 16:18:26 -04:00
Nikos Tsironis	e28adc3bf3	dm cache metadata: Fix loading discard bitset Add missing dm_bitset_cursor_next() to properly advance the bitset cursor. Otherwise, the discarded state of all blocks is set according to the discarded state of the first block. Fixes: `ae4a46a1f6` ("dm cache metadata: use bitset cursor api to load discard bitset") Cc: stable@vger.kernel.org Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-04-18 16:18:25 -04:00
Damien Le Moal	7aedf75ff7	dm zoned: Fix zone report handling The function blkdev_report_zones() returns success even if no zone information is reported (empty report). Empty zone reports can only happen if the report start sector passed exceeds the device capacity. The conditions for this to happen are either a bug in the caller code, or, a change in the device that forced the low level driver to change the device capacity to a value that is lower than the report start sector. This situation includes a failed disk revalidation resulting in the disk capacity being changed to 0. If this change happens while dm-zoned is in its initialization phase executing dmz_init_zones(), this function may enter an infinite loop and hang the system. To avoid this, add a check to disallow empty zone reports and bail out early. Also fix the function dmz_update_zone() to make sure that the report for the requested zone was correctly obtained. Fixes: `3b1a94c88b` ("dm zoned: drive-managed zoned block device target") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Shaun Tancheff <shaun@tancheff.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-04-18 16:17:58 -04:00
Dan Carpenter	a3839bc635	dm zoned: Silence a static checker warning My static checker complains about this line from dmz_get_zoned_device() aligned_capacity = dev->capacity & ~(blk_queue_zone_sectors(q) - 1); The problem is that "aligned_capacity" and "dev->capacity" are sector_t type (which is a u64 under most configs) but blk_queue_zone_sectors(q) returns a u32 so the higher 32 bits in aligned_capacity are cleared to zero. This patch adds a cast to address the issue. Fixes: `114e025968` ("dm zoned: ignore last smaller runt zone") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-04-18 16:16:01 -04:00
Christoph Hellwig	c13b5487d9	dm crypt: fix endianness annotations around org_sector_of_dmreq The sector used here is a little endian value, so use the right type for it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-04-18 16:16:01 -04:00
Linus Torvalds	dc4060a5dc	Linux 5.1-rc5	2019-04-14 15:17:41 -07:00
Linus Torvalds	6b3a707736	Merge branch 'page-refs' (page ref overflow) Merge page ref overflow branch. Jann Horn reported that he can overflow the page ref count with sufficient memory (and a filesystem that is intentionally extremely slow). Admittedly it's not exactly easy. To have more than four billion references to a page requires a minimum of 32GB of kernel memory just for the pointers to the pages, much less any metadata to keep track of those pointers. Jann needed a total of 140GB of memory and a specially crafted filesystem that leaves all reads pending (in order to not ever free the page references and just keep adding more). Still, we have a fairly straightforward way to limit the two obvious user-controllable sources of page references: direct-IO like page references gotten through get_user_pages(), and the splice pipe page duplication. So let's just do that. * branch page-refs: fs: prevent page refcount overflow in pipe_buf_get mm: prevent get_user_pages() from overflowing page refcount mm: add 'try_get_page()' helper function mm: make page ref count overflow check tighter and more explicit	2019-04-14 15:09:40 -07:00
Matthew Wilcox	15fab63e1e	fs: prevent page refcount overflow in pipe_buf_get Change pipe_buf_get() to return a bool indicating whether it succeeded in raising the refcount of the page (if the thing in the pipe is a page). This removes another mechanism for overflowing the page refcount. All callers converted to handle a failure. Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Matthew Wilcox <willy@infradead.org> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-04-14 10:00:04 -07:00
Linus Torvalds	8fde12ca79	mm: prevent get_user_pages() from overflowing page refcount If the page refcount wraps around past zero, it will be freed while there are still four billion references to it. One of the possible avenues for an attacker to try to make this happen is by doing direct IO on a page multiple times. This patch makes get_user_pages() refuse to take a new page reference if there are already more than two billion references to the page. Reported-by: Jann Horn <jannh@google.com> Acked-by: Matthew Wilcox <willy@infradead.org> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-04-14 10:00:04 -07:00
Linus Torvalds	88b1a17dfc	mm: add 'try_get_page()' helper function This is the same as the traditional 'get_page()' function, but instead of unconditionally incrementing the reference count of the page, it only does so if the count was "safe". It returns whether the reference count was incremented (and is marked __must_check, since the caller obviously has to be aware of it). Also like 'get_page()', you can't use this function unless you already had a reference to the page. The intent is that you can use this exactly like get_page(), but in situations where you want to limit the maximum reference count. The code currently does an unconditional WARN_ON_ONCE() if we ever hit the reference count issues (either zero or negative), as a notification that the conditional non-increment actually happened. NOTE! The count access for the "safety" check is inherently racy, but that doesn't matter since the buffer we use is basically half the range of the reference count (ie we look at the sign of the count). Acked-by: Matthew Wilcox <willy@infradead.org> Cc: Jann Horn <jannh@google.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-04-14 10:00:04 -07:00
Linus Torvalds	f958d7b528	mm: make page ref count overflow check tighter and more explicit We have a VM_BUG_ON() to check that the page reference count doesn't underflow (or get close to overflow) by checking the sign of the count. That's all fine, but we actually want to allow people to use a "get page ref unless it's already very high" helper function, and we want that one to use the sign of the page ref (without triggering this VM_BUG_ON). Change the VM_BUG_ON to only check for small underflows (or _very_ close to overflowing), and ignore overflows which have strayed into negative territory. Acked-by: Matthew Wilcox <willy@infradead.org> Cc: Jann Horn <jannh@google.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-04-14 10:00:04 -07:00
Linus Torvalds	4443f8e6ac	for-linus-20190412 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlyw354QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpiMyEAC4THUReCrTuv9oFRNg5uILVYIq51nP8dw7 XamC7A92jPXd6vl/QVjmvLwT34/Y2XvX0t62RBsk849CEjgGYTeF1/qI3tMkpN7c huupab3aYM/Rrv4i1KSPQu6iIto3DYqfmREaGJJ1Ikbu/CKDuUGyEo+Z4wrKUPon GWnE8QMS2fdc764eVzKKqB+GryaEiHmeD1N4NnPs+nla14ysueUvJUikkTt/Laef h7nOmz9mrqE6u1xVHNpo0TlW0oJdLfaDIL9ghwHFJXqvriTh8Tg2tEHpXI6vSTTt StnPbTA1s1uhHs4rWYl8J0UXSZnRRp0Ep8jCvqEb9CJ23uHCNyGEoy/R7q+x2quf T+ruolMXY7IIJP30ZMHar374YfajJdw7EH/565nlbLnjSBXhqjmc07kQ7mIYvpg6 JgureSdDwOOHpfrJgVq5es48ndt5HBYUBPzkvVGTgkeSJkMydkkM1qZeYEnai105 8EnUFusRUnYZtb73HBPjKS7i0BZZvZlI1oKYHabiMtajqcKyvwDP2tTmhqXYLDLY 9uloW0u2B0lddfzCb9hTYZOroNWfifo4vuSU5DHvnJoKvf4z3auDxaFD9N8fGn6S aZsRjMCpFqFd0YEnZPbsctgPg2Licrs02uPntlzBTJ0ByH20pX4OepYrvgQk3vao tOQ1jRYMKw== =cISy -----END PGP SIGNATURE----- Merge tag 'for-linus-20190412' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: "Set of fixes that should go into this round. This pull is larger than I'd like at this time, but there's really no specific reason for that. Some are fixes for issues that went into this merge window, others are not. Anyway, this contains: - Hardware queue limiting for virtio-blk/scsi (Dongli) - Multi-page bvec fixes for lightnvm pblk - Multi-bio dio error fix (Jason) - Remove the cache hint from the io_uring tool side, since we didn't move forward with that (me) - Make io_uring SETUP_SQPOLL root restricted (me) - Fix leak of page in error handling for pc requests (Jérôme) - Fix BFQ regression introduced in this merge window (Paolo) - Fix break logic for bio segment iteration (Ming) - Fix NVMe cancel request error handling (Ming) - NVMe pull request with two fixes (Christoph): - fix the initial CSN for nvme-fc (James) - handle log page offsets properly in the target (Keith)" * tag 'for-linus-20190412' of git://git.kernel.dk/linux-block: block: fix the return errno for direct IO nvmet: fix discover log page when offsets are used nvme-fc: correct csn initialization and increments on error block: do not leak memory in bio_copy_user_iov() lightnvm: pblk: fix crash in pblk_end_partial_read due to multipage bvecs nvme: cancel request synchronously blk-mq: introduce blk_mq_complete_request_sync() scsi: virtio_scsi: limit number of hw queues by nr_cpu_ids virtio-blk: limit number of hw queues by nr_cpu_ids block, bfq: fix use after free in bfq_bfqq_expire io_uring: restrict IORING_SETUP_SQPOLL to root tools/io_uring: remove IOCQE_FLAG_CACHEHIT block: don't use for-inside-for in bio_for_each_segment_all	2019-04-13 16:23:16 -07:00
Linus Torvalds	b60bc0665e	NFS client bugfixes for Linux 5.1 Highlights include: Stable fixes: - Fix a deadlock in close() due to incorrect draining of RDMA queues Bugfixes: - Revert "SUNRPC: Micro-optimise when the task is known not to be sleeping" as it is causing stack overflows - Fix a regression where NFSv4 getacl and fs_locations stopped working - Forbid setting AF_INET6 to "struct sockaddr_in"->sin_family. - Fix xfstests failures due to incorrect copy_file_range() return values -----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJcsfeVAAoJEA4mA3inWBJcPjAQAIPERRVWjg7xRz6CJzt2yoM1 ApPj965DCnC9bGcGAH2U+TbCWJOi3lJwaZOPTL0ut/Tcv9PpKETRqk+rrjUcFRy1 1b1HH16GivprOmHgCRyqo5Qj2ZiaGNpY3tJfxl/6eIiSpHKPZLa4zY+q2KfK/YNI SOVyNU0Gq08p4AiKr3CG5VVZGdNgRMrnzBYJqeTh1zZ7erWE2nJoE+pmvcLhZR0w uxshbTWbJT21KLEI+PXTyGtFkz5jNaKy4Ts07MRBJdQjDv73MUW8CcqFZicSjtqx zdKYa1VH9pEOjFOs57xGELSnYRdB00Vgd9/b6MqKyWH8iJzXFbgjEusMWiU45aeF NLg9ySSU8LeY93SxV66CHG57NIgHqwZu6P+lO3efRzuHgEGceDsz0WwDF2KNIZlm /vOmbk0I+woneFUeNDWAXD9/ETUJ8RCNk1/b1UlbkUL7aD5WSLDp1bKPifk/WA6E Mtgwmqz1Vso3cIPglWcAgsfEAYJZSJVDMfRIhm2dy7vVU0nfW12I00G8BShgr8f7 mxAxd/V+1/Q9ftPENgC9z5LWKYQjfjksnYRHXW1m5c92Yoe9TF0yiNyDmT5hBR6w MvUN2j3yeQBqk6JHZxtH/mmdSRD0o5kxvFrEqMj1PpP8X8DpWupQA8SZKnHq0wlj 8Q7LRum+wmhbiKCmZ+1F =vRPB -----END PGP SIGNATURE----- Merge tag 'nfs-for-5.1-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs Pull NFS client bugfixes from Trond Myklebust: "Highlights include: Stable fix: - Fix a deadlock in close() due to incorrect draining of RDMA queues Bugfixes: - Revert "SUNRPC: Micro-optimise when the task is known not to be sleeping" as it is causing stack overflows - Fix a regression where NFSv4 getacl and fs_locations stopped working - Forbid setting AF_INET6 to "struct sockaddr_in"->sin_family. - Fix xfstests failures due to incorrect copy_file_range() return values" * tag 'nfs-for-5.1-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: Revert "SUNRPC: Micro-optimise when the task is known not to be sleeping" NFSv4.1 fix incorrect return value in copy_file_range xprtrdma: Fix helper that drains the transport NFS: Fix handling of reply page vector NFS: Forbid setting AF_INET6 to "struct sockaddr_in"->sin_family.	2019-04-13 14:47:06 -07:00
Linus Torvalds	87af0c3813	SCSI fixes on 20190413 One obvious fix for a ciostor data corruption on error bug. Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com> -----BEGIN PGP SIGNATURE----- iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCXLGx4yYcamFtZXMuYm90 dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishXlDAQD41knG TLx+E1FCgYEMuq7SdQx6D1Z7l6ZSwBh1hntHdQD+KHAVafU6Kx2lTzfNw7FlCZZ5 LBwX/4AxmatTzQI4jFg= =Fxkf -----END PGP SIGNATURE----- Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI fix from James Bottomley: "One obvious fix for a ciostor data corruption on error bug" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: scsi: csiostor: fix missing data copy in csio_scsi_err_handler()	2019-04-13 14:37:49 -07:00
Linus Torvalds	09bad0df39	Here's more than a handful of clk driver fixes for changes that came in during the merge window: - Fix the AT91 sama5d2 programmable clk prescaler formula - A bunch of Amlogic meson clk driver fixes for the VPU clks - A DMI quirk for Intel's Bay Trail SoC's driver to properly mark pmc clks as critical only when really needed - Stop overwriting CLK_SET_RATE_PARENT flag in mediatek's clk gate implementation - Use the right structure to test for a frequency table in i.MX's PLL_1416x driver -----BEGIN PGP SIGNATURE----- iQJFBAABCAAvFiEE9L57QeeUxqYDyoaDrQKIl8bklSUFAlyxC/IRHHNib3lkQGtl cm5lbC5vcmcACgkQrQKIl8bklSWPTg//Q9CXbOYC64u2LEMtMKFtxS0UobjFKyMg EfRnHM3EuRKHCSPLtcr5bKQkFQYJ7Qx9A8oQm4v1d0KlQ2HyrOuAjfAkCaKweKSK iXpvWQMHcyRNPmPhzaDnuGBVXptOQ+kfwjWT4/nbkjW0bnFTwpvx9I5pdUd3UOJv IdnYOLKAF8Uwt2nyJd++Bh0UeBhQ1XIl9P46iZGa43nQsQhgSaru3oBnhVOzEti/ k9Di3H1k1wIKR+xDujl/S3vIIEUcx0eGkL86sFdVq6nYwdQQZKusESC0vh5QJ/Ax LLSJcdoM8B84zStkYgIskdltdMZmsUUjLjjEbF5iq1my+LwQZ3JLWkY/gXMeF2Mu t5S/TVe5GwqKw2tmoQYkR2Qz76x7/DauZEdUcYtu+K9D2ye5aNDsNNCHlFkamN2N EJkBXDqpKGHkyOdUGmL+B0W6D1KxwJEREkCh0aIpbVci1PjfxvI6PLJBF907RkLx UNDF/flLoOMy+iUl0ZC05Ie06CkzJMf1e7mMaIIS/FfC7UJ4yNVEHyCADzyrCLOB XWwmwCea5NnIi3EQP91a7WO/Gr+yUWxfrQ3viNqM3KbPKOurofMp/JvDnu8bX31O l+yiRfpdjIaKUdyDLnTaq3UGBlBlFnqFOWkjRmmMzRZoBmwZhCN7H30LIlqnqnpQ wsvhawe24UY= =JS0o -----END PGP SIGNATURE----- Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux Pull clk fixes from Stephen Boyd: "Here's more than a handful of clk driver fixes for changes that came in during the merge window: - Fix the AT91 sama5d2 programmable clk prescaler formula - A bunch of Amlogic meson clk driver fixes for the VPU clks - A DMI quirk for Intel's Bay Trail SoC's driver to properly mark pmc clks as critical only when really needed - Stop overwriting CLK_SET_RATE_PARENT flag in mediatek's clk gate implementation - Use the right structure to test for a frequency table in i.MX's PLL_1416x driver" * tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux: clk: imx: Fix PLL_1416X not rounding rates clk: mediatek: fix clk-gate flag setting platform/x86: pmc_atom: Drop __initconst on dmi table clk: x86: Add system specific quirk to mark clocks as critical clk: meson: vid-pll-div: remove warning and return 0 on invalid config clk: meson: pll: fix rounding and setting a rate that matches precisely clk: meson-g12a: fix VPU clock parents clk: meson: g12a: fix VPU clock muxes mask clk: meson-gxbb: round the vdec dividers to closest clk: at91: fix programmable clock for sama5d2	2019-04-13 14:33:56 -07:00
Linus Torvalds	a3b8424862	pci-v5.1-fixes-2 -----BEGIN PGP SIGNATURE----- iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAlywr7oUHGJoZWxnYWFz QGdvb2dsZS5jb20ACgkQWYigwDrT+vzpOQ//YQN/ml2NmNhFRXOlA1ZneWLZ7+Fn 8lBaTJVnfahLCGQKoV95F0jmzoI2nvJljKjIJ5+6ttTJrZ22Pq+OSzi/n831nIXL l9rc2nEjTjSPna7XfxArE+pfo+yXUgBLEH9rcyIMLsDoIXqFkHY73YCBTIZdvEci J6dSHd7bCDKDbl/i5VYl8zRQZ9maeLE4Zz6azOHQZ2RjYJKQt/HlPzTEu2SHdVb3 wRx0wiTiMZENZrg9uo+y5Jcap5YOFpWhmPMGhcB3HLAsQ1+y3Ln4hxJmakvtUHxA H1zAFx1PtA+KsBB0k4vPR/+YGIRuZ8eI3RtXRjZ+ZYCZK5jTMltPcvS2DON6du3N i3H+Aqnj/iB6Iyg6Z6iJfJqgz5EdBdIRV7s54RbWZN0UEeUS8/dUdygjZ+fQDgYQ O8htquxhq2BkdU5rGLSbb4MQICPINfysqjRd012ur2HgY6AuMkomZzZcJT5HG6pV OLmTl41IPhklrQ+HpUGUnECHz8OCy3Ospz5eHbnSrQ+fJ1MW3NaF3C4/JOYdCvl/ 5N6kZLZc8Hn75ZqisL/WYr2e0CXGF2NigIlI3tBlg5J8GtLxA1FVCcMPy5LE6bNn TtPbea8mupzmMr8wZ/qNxn0tAgdx/Q02ihEWF3ZIUuwFM6TayfGbW+TP6C7Sn7Oh eu7neYGdKts7Xr4= =DdoZ -----END PGP SIGNATURE----- Merge tag 'pci-v5.1-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull PCI fixes from Bjorn Helgaas: - Add a DMA alias quirk for another Marvell SATA device (Andre Przywara) - Fix a pciehp regression that broke safe removal of devices (Sergey Miroshnichenko) * tag 'pci-v5.1-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: PCI: pciehp: Ignore Link State Changes after powering off a slot PCI: Add function 1 DMA alias quirk for Marvell 9170 SATA controller	2019-04-13 14:29:21 -07:00

1 2 3 4 5 ...

826136 Commits