When a drive is marked write-mostly it should only be the
target of reads if there is no other option.
This behaviour was broken by
commit 9dedf60313
md/raid1: read balance chooses idlest disk for SSD
which causes a write-mostly device to be *preferred* is some cases.
Restore correct behaviour by checking and setting
best_dist_disk and best_pending_disk rather than best_disk.
We only need to test one of these as they are both changed
from -1 or >=0 at the same time.
As we leave min_pending and best_dist unchanged, any non-write-mostly
device will appear better than the write-mostly device.
Reported-by: Tomáš Hodek <tomas.hodek@volny.cz>
Reported-by: Dark Penguin <darkpenguin@yandex.ru>
Signed-off-by: NeilBrown <neilb@suse.de>
Link: http://marc.info/?l=linux-raid&m=135982797322422
Fixes: 9dedf60313
Cc: stable@vger.kernel.org (3.6+)
Algorithm:
1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues
ioctl(ADD_NEW_DISC with disc.state set to MD_DISK_CLUSTER_ADD)
2. Node 1 sends NEWDISK with uuid and slot number
3. Other nodes issue kobject_uevent_env with uuid and slot number
(Steps 4,5 could be a udev rule)
4. In userspace, the node searches for the disk, perhaps
using blkid -t SUB_UUID=""
5. Other nodes issue either of the following depending on whether the disk
was found:
ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and
disc.number set to slot number)
ioctl(CLUSTERED_DISK_NACK)
6. Other nodes drop lock on no-new-devs (CR) if device is found
7. Node 1 attempts EX lock on no-new-devs
8. If node 1 gets the lock, it sends METADATA_UPDATED after unmarking the disk
as SpareLocal
9. If not (get no-new-dev lock), it fails the operation and sends METADATA_UPDATED
10. Other nodes understand if the device is added or not by reading the superblock again after receiving the METADATA_UPDATED message.
Signed-off-by: Lidong Zhong <lzhong@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
set choose_first true for cluster read in read balance when the area
is resyncing.
Signed-off-by: Lidong Zhong <lzhong@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
If there is a resync going on, all nodes must suspend writes to the
range. This is recorded in the suspend_info/suspend_list.
If there is an I/O within the ranges of any of the suspend_info,
should_suspend will return 1.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
When a RESYNC_START message arrives, the node removes the entry
with the current slot number and adds the range to the
suspend_list.
Simlarly, when a RESYNC_FINISHED message is received, node clears
entry with respect to the bitmap number.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
When a resync is initiated, RESYNCING message is sent to all active
nodes with the range (lo,hi). When the resync is over, a RESYNCING
message is sent with (0,0). A high sector value of zero indicates
that the resync is over.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Re-reads the devices by invalidating the cache.
Since we don't write to faulty devices, this is detected using
events recorded in the devices. If it is old as compared to the mddev
mark it is faulty.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
- request to send a message
- make changes to superblock
- send messages telling everyone that the superblock has changed
- other nodes all read the superblock
- other nodes all ack the messages
- updating node release the "I'm sending a message" resource.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
The sending part is split in two functions to make sure
atomicity of the operations, such as the MD superblock update.
Signed-off-by: Lidong Zhong <lzhong@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
1. receive status
sender receiver receiver
ACK:CR ACK:CR ACK:CR
2. sender get EX of TOKEN
sender get EX of MESSAGE
sender receiver receiver
TOKEN:EX ACK:CR ACK:CR
MESSAGE:EX
ACK:CR
3. sender write LVB.
sender down-convert MESSAGE from EX to CR
sender try to get EX of ACK
[ wait until all receiver has *processed* the MESSAGE ]
[ triggered by bast of ACK ]
receiver get CR of MESSAGE
receiver read LVB
receiver processes the message
[ wait finish ]
receiver release ACK
sender receiver receiver
TOKEN:EX MESSAGE:CR MESSAGE:CR
MESSAGE:CR
ACK:EX
4. sender down-convert ACK from EX to CR
sender release MESSAGE
sender release TOKEN
receiver upconvert to EX of MESSAGE
receiver get CR of ACK
receiver release MESSAGE
sender receiver receiver
ACK:CR ACK:CR ACK:CR
Signed-off-by: Lidong Zhong <lzhong@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
The DLM informs us in case of node failure with the DLM slot number.
cluster_info->recovery_map sets the bit corresponding to the slot number
and wakes up the recovery thread.
The recovery thread:
1. Derives the slot number from the recovery_map
2. Locks the bitmap corresponding to the slot
3. Copies the set bits to the node-local bitmap
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
bitmap_copy_from_slot reads the bitmap from the slot mentioned.
It then copies the set bits to the node local bitmap.
This is helper function for the resync operation on node failure.
bitmap_set_memory_bits() currently assumes it is only run at startup and that
they bitmap is currently empty. So if it finds that a region is already
marked as dirty, it won't mark it dirty again. Change bitmap_set_memory_bits()
to always set the NEEDED_MASK bit if 'needed' is set.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
When a node joins, it does not know of other nodes performing resync.
So, each node keeps the resync information in it's LVB. When a new
node joins, it reads the LVB of each "online" bitmap.
[TODO] The new node attempts to get the PW lock on other bitmap, if
it is successful, it reads the bitmap and performs the resync (if
required) on it's behalf.
If the node does not get the PW, it requests CR and reads the LVB
for the resync information.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
On-disk format:
0 4k 8k 12k
-------------------------------------------------------------------
| idle | md super | bm super [0] + bits |
| bm bits[0, contd] | bm super[1] + bits | bm bits[1, contd] |
| bm super[2] + bits | bm bits [2, contd] | bm super[3] + bits |
| bm bits [3, contd] | | |
Bitmap super has a field nodes, which defines the maximum number
of nodes the device can use. While reading the bitmap super, if
the cluster finds out that the number of nodes is > 0:
1. Requests the md-cluster module.
2. Calls md_cluster_ops->join(), which sets up clustering such as
joining DLM lockspace.
Since the first time, the first bitmap is read. After the call
to the cluster_setup, the bitmap offset is adjusted and the
superblock is re-read. This also ensures the bitmap is read
the bitmap lock (when bitmap lock is introduced in later patches)
Questions:
1. cluster name is repeated in all bitmap supers. Is that okay?
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
DLM offers callbacks when a node fails and the lock remastery
is performed:
1. recover_prep: called when DLM discovers a node is down
2. recover_slot: called when DLM identifies the node and recovery
can start
3. recover_done: called when all nodes have completed recover_slot
recover_slot() and recover_done() are also called when the node joins
initially in order to inform the node with its slot number. These slot
numbers start from one, so we deduct one to make it start with zero
which the cluster-md code uses.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
md_cluster_info stores the cluster information in the MD device.
The join() is called when mddev detects it is a clustered device.
The main responsibilities are:
1. Setup a DLM lockspace
2. Setup all initial locks such as super block locks and bitmap lock (will come later)
The leave() clears up the lockspace and all the locks held.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
A dlm_lock_resource is a structure which contains all information
required for locking using DLM. The init function allocates the
lock and acquires the lock in NL mode. The unlock function
converts the lock resource to NL mode. This is done to preserve
LVB and for faster processing of locks. The lock resource is
DLM unlocked only in the lockres_free function, which is the end
of life of the lock resource.
Signed-off-by: Lidong Zhong <lzhong@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
to changes that enable effective use of an unbound workqueue across
all available CPUs. A large battery of tests were performed to
validate these changes, summary of results is available here:
https://www.redhat.com/archives/dm-devel/2015-February/msg00106.html
- A few additional stable fixes (to DM core, dm-snapshot and dm-mirror)
and a small fix to the dm-space-map-disk.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJU51MsAAoJEMUj8QotnQNan8wH/3Oa2/ofbX8zNYtdDxFBK7W0
pHwVA2CH9yzDDi90X9GsQODmXerMrYf+SHo/RsQTpSnt5WI/4ZVP/1EoNGu6T2Yr
XiaKc/Jwo+ScxdhoHSNtyGL5vKeipX6clgz1wcd/4/UBcBAr0Vxj25s8Ta7naK2V
hZyWnt3MCGEDQ45NVBmLOU78Cl68LGf8JOUY0l3cd8ehQjGyR9a12GXwRktepslp
hw9xu8uYmE93fD+MIe/fUs7W/WvIVgFSskhswlvD3kAFpEAsMczjLGVSRQlBE90L
OK7v1HKxiIUJa17g3LmiCFa59ZjrnSHGpOOv9IPAFU/LzFIU47lcFwRFjgfB8po=
=8vFi
-----END PGP SIGNATURE-----
Merge tag 'dm-3.20-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull more device mapper changes from Mike Snitzer:
- Significant dm-crypt CPU scalability performance improvements thanks
to changes that enable effective use of an unbound workqueue across
all available CPUs. A large battery of tests were performed to
validate these changes, summary of results is available here:
https://www.redhat.com/archives/dm-devel/2015-February/msg00106.html
- A few additional stable fixes (to DM core, dm-snapshot and dm-mirror)
and a small fix to the dm-space-map-disk.
* tag 'dm-3.20-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm snapshot: fix a possible invalid memory access on unload
dm: fix a race condition in dm_get_md
dm crypt: sort writes
dm crypt: add 'submit_from_crypt_cpus' option
dm crypt: offload writes to thread
dm crypt: remove unused io_pool and _crypt_io_pool
dm crypt: avoid deadlock in mempools
dm crypt: don't allocate pages for a partial request
dm crypt: use unbound workqueue for request processing
dm io: reject unsupported DISCARD requests with EOPNOTSUPP
dm mirror: do not degrade the mirror on discard error
dm space map disk: fix sm_disk_count_is_more_than_one()
Pull kconfig updates from Michal Marek:
"Yann E Morin was supposed to take over kconfig maintainership, but
this hasn't happened. So I'm sending a few kconfig patches that I
collected:
- Fix for missing va_end in kconfig
- merge_config.sh displays used if given too few arguments
- s/boolean/bool/ in Kconfig files for consistency, with the plan to
only support bool in the future"
* 'kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
kconfig: use va_end to match corresponding va_start
merge_config.sh: Display usage if given too few arguments
kconfig: use bool instead of boolean for type definition attributes
When the snapshot target is unloaded, snapshot_dtr() waits until
pending_exceptions_count drops to zero. Then, it destroys the snapshot.
Therefore, the function that decrements pending_exceptions_count
should not touch the snapshot structure after the decrement.
pending_complete() calls free_pending_exception(), which decrements
pending_exceptions_count, and then it performs up_write(&s->lock) and it
calls retry_origin_bios() which dereferences s->origin. These two
memory accesses to the fields of the snapshot may touch the dm_snapshot
struture after it is freed.
This patch moves the call to free_pending_exception() to the end of
pending_complete(), so that the snapshot will not be destroyed while
pending_complete() is in progress.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
The function dm_get_md finds a device mapper device with a given dev_t,
increases the reference count and returns the pointer.
dm_get_md calls dm_find_md, dm_find_md takes _minor_lock, finds the
device, tests that the device doesn't have DMF_DELETING or DMF_FREEING
flag, drops _minor_lock and returns pointer to the device. dm_get_md then
calls dm_get. dm_get calls BUG if the device has the DMF_FREEING flag,
otherwise it increments the reference count.
There is a possible race condition - after dm_find_md exits and before
dm_get is called, there are no locks held, so the device may disappear or
DMF_FREEING flag may be set, which results in BUG.
To fix this bug, we need to call dm_get while we hold _minor_lock. This
patch renames dm_find_md to dm_get_md and changes it so that it calls
dm_get while holding the lock.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
yet-another-livelock in raid5, and a problem with write errors
to 4K-block devices.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIVAwUAVOPorznsnt1WYoG5AQKa5A//RmLurotaJvHt8YA8jzlzo2KY/9jxIv8x
jF0CQ0Kva056GG27dlkMPBoddrVk7/WaCoz/ICBKgfSTCpiEFekNlExLkeLXrgUb
mKljS/HKhP4KQjiy519HiT2v9BRTqZD3l6P2405qVAZGTymwhelAp3y1I5ZIRFOr
k2Uo6PNT/XYvzwWAy5iEBrE7zZl3MZJG5RjW2Wq6JKaWMUQaQERu5OhugWEfCI3U
yVomILlwiYqC85WLgrVob/OqCSoA3UsIZkZKvKJCP8Y4C9QJzquEeZy4z0swgJq0
FMsu2WqmI3/ZNFvmlSozN05CEUMkZF6E8hGJcT+f1Wa1NE40zrzDkVsVvsGMD8v0
Ek7PTiA3X7AhqRl4Lt2Gs8kfDJ9MIRXzvXJv9UUO7PtVQGUhVdqmcqsodyqA7+lw
63hgSpEUtQTkpeWwcYQY6Impr/6jGxHwQKzFlLltaDvmAeOBd6gjAq2bdfPO5tox
FiSoClhor4yTc274+s5ORk9xDh61d7ZbnFJ+QN7YioyaSD2LHYIl9EkVFDe4OMXV
q39VzXt+uEYx6hxfmnpjPyPvPiyuDsnXzoTMcaaGsJL4TOazoIEj5AFGmg5i3/yw
6UVjKxJ43SMM7qg3VbgZpenGJwvRkP5mqQwMNDZICXdWulQTMwM5w3AidNqwYAaU
oEPAMfZNGvE=
=1BWV
-----END PGP SIGNATURE-----
Merge tag 'md/3.20-fixes' of git://neil.brown.name/md
Pull md bugfixes from Neil Brown:
"Three bug md fixes for 3.20
yet-another-livelock in raid5, and a problem with write errors to
4K-block devices"
* tag 'md/3.20-fixes' of git://neil.brown.name/md:
md/raid5: Fix livelock when array is both resyncing and degraded.
md/raid10: round up to bdev_logical_block_size in narrow_write_error.
md/raid1: round up to bdev_logical_block_size in narrow_write_error
Commit a7854487cd:
md: When RAID5 is dirty, force reconstruct-write instead of read-modify-write.
Causes an RCW cycle to be forced even when the array is degraded.
A degraded array cannot support RCW as that requires reading all data
blocks, and one may be missing.
Forcing an RCW when it is not possible causes a live-lock and the code
spins, repeatedly deciding to do something that cannot succeed.
So change the condition to only force RCW on non-degraded arrays.
Reported-by: Manibalan P <pmanibalan@amiindia.co.in>
Bisected-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Tested-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Fixes: a7854487cd
Cc: stable@vger.kernel.org (v3.7+)
Write requests are sorted in a red-black tree structure and are
submitted in the sorted order.
In theory the sorting should be performed by the underlying disk
scheduler, however, in practice the disk scheduler only accepts and
sorts a finite number of requests. To allow the sorting of all
requests, dm-crypt needs to implement its own sorting.
The overhead associated with rbtree-based sorting is considered
negligible so it is not used conditionally. Even on SSD sorting can be
beneficial since in-order request dispatch promotes lower latency IO
completion to the upper layers.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Make it possible to disable offloading writes by setting the optional
'submit_from_crypt_cpus' table argument.
There are some situations where offloading write bios from the
encryption threads to a single thread degrades performance
significantly.
The default is to offload write bios to the same thread because it
benefits CFQ to have writes submitted using the same IO context.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Submitting write bios directly in the encryption thread caused serious
performance degradation. On a multiprocessor machine, encryption requests
finish in a different order than they were submitted. Consequently, write
requests would be submitted in a different order and it could cause severe
performance degradation.
Move the submission of write requests to a separate thread so that the
requests can be sorted before submitting. But this commit improves
dm-crypt performance even without having dm-crypt perform request
sorting (in particular it enables IO schedulers like CFQ to sort more
effectively).
Note: it is required that a previous commit ("dm crypt: don't allocate
pages for a partial request") be applied before applying this patch.
Otherwise, this commit could introduce a crash.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The previous commit ("dm crypt: don't allocate pages for a partial
request") stopped using the io_pool slab mempool and backing
_crypt_io_pool kmem cache.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Fix a theoretical deadlock introduced in the previous commit ("dm crypt:
don't allocate pages for a partial request").
The function crypt_alloc_buffer may be called concurrently. If we allocate
from the mempool concurrently, there is a possibility of deadlock. For
example, if we have mempool of 256 pages, two processes, each wanting
256, pages allocate from the mempool concurrently, it may deadlock in a
situation where both processes have allocated 128 pages and the mempool
is exhausted.
To avoid such a scenario we allocate the pages under a mutex. In order
to not degrade performance with excessive locking, we try non-blocking
allocations without a mutex first and if that fails, we fallback to a
blocking allocations with a mutex.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Change crypt_alloc_buffer so that it only ever allocates pages for a
full request. This is a prerequisite for the commit "dm crypt: offload
writes to thread".
This change simplifies the dm-crypt code at the expense of reduced
throughput in low memory conditions (where allocation for a partial
request is most useful).
Note: the next commit ("dm crypt: avoid deadlock in mempools") is needed
to fix a theoretical deadlock.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Use unbound workqueue by default so that work is automatically balanced
between available CPUs. The original behavior of encrypting using the
same cpu that IO was submitted on can still be enabled by setting the
optional 'same_cpu_crypt' table argument.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
This modifies raid1's narrow_write_error to round up block_sectors to the
device's logical block size.
This prevents sd complaining about "Bad block number requested" for non-512-byte
sector disks.
Signed-off-by: Nate Dailey <nate.dailey@stratus.com>
Signed-off-by: NeilBrown <neilb@suse.de>
I created a dm-raid1 device backed by a device that supports DISCARD
and another device that does NOT support DISCARD with the following
dm configuration:
# echo '0 2048 mirror core 1 512 2 /dev/sda 0 /dev/sdb 0' | dmsetup create moo
# lsblk -D
NAME DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
sda 0 4K 1G 0
`-moo (dm-0) 0 4K 1G 0
sdb 0 0B 0B 0
`-moo (dm-0) 0 4K 1G 0
Notice that the mirror device /dev/mapper/moo advertises DISCARD
support even though one of the mirror halves doesn't.
If I issue a DISCARD request (via fstrim, mount -o discard, or ioctl
BLKDISCARD) through the mirror, kmirrord gets stuck in an infinite
loop in do_region() when it tries to issue a DISCARD request to sdb.
The problem is that when we call do_region() against sdb, num_sectors
is set to zero because q->limits.max_discard_sectors is zero.
Therefore, "remaining" never decreases and the loop never terminates.
To fix this: before entering the loop, check for the combination of
REQ_DISCARD and no discard and return -EOPNOTSUPP to avoid hanging up
the mirror device.
This bug was found by the unfortunate coincidence of pvmove and a
discard operation in the RHEL 6.5 kernel; upstream is also affected.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
It may be possible that a device claims discard support but it rejects
discards with -EOPNOTSUPP. It happens when using loopback on ext2/ext3
filesystem driven by the ext4 driver. It may also happen if the
underlying devices are moved from one disk on another.
If discard error happens, we reject the bio with -EOPNOTSUPP, but we do
not degrade the array.
This patch fixes failed test shell/lvconvert-repair-transient.sh in the
lvm2 testsuite if the testsuite is extracted on an ext2 or ext3
filesystem and it is being driven by the ext4 driver.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
dm_tm_shadow_block() is the only caller of
dm_sm_count_is_more_than_one() which only ever operates on a metadata
space-map. So in practice, sm_disk_count_is_more_than_one() isn't
actually used (which explains why this bug never amounted to anything).
But fix sm_disk_count_is_more_than_one() to properly set *result and
return 0.
Reported-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
stacking ontop of blk-mq devices. This blk-mq support changes the
model request-based DM uses for cloning a request to relying on
calling blk_get_request() directly from the underlying blk-mq device.
Early consumer of this code is Intel's emerging NVMe hardware; thanks
to Keith Busch for working on, and pushing for, these changes.
- A few other small fixes and cleanups across other DM targets.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJU3NRnAAoJEMUj8QotnQNavG0H/3yogMcHvKg9H+w0WmUQdwhN
w99Wj3nkquAw2sm9yahKlAMBNY53iu/LHmC6/PaTpJetgdH7y1foTrRa0qjyeB2D
DgNr8mOzxSxzX6CX9V8JMwqzky9XoG2IOt/7FeQQOpMqp4T1M2zgvbZtpl0lK/f3
lNaNBFpl+47NbGssD/WbtfI4Yy3hX0u406yGmQN5DxRyGTWD2AFqpA76g2mp8vrp
wmw259gPr4oLhj3pDc0GkuiVn59ZR2Zp+2gs0jD5uKlDL84VP/nE+WNB+ny1Mnmt
cOg8Q+W6/OosL66MKBHNsF0QS6DXNo5UvsN9fHGa5IUJw7Tsa11ZEPKHZGEbQw4=
=RiN2
-----END PGP SIGNATURE-----
Merge tag 'dm-3.20-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper changes from Mike Snitzer:
- The most significant change this cycle is request-based DM now
supports stacking ontop of blk-mq devices. This blk-mq support
changes the model request-based DM uses for cloning a request to
relying on calling blk_get_request() directly from the underlying
blk-mq device.
An early consumer of this code is Intel's emerging NVMe hardware;
thanks to Keith Busch for working on, and pushing for, these changes.
- A few other small fixes and cleanups across other DM targets.
* tag 'dm-3.20-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm: inherit QUEUE_FLAG_SG_GAPS flags from underlying queues
dm snapshot: remove unnecessary NULL checks before vfree() calls
dm mpath: simplify failure path of dm_multipath_init()
dm thin metadata: remove unused dm_pool_get_data_block_size()
dm ioctl: fix stale comment above dm_get_inactive_table()
dm crypt: update url in CONFIG_DM_CRYPT help text
dm bufio: fix time comparison to use time_after_eq()
dm: use time_in_range() and time_after()
dm raid: fix a couple integer overflows
dm table: train hybrid target type detection to select blk-mq if appropriate
dm: allocate requests in target when stacking on blk-mq devices
dm: prepare for allocating blk-mq clone requests in target
dm: submit stacked requests in irq enabled context
dm: split request structure out from dm_rq_target_io structure
dm: remove exports for request-based interfaces without external callers
Pull core block IO changes from Jens Axboe:
"This contains:
- A series from Christoph that cleans up and refactors various parts
of the REQ_BLOCK_PC handling. Contributions in that series from
Dongsu Park and Kent Overstreet as well.
- CFQ:
- A bug fix for cfq for realtime IO scheduling from Jeff Moyer.
- A stable patch fixing a potential crash in CFQ in OOM
situations. From Konstantin Khlebnikov.
- blk-mq:
- Add support for tag allocation policies, from Shaohua. This is
a prep patch enabling libata (and other SCSI parts) to use the
blk-mq tagging, instead of rolling their own.
- Various little tweaks from Keith and Mike, in preparation for
DM blk-mq support.
- Minor little fixes or tweaks from me.
- A double free error fix from Tony Battersby.
- The partition 4k issue fixes from Matthew and Boaz.
- Add support for zero+unprovision for blkdev_issue_zeroout() from
Martin"
* 'for-3.20/core' of git://git.kernel.dk/linux-block: (27 commits)
block: remove unused function blk_bio_map_sg
block: handle the null_mapped flag correctly in blk_rq_map_user_iov
blk-mq: fix double-free in error path
block: prevent request-to-request merging with gaps if not allowed
blk-mq: make blk_mq_run_queues() static
dm: fix multipath regression due to initializing wrong request
cfq-iosched: handle failure of cfq group allocation
block: Quiesce zeroout wrapper
block: rewrite and split __bio_copy_iov()
block: merge __bio_map_user_iov into bio_map_user_iov
block: merge __bio_map_kern into bio_map_kern
block: pass iov_iter to the BLOCK_PC mapping functions
block: add a helper to free bio bounce buffer pages
block: use blk_rq_map_user_iov to implement blk_rq_map_user
block: simplify bio_map_kern
block: mark blk-mq devices as stackable
block: keep established cmd_flags when cloning into a blk-mq request
block: add blk-mq support to blk_insert_cloned_request()
block: require blk_rq_prep_clone() be given an initialized clone request
blk-mq: add tag allocation policy
...
- assorted locking changes so that access to /proc/mdstat
and much of /sys/block/mdXX/md/* is protected by a spinlock
rather than a mutex and will never block indefinitely.
- Make an 'if' condition in RAID5 - which has been implicated
in recent bugs - more readable.
- misc minor fixes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIVAwUAVNwaHTnsnt1WYoG5AQLssw//TniaqtlsK6RPd4PbWdyKFBfUx6bPurtx
ZRQvw7FcX8sFW/QtxKyJZsOhKcCc1CQEByzZzvuR5SLE8scyw/SxC8eFqMY+vSWF
HmqJtoK3LdxY/PPC2qRYcdzpU4DzYuIu3ChkJvdXJa4mULKJhutR8nbp1k52JXN8
U2KPoE+hScVLKCuhwkvp6h4Fg/Caa9MjSpk3uMEkGDm75Iwl175EIb/fV95pIfqd
cXs2wJBW3TA/YQecMeHC/YmVSIjF8nobCfhFlWfWYg0ySrSh1YSPLjb707Ohs5cF
efEGoIWfm2uKd0XmsKEDclYDTNtwqgO7A3QDuWFP5ab9cTpi0Fuj1CE9LNBCf5Hf
a5mwggBQ2KkpGwgtBuz6MJRaVu1xtOOjyFG8qzzeDOHLKxHRvZgiUnkb5U2I+YxB
iMmyk7ijm9PF5M+wm3AqzFmG8icrAf6ixkSZidQy0PfAIFFFph+jQvdHqiXnLOGD
6S/xpG0avHxdC5pC4R0iRsNMNl3IESqy6qLkTOYjQ+yoZ9A+LY19nsMf9LBnwgdY
hz9WxV+EvFVggVl19U5Bg+3z1EwPt3kNr4Se1bkPI8reHH/adacNmHWBBLQ5KHle
WYdqcNXiTwaLje/7Qw5TKQFSGgLMAPc8ciDA7QOX/ZFoyUmWFn89YAoDdGqvvDd9
g5mQc/Amxt8=
=/FmP
-----END PGP SIGNATURE-----
Merge tag 'md/3.20' of git://neil.brown.name/md
Pull md updates from Neil Brown:
- assorted locking changes so that access to /proc/mdstat
and much of /sys/block/mdXX/md/* is protected by a spinlock
rather than a mutex and will never block indefinitely.
- Make an 'if' condition in RAID5 - which has been implicated
in recent bugs - more readable.
- misc minor fixes
* tag 'md/3.20' of git://neil.brown.name/md: (28 commits)
md/raid10: fix conversion from RAID0 to RAID10
md: wakeup thread upon rdev_dec_pending()
md: make reconfig_mutex optional for writes to md sysfs files.
md: move mddev_lock and related to md.h
md: use mddev->lock to protect updates to resync_{min,max}.
md: minor cleanup in safe_delay_store.
md: move GET_BITMAP_FILE ioctl out from mddev_lock.
md: tidy up set_bitmap_file
md: remove unnecessary 'buf' from get_bitmap_file.
md: remove mddev_lock from rdev_attr_show()
md: remove mddev_lock() from md_attr_show()
md/raid5: use ->lock to protect accessing raid5 sysfs attributes.
md: remove need for mddev_lock() in md_seq_show()
md/bitmap: protect clearing of ->bitmap by mddev->lock
md: protect ->pers changes with mddev->lock
md: level_store: group all important changes into one place.
md: rename ->stop to ->free
md: split detach operation out from ->stop.
md/linear: remove rcu protections in favour of suspend/resume
md: make merge_bvec_fn more robust in face of personality changes.
...
A RAID0 array (like a LINEAR array) does not have a concept
of 'size' being the amount of each device that is in use.
Rather, as much of each device as is available is used.
So the 'size' is set to 0 and ignored.
RAID10 does have this concept and needs it to be set correctly.
So when we convert RAID0 to RAID10 we must determine the
'size' (that being the size of the first 'strip_zone' in the
RAID0), and set it correctly.
Reported-and-tested-by: Xiao Ni <xni@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
A DM device must inherit the QUEUE_FLAG_SG_GAPS flags from its
underlying block devices' request queues.
This fixes problems when submitting cloned requests to multipathed
devices requiring virtually contiguous buffers.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Pull RCU updates from Ingo Molnar:
"The main RCU changes in this cycle are:
- Documentation updates.
- Miscellaneous fixes.
- Preemptible-RCU fixes, including fixing an old bug in the
interaction of RCU priority boosting and CPU hotplug.
- SRCU updates.
- RCU CPU stall-warning updates.
- RCU torture-test updates"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
rcu: Initialize tiny RCU stall-warning timeouts at boot
rcu: Fix RCU CPU stall detection in tiny implementation
rcu: Add GP-kthread-starvation checks to CPU stall warnings
rcu: Make cond_resched_rcu_qs() apply to normal RCU flavors
rcu: Optionally run grace-period kthreads at real-time priority
ksoftirqd: Use new cond_resched_rcu_qs() function
ksoftirqd: Enable IRQs and call cond_resched() before poking RCU
rcutorture: Add more diagnostics in rcu_barrier() test failure case
torture: Flag console.log file to prevent holdovers from earlier runs
torture: Add "-enable-kvm -soundhw pcspk" to qemu command line
rcutorture: Handle different mpstat versions
rcutorture: Check from beginning to end of grace period
rcu: Remove redundant rcu_batches_completed() declaration
rcutorture: Drop rcu_torture_completed() and friends
rcu: Provide rcu_batches_completed_sched() for TINY_RCU
rcutorture: Use unsigned for Reader Batch computations
rcutorture: Make build-output parsing correctly flag RCU's warnings
rcu: Make _batches_completed() functions return unsigned long
rcutorture: Issue warnings on close calls due to Reader Batch blows
documentation: Fix smp typo in memory-barriers.txt
...
The vfree() function performs input parameter validation.
Thus the NULL pointer test around vfree() calls is not needed.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Currently the cleanup of all error cases are open-coded. Introduce a
common exit path and labels.
Signed-off-by: Johannes Thumshirn <morbidrsa@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The thin-pool target doesn't display the data block size as part of
its table status, unlike the dm-cache target, so there is no need for
dm_pool_get_data_block_size().
This was found using cppcheck.
Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Update the obsolete url in the CONFIG_DM_CRYPT help text.
Signed-off-by: Loic Pefferkorn <loic@loicp.eu>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
To be future-proof and for better readability the time comparison
is modified to use time_after_eq() instead of plain, error-prone math.
Signed-off-by: Asaf Vertz <asaf.vertz@tandemg.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
To be future-proof and for better readability the time comparisons are modified
to use time_in_range() and time_after() instead of plain, error-prone math.
Signed-off-by: Manuel Schölling <manuel.schoelling@gmx.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
My static checker complains that if "num_raid_params" is UINT_MAX then
the "if (num_raid_params + 1 > argc) {" check doesn't work as intended.
The other change is that I moved the "if (argc != (num_raid_devs * 2))"
condition forward a few lines so it was before the call to
context_alloc(). If we had an integer overflow inside that function
then it would lead to an immediate crash.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Otherwise replacing the multipath target with the error target fails:
device-mapper: ioctl: can't change device type after initial table load.
The error target was mistakenly considered to be target type
DM_TYPE_REQUEST_BASED rather than DM_TYPE_MQ_REQUEST_BASED even if the
target it was to replace was of type DM_TYPE_MQ_REQUEST_BASED.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
For blk-mq request-based DM the responsibility of allocating a cloned
request is transfered from DM core to the target type. Doing so
enables the cloned request to be allocated from the appropriate
blk-mq request_queue's pool (only the DM target, e.g. multipath, can
know which block device to send a given cloned request to).
Care was taken to preserve compatibility with old-style block request
completion that requires request-based DM _not_ acquire the clone
request's queue lock in the completion path. As such, there are now 2
different request-based DM target_type interfaces:
1) the original .map_rq() interface will continue to be used for
non-blk-mq devices -- the preallocated clone request is passed in
from DM core.
2) a new .clone_and_map_rq() and .release_clone_rq() will be used for
blk-mq devices -- blk_get_request() and blk_put_request() are used
respectively from these hooks.
dm_table_set_type() was updated to detect if the request-based target is
being stacked on blk-mq devices, if so DM_TYPE_MQ_REQUEST_BASED is set.
DM core disallows switching the DM table's type after it is set. This
means that there is no mixing of non-blk-mq and blk-mq devices within
the same request-based DM table.
[This patch was started by Keith and later heavily modified by Mike]
Tested-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
For blk-mq request-based DM the responsibility of allocating a cloned
request will be transfered from DM core to the target type.
To prepare for conditionally using this new model the original
request's 'special' now points to the dm_rq_target_io because the
clone is allocated later in the block layer rather than in DM core.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Switch to having request-based DM enqueue all prep'ed requests into work
processed by another thread. This allows request-based DM to invoke
block APIs that assume interrupt enabled context (e.g. blk_get_request)
and is a prerequisite for adding blk-mq support to request-based DM.
The new kernel thread is only initialized for request-based DM devices.
multipath_map() is now always in irq enabled context so change multipath
spinlock (m->lock) locking to always disable interrupts.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Request-based DM support for blk-mq devices requires that
dm_rq_target_io structures not be allocated with an embedded request
structure. The request-based DM target (e.g. dm-multipath) must
allocate the request from the blk-mq devices' request_queue using
blk_get_request().
The unfortunate side-effect of this change is old-style request-based DM
support will no longer use contiguous memory for the dm_rq_target_io and
request structures for each clone.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Commit febf715 ("block: require blk_rq_prep_clone() be given an
initialized clone request") introduced a regression by calling
blk_rq_init() on the original request rather than the clone
request that is passed to setup_clone().
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Fixes: febf71588c ("block: require blk_rq_prep_clone() be given an initialized clone request")
Signed-off-by: Jens Axboe <axboe@fb.com>
After each call to rdev_dec_pending() we should wakeup the
md thread if the device is found to be faulty.
Otherwise we'll incur heavy delays on failing devices.
Signed-off-by: Neil Brown <nfbrown@suse.de>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Rather than using mddev_lock() to take the reconfig_mutex
when writing to any md sysfs file, we only take mddev_lock()
in the particular _store() functions that require it.
Admittedly this is most, but it isn't all.
This also allows us to remove special-case handling for new_dev_store
(in md_attr_store).
Signed-off-by: NeilBrown <neilb@suse.de>
The one which is not inline (mddev_unlock) gets EXPORTed.
This makes the locking available to personality modules so that it
doesn't have to be imposed upon them.
Signed-off-by: NeilBrown <neilb@suse.de>
There are interdependencies between these two sysfs attributes
and whether a resync is currently running.
Rather than depending on reconfig_mutex to ensure no races when
testing these interdependencies are met, use the spinlock.
This will allow the mutex to be remove from protecting this
code in a subsequent patch.
Signed-off-by: NeilBrown <neilb@suse.de>
There isn't really much room for races with ->safemode_delay.
But as I am trying to clean up any racy code and will soon
be removing reconfig_mutex protection from most _store()
functions:
- only set mddev->safemode_delay once, to ensure no code
can see an intermediate value
- use safemode_timer to call md_safemode_timeout() rather than
calling it directly, to ensure it never races with itself.
Signed-off-by: NeilBrown <neilb@suse.de>
It makes more sense to report bitmap_info->file, rather than
bitmap->file (the later is only available once the array is
active).
With that change, use mddev->lock to protect bitmap_info being
set to NULL, and we can call get_bitmap_file() without taking
the mutex.
Signed-off-by: NeilBrown <neilb@suse.de>
1/ delay setting mddev->bitmap_info.file until 'f' looks
usable, so we don't have to unset it.
2/ Don't allow bitmap file to be set if bitmap_info.file
is already set.
Signed-off-by: NeilBrown <neilb@suse.de>
'buf' is only used because d_path fills from the end of the
buffer instead of from the start.
We don't need a separate buf to handle that, we just need to use
memmove() to move the string to the start.
Signed-off-by: NeilBrown <neilb@suse.de>
No rdev attributes need locking for 'show', though
state_show() might benefit from ensuring it sees a
consistent set of flags.
None even use rdev->mddev, so testing for it isn't really
needed and it certainly doesn't need to be held constant.
So improve state_show() and remove the locking.
Signed-off-by: NeilBrown <neilb@suse.de>
Most attributes can be read safely without any locking.
A race might lead to a slightly out-dated value, but nothing wrong.
We already have locking in some places where needed.
All that remains is can_clear_show(), behind_writes_used_show()
and action_show() which are easily fixed.
Signed-off-by: NeilBrown <neilb@suse.de>
It is important that mddev->private isn't freed while
a sysfs attribute function is accessing it.
So use mddev->lock to protect the setting of ->private to NULL, and
take that lock when checking ->private for NULL and de-referencing it
in the sysfs access functions.
This only applies to the read ('show') side of access. Write
access will be handled separately.
Signed-off-by: NeilBrown <neilb@suse.de>
The only access in md_seq_show that could suffer from races
not protected by ->lock is walking the rdev list.
This can receive sufficient protection from 'rcu'.
So use rdev_for_each_rcu() and get rid of mddev_lock().
Now reading /proc/mdstat will never block in md_seq_show.
Signed-off-by: NeilBrown <neilb@suse.de>
1/ Another live lock, needs backporting
2/ work-around false positive with new warnings.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIVAwUAVNE+TDnsnt1WYoG5AQIX8w/9EWD9hTM3HEatrdZFfOgQrG4fafpZoOHT
+fxMTyvIbIr7ppL3lZVA6KmyDS15/BIt0JhwMy7pzaPqvxSCK/qqGOdE8h1nVaN1
/TbARZCCOn62PsRxKQDHCsU8lsRt3VNH4fGvm0RBTry/RtvGrxcqIBLGnwWseCQq
SGVj1uKb+PI5FL8c4GvyVCdBD+uO8idpY6D6Rd2WQbuskOPoJhIEZRh0wPHEYvWw
rJ+gzzWkalFOjPgejS54ZrTGxOgvZ0NiAaFuEQaDG2zRc27luDxF/eyCR9G12juC
YH8M2IxNp0i20iaoNp8A+D8ksMbNE3OEFOZx2gtFwItQ3aye455Lv+C0ZnbxlWD/
R+399E0wKtFp8onW+KALoJvgZHjlanj3uIjSPltlCxDDQ3F5Any6h6uGIEOAVYx2
uruUmjp0JsxHio52R1Ai26VT+Ssc49GVEfBwcFej/ZGs8a0XxvYWuk1lllh9AL0w
8THt9yVQMR8NmUYrNnceRK6BJN4PdFHi/jxoLzeQfW2OHpmuug2Q0M/raYZGOIx6
xI92XPIGKN/kzRhBua75KhQkX5HBGJFP0kutIHj58AHacMFbiiJl9lzSIjGOJzjS
sBxyvvnOYUV4QW2Kb3KNfJWu2dDbLx/z4xzzkiG22d+LSW03FaPPnqSXXT59FIhQ
OzNfUxdNLJc=
=qYoP
-----END PGP SIGNATURE-----
Merge tag 'md/3.19-fixes' of git://neil.brown.name/md
Pull two fixes for md from Neil Brown:
- Another live lock, needs backporting
- work-around false positive with new warnings.
* tag 'md/3.19-fixes' of git://neil.brown.name/md:
md/bitmap: fix a might_sleep() warning.
md/raid5: fix another livelock caused by non-aligned writes.
->pers is already protected by ->reconfig_mutex, and
cannot possibly change when there are threads running or
outstanding IO.
However there are some places where we access ->pers
not in a thread or IO context, and where ->reconfig_mutex
is unnecessarily heavy-weight: level_show and md_seq_show().
So protect all changes, and those accesses, with ->lock.
This is a step toward taking those accesses out from under
reconfig_mutex.
[Fixed missing "mddev->pers" -> "pers" conversion, thanks to
Dan Carpenter <dan.carpenter@oracle.com>]
Signed-off-by: NeilBrown <neilb@suse.de>
Gather all the changes that can happen atomically and might
be relevant to other code into one place. This will
make it easier to refine the locking.
Note that this puts quite a few things between mddev_detach()
and ->free(). Enabling this was the point of some recent patches.
Signed-off-by: NeilBrown <neilb@suse.de>
Now that the ->stop function only frees the private data,
rename is accordingly.
Also pass in the private pointer as an arg rather than using
mddev->private. This flexibility will be useful in level_store().
Finally, don't clear ->private. It doesn't make sense to clear
it seeing that isn't what we free, and it is no longer necessary
to clear ->private (it was some time ago before ->to_remove was
introduced).
Setting ->to_remove in ->free() is a bit of a wart, but not a
big problem at the moment.
Signed-off-by: NeilBrown <neilb@suse.de>
Each md personality has a 'stop' operation which does two
things:
1/ it finalizes some aspects of the array to ensure nothing
is accessing the ->private data
2/ it frees the ->private data.
All the steps in '1' can apply to all arrays and so can be
performed in common code.
This is useful as in the case where we change the personality which
manages an array (in level_store()), it would be helpful to do
step 1 early, and step 2 later.
So split the 'step 1' functionality out into a new mddev_detach().
Signed-off-by: NeilBrown <neilb@suse.de>
The use of 'rcu' to protect accesses to ->private_data so that
the ->private_data could be updated predates the introduction
of mddev_suspend/mddev_resume.
These are a cleaner mechanism for providing stability while
swapping in a new ->private data - it is used by level_store()
to support changing of raid levels.
So get rid of the RCU stuff and just use mddev_suspend, mddev_resume.
As these function call ->quiesce(), we add an empty function for
linear just like for raid0.
Signed-off-by: NeilBrown <neilb@suse.de>
There is no locking around calls to merge_bvec_fn(), so
it is possible that calls which coincide with a level (or personality)
change could go wrong.
So create a central dispatch point for these functions and use
rcu_read_lock().
If the array is suspended, reject any merge that can be rejected.
If not, we know it is safe to call the function.
Signed-off-by: NeilBrown <neilb@suse.de>
There is currently no locking around calls to the 'congested'
bdi function. If called at an awkward time while an array is
being converted from one level (or personality) to another, there
is a tiny chance of running code in an unreferenced module etc.
So add a 'congested' function to the md_personality operations
structure, and call it with appropriate locking from a central
'mddev_congested'.
When the array personality is changing the array will be 'suspended'
so no IO is processed.
If mddev_congested detects this, it simply reports that the
array is congested, which is a safe guess.
As mddev_suspend calls synchronize_rcu(), mddev_congested can
avoid races by included the whole call inside an rcu_read_lock()
region.
This require that the congested functions for all subordinate devices
can be run under rcu_lock. Fortunately this is the case.
Signed-off-by: NeilBrown <neilb@suse.de>
This lock is used for (slightly) more than helping with writing
superblocks, and it will soon be extended further. So the
name is inappropriate.
Also, the _irq variant hasn't been needed since 2.6.37 as it is
never taking from interrupt or bh context.
So:
-rename write_lock to lock
-document what it protects
-remove _irq ... except in md_flush_request() as there
is no wait_event_lock() (with no _irq). This can be
cleaned up after appropriate changes to wait.h.
Signed-off-by: NeilBrown <neilb@suse.de>
That last condition is unclear and over cautious.
There are two related issues here.
If a partial write is destined for a missing device, then
either RMW or RCW can work. We must read all the available
block. Only then can the missing blocks be calculated, and
then the parity update performed.
If RMW is not an option, then there is a complication even
without partial writes. If we would need to read a missing
device to perform the reconstruction, then we must first read every
block so the missing device data can be computed.
This is the case for RAID6 (Which currently does not support
RMW) and for times when we don't trust the parity (after a crash)
and so are in the process of resyncing it.
So make these two cases more clear and separate, and perform
the relevant tests more thoroughly.
Signed-off-by: NeilBrown <neilb@suse.de>
Both the last two cases are only relevant if something has failed and
something needs to be written (but not over-written), and if it is OK
to pre-read blocks at this point. So factor out those tests and
explain them.
Signed-off-by: NeilBrown <neilb@suse.de>
Some of the conditions in need_this_block have very straight
forward motivation. Separate those out and document them.
Signed-off-by: NeilBrown <neilb@suse.de>
fetch_block() has a very large and hard to read 'if' condition.
Separate it into its own function so that it can be
made more readable.
Signed-off-by: NeilBrown <neilb@suse.de>
67f455486d introduced a call to
md_wakeup_thread() when adding to the delayed_list. However the md
thread is woken up unconditionally just below.
Remove the unnecessary wakeup call.
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
commit 8eb23b9f35
sched: Debug nested sleeps
causes false-positive warnings in RAID5 code.
This annotation removes them and adds a comment
explaining why there is no real problem.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If a non-page-aligned write is destined for a device which
is missing/faulty, we can deadlock.
As the target device is missing, a read-modify-write cycle
is not possible.
As the write is not for a full-page, a recontruct-write cycle
is not possible.
This should be handled by logic in fetch_block() which notices
there is a non-R5_OVERWRITE write to a missing device, and so
loads all blocks.
However since commit 67f455486d, that code requires
STRIPE_PREREAD_ACTIVE before it will active, and those circumstances
never set STRIPE_PREREAD_ACTIVE.
So: in handle_stripe_dirtying, if neither rmw or rcw was possible,
set STRIPE_DELAYED, which will cause STRIPE_PREREAD_ACTIVE be set
after a suitable delay.
Fixes: 67f455486d
Cc: stable@vger.kernel.org (v3.16+)
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Tested-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Prepare to allow blk_rq_prep_clone() to accept clone requests that were
allocated from blk-mq request queues. As such the blk_rq_prep_clone()
caller must first initialize the clone request.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
You can't modify the metadata in these modes. It's better to fail these
messages immediately than let the block-manager deny write locks on
metadata blocks. Otherwise these failed metadata changes will trigger
'needs_check' to get set in the metadata superblock -- requiring repair
using the thin_check utility.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Commit 9b1cc9f251 ("dm cache: share cache-metadata object across
inactive and active DM tables") mistakenly ignored the use of ERR_PTR
returns. Restore missing IS_ERR checks and ERR_PTR returns where
appropriate.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Commit ffcc393641 ("dm: enhance internal suspend and resume interface")
attempted to handle multiple internal suspends on the same device, but
it did that incorrectly. When these functions are called in this order
on the same device the device is no longer suspended, but it should be:
dm_internal_suspend_noflush
dm_internal_suspend_noflush
dm_internal_resume
Fix this bug by maintaining an 'internal_suspend_count' and resuming
the device when this count drops to zero.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Introduce a new variable to count the number of allocated migration
structures. The existing variable cache->nr_migrations became
overloaded. It was used to:
i) track of the number of migrations in flight for the purposes of
quiescing during suspend.
ii) to estimate the amount of background IO occuring.
Recent discard changes meant that REQ_DISCARD bios are processed with
a migration. Discards are not background IO so nr_migrations was not
incremented. However this could cause quiescing to complete early.
(i) is now handled with a new variable cache->nr_allocated_migrations.
cache->nr_migrations has been renamed cache->nr_io_migrations.
cleanup_migration() is now called free_io_migration(), since it
decrements that variable.
Also, remove the unused cache->next_migration variable that got replaced
with with prealloc_structs a while ago.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
If a DM table is reloaded with an inactive table when the device is not
suspended (normal procedure for LVM2), then there will be two dm-bufio
objects that can diverge. This can lead to a situation where the
inactive table uses bufio to read metadata at the same time the active
table writes metadata -- resulting in the inactive table having stale
metadata buffers once it is promoted to the active table slot.
Fix this by using reference counting and a global list of cache metadata
objects to ensure there is only one metadata object per metadata device.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Pull RCU updates from Paul E. McKenney:
- Documentation updates.
- Miscellaneous fixes.
- Preemptible-RCU fixes, including fixing an old bug in the
interaction of RCU priority boosting and CPU hotplug.
- SRCU updates.
- RCU CPU stall-warning updates.
- RCU torture-test updates.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Support for keyword 'boolean' will be dropped later on.
No functional change.
Reference: http://lkml.kernel.org/r/cover.1418003065.git.cj@linux.com
Signed-off-by: Christoph Jaeger <cj@linux.com>
Signed-off-by: Michal Marek <mmarek@suse.cz>
SRCU is not necessary to be compiled by default in all cases. For tinification
efforts not compiling SRCU unless necessary is desirable.
The current patch tries to make compiling SRCU optional by introducing a new
Kconfig option CONFIG_SRCU which is selected when any of the components making
use of SRCU are selected.
If we do not select CONFIG_SRCU, srcu.o will not be compiled at all.
text data bss dec hex filename
2007 0 0 2007 7d7 kernel/rcu/srcu.o
Size of arch/powerpc/boot/zImage changes from
text data bss dec hex filename
831552 64180 23944 919676 e087c arch/powerpc/boot/zImage : before
829504 64180 23952 917636 e0084 arch/powerpc/boot/zImage : after
so the savings are about ~2000 bytes.
Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: resolve conflict due to removal of arch/ia64/kvm/Kconfig. ]
In bio-based DM's clone_endio(), when target_type doesn't implement
.end_io (e.g. linear) r will be always be initialized 0. So if a
WRITE SAME bio fails WRITE SAME will not be disabled as intended.
Fix this by initializing r to error, rather than 0, in clone_endio().
Signed-off-by: Alex Chen <alex.chen@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Fixes: 7eee4ae2db ("dm: disable WRITE SAME if it fails")
Cc: stable@vger.kernel.org
Commit 80e96c5484 ("dm thin: do not allow thin device activation
while pool is suspended") delayed the initialization of a new thin
device's refcount and completion until after this new thin was added
to the pool's active_thins list and the pool lock is released. This
opens a race with a worker thread that walks the list and calls
thin_get/put, noticing that the refcount goes to 0 and calling
complete, freezing up the system and giving the oops below:
kernel: BUG: unable to handle kernel NULL pointer dereference at (null)
kernel: IP: [<ffffffff810d360b>] __wake_up_common+0x2b/0x90
kernel: Call Trace:
kernel: [<ffffffff810d3683>] __wake_up_locked+0x13/0x20
kernel: [<ffffffff810d3dc7>] complete+0x37/0x50
kernel: [<ffffffffa0595c50>] thin_put+0x20/0x30 [dm_thin_pool]
kernel: [<ffffffffa059aab7>] do_worker+0x667/0x870 [dm_thin_pool]
kernel: [<ffffffff816a8a4c>] ? __schedule+0x3ac/0x9a0
kernel: [<ffffffff810b1aef>] process_one_work+0x14f/0x400
kernel: [<ffffffff810b206b>] worker_thread+0x6b/0x490
kernel: [<ffffffff810b2000>] ? rescuer_thread+0x260/0x260
kernel: [<ffffffff810b6a7b>] kthread+0xdb/0x100
kernel: [<ffffffff810b69a0>] ? kthread_create_on_node+0x170/0x170
kernel: [<ffffffff816ad7ec>] ret_from_fork+0x7c/0xb0
kernel: [<ffffffff810b69a0>] ? kthread_create_on_node+0x170/0x170
Set the thin device's initial refcount and initialize the completion
before adding it to the pool's active_thins list in thin_ctr().
Signed-off-by: Marc Dionne <marc.dionne@your-file-system.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Discard bios and thin device deletion have the potential to release data
blocks. If the thin-pool is in out-of-data-space mode, and blocks were
released, transition the thin-pool back to full write mode.
The correct time to do this is just after the thin-pool metadata commit.
It cannot be done before the commit because the space maps will not
allow immediate reuse of the data blocks in case there's a rollback
following power failure.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
When the pool was in PM_OUT_OF_SPACE mode its process_prepared_discard
function pointer was incorrectly being set to
process_prepared_discard_passdown rather than process_prepared_discard.
This incorrect function pointer meant the discard was being passed down,
but not effecting the mapping. As such any discard that was issued, in
an attempt to reclaim blocks, would not successfully free data space.
Reported-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIVAwUAVIzozznsnt1WYoG5AQKEOQ/9G31KYGhe3/Tn+Hhyy7o16+6fjOBSUN2z
6GCGtOuHhom0Q+y4cdeAohw+eW9bJPqc9tRyRrLnVCuAkeymPP1t7fIRLGAuOHq5
QjjX7ZwKK8urmPAFIxRFgJwrJctFEoaCQiZHk4Pcx9YcYm4paJP2liqqQ0khGPcl
KGGjWgD3KMAG0moJzNmLMWBZevG3SLCTirPxOcDJuBkQ7KfP0e4UGVn/UYXdgF+f
73yForADVIE/Jq/CXnsOEZa/nPTM7DIsXZAQqqFCnkUiuCvwhgfkowxhwCExx2d6
w3+I69anD1WII1Z27bNDBgnJGR5xvdiAl6NTJ+GY1uMCo3lnV9OYpQrJxAEJP4dC
nXILm6SVm7WLfLvVY1lJm3wZndIyWNmTcDxODGyMZhAY4RmjDHG+AZ9tcnY1pM+9
rt0jHbmaj/ZZ5jKbGxPKBlQ3/U7EP8d4oiruOaDWCGT4sN+e2dbZVLpaWoNaBFUx
eyUft+DEvY82ryz2Ktn/Exsx1aWk2Cy3cMl0ELq2L/2WAcjhA6gsikNmKG7tfafn
Bd6WIGzAygATaFs4hOtFANyXqWubWi8TgLUHXNYUkJ2JooaMuSIfWrpeMNJyHHU7
5bWTaSZqZ86Ygb6upvJqaA3olePMo1RPBsOev46TnuFT9vtfpzPRy6FlvbXfx8sK
M5YCqNpTDTs=
=Ar2y
-----END PGP SIGNATURE-----
Merge tag 'md/3.19' of git://neil.brown.name/md
Pull md updates from Neil Brown:
"Three fixes for md.
I did have a largish set of locking changes queued, but late testing
showed they weren't quite as stable as I thought and while I fixed
what I found, I decided it safer to delay them a release ...
particularly as I'll be AFK for a few weeks. So expect a larger batch
next time :-)"
* tag 'md/3.19' of git://neil.brown.name/md:
md: Check MD_RECOVERY_RUNNING as well as ->sync_thread.
md: fix semicolon.cocci warnings
md/raid5: fetch_block must fetch all the blocks handle_stripe_dirtying wants.
Pull block layer driver updates from Jens Axboe:
- NVMe updates:
- The blk-mq conversion from Matias (and others)
- A stack of NVMe bug fixes from the nvme tree, mostly from Keith.
- Various bug fixes from me, fixing issues in both the blk-mq
conversion and generic bugs.
- Abort and CPU online fix from Sam.
- Hot add/remove fix from Indraneel.
- A couple of drbd fixes from the drbd team (Andreas, Lars, Philipp)
- With the generic IO stat accounting from 3.19/core, converting md,
bcache, and rsxx to use those. From Gu Zheng.
- Boundary check for queue/irq mode for null_blk from Matias. Fixes
cases where invalid values could be given, causing the device to hang.
- The xen blkfront pull request, with two bug fixes from Vitaly.
* 'for-3.19/drivers' of git://git.kernel.dk/linux-block: (56 commits)
NVMe: fix race condition in nvme_submit_sync_cmd()
NVMe: fix retry/error logic in nvme_queue_rq()
NVMe: Fix FS mount issue (hot-remove followed by hot-add)
NVMe: fix error return checking from blk_mq_alloc_request()
NVMe: fix freeing of wrong request in abort path
xen/blkfront: remove redundant flush_op
xen/blkfront: improve protection against issuing unsupported REQ_FUA
NVMe: Fix command setup on IO retry
null_blk: boundary check queue_mode and irqmode
block/rsxx: use generic io stats accounting functions to simplify io stat accounting
md: use generic io stats accounting functions to simplify io stat accounting
drbd: use generic io stats accounting functions to simplify io stat accounting
md/bcache: use generic io stats accounting functions to simplify io stat accounting
NVMe: Update module version major number
NVMe: fail pci initialization if the device doesn't have any BARs
NVMe: add ->exit_hctx() hook
NVMe: make setup work for devices that don't do INTx
NVMe: enable IO stats by default
NVMe: nvme_submit_async_admin_req() must use atomic rq allocation
NVMe: replace blk_put_request() with blk_mq_free_request()
...
A recent change to md started the ->sync_thread from a asynchronously
from a work_queue rather than synchronously. This means that there
can be a small window between the time when MD_RECOVERY_RUNNING is set
and when ->sync_thread is set.
So code that checks ->sync_thread might now conclude that the thread
has not been started and (because a lock is held) will not be started.
That is no longer the case.
Most of those places are best fixed by testing MD_RECOVERY_RUNNING
as well. To make this completely reliable, we wake_up(&resync_wait)
after clearing that flag as well as after clearing ->sync_thread.
Other places are better served by flushing the relevant workqueue
to ensure that that if the sync thread was starting, it has now
started. This is particularly best if we are about to stop the
sync thread.
Fixes: ac05f25669
Signed-off-by: NeilBrown <neilb@suse.de>
performance requirements that were requested by the Gluster
distributed filesystem. Specifically, dm-thinp now takes care to
aggregate IO that will be issued to the same thinp block before
issuing IO to the underlying devices. This really helps improve
performance on HW RAID6 devices that have a writeback cache because it
avoids RMW in the HW RAID controller.
- Some stable fixes: fix leak in DM bufio if integrity profiles were
enabled, use memzero_explicit in DM crypt to avoid any potential for
information leak, and a DM cache fix to properly mark a cache block
dirty if it was promoted to the cache via the overwrite optimization.
- A few simple DM persistent data library fixes
- DM cache multiqueue policy block promotion improvements.
- DM cache discard improvements that take advantage of range
(multiblock) discard support in the DM bio-prison. This allows for
much more efficient bulk discard processing (e.g. when mkfs.xfs
discards the entire device).
- Some small optimizations in DM core and RCU deference cleanups
- DM core changes to suspend/resume code to introduce the new internal
suspend/resume interface that the DM thin-pool target now uses to
suspend/resume active thin devices when the thin-pool must
suspend/resume. This avoids forcing userspace to track all active
thin volumes in a thin-pool when the thin-pool is suspended for the
purposes of metadata or data space resize.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJUhcvVAAoJEMUj8QotnQNaB78H+wSA6sDJGOhc6e1KlWoFh4Hx
hTmwm/O8Fxrp9StO3NPlcv9l+l9FX9pGzN/lo3OsxgWMTs/vLTKZ5SAe3/YT3/b9
6SFC7pC70+glakgMhhXWRvoeSEQC1OWb5BuvOF8irl2n+7B9NAn/zHd9pgpmyWHp
nBXK2GJBMzVSiI47NMjo2n6007LgQq0xxSJ9luwdrpwjDqD1d406DrhzbHou5H2Y
b8XJGQzUy0GZCX8ycwPVXo9svp2Bc+XajVcgOj5Qg7s2uV5car8NN7TxhSOKSXn2
VpiSyEa2MLHAbFuWtGs8XO98z/m5JlGf1eIgRO4s7w59URgpzdxOHXLlAoyqIGw=
=opXi
-----END PGP SIGNATURE-----
Merge tag 'dm-3.19-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer:
- Significant DM thin-provisioning performance improvements to meet
performance requirements that were requested by the Gluster
distributed filesystem.
Specifically, dm-thinp now takes care to aggregate IO that will be
issued to the same thinp block before issuing IO to the underlying
devices. This really helps improve performance on HW RAID6 devices
that have a writeback cache because it avoids RMW in the HW RAID
controller.
- Some stable fixes: fix leak in DM bufio if integrity profiles were
enabled, use memzero_explicit in DM crypt to avoid any potential for
information leak, and a DM cache fix to properly mark a cache block
dirty if it was promoted to the cache via the overwrite optimization.
- A few simple DM persistent data library fixes
- DM cache multiqueue policy block promotion improvements.
- DM cache discard improvements that take advantage of range
(multiblock) discard support in the DM bio-prison. This allows for
much more efficient bulk discard processing (e.g. when mkfs.xfs
discards the entire device).
- Some small optimizations in DM core and RCU deference cleanups
- DM core changes to suspend/resume code to introduce the new internal
suspend/resume interface that the DM thin-pool target now uses to
suspend/resume active thin devices when the thin-pool must
suspend/resume.
This avoids forcing userspace to track all active thin volumes in a
thin-pool when the thin-pool is suspended for the purposes of
metadata or data space resize.
* tag 'dm-3.19-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (49 commits)
dm crypt: use memzero_explicit for on-stack buffer
dm space map metadata: fix sm_bootstrap_get_count()
dm space map metadata: fix sm_bootstrap_get_nr_blocks()
dm bufio: fix memleak when using a dm_buffer's inline bio
dm cache: fix spurious cell_defer when dealing with partial block at end of device
dm cache: dirty flag was mistakenly being cleared when promoting via overwrite
dm cache: only use overwrite optimisation for promotion when in writeback mode
dm cache: discard block size must be a multiple of cache block size
dm cache: fix a harmless race when working out if a block is discarded
dm cache: when reloading a discard bitset allow for a different discard block size
dm cache: fix some issues with the new discard range support
dm array: if resizing the array is a noop set the new root to the old one
dm: use rcu_dereference_protected instead of rcu_dereference
dm thin: fix pool_io_hints to avoid looking at max_hw_sectors
dm thin: suspend/resume active thin devices when reloading thin-pool
dm: enhance internal suspend and resume interface
dm thin: do not allow thin device activation while pool is suspended
dm: add presuspend_undo hook to target_type
dm: return earlier from dm_blk_ioctl if target doesn't implement .ioctl
dm thin: remove stale 'trim' message in block comment above pool_message
...
It is critical that fetch_block() and handle_stripe_dirtying()
are consistent in their analysis of what needs to be loaded.
Otherwise raid5 can wait forever for a block that won't be loaded.
Currently when writing to a RAID5 that is resyncing, to a location
beyond the resync offset, handle_stripe_dirtying chooses a
reconstruct-write cycle, but fetch_block() assumes a
read-modify-write, and a lockup can happen.
So treat that case just like RAID6, just as we do in
handle_stripe_dirtying. RAID6 always does reconstruct-write.
This bug was introduced when the behaviour of handle_stripe_dirtying
was changed in 3.7, so the patch is suitable for any kernel since,
though it will need careful merging for some versions.
Cc: stable@vger.kernel.org (v3.7+)
Fixes: a7854487cd
Reported-by: Henry Cai <henryplusplus@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Use memzero_explicit to cleanup sensitive data allocated on stack
to prevent the compiler from optimizing and removing memset() calls.
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
This function isn't right and it causes a static checker warning:
drivers/md/dm-thin.c:3016 maybe_resize_data_dev()
error: potentially using uninitialized 'sb_data_size'.
It should set "*count" and return zero on success the same as the
sm_metadata_get_nr_blocks() function does earlier.
Fixes: 3241b1d3e0 ('dm: add persistent data library')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
When dm-bufio sets out to use the bio built into a struct dm_buffer to
issue an IO, it needs to call bio_reset after it's done with the bio
so that we can free things attached to the bio such as the integrity
payload. Therefore, inject our own endio callback to take care of
the bio_reset after calling submit_io's end_io callback.
Test case:
1. modprobe scsi_debug delay=0 dif=1 dix=199 ato=1 dev_size_mb=300
2. Set up a dm-bufio client, e.g. dm-verity, on the scsi_debug device
3. Repeatedly read metadata and watch kmalloc-192 leak!
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
We never bother caching a partial block that is at the back end of the
origin device. No cell ever gets locked, but the calling code was
assuming it was and trying to release it.
Now the code only releases if the cell has been set to a non NULL
value.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
If the incoming bio is a WRITE and completely covers a block then we
don't bother to do any copying for a promotion operation. Once this is
done the cache block and origin block will be different, so we need to
set it to 'dirty'.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Overwrite causes the cache block and origin blocks to diverge, which
is only allowed in writeback mode.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Otherwise the cache blocks may span two discard blocks, which we don't
handle when doing the discard lookup.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
It is more correct to hold the cell before checking the discard state.
These flags are only used as hints to the policy so this change will
have negligable effect.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The discard block size can change if the origin changes size or if an
old DM cache is upgraded from using a discard block size that was equal
to cache block size.
To fix this an extent of discarded blocks is established for the purpose
of translating the old discard block size to the new in-core discard
block size and set bits. The old (potentially huge) discard bitset is
left ondisk until it is re-written using the new in-core information on
the next successful DM cache shutdown.
Fixes: 7ae34e7778 ("dm cache: improve discard support")
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Commit 7ae34e777 ("dm cache: improve discard support") needed to also:
- discontinue having DM core split the discard bios on cache block
boundaries
- calculate the cache's discard_nr_blocks relative to the determined
discard_block_size rather than using oblock_to_dblock()
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
This could've been quite bad (to return success but not update the new
root to point at the old) but in practice the only known consumer of the
dm array code is the DM cache target. And the DM cache target passes in
the same old root to array_resize() anyway.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Use generic io stats accounting help functions (generic_{start,end}_io_acct)
to simplify io stat accounting.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Acked-by: Kent Overstreet <kmo@datera.io>
Signed-off-by: Jens Axboe <axboe@fb.com>
rcu_dereference() should be used in sections protected by rcu_read_lock.
For writers, holding some kind of mutex or lock,
rcu_dereference_protected() is the way to go, adding explicit lockdep
bits.
In __unbind(), we are the last user of this mapped device, so can use
the constant '1' instead of a lockdep_is_held(), not consistent with
other uses of rcu_dereference_protected() which use md->suspend_lock
mutex.
Reported-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 33423974bf ("dm: Use rcu_dereference() for accessing rcu pointer")
Cc: Pranith Kumar <bobby.prani@gmail.com>
[snitzer: allow lines longer than 80 columns, refine subject]
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Simplify the pool_io_hints code that works to establish a max_sectors
value that is a power-of-2 factor of the thin-pool's blocksize. The
biggest associated improvement is that the DM thin-pool is no longer
concerning itself with the data device's max_hw_sectors when adjusting
max_sectors.
This fixes the relative fragility of the original "dm thin: adjust
max_sectors_kb based on thinp blocksize" commit that only became
apparent when testing was performed using a DM thin-pool ontop of a
virtio_blk device. One proposed upstream patch detailed the problems
inherent in virtio_blk: https://lkml.org/lkml/2014/11/20/611
So even though virtio_blk incorrectly set its max_hw_sectors it actually
helped make it clear that we need DM thinp to be tolerant of any future
Linux driver that incorrectly sets max_hw_sectors.
We only need to be concerned with modifying the thin-pool device's
max_sectors limit if it is smaller than the thin-pool's blocksize. In
this case the value of max_sectors does become a limiting factor when
upper layers (e.g. filesystems) construct their bios. But if the
hardware can support IOs larger than the thin-pool's blocksize the user
is encouraged to adjust the thin-pool's data device's max_sectors
accordingly -- doing so will enable the thin-pool to inherit the
established user-defined max_sectors.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Before this change it was expected that userspace would first suspend
all active thin devices, reload/resize the thin-pool target, then resume
all active thin devices. Now the thin-pool suspend/resume will trigger
the suspend/resume of all active thins via appropriate calls to
dm_internal_suspend and dm_internal_resume.
Store the mapped_device for each thin device in struct thin_c to make
these calls possible.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Rename dm_internal_{suspend,resume} to dm_internal_{suspend,resume}_fast
-- dm-stats will continue using these methods to avoid all the extra
suspend/resume logic that is not needed in order to quickly flush IO.
Introduce dm_internal_suspend_noflush() variant that actually calls the
mapped_device's target callbacks -- otherwise target-specific hooks are
avoided (e.g. dm-thin's thin_presuspend and thin_postsuspend). Common
code between dm_internal_{suspend_noflush,resume} and
dm_{suspend,resume} was factored out as __dm_{suspend,resume}.
Update dm_internal_{suspend_noflush,resume} to always take and release
the mapped_device's suspend_lock. Also update dm_{suspend,resume} to be
aware of potential for DM_INTERNAL_SUSPEND_FLAG to be set and respond
accordingly by interruptibly waiting for the DM_INTERNAL_SUSPEND_FLAG to
be cleared. Add lockdep annotation to dm_suspend() and dm_resume().
The existing DM_SUSPEND_FLAG remains unchanged.
DM_INTERNAL_SUSPEND_FLAG is set by dm_internal_suspend_noflush() and
cleared by dm_internal_resume().
Both DM_SUSPEND_FLAG and DM_INTERNAL_SUSPEND_FLAG may be set if a device
was already suspended when dm_internal_suspend_noflush() was called --
this can be thought of as a "nested suspend". A "nested suspend" can
occur with legacy userspace dm-thin code that might suspend all active
thin volumes before suspending the pool for resize.
But otherwise, in the normal dm-thin-pool suspend case moving forward:
the thin-pool will have DM_SUSPEND_FLAG set and all active thins from
that thin-pool will have DM_INTERNAL_SUSPEND_FLAG set.
Also add DM_INTERNAL_SUSPEND_FLAG to status report. This new
DM_INTERNAL_SUSPEND_FLAG state is being reported to assist with
debugging (e.g. 'dmsetup info' will report an internally suspended
device accordingly).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Otherwise IO could be issued to the pool while it is suspended.
Care was taken to properly interlock between the thin and thin-pool
targets when accessing the pool's 'suspended' flag. The thin_ctr will
not add a new thin device to the pool's active_thins list if the pool is
susepended.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
The DM thin-pool target now must undo the changes performed during
pool_presuspend() so introduce presuspend_undo hook in target_type.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
This fixes a regression introduced in 3.13.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIVAwUAVGkmSjnsnt1WYoG5AQKu8xAAkRYFm9FDLIU5W4AVnNhtsHfWBtNphi1X
myHi75jRO5XUVVAYZ2x7EsGBvjDC3iOmkB++b0qLJ4MCf3yq07P2Y6osd+5pq1gD
XOzGefzi9kXkF+7KGKNrQY+xN++q5jcqMWtiSa+ef2j5YqGt06tqgvFz9YtBozrF
TEbe73yAVsFdy8XAGi3Z3ceYLaECbTjXRIMhksBqX4YByNVM9N7XT5Gk5L5ykYya
90EV5nDrfQPTicsL5/8Nb9qKczoRT7I6yubNgUpazdd8g3+wWJycew7I+CiVb44I
wGQbh5FaJT9KkTYrfkNOwT6N+fAEj4y9GxVMvSW80tmk9VKpv2MkCGrdwwWxp0/q
XXI3hSIRjBszvkMlLCANg7VFFvNeehVhYrn1ml3fGiZ548STdsCVewP7cOwhuQFp
f3dniAj49zw9GxZNopLkIfI+HmNZCOTf+E5U1nLOKZKOKpsw9ksNJrvZV8ZZxMkK
gZRAJwsd64Mob2ClRII9ZKzdRwygN1pDdtS5pa+rvzdRQplE4Flg4Ipv9w+5lsQh
346ijrxim11NpO/nRV0pXDNDudMzpF0cJvzxMo5uTTsX+eLUBbsdm/qmb2rEAxM7
JDdW8b7Vluz8fxq7+0Lc1O31CcEGJlBACtdRAXWIAhLZwIaps8+tn+yAjyMEb73H
jBJ9UAfmdCU=
=Fs49
-----END PGP SIGNATURE-----
Merge tag 'md/3.18-fix' of git://neil.brown.name/md
Pull md bugfix from Neil Brown:
"One fix for md for 3.18.
This fixes a regression introduced in 3.13"
* tag 'md/3.18-fix' of git://neil.brown.name/md:
md: Always set RECOVERY_NEEDED when clearing RECOVERY_FROZEN
md_check_recovery will skip any recovery and also clear
MD_RECOVERY_NEEDED if MD_RECOVERY_FROZEN is set.
So when we clear _FROZEN, we must set _NEEDED and ensure that
md_check_recovery gets run.
Otherwise we could miss out on something that is needed.
In particular, this can make it impossible to remove a
failed device from an array is the 'recovery-needed' processing
didn't happen.
Suitable for stable kernels since 3.13.
Cc: stable@vger.kernel.org (3.13+)
Reported-and-tested-by: Joe Lawrence <joe.lawrence@stratus.com>
Fixes: 30b8feb730
Signed-off-by: NeilBrown <neilb@suse.de>
. stable fix for a dm-cache related bug in dm-btree walking code that
results from using very large fast device (e.g. 4T) with a very small
cache blocksize (e.g. 32K) -- this is a very uncommon configuration
. a couple fixes for dm-raid (one for stable and the other addresses a
crash in 3.18-rc1 code)
. stable fix for dm-thinp that addresses a very rare dm-bufio bug having
to do with memory reclaimation (via shrinker) when using dm-thinp
ontop of loopback devices
. fix a leak in dm-stripe target constructor's error path
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJUY/p7AAoJEMUj8QotnQNaxPEIAJsmJC5ujQAIdm5yUxsOWruU
Y/36HbPvlmV8fgWqGyjaubBrzqgWry/yW/u/Sv9+9rE3Zh6JSVLVrCA6uZZ3Yr+j
HKYEPjm/O0zVJepfEDKtjG6dxeaql47+luwU1iP1bAYeZE3zmKn1oFT2GW5gTbxO
2n3MiN/dyX8v0cTw6r0O69luIAu93CSY0XDk+1ynfKlKKVmgcAUPvKuobF+yHXoF
Rd7KTqFoK6HgRhdUHvUQnCGDandZ9MHjt3oW9p3dv3ezvW1cNUARoVHMRGG6Awfu
WZkQ/VORDeaJT+bhjGfPIla1HbgxEKJrgzTUlpj+P6K2uPK2f6ECEyBpDLWKy9g=
=lkSu
-----END PGP SIGNATURE-----
Merge tag 'dm-3.18-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
- stable fix for dm-thin that avoids normal IO racing with discard
- stable fix for a dm-cache related bug in dm-btree walking code that
results from using very large fast device (eg 4T) with a very small
cache blocksize (eg 32K) -- this is a very uncommon configuration
- a couple fixes for dm-raid (one for stable and the other addresses a
crash in 3.18-rc1 code)
- stable fix for dm-thinp that addresses a very rare dm-bufio bug
having to do with memory reclaimation (via shrinker) when using
dm-thinp ontop of loopback devices
- fix a leak in dm-stripe target constructor's error path
* tag 'dm-3.18-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm btree: fix a recursion depth bug in btree walking code
dm thin: grab a virtual cell before looking up the mapping
dm raid: fix inaccessible superblocks causing oops in configure_discard_support
dm raid: ensure superblock's size matches device's logical block size
dm bufio: change __GFP_IO to __GFP_FS in shrinker callbacks
dm stripe: fix potential for leak in stripe_ctr error path
As long as struct thin_c is in the list, anyone can grab a reference of
it. Consequently, we must wait for the reference count to drop to zero
*after* we remove the structure from the list, not before.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Loading and saving millions of block mappings takes time. We may as
well explain what's going on, and encourage people to use a larger
cache block size.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Safely allow the discard blocksize to be larger than the cache blocksize
by using the bio prison's range locking support. This also improves
discard performance considerly because larger discards are issued to the
dm-cache device. The discard blocksize was always intended to be
greater than the cache blocksize. But until now it wasn't implemented
safely.
Also, by safely restoring the ability to have discard blocksize larger
than cache blocksize we're able to significantly reduce the memory used
for the cache's discard bitset. Before, with a small discard blocksize,
the discard bitset could get quite large because its size is a function
of the discard blocksize and the origin device's size. For example,
previously, using a 32KB cache blocksize with a 40TB origin resulted in
1280MB of incore memory use for the discard bitset! Now, the discard
blocksize is scaled up accordingly to ensure the discard bitset is
capped at 2**14 bits, or 16KB.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
This reverts commit d132cc6d9e because we
actually do want to allow the discard blocksize to be larger than the
cache blocksize. Further dm-cache discard changes will make this
possible.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
This reverts commit 64ab346a36 because we
actually do want to allow the discard blocksize to be larger than the
cache blocksize. Further dm-cache discard changes will make this
possible.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Ranges will be placed in the same cell if they overlap.
Range locking is a prerequisite for more efficient multi-block discard
support in both the cache and thin-provisioning targets.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Before, if the user wanted sequential IO to be promoted to the cache
they'd have to set sequential_threshold to some nebulous large value.
Now, the user may easily disable sequential IO detection (and sequential
IO's implicit bypass of the cache) by setting sequential_threshold to 0.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Rather than maintaining a separate promote_threshold variable that we
periodically update we now use the hit count of the oldest clean
block. Also add a fudge factor to discourage demoting dirty blocks.
With some tests this has a sizeable difference, because the old code
was too eager to demote blocks. For example, device-mapper-test-suite's
git_extract_cache_quick test goes from taking 190 seconds, to 142
(linear on spindle takes 250).
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
When creating new devices dm_sync_table() calls
synchronize_rcu_expedited(), causing _all_ pending RCU pointers to be
flushed. This causes a latency overhead that is especially noticeable
when creating lots of devices.
And all of this is pointless as there are no old maps to be
disconnected, and hence no stale pointers which would need to be
cleared up.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Annotate the map field with __rcu since this is a rcu pointer which is checked
by sparse.
Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The map field in 'struct mapped_device' is an rcu pointer. Use rcu_dereference()
while accessing it.
Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Sort the cells in logical block order before processing each cell in
process_thin_deferred_cells(). This significantly improves the ondisk
layout on rotational storage, whereby improving read performance.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
This use of direct submission in process_shared_bio() reduces latency
for submitting bios in the shared cell by avoiding adding those bios to
the deferred list and waiting for the next iteration of the worker.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
This use of direct submission in process_prepared_mapping() reduces
latency for submitting bios in a cell by avoiding adding those bios to
the deferred list and waiting for the next iteration of the worker.
But this direct submission exposes the potential for a race between
releasing a cell and incrementing deferred set. Fix this by introducing
dm_cell_visit_release() and refactoring inc_remap_and_issue_cell()
accordingly.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
This avoids dropping the cell, so increases the probability that other
bios will collect within the cell, rather than being passed individually
to the worker.
Also add required process_cell and process_discard_cell error handling
wrappers and set associated pool-mode function pointers accordingly.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
When processing a discard bio, if the block is already quiesced do the
discard immediately rather than adding the mapping to a list for the
next iteration of the worker thread.
Discarding a fully provisioned 100G thin volume with 64k block size goes
from 860s to 95s with this change.
Clearly there's something wrong with the worker architecture, more
investigation needed.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Introduce thin_merge so that any additional constraints from the data
volume may be taken into account when determing the maximum number of
sectors that can be issued relative to the specified logical offset.
This is particularly important if/when the data volume is layered ontop
of a more sophisticated device (e.g. dm-raid or some other DM target).
Reviewed-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
These code changes do not introduce a functional change.
But bio_add_page() will never attempt to build up a bio larger than
queue_max_sectors(). Similarly, bio_get_nr_vecs() is also bound by
queue_max_sectors(). Therefore, there is no point in allowing
dm_merge_bvec() to answer "how many sectors can a bio have at this
offset?" with anything larger than queue_max_sectors(). Using
queue_max_sectors() rather than BIO_MAX_SECTORS serves to more
accurately convey the limits that are being imposed.
Also, use unlikely() to clarify the fact that the defensive code in
dm_merge_bvec() relative to max_size going negative shouldn't ever
happen -- if it does happen there is a bug in the block layer for
requesting larger than dm_merge_bvec()'s initial response for a given
offset. Also, update a comment in dm_merge_bvec() relative to
max_hw_sectors_kb. And fix empty newline whitespace.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Allows for filesystems to submit bios that are a factor of the thinp
blocksize, improving dm-thinp efficiency (particularly when the data
volume is RAID).
Also set io_min to max_sectors_kb if it is a factor of the thinp
blocksize.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Throttle IO based on the time it's taking the worker to do one loop.
There were reports of hung task timeouts occuring and it was observed
that the excessively long avgqu-sz (as reported by iostat) was
contributing to these hung tasks.
Throttling definitely helps dm-thinp perform better under heavy IO load
(without being detremental by being overzealous). It reduces avgqu-sz
drastically, e.g.: from 60K to ~6K, and even as low as 150 once metadata
is cached by bufio, when dirty_ratio=5, dirty_background_ratio=2. And
avgqu-sz stays at or below 30K even with dirty_ratio=20,
dirty_background_ratio=10.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Prefetch metadata at the start of the worker thread and then again every
128th bio processed from the deferred list.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Introduce the dm_tm_issue_prefetches interface. If you're using a
non-blocking clone the tm will build up a list of requested blocks that
weren't in core. dm_tm_issue_prefetches will request those blocks to be
prefetched.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
This change is a prerequisite for allowing metadata to be prefetched.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Previously it was using a fixed sized hash table. There are times
when very many concurrent cells are held (such as when processing a very
large discard). When this happens the hash table performance becomes
very poor.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
These changes help keep metadata backed by dm-bufio in-core longer which
fixes reports of metadata churn in the face of heavy random IO workloads.
Before, bufio evicted all buffers older than DM_BUFIO_DEFAULT_AGE_SECS.
Having a device (e.g. dm-thinp or dm-cache) lose all metadata just
because associated buffers had been idle for some time is unfriendly.
Now, the user may now configure the number of bytes that bufio retains
using the 'retain_bytes' module parameter. The default is 256K.
Also, the DM_BUFIO_WORK_TIMER_SECS and DM_BUFIO_DEFAULT_AGE_SECS
defaults were quite low so increase them (to 30 and 300 respectively).
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Converting over to using an rbtree eliminates a fixed 8MB allocation
from vmalloc space for the hash table.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The walk code was using a 'ro_spine' to hold it's locked btree nodes.
But this data structure is designed for the rolling lock scheme, and
as such automatically unlocks blocks that are two steps up the call
chain. This is not suitable for the simple recursive walk algorithm,
which retraces its steps.
This code is only used by the persistent array code, which in turn is
only used by dm-cache. In order to trigger it you need to have a
mapping tree that is more than 2 levels deep; which equates to 8-16
million cache blocks. For instance a 4T ssd with a very small block
size of 32k only just triggers this bug.
The fix just places the locked blocks on the stack, and stops using
the ro_spine altogether.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Avoids normal IO racing with discard.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Commit 48cf06bc5f ("dm raid: add discard support for RAID levels 4, 5
and 6") did not properly handle missing metadata device(s). A failing
read of the superblock causes the metadata and data devices to be
removed from the dev array in struct raid_set, setting references to
both devices to NULL. configure_discard_support() nonetheless tries to
access the data dev unconditionally causing an oops.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The dm-raid superblock (struct dm_raid_superblock) is padded to 512
bytes and that size is being used to read it in from the metadata
device into one preallocated page.
Reading or writing this on a 512-byte sector device works fine but on
a 4096-byte sector device this fails.
Set the dm-raid superblock's size to the logical block size of the
metadata device, because IO at that size is guaranteed too work. Also
add a size check to avoid silent partial metadata loss in case the
superblock should ever grow past the logical block size or PAGE_SIZE.
[includes pointer math fix from Dan Carpenter]
Reported-by: "Liuhua Wang" <lwang@suse.com>
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
bioset_create_nobvec() interface when creating the DM's bioset
. fix a few bugs in dm-bufio and dm-log-userspace
. add DM core support for a DM multipath use-case that requires loading
DM tables that contain devices that have failed (by allowing active
and inactive DM tables to share dm_devs)
. add discard support to the DM raid target; like MD raid456 the user
must opt-in to raid456 discard support be specifying the
devices_handle_discard_safely=Y module param.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJUQWcdAAoJEMUj8QotnQNaHaQH/RD0xf54AnaQ0tEGuNQXFwtx
Gc/3s+VEcKlmTvk9nm2FWNvVagPn8uBQ0O2eid4UJk9AyfPJnPwGUoVqxbKhKK9i
G5/O5s8opLlItk14h/btw/zB8RNC1bg8NGnBrGYDudiwHm+Gv4jlnHErp2JMHv9F
nonb+QoG23wlEJkBafzBNYhthkNDq1ZFrDjhqG7dNySkXh8VZAW8VcZ/ZfskkhOa
C8CDl3TKL1BBJHQKesvqHQbCSqh8Ujzs63bLA3heaSMExkhmUgdfpnbHK4hzPNJP
rtmVEW57mVI+O5Cfva1p9RClT5EjiO+5VufHkpRJSIsfsH5PMaQ7vW8gKmwd5JA=
=z+Yz
-----END PGP SIGNATURE-----
Merge tag 'dm-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device-mapper updates from Mike Snitzer:
"I rebased the DM tree ontop of linux-block.git's 'for-3.18/core' at
the beginning of October because DM core now depends on the newly
introduced bioset_create_nobvec() interface.
Summary:
- fix DM's long-standing excessive use of memory by leveraging the
new bioset_create_nobvec() interface when creating the DM's bioset
- fix a few bugs in dm-bufio and dm-log-userspace
- add DM core support for a DM multipath use-case that requires
loading DM tables that contain devices that have failed (by
allowing active and inactive DM tables to share dm_devs)
- add discard support to the DM raid target; like MD raid456 the user
must opt-in to raid456 discard support be specifying the
devices_handle_discard_safely=Y module param"
* tag 'dm-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm log userspace: fix memory leak in dm_ulog_tfr_init failure path
dm bufio: when done scanning return from __scan immediately
dm bufio: update last_accessed when relinking a buffer
dm raid: add discard support for RAID levels 4, 5 and 6
dm raid: add discard support for RAID levels 1 and 10
dm: allow active and inactive tables to share dm_devs
dm mpath: stop queueing IO when no valid paths exist
dm: use bioset_create_nobvec()
dm: remove nr_iovecs parameter from alloc_tio()
Pull block layer driver update from Jens Axboe:
"This is the block driver pull request for 3.18. Not a lot in there
this round, and nothing earth shattering.
- A round of drbd fixes from the linbit team, and an improvement in
asender performance.
- Removal of deprecated (and unused) IRQF_DISABLED flag in rsxx and
hd from Michael Opdenacker.
- Disable entropy collection from flash devices by default, from Mike
Snitzer.
- A small collection of xen blkfront/back fixes from Roger Pau Monné
and Vitaly Kuznetsov"
* 'for-3.18/drivers' of git://git.kernel.dk/linux-block:
block: disable entropy contributions for nonrot devices
xen, blkfront: factor out flush-related checks from do_blkif_request()
xen-blkback: fix leak on grant map error path
xen/blkback: unmap all persistent grants when frontend gets disconnected
rsxx: Remove deprecated IRQF_DISABLED
block: hd: remove deprecated IRQF_DISABLED
drbd: use RB_DECLARE_CALLBACKS() to define augment callbacks
drbd: compute the end before rb_insert_augmented()
drbd: Add missing newline in resync progress display in /proc/drbd
drbd: reduce lock contention in drbd_worker
drbd: Improve asender performance
drbd: Get rid of the WORK_PENDING macro
drbd: Get rid of the __no_warn and __cond_lock macros
drbd: Avoid inconsistent locking warning
drbd: Remove superfluous newline from "resync_extents" debugfs entry.
drbd: Use consistent names for all the bi_end_io callbacks
drbd: Use better variable names
- a few minor bug fixes
- quite a lot of code tidy-up and simplification
- remove PRINT_RAID_DEBUG ioctl. I'm fairly sure
it is unused, and it isn't particularly useful.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIVAwUAVD9k1jnsnt1WYoG5AQKCaw/9F7jE9PDYtEfJ8ShEWMM0CWNsCKmgqfpV
i4RaeVKe1IoA5JOurYk+wvdbWSXGfz5XQ9GX8ptRl9X7ZoG4aJ65v9GHpBamPLrc
mB2Lz3zR9AZVYrMDeZym9cSZ6FpZNzzXpJE2O2RslXq3gFI03MObyM8xyeh8ybOD
45nhH+CJ17OFNn5OzzFLhYEAoYDeOS97zAwInWeFlUp14Jl403xnZ3srF2YJ78TR
PjcCpxo1IhGEnYE8rDYqH/UjDPzEfAdYrqM5k3NEnuPiqn+KxYNSsbAQGdeMrGUc
DO0H8dt6U1U2tq/t/qN8n01uQ7AJ3S3JrTsQxSW/UC1SVfgpztK/a78eA/YSy/zs
iZzPP7CpLfF4T945jaQionevZOBFRM+gbrMqeoQTPO2QtfrSGe4Awoht7Z+no3RR
dCX0ScO16kHkcAcSXXGZkGtC1DcteEwUfufSdako12exo1k3efc98DsyMw2VfzOM
EJcQD1JGYVW+czZM58EEue92TT5jvWnhU5s3PEUMTZrDgSWwTVQC3oNCgDGFKI4X
eebpWlG3gEjNrnL5givbBwC2LCfI59R70gpnGhavdKtt9AtpfsMJnj8E3cCqHE9I
xR6YPF161KSmKGOG47RK/VJJnq5SmZbxxeShT101uq3SVeqImit6ql3JfAM9HoMD
RI2iWG9yma4=
=2QEJ
-----END PGP SIGNATURE-----
Merge tag 'md/3.18' of git://neil.brown.name/md
Pull md updates from Neil Brown:
- a few minor bug fixes
- quite a lot of code tidy-up and simplification
- remove PRINT_RAID_DEBUG ioctl. I'm fairly sure it is unused, and it
isn't particularly useful.
* tag 'md/3.18' of git://neil.brown.name/md: (21 commits)
lib/raid6: Add log level to printks
md: move EXPORT_SYMBOL to after function in md.c
md: discard PRINT_RAID_DEBUG ioctl
md: remove MD_BUG()
md: clean up 'exit' labels in md_ioctl().
md: remove unnecessary test for MD_MAJOR in md_ioctl()
md: don't allow "-sync" to be set for device in an active array.
md: remove unwanted white space from md.c
md: don't start resync thread directly from md thread.
md: Just use RCU when checking for overlap between arrays.
md: avoid potential long delay under pers_lock
md: simplify export_array()
md: discard find_rdev_nr in favour of find_rdev_nr_rcu
md: use wait_event() to simplify md_super_wait()
md: be more relaxed about stopping an array which isn't started.
md/raid1: process_checks doesn't use its return value.
md/raid5: fix init_stripe() inconsistencies
md/raid10: another memory leak due to reshape.
md: use set_bit/clear_bit instead of shift/mask for bi_flags changes.
md/raid1: minor typos and reformatting.
...
The shrinker uses gfp flags to indicate what kind of operation can the
driver wait for. If __GFP_IO flag is present, the driver can wait for
block I/O operations, if __GFP_FS flag is present, the driver can wait on
operations involving the filesystem.
dm-bufio tested for __GFP_IO. However, dm-bufio can run on a loop block
device that makes calls into the filesystem. If __GFP_IO is present and
__GFP_FS isn't, dm-bufio could still block on filesystem operations if it
runs on a loop block device.
The change from __GFP_IO to __GFP_FS supposedly fixes one observed (though
unreproducible) deadlock involving dm-bufio and loop device.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Pull percpu consistent-ops changes from Tejun Heo:
"Way back, before the current percpu allocator was implemented, static
and dynamic percpu memory areas were allocated and handled separately
and had their own accessors. The distinction has been gone for many
years now; however, the now duplicate two sets of accessors remained
with the pointer based ones - this_cpu_*() - evolving various other
operations over time. During the process, we also accumulated other
inconsistent operations.
This pull request contains Christoph's patches to clean up the
duplicate accessor situation. __get_cpu_var() uses are replaced with
with this_cpu_ptr() and __this_cpu_ptr() with raw_cpu_ptr().
Unfortunately, the former sometimes is tricky thanks to C being a bit
messy with the distinction between lvalues and pointers, which led to
a rather ugly solution for cpumask_var_t involving the introduction of
this_cpu_cpumask_var_ptr().
This converts most of the uses but not all. Christoph will follow up
with the remaining conversions in this merge window and hopefully
remove the obsolete accessors"
* 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (38 commits)
irqchip: Properly fetch the per cpu offset
percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t -fix
ia64: sn_nodepda cannot be assigned to after this_cpu conversion. Use __this_cpu_write.
percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t
Revert "powerpc: Replace __get_cpu_var uses"
percpu: Remove __this_cpu_ptr
clocksource: Replace __this_cpu_ptr with raw_cpu_ptr
sparc: Replace __get_cpu_var uses
avr32: Replace __get_cpu_var with __this_cpu_write
blackfin: Replace __get_cpu_var uses
tile: Use this_cpu_ptr() for hardware counters
tile: Replace __get_cpu_var uses
powerpc: Replace __get_cpu_var uses
alpha: Replace __get_cpu_var
ia64: Replace __get_cpu_var uses
s390: cio driver &__get_cpu_var replacements
s390: Replace __get_cpu_var uses
mips: Replace __get_cpu_var uses
MIPS: Replace __get_cpu_var uses in FPU emulator.
arm: Replace __this_cpu_ptr with raw_cpu_ptr
...
Replaced the use of a Variable Length Array In Struct (VLAIS) with a C99
compliant equivalent. This patch allocates the appropriate amount of memory
using a char array using the SHASH_DESC_ON_STACK macro.
The new code can be compiled with both gcc and clang.
Signed-off-by: Jan-Simon Möller <dl9pf@gmx.de>
Signed-off-by: Behan Webster <behanw@converseincode.com>
Reviewed-by: Mark Charlebois <charlebm@gmail.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Cc: pageexec@freemail.hu
Cc: gmazyland@gmail.com
Cc: "David S. Miller" <davem@davemloft.net>
All the interesting information printed by this ioctl
is provided in /proc/mdstat and/or sysfs.
So it isn't needed and isn't used and would be best if it didn't
exist.
Signed-off-by: NeilBrown <neilb@suse.de>
Most of the places that call this are doing so pointlessly.
A couple of the others a best replaced with WARN_ON().
Signed-off-by: NeilBrown <neilb@suse.de>
unknown ioctls no longer get this deep into md_ioctl since
md_ioctl_valid() was introduced in 3.14.
So remove the test and the misleading comment.
Signed-off-by: NeilBrown <neilb@suse.de>
If an array is active, devices can be marked 'faulty', but simply
removing the 'sync' flag is wrong. That only makes sense
for an array which is not active (and is probably only useful
for testing anyway).
Signed-off-by: NeilBrown <neilb@suse.de>
The main 'md' thread is needed for processing writes, so if it blocks
write requests could be delayed.
Starting a new thread requires some GFP_KERNEL allocations and so can
wait for writes to complete. This can deadlock.
So instead, ask a workqueue to start the sync thread.
There is no particular rush for this to happen, so any work queue
will do.
MD_RECOVERY_RUNNING is used to ensure only one thread is started.
Reported-by: BillStuff <billstuff2001@sbcglobal.net>
Signed-off-by: NeilBrown <neilb@suse.de>
We don't really need the full mddev_lock here, and having to
drop it is messy.
RCU is enough to protect these lists.
Signed-off-by: NeilBrown <neilb@suse.de>
printk may cause long time lapse if value of printk_delay in sysctl is
configured large by user. If register_md_personality takes long time to print in
spinlock pers_lock, we may encounter high CPU usage rate when there are other
pers_lock competitors who may be blocked to spin.
We can avoid this condition by moving printk out of coverage of pers_lock
spinlock.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: NeilBrown <neilb@suse.de>
In general we don't allow an array to be stopped if it is in use.
However if the array hasn't really been started yet, then any
apparent use is an anomily, probably due to 'udev' or similar
having a look to see what is there.
This means that if something goes wrong while assembling an array
it cannot reliably be un-assembled - STOP_ARRAY could fail.
There is no value here, so change do_md_stop() to succeed
despite concurrent opens if the array has not yet been
activated. i.e. if ->pers is NULL.
Reported-by: "Baldysiak, Pawel" <pawel.baldysiak@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
raid5: fix init_stripe() inconsistencies
1) remove_hash() is not necessary. We will only be called right after
get_free_stripe(). There we have already a call to remove_hash().
2) Tracing prints out the sector of the freed stripe and not the sector
that we want to initialize.
Signed-off-by: NeilBrown <neilb@suse.de>
Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle were:
- Optimized support for Intel "Cluster-on-Die" (CoD) topologies (Dave
Hansen)
- Various sched/idle refinements for better idle handling (Nicolas
Pitre, Daniel Lezcano, Chuansheng Liu, Vincent Guittot)
- sched/numa updates and optimizations (Rik van Riel)
- sysbench speedup (Vincent Guittot)
- capacity calculation cleanups/refactoring (Vincent Guittot)
- Various cleanups to thread group iteration (Oleg Nesterov)
- Double-rq-lock removal optimization and various refactorings
(Kirill Tkhai)
- various sched/deadline fixes
... and lots of other changes"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
sched/dl: Use dl_bw_of() under rcu_read_lock_sched()
sched/fair: Delete resched_cpu() from idle_balance()
sched, time: Fix build error with 64 bit cputime_t on 32 bit systems
sched: Improve sysbench performance by fixing spurious active migration
sched/x86: Fix up typo in topology detection
x86, sched: Add new topology for multi-NUMA-node CPUs
sched/rt: Use resched_curr() in task_tick_rt()
sched: Use rq->rd in sched_setaffinity() under RCU read lock
sched: cleanup: Rename 'out_unlock' to 'out_free_new_mask'
sched: Use dl_bw_of() under RCU read lock
sched/fair: Remove duplicate code from can_migrate_task()
sched, mips, ia64: Remove __ARCH_WANT_UNLOCKED_CTXSW
sched: print_rq(): Don't use tasklist_lock
sched: normalize_rt_tasks(): Don't use _irqsave for tasklist_lock, use task_rq_lock()
sched: Fix the task-group check in tg_has_rt_tasks()
sched/fair: Leverage the idle state info when choosing the "idlest" cpu
sched: Let the scheduler see CPU idle states
sched/deadline: Fix inter- exclusive cpusets migrations
sched/deadline: Clear dl_entity params when setscheduling to different class
sched/numa: Kill the wrong/dead TASK_DEAD check in task_numa_fault()
...
Fix a potential struct stripe_c leak that would occur if the
chunk_size exceeded the maximum allowed by dm_set_target_max_io_len
(UINT_MAX). However, in practice there is no possibility of this
occuring given that chunk_size is of type uint32_t. But it is good to
fix this to future-proof in case dm_set_target_max_io_len's
implementation were to change.
Signed-off-by: Pavitra Kumar <pavitrak@nvidia.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
If two threads call bitmap_unplug at the same time, then
one might schedule all the writes, and the other might
decide that it doesn't need to wait. But really it does.
It rarely hurts to wait when it isn't absolutely necessary,
and the current code doesn't really focus on 'absolutely necessary'
anyway. So just wait always.
This can potentially lead to data corruption if a crash happens
at an awkward time and data was written before the bitmap was
updated. It is very unlikely, but this should go to -stable
just to be safe. Appropriate for any -stable.
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@vger.kernel.org (please delay until 3.18 is released)
If cn_add_callback() fails in dm_ulog_tfr_init(), it does not
deallocate prealloced memory but calls cn_del_callback().
Found by Linux Driver Verification project (linuxtesting.org).
Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Reviewed-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
When __scan frees the required number of buffer entries that the
shrinker requested (nr_to_scan becomes zero) it must return. Before
this fix the __scan code exited only the inner loop and continued in the
outer loop -- which could result in reduced performance due to extra
buffers being freed (e.g. unnecessarily evicted thinp metadata needing
to be synchronously re-read into bufio's cache).
Also, move dm_bufio_cond_resched to __scan's inner loop, so that
iterating the bufio client's lru lists doesn't result in scheduling
latency.
Reported-by: Joe Thornber <thornber@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.2+
The 'last_accessed' member of the dm_buffer structure was only set when
the the buffer was created. This led to each buffer being discarded
after dm_bufio_max_age time even if it was used recently. In practice
this resulted in all thinp metadata being evicted soon after being read
-- this is particularly problematic for metadata intensive workloads
like multithreaded small random IO.
'last_accessed' is now updated each time the buffer is moved to the head
of the LRU list, so the buffer is now properly discarded if it was not
used in dm_bufio_max_age time.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # v3.2+
In case of RAID levels 4, 5 and 6 we have to verify each RAID members'
ability to zero data on discards to avoid stripe data corruption -- if
discard_zeroes_data is not set for each RAID member discard support must
be disabled. But given the uncertainty of whether or not a RAID member
properly supports zeroing data on discard we require the user to
explicitly allow discard support on RAID levels 4, 5, and 6 by setting
a dm-raid module paramter, e.g.: dm-raid.devices_handle_discard_safely=Y
Otherwise, discards could cause data corruption on RAID4/5/6.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Discard support is not enabled for RAID levels 4, 5, and 6 at this time
due to concerns about unreliable discard_zeroes_data support on some
hardware. Otherwise, discards could cause stripe data corruption
(classic example of bad apples spoiling the bunch).
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Until this change, when loading a new DM table, DM core would re-open
all of the devices in the DM table. Now, DM core will avoid redundant
device opens (and closes when destroying the old table) if the old
table already has a device open using the same mode. This is achieved
by managing reference counts on the table_devices that DM core now
stores in the mapped_device structure (rather than in the dm_table
structure). So a mapped_device's active and inactive dm_tables' dm_dev
lists now just point to the dm_devs stored in the mapped_device's
table_devices list.
This improvement in DM core's device reference counting has the
side-effect of fixing a long-standing limitation of the multipath
target: a DM multipath table couldn't include any paths that were unusable
(failed). For example: if all paths have failed and you add a new,
working, path to the table; you can't use it since the table load would
fail due to it still containing failed paths. Now a re-load of a
multipath table can include failed devices and when those devices become
active again they can be used instantly.
The device list code in dm.c isn't a straight copy/paste from the code in
dm-table.c, but it's very close (aside from some variable renames). One
subtle difference is that find_table_device for the tables_devices list
will only match devices with the same name and mode. This is because we
don't want to upgrade a device's mode in the active table when an
inactive table is loaded.
Access to the mapped_device structure's tables_devices list requires a
mutex (tables_devices_lock), so that tables cannot be created and
destroyed concurrently.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
'queue_io' is set so that IO is queued while paths are being
initialized. Clear queue_io in __choose_pgpath if there are no valid
paths, since there are obviously no paths that can be initialized.
Otherwise IOs to the device will back up.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Since DM core uses bio_clone_fast() for both bio-based and request-based
DM devices there is no need for DM's bioset to have a bvec mempool.
With this patch, on arch with 4KB page for example, memory usage will be
reduced by 64KB for each bio-based DM device and 1MB for each
request-based DM device.
For example, when you create 10,000 bio-based DM devices and 1,000
request-based DM devices, memory usage of biovec under no load is:
# grep biovec /proc/slabinfo
biovec-256 418068 418068 4096 ...
biovec-128 0 0 2048 ...
biovec-64 0 0 1024 ...
biovec-16 0 0 256 ...
With this patch series applied, the usage becomes:
# grep biovec /proc/slabinfo
biovec-256 116 116 4096 ...
biovec-128 0 0 2048 ...
biovec-64 0 0 1024 ...
biovec-16 0 0 256 ...
So 4096 * (418068 - 116) = 1.6GB of memory is saved in this example.
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
alloc_tio() uses bio_alloc_bioset() to allocate a clone-bio for a bio.
alloc_tio() takes the number of bvecs to allocate for the clone-bio.
However, with v3.14's immutable biovec changes DM now uses
__bio_clone_fast() and no longer needs to allocate bvecs.
In practice, the 'nr_iovecs' passed to alloc_tio() is always effectively
0. __clone_and_map_simple_bio() looked like it was passing non-zero
nr_iovecs, but its value was always within the range of inline bvecs and
no allocation actually happened. If allocation happened, the BUG_ON() in
__bio_clone_fast() would've triggered.
Remove the nr_iovecs parameter from alloc_tio() to prevent possible
future bio_alloc_bioset() mis-use of a new bioset interface that will no
longer allow bvecs to be allocated.
Also fix extra whitespace before the __bio_clone_fast() call in
__clone_and_map_simple_bio().
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Clear QUEUE_FLAG_ADD_RANDOM in all block drivers that set
QUEUE_FLAG_NONROT.
Historically, all block devices have automatically made entropy
contributions. But as previously stated in commit e2e1a148 ("block: add
sysfs knob for turning off disk entropy contributions"):
- On SSD disks, the completion times aren't as random as they
are for rotational drives. So it's questionable whether they
should contribute to the random pool in the first place.
- Calling add_disk_randomness() has a lot of overhead.
There are more reliable sources for randomness than non-rotational block
devices. From a security perspective it is better to err on the side of
caution than to allow entropy contributions from unreliable "random"
sources.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
It has come to my attention (thanks Martin) that 'discard_zeroes_data'
is only a hint. Some devices in some cases don't do what it
says on the label.
The use of DISCARD in RAID5 depends on reads from discarded regions
being predictably zero. If a write to a previously discarded region
performs a read-modify-write cycle it assumes that the parity block
was consistent with the data blocks. If all were zero, this would
be the case. If some are and some aren't this would not be the case.
This could lead to data corruption after a device failure when
data needs to be reconstructed from the parity.
As we cannot trust 'discard_zeroes_data', ignore it by default
and so disallow DISCARD on all raid4/5/6 arrays.
As many devices are trustworthy, and as there are benefits to using
DISCARD, add a module parameter to over-ride this caution and cause
DISCARD to work if discard_zeroes_data is set.
If a site want to enable DISCARD on some arrays but not on others they
should select DISCARD support at the filesystem level, and set the
raid456 module parameter.
raid456.devices_handle_discard_safely=Y
As this is a data-safety issue, I believe this patch is suitable for
-stable.
DISCARD support for RAID456 was added in 3.7
Cc: Shaohua Li <shli@kernel.org>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Heinz Mauelshagen <heinzm@redhat.com>
Cc: stable@vger.kernel.org (3.7+)
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Fixes: 620125f2bf
Signed-off-by: NeilBrown <neilb@suse.de>
particularly, but not only, fixing new "resync" code.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIVAwUAVCIoAznsnt1WYoG5AQJDzRAAtciwFilYXxu8M7fPOQ/HZeoMtNLVX0dK
cvL5yRhNxfoGLIG7TEeb5Wvd8cxHNR5t4x+jGmipJ7cTGE4S6Edgdpy2yhHDFBdo
AyGgYreX441P07cefPUUa9nTVFlqx2TzJa+SR75CmBwbuZpx52kfHK9KMXWljY+Q
Hm60k34tK4zzC5Tm2E7aeegFjUaIAwrpt3TOJlh8E/JiEQDsVz2+o+7RFwPXrXgm
YnxfpaAcw5XcanlUj0q6r6O86hhItO54sBBcTtTNZtD7oZC82/OYj6SxlG0V3D2a
wBFouI518Rf0TmdtG3XgPAfI0eCZyowZtYmpoYX+/8rkGSy2ZmJfxSY2NzmGBmX4
LtH0tYkp2qSu6WCXUMPOLmPRqQuT6iX4ho7KCNMr2n05kHMom/InNUajWUvqPFdE
eBs27u9HngTVCTMpwdCfFV/qWXszEhpp9wyzAv5zRV7gyc3hZM3cQ1iV2GKor8Ka
wSTeDT+gY9J2sCJgqx7li45jpsZPzayupwW+hBvieKeY6/fM1leur4Ji/mcRXytK
YUci6fiy2kwxs1uzFq7Kra3Y5gqGq+S6HCspmZTtstzFxbKcMTmOC1B2ukKDPvGS
HwXnQ6w+fXmF/+fXWD98++ET80rWj6utXBJhSGhkdQcyYRz5DU/2GsLsA4yvho4N
Dbo2gIjTtD8=
=gMsu
-----END PGP SIGNATURE-----
Merge tag 'md/3.17-more-fixes' of git://git.neil.brown.name/md
Pull bugfixes for md/raid1 from Neil Brown:
"It is amazing how much easier it is to find bugs when you know one is
there. Two bug reports resulted in finding 7 bugs!
All are tagged for -stable. Those that can't cause (rare) data
corruption, cause lockups.
Particularly, but not only, fixing new "resync" code"
* tag 'md/3.17-more-fixes' of git://git.neil.brown.name/md:
md/raid1: fix_read_error should act on all non-faulty devices.
md/raid1: count resync requests in nr_pending.
md/raid1: update next_resync under resync_lock.
md/raid1: Don't use next_resync to determine how far resync has progressed
md/raid1: make sure resync waits for conflicting writes to complete.
md/raid1: clean up request counts properly in close_sync()
md/raid1: be more cautious where we read-balance during resync.
md/raid1: intialise start_next_window for READ case to avoid hang
If a devices is being recovered it is not InSync and is not Faulty.
If a read error is experienced on that device, fix_read_error()
will be called, but it ignores non-InSync devices. So it will
neither fix the error nor fail the device.
It is incorrect that fix_read_error() ignores non-InSync devices.
It should only ignore Faulty devices. So fix it.
This became a bug when we allowed reading from a device that was being
recovered. It is suitable for any subsequent -stable kernel.
Fixes: da8840a747
Cc: stable@vger.kernel.org (v3.5+)
Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Tested-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Both normal IO and resync IO can be retried with reschedule_retry()
and so be counted into ->nr_queued, but only normal IO gets counted in
->nr_pending.
Before the recent improvement to RAID1 resync there could only
possibly have been one or the other on the queue. When handling a
read failure it could only be normal IO. So when handle_read_error()
called freeze_array() the fact that freeze_array only compares
->nr_queued against ->nr_pending was safe.
But now that these two types can interleave, we can have both normal
and resync IO requests queued, so we need to count them both in
nr_pending.
This error can lead to freeze_array() hanging if there is a read
error, so it is suitable for -stable.
Fixes: 79ef3a8aa1
cc: stable@vger.kernel.org (v3.13+)
Reported-by: Brassow Jonathan <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
raise_barrier() uses next_resync as part of its calculations, so it
really should be updated first, instead of afterwards.
next_resync is always used under resync_lock so update it under
resync lock to, just before it is used. That is safest.
This could cause normal IO and resync IO to interact badly so
it suitable for -stable.
Fixes: 79ef3a8aa1
cc: stable@vger.kernel.org (v3.13+)
Signed-off-by: NeilBrown <neilb@suse.de>
next_resync is (approximately) the location for the next resync request.
However it does *not* reliably determine the earliest location
at which resync might be happening.
This is because resync requests can complete out of order, and
we only limit the number of current requests, not the distance
from the earliest pending request to the latest.
mddev->curr_resync_completed is a reliable indicator of the earliest
position at which resync could be happening. It is updated less
frequently, but is actually reliable which is more important.
So use it to determine if a write request is before the region
being resynced and so safe from conflict.
This error can allow resync IO to interfere with normal IO which
could lead to data corruption. Hence: stable.
Fixes: 79ef3a8aa1
cc: stable@vger.kernel.org (v3.13+)
Signed-off-by: NeilBrown <neilb@suse.de>
The resync/recovery process for raid1 was recently changed
so that writes could happen in parallel with resync providing
they were in different regions of the device.
There is a problem though: While a write request will always
wait for conflicting resync to complete, a resync request
will *not* always wait for conflicting writes to complete.
Two changes are needed to fix this:
1/ raise_barrier (which waits until it is safe to do resync)
must wait until current_window_requests is zero
2/ wait_battier (which waits at the start of a new write request)
must update current_window_requests if the request could
possible conflict with a concurrent resync.
As concurrent writes and resync can lead to data loss,
this patch is suitable for -stable.
Fixes: 79ef3a8aa1
Cc: stable@vger.kernel.org (v3.13+)
Cc: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If there are outstanding writes when close_sync is called,
the change to ->start_next_window might cause them to
decrement the wrong counter when they complete. Fix this
by merging the two counters into the one that will be decremented.
Having an incorrect value in a counter can cause raise_barrier()
to hangs, so this is suitable for -stable.
Fixes: 79ef3a8aa1
cc: stable@vger.kernel.org (v3.13+)
Signed-off-by: NeilBrown <neilb@suse.de>
commit 79ef3a8aa1 made
it possible for reads to happen concurrently with resync.
This means that we need to be more careful where read_balancing
is allowed during resync - we can no longer be sure that any
resync that has already started will definitely finish.
So keep read_balancing to before recovery_cp, which is conservative
but safe.
This bug makes it possible to read from a device that doesn't
have up-to-date data, so it can cause data corruption.
So it is suitable for any kernel since 3.11.
Fixes: 79ef3a8aa1
cc: stable@vger.kernel.org (v3.13+)
Signed-off-by: NeilBrown <neilb@suse.de>
r1_bio->start_next_window is not initialised in the READ
case, so allow_barrier may incorrectly decrement
conf->current_window_requests
which can cause raise_barrier() to block forever.
Fixes: 79ef3a8aa1
cc: stable@vger.kernel.org (v3.13+)
Reported-by: Brassow Jonathan <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When a writeback or a promotion of a block is completed, the cell of
that block is removed from the prison, the block is marked as clean, and
the clear_dirty() callback of the cache policy is called.
Unfortunately, performing those actions in this order allows an incoming
new write bio for that block to come in before clearing the dirty status
is completed and therefore possibly causing one of these two scenarios:
Scenario A:
Thread 1 Thread 2
cell_defer() .
- cell removed from prison .
- detained bios queued .
. incoming write bio
. remapped to cache
. set_dirty() called,
. but block already dirty
. => it does nothing
clear_dirty() .
- block marked clean .
- policy clear_dirty() called .
Result: Block is marked clean even though it is actually dirty. No
writeback will occur.
Scenario B:
Thread 1 Thread 2
cell_defer() .
- cell removed from prison .
- detained bios queued .
clear_dirty() .
- block marked clean .
. incoming write bio
. remapped to cache
. set_dirty() called
. - block marked dirty
. - policy set_dirty() called
- policy clear_dirty() called .
Result: Block is properly marked as dirty, but policy thinks it is clean
and therefore never asks us to writeback it.
This case is visible in "dmsetup status" dirty block count (which
normally decreases to 0 on a quiet device).
Fix these issues by calling clear_dirty() before calling cell_defer().
Incoming bios for that block will then be detained in the cell and
released only after clear_dirty() has completed, so the race will not
occur.
Found by inspecting the code after noticing spurious dirty counts
(scenario B).
Signed-off-by: Anssi Hannula <anssi.hannula@iki.fi>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
The DM crypt target accesses memory beyond allocated space resulting in
a crash on 32 bit x86 systems.
This bug is very old (it dates back to 2.6.25 commit 3a7f6c990a "dm
crypt: use async crypto"). However, this bug was masked by the fact
that kmalloc rounds the size up to the next power of two. This bug
wasn't exposed until 3.17-rc1 commit 298a9fa08a ("dm crypt: use per-bio
data"). By switching to using per-bio data there was no longer any
padding beyond the end of a dm-crypt allocated memory block.
To minimize allocation overhead dm-crypt puts several structures into one
block allocated with kmalloc. The block holds struct ablkcipher_request,
cipher-specific scratch pad (crypto_ablkcipher_reqsize(any_tfm(cc))),
struct dm_crypt_request and an initialization vector.
The variable dmreq_start is set to offset of struct dm_crypt_request
within this memory block. dm-crypt allocates the block with this size:
cc->dmreq_start + sizeof(struct dm_crypt_request) + cc->iv_size.
When accessing the initialization vector, dm-crypt uses the function
iv_of_dmreq, which performs this calculation: ALIGN((unsigned long)(dmreq
+ 1), crypto_ablkcipher_alignmask(any_tfm(cc)) + 1).
dm-crypt allocated "cc->iv_size" bytes beyond the end of dm_crypt_request
structure. However, when dm-crypt accesses the initialization vector, it
takes a pointer to the end of dm_crypt_request, aligns it, and then uses
it as the initialization vector. If the end of dm_crypt_request is not
aligned on a crypto_ablkcipher_alignmask(any_tfm(cc)) boundary the
alignment causes the initialization vector to point beyond the allocated
space.
Fix this bug by calculating the variable iv_size_padding and adding it
to the allocated size.
Also correct the alignment of dm_crypt_request. struct dm_crypt_request
is specific to dm-crypt (it isn't used by the crypto subsystem at all),
so it is aligned on __alignof__(struct dm_crypt_request).
Also align per_bio_data_size on ARCH_KMALLOC_MINALIGN, so that it is
aligned as if the block was allocated with kmalloc.
Reported-by: Krzysztof Kolasa <kkolasa@winsoft.pl>
Tested-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Most places which allocate an r10_bio zero the ->state, some don't.
As the r10_bio comes from a mempool, and the allocation function uses
kzalloc it is often zero anyway. But sometimes it isn't and it is
best to be safe.
I only noticed this because of the bug fixed by an earlier patch
where the r10_bios allocated for a reshape were left around to
be used by a subsequent resync. In that case the R10BIO_IsReshape
flag caused problems.
Signed-off-by: NeilBrown <neilb@suse.de>
When a raid10 commences a resync/recovery/reshape it allocates
some buffer space.
When a resync/recovery completes the buffer space is freed. But not
when the reshape completes.
This can result in a small memory leak.
There is a subtle side-effect of this bug. When a RAID10 is reshaped
to a larger array (more devices), the reshape is immediately followed
by a "resync" of the new space. This "resync" will use the buffer
space which was allocated for "reshape". This can cause problems
including a "BUG" in the SCSI layer. So this is suitable for -stable.
Cc: stable@vger.kernel.org (v3.5+)
Fixes: 3ea7daa5d7
Signed-off-by: NeilBrown <neilb@suse.de>
raid10 reshape clears unwanted bits from a bio->bi_flags using
a method which, while clumsy, worked until 3.10 when BIO_OWNS_VEC
was added.
Since then it clears that bit but shouldn't. This results in a
memory leak.
So change to used the approved method of clearing unwanted bits.
As this causes a memory leak which can consume all of memory
the fix is suitable for -stable.
Fixes: a38352e0ac
Cc: stable@vger.kernel.org (v3.10+)
Reported-by: mdraid.pkoch@dfgh.net (Peter Koch)
Signed-off-by: NeilBrown <neilb@suse.de>
During recovery of a double-degraded RAID6 it is possible for
some blocks not to be recovered properly, leading to corruption.
If a write happens to one block in a stripe that would be written to a
missing device, and at the same time that stripe is recovering data
to the other missing device, then that recovered data may not be written.
This patch skips, in the double-degraded case, an optimisation that is
only safe for single-degraded arrays.
Bug was introduced in 2.6.32 and fix is suitable for any kernel since
then. In an older kernel with separate handle_stripe5() and
handle_stripe6() functions the patch must change handle_stripe6().
Cc: stable@vger.kernel.org (2.6.32+)
Fixes: 6c0069c0ae
Cc: Yuri Tikhonov <yur@emcraft.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Reported-by: "Manibalan P" <pmanibalan@amiindia.co.in>
Tested-by: "Manibalan P" <pmanibalan@amiindia.co.in>
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1090423
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Dan Williams <dan.j.williams@intel.com>
If a stripe in a raid6 array received a write to each data block while
the array is degraded, and if any of these writes to a missing device
are not page-aligned, then a live-lock happens.
In this case the P and Q blocks need to be read so that the part of
the missing block which is *not* being updated by the write can be
constructed. Due to a logic error, these blocks are not loaded, so
the update cannot proceed and the stripe is 'handled' repeatedly in an
infinite loop.
This bug is unlikely as most writes are page aligned. However as it
can lead to a livelock it is suitable for -stable. It was introduced
in 3.16.
Cc: stable@vger.kernel.org (v3.16)
Fixed: 67f455486d
Signed-off-by: NeilBrown <neilb@suse.de>
allow thin snapshots to be larger than the external origin.
. Add support for quickly loading a repetitive pattern into the
dm-switch target.
. Use per-bio data in the dm-crypt target instead of always using a
mempool for each allocation. Required switching to kmalloc alignment
for the bio slab.
. Fix DM core to properly stack the QUEUE_FLAG_NO_SG_MERGE flag
. Fix the dm-cache and dm-thin targets' export of the minimum_io_size to
match the data block size -- this fixes an issue where mkfs.xfs would
improperly infer raid striping was in place on the underlying storage.
. Small cleanups in dm-io, dm-mpath and dm-cache
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJT64yEAAoJEMUj8QotnQNatjQH/2mqm8EtPuZas70zHVDzjMlE
ZyV8xgHpU0MBmiBi+JhUBv9iKX4sVa+C25559WkKtxRVMnZmI1WDry4TagiqrhnK
9o/uvdWigJMR+uwahwe4UErEtKscOQJD30a8taN/suJ6Z2C7XJJRUZPsyL4a3Vov
w+UIi7aYDEGp/2VQ8mvTTxjdF5x5km4wKsjBTs03uTrrkEJ+bIUndl2I1X+X4bsw
kiWYOQwmcnD8GwYkSrthJYLsS3Hjur/J/My7KZwXc00ANLOexqHdKfRDwH8b36+m
olKXv3swCd8vi+jJYEYzuW9213ACsSEGP7h8NFVZ/+2FeDsSzB/C7zjW9okIUIw=
=y/3r
-----END PGP SIGNATURE-----
Merge tag 'dm-3.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper changes from Mike Snitzer:
- Allow the thin target to paired with any size external origin; also
allow thin snapshots to be larger than the external origin.
- Add support for quickly loading a repetitive pattern into the
dm-switch target.
- Use per-bio data in the dm-crypt target instead of always using a
mempool for each allocation. Required switching to kmalloc alignment
for the bio slab.
- Fix DM core to properly stack the QUEUE_FLAG_NO_SG_MERGE flag
- Fix the dm-cache and dm-thin targets' export of the minimum_io_size
to match the data block size -- this fixes an issue where mkfs.xfs
would improperly infer raid striping was in place on the underlying
storage.
- Small cleanups in dm-io, dm-mpath and dm-cache
* tag 'dm-3.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm table: propagate QUEUE_FLAG_NO_SG_MERGE
dm switch: efficiently support repetitive patterns
dm switch: factor out switch_region_table_read
dm cache: set minimum_io_size to cache's data block size
dm thin: set minimum_io_size to pool's data block size
dm crypt: use per-bio data
block: use kmalloc alignment for bio slab
dm table: make dm_table_supports_discards static
dm cache metadata: use dm-space-map-metadata.h defined size limits
dm cache: fail migrations in the do_worker error path
dm cache: simplify deferred set reference count increments
dm thin: relax external origin size constraints
dm thin: switch to an atomic_t for tracking pending new block preparations
dm mpath: eliminate pg_ready() wrapper
dm io: simplify dec_count and sync_io
Pull block driver changes from Jens Axboe:
"Nothing out of the ordinary here, this pull request contains:
- A big round of fixes for bcache from Kent Overstreet, Slava Pestov,
and Surbhi Palande. No new features, just a lot of fixes.
- The usual round of drbd updates from Andreas Gruenbacher, Lars
Ellenberg, and Philipp Reisner.
- virtio_blk was converted to blk-mq back in 3.13, but now Ming Lei
has taken it one step further and added support for actually using
more than one queue.
- Addition of an explicit SG_FLAG_Q_AT_HEAD for block/bsg, to
compliment the the default behavior of adding to the tail of the
queue. From Douglas Gilbert"
* 'for-3.17/drivers' of git://git.kernel.dk/linux-block: (86 commits)
bcache: Drop unneeded blk_sync_queue() calls
bcache: add mutex lock for bch_is_open
bcache: Correct printing of btree_gc_max_duration_ms
bcache: try to set b->parent properly
bcache: fix memory corruption in init error path
bcache: fix crash with incomplete cache set
bcache: Fix more early shutdown bugs
bcache: fix use-after-free in btree_gc_coalesce()
bcache: Fix an infinite loop in journal replay
bcache: fix crash in bcache_btree_node_alloc_fail tracepoint
bcache: bcache_write tracepoint was crashing
bcache: fix typo in bch_bkey_equal_header
bcache: Allocate bounce buffers with GFP_NOWAIT
bcache: Make sure to pass GFP_WAIT to mempool_alloc()
bcache: fix uninterruptible sleep in writeback thread
bcache: wait for buckets when allocating new btree root
bcache: fix crash on shutdown in passthrough mode
bcache: fix lockdep warnings on shutdown
bcache allocator: send discards with correct size
bcache: Fix to remove the rcu_sched stalls.
...
Most interesting is that md devices (major == 9) with
minor numbers of 512 or more will no longer be created
simply by opening a block device file. They can only
be created by writing to
/sys/module/md_mod/parameters/new_array
The 'auto-create-on-open' semantic is cumbersome and we
need to start moving away from it.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIVAwUAU+hOmDnsnt1WYoG5AQJxZxAAjCJAOHiqIBeFg2auJdZF5u5dgqshJQk3
gMpl96Arazf5YXJNTMRoNsfgPg/XjG/9MDS9UoyzgAsQTRi90UYjq+gcSODdHGcF
pVFxdrPf3Ja+cnlniW3aCuosv143c9N06IUDOm5TuY7nioQQA7gVS6HEpctmLbgi
R5Q2jQ5GeSVhWKsowYBdzbso3y/QAH0zS5AS4tW3c+l7YMie3gfBqNHTHiQby9I1
NwEUaWrU13jUHT6M4CWIp+wqgORXVqFNHx1dxYefTEgIdB1Bfo15MgDorG1sReF7
I+H1zvc/QpvEQ/WWJ6Cg5gItVZqof6SPljnQHkt0kqtu6fDJD3wYBDYKE8VgP9k1
9zJUibyOOqDFAxrEkVy6XnQ50bYV2uNPFSKpw3jFLXpapRzL602NRWvgHGLkPEbS
TofriiykOzYGxKkIhIWtiqzOaOhPMo+Z771WGRjoF4ZOAALYRnItvC/FFoRKCHeZ
xTOp2xnKwAX6vHsrIHpJP+0/no2R8kzkwnSFSCgoRdraFGxfqY2N32svZaXFUt0c
FlaAktWPtM+gaVzzJzEnm4kfruvVRvPV9zyVa1ro990E+00X5eGHi7g7mPmTQ33q
gxQK/d/jL22RoOXppfaqj2lApQsQJ6F1rMyuvUwLK+gyFacMQyGtRfHqJo5JZKRT
rBFNoOcGh3A=
=v8Er
-----END PGP SIGNATURE-----
Merge tag 'md/3.17' of git://neil.brown.name/md
Pull md updates from Neil Brown:
"Most interesting is that md devices (major == 9) with minor numbers of
512 or more will no longer be created simply by opening a block device
file. They can only be created by writing to
/sys/module/md_mod/parameters/new_array
The 'auto-create-on-open' semantic is cumbersome and we need to start
moving away from it"
* tag 'md/3.17' of git://neil.brown.name/md:
md: don't allow bitmap file to be added to raid0/linear.
md/raid0: check for bitmap compatability when changing raid levels.
md: Recovery speed is wrong
md: disable probing for md devices 512 and over.
md/raid1,raid10: always abort recover on write error.
Commit 05f1dd5 ("block: add queue flag for disabling SG merging")
introduced a new queue flag: QUEUE_FLAG_NO_SG_MERGE. This gets set by
default in blk_mq_init_queue for mq-enabled devices. The effect of
the flag is to bypass the SG segment merging. Instead, the
bio->bi_vcnt is used as the number of hardware segments.
With a device mapper target on top of a device with
QUEUE_FLAG_NO_SG_MERGE set, we can end up sending down more segments
than a driver is prepared to handle. I ran into this when backporting
the virtio_blk mq support. It triggerred this BUG_ON, in
virtio_queue_rq:
BUG_ON(req->nr_phys_segments + 2 > vblk->sg_elems);
The queue's max is set here:
blk_queue_max_segments(q, vblk->sg_elems-2);
Basically, what happens is that a bio is built up for the dm device
(which does not have the QUEUE_FLAG_NO_SG_MERGE flag set) using
bio_add_page. That path will call into __blk_recalc_rq_segments, so
what you end up with is bi_phys_segments being much smaller than bi_vcnt
(and bi_vcnt grows beyond the maximum sg elements). Then, when the bio
is submitted, it gets cloned. When the cloned bio is submitted, it will
end up in blk_recount_segments, here:
if (test_bit(QUEUE_FLAG_NO_SG_MERGE, &q->queue_flags))
bio->bi_phys_segments = bio->bi_vcnt;
and now we've set bio->bi_phys_segments to a number that is beyond what
was registered as queue_max_segments by the driver.
The right way to fix this is to propagate the queue flag up the stack.
The rules for propagating the flag are simple:
- if the flag is set for any underlying device, it must be set for the
upper device
- consequently, if the flag is not set for any underlying device, it
should not be set for the upper device.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.16+
An array can only accept a bitmap if it will call bitmap_daemon_work
periodically, which means it needs a thread running.
If there is no thread, don't allow a bitmap to be added.
Signed-off-by: NeilBrown <neilb@suse.de>
When we calculate the speed of recovery, the numerator that contains
the recovery done sectors. It's need to subtract the sectors which
don't finish recovery.
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Pull scheduler updates from Ingo Molnar:
- Move the nohz kick code out of the scheduler tick to a dedicated IPI,
from Frederic Weisbecker.
This necessiated quite some background infrastructure rework,
including:
* Clean up some irq-work internals
* Implement remote irq-work
* Implement nohz kick on top of remote irq-work
* Move full dynticks timer enqueue notification to new kick
* Move multi-task notification to new kick
* Remove unecessary barriers on multi-task notification
- Remove proliferation of wait_on_bit() action functions and allow
wait_on_bit_action() functions to support a timeout. (Neil Brown)
- Another round of sched/numa improvements, cleanups and fixes. (Rik
van Riel)
- Implement fast idling of CPUs when the system is partially loaded,
for better scalability. (Tim Chen)
- Restructure and fix the CPU hotplug handling code that may leave
cfs_rq and rt_rq's throttled when tasks are migrated away from a dead
cpu. (Kirill Tkhai)
- Robustify the sched topology setup code. (Peterz Zijlstra)
- Improve sched_feat() handling wrt. static_keys (Jason Baron)
- Misc fixes.
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
sched/fair: Fix 'make xmldocs' warning caused by missing description
sched: Use macro for magic number of -1 for setparam
sched: Robustify topology setup
sched: Fix sched_setparam() policy == -1 logic
sched: Allow wait_on_bit_action() functions to support a timeout
sched: Remove proliferation of wait_on_bit() action functions
sched/numa: Revert "Use effective_load() to balance NUMA loads"
sched: Fix static_key race with sched_feat()
sched: Remove extra static_key*() function indirection
sched/rt: Fix replenish_dl_entity() comments to match the current upstream code
sched: Transform resched_task() into resched_curr()
sched/deadline: Kill task_struct->pi_top_task
sched: Rework check_for_tasks()
sched/rt: Enqueue just unthrottled rt_rq back on the stack in __disable_runtime()
sched/fair: Disable runtime_enabled on dying rq
sched/numa: Change scan period code to match intent
sched/numa: Rework best node setting in task_numa_migrate()
sched/numa: Examine a task move when examining a task swap
sched/numa: Simplify task_numa_compare()
sched/numa: Use effective_load() to balance NUMA loads
...
this is needed for the queue/block device we created (it's done by
blk_cleanup_queue() which we do call) - but calling it for the block devices we
only opened is pointless.
Change-Id: I53dfded14ed15b9581d10ca8399d5e1b3abbf9f2
Since bch_is_open will iterate linked list bch_cache_sets and
uncached_devices, it needs bch_register_lock.
Signed-off-by: Jianjian Huo <samuel.huo@gmail.com>
time_stats::btree_gc_max_duration_mc is not bit shifted by 8
Fixes BUG #138
Change-Id: I44fc6e1d0579674016acc533f1a546b080e5371a
Signed-off-by: Surbhi Palande <sap@daterainc.com>
If register_cache_set() failed, we would touch ca->set after
it had already been freed. Also, fix an assertion to catch
this.
Change-Id: I748e5f5b223e2d9b2602075dec2f997cced2394d
If we goto out_nocoalesce after we free new_nodes[0], we end up freeing
new_nodes[0] again. This was generating a lockdep warning. The fix is
to set new_nodes[0] to NULL, since the out_nocoalesce path safely
ignores NULL entries in the new_nodes array.
This regression was introduced in 2d7f9531.
Change-Id: I76564d7257800583214376b4bacf236cda90c89c
When running with multiple cache devices, if one of the devices has a completely
empty journal but we'd already found some journal entries on a previosu device
we'd go into an infinite loop.
Change-Id: I1dcdc0d738192746de28f40e8b08825b0dea5e2b
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
There's no point in blocking on these allocations, since our fallback paths will
probably go faster than blocking.
Change-Id: I733ca202c25cb36bde02607a0a60552229a4241c
this was very wrong - mempool_alloc() only guarantees success with GFP_WAIT.
bcache uses GFP_NOWAIT in various other places where we have a fallback,
circuits must've gotten crossed when writing this code or something.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
There were two issues here:
- writeback thread did not start until the device first became dirty
- writeback thread used uninterruptible sleep once running
Without this patch I see kernel warnings printed and a load average of
1.52 after booting my test VM. With this patch the warnings are gone and
the load average is near 0.00 as expected.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>