mirror of
https://github.com/torvalds/linux.git
synced 2024-11-22 20:22:09 +00:00
dm: add documentation for dm-vdo target
This adds the admin-guide documentation for dm-vdo. vdo.rst is the guide to using dm-vdo. vdo-design is an overview of the design of dm-vdo. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
This commit is contained in:
parent
54be6c6c5a
commit
04bf7ac646
415
Documentation/admin-guide/device-mapper/vdo-design.rst
Normal file
415
Documentation/admin-guide/device-mapper/vdo-design.rst
Normal file
@ -0,0 +1,415 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0-only
|
||||
|
||||
================
|
||||
Design of dm-vdo
|
||||
================
|
||||
|
||||
The dm-vdo (virtual data optimizer) target provides inline deduplication,
|
||||
compression, zero-block elimination, and thin provisioning. A dm-vdo target
|
||||
can be backed by up to 256TB of storage, and can present a logical size of
|
||||
up to 4PB. This target was originally developed at Permabit Technology
|
||||
Corp. starting in 2009. It was first released in 2013 and has been used in
|
||||
production environments ever since. It was made open-source in 2017 after
|
||||
Permabit was acquired by Red Hat. This document describes the design of
|
||||
dm-vdo. For usage, see vdo.rst in the same directory as this file.
|
||||
|
||||
Because deduplication rates fall drastically as the block size increases, a
|
||||
vdo target has a maximum block size of 4K. However, it can achieve
|
||||
deduplication rates of 254:1, i.e. up to 254 copies of a given 4K block can
|
||||
reference a single 4K of actual storage. It can achieve compression rates
|
||||
of 14:1. All zero blocks consume no storage at all.
|
||||
|
||||
Theory of Operation
|
||||
===================
|
||||
|
||||
The design of dm-vdo is based on the idea that deduplication is a two-part
|
||||
problem. The first is to recognize duplicate data. The second is to avoid
|
||||
storing multiple copies of those duplicates. Therefore, dm-vdo has two main
|
||||
parts: a deduplication index (called UDS) that is used to discover
|
||||
duplicate data, and a data store with a reference counted block map that
|
||||
maps from logical block addresses to the actual storage location of the
|
||||
data.
|
||||
|
||||
Zones and Threading
|
||||
-------------------
|
||||
|
||||
Due to the complexity of data optimization, the number of metadata
|
||||
structures involved in a single write operation to a vdo target is larger
|
||||
than most other targets. Furthermore, because vdo must operate on small
|
||||
block sizes in order to achieve good deduplication rates, acceptable
|
||||
performance can only be achieved through parallelism. Therefore, vdo's
|
||||
design attempts to be lock-free. Most of a vdo's main data structures are
|
||||
designed to be easily divided into "zones" such that any given bio must
|
||||
only access a single zone of any zoned structure. Safety with minimal
|
||||
locking is achieved by ensuring that during normal operation, each zone is
|
||||
assigned to a specific thread, and only that thread will access the portion
|
||||
of that data structure in that zone. Associated with each thread is a work
|
||||
queue. Each bio is associated with a request object which can be added to a
|
||||
work queue when the next phase of its operation requires access to the
|
||||
structures in the zone associated with that queue. Although each structure
|
||||
may be divided into zones, this division is not reflected in the on-disk
|
||||
representation of each data structure. Therefore, the number of zones for
|
||||
each structure, and hence the number of threads, is configured each time a
|
||||
vdo target is started.
|
||||
|
||||
The Deduplication Index
|
||||
-----------------------
|
||||
|
||||
In order to identify duplicate data efficiently, vdo was designed to
|
||||
leverage some common characteristics of duplicate data. From empirical
|
||||
observations, we gathered two key insights. The first is that in most data
|
||||
sets with significant amounts of duplicate data, the duplicates tend to
|
||||
have temporal locality. When a duplicate appears, it is more likely that
|
||||
other duplicates will be detected, and that those duplicates will have been
|
||||
written at about the same time. This is why the index keeps records in
|
||||
temporal order. The second insight is that new data is more likely to
|
||||
duplicate recent data than it is to duplicate older data and in general,
|
||||
there are diminishing returns to looking further back in time. Therefore,
|
||||
when the index is full, it should cull its oldest records to make space for
|
||||
new ones. Another important idea behind the design of the index is that the
|
||||
ultimate goal of deduplication is to reduce storage costs. Since there is a
|
||||
trade-off between the storage saved and the resources expended to achieve
|
||||
those savings, vdo does not attempt to find every last duplicate block. It
|
||||
is sufficient to find and eliminate most of the redundancy.
|
||||
|
||||
Each block of data is hashed to produce a 16-byte block name. An index
|
||||
record consists of this block name paired with the presumed location of
|
||||
that data on the underlying storage. However, it is not possible to
|
||||
guarantee that the index is accurate. Most often, this occurs because it is
|
||||
too costly to update the index when a block is over-written or discarded.
|
||||
Doing so would require either storing the block name along with the blocks,
|
||||
which is difficult to do efficiently in block-based storage, or reading and
|
||||
rehashing each block before overwriting it. Inaccuracy can also result from
|
||||
a hash collision where two different blocks have the same name. In
|
||||
practice, this is extremely unlikely, but because vdo does not use a
|
||||
cryptographic hash, a malicious workload can be constructed. Because of
|
||||
these inaccuracies, vdo treats the locations in the index as hints, and
|
||||
reads each indicated block to verify that it is indeed a duplicate before
|
||||
sharing the existing block with a new one.
|
||||
|
||||
Records are collected into groups called chapters. New records are added to
|
||||
the newest chapter, called the open chapter. This chapter is stored in a
|
||||
format optimized for adding and modifying records, and the content of the
|
||||
open chapter is not finalized until it runs out of space for new records.
|
||||
When the open chapter fills up, it is closed and a new open chapter is
|
||||
created to collect new records.
|
||||
|
||||
Closing a chapter converts it to a different format which is optimized for
|
||||
writing. The records are written to a series of record pages based on the
|
||||
order in which they were received. This means that records with temporal
|
||||
locality should be on a small number of pages, reducing the I/O required to
|
||||
retrieve them. The chapter also compiles an index that indicates which
|
||||
record page contains any given name. This index means that a request for a
|
||||
name can determine exactly which record page may contain that record,
|
||||
without having to load the entire chapter from storage. This index uses
|
||||
only a subset of the block name as its key, so it cannot guarantee that an
|
||||
index entry refers to the desired block name. It can only guarantee that if
|
||||
there is a record for this name, it will be on the indicated page. The
|
||||
contents of a closed chapter are never altered in any way; these chapters
|
||||
are read-only structures.
|
||||
|
||||
Once enough records have been written to fill up all the available index
|
||||
space, the oldest chapter gets removed to make space for new chapters. Any
|
||||
time a request finds a matching record in the index, that record is copied
|
||||
to the open chapter. This ensures that useful block names remain available
|
||||
in the index, while unreferenced block names are forgotten.
|
||||
|
||||
In order to find records in older chapters, the index also maintains a
|
||||
higher level structure called the volume index, which contains entries
|
||||
mapping a block name to the chapter containing its newest record. This
|
||||
mapping is updated as records for the block name are copied or updated,
|
||||
ensuring that only the newer record for a given block name is findable.
|
||||
Older records for a block name can no longer be found even though they have
|
||||
not been deleted. Like the chapter index, the volume index uses only a
|
||||
subset of the block name as its key and can not definitively say that a
|
||||
record exists for a name. It can only say which chapter would contain the
|
||||
record if a record exists. The volume index is stored entirely in memory
|
||||
and is saved to storage only when the vdo target is shut down.
|
||||
|
||||
From the viewpoint of a request for a particular block name, first it will
|
||||
look up the name in the volume index which will indicate either that the
|
||||
record is new, or which chapter to search. If the latter, the request looks
|
||||
up its name in the chapter index to determine if the record is new, or
|
||||
which record page to search. Finally, if not new, the request will look for
|
||||
its record on the indicated record page. This process may require up to two
|
||||
page reads per request (one for the chapter index page and one for the
|
||||
request page). However, recently accessed pages are cached so that these
|
||||
page reads can be amortized across many block name requests.
|
||||
|
||||
The volume index and the chapter indexes are implemented using a
|
||||
memory-efficient structure called a delta index. Instead of storing the
|
||||
entire key (the block name) for each entry, the entries are sorted by name
|
||||
and only the difference between adjacent keys (the delta) is stored.
|
||||
Because we expect the hashes to be evenly distributed, the size of the
|
||||
deltas follows an exponential distribution. Because of this distribution,
|
||||
the deltas are expressed in a Huffman code to take up even less space. The
|
||||
entire sorted list of keys is called a delta list. This structure allows
|
||||
the index to use many fewer bytes per entry than a traditional hash table,
|
||||
but it is slightly more expensive to look up entries, because a request
|
||||
must read every entry in a delta list to add up the deltas in order to find
|
||||
the record it needs. The delta index reduces this lookup cost by splitting
|
||||
its key space into many sub-lists, each starting at a fixed key value, so
|
||||
that each individual list is short.
|
||||
|
||||
The default index size can hold 64 million records, corresponding to about
|
||||
256GB. This means that the index can identify duplicate data if the
|
||||
original data was written within the last 256GB of writes. This range is
|
||||
called the deduplication window. If new writes duplicate data that is older
|
||||
than that, the index will not be able to find it because the records of the
|
||||
older data have been removed. So when writing a 200 GB file to a vdo
|
||||
target, and then immediately writing it again, the two copies will
|
||||
deduplicate perfectly. Doing the same with a 500 GB file will result in no
|
||||
deduplication, because the beginning of the file will no longer be in the
|
||||
index by the time the second write begins (assuming there is no duplication
|
||||
within the file itself).
|
||||
|
||||
If you anticipate a data workload that will see useful deduplication beyond
|
||||
the 256GB threshold, vdo can be configured to use a larger index with a
|
||||
correspondingly larger deduplication window. (This configuration can only
|
||||
be set when the target is created, not altered later. It is important to
|
||||
consider the expected workload for a vdo target before configuring it.)
|
||||
There are two ways to do this.
|
||||
|
||||
One way is to increase the memory size of the index, which also increases
|
||||
the amount of backing storage required. Doubling the size of the index will
|
||||
double the length of the deduplication window at the expense of doubling
|
||||
the storage size and the memory requirements.
|
||||
|
||||
The other way is to enable sparse indexing. Sparse indexing increases the
|
||||
deduplication window by a factor of 10, at the expense of also increasing
|
||||
the storage size by a factor of 10. However with sparse indexing, the
|
||||
memory requirements do not increase; the trade-off is slightly more
|
||||
computation per request, and a slight decrease in the amount of
|
||||
deduplication detected. (For workloads with significant amounts of
|
||||
duplicate data, sparse indexing will detect 97-99% of the deduplication
|
||||
that a standard, or "dense", index will detect.)
|
||||
|
||||
The Data Store
|
||||
--------------
|
||||
|
||||
The data store is implemented by three main data structures, all of which
|
||||
work in concert to reduce or amortize metadata updates across as many data
|
||||
writes as possible.
|
||||
|
||||
*The Slab Depot*
|
||||
|
||||
Most of the vdo volume belongs to the slab depot. The depot contains a
|
||||
collection of slabs. The slabs can be up to 32GB, and are divided into
|
||||
three sections. Most of a slab consists of a linear sequence of 4K blocks.
|
||||
These blocks are used either to store data, or to hold portions of the
|
||||
block map (see below). In addition to the data blocks, each slab has a set
|
||||
of reference counters, using 1 byte for each data block. Finally each slab
|
||||
has a journal. Reference updates are written to the slab journal, which is
|
||||
written out one block at a time as each block fills. A copy of the
|
||||
reference counters are kept in memory, and are written out a block at a
|
||||
time, in oldest-dirtied-order whenever there is a need to reclaim slab
|
||||
journal space. The journal is used both to ensure that the main recovery
|
||||
journal (see below) can regularly free up space, and also to amortize the
|
||||
cost of updating individual reference blocks.
|
||||
|
||||
Each slab is independent of every other. They are assigned to "physical
|
||||
zones" in round-robin fashion. If there are P physical zones, then slab n
|
||||
is assigned to zone n mod P.
|
||||
|
||||
The slab depot maintains an additional small data structure, the "slab
|
||||
summary," which is used to reduce the amount of work needed to come back
|
||||
online after a crash. The slab summary maintains an entry for each slab
|
||||
indicating whether or not the slab has ever been used, whether it is clean
|
||||
(i.e. all of its reference count updates have been persisted to storage),
|
||||
and approximately how full it is. During recovery, each physical zone will
|
||||
attempt to recover at least one slab, stopping whenever it has recovered a
|
||||
slab which has some free blocks. Once each zone has some space (or has
|
||||
determined that none is available), the target can resume normal operation
|
||||
in a degraded mode. Read and write requests can be serviced, perhaps with
|
||||
degraded performance, while the remainder of the dirty slabs are recovered.
|
||||
|
||||
*The Block Map*
|
||||
|
||||
The block map contains the logical to physical mapping. It can be thought
|
||||
of as an array with one entry per logical address. Each entry is 5 bytes,
|
||||
36 bits of which contain the physical block number which holds the data for
|
||||
the given logical address. The other 4 bits are used to indicate the nature
|
||||
of the mapping. Of the 16 possible states, one represents a logical address
|
||||
which is unmapped (i.e. it has never been written, or has been discarded),
|
||||
one represents an uncompressed block, and the other 14 states are used to
|
||||
indicate that the mapped data is compressed, and which of the compression
|
||||
slots in the compressed block this logical address maps to (see below).
|
||||
|
||||
In practice, the array of mapping entries is divided into "block map
|
||||
pages," each of which fits in a single 4K block. Each block map page
|
||||
consists of a header, and 812 mapping entries (812 being the number that
|
||||
fit). Each mapping page is actually a leaf of a radix tree which consists
|
||||
of block map pages at each level. There are 60 radix trees which are
|
||||
assigned to "logical zones" in round robin fashion (if there are L logical
|
||||
zones, tree n will belong to zone n mod L). At each level, the trees are
|
||||
interleaved, so logical addresses 0-811 belong to tree 0, logical addresses
|
||||
812-1623 belong to tree 1, and so on. The interleaving is maintained all
|
||||
the way up the forest. 60 was chosen as the number of trees because it is
|
||||
highly composite and hence results in an evenly distributed number of trees
|
||||
per zone for a large number of possible logical zone counts. The storage
|
||||
for the 60 tree roots is allocated at format time. All other block map
|
||||
pages are allocated out of the slabs as needed. This flexible allocation
|
||||
avoids the need to pre-allocate space for the entire set of logical
|
||||
mappings and also makes growing the logical size of a vdo easy to
|
||||
implement.
|
||||
|
||||
In operation, the block map maintains two caches. It is prohibitive to keep
|
||||
the entire leaf level of the trees in memory, so each logical zone
|
||||
maintains its own cache of leaf pages. The size of this cache is
|
||||
configurable at target start time. The second cache is allocated at start
|
||||
time, and is large enough to hold all the non-leaf pages of the entire
|
||||
block map. This cache is populated as needed.
|
||||
|
||||
*The Recovery Journal*
|
||||
|
||||
The recovery journal is used to amortize updates across the block map and
|
||||
slab depot. Each write request causes an entry to be made in the journal.
|
||||
Entries are either "data remappings" or "block map remappings." For a data
|
||||
remapping, the journal records the logical address affected and its old and
|
||||
new physical mappings. For a block map remapping, the journal records the
|
||||
block map page number and the physical block allocated for it (block map
|
||||
pages are never reclaimed, so the old mapping is always 0). Each journal
|
||||
entry and the data write it represents must be stable on disk before the
|
||||
other metadata structures may be updated to reflect the operation.
|
||||
|
||||
*Write Path*
|
||||
|
||||
A write bio is first assigned a "data_vio," the request object which will
|
||||
operate on behalf of the bio. (A "vio," from Vdo I/O, is vdo's wrapper for
|
||||
bios; metadata operations use a vio, whereas submitted bios require the
|
||||
much larger data_vio.) There is a fixed pool of 2048 data_vios. This number
|
||||
was chosen both to bound the amount of work that is required to recover
|
||||
from a crash, and because measurements indicate that increasing it consumes
|
||||
more resources, but does not improve performance. These measurements have
|
||||
been, and should continue to be, revisited over time.
|
||||
|
||||
Once a data_vio is assigned, the following steps are performed:
|
||||
|
||||
1. The bio's data is checked to see if it is all zeros, and copied if not.
|
||||
|
||||
2. A lock is obtained on the logical address of the bio. Because
|
||||
deduplication involves sharing blocks, it is vital to prevent
|
||||
simultaneous modifications of the same block.
|
||||
|
||||
3. The block map tree is traversed, loading any non-leaf pages which cover
|
||||
the logical address and are not already in memory. If any of these
|
||||
pages, or the leaf page which covers the logical address have not been
|
||||
allocated, and the block is not all zeros, they are allocated at this
|
||||
time.
|
||||
|
||||
4. If the block is a zero block, skip to step 9. Otherwise, an attempt is
|
||||
made to allocate a free data block.
|
||||
|
||||
5. If an allocation was obtained, the bio is acknowledged.
|
||||
|
||||
6. The bio's data is hashed.
|
||||
|
||||
7. The data_vio obtains or joins a "hash lock," which represents all of
|
||||
the bios currently writing the same data.
|
||||
|
||||
8. If the hash lock does not already have a data_vio acting as its agent,
|
||||
the current one assumes that role. As the agent:
|
||||
|
||||
a) The index is queried.
|
||||
|
||||
b) If an entry is found, the indicated block is read and compared
|
||||
to the data being written.
|
||||
|
||||
c) If the data matches, we have identified duplicate data. As many
|
||||
of the data_vios as there are references available for that
|
||||
block (including the agent) are shared. If there are more
|
||||
data_vios in the hash lock than there are references available,
|
||||
one of them becomes the new agent and continues as if there was
|
||||
no duplicate found.
|
||||
|
||||
d) If no duplicate was found, and the agent in the hash lock does
|
||||
not have an allocation (fron step 3), another data_vio in the
|
||||
hash lock will become the agent and write the data. If no
|
||||
data_vio in the hash lock has an allocation, the data_vios will
|
||||
be marked out of space and go to step 13 for cleanup.
|
||||
|
||||
If there is an allocation, the data being written will be
|
||||
compressed. If the compressed size is sufficiently small, the
|
||||
data_vio will go to the packer where it may be placed in a bin
|
||||
along with other data_vios.
|
||||
|
||||
e) Once a bin is full, either because it is out of space, or
|
||||
because all 14 of its slots are in use, it is written out.
|
||||
|
||||
f) Each data_vio from the bin just written is the agent of some
|
||||
hash lock, it will now proceed to treat the just written
|
||||
compressed block as if it were a duplicate and share it with as
|
||||
many other data_vios in its hash lock as possible.
|
||||
|
||||
g) If the agent's data is not compressed, it will attempt to write
|
||||
its data to the block it has allocated.
|
||||
|
||||
h) If the data was written, this new block is treated as a
|
||||
duplicate and shared as much as possible with any other
|
||||
data_vios in the hash lock.
|
||||
|
||||
i) If the agent wrote new data (whether compressed or not), the
|
||||
index is updated to reflect the new entry.
|
||||
|
||||
9. The block map is queried to determine the previous mapping of the
|
||||
logical address.
|
||||
|
||||
10. An entry is made in the recovery journal. The data_vio will block in
|
||||
the journal until a flush has completed to ensure the data it may have
|
||||
written is stable. It must also wait until its journal entry is stable
|
||||
on disk. (Journal writes are all issued with the FUA bit set.)
|
||||
|
||||
11. Once the recovery journal entry is stable, the data_vio makes two slab
|
||||
journal entries: an increment entry for the new mapping, and a
|
||||
decrement entry for the old mapping, if that mapping was non-zero. For
|
||||
correctness during recovery, the slab journal entries in any given slab
|
||||
journal must be in the same order as the corresponding recovery journal
|
||||
entries. Therefore, if the two entries are in different zones, they are
|
||||
made concurrently, and if they are in the same zone, the increment is
|
||||
always made before the decrement in order to avoid underflow. After
|
||||
each slab journal entry is made in memory, the associated reference
|
||||
count is also updated in memory. Each of these updates will get written
|
||||
out as needed. (Slab journal blocks are written out either when they
|
||||
are full, or when the recovery journal requests they do so in order to
|
||||
allow the recovery journal to free up space; reference count blocks are
|
||||
written out whenever the associated slab journal requests they do so in
|
||||
order to free up slab journal space.)
|
||||
|
||||
12. Once all the reference count updates are done, the block map is updated
|
||||
and the write is complete.
|
||||
|
||||
13. If the data_vio did not use its allocation, it releases the allocated
|
||||
block, the hash lock (if it has one), and its logical lock. The
|
||||
data_vio then returns to the pool.
|
||||
|
||||
*Read Path*
|
||||
|
||||
Reads are much simpler than writes. After a data_vio is assigned to the
|
||||
bio, and the logical lock is obtained, the block map is queried. If the
|
||||
block is mapped, the appropriate physical block is read, and if necessary,
|
||||
decompressed.
|
||||
|
||||
*Recovery*
|
||||
|
||||
When a vdo is restarted after a crash, it will attempt to recover from the
|
||||
recovery journal. During the pre-resume phase of the next start, the
|
||||
recovery journal is read. The increment portion of valid entries are played
|
||||
into the block map. Next, valid entries are played, in order as required,
|
||||
into the slab journals. Finally, each physical zone attempts to replay at
|
||||
least one slab journal to reconstruct the reference counts of one slab.
|
||||
Once each zone has some free space (or has determined that it has none),
|
||||
the vdo comes back online, while the remainder of the slab journals are
|
||||
used to reconstruct the rest of the reference counts.
|
||||
|
||||
*Read-only Rebuild*
|
||||
|
||||
If a vdo encounters an unrecoverable error, it will enter read-only mode.
|
||||
This mode indicates that some previously acknowledged data may have been
|
||||
lost. The vdo may be instructed to rebuild as best it can in order to
|
||||
return to a writable state. However, this is never done automatically due
|
||||
to the likelihood that data has been lost. During a read-only rebuild, the
|
||||
block map is recovered from the recovery journal as before. However, the
|
||||
reference counts are not rebuilt from the slab journals. Rather, the
|
||||
reference counts are zeroed, and then the entire block map is traversed,
|
||||
and the reference counts are updated from it. While this may lose some
|
||||
data, it ensures that the block map and reference counts are consistent.
|
388
Documentation/admin-guide/device-mapper/vdo.rst
Normal file
388
Documentation/admin-guide/device-mapper/vdo.rst
Normal file
@ -0,0 +1,388 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0-only
|
||||
|
||||
dm-vdo
|
||||
======
|
||||
|
||||
The dm-vdo (virtual data optimizer) device mapper target provides
|
||||
block-level deduplication, compression, and thin provisioning. As a device
|
||||
mapper target, it can add these features to the storage stack, compatible
|
||||
with any file system. The vdo target does not protect against data
|
||||
corruption, relying instead on integrity protection of the storage below
|
||||
it. It is strongly recommended that lvm be used to manage vdo volumes. See
|
||||
lvmvdo(7).
|
||||
|
||||
Userspace component
|
||||
===================
|
||||
|
||||
Formatting a vdo volume requires the use of the 'vdoformat' tool, available
|
||||
at:
|
||||
|
||||
https://github.com/dm-vdo/vdo/
|
||||
|
||||
In most cases, a vdo target will recover from a crash automatically the
|
||||
next time it is started. In cases where it encountered an unrecoverable
|
||||
error (either during normal operation or crash recovery) the target will
|
||||
enter or come up in read-only mode. Because read-only mode is indicative of
|
||||
data-loss, a positive action must be taken to bring vdo out of read-only
|
||||
mode. The 'vdoforcerebuild' tool, available from the same repo, is used to
|
||||
prepare a read-only vdo to exit read-only mode. After running this tool,
|
||||
the vdo target will rebuild its metadata the next time it is
|
||||
started. Although some data may be lost, the rebuilt vdo's metadata will be
|
||||
internally consistent and the target will be writable again.
|
||||
|
||||
The repo also contains additional userspace tools which can be used to
|
||||
inspect a vdo target's on-disk metadata. Fortunately, these tools are
|
||||
rarely needed except by dm-vdo developers.
|
||||
|
||||
Target interface
|
||||
================
|
||||
|
||||
Table line
|
||||
----------
|
||||
|
||||
::
|
||||
|
||||
<offset> <logical device size> vdo V4 <storage device>
|
||||
<storage device size> <minimum I/O size> <block map cache size>
|
||||
<block map era length> [optional arguments]
|
||||
|
||||
|
||||
Required parameters:
|
||||
|
||||
offset:
|
||||
The offset, in sectors, at which the vdo volume's logical
|
||||
space begins.
|
||||
|
||||
logical device size:
|
||||
The size of the device which the vdo volume will service,
|
||||
in sectors. Must match the current logical size of the vdo
|
||||
volume.
|
||||
|
||||
storage device:
|
||||
The device holding the vdo volume's data and metadata.
|
||||
|
||||
storage device size:
|
||||
The size of the device holding the vdo volume, as a number
|
||||
of 4096-byte blocks. Must match the current size of the vdo
|
||||
volume.
|
||||
|
||||
minimum I/O size:
|
||||
The minimum I/O size for this vdo volume to accept, in
|
||||
bytes. Valid values are 512 or 4096. The recommended value
|
||||
is 4096.
|
||||
|
||||
block map cache size:
|
||||
The size of the block map cache, as a number of 4096-byte
|
||||
blocks. The minimum and recommended value is 32768 blocks.
|
||||
If the logical thread count is non-zero, the cache size
|
||||
must be at least 4096 blocks per logical thread.
|
||||
|
||||
block map era length:
|
||||
The speed with which the block map cache writes out
|
||||
modified block map pages. A smaller era length is likely to
|
||||
reduce the amount of time spent rebuilding, at the cost of
|
||||
increased block map writes during normal operation. The
|
||||
maximum and recommended value is 16380; the minimum value
|
||||
is 1.
|
||||
|
||||
Optional parameters:
|
||||
--------------------
|
||||
Some or all of these parameters may be specified as <key> <value> pairs.
|
||||
|
||||
Thread related parameters:
|
||||
|
||||
Different categories of work are assigned to separate thread groups, and
|
||||
the number of threads in each group can be configured separately.
|
||||
|
||||
If <hash>, <logical>, and <physical> are all set to 0, the work handled by
|
||||
all three thread types will be handled by a single thread. If any of these
|
||||
values are non-zero, all of them must be non-zero.
|
||||
|
||||
ack:
|
||||
The number of threads used to complete bios. Since
|
||||
completing a bio calls an arbitrary completion function
|
||||
outside the vdo volume, threads of this type allow the vdo
|
||||
volume to continue processing requests even when bio
|
||||
completion is slow. The default is 1.
|
||||
|
||||
bio:
|
||||
The number of threads used to issue bios to the underlying
|
||||
storage. Threads of this type allow the vdo volume to
|
||||
continue processing requests even when bio submission is
|
||||
slow. The default is 4.
|
||||
|
||||
bioRotationInterval:
|
||||
The number of bios to enqueue on each bio thread before
|
||||
switching to the next thread. The value must be greater
|
||||
than 0 and not more than 1024; the default is 64.
|
||||
|
||||
cpu:
|
||||
The number of threads used to do CPU-intensive work, such
|
||||
as hashing and compression. The default is 1.
|
||||
|
||||
hash:
|
||||
The number of threads used to manage data comparisons for
|
||||
deduplication based on the hash value of data blocks. The
|
||||
default is 0.
|
||||
|
||||
logical:
|
||||
The number of threads used to manage caching and locking
|
||||
based on the logical address of incoming bios. The default
|
||||
is 0; the maximum is 60.
|
||||
|
||||
physical:
|
||||
The number of threads used to manage administration of the
|
||||
underlying storage device. At format time, a slab size for
|
||||
the vdo is chosen; the vdo storage device must be large
|
||||
enough to have at least 1 slab per physical thread. The
|
||||
default is 0; the maximum is 16.
|
||||
|
||||
Miscellaneous parameters:
|
||||
|
||||
maxDiscard:
|
||||
The maximum size of discard bio accepted, in 4096-byte
|
||||
blocks. I/O requests to a vdo volume are normally split
|
||||
into 4096-byte blocks, and processed up to 2048 at a time.
|
||||
However, discard requests to a vdo volume can be
|
||||
automatically split to a larger size, up to <maxDiscard>
|
||||
4096-byte blocks in a single bio, and are limited to 1500
|
||||
at a time. Increasing this value may provide better overall
|
||||
performance, at the cost of increased latency for the
|
||||
individual discard requests. The default and minimum is 1;
|
||||
the maximum is UINT_MAX / 4096.
|
||||
|
||||
deduplication:
|
||||
Whether deduplication is enabled. The default is 'on'; the
|
||||
acceptable values are 'on' and 'off'.
|
||||
|
||||
compression:
|
||||
Whether compression is enabled. The default is 'off'; the
|
||||
acceptable values are 'on' and 'off'.
|
||||
|
||||
Device modification
|
||||
-------------------
|
||||
|
||||
A modified table may be loaded into a running, non-suspended vdo volume.
|
||||
The modifications will take effect when the device is next resumed. The
|
||||
modifiable parameters are <logical device size>, <physical device size>,
|
||||
<maxDiscard>, <compression>, and <deduplication>.
|
||||
|
||||
If the logical device size or physical device size are changed, upon
|
||||
successful resume vdo will store the new values and require them on future
|
||||
startups. These two parameters may not be decreased. The logical device
|
||||
size may not exceed 4 PB. The physical device size must increase by at
|
||||
least 32832 4096-byte blocks if at all, and must not exceed the size of the
|
||||
underlying storage device. Additionally, when formatting the vdo device, a
|
||||
slab size is chosen: the physical device size may never increase above the
|
||||
size which provides 8192 slabs, and each increase must be large enough to
|
||||
add at least one new slab.
|
||||
|
||||
Examples:
|
||||
|
||||
Start a previously-formatted vdo volume with 1 GB logical space and 1 GB
|
||||
physical space, storing to /dev/dm-1 which has more than 1 GB of space.
|
||||
|
||||
::
|
||||
|
||||
dmsetup create vdo0 --table \
|
||||
"0 2097152 vdo V4 /dev/dm-1 262144 4096 32768 16380"
|
||||
|
||||
Grow the logical size to 4 GB.
|
||||
|
||||
::
|
||||
|
||||
dmsetup reload vdo0 --table \
|
||||
"0 8388608 vdo V4 /dev/dm-1 262144 4096 32768 16380"
|
||||
dmsetup resume vdo0
|
||||
|
||||
Grow the physical size to 2 GB.
|
||||
|
||||
::
|
||||
|
||||
dmsetup reload vdo0 --table \
|
||||
"0 8388608 vdo V4 /dev/dm-1 524288 4096 32768 16380"
|
||||
dmsetup resume vdo0
|
||||
|
||||
Grow the physical size by 1 GB more and increase max discard sectors.
|
||||
|
||||
::
|
||||
|
||||
dmsetup reload vdo0 --table \
|
||||
"0 10485760 vdo V4 /dev/dm-1 786432 4096 32768 16380 maxDiscard 8"
|
||||
dmsetup resume vdo0
|
||||
|
||||
Stop the vdo volume.
|
||||
|
||||
::
|
||||
|
||||
dmsetup remove vdo0
|
||||
|
||||
Start the vdo volume again. Note that the logical and physical device sizes
|
||||
must still match, but other parameters can change.
|
||||
|
||||
::
|
||||
|
||||
dmsetup create vdo1 --table \
|
||||
"0 10485760 vdo V4 /dev/dm-1 786432 512 65550 5000 hash 1 logical 3 physical 2"
|
||||
|
||||
Messages
|
||||
--------
|
||||
All vdo devices accept messages in the form:
|
||||
|
||||
::
|
||||
dmsetup message <target-name> 0 <message-name> <message-parameters>
|
||||
|
||||
The messages are:
|
||||
|
||||
stats:
|
||||
Outputs the current view of the vdo statistics. Mostly used
|
||||
by the vdostats userspace program to interpret the output
|
||||
buffer.
|
||||
|
||||
dump:
|
||||
Dumps many internal structures to the system log. This is
|
||||
not always safe to run, so it should only be used to debug
|
||||
a hung vdo. Optional parameters to specify structures to
|
||||
dump are:
|
||||
|
||||
viopool: The pool of I/O requests incoming bios
|
||||
pools: A synonym of 'viopool'
|
||||
vdo: Most of the structures managing on-disk data
|
||||
queues: Basic information about each vdo thread
|
||||
threads: A synonym of 'queues'
|
||||
default: Equivalent to 'queues vdo'
|
||||
all: All of the above.
|
||||
|
||||
dump-on-shutdown:
|
||||
Perform a default dump next time vdo shuts down.
|
||||
|
||||
|
||||
Status
|
||||
------
|
||||
|
||||
::
|
||||
|
||||
<device> <operating mode> <in recovery> <index state>
|
||||
<compression state> <physical blocks used> <total physical blocks>
|
||||
|
||||
device:
|
||||
The name of the vdo volume.
|
||||
|
||||
operating mode:
|
||||
The current operating mode of the vdo volume; values may be
|
||||
'normal', 'recovering' (the volume has detected an issue
|
||||
with its metadata and is attempting to repair itself), and
|
||||
'read-only' (an error has occurred that forces the vdo
|
||||
volume to only support read operations and not writes).
|
||||
|
||||
in recovery:
|
||||
Whether the vdo volume is currently in recovery mode;
|
||||
values may be 'recovering' or '-' which indicates not
|
||||
recovering.
|
||||
|
||||
index state:
|
||||
The current state of the deduplication index in the vdo
|
||||
volume; values may be 'closed', 'closing', 'error',
|
||||
'offline', 'online', 'opening', and 'unknown'.
|
||||
|
||||
compression state:
|
||||
The current state of compression in the vdo volume; values
|
||||
may be 'offline' and 'online'.
|
||||
|
||||
used physical blocks:
|
||||
The number of physical blocks in use by the vdo volume.
|
||||
|
||||
total physical blocks:
|
||||
The total number of physical blocks the vdo volume may use;
|
||||
the difference between this value and the
|
||||
<used physical blocks> is the number of blocks the vdo
|
||||
volume has left before being full.
|
||||
|
||||
Memory Requirements
|
||||
===================
|
||||
|
||||
A vdo target requires a fixed 38 MB of RAM along with the following amounts
|
||||
that scale with the target:
|
||||
|
||||
- 1.15 MB of RAM for each 1 MB of configured block map cache size. The
|
||||
block map cache requires a minimum of 150 MB.
|
||||
- 1.6 MB of RAM for each 1 TB of logical space.
|
||||
- 268 MB of RAM for each 1 TB of physical storage managed by the volume.
|
||||
|
||||
The deduplication index requires additional memory which scales with the
|
||||
size of the deduplication window. For dense indexes, the index requires 1
|
||||
GB of RAM per 1 TB of window. For sparse indexes, the index requires 1 GB
|
||||
of RAM per 10 TB of window. The index configuration is set when the target
|
||||
is formatted and may not be modified.
|
||||
|
||||
Run-time Usage
|
||||
==============
|
||||
|
||||
When using dm-vdo, it is important to be aware of the ways in which its
|
||||
behavior differs from other storage targets.
|
||||
|
||||
- There is no guarantee that over-writes of existing blocks will succeed.
|
||||
Because the underlying storage may be multiply referenced, over-writing
|
||||
an existing block generally requires a vdo to have a free block
|
||||
available.
|
||||
|
||||
- When blocks are no longer in use, sending a discard request for those
|
||||
blocks lets the vdo release references for those blocks. If the vdo is
|
||||
thinly provisioned, discarding unused blocks is essential to prevent the
|
||||
target from running out of space. However, due to the sharing of
|
||||
duplicate blocks, no discard request for any given logical block is
|
||||
guaranteed to reclaim space.
|
||||
|
||||
- Assuming the underlying storage properly implements flush requests, vdo
|
||||
is resilient against crashes, however, unflushed writes may or may not
|
||||
persist after a crash.
|
||||
|
||||
- Each write to a vdo target entails a significant amount of processing.
|
||||
However, much of the work is paralellizable. Therefore, vdo targets
|
||||
achieve better throughput at higher I/O depths, and can support up 2048
|
||||
requests in parallel.
|
||||
|
||||
Tuning
|
||||
======
|
||||
|
||||
The vdo device has many options, and it can be difficult to make optimal
|
||||
choices without perfect knowledge of the workload. Additionally, most
|
||||
configuration options must be set when a vdo target is started, and cannot
|
||||
be changed without shutting it down completely; the configuration cannot be
|
||||
changed while the target is active. Ideally, tuning with simulated
|
||||
workloads should be performed before deploying vdo in production
|
||||
environments.
|
||||
|
||||
The most important value to adjust is the block map cache size. In order to
|
||||
service a request for any logical address, a vdo must load the portion of
|
||||
the block map which holds the relevant mapping. These mappings are cached.
|
||||
Performance will suffer when the working set does not fit in the cache. By
|
||||
default, a vdo allocates 128 MB of metadata cache in RAM to support
|
||||
efficient access to 100 GB of logical space at a time. It should be scaled
|
||||
up proportionally for larger working sets.
|
||||
|
||||
The logical and physical thread counts should also be adjusted. A logical
|
||||
thread controls a disjoint section of the block map, so additional logical
|
||||
threads increase parallelism and can increase throughput. Physical threads
|
||||
control a disjoint section of the data blocks, so additional physical
|
||||
threads can also increase throughput. However, excess threads can waste
|
||||
resources and increase contention.
|
||||
|
||||
Bio submission threads control the parallelism involved in sending I/O to
|
||||
the underlying storage; fewer threads mean there is more opportunity to
|
||||
reorder I/O requests for performance benefit, but also that each I/O
|
||||
request has to wait longer before being submitted.
|
||||
|
||||
Bio acknowledgment threads are used for finishing I/O requests. This is
|
||||
done on dedicated threads since the amount of work required to execute a
|
||||
bio's callback can not be controlled by the vdo itself. Usually one thread
|
||||
is sufficient but additional threads may be beneficial, particularly when
|
||||
bios have CPU-heavy callbacks.
|
||||
|
||||
CPU threads are used for hashing and for compression; in workloads with
|
||||
compression enabled, more threads may result in higher throughput.
|
||||
|
||||
Hash threads are used to sort active requests by hash and determine whether
|
||||
they should deduplicate; the most CPU intensive actions done by these
|
||||
threads are comparison of 4096-byte data blocks. In most cases, a single
|
||||
hash thread is sufficient.
|
Loading…
Reference in New Issue
Block a user