Write failures should not happen under normal circumstances,
so in order to bring the chunk back into a known state as soon
as possible, evacuate all the valid data out of the line and let the
fw judge if the block can be written to in the next reset cycle.
Do this by introducing a new gc list for lines with failed writes,
and ensure that the rate limiter allocates a small portion of
the write bandwidth to get the job done.
The lba list is saved in memory for use during gc as we
cannot gurantee that the emeta data is readable if a write
error occurred.
Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The write error recovery path is incomplete, so rework
the write error recovery handling to do resubmits directly
from the write buffer.
When a write error occurs, the remaining sectors in the chunk are
mapped out and invalidated and the request inserted in a resubmit list.
The writer thread checks if there are any requests to resubmit,
scans and invalidates any lbas that have been overwritten by later
writes and resubmits the failed entries.
Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Remove dead function for manual sync. I/O
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If the namespace is unregistered before the LightNVM target is removed
(e.g., on hot unplug) it is too late for the target to store any metadata
on the device - any attempt to write to the device will fail. In this
case, pass on a "gracefull teardown" flag to the target to let it know
when this happens.
In the case of pblk, we pad the open line (close all open chunks) to
improve data retention. In the event of an ungraceful shutdown, avoid
this part and just clean up.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Do the check for the chunk state after making sure that the chunk type
is supported.
Fixes: 32ef9412c1 ("lightnvm: pblk: implement get log report chunk")
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Call nvm_submit_io directly and remove an unnecessary indirection on the
read path.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When cleaning up buffer entries as we wrap up, their state should be
"completed". If any of the entries is in "submitted" state, it means
that something bad has happened. Trigger a warning immediately instead of
waiting for the state flag to eventually be updated, thus hiding the
issue.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In the event of a mismatch between the read LBA and the metadata pointer
reported by the device, improve the error message to be able to detect
the offending physical address (PPA) mapped to the corrupted LBA.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Check that the lba stored in the LBA metadata is correct in the GC path
too. This requires a new helper function to check random reads in the
vector read.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bad blocks can grow at runtime. Check that the number of valid blocks in
a line are within the sanity threshold before allocating the line for
new writes.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In the event of a line failing to allocate, fail gracefully and stop the
pipeline to avoid more write failing in the same place.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Convert lightnvm to embedded bio sets.
Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Smatch complains that flush_workqueue() dereferences the work queue
pointer but then we check if it's NULL on the next line when it's too
late. These NULL checks can be removed because the module won't load if
we can't allocate the work queues.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If the line has not been written to, we should not
try to recover any data from it, so check the state of the
chunks in the line before attempting to read smeta.
Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Implement 2.0 support in pblk. This includes the address formatting and
mapping paths, as well as the sysfs entries for them.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In preparation of pblk supporting 2.0, implement the get log report
chunk in pblk. Also, define the chunk states as given in the 2.0 spec.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In preparation for 2.0 support in pblk, rename variables referring to
the address format to addrf and reserve ppaf for the 1.2 path.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
At this point, only 1.2 spec is supported, thus check for it. Also,
since device-side L2P is only supported in the 1.2 spec, make sure to
only check its value under 1.2.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The 2.0 spec provides a report chunk log page that can be retrieved
using the stangard nvme get log page. This replaces the dedicated
get/put bad block table in 1.2.
This patch implements the helper functions to allow targets retrieve the
chunk metadata using get log page. It makes nvme_get_log_ext available
outside of nvme core so that we can use it form lightnvm.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
On address conversions, use the generic device, instead of the target
device. This allows to use conversions outside of the target's realm.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add support for 2.0 address format. Also, align address bits for 1.2 and
2.0 to be able to operate on channel and luns without requiring a format
conversion. Use a generic address format for this purpose.
Also, convert the generic operations to the generic format in pblk.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Normalize nomenclature for naming channels, luns, chunks, planes and
sectors as well as derivations in order to improve readability.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Separate the version between major and minor on the generic geometry and
represent it through sysfs in the 2.0 path. The 1.2 path only shows the
major version to preserve the existing user space interface.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, the device geometry is stored redundantly in the nvm_id and
nvm_geo structures at a device level. Moreover, when instantiating
targets on a specific number of LUNs, these structures are replicated
and manually modified to fit the instance channel and LUN partitioning.
Instead, create a generic geometry around nvm_geo, which can be used by
(i) the underlying device to describe the geometry of the whole device,
and (ii) instances to describe their geometry independently.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently all functions for handling the lightnvm core ioctl commands
do a check for CAP_SYS_ADMIN.
Change this to fail early in nvm_ctl_ioctl(), so we don't have to
duplicate the permission checks all over.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
fix reading bad block device information to correctly setup the per line
blk_bitmap during lightnvm initialization
Signed-off-by: Heiner Litz <hlitz@ucsc.edu>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The nvme driver sets up the size of the nvme namespace in two steps.
First it initializes the device with standard logical block and
metadata sizes, and then sets the correct logical block and metadata
size. Due to the OCSSD 2.0 specification relies on the namespace to
expose these sizes for correct initialization, let it be updated
appropriately on the LightNVM side as well.
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The value of max_phys_sect is always static. Instead of
defining it in the nvm_dev_ops structure, declare it as a global
value.
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Implement the geometry data structures for 2.0 and enable a drive
to be identified as one, including exposing the appropriate 2.0
sysfs entries.
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are no groups in the 2.0 specification, make sure that the
nvm_id structure is flattened before 2.0 data structures are added.
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In preparation for the OCSSD 2.0 spec. bad block identification,
refactor the current code to generalize bad block get/set functions and
structures.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Make sure that we are not advancing the sync pointer while
we're adding bios to the write buffer entry completion list.
This race condition results in bios not completing and was identified
by a hang when running xfstest generic/113.
Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When shutting down pblk the write buffer is flushed and if the
current line can't fit the data in the write buffer we need
to allocate a new line, so remove the check that prevents this.
Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Unless we delete the timer that wakes up the write thread
before we stop the thread we risk re-starting the thread, so
delete the timer first.
Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When pblk receives a sync, all data up to that point in the write buffer
must be comitted to persistent storage, and as flash memory comes with a
minimal write size there is a significant cost involved both in terms
of time for completing the sync and in terms of write amplification
padded sectors for filling up to the minimal write size.
In order to get a better understanding of the costs involved for syncs,
Add a sysfs attribute to pblk: padded_dist, showing a normalized
distribution of sectors padded. In order to facilitate measurements of
specific workloads during the lifetime of the pblk instance, the
distribution can be reset by writing 0 to the attribute.
Do this by introducing counters for each possible padding:
{0..(minimal write size - 1)} and calculate the normalized distribution
when showing the attribute.
Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Signed-off-by: Javier González <javier@cnexlabs.com>
Rearranged total_buckets statement in pblk_sysfs_get_padding_dist
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In a SSD, write amplification, WA, is defined as the average
number of page writes per user page write. Write amplification
negatively affects write performance and decreases the lifetime
of the disk, so it's a useful metric to add to sysfs.
In plkb's case, the number of writes per user sector is the sum of:
(1) number of user writes
(2) number of sectors written by the garbage collector
(3) number of sectors padded (i.e. due to syncs)
This patch adds persistent counters for 1-3 and two sysfs attributes
to export these along with WA calculated with five decimals:
write_amp_mileage: the accumulated write amplification stats
for the lifetime of the pblk instance
write_amp_trip: resetable stats to facilitate delta measurements,
values reset at creation and if 0 is written
to the attribute.
64-bit counters are used as a 32 bit counter would wrap around
already after about 17 TB worth of user data. It will take a
long long time before the 64 bit sector counters wrap around.
The counters are stored after the bad block bitmap in the first
emeta sector of each written line. There is plenty of space in the
first emeta sector, so we don't need to bump the major version of
the line data format.
Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
As a preparation for future bumps of data line persistent storage
versions, we need to start checking the emeta line version during
recovery. Also slit up the current emeta/smeta version into two
bytes (major,minor).
Recovering lines with the same major number as the current pblk data
line version must succeed. This means that any changes in the
persistent format must be:
(1) Backward compatible: if we switch back to and older
kernel, recovery of lines stored with major == current_major
and minor > current_minor must succeed.
(2) Forward compatible: switching to a newer kernel,
recovery of lines stored with major=current_major and
minor < minor must handle the data format differences
gracefully(i.e. initialize new data structures to default values).
If we detect lines that have a different major number than
the current we must abort recovery. The user must manually
migrate the data in this case.
Previously the version stored in the emeta header was copied
from smeta, which has version 1, so we need to set the minor
version to 1.
Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Unless we check if there are bad sectors in the entire emeta-area
we risk ending up with valid bitmap / available sector count inconsistency.
This results in lines with a bad chunk at the last LUN marked as bad,
so go through the whole emeta area and mark up the invalid sectors.
Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Omit an extra message for a memory allocation failure in this function.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch has been generated as follows:
for verb in set_unlocked clear_unlocked set clear; do
replace-in-files queue_flag_${verb} blk_queue_flag_${verb%_unlocked} \
$(git grep -lw queue_flag_${verb} drivers block/bsg*)
done
Except for protecting all queue flag changes with the queue lock
this patch does not change any functionality.
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch does not change any functionality.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Shorten function to simply return the value of the if statement.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since pblk registers its own block device, the iostat accounting is
not automatically done for us. Therefore, add the necessary
accounting logic to satisfy the iostat interface.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add the instance name to the information printed out on target creation.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Refactor the way we free the write buffer to ensure that all entries get
freed in case of an error on the init sequence.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>