A partial read I/O in pblk is an I/O where some sectors reside in the
write buffer in main memory and some are persisted on the device. Such
an I/O must at least contain 2 lbas, therefore checking for the case
where a single lba is mapped is not necessary.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When a line is recycled during garbage collection, reads can still be
issued to the line. If the line is freed in the middle of this process,
data corruption might occur.
This patch guarantees that lines are not freed in the middle of reads
that target them (lines). Specifically, we use the existing line
reference to decide when a line is eligible for being freed after the
recycle process.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
As part of pblk's recovery scheme, we store the lba mapped to each
physical sector on the device's out-of-bound (OOB) area.
On the read path, we can use this information to validate that the data
being delivered to the upper layers corresponds to the lba being
requested. The cost of this check is an extra copy on the DMA region on
the device and an extra comparison in the host, given that (i) the OOB
area is being read together with the data in the media, and (ii) the DMA
region allocated for the ppa list can be reused for the metadata stored
on the OOB area.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For consistency with the rest of pblk, use rqd->end_io to point to the
function taking care of ending the request on the completion path.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Refactor the rqd allocation and free functions so that all I/O types can
use these helper functions.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Each request type sent to the LightNVM subsystem requires different
metadata. Until now, we have tailored this metadata based on write, read
and erase commands. However, pblk uses different metadata for internal
writes that do not hit the write buffer. Instead of abusing the metadata
for reads, create a new request type - internal write to improve
code readability.
In the process, create internal values for each I/O type instead of
abusing the READ/WRITE macros, as suggested by Christoph.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Wait until we know the exact number of ppas to be sent to the device,
before allocating the bio.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
On REQ_PREFLUSH, directly tag the I/O context flags to signal a flush in
the write to cache path, instead of finding the correct entry context
and imposing a memory barrier. This simplifies the code and might
potentially prevent race conditions when adding functionality to the
write path.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Simplify put bio by doing it on bio end_io instead of manually putting
it on the completion path.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Simplify the part of the garbage collector where data is read from the
line being recycled and moved into an internal queue before being copied
to the memory buffer. This allows to get rid of a dedicated function,
which introduces an unnecessary dependency on the code.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When a line is selected for recycling by the garbage collector (GC), the
line state changes and the invalid bitmap is frozen, preventing
invalidations from happening. Throughout the GC, the L2P map is checked
to verify that not data being recycled has been updated. The last check
is done before the new map is being stored on the L2P table. Though
this algorithm works, it requires a number of corner cases to be checked
each time the L2P table is being updated. This complicates readability
and is error prone in case that the recycling algorithm is modified.
Instead, this patch makes the invalid bitmap accessible even when the
line is being recycled. When recycled data is being remapped, it is
enough to check the invalid bitmap for the line before updating the L2P
table.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Normalize the way we name ppa variables to improve code readability.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use a constant to set the maximum number of inflight GC requests
allowed.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
As part of the mempool audit on pblk, remove unnecessary mempool
allocation checks on mempools.
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
pblk holds two sector bitmaps: one to keep track of the mapped sectors
while the line is active and another one to keep track of the invalid
sectors. The latter is kept during the whole live of the line, until it
is recycled. Since we cannot guarantee forward progress for the mempool
in this case, get rid of the mempool and simply allocate memory through
kmalloc.
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since read and erase paths offer different guarantees for inflight I/Os,
separate the mempools to set the right min_nr for each on creation.
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In pblk, we have a mempool to allocate a generic structure that we
pass along workqueues. This is heavily used in the GC path in order
to have enough inflight reads and fully utilize the GC bandwidth.
However, the current GC path copies data to the host memory and puts it
back into the write buffer. This requires a vmalloc allocation for the
data and a memory copy. Thus, guaranteeing the allocation by using a
mempool for the structure in itself does not give us much. Until we
implement support for vector copy to avoid moving data through the host,
just allocate the workqueue structure using kmalloc.
This allows us to have a much smaller mempool.
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
pblk uses an internal page mempool for allocating pages on internal
bios. The main two users of this memory pool are partial reads (reads
with some sectors in cache and some on media) and padded writes, which
need to add dummy pages to an existing bio already containing valid
data (and with a large enough bioset allocated). In both cases, the
maximum number of pages per bio is defined by the maximum number of
physical sectors supported by the underlying device.
This patch fixes a bad mempool allocation, where the min_nr of elements
on the pool was fixed (to 16), which is lower than the maximum number
of sectors supported by NVMe (as of the time for this patch). Instead,
use the maximum number of allowed sectors reported by the device.
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
On low LUN configurations, make sure not to send bios that are bigger
than the buffer size.
Fixes: a4bd217b43 ("lightnvm: physical block device (pblk) target")
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fix stat counter to collect the right number of I/Os being synced on the
completion path.
Fixes: 0880a9aa2d ("lightnvm: pblk: delete redundant buffer pointer")
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When a REQ_FLUSH reaches pblk, the bio cannot be directly completed.
Instead, data on the write buffer is flushed and the bio is completed on
the completion pah. This might require some sectors to be padded in
order to guarantee a successful write.
This patch fixes a memory leak on the padded pages. A consequence of
this bad free was that internal bios not containing data (only a flush)
were not being completed.
Fixes: a4bd217b43 ("lightnvm: physical block device (pblk) target")
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The data buffer for the GC path allocates virtual memory through
vmalloc. When this change was introduced, a flag signaling kmalloc'ed
memory was wrongly introduced. Use the right flag when creating a bio
from this buffer.
Fixes: de54e703a4 ("lightnvm: pblk: use vmalloc for GC data buffer")
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is a trivial change which reuses pblk_gc_should_kick instead of
repeating it again in pblk_rl_free_lines_inc.
Signed-off-by: Rakesh Pandit <rakesh@tuxera.com>
Made it apply to the common case.
Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Correct it by converting little endian to cpu endian and also define
a macro for line version so that maintenance is easy.
Signed-off-by: Rakesh Pandit <rakesh@tuxera.com>
Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The two pr_err messages are useless as they don't differentiate
error code.
Signed-off-by: Rakesh Pandit <rakesh@tuxera.com>
Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This usually happens if we are developing with qemu and ll2pmode has
default value. Improve description.
Signed-off-by: Rakesh Pandit <rakesh@tuxera.com>
Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It seems pblk_dealloc_page would race against pblk_alloc_pages for
line bitmap for sector allocation.The chances are very low but might
as well protect the bitmap properly.
Signed-off-by: Rakesh Pandit <rakesh@tuxera.com>
Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Because NVM needs BLK_DEV_NVME, select it automatically if we mark NVM
in config file before building kernel. Also append PCI to depends as
select doesn't automatically add dependencies.
Signed-off-by: Rakesh Pandit <rakesh@tuxera.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use appropriate memory free calls based on allocation type used and
also fix number of times free is called if kmalloc fails.
Signed-off-by: Rakesh Pandit <rakesh@tuxera.com>
Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Remove repeated calculation for number of channels while creating a
target device.
Signed-off-by: Rakesh Pandit <rakesh@tuxera.com>
Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
nvm_tgt_types list was protected by wrong lock for NVM_INFO ioctl call
and can race with addition or removal of target types. Also
unregistering target type was not protected correctly.
Fixes: 5cd907853 ("lightnvm: remove nested lock conflict with mm")
Signed-off-by: Rakesh Pandit <rakesh@tuxera.com>
Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When a virtual block device is formatted and mounted after creating
with "nvme lnvm create... -t pblk", a removal from "nvm lnvm remove"
would result in this:
446416.309757] bdi-block not registered
[446416.309773] ------------[ cut here ]------------
[446416.309780] WARNING: CPU: 3 PID: 4319 at fs/fs-writeback.c:2159
__mark_inode_dirty+0x268/0x340
Ideally removal should return -EBUSY as block device is mounted after
formatting. This patch tries to address this checking if whole device
or any partition of it already mounted or not before removal.
Whole device is checked using "bd_super" member of block device. This
member is always set once block device has been mounted using a
filesystem. Another member "bd_part_count" takes care of checking any
if any partitions are under use. "bd_part_count" is only updated
under locks when partitions are opened or closed (first open and last
release). This at least does take care sending -EBUSY if removal is
being attempted while whole block device or any partition is mounted.
Signed-off-by: Rakesh Pandit <rakesh@tuxera.com>
Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If target type module e.g. pblk here is unloaded (rmmod) while module
is in use (after creating target) system crashes. We fix this by
using module API refcnt.
Signed-off-by: Rakesh Pandit <rakesh@tuxera.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When a lba either hits the cache or corresponds to an empty entry in the
L2P table, we need to advance the bio according to the position in which
the lba is located. Otherwise, we will copy data in the wrong page, thus
causing data corruption for the application.
In case of a cache hit, we assumed that bio->bi_iter.bi_idx would
contain the correct index, but this is no necessarily true. Instead, use
the local bio advance counter and iterator. This guarantees that lbas
hitting the cache are copied into the right bv_page.
In case of an empty L2P entry, we omitted to advance the bio. In the
cases when the same I/O also contains a cache hit, data corresponding
to this lba will be copied to the wrong bv_page. Fix this by advancing
the bio as we do in the case of a cache hit.
Fixes: a4bd217b43 lightnvm: physical block device (pblk) target
Signed-off-by: Javier González <javier@javigon.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When removing a pblk instance, control the write I/O flow to the
controller as we do in the fast path.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Do bitmap checks only when debug mode is enable. The line bitmap used
for mapping to physical addresses is fairly large (~512KB) and it is
expensive to do this checks on the fast path.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When a read is directed to the cache, we risk that the lba has been
updated during the time we made the L2P table lookup and the time we are
actually reading form the cache. We intentionally not hold the L2P lock
not to block other threads.
While strict ordering is not a guarantee at this level (unless REQ_FLUSH
has been previously issued), we have experience that some databases that
have recently implemented direct I/O support, issue metadata reads very
close to the writes, without issuing a fsync in the middle. An easy way
to support them while they is to make an extra effort and check the L2P
map right before reading the cache.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a sanity check to the pblk initialization sequence in order to
ensure that enough LUNs have been allocated to store the line metadata.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When removing a pblk instance, pad the current line using asynchronous
I/O. This reduces the removal time from ~1 minute in the worst case to a
couple of seconds.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For now, we allocate a per I/O buffer for GC data. Since the potential
size of the buffer is 256KB and GC is not in the fast path, do this
allocation with vmalloc. This puts lets pressure on the memory
allocator at no performance cost.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When user threads place data into the write buffer, they reserve space
and do the memory copy out of the lock. As a consequence, when the write
thread starts persisting data, there is a chance that it is not copied
yet. In this case, avoid polling, and schedule before retrying.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Prevent pblk->lines being double freed in case of an error during pblk
initialization.
Fixes: dd2a434373: "lightnvm: pblk: sched. metadata on write thread"
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use the right types and conversions on le64 variables. Reported by
sparse.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
While creating new device with NVM_DEV_CREATE if LUNs are already
allocated ioctl would return -ENOMEM which is wrong. This patch
propagates -EBUSY from nvm_reserve_luns which is correct response.
Fixes: ade69e243 ("lightnvm: merge gennvm with core")
Reviewed-by: Frans Klaver <fransklaver@gmail.com>
Signed-off-by: Rakesh Pandit <rakesh@tuxera.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Due to user writes being decoupled from media writes because of the need
of an intermediate write buffer, irrecoverable media write errors lead
to pblk stalling; user writes fill up the buffer and end up in an
infinite retry loop.
In order to let user writes fail gracefully, it is necessary for pblk to
keep track of its own internal state and prevent further writes from
being placed into the write buffer.
This patch implements a state machine to keep track of internal errors
and, in case of failure, fail further user writes in an standard way.
Depending on the type of error, pblk will do its best to persist
buffered writes (which are already acknowledged) and close down on a
graceful manner. This way, data might be recovered by re-instantiating
pblk. Such state machine paves out the way for a state-based FTL log.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Make constants to define sizes for internal mempools and workqueues. In
this process, adjust the values to be more meaningful given the internal
constrains of the FTL. In order to do this for workqueues, separate the
current auxiliary workqueue into two dedicated workqueues to manage
lines being closed and bad blocks.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
At the moment, in order to get enough read parallelism, we have recycled
several lines at the same time. This approach has proven not to work
well when reaching capacity, since we end up mixing valid data from all
lines, thus not maintaining a sustainable free/recycled line ratio.
The new design, relies on a two level workqueue mechanism. In the first
level, we read the metadata for a number of lines based on the GC list
they reside on (this is governed by the number of valid sectors in each
line). In the second level, we recycle a single line at a time. Here, we
issue reads in parallel, while a single GC write thread places data in
the write buffer. This design allows to (i) only move data from one line
at a time, thus maintaining a sane free/recycled ration and (ii)
maintain the GC writer busy with recycled data.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Set a dma area for all I/Os in order to read/write from/to the metadata
stored on the per-sector out-of-bound area.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
At the moment, we separate the closed lines on three different list
based on their number of valid sectors. GC recycles lines from each list
based on capacity. Lines from each list are taken in a FIFO fashion.
Since the number of lines is limited (it corresponds to the number of
blocks in a LUN, which is somewhere between 1000-2000), we can afford
scanning the lists to choose the optimal line to be recycled. This helps
specially in lines with a high number of valid sectors.
If the number of blocks per LUN increases, we will consider a more
efficient policy.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Decouple bad block discovery from line allocation logic. This allows to
return meaningful error codes in case of bad block discovery failure.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
smeta size will always be suitable for a kmalloc allocation. Simplify
the code and leave the vmalloc fallback only for emeta, where the pblk
configuration has an impact.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If a read request is sequential and its size aligns with a
multi-plane page size, use the multi-plane hint to process the I/O in
parallel in the controller.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
After refactoring the metadata path, the backpointer controlling
synced I/Os in a line becomes unnecessary; metadata is scheduled
on the write thread, thus we know when the end of the line is reached
and act on it directly.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Remove a legacy variable that helped verifying the consistency of the
run-time metadata for the free line list. With the new metadata layout,
this check is no longer necessary.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
At the moment, line metadata is persisted on a separate work queue, that
is kicked each time that a line is closed. The assumption when designing
this was that freeing the write thread from creating a new write request
was better than the potential impact of writes colliding on the media
(user I/O and metadata I/O). Experimentation has proven that this
assumption is wrong; collision can cause up to 25% of bandwidth and
introduce long tail latencies on the write thread, which potentially
cause user write threads to spend more time spinning to get a free entry
on the write buffer.
This patch moves the metadata logic to the write thread. When a line is
closed, remaining metadata is written in memory and is placed on a
metadata queue. The write thread then takes the metadata corresponding
to the previous line, creates the write request and schedules it to
minimize collisions on the media. Using this approach, we see that we
can saturate the media's bandwidth, which helps reducing both write
latencies and the spinning time for user writer threads.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Read requests allocate some extra memory to store its per I/O context.
Instead of requiring yet another memory pool for other type of requests,
generalize this context allocation (and change naming accordingly).
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Erase I/Os are scheduled with the following goals in mind: (i) minimize
LUNs collisions with write I/Os, and (ii) even out the price of erasing
on every write, instead of putting all the burden on when garbage
collection runs. This works well on the current design, but is specific
to the default mapping algorithm.
This patch generalizes the erase path so that other mapping algorithms
can select an arbitrary line to be erased instead. It also gets rid of
the erase semaphore since it creates jittering for user writes.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Allow to configure the number of maximum sectors per write command
through sysfs. This makes it easier to tune write command sizes for
different controller configurations.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a new debug counter to measure cache hits on the read path
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Spare a double calculation on the fast write path.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In case of a failure when submitting a request, convert the ppa_list
addresses to the target format so that it can interpret ppas for
recovery
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
pblk_submit_read() uses bio_clone_bioset() but doesn't change the
io_vec, so bio_clone_fast() is a better choice.
It also uses fs_bio_set which is intended for filesystems. Using it
in a device driver can deadlock.
So allocate a new bioset, and and use bio_clone_fast().
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Javier González <javier@cnexlabs.com>
Tested-by: Javier González <javier@cnexlabs.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_queue_split() is always called with the last arg being q->bio_split,
where 'q' is the first arg.
Also blk_queue_split() sometimes uses the passed-in 'bs' and sometimes uses
q->bio_split.
This is inconsistent and unnecessary. Remove the last arg and always use
q->bio_split inside blk_queue_split()
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Credit-to: Javier González <jg@lightnvm.io> (Noticed that lightnvm was missed)
Reviewed-by: Javier González <javier@cnexlabs.com>
Tested-by: Javier González <javier@cnexlabs.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Replace bi_error with a new bi_status to allow for a clear conversion.
Note that device mapper overloaded bi_error with a private value, which
we'll have to keep arround at least for now and thus propagate to a
proper blk_status_t value.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Free memory correctly when an allocation fails on a loop and we free
backwards previously successful allocations.
Signed-off-by: Javier González <javier@cnexlabs.com>
Reviewed-by: Matias Bjørling <matias@cnexlabs.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
'blks' is malloced in pblk_bb_discovery() and should be freed
before leaving from the nvm_get_tgt_bb_tbl() error handling cases,
otherwise it will cause memory leak. Also skip assign blks to
rlun->bb_list when error.
Fixes: a4bd217b43 ("lightnvm: physical block device (pblk) target")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
When block erases fail, these blocks are marked bad. The number of valid
blocks in the line was not updated, which could cause an infinite loop
on the erase path.
Fix this atomic counter and, in order to avoid taking an irq lock on the
interrupt context, make the erase counters atomic too.
Also, in the case that a significant number of blocks become bad in a
line, the result is the double shared metadata buffer (emeta) to stop
the pipeline until all metadata is flushed to the media. Increase the
number of metadata lines from 2 to 4 to avoid this case.
Fixes: a4bd217b43 "lightnvm: physical block device (pblk) target"
Signed-off-by: Javier González <javier@cnexlabs.com>
Reviewed-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
When a line allocation fails, for example, due to having too many bad
blocks, free its metadata correctly.
Fixes: a4bd217b43 "lightnvm: physical block device (pblk) target"
Signed-off-by: Javier González <javier@cnexlabs.com>
Reviewed-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
When a pblk line fails (or is recovered), make sure to take the line
management lock.
Fixes: a4bd217b43 "lightnvm: physical block device (pblk) target"
Signed-off-by: Javier González <javier@cnexlabs.com>
Reviewed-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Reading from ADDR_EMPTY is out of bounds. The current code generates a
static checker warning because we check for out of bounds "lba" before
we check for ADDR_EMPTY, so the second check is always false. It looks
like we intended ADDR_EMPTY to be a no-op without printing a warning.
Fixes: a4bd217b43 ("lightnvm: physical block device (pblk) target")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This is a static checker fix, and perhaps not a real bug. The static
checker thinks that nr_secs could be negative. It would result in
zeroing more memory than intended. Anyway, even if it's not a bug,
changing this variable to unsigned makes the code easier to audit.
Fixes: a4bd217b43 ("lightnvm: physical block device (pblk) target")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
From userspace calling ioctl(NVM_DEV_CREATE) was returning ENOMEM for
invalid arguments even though pblk (pblk_init) was returning correctly
-EINVAL to nvm_create_tgt inside core. This patch propagates the
correct return value to userspace.
Because pblk was introduced recently this only needs to go in 4.12.
Fixes: a4bd217b43 ("lightnvm: physical block device (pblk) target")
Signed-off-by: Rakesh Pandit <rakesh@tuxera.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The driver uses both u64 and sector_t to refer to offsets, and assigns between the
two. This causes one harmless warning when sector_t is 32-bit:
drivers/lightnvm/pblk-rb.c: In function 'pblk_rb_write_entry_gc':
include/linux/lightnvm.h:215:20: error: large integer implicitly truncated to unsigned type [-Werror=overflow]
drivers/lightnvm/pblk-rb.c:324:22: note: in expansion of macro 'ADDR_EMPTY'
As the driver is already doing this inconsistently, changing the type
won't make it worse and is an easy way to avoid the warning.
Fixes: a4bd217b43 ("lightnvm: physical block device (pblk) target")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
There were a bunch of places in pblk_lines_init() where we didn't set an
error code. And in pblk_writer_init() we accidentally return 1 instead
of a correct error code, which would result in a Oops later.
Fixes: 11a5d6fdf919 ("lightnvm: physical block device (pblk) target")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
WARN_ON() takes a condition, not an error message. I slightly tweaked
some conditions so hopefully it's more clear.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
These labels are reversed so we could end up dereferencing an error
pointer or leaking.
Fixes: 7f347ba6bb3a ("lightnvm: physical block device (pblk) target")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch introduces pblk, a host-side translation layer for
Open-Channel SSDs to expose them like block devices. The translation
layer allows data placement decisions, and I/O scheduling to be
managed by the host, enabling users to optimize the SSD for their
specific workloads.
An open-channel SSD has a set of LUNs (parallel units) and a
collection of blocks. Each block can be read in any order, but
writes must be sequential. Writes may also fail, and if a block
requires it, must also be reset before new writes can be
applied.
To manage the constraints, pblk maintains a logical to
physical address (L2P) table, write cache, garbage
collection logic, recovery scheme, and logic to rate-limit
user I/Os versus garbage collection I/Os.
The L2P table is fully-associative and manages sectors at a
4KB granularity. Pblk stores the L2P table in two places, in
the out-of-band area of the media and on the last page of a
line. In the cause of a power failure, pblk will perform a
scan to recover the L2P table.
The user data is organized into lines. A line is data
striped across blocks and LUNs. The lines enable the host to
reduce the amount of metadata to maintain besides the user
data and makes it easier to implement RAID or erasure coding
in the future.
pblk implements multi-tenant support and can be instantiated
multiple times on the same drive. Each instance owns a
portion of the SSD - both regarding I/O bandwidth and
capacity - providing I/O isolation for each case.
Finally, pblk also exposes a sysfs interface that allows
user-space to peek into the internals of pblk. The interface
is available at /dev/block/*/pblk/ where * is the block
device name exposed.
This work also contains contributions from:
Matias Bjørling <matias@cnexlabs.com>
Simon A. F. Lund <slund@cnexlabs.com>
Young Tack Jin <youngtack.jin@gmail.com>
Huaicheng Li <huaicheng@cs.uchicago.edu>
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Convert sprintf calls to strlcpy in order to make possible buffer
overflow more obvious.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Prefix the nvm_free static function with a missing static keyword.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Target initialization has two responsibilities: creating the target
partition and instantiating the target. This patch enables to create a
factory partition (e.g., do not trigger recovery on the given target).
This is useful for target development and for being able to restore the
device state at any moment in time without requiring a full-device
erase.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Reorder disk allocation such that the disk structure can be put
safely.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The dev->lun_map bits are cleared twice if an target init error occurs.
First in the target clean routine, and then next in the nvm_tgt_create
error function. Make sure that it is only cleared once by extending
nvm_remove_tgt_devi() with a clear bit, such that clearing of bits can
ignored when cleaning up a successful initialized target.
Signed-off-by: Javier González <javier@cnexlabs.com>
Fix style.
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
mempool_alloc() cannot fail if the gfp flags allow it to
sleep, and both GFP_KERNEL and GFP_NOIO allows for sleeping.
So rrpc_move_valid_pages() and rrpc_make_rq() don't need to
test the return value.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Until now erases have been submitted as synchronous commands through a
dedicated erase function. In order to enable targets implementing
asynchronous erases, refactor the erase path so that it uses the normal
async I/O submission functions. If a target requires sync I/O, it can
implement it internally. Also, adapt rrpc to use the new erase path.
Signed-off-by: Javier González <javier@cnexlabs.com>
Fixed spelling error.
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
According to error handling in this function, it is likely that going to
'out' was expected here.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The create target ioctl takes a lun begin and lun end parameter, which
defines the range of luns to initialize a target with. If the user does
not set the parameters, it default to only using lun 0. Instead,
defaults to use all luns in the OCSSD, as it is the usual behaviour
users want.
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
If one specifies the end lun id to be the absolute number of luns,
without taking zero indexing into account, the lightnvm core will pass
the off-by-one end lun id to target creation, which then panics during
nvm_ioctl_dev_create.
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@fb.com>