libnvdimm: documentation clarifications
A bunch of changes that I hope will help in understanding it better for first-time readers. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
This commit is contained in:
parent
589e75d157
commit
8de5dff8ba
@ -62,6 +62,12 @@ DAX: File system extensions to bypass the page cache and block layer to
|
||||
mmap persistent memory, from a PMEM block device, directly into a
|
||||
process address space.
|
||||
|
||||
DSM: Device Specific Method: ACPI method to to control specific
|
||||
device - in this case the firmware.
|
||||
|
||||
DCR: NVDIMM Control Region Structure defined in ACPI 6 Section 5.2.25.5.
|
||||
It defines a vendor-id, device-id, and interface format for a given DIMM.
|
||||
|
||||
BTT: Block Translation Table: Persistent memory is byte addressable.
|
||||
Existing software may have an expectation that the power-fail-atomicity
|
||||
of writes is at least one sector, 512 bytes. The BTT is an indirection
|
||||
@ -133,16 +139,16 @@ device driver:
|
||||
registered, can be immediately attached to nd_pmem.
|
||||
|
||||
2. BLK (nd_blk.ko): This driver performs I/O using a set of platform
|
||||
defined apertures. A set of apertures will all access just one DIMM.
|
||||
Multiple windows allow multiple concurrent accesses, much like
|
||||
defined apertures. A set of apertures will access just one DIMM.
|
||||
Multiple windows (apertures) allow multiple concurrent accesses, much like
|
||||
tagged-command-queuing, and would likely be used by different threads or
|
||||
different CPUs.
|
||||
|
||||
The NFIT specification defines a standard format for a BLK-aperture, but
|
||||
the spec also allows for vendor specific layouts, and non-NFIT BLK
|
||||
implementations may other designs for BLK I/O. For this reason "nd_blk"
|
||||
calls back into platform-specific code to perform the I/O. One such
|
||||
implementation is defined in the "Driver Writer's Guide" and "DSM
|
||||
implementations may have other designs for BLK I/O. For this reason
|
||||
"nd_blk" calls back into platform-specific code to perform the I/O.
|
||||
One such implementation is defined in the "Driver Writer's Guide" and "DSM
|
||||
Interface Example".
|
||||
|
||||
|
||||
@ -152,7 +158,7 @@ Why BLK?
|
||||
While PMEM provides direct byte-addressable CPU-load/store access to
|
||||
NVDIMM storage, it does not provide the best system RAS (recovery,
|
||||
availability, and serviceability) model. An access to a corrupted
|
||||
system-physical-address address causes a cpu exception while an access
|
||||
system-physical-address address causes a CPU exception while an access
|
||||
to a corrupted address through an BLK-aperture causes that block window
|
||||
to raise an error status in a register. The latter is more aligned with
|
||||
the standard error model that host-bus-adapter attached disks present.
|
||||
@ -162,7 +168,7 @@ data could be interleaved in an opaque hardware specific manner across
|
||||
several DIMMs.
|
||||
|
||||
PMEM vs BLK
|
||||
BLK-apertures solve this RAS problem, but their presence is also the
|
||||
BLK-apertures solve these RAS problems, but their presence is also the
|
||||
major contributing factor to the complexity of the ND subsystem. They
|
||||
complicate the implementation because PMEM and BLK alias in DPA space.
|
||||
Any given DIMM's DPA-range may contribute to one or more
|
||||
@ -220,8 +226,8 @@ socket. Each unique interface (BLK or PMEM) to DPA space is identified
|
||||
by a region device with a dynamically assigned id (REGION0 - REGION5).
|
||||
|
||||
1. The first portion of DIMM0 and DIMM1 are interleaved as REGION0. A
|
||||
single PMEM namespace is created in the REGION0-SPA-range that spans
|
||||
DIMM0 and DIMM1 with a user-specified name of "pm0.0". Some of that
|
||||
single PMEM namespace is created in the REGION0-SPA-range that spans most
|
||||
of DIMM0 and DIMM1 with a user-specified name of "pm0.0". Some of that
|
||||
interleaved system-physical-address range is reclaimed as BLK-aperture
|
||||
accessed space starting at DPA-offset (a) into each DIMM. In that
|
||||
reclaimed space we create two BLK-aperture "namespaces" from REGION2 and
|
||||
@ -230,13 +236,13 @@ by a region device with a dynamically assigned id (REGION0 - REGION5).
|
||||
|
||||
2. In the last portion of DIMM0 and DIMM1 we have an interleaved
|
||||
system-physical-address range, REGION1, that spans those two DIMMs as
|
||||
well as DIMM2 and DIMM3. Some of REGION1 allocated to a PMEM namespace
|
||||
named "pm1.0" the rest is reclaimed in 4 BLK-aperture namespaces (for
|
||||
well as DIMM2 and DIMM3. Some of REGION1 is allocated to a PMEM namespace
|
||||
named "pm1.0", the rest is reclaimed in 4 BLK-aperture namespaces (for
|
||||
each DIMM in the interleave set), "blk2.1", "blk3.1", "blk4.0", and
|
||||
"blk5.0".
|
||||
|
||||
3. The portion of DIMM2 and DIMM3 that do not participate in the REGION1
|
||||
interleaved system-physical-address range (i.e. the DPA address below
|
||||
interleaved system-physical-address range (i.e. the DPA address past
|
||||
offset (b) are also included in the "blk4.0" and "blk5.0" namespaces.
|
||||
Note, that this example shows that BLK-aperture namespaces don't need to
|
||||
be contiguous in DPA-space.
|
||||
@ -252,15 +258,15 @@ LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API
|
||||
|
||||
What follows is a description of the LIBNVDIMM sysfs layout and a
|
||||
corresponding object hierarchy diagram as viewed through the LIBNDCTL
|
||||
api. The example sysfs paths and diagrams are relative to the Example
|
||||
API. The example sysfs paths and diagrams are relative to the Example
|
||||
NVDIMM Platform which is also the LIBNVDIMM bus used in the LIBNDCTL unit
|
||||
test.
|
||||
|
||||
LIBNDCTL: Context
|
||||
Every api call in the LIBNDCTL library requires a context that holds the
|
||||
Every API call in the LIBNDCTL library requires a context that holds the
|
||||
logging parameters and other library instance state. The library is
|
||||
based on the libabc template:
|
||||
https://git.kernel.org/cgit/linux/kernel/git/kay/libabc.git/
|
||||
https://git.kernel.org/cgit/linux/kernel/git/kay/libabc.git
|
||||
|
||||
LIBNDCTL: instantiate a new library context example
|
||||
|
||||
@ -409,7 +415,7 @@ Bit 31:28 Reserved
|
||||
LIBNVDIMM/LIBNDCTL: Region
|
||||
----------------------
|
||||
|
||||
A generic REGION device is registered for each PMEM range orBLK-aperture
|
||||
A generic REGION device is registered for each PMEM range or BLK-aperture
|
||||
set. Per the example there are 6 regions: 2 PMEM and 4 BLK-aperture
|
||||
sets on the "nfit_test.0" bus. The primary role of regions are to be a
|
||||
container of "mappings". A mapping is a tuple of <DIMM,
|
||||
@ -509,7 +515,7 @@ At first glance it seems since NFIT defines just PMEM and BLK interface
|
||||
types that we should simply name REGION devices with something derived
|
||||
from those type names. However, the ND subsystem explicitly keeps the
|
||||
REGION name generic and expects userspace to always consider the
|
||||
region-attributes for 4 reasons:
|
||||
region-attributes for four reasons:
|
||||
|
||||
1. There are already more than two REGION and "namespace" types. For
|
||||
PMEM there are two subtypes. As mentioned previously we have PMEM where
|
||||
@ -698,8 +704,8 @@ static int configure_namespace(struct ndctl_region *region,
|
||||
|
||||
Why the Term "namespace"?
|
||||
|
||||
1. Why not "volume" for instance? "volume" ran the risk of confusing ND
|
||||
as a volume manager like device-mapper.
|
||||
1. Why not "volume" for instance? "volume" ran the risk of confusing
|
||||
ND (libnvdimm subsystem) to a volume manager like device-mapper.
|
||||
|
||||
2. The term originated to describe the sub-devices that can be created
|
||||
within a NVME controller (see the nvme specification:
|
||||
@ -774,13 +780,14 @@ block" needs to be destroyed. Note, that to destroy a BTT the media
|
||||
needs to be written in raw mode. By default, the kernel will autodetect
|
||||
the presence of a BTT and disable raw mode. This autodetect behavior
|
||||
can be suppressed by enabling raw mode for the namespace via the
|
||||
ndctl_namespace_set_raw_mode() api.
|
||||
ndctl_namespace_set_raw_mode() API.
|
||||
|
||||
|
||||
Summary LIBNDCTL Diagram
|
||||
------------------------
|
||||
|
||||
For the given example above, here is the view of the objects as seen by the LIBNDCTL api:
|
||||
For the given example above, here is the view of the objects as seen by the
|
||||
LIBNDCTL API:
|
||||
+---+
|
||||
|CTX| +---------+ +--------------+ +---------------+
|
||||
+-+-+ +-> REGION0 +---> NAMESPACE0.0 +--> PMEM8 "pm0.0" |
|
||||
|
Loading…
Reference in New Issue
Block a user