Commit Graph

99 Commits

Author SHA1 Message Date
Dan Williams
b3fde74ea1 libnvdimm, label: add address abstraction identifiers
Starting with v1.2 labels, 'address abstractions' can be hinted via an
address abstraction id that implies an info-block format. The standard
address abstraction in the specification is the v2 format of the
Block-Translation-Table (BTT). Support for that is saved for a later
patch, for now we add support for the Linux supported address
abstractions BTT (v1), PFN, and DAX.

The new 'holder_class' attribute for namespace devices is added for
tooling to specify the 'abstraction_guid' to store in the namespace label.
For v1.1 labels this field is undefined and any setting of
'holder_class' away from the default 'none' value will only have effect
until the driver is unloaded. Setting 'holder_class' requires that
whatever device tries to claim the namespace must be of the specified
class.

Cc: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-15 14:31:40 -07:00
Dan Williams
8f2bc2430e libnvdimm, label: populate 'isetcookie' for blk-aperture namespaces
Starting with the v1.2 definition of namespace labels, the isetcookie
field is populated and validated for blk-aperture namespaces. This adds
some safety against inadvertent copying of namespace labels from one
DIMM-device to another.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-15 14:31:40 -07:00
Dan Williams
faec6f8a1c libnvdimm, label: populate the type_guid property for v1.2 namespaces
The type_guid refers to the "Address Range Type GUID" for the region
backing a namespace as defined the ACPI NFIT (NVDIMM Firmware Interface
Table). This 'type' identifier specifies an access mechanism for the
given namespace. This capability replaces the confusing usage of the
'NSLABEL_FLAG_LOCAL' flag to indicate a block-aperture-mode namespace.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-15 14:31:40 -07:00
Dan Williams
f979b13c3c libnvdimm, label: honor the lba size specified in v1.2 labels
Previously we only honored the lba size for blk-aperture mode
namespaces. For pmem namespaces the lba size was just assumed to be 512.
With the new v1.2 label definition and compatibility with other
operating environments, the ->lbasize property is now respected for pmem
namespaces.

Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-15 14:31:39 -07:00
Dan Williams
c12c48ce86 libnvdimm, label: add v1.2 interleave-set-cookie algorithm
The interleave-set-cookie algorithm is extended to incorporate all the
same components that are used to generate an nvdimm unique-id. For
backwards compatibility we still maintain the old v1.1 definition.

Reported-by: Nicholas Moulin <nicholas.w.moulin@intel.com>
Reported-by: Kaushik Kanetkar <kaushik.a.kanetkar@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-15 14:31:39 -07:00
Dan Williams
9d62ed9651 libnvdimm: handle locked label storage areas
Per the latest version of the "NVDIMM DSM Interface Example" [1], the
label data retrieval routine can report a "locked" status. In this case
all regions associated with that DIMM are disabled until the label area
is unlocked. Provide generic libnvdimm enabling for NVDIMMs with label
data area locking capabilities.

[1]: http://pmem.io/documents/

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-05-04 15:41:39 -07:00
Dan Williams
8f078b38dd libnvdimm: convert NDD_ flags to use bitops, introduce NDD_LOCKED
This is a preparation patch for handling locked nvdimm label regions, a
new concept as introduced by the latest DSM document on pmem.io [1]. A
future patch will leverage nvdimm_set_locked() at DIMM probe time to
flag regions that can not be enabled. There should be no functional
difference resulting from this change.

[1]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example-V1.3.pdf

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-05-04 14:01:24 -07:00
Dan Williams
86ef58a4e3 nfit, libnvdimm: fix interleave set cookie calculation
The interleave-set cookie is a sum that sanity checks the composition of
an interleave set has not changed from when the namespace was initially
created.  The checksum is calculated by sorting the DIMMs by their
location in the interleave-set. The comparison for the sort must be
64-bit wide, not byte-by-byte as performed by memcmp() in the broken
case.

Fix the implementation to accept correct cookie values in addition to
the Linux "memcmp" order cookies, but only allow correct cookies to be
generated going forward. It does mean that namespaces created by
third-party-tooling, or created by newer kernels with this fix, will not
validate on older kernels. However, there are a couple mitigating
conditions:

    1/ platforms with namespace-label capable NVDIMMs are not widely
       available.

    2/ interleave-sets with a single-dimm are by definition not affected
       (nothing to sort). This covers the QEMU-KVM NVDIMM emulation case.

The cookie stored in the namespace label will be fixed by any write the
namespace label, the most straightforward way to achieve this is to
write to the "alt_name" attribute of a namespace in sysfs.

Cc: <stable@vger.kernel.org>
Fixes: eaf961536e ("libnvdimm, nfit: add interleave-set state-tracking infrastructure")
Reported-by: Nicholas Moulin <nicholas.w.moulin@linux.intel.com>
Tested-by: Nicholas Moulin <nicholas.w.moulin@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-03-01 00:49:42 -08:00
Dan Williams
9d032f4201 libnvdimm, namespace: do not delete namespace-id 0
Given that the naming of pmem devices changes from the pmemX form to the
pmemX.Y form when namespace id is greater than 0, arrange for namespaces
with id-0 to be exempt from deletion. Otherwise a simple reconfiguration
of an existing namespace to a new mode results in a name change of the
resulting block device:

    # ndctl list --namespace=namespace1.0
    {
      "dev":"namespace1.0",
      "mode":"raw",
      "size":2147483648,
      "uuid":"3dadf3dc-89b9-4b24-b20e-abc8a4707ce3",
      "blockdev":"pmem1"
    }

    # ndctl create-namespace --reconfig=namespace1.0 --mode=memory --force
    {
      "dev":"namespace1.1",
      "mode":"memory",
      "size":2111832064,
      "uuid":"7b4a6341-7318-4219-a02c-fb57c0bbf613",
      "blockdev":"pmem1.1"
    }

This change does require tooling changes to explicitly look for
namespaceX.0 if the seed has already advanced to another namespace.

Cc: <stable@vger.kernel.org>
Fixes: 98a29c39dc ("libnvdimm, namespace: allow creation of multiple pmem-namespaces per region")
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-01-31 18:18:21 -08:00
Bhumika Goyal
970d14e398 nvdimm: constify device_type structures
Declare device_type structure as const as it is only stored in the
type field of a device structure. This field is of type const, so add
const to declaration of device_type structure.

File size before:
  text	   data	    bss	    dec	    hex	filename
  19278	   3199	     16	  22493	   57dd	nvdimm/namespace_devs.o

File size after:
  text	   data	    bss	    dec	    hex	filename
  19929	   3160	     16	  23105	   5a41	nvdimm/namespace_devs.o

Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-01-31 18:16:30 -08:00
Dan Williams
1f19b983a8 libnvdimm, namespace: fix pmem namespace leak, delete when size set to zero
Commit 98a29c39dc ("libnvdimm, namespace: allow creation of multiple
pmem-namespaces per region") added support for establishing additional
pmem namespace beyond the seed device, similar to blk namespaces.
However, it neglected to delete the namespace when the size is set to
zero.

Fixes: 98a29c39dc ("libnvdimm, namespace: allow creation of multiple pmem-namespaces per region")
Cc: <stable@vger.kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-01-13 09:50:33 -08:00
Dan Williams
c44ef859ce Merge branch 'for-4.10/libnvdimm' into libnvdimm-for-next 2016-12-17 15:08:10 -08:00
Dan Williams
9cf8bd529c libnvdimm: replace mutex_is_locked() warnings with lockdep_assert_held
For warnings that should only ever trigger during development and
testing replace WARN statements with lockdep_assert_held. The lockdep
pattern is prevalent, and these paths are are well covered by libnvdimm
unit tests.

Reported-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-12-15 20:04:31 -08:00
Fabian Frederick
b44fe76043 libnvdimm, namespace: use octal for permissions
According to commit f90774e1fd
("checkpatch: look for symbolic permissions and suggest octal instead")

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-12-04 10:54:08 -08:00
Nicolas Iooss
238b323a68 libnvdimm, namespace: fix the type of name variable
In create_namespace_blk(), the local variable "name" is defined as an
array of NSLABEL_NAME_LEN pointers:

    char *name[NSLABEL_NAME_LEN];

This variable is then used in calls to memcpy() and kmemdup() as if it
were char[NSLABEL_NAME_LEN]. Remove the star in the variable definition
to makes it look right.

Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-11-28 13:41:17 -08:00
Dan Carpenter
75d29713b7 libnvdimm, namespace: potential NULL deref on allocation error
If the kcalloc() fails then "devs" can be NULL and we dereference it
checking "devs[i]".

Fixes: 1b40e09a12 ('libnvdimm: blk labels and namespace instantiation')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-10-19 10:35:51 -07:00
Dan Williams
98a29c39dc libnvdimm, namespace: allow creation of multiple pmem-namespaces per region
Similar to BLK regions, publish new seed namespace devices to allow
unused PMEM region capacity to be consumed by additional namespaces.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-10-07 09:22:53 -07:00
Dan Williams
991d9020f3 libnvdimm, namespace: lift single pmem limit in scan_labels()
Now that the rest of the infrastructure has been converted to handle
multi-pmem configurations, lift the artificial barrier at scan time.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-10-07 09:22:53 -07:00
Dan Williams
c969e24c1b libnvdimm, namespace: filter out of range labels in scan_labels()
Short-circuit doomed-to-fail label validation attempts by skipping
labels that are outside the given region.  For example a DIMM that has
multiple PMEM regions will waste time attempting to create namespaces
only to find that the interleave-set-cookie does not validate, e.g.:

    nd_region region6: invalid cookie in label: 73e608dc-47b9-4b2a-b5c7-2d55a32e0c2

Similar to how we skip BLK labels when performing PMEM validation we can
skip out-of-range labels early.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-10-07 09:22:53 -07:00
Dan Williams
762d067dba libnvdimm, namespace: enable allocation of multiple pmem namespaces
Now that we have nd_region_available_dpa() able to handle the presence
of multiple PMEM allocations in aliased PMEM regions, reuse that same
infrastructure to track allocations from free space.  In particular
handle allocating from an aliased PMEM region in the case where there
are dis-contiguous holes.  The allocation for BLK and PMEM are
documented in the space_valid() helper:

    BLK-space is valid as long as it does not precede a PMEM
    allocation in a given region. PMEM-space must be contiguous
    and adjacent to an existing existing allocation (if one
    exists).

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-10-07 09:22:53 -07:00
Dan Williams
012207334a libnvdimm, namespace: expand pmem device naming scheme for multi-pmem
pmem devices are currently named /dev/pmem<region-index>. Preserve the
naming of the 0th device, but add a ".<namespace-index>" for other
devices.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-10-07 09:22:53 -07:00
Dan Williams
6ff3e912d3 libnvdimm, namespace: sort namespaces by dpa at init
Add more determinism to initial namespace device-name assignments by
sorting the namespaces by starting dpa.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-10-07 09:20:53 -07:00
Dan Williams
0e3b0d123c libnvdimm, namespace: allow multiple pmem-namespaces per region at scan time
If label scanning finds multiple valid pmem namespaces allow them to be
surfaced rather than fail namespace scanning. Support for creating
multiple namespaces per region is saved for a later patch.

Note that this adds some new error messages to clarify which of the pmem
namespaces in the set are potentially impacted by invalid labels.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-10-07 09:20:53 -07:00
Dan Williams
8a5f50d3b7 libnvdimm, namespace: unify blk and pmem label scanning
In preparation for allowing multiple namespace per pmem region, unify
blk and pmem label scanning.  Given that blk regions already support
multiple namespaces, teaching that path how to do pmem namespace
scanning is an incremental step towards multiple pmem namespace support.
This should be functionally equivalent to the previous state in that
stops after finding the first valid pmem label set.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-10-05 20:24:18 -07:00
Dan Williams
f95b4bca9e libnvdimm, namespace: refactor uuid_show() into a namespace_to_uuid() helper
The ability to translate a generic struct device pointer into a
namespace uuid is a useful utility as we go to unify the blk and pmem
label scanning paths.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-10-05 20:24:18 -07:00
Dan Williams
ae8219f186 libnvdimm, label: convert label tracking to a linked list
In preparation for enabling multiple namespaces per pmem region, convert
the label tracking to use a linked list.  In particular this will allow
select_pmem_id() to move labels from the unvalidated state to the
validated state.  Currently we only track one validated set per-region.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-09-30 19:13:42 -07:00
Dan Williams
4765218db7 libnvdimm, namespace: debug invalid interleave-set-cookie values
If platform firmware fails to populate unique / non-zero serial number
data for each nvdimm in an interleave-set it may cause pmem region
initialization to fail.  Add a debug message for this case.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-09-21 09:36:36 -07:00
Geert Uytterhoeven
ae551e9ca2 nvdimm: Spelling s/unacknoweldged/unacknowledged/
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-09-01 18:20:39 -07:00
Dan Williams
cd03412a51 libnvdimm, dax: introduce device-dax infrastructure
Device DAX is the device-centric analogue of Filesystem DAX
(CONFIG_FS_DAX).  It allows persistent memory ranges to be allocated and
mapped without need of an intervening file system.  This initial
infrastructure arranges for a libnvdimm pfn-device to be represented as
a different device-type so that it can be attached to a driver other
than the pmem driver.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-05-09 15:35:42 -07:00
Dan Williams
0bfb8dd3ed libnvdimm: cleanup nvdimm_namespace_common_probe(), kill 'host'
The 'host' variable can be killed as it is always the same as the passed
in device.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-04-22 12:26:24 -07:00
Dan Williams
cfe30b8720 libnvdimm, pmem: adjust for section collisions with 'System RAM'
On a platform where 'Persistent Memory' and 'System RAM' are mixed
within a given sparsemem section, trim the namespace and notify about the
sub-optimal alignment.

Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-03-05 12:25:45 -08:00
Dan Williams
9c41242817 libnvdimm: fix mode determination for e820 devices
Correctly display "safe" mode when a btt is established on a e820/memmap
defined pmem namespace.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-26 09:40:32 -08:00
Dan Williams
e07ecd76d4 libnvdimm: fix namespace object confusion in is_uuid_busy()
When btt devices were re-worked to be child devices of regions this
routine was overlooked.  It mistakenly attempts to_nd_namespace_pmem()
or to_nd_namespace_blk() conversions on btt and pfn devices.  By luck to
date we have happened to be hitting valid memory leading to a uuid
miscompare, but a recent change to struct nd_namespace_common causes:

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000001
 IP: [<ffffffff814610dc>] memcmp+0xc/0x40
 [..]
 Call Trace:
  [<ffffffffa0028631>] is_uuid_busy+0xc1/0x2a0 [libnvdimm]
  [<ffffffffa0028570>] ? to_nd_blk_region+0x50/0x50 [libnvdimm]
  [<ffffffff8158c9c0>] device_for_each_child+0x50/0x90

Cc: <stable@vger.kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-05 18:37:23 -08:00
Dan Williams
0731de0dd9 libnvdimm, pfn: move 'memory mode' indication to sysfs
'Memory mode' is defined as the capability of a DAX mapping to be the
source/target of DMA and other "direct I/O" scenarios.  While it
currently requires allocating 'struct page' for each page frame of
persistent memory in the namespace it will not always be the case.  Work
continues on reducing the kernel's dependency on 'struct page'.

Let's not maintain a suffix that is expected to lose meaning over time.
In other words a future 'raw mode' pmem namespace may be as capable as
today's 'memory mode' namespace.  Undo the encoding of the mode in the
device name and leave it to other tooling to determine the mode of the
namespace from its attributes.

Reported-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-12-24 12:20:20 -08:00
Dan Williams
2dc43331e3 libnvdimm, pfn: fix pfn seed creation
Similar to btt, plant a new pfn seed when the existing one is activated.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-12-13 11:41:36 -08:00
Dmitry Krivenok
bd26d0d0ce nvdimm: improve diagnosibility of namespaces
In order to bind namespace to the driver user must first
set all mandatory attributes in the following order:
- uuid
- size
- sector_size (for blk namespace only)

If the order is wrong, then user either won't be able to set
the attribute or bind the namespace.

This simple patch improves diagnosibility of common operations
with namespaces by printing some details about the error
instead of failing silently.

Below are examples of error messages (assuming dyndbg is
enabled for nvdimms):

[/]# echo 4194304 > /sys/bus/nd/devices/region5/namespace5.0/size
[  288.372612] nd namespace5.0: __size_store: uuid not set
[  288.374839] nd namespace5.0: size_store: 400000 fail (-6)
sh: write error: No such device or address
[/]#

[/]# echo namespace5.0 > /sys/bus/nd/drivers/nd_blk/bind
[  554.671648] nd_blk namespace5.0: nvdimm_namespace_common_probe: sector size not set
[  554.674688]  ndbus1: nd_blk.probe(namespace5.0) = -19
sh: write error: No such device
[/]#

Signed-off-by: Dmitry V. Krivenok <krivenok.dmitry@gmail.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-12-08 16:27:30 -08:00
Dan Williams
004f1afbe1 libnvdimm, pmem: direct map legacy pmem by default
The expectation is that the legacy / non-standard pmem discovery method
(e820 type-12) will only ever be used to describe small quantities of
persistent memory.  Larger capacities will be described via the ACPI
NFIT.  When "allocate struct page from pmem" support is added this default
policy can be overridden by assigning a legacy pmem namespace to a pfn
device, however this would be only be necessary if a platform used the
legacy mechanism to define a very large range.

Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-08-28 23:40:05 -04:00
Dan Williams
e1455744b2 libnvdimm, pfn: 'struct page' provider infrastructure
Implement the base infrastructure for libnvdimm PFN devices. Similar to
BTT devices they take a namespace as a backing device and layer
functionality on top. In this case the functionality is reserving space
for an array of 'struct page' entries to be handed out through
pfn_to_page(). For now this is just the basic libnvdimm-device-model for
configuring the base PFN device.

As the namespace claiming mechanism for PFN devices is mostly identical
to BTT devices drivers/nvdimm/claim.c is created to house the common
bits.

Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-08-28 23:39:36 -04:00
Vishal Verma
6ec689542b libnvdimm, btt: write and validate parent_uuid
When a BTT is instantiated on a namespace it must validate the namespace
uuid matches the 'parent_uuid' stored in the btt superblock. This
property enforces that changing the namespace UUID invalidates all
former BTT instances on that storage. For "IO namespaces" that don't
have a label or UUID, the parent_uuid is set to zero, and this
validation is skipped. For such cases, old BTTs have to be invalidated
by forcing the namespace to raw mode, and overwriting the BTT info
blocks.

Based on a patch by Dan Williams <dan.j.williams@intel.com>

Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-08-14 13:43:04 -04:00
Toshi Kani
74ae66c3b1 libnvdimm: Add sysfs numa_node to NVDIMM devices
Add support of sysfs 'numa_node' to I/O-related NVDIMM devices
under /sys/bus/nd/devices, regionN, namespaceN.0, and bttN.x.

An example of numa_node values on a 2-socket system with a single
NVDIMM range on each socket is shown below.
  /sys/bus/nd/devices
  |-- btt0.0/numa_node:0
  |-- btt1.0/numa_node:1
  |-- btt1.1/numa_node:1
  |-- namespace0.0/numa_node:0
  |-- namespace1.0/numa_node:1
  |-- region0/numa_node:0
  |-- region1/numa_node:1

These numa_node files are then linked under the block class of
their device names.
  /sys/class/block/pmem0/device/numa_node:0
  /sys/class/block/pmem1s/device/numa_node:1

This enables numactl(8) to accept 'block:' and 'file:' paths of
pmem and btt devices as shown in the examples below.
  numactl --preferred block:pmem0 --show
  numactl --preferred file:/dev/pmem1s --show

Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-26 11:23:38 -04:00
Vishal Verma
fcae695737 libnvdimm, blk: add support for blk integrity
Support multiple block sizes (sector + metadata) for nd_blk in the
same way as done for the BTT. Add the idea of an 'internal' lbasize,
which is properly aligned and padded, and store metadata in this space.

Signed-off-by: Vishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-26 11:23:38 -04:00
Ross Zwisler
047fc8a1f9 libnvdimm, nfit, nd_blk: driver for BLK-mode access persistent memory
The libnvdimm implementation handles allocating dimm address space (DPA)
between PMEM and BLK mode interfaces.  After DPA has been allocated from
a BLK-region to a BLK-namespace the nd_blk driver attaches to handle I/O
as a struct bio based block device. Unlike PMEM, BLK is required to
handle platform specific details like mmio register formats and memory
controller interleave.  For this reason the libnvdimm generic nd_blk
driver calls back into the bus provider to carry out the I/O.

This initial implementation handles the BLK interface defined by the
ACPI 6 NFIT [1] and the NVDIMM DSM Interface Example [2] composed from
DCR (dimm control region), BDW (block data window), IDT (interleave
descriptor) NFIT structures and the hardware register format.
[1]: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
[2]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-26 11:23:38 -04:00
Vishal Verma
5212e11fde nd_btt: atomic sector updates
BTT stands for Block Translation Table, and is a way to provide power
fail sector atomicity semantics for block devices that have the ability
to perform byte granularity IO. It relies on the capability of libnvdimm
namespace devices to do byte aligned IO.

The BTT works as a stacked blocked device, and reserves a chunk of space
from the backing device for its accounting metadata. It is a bio-based
driver because all IO is done synchronously, and there is no queuing or
asynchronous completions at either the device or the driver level.

The BTT uses 'lanes' to index into various 'on-disk' data structures,
and lanes also act as a synchronization mechanism in case there are more
CPUs than available lanes. We did a comparison between two lane lock
strategies - first where we kept an atomic counter around that tracked
which was the last lane that was used, and 'our' lane was determined by
atomically incrementing that. That way, for the nr_cpus > nr_lanes case,
theoretically, no CPU would be blocked waiting for a lane. The other
strategy was to use the cpu number we're scheduled on to and hash it to
a lane number. Theoretically, this could block an IO that could've
otherwise run using a different, free lane. But some fio workloads
showed that the direct cpu -> lane hash performed faster than tracking
'last lane' - my reasoning is the cache thrash caused by moving the
atomic variable made that approach slower than simply waiting out the
in-progress IO. This supports the conclusion that the driver can be a
very simple bio-based one that does synchronous IOs instead of queuing.

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Neil Brown <neilb@suse.de>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
[jmoyer: fix nmi watchdog timeout in btt_map_init]
[jmoyer: move btt initialization to module load path]
[jmoyer: fix memory leak in the btt initialization path]
[jmoyer: Don't overwrite corrupted arenas]
Signed-off-by: Vishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-26 11:23:38 -04:00
Dan Williams
8c2f7e8658 libnvdimm: infrastructure for btt devices
NVDIMM namespaces, in addition to accepting "struct bio" based requests,
also have the capability to perform byte-aligned accesses.  By default
only the bio/block interface is used.  However, if another driver can
make effective use of the byte-aligned capability it can claim namespace
interface and use the byte-aligned ->rw_bytes() interface.

The BTT driver is the initial first consumer of this mechanism to allow
adding atomic sector update semantics to a pmem or blk namespace.  This
patch is the sysfs infrastructure to allow configuring a BTT instance
for a namespace.  Enabling that BTT and performing i/o is in a
subsequent patch.

Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-25 04:20:04 -04:00
Dan Williams
0ba1c63489 libnvdimm: write blk label set
After 'uuid', 'size', 'sector_size', and optionally 'alt_name' have been
set to valid values the labels on the dimm can be updated.  The
difference with the pmem case is that blk namespaces are limited to one
dimm and can cover discontiguous ranges in dpa space.

Also, after allocating label slots, it is useful for userspace to know
how many slots are left.  Export this information in sysfs.

Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Neil Brown <neilb@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-24 21:24:10 -04:00
Dan Williams
f524bf271a libnvdimm: write pmem label set
After 'uuid', 'size', and optionally 'alt_name' have been set to valid
values the labels on the dimms can be updated.

Write procedure is:
1/ Allocate and write new labels in the "next" index
2/ Free the old labels in the working copy
3/ Write the bitmap and the label space on the dimm
4/ Write the index to make the update valid

Label ranges directly mirror the dpa resource values for the given
label_id of the namespace.

Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Neil Brown <neilb@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-24 21:24:10 -04:00
Dan Williams
1b40e09a12 libnvdimm: blk labels and namespace instantiation
A blk label set describes a namespace comprised of one or more
discontiguous dpa ranges on a single dimm.  They may alias with one or
more pmem interleave sets that include the given dimm.

This is the runtime/volatile configuration infrastructure for sysfs
manipulation of 'alt_name', 'uuid', 'size', and 'sector_size'.  A later
patch will make these settings persistent by writing back the label(s).

Unlike pmem namespaces, multiple blk namespaces can be created per
region.  Once a blk namespace has been created a new seed device
(unconfigured child of a parent blk region) is instantiated.  As long as
a region has 'available_size' != 0 new child namespaces may be created.

Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Neil Brown <neilb@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-24 21:24:10 -04:00
Dan Williams
bf9bccc14c libnvdimm: pmem label sets and namespace instantiation.
A complete label set is a PMEM-label per-dimm per-interleave-set where
all the UUIDs match and the interleave set cookie matches the hosting
interleave set.

Present sysfs attributes for manipulation of a PMEM-namespace's
'alt_name', 'uuid', and 'size' attributes.  A later patch will make
these settings persistent by writing back the label.

Note that PMEM allocations grow forwards from the start of an interleave
set (lowest dimm-physical-address (DPA)).  BLK-namespaces that alias
with a PMEM interleave set will grow allocations backward from the
highest DPA.

Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Neil Brown <neilb@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-24 21:24:10 -04:00
Dan Williams
3d88002e4a libnvdimm: support for legacy (non-aliasing) nvdimms
The libnvdimm region driver is an intermediary driver that translates
non-volatile "region"s into "namespace" sub-devices that are surfaced by
persistent memory block-device drivers (PMEM and BLK).

ACPI 6 introduces the concept that a given nvdimm may simultaneously
offer multiple access modes to its media through direct PMEM load/store
access, or windowed BLK mode.  Existing nvdimms mostly implement a PMEM
interface, some offer a BLK-like mode, but never both as ACPI 6 defines.
If an nvdimm is single interfaced, then there is no need for dimm
metadata labels.  For these devices we can take the region boundaries
directly to create a child namespace device (nd_namespace_io).

Acked-by: Christoph Hellwig <hch@lst.de>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-24 21:24:10 -04:00