linux/drivers
Dave Hansen c221c0b030 device-dax: "Hotplug" persistent memory for use like normal RAM
This is intended for use with NVDIMMs that are physically persistent
(physically like flash) so that they can be used as a cost-effective
RAM replacement.  Intel Optane DC persistent memory is one
implementation of this kind of NVDIMM.

Currently, a persistent memory region is "owned" by a device driver,
either the "Direct DAX" or "Filesystem DAX" drivers.  These drivers
allow applications to explicitly use persistent memory, generally
by being modified to use special, new libraries. (DIMM-based
persistent memory hardware/software is described in great detail
here: Documentation/nvdimm/nvdimm.txt).

However, this limits persistent memory use to applications which
*have* been modified.  To make it more broadly usable, this driver
"hotplugs" memory into the kernel, to be managed and used just like
normal RAM would be.

To make this work, management software must remove the device from
being controlled by the "Device DAX" infrastructure:

	echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind

and then tell the new driver that it can bind to the device:

	echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id

After this, there will be a number of new memory sections visible
in sysfs that can be onlined, or that may get onlined by existing
udev-initiated memory hotplug rules.

This rebinding procedure is currently a one-way trip.  Once memory
is bound to "kmem", it's there permanently and can not be
unbound and assigned back to device_dax.

The kmem driver will never bind to a dax device unless the device
is *explicitly* bound to the driver.  There are two reasons for
this: One, since it is a one-way trip, it can not be undone if
bound incorrectly.  Two, the kmem driver destroys data on the
device.  Think of if you had good data on a pmem device.  It
would be catastrophic if you compile-in "kmem", but leave out
the "device_dax" driver.  kmem would take over the device and
write volatile data all over your good data.

This inherits any existing NUMA information for the newly-added
memory from the persistent memory device that came from the
firmware.  On Intel platforms, the firmware has guarantees that
require each socket's persistent memory to be in a separate
memory-only NUMA node.  That means that this patch is not expected
to create NUMA nodes, but will simply hotplug memory into existing
nodes.

Because NUMA nodes are created, the existing NUMA APIs and tools
are sufficient to create policies for applications or memory areas
to have affinity for or an aversion to using this memory.

There is currently some metadata at the beginning of pmem regions.
The section-size memory hotplug restrictions, plus this small
reserved area can cause the "loss" of a section or two of capacity.
This should be fixable in follow-on patches.  But, as a first step,
losing 256MB of memory (worst case) out of hundreds of gigabytes
is a good tradeoff vs. the required code to fix this up precisely.
This calculation is also the reason we export
memory_block_size_bytes().

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-nvdimm@lists.01.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Huang Ying <ying.huang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Jerome Glisse <jglisse@redhat.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-02-28 10:41:23 -08:00
..
accessibility
acpi acpi/nfit, device-dax: Identify differentiated memory with a unique numa-node 2019-01-06 21:41:57 -08:00
amba
android
ata for-4.21/libata-20190102 2019-01-02 18:47:56 -08:00
atm
auxdisplay auxdisplay: charlcd: fix x/y command parsing 2018-12-21 21:27:21 +01:00
base device-dax: "Hotplug" persistent memory for use like normal RAM 2019-02-28 10:41:23 -08:00
bcma
block block: sunvdc: don't run hw queue synchronously from irq context 2019-01-03 08:21:47 -07:00
bluetooth
bus ARM: SoC driver updates 2018-12-31 17:32:35 -08:00
cdrom gdrom: fix a memory leak bug 2018-12-29 08:20:44 -07:00
char Remove 'type' argument from access_ok() function 2019-01-03 18:57:57 -08:00
clk Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input 2019-01-02 18:56:59 -08:00
clocksource arch/csky patches for 4.21-rc1 2019-01-05 09:50:07 -08:00
connector
cpufreq powerpc updates for 4.21 2018-12-27 10:43:24 -08:00
cpuidle powerpc updates for 4.21 2018-12-27 10:43:24 -08:00
crypto Remove 'type' argument from access_ok() function 2019-01-03 18:57:57 -08:00
dax device-dax: "Hotplug" persistent memory for use like normal RAM 2019-02-28 10:41:23 -08:00
dca
devfreq
dio
dma Merge branch 'next/drivers' into next/late 2019-01-04 14:31:38 -08:00
dma-buf drivers/dma-buf/udmabuf.c: convert to use vm_fault_t 2019-01-04 13:13:46 -08:00
edac
eisa
extcon
firewire FireWire (IEEE 1394) subsystem patch: 2019-01-05 18:33:21 -08:00
firmware arm64 fixes for -rc1 2019-01-05 11:28:39 -08:00
fmc
fpga Remove 'type' argument from access_ok() function 2019-01-03 18:57:57 -08:00
fsi
gnss
gpio This is the bulk of GPIO changes for the v4.21 kernel series: 2018-12-28 20:00:21 -08:00
gpu drm i915 gvt, amdgpu, core fixes 2019-01-05 18:25:19 -08:00
hid Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid 2019-01-05 17:53:40 -08:00
hsi
hv Char/Misc driver patches for 4.21-rc1 2018-12-28 20:54:57 -08:00
hwmon Kconfig updates for v4.21 2018-12-29 13:03:29 -08:00
hwspinlock hwspinlock: fix return value check in stm32_hwspinlock_probe() 2019-01-03 11:42:10 -08:00
hwtracing
i2c Merge branch 'i2c/for-5.0' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux 2019-01-05 18:13:35 -08:00
i3c
ide for-4.21/block-20181221 2018-12-28 13:19:59 -08:00
idle
iio Staging/IIO driver patches for 4.21-rc1 2018-12-28 20:39:58 -08:00
infiniband 4.21 merge window 2nd pull request 2019-01-05 18:20:51 -08:00
input Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input 2019-01-02 18:56:59 -08:00
iommu IOMMU Updates for Linux v4.21 2019-01-01 15:55:29 -08:00
ipack
irqchip Xtensa updates for v4.21: 2018-12-29 09:40:40 -08:00
isdn isdn: fix kernel-infoleak in capi_unlocked_ioctl 2019-01-02 10:31:39 -08:00
leds LEDs for 4.21-rc1 2018-12-25 14:52:50 -08:00
lightnvm lightnvm: pblk: fix use-after-free bug 2018-12-22 14:45:35 -07:00
macintosh Remove 'type' argument from access_ok() function 2019-01-03 18:57:57 -08:00
mailbox mailbox: tegra-hsp: Use device-managed registration API 2018-12-21 22:31:26 -06:00
mcb
md Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md into for-linus 2019-01-03 08:21:02 -07:00
media Remove 'type' argument from access_ok() function 2019-01-03 18:57:57 -08:00
memory ARM: SoC: late updates 2019-01-05 11:30:37 -08:00
memstick MMC core: 2018-12-28 16:52:18 -08:00
message
mfd
misc Merge branch 'i2c/for-5.0' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux 2019-01-05 18:13:35 -08:00
mmc MMC core: 2018-12-28 16:52:18 -08:00
mtd Kbuild updates for v4.21 2018-12-29 12:03:17 -08:00
mux
net Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2019-01-03 12:53:47 -08:00
nfc
ntb
nubus
nvdimm acpi/nfit, device-dax: Identify differentiated memory with a unique numa-node 2019-01-06 21:41:57 -08:00
nvme
nvmem
of Merge tag 'devicetree-for-4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux 2018-12-28 20:08:34 -08:00
opp
oprofile
parisc Kconfig file consolidation for v4.21 2018-12-29 13:40:29 -08:00
parport
pci pci-v4.21-changes 2019-01-05 17:57:34 -08:00
pcmcia Included in this update: 2019-01-05 11:23:17 -08:00
perf drivers/perf: hisi: Fixup one DDRC PMU register offset 2019-01-04 10:13:27 +00:00
phy
pinctrl Pin control bulk changes for the v4.21 kernel cycle: 2019-01-01 13:19:16 -08:00
platform chrome platform changes for v4.21 2019-01-06 11:40:06 -08:00
pnp Remove 'type' argument from access_ok() function 2019-01-03 18:57:57 -08:00
power power supply and reset changes for the v4.21 series 2018-12-28 20:22:45 -08:00
powercap
pps Kconfig updates for v4.21 2018-12-29 13:03:29 -08:00
ps3
ptp Char/Misc driver patches for 4.21-rc1 2018-12-28 20:54:57 -08:00
pwm pwm: imx: Add ipg clock operation 2018-12-24 12:06:56 +01:00
rapidio
ras treewide: surround Kconfig file paths with double quotes 2018-12-22 00:25:54 +09:00
regulator Merge remote-tracking branch 'regulator/topic/coupled' into regulator-next 2018-12-21 13:43:35 +00:00
remoteproc
reset
rpmsg
rtc RTC for 4.21 2019-01-01 13:24:31 -08:00
s390 s390 updates for the 4.21 merge window 2019-01-02 18:37:01 -08:00
sbus Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-next 2018-12-26 10:32:18 -08:00
scsi Remove 'type' argument from access_ok() function 2019-01-03 18:57:57 -08:00
sfi
sh
siox
slimbus
sn
soc ARM: SoC driver updates 2018-12-31 17:32:35 -08:00
soundwire
spi spi: Updates for v4.21 2018-12-25 14:43:54 -08:00
spmi
ssb
staging Remove 'type' argument from access_ok() function 2019-01-03 18:57:57 -08:00
target SCSI misc on 20181224 2018-12-28 14:48:06 -08:00
tc
tee OP-TEE dynamic shm log message 2018-12-31 13:06:30 -08:00
thermal Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux 2019-01-05 16:07:28 -08:00
thunderbolt
tty ARM: SoC: late updates 2019-01-05 11:30:37 -08:00
uio Char/Misc driver patches for 4.21-rc1 2018-12-28 20:54:57 -08:00
usb pci-v4.21-changes 2019-01-05 17:57:34 -08:00
uwb
vfio IOMMU Updates for Linux v4.21 2019-01-01 15:55:29 -08:00
vhost Remove 'type' argument from access_ok() function 2019-01-03 18:57:57 -08:00
video fbdev changes for v4.21: 2019-01-05 18:15:37 -08:00
virt
virtio virtio, vhost: features, fixes, cleanups 2019-01-02 18:54:45 -08:00
visorbus
vlynq
vme
w1 treewide: surround Kconfig file paths with double quotes 2018-12-22 00:25:54 +09:00
watchdog linux-watchdog 4.21-rc1 tag 2019-01-01 13:16:45 -08:00
xen Remove 'type' argument from access_ok() function 2019-01-03 18:57:57 -08:00
zorro
Kconfig Kconfig file consolidation for v4.21 2018-12-29 13:40:29 -08:00
Makefile