linux

Author	SHA1	Message	Date
Toshi Kani	4e3f0701f2	libnvdimm: fix badblock range handling of ARS range __add_badblock_range() does not account sector alignment when it sets 'num_sectors'. Therefore, an ARS error record range spanning across two sectors is set to a single sector length, which leaves the 2nd sector unprotected. Change __add_badblock_range() to set 'num_sectors' properly. Cc: <stable@vger.kernel.org> Fixes: `0caeef63e6` ("libnvdimm: Add a poison list and export badblocks") Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-07-17 11:43:58 -07:00
Linus Torvalds	130568d5ea	Merge branch 'for-linus' of git://git.kernel.dk/linux-block Pull more block updates from Jens Axboe: "This is a followup for block changes, that didn't make the initial pull request. It's a bit of a mixed bag, this contains: - A followup pull request from Sagi for NVMe. Outside of fixups for NVMe, it also includes a series for ensuring that we properly quiesce hardware queues when browsing live tags. - Set of integrity fixes from Dmitry (mostly), fixing various issues for folks using DIF/DIX. - Fix for a bug introduced in cciss, with the req init changes. From Christoph. - Fix for a bug in BFQ, from Paolo. - Two followup fixes for lightnvm/pblk from Javier. - Depth fix from Ming for blk-mq-sched. - Also from Ming, performance fix for mtip32xx that was introduced with the dynamic initialization of commands" * 'for-linus' of git://git.kernel.dk/linux-block: (44 commits) block: call bio_uninit in bio_endio nvmet: avoid unneeded assignment of submit_bio return value nvme-pci: add module parameter for io queue depth nvme-pci: compile warnings in nvme_alloc_host_mem() nvmet_fc: Accept variable pad lengths on Create Association LS nvme_fc/nvmet_fc: revise Create Association descriptor length lightnvm: pblk: remove unnecessary checks lightnvm: pblk: control I/O flow also on tear down cciss: initialize struct scsi_req null_blk: fix error flow for shared tags during module_init block: Fix __blkdev_issue_zeroout loop nvme-rdma: unconditionally recycle the request mr nvme: split nvme_uninit_ctrl into stop and uninit virtio_blk: quiesce/unquiesce live IO when entering PM states mtip32xx: quiesce request queues to make sure no submissions are inflight nbd: quiesce request queues to make sure no submissions are inflight nvme: kick requeue list when requeueing a request instead of when starting the queues nvme-pci: quiesce/unquiesce admin_q instead of start/stop its hw queues nvme-loop: quiesce/unquiesce admin_q instead of start/stop its hw queues nvme-fc: quiesce/unquiesce admin_q instead of start/stop its hw queues ...	2017-07-11 15:36:52 -07:00
Linus Torvalds	b6ffe9ba46	libnvdimm for 4.13 * Introduce the _flushcache() family of memory copy helpers and use them for persistent memory write operations on x86. The _flushcache() semantic indicates that the cache is either bypassed for the copy operation (movnt) or any lines dirtied by the copy operation are written back (clwb, clflushopt, or clflush). * Extend dax_operations with ->copy_from_iter() and ->flush() operations. These operations and other infrastructure updates allow all persistent memory specific dax functionality to be pushed into libnvdimm and the pmem driver directly. It also allows dax-specific sysfs attributes to be linked to a host device, for example: /sys/block/pmem0/dax/write_cache * Add support for the new NVDIMM platform/firmware mechanisms introduced in ACPI 6.2 and UEFI 2.7. This support includes the v1.2 namespace label format, extensions to the address-range-scrub command set, new error injection commands, and a new BTT (block-translation-table) layout. These updates support inter-OS and pre-OS compatibility. * Fix a longstanding memory corruption bug in nfit_test. * Make the pmem and nvdimm-region 'badblocks' sysfs files poll(2) capable. * Miscellaneous fixes and small updates across libnvdimm and the nfit driver. Acknowledgements that came after the branch was pushed: commit `6aa734a2f3` "libnvdimm, region, pmem: fix 'badblocks' sysfs_get_dirent() reference lifetime" Reviewed-by: Toshi Kani <toshi.kani@hpe.com> -----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJZXsUtAAoJEB7SkWpmfYgCOXcP/06bncqTEvtgrOF2b7O8w+8e mTySD51RUn6UpkFd37SMRch+rmbojuqj465TAE7XIXgyLgIOJixKaTlHYUoEnP3X rC4Q/g5mN0nittMDwL+vQaa1lQWd2kbjOlrqCgnLHVEEJpHmiQussunjvir4G1U7 5ROooP8W+qMK5y5XPLJAg/gyGhYkjpRSlDg3Eo5meZZ0IdURbI7+WCLKrPcQUERT WmDc9gLhJdSQVxBV/0m2gdAER4ADmFjcrlm8kjXRBhdlUmEFjM0zpvlHJutHTkks rNZWCmCJs0Sas+DmRKszFmvVFHRHqUVA3dWK4P6PJEX+tl7BwlPcxpbfacHTG2EZ btArFc584DZ+EIrim1cXXRvLFlxnKOFBtBeteFs7l2kZjEcN6S4I5OZgTyeDpe/i 2WDpHWLQWibkcIzH9y1EuMBkYnQjTJl1pecHzJoTaC+jAQ+opLiY7EecjLmCmQS6 MBYUeQZNufLGfT5b8KXfpKeiXhpFkYrAGp+ErfoH/6RKy2zqTdagN1yVhos2y+a7 JJu/Weetpn8qv+KTGUShO8TGyWv3wU46YkG2rKWl0FL1+C+6LMMw1/L0A97lwVlg BpypVVyaNu1D22ifZ8O5wbqPIYghoZ5akA0CiduhX19cpl5rTeTd8EvLjvcYhZEZ pMHuMAqIcIyLhIe2/sRF =xKQB -----END PGP SIGNATURE----- Merge tag 'libnvdimm-for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Dan Williams: "libnvdimm updates for the latest ACPI and UEFI specifications. This pull request also includes new 'struct dax_operations' enabling to undo the abuse of copy_user_nocache() for copy operations to pmem. The dax work originally missed 4.12 to address concerns raised by Al. Summary: - Introduce the _flushcache() family of memory copy helpers and use them for persistent memory write operations on x86. The _flushcache() semantic indicates that the cache is either bypassed for the copy operation (movnt) or any lines dirtied by the copy operation are written back (clwb, clflushopt, or clflush). - Extend dax_operations with ->copy_from_iter() and ->flush() operations. These operations and other infrastructure updates allow all persistent memory specific dax functionality to be pushed into libnvdimm and the pmem driver directly. It also allows dax-specific sysfs attributes to be linked to a host device, for example: /sys/block/pmem0/dax/write_cache - Add support for the new NVDIMM platform/firmware mechanisms introduced in ACPI 6.2 and UEFI 2.7. This support includes the v1.2 namespace label format, extensions to the address-range-scrub command set, new error injection commands, and a new BTT (block-translation-table) layout. These updates support inter-OS and pre-OS compatibility. - Fix a longstanding memory corruption bug in nfit_test. - Make the pmem and nvdimm-region 'badblocks' sysfs files poll(2) capable. - Miscellaneous fixes and small updates across libnvdimm and the nfit driver. Acknowledgements that came after the branch was pushed: commit `6aa734a2f3` ("libnvdimm, region, pmem: fix 'badblocks' sysfs_get_dirent() reference lifetime") was reviewed by Toshi Kani <toshi.kani@hpe.com>" * tag 'libnvdimm-for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (42 commits) libnvdimm, namespace: record 'lbasize' for pmem namespaces acpi/nfit: Issue Start ARS to retrieve existing records libnvdimm: New ACPI 6.2 DSM functions acpi, nfit: Show bus_dsm_mask in sysfs libnvdimm, acpi, nfit: Add bus level dsm mask for pass thru. acpi, nfit: Enable DSM pass thru for root functions. libnvdimm: passthru functions clear to send libnvdimm, btt: convert some info messages to warn/err libnvdimm, region, pmem: fix 'badblocks' sysfs_get_dirent() reference lifetime libnvdimm: fix the clear-error check in nsio_rw_bytes libnvdimm, btt: fix btt_rw_page not returning errors acpi, nfit: quiet invalid block-aperture-region warnings libnvdimm, btt: BTT updates for UEFI 2.7 format acpi, nfit: constify *_attribute_group libnvdimm, pmem: disable dax flushing when pmem is fronting a volatile region libnvdimm, pmem, dax: export a cache control attribute dax: convert to bitmask for flags dax: remove default copy_from_iter fallback libnvdimm, nfit: enable support for volatile ranges libnvdimm, pmem: fix persistence warning ...	2017-07-07 09:44:06 -07:00
Dan Williams	9d92573fff	Merge branch 'for-4.13/dax' into libnvdimm-for-next	2017-07-03 16:54:58 -07:00
Dan Williams	2de5148ffb	libnvdimm, namespace: record 'lbasize' for pmem namespaces Commit `f979b13c3c` "libnvdimm, label: honor the lba size specified in v1.2 labels") neglected to update the 'lbasize' in the label when the namespace sector_size attribute was written. We need this value in the label for inter-OS / pre-OS compatibility. Fixes: `f979b13c3c` ("libnvdimm, label: honor the lba size specified in v1.2 labels") Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-07-03 16:30:44 -07:00
Dmitry Monakhov	b1fb2c52b2	block: guard bvec iteration logic Currently if some one try to advance bvec beyond it's size we simply dump WARN_ONCE and continue to iterate beyond bvec array boundaries. This simply means that we endup dereferencing/corrupting random memory region. Sane reaction would be to propagate error back to calling context But bvec_iter_advance's calling context is not always good for error handling. For safity reason let truncate iterator size to zero which will break external iteration loop which prevent us from unpredictable memory range corruption. And even it caller ignores an error, it will corrupt it's own bvecs, not others. This patch does: - Return error back to caller with hope that it will react on this - Truncate iterator size Code was added long time ago here `4550dd6c`, luckily no one hit it in real life :) Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> [hch: switch to true/false returns instead of errno values] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-07-03 16:56:26 -06:00
Dmitry Monakhov	e23947bd76	bio-integrity: fold bio_integrity_enabled to bio_integrity_prep Currently all integrity prep hooks are open-coded, and if prepare fails we ignore it's code and fail bio with EIO. Let's return real error to upper layer, so later caller may react accordingly. In fact no one want to use bio_integrity_prep() w/o bio_integrity_enabled, so it is reasonable to fold it in to one function. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> [hch: merged with the latest block tree, return bool from bio_integrity_prep] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-07-03 16:56:24 -06:00
Jerry Hoemann	53b85a449b	libnvdimm: passthru functions clear to send Have dsm functions called via the pass thru mechanism also be checked against clear to send. Signed-off-by: Jerry Hoemann <jerry.hoemann@hpe.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-07-01 08:49:59 -07:00
Vishal Verma	e6be2dcbef	libnvdimm, btt: convert some info messages to warn/err Some critical messages such as IO errors, metadata failures were printed with dev_info. Make them louder by upgrading them to dev_warn or dev_error. Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-07-01 08:49:59 -07:00
Dan Williams	6aa734a2f3	libnvdimm, region, pmem: fix 'badblocks' sysfs_get_dirent() reference lifetime We need to hold a reference on the 'dirent' until we are sure there are no more notifications that will be sent. As noted in the new comments we take advantage of the fact that the references are taken and dropped under device_lock() and that nd_device_notify() holds device_lock() over new badblocks notifications. The notifications that happen when badblocks are cleared only occur while the device is active. Also take the opportunity to fix up the error messages to report the user visible effect of a sysfs_get_dirent() failure. Fixes: `975750a98c` ("libnvdimm, pmem: Add sysfs notifications to badblocks") Cc: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-30 18:56:03 -07:00
Vishal Verma	7e5a21dfe5	libnvdimm: fix the clear-error check in nsio_rw_bytes A leftover from the 'bandaid' fix that disabled BTT error clearing in rw_bytes resulted in an incorrect check. After we converted these checks over to use the NVDIMM_IO_ATOMIC flag, the ndns->claim check was both redundant, and incorrect. Remove it. Fixes: `3ae3d67ba7` ("libnvdimm: add an atomic vs process context flag to rw_bytes") Cc: <stable@vger.kernel.org> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-30 18:50:34 -07:00
Vishal Verma	c13c43d54f	libnvdimm, btt: fix btt_rw_page not returning errors btt_rw_page was not propagating errors frm btt_do_bvec, resulting in any IO errors via the rw_page path going unnoticed. the pmem driver recently fixed this in `e10624f` pmem: fail io-requests to known bad blocks but same problem in BTT went neglected. Fixes: `5212e11fde` ("nd_btt: atomic sector updates") Cc: <stable@vger.kernel.org> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-29 18:24:28 -07:00
Dan Williams	d5d51fece7	acpi, nfit: quiet invalid block-aperture-region warnings This state is already visible by userspace since the BLK region will not be enabled, and it is otherwise benign as it usually indicates that the DIMM is not configured. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-29 13:50:38 -07:00
Vishal Verma	14e4945426	libnvdimm, btt: BTT updates for UEFI 2.7 format The UEFI 2.7 specification defines an updated BTT metadata format, bumping the revision to 2.0. Add support for the new format, while retaining compatibility for the old 1.1 format. Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Linda Knippers <linda.knippers@hpe.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-29 13:50:38 -07:00
Dan Williams	0b277961f4	libnvdimm, pmem: disable dax flushing when pmem is fronting a volatile region The pmem driver attaches to both persistent and volatile memory ranges advertised by the ACPI NFIT. When the region is volatile it is redundant to spend cycles flushing caches at fsync(). Check if the hosting region is volatile and do not set dax_write_cache() if it is. Cc: Jan Kara <jack@suse.cz> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-29 09:29:50 -07:00
Dan Williams	6e0c90d691	libnvdimm, pmem, dax: export a cache control attribute The dax_flush() operation can be turned into a nop on platforms where firmware arranges for cpu caches to be flushed on a power-fail event. The ACPI 6.2 specification defines a mechanism for the platform to indicate this capability so the kernel can select the proper default. However, for other platforms, the administrator must toggle this setting manually. Given this flush setting is a dax-specific mechanism we advertise it through a 'dax' attribute group hanging off a host device. For example, a 'pmem0' block-device gets a 'dax' sysfs-subdirectory with a 'write_cache' attribute to control response to dax cache flush requests. This is similar to the 'queue/write_cache' attribute that appears under block devices. Cc: Jan Kara <jack@suse.cz> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-29 09:29:50 -07:00
Dan Williams	c9e582aa68	libnvdimm, nfit: enable support for volatile ranges Allow volatile nfit ranges to participate in all the same infrastructure provided for persistent memory regions. A resulting resulting namespace device will still be called "pmem", but the parent region type will be "nd_volatile". This is in preparation for disabling the dax ->flush() operation in the pmem driver when it is hosted on a volatile range. Cc: Jan Kara <jack@suse.cz> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-27 16:44:13 -07:00
Dan Williams	c00b396ef7	libnvdimm, pmem: fix persistence warning The pmem driver assumes if platform firmware describes the memory devices associated with a persistent memory range and CONFIG_ARCH_HAS_PMEM_API=y that it has all the mechanism necessary to flush data to a power-fail safe zone. We warn if the firmware does not describe memory devices, but we also need to warn if the architecture does not claim pmem support. Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-27 16:44:01 -07:00
Dan Williams	ca6a4657e5	x86, libnvdimm, pmem: remove global pmem api Now that all callers of the pmem api have been converted to dax helpers that call back to the pmem driver, we can remove include/linux/pmem.h and asm/pmem.h. Cc: <x86@kernel.org> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Oliver O'Halloran <oohall@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-27 16:29:54 -07:00
Dan Williams	f2b612578e	x86, libnvdimm, pmem: move arch_invalidate_pmem() to libnvdimm Kill this globally defined wrapper and move to libnvdimm so that we can ultimately remove include/linux/pmem.h and asm/pmem.h. Cc: <x86@kernel.org> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-27 16:29:00 -07:00
Christoph Hellwig	0b0bcacc3b	block: don't bother with bounce limits for make_request drivers We only call blk_queue_bounce for request-based drivers, so stop messing with it for make_request based drivers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-06-27 12:13:45 -06:00
Dan Williams	4e4f00a9b5	x86, dax, libnvdimm: remove wb_cache_pmem() indirection With all handling of the CONFIG_ARCH_HAS_PMEM_API case being moved to libnvdimm and the pmem driver directly we do not need to provide global wrappers and fallbacks in the CONFIG_ARCH_HAS_PMEM_API=n case. The pmem driver will simply not link to arch_wb_cache_pmem() in that case. Same as before, pmem flushing is only defined for x86_64, via clean_cache_range(), but it is straightforward to add other archs in the future. arch_wb_cache_pmem() is an exported function since the pmem module needs to find it, but it is privately declared in drivers/nvdimm/pmem.h because there are no consumers outside of the pmem driver. Cc: <x86@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Oliver O'Halloran <oohall@gmail.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-15 14:35:24 -07:00
Dan Williams	3c1cebff23	dax, pmem: introduce an optional 'flush' dax_operation Filesystem-DAX flushes caches whenever it writes to the address returned through dax_direct_access() and when writing back dirty radix entries. That flushing is only required in the pmem case, so add a dax operation to allow pmem to take this extra action, but skip it for other dax capable devices that do not provide a flush routine. An example for this differentiation might be a volatile ram disk where there is no expectation of persistence. In fact the pmem driver itself might front such an address range specified by the NFIT. So, this "no flush" property might be something passed down by the bus / libnvdimm. Cc: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-15 14:34:59 -07:00
Toshi Kani	975750a98c	libnvdimm, pmem: Add sysfs notifications to badblocks Sysfs "badblocks" information may be updated during run-time that: - MCE, SCI, and sysfs "scrub" may add new bad blocks - Writes and ioctl() may clear bad blocks Add support to send sysfs notifications to sysfs "badblocks" file under region and pmem directories when their badblocks information is re-evaluated (but is not necessarily changed) during run-time. Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Linda Knippers <linda.knippers@hpe.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-15 14:31:41 -07:00
Dan Williams	8990cdf10c	libnvdimm, label: switch to using v1.2 labels by default The rules for which version of the label specification are in effect at any given point in time are as follows: 1/ If a DIMM has an existing / valid index block then the version specified is used regardless if it is a previous version. 2/ By default when the kernel is initializing new index blocks the latest specification version (v1.2 at time of writing) is used. 3/ An environment that wants to force create v1.1 label-sets must arrange for userspace to disable all active regions / namespaces / dimms and write a valid set of v1.1 index blocks to the dimms. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-15 14:31:41 -07:00
Dan Williams	b3fde74ea1	libnvdimm, label: add address abstraction identifiers Starting with v1.2 labels, 'address abstractions' can be hinted via an address abstraction id that implies an info-block format. The standard address abstraction in the specification is the v2 format of the Block-Translation-Table (BTT). Support for that is saved for a later patch, for now we add support for the Linux supported address abstractions BTT (v1), PFN, and DAX. The new 'holder_class' attribute for namespace devices is added for tooling to specify the 'abstraction_guid' to store in the namespace label. For v1.1 labels this field is undefined and any setting of 'holder_class' away from the default 'none' value will only have effect until the driver is unloaded. Setting 'holder_class' requires that whatever device tries to claim the namespace must be of the specified class. Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-15 14:31:40 -07:00
Dan Williams	355d838878	libnvdimm, label: add v1.2 label checksum support The v1.2 namespace label specification adds a fletcher checksum to each label instance. Add generation and validation support for the new field. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-15 14:31:40 -07:00
Dan Williams	3934d8410c	libnvdimm, label: update 'nlabel' and 'position' handling for local namespaces The v1.2 namespace label specification requires 'nlabel' and 'position' to be valid for the first ("lowest dpa") label in the set. It also requires all non-first labels to set those fields to 0xff. Linux does not much care if these values are correct, because we can just trust the count of labels with the matching uuid like the v1.1 case. However, we set them correctly in case other environments care. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-15 14:31:40 -07:00
Dan Williams	8f2bc2430e	libnvdimm, label: populate 'isetcookie' for blk-aperture namespaces Starting with the v1.2 definition of namespace labels, the isetcookie field is populated and validated for blk-aperture namespaces. This adds some safety against inadvertent copying of namespace labels from one DIMM-device to another. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-15 14:31:40 -07:00
Dan Williams	faec6f8a1c	libnvdimm, label: populate the type_guid property for v1.2 namespaces The type_guid refers to the "Address Range Type GUID" for the region backing a namespace as defined the ACPI NFIT (NVDIMM Firmware Interface Table). This 'type' identifier specifies an access mechanism for the given namespace. This capability replaces the confusing usage of the 'NSLABEL_FLAG_LOCAL' flag to indicate a block-aperture-mode namespace. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-15 14:31:40 -07:00
Dan Williams	f979b13c3c	libnvdimm, label: honor the lba size specified in v1.2 labels Previously we only honored the lba size for blk-aperture mode namespaces. For pmem namespaces the lba size was just assumed to be 512. With the new v1.2 label definition and compatibility with other operating environments, the ->lbasize property is now respected for pmem namespaces. Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-15 14:31:39 -07:00
Dan Williams	c12c48ce86	libnvdimm, label: add v1.2 interleave-set-cookie algorithm The interleave-set-cookie algorithm is extended to incorporate all the same components that are used to generate an nvdimm unique-id. For backwards compatibility we still maintain the old v1.1 definition. Reported-by: Nicholas Moulin <nicholas.w.moulin@intel.com> Reported-by: Kaushik Kanetkar <kaushik.a.kanetkar@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-15 14:31:39 -07:00
Dan Williams	564e871aa6	libnvdimm, label: add v1.2 nvdimm label definitions In support of improved interoperability between operating systems and pre-boot environments the Intel proposed NVDIMM Namespace Specification [1], has been adopted and modified to the the UEFI 2.7 NVDIMM Label Protocol [2]. Update the definitions of the namespace label data structures so that the new format can be supported alongside the existing label format. The new specification changes the default label size to 256 bytes, so everywhere that relied on sizeof(struct nd_namespace_label) must now use the sizeof_namespace_label() helper. There should be no functional differences from these changes as the default is still the v1.1 128-byte format. Future patches will move the default to the v1.2 definition. [1]: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf [2]: http://www.uefi.org/sites/default/files/resources/UEFI_Spec_2_7.pdf Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-15 14:31:39 -07:00
Christoph Hellwig	fdd050b5b3	Merge branch 'uuid-types' of bombadil.infradead.org:public_git/uuid into nvme-base	2017-06-13 11:45:14 +02:00
Dan Williams	0aed55af88	x86, uaccess: introduce copy_from_iter_flushcache for pmem / cache-bypass operations The pmem driver has a need to transfer data with a persistent memory destination and be able to rely on the fact that the destination writes are not cached. It is sufficient for the writes to be flushed to a cpu-store-buffer (non-temporal / "movnt" in x86 terms), as we expect userspace to call fsync() to ensure data-writes have reached a power-fail-safe zone in the platform. The fsync() triggers a REQ_FUA or REQ_FLUSH to the pmem driver which will turn around and fence previous writes with an "sfence". Implement a __copy_from_user_inatomic_flushcache, memcpy_page_flushcache, and memcpy_flushcache, that guarantee that the destination buffer is not dirty in the cpu cache on completion. The new copy_from_iter_flushcache and sub-routines will be used to replace the "pmem api" (include/linux/pmem.h + arch/x86/include/asm/pmem.h). The availability of copy_from_iter_flushcache() and memcpy_flushcache() are gated by the CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE config symbol, and fallback to copy_from_iter_nocache() and plain memcpy() otherwise. This is meant to satisfy the concern from Linus that if a driver wants to do something beyond the normal nocache semantics it should be something private to that driver [1], and Al's concern that anything uaccess related belongs with the rest of the uaccess code [2]. The first consumer of this interface is a new 'copy_from_iter' dax operation so that pmem can inject cache maintenance operations without imposing this overhead on other dax-capable drivers. [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html [2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.html Cc: <x86@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Matthew Wilcox <mawilcox@microsoft.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-09 09:09:56 -07:00
Christoph Hellwig	4e4cbee93d	block: switch bios to blk_status_t Replace bi_error with a new bi_status to allow for a clear conversion. Note that device mapper overloaded bi_error with a private value, which we'll have to keep arround at least for now and thus propagate to a proper blk_status_t value. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-06-09 09:27:32 -06:00
Christoph Hellwig	ef40dda5bb	uuid: hoist uuid_is_null() helper from libnvdimm Hoist the libnvdimm helper as an inline helper to linux/uuid.h using an auxiliary const variable uuid_null in lib/uuid.c. [hch: also add the guid variant. Both do the same but I'd like to keep casts to a minimum] The common helper uses the new abstract type uuid_t * instead of u8 *. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Amir Goldstein <amir73il@gmail.com> [hch: added guid_is_null] Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>	2017-06-05 16:59:05 +02:00
Linus Torvalds	0fcc3ab23d	Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm fixes from Dan Williams: "Incremental fixes and a small feature addition on top of the main libnvdimm 4.12 pull request: - Geert noticed that tinyconfig was bloated by BLOCK selecting DAX. The size regression is fixed by moving all dax helpers into the dax-core and only specifying "select DAX" for FS_DAX and dax-capable drivers. He also asked for clarification of the NR_DEV_DAX config option which, on closer look, does not need to be a config option at all. Mike also throws in a DEV_DAX_PMEM fixup for good measure. - Ben's attention to detail on -stable patch submissions caught a case where the recent fixes to arch_copy_from_iter_pmem() missed a condition where we strand dirty data in the cache. This is tagged for -stable and will also be included in the rework of the pmem api to a proposed {memcpy,copy_user}_flushcache() interface for 4.13. - Vishal adds a feature that missed the initial pull due to pending review feedback. It allows the kernel to clear media errors when initializing a BTT (atomic sector update driver) instance on a pmem namespace. - Ross noticed that the dax_device + dax_operations conversion broke __dax_zero_page_range(). The nvdimm unit tests fail to check this path, but xfstests immediately trips over it. No excuse for missing this before submitting the 4.12 pull request. These all pass the nvdimm unit tests and an xfstests spot check. The set has received a build success notification from the kbuild robot" * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: filesystem-dax: fix broken __dax_zero_page_range() conversion libnvdimm, btt: ensure that initializing metadata clears poison libnvdimm: add an atomic vs process context flag to rw_bytes x86, pmem: Fix cache flushing for iovec write < 8 bytes device-dax: kill NR_DEV_DAX block, dax: move "select DAX" from BLOCK to FS_DAX device-dax: Tell kbuild DEV_DAX_PMEM depends on DEV_DAX	2017-05-12 15:43:10 -07:00
Vishal Verma	b177fe85dd	libnvdimm, btt: ensure that initializing metadata clears poison If we had badblocks/poison in the metadata area of a BTT, recreating the BTT would not clear the poison in all cases, notably the flog area. This is because rw_bytes will only clear errors if the request being sent down is 512B aligned and sized. Make sure that when writing the map and info blocks, the rw_bytes being sent are of the correct size/alignment. For the flog, instead of doing the smaller log_entry writes only, first do a 'wipe' of the entire area by writing zeroes in large enough chunks so that errors get cleared. Cc: Andy Rudoff <andy.rudoff@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-05-10 21:46:22 -07:00
Vishal Verma	3ae3d67ba7	libnvdimm: add an atomic vs process context flag to rw_bytes nsio_rw_bytes can clear media errors, but this cannot be done while we are in an atomic context due to locking within ACPI. From the BTT, ->rw_bytes may be called either from atomic or process context depending on whether the calls happen during initialization or during IO. During init, we want to ensure error clearing happens, and the flag marking process context allows nsio_rw_bytes to do that. When called during IO, we're in atomic context, and error clearing can be skipped. Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-05-10 21:46:22 -07:00
Michal Hocko	752ade68cb	treewide: use kv[mz]alloc* rather than opencoded variants There are many code paths opencoding kvmalloc. Let's use the helper instead. The main difference to kvmalloc is that those users are usually not considering all the aspects of the memory allocator. E.g. allocation requests <= 32kB (with 4kB pages) are basically never failing and invoke OOM killer to satisfy the allocation. This sounds too disruptive for something that has a reasonable fallback - the vmalloc. On the other hand those requests might fallback to vmalloc even when the memory allocator would succeed after several more reclaim/compaction attempts previously. There is no guarantee something like that happens though. This patch converts many of those places to kv[mz]alloc* helpers because they are more conservative. Link: http://lkml.kernel.org/r/20170306103327.2766-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390 Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim Acked-by: David Sterba <dsterba@suse.com> # btrfs Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4 Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx5 Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Anton Vorontsov <anton@enomsg.org> Cc: Colin Cross <ccross@android.com> Cc: Tony Luck <tony.luck@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Santosh Raspatur <santosh@chelsio.com> Cc: Hariprasad S <hariprasad@chelsio.com> Cc: Yishai Hadas <yishaih@mellanox.com> Cc: Oleg Drokin <oleg.drokin@intel.com> Cc: "Yan, Zheng" <zyan@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-05-08 17:15:13 -07:00
Linus Torvalds	53ef7d0e20	libnvdimm for 4.12 * Region media error reporting: A libnvdimm region device is the parent to one or more namespaces. To date, media errors have been reported via the "badblocks" attribute attached to pmem block devices for namespaces in "raw" or "memory" mode. Given that namespaces can be in "device-dax" or "btt-sector" mode this new interface reports media errors generically, i.e. independent of namespace modes or state. This subsequently allows userspace tooling to craft "ACPI 6.1 Section 9.20.7.6 Function Index 4 - Clear Uncorrectable Error" requests and submit them via the ioctl path for NVDIMM root bus devices. * Introduce 'struct dax_device' and 'struct dax_operations': Prompted by a request from Linus and feedback from Christoph this allows for dax capable drivers to publish their own custom dax operations. This fixes the broken assumption that all dax operations are related to a persistent memory device, and makes it easier for other architectures and platforms to add customized persistent memory support. * 'libnvdimm' core updates: A new "deep_flush" sysfs attribute is available for storage appliance applications to manually trigger memory controllers to drain write-pending buffers that would otherwise be flushed automatically by the platform ADR (asynchronous-DRAM-refresh) mechanism at a power loss event. Support for "locked" DIMMs is included to prevent namespaces from surfacing when the namespace label data area is locked. Finally, fixes for various reported deadlocks and crashes, also tagged for -stable. * ACPI / nfit driver updates: General updates of the nfit driver to add DSM command overrides, ACPI 6.1 health state flags support, DSM payload debug available by default, and various fixes. Acknowledgements that came after the branch was pushed: commmit `565851c972` "device-dax: fix sysfs attribute deadlock" Tested-by: Yi Zhang <yizhan@redhat.com> commit `23f4984483` "libnvdimm: rework region badblocks clearing" Tested-by: Toshi Kani <toshi.kani@hpe.com> -----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJZDONJAAoJEB7SkWpmfYgC3SsP/2KrLvTUcz646ViuPOgZ2cC4 W6wAx6cvDSt+H52kLnFEsYoFt7WAj20ggPirb/Bc5jkGlvwE0lT9Xtmso9GpVkYT J9ZJ9pP/4YaAD3II1gmTwaUjYi0FxoOdx3Eb92yuWkO/8ylz4b2Nu3cBpYwyziGQ nIfEVwDXRLE86u6x0bWuf6TlVuvsbdiAI55CDqDMVQC6xIOLbSez7b8QIHlpiKEb Mw+xqdQva0esoreZEOXEhWNO+qtfILx8/ceBEGTNMp4e/JjZ2FbrSNplM+9bH5k7 ywqP8lW+mBEw0fmBBkYoVG/xyesiiBb55JLnbi8Ew+7IUxw8a3iV7wftRi62lHcK zAjsHe4L+MansgtZsCL8wluvIPaktAdtB4xr7l9VNLKRYRUG73jEWU0gcUNryHIL BkQJ52pUS1PkClyAsWbBBHl1I/CvzVPd21VW0YELmLR4OywKy1c+eKw2bcYgjrb4 59HZSv6S6EoKaQC+2qvVNpePil7cdfg5V2ubH/ki9HoYVyoxDptEWHnvf0NNatIH Y7mNcOPvhOksJmnKSyHbDjtRur7WoHIlC9D7UjEFkSBWsKPjxJHoidN4SnCMRtjQ WKQU0seoaKj04b68Bs/Qm9NozVgnsPFIUDZeLMikLFX2Jt7YSPu+Jmi2s4re6WLh TmJQ3Ly9t3o3/weHSzmn =Ox0s -----END PGP SIGNATURE----- Merge tag 'libnvdimm-for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Dan Williams: "The bulk of this has been in multiple -next releases. There were a few late breaking fixes and small features that got added in the last couple days, but the whole set has received a build success notification from the kbuild robot. Change summary: - Region media error reporting: A libnvdimm region device is the parent to one or more namespaces. To date, media errors have been reported via the "badblocks" attribute attached to pmem block devices for namespaces in "raw" or "memory" mode. Given that namespaces can be in "device-dax" or "btt-sector" mode this new interface reports media errors generically, i.e. independent of namespace modes or state. This subsequently allows userspace tooling to craft "ACPI 6.1 Section 9.20.7.6 Function Index 4 - Clear Uncorrectable Error" requests and submit them via the ioctl path for NVDIMM root bus devices. - Introduce 'struct dax_device' and 'struct dax_operations': Prompted by a request from Linus and feedback from Christoph this allows for dax capable drivers to publish their own custom dax operations. This fixes the broken assumption that all dax operations are related to a persistent memory device, and makes it easier for other architectures and platforms to add customized persistent memory support. - 'libnvdimm' core updates: A new "deep_flush" sysfs attribute is available for storage appliance applications to manually trigger memory controllers to drain write-pending buffers that would otherwise be flushed automatically by the platform ADR (asynchronous-DRAM-refresh) mechanism at a power loss event. Support for "locked" DIMMs is included to prevent namespaces from surfacing when the namespace label data area is locked. Finally, fixes for various reported deadlocks and crashes, also tagged for -stable. - ACPI / nfit driver updates: General updates of the nfit driver to add DSM command overrides, ACPI 6.1 health state flags support, DSM payload debug available by default, and various fixes. Acknowledgements that came after the branch was pushed: - commmit `565851c972` "device-dax: fix sysfs attribute deadlock": Tested-by: Yi Zhang <yizhan@redhat.com> - commit `23f4984483` "libnvdimm: rework region badblocks clearing" Tested-by: Toshi Kani <toshi.kani@hpe.com>" * tag 'libnvdimm-for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (52 commits) libnvdimm, pfn: fix 'npfns' vs section alignment libnvdimm: handle locked label storage areas libnvdimm: convert NDD_ flags to use bitops, introduce NDD_LOCKED brd: fix uninitialized use of brd->dax_dev block, dax: use correct format string in bdev_dax_supported device-dax: fix sysfs attribute deadlock libnvdimm: restore "libnvdimm: band aid btt vs clear poison locking" libnvdimm: fix nvdimm_bus_lock() vs device_lock() ordering libnvdimm: rework region badblocks clearing acpi, nfit: kill ACPI_NFIT_DEBUG libnvdimm: fix clear length of nvdimm_forget_poison() libnvdimm, pmem: fix a NULL pointer BUG in nd_pmem_notify libnvdimm, region: sysfs trigger for nvdimm_flush() libnvdimm: fix phys_addr for nvdimm_clear_poison x86, dax, pmem: remove indirection around memcpy_from_pmem() block: remove block_device_operations ->direct_access() block, dax: convert bdev_dax_supported() to dax_direct_access() filesystem-dax: convert to dax_direct_access() Revert "block: use DAX for partition table reads" ext2, ext4, xfs: retrieve dax_device for iomap operations ...	2017-05-05 18:49:20 -07:00
Dan Williams	736163671b	Merge branch 'for-4.12/dax' into libnvdimm-for-next	2017-05-04 23:38:43 -07:00
Dan Williams	d5483feda8	libnvdimm, pfn: fix 'npfns' vs section alignment Fix failures to create namespaces due to the vmem_altmap not advertising enough free space to store the memmap. WARNING: CPU: 15 PID: 8022 at arch/x86/mm/init_64.c:656 arch_add_memory+0xde/0xf0 [..] Call Trace: dump_stack+0x63/0x83 __warn+0xcb/0xf0 warn_slowpath_null+0x1d/0x20 arch_add_memory+0xde/0xf0 devm_memremap_pages+0x244/0x440 pmem_attach_disk+0x37e/0x490 [nd_pmem] nd_pmem_probe+0x7e/0xa0 [nd_pmem] nvdimm_bus_probe+0x71/0x120 [libnvdimm] driver_probe_device+0x2bb/0x460 bind_store+0x114/0x160 drv_attr_store+0x25/0x30 In commit `658922e57b` "libnvdimm, pfn: fix memmap reservation sizing" we arranged for the capacity to be allocated, but failed to also update the 'npfns' parameter. This leads to cases where there is enough capacity reserved to hold all the allocated sections, but vmemmap_populate_hugepages() still encounters -ENOMEM from altmap_alloc_block_buf(). This fix is a stop-gap until we can teach the core memory hotplug implementation to permit sub-section hotplug. Cc: <stable@vger.kernel.org> Fixes: `658922e57b` ("libnvdimm, pfn: fix memmap reservation sizing") Reported-by: Anisha Allada <anisha.allada@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-05-04 19:54:42 -07:00
Dan Williams	9d62ed9651	libnvdimm: handle locked label storage areas Per the latest version of the "NVDIMM DSM Interface Example" [1], the label data retrieval routine can report a "locked" status. In this case all regions associated with that DIMM are disabled until the label area is unlocked. Provide generic libnvdimm enabling for NVDIMMs with label data area locking capabilities. [1]: http://pmem.io/documents/ Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-05-04 15:41:39 -07:00
Dan Williams	8f078b38dd	libnvdimm: convert NDD_ flags to use bitops, introduce NDD_LOCKED This is a preparation patch for handling locked nvdimm label regions, a new concept as introduced by the latest DSM document on pmem.io [1]. A future patch will leverage nvdimm_set_locked() at DIMM probe time to flag regions that can not be enabled. There should be no functional difference resulting from this change. [1]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example-V1.3.pdf Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-05-04 14:01:24 -07:00
Linus Torvalds	d3b5d35290	Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mm updates from Ingo Molnar: "The main x86 MM changes in this cycle were: - continued native kernel PCID support preparation patches to the TLB flushing code (Andy Lutomirski) - various fixes related to 32-bit compat syscall returning address over 4Gb in applications, launched from 64-bit binaries - motivated by C/R frameworks such as Virtuozzo. (Dmitry Safonov) - continued Intel 5-level paging enablement: in particular the conversion of x86 GUP to the generic GUP code. (Kirill A. Shutemov) - x86/mpx ABI corner case fixes/enhancements (Joerg Roedel) - ... plus misc updates, fixes and cleanups" * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (62 commits) mm, zone_device: Replace {get, put}_zone_device_page() with a single reference to fix pmem crash x86/mm: Fix flush_tlb_page() on Xen x86/mm: Make flush_tlb_mm_range() more predictable x86/mm: Remove flush_tlb() and flush_tlb_current_task() x86/vm86/32: Switch to flush_tlb_mm_range() in mark_screen_rdonly() x86/mm/64: Fix crash in remove_pagetable() Revert "x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation" x86/boot/e820: Remove a redundant self assignment x86/mm: Fix dump pagetables for 4 levels of page tables x86/mpx, selftests: Only check bounds-vs-shadow when we keep shadow x86/mpx: Correctly report do_mpx_bt_fault() failures to user-space Revert "x86/mm/numa: Remove numa_nodemask_from_meminfo()" x86/espfix: Add support for 5-level paging x86/kasan: Extend KASAN to support 5-level paging x86/mm: Add basic defines/helpers for CONFIG_X86_5LEVEL=y x86/paravirt: Add 5-level support to the paravirt code x86/mm: Define virtual memory map for 5-level paging x86/asm: Remove __VIRTUAL_MASK_SHIFT==47 assert x86/boot: Detect 5-level paging support x86/mm/numa: Remove numa_nodemask_from_meminfo() ...	2017-05-01 23:54:56 -07:00
Dan Williams	a3e9af95f7	libnvdimm: restore "libnvdimm: band aid btt vs clear poison locking" This continues the 4.11 status quo of disabling of error clearing from the BTT I/O path. Toshi found that even though we have eliminated all the libnvdimm sources of sleeping-while-atomic triggers, we still have sleeping operations that will occur in the path to send the ACPI DSM to the DIMM to clear the error: BUG: sleeping function called from invalid context at mm/slab.h:432 in_atomic(): 1, irqs_disabled(): 0, pid: 13353, name: dd Call Trace: dump_stack+0x86/0xc3 ___might_sleep+0x17d/0x250 __might_sleep+0x4a/0x80 __kmalloc+0x1c0/0x2e0 acpi_os_allocate_zeroed+0x2d/0x2f acpi_evaluate_object+0x59/0x3b1 acpi_evaluate_dsm+0xbd/0x10c acpi_nfit_ctl+0x1ef/0x7c0 [nfit] ? nsio_rw_bytes+0x152/0x280 nvdimm_clear_poison+0x77/0x140 nsio_rw_bytes+0x18f/0x280 btt_write_pg+0x1d4/0x3d0 [nd_btt] btt_make_request+0x119/0x2d0 [nd_btt] A solution for tracking and handling media errors natively in the BTT is needed. Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Reported-by: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-05-01 10:00:02 -07:00
Dan Williams	452bae0aed	libnvdimm: fix nvdimm_bus_lock() vs device_lock() ordering A debug patch to turn the standard device_lock() into something that lockdep can analyze yielded the following: ====================================================== [ INFO: possible circular locking dependency detected ] 4.11.0-rc4+ #106 Tainted: G O ------------------------------------------------------- lt-libndctl/1898 is trying to acquire lock: (&dev->nvdimm_mutex/3){+.+.+.}, at: [<ffffffffc023c948>] nd_attach_ndns+0x178/0x1b0 [libnvdimm] but task is already holding lock: (&nvdimm_bus->reconfig_mutex){+.+.+.}, at: [<ffffffffc022e0b1>] nvdimm_bus_lock+0x21/0x30 [libnvdimm] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&nvdimm_bus->reconfig_mutex){+.+.+.}: lock_acquire+0xf6/0x1f0 __mutex_lock+0x88/0x980 mutex_lock_nested+0x1b/0x20 nvdimm_bus_lock+0x21/0x30 [libnvdimm] nvdimm_namespace_capacity+0x1b/0x40 [libnvdimm] nvdimm_namespace_common_probe+0x230/0x510 [libnvdimm] nd_pmem_probe+0x14/0x180 [nd_pmem] nvdimm_bus_probe+0xa9/0x260 [libnvdimm] -> #0 (&dev->nvdimm_mutex/3){+.+.+.}: __lock_acquire+0x1107/0x1280 lock_acquire+0xf6/0x1f0 __mutex_lock+0x88/0x980 mutex_lock_nested+0x1b/0x20 nd_attach_ndns+0x178/0x1b0 [libnvdimm] nd_namespace_store+0x308/0x3c0 [libnvdimm] namespace_store+0x87/0x220 [libnvdimm] In this case '&dev->nvdimm_mutex/3' mirrors '&dev->mutex'. Fix this by replacing the use of device_lock() with nvdimm_bus_lock() to protect nd_{attach,detach}_ndns() operations. Cc: <stable@vger.kernel.org> Fixes: `8c2f7e8658` ("libnvdimm: infrastructure for btt devices") Reported-by: Yi Zhang <yizhan@redhat.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-05-01 08:29:37 -07:00
Dan Williams	7138970383	mm, zone_device: Replace {get, put}_zone_device_page() with a single reference to fix pmem crash The x86 conversion to the generic GUP code included a small change which causes crashes and data corruption in the pmem code - not good. The root cause is that the /dev/pmem driver code implicitly relies on the x86 get_user_pages() implementation doing a get_page() on the page refcount, because get_page() does a get_zone_device_page() which properly refcounts pmem's separate page struct arrays that are not present in the regular page struct structures. (The pmem driver does this because it can cover huge memory areas.) But the x86 conversion to the generic GUP code changed the get_page() to page_cache_get_speculative() which is faster but doesn't do the get_zone_device_page() call the pmem code relies on. One way to solve the regression would be to change the generic GUP code to use get_page(), but that would slow things down a bit and punish other generic-GUP using architectures for an x86-ism they did not care about. (Arguably the pmem driver was probably not working reliably for them: but nvdimm is an Intel feature, so non-x86 exposure is probably still limited.) So restructure the pmem code's interface with the MM instead: get rid of the get/put_zone_device_page() distinction, integrate put_zone_device_page() into __put_page() and and restructure the pmem completion-wait and teardown machinery: Kirill points out that the calls to {get,put}_dev_pagemap() can be removed from the mm fast path if we take a single get_dev_pagemap() reference to signify that the page is alive and use the final put of the page to drop that reference. This does require some care to make sure that any waits for the percpu_ref to drop to zero occur after devm_memremap_page_release(), since it now maintains its own elevated reference. This speeds up things while also making the pmem refcounting more robust going forward. Suggested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com> Tested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/149339998297.24933.1129582806028305912.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-05-01 09:15:53 +02:00
Dan Williams	23f4984483	libnvdimm: rework region badblocks clearing Toshi noticed that the new support for a region-level badblocks missed the case where errors are cleared due to BTT I/O. An initial attempt to fix this ran into a "sleeping while atomic" warning due to taking the nvdimm_bus_lock() in the BTT I/O path to satisfy the locking requirements of __nvdimm_bus_badblocks_clear(). However, that lock is not needed since we are not acting on any data that is subject to change under that lock. The badblocks instance has its own internal lock to handle mutations of the error list. So, in order to make it clear that we are just acting on region devices, rename __nvdimm_bus_badblocks_clear() to nvdimm_clear_badblocks_regions(). Eliminate the lock and consolidate all support routines for the new nvdimm_account_cleared_poison() in drivers/nvdimm/bus.c. Finally, to the opportunity to cleanup to some unnecessary casts, make the calling convention of nvdimm_clear_badblocks_regions() clearer by replacing struct resource with the minimal struct clear_badblocks_context, and use the DEVICE_ATTR macro. Cc: Dave Jiang <dave.jiang@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Reported-by: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-04-29 15:24:03 -07:00
Toshi Kani	8d13c02906	libnvdimm: fix clear length of nvdimm_forget_poison() ND_CMD_CLEAR_ERROR command returns 'clear_err.cleared', the length of error actually cleared, which may be smaller than its requested 'len'. Change nvdimm_clear_poison() to call nvdimm_forget_poison() with 'clear_err.cleared' when this value is valid. Cc: <stable@vger.kernel.org> Fixes: `e046114af5` ("libnvdimm: clear the internal poison_list when clearing badblocks") Cc: Dave Jiang <dave.jiang@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-04-28 15:56:26 -07:00
Toshi Kani	b2518c78ce	libnvdimm, pmem: fix a NULL pointer BUG in nd_pmem_notify The following BUG was observed when nd_pmem_notify() was called for a BTT device. The use of a pmem_device pointer is not valid with BTT. BUG: unable to handle kernel NULL pointer dereference at 0000000000000030 IP: nd_pmem_notify+0x30/0xf0 [nd_pmem] Call Trace: nd_device_notify+0x40/0x50 child_notify+0x10/0x20 device_for_each_child+0x50/0x90 nd_region_notify+0x20/0x30 nd_device_notify+0x40/0x50 nvdimm_region_notify+0x27/0x30 acpi_nfit_scrub+0x341/0x590 [nfit] process_one_work+0x197/0x450 worker_thread+0x4e/0x4a0 kthread+0x109/0x140 Fix nd_pmem_notify() by setting nd_region and badblocks pointers properly for BTT. Cc: <stable@vger.kernel.org> Cc: Vishal Verma <vishal.l.verma@intel.com> Fixes: `719994660c` ("libnvdimm: async notification support") Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-04-28 12:46:47 -07:00
Dan Williams	ab630891ce	libnvdimm, region: sysfs trigger for nvdimm_flush() The nvdimm_flush() mechanism helps to reduce the impact of an ADR (asynchronous-dimm-refresh) failure. The ADR mechanism handles flushing platform WPQ (write-pending-queue) buffers when power is removed. The nvdimm_flush() mechanism performs that same function on-demand. When a pmem namespace is associated with a block device, an nvdimm_flush() is triggered with every block-layer REQ_FUA, or REQ_FLUSH request. These requests are typically associated with filesystem metadata updates. However, when a namespace is in device-dax mode, userspace (think database metadata) needs another path to perform the same flushing. In other words this is not required to make data persistent, but in the case of metadata it allows for a smaller failure domain in the unlikely event of an ADR failure. The new 'deep_flush' attribute is visible when the individual DIMMs backing a given interleave-set are described by platform firmware. In ACPI terms this is "NVDIMM Region Mapping Structures" and associated "Flush Hint Address Structures". Reads return "1" if the region supports triggering WPQ flushes on all DIMMs. Reads return "0" the flush operation is a platform nop, and in that case the attribute is read-only. Why sysfs and not an ioctl? An ioctl requires establishing a new ioctl function number space for device-dax. Given that this would be called on a device-dax fd an application could be forgiven for accidentally calling this on a filesystem-dax fd. Placing this interface in libnvdimm sysfs removes that potential for collision with a filesystem ioctl, and it keeps ioctls out of the generic device-dax implementation. Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-04-28 12:46:46 -07:00
Toshi Kani	97681f9b08	libnvdimm: fix phys_addr for nvdimm_clear_poison nvdimm_clear_poison() expects a physical address, not an offset. Fix nsio_rw_bytes() to call nvdimm_clear_poison() with a physical address. Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-04-27 13:51:18 -07:00
Dan Williams	6abccd1bfe	x86, dax, pmem: remove indirection around memcpy_from_pmem() memcpy_from_pmem() maps directly to memcpy_mcsafe(). The wrapper serves no real benefit aside from affording a more generic function name than the x86-specific 'mcsafe'. However this would not be the first time that x86 terminology leaked into the global namespace. For lack of better name, just use memcpy_mcsafe() directly. This conversion also catches a place where we should have been using plain memcpy, acpi_nfit_blk_single_io(). Cc: <x86@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-04-25 13:20:46 -07:00
Dan Williams	d4b29fd78e	block: remove block_device_operations ->direct_access() Now that all the producers and consumers of dax interfaces have been converted to using dax_operations on a dax_device, remove the block device direct_access enabling. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-04-25 13:20:46 -07:00
Dan Williams	bc042fdfbb	libnvdimm, region: fix flush hint detection crash In the case where a dimm does not have any associated flush hints the ndrd->flush_wpq array may be uninitialized leading to crashes with the following signature: BUG: unable to handle kernel NULL pointer dereference at 0000000000000010 IP: region_visible+0x10f/0x160 [libnvdimm] Call Trace: internal_create_group+0xbe/0x2f0 sysfs_create_groups+0x40/0x80 device_add+0x2d8/0x650 nd_async_device_register+0x12/0x40 [libnvdimm] async_run_entry_fn+0x39/0x170 process_one_work+0x212/0x6c0 ? process_one_work+0x197/0x6c0 worker_thread+0x4e/0x4a0 kthread+0x10c/0x140 ? process_one_work+0x6c0/0x6c0 ? kthread_create_on_node+0x60/0x60 ret_from_fork+0x31/0x40 Cc: <stable@vger.kernel.org> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Fixes: `f284a4f237` ("libnvdimm: introduce nvdimm_flush() and nvdimm_has_flush()") Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-04-24 16:01:56 -07:00
Dan Williams	c1d6e828a3	pmem: add dax_operations support Setup a dax_device to have the same lifetime as the pmem block device and add a ->direct_access() method that is equivalent to pmem_direct_access(). Once fs/dax.c has been converted to use dax_operations the old pmem_direct_access() will be removed. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-04-19 15:14:35 -07:00
Dan Williams	e88da7998d	Revert "libnvdimm: band aid btt vs clear poison locking" This reverts commit `4aa5615e08` "libnvdimm: band aid btt vs clear poison locking". Now that poison list locking has been converted to a spinlock and poison list entry allocation during i/o has been converted to GFP_NOWAIT, revert the band-aid that disabled error clearing from btt i/o. Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-04-14 13:29:01 -07:00
Dave Jiang	b3b454f694	libnvdimm: fix clear poison locking with spinlock and GFP_NOWAIT allocation The following warning results from holding a lane spinlock, preempt_disable(), or the btt map spinlock and then trying to take the reconfig_mutex to walk the poison list and potentially add new entries. BUG: sleeping function called from invalid context at kernel/locking/mutex. c:747 in_atomic(): 1, irqs_disabled(): 0, pid: 17159, name: dd [..] Call Trace: dump_stack+0x85/0xc8 ___might_sleep+0x184/0x250 __might_sleep+0x4a/0x90 __mutex_lock+0x58/0x9b0 ? nvdimm_bus_lock+0x21/0x30 [libnvdimm] ? __nvdimm_bus_badblocks_clear+0x2f/0x60 [libnvdimm] ? acpi_nfit_forget_poison+0x79/0x80 [nfit] ? _raw_spin_unlock+0x27/0x40 mutex_lock_nested+0x1b/0x20 nvdimm_bus_lock+0x21/0x30 [libnvdimm] nvdimm_forget_poison+0x25/0x50 [libnvdimm] nvdimm_clear_poison+0x106/0x140 [libnvdimm] nsio_rw_bytes+0x164/0x270 [libnvdimm] btt_write_pg+0x1de/0x3e0 [nd_btt] ? blk_queue_enter+0x30/0x290 btt_make_request+0x11a/0x310 [nd_btt] ? blk_queue_enter+0xb7/0x290 ? blk_queue_enter+0x30/0x290 generic_make_request+0x118/0x3b0 A spinlock is introduced to protect the poison list. This allows us to not having to acquire the reconfig_mutex for touching the poison list. The add_poison() function has been broken out into two helper functions. One to allocate the poison entry and the other to apppend the entry. This allows us to unlock the poison_lock in non-I/O path and continue to be able to allocate the poison entry with GFP_KERNEL. We will use GFP_NOWAIT in the I/O path in order to satisfy being in atomic context. Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-04-13 14:23:51 -07:00
Dave Jiang	006358b35c	libnvdimm: add support for clear poison list and badblocks for device dax Providing mechanism to clear poison list via the ndctl ND_CMD_CLEAR_ERROR call. We will update the poison list and also the badblocks at region level if the region is in dax mode or in pmem mode and not active. In other words we force badblocks to be cleared through write requests if the address is currently accessed through a block device, otherwise it can only be done via the ioctl+dsm path. Signed-off-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-04-12 21:56:43 -07:00
Dave Jiang	802f4be6fe	libnvdimm: Add 'resource' sysfs attribute to regions Adding sysfs attribute in order to export the physical address of the region. This is for supporting of user app poison clear via ND_IOCTL_CLEAR_ERROR. Signed-off-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-04-12 21:56:42 -07:00
Dave Jiang	6a6bef9042	libnvdimm: add mechanism to publish badblocks at the region level badblocks sysfs file will be export at region level. When nvdimm event notifier happens for NVDIMM_REVALIATE_POISON, the badblocks in the region will be updated. Signed-off-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-04-12 21:56:42 -07:00
Dan Williams	4aa5615e08	libnvdimm: band aid btt vs clear poison locking The following warning results from holding a lane spinlock, preempt_disable(), or the btt map spinlock and then trying to take the reconfig_mutex to walk the poison list and potentially add new entries. BUG: sleeping function called from invalid context at kernel/locking/mutex.c:747 in_atomic(): 1, irqs_disabled(): 0, pid: 17159, name: dd [..] Call Trace: dump_stack+0x85/0xc8 ___might_sleep+0x184/0x250 __might_sleep+0x4a/0x90 __mutex_lock+0x58/0x9b0 ? nvdimm_bus_lock+0x21/0x30 [libnvdimm] ? __nvdimm_bus_badblocks_clear+0x2f/0x60 [libnvdimm] ? acpi_nfit_forget_poison+0x79/0x80 [nfit] ? _raw_spin_unlock+0x27/0x40 mutex_lock_nested+0x1b/0x20 nvdimm_bus_lock+0x21/0x30 [libnvdimm] nvdimm_forget_poison+0x25/0x50 [libnvdimm] nvdimm_clear_poison+0x106/0x140 [libnvdimm] nsio_rw_bytes+0x164/0x270 [libnvdimm] btt_write_pg+0x1de/0x3e0 [nd_btt] ? blk_queue_enter+0x30/0x290 btt_make_request+0x11a/0x310 [nd_btt] ? blk_queue_enter+0xb7/0x290 ? blk_queue_enter+0x30/0x290 generic_make_request+0x118/0x3b0 As a minimal fix, disable error clearing when the BTT is enabled for the namespace. For the final fix a larger rework of the poison list locking is needed. Note that this is not a problem in the blk case since that path never calls nvdimm_clear_poison(). Cc: <stable@vger.kernel.org> Fixes: `82bf1037f2` ("libnvdimm: check and clear poison before writing to pmem") Cc: Dave Jiang <dave.jiang@intel.com> [jeff: dynamically disable error clearing in the btt case] Suggested-by: Jeff Moyer <jmoyer@redhat.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Reported-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-04-10 17:21:45 -07:00
Dan Williams	0beb2012a1	libnvdimm: fix reconfig_mutex, mmap_sem, and jbd2_handle lockdep splat Holding the reconfig_mutex over a potential userspace fault sets up a lockdep dependency chain between filesystem-DAX and the libnvdimm ioctl path. Move the user access outside of the lock. [ INFO: possible circular locking dependency detected ] 4.11.0-rc3+ #13 Tainted: G W O ------------------------------------------------------- fallocate/16656 is trying to acquire lock: (&nvdimm_bus->reconfig_mutex){+.+.+.}, at: [<ffffffffa00080b1>] nvdimm_bus_lock+0x21/0x30 [libnvdimm] but task is already holding lock: (jbd2_handle){++++..}, at: [<ffffffff813b4944>] start_this_handle+0x104/0x460 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (jbd2_handle){++++..}: lock_acquire+0xbd/0x200 start_this_handle+0x16a/0x460 jbd2__journal_start+0xe9/0x2d0 __ext4_journal_start_sb+0x89/0x1c0 ext4_dirty_inode+0x32/0x70 __mark_inode_dirty+0x235/0x670 generic_update_time+0x87/0xd0 touch_atime+0xa9/0xd0 ext4_file_mmap+0x90/0xb0 mmap_region+0x370/0x5b0 do_mmap+0x415/0x4f0 vm_mmap_pgoff+0xd7/0x120 SyS_mmap_pgoff+0x1c5/0x290 SyS_mmap+0x22/0x30 entry_SYSCALL_64_fastpath+0x1f/0xc2 -> #1 (&mm->mmap_sem){++++++}: lock_acquire+0xbd/0x200 __might_fault+0x70/0xa0 __nd_ioctl+0x683/0x720 [libnvdimm] nvdimm_ioctl+0x8b/0xe0 [libnvdimm] do_vfs_ioctl+0xa8/0x740 SyS_ioctl+0x79/0x90 do_syscall_64+0x6c/0x200 return_from_SYSCALL_64+0x0/0x7a -> #0 (&nvdimm_bus->reconfig_mutex){+.+.+.}: __lock_acquire+0x16b6/0x1730 lock_acquire+0xbd/0x200 __mutex_lock+0x88/0x9b0 mutex_lock_nested+0x1b/0x20 nvdimm_bus_lock+0x21/0x30 [libnvdimm] nvdimm_forget_poison+0x25/0x50 [libnvdimm] nvdimm_clear_poison+0x106/0x140 [libnvdimm] pmem_do_bvec+0x1c2/0x2b0 [nd_pmem] pmem_make_request+0xf9/0x270 [nd_pmem] generic_make_request+0x118/0x3b0 submit_bio+0x75/0x150 Cc: <stable@vger.kernel.org> Fixes: `62232e45f4` ("libnvdimm: control (ioctl) messages for nvdimm_bus and nvdimm devices") Cc: Dave Jiang <dave.jiang@intel.com> Reported-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-04-10 17:21:45 -07:00
Dan Williams	fe514739d8	libnvdimm: fix blk free space accounting Commit `a1f3e4d6a0` "libnvdimm, region: update nd_region_available_dpa() for multi-pmem support" reworked blk dpa (DIMM Physical Address) accounting to comprehend multiple pmem namespace allocations aliasing with a given blk-dpa range. The following call trace is a result of failing to account for allocated blk capacity. WARNING: CPU: 1 PID: 2433 at tools/testing/nvdimm/../../../drivers/nvdimm/names 4 size_store+0x6f3/0x930 [libnvdimm] nd_region region5: allocation underrun: 0x0 of 0x1000000 bytes [..] Call Trace: dump_stack+0x86/0xc3 __warn+0xcb/0xf0 warn_slowpath_fmt+0x5f/0x80 size_store+0x6f3/0x930 [libnvdimm] dev_attr_store+0x18/0x30 If a given blk-dpa allocation does not alias with any pmem ranges then the full allocation should be accounted as busy space, not the size of the current pmem contribution to the region. The thinkos that led to this confusion was not realizing that the struct resource management is already guaranteeing no collisions between pmem allocations and blk allocations on the same dimm. Also, we do not try to support blk allocations in aliased pmem holes. This patch also fixes a case where the available blk goes negative. Cc: <stable@vger.kernel.org> Fixes: `a1f3e4d6a0` ("libnvdimm, region: update nd_region_available_dpa() for multi-pmem support"). Reported-by: Dariusz Dokupil <dariusz.dokupil@intel.com> Reported-by: Dave Jiang <dave.jiang@intel.com> Reported-by: Vishal Verma <vishal.l.verma@intel.com> Tested-by: Dave Jiang <dave.jiang@intel.com> Tested-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-04-04 15:08:36 -07:00
Dan Williams	86ef58a4e3	nfit, libnvdimm: fix interleave set cookie calculation The interleave-set cookie is a sum that sanity checks the composition of an interleave set has not changed from when the namespace was initially created. The checksum is calculated by sorting the DIMMs by their location in the interleave-set. The comparison for the sort must be 64-bit wide, not byte-by-byte as performed by memcmp() in the broken case. Fix the implementation to accept correct cookie values in addition to the Linux "memcmp" order cookies, but only allow correct cookies to be generated going forward. It does mean that namespaces created by third-party-tooling, or created by newer kernels with this fix, will not validate on older kernels. However, there are a couple mitigating conditions: 1/ platforms with namespace-label capable NVDIMMs are not widely available. 2/ interleave-sets with a single-dimm are by definition not affected (nothing to sort). This covers the QEMU-KVM NVDIMM emulation case. The cookie stored in the namespace label will be fixed by any write the namespace label, the most straightforward way to achieve this is to write to the "alt_name" attribute of a namespace in sysfs. Cc: <stable@vger.kernel.org> Fixes: `eaf961536e` ("libnvdimm, nfit: add interleave-set state-tracking infrastructure") Reported-by: Nicholas Moulin <nicholas.w.moulin@linux.intel.com> Tested-by: Nicholas Moulin <nicholas.w.moulin@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-03-01 00:49:42 -08:00
Dan Williams	bfb34527a3	libnvdimm, pfn: fix memmap reservation size versus 4K alignment When vmemmap_populate() allocates space for the memmap it does so in 2MB sized chunks. The libnvdimm-pfn driver incorrectly accounts for this when the alignment of the device is set to 4K. When this happens we trigger memory allocation failures in altmap_alloc_block_buf() and trigger warnings of the form: WARNING: CPU: 0 PID: 3376 at arch/x86/mm/init_64.c:656 arch_add_memory+0xe4/0xf0 [..] Call Trace: dump_stack+0x86/0xc3 __warn+0xcb/0xf0 warn_slowpath_null+0x1d/0x20 arch_add_memory+0xe4/0xf0 devm_memremap_pages+0x29b/0x4e0 Fixes: `315c562536` ("libnvdimm, pfn: add 'align' attribute, default to HPAGE_SIZE") Cc: <stable@vger.kernel.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-02-04 14:47:31 -08:00
Dan Williams	9d032f4201	libnvdimm, namespace: do not delete namespace-id 0 Given that the naming of pmem devices changes from the pmemX form to the pmemX.Y form when namespace id is greater than 0, arrange for namespaces with id-0 to be exempt from deletion. Otherwise a simple reconfiguration of an existing namespace to a new mode results in a name change of the resulting block device: # ndctl list --namespace=namespace1.0 { "dev":"namespace1.0", "mode":"raw", "size":2147483648, "uuid":"3dadf3dc-89b9-4b24-b20e-abc8a4707ce3", "blockdev":"pmem1" } # ndctl create-namespace --reconfig=namespace1.0 --mode=memory --force { "dev":"namespace1.1", "mode":"memory", "size":2111832064, "uuid":"7b4a6341-7318-4219-a02c-fb57c0bbf613", "blockdev":"pmem1.1" } This change does require tooling changes to explicitly look for namespaceX.0 if the seed has already advanced to another namespace. Cc: <stable@vger.kernel.org> Fixes: `98a29c39dc` ("libnvdimm, namespace: allow creation of multiple pmem-namespaces per region") Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-01-31 18:18:21 -08:00
Bhumika Goyal	970d14e398	nvdimm: constify device_type structures Declare device_type structure as const as it is only stored in the type field of a device structure. This field is of type const, so add const to declaration of device_type structure. File size before: text data bss dec hex filename 19278 3199 16 22493 57dd nvdimm/namespace_devs.o File size after: text data bss dec hex filename 19929 3160 16 23105 5a41 nvdimm/namespace_devs.o Signed-off-by: Bhumika Goyal <bhumirks@gmail.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-01-31 18:16:30 -08:00
Dan Williams	1f19b983a8	libnvdimm, namespace: fix pmem namespace leak, delete when size set to zero Commit `98a29c39dc` ("libnvdimm, namespace: allow creation of multiple pmem-namespaces per region") added support for establishing additional pmem namespace beyond the seed device, similar to blk namespaces. However, it neglected to delete the namespace when the size is set to zero. Fixes: `98a29c39dc` ("libnvdimm, namespace: allow creation of multiple pmem-namespaces per region") Cc: <stable@vger.kernel.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-01-13 09:50:33 -08:00
Stefan Hajnoczi	d47d1d27fd	pmem: return EIO on read_pmem() failure The read_pmem() function uses memcpy_mcsafe() on x86 where an EFAULT error code indicates a failed read. Block I/O should use EIO to indicate failure. Other pmem code paths (like bad blocks) already use EIO so let's be consistent. This fixes compatibility with consumers like btrfs that try to parse the specific error code rather than treat all errors the same. Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-01-12 16:40:29 -08:00
Linus Torvalds	3be134e515	libnvdimm for 4.10 * Dynamic label support: To date namespace label support has been limited to disambiguating cases where PMEM (direct load/store) and BLK (mmio aperture) accessed-capacity alias on the same DIMM. Since 4.9 added support for multiple namespaces per PMEM-region there is value to support namespace labels even in the non-aliasing case. The presence of a valid namespace index block force-enables label support when the kernel would otherwise rely on region boundaries, and permits the region to be sub-divided. * Handle media errors in namespace metadata: Complement the error handling for media errors in namespace data areas with support for clearing errors on writes, and downgrading potential machine-check exceptions to simple i/o errors on read. * Device-DAX region attributes: Add 'align', 'id', and 'size' as attributes for device-dax regions. In particular this enables userspace tooling to generically size memory mapping and i/o operations. Prevent userspace from growing assumptions / dependencies about the parent device topology for a dax region. A libnvdimm namespace may not always be the parent device of a dax region. * Various cleanups and small fixes. -----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJYVxRcAAoJEB7SkWpmfYgCSnoQAKNs7IJRtGqKFCkMB5VDAW79 Lifi35Hqm+mfFaLMIyRoKJMnhXKQBsthfcHZIy5t63kDWEV/tJ+riZ+m2Ibz4GPE AUn07jxge2q9v0RwMylqpZ/EHyaK/xD1z0+SdIwVF2VaEDoVdnmu2WEqjuir/lfM t70B/YoWLg6CcHeTzaV5vdxlRKyG4pzOV3UzPlYLROr3wFJBHD88j/4zcGy3F1ue DpzGThhtEQXZUZCJLp54E0ch7V2cz/DNUDYwsjbCZrdYYmUJFqjZQxNyftn28aEJ HI/BERXLhm2HF2qRqC+0C1lJmUylYk4wXUGco+b/u1GYN59eFe/JT3dJgxEHFKL2 nVOUmR8XsRE6gkFWQGcFesRjjI1Rqz2Wgql7oqkSL9grGOwY4rvKX0LhJxgvfuZ0 Itb5h0dMFUriAP0DaaJFMYfANIOx+mqr4RxDMwP5ZADku6+Ga0LXxvcBt06K/mYO /zLo27MhsuuFy/odXm1A21OjFiH2pM3pSCaytKvDjy7DwzgE7ZzXmixXQ7j8YVp6 OtLYzMjIt8nt4xh0hmav5tb0r9l6mgNqlifacMC9lEM/7SDAiBXXBLuQngJN/j0s iXllBk0pYVWibf3VcD6oY+qKdmBWvXxPjgj1lHE6j9A7Gw2jwIt/W2NWzV94nKmC f6d+faHiRokqyVhdgfx3 =YWfP -----END PGP SIGNATURE----- Merge tag 'libnvdimm-for-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Dan Williams: "The libnvdimm pull request is relatively small this time around due to some development topics being deferred to 4.11. As for this pull request the bulk of it has been in -next for several releases leading to one late fix being added (commit `868f036fee` ("libnvdimm: fix mishandled nvdimm_clear_poison() return value")). It has received a build success notification from the 0day-kbuild robot and passes the latest libnvdimm unit tests. Summary: - Dynamic label support: To date namespace label support has been limited to disambiguating cases where PMEM (direct load/store) and BLK (mmio aperture) accessed-capacity alias on the same DIMM. Since 4.9 added support for multiple namespaces per PMEM-region there is value to support namespace labels even in the non-aliasing case. The presence of a valid namespace index block force-enables label support when the kernel would otherwise rely on region boundaries, and permits the region to be sub-divided. - Handle media errors in namespace metadata: Complement the error handling for media errors in namespace data areas with support for clearing errors on writes, and downgrading potential machine-check exceptions to simple i/o errors on read. - Device-DAX region attributes: Add 'align', 'id', and 'size' as attributes for device-dax regions. In particular this enables userspace tooling to generically size memory mapping and i/o operations. Prevent userspace from growing assumptions / dependencies about the parent device topology for a dax region. A libnvdimm namespace may not always be the parent device of a dax region. - Various cleanups and small fixes" * tag 'libnvdimm-for-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: dax: add region 'id', 'size', and 'align' attributes libnvdimm: fix mishandled nvdimm_clear_poison() return value libnvdimm: replace mutex_is_locked() warnings with lockdep_assert_held libnvdimm, pfn: fix align attribute libnvdimm, e820: use module_platform_driver libnvdimm, namespace: use octal for permissions libnvdimm, namespace: avoid multiple sector calculations libnvdimm: remove else after return in nsio_rw_bytes() libnvdimm, namespace: fix the type of name variable libnvdimm: use consistent naming for request_mem_region() nvdimm: use the right length of "pmem" libnvdimm: check and clear poison before writing to pmem tools/testing/nvdimm: dynamic label support libnvdimm: allow a platform to force enable label support libnvdimm: use generic iostat interfaces	2016-12-18 15:49:10 -08:00
Dan Williams	c44ef859ce	Merge branch 'for-4.10/libnvdimm' into libnvdimm-for-next	2016-12-17 15:08:10 -08:00
Dan Williams	868f036fee	libnvdimm: fix mishandled nvdimm_clear_poison() return value Colin, via static analysis, reports that the length could be negative from nvdimm_clear_poison() in the error case. There was a similar problem with commit `0a3f27b9a6` "libnvdimm, namespace: avoid multiple sector calculations" that I noticed when merging the for-4.10/libnvdimm topic branch into libnvdimm-for-next, but I missed this one. Fix both of them to the following procedure: * if we clear a block's worth of media, clear that many blocks in badblocks * if we clear less than the requested size of the transfer return an error * always invalidate cache after any non-error / non-zero nvdimm_clear_poison result Fixes: `82bf1037f2` ("libnvdimm: check and clear poison before writing to pmem") Fixes: `0a3f27b9a6` ("libnvdimm, namespace: avoid multiple sector calculations") Cc: Fabian Frederick <fabf@skynet.be> Cc: Dave Jiang <dave.jiang@intel.com> Reported-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-12-16 08:10:31 -08:00
Dan Williams	9cf8bd529c	libnvdimm: replace mutex_is_locked() warnings with lockdep_assert_held For warnings that should only ever trigger during development and testing replace WARN statements with lockdep_assert_held. The lockdep pattern is prevalent, and these paths are are well covered by libnvdimm unit tests. Reported-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-12-15 20:04:31 -08:00
Linus Torvalds	e7aa8c2eb1	These are the documentation changes for 4.10. It's another busy cycle for the docs tree, as the sphinx conversion continues. Highlights include: - Further work on PDF output, which remains a bit of a pain but should be more solid now. - Five more DocBook template files converted to Sphinx. Only 27 to go... Lots of plain-text files have also been converted and integrated. - Images in binary formats have been replaced with more source-friendly versions. - Various bits of organizational work, including the renaming of various files discussed at the kernel summit. - New documentation for the device_link mechanism. ...and, of course, lots of typo fixes and small updates. -----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJYTbl7AAoJEI3ONVYwIuV63NIP/REwzThnGWFJMRSuq8Ieq2r9 sFSQsaGTGlhyKiDoEooo+SO/Za3uTonjK+e7WZg8mhdiEdamta5aociU/71C1Yy/ T9ur0FhcGblrvZ1NidSDvCLwuECZOMMei7mgLZ9a+KCpc4ANqqTVZSUm1blKcqhF XelhVXxBa0ar35l/pVzyCxkdNXRWXv+MJZE8hp5XAdTdr11DS7UY9zrZdH31axtf BZlbYJrvB8WPydU6myTjRpirA17Hu7uU64MsL3bNIEiRQ+nVghEzQC8uxeUCvfVx r0H5AgGGQeir+e8GEv2T20SPZ+dumXs+y/HehKNb3jS3gV0mo+pKPeUhwLIxr+Zh QY64gf+jYf5ISHwAJRnU0Ima72ehObzSbx9Dko10nhq2OvbR5f83gjz9t9jKYFU7 RDowICA8lwqyRbHRoVfyoW8CpVhWFpMFu3yNeJMckeTish3m7ANqzaWslbsqIP5G zxgFMIrVVSbeae+sUeygtEJAnWI09aZ4tuaUXYtGWwu6ikC/3aV6DryP4bthG2LF A19uV4nMrLuuh8g2wiTHHjMfjYRwvSn+f9yaolwJhwyNDXQzRPy+ZJ3W/6olOkXC bAxTmVRCW5GA/fmSrfXmW1KbnxlWfP2C62hzZQ09UHxzTHdR97oFLDQdZhKo1uwf pmSJR0hVeRUmA4uw6+Su =A0EV -----END PGP SIGNATURE----- Merge tag 'docs-4.10' of git://git.lwn.net/linux Pull documentation update from Jonathan Corbet: "These are the documentation changes for 4.10. It's another busy cycle for the docs tree, as the sphinx conversion continues. Highlights include: - Further work on PDF output, which remains a bit of a pain but should be more solid now. - Five more DocBook template files converted to Sphinx. Only 27 to go... Lots of plain-text files have also been converted and integrated. - Images in binary formats have been replaced with more source-friendly versions. - Various bits of organizational work, including the renaming of various files discussed at the kernel summit. - New documentation for the device_link mechanism. ... and, of course, lots of typo fixes and small updates" * tag 'docs-4.10' of git://git.lwn.net/linux: (193 commits) dma-buf: Extract dma-buf.rst Update Documentation/00-INDEX docs: 00-INDEX: document directories/files with no docs docs: 00-INDEX: remove non-existing entries docs: 00-INDEX: add missing entries for documentation files/dirs docs: 00-INDEX: consolidate process/ and admin-guide/ description scripts: add a script to check if Documentation/00-INDEX is sane Docs: change sh -> awk in REPORTING-BUGS Documentation/core-api/device_link: Add initial documentation core-api: remove an unexpected unident ppc/idle: Add documentation for powersave=off Doc: Correct typo, "Introdution" => "Introduction" Documentation/atomic_ops.txt: convert to ReST markup Documentation/local_ops.txt: convert to ReST markup Documentation/assoc_array.txt: convert to ReST markup docs-rst: parse-headers.pl: cleanup the documentation docs-rst: fix media cleandocs target docs-rst: media/Makefile: reorganize the rules docs-rst: media: build SVG from graphviz files docs-rst: replace bayer.png by a SVG image ...	2016-12-12 21:58:13 -08:00
Dan Williams	af7d9f0c57	libnvdimm, pfn: fix align attribute Fix the format specifier so that the attribute can be parsed correctly. Currently it returns decimal 1000 for a 4096-byte alignment. Cc: <stable@vger.kernel.org> Reported-by: Dave Jiang <dave.jiang@intel.com> Fixes: `315c562536` ("libnvdimm, pfn: add 'align' attribute, default to HPAGE_SIZE") Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-12-10 08:12:05 -08:00
Dan Williams	efda1b5d87	acpi, nfit, libnvdimm: fix / harden ars_status output length handling Given ambiguities in the ACPI 6.1 definition of the "Output (Size)" field of the ARS (Address Range Scrub) Status command, a firmware implementation may in practice return 0, 4, or 8 to indicate that there is no output payload to process. The specification states "Size of Output Buffer in bytes, including this field.". However, 'Output Buffer' is also the name of the entire payload, and earlier in the specification it states "Max Query ARS Status Output Buffer Size: Maximum size of buffer (including the Status and Extended Status fields)". Without this fix if the BIOS happens to return 0 it causes memory corruption as evidenced by this result from the acpi_nfit_ctl() unit test. ars_status00000000: 00020000 00000000 ........ BUG: stack guard page was hit at ffffc90001750000 (stack is ffffc9000174c000..ffffc9000174ffff) kernel stack overflow (page fault): 0000 [#1] SMP DEBUG_PAGEALLOC task: ffff8803332d2ec0 task.stack: ffffc9000174c000 RIP: 0010:[<ffffffff814cfe72>] [<ffffffff814cfe72>] __memcpy+0x12/0x20 RSP: 0018:ffffc9000174f9a8 EFLAGS: 00010246 RAX: ffffc9000174fab8 RBX: 0000000000000000 RCX: 000000001fffff56 RDX: 0000000000000000 RSI: ffff8803231f5a08 RDI: ffffc90001750000 RBP: ffffc9000174fa88 R08: ffffc9000174fab0 R09: ffff8803231f54b8 R10: 0000000000000008 R11: 0000000000000001 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000003 R15: ffff8803231f54a0 FS: 00007f3a611af640(0000) GS:ffff88033ed00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffc90001750000 CR3: 0000000325b20000 CR4: 00000000000406e0 Stack: ffffffffa00bc60d 0000000000000008 ffffc90000000001 ffffc9000174faac 0000000000000292 ffffffffa00c24e4 ffffffffa00c2914 0000000000000000 0000000000000000 ffffffff00000003 ffff880331ae8ad0 0000000800000246 Call Trace: [<ffffffffa00bc60d>] ? acpi_nfit_ctl+0x49d/0x750 [nfit] [<ffffffffa01f4fe0>] nfit_test_probe+0x670/0xb1b [nfit_test] Cc: <stable@vger.kernel.org> Fixes: `747ffe11b4` ("libnvdimm, tools/testing/nvdimm: fix 'ars_status' output buffer sizing") Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-12-06 16:08:10 -08:00
Johannes Thumshirn	3a71b3c894	libnvdimm, e820: use module_platform_driver Use module_platform_driver for the e820 driver instead of open-coding it. Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-12-05 08:52:21 -08:00
Fabian Frederick	b44fe76043	libnvdimm, namespace: use octal for permissions According to commit `f90774e1fd` ("checkpatch: look for symbolic permissions and suggest octal instead") Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-12-04 10:54:08 -08:00
Fabian Frederick	0a3f27b9a6	libnvdimm, namespace: avoid multiple sector calculations Use sector_t for cleared Suggested-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-12-04 10:48:58 -08:00
Fabian Frederick	d37806dc37	libnvdimm: remove else after return in nsio_rw_bytes() else after return is not needed. Signed-off-by: Fabian Frederick <fabf@skynet.be> [djbw: removed some now unnecessary newlines] Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-12-04 10:45:13 -08:00
Nicolas Iooss	238b323a68	libnvdimm, namespace: fix the type of name variable In create_namespace_blk(), the local variable "name" is defined as an array of NSLABEL_NAME_LEN pointers: char *name[NSLABEL_NAME_LEN]; This variable is then used in calls to memcpy() and kmemdup() as if it were char[NSLABEL_NAME_LEN]. Remove the star in the variable definition to makes it look right. Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-11-28 13:41:17 -08:00
Dan Williams	450c6633e8	libnvdimm: use consistent naming for request_mem_region() Here is an example /proc/iomem listing for a system with 2 namespaces, one in "sector" mode and one in "memory" mode: 1fc000000-2fbffffff : Persistent Memory (legacy) 1fc000000-2fbffffff : namespace1.0 340000000-34fffffff : Persistent Memory 340000000-34fffffff : btt0.1 Here is the corresponding ndctl listing: # ndctl list [ { "dev":"namespace1.0", "mode":"memory", "size":4294967296, "blockdev":"pmem1" }, { "dev":"namespace0.0", "mode":"sector", "size":267091968, "uuid":"f7594f86-badb-4592-875f-ded577da2eaf", "sector_size":4096, "blockdev":"pmem0s" } ] Notice that the ndctl listing is purely in terms of namespace devices, while the iomem listing leaks the internal "btt0.1" implementation detail. Given that ndctl requires the namespace device name to change the mode, for example: # ndctl create-namespace --reconfig=namespace0.0 --mode=raw --force ...use the namespace name in the iomem listing to keep the claiming device name consistent across different mode settings. Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-11-28 11:15:18 -08:00
Jonathan Corbet	917fef6f7e	Linux 4.9-rc4 -----BEGIN PGP SIGNATURE----- iQEcBAABAgAGBQJYHmoCAAoJEHm+PkMAQRiG7RMIAI2i7Y5hpL5yCxK5AFaL4u/G KxXfp1B1UanUTgjOmd7zGqtDYcFX9t7GTTUFixQ7/9Opr4PD9qbnatoDGSc3xjbT msDgA1B78F1/Q3kHWfeGq32MihQ4mj5NwUCo+igUcUvvWG7mHgzErj/Nh5RoobQX p/izdpTbrw3GX6xXB8olbG7XWHaVye/+TT3q6+gmgm8I/QEujcLeGoycE0zlhPN8 FG/JX76At/+ZM2Py7Oxo3k+oKL9CHrtOQYDp/wN0uslV5eYvvkZz0/M1HMOGZt+c gZU5jzM17K7C4Nzo06WAuBU9wUBGc25m+cPicLlOmljnzfU+f50SKaDjZq3p7QI= =2KUF -----END PGP SIGNATURE----- Merge tag 'v4.9-rc4' into sound Bring in -rc4 patches so I can successfully merge the sound doc changes.	2016-11-18 16:13:41 -07:00
Nicolas Iooss	2d9a02744f	nvdimm: use the right length of "pmem" In order to test that the name of a resource begins with "pmem", call strncmp() with 4 as length instead of 3 to match the whole prefix. Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-11-11 20:37:42 -08:00
Dave Jiang	82bf1037f2	libnvdimm: check and clear poison before writing to pmem We need to clear any poison when we are writing to pmem. The granularity will be sector size. If it's less then we can't do anything about it barring corruption. Signed-off-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> [djbw: fixup 0-length write request to succeed] Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-11-11 20:35:31 -08:00
Arnd Bergmann	867dfe3421	nvdimm: make CONFIG_NVDIMM_DAX 'bool' A bugfix just tried to address a randconfig build problem and introduced a variant of the same problem: with CONFIG_LIBNVDIMM=y and CONFIG_NVDIMM_DAX=m, the nvdimm module now fails to link: drivers/nvdimm/built-in.o: In function `to_nd_device_type': bus.c:(.text+0x1b5d): undefined reference to `is_nd_dax' drivers/nvdimm/built-in.o: In function `nd_region_notify_driver_action.constprop.2': region_devs.c:(.text+0x6b6c): undefined reference to `is_nd_dax' region_devs.c:(.text+0x6b8c): undefined reference to `to_nd_dax' drivers/nvdimm/built-in.o: In function `nd_region_probe': region.c:(.text+0x70f3): undefined reference to `nd_dax_create' drivers/nvdimm/built-in.o: In function `mode_show': namespace_devs.c:(.text+0xa196): undefined reference to `is_nd_dax' drivers/nvdimm/built-in.o: In function `nvdimm_namespace_common_probe': (.text+0xa55f): undefined reference to `is_nd_dax' drivers/nvdimm/built-in.o: In function `nvdimm_namespace_common_probe': (.text+0xa56e): undefined reference to `to_nd_dax' This reverts the earlier fix, making NVDIMM_DAX a 'bool' option again as it should be (it gets linked into the libnvdimm module). To fix the original problem, I'm adding a dependency on LIBNVDIMM to DEV_DAX_PMEM, which ensures we can't have that one built-in if the rest is a module. Fixes: `4e65e9381c` ("/dev/dax: fix Kconfig dependency build breakage") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-10-27 16:16:21 -07:00
Mauro Carvalho Chehab	8c27ceff36	docs: fix locations of several documents that got moved The previous patch renamed several files that are cross-referenced along the Kernel documentation. Adjust the links to point to the right places. Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>	2016-10-24 08:12:35 -02:00
Toshi Kani	3115bb02b5	pmem: report error on clear poison failure ACPI Clear Uncorrectable Error DSM function may fail or may be unsupported on a platform. pmem_clear_poison() returns without clearing badblocks in such cases. This failure is detected at the next read (-EIO). This behavior can lead to an issue when user keeps writing but does not read immediately. For instance, flight recorder file may be only read when it is necessary for troubleshooting. Change pmem_do_bvec() and pmem_clear_poison() to return -EIO so that filesystem can log an error message on a write error. Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-10-19 10:35:52 -07:00
Dan Carpenter	75d29713b7	libnvdimm, namespace: potential NULL deref on allocation error If the kcalloc() fails then "devs" can be NULL and we dereference it checking "devs[i]". Fixes: `1b40e09a12` ('libnvdimm: blk labels and namespace instantiation') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-10-19 10:35:51 -07:00
Dan Williams	42237e393f	libnvdimm: allow a platform to force enable label support Platforms like QEMU-KVM implement an NFIT table and label DSMs. However, since that environment does not define an aliased configuration, the labels are currently ignored and the kernel registers a single full-sized pmem-namespace per region. Now that the kernel supports sub-divisions of pmem regions the labels have a purpose. Arrange for the labels to be honored when we find an existing / valid namespace index block. Cc: <qemu-devel@nongnu.org> Cc: Haozhong Zhang <haozhong.zhang@intel.com> Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-10-19 08:57:33 -07:00
Toshi Kani	8d7c22ac0c	libnvdimm: use generic iostat interfaces nd_iostat_start() and nd_iostat_end() implement the same functionality that generic_start_io_acct() and generic_end_io_acct() already provide. Change nd_iostat_start() and nd_iostat_end() to call the generic iostat interfaces. There is no change in the nd interfaces. Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dave Chinner <david@fromorbit.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-10-19 08:53:26 -07:00
Dan Williams	e476f94482	Merge branch 'for-4.9/dax' into libnvdimm-for-next	2016-10-07 16:46:30 -07:00
Dan Williams	178d6f4be8	Merge branch 'for-4.9/libnvdimm' into libnvdimm-for-next	2016-10-07 16:46:24 -07:00
Ross Zwisler	4e65e9381c	/dev/dax: fix Kconfig dependency build breakage The function dax_pmem_probe() in drivers/dax/pmem.c is compiled under the CONFIG_DEV_DAX_PMEM tri-state config option. This config option currently only depends on CONFIG_NVDIMM_DAX, a bool, which means that the following configuration is possible: CONFIG_LIBNVDIMM=m ... CONFIG_NVDIMM_DAX=y CONFIG_DEV_DAX=y CONFIG_DEV_DAX_PMEM=y With this config LIBNVDIMM is compiled as a module with NVDIMM_DAX=y just meaning that we will compile drivers/nvdimm/dax_devs.c into that module. However, dax_pmem_probe() depends on several symbols defined in drivers/nvdimm/dax_devs.c, which results in the following build errors: drivers/built-in.o: In function `dax_pmem_probe': linux/drivers/dax/pmem.c:70: undefined reference to `to_nd_dax' linux/drivers/dax/pmem.c:74: undefined reference to `nvdimm_namespace_common_probe' linux/drivers/dax/pmem.c:80: undefined reference to `devm_nsio_enable' linux/drivers/dax/pmem.c:81: undefined reference to `nvdimm_setup_pfn' linux/drivers/dax/pmem.c:84: undefined reference to `devm_nsio_disable' linux/drivers/dax/pmem.c:122: undefined reference to `to_nd_region' drivers/built-in.o: In function `dax_pmem_init': linux/drivers/dax/pmem.c:147: undefined reference to `__nd_driver_register' Fix this by making NVDIMM_DAX a tristate. DEV_DAX_PMEM depends on NVDIMM_DAX which depends on LIBNVDIMM. Since they are all now tristates, if LIBNVDIMM is built as a kernel module DEV_DAX_PMEM will be as well. This prevents dax_devs.c from being built as a built-in while its dependencies are in the libnvdimm.ko module. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-10-07 16:46:00 -07:00
Dan Williams	98a29c39dc	libnvdimm, namespace: allow creation of multiple pmem-namespaces per region Similar to BLK regions, publish new seed namespace devices to allow unused PMEM region capacity to be consumed by additional namespaces. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-10-07 09:22:53 -07:00
Dan Williams	991d9020f3	libnvdimm, namespace: lift single pmem limit in scan_labels() Now that the rest of the infrastructure has been converted to handle multi-pmem configurations, lift the artificial barrier at scan time. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-10-07 09:22:53 -07:00
Dan Williams	c969e24c1b	libnvdimm, namespace: filter out of range labels in scan_labels() Short-circuit doomed-to-fail label validation attempts by skipping labels that are outside the given region. For example a DIMM that has multiple PMEM regions will waste time attempting to create namespaces only to find that the interleave-set-cookie does not validate, e.g.: nd_region region6: invalid cookie in label: 73e608dc-47b9-4b2a-b5c7-2d55a32e0c2 Similar to how we skip BLK labels when performing PMEM validation we can skip out-of-range labels early. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-10-07 09:22:53 -07:00
Dan Williams	762d067dba	libnvdimm, namespace: enable allocation of multiple pmem namespaces Now that we have nd_region_available_dpa() able to handle the presence of multiple PMEM allocations in aliased PMEM regions, reuse that same infrastructure to track allocations from free space. In particular handle allocating from an aliased PMEM region in the case where there are dis-contiguous holes. The allocation for BLK and PMEM are documented in the space_valid() helper: BLK-space is valid as long as it does not precede a PMEM allocation in a given region. PMEM-space must be contiguous and adjacent to an existing existing allocation (if one exists). Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-10-07 09:22:53 -07:00
Dan Williams	16660eaea0	libnvdimm, namespace: update label implementation for multi-pmem Instead of assuming that there will only ever be one allocated range at the start of the region, account for additional namespaces that might start at an offset from the region base. After this change pmem namespaces now have a reason to carry an array of resources similar to blk. Unifying the resource tracking infrastructure in nd_namespace_common is a future cleanup candidate. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-10-07 09:22:53 -07:00
Dan Williams	012207334a	libnvdimm, namespace: expand pmem device naming scheme for multi-pmem pmem devices are currently named /dev/pmem<region-index>. Preserve the naming of the 0th device, but add a ".<namespace-index>" for other devices. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-10-07 09:22:53 -07:00
Dan Williams	a1f3e4d6a0	libnvdimm, region: update nd_region_available_dpa() for multi-pmem support The free dpa (dimm-physical-address) space calculation reports how much free space is available with consideration for aliased BLK + PMEM regions. Recall that BLK capacity is allocated from high addresses and PMEM is allocated from low addresses in their respective regions. nd_region_available_dpa() accounts for the fact that the largest encroachment (lowest starting address) into PMEM capacity by a BLK allocation limits the available capacity to that point, regardless if there is BLK allocation hole at a higher address. Similarly, for the multi-pmem case we need to track the largest encroachment (highest ending address) of a PMEM allocation in BLK capacity regardless of whether there is an allocation hole that a BLK allocation could fill at a lower address. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-10-07 09:20:53 -07:00
Dan Williams	6ff3e912d3	libnvdimm, namespace: sort namespaces by dpa at init Add more determinism to initial namespace device-name assignments by sorting the namespaces by starting dpa. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-10-07 09:20:53 -07:00
Dan Williams	0e3b0d123c	libnvdimm, namespace: allow multiple pmem-namespaces per region at scan time If label scanning finds multiple valid pmem namespaces allow them to be surfaced rather than fail namespace scanning. Support for creating multiple namespaces per region is saved for a later patch. Note that this adds some new error messages to clarify which of the pmem namespaces in the set are potentially impacted by invalid labels. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-10-07 09:20:53 -07:00
Dan Williams	8a5f50d3b7	libnvdimm, namespace: unify blk and pmem label scanning In preparation for allowing multiple namespace per pmem region, unify blk and pmem label scanning. Given that blk regions already support multiple namespaces, teaching that path how to do pmem namespace scanning is an incremental step towards multiple pmem namespace support. This should be functionally equivalent to the previous state in that stops after finding the first valid pmem label set. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-10-05 20:24:18 -07:00
Dan Williams	f95b4bca9e	libnvdimm, namespace: refactor uuid_show() into a namespace_to_uuid() helper The ability to translate a generic struct device pointer into a namespace uuid is a useful utility as we go to unify the blk and pmem label scanning paths. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-10-05 20:24:18 -07:00
Dan Williams	ae8219f186	libnvdimm, label: convert label tracking to a linked list In preparation for enabling multiple namespaces per pmem region, convert the label tracking to use a linked list. In particular this will allow select_pmem_id() to move labels from the unvalidated state to the validated state. Currently we only track one validated set per-region. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-09-30 19:13:42 -07:00
Dan Williams	44c462eb9e	libnvdimm, region: move region-mapping input-paramters to nd_mapping_desc Before we add more libnvdimm-private fields to nd_mapping make it clear which parameters are input vs libnvdimm internals. Use struct nd_mapping_desc instead of struct nd_mapping in nd_region_desc and make struct nd_mapping private to libnvdimm. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-09-30 19:13:42 -07:00
Dave Jiang	db58028ee4	nvdimm: reduce duplicated wpq flushes Existing implemenetation writes to all the flush hint addresses for a given ND region. This is not necessary as the flushes are per imc and not per DIMM. Search the mappings and clear out the duplicates at init to avoid multiple flush to the same imc. Signed-off-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-09-30 19:12:52 -07:00
Vishal Verma	e046114af5	libnvdimm: clear the internal poison_list when clearing badblocks nvdimm_clear_poison cleared the user-visible badblocks, and sent commands to the NVDIMM to clear the areas marked as 'poison', but it neglected to clear the same areas from the internal poison_list which is used to marshal ARS results before sorting them by namespace. As a result, once on-demand ARS functionality was added: `37b137f` nfit, libnvdimm: allow an ARS scrub to be triggered on demand A scrub triggered from either sysfs or an MCE was found to be adding stale entries that had been cleared from gendisk->badblocks, but were still present in nvdimm_bus->poison_list. Additionally, the stale entries could be triggered into producing stale disk->badblocks by simply disabling and re-enabling the namespace or region. This adds the missing step of clearing poison_list entries when clearing poison, so that it is always in sync with badblocks. Fixes: `37b137f` ("nfit, libnvdimm: allow an ARS scrub to be triggered on demand") Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-09-30 17:03:45 -07:00
Vishal Verma	bd697a80c3	pmem: reduce kmap_atomic sections to the memcpys only pmem_do_bvec used to kmap_atomic at the begin, and only unmap at the end. Things like nvdimm_clear_poison may want to do nvdimm subsystem bookkeeping operations that may involve taking locks or doing memory allocations, and we can't do that from the atomic context. Reduce the atomic context to just what needs it - the memcpy to/from pmem. Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-09-30 17:03:45 -07:00
Dan Williams	595c73071e	libnvdimm, region: fix flush hint table thinko The definition of the flush hint table as: void __iomem *flush_wpq[0][0]; ...passed the unit test, but is broken as flush_wpq[0][1] and flush_wpq[1][0] refer to the same entry. Fix this to use a helper that calculates a slot in the table based on the geometry of flush hints in the region. This is important to get right since virtualization solutions use this mechanism to trigger hypervisor flushes to platform persistence. Reported-by: Dave Jiang <dave.jiang@intel.com> Tested-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-09-24 11:45:38 -07:00
Dave Jiang	a0056afe21	nvdimm: remove duplicate nd_mapping declaration Signed-off-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-09-21 15:28:29 -07:00
Dan Williams	4765218db7	libnvdimm, namespace: debug invalid interleave-set-cookie values If platform firmware fails to populate unique / non-zero serial number data for each nvdimm in an interleave-set it may cause pmem region initialization to fail. Add a debug message for this case. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-09-21 09:36:36 -07:00
Dan Williams	ecfb6d8a04	libnvdimm: fix devm_nvdimm_memremap() error path The internal alloc_nvdimm_map() helper might fail, particularly if the memory region is already busy. Report request_mem_region() failures and check for the failure. Reported-by: Ryan Chen <ryan.chan105@gmail.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-09-21 09:35:15 -07:00
Oliver O'Halloran	480b6837aa	nvdimm: fix PHYS_PFN/PFN_PHYS mixup nd_activate_region() iomaps any hint addresses required when activating a region. To prevent duplicate mappings it checks the PFN of the hint to be mapped against the PFNs of the already mapped hints. Unfortunately it doesn't convert the PFN back into a physical address before passing it to devm_nvdimm_ioremap(). Instead it applies PHYS_PFN a second time which ends about as well as you would imagine. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-09-19 08:54:27 -07:00
Dave Jiang	1e8b8d9619	libnvdimm: allow legacy (e820) pmem region to clear bad blocks Bad blocks can be injected via /sys/block/pmemN/badblocks. In a situation where legacy pmem is being used or a pmem region created by using memmap kernel parameter, the injected bad blocks are not cleared due to nvdimm_clear_poison() failing from lack of ndctl function pointer. In this case we need to just return as handled and allow the bad blocks to be cleared rather than fail. Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-09-09 17:34:46 -07:00
Toshi Kani	aee6598748	libnvdimm: Fix nvdimm_probe error on NVDIMM-N 'ndctl list --buses --dimms' does not list any NVDIMM-Ns since they are considered as idle. ndctl checks if any driver is attached to nmem device. nvdimm_probe() always fails in nvdimm_init_nsarea() since NVDIMM-Ns do not implement optinal ND_CMD_GET_CONFIG_DATA command. Change nvdimm_probe() to accept the case that the CONFIG_DATA command is not implemented for NVDIMM-Ns. The driver attaches without ndd, which keeps it no-op to the device. Reported-by: Brian Boylston <brian.boylston@hpe.com> Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Cc: Dan Williams <dan.j.williams@intel.com> Tested-by: Johannes Thumshirn <jthumshirn@suse.de> Acked-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-09-01 18:20:39 -07:00
Geert Uytterhoeven	ae551e9ca2	nvdimm: Spelling s/unacknoweldged/unacknowledged/ Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-09-01 18:20:39 -07:00
Dan Williams	ba9c8dd3c2	acpi, nfit: add dimm device notification support Per "ACPI 6.1 Section 9.20.3" NVDIMM devices, children of the ACPI0012 NVDIMM Root device, can receive health event notifications. Given that these devices are precluded from registering a notification handler via acpi_driver.acpi_device_ops (due to no _HID), we use acpi_install_notify_handler() directly. The registered handler, acpi_nvdimm_notify(), triggers a poll(2) event on the nmemX/nfit/flags sysfs attribute when a health event notification is received. Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Toshi Kani <toshi.kani@hpe.com> Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> Acked-by: Rafael J. Wysocki <rafael@kernel.org> Reviewed-by: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-08-29 14:55:17 -07:00
Vishal Verma	abe8b4e3ce	nvdimm, btt: add a size attribute for BTTs To be consistent with other namespaces, expose a 'size' attribute for BTT devices also. Cc: Dan Williams <dan.j.williams@intel.com> Reported-by: Linda Knippers <linda.knippers@hpe.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-08-08 09:26:14 -07:00
Jens Axboe	1eff9d322a	block: rename bio bi_rw to bi_opf Since commit `63a4cc2486`, bio->bi_rw contains flags in the lower portion and the op code in the higher portions. This means that old code that relies on manually setting bi_rw is most likely going to be broken. Instead of letting that brokeness linger, rename the member, to force old and out-of-tree code to break at compile time instead of at runtime. No intended functional changes in this commit. Signed-off-by: Jens Axboe <axboe@fb.com>	2016-08-07 14:41:02 -06:00
Jens Axboe	c11f0c0b5b	block/mm: make bdev_ops->rw_page() take a bool for read/write Commit `abf545484d` changed it from an 'rw' flags type to the newer ops based interface, but now we're effectively leaking some bdev internals to the rest of the kernel. Since we only care about whether it's a read or a write at that level, just pass in a bool 'is_write' parameter instead. Then we can also move op_is_write() and friends back under CONFIG_BLOCK protection. Reviewed-by: Mike Christie <mchristi@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-08-07 14:41:02 -06:00
Mike Christie	abf545484d	mm/block: convert rw_page users to bio op use The rw_page users were not converted to use bio/req ops. As a result bdev_write_page is not passing down REQ_OP_WRITE and the IOs will be sent down as reads. Signed-off-by: Mike Christie <mchristi@redhat.com> Fixes: `4e1b2d52a8` ("block, fs, drivers: remove REQ_OP compat defs and related code") Modified by me to: 1) Drop op_flags passing into ->rw_page(), as we don't use it. 2) Make op_is_write() and friends safe to use for !CONFIG_BLOCK Signed-off-by: Jens Axboe <axboe@fb.com>	2016-08-04 14:25:33 -06:00
Linus Torvalds	f0c98ebc57	libnvdimm for 4.8 1/ Replace pcommit with ADR / directed-flushing: The pcommit instruction, which has not shipped on any product, is deprecated. Instead, the requirement is that platforms implement either ADR, or provide one or more flush addresses per nvdimm. ADR (Asynchronous DRAM Refresh) flushes data in posted write buffers to the memory controller on a power-fail event. Flush addresses are defined in ACPI 6.x as an NVDIMM Firmware Interface Table (NFIT) sub-structure: "Flush Hint Address Structure". A flush hint is an mmio address that when written and fenced assures that all previous posted writes targeting a given dimm have been flushed to media. 2/ On-demand ARS (address range scrub): Linux uses the results of the ACPI ARS commands to track bad blocks in pmem devices. When latent errors are detected we re-scrub the media to refresh the bad block list, userspace can also request a re-scrub at any time. 3/ Support for the Microsoft DSM (device specific method) command format. 4/ Support for EDK2/OVMF virtual disk device memory ranges. 5/ Various fixes and cleanups across the subsystem. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJXmXBsAAoJEB7SkWpmfYgCEwwP/1IOt9ocP+iHLMDH9KE7VaTZ NmUDR+Zy6g5cRQM7SgcuU5BXUcx+OsSrSrUTVF1cW994o9Gbz1mFotkv0ZAsPcYY ZVRQxo2oqHrssyOcg+PsgKWiXn68rJOCgmpEyzaJywl5qTMst7pzsT1s1f7rSh6h trCf4VaJJwxZR8fARGtlHUnnhPe2Orp99EZRKEWprAsIv2kPuWpPHSjRjuEgN1JG KW8AYwWqFTtiLRUk86I4KBB0wcDrfctsjgN9Ogd6+aHyQBRnVSr2U+vDCFkC8KLu qiDCpYp+yyxBjclnljz7tRRT3GtzfCUWd4v2KVWqgg2IaobUc0Lbukp/rmikUXQP WLikT2OCQ994eFK5OX3Q3cIU/4j459TQnof8q14yVSpjAKrNUXVSR5puN7Hxa+V7 41wKrAsnsyY1oq+Yd/rMR8VfH7PHx3bFkrmRCGZCufLX1UQm4aYj+sWagDKiV3yA DiudghbOnhfurfGsnXUVw7y7GKs+gNWNBmB6ndAD6ZEHmKoGUhAEbJDLCc3DnANl b/2mv1MIdIcC1DlCmnbbcn6fv6bICe/r8poK3VrCK3UgOq/EOvKIWl7giP+k1JuC 6DdVYhlNYIVFXUNSLFAwz8OkLu8byx7WDm36iEqrKHtPw+8qa/2bWVgOU6OBgpjV cN3edFVIdxvZeMgM5Ubq =xCBG -----END PGP SIGNATURE----- Merge tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Dan Williams: - Replace pcommit with ADR / directed-flushing. The pcommit instruction, which has not shipped on any product, is deprecated. Instead, the requirement is that platforms implement either ADR, or provide one or more flush addresses per nvdimm. ADR (Asynchronous DRAM Refresh) flushes data in posted write buffers to the memory controller on a power-fail event. Flush addresses are defined in ACPI 6.x as an NVDIMM Firmware Interface Table (NFIT) sub-structure: "Flush Hint Address Structure". A flush hint is an mmio address that when written and fenced assures that all previous posted writes targeting a given dimm have been flushed to media. - On-demand ARS (address range scrub). Linux uses the results of the ACPI ARS commands to track bad blocks in pmem devices. When latent errors are detected we re-scrub the media to refresh the bad block list, userspace can also request a re-scrub at any time. - Support for the Microsoft DSM (device specific method) command format. - Support for EDK2/OVMF virtual disk device memory ranges. - Various fixes and cleanups across the subsystem. * tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (41 commits) libnvdimm-btt: Delete an unnecessary check before the function call "__nd_device_register" nfit: do an ARS scrub on hitting a latent media error nfit: move to nfit/ sub-directory nfit, libnvdimm: allow an ARS scrub to be triggered on demand libnvdimm: register nvdimm_bus devices with an nd_bus driver pmem: clarify a debug print in pmem_clear_poison x86/insn: remove pcommit Revert "KVM: x86: add pcommit support" nfit, tools/testing/nvdimm/: unify shutdown paths libnvdimm: move ->module to struct nvdimm_bus_descriptor nfit: cleanup acpi_nfit_init calling convention nfit: fix _FIT evaluation memory leak + use after free tools/testing/nvdimm: add manufacturing_{date\|location} dimm properties tools/testing/nvdimm: add virtual ramdisk range acpi, nfit: treat virtual ramdisk SPA as pmem region pmem: kill __pmem address space pmem: kill wmb_pmem() libnvdimm, pmem: use nvdimm_flush() for namespace I/O writes fs/dax: remove wmb_pmem() libnvdimm, pmem: flush posted-write queues on shutdown ...	2016-07-28 17:38:16 -07:00
Linus Torvalds	3fc9d69093	Merge branch 'for-4.8/drivers' of git://git.kernel.dk/linux-block Pull block driver updates from Jens Axboe: "This branch also contains core changes. I've come to the conclusion that from 4.9 and forward, I'll be doing just a single branch. We often have dependencies between core and drivers, and it's hard to always split them up appropriately without pulling core into drivers when that happens. That said, this contains: - separate secure erase type for the core block layer, from Christoph. - set of discard fixes, from Christoph. - bio shrinking fixes from Christoph, as a followup up to the op/flags change in the core branch. - map and append request fixes from Christoph. - NVMeF (NVMe over Fabrics) code from Christoph. This is pretty exciting! - nvme-loop fixes from Arnd. - removal of ->driverfs_dev from Dan, after providing a device_add_disk() helper. - bcache fixes from Bhaktipriya and Yijing. - cdrom subchannel read fix from Vchannaiah. - set of lightnvm updates from Wenwei, Matias, Johannes, and Javier. - set of drbd updates and fixes from Fabian, Lars, and Philipp. - mg_disk error path fix from Bart. - user notification for failed device add for loop, from Minfei. - NVMe in general: + NVMe delay quirk from Guilherme. + SR-IOV support and command retry limits from Keith. + fix for memory-less NUMA node from Masayoshi. + use UINT_MAX for discard sectors, from Minfei. + cancel IO fixes from Ming. + don't allocate unused major, from Neil. + error code fixup from Dan. + use constants for PSDT/FUSE from James. + variable init fix from Jay. + fabrics fixes from Ming, Sagi, and Wei. + various fixes" * 'for-4.8/drivers' of git://git.kernel.dk/linux-block: (115 commits) nvme/pci: Provide SR-IOV support nvme: initialize variable before logical OR'ing it block: unexport various bio mapping helpers scsi/osd: open code blk_make_request target: stop using blk_make_request block: simplify and export blk_rq_append_bio block: ensure bios return from blk_get_request are properly initialized virtio_blk: use blk_rq_map_kern memstick: don't allow REQ_TYPE_BLOCK_PC requests block: shrink bio size again block: simplify and cleanup bvec pool handling block: get rid of bio_rw and READA block: don't ignore -EOPNOTSUPP blkdev_issue_write_same block: introduce BLKDEV_DISCARD_ZERO to fix zeroout NVMe: don't allocate unused nvme_major nvme: avoid crashes when node 0 is memoryless node. nvme: Limit command retries loop: Make user notify for adding loop device failed nvme-loop: fix nvme-loop Kconfig dependencies nvmet: fix return value check in nvmet_subsys_alloc() ...	2016-07-26 15:37:51 -07:00
Linus Torvalds	d05d7f4079	Merge branch 'for-4.8/core' of git://git.kernel.dk/linux-block Pull core block updates from Jens Axboe: - the big change is the cleanup from Mike Christie, cleaning up our uses of command types and modified flags. This is what will throw some merge conflicts - regression fix for the above for btrfs, from Vincent - following up to the above, better packing of struct request from Christoph - a 2038 fix for blktrace from Arnd - a few trivial/spelling fixes from Bart Van Assche - a front merge check fix from Damien, which could cause issues on SMR drives - Atari partition fix from Gabriel - convert cfq to highres timers, since jiffies isn't granular enough for some devices these days. From Jan and Jeff - CFQ priority boost fix idle classes, from me - cleanup series from Ming, improving our bio/bvec iteration - a direct issue fix for blk-mq from Omar - fix for plug merging not involving the IO scheduler, like we do for other types of merges. From Tahsin - expose DAX type internally and through sysfs. From Toshi and Yigal * 'for-4.8/core' of git://git.kernel.dk/linux-block: (76 commits) block: Fix front merge check block: do not merge requests without consulting with io scheduler block: Fix spelling in a source code comment block: expose QUEUE_FLAG_DAX in sysfs block: add QUEUE_FLAG_DAX for devices to advertise their DAX support Btrfs: fix comparison in __btrfs_map_block() block: atari: Return early for unsupported sector size Doc: block: Fix a typo in queue-sysfs.txt cfq-iosched: Charge at least 1 jiffie instead of 1 ns cfq-iosched: Fix regression in bonnie++ rewrite performance cfq-iosched: Convert slice_resid from u64 to s64 block: Convert fifo_time from ulong to u64 blktrace: avoid using timespec block/blk-cgroup.c: Declare local symbols static block/bio-integrity.c: Add #include "blk.h" block/partition-generic.c: Remove a set-but-not-used variable block: bio: kill BIO_MAX_SIZE cfq-iosched: temporarily boost queue priority for idle classes block: drbd: avoid to use BIO_MAX_SIZE block: bio: remove BIO_MAX_SECTORS ...	2016-07-26 15:03:07 -07:00
Dan Williams	0606263f24	Merge branch 'for-4.8/libnvdimm' into libnvdimm-for-next	2016-07-24 08:05:44 -07:00
Markus Elfring	d4c5725d57	libnvdimm-btt: Delete an unnecessary check before the function call "__nd_device_register" The __nd_device_register() function tests whether its argument is NULL and then returns immediately. Thus the test around the call is not needed. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-07-24 08:04:55 -07:00
Vishal Verma	37b137ff8c	nfit, libnvdimm: allow an ARS scrub to be triggered on demand Normally, an ARS (Address Range Scrub) only happens at boot/initialization time. There can however arise situations where a bus-wide rescan is needed - notably, in the case of discovering a latent media error, we should do a full rescan to figure out what other sectors are bad, and thus potentially avoid triggering an mce on them in the future. Also provide a sysfs trigger to start a bus-wide scrub. Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-07-23 21:51:42 -07:00
Dan Williams	18515942d6	libnvdimm: register nvdimm_bus devices with an nd_bus driver A recent effort to add a new nvdimm bus provider attribute highlighted a race between interrogating nvdimm_bus->nd_desc and nvdimm_bus tear down. The typical way to handle these races is to take the device_lock() in the attribute method and validate that the device is still active. In order for a device to be 'active' it needs to be associated with a driver. So, we create the small boilerplate for a driver and register nvdimm_bus devices on the 'nvdimm_bus_type' bus. A result of this change is that ndbusX devices now appear under /sys/bus/nd/devices. In fact this makes /sys/class/nd somewhat redundant, but removing that will need to take a long deprecation period given its use by ndctl binaries in the field. This change naturally pulls code from drivers/nvdimm/core.c to drivers/nvdimm/bus.c, so it is a nice code organization clean-up as well. Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-07-23 11:06:33 -07:00
Vishal Verma	5bf0b6e1af	pmem: clarify a debug print in pmem_clear_poison Prefix the sector number being cleared with a '0x' to make it clear that this is a hex value. Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-07-23 11:06:33 -07:00
Dan Williams	bc9775d869	libnvdimm: move ->module to struct nvdimm_bus_descriptor Let the provider module be explicitly passed in rather than implicitly assumed by the module that calls nvdimm_bus_register(). This is in preparation for unifying the nfit and nfit_test driver teardown paths. Reviewed-by: Lee, Chun-Yi <jlee@suse.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-07-21 20:03:19 -07:00
Toshi Kani	163d4baaeb	block: add QUEUE_FLAG_DAX for devices to advertise their DAX support Currently, presence of direct_access() in block_device_operations indicates support of DAX on its block device. Because block_device_operations is instantiated with 'const', this DAX capablity may not be enabled conditinally. In preparation for supporting DAX to device-mapper devices, add QUEUE_FLAG_DAX to request_queue flags to advertise their DAX support. This will allow to set the DAX capability based on how mapped device is composed. Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Acked-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: <linux-s390@vger.kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-20 21:01:01 -06:00
Dan Williams	7a9eb20666	pmem: kill __pmem address space The __pmem address space was meant to annotate codepaths that touch persistent memory and need to coordinate a call to wmb_pmem(). Now that wmb_pmem() is gone, there is little need to keep this annotation. Cc: Christoph Hellwig <hch@lst.de> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-07-12 19:25:38 -07:00
Dan Williams	91131dbd1d	libnvdimm, pmem: use nvdimm_flush() for namespace I/O writes nsio_rw_bytes() is used to write info block metadata to the namespace, so it should trigger a flush after every write. Replace wmb_pmem() with nvdimm_flush() in this path. Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-07-12 15:13:48 -07:00
Dan Williams	476f848aae	libnvdimm, pmem: flush posted-write queues on shutdown Commit writes to media on system shutdown or pmem driver unload. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-07-12 15:13:48 -07:00
Dan Williams	7e267a8c79	libnvdimm, pmem: use REQ_FUA, REQ_FLUSH for nvdimm_flush() Given that nvdimm_flush() has higher overhead than wmb_pmem() (pointer chasing through nd_region), and that we otherwise assume a platform has ADR capability when flush hints are not present, move nvdimm_flush() to REQ_FLUSH context. Note that we still arrange for nvdimm_flush() to be called even in the ADR case. We need at least once wmb() fence to push buffered writes in the cpu out to the ADR protected domain. Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-07-11 16:16:03 -07:00
Dan Williams	0c27af60d1	libnvdimm: cycle flush hints When the NFIT provides multiple flush hint addresses per-dimm it is expressing that the platform is capable of processing multiple flush requests in parallel. There is some fixed cost per flush request, let the cost be shared in parallel on multiple cpus. Since there may not be enough flush hint addresses for each cpu to have one, keep a per-cpu index of the last used hint, hash it with current pid, and assume that access pattern and scheduler randomness will keep the flush-hint usage somewhat staggered across cpus. Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-07-11 16:16:03 -07:00
Dan Williams	f284a4f237	libnvdimm: introduce nvdimm_flush() and nvdimm_has_flush() nvdimm_flush() is a replacement for the x86 'pcommit' instruction. It is an optional write flushing mechanism that an nvdimm bus can provide for the pmem driver to consume. In the case of the NFIT nvdimm-bus-provider nvdimm_flush() is implemented as a series of flush-hint-address [1] writes to each dimm in the interleave set (region) that backs the namespace. The nvdimm_has_flush() routine relies on platform firmware to describe the flushing capabilities of a platform. It uses the heuristic of whether an nvdimm bus provider provides flush address data to return a ternary result: 1: flush addresses defined 0: dimm topology described without flush addresses (assume ADR) -errno: no topology information, unable to determine flush mechanism The pmem driver is expected to take the following actions on this ternary result: 1: nvdimm_flush() in response to REQ_FUA / REQ_FLUSH and shutdown 0: do not set, WC or FUA on the queue, take no further action -errno: warn and then operate as if nvdimm_has_flush() returned '0' The caveat of this heuristic is that it can not distinguish the "dimm does not have flush address" case from the "platform firmware is broken and failed to describe a flush address". Given we are already explicitly trusting the NFIT there's not much more we can do beyond blacklisting broken firmwares if they are ever encountered. Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-07-11 16:13:42 -07:00
Dan Williams	a8f720224e	libnvdimm: keep region data alive over namespace removal nd_region device driver data will be used in the namespace i/o path. Re-order nd_region_remove() to ensure this data stays live across namespace device removal Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-07-11 16:13:41 -07:00
Dan Williams	e5ae3b252c	libnvdimm, nfit: move flush hint mapping to region-device driver-data In preparation for triggering flushes of a DIMM's writes-posted-queue (WPQ) via the pmem driver move mapping of flush hint addresses to the region driver. Since this uses devm_nvdimm_memremap() the flush addresses will remain mapped while any region to which the dimm belongs is active. We need to communicate more information to the nvdimm core to facilitate this mapping, namely each dimm object now carries an array of flush hint address resources. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-07-11 15:09:26 -07:00
Dan Williams	a8a6d2e04c	libnvdimm, nfit: remove nfit_spa_map() infrastructure Now that all shared mappings are handled by devm_nvdimm_memremap() we no longer need nfit_spa_map() nor do we need to trigger a callback to the bus provider at region disable time. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-07-11 15:09:26 -07:00
Dan Williams	29b9aa0aa3	libnvdimm: introduce devm_nvdimm_memremap(), convert nfit_spa_map() users In preparation for generically mapping flush hint addresses for both the BLK and PMEM use case, provide a generic / reference counted mapping api. Given the fact that a dimm may belong to multiple regions (PMEM and BLK), the flush hint addresses need to be held valid as long as any region associated with the dimm is active. This is similar to the existing BLK-region case where multiple BLK-regions may share an aperture mapping. Up-level this shared / reference-counted mapping capability from the nfit driver to a core nvdimm capability. This eliminates the need for the nd_blk_region.disable() callback. Note that the removal of nfit_spa_map() and related infrastructure is deferred to a later patch. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-07-07 17:11:09 -07:00
Johannes Thumshirn	8729bdea82	libnvdimm: initialize struct blk_integrity with 0 Initialize struct blk_integrity with 0 as blk_integrity_register() takes the then unitialized struct blk_integrity::flags and ORs it to the resulting block integrity structure. Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-07-06 10:34:42 -07:00
Dan Williams	52c44d93c2	block: remove ->driverfs_dev Now that all drivers that specify a ->driverfs_dev have been converted to device_add_disk(), the pointer can be removed from struct gendisk. Cc: Jens Axboe <axboe@fb.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-06-27 12:26:08 -07:00
Dan Williams	0d52c756a6	block: convert to device_add_disk() For block drivers that specify a parent device, convert them to use device_add_disk(). This conversion was done with the following semantic patch: @@ struct gendisk disk; expression E; @@ - disk->driverfs_dev = E; ... - add_disk(disk); + device_add_disk(E, disk); @@ struct gendisk disk; expression E1, E2; @@ - disk->driverfs_dev = E1; ... E2 = disk; ... - add_disk(E2); + device_add_disk(E1, E2); ...plus some manual fixups for a few missed conversions. Cc: Jens Axboe <axboe@fb.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: David S. Miller <davem@davemloft.net> Cc: James Bottomley <James.Bottomley@hansenpartnership.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-06-27 12:26:08 -07:00
Dan Williams	f295e53b60	libnvdimm, pmem: allow nfit_test to override pmem_direct_access() Currently phys_to_pfn_t() is an exported symbol to allow nfit_test to override it and indicate that nfit_test-pmem is not device-mapped. Now, we want to enable nfit_test to operate without DMA_CMA and the pmem it provides will no longer be physically contiguous, i.e. won't be capable of supporting direct_access requests larger than a page. Make pmem_direct_access() a weak symbol so that it can be replaced by the tools/testing/nvdimm/ version, and move phys_to_pfn_t() to a static inline now that it no longer needs to be overridden. Acked-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-06-24 11:39:29 -07:00
Dan Williams	1ee6667cd8	libnvdimm, pfn, dax: fix initialization vs autodetect for mode + alignment The updated ndctl unit tests discovered that if a pfn configuration with a 4K alignment is read from the namespace, that alignment will be ignored in favor of the default 2M alignment. The result is that the configuration will fail initialization with a message like: dax6.1: bad offset: 0x22000 dax disabled align: 0x200000 Fix this by allowing the alignment read from the info block to override the default which is 2M not 0 in the autodetect path. This also fixes a similar problem with the mode and alignment settings silently being overwritten by the kernel when userspace has changed it. We now will either overwrite the info block if userspace changes the uuid or fail and warn if a live setting disagrees with the info block. Cc: <stable@vger.kernel.org> Cc: Micah Parrish <micah.parrish@hpe.com> Cc: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-06-23 17:50:39 -07:00
Dan Williams	4258895814	libnvdimm: IS_ERR() usage cleanup Prompted by commit `287980e49f` "remove lots of IS_ERR_VALUE abuses", I ran make coccicheck against drivers/nvdimm/ and found that: if (IS_ERR(x)) return PTR_ERR(x); return 0; ...can be replaced with PTR_ERR_OR_ZERO(). Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-06-17 16:23:23 -07:00
Dan Williams	f02716db95	libnvdimm: use devm_add_action_or_reset() Clean up needless calls to the action routine by letting devm_add_action_or_reset() call it automatically. This does cause the disk to registered and immediately unregistered when a memory allocation fails, but the block layer should be prepared for such an event. Reported-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-06-15 14:59:17 -07:00
Linus Torvalds	315227f6da	DAX error handling for 4.7 - Until now, dax has been disabled if media errors were found on any device. This enables the use of DAX in the presence of these errors by making all sector-aligned zeroing go through the driver. - The driver (already) has the ability to clear errors on writes that are sent through the block layer using 'DSMs' defined in ACPI 6.1. Other misc changes: - When mounting DAX filesystems, check to make sure the partition is page aligned. This is a requirement for DAX, and previously, we allowed such unaligned mounts to succeed, but subsequent reads/writes would fail. - Misc/cleanup fixes from Jan that remove unused code from DAX related to zeroing, writeback, and some size checks. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJXQ4GKAAoJEHr6Yb6juE3/zowP/iclIhgXXXMQJRUHJlePMXC8 15sGZ32JS1ak9g7vrsmNVEDNynfNtiMYdBxtUyRuj6xqgwdZvFk3F55KOCPtaeA1 +yADkgeRkTAcwzmHw9WQVEzBCqyzSisdrwtEfH817qdq9FJdH66x2Kos6i+HeAVr 5Q/e4gs7lKrjf384/QBl+wxNZOndJaQAPd2VRHQqx2A9F33v0ljdwRaUG1r4fjK2 dtmhcZCqdQyuAGXW3piTnZc5ZFc3DPqO4FkEfqkEK3lFOflK0fd8wMsAZRp/Jd0j GJsgnVSWSqG0Dz476djlG0w8t2p5Jv1g9cKChV+ZZEdFLKWHCOUFqXNj8uI8I4k5 cOEKCHyJ3IwfSHhNQqktEWrQN4T8ZXhWtuc9GuV4UZYuqJqHci6EdR/YsWsJjV+L lm/qvK4ipDS1pivxOy8KX/iN0z7Io8J9GXpStDx3g8iWjLlh4YYlbJLWeeRepo/z aPlV/QAKcHiGY6jzLExrZIyCWkzwo6O+0p1Kxerv9/7K/32HWbOodZ+tC8eD+N25 pV69nCGf+u50T2TtIx1+iann4NC1r7zg5yqnT9AgpyZpiwR5joCDzI5sXW+D0rcS vPtfM84Ccdeq/e6mvfIpZgR0/npQapKnrmUest0J7P2BFPHiFPji1KzZ7M+1aFOo 9R6JdrAj0Sc+FBa+cGzH =v6Of -----END PGP SIGNATURE----- Merge tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull misc DAX updates from Vishal Verma: "DAX error handling for 4.7 - Until now, dax has been disabled if media errors were found on any device. This enables the use of DAX in the presence of these errors by making all sector-aligned zeroing go through the driver. - The driver (already) has the ability to clear errors on writes that are sent through the block layer using 'DSMs' defined in ACPI 6.1. Other misc changes: - When mounting DAX filesystems, check to make sure the partition is page aligned. This is a requirement for DAX, and previously, we allowed such unaligned mounts to succeed, but subsequent reads/writes would fail. - Misc/cleanup fixes from Jan that remove unused code from DAX related to zeroing, writeback, and some size checks" * tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: dax: fix a comment in dax_zero_page_range and dax_truncate_page dax: for truncate/hole-punch, do zeroing through the driver if possible dax: export a low-level __dax_zero_page_range helper dax: use sb_issue_zerout instead of calling dax_clear_sectors dax: enable dax in the presence of known media errors (badblocks) dax: fallback from pmd to pte on error block: Update blkdev_dax_capable() for consistency xfs: Add alignment check for DAX mount ext2: Add alignment check for DAX mount ext4: Add alignment check for DAX mount block: Add bdev_dax_supported() for dax mount checks block: Add vfs_msg() interface dax: Remove redundant inode size checks dax: Remove pointless writeback from dax_do_io() dax: Remove zeroing from dax_io() dax: Remove dead zeroing code from fault handlers ext2: Avoid DAX zeroing to corrupt data ext2: Fix block zeroing in ext2_get_blocks() for DAX dax: Remove complete_unwritten argument DAX: move RADIX_DAX_ definitions to dax.c	2016-05-26 19:34:26 -07:00
Dan Williams	36092ee8ba	Merge branch 'for-4.7/dax' into libnvdimm-for-next	2016-05-21 12:33:04 -07:00
Dan Williams	03dca343af	libnvdimm, dax: fix deletion The ndctl unit tests discovered that the dax enabling omitted updates to nd_detach_and_reset(). This routine clears device the configuration when the namespace is detached. Without this clearing userspace may assume that the device is in the process of being configured by another agent in the system. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-05-21 12:22:41 -07:00
Dan Williams	5e24c9fd36	libnvdimm, dax: fix alignment validation Testing the dax-device autodetect support revealed a probe failure with the following result: dax0.1: bad offset: 0x8200000 dax disabled The original pfn-device implementation inferred the alignment from ilog2(offset), now that the alignment is explicit the is_power_of_2() needs replacing with a real sanity check against the recorded alignment. Otherwise the alignment check is useless in the implicit case and only the minimum size of the offset matters. This self-consistency check is further validated by the probe path that will re-check that the offset is large enough to contain all the metadata required to enable the device. Cc: <stable@vger.kernel.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-05-21 11:01:41 -07:00
Dan Williams	c5ed926864	libnvdimm, dax: autodetect support For autodetecting a previously established dax configuration we need the info block to indicate block-device vs device-dax mode, and we need to have the default namespace probe hand-off the configuration to the dax_pmem driver. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-05-20 22:02:57 -07:00
Dan Williams	b354aba016	libnvdimm: release ida resources ida instances allocate some internal memory for ->free_bitmap in addition to the base 'struct ida'. Use ida_destroy() to release that memory at module_exit(). Reported-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-05-20 22:02:56 -07:00
Dan Williams	0a70bd4305	dax: enable dax in the presence of known media errors (badblocks) 1/ If a mapping overlaps a bad sector fail the request. 2/ Do not opportunistically report more dax-capable capacity than is requested when errors present. Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com> [vishal: fix a conflict with system RAM collision patches] [vishal: add a 'size' parameter to ->direct_access] [vishal: fix a conflict with DAX alignment check patches] Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>	2016-05-18 12:16:56 -06:00
Dan Williams	1f716d05f8	Merge branch 'for-4.7/dsm' into libnvdimm-for-next	2016-05-18 10:06:59 -07:00
Dan Williams	2159669f58	Merge branch 'for-4.7/libnvdimm' into libnvdimm-for-next	2016-05-18 10:06:48 -07:00
Dan Williams	594d6d96ea	Merge branch 'for-4.7/dax' into libnvdimm-for-next	2016-05-18 09:59:34 -07:00
Dan Williams	6cf9c5babd	libnvdimm: stop requiring a driver ->remove() method The dax_pmem driver was implementing an empty ->remove() method to satisfy the nvdimm bus driver that unconditionally calls ->remove(). Teach the core bus driver to check if ->remove() is NULL to remove that requirement. Reported-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-05-18 09:13:13 -07:00
Dan Williams	45a0dac045	libnvdimm, dax: record the specified alignment of a dax-device instance We want to use the alignment as the allocation and mapping unit. Previously this information was only useful for establishing the data offset, but now it is important to remember the granularity for the later use. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-05-09 15:35:42 -07:00
Dan Williams	52ac23b25e	libnvdimm, dax: reserve space to store labels for device-dax We may want to subdivide a device-dax range into multiple devices so that each can have separate permissions or naming. Reserve 128K of label space by default so we have the capability of making allocation decisions persistent. This reservation is not something we can add later since it would result in the default size of a device-dax range changing between kernel versions. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-05-09 15:35:42 -07:00
Dan Williams	cd03412a51	libnvdimm, dax: introduce device-dax infrastructure Device DAX is the device-centric analogue of Filesystem DAX (CONFIG_FS_DAX). It allows persistent memory ranges to be allocated and mapped without need of an intervening file system. This initial infrastructure arranges for a libnvdimm pfn-device to be represented as a different device-type so that it can be attached to a driver other than the pmem driver. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-05-09 15:35:42 -07:00
Dan Williams	1b8d2afde5	libnvdimm, pfn: fix ARCH=alpha allmodconfig build failure I had relied on the kbuild robot for cross build coverage, however it only builds alpha_defconfig. Switch from HPAGE_SIZE to PMD_SIZE, which is more widely defined. Fixes: `658922e57b` ("libnvdimm, pfn: fix memmap reservation sizing") Cc: <stable@vger.kernel.org> Reported-by: Guenter Roeck <guenter@roeck-us.net> Tested-by: Guenter Roeck <guenter@roeck-us.net> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-05-06 10:20:10 -07:00
Dan Williams	658922e57b	libnvdimm, pfn: fix memmap reservation sizing When configuring a pfn-device instance to allocate the memmap array it needs to account for the fact that vmemmap_populate_hugepages() allocates struct page blocks in HPAGE_SIZE chunks. We need to align the reserved area size to 2MB otherwise arch_add_memory() runs out of memory while establishing the memmap: WARNING: CPU: 0 PID: 496 at arch/x86/mm/init_64.c:704 arch_add_memory+0xe7/0xf0 [..] Call Trace: [<ffffffff8148bdb3>] dump_stack+0x85/0xc2 [<ffffffff810a749b>] __warn+0xcb/0xf0 [<ffffffff810a75cd>] warn_slowpath_null+0x1d/0x20 [<ffffffff8106a497>] arch_add_memory+0xe7/0xf0 [<ffffffff811d2097>] devm_memremap_pages+0x287/0x450 [<ffffffff811d1ffa>] ? devm_memremap_pages+0x1ea/0x450 [<ffffffffa0000298>] __wrap_devm_memremap_pages+0x58/0x70 [nfit_test_iomap] [<ffffffffa0047a58>] pmem_attach_disk+0x318/0x420 [nd_pmem] [<ffffffffa0047bcf>] nd_pmem_probe+0x6f/0x90 [nd_pmem] [<ffffffffa0009469>] nvdimm_bus_probe+0x69/0x110 [libnvdimm] [..] ndbus0: nd_pmem.probe(pfn3.0) = -12 nd_pmem: probe of pfn3.0 failed with error -12 libndctl: ndctl_pfn_enable: pfn3.0: failed to enable Reported-by: Namratha Kothapalli <namratha.n.kothapalli@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-04-30 13:07:06 -07:00
Dan Williams	31eca76ba2	nfit, libnvdimm: limited/whitelisted dimm command marshaling mechanism There are currently 4 known similar but incompatible definitions of the command sets that can be sent to an NVDIMM through ACPI. It is also clear that future platform generations (ACPI or not) will continue to revise and extend the DIMM command set as new devices and use cases arrive. It is obviously untenable to continue to proliferate divergence of these command definitions, and to that end a standardization process has begun to provide for a unified specification. However, that leaves a problem about what to do with this first generation where vendors are already shipping divergence. The Linux kernel can support these initial diverged platforms without giving platform-firmware free reign to continue to diverge and compound kernel maintenance overhead. The kernel implementation can encourage standardization in two ways: 1/ Require that any function code that userspace wants to send be explicitly white-listed in the implementation. For ACPI this means function codes marked as supported by acpi_check_dsm() may only be invoked if they appear in the white-list. A function must be publicly documented before it is added to the white-list. 2/ The above restrictions can be trivially bypassed by using the "vendor-specific" payload command. However, since vendor-specific commands are by definition not publicly documented and have the potential to corrupt the kernel's view of the dimm state, we provide a toggle to disable vendor-specific operations. Enabling undefined behavior is a policy decision that can be made by the platform owner and encourages firmware implementations to choose public over private command implementations. Based on an initial patch from Jerry Hoemann Cc: Jerry Hoemann <jerry.hoemann@hpe.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-04-28 16:59:06 -07:00
Dan Williams	e3654eca70	nfit, libnvdimm: clarify "commands" vs "_DSMs" Clarify the distinction between "commands", the ioctls userspace calls to request the kernel take some action on a given dimm device, and "_DSMs", the actual function numbers used in the firmware interface to the DIMM. _DSMs are ACPI specific whereas commands are Linux kernel generic. This is in preparation for breaking the 1:1 implicit relationship between the kernel ioctl number space and the firmware specific function numbers. Cc: Jerry Hoemann <jerry.hoemann@hpe.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-04-28 16:23:16 -07:00
Dan Williams	0bfb8dd3ed	libnvdimm: cleanup nvdimm_namespace_common_probe(), kill 'host' The 'host' variable can be killed as it is always the same as the passed in device. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-04-22 12:26:24 -07:00
Dan Williams	5a92289f41	libnvdimm, pmem: kill ->pmem_queue and ->pmem_disk The devm conversion obviates the need to continue to remember the queue and disk locally in the driver. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-04-22 12:26:24 -07:00
Dan Williams	ac515c084b	libnvdimm, pmem, pfn: move pfn setup to the core Now that pmem internals have been disentangled from pfn setup, that code can move to the core. This is in preparation for adding another user of the pfn-device capabilities. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-04-22 12:26:23 -07:00
Dan Williams	200c79da82	libnvdimm, pmem, pfn: make pmem_rw_bytes generic and refactor pfn setup In preparation for providing an alternative (to block device) access mechanism to persistent memory, convert pmem_rw_bytes() to nsio_rw_bytes(). This allows ->rw_bytes() functionality without requiring a 'struct pmem_device' to be instantiated. In other words, when ->rw_bytes() is in use i/o is driven through 'struct nd_namespace_io', otherwise it is driven through 'struct pmem_device' and the block layer. This consolidates the disjoint calls to devm_exit_badblocks() and devm_memunmap() into a common devm_nsio_disable() and cleans up the init path to use a unified pmem_attach_disk() implementation. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-04-22 12:26:23 -07:00
Dan Williams	947df02d25	libnvdimm, pmem: clean up resource print / request The leading '0x' in front of %pa is redundant, also we can just use %pR to simplify the print statement. The request parameters can be directly taken from the resource as well. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-04-22 12:26:23 -07:00
Dan Williams	030b99e39c	libnvdimm, pmem: use devm_add_action to release bdev resources Register a callback to clean up the request_queue and put the gendisk at driver disable time. Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-04-22 12:26:23 -07:00
Dan Williams	9d90725ddc	libnvdimm, blk: move i/o infrastructure to nd_namespace_blk Consolidate the information for issuing i/o to a blk-namespace, and eliminate some pointer chasing. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-04-22 12:26:23 -07:00
Dan Williams	8378af17a4	libnvdimm, blk: quiet i/o error reporting I/O errors events have the potential to be a high frequency and a log message for each event can swamp the system. This message is also redundant with upper layer error reporting. Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-04-22 12:26:23 -07:00
Dan Williams	bd842b8ca7	libnvdimm, pmem: use ->queuedata for driver private data Save a pointer chase by storing the driver private data in the request_queue rather than the gendisk. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-04-22 12:26:23 -07:00
Dan Williams	d44077a7cd	libnvdimm, blk: use ->queuedata for driver private data Save a pointer chase by storing the driver private data in the request_queue rather than the gendisk. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-04-22 12:26:22 -07:00
Dan Williams	d29cee120e	libnvdimm, blk: use devm_add_action to release bdev resources Register a callback to clean up the request_queue and put the gendisk at driver disable time. Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-04-22 12:26:22 -07:00
Dan Williams	9dec4892ca	libnvdimm, btt: add btt startup debug Report the reason for btt probe failures when debug is enabled. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-04-22 12:26:05 -07:00
Dan Williams	e32bc729a3	libnvdimm, btt, convert nd_btt_probe() to devm Pass the device performing the probe so we can use a devm allocation for the btt superblock. Cc: Vishal Verma <vishal.l.verma@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-04-22 10:59:54 -07:00
Dan Williams	bd032943b5	libnvdimm, pfn, convert nd_pfn_probe() to devm Pass the device performing the probe so we can use a devm allocation for the pfn superblock. Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-04-22 10:59:54 -07:00
Dan Williams	298f2bc5db	libnvdimm, pmem: kill pmem->ndns We can derive the common namespace from other information. We also do not need to cache it because all the usages are in slow paths. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-04-22 10:59:54 -07:00
Dan Williams	0a370d261c	libnvdimm, pmem: clarify the write+clear_poison+write flow The ACPI specification does not specify the state of data after a clear poison operation. Potential future libnvdimm bus implementations for other architectures also might not specify or disagree on the state of data after clear poison. Clarify why we write twice. Reported-by: Jeff Moyer <jmoyer@redhat.com> Reported-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>	2016-04-15 14:59:41 -06:00
Dan Williams	baa51277cf	libnvdimm, test: add mock SMART data payload Provide simulated SMART data to enable the ndctl implementation of SMART data retrieval and parsing. The payload is defined here, "Section 4.1 SMART and Health Info (Function Index 1)": http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-04-11 11:11:14 -07:00
Linus Torvalds	239467e852	Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm fixes from Dan Williams: "Three fixes, the first two are tagged for -stable: - The ndctl utility/library gained expanded unit tests illuminating a long standing bug in the libnvdimm SMART data retrieval implementation. It has been broken since its initial implementation, now fixed. - Another one line fix for the detection of stale info blocks. Without this change userspace can get into a situation where it is unable to reconfigure a namespace. - Fix the badblock initialization path in the presence of the new (in v4.6-rc1) section alignment workarounds. Without this change badblocks will be reported at the wrong offset. These have received a build success report from the kbuild robot and have appeared in -next with no reported issues" * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: libnvdimm, pfn: fix nvdimm_namespace_add_poison() vs section alignment libnvdimm, pfn: fix uuid validation libnvdimm: fix smart data retrieval	2016-04-09 14:05:45 -07:00
Dan Williams	a390180291	libnvdimm, pfn: fix nvdimm_namespace_add_poison() vs section alignment When section alignment padding is in effect we need to shift / truncate the range that is queried for poison by the 'start_pad' or 'end_trunc' reservations. It's easiest if we just pass in an adjusted resource range rather than deriving it from the passed in namespace. With the resource range resolution pushed out to the caller we can also push the namespace-to-region lookup to the caller and drop the implicit pmem-type assumption about the passed in namespace object. Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-04-07 20:02:06 -07:00
Dan Williams	e5670563f5	libnvdimm, pfn: fix uuid validation If we detect a namespace has a stale info block in the init path, we should overwrite with the latest configuration. In fact, we already return -ENODEV when the parent uuid is invalid, the same should be done for the 'self' uuid. Otherwise we can get into a condition where userspace is unable to reconfigure the pfn-device without directly / manually invalidating the info block. Cc: <stable@vger.kernel.org> Reported-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-04-07 19:59:27 -07:00
Dan Williams	2112911266	libnvdimm: fix smart data retrieval It appears that smart data retrieval has been broken the since the initial implementation. Fix the payload size to be 128-bytes per the specification. Cc: <stable@vger.kernel.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-04-07 19:58:44 -07:00
Kirill A. Shutemov	09cbfeaf1a	mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced long time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_* or PAGE_* constant should be used in a particular case, especially on the border between fs and mm. Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much breakage to be doable. Let's stop pretending that pages in page cache are special. They are not. The changes are pretty straight-forward: - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using script below. For some reason, coccinelle doesn't patch header files. I've called spatch for them manually. The only adjustment after coccinelle is revert of changes to PAGE_CAHCE_ALIGN definition: we are going to drop it later. There are few places in the code where coccinelle didn't reach. I'll fix them manually in a separate patch. Comments and documentation also will be addressed with the separate patch. virtual patch @@ expression E; @@ - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ expression E; @@ - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ @@ - PAGE_CACHE_SHIFT + PAGE_SHIFT @@ @@ - PAGE_CACHE_SIZE + PAGE_SIZE @@ @@ - PAGE_CACHE_MASK + PAGE_MASK @@ expression E; @@ - PAGE_CACHE_ALIGN(E) + PAGE_ALIGN(E) @@ expression E; @@ - page_cache_get(E) + get_page(E) @@ expression E; @@ - page_cache_release(E) + put_page(E) Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-04-04 10:41:08 -07:00
Dan Williams	fc0c202813	x86, pmem: use memcpy_mcsafe() for memcpy_from_pmem() Update the definition of memcpy_from_pmem() to return 0 or a negative error code. Implement x86/arch_memcpy_from_pmem() with memcpy_mcsafe(). Cc: Borislav Petkov <bp@alien8.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-03-28 17:19:31 -07:00
Linus Torvalds	de06dbfa78	Merge branch 'for-linus' of git://ftp.arm.linux.org.uk/~rmk/linux-arm Pull ARM updates from Russell King: "Another mixture of changes this time around: - Split XIP linker file from main linker file to make it more maintainable, and various XIP fixes, and clean up a resulting macro. - Decompressor cleanups from Masahiro Yamada - Avoid printing an error for a missing L2 cache - Remove some duplicated symbols in System.map, and move vectors/stubs back into kernel VMA - Various low priority fixes from Arnd - Updates to allow bus match functions to return negative errno values, touching some drivers and the driver core. Greg has acked these changes. - Virtualisation platform udpates form Jean-Philippe Brucker. - Security enhancements from Kees Cook - Rework some Kconfig dependencies and move PSCI idle management code out of arch/arm into drivers/firmware/psci.c - ARM DMA mapping updates, touching media, acked by Mauro. - Fix places in ARM code which should be using virt_to_idmap() so that Keystone2 can work. - Fix Marvell Tauros2 to work again with non-DT boots. - Provide a delay timer for ARM Orion platforms" * 'for-linus' of git://ftp.arm.linux.org.uk/~rmk/linux-arm: (45 commits) ARM: 8546/1: dma-mapping: refactor to fix coherent+cma+gfp=0 ARM: 8547/1: dma-mapping: store buffer information ARM: 8543/1: decompressor: rename suffix_y to compress-y ARM: 8542/1: decompressor: merge piggy.*.S and simplify Makefile ARM: 8541/1: decompressor: drop redundant FORCE in Makefile ARM: 8540/1: decompressor: use clean-files instead of extra-y to clean files ARM: 8539/1: decompressor: drop more unneeded assignments to "targets" ARM: 8538/1: decompressor: drop unneeded assignments to "targets" ARM: 8532/1: uncompress: mark putc as inline ARM: 8531/1: turn init_new_context into an inline function ARM: 8530/1: remove VIRT_TO_BUS ARM: 8537/1: drop unused DEBUG_RODATA from XIP_KERNEL ARM: 8536/1: mm: hide __start_rodata_section_aligned for non-debug builds ARM: 8535/1: mm: DEBUG_RODATA makes no sense with XIP_KERNEL ARM: 8534/1: virt: fix hyp-stub build for pre-ARMv7 CPUs ARM: make the physical-relative calculation more obvious ARM: 8512/1: proc-v7.S: Adjust stack address when XIP_KERNEL ARM: 8411/1: Add default SPARSEMEM settings ARM: 8503/1: clk_register_clkdev: remove format string interface ARM: 8529/1: remove 'i' and 'zi' targets ...	2016-03-19 16:31:54 -07:00
Dan Williams	489011652a	Merge branch 'for-4.6/pfn' into libnvdimm-for-next	2016-03-09 17:15:43 -08:00
Dan Williams	59e6473980	libnvdimm, pmem: clear poison on write If a write is directed at a known bad block perform the following: 1/ write the data 2/ send a clear poison command 3/ invalidate the poison out of the cache hierarchy Cc: <x86@kernel.org> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-03-09 15:15:32 -08:00
Dan Williams	b5ebc8ec69	libnvdimm, pmem: fix kmap_atomic() leak in error path When we enounter a bad block we need to kunmap_atomic() before returning. Cc: <stable@vger.kernel.org> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-03-09 15:12:41 -08:00
NeilBrown	ff8e92d5d9	nvdimm/btt: don't allocate unused major device number alloc_disk(0) does not require or use a ->major number, all devices are allocated with a major of BLOCK_EXT_MAJOR. So don't allocate btt_major. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-03-09 15:00:24 -08:00
NeilBrown	ec56151d38	nvdimm/blk: don't allocate unused major device number When alloc_disk(0) is used ->major is completely ignored, all devices are allocated with a "major" of BLOCK_EXT_MAJOR. So don't allocate nd_blk_major Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-03-09 14:59:41 -08:00
NeilBrown	55155291b3	pmem: don't allocate unused major device number When alloc_disk(0) or alloc_disk-node(0, XX) is used, the ->major number is completely ignored: all devices are allocated with a major of BLOCK_EXT_MAJOR. So there is no point allocating pmem_major. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-03-09 11:31:34 -08:00
Dan Williams	45f68802f2	libnvdimm, pmem: fix ia64 build, use PHYS_PFN drivers/nvdimm/pmem.c: In function 'nvdimm_namespace_attach_pfn': drivers/nvdimm/pmem.c:367:3: error: implicit declaration of function '__phys_to_pfn' [-Werror=implicit-function-declaration] .base_pfn = __phys_to_pfn(nsio->res.start), ia64 does not provide __phys_to_pfn(), just use the PHYS_PFN() alias. Cc: Guenter Roeck <linux@roeck-us.net> Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-03-06 08:04:12 -08:00
Dan Williams	d4f323672a	nfit, libnvdimm: clear poison command support Add the boiler-plate for a 'clear error' command based on section 9.20.7.6 "Function Index 4 - Clear Uncorrectable Error" from the ACPI 6.1 specification, and add a reference implementation in nfit_test. Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-03-05 18:06:14 -08:00
Dan Williams	f6ed58c70d	libnvdimm, pfn: 'resource'-address and 'size' attributes for pfn devices Currenty with a raw mode pmem namespace the physical memory address range for the device can be obtained via /sys/block/pmemX/device/{resource\|size}. Add similar attributes for pfn instances that takes the struct page memmap and section padding into account. Reported-by: Haozhong Zhang <haozhong.zhang@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-03-05 12:25:45 -08:00
Dan Williams	cfe30b8720	libnvdimm, pmem: adjust for section collisions with 'System RAM' On a platform where 'Persistent Memory' and 'System RAM' are mixed within a given sparsemem section, trim the namespace and notify about the sub-optimal alignment. Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-03-05 12:25:45 -08:00
Dan Williams	d9cbe09d39	libnvdimm, pmem: fix 'pfn' support for section-misaligned namespaces The altmap for a section-misaligned namespace needs to arrange for the base_pfn to be section-aligned. As a result the 'reserve' region (pfns from base that do not have a struct page) must be increased. Otherwise we trip the altmap validation check in __add_pages: if (altmap->base_pfn != phys_start_pfn \|\| vmem_altmap_offset(altmap) > nr_pages) { pr_warn_once("memory add fail, invalid altmap\n"); return -EINVAL; } Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-03-05 12:25:44 -08:00
Jerry Hoemann	07accfa9d1	libnvdimm: Fix security issue with DSM IOCTL. Code attempts to prevent certain IOCTL DSM from being called when device is opened read only. This security feature can be trivially overcome by changing the size portion of the ioctl_command which isn't used. Check only the _IOC_NR (i.e. the command). Cc: <stable@vger.kernel.org> Signed-off-by: Jerry Hoemann <jerry.hoemann@hpe.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-03-05 12:24:06 -08:00
Jerry Hoemann	4dc0e7be88	libnvdimm: Clean-up access mode check. Change nd_ioctl and nvdimm_ioctl access mode check to use O_RDONLY. Signed-off-by: Jerry Hoemann <jerry.hoemann@hpe.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-03-05 12:24:06 -08:00
Dan Williams	87bf572e19	nfit: disable userspace initiated ars during scrub While the nfit driver is issuing address range scrub commands and reaping the results do not permit an ars_start command issued from userspace. The scrub thread assumes that all ars completions are for scrubs initiated by platform firmware at boot, or by the nfit driver. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-03-05 12:24:06 -08:00
Dan Williams	7ae0fa439f	nfit, libnvdimm: async region scrub workqueue Introduce a workqueue that will be used to run address range scrub asynchronously with the rest of nvdimm device probing. Userspace still wants notification when probing operations complete, so introduce a new callback to flush this workqueue when userspace is awaiting probe completion. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-03-05 12:24:06 -08:00
Dan Williams	719994660c	libnvdimm: async notification support In preparation for asynchronous address range scrub support add an ability for the pmem driver to dynamically consume address range scrub results. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-03-05 12:24:06 -08:00
Dan Williams	5faecf4eb0	libnvdimm: protect nvdimm_{bus\|namespace}_add_poison() with nvdimm_bus_lock() In preparation for making poison list retrieval asynchronus to region registration, add protection for walking and mutating the bus-level poison list. Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-03-05 12:24:06 -08:00
Dan Williams	aef2533822	libnvdimm, nfit: centralize command status translation The return value from an 'ndctl_fn' reports the command execution status, i.e. was the command properly formatted and was it successfully submitted to the bus provider. The new 'cmd_rc' parameter allows the bus provider to communicate command specific results, translated into common error codes. Convert the ARS commands to this scheme to: 1/ Consolidate status reporting 2/ Prepare for for expanding ars unit test cases 3/ Make the implementation more generic Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-03-05 12:24:06 -08:00
Russell King	1b3bf84797	Merge branches 'amba', 'fixes', 'misc' and 'tauros2' into for-next	2016-03-04 23:36:02 +00:00
Ingo Molnar	bc94b99636	Linux 4.5-rc6 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJW0yM6AAoJEHm+PkMAQRiGeUwIAJRTHFPJTFpJcJjeZEV4/EL1 7Pl0WSHs/CWBkXIevAg2HgkECSQ9NI9FAUFvoGxCldDpFAnL1U2QV8+Ur2qhiXMG 5v0jILJuiw57qT/NfhEudZolerlRoHILmB3JRTb+DUV4GHZuWpTkJfUSI9j5aTEl w83XUgtK4bKeIyFbHdWQk6xqfzfFBSuEITuSXreOMwkFfMmeScE0WXOPLBZWyhPa v0rARJLYgM+vmRAnJjnG8unH+SgnqiNcn2oOFpevKwmpVcOjcEmeuxh/HdeZf7HM /R8F86OwdmXsO+z8dQxfcucLg+I9YmKfFr8b6hopu1sRztss2+Uk6H1j2J7IFIg= =tvkh -----END PGP SIGNATURE----- Merge tag 'v4.5-rc6' into core/resources, to resolve conflict Signed-off-by: Ingo Molnar <mingo@kernel.org>	2016-03-04 12:12:08 +01:00
Arnd Bergmann	c45442055d	nvdimm: use 'u64' for pfn flags A recent bugfix changed pfn_t to always be 64-bit wide, but did not change the code in pmem.c, which is now broken on 32-bit architectures as reported by gcc: In file included from ../drivers/nvdimm/pmem.c:28:0: drivers/nvdimm/pmem.c: In function 'pmem_alloc': include/linux/pfn_t.h:15:17: error: large integer implicitly truncated to unsigned type [-Werror=overflow] #define PFN_DEV (1ULL << (BITS_PER_LONG_LONG - 3)) This changes the intermediate pfn_flags in struct pmem_device to be 64 bit wide as well, so they can store the flags correctly. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: `db78c22230` ("mm: fix pfn_t vs highmem") Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-02-23 17:17:20 -08:00
Dan Williams	4577b06655	nfit: update address range scrub commands to the acpi 6.1 format The original format of these commands from the "NVDIMM DSM Interface Example" [1] are superseded by the ACPI 6.1 definition of the "NVDIMM Root Device _DSMs" [2]. [1]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf [2]: http://www.uefi.org/sites/default/files/resources/ACPI_6_1.pdf "9.20.7 NVDIMM Root Device _DSMs" Changes include: 1/ New 'restart' fields in ars_status, unfortunately these are implemented in the middle of the existing definition so this change is not backwards compatible. The expectation is that shipping platforms will only ever support the ACPI 6.1 definition. 2/ New status values for ars_start ('busy') and ars_status ('overflow'). Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Linda Knippers <linda.knippers@hpe.com> Cc: <stable@vger.kernel.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-02-23 17:17:20 -08:00
Dan Williams	747ffe11b4	libnvdimm, tools/testing/nvdimm: fix 'ars_status' output buffer sizing Use the output length specified in the command to size the receive buffer rather than the arbitrary 4K limit. This bug was hiding the fact that the ndctl implementation of ndctl_bus_cmd_new_ars_status() was not specifying an output buffer size. Cc: <stable@vger.kernel.org> Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-02-19 15:21:52 -08:00
Dan Williams	82ec2ba2b1	ARM: 8522/1: drivers: nvdimm: ensure no negative value gets returned on positive match This patch ensures that existing bus match callbacks don't return negative values (which might be interpreted as potential errors in the future) in case of positive match. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>	2016-02-16 16:28:50 +00:00
Toshi Kani	f0f4711aa1	x86, kexec, nvdimm: Use walk_iomem_res_desc() for iomem search Change the callers of walk_iomem_res() scanning for the following resources by name to use walk_iomem_res_desc() instead. "ACPI Tables" "ACPI Non-volatile Storage" "Persistent Memory (legacy)" "Crash kernel" Note, the caller of walk_iomem_res() with "GART" will be removed in a later patch. Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Dave Young <dyoung@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Chun-Yi <joeyli.kernel@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Don Zickus <dzickus@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Lee, Chun-Yi <joeyli.kernel@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luis R. Rodriguez <mcgrof@suse.com> Cc: Minfei Huang <mnfhuang@gmail.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Takao Indoh <indou.takao@jp.fujitsu.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Toshi Kani <toshi.kani@hp.com> Cc: kexec@lists.infradead.org Cc: linux-arch@vger.kernel.org Cc: linux-mm <linux-mm@kvack.org> Cc: linux-nvdimm@lists.01.org Link: http://lkml.kernel.org/r/1453841853-11383-15-git-send-email-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>	2016-01-30 09:49:59 +01:00
Dan Williams	45eb570a0d	libnvdimm, pfn: fix restoring memmap location This path was missed when turning on the memmap in pmem support. Permit 'pmem' as a valid location for the map. Reported-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-01-29 17:43:16 -08:00
Dan Williams	9c41242817	libnvdimm: fix mode determination for e820 devices Correctly display "safe" mode when a btt is established on a e820/memmap defined pmem namespace. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-01-26 09:40:32 -08:00
Dan Williams	5c2c2587b1	mm, dax, pmem: introduce {get\|put}_dev_pagemap() for dax-gup get_dev_page() enables paths like get_user_pages() to pin a dynamically mapped pfn-range (devm_memremap_pages()) while the resulting struct page objects are in use. Unlike get_page() it may fail if the device is, or is in the process of being, disabled. While the initial lookup of the range may be an expensive list walk, the result is cached to speed up subsequent lookups which are likely to be in the same mapped range. devm_memremap_pages() now requires a reference counter to be specified at init time. For pmem this means moving request_queue allocation into pmem_alloc() so the existing queue usage counter can track "device pages". ZONE_DEVICE pages always have an elevated count and will never be on an lru reclaim list. That space in 'struct page' can be redirected for other uses, but for safety introduce a poison value that will always trip __list_add() to assert. This allows half of the struct list_head storage to be reclaimed with some assurance to back up the assumption that the page count never goes to zero and a list_add() is never attempted. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Logan Gunthorpe <logang@deltatee.com> Cc: Dave Hansen <dave@sr71.net> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-01-15 17:56:32 -08:00
Dan Williams	468ded03c0	libnvdimm, pmem: move request_queue allocation earlier in probe Before the dynamically allocated struct pages from devm_memremap_pages() can be put to use outside the driver, we need a mechanism to track whether they are still in use at teardown. Towards that goal reorder the initialization sequence to allow the 'q_usage_counter' from the request_queue to be used by the devm_memremap_pages() implementation (in subsequent patches). Signed-off-by: Dan Williams <dan.j.williams@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-01-15 17:56:32 -08:00
Dan Williams	d2c0f041e1	libnvdimm, pfn, pmem: allocate memmap array in persistent memory Use the new vmem_altmap capability to enable the pmem driver to arrange for a struct page memmap to be established in persistent memory. [linux@roeck-us.net: mn10300: declare __pfn_to_phys() to fix build error] Signed-off-by: Dan Williams <dan.j.williams@intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-01-15 17:56:32 -08:00
Dan Williams	4b94ffdc41	x86, mm: introduce vmem_altmap to augment vmemmap_populate() In support of providing struct page for large persistent memory capacities, use struct vmem_altmap to change the default policy for allocating memory for the memmap array. The default vmemmap_populate() allocates page table storage area from the page allocator. Given persistent memory capacities relative to DRAM it may not be feasible to store the memmap in 'System Memory'. Instead vmem_altmap represents pre-allocated "device pages" to satisfy vmemmap_alloc_block_buf() requests. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reported-by: kbuild test robot <lkp@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-01-15 17:56:32 -08:00
Dan Williams	9476df7d80	mm: introduce find_dev_pagemap() There are several scenarios where we need to retrieve and update metadata associated with a given devm_memremap_pages() mapping, and the only lookup key available is a pfn in the range: 1/ We want to augment vmemmap_populate() (called via arch_add_memory()) to allocate memmap storage from pre-allocated pages reserved by the device driver. At vmemmap_alloc_block_buf() time it grabs device pages rather than page allocator pages. This is in support of devm_memremap_pages() mappings where the memmap is too large to fit in main memory (i.e. large persistent memory devices). 2/ Taking a reference against the mapping when inserting device pages into the address_space radix of a given inode. This facilitates unmap_mapping_range() and truncate_inode_pages() operations when the driver is tearing down the mapping. 3/ get_user_pages() operations on ZONE_DEVICE memory require taking a reference against the mapping so that the driver teardown path can revoke and drain usage of device pages. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Logan Gunthorpe <logang@deltatee.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-01-15 17:56:32 -08:00
Dan Williams	34c0fd540e	mm, dax, pmem: introduce pfn_t For the purpose of communicating the optional presence of a 'struct page' for the pfn returned from ->direct_access(), introduce a type that encapsulates a page-frame-number plus flags. These flags contain the historical "page_link" encoding for a scatterlist entry, but can also denote "device memory". Where "device memory" is a set of pfns that are not part of the kernel's linear mapping by default, but are accessed via the same memory controller as ram. The motivation for this new type is large capacity persistent memory that needs struct page entries in the 'memmap' to support 3rd party DMA (i.e. O_DIRECT I/O with a persistent memory source/target). However, we also need it in support of maintaining a list of mapped inodes which need to be unmapped at driver teardown or freeze_bdev() time. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Hansen <dave@sr71.net> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-01-15 17:56:32 -08:00
Dan Williams	8b63b6bfc1	Merge branch 'for-4.5/block-dax' into for-4.5/libnvdimm	2016-01-10 07:53:55 -08:00
Dan Williams	710d69cc99	libnvdimm, pmem: nvdimm_read_bytes() badblocks support Support badblock checking in all the pmem read paths that do not go through the block layer. This protects info block reads (btt or pfn) as well as data reads to a pmem namespace via a btt instance. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-01-09 22:42:31 -08:00
Dan Williams	57f7f317ab	pmem, dax: disable dax in the presence of bad blocks Longer term teach dax to punch "error" holes in mapping requests and deliver SIGBUS to applications that consume a bad pmem page. For now, simply disable the dax performance optimization in the presence of known errors. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-01-09 22:42:31 -08:00
Dan Williams	e10624f8c0	pmem: fail io-requests to known bad blocks Check the sectors specified in a read bio to see if they hit a known bad block, and return an error code pmem_do_bvec(). Note that the ->rw_page() is not in a position to return errors. For now, copy the same layering violation present in zram_rw_page() to avoid crashes of the form: kernel BUG at mm/filemap.c:822! [..] Call Trace: [<ffffffff811c540e>] page_endio+0x1e/0x60 [<ffffffff81290d29>] mpage_end_io+0x39/0x60 [<ffffffff8141c4ef>] bio_endio+0x3f/0x60 [<ffffffffa005c491>] pmem_make_request+0x111/0x230 [nd_pmem] ...i.e. unlock a page that was already unlocked via pmem_rw_page() => page_endio(). Reported-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-01-09 08:39:04 -08:00
Dan Williams	b95f5f4391	libnvdimm: convert to statically allocated badblocks If a device will ever have badblocks it should always have a badblocks instance available. So, similar to md, embed a badblocks instance in pmem_device. This reduces pointer chasing in the i/o fast path, and simplifies the init path. Reported-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-01-09 08:39:04 -08:00
Dan Williams	87ba05dff3	libnvdimm: don't fail init for full badblocks list If the badblocks list runs out of space it simply means that software is unable to intercept all errors. This is no different than the latent discovery of new badblocks case and should not be an initialization failure condition. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-01-09 08:39:04 -08:00
Dan Williams	ad9a8bde2c	libnvdimm, pmem: move definition of nvdimm_namespace_add_poison to nd.h nd-core.h is private to the libnvdimm core internals and should not be used by drivers. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-01-09 08:39:03 -08:00
Vishal Verma	0caeef63e6	libnvdimm: Add a poison list and export badblocks During region creation, perform Address Range Scrubs (ARS) for the SPA (System Physical Address) ranges to retrieve known poison locations from firmware. Add a new data structure 'nd_poison' which is used as a list in nvdimm_bus to store these poison locations. When creating a pmem namespace, if there is any known poison associated with its physical address space, convert the poison ranges to bad sectors that are exposed using the badblocks interface. Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-01-09 08:39:03 -08:00
Dan Williams	e07ecd76d4	libnvdimm: fix namespace object confusion in is_uuid_busy() When btt devices were re-worked to be child devices of regions this routine was overlooked. It mistakenly attempts to_nd_namespace_pmem() or to_nd_namespace_blk() conversions on btt and pfn devices. By luck to date we have happened to be hitting valid memory leading to a uuid miscompare, but a recent change to struct nd_namespace_common causes: BUG: unable to handle kernel NULL pointer dereference at 0000000000000001 IP: [<ffffffff814610dc>] memcmp+0xc/0x40 [..] Call Trace: [<ffffffffa0028631>] is_uuid_busy+0xc1/0x2a0 [libnvdimm] [<ffffffffa0028570>] ? to_nd_blk_region+0x50/0x50 [libnvdimm] [<ffffffff8158c9c0>] device_for_each_child+0x50/0x90 Cc: <stable@vger.kernel.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-01-05 18:37:23 -08:00
Dan Williams	0731de0dd9	libnvdimm, pfn: move 'memory mode' indication to sysfs 'Memory mode' is defined as the capability of a DAX mapping to be the source/target of DMA and other "direct I/O" scenarios. While it currently requires allocating 'struct page' for each page frame of persistent memory in the namespace it will not always be the case. Work continues on reducing the kernel's dependency on 'struct page'. Let's not maintain a suffix that is expected to lose meaning over time. In other words a future 'raw mode' pmem namespace may be as capable as today's 'memory mode' namespace. Undo the encoding of the mode in the device name and leave it to other tooling to determine the mode of the namespace from its attributes. Reported-by: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-12-24 12:20:20 -08:00
Dan Williams	3fa9626865	libnvdimm, pfn: fix nd_pfn_validate() return value handling The -ENODEV case indicates that the info-block needs to established. All other return codes cause nd_pfn_init() to abort. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-12-24 12:20:19 -08:00
Dan Williams	2dc43331e3	libnvdimm, pfn: fix pfn seed creation Similar to btt, plant a new pfn seed when the existing one is activated. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-12-13 11:41:36 -08:00
Dan Williams	a34d5e8a6a	libnvdimm, pfn: add parent uuid validation Track and check the uuid of the namespace hosting a pfn instance. This forces the pfn info block to be invalidated if the namespace is re-configured with a different uuid. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-12-13 11:40:20 -08:00
Dan Williams	315c562536	libnvdimm, pfn: add 'align' attribute, default to HPAGE_SIZE When setting aside capacity for struct page it must be aligned to the largest mapping size that is to be made available via DAX. Make the alignment configurable to enable support for 1GiB page-size mappings. The offset for PFN_MODE_RAM may now be larger than SZ_8K, so fixup the offset check in nvdimm_namespace_attach_pfn(). Reported-by: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-12-12 15:04:26 -08:00
Dan Williams	f7c6ab80fa	libnvdimm, pfn: clean up pfn create parameters In all cases __nd_pfn_create is called with default parameters which are then overridden by values in the info block. Clean up pfn creation by dropping the parameters and setting default values internal to __nd_pfn_create. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-12-10 16:11:59 -08:00
Dan Williams	9f1e8cee77	libnvdimm, pfn: kill ND_PFN_ALIGN The alignment constraint isn't necessary now that devm_memremap_pages() allows for unaligned mappings. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-12-10 15:14:20 -08:00
Dmitry Krivenok	6bb691ac08	nvdimm: do not show pfn_seed for non pmem regions This simple change hides pfn_seed attribute for non pmem regions because they don't support pfn anyway. Signed-off-by: Dmitry V. Krivenok <krivenok.dmitry@gmail.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-12-08 16:27:30 -08:00
Dmitry Krivenok	bd26d0d0ce	nvdimm: improve diagnosibility of namespaces In order to bind namespace to the driver user must first set all mandatory attributes in the following order: - uuid - size - sector_size (for blk namespace only) If the order is wrong, then user either won't be able to set the attribute or bind the namespace. This simple patch improves diagnosibility of common operations with namespaces by printing some details about the error instead of failing silently. Below are examples of error messages (assuming dyndbg is enabled for nvdimms): [/]# echo 4194304 > /sys/bus/nd/devices/region5/namespace5.0/size [ 288.372612] nd namespace5.0: __size_store: uuid not set [ 288.374839] nd namespace5.0: size_store: 400000 fail (-6) sh: write error: No such device or address [/]# [/]# echo namespace5.0 > /sys/bus/nd/drivers/nd_blk/bind [ 554.671648] nd_blk namespace5.0: nvdimm_namespace_common_probe: sector size not set [ 554.674688] ndbus1: nd_blk.probe(namespace5.0) = -19 sh: write error: No such device [/]# Signed-off-by: Dmitry V. Krivenok <krivenok.dmitry@gmail.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-12-08 16:27:30 -08:00
Dan Williams	589e75d157	libnvdimm, pmem: fix size trim in pmem_direct_access() This masking prevents access to the end of the device via dax_do_io(), and is unnecessary as arch_add_memory() would have rejected an unaligned allocation. Cc: <stable@vger.kernel.org> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-11-12 09:55:23 -08:00
Dan Williams	f7256dc0cd	libnvdimm, e820: fix numa node for e820-type-12 pmem ranges Rather than punt on the numa node for these e820 ranges try to find a better answer with memory_add_physaddr_to_nid() when it is available. Cc: <stable@vger.kernel.org> Reported-by: Boaz Harrosh <boaz@plexistor.com> Tested-by: Boaz Harrosh <boaz@plexistor.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-11-12 09:21:18 -08:00
Linus Torvalds	3419b45039	Merge branch 'for-4.4/io-poll' of git://git.kernel.dk/linux-block Pull block IO poll support from Jens Axboe: "Various groups have been doing experimentation around IO polling for (really) fast devices. The code has been reviewed and has been sitting on the side for a few releases, but this is now good enough for coordinated benchmarking and further experimentation. Currently O_DIRECT sync read/write are supported. A framework is in the works that allows scalable stats tracking so we can auto-tune this. And we'll add libaio support as well soon. Fow now, it's an opt-in feature for test purposes" * 'for-4.4/io-poll' of git://git.kernel.dk/linux-block: direct-io: be sure to assign dio->bio_bdev for both paths directio: add block polling support NVMe: add blk polling support block: add block polling support blk-mq: return tag/queue combo in the make_request_fn handlers block: change ->make_request_fn() and users to return a queue cookie	2015-11-10 17:23:49 -08:00
Linus Torvalds	264015f8a8	libnvdimm for 4.4: 1/ Add support for the ACPI 6.0 NFIT hot add mechanism to process updates of the NFIT at runtime. 2/ Teach the coredump implementation how to filter out DAX mappings. 3/ Introduce NUMA hints for allocations made by the pmem driver, and as a side effect all devm allocations now hint their NUMA node by default. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJWQX2sAAoJEB7SkWpmfYgCWsEQAK7w/xM9zClVY/DDlFJxFtYq DZJ4faPj+E3FMTiJIEDzjtRgQvOFE+wmJtntYsCqKH/QZmpnyk9jeT/CbJzEEL2k WsAk+qHGLcVUlSb36blwN1RFzYqC+IDYThewJqUvxDbOwL1AbiibbX7gplzZHLhW +rj3ScVlSNOPRDgGGpkAeLNNsttuKvsGo7nB/JZopm0tV6g14rSK09wQbVhv6S6T Lu7VGYqnJlkteL9YlzRiROf9hW2ZFCMGJz1YZydPTy3aX3hGTBX4w2qvmsPwBIKP kW/gCNisVJGk1cZCk4joSJ8i/b3x3fE0zdZ5waivJ5jDvYbUUfyk0KtJkfw207Rl 14yWitUC6aeVuCeOqXHgsjRi+1QVN9Pg7i49xgGiUN1igQiUYRTgQPWZxDv6Zo/s USrLFQBaRd+hJw+dl7A47lJ3mUF96tPCoQb4LCQ7DVsg5U4J2TvqXLH9Gek/CCZ4 QsMkZDTQlZw4+JEDlzBgg/L7xVty8DadplTADMdjaRhFU3y8zKNJ85Ileokt7KVt IsBT4+S5HeZLvinZY95932DwAmFp1DtsyENd1BUXL06ddyvlQrFJ6NQaXji4fuDc EVQmMoTAqDujZFupMAux9vkUBDFj/hmaVD5F7j3+MWP87OCritw/IZn+2LgTaKoX EmttaYrDr2jJwIaGyw+H =a2/L -----END PGP SIGNATURE----- Merge tag 'libnvdimm-for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Dan Williams: "Outside of the new ACPI-NFIT hot-add support this pull request is more notable for what it does not contain, than what it does. There were a handful of development topics this cycle, dax get_user_pages, dax fsync, and raw block dax, that need more more iteration and will wait for 4.5. The patches to make devm and the pmem driver NUMA aware have been in -next for several weeks. The hot-add support has not, but is contained to the NFIT driver and is passing unit tests. The coredump support is straightforward and was looked over by Jeff. All of it has received a 0day build success notification across 107 configs. Summary: - Add support for the ACPI 6.0 NFIT hot add mechanism to process updates of the NFIT at runtime. - Teach the coredump implementation how to filter out DAX mappings. - Introduce NUMA hints for allocations made by the pmem driver, and as a side effect all devm allocations now hint their NUMA node by default" * tag 'libnvdimm-for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: coredump: add DAX filtering for FDPIC ELF coredumps coredump: add DAX filtering for ELF coredumps acpi: nfit: Add support for hot-add nfit: in acpi_nfit_init, break on a 0-length table pmem, memremap: convert to numa aware allocations devm_memremap_pages: use numa_mem_id devm: make allocations numa aware by default devm_memremap: convert to return ERR_PTR devm_memunmap: use devres_release() pmem: kill memremap_pmem() x86, mm: quiet arch_add_memory()	2015-11-10 12:07:22 -08:00
Jens Axboe	dece16353e	block: change ->make_request_fn() and users to return a queue cookie No functional changes in this patch, but it prepares us for returning a more useful cookie related to the IO that was queued up. Signed-off-by: Jens Axboe <axboe@fb.com> Acked-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com>	2015-11-07 10:40:46 -07:00
Dan Williams	4125a09b0a	block, libnvdimm, nvme: provide a built-in blk_integrity nop profile The libnvidmm-btt and nvme drivers use blk_integrity to reserve space for per-sector metadata, but sometimes without protection checksums. This property is generically useful, so teach the block core to internally specify a nop profile if one is not provided at registration time. Cc: Keith Busch <keith.busch@intel.com> Cc: Matthew Wilcox <willy@linux.intel.com> Suggested-by: Christoph Hellwig <hch@lst.de> [hch: kill the local nvme nop profile as well] Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-10-21 14:43:45 -06:00
Dan Williams	9609b9942b	md, dm, scsi, nvme, libnvdimm: drop blk_integrity_unregister() at shutdown Now that the integrity profile is statically allocated there is no work to do when shutting down an integrity enabled block device. Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: James Bottomley <JBottomley@Odin.com> Acked-by: NeilBrown <neilb@suse.com> Acked-by: Keith Busch <keith.busch@intel.com> Acked-by: Vishal Verma <vishal.l.verma@intel.com> Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-10-21 14:43:37 -06:00
Martin K. Petersen	25520d55cd	block: Inline blk_integrity in struct gendisk Up until now the_integrity profile has been dynamically allocated and attached to struct gendisk after the disk has been made active. This causes problems because NVMe devices need to register the profile prior to the partition table being read due to a mandatory metadata buffer requirement. In addition, DM goes through hoops to deal with preallocating, but not initializing integrity profiles. Since the integrity profile is small (4 bytes + a pointer), Christoph suggested moving it to struct gendisk proper. This requires several changes: - Moving the blk_integrity definition to genhd.h. - Inlining blk_integrity in struct gendisk. - Removing the dynamic allocation code. - Adding helper functions which allow gendisk to set up and tear down the integrity sysfs dir when a disk is added/deleted. - Adding a blk_integrity_revalidate() callback for updating the stable pages bdi setting. - The calls that depend on whether a device has an integrity profile or not now key off of the bi->profile pointer. - Simplifying the integrity support routines in DM (Mike Snitzer). Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reported-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-10-21 14:42:42 -06:00
Martin K. Petersen	0f8087ecde	block: Consolidate static integrity profile properties We previously made a complete copy of a device's data integrity profile even though several of the fields inside the blk_integrity struct are pointers to fixed template entries in t10-pi.c. Split the static and per-device portions so that we can reference the template directly. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reported-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-10-21 14:42:38 -06:00
Dan Williams	538ea4aa44	pmem, memremap: convert to numa aware allocations Given that pmem ranges come with numa-locality hints, arrange for the resulting driver objects to be obtained from node-local memory. Reviewed-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-10-09 17:00:33 -04:00
Dan Williams	b36f47617f	devm_memremap: convert to return ERR_PTR Make devm_memremap consistent with the error return scheme of devm_memremap_pages to remove special casing in the pmem driver. Cc: Christoph Hellwig <hch@lst.de> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-10-09 17:00:33 -04:00
Dan Williams	a639315d6c	pmem: kill memremap_pmem() Now that the pmem-api is defined as "a set of apis that enables access to WB mapped pmem", the mapping type is implied. Remove the wrapper and push the functionality down into the pmem driver in preparation for adding support for direct-mapped pmem. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-10-09 17:00:32 -04:00
Ross Zwisler	ba8fe0f85e	pmem: add proper fencing to pmem_rw_page() pmem_rw_page() needs to call wmb_pmem() on writes to make sure that the newly written data is durable. This flow was added to pmem_rw_bytes() and pmem_make_request() with this commit: commit `61031952f4` ("arch, x86: pmem api for ensuring durability of persistent memory updates") ...the pmem_rw_page() path was missed. Cc: <stable@vger.kernel.org> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-09-17 11:49:28 -04:00
Axel Lin	4ca8b57a0a	libnvdimm: pfn_devs: Fix locking in namespace_store Always take device_lock() before nvdimm_bus_lock() to prevent deadlock. Signed-off-by: Axel Lin <axel.lin@ingics.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-09-17 11:47:50 -04:00
Axel Lin	4be9c1fc3d	libnvdimm: btt_devs: Fix locking in namespace_store Always take device_lock() before nvdimm_bus_lock() to prevent deadlock. Cc: <stable@vger.kernel.org> Signed-off-by: Axel Lin <axel.lin@ingics.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-09-17 11:37:16 -04:00
Linus Torvalds	12f03ee606	libnvdimm for 4.3: 1/ Introduce ZONE_DEVICE and devm_memremap_pages() as a generic mechanism for adding device-driver-discovered memory regions to the kernel's direct map. This facility is used by the pmem driver to enable pfn_to_page() operations on the page frames returned by DAX ('direct_access' in 'struct block_device_operations'). For now, the 'memmap' allocation for these "device" pages comes from "System RAM". Support for allocating the memmap from device memory will arrive in a later kernel. 2/ Introduce memremap() to replace usages of ioremap_cache() and ioremap_wt(). memremap() drops the __iomem annotation for these mappings to memory that do not have i/o side effects. The replacement of ioremap_cache() with memremap() is limited to the pmem driver to ease merging the api change in v4.3. Completion of the conversion is targeted for v4.4. 3/ Similar to the usage of memcpy_to_pmem() + wmb_pmem() in the pmem driver, update the VFS DAX implementation and PMEM api to provide persistence guarantees for kernel operations on a DAX mapping. 4/ Convert the ACPI NFIT 'BLK' driver to map the block apertures as cacheable to improve performance. 5/ Miscellaneous updates and fixes to libnvdimm including support for issuing "address range scrub" commands, clarifying the optimal 'sector size' of pmem devices, a clarification of the usage of the ACPI '_STA' (status) property for DIMM devices, and other minor fixes. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJV6Nx7AAoJEB7SkWpmfYgCWyYQAI5ju6Gvw27RNFtPovHcZUf5 JGnxXejI6/AqeTQ+IulgprxtEUCrXOHjCDA5dkjr1qvsoqK1qxug+vJHOZLgeW0R OwDtmdW4Qrgeqm+CPoxETkorJ8wDOc8mol81kTiMgeV3UqbYeeHIiTAmwe7VzZ0C nNdCRDm5g8dHCjTKcvK3rvozgyoNoWeBiHkPe76EbnxDICxCB5dak7XsVKNMIVFQ NuYlnw6IYN7+rMHgpgpRux38NtIW8VlYPWTmHExejc2mlioWMNBG/bmtwLyJ6M3e zliz4/cnonTMUaizZaVozyinTa65m7wcnpjK+vlyGV2deDZPJpDRvSOtB0lH30bR 1gy+qrKzuGKpaN6thOISxFLLjmEeYwzYd7SvC9n118r32qShz+opN9XX0WmWSFlA sajE1ehm4M7s5pkMoa/dRnAyR8RUPu4RNINdQ/Z9jFfAOx+Q26rLdQXwf9+uqbEb bIeSQwOteK5vYYCstvpAcHSMlJAglzIX5UfZBvtEIJN7rlb0VhmGWfxAnTu+ktG1 o9cqAt+J4146xHaFwj5duTsyKhWb8BL9+xqbKPNpXEp+PbLsrnE/+WkDLFD67jxz dgIoK60mGnVXp+16I2uMqYYDgAyO5zUdmM4OygOMnZNa1mxesjbDJC6Wat1Wsndn slsw6DkrWT60CRE42nbK =o57/ -----END PGP SIGNATURE----- Merge tag 'libnvdimm-for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Dan Williams: "This update has successfully completed a 0day-kbuild run and has appeared in a linux-next release. The changes outside of the typical drivers/nvdimm/ and drivers/acpi/nfit.[ch] paths are related to the removal of IORESOURCE_CACHEABLE, the introduction of memremap(), and the introduction of ZONE_DEVICE + devm_memremap_pages(). Summary: - Introduce ZONE_DEVICE and devm_memremap_pages() as a generic mechanism for adding device-driver-discovered memory regions to the kernel's direct map. This facility is used by the pmem driver to enable pfn_to_page() operations on the page frames returned by DAX ('direct_access' in 'struct block_device_operations'). For now, the 'memmap' allocation for these "device" pages comes from "System RAM". Support for allocating the memmap from device memory will arrive in a later kernel. - Introduce memremap() to replace usages of ioremap_cache() and ioremap_wt(). memremap() drops the __iomem annotation for these mappings to memory that do not have i/o side effects. The replacement of ioremap_cache() with memremap() is limited to the pmem driver to ease merging the api change in v4.3. Completion of the conversion is targeted for v4.4. - Similar to the usage of memcpy_to_pmem() + wmb_pmem() in the pmem driver, update the VFS DAX implementation and PMEM api to provide persistence guarantees for kernel operations on a DAX mapping. - Convert the ACPI NFIT 'BLK' driver to map the block apertures as cacheable to improve performance. - Miscellaneous updates and fixes to libnvdimm including support for issuing "address range scrub" commands, clarifying the optimal 'sector size' of pmem devices, a clarification of the usage of the ACPI '_STA' (status) property for DIMM devices, and other minor fixes" * tag 'libnvdimm-for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (34 commits) libnvdimm, pmem: direct map legacy pmem by default libnvdimm, pmem: 'struct page' for pmem libnvdimm, pfn: 'struct page' provider infrastructure x86, pmem: clarify that ARCH_HAS_PMEM_API implies PMEM mapped WB add devm_memremap_pages mm: ZONE_DEVICE for "device memory" mm: move __phys_to_pfn and __pfn_to_phys to asm/generic/memory_model.h dax: drop size parameter to ->direct_access() nd_blk: change aperture mapping from WC to WB nvdimm: change to use generic kvfree() pmem, dax: have direct_access use __pmem annotation dax: update I/O path to do proper PMEM flushing pmem: add copy_from_iter_pmem() and clear_pmem() pmem, x86: clean up conditional pmem includes pmem: remove layer when calling arch_has_wmb_pmem() pmem, x86: move x86 PMEM API to new pmem.h header libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option pmem: switch to devm_ allocations devres: add devm_memremap libnvdimm, btt: write and validate parent_uuid ...	2015-09-08 14:35:59 -07:00
Linus Torvalds	1081230b74	Merge branch 'for-4.3/core' of git://git.kernel.dk/linux-block Pull core block updates from Jens Axboe: "This first core part of the block IO changes contains: - Cleanup of the bio IO error signaling from Christoph. We used to rely on the uptodate bit and passing around of an error, now we store the error in the bio itself. - Improvement of the above from myself, by shrinking the bio size down again to fit in two cachelines on x86-64. - Revert of the max_hw_sectors cap removal from a revision again, from Jeff Moyer. This caused performance regressions in various tests. Reinstate the limit, bump it to a more reasonable size instead. - Make /sys/block/<dev>/queue/discard_max_bytes writeable, by me. Most devices have huge trim limits, which can cause nasty latencies when deleting files. Enable the admin to configure the size down. We will look into having a more sane default instead of UINT_MAX sectors. - Improvement of the SGP gaps logic from Keith Busch. - Enable the block core to handle arbitrarily sized bios, which enables a nice simplification of bio_add_page() (which is an IO hot path). From Kent. - Improvements to the partition io stats accounting, making it faster. From Ming Lei. - Also from Ming Lei, a basic fixup for overflow of the sysfs pending file in blk-mq, as well as a fix for a blk-mq timeout race condition. - Ming Lin has been carrying Kents above mentioned patches forward for a while, and testing them. Ming also did a few fixes around that. - Sasha Levin found and fixed a use-after-free problem introduced by the bio->bi_error changes from Christoph. - Small blk cgroup cleanup from Viresh Kumar" * 'for-4.3/core' of git://git.kernel.dk/linux-block: (26 commits) blk: Fix bio_io_vec index when checking bvec gaps block: Replace SG_GAPS with new queue limits mask block: bump BLK_DEF_MAX_SECTORS to 2560 Revert "block: remove artifical max_hw_sectors cap" blk-mq: fix race between timeout and freeing request blk-mq: fix buffer overflow when reading sysfs file of 'pending' Documentation: update notes in biovecs about arbitrarily sized bios block: remove bio_get_nr_vecs() fs: use helper bio_add_page() instead of open coding on bi_io_vec block: kill merge_bvec_fn() completely md/raid5: get rid of bio_fits_rdev() md/raid5: split bio for chunk_aligned_read block: remove split code in blkdev_issue_{discard,write_same} btrfs: remove bio splitting and merge_bvec_fn() calls bcache: remove driver private bio splitting code block: simplify bio_add_page() block: make generic_make_request handle arbitrarily sized bios blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL) block: don't access bio->bi_error after bio_put() block: shrink struct bio down to 2 cache lines again ...	2015-09-02 13:10:25 -07:00
Dan Williams	004f1afbe1	libnvdimm, pmem: direct map legacy pmem by default The expectation is that the legacy / non-standard pmem discovery method (e820 type-12) will only ever be used to describe small quantities of persistent memory. Larger capacities will be described via the ACPI NFIT. When "allocate struct page from pmem" support is added this default policy can be overridden by assigning a legacy pmem namespace to a pfn device, however this would be only be necessary if a platform used the legacy mechanism to define a very large range. Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-08-28 23:40:05 -04:00
Dan Williams	32ab0a3f51	libnvdimm, pmem: 'struct page' for pmem Enable the pmem driver to handle PFN device instances. Attaching a pmem namespace to a pfn device triggers the driver to allocate and initialize struct page entries for pmem. Memory capacity for this allocation comes exclusively from RAM for now which is suitable for low PMEM to RAM ratios. This mechanism will be expanded later for setting an "allocate from PMEM" policy. Cc: Boaz Harrosh <boaz@plexistor.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-08-28 23:40:04 -04:00
Dan Williams	e1455744b2	libnvdimm, pfn: 'struct page' provider infrastructure Implement the base infrastructure for libnvdimm PFN devices. Similar to BTT devices they take a namespace as a backing device and layer functionality on top. In this case the functionality is reserving space for an array of 'struct page' entries to be handed out through pfn_to_page(). For now this is just the basic libnvdimm-device-model for configuring the base PFN device. As the namespace claiming mechanism for PFN devices is mostly identical to BTT devices drivers/nvdimm/claim.c is created to house the common bits. Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-08-28 23:39:36 -04:00
Dan Williams	96601adb74	x86, pmem: clarify that ARCH_HAS_PMEM_API implies PMEM mapped WB Given that a write-back (WB) mapping plus non-temporal stores is expected to be the most efficient way to access PMEM, update the definition of ARCH_HAS_PMEM_API to imply arch support for WB-mapped-PMEM. This is needed as a pre-requisite for adding PMEM to the direct map and mapping it with struct page. The above clarification for X86_64 means that memcpy_to_pmem() is permitted to use the non-temporal arch_memcpy_to_pmem() rather than needlessly fall back to default_memcpy_to_pmem() when the pcommit instruction is not available. When arch_memcpy_to_pmem() is not guaranteed to flush writes out of cache, i.e. on older X86_32 implementations where non-temporal stores may just dirty cache, ARCH_HAS_PMEM_API is simply disabled. The default fall back for persistent memory handling remains. Namely, map it with the WT (write-through) cache-type and hope for the best. arch_has_pmem_api() is updated to only indicate whether the arch provides the proper helpers to meet the minimum "writes are visible outside the cache hierarchy after memcpy_to_pmem() + wmb_pmem()". Code that cares whether wmb_pmem() actually flushes writes to pmem must now call arch_has_wmb_pmem() directly. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> [hch: set ARCH_HAS_PMEM_API=n on x86_32] Reviewed-by: Christoph Hellwig <hch@lst.de> [toshi: x86_32 compile fixes] Signed-off-by: Toshi Kani <toshi.kani@hp.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-08-27 19:40:59 -04:00
Dan Williams	cb389b9c0e	dax: drop size parameter to ->direct_access() None of the implementations currently use it. The common bdev_direct_access() entry point handles all the size checks before calling ->direct_access(). Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-08-27 19:40:58 -04:00
Dan Williams	4a9bf88a5c	Merge branch 'pmem-api' into libnvdimm-for-next	2015-08-27 19:40:26 -04:00
yalin wang	a06a757652	nvdimm: change to use generic kvfree() Signed-off-by: yalin wang <yalin.wang2010@gmail.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-08-27 19:35:48 -04:00
Ross Zwisler	e2e05394e4	pmem, dax: have direct_access use __pmem annotation Update the annotation for the kaddr pointer returned by direct_access() so that it is a __pmem pointer. This is consistent with the PMEM driver and with how this direct_access() pointer is used in the DAX code. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-08-20 14:07:24 -04:00
Dan Williams	7a67832c7e	libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option We currently register a platform device for e820 type-12 memory and register a nvdimm bus beneath it. Registering the platform device triggers the device-core machinery to probe for a driver, but that search currently comes up empty. Building the nvdimm-bus registration into the e820_pmem platform device registration in this way forces libnvdimm to be built-in. Instead, convert the built-in portion of CONFIG_X86_PMEM_LEGACY to simply register a platform device and move the rest of the logic to the driver for e820_pmem, for the following reasons: 1/ Letting e820_pmem support be a module allows building and testing libnvdimm.ko changes without rebooting 2/ All the normal policy around modules can be applied to e820_pmem (unbind to disable and/or blacklisting the module from loading by default) 3/ Moving the driver to a generic location and converting it to scan "iomem_resource" rather than "e820.map" means any other architecture can take advantage of this simple nvdimm resource discovery mechanism by registering a resource named "Persistent Memory (legacy)" Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-08-19 00:34:34 -04:00
Christoph Hellwig	708ab62bef	pmem: switch to devm_ allocations Signed-off-by: Christoph Hellwig <hch@lst.de> [djbw: tools/testing/nvdimm/ and memunmap_pmem support] Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-08-14 16:01:21 -04:00
Vishal Verma	6ec689542b	libnvdimm, btt: write and validate parent_uuid When a BTT is instantiated on a namespace it must validate the namespace uuid matches the 'parent_uuid' stored in the btt superblock. This property enforces that changing the namespace UUID invalidates all former BTT instances on that storage. For "IO namespaces" that don't have a label or UUID, the parent_uuid is set to zero, and this validation is skipped. For such cases, old BTTs have to be invalidated by forcing the namespace to raw mode, and overwriting the BTT info blocks. Based on a patch by Dan Williams <dan.j.williams@intel.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-08-14 13:43:04 -04:00
Vishal Verma	ab45e76327	libnvdimm, btt: consolidate arena validation Use arena_is_valid as a common routine for checking the validity of an info block from both discover_arenas, and nd_btt_probe. As a result, don't check for validity of the BTT's UUID, and lbasize. The checksum in the BTT info block guarantees self-consistency, and when we're called from nd_btt_probe, we don't have a valid uuid or lbasize available to check against. Also cleanup to return a bool instead of an int. Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-08-14 13:43:04 -04:00
Vishal Verma	fbde1414ac	libnvdimm, btt: clean up internal interfaces Consolidate the parameters passed to arena_is_valid into just nd_btt, and an info block to increase re-usability. Similarly, btt_arena_write_layout doesn't need to be passed a uuid, as it can be obtained from arena->nd_btt. Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-08-14 13:43:04 -04:00
Randy Dunlap	f6ef5a2a50	nvdimm: fix inline function return type warning Fix multiple build warnings when CONFIG_BTT is not enabled: In file included from ../drivers/nvdimm/bus.c:29:0: ../drivers/nvdimm/nd.h:169:15: warning: return type defaults to 'int' [-Wreturn-type] static inline nd_btt_probe(struct nd_namespace_common ndns, void drvdata) ^ Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: linux-nvdimm@lists.01.org Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-07-31 18:17:09 -04:00
Christoph Hellwig	4246a0b63b	block: add a bi_error field to struct bio Currently we have two different ways to signal an I/O error on a BIO: (1) by clearing the BIO_UPTODATE flag (2) by returning a Linux errno value to the bi_end_io callback The first one has the drawback of only communicating a single possible error (-EIO), and the second one has the drawback of not beeing persistent when bios are queued up, and are not passed along from child to parent bio in the ever more popular chaining scenario. Having both mechanisms available has the additional drawback of utterly confusing driver authors and introducing bugs where various I/O submitters only deal with one of them, and the others have to add boilerplate code to deal with both kinds of error returns. So add a new bi_error field to store an errno value directly in struct bio and remove the existing mechanisms to clean all this up. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-07-29 08:55:15 -06:00
Vishal Verma	6b47496a6f	libnvdimm, pmem: Change pmem physical sector size to PAGE_SIZE Based on a patch: `c8fa317` brd: Request from fdisk 4k alignment by Boaz Harrosh, allow fdisk to create properly aligned partitions for DAX. This will also cause mkfs.ext4 to emit a warning if using a file system block size of less than PAGE_SIZE. Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Elliott, Robert <Elliott@hp.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Acked-by: Boaz Harrosh <boaz@plexistor.com> Acked-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-07-27 22:53:19 -04:00
Dan Williams	5e32940621	libnvdimm, btt: sparse fix Fix: drivers/nvdimm/btt.c:635:29: warning: restricted __le64 degrades to integer Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-07-27 22:53:19 -04:00
Dan Williams	8ca243536d	libnvdimm: fix namespace seed creation A new BLK namespace "seed" device is created whenever the current seed is successfully probed. However, if that namespace is assigned to a BTT it may never directly experience a successful probe as it is a subordinate device to a BTT configuration. The effect of the current code is that no new namespaces can be instantiated, after the seed namespace, to consume available BLK DPA capacity. Fix this by treating a successful BTT probe event as a successful probe event for the backing namespace. Reported-by: Nicholas Moulin <nicholas.w.moulin@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-07-25 09:57:56 -07:00
Axel Lin	daa1dee405	nvdimm: Fix return value of nvdimm_bus_init() if class_create() fails Return proper error if class_create() fails. Signed-off-by: Axel Lin <axel.lin@ingics.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-06-30 14:30:34 -04:00
Dan Williams	af834d457d	libnvdimm: smatch cleanups in __nd_ioctl Drop use of access_ok() since we are already using copy_{to\|from}_user() which do their own access_ok(). Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-06-30 14:10:09 -04:00
Ross Zwisler	61031952f4	arch, x86: pmem api for ensuring durability of persistent memory updates Based on an original patch by Ross Zwisler [1]. Writes to persistent memory have the potential to be posted to cpu cache, cpu write buffers, and platform write buffers (memory controller) before being committed to persistent media. Provide apis, memcpy_to_pmem(), wmb_pmem(), and memremap_pmem(), to write data to pmem and assert that it is durable in PMEM (a persistent linear address range). A '__pmem' attribute is added so sparse can track proper usage of pointers to pmem. This continues the status quo of pmem being x86 only for 4.2, but reworks to ioremap, and wider implementation of memremap() will enable other archs in 4.3. [1]: https://lists.01.org/pipermail/linux-nvdimm/2015-May/000932.html Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> [djbw: various reworks] Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-06-26 11:23:38 -04:00
Toshi Kani	74ae66c3b1	libnvdimm: Add sysfs numa_node to NVDIMM devices Add support of sysfs 'numa_node' to I/O-related NVDIMM devices under /sys/bus/nd/devices, regionN, namespaceN.0, and bttN.x. An example of numa_node values on a 2-socket system with a single NVDIMM range on each socket is shown below. /sys/bus/nd/devices \|-- btt0.0/numa_node:0 \|-- btt1.0/numa_node:1 \|-- btt1.1/numa_node:1 \|-- namespace0.0/numa_node:0 \|-- namespace1.0/numa_node:1 \|-- region0/numa_node:0 \|-- region1/numa_node:1 These numa_node files are then linked under the block class of their device names. /sys/class/block/pmem0/device/numa_node:0 /sys/class/block/pmem1s/device/numa_node:1 This enables numactl(8) to accept 'block:' and 'file:' paths of pmem and btt devices as shown in the examples below. numactl --preferred block:pmem0 --show numactl --preferred file:/dev/pmem1s --show Signed-off-by: Toshi Kani <toshi.kani@hp.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-06-26 11:23:38 -04:00
Toshi Kani	41d7a6d637	libnvdimm: Set numa_node to NVDIMM devices ACPI NFIT table has System Physical Address Range Structure entries that describe a proximity ID of each range when ACPI_NFIT_PROXIMITY_VALID is set in the flags. Change acpi_nfit_register_region() to map a proximity ID to its node ID, and set it to a new numa_node field of nd_region_desc, which is then conveyed to the nd_region device. The device core arranges for btt and namespace devices to inherit their node from their parent region. Signed-off-by: Toshi Kani <toshi.kani@hp.com> [djbw: move set_dev_node() from region.c to bus.c] Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-06-26 11:23:38 -04:00
Dan Williams	5813882094	libnvdimm, nfit: handle unarmed dimms, mark namespaces read-only Upon detection of an unarmed dimm in a region, arrange for descendant BTT, PMEM, or BLK instances to be read-only. A dimm is primarily marked "unarmed" via flags passed by platform firmware (NFIT). The flags in the NFIT memory device sub-structure indicate the state of the data on the nvdimm relative to its energy source or last "flush to persistence". For the most part there is nothing the driver can do but advertise the state of these flags in sysfs and emit a message if firmware indicates that the contents of the device may be corrupted. However, for the case of ACPI_NFIT_MEM_ARMED, the driver can arrange for the block devices incorporating that nvdimm to be marked read-only. This is a safe default as the data is still available and new writes are held off until the administrator either forces read-write mode, or the energy source becomes armed. A 'read_only' attribute is added to REGION devices to allow for overriding the default read-only policy of all descendant block devices. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-06-26 11:23:38 -04:00
Dan Williams	0f51c4fa7f	pmem: flag pmem block devices as non-rotational ...since they are effectively SSDs as far as userspace is concerned. Reviewed-by: Vishal Verma <vishal.l.verma@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-06-26 11:23:38 -04:00
Dan Williams	f0dc089ce2	libnvdimm: enable iostat This is disabled by default as the overhead is prohibitive, but if the user takes the action to turn it on we'll oblige. Reviewed-by: Vishal Verma <vishal.l.verma@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-06-26 11:23:38 -04:00
Dan Williams	edc870e546	pmem: make_request cleanups Various cleanups: 1/ Kill the BUG_ON since we've already told the block layer we don't support DISCARD on all these drivers. 2/ Kill the 'rw' variable, no need to cache it. 3/ Kill the local 'sector' variable. bio_for_each_segment() is already advancing the iterator's sector number by the bio_vec length. 4/ Kill the check for accessing past the end of device generic_make_request_checks() already does that. Suggested-by: Christoph Hellwig <hch@lst.de> [hch: kill access past end of the device check] Reviewed-by: Vishal Verma <vishal.l.verma@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-06-26 11:23:38 -04:00
Dan Williams	43d3fa3a04	libnvdimm, pmem: fix up max_hw_sectors There is no hardware limit to enforce on the size of the i/o that can be passed to an nvdimm block device, so set it to UINT_MAX. Reviewed-by: Vishal Verma <vishal.l.verma@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-06-26 11:23:38 -04:00
Vishal Verma	fcae695737	libnvdimm, blk: add support for blk integrity Support multiple block sizes (sector + metadata) for nd_blk in the same way as done for the BTT. Add the idea of an 'internal' lbasize, which is properly aligned and padded, and store metadata in this space. Signed-off-by: Vishal Verma <vishal.l.verma@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-06-26 11:23:38 -04:00
Vishal Verma	41cd8b70c3	libnvdimm, btt: add support for blk integrity Support multiple block sizes (sector + metadata) using the blk integrity framework. This registers a new integrity template that defines the protection information tuple size based on the configured metadata size, and simply acts as a passthrough for protection information generated by another layer. The metadata is written to the storage as-is, and read back with each sector. Signed-off-by: Vishal Verma <vishal.l.verma@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-06-26 11:23:38 -04:00
Ross Zwisler	047fc8a1f9	libnvdimm, nfit, nd_blk: driver for BLK-mode access persistent memory The libnvdimm implementation handles allocating dimm address space (DPA) between PMEM and BLK mode interfaces. After DPA has been allocated from a BLK-region to a BLK-namespace the nd_blk driver attaches to handle I/O as a struct bio based block device. Unlike PMEM, BLK is required to handle platform specific details like mmio register formats and memory controller interleave. For this reason the libnvdimm generic nd_blk driver calls back into the bus provider to carry out the I/O. This initial implementation handles the BLK interface defined by the ACPI 6 NFIT [1] and the NVDIMM DSM Interface Example [2] composed from DCR (dimm control region), BDW (block data window), IDT (interleave descriptor) NFIT structures and the hardware register format. [1]: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf [2]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf Cc: Andy Lutomirski <luto@amacapital.net> Cc: Boaz Harrosh <boaz@plexistor.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jens Axboe <axboe@fb.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-06-26 11:23:38 -04:00
Vishal Verma	5212e11fde	nd_btt: atomic sector updates BTT stands for Block Translation Table, and is a way to provide power fail sector atomicity semantics for block devices that have the ability to perform byte granularity IO. It relies on the capability of libnvdimm namespace devices to do byte aligned IO. The BTT works as a stacked blocked device, and reserves a chunk of space from the backing device for its accounting metadata. It is a bio-based driver because all IO is done synchronously, and there is no queuing or asynchronous completions at either the device or the driver level. The BTT uses 'lanes' to index into various 'on-disk' data structures, and lanes also act as a synchronization mechanism in case there are more CPUs than available lanes. We did a comparison between two lane lock strategies - first where we kept an atomic counter around that tracked which was the last lane that was used, and 'our' lane was determined by atomically incrementing that. That way, for the nr_cpus > nr_lanes case, theoretically, no CPU would be blocked waiting for a lane. The other strategy was to use the cpu number we're scheduled on to and hash it to a lane number. Theoretically, this could block an IO that could've otherwise run using a different, free lane. But some fio workloads showed that the direct cpu -> lane hash performed faster than tracking 'last lane' - my reasoning is the cache thrash caused by moving the atomic variable made that approach slower than simply waiting out the in-progress IO. This supports the conclusion that the driver can be a very simple bio-based one that does synchronous IOs instead of queuing. Cc: Andy Lutomirski <luto@amacapital.net> Cc: Boaz Harrosh <boaz@plexistor.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jens Axboe <axboe@fb.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Neil Brown <neilb@suse.de> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg KH <gregkh@linuxfoundation.org> [jmoyer: fix nmi watchdog timeout in btt_map_init] [jmoyer: move btt initialization to module load path] [jmoyer: fix memory leak in the btt initialization path] [jmoyer: Don't overwrite corrupted arenas] Signed-off-by: Vishal Verma <vishal.l.verma@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-06-26 11:23:38 -04:00
Dan Williams	8c2f7e8658	libnvdimm: infrastructure for btt devices NVDIMM namespaces, in addition to accepting "struct bio" based requests, also have the capability to perform byte-aligned accesses. By default only the bio/block interface is used. However, if another driver can make effective use of the byte-aligned capability it can claim namespace interface and use the byte-aligned ->rw_bytes() interface. The BTT driver is the initial first consumer of this mechanism to allow adding atomic sector update semantics to a pmem or blk namespace. This patch is the sysfs infrastructure to allow configuring a BTT instance for a namespace. Enabling that BTT and performing i/o is in a subsequent patch. Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-06-25 04:20:04 -04:00
Dan Williams	0ba1c63489	libnvdimm: write blk label set After 'uuid', 'size', 'sector_size', and optionally 'alt_name' have been set to valid values the labels on the dimm can be updated. The difference with the pmem case is that blk namespaces are limited to one dimm and can cover discontiguous ranges in dpa space. Also, after allocating label slots, it is useful for userspace to know how many slots are left. Export this information in sysfs. Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Neil Brown <neilb@suse.de> Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-06-24 21:24:10 -04:00
Dan Williams	f524bf271a	libnvdimm: write pmem label set After 'uuid', 'size', and optionally 'alt_name' have been set to valid values the labels on the dimms can be updated. Write procedure is: 1/ Allocate and write new labels in the "next" index 2/ Free the old labels in the working copy 3/ Write the bitmap and the label space on the dimm 4/ Write the index to make the update valid Label ranges directly mirror the dpa resource values for the given label_id of the namespace. Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Neil Brown <neilb@suse.de> Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-06-24 21:24:10 -04:00
Dan Williams	1b40e09a12	libnvdimm: blk labels and namespace instantiation A blk label set describes a namespace comprised of one or more discontiguous dpa ranges on a single dimm. They may alias with one or more pmem interleave sets that include the given dimm. This is the runtime/volatile configuration infrastructure for sysfs manipulation of 'alt_name', 'uuid', 'size', and 'sector_size'. A later patch will make these settings persistent by writing back the label(s). Unlike pmem namespaces, multiple blk namespaces can be created per region. Once a blk namespace has been created a new seed device (unconfigured child of a parent blk region) is instantiated. As long as a region has 'available_size' != 0 new child namespaces may be created. Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Neil Brown <neilb@suse.de> Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2015-06-24 21:24:10 -04:00

... 4 5 6 7 8 ...

562 Commits