This feature bit restricts older clients from performing certain
maintenance operations against an image (e.g. clone, snap create).
krbd does not perform maintenance operations.
Cc: stable@vger.kernel.org
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
If rbd_img_request_submit() fails, parent_request->obj_request is
NULLed out, triggering an assert in rbd_obj_request_put():
rbd_img_request_put(parent_request)
rbd_parent_request_destroy
rbd_obj_request_put(NULL)
Just remove it -- parent_request->obj_request will be put in
rbd_parent_request_destroy().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
smart1,2.h is unused since commit d436641439 ("cpqarray: remove it from the kernel")
Remove it from tree.
Signed-off-by: Corentin Labbe <clabbe@baylibre.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
'struct frame' uses two variables to store the sent timestamp - 'struct
timeval' and jiffies. jiffies is used to avoid discrepancies caused by
updates to system time. 'struct timeval' is deprecated because it uses
32-bit representation for seconds which will overflow in year 2038.
This patch does the following:
- Replace the use of 'struct timeval' and jiffies with ktime_t, which
is the recommended type for timestamping
- ktime_t provides both long range (like jiffies) and high resolution
(like timeval). Using ktime_get (monotonic time) instead of wall-clock
time prevents any discprepancies caused by updates to system time.
[updates by Arnd below]
The original patch from Tina never went anywhere as we discussed how
to keep the impact on performance minimal. I've started over now but
arrived at basically the same patch that she had originally, except for
an slightly improved tsince_hr() function. I'm making it more robust
against overflows, and also optimize explicitly for the common case
in which a frame is less than 4.2 seconds old, using only a 32-bit
division in that case.
This should make the new version more efficient than the old code,
since we replace the existing two 32-bit division in do_gettimeofday()
plus one multiplication with a single single 32-bit division in
tsince_hr() and drop the double bookkeeping. It's also more efficient
than the ktime_get_us() API we discussed before, since that would
also rely on multiple divisions.
Link: https://lists.linaro.org/pipermail/y2038/2015-May/000276.html
Signed-off-by: Tina Ruchandani <ruchandani.tina@gmail.com>
Cc: Ed Cashin <ed.cashin@acm.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull ceph fixes from Ilya Dryomov:
"Two rbd fixes for 4.12 and 4.2 issues respectively, marked for
stable"
* tag 'ceph-for-4.15-rc8' of git://github.com/ceph/ceph-client:
rbd: set max_segments to USHRT_MAX
rbd: reacquire lock should update lock owner client id
Selecting FAULT_INJECTION causes a Kconfig warning when CONFIG_DEBUG_KERNEL
is not set:
warning: (BLK_DEV_NULL_BLK && DRM_I915_SELFTEST) selects FAULT_INJECTION which has unmet direct dependencies (DEBUG_KERNEL)
The other drivers that use FAULT_INJECTION tend to have a separate
Kconfig symbol for turning on that feature, so let's do the same
thing here. This may add a bit more complexity than we like, but
it avoids the warning and is more consistent with the rest of the
kernel.
Fixes: 93b570464c ("null_blk: add option for managing IO timeouts")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use the fault injection framework to provide a way for null_blk
to configure timeouts. This only works for queue_mode 1 and 2,
since the bio mode doesn't have code for tracking timeouts.
Let's say you want to have a 10% chance of timing out every
100,000 requests, and for 5 total timeouts, you could do:
modprobe null_blk timeout="100000,10,0,5"
This is useful for adding blktests to test that IO timeouts
are handled appropriately.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is needed to ensure that we actually handle timeouts.
Without it, the queue_mode=1 path will never call blk_add_timer(),
and the queue_mode=2 path will continually just return
EH_RESET_TIMER and we never actually complete the offending request.
This was used to test the new timeout code, and the changes around
killing off REQ_ATOM_COMPLETE.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit d3834fefcf ("rbd: bump queue_max_segments") bumped
max_segments (unsigned short) to max_hw_sectors (unsigned int).
max_hw_sectors is set to the number of 512-byte sectors in an object
and overflows unsigned short for 32M (largest possible) objects, making
the block layer resort to handing us single segment (i.e. single page
or even smaller) bios in that case.
Cc: stable@vger.kernel.org
Fixes: d3834fefcf ("rbd: bump queue_max_segments")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Otherwise, future operations on this RBD using exclusive-lock are
going to require the lock from a non-existent client id.
Cc: stable@vger.kernel.org
Fixes: 14bb211d32 ("rbd: support updating the lock cookie without releasing the lock")
Link: http://tracker.ceph.com/issues/19929
Signed-off-by: Florian Margaine <florian@platform.sh>
[idryomov@gmail.com: rbd_set_owner_cid() call, __rbd_lock() helper]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
范龙飞 reports that KASAN can report a use-after-free in __lock_acquire.
The reason is due to insufficient serialization in lo_release(), which
will continue to use the loop device even after it has decremented the
lo_refcnt to zero.
In the meantime, another process can come in, open the loop device
again as it is being shut down. Confusion ensues.
Reported-by: 范龙飞 <long7573@126.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When CONFIG_KASAN is set, all the local variables in this function are
allocated on the stack together, leading to a warning about possible
kernel stack overflow:
drivers/block/DAC960.c: In function 'DAC960_gam_ioctl':
drivers/block/DAC960.c:7061:1: error: the frame size of 2240 bytes is larger than 2048 bytes [-Werror=frame-larger-than=]
By splitting up the function into smaller chunks, we can avoid that and
make the code slightly more readable at the same time. The coding style
in this file is completely nonstandard, and I chose to not touch that
at all, leaving the unconventional intendation unchanged to make it
easier to review the diff.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch converts to bio_first_bvec_all() & bio_first_page_all() for
retrieving the 1st bvec/page, and prepares for supporting multipage bvec.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Call bdev_get_queue(bdev) after bdev->bd_disk has been initialized
instead of just before that pointer has been initialized. This patch
avoids that the following command
pktsetup 1 /dev/sr0
triggers the following kernel crash:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000548
IP: pkt_setup_dev+0x2db/0x670 [pktcdvd]
CPU: 2 PID: 724 Comm: pktsetup Not tainted 4.15.0-rc4-dbg+ #1
Call Trace:
pkt_ctl_ioctl+0xce/0x1c0 [pktcdvd]
do_vfs_ioctl+0x8e/0x670
SyS_ioctl+0x3c/0x70
entry_SYSCALL_64_fastpath+0x23/0x9a
Reported-by: Maciej S. Szmigiero <mail@maciej.szmigiero.name>
Fixes: commit ca18d6f769 ("block: Make most scsi_req_init() calls implicit")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Tested-by: Maciej S. Szmigiero <mail@maciej.szmigiero.name>
Cc: Maciej S. Szmigiero <mail@maciej.szmigiero.name>
Cc: <stable@vger.kernel.org> # v4.13
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit 523e1d399c ("block: make gendisk hold a reference to its queue")
modified add_disk() and disk_release() but did not update any of the
error paths that trigger a put_disk() call after disk->queue has been
assigned. That introduced the following behavior in the pktcdvd driver
if pkt_new_dev() fails:
Kernel BUG at 00000000e98fd882 [verbose debug info unavailable]
Since disk_release() calls blk_put_queue() anyway if disk->queue != NULL,
fix this by removing the blk_cleanup_queue() call from the pkt_setup_dev()
error path.
Fixes: commit 523e1d399c ("block: make gendisk hold a reference to its queue")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Maciej S. Szmigiero <mail@maciej.szmigiero.name>
Cc: <stable@vger.kernel.org> # v3.2
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With rrpc to be removed, the null_blk lightnvm support is no longer
functional. Remove the lightnvm implementation and maybe add it to
another module in the future if someone takes on the challenge.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit 966a967116 randomly added alignment to this structure, but
it's actually detrimental to performance of null_blk. Test case:
Running on both the home and remote node shows a ~5% degradation
in performance.
While in there, move blk_status_t to the hole after the integer tag
in the nullb_cmd structure. After this patch, we shrink the size
from 192 to 152 bytes.
Fixes: 966a967116 ("smp: Avoid using two cache lines for struct call_single_data")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull block fixes from Jens Axboe:
"A selection of fixes/changes that should make it into this series.
This contains:
- NVMe, two merges, containing:
- pci-e, rdma, and fc fixes
- Device quirks
- Fix for a badblocks leak in null_blk
- bcache fix from Rui Hua for a race condition regression where
-EINTR was returned to upper layers that didn't expect it.
- Regression fix for blktrace for a bug introduced in this series.
- blktrace cleanup for cgroup id.
- bdi registration error handling.
- Small series with cleanups for blk-wbt.
- Various little fixes for typos and the like.
Nothing earth shattering, most important are the NVMe and bcache fixes"
* 'for-linus' of git://git.kernel.dk/linux-block: (34 commits)
nvme-pci: fix NULL pointer dereference in nvme_free_host_mem()
nvme-rdma: fix memory leak during queue allocation
blktrace: fix trace mutex deadlock
nvme-rdma: Use mr pool
nvme-rdma: Check remotely invalidated rkey matches our expected rkey
nvme-rdma: wait for local invalidation before completing a request
nvme-rdma: don't complete requests before a send work request has completed
nvme-rdma: don't suppress send completions
bcache: check return value of register_shrinker
bcache: recover data from backing when data is clean
bcache: Fix building error on MIPS
bcache: add a comment in journal bucket reading
nvme-fc: don't use bit masks for set/test_bit() numbers
blk-wbt: fix comments typo
blk-wbt: move wbt_clear_stat to common place in wbt_done
blk-sysfs: remove NULL pointer checking in queue_wb_lat_store
blk-wbt: remove duplicated setting in wbt_init
nvme-pci: add quirk for delay before CHK RDY for WDC SN200
block: remove useless assignment in bio_split
null_blk: fix dev->badblocks leak
...
With all callbacks converted, and the timer callback prototype
switched over, the TIMER_FUNC_TYPE cast is no longer needed,
so remove it. Conversion was done with the following scripts:
perl -pi -e 's|\(TIMER_FUNC_TYPE\)||g' \
$(git grep TIMER_FUNC_TYPE | cut -d: -f1 | sort -u)
perl -pi -e 's|\(TIMER_DATA_TYPE\)||g' \
$(git grep TIMER_DATA_TYPE | cut -d: -f1 | sort -u)
The now unused macros are also dropped from include/linux/timer.h.
Signed-off-by: Kees Cook <keescook@chromium.org>
This mechanically converts all remaining cases of ancient open-coded timer
setup with the old setup_timer() API, which is the first step in timer
conversions. This has no behavioral changes, since it ultimately just
changes the order of assignment to fields of struct timer_list when
finding variations of:
init_timer(&t);
f.function = timer_callback;
t.data = timer_callback_arg;
to be converted into:
setup_timer(&t, timer_callback, timer_callback_arg);
The conversion is done with the following Coccinelle script, which
is an improved version of scripts/cocci/api/setup_timer.cocci, in the
following ways:
- assignments-before-init_timer() cases
- limit the .data case removal to the specific struct timer_list instance
- handling calls by dereference (timer->field vs timer.field)
spatch --very-quiet --all-includes --include-headers \
-I ./arch/x86/include -I ./arch/x86/include/generated \
-I ./include -I ./arch/x86/include/uapi \
-I ./arch/x86/include/generated/uapi -I ./include/uapi \
-I ./include/generated/uapi --include ./include/linux/kconfig.h \
--dir . \
--cocci-file ~/src/data/setup_timer.cocci
@fix_address_of@
expression e;
@@
init_timer(
-&(e)
+&e
, ...)
// Match the common cases first to avoid Coccinelle parsing loops with
// "... when" clauses.
@match_immediate_function_data_after_init_timer@
expression e, func, da;
@@
-init_timer
+setup_timer
( \(&e\|e\)
+, func, da
);
(
-\(e.function\|e->function\) = func;
-\(e.data\|e->data\) = da;
|
-\(e.data\|e->data\) = da;
-\(e.function\|e->function\) = func;
)
@match_immediate_function_data_before_init_timer@
expression e, func, da;
@@
(
-\(e.function\|e->function\) = func;
-\(e.data\|e->data\) = da;
|
-\(e.data\|e->data\) = da;
-\(e.function\|e->function\) = func;
)
-init_timer
+setup_timer
( \(&e\|e\)
+, func, da
);
@match_function_and_data_after_init_timer@
expression e, e2, e3, e4, e5, func, da;
@@
-init_timer
+setup_timer
( \(&e\|e\)
+, func, da
);
... when != func = e2
when != da = e3
(
-e.function = func;
... when != da = e4
-e.data = da;
|
-e->function = func;
... when != da = e4
-e->data = da;
|
-e.data = da;
... when != func = e5
-e.function = func;
|
-e->data = da;
... when != func = e5
-e->function = func;
)
@match_function_and_data_before_init_timer@
expression e, e2, e3, e4, e5, func, da;
@@
(
-e.function = func;
... when != da = e4
-e.data = da;
|
-e->function = func;
... when != da = e4
-e->data = da;
|
-e.data = da;
... when != func = e5
-e.function = func;
|
-e->data = da;
... when != func = e5
-e->function = func;
)
... when != func = e2
when != da = e3
-init_timer
+setup_timer
( \(&e\|e\)
+, func, da
);
@r1 exists@
expression t;
identifier f;
position p;
@@
f(...) { ... when any
init_timer@p(\(&t\|t\))
... when any
}
@r2 exists@
expression r1.t;
identifier g != r1.f;
expression e8;
@@
g(...) { ... when any
\(t.data\|t->data\) = e8
... when any
}
// It is dangerous to use setup_timer if data field is initialized
// in another function.
@script:python depends on r2@
p << r1.p;
@@
cocci.include_match(False)
@r3@
expression r1.t, func, e7;
position r1.p;
@@
(
-init_timer@p(&t);
+setup_timer(&t, func, 0UL);
... when != func = e7
-t.function = func;
|
-t.function = func;
... when != func = e7
-init_timer@p(&t);
+setup_timer(&t, func, 0UL);
|
-init_timer@p(t);
+setup_timer(t, func, 0UL);
... when != func = e7
-t->function = func;
|
-t->function = func;
... when != func = e7
-init_timer@p(t);
+setup_timer(t, func, 0UL);
)
Signed-off-by: Kees Cook <keescook@chromium.org>
This changes all DEFINE_TIMER() callbacks to use a struct timer_list
pointer instead of unsigned long. Since the data argument has already been
removed, none of these callbacks are using their argument currently, so
this renames the argument to "unused".
Done using the following semantic patch:
@match_define_timer@
declarer name DEFINE_TIMER;
identifier _timer, _callback;
@@
DEFINE_TIMER(_timer, _callback);
@change_callback depends on match_define_timer@
identifier match_define_timer._callback;
type _origtype;
identifier _origarg;
@@
void
-_callback(_origtype _origarg)
+_callback(struct timer_list *unused)
{ ... }
Signed-off-by: Kees Cook <keescook@chromium.org>
Pull ceph updates from Ilya Dryomov:
"We have a set of file locking improvements from Zheng, rbd rw/ro state
handling code cleanup from myself and some assorted CephFS fixes from
Jeff.
rbd now defaults to single-major=Y, lifting the limit of ~240 rbd
images per host for everyone"
* tag 'ceph-for-4.15-rc1' of git://github.com/ceph/ceph-client:
rbd: default to single-major device number scheme
libceph: don't WARN() if user tries to add invalid key
rbd: set discard_alignment to zero
ceph: silence sparse endianness warning in encode_caps_cb
ceph: remove the bump of i_version
ceph: present consistent fsid, regardless of arch endianness
ceph: clean up spinlocking and list handling around cleanup_cap_releases()
rbd: get rid of rbd_mapping::read_only
rbd: fix and simplify rbd_ioctl_set_ro()
ceph: remove unused and redundant variable dropping
ceph: mark expected switch fall-throughs
ceph: -EINVAL on decoding failure in ceph_mdsc_handle_fsmap()
ceph: disable cached readdir after dropping positive dentry
ceph: fix bool initialization/comparison
ceph: handle 'session get evicted while there are file locks'
ceph: optimize flock encoding during reconnect
ceph: make lock_to_ceph_filelock() static
ceph: keep auth cap when inode has flocks or posix locks
Pull more block layer updates from Jens Axboe:
"A followup pull request, with some parts that either needed a bit more
testing before going in, merge sync, or just later arriving fixes.
This contains:
- Timer related updates from Kees. These were purposefully delayed
since I didn't want to pull in a later v4.14-rc tag to my block
tree.
- ide-cd prep sense buffer fix from Bart. Also delayed, as not to
clash with the late fix we put into 4.14-rc.
- Small BFQ updates series from Luca and Paolo.
- Single nvmet fix from James, fixing a non-functional case there.
- Bio fast clone fix from Michael, which made bcache return the wrong
data for some cases.
- Legacy IO path regression hang fix from Ming"
* 'for-linus' of git://git.kernel.dk/linux-block:
bio: ensure __bio_clone_fast copies bi_partno
nvmet_fc: fix better length checking
block: wake up all tasks blocked in get_request()
block, bfq: move debug blkio stats behind CONFIG_DEBUG_BLK_CGROUP
block, bfq: update blkio stats outside the scheduler lock
block, bfq: add missing invocations of bfqg_stats_update_io_add/remove
doc, block, bfq: update max IOPS sustainable with BFQ
ide: Make ide_cdrom_prep_fs() initialize the sense buffer pointer
md: Convert timers to use timer_setup()
block: swim3: Convert timers to use timer_setup()
block/aoe: Convert timers to use timer_setup()
amifloppy: Convert timers to use timer_setup()
block/floppy: Convert callback to pass timer_list
Pull libnvdimm and dax updates from Dan Williams:
"Save for a few late fixes, all of these commits have shipped in -next
releases since before the merge window opened, and 0day has given a
build success notification.
The ext4 touches came from Jan, and the xfs touches have Darrick's
reviewed-by. An xfstest for the MAP_SYNC feature has been through
a few round of reviews and is on track to be merged.
- Introduce MAP_SYNC and MAP_SHARED_VALIDATE, a mechanism to enable
'userspace flush' of persistent memory updates via filesystem-dax
mappings. It arranges for any filesystem metadata updates that may
be required to satisfy a write fault to also be flushed ("on disk")
before the kernel returns to userspace from the fault handler.
Effectively every write-fault that dirties metadata completes an
fsync() before returning from the fault handler. The new
MAP_SHARED_VALIDATE mapping type guarantees that the MAP_SYNC flag
is validated as supported by the filesystem's ->mmap() file
operation.
- Add support for the standard ACPI 6.2 label access methods that
replace the NVDIMM_FAMILY_INTEL (vendor specific) label methods.
This enables interoperability with environments that only implement
the standardized methods.
- Add support for the ACPI 6.2 NVDIMM media error injection methods.
- Add support for the NVDIMM_FAMILY_INTEL v1.6 DIMM commands for
latch last shutdown status, firmware update, SMART error injection,
and SMART alarm threshold control.
- Cleanup physical address information disclosures to be root-only.
- Fix revalidation of the DIMM "locked label area" status to support
dynamic unlock of the label area.
- Expand unit test infrastructure to mock the ACPI 6.2 Translate SPA
(system-physical-address) command and error injection commands.
Acknowledgements that came after the commits were pushed to -next:
- 957ac8c421 ("dax: fix PMD faults on zero-length files"):
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
- a39e596baa ("xfs: support for synchronous DAX faults") and
7b565c9f96 ("xfs: Implement xfs_filemap_pfn_mkwrite() using __xfs_filemap_fault()")
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>"
* tag 'libnvdimm-for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (49 commits)
acpi, nfit: add 'Enable Latch System Shutdown Status' command support
dax: fix general protection fault in dax_alloc_inode
dax: fix PMD faults on zero-length files
dax: stop requiring a live device for dax_flush()
brd: remove dax support
dax: quiet bdev_dax_supported()
fs, dax: unify IOMAP_F_DIRTY read vs write handling policy in the dax core
tools/testing/nvdimm: unit test clear-error commands
acpi, nfit: validate commands against the device type
tools/testing/nvdimm: stricter bounds checking for error injection commands
xfs: support for synchronous DAX faults
xfs: Implement xfs_filemap_pfn_mkwrite() using __xfs_filemap_fault()
ext4: Support for synchronous DAX faults
ext4: Simplify error handling in ext4_dax_huge_fault()
dax: Implement dax_finish_sync_fault()
dax, iomap: Add support for synchronous faults
mm: Define MAP_SYNC and VM_SYNC flags
dax: Allow tuning whether dax_insert_mapping_entry() dirties entry
dax: Allow dax_iomap_fault() to return pfn
dax: Fix comment describing dax_iomap_fault()
...
With fast swap storage, the platform wants to use swap more aggressively
and swap-in is crucial to application latency.
The rw_page() based synchronous devices like zram, pmem and btt are such
fast storage. When I profile swapin performance with zram lz4
decompress test, S/W overhead is more than 70%. Maybe, it would be
bigger in nvdimm.
This patchset reduces swap-in latency by skipping swapcache if the swap
device is a synchronous device like a rw_page() based device.
It enhances by 45% my swapin test (5G sequential swapin, no readahead)
from 2.41sec to 1.64sec.
This patch (of 4):
Commit 19b7ccf865 ("block: get rid of blk_integrity_revalidate()")
fixed a weird thing (i.e., reset BDI_CAP_STABLE_WRITES flag
unconditionally whenever revalidat_disk is called) so zram doesn't need
to reset the flag any more when revalidating the bdev. Instead, set the
flag just once when the zram device is created.
It shouldn't change any behavior.
Link: http://lkml.kernel.org/r/1505886205-9671-2-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Hugh Dickins <hughd@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This converts the amifloppy driver to pass the timer pointer to the
callback instead of the drive number (and flags). It eliminates the
decusagecounter flag, as it was unused, and drops the ininterrupt flag
which appeared to be a needless optimization. The drive can then be
calculated from the offset of the timer in the drive timer array.
Additionally moves to a static data variable instead of the
soon-to-be-gone timer->data field.
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Krzysztof Halasa <khc@pm.waw.pl>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull dma-mapping updates from Christoph Hellwig:
- turn dma_cache_sync into a dma_map_ops instance and remove
implementation that purely are dead because the architecture doesn't
support noncoherent allocations
- add a flag for busses that need DMA configuration (Robin Murphy)
* tag 'dma-mapping-4.15' of git://git.infradead.org/users/hch/dma-mapping:
dma-mapping: turn dma_cache_sync into a dma_map_ops method
sh: make dma_cache_sync a no-op
xtensa: make dma_cache_sync a no-op
unicore32: make dma_cache_sync a no-op
powerpc: make dma_cache_sync a no-op
mn10300: make dma_cache_sync a no-op
microblaze: make dma_cache_sync a no-op
ia64: make dma_cache_sync a no-op
frv: make dma_cache_sync a no-op
x86: make dma_cache_sync a no-op
floppy: consolidate the dummy fd_cacheflush definition
drivers: flag buses which demand DMA configuration
DAX support in brd is awkward because its backing page frames are
distinct from the ones provided by pmem, dcssblk, or axonram. We need
pfn_t_devmap() entries to fully support DAX, and the limited DAX support
for pfn_t_special() page frames is not interesting for brd when pmem is
already a superset of brd. Lastly, brd is the only dax capable driver
that may sleep in its ->direct_access() implementation. So it causes a
global burden with no net gain of kernel functionality.
For all these reasons, remove DAX support.
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Pull core block layer updates from Jens Axboe:
"This is the main pull request for block storage for 4.15-rc1.
Nothing out of the ordinary in here, and no API changes or anything
like that. Just various new features for drivers, core changes, etc.
In particular, this pull request contains:
- A patch series from Bart, closing the whole on blk/scsi-mq queue
quescing.
- A series from Christoph, building towards hidden gendisks (for
multipath) and ability to move bio chains around.
- NVMe
- Support for native multipath for NVMe (Christoph).
- Userspace notifications for AENs (Keith).
- Command side-effects support (Keith).
- SGL support (Chaitanya Kulkarni)
- FC fixes and improvements (James Smart)
- Lots of fixes and tweaks (Various)
- bcache
- New maintainer (Michael Lyle)
- Writeback control improvements (Michael)
- Various fixes (Coly, Elena, Eric, Liang, et al)
- lightnvm updates, mostly centered around the pblk interface
(Javier, Hans, and Rakesh).
- Removal of unused bio/bvec kmap atomic interfaces (me, Christoph)
- Writeback series that fix the much discussed hundreds of millions
of sync-all units. This goes all the way, as discussed previously
(me).
- Fix for missing wakeup on writeback timer adjustments (Yafang
Shao).
- Fix laptop mode on blk-mq (me).
- {mq,name} tupple lookup for IO schedulers, allowing us to have
alias names. This means you can use 'deadline' on both !mq and on
mq (where it's called mq-deadline). (me).
- blktrace race fix, oopsing on sg load (me).
- blk-mq optimizations (me).
- Obscure waitqueue race fix for kyber (Omar).
- NBD fixes (Josef).
- Disable writeback throttling by default on bfq, like we do on cfq
(Luca Miccio).
- Series from Ming that enable us to treat flush requests on blk-mq
like any other request. This is a really nice cleanup.
- Series from Ming that improves merging on blk-mq with schedulers,
getting us closer to flipping the switch on scsi-mq again.
- BFQ updates (Paolo).
- blk-mq atomic flags memory ordering fixes (Peter Z).
- Loop cgroup support (Shaohua).
- Lots of minor fixes from lots of different folks, both for core and
driver code"
* 'for-4.15/block' of git://git.kernel.dk/linux-block: (294 commits)
nvme: fix visibility of "uuid" ns attribute
blk-mq: fixup some comment typos and lengths
ide: ide-atapi: fix compile error with defining macro DEBUG
blk-mq: improve tag waiting setup for non-shared tags
brd: remove unused brd_mutex
blk-mq: only run the hardware queue if IO is pending
block: avoid null pointer dereference on null disk
fs: guard_bio_eod() needs to consider partitions
xtensa/simdisk: fix compile error
nvme: expose subsys attribute to sysfs
nvme: create 'slaves' and 'holders' entries for hidden controllers
block: create 'slaves' and 'holders' entries for hidden gendisks
nvme: also expose the namespace identification sysfs files for mpath nodes
nvme: implement multipath access to nvme subsystems
nvme: track shared namespaces
nvme: introduce a nvme_ns_ids structure
nvme: track subsystems
block, nvme: Introduce blk_mq_req_flags_t
block, scsi: Make SCSI quiesce and resume work reliably
block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
...
Pull configfs updates from Christoph Hellwig:
"A couple of configfs cleanups:
- proper use of the bool type (Thomas Meyer)
- constification of struct config_item_type (Bhumika Goyal)"
* tag 'configfs-for-4.15' of git://git.infradead.org/users/hch/configfs:
RDMA/cma: make config_item_type const
stm class: make config_item_type const
ACPI: configfs: make config_item_type const
nvmet: make config_item_type const
usb: gadget: configfs: make config_item_type const
PCI: endpoint: make config_item_type const
iio: make function argument and some structures const
usb: gadget: make config_item_type structures const
dlm: make config_item_type const
netconsole: make config_item_type const
nullb: make config_item_type const
ocfs2/cluster: make config_item_type const
target: make config_item_type const
configfs: make ci_type field, some pointers and function arguments const
configfs: make config_item_type const
configfs: Fix bool initialization/comparison
Pull timer updates from Thomas Gleixner:
"Yet another big pile of changes:
- More year 2038 work from Arnd slowly reaching the point where we
need to think about the syscalls themself.
- A new timer function which allows to conditionally (re)arm a timer
only when it's either not running or the new expiry time is sooner
than the armed expiry time. This allows to use a single timer for
multiple timeout requirements w/o caring about the first expiry
time at the call site.
- A new NMI safe accessor to clock real time for the printk timestamp
work. Can be used by tracing, perf as well if required.
- A large number of timer setup conversions from Kees which got
collected here because either maintainers requested so or they
simply got ignored. As Kees pointed out already there are a few
trivial merge conflicts and some redundant commits which was
unavoidable due to the size of this conversion effort.
- Avoid a redundant iteration in the timer wheel softirq processing.
- Provide a mechanism to treat RTC implementations depending on their
hardware properties, i.e. don't inflict the write at the 0.5
seconds boundary which originates from the PC CMOS RTC to all RTCs.
No functional change as drivers need to be updated separately.
- The usual small updates to core code clocksource drivers. Nothing
really exciting"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (111 commits)
timers: Add a function to start/reduce a timer
pstore: Use ktime_get_real_fast_ns() instead of __getnstimeofday()
timer: Prepare to change all DEFINE_TIMER() callbacks
netfilter: ipvs: Convert timers to use timer_setup()
scsi: qla2xxx: Convert timers to use timer_setup()
block/aoe: discover_timer: Convert timers to use timer_setup()
ide: Convert timers to use timer_setup()
drbd: Convert timers to use timer_setup()
mailbox: Convert timers to use timer_setup()
crypto: Convert timers to use timer_setup()
drivers/pcmcia: omap1: Fix error in automated timer conversion
ARM: footbridge: Fix typo in timer conversion
drivers/sgi-xp: Convert timers to use timer_setup()
drivers/pcmcia: Convert timers to use timer_setup()
drivers/memstick: Convert timers to use timer_setup()
drivers/macintosh: Convert timers to use timer_setup()
hwrng/xgene-rng: Convert timers to use timer_setup()
auxdisplay: Convert timers to use timer_setup()
sparc/led: Convert timers to use timer_setup()
mips: ip22/32: Convert timers to use timer_setup()
...
It's been 3.5 years, let's turn it on by default. Support in rbd(8)
utility goes back to pre-firefly, "rbd map" has been loading the module
with single_major=Y ever since. However, if the module is already
loaded (whether by hand or at boot time), we end up with single_major=N.
Also, some people don't install rbd(8) and use the sysfs interface
directly.
(With single-major=N, a major number is consumed for every mapping,
imposing a limit of ~240 rbd images per host. single-major=Y allows
mapping thousands of rbd images on a single machine.)
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
RBD devices are currently incorrectly initialised with the block queue
discard_alignment set to the underlying RADOS object size.
As per Documentation/ABI/testing/sysfs-block:
The discard_alignment parameter indicates how many bytes the beginning
of the device is offset from the internal allocation unit's natural
alignment.
Correcting the discard_alignment parameter from the RADOS object size to
zero (the blk_set_default_limits() default) has no effect on how discard
requests are propagated through the block layer - @alignment in
__blkdev_issue_discard() remains zero. However, it does fix the UNMAP
granularity alignment value advertised to SCSI initiators via the Block
Limits VPD.
Signed-off-by: David Disseldorp <ddiss@suse.de>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
->open_count/-EBUSY check is bogus and wrong: when an open device is
set read-only, blkdev_write_iter() refuses further writes with -EPERM.
This is standard behaviour and all other block devices allow this.
set_disk_ro() call is also problematic: we affect the entire device
when called on a single partition.
All rbd_ioctl_set_ro() needs to do is refuse ro -> rw transition for
mapped snapshots. Everything else can be handled by generic code.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Remove unused mutex brd_mutex. It is unused since the commit ff26956875
("brd: remove support for BLKFLSBUF").
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>