Allow device initialization to finish gracefully when it is in
FTL rebuild failure state. Also, recover device out of this state
after successfully secure erasing it.
Signed-off-by: Selvan Mani <smani@micron.com>
Signed-off-by: Vignesh Gunasekaran <vgunasekaran@micron.com>
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
Flush inflight IOs using fsync_bdev() when the device is safely
removed. Also, block further IOs in device open function.
Signed-off-by: Selvan Mani <smani@micron.com>
Signed-off-by: Rajesh Kumar Sambandam <rsambandam@micron.com>
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
When FTL rebuild is in progress, alloc_disk() initializes the disk
but device node will be created by add_disk() only after successful
completion of FTL rebuild. So, skip deletion of device node in
removal path when FTL rebuild is in progress.
Signed-off-by: Selvan Mani <smani@micron.com>
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
Prevent standby immediate command from being issued in remove,
suspend and shutdown paths, while drive is in FTL rebuild process.
Signed-off-by: Selvan Mani <smani@micron.com>
Signed-off-by: Vignesh Gunasekaran <vgunasekaran@micron.com>
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
Print exact time when an internal command is interrupted.
Signed-off-by: Selvan Mani <smani@micron.com>
Signed-off-by: Rajesh Kumar Sambandam <rsambandam@micron.com>
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
Remove setting and clearing MTIP_PF_EH_ACTIVE_BIT flag in
mtip_handle_tfe() as they are redundant. Also avoid waking
up service thread from mtip_handle_tfe() because it is
already woken up in case of taskfile error.
Signed-off-by: Selvan Mani <smani@micron.com>
Signed-off-by: Rajesh Kumar Sambandam <rsambandam@micron.com>
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
Service thread does not detect the need for taskfile error hanlding. Fixed the
flag condition to process taskfile error.
Signed-off-by: Selvan Mani <smani@micron.com>
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
Latest virtio spec says the feature bit name is VIRTIO_BLK_F_FLUSH,
VIRTIO_BLK_F_WCE is the legacy name. virtio blk header says exactly the
reverse - fix that and update driver code to match.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
The userspace needs to know when nbd devices are ready for use.
Currently no events are created for the userspace which doesn't work for
systemd.
See the discussion here: https://github.com/systemd/systemd/pull/358
This patch uses a central point to setup the nbd-internal sizes. A ioctl
to set a size does not lead to a visible size change. The size of the
block device will be kept at 0 until nbd is connected. As soon as it
connects, the size will be changed to the real value and a uevent is
created. When disconnecting, the blockdevice is set to 0 size and
another uevent is generated.
Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
If the LightNVM subsystem is not compiled into the kernel, and the
null_blk device driver requests lightnvm to be initialized. The call to
nvm_register fails and the null_add_dev function cleans up the
initialization. However, at this point the null block device has
already been added to the nullb_list and thus a second cleanup will
occur when the function has returned, that leads to a double call to
blk_cleanup_queue.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
In case /dev/fdX is open with O_NDELAY / O_NONBLOCK, floppy_open() immediately
succeeds, without performing any further media / controller preparations.
That's "correct" wrt. the NODELAY flag, but is hardly correct wrt. the rest
of the floppy driver, that is not really O_NONBLOCK ready, at all. Therefore
it's not too surprising, that subsequent attempts to work with the
filedescriptor produce bad results. Namely, syzkaller tool has been able
to livelock mmap() on the returned fd to keep waiting on the page unlock
bit forever.
Quite frankly, I have trouble defining what non-blocking behavior would be for
floppies. Is waiting ages for the driver to actually succeed reading a sector
blocking operation? Is waiting for drive motor to start blocking operation? How
about in case of virtualized floppies?
One option would be returning EWOULDBLOCK in case O_NDLEAY / O_NONBLOCK is
being passed to open(). That has a theoretical potential of breaking some
arcane and archaic userspace though.
Let's take a more conservative aproach, and accept the O_NDLEAY flag, and let
the driver behave as usual.
While at it, clean up a bit handling of !(mode & (FMODE_READ|FMODE_WRITE))
case and return EINVAL instead of succeeding as well.
Spotted by syzkaller tool.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Make the "Attempted send on closed socket" error messages generated in
nbd_request_handler() ratelimited.
When the nbd socket is shutdown, the nbd_request_handler() function emits
an error message for every request remaining in its queue. If the queue
is large, this will spam a large amount of messages to the log. There's
no need for a separate error message for each request, so this patch
ratelimits it.
In the specific case this was found, the system was virtual and the error
messages were logged to the serial port, which overwhelmed it.
Fixes: 4d48a542b4 ("nbd: fix I/O hang on disconnected nbds")
Signed-off-by: Dan Streetman <dan.streetman@canonical.com>
Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
nbd changes properties of the blockdevice depending on flags that were
received. This patch moves this flag parsing into a separate function
nbd_parse_flags().
Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
Group all variables that are reset after a disconnect into reset
functions. This patch adds two of these functions, nbd_reset() and
nbd_bdev_reset().
Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
It may be useful to know in the client that a connection timed out. The
current code returns success for a timeout.
This patch reports the error code -ETIMEDOUT for a timeout.
Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
As discussed on the mailing list, the usage of signals for timeout
handling has a lot of potential issues. The nbd driver used for some
time signals for timeouts. These signals where able to get the threads
out of the blocking socket operations.
This patch removes all signal usage and uses a socket shutdown instead.
The socket descriptor itself is cleared later when the whole nbd device
is closed.
The tasks_lock is removed as we do not depend on this anymore. Instead
a new lock for the socket is introduced so we can safely work with the
socket in the timeout handler outside of the two main threads.
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
System block allows the device to initialize with its configured media
manager. The system blocks is written to disk, and read again when media
manager is determined. For this to work, the backend must store the
data. Device drivers, such as null_blk, does not have any backend
storage. This patch allows the media manager to be initialized without a
storage backend.
It also fix incorrect configuration of capabilities in null_blk, as it
does not support get/set bad block interface.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
Static checker complains about the implemented error handling. It is
indeed wrong. We don't care about the return values of created debugfs
files.
We only have to check the return values of created dirs for NULL
pointer. If we use a null pointer as parent directory for files, this
may lead to debugfs files in wrong places.
Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
floppy_revalidate() doesn't perform any error handling on lock_fdc()
result. lock_fdc() might actually be interrupted by a signal (it waits for
fdc becoming non-busy interruptibly). In such case, floppy_revalidate()
proceeds as if it had claimed the lock, but it fact it doesn't.
In case of multiple threads trying to open("/dev/fdX"), this leads to
serious corruptions all over the place, because all of a sudden there is
no critical section protection (that'd otherwise be guaranteed by locked
fd) whatsoever.
While at this, fix the fact that the 'interruptible' parameter to
lock_fdc() doesn't make any sense whatsoever, because we always wait
interruptibly anyway.
Most of the lock_fdc() callsites do properly handle error (and propagate
EINTR), but floppy_revalidate() and floppy_check_events() don't. Fix this.
Spotted by 'syzkaller' tool.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Need to reallocate ring info in the resume path, because info->rinfo was freed
in blkif_free(). And 'multi-queue-max-queues' backend reports may have been
changed.
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Reported-and-Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
This patch replaces uses of the long obsolete hash interface with
either shash (for non-SG users) or ahash.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Pull Ceph updates from Sage Weil:
"The two main changes are aio support in CephFS, and a series that
fixes several issues in the authentication key timeout/renewal code.
On top of that are a variety of cleanups and minor bug fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
libceph: remove outdated comment
libceph: kill off ceph_x_ticket_handler::validity
libceph: invalidate AUTH in addition to a service ticket
libceph: fix authorizer invalidation, take 2
libceph: clear messenger auth_retry flag if we fault
libceph: fix ceph_msg_revoke()
libceph: use list_for_each_entry_safe
ceph: use i_size_{read,write} to get/set i_size
ceph: re-send AIO write request when getting -EOLDSNAP error
ceph: Asynchronous IO support
ceph: Avoid to propagate the invalid page point
ceph: fix double page_unlock() in page_mkwrite()
rbd: delete an unnecessary check before rbd_dev_destroy()
libceph: use list_next_entry instead of list_entry_next
ceph: ceph_frag_contains_value can be boolean
ceph: remove unused functions in ceph_frag.h
Pull final vfs updates from Al Viro:
- The ->i_mutex wrappers (with small prereq in lustre)
- a fix for too early freeing of symlink bodies on shmem (they need to
be RCU-delayed) (-stable fodder)
- followup to dedupe stuff merged this cycle
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
vfs: abort dedupe loop if fatal signals are pending
make sure that freeing shmem fast symlinks is RCU-delayed
wrappers for ->i_mutex access
lustre: remove unused declaration
There are many locations that do
if (memory_was_allocated_by_vmalloc)
vfree(ptr);
else
kfree(ptr);
but kvfree() can handle both kmalloc()ed memory and vmalloc()ed memory
using is_vmalloc_addr(). Unless callers have special reasons, we can
replace this branch with kvfree(). Please check and reply if you found
problems.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Jan Kara <jack@suse.com>
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Acked-by: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Acked-by: David Rientjes <rientjes@google.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: Boris Petkov <bp@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).
Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull lightnvm fixes and updates from Jens Axboe:
"This should have been part of the drivers branch, but it arrived a bit
late and wasn't based on the official core block driver branch. So
they got a small scolding, but got a pass since it's still new. Hence
it's in a separate branch.
This is mostly pure fixes, contained to lightnvm/, and minor feature
additions"
* 'for-4.5/lightnvm' of git://git.kernel.dk/linux-block: (26 commits)
lightnvm: ensure that nvm_dev_ops can be used without CONFIG_NVM
lightnvm: introduce factory reset
lightnvm: use system block for mm initialization
lightnvm: introduce ioctl to initialize device
lightnvm: core on-disk initialization
lightnvm: introduce mlc lower page table mappings
lightnvm: add mccap support
lightnvm: manage open and closed blocks separately
lightnvm: fix missing grown bad block type
lightnvm: reference rrpc lun in rrpc block
lightnvm: introduce nvm_submit_ppa
lightnvm: move rq->error to nvm_rq->error
lightnvm: support multiple ppas in nvm_erase_ppa
lightnvm: move the pages per block check out of the loop
lightnvm: sectors first in ppa list
lightnvm: fix locking and mempool in rrpc_lun_gc
lightnvm: put block back to gc list on its reclaim fail
lightnvm: check bi_error in gc
lightnvm: return the get_bb_tbl return value
lightnvm: refactor end_io functions for sync
...
Pull block driver updates from Jens Axboe:
"This is the block driver pull request for 4.5, with the exception of
NVMe, which is in a separate branch and will be posted after this one.
This pull request contains:
- A set of bcache stability fixes, which have been acked by Kent.
These have been used and tested for more than a year by the
community, so it's about time that they got in.
- A set of drbd updates from the drbd team (Andreas, Lars, Philipp)
and Markus Elfring, Oleg Drokin.
- A set of fixes for xen blkback/front from the usual suspects, (Bob,
Konrad) as well as community based fixes from Kiri, Julien, and
Peng.
- A 2038 time fix for sx8 from Shraddha, with a fix from me.
- A small mtip32xx cleanup from Zhu Yanjun.
- A null_blk division fix from Arnd"
* 'for-4.5/drivers' of git://git.kernel.dk/linux-block: (71 commits)
null_blk: use sector_div instead of do_div
mtip32xx: restrict variables visible in current code module
xen/blkfront: Fix crash if backend doesn't follow the right states.
xen/blkback: Fix two memory leaks.
xen/blkback: make st_ statistics per ring
xen/blkfront: Handle non-indirect grant with 64KB pages
xen-blkfront: Introduce blkif_ring_get_request
xen-blkback: clear PF_NOFREEZE for xen_blkif_schedule()
xen/blkback: Free resources if connect_ring failed.
xen/blocks: Return -EXX instead of -1
xen/blkback: make pool of persistent grants and free pages per-queue
xen/blkback: get the number of hardware queues/rings from blkfront
xen/blkback: pseudo support for multi hardware queues/rings
xen/blkback: separate ring information out of struct xen_blkif
xen/blkfront: correct setting for xen_blkif_max_ring_order
xen/blkfront: make persistent grants pool per-queue
xen/blkfront: Remove duplicate setting of ->xbdev.
xen/blkfront: Cleanup of comments, fix unaligned variables, and syntax errors.
xen/blkfront: negotiate number of queues/rings to be used with backend
xen/blkfront: split per device io_lock
...
The rbd_dev_destroy() function tests whether its argument is NULL
and then returns immediately. Thus the test around the call is not needed.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Pull core block updates from Jens Axboe:
"We don't have a lot of core changes this time around, it's mostly in
drivers, which will come in a subsequent pull.
The cores changes include:
- blk-mq
- Prep patch from Christoph, changing blk_mq_alloc_request() to
take flags instead of just using gfp_t for sleep/nosleep.
- Doc patch from me, clarifying the difference between legacy
and blk-mq for timer usage.
- Fixes from Raghavendra for memory-less numa nodes, and a reuse
of CPU masks.
- Cleanup from Geliang Tang, using offset_in_page() instead of open
coding it.
- From Ilya, rename request_queue slab to it reflects what it holds,
and a fix for proper use of bdgrab/put.
- A real fix for the split across stripe boundaries from Keith. We
yanked a broken version of this from 4.4-rc final, this one works.
- From Mike Krinkin, emit a trace message when we split.
- From Wei Tang, two small cleanups, not explicitly clearing memory
that is already cleared"
* 'for-4.5/core' of git://git.kernel.dk/linux-block:
block: use bd{grab,put}() instead of open-coding
block: split bios to max possible length
block: add call to split trace point
blk-mq: Avoid memoryless numa node encoded in hctx numa_node
blk-mq: Reuse hardware context cpumask for tags
blk-mq: add a flags parameter to blk_mq_alloc_request
Revert "blk-flush: Queue through IO scheduler when flush not required"
block: clarify blk_add_timer() use case for blk-mq
bio: use offset_in_page macro
block: do not initialise statics to 0 or NULL
block: do not initialise globals to 0 or NULL
block: rename request_queue slab cache
For the purpose of communicating the optional presence of a 'struct
page' for the pfn returned from ->direct_access(), introduce a type that
encapsulates a page-frame-number plus flags. These flags contain the
historical "page_link" encoding for a scatterlist entry, but can also
denote "device memory". Where "device memory" is a set of pfns that are
not part of the kernel's linear mapping by default, but are accessed via
the same memory controller as ram.
The motivation for this new type is large capacity persistent memory
that needs struct page entries in the 'memmap' to support 3rd party DMA
(i.e. O_DIRECT I/O with a persistent memory source/target). However,
we also need it in support of maintaining a list of mapped inodes which
need to be unmapped at driver teardown or freeze_bdev() time.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Hansen <dave@sr71.net>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The use of idr_remove() is forbidden in the callback functions of
idr_for_each(). It is therefore unsafe to call idr_remove in
zram_remove().
This patch moves the call to idr_remove() from zram_remove() to
hot_remove_store(). In the detroy_devices() path, idrs are removed by
idr_destroy(). This solves an use-after-free detected by KASan.
[akpm@linux-foundation.org: fix coding stype, per Sergey]
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: <stable@vger.kernel.org> [4.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge first patch-bomb from Andrew Morton:
- A few hotfixes which missed 4.4 becasue I was asleep. cc'ed to
-stable
- A few misc fixes
- OCFS2 updates
- Part of MM. Including pretty large changes to page-flags handling
and to thp management which have been buffered up for 2-3 cycles now.
I have a lot of MM material this time.
[ It turns out the THP part wasn't quite ready, so that got dropped from
this series - Linus ]
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (117 commits)
zsmalloc: reorganize struct size_class to pack 4 bytes hole
mm/zbud.c: use list_last_entry() instead of list_tail_entry()
zram/zcomp: do not zero out zcomp private pages
zram: pass gfp from zcomp frontend to backend
zram: try vmalloc() after kmalloc()
zram/zcomp: use GFP_NOIO to allocate streams
mm: add tracepoint for scanning pages
drivers/base/memory.c: fix kernel warning during memory hotplug on ppc64
mm/page_isolation: use macro to judge the alignment
mm: fix noisy sparse warning in LIBCFS_ALLOC_PRE()
mm: rework virtual memory accounting
include/linux/memblock.h: fix ordering of 'flags' argument in comments
mm: move lru_to_page to mm_inline.h
Documentation/filesystems: describe the shared memory usage/accounting
memory-hotplug: don't BUG() in register_memory_resource()
hugetlb: make mm and fs code explicitly non-modular
mm/swapfile.c: use list_for_each_entry_safe in free_swap_count_continuations
mm: /proc/pid/clear_refs: no need to clear VM_SOFTDIRTY in clear_soft_dirty_pmd()
mm: make sure isolate_lru_page() is never called for tail page
vmstat: make vmstat_updater deferrable again and shut down on idle
...
Do not __GFP_ZERO allocated zcomp ->private pages. We keep allocated
streams around and use them for read/write requests, so we supply a
zeroed out ->private to compression algorithm as a scratch buffer only
once -- the first time we use that stream. For the rest of IO requests
served by this stream ->private usually contains some temporarily data
from the previous requests.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Each zcomp backend uses own gfp flag but it's pointless because the
context they could be called is driven by upper layer(ie, zcomp
frontend). As well, zcomp frondend could call them in different
context. One context(ie, zram init part) is it should be better to make
sure successful allocation other context(ie, further stream allocation
part for accelarating I/O speed) is just optional so let's pass gfp down
from driver (ie, zcomp frontend) like normal MM convention.
[sergey.senozhatsky@gmail.com: add missing __vmalloc zero and highmem gfps]
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When we're using LZ4 multi compression streams for zram swap, we found
out page allocation failure message in system running test. That was
not only once, but a few(2 - 5 times per test). Also, some failure
cases were continually occurring to try allocation order 3.
In order to make parallel compression private data, we should call
kzalloc() with order 2/3 in runtime(lzo/lz4). But if there is no order
2/3 size memory to allocate in that time, page allocation fails. This
patch makes to use vmalloc() as fallback of kmalloc(), this prevents
page alloc failure warning.
After using this, we never found warning message in running test, also
It could reduce process startup latency about 60-120ms in each case.
For reference a call trace :
Binder_1: page allocation failure: order:3, mode:0x10c0d0
CPU: 0 PID: 424 Comm: Binder_1 Tainted: GW 3.10.49-perf-g991d02b-dirty #20
Call trace:
dump_backtrace+0x0/0x270
show_stack+0x10/0x1c
dump_stack+0x1c/0x28
warn_alloc_failed+0xfc/0x11c
__alloc_pages_nodemask+0x724/0x7f0
__get_free_pages+0x14/0x5c
kmalloc_order_trace+0x38/0xd8
zcomp_lz4_create+0x2c/0x38
zcomp_strm_alloc+0x34/0x78
zcomp_strm_multi_find+0x124/0x1ec
zcomp_strm_find+0xc/0x18
zram_bvec_rw+0x2fc/0x780
zram_make_request+0x25c/0x2d4
generic_make_request+0x80/0xbc
submit_bio+0xa4/0x15c
__swap_writepage+0x218/0x230
swap_writepage+0x3c/0x4c
shrink_page_list+0x51c/0x8d0
shrink_inactive_list+0x3f8/0x60c
shrink_lruvec+0x33c/0x4cc
shrink_zone+0x3c/0x100
try_to_free_pages+0x2b8/0x54c
__alloc_pages_nodemask+0x514/0x7f0
__get_free_pages+0x14/0x5c
proc_info_read+0x50/0xe4
vfs_read+0xa0/0x12c
SyS_read+0x44/0x74
DMA: 3397*4kB (MC) 26*8kB (RC) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB
0*512kB 0*1024kB 0*2048kB 0*4096kB = 13796kB
[minchan@kernel.org: change vmalloc gfp and adding comment about gfp]
[sergey.senozhatsky@gmail.com: tweak comments and styles]
Signed-off-by: Kyeongdon Kim <kyeongdon.kim@lge.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We can end up allocating a new compression stream with GFP_KERNEL from
within the IO path, which may result is nested (recursive) IO
operations. That can introduce problems if the IO path in question is a
reclaimer, holding some locks that will deadlock nested IOs.
Allocate streams and working memory using GFP_NOIO flag, forbidding
recursive IO and FS operations.
An example:
inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage.
git/20158 [HC0[0]:SC0[0]:HE1:SE1] takes:
(jbd2_handle){+.+.?.}, at: start_this_handle+0x4ca/0x555
{IN-RECLAIM_FS-W} state was registered at:
__lock_acquire+0x8da/0x117b
lock_acquire+0x10c/0x1a7
start_this_handle+0x52d/0x555
jbd2__journal_start+0xb4/0x237
__ext4_journal_start_sb+0x108/0x17e
ext4_dirty_inode+0x32/0x61
__mark_inode_dirty+0x16b/0x60c
iput+0x11e/0x274
__dentry_kill+0x148/0x1b8
shrink_dentry_list+0x274/0x44a
prune_dcache_sb+0x4a/0x55
super_cache_scan+0xfc/0x176
shrink_slab.part.14.constprop.25+0x2a2/0x4d3
shrink_zone+0x74/0x140
kswapd+0x6b7/0x930
kthread+0x107/0x10f
ret_from_fork+0x3f/0x70
irq event stamp: 138297
hardirqs last enabled at (138297): debug_check_no_locks_freed+0x113/0x12f
hardirqs last disabled at (138296): debug_check_no_locks_freed+0x33/0x12f
softirqs last enabled at (137818): __do_softirq+0x2d3/0x3e9
softirqs last disabled at (137813): irq_exit+0x41/0x95
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(jbd2_handle);
<Interrupt>
lock(jbd2_handle);
*** DEADLOCK ***
5 locks held by git/20158:
#0: (sb_writers#7){.+.+.+}, at: [<ffffffff81155411>] mnt_want_write+0x24/0x4b
#1: (&type->i_mutex_dir_key#2/1){+.+.+.}, at: [<ffffffff81145087>] lock_rename+0xd9/0xe3
#2: (&sb->s_type->i_mutex_key#11){+.+.+.}, at: [<ffffffff8114f8e2>] lock_two_nondirectories+0x3f/0x6b
#3: (&sb->s_type->i_mutex_key#11/4){+.+.+.}, at: [<ffffffff8114f909>] lock_two_nondirectories+0x66/0x6b
#4: (jbd2_handle){+.+.?.}, at: [<ffffffff811e31db>] start_this_handle+0x4ca/0x555
stack backtrace:
CPU: 2 PID: 20158 Comm: git Not tainted 4.1.0-rc7-next-20150615-dbg-00016-g8bdf555-dirty #211
Call Trace:
dump_stack+0x4c/0x6e
mark_lock+0x384/0x56d
mark_held_locks+0x5f/0x76
lockdep_trace_alloc+0xb2/0xb5
kmem_cache_alloc_trace+0x32/0x1e2
zcomp_strm_alloc+0x25/0x73 [zram]
zcomp_strm_multi_find+0xe7/0x173 [zram]
zcomp_strm_find+0xc/0xe [zram]
zram_bvec_rw+0x2ca/0x7e0 [zram]
zram_make_request+0x1fa/0x301 [zram]
generic_make_request+0x9c/0xdb
submit_bio+0xf7/0x120
ext4_io_submit+0x2e/0x43
ext4_bio_write_page+0x1b7/0x300
mpage_submit_page+0x60/0x77
mpage_map_and_submit_buffers+0x10f/0x21d
ext4_writepages+0xc8c/0xe1b
do_writepages+0x23/0x2c
__filemap_fdatawrite_range+0x84/0x8b
filemap_flush+0x1c/0x1e
ext4_alloc_da_blocks+0xb8/0x117
ext4_rename+0x132/0x6dc
? mark_held_locks+0x5f/0x76
ext4_rename2+0x29/0x2b
vfs_rename+0x540/0x636
SyS_renameat2+0x359/0x44d
SyS_rename+0x1e/0x20
entry_SYSCALL_64_fastpath+0x12/0x6f
[minchan@kernel.org: add stable mark]
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Kyeongdon Kim <kyeongdon.kim@lge.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull trivial tree updates from Jiri Kosina.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
floppy: make local variable non-static
exynos: fixes an incorrect header guard
dt-bindings: fixes some incorrect header guards
cpufreq-dt: correct dead link in documentation
cpufreq: ARM big LITTLE: correct dead link in documentation
treewide: Fix typos in printk
Documentation: filesystem: Fix typo in fs/eventfd.c
fs/super.c: use && instead of & for warn_on condition
Documentation: fix sysfs-ptp
lib: scatterlist: fix Kconfig description
This pull includes driver updates from the usual suspects (bfa, arcmsr,
scsi_dh_alua, lpfc, storvsc, cxlflash). The major change is the addition of
the hisi_sas driver, which is an ARM platform device for SAS. The other
change of note is an enormous style transformation to the atp870u driver
(which is our worst written SCSI driver).
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABAgAGBQJWluqqAAoJEDeqqVYsXL0McuwH/1oqvFOagsvoDcwDyNAUR/eW
VAH454ndIJ0eSXORNfA7ko3ZQKa53x1WN9eKr+RHI7lpGCjwBz2MjnvQsnKISvXp
K0owkJTcAAF+Wdq7rdNlm1VlQHuLvG8TMTnno+NY3CtxCR2yiRWlctkNkjr0rWUv
leXJkXZSThkkiY/rEDZZXee8Ajwac87QT+ELEqCT2HueGZD+J8s59JpsOtZdt6Bj
n94ydOuct8hF3Xt3pdu1oDRpWpoJIyjHtYhdrvzSiKKBHTWtuq1oN0Cwnp0qtEDD
X3K1Mr0yBmAjTOsK+y+bZnJ1y7qJLLt5ZHmVixkzFWujXPNbrIsyYkV5eI432XA=
=ggNi
-----END PGP SIGNATURE-----
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull first round of SCSI updates from James Bottomley:
"This includes driver updates from the usual suspects (bfa, arcmsr,
scsi_dh_alua, lpfc, storvsc, cxlflash).
The major change is the addition of the hisi_sas driver, which is an
ARM platform device for SAS. The other change of note is an enormous
style transformation to the atp870u driver (which is our worst written
SCSI driver)"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (169 commits)
cxlflash: Enable device id for future IBM CXL adapter
cxlflash: Resolve oops in wait_port_offline
cxlflash: Fix to resolve cmd leak after host reset
cxlflash: Removed driver date print
cxlflash: Fix to avoid virtual LUN failover failure
cxlflash: Fix to escalate LINK_RESET also on port 1
storvsc: Tighten up the interrupt path
storvsc: Refactor the code in storvsc_channel_init()
storvsc: Properly support Fibre Channel devices
storvsc: Fix a bug in the layout of the hv_fc_wwn_packet
mvsas: Add SGPIO support to Marvell 94xx
mpt3sas: A correction in unmap_resources
hpsa: Add box and bay information for enclosure devices
hpsa: Change SAS transport devices to bus 0.
hpsa: fix path_info_show
cciss: print max outstanding commands as a hex value
scsi_debug: Increase the reported optimal transfer length
lpfc: Update version to 11.0.0.10 for upstream patch set
lpfc: Use kzalloc instead of kmalloc
lpfc: Delete unnecessary checks before the function call "mempool_destroy"
...
Dividing a sector_t number should be done using sector_div rather than do_div
to optimize the 32-bit sector_t case, and with the latest do_div optimizations,
we now get a compile-time warning for this:
arch/arm/include/asm/div64.h:32:95: note: expected 'uint64_t * {aka long long unsigned int *}' but argument is of type 'sector_t * {aka long unsigned int *}'
drivers/block/null_blk.c:521:81: warning: comparison of distinct pointer types lacks a cast
This changes the newly added code to use sector_div. It is a simplified version
of the original patch, as Linus Torvalds pointed out that we should not be using
an expensive division function in the first place.
This version was suggested by Matias Bjorling.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Matias Bjorling <m@bjorling.me>
Fixes: b2b7e00148 ("null_blk: register as a LightNVM device")
Signed-off-by: Jens Axboe <axboe@fb.com>
Konrad writes:
The pull is based on converting the backend driver into an multiqueue
driver and exposing more than one queue to the frontend. As such we had
to modify the frontend and also fix a bunch of bugs around this.
The original work is based on Arianna Avanzini's work as an OPW intern.
Bob took over the work and had been massaging it for quite some time.
Also included are are features to 64KB page support for ARM and various
bug-fixes.
To implement sync I/O support within the LightNVM core, the end_io
functions are refactored to take an end_io function pointer instead of
testing for initialized media manager, followed by calling its end_io
function.
Sync I/O can then be implemented using a callback that signal I/O
completion. This is similar to the logic found in blk_to_execute_io().
By implementing it this way, the underlying device I/Os submission logic
is abstracted away from core, targets, and media managers.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
The modified variables are only used in the file mtip32xx.c.
As such, the static keyword is inserted to define that object
to be only visible to the current code module during compilation.
Signed-off-by: Zhu Yanjun <zyjzyj2000@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The max outstanding commands is being printed with a 0x prefix to
suggest it is a hex value, when in fact the integer decimal %d format
specifier is being used and this is a bit confusing. Use %x instead to
match the proceeding 0x prefix.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
We have split the setting up of all the resources in two steps:
1) talk_to_blkback - which figures out the num_ring_pages (from
the default value of zero), sets up shadow and so
2) blkfront_connect - does the real part of filling out the
internal structures.
The problem is if we bypass the 1) step and go straight to 2)
and call blkfront_setup_indirect where we use the macro
BLK_RING_SIZE - which returns an negative value (because
sz is zero - since num_ring_pages is zero - since it has never
been set).
We can fix this by making sure that we always have called
talk_to_blkback before going to blkfront_connect.
Or we could set in blkfront_probe info->nr_ring_pages = 1
to have a default value. But that looks odd - as we haven't
actually negotiated any ring size.
This patch changes XenbusStateConnected state to detect if
we haven't done the initial handshake - and if so continue
on as if were in XenbusStateInitWait state.
We also roll the error recovery (freeing the structure) into
talk_to_blkback error path - which is safe since that function
is only called from blkback_changed.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
This patch fixs two memleaks:
backtrace:
[<ffffffff817ba5e8>] kmemleak_alloc+0x28/0x50
[<ffffffff81205e3b>] kmem_cache_alloc+0xbb/0x1d0
[<ffffffff81534028>] xen_blkbk_probe+0x58/0x230
[<ffffffff8146adb6>] xenbus_dev_probe+0x76/0x130
[<ffffffff81511716>] driver_probe_device+0x166/0x2c0
[<ffffffff815119bc>] __device_attach_driver+0xac/0xb0
[<ffffffff8150fa57>] bus_for_each_drv+0x67/0x90
[<ffffffff81511ab7>] __device_attach+0xc7/0x120
[<ffffffff81511b23>] device_initial_probe+0x13/0x20
[<ffffffff8151059a>] bus_probe_device+0x9a/0xb0
[<ffffffff8150f0a1>] device_add+0x3b1/0x5c0
[<ffffffff8150f47e>] device_register+0x1e/0x30
[<ffffffff8146a9e8>] xenbus_probe_node+0x158/0x170
[<ffffffff8146abaf>] xenbus_dev_changed+0x1af/0x1c0
[<ffffffff8146b1bb>] backend_changed+0x1b/0x20
[<ffffffff81468ca6>] xenwatch_thread+0xb6/0x160
unreferenced object 0xffff880007ba8ef8 (size 224):
backtrace:
[<ffffffff817ba5e8>] kmemleak_alloc+0x28/0x50
[<ffffffff81205c73>] __kmalloc+0xd3/0x1e0
[<ffffffff81534d87>] frontend_changed+0x2c7/0x580
[<ffffffff8146af12>] xenbus_otherend_changed+0xa2/0xb0
[<ffffffff8146b2c0>] frontend_changed+0x10/0x20
[<ffffffff81468ca6>] xenwatch_thread+0xb6/0x160
[<ffffffff810d3e97>] kthread+0xd7/0xf0
[<ffffffff817c4a9f>] ret_from_fork+0x3f/0x70
[<ffffffffffffffff>] 0xffffffffffffffff
unreferenced object 0xffff8800048dcd38 (size 224):
The first leak is caused by not put() the be->blkif reference
which we had gotten in xen_blkif_alloc(), while the second is
us not freeing blkif->rings in the right place.
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Reported-and-Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Make st_* statistics per ring and the VBD sysfs would iterate over all the
rings.
Note: xenvbd_sysfs_delif() is called in xen_blkbk_remove() before all rings
are torn down, so it's safe.
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
v2: Aligned the variables on the same column.
The minimal size of request in the block framework is always PAGE_SIZE.
It means that when 64KB guest is support, the request will at least be
64KB.
Although, if the backend doesn't support indirect descriptor (such as QDISK
in QEMU), a ring request is only able to accommodate 11 segments of 4KB
(i.e 44KB).
The current frontend is assuming that an I/O request will always fit in
a ring request. This is not true any more when using 64KB page
granularity and will therefore crash during boot.
On ARM64, the ABI is completely neutral to the page granularity used by
the domU. The guest has the choice between different page granularity
supported by the processors (for instance on ARM64: 4KB, 16KB, 64KB).
This can't be enforced by the hypervisor and therefore it's possible to
run guests using different page granularity.
So we can't mandate the block backend to support indirect descriptor
when the frontend is using 64KB page granularity and have to fix it
properly in the frontend.
The solution exposed below is based on modifying directly the frontend
guest rather than asking the block framework to support smaller size
(i.e < PAGE_SIZE). This is because the change is the block framework are
not trivial as everything seems to relying on a struct *page (see [1]).
Although, it may be possible that someone succeed to do it in the future
and we would therefore be able to use it.
Given that a block request may not fit in a single ring request, a
second request is introduced for the data that cannot fit in the first
one. This means that the second ring request should never be used on
Linux if the page size is smaller than 44KB.
To achieve the support of the extra ring request, the block queue size
is divided by two. Therefore, the ring will always contain enough space
to accommodate 2 ring requests. While this will reduce the overall
performance, it will make the implementation more contained. The way
forward to get better performance is to implement in the backend either
indirect descriptor or multiple grants ring.
Note that the parameters blk_queue_max_* helpers haven't been updated.
The block code will set the mimimum size supported and we may be able
to support directly any change in the block framework that lower down
the minimal size of a request.
[1] http://lists.xen.org/archives/html/xen-devel/2015-08/msg02200.html
Signed-off-by: Julien Grall <julien.grall@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
The code to get a request is always the same. Therefore we can factorize
it in a single function.
Signed-off-by: Julien Grall <julien.grall@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
xen_blkif_schedule() kthread calls try_to_freeze() at the beginning of
every attempt to purge the LRU. This operation can't ever succeed though,
as the kthread hasn't marked itself as freezable.
Before (hopefully eventually) kthread freezing gets converted to fileystem
freezing, we'd rather mark xen_blkif_schedule() freezable (as it can
generate I/O during suspend).
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
With the multi-queue support we could fail at setting up
some of the rings and fail the connection. That meant that
all resources tied to rings[0..n-1] (where n is the ring
that failed to be setup). Eventually the frontend will switch
to the states and we will call xen_blkif_disconnect.
However we do not want to be at the mercy of the frontend
deciding when to change states. This allows us to do the
cleanup right away and freeing resources.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Make pool of persistent grants and free pages per-queue/ring instead of
per-device to get better scalability.
Test was done based on null_blk driver:
dom0: v4.2-rc8 16vcpus 10GB "modprobe null_blk"
domu: v4.2-rc8 16vcpus 10GB
[test]
rw=read
direct=1
ioengine=libaio
bs=4k
time_based
runtime=30
filename=/dev/xvdb
numjobs=16
iodepth=64
iodepth_batch=64
iodepth_batch_complete=64
group_reporting
Results:
iops1: After patch "xen/blkfront: make persistent grants per-queue".
iops2: After this patch.
Queues: 1 4 8 16
Iops orig(k): 810 1064 780 700
Iops1(k): 810 1230(~20%) 1024(~20%) 850(~20%)
Iops2(k): 810 1410(~35%) 1354(~75%) 1440(~100%)
With 4 queues after this commit we can get ~75% increase in IOPS, and
performance won't drop if increasing queue numbers.
Please find the respective chart in this link:
https://www.dropbox.com/s/agrcy2pbzbsvmwv/iops.png?dl=0
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Backend advertises "multi-queue-max-queues" to front, also get the negotiated
number from "multi-queue-num-queues" written by blkfront.
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Preparatory patch for multiple hardware queues (rings). The number of
rings is unconditionally set to 1, larger number will be enabled in
"xen/blkback: get the number of hardware queues/rings from blkfront".
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
v2: Align variables in the structures.
Split per ring information to an new structure "xen_blkif_ring", so that one vbd
device can be associated with one or more rings/hardware queues.
Introduce 'pers_gnts_lock' to protect the pool of persistent grants since we
may have multi backend threads.
This patch is a preparation for supporting multi hardware queues/rings.
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
v2: Align the variables in the structure.
According to this piece code:
"
pr_info("Invalid max_ring_order (%d), will use default max: %d.\n",
xen_blkif_max_ring_order, XENBUS_MAX_RING_GRANT_ORDER);
"
if xen_blkif_max_ring_order is bigger that XENBUS_MAX_RING_GRANT_ORDER,
need to set xen_blkif_max_ring_order using XENBUS_MAX_RING_GRANT_ORDER,
but not 0.
Signed-off-by: Peng Fan <van.freenix@gmail.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: "Roger Pau Monné" <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Make persistent grants per-queue/ring instead of per-device, so that we can
drop the 'dev_lock' and get better scalability.
Test was done based on null_blk driver:
dom0: v4.2-rc8 16vcpus 10GB "modprobe null_blk"
domu: v4.2-rc8 16vcpus 10GB
[test]
rw=read
direct=1
ioengine=libaio
bs=4k
time_based
runtime=30
filename=/dev/xvdb
numjobs=16
iodepth=64
iodepth_batch=64
iodepth_batch_complete=64
group_reporting
Queues: 1 4 8 16
Iops orig(k): 810 1064 780 700
Iops patched(k): 810 1230(~20%) 1024(~20%) 850(~20%)
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
We do the same exact operations a bit earlier in the
function.
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
The max number of hardware queues for xen/blkfront is set by parameter
'max_queues'(default 4), while it is also capped by the max value that the
xen/blkback exposes through XenStore key 'multi-queue-max-queues'.
The negotiated number is the smaller one and would be written back to xenstore
as "multi-queue-num-queues", blkback needs to read this negotiated number.
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
After patch "xen/blkfront: separate per ring information out of device
info", per-ring data is protected by a per-device lock ('io_lock').
This is not a good way and will effect the scalability, so introduce a
per-ring lock ('ring_lock').
The old 'io_lock' is renamed to 'dev_lock' which protects the ->grants list and
->persistent_gnts_c which are shared by all rings.
Note that in 'blkfront_probe' the 'blkfront_info' is setup via kzalloc
so setting ->persistent_gnts_c to zero is not needed.
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Preparatory patch for multiple hardware queues (rings). The number of
rings is unconditionally set to 1, larger number will be enabled in
patch "xen/blkfront: negotiate number of queues/rings to be used with backend"
so as to make review easier.
Note that blkfront_gather_backend_features does not call
blkfront_setup_indirect anymore (as that needs to be done per ring).
That means that in blkif_recover/blkif_connect we have to do it in a loop
(bounded by nr_rings).
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Split per ring information to a new structure "blkfront_ring_info".
A ring is the representation of a hardware queue, every vbd device can associate
with one or more rings depending on how many hardware queues/rings to be used.
This patch is a preparation for supporting real multi hardware queues/rings.
We also add a backpointer to 'struct blkfront_info' (dev_info) which
is not needed (we could use containers_of) but further patch
("xen/blkfront: pseudo support for multi hardware queues/rings")
will make allocation of 'blkfront_ring_info' dynamic.
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
If null_blk is run in NULL_IRQ_TIMER mode and with queue_mode NULL_Q_RQ,
we need to restart the queue from the hrtimer interrupt. We can't
directly invoke the request_fn from that context, so punt the queue run
to async kblockd context.
Tested-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Jens Axboe <axboe@fb.com>
Commit 8182503df1 used monotonic time, but if the adapter is
using the seconds for logging entries, then we'll get duplicate
entries if the system is rebooted. Use real time instead.
Reported-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 8182503df1 ("block: sx8.c: Replace timeval with ktime_t")
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull block layer fixes from Jens Axboe:
"Three small fixes for 4.4 final. Specifically:
- The segment issue fix from Junichi, where the old IO path does a
bio limit split before potentially bouncing the pages. We need to
do that in the right order, to ensure that limitations are met.
- A NVMe surprise removal IO hang fix from Keith.
- A use-after-free in null_blk, introduced by a previous patch in
this series. From Mike Krinkin"
* 'for-linus' of git://git.kernel.dk/linux-block:
null_blk: fix use-after-free error
block: ensure to split after potentially bouncing a bio
NVMe: IO ending fixes on surprise removal
blk_end_request_all may free request, so we need to save
request_queue pointer before blk_end_request_all call.
The problem was introduced in commit cf8ecc5a84
("null_blk: guarantee device restart in all irq modes")
and causes general protection fault with slab poisoning
enabled.
Fixes: cf8ecc5a84 ("null_blk: guarantee device
restart in all irq modes")
Signed-off-by: Mike Krinkin <krinkin.m.u@gmail.com>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
32-bit systems using 'struct timeval' will break in the year 2038,
in order to avoid that replace the code with more appropriate types.
This patch replaces timeval with 64 bit ktime_t which is y2038 safe.
Since st->timestamp is only interested in seconds, directly using
time64_t here. Function ktime_get_seconds is used since it uses
monotonic instead of real time and thus will not cause overflow.
Signed-off-by: Shraddha Barke <shraddha.6596@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
- XSA-155 security fixes to backend drivers.
- XSA-157 security fixes to pciback.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJWdDrXAAoJEFxbo/MsZsTR3N0H/0Lvz6MWBARCje7livbz7nqE
PS0Bea+2yAfNhCDDiDlpV0lor8qlyfWDF6lGhLjItldAzahag3ZDKDf1Z/lcQvhf
3MwFOcOVZE8lLtvLT6LGnPuehi1Mfdi1Qk1/zQhPhsq6+FLPLT2y+whmBihp8mMh
C12f7KRg5r3U7eZXNB6MEtGA0RFrOp0lBdvsiZx3qyVLpezj9mIe0NueQqwY3QCS
xQ0fILp/x2EnZNZuzgghFTPRxMAx5ReOezgn9Rzvq4aThD+irz1y6ghkYN4rG2s2
tyYOTqBnjJEJEQ+wmYMhnfCwVvDffztG+uI9hqN31QFJiNB0xsjSWFCkDAWchiU=
=Argz
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.4-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen bug fixes from David Vrabel:
- XSA-155 security fixes to backend drivers.
- XSA-157 security fixes to pciback.
* tag 'for-linus-4.4-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen-pciback: fix up cleanup path when alloc fails
xen/pciback: Don't allow MSI-X ops if PCI_COMMAND_MEMORY is not set.
xen/pciback: For XEN_PCI_OP_disable_msi[|x] only disable if device has MSI(X) enabled.
xen/pciback: Do not install an IRQ handler for MSI interrupts.
xen/pciback: Return error on XEN_PCI_OP_enable_msix when device has MSI or MSI-X enabled
xen/pciback: Return error on XEN_PCI_OP_enable_msi when device has MSI or MSI-X enabled
xen/pciback: Save xen_pci_op commands before processing it
xen-scsiback: safely copy requests
xen-blkback: read from indirect descriptors only once
xen-blkback: only read request operation from shared ring once
xen-netback: use RING_COPY_REQUEST() throughout
xen-netback: don't use last request to determine minimum Tx credit
xen: Add RING_COPY_REQUEST()
xen/x86/pvh: Use HVM's flush_tlb_others op
xen: Resume PMU from non-atomic context
xen/events/fifo: Consume unprocessed events when a CPU dies
Since indirect descriptors are in memory shared with the frontend, the
frontend could alter the first_sect and last_sect values after they have
been validated but before they are recorded in the request. This may
result in I/O requests that overflow the foreign page, possibly
overwriting local pages when the I/O request is executed.
When parsing indirect descriptors, only read first_sect and last_sect
once.
This is part of XSA155.
CC: stable@vger.kernel.org
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
A compiler may load a switch statement value multiple times, which could
be bad when the value is in memory shared with the frontend.
When converting a non-native request to a native one, ensure that
src->operation is only loaded once by using READ_ONCE().
This is part of XSA155.
CC: stable@vger.kernel.org
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Pull block layer fixes from Jens Axboe:
"A set of fixes for the current series. This contains:
- A bunch of fixes for lightnvm, should be the last round for this
series. From Matias and Wenwei.
- A writeback detach inode fix from Ilya, also marked for stable.
- A block (though it says SCSI) fix for an OOPS in SCSI runtime power
management.
- Module init error path fixes for null_blk from Minfei"
* 'for-linus' of git://git.kernel.dk/linux-block:
null_blk: Fix error path in module initialization
lightnvm: do not compile in debugging by default
lightnvm: prevent gennvm module unload on use
lightnvm: fix media mgr registration
lightnvm: replace req queue with nvmdev for lld
lightnvm: comments on constants
lightnvm: check mm before use
lightnvm: refactor spin_unlock in gennvm_get_blk
lightnvm: put blks when luns configure failed
lightnvm: use flags in rrpc_get_blk
block: detach bdev inode from its wb in __blkdev_put()
SCSI: Fix NULL pointer dereference in runtime PM
Module couldn't release resource properly during the initialization. To
fix this issue, we will clean up the proper resource before returning.
Signed-off-by: Minfei Huang <mnfhuang@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
There's no reason for temparea to be static, since it's only used for
temporary sprintf output. It's not immediately obvious that the output
will always fit (in the worst case, the output including '\0' is
exactly 32 bytes), so save a future reader from worrying about that.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
In the case where a request queue is passed to the low lever lightnvm
device drive integration, the device driver might pass its admin
commands through another queue. Instead pass nvm_dev, and let the
low level drive the appropriate queue.
Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull Ceph fix from Sage Weil:
"This addresses a refcounting bug that leads to a use-after-free"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
rbd: don't put snap_context twice in rbd_queue_workfn()
Commit 4e752f0ab0 ("rbd: access snapshot context and mapping size
safely") moved ceph_get_snap_context() out of rbd_img_request_create()
and into rbd_queue_workfn(), adding a ceph_put_snap_context() to the
error path in rbd_queue_workfn(). However, rbd_img_request_create()
consumes a ref on snapc, so calling ceph_put_snap_context() after
a successful rbd_img_request_create() leads to an extra put. Fix it.
Cc: stable@vger.kernel.org # 3.18+
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Josh Durgin <jdurgin@redhat.com>
We already have the reserved flag, and a nowait flag awkwardly encoded as
a gfp_t. Add a real flags argument to make the scheme more extensible and
allow for a nicer calling convention.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
This commit at least doubles the maximum value for
completion_nsec. This helps in special cases where one wants/needs to
emulate an extremely slow I/O (for example to spot bugs).
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Arianna Avanzini <avanzini@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
In single-queue (block layer) mode,the function null_rq_prep_fn stops
the device if alloc_cmd fails. Then, once stopped, the device must be
restarted on the next command completion, so that the request(s) for
which alloc_cmd failed can be requeued. Otherwise the device hangs.
Unfortunately, device restart is currently performed only for delayed
completions, i.e., in irqmode==2. This fact causes hangs, for the
above reasons, with the other irqmodes in combination with single-queue
block layer.
This commits addresses this issue by making sure that, if stopped, the
device is properly restarted for all irqmodes on completions.
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Arianna AVanzini <avanzini@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
For the Timer IRQ mode (i.e., when command completions are delayed),
there is one timer for each CPU. Each of these timers
. has a completion queue associated with it, containing all the
command completions to be executed when the timer fires;
. is set, and a new completion-to-execute is inserted into its
completion queue, every time the dispatch code for a new command
happens to be executed on the CPU related to the timer.
This implies that, if the dispatch of a new command happens to be
executed on a CPU whose timer has already been set, but has not yet
fired, then the timer is set again, to the completion time of the
newly arrived command. When the timer eventually fires, all its queued
completions are executed.
This way of handling delayed command completions entails the following
problem: if more than one command completion is inserted into the
queue of a timer before the timer fires, then the expiration time for
the timer is moved forward every time each of these completions is
enqueued. As a consequence, only the last completion enqueued enjoys a
correct execution time, while all previous completions are unjustly
delayed until the last completion is executed (and at that time they
are executed all together).
Specifically, if all the above completions are enqueued almost at the
same time, then the problem is negligible. On the opposite end, if
every completion is enqueued a while after the previous completion was
enqueued (in the extreme case, it is enqueued only right before the
timer would have expired), then every enqueued completion, except for
the last one, experiences an inflated delay, proportional to the number
of completions enqueued after it. In the end, commands, and thus I/O
requests, may be completed at an arbitrarily lower rate than the
desired one.
This commit addresses this issue by replacing per-CPU timers with
per-command timers, i.e., by associating an individual timer with each
command.
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Arianna Avanzini <avanzini@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
In case the lower level device size changed, but some other internal
details of the resize did not work out, drbd_determine_dev_size() would
try to restore the previous settings, trusting
drbd_md_set_sector_offsets() to "do the right thing", but overlooked
that this internally may set the meta data base offset based on device size.
This could end up with incomplete on-disk meta data layout change, and
ultimately lead to data corruption (if the failure was not noticed or
ignored by the operator, and other things go wrong as well).
Just remember all meta data related offsets/sizes,
and on error restore them all.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
During handshake communication, we also reconsider our device size,
using drbd_determine_dev_size(). Just in case we need to change the
offsets or layout of our on-disk metadata, we lock out application
and other meta data IO, and wait for the activity log to be "idle"
(no more referenced extents).
If this handshake happens just after a connection loss, with a fencing
policy of "resource-and-stonith", we have frozen IO.
If, additionally, the activity log was "starving" (too many incoming
random writes at that point in time), it won't become idle, ever,
because of the frozen IO, and this would be a lockup of the receiver
thread, and consquentially of DRBD.
Previous logic (re-)initialized with a special "empty" transaction
block, which required the activity log to fully drain first.
Instead, write out some standard activity log transactions.
Using lc_try_lock_for_transaction() instead of lc_try_lock() does not
care about pending activity log references, avoiding the potential
deadlock.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
To be able to "force out" an activity log transaction,
even if there are no pending updates.
This will be used to relocate the on-disk activity log,
if the on-disk offsets have to be changed,
without the need to empty the activity log first.
While at it, move the definition,
so we can drop the forward declaration of a static helper.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Avoid to prematurely resume application IO: don't set/clear a single
bit, but inc/dec an atomic counter.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Don't remember a DRBD request as ack_pending, if it is not.
In protocol A, we usually clear RQ_NET_PENDING at the same time we set
RQ_NET_SENT, so when deciding to remember it as ack_pending,
mod_rq_state needs to look at the current request state,
not at the previous state before the current modification was applied.
This should prevent advance_conn_req_ack_pending() from walking the full
transfer log just to find NULL in protocol A, which would cause serious
performance degradation with many "in-flight" requests, e.g. when
working via DRBD-proxy, or with a huge bandwidth-delay product.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
new_disk_conf could be leaked if the follow on checks fail,
so make sure to free it on error if it was not assigned yet.
Found with smatch.
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Disconnect should wait for pending bitmap IO.
But if that bitmap IO is not happening, because it is waiting for
pending application IO, and there is no progress, because the fencing
policy suspended application IO because of the disconnect,
then we deadlock.
The bitmap writeout in this case does not care for concurrent
application IO, so there is no point waiting for it.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
lsblk should be able to pick up stacking device driver relations
involving DRBD conveniently.
Even though upstream kernel since 2011 says
"DON'T USE THIS UNLESS YOU'RE ALREADY USING IT."
a new user has been added since (bcache),
which sets the precedences for us to use it as well.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We cannot possibly support SECDISCARD, even if all backend devices would
support it: if our peer is currently unreachable, some instance of the
data may obviously still be recoverable.
We did not set discard_granularity at all. We don't really care (yet),
we only pass them on, so for now, set our granularity to one sector.
blkdev_stack_limits() takes care of the rest.
If we decide we cannot support discards,
not only clear the (not user visible) QUEUE_FLAG_DISCARD,
but set both (user visible) discard_granularity and max_discard_sectors
to zero, to avoid confusion with e.g. lsblk -D.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
When accessing out meta data area on disk, we double check the
plausibility of the requested sector offsets, and are very noisy about
it if they look suspicious.
During initial read of our "superblock", for "external" meta data,
this triggered because the range estimate returned by
drbd_md_last_sector() was still wrong.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Suggested by Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Apparently we now implicitly get definitions for BITS_PER_PAGE and
BITS_PER_PAGE_MASK from the pid_namespace.h
Instead of renaming our defines, I chose to define only if not yet
defined, but to double check the value if already defined.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Since kernel 3.3, we can use snprintf-style arguments
to create a workqueue.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The effective data generation ID may be interesting for debugging
purposes of scenarios involving diskless states.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
In a multiple error scenario, we may end up with a "frozen" Primary,
that has no access to any data (no local disk, no replication link).
If we then resume-io, we try to generate a new data generation id,
which will fail if there is no longer a local disk.
Double check for available local data,
which prevents the NULL pointer deref.
If we are diskless, turn the resume-io in this situation
into the first stage of a "force down", by bumping the "effective" data
gen id, which will prevent later attach or connect to the former data
set without first being demoted (deconfigured).
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The intention is to reduce CPU utilization. Recent measurements
unveiled that the current performance bottleneck is CPU utilization
on the receiving node. The asender thread became CPU limited.
One of the main points is to eliminate the idr_for_each_entry() loop
from the sending acks code path.
One exception in that is sending back ping_acks. These stay
in the ack-receiver thread. Otherwise the logic becomes too
complicated for no added value.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This prepares the next patch where the sending on the meta (or
control) socket is moved to a dedicated workqueue.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
A D_FAILED disk transitions as quickly as possible to
D_DISKLESS. But in the "unresponsive local disk" case,
there remains a time window where a administrative detach command could
find the disk already failed, but some internal meta data IO against the
unresponsive local disk still pending.
In that case, drbd_md_get_buffer() will return NULL.
Don't unconditionally call drbd_md_put_buffer(), or it will cause
refcount imbalance, and prevent any further re-attach on this volume
(until it is deleted and re-created).
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The recent (not yet released) backport of the extended state broadcasts
to support the "events2" subcommand of drbdsetup had some glitches.
remember_old_state() would first count all connections with a
net_conf != NULL, then allocate a suitable array, then populate that
array with all connections found to have net_conf != NULL.
This races with the state change to C_STANDALONE,
and the NULL assignment there.
remember_new_state() then iterates over said connection array,
assuming that it would be fully populated.
But rcu_lock() just makes sure the thing some pointer points to,
if any, won't go away. It does not make the pointer itself immutable.
In fact there is no need to "filter" connections based on whether or not
they have a currently valid configuration. Just record them always, if
they don't have a config, that's fine, there will be no change then.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Don't blame the peer for being unresponsive,
if we did not even ask the question yet.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The only way to make DRBD intentionally call panic is to
set a disk timeout, have that trigger, "abort" some request and complete
to upper layers, then have the backend IO subsystem later complete these
requests successfully regardless.
As the attached IO pages have been recycled for other purposes
meanwhile, this will cause unexpected random memory changes.
To prevent corruption, we rather panic in that case.
Make it obvious from stack traces that this was the case by introducing
drbd_panic_after_delayed_completion_of_aborted_request().
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Even though we really want to get the state information about our bad
disk to the peer as soon as possible, it is useful to first call the
local-io-error handler.
People may chose to hard-reset the box from there.
If that looks and behaves exactly like a "regular node crash", without
bumping the data generation UUIDs on the peer in between, it makes it
easier to deal with.
If you intend to return from the local-io-error handler, then better
return as quickly as possible to avoid triggering other timeouts.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
If for some reason the primary lost its disk *and* the replication link
before it is able to communicate the disk loss, probably blocked IO,
then later is able to re-establish the connection, the peer needs to
bump its UUIDs just like it does when peer only loses the disk
and is able to communicate this in time.
Otherwise, a later re-attach of the disk on the primary may start a
resync in the "wrong" direction.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
When detaching, we make sure no application IO is in-flight
by internally suspending IO, then trigger the state change,
wait for the result, and finally internally resume IO again.
Once we triggered the stat change to "Failed",
we expect it to change from Failed to Diskless.
(To avoid races, we actually wait for it to leave "Failed").
On an unresponsive local IO backend, this may not happen, ever.
Don't have a "hung" detach block IO "forever", but resume IO
before waiting for the state change to Diskless.
We may well be able to continue IO to and from a healthy peer.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
(You should not use disk-timeout anyways,
see the man page for why...)
We add incoming requests to the tail of some ring list.
On local completion, requests are removed from that list.
The timer looks only at the head of that ring list,
so is supposed to only see the oldest request.
All protected by a spinlock.
The request object is created with timestamps zeroed out.
The timestamp was only filled in just before the actual submit.
But to actually submit the request, we need to give up the spinlock.
If you are unlucky, there is no older still pending request, the timer
looks at a new request with timestamp still zero (before it even was
submitted), and 0 + timeout is most likely older than "now".
Better assign the timestamp right when we put the
request object on said ring list.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
GFP_NOWAIT has a value of 0. I.e. functionality not changed.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The lc_destroy() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The status command originates the drbd9 code base. While for now we
keep the status information in /proc/drbd available, this commit
allows the user base to gracefully migrate their monitoring
infrastructure to the new status reporting interface.
In drbd9 no status information is exposed through /proc/drbd.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The events2 command originates from drbd-9 development. It features
more information but requires a incompatible change in output
format.
Therefore the previous events command continues to exist, the new
improved events2 command becomes available now.
This prepares the user-base for a later switch to the complete
drbd9 code base.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Instead of using a rwlock for synchronizing state changes across
resources, take the request locks of all resources for global state
changes. Use resources_mutex to serialize global state changes.
This means that taking the request lock of a resource is now enough to
prevent changes of that resource. (Previously, a read lock on the
global state lock was needed as well.)
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Also change the enum values to all-capital letters.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
There is no need to have these two as inline functions. In addition,
drbd_should_send_out_of_sync() is only used in a single place, anyway.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
In drbd-8.4 there is always a single connection per resource,
and there is always exactly one peer_device for a device.
peer_device can not be NULL here.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
kthread_create_on_node takes format+args, so there's no need to do the
pretty-printing in advance. Moreover, "mtip_svc_thd_99" (including its
'\0') only just fits in 16 bytes, so if index could ever go above 99
we'd have a stack buffer overflow.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The gendisk structure has not been initialized when using lightnvm.
Make sure to not delete it upon exit. Also make sure that we use the
appropriate disk_name at unregistration.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
The linear addressing mode was removed in 7386af2. Make null_blk instead
expose the ppa format geometry and support the generic addressing mode.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
Instead of using a page pool, we can save memory by only allocating room
for 64 entries for the ppa command. Introduce a ppa_cache to allocate only
the required memory for the ppa list.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
Add support for registering as a LightNVM device. This allows us to
evaluate the performance of the LightNVM subsystem.
In /drivers/Makefile, LightNVM is moved above block device drivers
to make sure that the LightNVM media managers have been initialized
before drivers under /drivers/block are initialized.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Fix by Jens Axboe to remove unneeded slab cache and the following
memory leak.
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull Ceph updates from Sage Weil:
"There are several patches from Ilya fixing RBD allocation lifecycle
issues, a series adding a nocephx_sign_messages option (and associated
bug fixes/cleanups), several patches from Zheng improving the
(directory) fsync behavior, a big improvement in IO for direct-io
requests when striping is enabled from Caifeng, and several other
small fixes and cleanups"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
libceph: clear msg->con in ceph_msg_release() only
libceph: add nocephx_sign_messages option
libceph: stop duplicating client fields in messenger
libceph: drop authorizer check from cephx msg signing routines
libceph: msg signing callouts don't need con argument
libceph: evaluate osd_req_op_data() arguments only once
ceph: make fsync() wait unsafe requests that created/modified inode
ceph: add request to i_unsafe_dirops when getting unsafe reply
libceph: introduce ceph_x_authorizer_cleanup()
ceph: don't invalidate page cache when inode is no longer used
rbd: remove duplicate calls to rbd_dev_mapping_clear()
rbd: set device_type::release instead of device::release
rbd: don't free rbd_dev outside of the release callback
rbd: return -ENOMEM instead of pool id if rbd_dev_create() fails
libceph: use local variable cursor instead of &msg->cursor
libceph: remove con argument in handle_reply()
ceph: combine as many iovec as possile into one OSD request
ceph: fix message length computation
ceph: fix a comment typo
rbd: drop null test before destroy functions
Currently when improperly aligned discard request is submitted, we just
silently discard more / less data which results in filesystem corruption
in some cases. Refuse such misaligned requests.
Signed-off-by: Jan Kara <jack@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull block IO poll support from Jens Axboe:
"Various groups have been doing experimentation around IO polling for
(really) fast devices. The code has been reviewed and has been
sitting on the side for a few releases, but this is now good enough
for coordinated benchmarking and further experimentation.
Currently O_DIRECT sync read/write are supported. A framework is in
the works that allows scalable stats tracking so we can auto-tune
this. And we'll add libaio support as well soon. Fow now, it's an
opt-in feature for test purposes"
* 'for-4.4/io-poll' of git://git.kernel.dk/linux-block:
direct-io: be sure to assign dio->bio_bdev for both paths
directio: add block polling support
NVMe: add blk polling support
block: add block polling support
blk-mq: return tag/queue combo in the make_request_fn handlers
block: change ->make_request_fn() and users to return a queue cookie
Merge second patch-bomb from Andrew Morton:
- most of the rest of MM
- procfs
- lib/ updates
- printk updates
- bitops infrastructure tweaks
- checkpatch updates
- nilfs2 update
- signals
- various other misc bits: coredump, seqfile, kexec, pidns, zlib, ipc,
dma-debug, dma-mapping, ...
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (102 commits)
ipc,msg: drop dst nil validation in copy_msg
include/linux/zutil.h: fix usage example of zlib_adler32()
panic: release stale console lock to always get the logbuf printed out
dma-debug: check nents in dma_sync_sg*
dma-mapping: tidy up dma_parms default handling
pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode
kexec: use file name as the output message prefix
fs, seqfile: always allow oom killer
seq_file: reuse string_escape_str()
fs/seq_file: use seq_* helpers in seq_hex_dump()
coredump: change zap_threads() and zap_process() to use for_each_thread()
coredump: ensure all coredumping tasks have SIGNAL_GROUP_COREDUMP
signal: remove jffs2_garbage_collect_thread()->allow_signal(SIGCONT)
signal: introduce kernel_signal_stop() to fix jffs2_garbage_collect_thread()
signal: turn dequeue_signal_lock() into kernel_dequeue_signal()
signals: kill block_all_signals() and unblock_all_signals()
nilfs2: fix gcc uninitialized-variable warnings in powerpc build
nilfs2: fix gcc unused-but-set-variable warnings
MAINTAINERS: nilfs2: add header file for tracing
nilfs2: add tracepoints for analyzing reading and writing metadata files
...
Pull trivial updates from Jiri Kosina:
"Trivial stuff from trivial tree that can be trivially summed up as:
- treewide drop of spurious unlikely() before IS_ERR() from Viresh
Kumar
- cosmetic fixes (that don't really affect basic functionality of the
driver) for pktcdvd and bcache, from Julia Lawall and Petr Mladek
- various comment / printk fixes and updates all over the place"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
bcache: Really show state of work pending bit
hwmon: applesmc: fix comment typos
Kconfig: remove comment about scsi_wait_scan module
class_find_device: fix reference to argument "match"
debugfs: document that debugfs_remove*() accepts NULL and error values
net: Drop unlikely before IS_ERR(_OR_NULL)
mm: Drop unlikely before IS_ERR(_OR_NULL)
fs: Drop unlikely before IS_ERR(_OR_NULL)
drivers: net: Drop unlikely before IS_ERR(_OR_NULL)
drivers: misc: Drop unlikely before IS_ERR(_OR_NULL)
UBI: Update comments to reflect UBI_METAONLY flag
pktcdvd: drop null test before destroy functions
No functional changes in this patch, but it prepares us for returning
a more useful cookie related to the IO that was queued up.
Signed-off-by: Jens Axboe <axboe@fb.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
1. Rename dequeue_signal_lock() to kernel_dequeue_signal(). This
matches another "for kthreads only" kernel_sigaction() helper.
2. Remove the "tsk" and "mask" arguments, they are always current
and current->blocked. And it is simply wrong if tsk != current.
3. We could also remove the 3rd "siginfo_t *info" arg but it looks
potentially useful. However we can simplify the callers if we
change kernel_dequeue_signal() to accept info => NULL.
4. Remove _irqsave, it is never called from atomic context.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Tejun Heo <tj@kernel.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Markus Pargmann <mpa@pengutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make is_partial_io()/valid_io_request()/page_zero_filled() return boolean,
since each function only uses either one or zero as its return value.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
`mem_used_max' is designed to store the max amount of memory zram consumed
to store the data. However, it does not represent the actual
'overcommited' (max) value. The existing code goes to -ENOMEM
overcommited case before it updates `->stats.max_used_pages', which hides
the reason we went to -ENOMEM in the first place -- we actually used more
memory than `->limit_pages':
alloced_pages = zs_get_total_pages(meta->mem_pool);
if (zram->limit_pages && alloced_pages > zram->limit_pages) {
zs_free(meta->mem_pool, handle);
ret = -ENOMEM;
goto out;
}
update_used_max(zram, alloced_pages);
Which is misleading. User will see -ENOMEM, check `->limit_pages', check
`->stats.max_used_pages', which will keep the value BEFORE zram passed
`->limit_pages', and see:
`->stats.max_used_pages' < `->limit_pages'
Move update_used_max() before we do `->limit_pages' check, so that
user will see:
`->stats.max_used_pages' > `->limit_pages'
should the overcommit and -ENOMEM happen.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When the user supplies an unsupported compression algorithm, keep the
previously selected one (knowingly supported) or the default one (if the
compression algorithm hasn't been changed yet).
Note that previously this operation (i.e. setting an invalid algorithm)
would result in no algorithm being selected, which means that this
represents a small change in the default behaviour.
Minchan said:
For initializing zram, we need to set up 3 optional parameters in advance.
1. the number of compression streams
2. memory limitation
3. compression algorithm
Although user pass completely wrong value to set up for 1 and 2
parameters, it's okay because they have default value so zram will be
initialized with the default value (of course, when user passes a wrong
value via *echo*, sysfs returns -EINVAL so the user can notice it).
But 3 is not consistent with other optional parameters. IOW, if the
user passes a wrong value to set up 3 parameter, zram's initialization
would fail unlike other optional parameters.
So this patch makes them consistent.
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__GFP_WAIT was used to signal that the caller was in atomic context and
could not sleep. Now it is possible to distinguish between true atomic
context and callers that are not willing to sleep. The latter should
clear __GFP_DIRECT_RECLAIM so kswapd will still wake. As clearing
__GFP_WAIT behaves differently, there is a risk that people will clear the
wrong flags. This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly
indicate what it does -- setting it allows all reclaim activity, clearing
them prevents it.
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__GFP_WAIT has been used to identify atomic context in callers that hold
spinlocks or are in interrupts. They are expected to be high priority and
have access one of two watermarks lower than "min" which can be referred
to as the "atomic reserve". __GFP_HIGH users get access to the first
lower watermark and can be called the "high priority reserve".
Over time, callers had a requirement to not block when fallback options
were available. Some have abused __GFP_WAIT leading to a situation where
an optimisitic allocation with a fallback option can access atomic
reserves.
This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
cannot sleep and have no alternative. High priority users continue to use
__GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
callers that want to wake kswapd for background reclaim. __GFP_WAIT is
redefined as a caller that is willing to enter direct reclaim and wake
kswapd for background reclaim.
This patch then converts a number of sites
o __GFP_ATOMIC is used by callers that are high priority and have memory
pools for those requests. GFP_ATOMIC uses this flag.
o Callers that have a limited mempool to guarantee forward progress clear
__GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
into this category where kswapd will still be woken but atomic reserves
are not used as there is a one-entry mempool to guarantee progress.
o Callers that are checking if they are non-blocking should use the
helper gfpflags_allow_blocking() where possible. This is because
checking for __GFP_WAIT as was done historically now can trigger false
positives. Some exceptions like dm-crypt.c exist where the code intent
is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
flag manipulations.
o Callers that built their own GFP flags instead of starting with GFP_KERNEL
and friends now also need to specify __GFP_KSWAPD_RECLAIM.
The first key hazard to watch out for is callers that removed __GFP_WAIT
and was depending on access to atomic reserves for inconspicuous reasons.
In some cases it may be appropriate for them to use __GFP_HIGH.
The second key hazard is callers that assembled their own combination of
GFP flags instead of starting with something like GFP_KERNEL. They may
now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
if it's missed in most cases as other activity will wake kswapd.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The asm-generic changes for 4.4 are mostly a series from Christoph Hellwig
to clean up various abuses of headers in there. The patch to rename the
io-64-nonatomic-*.h headers caused some conflicts with new users, so I
added a workaround that we can remove in the next merge window.
The only other patch is a warning fix from Marek Vasut
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIVAwUAVjzaf2CrR//JCVInAQImmhAA20fZ91sUlnA5skKNPT1phhF6Z7UF2Sx5
nPKcHQD3HA3lT1OKfPBYvCo+loYflvXFLaQThVylVcnE/8ecAEMtft4nnGW2nXvh
sZqHIZ8fszTB53cynAZKTjdobD1wu33Rq7XRzg0ugn1mdxFkOzCHW/xDRvWRR5TL
rdQjzzgvn2PNlqFfHlh6cZ5ykShM36AIKs3WGA0H0Y/aYsE9GmDOAUp41q1mLXnA
4lKQaIxoeOa+kmlsUB0wEHUecWWWJH4GAP+CtdKzTX9v12bGNhmiKUMCETG78BT3
uL8irSqaViNwSAS9tBxSpqvmVUsa5aCA5M3MYiO+fH9ifd7wbR65g/wq39D3Pc01
KnZ3BNVRW5XSA3c86pr8vbg/HOynUXK8TN0lzt6rEk8bjoPBNEDy5YWzy0t6reVe
wX65F+ver8upjOKe9yl2Jsg+5Kcmy79GyYjLUY3TU2mZ+dIdScy/jIWatXe/OTKZ
iB4Ctc4MDe9GDECmlPOWf98AXqsBUuKQiWKCN/OPxLtFOeWBvi4IzvFuO8QvnL9p
jZcRDmIlIWAcDX/2wMnLjV+Hqi3EeReIrYznxTGnO7HHVInF555GP51vFaG5k+SN
smJQAB0/sostmC1OCCqBKq5b6/li95/No7+0v0SUhJJ5o76AR5CcNsnolXesw1fu
vTUkB/I66Hk=
=dQKG
-----END PGP SIGNATURE-----
Merge tag 'asm-generic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic
Pull asm-generic cleanups from Arnd Bergmann:
"The asm-generic changes for 4.4 are mostly a series from Christoph
Hellwig to clean up various abuses of headers in there. The patch to
rename the io-64-nonatomic-*.h headers caused some conflicts with new
users, so I added a workaround that we can remove in the next merge
window.
The only other patch is a warning fix from Marek Vasut"
* tag 'asm-generic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
asm-generic: temporarily add back asm-generic/io-64-nonatomic*.h
asm-generic: cmpxchg: avoid warnings from macro-ized cmpxchg() implementations
gpio-mxc: stop including <asm-generic/bug>
n_tracesink: stop including <asm-generic/bug>
n_tracerouter: stop including <asm-generic/bug>
mlx5: stop including <asm-generic/kmap_types.h>
hifn_795x: stop including <asm-generic/kmap_types.h>
drbd: stop including <asm-generic/kmap_types.h>
move count_zeroes.h out of asm-generic
move io-64-nonatomic*.h out of asm-generic
Pull block driver updates from Jens Axboe:
"Here are the block driver changes for 4.4. This pull request
contains:
- NVMe:
- Refactor and moving of code to prepare for proper target
support. From Christoph and Jay.
- 32-bit nvme warning fix from Arnd.
- Error initialization fix from me.
- Proper namespace removal and reference counting support from
Keith.
- Device resume fix on IO failure, also from Keith.
- Dependency fix from Keith, now that nvme isn't under the
umbrella of the block anymore.
- Target location and maintainers update from Jay.
- From Ming Lei, the long awaited DIO/AIO support for loop.
- Enable BD-RE writeable opens, from Georgios"
* 'for-4.4/drivers' of git://git.kernel.dk/linux-block: (24 commits)
Update target repo for nvme patch contributions
NVMe: initialize error to '0'
nvme: use an integer value to Linux errno values
nvme: fix 32-bit build warning
NVMe: Add explicit block config dependency
nvme: include <linux/types.ĥ> in <linux/nvme.h>
nvme: move to a new drivers/nvme/host directory
nvme.h: add missing nvme_id_ctrl endianess annotations
nvme: move hardware structures out of the uapi version of nvme.h
nvme: add a local nvme.h header
nvme: properly handle partially initialized queues in nvme_create_io_queues
nvme: merge nvme_dev_start, nvme_dev_resume and nvme_async_probe
nvme: factor reset code into a common helper
nvme: merge nvme_dev_reset into nvme_reset_failed_dev
nvme: delete dev from dev_list in nvme_reset
NVMe: Simplify device resume on io queue failure
NVMe: Namespace removal simplifications
NVMe: Reference count open namespaces
cdrom: Random writing support for BD-RE media
block: loop: support DIO & AIO
...
- Improve balloon driver memory hotplug placement.
- Use unpopulated hotplugged memory for foreign pages (if
supported/enabled).
- Support 64 KiB guest pages on arm64.
- CPU hotplug support on arm/arm64.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJWOeSkAAoJEFxbo/MsZsTRph0H/0nE8Tx0GyGtOyCYfBdInTvI
WgjvL8VR1XrweZMVis3668MzhLSYg6b5lvJsoi+L3jlzYRyze43iHXsKfvp+8p0o
TVUhFnlHEHF8ASEtPydAi6HgS7Dn9OQ9LaZ45R1Gk0rHnwJjIQonhTn2jB0yS9Am
Hf4aZXP2NVZphjYcloqNsLH0G6mGLtgq8cS0uKcVO2YIrR4Dr3sfj9qfq9mflf8n
sA/5ifoHRfOUD1vJzYs4YmIBUv270jSsprWK/Mi2oXIxUTBpKRAV1RVCAPW6GFci
HIZjIJkjEPWLsvxWEs0dUFJQGp3jel5h8vFPkDWBYs3+9rILU2DnLWpKGNDHx3k=
=vUfa
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.4-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen updates from David Vrabel:
- Improve balloon driver memory hotplug placement.
- Use unpopulated hotplugged memory for foreign pages (if
supported/enabled).
- Support 64 KiB guest pages on arm64.
- CPU hotplug support on arm/arm64.
* tag 'for-linus-4.4-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (44 commits)
xen: fix the check of e_pfn in xen_find_pfn_range
x86/xen: add reschedule point when mapping foreign GFNs
xen/arm: don't try to re-register vcpu_info on cpu_hotplug.
xen, cpu_hotplug: call device_offline instead of cpu_down
xen/arm: Enable cpu_hotplug.c
xenbus: Support multiple grants ring with 64KB
xen/grant-table: Add an helper to iterate over a specific number of grants
xen/xenbus: Rename *RING_PAGE* to *RING_GRANT*
xen/arm: correct comment in enlighten.c
xen/gntdev: use types from linux/types.h in userspace headers
xen/gntalloc: use types from linux/types.h in userspace headers
xen/balloon: Use the correct sizeof when declaring frame_list
xen/swiotlb: Add support for 64KB page granularity
xen/swiotlb: Pass addresses rather than frame numbers to xen_arch_need_swiotlb
arm/xen: Add support for 64KB page granularity
xen/privcmd: Add support for Linux 64KB page granularity
net/xen-netback: Make it running on 64KB page granularity
net/xen-netfront: Make it running on 64KB page granularity
block/xen-blkback: Make it running on 64KB page granularity
block/xen-blkfront: Make it running on 64KB page granularity
...
Commit d1cf578845 ("rbd: set mapping info earlier") defined
rbd_dev_mapping_clear(), but, just a few days after, commit
f35a4dee14 ("rbd: set the mapping size and features later") moved
rbd_dev_mapping_set() calls and added another rbd_dev_mapping_clear()
call instead of moving the old one. Around the same time, another
duplicate was introduced in rbd_dev_device_release() - kill both.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
No point in providing an empty device_type::release callback and then
setting device::release for each rbd_dev dynamically.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
struct rbd_device has struct device embedded in it, which means it's
part of kobject universe and has an unpredictable life cycle. Freeing
its memory outside of the release callback is flawed, yet commits
200a6a8be5 ("rbd: don't destroy rbd_dev in device release function")
and 8ad42cd0c0 ("rbd: don't have device release destroy rbd_dev")
moved rbd_dev_destroy() out to rbd_dev_image_release().
This commit reverts most of that, the key points are:
- rbd_dev->dev is initialized in rbd_dev_create(), making it possible
to use rbd_dev_destroy() - which is just a put_device() - both before
we register with device core and after.
- rbd_dev_release() (the release callback) is the only place we
kfree(rbd_dev). It's also where we do module_put(), keeping the
module unload race window as small as possible.
- We pin the module in rbd_dev_create(), but only for mapping
rbd_dev-s. Moving image related stuff out of struct rbd_device into
another struct which isn't tied with sysfs and device core is long
overdue, but until that happens, this will keep rbd module refcount
(which users can observe with lsmod) sane.
Fixes: http://tracker.ceph.com/issues/12697
Cc: Alex Elder <elder@linaro.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Returning pool id (i.e. >= 0) from a sysfs ->store() callback makes
userspace think it needs to retry the write. Fix it - it's a leftover
from the times when the equivalent of rbd_dev_create() was the first
action in rbd_add().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
rbd requires stable pages, as it performs a crc of the page data before
they are send to the OSDs.
But since kernel 3.9 (patch 1d1d1a7672
"mm: only enforce stable page writes if the backing device requires
it") it is not assumed anymore that block devices require stable pages.
This patch sets the necessary flag to get stable pages back for rbd.
In a ceph installation that provides multiple ext4 formatted rbd
devices "bad crc" messages appeared regularly (ca 1 message every 1-2
minutes on every OSD that provided the data for the rbd) in the
OSD-logs before this patch. After this patch this messages are pretty
much gone (only ca 1-2 / month / OSD).
Cc: stable@vger.kernel.org # 3.9+, needs backporting
Signed-off-by: Ronny Hegewald <Ronny.Hegewald@online.de>
[idryomov@gmail.com: require stable pages only in crc case, changelog]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Pull block layer fixes from Jens Axboe:
"A final set of fixes for 4.3.
It is (again) bigger than I would have liked, but it's all been
through the testing mill and has been carefully reviewed by multiple
parties. Each fix is either a regression fix for this cycle, or is
marked stable. You can scold me at KS. The pull request contains:
- Three simple fixes for NVMe, fixing regressions since 4.3. From
Arnd, Christoph, and Keith.
- A single xen-blkfront fix from Cathy, fixing a NULL dereference if
an error is returned through the staste change callback.
- Fixup for some bad/sloppy code in nbd that got introduced earlier
in this cycle. From Markus Pargmann.
- A blk-mq tagset use-after-free fix from Junichi.
- A backing device lifetime fix from Tejun, fixing a crash.
- And finally, a set of regression/stable fixes for cgroup writeback
from Tejun"
* 'for-linus' of git://git.kernel.dk/linux-block:
writeback: remove broken rbtree_postorder_for_each_entry_safe() usage in cgwb_bdi_destroy()
NVMe: Fix memory leak on retried commands
block: don't release bdi while request_queue has live references
nvme: use an integer value to Linux errno values
blk-mq: fix use-after-free in blk_mq_free_tag_set()
nvme: fix 32-bit build warning
writeback: fix incorrect calculation of available memory for memcg domains
writeback: memcg dirty_throttle_control should be initialized with wb->memcg_completions
writeback: bdi_writeback iteration must not skip dying ones
writeback: fix bdi_writeback iteration in wakeup_dirtytime_writeback()
writeback: laptop_mode_timer_fn() needs rcu_read_lock() around bdi_writeback iteration
nbd: Add locking for tasks
xen-blkfront: check for null drvdata in blkback_changed (XenbusStateClosing)
Mapping an image with a long parent chain (e.g. image foo, whose parent
is bar, whose parent is baz, etc) currently leads to a kernel stack
overflow, due to the following recursion in the reply path:
rbd_osd_req_callback()
rbd_obj_request_complete()
rbd_img_obj_callback()
rbd_img_parent_read_callback()
rbd_obj_request_complete()
...
Limit the parent chain to 16 images, which is ~5K worth of stack. When
the above recursion is eliminated, this limit can be lifted.
Fixes: http://tracker.ceph.com/issues/12538
Cc: stable@vger.kernel.org # 3.10+, needs backporting for < 4.2
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Josh Durgin <jdurgin@redhat.com>
Currently we leak parent_spec and trigger a "parent reference
underflow" warning if rbd_dev_create() in rbd_dev_probe_parent() fails.
The problem is we take the !parent out_err branch and that only drops
refcounts; parent_spec that would've been freed had we called
rbd_dev_unparent() remains and triggers rbd_warn() in
rbd_dev_parent_put() - at that point we have parent_spec != NULL and
parent_ref == 0, so counter ends up being -1 after the decrement.
Redo rbd_dev_probe_parent() to fix this.
Cc: stable@vger.kernel.org # 3.10+, needs backporting for < 4.2
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Linux may use a different page size than the size of grant. So make
clear that the order is actually in number of grant.
Signed-off-by: Julien Grall <julien.grall@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
The PV block protocol is using 4KB page granularity. The goal of this
patch is to allow a Linux using 64KB page granularity behaving as a
block backend on a non-modified Xen.
It's only necessary to adapt the ring size and the number of request per
indirect frames. The rest of the code is relying on the grant table
code.
Note that the grant table code is allocating a Linux page per grant
which will result to waste 6OKB for every grant when Linux is using 64KB
page granularity. This could be improved by sharing the page between
multiple grants.
Signed-off-by: Julien Grall <julien.grall@citrix.com>
Acked-by: "Roger Pau Monné" <roger.pau@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
The PV block protocol is using 4KB page granularity. The goal of this
patch is to allow a Linux using 64KB page granularity using block
device on a non-modified Xen.
The block API is using segment which should at least be the size of a
Linux page. Therefore, the driver will have to break the page in chunk
of 4K before giving the page to the backend.
When breaking a 64KB segment in 4KB chunks, it is possible that some
chunks are empty. As the PV protocol always require to have data in the
chunk, we have to count the number of Xen page which will be in use and
avoid sending empty chunks.
Note that, a pre-defined number of grants are reserved before preparing
the request. This pre-defined number is based on the number and the
maximum size of the segments. If each segment contains a very small
amount of data, the driver may reserve too many grants (16 grants is
reserved per segment with 64KB page granularity).
Furthermore, in the case of persistent grants we allocate one Linux page
per grant although only the first 4KB of the page will be effectively
in use. This could be improved by sharing the page with multiple grants.
Signed-off-by: Julien Grall <julien.grall@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Prepare the code to support 64KB page granularity. The first
implementation will use a full Linux page per indirect and persistent
grant. When non-persistent grant is used, each page of a bio request
may be split in multiple grant.
Furthermore, the field page of the grant structure is only used to copy
data from persistent grant or indirect grant. Avoid to set it for other
use case as it will have no meaning given the page will be split in
multiple grant.
Provide 2 functions, to setup indirect grant, the other for bio page.
Signed-off-by: Julien Grall <julien.grall@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
All the usage of the field pfn are done using the same idiom:
pfn_to_page(grant->pfn)
This will return always the same page. Store directly the page in the
grant to clean up the code.
Signed-off-by: Julien Grall <julien.grall@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Currently, blkif_queue_request has 2 distinct execution path:
- Send a discard request
- Send a read/write request
The function is also allocating grants to use for generating the
request. Although, this is only used for read/write request.
Rather than having a function with 2 distinct execution path, separate
the function in 2. This will also remove one level of tabulation.
Signed-off-by: Julien Grall <julien.grall@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
This covers only the simplest case - an object size sized write, but
it's still useful in tiering setups when EC is used for the base tier
as writefull op can be proxied, saving an object promotion.
Even though updating ceph_osdc_new_request() to allow writefull should
just be a matter of fixing an assert, I didn't do it because its only
user is cephfs. All other sites were updated.
Reflects ceph.git commit 7bfb7f9025a8ee0d2305f49bf0336d2424da5b5b.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Commit 30e2bc08b2 ("Revert "block: remove artifical max_hw_sectors
cap"") restored a clamp on max_sectors. It's now 2560 sectors instead
of 1024, but it's not good enough: we set max_hw_sectors to rbd object
size because we don't want object sized I/Os to be split, and the
default object size is 4M.
So, set max_sectors to max_hw_sectors in rbd at queue init time.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Resources are reallocated for requeued commands, so unmap and release
the iod for the failed command.
It's a pretty bad memory leak and causes a kernel hang if you remove a
drive because of a busy dma pool. You'll get messages spewing like this:
nvme 0000:xx:xx.x: dma_pool_destroy prp list 256, ffff880420dec000 busy
and lock up pci and the driver since removal never completes while
holding a lock.
Cc: stable@vger.kernel.org
Cc: <stable@vger.kernel.org> # 4.0.x-
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Use a separate integer variable to hold the signed Linux errno
values we pass back to the block layer. Note that for pass through
commands those might still be NVMe values, but those fit into the
int as well.
Fixes: f4829a9b7a: ("blk-mq: fix racy updates of rq->errors")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
<linux/highmem.h> is the placace the get the kmap type flags, asm-generic
files are generic implementations only to be used by architecture code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
These are not implementations of default architecture code but helpers
for drivers. Move them to the place they belong to.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Darren Hart <dvhart@linux.intel.com>
Acked-by: Hitoshi Mitake <mitake.hitoshi@lab.ntt.co.jp>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Compiling the nvme driver on 32-bit warns about a cast from a __u64
variable to a pointer:
drivers/block/nvme-core.c: In function 'nvme_submit_io':
drivers/block/nvme-core.c:1847:4: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
(void __user *)io.addr, length, NULL, 0);
The cast here is intentional and safe, so we can shut up the
gcc warning by adding an intermediate cast to 'uintptr_t'.
I had previously submitted a patch to fix this problem in the
nvme driver, but it was accepted on the same day that two new
warnings got added.
For clarification, I also change the third instance of this cast
to use uintptr_t instead of unsigned long now.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: d29ec8241c ("nvme: submit internal commands through the block layer")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch moves the NVMe driver from drivers/block/ to its own new
drivers/nvme/host/ directory. This is in preparation of splitting the
current monolithic driver up and add support for the upcoming NVMe
over Fabrics standard. The drivers/nvme/host/ is chose to leave space
for a NVMe target implementation in addition to this host side driver.
Signed-off-by: Jay Sternberg <jay.e.sternberg@intel.com>
[hch: rebased, renamed core.c to pci.c, slight tweaks]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently all NVMe command and completion structures are exposed to userspace
through the uapi version of nvme.h. They are not an ABI between the kernel
and userspace, and will change in C-incompatible way for future versions of
the spec. Move them to the kernel version of the file and rename the uapi
header to nvme_ioctl.h so that userspace can easily detect the presence of
the new clean header. Nvme-cli already carries a local copy of the header,
so it won't be affected by this move.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Add a new drivers/block/nvme.h which contains all the driver internal
interface.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This avoids having to clean up later in a seemingly unrelated place.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
And give the resulting function a sensible name. This keeps all the
error handling in a single place and will allow for further improvements
to it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
And give the resulting function a more descriptive name.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Device resets need to delete the device from the device list before
kicking of the reset an re-probe, otherwise we get the device added
to the list twice. nvme_reset is the only side missing this deletion
at the moment, and this patch adds it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Releasing IO queues and disks was done in a work queue outside the
controller resume context to delete namespaces if the controller failed
after a resume from suspend. This is unnecessary since we can resume
a device asynchronously.
This patch makes resume use probe_work so it can directly remove
namespaces if the device is manageable but not IO capable. Since the
deleting disks was the only reason we had the convoluted "reset_workfn",
this patch removes that unnecessary indirection.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
This liberates namespace removal from the device, allowing gendisk
references to be closed independent of the nvme controller reference
count.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Dynamic namespace attachment means the namespace may be removed at any
time, so the namespace reference count can not be tied to the device
reference count. This fixes a NULL dereference if an opened namespace
is detached from a controller.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
The timeout handling introduced in
7e2893a16d (nbd: Fix timeout detection)
introduces a race condition which may lead to killing of tasks that are
not in nbd context anymore. This was not observed or reproducable yet.
This patch adds locking to critical use of task_recv and task_send to
avoid killing tasks that already left the NBD thread functions. This
lock is only acquired if a timeout occures or the nbd device
starts/stops.
Reported-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
Reviewed-by: Ben Hutchings <ben@decadent.org.uk>
Fixes: 7e2893a16d ("nbd: Fix timeout detection")
Signed-off-by: Jens Axboe <axboe@fb.com>
Konrad writes:
Please git pull an update branch to your 'for-4.3/drivers' branch (which
oddly I don't see does not have the previous pull?)
git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git stable/for-jens-4.3
which has two fixes - one where we use the Xen blockfront EFI driver and
don't release all the requests, the other if the allocation of resources
for a particular state failed - we would go back 'Closing' and assume
that an structure would be allocated while in fact it may not be - and
crash.
xen-blkfront will crash if the check to talk_to_blkback()
in blkback_changed()(XenbusStateInitWait) returns an error.
The driver data is freed and info is set to NULL. Later during
the close process via talk_to_blkback's call to xenbus_dev_fatal()
the null pointer is passed to and dereference in blkfront_closing.
CC: stable@vger.kernel.org
Signed-off-by: Cathy Avery <cathy.avery@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
blk_mq_complete_request may be a no-op if the request has already
been completed by others means (e.g. a timeout or cancellation), but
currently drivers have to set rq->errors before calling
blk_mq_complete_request, which might leave us with the wrong error value.
Add an error parameter to blk_mq_complete_request so that we can
defer setting rq->errors until we known we won the race to complete the
request.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The asynchronous namespace scanning caused affinity hints to be set before
its tagset initialized, so there was no cpu mask to set the hint. This
patch moves the affinity hint setting to after namespaces are scanned.
Reported-by: 김경산 <ks0204.kim@samsung.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
There are at least 3 advantages to use direct I/O and AIO on
read/write loop's backing file:
1) double cache can be avoided, then memory usage gets
decreased a lot
2) not like user space direct I/O, there isn't cost of
pinning pages
3) avoid context switch for obtaining good throughput
- in buffered file read, random I/O top throughput is often obtained
only if they are submitted concurrently from lots of tasks; but for
sequential I/O, most of times they can be hit from page cache, so
concurrent submissions often introduce unnecessary context switch
and can't improve throughput much. There was such discussion[1]
to use non-blocking I/O to improve the problem for application.
- with direct I/O and AIO, concurrent submissions can be
avoided and random read throughput can't be affected meantime
xfstests(-g auto, ext4) is basically passed when running with
direct I/O(aio), one exception is generic/232, but it failed in
loop buffered I/O(4.2-rc6-next-20150814) too.
Follows the fio test result for performance purpose:
4 jobs fio test inside ext4 file system over loop block
1) How to run
- KVM: 4 VCPUs, 2G RAM
- linux kernel: 4.2-rc6-next-20150814(base) with the patchset
- the loop block is over one image on SSD.
- linux psync, 4 jobs, size 1500M, ext4 over loop block
- test result: IOPS from fio output
2) Throughput(IOPS) becomes a bit better with direct I/O(aio)
-------------------------------------------------------------
test cases |randread |read |randwrite |write |
-------------------------------------------------------------
base |8015 |113811 |67442 |106978
-------------------------------------------------------------
base+loop aio |8136 |125040 |67811 |111376
-------------------------------------------------------------
- somehow, it should be caused by more page cache avaiable for
application or one extra page copy is avoided in case of direct I/O
3) context switch
- context switch decreased by ~50% with loop direct I/O(aio)
compared with loop buffered I/O(4.2-rc6-next-20150814)
4) memory usage from /proc/meminfo
-------------------------------------------------------------
| Buffers | Cached
-------------------------------------------------------------
base | > 760MB | ~950MB
-------------------------------------------------------------
base+loop direct I/O(aio) | < 5MB | ~1.6GB
-------------------------------------------------------------
- so there are much more page caches available for application with
direct I/O
[1] https://lwn.net/Articles/612483/
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
If loop block is mounted via 'mount -o loop', it isn't easy
to pass file descriptor opened as O_DIRECT, so this patch
introduces a new command to support direct IO for this case.
Cc: linux-api@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patches provides one interface for enabling direct IO
from user space:
- userspace(such as losetup) can pass 'file' which is
opened/fcntl as O_DIRECT
Also __loop_update_dio() is introduced to check if direct I/O
can be used on current loop setting.
The last big change is to introduce LO_FLAGS_DIRECT_IO flag
for userspace to know if direct IO is used to access backing
file.
Cc: linux-api@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
The following patch will use dio/aio to submit IO to backing file,
then it needn't to schedule IO concurrently from work, so
use kthread_work for decreasing context switch cost a lot.
For non-AIO case, single thread has been used for long long time,
and it was just converted to work in v4.0, which has caused performance
regression for fedora live booting already. In discussion[1], even
though submitting I/O via work concurrently can improve random read IO
throughput, meantime it might hurt sequential read IO performance, so
better to restore to single thread behaviour.
For the following AIO support, it is better to use multi hw-queue
with per-hwq kthread than current work approach suppose there is so
high performance requirement for loop.
[1] http://marc.info/?t=143082678400002&r=1&w=2
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
It doesn't make sense to enable merge because the I/O
submitted to backing file is handled page by page.
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Konrad writes:
It has one fix that should go in and also be put in stable tree (I've
added the CC already).
It is a fix for a memory leak that can exposed via using UEFI
xen-blkfront driver.
This is due to commit 86839c56de
"xen/block: add multi-page ring support"
When using an guest under UEFI - after the domain is destroyed
the following warning comes from blkback.
------------[ cut here ]------------
WARNING: CPU: 2 PID: 95 at
/home/julien/works/linux/drivers/block/xen-blkback/xenbus.c:274
xen_blkif_deferred_free+0x1f4/0x1f8()
Modules linked in:
CPU: 2 PID: 95 Comm: kworker/2:1 Tainted: G W 4.2.0 #85
Hardware name: APM X-Gene Mustang board (DT)
Workqueue: events xen_blkif_deferred_free
Call trace:
[<ffff8000000890a8>] dump_backtrace+0x0/0x124
[<ffff8000000891dc>] show_stack+0x10/0x1c
[<ffff8000007653bc>] dump_stack+0x78/0x98
[<ffff800000097e88>] warn_slowpath_common+0x9c/0xd4
[<ffff800000097f80>] warn_slowpath_null+0x14/0x20
[<ffff800000557a0c>] xen_blkif_deferred_free+0x1f0/0x1f8
[<ffff8000000ad020>] process_one_work+0x160/0x3b4
[<ffff8000000ad3b4>] worker_thread+0x140/0x494
[<ffff8000000b2e34>] kthread+0xd8/0xf0
---[ end trace 6f859b7883c88cdd ]---
Request allocation has been moved to connect_ring, which is called every
time blkback connects to the frontend (this can happen multiple times during
a blkback instance life cycle). On the other hand, request freeing has not
been moved, so it's only called when destroying the backend instance. Due to
this mismatch, blkback can allocate the request pool multiple times, without
freeing it.
In order to fix it, move the freeing of requests to xen_blkif_disconnect to
restore the symmetry between request allocation and freeing.
Reported-by: Julien Grall <julien.grall@citrix.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Tested-by: Julien Grall <julien.grall@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: xen-devel@lists.xenproject.org
CC: stable@vger.kernel.org # 4.2
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Pull block updates from Jens Axboe:
"This is a bit bigger than it should be, but I could (did) not want to
send it off last week due to both wanting extra testing, and expecting
a fix for the bounce regression as well. In any case, this contains:
- Fix for the blk-merge.c compilation warning on gcc 5.x from me.
- A set of back/front SG gap merge fixes, from me and from Sagi.
This ensures that we honor SG gapping for integrity payloads as
well.
- Two small fixes for null_blk from Matias, fixing a leak and a
capacity propagation issue.
- A blkcg fix from Tejun, fixing a NULL dereference.
- A fast clone optimization from Ming, fixing a performance
regression since the arbitrarily sized bio's were introduced.
- Also from Ming, a regression fix for bouncing IOs"
* 'for-linus' of git://git.kernel.dk/linux-block:
block: fix bounce_end_io
block: blk-merge: fast-clone bio when splitting rw bios
block: blkg_destroy_all() should clear q->root_blkg and ->root_rl.blkg
block: Copy a user iovec if it includes gaps
block: Refuse adding appending a gapped integrity page to a bio
block: Refuse request/bio merges with gaps in the integrity payload
block: Check for gaps on front and back merges
null_blk: fix wrong capacity when bs is not 512 bytes
null_blk: fix memory leak on cleanup
block: fix bogus compiler warnings in blk-merge.c
zcomp_create() verifies the success of zcomp_strm_{multi,single}_create()
through comp->stream, which can potentially be pointing to memory that
was freed if these functions returned an error.
While at it, replace a 'ERR_PTR(-ENOMEM)' by a more generic
'ERR_PTR(error)' as in the future zcomp_strm_{multi,siggle}_create()
could return other error codes. Function documentation updated
accordingly.
Fixes: beca3ec71f ("zram: add multi stream functionality")
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull Ceph update from Sage Weil:
"There are a few fixes for snapshot behavior with CephFS and support
for the new keepalive protocol from Zheng, a libceph fix that affects
both RBD and CephFS, a few bug fixes and cleanups for RBD from Ilya,
and several small fixes and cleanups from Jianpeng and others"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: improve readahead for file holes
ceph: get inode size for each append write
libceph: check data_len in ->alloc_msg()
libceph: use keepalive2 to verify the mon session is alive
rbd: plug rbd_dev->header.object_prefix memory leak
rbd: fix double free on rbd_dev->header_name
libceph: set 'exists' flag for newly up osd
ceph: cleanup use of ceph_msg_get
ceph: no need to get parent inode in ceph_open
ceph: remove the useless judgement
ceph: remove redundant test of head->safe and silence static analysis warnings
ceph: fix queuing inode to mdsdir's snaprealm
libceph: rename con_work() to ceph_con_workfn()
libceph: Avoid holding the zero page on ceph_msgr_slab_init errors
libceph: remove the unused macro AES_KEY_SIZE
ceph: invalidate dirty pages after forced umount
ceph: EIO all operations after forced umount
- Use the correct GFN/BFN terms more consistently.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJV8VRMAAoJEFxbo/MsZsTRiGQH/i/jrAJUJfrFC2PINaA2gDwe
O0dlrkCiSgAYChGmxxxXZQSPM5Po5+EbT/dLjZ/uvSooeorM9RYY/mFo7ut/qLep
4pyQUuwGtebWGBZTrj9sygUVXVhgJnyoZxskNUbhj9zvP7hb9++IiI78mzne6cpj
lCh/7Z2dgpfRcKlNRu+qpzP79Uc7OqIfDK+IZLrQKlXa7IQDJTQYoRjbKpfCtmMV
BEG3kN9ESx5tLzYiAfxvaxVXl9WQFEoktqe9V8IgOQlVRLgJ2DQWS6vmraGrokWM
3HDOCHtRCXlPhu1Vnrp0R9OgqWbz8FJnmVAndXT8r3Nsjjmd0aLwhJx7YAReO/4=
=JDia
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.3-rc0b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen terminology fixes from David Vrabel:
"Use the correct GFN/BFN terms more consistently"
* tag 'for-linus-4.3-rc0b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/xenbus: Rename the variable xen_store_mfn to xen_store_gfn
xen/privcmd: Further s/MFN/GFN/ clean-up
hvc/xen: Further s/MFN/GFN clean-up
video/xen-fbfront: Further s/MFN/GFN clean-up
xen/tmem: Use xen_page_to_gfn rather than pfn_to_gfn
xen: Use correctly the Xen memory terminologies
arm/xen: implement correctly pfn_to_mfn
xen: Make clear that swiotlb and biomerge are dealing with DMA address
virtio-mmio can now be auto-loaded through acpi.
virtio blk supports extended partitions.
total memory is better reported when using virtio balloon with auto-deflate.
cache control is re-enabled when using virtio-blk in modern mode.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJV7/iUAAoJECgfDbjSjVRpzRIIAMJz5enk2FbyiOsky5THASqK
rWAEx+Jz5Av3FVM99W+Fs8dmXrBGkCTyRgmX3mZx+pGNhOsLjc6xHElTekzwQBas
18oFsQDTWk0rPZkO7X6/OY3GgHyq8VSI/2kW+yKMYj4k/gAaGV+Azf0m5DGMp4wn
rngevvcDSkFEN2Gjxbi3MWT8sJ9Br/EFg6KzKUOiOycHh/rQZsaQK6tq4kpUZkil
2wvXspQY6ix2fD558gsiqUv6l8yZ9mXVgxtyRcnxjfM+8jr+qtqai8yVGCAC5krh
50RXrR22tN/6aLYuvtPj46+Y0dleCMEwW2rSmcw5h5yr3B/0ylnZ1YNp6vlFyv0=
=8xQ2
-----END PGP SIGNATURE-----
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio updates from Michael Tsirkin:
"Virtio fixes and features for 4.3:
- virtio-mmio can now be auto-loaded through acpi.
- virtio blk supports extended partitions.
- total memory is better reported when using virtio balloon with
auto-deflate.
- cache control is re-enabled when using virtio-blk in modern mode"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
virtio_balloon: do not change memory amount visible via /proc/meminfo
virtio_ballon: change stub of release_pages_by_pfn
virtio-blk: Allow extended partitions
virtio_mmio: add ACPI probing
virtio-blk: use VIRTIO_BLK_F_WCE and VIRTIO_BLK_F_CONFIG_WCE in virtio1
Merge second patch-bomb from Andrew Morton:
"Almost all of the rest of MM. There was an unusually large amount of
MM material this time"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (141 commits)
zpool: remove no-op module init/exit
mm: zbud: constify the zbud_ops
mm: zpool: constify the zpool_ops
mm: swap: zswap: maybe_preload & refactoring
zram: unify error reporting
zsmalloc: remove null check from destroy_handle_cache()
zsmalloc: do not take class lock in zs_shrinker_count()
zsmalloc: use class->pages_per_zspage
zsmalloc: consider ZS_ALMOST_FULL as migrate source
zsmalloc: partial page ordering within a fullness_list
zsmalloc: use shrinker to trigger auto-compaction
zsmalloc: account the number of compacted pages
zsmalloc/zram: introduce zs_pool_stats api
zsmalloc: cosmetic compaction code adjustments
zsmalloc: introduce zs_can_compact() function
zsmalloc: always keep per-class stats
zsmalloc: drop unused variable `nr_to_migrate'
mm/memblock.c: fix comment in __next_mem_range()
mm/page_alloc.c: fix type information of memoryless node
memory-hotplug: fix comments in zone_spanned_pages_in_node() and zone_spanned_pages_in_node()
...
Make zram syslog error reporting more consistent. We have random
error levels in some places. For example, critical errors like
"Error allocating memory for compressed page"
and
"Unable to allocate temp memory"
are reported as KERN_INFO messages.
a) Reassign error levels
Error messages that directly affect zram
functionality -- pr_err():
Error allocating zram address table
Error creating memory pool
Decompression failed! err=%d, page=%u
Unable to allocate temp memory
Compression failed! err=%d
Error allocating memory for compressed page: %u, size=%zu
Cannot initialise %s compressing backend
Error allocating disk queue for device %d
Error allocating disk structure for device %d
Error creating sysfs group for device %d
Unable to register zram-control class
Unable to get major number
Messages that do not affect functionality, but user
must be warned (because sysfs attrs will be removed in
this particular case) -- pr_warn():
%d (%s) Attribute %s (and others) will be removed. %s
Messages that do not affect functionality and mostly are
informative -- pr_info():
Cannot change max compression streams
Can't change algorithm for initialized device
Cannot change disksize for initialized device
Added device: %s
Removed device: %s
b) Update sysfs_create_group() error message
First, it lacks a trailing new line; add it. Second, every error message
in zram_add() has a "for device %d" part, which makes errors more
informative. Add missing part to "Error creating sysfs group" message.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Compaction returns back to zram the number of migrated objects, which is
quite uninformative -- we have objects of different sizes so user space
cannot obtain any valuable data from that number. Change compaction to
operate in terms of pages and return back to compaction issuer the
number of pages that were freed during compaction. So from now on we
will export more meaningful value in zram<id>/mm_stat -- the number of
freed (compacted) pages.
This requires:
(a) a rename of `num_migrated' to 'pages_compacted'
(b) a internal API change -- return first_page's fullness_group from
putback_zspage(), so we know when putback_zspage() did
free_zspage(). It helps us to account compaction stats correctly.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
`zs_compact_control' accounts the number of migrated objects but it has
a limited lifespan -- we lose it as soon as zs_compaction() returns back
to zram. It worked fine, because (a) zram had it's own counter of
migrated objects and (b) only zram could trigger compaction. However,
this does not work for automatic pool compaction (not issued by zram).
To account objects migrated during auto-compaction (issued by the
shrinker) we need to store this number in zs_pool.
Define a new `struct zs_pool_stats' structure to keep zs_pool's stats
there. It provides only `num_migrated', as of this writing, but it
surely can be extended.
A new zsmalloc zs_pool_stats() symbol exports zs_pool's stats back to
caller.
Use zs_pool_stats() in zram and remove `num_migrated' from zram_stats.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Suggested-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1/ Introduce ZONE_DEVICE and devm_memremap_pages() as a generic
mechanism for adding device-driver-discovered memory regions to the
kernel's direct map. This facility is used by the pmem driver to
enable pfn_to_page() operations on the page frames returned by DAX
('direct_access' in 'struct block_device_operations'). For now, the
'memmap' allocation for these "device" pages comes from "System
RAM". Support for allocating the memmap from device memory will
arrive in a later kernel.
2/ Introduce memremap() to replace usages of ioremap_cache() and
ioremap_wt(). memremap() drops the __iomem annotation for these
mappings to memory that do not have i/o side effects. The
replacement of ioremap_cache() with memremap() is limited to the
pmem driver to ease merging the api change in v4.3. Completion of
the conversion is targeted for v4.4.
3/ Similar to the usage of memcpy_to_pmem() + wmb_pmem() in the pmem
driver, update the VFS DAX implementation and PMEM api to provide
persistence guarantees for kernel operations on a DAX mapping.
4/ Convert the ACPI NFIT 'BLK' driver to map the block apertures as
cacheable to improve performance.
5/ Miscellaneous updates and fixes to libnvdimm including support
for issuing "address range scrub" commands, clarifying the optimal
'sector size' of pmem devices, a clarification of the usage of the
ACPI '_STA' (status) property for DIMM devices, and other minor
fixes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJV6Nx7AAoJEB7SkWpmfYgCWyYQAI5ju6Gvw27RNFtPovHcZUf5
JGnxXejI6/AqeTQ+IulgprxtEUCrXOHjCDA5dkjr1qvsoqK1qxug+vJHOZLgeW0R
OwDtmdW4Qrgeqm+CPoxETkorJ8wDOc8mol81kTiMgeV3UqbYeeHIiTAmwe7VzZ0C
nNdCRDm5g8dHCjTKcvK3rvozgyoNoWeBiHkPe76EbnxDICxCB5dak7XsVKNMIVFQ
NuYlnw6IYN7+rMHgpgpRux38NtIW8VlYPWTmHExejc2mlioWMNBG/bmtwLyJ6M3e
zliz4/cnonTMUaizZaVozyinTa65m7wcnpjK+vlyGV2deDZPJpDRvSOtB0lH30bR
1gy+qrKzuGKpaN6thOISxFLLjmEeYwzYd7SvC9n118r32qShz+opN9XX0WmWSFlA
sajE1ehm4M7s5pkMoa/dRnAyR8RUPu4RNINdQ/Z9jFfAOx+Q26rLdQXwf9+uqbEb
bIeSQwOteK5vYYCstvpAcHSMlJAglzIX5UfZBvtEIJN7rlb0VhmGWfxAnTu+ktG1
o9cqAt+J4146xHaFwj5duTsyKhWb8BL9+xqbKPNpXEp+PbLsrnE/+WkDLFD67jxz
dgIoK60mGnVXp+16I2uMqYYDgAyO5zUdmM4OygOMnZNa1mxesjbDJC6Wat1Wsndn
slsw6DkrWT60CRE42nbK
=o57/
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dan Williams:
"This update has successfully completed a 0day-kbuild run and has
appeared in a linux-next release. The changes outside of the typical
drivers/nvdimm/ and drivers/acpi/nfit.[ch] paths are related to the
removal of IORESOURCE_CACHEABLE, the introduction of memremap(), and
the introduction of ZONE_DEVICE + devm_memremap_pages().
Summary:
- Introduce ZONE_DEVICE and devm_memremap_pages() as a generic
mechanism for adding device-driver-discovered memory regions to the
kernel's direct map.
This facility is used by the pmem driver to enable pfn_to_page()
operations on the page frames returned by DAX ('direct_access' in
'struct block_device_operations').
For now, the 'memmap' allocation for these "device" pages comes
from "System RAM". Support for allocating the memmap from device
memory will arrive in a later kernel.
- Introduce memremap() to replace usages of ioremap_cache() and
ioremap_wt(). memremap() drops the __iomem annotation for these
mappings to memory that do not have i/o side effects. The
replacement of ioremap_cache() with memremap() is limited to the
pmem driver to ease merging the api change in v4.3.
Completion of the conversion is targeted for v4.4.
- Similar to the usage of memcpy_to_pmem() + wmb_pmem() in the pmem
driver, update the VFS DAX implementation and PMEM api to provide
persistence guarantees for kernel operations on a DAX mapping.
- Convert the ACPI NFIT 'BLK' driver to map the block apertures as
cacheable to improve performance.
- Miscellaneous updates and fixes to libnvdimm including support for
issuing "address range scrub" commands, clarifying the optimal
'sector size' of pmem devices, a clarification of the usage of the
ACPI '_STA' (status) property for DIMM devices, and other minor
fixes"
* tag 'libnvdimm-for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (34 commits)
libnvdimm, pmem: direct map legacy pmem by default
libnvdimm, pmem: 'struct page' for pmem
libnvdimm, pfn: 'struct page' provider infrastructure
x86, pmem: clarify that ARCH_HAS_PMEM_API implies PMEM mapped WB
add devm_memremap_pages
mm: ZONE_DEVICE for "device memory"
mm: move __phys_to_pfn and __pfn_to_phys to asm/generic/memory_model.h
dax: drop size parameter to ->direct_access()
nd_blk: change aperture mapping from WC to WB
nvdimm: change to use generic kvfree()
pmem, dax: have direct_access use __pmem annotation
dax: update I/O path to do proper PMEM flushing
pmem: add copy_from_iter_pmem() and clear_pmem()
pmem, x86: clean up conditional pmem includes
pmem: remove layer when calling arch_has_wmb_pmem()
pmem, x86: move x86 PMEM API to new pmem.h header
libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option
pmem: switch to devm_ allocations
devres: add devm_memremap
libnvdimm, btt: write and validate parent_uuid
...
Need to free object_prefix when rbd_dev_v2_snap_context() fails, but
only if this is the first time we are reading in the header.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
If rbd_dev_image_probe() in rbd_dev_probe_parent() fails, header_name
is freed twice: once in rbd_dev_probe_parent() and then in its caller
rbd_dev_image_probe() (rbd_dev_image_probe() is called recursively to
handle parent images).
rbd_dev_probe_parent() is responsible for probing the parent, so it
shouldn't muck with clone's fields.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
- Convert xen-blkfront to the multiqueue API
- [arm] Support binding event channels to different VCPUs.
- [x86] Support > 512 GiB in a PV guests (off by default as such a
guest cannot be migrated with the current toolstack).
- [x86] PMU support for PV dom0 (limited support for using perf with
Xen and other guests).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJV7wIdAAoJEFxbo/MsZsTR0hEH/04HTKLKGnSJpZ5WbMPxqZxE
UqGlvhvVWNAmFocZmbPcEi9T1qtcFrX5pM55JQr6UmAp3ovYsT2q1Q1kKaOaawks
pSfc/YEH3oQW5VUQ9Lm9Ru5Z8Btox0WrzRREO92OF36UOgUOBOLkGsUfOwDinNIM
lSk2djbYwDYAsoeC3PHB32wwMI//Lz6B/9ZVXcyL6ULynt1ULdspETjGnptRPZa7
JTB5L4/soioKOn18HDwwOhKmvaFUPQv9Odnv7dc85XwZreajhM/KMu3qFbMDaF/d
WVB1NMeCBdQYgjOrUjrmpyr5uTMySiQEG54cplrEKinfeZgKlEyjKvjcAfJfiac=
=Ktjl
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.3-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen updates from David Vrabel:
"Xen features and fixes for 4.3:
- Convert xen-blkfront to the multiqueue API
- [arm] Support binding event channels to different VCPUs.
- [x86] Support > 512 GiB in a PV guests (off by default as such a
guest cannot be migrated with the current toolstack).
- [x86] PMU support for PV dom0 (limited support for using perf with
Xen and other guests)"
* tag 'for-linus-4.3-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (33 commits)
xen: switch extra memory accounting to use pfns
xen: limit memory to architectural maximum
xen: avoid another early crash of memory limited dom0
xen: avoid early crash of memory limited dom0
arm/xen: Remove helpers which are PV specific
xen/x86: Don't try to set PCE bit in CR4
xen/PMU: PMU emulation code
xen/PMU: Intercept PMU-related MSR and APIC accesses
xen/PMU: Describe vendor-specific PMU registers
xen/PMU: Initialization code for Xen PMU
xen/PMU: Sysfs interface for setting Xen PMU mode
xen: xensyms support
xen: remove no longer needed p2m.h
xen: allow more than 512 GB of RAM for 64 bit pv-domains
xen: move p2m list if conflicting with e820 map
xen: add explicit memblock_reserve() calls for special pages
mm: provide early_memremap_ro to establish read-only mapping
xen: check for initrd conflicting with e820 map
xen: check pre-allocated page tables for conflict with memory map
xen: check for kernel memory conflicting with memory layout
...
Based on include/xen/mm.h [1], Linux is mistakenly using MFN when GFN
is meant, I suspect this is because the first support for Xen was for
PV. This resulted in some misimplementation of helpers on ARM and
confused developers about the expected behavior.
For instance, with pfn_to_mfn, we expect to get an MFN based on the name.
Although, if we look at the implementation on x86, it's returning a GFN.
For clarity and avoid new confusion, replace any reference to mfn with
gfn in any helpers used by PV drivers. The x86 code will still keep some
reference of pfn_to_mfn which may be used by all kind of guests
No changes as been made in the hypercall field, even
though they may be invalid, in order to keep the same as the defintion
in xen repo.
Note that page_to_mfn has been renamed to xen_page_to_gfn to avoid a
name to close to the KVM function gfn_to_page.
Take also the opportunity to simplify simple construction such
as pfn_to_mfn(page_to_pfn(page)) into xen_page_to_gfn. More complex clean up
will come in follow-up patches.
[1] http://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=e758ed14f390342513405dd766e874934573e6cb
Signed-off-by: Julien Grall <julien.grall@citrix.com>
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
This will allow up to DISK_MAX_PARTS (256) partitions, with for example
GPT in the guest. Otherwise, the partition scan code will only discover
the first 15 partitions.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
VIRTIO_BLK_F_CONFIG_WCE is important in order to achieve good performance
(up to 2x, though more realistically +30-40%) in latency-bound workloads.
However, it was removed by mistake together with VIRTIO_BLK_F_FLUSH.
It will be restored in the next revision of the virtio 1.0 standard, so
do the same in Linux.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
set_capacity() sets device's capacity using 512 bytes sectors.
null_blk calculates the number of sectors by size / bs, which
set_capacity is called with. This led to null_blk exposing the
wrong number of sectors when bs is not 512 bytes.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Driver was not freeing the memory allocated for internal nullb queues.
This patch frees the memory during driver unload.
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull block driver updates from Jens Axboe:
"On top of the 4.3 core block IO changes, here are the driver related
changes for 4.3. Basically just NVMe and nbd this time around:
- NVMe:
- PRACT PI improvement from Alok Pandey.
- Cleanups and improvements on submission queue doorbell and
writing, using CMB if available. From Jon Derrick.
- From Keith, support for setting queue maximum segments, and
reset support.
- Also from Jon, fixup of u64 division issue on 32-bit archs and
wiring up of the reset support through and ioctl.
- Two small cleanups from Matias and Sunad
- Various code cleanups and fixes from Markus Pargmann"
* 'for-4.3/drivers' of git://git.kernel.dk/linux-block:
NVMe: Using PRACT bit to generate and verify PI by controller
NVMe:Remove unreachable code in nvme_abort_req
NVMe: Add nvme subsystem reset IOCTL
NVMe: Add nvme subsystem reset support
NVMe: removed unused nn var from nvme_dev_add
NVMe: Set queue max segments
nbd: flags is a u32 variable
nbd: Rename functions for clearness of recv/send path
nbd: Change 'disconnect' to be boolean
nbd: Add debugfs entries
nbd: Remove variable 'pid'
nbd: Move clear queue debug message
nbd: Remove 'harderror' and propagate error properly
nbd: restructure sock_shutdown
nbd: sock_shutdown, remove conditional lock
nbd: Fix timeout detection
nvme: Fixes u64 division which breaks i386 builds
NVMe: Use CMB for the IO SQes if available
NVMe: Unify SQ entry writing and doorbell ringing
Pull core block updates from Jens Axboe:
"This first core part of the block IO changes contains:
- Cleanup of the bio IO error signaling from Christoph. We used to
rely on the uptodate bit and passing around of an error, now we
store the error in the bio itself.
- Improvement of the above from myself, by shrinking the bio size
down again to fit in two cachelines on x86-64.
- Revert of the max_hw_sectors cap removal from a revision again,
from Jeff Moyer. This caused performance regressions in various
tests. Reinstate the limit, bump it to a more reasonable size
instead.
- Make /sys/block/<dev>/queue/discard_max_bytes writeable, by me.
Most devices have huge trim limits, which can cause nasty latencies
when deleting files. Enable the admin to configure the size down.
We will look into having a more sane default instead of UINT_MAX
sectors.
- Improvement of the SGP gaps logic from Keith Busch.
- Enable the block core to handle arbitrarily sized bios, which
enables a nice simplification of bio_add_page() (which is an IO hot
path). From Kent.
- Improvements to the partition io stats accounting, making it
faster. From Ming Lei.
- Also from Ming Lei, a basic fixup for overflow of the sysfs pending
file in blk-mq, as well as a fix for a blk-mq timeout race
condition.
- Ming Lin has been carrying Kents above mentioned patches forward
for a while, and testing them. Ming also did a few fixes around
that.
- Sasha Levin found and fixed a use-after-free problem introduced by
the bio->bi_error changes from Christoph.
- Small blk cgroup cleanup from Viresh Kumar"
* 'for-4.3/core' of git://git.kernel.dk/linux-block: (26 commits)
blk: Fix bio_io_vec index when checking bvec gaps
block: Replace SG_GAPS with new queue limits mask
block: bump BLK_DEF_MAX_SECTORS to 2560
Revert "block: remove artifical max_hw_sectors cap"
blk-mq: fix race between timeout and freeing request
blk-mq: fix buffer overflow when reading sysfs file of 'pending'
Documentation: update notes in biovecs about arbitrarily sized bios
block: remove bio_get_nr_vecs()
fs: use helper bio_add_page() instead of open coding on bi_io_vec
block: kill merge_bvec_fn() completely
md/raid5: get rid of bio_fits_rdev()
md/raid5: split bio for chunk_aligned_read
block: remove split code in blkdev_issue_{discard,write_same}
btrfs: remove bio splitting and merge_bvec_fn() calls
bcache: remove driver private bio splitting code
block: simplify bio_add_page()
block: make generic_make_request handle arbitrarily sized bios
blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL)
block: don't access bio->bi_error after bio_put()
block: shrink struct bio down to 2 cache lines again
...
None of the implementations currently use it. The common
bdev_direct_access() entry point handles all the size checks before
calling ->direct_access().
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
This patch enables the PRCHK and reftag support when PRACT bit is set, and
block layer integrity is disabled.
Signed-off-by: Alok Pandey <pandey.alok@samsung.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Hi,
After commit f70ced0917 (blk-mq: support per-distpatch_queue flush
machinery), the mtip32xx driver may oops upon module load due to walking
off the end of an array in mtip_init_cmd. On initialization of the
flush_rq, init_request is called with request_index >= the maximum queue
depth the driver supports. For mtip32xx, this value is used to index
into an array. What this means is that the driver will walk off the end
of the array, and either oops or cause random memory corruption.
The problem is easily reproduced by doing modprobe/rmmod of the mtip32xx
driver in a loop. I can typically reproduce the problem in about 30
seconds.
Now, in the case of mtip32xx, it actually doesn't support flush/fua, so
I think we can simply return without doing anything. In addition, no
other mq-enabled driver does anything with the request_index passed into
init_request(), so no other driver is affected. However, I'm not really
sure what is expected of drivers. Ming, what did you envision drivers
would do when initializing the flush requests?
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Update the annotation for the kaddr pointer returned by direct_access()
so that it is a __pmem pointer. This is consistent with the PMEM driver
and with how this direct_access() pointer is used in the DAX code.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Note: This patch is based on original work of Arianna's internship for
GNOME's Outreach Program for Women.
Only one hardware queue is used now, so there is no significant
performance change
The legacy non-mq code is deleted completely which is the same as other
drivers like virtio, mtip, and nvme.
Also dropped one unnecessary holding of info->io_lock when calling
blk_mq_stop_hw_queues().
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jens Axboe <axboe@fb.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Removing unreachable code from nvme_abort_req as nvme_submit_cmd has no
failure status to return.
Signed-off-by: Sunad Bhandary <sunad.s@samsung.com>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The SG_GAPS queue flag caused checks for bio vector alignment against
PAGE_SIZE, but the device may have different constraints. This patch
adds a queue limits so a driver with such constraints can set to allow
requests that would have been unnecessarily split. The new gaps check
takes the request_queue as a parameter to simplify the logic around
invoking this function.
This new limit makes the queue flag redundant, so removing it and
all usage. Device-mappers will inherit the correct settings through
blk_stack_limits().
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
This reverts commit 34b48db66e.
That commit caused performance regressions for streaming I/O
workloads on a number of different storage devices, from
SATA disks to external RAID arrays. It also managed to
trip up some buggy firmware in at least one drive, causing
data corruption.
The next patch will bump the default max_sectors_kb value to
1280, which will accommodate a 10-data-disk stripe write
with chunk size 128k. In the testing I've done using iozone,
fio, and aio-stress, a value of 1280 does not show a big
performance difference from 512. This will hopefully still
help the software RAID setup that Christoph saw the original
performance gains with while still not regressing other
storage configurations.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Controllers can perform optional subsystem resets as introduced in NVMe
1.1. This patch adds an IOCTL to trigger the subsystem reset by writing
"NVMe" to the NSSR register.
Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Controllers part of an NVMe subsystem may be reset by any other controller
in the subsystem. If the device is capable of subsystem resets, this
patch adds detection for such events and performs appropriate controller
initialization upon subsystem reset detection.
The register bit is a RW1C type, so the driver needs to write a 1 to the
status bit to clear the subsystem reset occured bit during initialization.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The logic in nvme_dev_add to enumerate namespaces was moved to
nvme_dev_scan. When moved, the nn variable is no longer used. This patch
removes it.
Fixes: a5768aai ("NVMe: Automatic namespace rescan")
Signed-off-by: Matias Bjørling <m@bjorling.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
This sets the queue's max segment size to match the device's
capabilities. The default of 128 is usable until a device's transfer
capability exceeds 512k, assuming a device page size of 4k. Many nvme
devices exceed that transfer limit, so this lets the block layer know what
kind of commands it to allow to form rather than unnecessarily split them.
One additional segment is added to account for a transfer that may start
in the middle of a page.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
The flags variable is used as u32 variable. This patch changes the type
to be u32.
Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch renames functions so that it is clear what the function does.
Otherwise it is not directly understandable what for example 'do_it' means.
Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Add some debugfs files that help to understand the internal state of
NBD. This exports the different sizes, flags, tasks and so on.
Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch uses nbd->task_recv to determine the value of the previously
used variable 'pid' for sysfs.
Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
This message was a warning without a reason. This patch moves it into
nbd_clear_que and transforms it to a debug message.
Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Instead of a variable 'harderror' we can simply try to correctly
propagate errors to the userspace.
This patch removes the harderror variable and passes errors through
error pointers and nbd_do_it back to the userspace.
Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch restructures sock_shutdown to avoid having the main code path
in an if block.
Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Move the conditional lock from sock_shutdown into the surrounding code.
Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
At the moment the nbd timeout just detects hanging tcp operations. This
is not enough to detect a hanging or bad connection as expected of a
timeout.
This patch redesigns the timeout detection to include some more cases.
The timeout is now in relation to replies from the server. If the server
does not send replies within the timeout the connection will be shut
down.
The patch adds a continous timer 'timeout_timer' that is setup in one of
two cases:
- The request list is empty and we are sending the first request out to
the server. We want to have a reply within the given timeout,
otherwise we consider the connection to be dead.
- A server response was received. This means the server is still
communicating with us. The timer is reset to the timeout value.
The timer is not stopped if the list becomes empty. It will just trigger
a timeout which will directly leave the handling routine again as the
request list is empty.
The whole patch does not use any additional explicit locking. The
list_empty() calls are safe to be used concurrently. The timer is locked
internally as we just use mod_timer and del_timer_sync().
The patch is based on the idea of Michal Belczyk with a previous
different implementation.
Cc: Michal Belczyk <belczyk@bsd.krakow.pl>
Cc: Hermann Lauer <Hermann.Lauer@iwr.uni-heidelberg.de>
Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
Tested-by: Hermann Lauer <Hermann.Lauer@iwr.uni-heidelberg.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
zram_meta_alloc() constructs a pool name for zs_create_pool() call as
snprintf(pool_name, sizeof(pool_name), "zram%d", device_id);
However, it defines pool name buffer to be only 8 bytes long (minus
trailing zero), which means that we can have only 1000 pool names: zram0
-- zram999.
With CONFIG_ZSMALLOC_STAT enabled an attempt to create a device zram1000
can fail if device zram100 already exists, because snprintf() will
truncate new pool name to zram100 and pass it debugfs_create_dir(),
causing:
debugfs dir <zram100> creation failed
zram: Error creating memory pool
... and so on.
Fix it by passing zram->disk->disk_name to zram_meta_alloc() instead of
divice_id. We construct zram%d name earlier and keep it as a ->disk_name,
no need to snprintf() it again.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull xen block driver fixes from Jens Axboe:
"A few small bug fixes for xen-blk{front,back} that have been sitting
over my vacation"
* 'for-linus' of git://git.kernel.dk/linux-block:
xen-blkback: replace work_pending with work_busy in purge_persistent_gnt()
xen-blkfront: don't add indirect pages to list when !feature_persistent
xen-blkfront: introduce blkfront_gather_backend_features()
As generic_make_request() is now able to handle arbitrarily sized bios,
it's no longer necessary for each individual block driver to define its
own ->merge_bvec_fn() callback. Remove every invocation completely.
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: drbd-user@lists.linbit.com
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Alex Elder <elder@kernel.org>
Cc: ceph-devel@vger.kernel.org
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: Neil Brown <neilb@suse.de>
Cc: linux-raid@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Acked-by: NeilBrown <neilb@suse.de> (for the 'md' bits)
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[dpark: also remove ->merge_bvec_fn() in dm-thin as well as
dm-era-target, and resolve merge conflicts]
Signed-off-by: Dongsu Park <dpark@posteo.net>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The way the block layer is currently written, it goes to great lengths
to avoid having to split bios; upper layer code (such as bio_add_page())
checks what the underlying device can handle and tries to always create
bios that don't need to be split.
But this approach becomes unwieldy and eventually breaks down with
stacked devices and devices with dynamic limits, and it adds a lot of
complexity. If the block layer could split bios as needed, we could
eliminate a lot of complexity elsewhere - particularly in stacked
drivers. Code that creates bios can then create whatever size bios are
convenient, and more importantly stacked drivers don't have to deal with
both their own bio size limitations and the limitations of the
(potentially multiple) devices underneath them. In the future this will
let us delete merge_bvec_fn and a bunch of other code.
We do this by adding calls to blk_queue_split() to the various
make_request functions that need it - a few can already handle arbitrary
size bios. Note that we add the call _after_ any call to
blk_queue_bounce(); this means that blk_queue_split() and
blk_recalc_rq_segments() don't need to be concerned with bouncing
affecting segment merging.
Some make_request_fn() callbacks were simple enough to audit and verify
they don't need blk_queue_split() calls. The skipped ones are:
* nfhd_make_request (arch/m68k/emu/nfblock.c)
* axon_ram_make_request (arch/powerpc/sysdev/axonram.c)
* simdisk_make_request (arch/xtensa/platforms/iss/simdisk.c)
* brd_make_request (ramdisk - drivers/block/brd.c)
* mtip_submit_request (drivers/block/mtip32xx/mtip32xx.c)
* loop_make_request
* null_queue_bio
* bcache's make_request fns
Some others are almost certainly safe to remove now, but will be left
for future patches.
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: drbd-user@lists.linbit.com
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Jim Paris <jim@jtan.com>
Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: Andreas Dilger <andreas.dilger@intel.com>
Acked-by: NeilBrown <neilb@suse.de> (for the 'md/md.c' bits)
Acked-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[dpark: skip more mq-based drivers, resolve merge conflicts, etc.]
Signed-off-by: Dongsu Park <dpark@posteo.net>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
For write/discard obj_requests that involved a copyup method call, the
opcode of the first op is CEPH_OSD_OP_CALL and the ->callback is
rbd_img_obj_copyup_callback(). The latter frees copyup pages, sets
->xferred and delegates to rbd_img_obj_callback(), the "normal" image
object callback, for reporting to block layer and putting refs.
rbd_osd_req_callback() however treats CEPH_OSD_OP_CALL as a trivial op,
which means obj_request is marked done in rbd_osd_trivial_callback(),
*before* ->callback is invoked and rbd_img_obj_copyup_callback() has
a chance to run. Marking obj_request done essentially means giving
rbd_img_obj_callback() a license to end it at any moment, so if another
obj_request from the same img_request is being completed concurrently,
rbd_img_obj_end_request() may very well be called on such prematurally
marked done request:
<obj_request-1/2 reply>
handle_reply()
rbd_osd_req_callback()
rbd_osd_trivial_callback()
rbd_obj_request_complete()
rbd_img_obj_copyup_callback()
rbd_img_obj_callback()
<obj_request-2/2 reply>
handle_reply()
rbd_osd_req_callback()
rbd_osd_trivial_callback()
for_each_obj_request(obj_request->img_request) {
rbd_img_obj_end_request(obj_request-1/2)
rbd_img_obj_end_request(obj_request-2/2) <--
}
Calling rbd_img_obj_end_request() on such a request leads to trouble,
in particular because its ->xfferred is 0. We report 0 to the block
layer with blk_update_request(), get back 1 for "this request has more
data in flight" and then trip on
rbd_assert(more ^ (which == img_request->obj_request_count));
with rhs (which == ...) being 1 because rbd_img_obj_end_request() has
been called for both requests and lhs (more) being 1 because we haven't
got a chance to set ->xfferred in rbd_img_obj_copyup_callback() yet.
To fix this, leverage that rbd wants to call class methods in only two
cases: one is a generic method call wrapper (obj_request is standalone)
and the other is a copyup (obj_request is part of an img_request). So
make a dedicated handler for CEPH_OSD_OP_CALL and directly invoke
rbd_img_obj_copyup_callback() from it if obj_request is part of an
img_request, similar to how CEPH_OSD_OP_READ handler invokes
rbd_img_obj_request_read_callback().
Since rbd_img_obj_copyup_callback() is now being called from the OSD
request callback (only), it is renamed to rbd_osd_copyup_callback().
Cc: Alex Elder <elder@linaro.org>
Cc: stable@vger.kernel.org # 3.10+, needs backporting for < 3.18
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Currently we have two different ways to signal an I/O error on a BIO:
(1) by clearing the BIO_UPTODATE flag
(2) by returning a Linux errno value to the bi_end_io callback
The first one has the drawback of only communicating a single possible
error (-EIO), and the second one has the drawback of not beeing persistent
when bios are queued up, and are not passed along from child to parent
bio in the ever more popular chaining scenario. Having both mechanisms
available has the additional drawback of utterly confusing driver authors
and introducing bugs where various I/O submitters only deal with one of
them, and the others have to add boilerplate code to deal with both kinds
of error returns.
So add a new bi_error field to store an errno value directly in struct
bio and remove the existing mechanisms to clean all this up.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Konrad writes:
"There are three bugs that have been found in the xen-blkfront (and
backend). Two of them have the stable tree CC-ed. They have been found
where an guest is migrating to a host that is missing
'feature-persistent' support (from one that has it enabled). We end up
hitting an BUG() in the driver code."
The BUG_ON() in purge_persistent_gnt() will be triggered when previous purge
work haven't finished.
There is a work_pending() before this BUG_ON, but it doesn't account if the work
is still currently running.
CC: stable@vger.kernel.org
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
We should consider info->feature_persistent when adding indirect page to list
info->indirect_pages, else the BUG_ON() in blkif_free() would be triggered.
When we are using persistent grants the indirect_pages list
should always be empty because blkfront has pre-allocated enough
persistent pages to fill all requests on the ring.
CC: stable@vger.kernel.org
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
There is a bug when migrate from !feature-persistent host to feature-persistent
host, because domU still thinks new host/backend doesn't support persistent.
Dmesg like:
backed has not unmapped grant: 839
backed has not unmapped grant: 773
backed has not unmapped grant: 773
backed has not unmapped grant: 773
backed has not unmapped grant: 839
The fix is to recheck feature-persistent of new backend in blkif_recover().
See: https://lkml.org/lkml/2015/5/25/469
As Roger suggested, we can split the part of blkfront_connect that checks for
optional features, like persistent grants, indirect descriptors and
flush/barrier features to a separate function and call it from both
blkfront_connect and blkif_recover
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
end_cmd finishes request associated with nullb_cmd struct, so we
should save pointer to request_queue in a local variable before
calling end_cmd.
The problem was causes general protection fault with slab poisoning
enabled.
Fixes: 8b70f45e2e ("null_blk: restart request processing on completion handler")
Tested-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Mike Krinkin <krinkin.m.u@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Uses div_u64 for u64 division and round_down, a bitwise operation,
instead of rounddown, which uses a modulus.
Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Some controllers have a controller-side memory buffer available for use
for submissions, completions, lists, or data.
If a CMB is available, the entire CMB will be ioremapped and it will
attempt to map the IO SQes onto the CMB. The queues will be shrunk as
needed. The CMB will not be used if the queue depth is shrunk below some
threshold where it may have reduced performance over a larger queue
in system memory.
Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch changes sq_cmd writers to instead create their command on
the stack. __nvme_submit_cmd copies the sq entry to the queue and writes
the doorbell.
Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Some drivers use it now, others just set the limits field manually.
But in preparation for splitting this into a hard and soft limit,
ensure that they all call the proper function for setting the hw
limit for discards.
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>