Pull libnvdimm updates from Dan Williams:
"Primarily just the virtio_pmem driver:
- virtio_pmem
The new virtio_pmem facility introduces a paravirtualized
persistent memory device that allows a guest VM to use DAX
mechanisms to access a host-file with host-page-cache. It arranges
for MAP_SYNC to be disabled and instead triggers a host fsync()
when a 'write-cache flush' command is sent to the virtual disk
device.
- Miscellaneous small fixups"
* tag 'libnvdimm-for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
virtio_pmem: fix sparse warning
xfs: disable map_sync for async flush
ext4: disable map_sync for async flush
dax: check synchronous mapping is supported
dm: enable synchronous dax
libnvdimm: add dax_dev sync flag
virtio-pmem: Add virtio pmem driver
libnvdimm: nd_region flush callback support
libnvdimm, namespace: Drop uuid_t implementation detail
There is a hang issue while using fio to do some basic test. The issue
can be easily reproduced using the below script:
while true
do
fio --ioengine=io_uring -rw=write -bs=4k -numjobs=1 \
-size=1G -iodepth=64 -name=uring --filename=/dev/zero
done
After several minutes (or more), fio would block at
io_uring_enter->io_cqring_wait in order to waiting for previously
committed sqes to be completed and can't return to user anymore until
we send a SIGTERM to fio. After receiving SIGTERM, fio hangs at
io_ring_ctx_wait_and_kill with a backtrace like this:
[54133.243816] Call Trace:
[54133.243842] __schedule+0x3a0/0x790
[54133.243868] schedule+0x38/0xa0
[54133.243880] schedule_timeout+0x218/0x3b0
[54133.243891] ? sched_clock+0x9/0x10
[54133.243903] ? wait_for_completion+0xa3/0x130
[54133.243916] ? _raw_spin_unlock_irq+0x2c/0x40
[54133.243930] ? trace_hardirqs_on+0x3f/0xe0
[54133.243951] wait_for_completion+0xab/0x130
[54133.243962] ? wake_up_q+0x70/0x70
[54133.243984] io_ring_ctx_wait_and_kill+0xa0/0x1d0
[54133.243998] io_uring_release+0x20/0x30
[54133.244008] __fput+0xcf/0x270
[54133.244029] ____fput+0xe/0x10
[54133.244040] task_work_run+0x7f/0xa0
[54133.244056] do_exit+0x305/0xc40
[54133.244067] ? get_signal+0x13b/0xbd0
[54133.244088] do_group_exit+0x50/0xd0
[54133.244103] get_signal+0x18d/0xbd0
[54133.244112] ? _raw_spin_unlock_irqrestore+0x36/0x60
[54133.244142] do_signal+0x34/0x720
[54133.244171] ? exit_to_usermode_loop+0x7e/0x130
[54133.244190] exit_to_usermode_loop+0xc0/0x130
[54133.244209] do_syscall_64+0x16b/0x1d0
[54133.244221] entry_SYSCALL_64_after_hwframe+0x49/0xbe
The reason is that we had added a req to ctx->pending_async at the very
end, but it didn't get a chance to be processed. How could this happen?
fio#cpu0 wq#cpu1
io_add_to_prev_work io_sq_wq_submit_work
atomic_read() <<< 1
atomic_dec_return() << 1->0
list_empty(); <<< true;
list_add_tail()
atomic_read() << 0 or 1?
As atomic_ops.rst states, atomic_read does not guarantee that the
runtime modification by any other thread is visible yet, so we must take
care of that with a proper implicit or explicit memory barrier.
This issue was detected with the help of Jackie's <liuyun01@kylinos.cn>
Fixes: 31b5151064 ("io_uring: allow workqueue item to handle multiple buffered requests")
Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a per-transport maximum limit in the socket case, and add
helpers to allow the NFSv4 code to discover that limit.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Merge more updates from Andrew Morton:
"VM:
- z3fold fixes and enhancements by Henry Burns and Vitaly Wool
- more accurate reclaimed slab caches calculations by Yafang Shao
- fix MAP_UNINITIALIZED UAPI symbol to not depend on config, by
Christoph Hellwig
- !CONFIG_MMU fixes by Christoph Hellwig
- new novmcoredd parameter to omit device dumps from vmcore, by
Kairui Song
- new test_meminit module for testing heap and pagealloc
initialization, by Alexander Potapenko
- ioremap improvements for huge mappings, by Anshuman Khandual
- generalize kprobe page fault handling, by Anshuman Khandual
- device-dax hotplug fixes and improvements, by Pavel Tatashin
- enable synchronous DAX fault on powerpc, by Aneesh Kumar K.V
- add pte_devmap() support for arm64, by Robin Murphy
- unify locked_vm accounting with a helper, by Daniel Jordan
- several misc fixes
core/lib:
- new typeof_member() macro including some users, by Alexey Dobriyan
- make BIT() and GENMASK() available in asm, by Masahiro Yamada
- changed LIST_POISON2 on x86_64 to 0xdead000000000122 for better
code generation, by Alexey Dobriyan
- rbtree code size optimizations, by Michel Lespinasse
- convert struct pid count to refcount_t, by Joel Fernandes
get_maintainer.pl:
- add --no-moderated switch to skip moderated ML's, by Joe Perches
misc:
- ptrace PTRACE_GET_SYSCALL_INFO interface
- coda updates
- gdb scripts, various"
[ Using merge message suggestion from Vlastimil Babka, with some editing - Linus ]
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (100 commits)
fs/select.c: use struct_size() in kmalloc()
mm: add account_locked_vm utility function
arm64: mm: implement pte_devmap support
mm: introduce ARCH_HAS_PTE_DEVMAP
mm: clean up is_device_*_page() definitions
mm/mmap: move common defines to mman-common.h
mm: move MAP_SYNC to asm-generic/mman-common.h
device-dax: "Hotremove" persistent memory that is used like normal RAM
mm/hotplug: make remove_memory() interface usable
device-dax: fix memory and resource leak if hotplug fails
include/linux/lz4.h: fix spelling and copy-paste errors in documentation
ipc/mqueue.c: only perform resource calculation if user valid
include/asm-generic/bug.h: fix "cut here" for WARN_ON for __WARN_TAINT architectures
scripts/gdb: add helpers to find and list devices
scripts/gdb: add lx-genpd-summary command
drivers/pps/pps.c: clear offset flags in PPS_SETPARAMS ioctl
kernel/pid.c: convert struct pid count to refcount_t
drivers/rapidio/devices/rio_mport_cdev.c: NUL terminate some strings
select: shift restore_saved_sigmask_unless() into poll_select_copy_remaining()
select: change do_poll() to return -ERESTARTNOHAND rather than -EINTR
...
btrfs_get_io_geometry() calls btrfs_get_chunk_map() to acquire a reference
on a extent_map, but on normal operation it does not drop this reference
anymore.
This leads to excessive kmemleak reports.
Always call free_extent_map(), not just in the error case.
Fixes: 5f1411265e ("btrfs: Introduce btrfs_io_geometry infrastructure")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If CONFIG_BTRFS_FS is y and CONFIG_LIBCRC32C is m,
building fails:
fs/btrfs/super.o: In function `btrfs_mount_root':
super.c:(.text+0xb7f9): undefined reference to `crc32c_impl'
fs/btrfs/super.o: In function `init_btrfs_fs':
super.c:(.init.text+0x3465): undefined reference to `crc32c_impl'
fs/btrfs/extent-tree.o: In function `hash_extent_data_ref':
extent-tree.c:(.text+0xe60): undefined reference to `crc32c'
extent-tree.c:(.text+0xe78): undefined reference to `crc32c'
extent-tree.c:(.text+0xe8b): undefined reference to `crc32c'
fs/btrfs/dir-item.o: In function `btrfs_insert_xattr_item':
dir-item.c:(.text+0x291): undefined reference to `crc32c'
fs/btrfs/dir-item.o: In function `btrfs_insert_dir_item':
dir-item.c:(.text+0x429): undefined reference to `crc32c'
Select LIBCRC32C to fix it.
Reported-by: Hulk Robot <hulkci@huawei.com>
Fixes: d5178578bc ("btrfs: directly call into crypto framework for checksumming")
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As btrfs(5) specified:
Note
If nodatacow or nodatasum are enabled, compression is disabled.
If NODATASUM or NODATACOW set, we should not compress the extent.
Normally NODATACOW is detected properly in run_delalloc_range() so
compression won't happen for NODATACOW.
However for NODATASUM we don't have any check, and it can cause
compressed extent without csum pretty easily, just by:
mkfs.btrfs -f $dev
mount $dev $mnt -o nodatasum
touch $mnt/foobar
mount -o remount,datasum,compress $mnt
xfs_io -f -c "pwrite 0 128K" $mnt/foobar
And in fact, we have a bug report about corrupted compressed extent
without proper data checksum so even RAID1 can't recover the corruption.
(https://bugzilla.kernel.org/show_bug.cgi?id=199707)
Running compression without proper checksum could cause more damage when
corruption happens, as compressed data could make the whole extent
unreadable, so there is no need to allow compression for
NODATACSUM.
The fix will refactor the inode compression check into two parts:
- inode_can_compress()
As the hard requirement, checked at btrfs_run_delalloc_range(), so no
compression will happen for NODATASUM inode at all.
- inode_need_compress()
As the soft requirement, checked at btrfs_run_delalloc_range() and
compress_file_range().
Reported-by: James Harvey <jamespharvey20@gmail.com>
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move internal function declarations out of fs/internal.h into
include/linux/iomap.h so that our transition is complete.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Move the main iteration code into a separate file so that we can group
related functions in a single file instead of having a single enormous
source file.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Move the buffered IO code into a separate file so that we can group
related functions in a single file instead of having a single enormous
source file.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Move the direct IO code into a separate file so that we can group
related functions in a single file instead of having a single enormous
source file.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Move the SEEK_HOLE/SEEK_DATA code into a separate file so that we can
group related functions in a single file instead of having a single
enormous source file.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Move the file mapping reporting code (FIEMAP/FIBMAP) into a separate
file so that we can group related functions in a single file instead of
having a single enormous source file.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Move the swapfile activation code into a separate file so that we can
group related functions in a single file instead of having a single
enormous source file.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
We used to need rather convoluted ordering trickery to guarantee
that dput() of ex-mountpoints happens before the final mntput()
of the same. Since we don't need that anymore, there's no point
playing with fs_pin for that.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Lift getting the original mount (dentry is actually not needed at all)
of the mountpoint into the callers - to do_move_mount() and pivot_root()
level. That simplifies the cleanup in those and allows to get saner
arguments for attach_mnt_recursive().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Using dput_to_list() to shift the contributing reference from ->mnt_mountpoint
to ->mnt_mp->m_dentry. Dentries are dropped (with dput_to_list()) as soon
as struct mountpoint is destroyed; in cases where we are under namespace_sem
we use the global list, shrinking it in namespace_unlock(). In case of
detaching stuck MNT_LOCKed children at final mntput_no_expire() we use a local
list and shrink it ourselves. ->mnt_ex_mountpoint crap is gone.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
RocksDB can hang indefinitely when using a DAX file. This is due to
a bug in the XArray conversion when handling a PMD fault and finding a
PTE entry. We use the wrong index in the hash and end up waiting on
the wrong waitqueue.
There's actually no need to wait; if we find a PTE entry while looking
for a PMD entry, we can return immediately as we know we should fall
back to a PTE fault (which may not conflict with the lock held).
We reuse the XA_RETRY_ENTRY to signal a conflicting entry was found.
This value can never be found in an XArray while holding its lock, so
it does not create an ambiguity.
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/CAPcyv4hwHpX-MkUEqxwdTj7wCCZCN4RV-L4jsnuwLGyL_UEG4A@mail.gmail.com
Fixes: b15cd80068 ("dax: Convert page fault handlers to XArray")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: Robert Barror <robert.barror@intel.com>
Reported-by: Seema Pandit <seema.pandit@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:
struct foo {
int stuff;
struct boo entry[];
};
size = sizeof(struct foo) + count * sizeof(struct boo);
instance = kmalloc(size, GFP_KERNEL);
Instead of leaving these open-coded and prone to type mistakes, we can now
use the new struct_size() helper:
instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL);
Also, notice that variable size is unnecessary, hence it is removed.
This code was detected with the help of Coccinelle.
Link: http://lkml.kernel.org/r/20190604164226.GA13823@embeddedor
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
strncpy() was used to copy a fixed size buffer. Since NUL-terminating
string is not required here, prefer a memcpy function. The generated
code (ppc32) remains the same.
Silence the following warning triggered using W=1:
fs/hfsplus/xattr.c:410:3: warning: 'strncpy' output truncated before terminating nul copying 4 bytes from a string of the same length [-Wstringop-truncation]
Link: http://lkml.kernel.org/r/20190529113341.11972-1-malat@debian.org
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Reviewed-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds support for partial file caching in Coda. Every read, write
and mmap informs the userspace cache manager about what part of a file
is about to be accessed so that the cache manager can ensure the
relevant parts are available before the operation is allowed to proceed.
When a read or write operation completes, this is also reported to allow
the cache manager to track when partially cached content can be
released.
If the cache manager does not support partial file caching, or when the
entire file has been fetched into the local cache, the cache manager may
return an EOPNOTSUPP error to indicate that intent upcalls are no longer
necessary until the file is closed.
[akpm@linux-foundation.org: little whitespace fixup]
Link: http://lkml.kernel.org/r/20190618181301.6960-1-jaharkes@cs.cmu.edu
Signed-off-by: Pedro Cuadra <pjcuadra@gmail.com>
Signed-off-by: Jan Harkes <jaharkes@cs.cmu.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We exchange file timestamps with user space using psdev device
read/write operations with a fixed but architecture specific binary
layout.
On 32-bit systems, this uses a 'timespec' structure that is defined by
the C library to contain two 32-bit values for seconds and nanoseconds.
As we get ready for the year 2038 overflow of the 32-bit signed seconds,
the kernel now uses 64-bit timestamps internally, and user space will do
the same change by changing the 'timespec' definition in the future.
Unfortunately, this breaks the layout of the coda_vattr structure, so we
need to redefine that in terms of something that does not change. I'm
introducing a new 'struct vtimespec' structure here that keeps the
existing layout, and the same change has to be done in the coda user
space copy of linux/coda.h before anyone can use that on a 32-bit
architecture with 64-bit time_t.
An open question is what should happen to actual times past y2038, as
they are now truncated to the last valid date when sent to user space,
and interpreted as pre-1970 times when a timestamp with the MSB set is
read back into the kernel. Alternatively, we could change the new
timespec64_to_coda()/coda_to_timespec64() functions to use a different
interpretation and extend the available range further to the future by
disallowing past timestamps. This would require more changes in the
user space side though.
Link: http://lkml.kernel.org/r/562b7324149461743e4fbe2fedbf7c242f7e274a.1558117389.git.jaharkes@cs.cmu.edu
Link: https://patchwork.kernel.org/patch/10474735/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jan Harkes <jaharkes@cs.cmu.edu>
Acked-by: Jan Harkes <jaharkes@cs.cmu.edu>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Mikko Rapeli <mikko.rapeli@iki.fi>
Cc: Sam Protsenko <semen.protsenko@linaro.org>
Cc: Yann Droneaud <ydroneaud@opteya.com>
Cc: Zhouyang Jia <jiazhouyang09@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Normally, the inode's i_uid/i_gid are translated relative to s_user_ns,
but this is not a correct behavior for proc. Since sysctl permission
check in test_perm is done against GLOBAL_ROOT_[UG]ID, it makes more
sense to use these values in u_[ug]id of proc inodes. In other words:
although uid/gid in the inode is not read during test_perm, the inode
logically belongs to the root of the namespace. I have confirmed this
with Eric Biederman at LPC and in this thread:
https://lore.kernel.org/lkml/87k1kzjdff.fsf@xmission.com
Consequences
============
Since the i_[ug]id values of proc nodes are not used for permissions
checks, this change usually makes no functional difference. However, it
causes an issue in a setup where:
* a namespace container is created without root user in container -
hence the i_[ug]id of proc nodes are set to INVALID_[UG]ID
* container creator tries to configure it by writing /proc/sys files,
e.g. writing /proc/sys/kernel/shmmax to configure shared memory limit
Kernel does not allow to open an inode for writing if its i_[ug]id are
invalid, making it impossible to write shmmax and thus - configure the
container.
Using a container with no root mapping is apparently rare, but we do use
this configuration at Google. Also, we use a generic tool to configure
the container limits, and the inability to write any of them causes a
failure.
History
=======
The invalid uids/gids in inodes first appeared due to 8175435777 (fs:
Update i_[ug]id_(read|write) to translate relative to s_user_ns).
However, AFAIK, this did not immediately cause any issues. The
inability to write to these "invalid" inodes was only caused by a later
commit 0bd23d09b8 (vfs: Don't modify inodes with a uid or gid unknown
to the vfs).
Tested: Used a repro program that creates a user namespace without any
mapping and stat'ed /proc/$PID/root/proc/sys/kernel/shmmax from outside.
Before the change, it shows the overflow uid, with the change it's 0.
The overflow uid indicates that the uid in the inode is not correct and
thus it is not possible to open the file for writing.
Link: http://lkml.kernel.org/r/20190708115130.250149-1-rburny@google.com
Fixes: 0bd23d09b8 ("vfs: Don't modify inodes with a uid or gid unknown to the vfs")
Signed-off-by: Radoslaw Burny <rburny@google.com>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Seth Forshee <seth.forshee@canonical.com>
Cc: John Sperbeck <jsperbeck@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: <stable@vger.kernel.org> [4.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Don't repeat function signatures twice.
This is a kind-of-precursor for "struct proc_ops".
Note:
typeof(pde->proc_fops->...) ...;
can't be used because ->proc_fops is "const struct file_operations *".
"const" prevents assignment down the code and it can't be deleted in the
type system.
Link: http://lkml.kernel.org/r/20190529191110.GB5703@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit 2724273e8f ("vmcore: add API to collect hardware dump in
second kernel"), drivers are allowed to add device related dump data to
vmcore as they want by using the device dump API. This has a potential
issue, the data is stored in memory, drivers may append too much data
and use too much memory. The vmcore is typically used in a kdump kernel
which runs in a pre-reserved small chunk of memory. So as a result it
will make kdump unusable at all due to OOM issues.
So introduce new 'novmcoredd' command line option. User can disable
device dump to reduce memory usage. This is helpful if device dump is
using too much memory, disabling device dump could make sure a regular
vmcore without device dump data is still available.
[akpm@linux-foundation.org: tweak documentation]
[akpm@linux-foundation.org: vmcore.c needs moduleparam.h]
Link: http://lkml.kernel.org/r/20190528111856.7276-1-kasong@redhat.com
Signed-off-by: Kairui Song <kasong@redhat.com>
Acked-by: Dave Young <dyoung@redhat.com>
Reviewed-by: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Cc: "David S . Miller" <davem@davemloft.net>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>