As Christoph said [1] "having this function seems
entirely pointless", let's kill those.
filesystem function name
ext2,f2fs,ext4,isofs,squashfs,cifs,... init_inodecache
In addition, add a necessary "rcu_barrier()" on exit_fs();
[1] https://lore.kernel.org/r/20190829101545.GC20598@infradead.org/
Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Link: https://lore.kernel.org/r/20190904020912.63925-9-gaoxiang25@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
As Christoph said, "This looks like a really obsfucated
way to write:
return datamode == EROFS_INODE_FLAT_COMPRESSION ||
datamode == EROFS_INODE_FLAT_COMPRESSION_LEGACY; "
Although I had my own consideration, it's the right way for now.
[1] https://lore.kernel.org/r/20190829095954.GB20598@infradead.org/
Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Link: https://lore.kernel.org/r/20190904020912.63925-6-gaoxiang25@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
As Christoph suggested "Please don't add __packed" [1],
remove all __packed except struct erofs_dirent here.
Note that all on-disk fields except struct erofs_dirent
(12 bytes with a 8-byte nid) in EROFS are naturally aligned.
[1] https://lore.kernel.org/r/20190829095954.GB20598@infradead.org/
Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Link: https://lore.kernel.org/r/20190904020912.63925-5-gaoxiang25@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Make sure that attribute methods are not called after the item
has been removed from the tree. To do so, we
* at the point of no return in removals, grab ->frag_sem
exclusive and mark the fragment dead.
* call the methods of attributes with ->frag_sem taken
shared and only after having verified that the fragment is still
alive.
The main benefit is for method instances - they are
guaranteed that the objects they are accessing *and* all ancestors
are still there. Another win is that we don't need to bother
with extra refcount on config_item when opening a file -
the item will be alive for as long as it stays in the tree, and
we won't touch it/attributes/any associated data after it's
been removed from the tree.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
while filling the linux inode, using switch-case statement to check
the type of inode.
switch-case statement looks more clean here.
Signed-off-by: Pratik Shinde <pratikshinde320@gmail.com>
Reviewed-by: Gao Xiang <gaoxiang25@huawei.com>
Link: https://lore.kernel.org/r/20190830095615.10995-1-pratikshinde320@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When performing rename operation with RENAME_WHITEOUT flag, we will
hold AGF lock to allocate or free extents in manipulating the dirents
firstly, and then doing the xfs_iunlink_remove() call last to hold
AGI lock to modify the tmpfile info, so we the lock order AGI->AGF.
The big problem here is that we have an ordering constraint on AGF
and AGI locking - inode allocation locks the AGI, then can allocate
a new extent for new inodes, locking the AGF after the AGI. Hence
the ordering that is imposed by other parts of the code is AGI before
AGF. So we get an ABBA deadlock between the AGI and AGF here.
Process A:
Call trace:
? __schedule+0x2bd/0x620
schedule+0x33/0x90
schedule_timeout+0x17d/0x290
__down_common+0xef/0x125
? xfs_buf_find+0x215/0x6c0 [xfs]
down+0x3b/0x50
xfs_buf_lock+0x34/0xf0 [xfs]
xfs_buf_find+0x215/0x6c0 [xfs]
xfs_buf_get_map+0x37/0x230 [xfs]
xfs_buf_read_map+0x29/0x190 [xfs]
xfs_trans_read_buf_map+0x13d/0x520 [xfs]
xfs_read_agf+0xa6/0x180 [xfs]
? schedule_timeout+0x17d/0x290
xfs_alloc_read_agf+0x52/0x1f0 [xfs]
xfs_alloc_fix_freelist+0x432/0x590 [xfs]
? down+0x3b/0x50
? xfs_buf_lock+0x34/0xf0 [xfs]
? xfs_buf_find+0x215/0x6c0 [xfs]
xfs_alloc_vextent+0x301/0x6c0 [xfs]
xfs_ialloc_ag_alloc+0x182/0x700 [xfs]
? _xfs_trans_bjoin+0x72/0xf0 [xfs]
xfs_dialloc+0x116/0x290 [xfs]
xfs_ialloc+0x6d/0x5e0 [xfs]
? xfs_log_reserve+0x165/0x280 [xfs]
xfs_dir_ialloc+0x8c/0x240 [xfs]
xfs_create+0x35a/0x610 [xfs]
xfs_generic_create+0x1f1/0x2f0 [xfs]
...
Process B:
Call trace:
? __schedule+0x2bd/0x620
? xfs_bmapi_allocate+0x245/0x380 [xfs]
schedule+0x33/0x90
schedule_timeout+0x17d/0x290
? xfs_buf_find+0x1fd/0x6c0 [xfs]
__down_common+0xef/0x125
? xfs_buf_get_map+0x37/0x230 [xfs]
? xfs_buf_find+0x215/0x6c0 [xfs]
down+0x3b/0x50
xfs_buf_lock+0x34/0xf0 [xfs]
xfs_buf_find+0x215/0x6c0 [xfs]
xfs_buf_get_map+0x37/0x230 [xfs]
xfs_buf_read_map+0x29/0x190 [xfs]
xfs_trans_read_buf_map+0x13d/0x520 [xfs]
xfs_read_agi+0xa8/0x160 [xfs]
xfs_iunlink_remove+0x6f/0x2a0 [xfs]
? current_time+0x46/0x80
? xfs_trans_ichgtime+0x39/0xb0 [xfs]
xfs_rename+0x57a/0xae0 [xfs]
xfs_vn_rename+0xe4/0x150 [xfs]
...
In this patch we move the xfs_iunlink_remove() call to
before acquiring the AGF lock to preserve correct AGI/AGF locking
order.
Signed-off-by: kaixuxia <kaixuxia@tencent.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Define a flags field for the AG geometry ioctl structure.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Add a helper that validates the startblock is valid. This checks for a
non-zero block on the main device, but skips that check for blocks on
the realtime device.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
we are not retaining dentries there anyway (simple_dentry_operations),
so d_delete()+dput() == d_drop()+dput()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The rules for nd->root are messy:
* if we have LOOKUP_ROOT, it doesn't contribute to refcounts
* if we have LOOKUP_RCU, it doesn't contribute to refcounts
* if nd->root.mnt is NULL, it doesn't contribute to refcounts
* otherwise it does contribute
terminate_walk() needs to drop the references if they are contributing.
So everything else should be careful not to confuse it, leading to
rather convoluted code.
It's easier to keep track of whether we'd grabbed the reference(s)
explicitly. Use a new flag for that. Don't bother with zeroing
nd->root.mnt on unlazy failures and in terminate_walk - it's not
needed anymore (terminate_walk() won't care and the next path_init()
will zero nd->root in !LOOKUP_ROOT case anyway).
Resulting rules for nd->root refcounts are much simpler: they are
contributing iff LOOKUP_ROOT_GRABBED is set in nd->flags.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Refcounted, hangs of configfs_dirent, created by operations that add
fragments to configfs tree (mkdir and configfs_register_{subsystem,group}).
Will be used in the next commit to provide exclusion between fragment
removal and ->show/->store calls.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
revert cc57c07343 "configfs: fix registered group removal"
It was an attempt to handle something that fundamentally doesn't
work - configfs_register_group() should never be done in a part
of tree that can be rmdir'ed. And in mainline it never had been,
so let's not borrow trouble; the fix was racy anyway, it would take
a lot more to make that work and desired semantics is not clear.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
simplifies the ->read()/->write()/->release() instances nicely
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
We want to throw out the attrbute if it refers to the mounted on fileid,
and not the real fileid. However we do not want to block cache consistency
updates from NFSv4 writes.
Reported-by: Murphy Zhou <jencce.kernel@gmail.com>
Fixes: 7e10cc25bf ("NFS: Don't refresh attributes with mounted-on-file...")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Make afs_permission() and afs_d_revalidate() do initial checks in RCU-mode
pathwalk to reduce latency in pathwalk elements that get done multiple
times. We don't need to query the server unless we've received a
notification from it that something has changed or the callback has
expired.
This requires that we can request a key and check permits under RCU
conditions if we need to.
Signed-off-by: David Howells <dhowells@redhat.com>
Provide an RCU-capable key lookup function. We don't want to call
afs_request_key() in RCU-mode pathwalk as request_key() might sleep, even if
we don't ask it to construct anything as it might find a key that is currently
undergoing construction.
Signed-off-by: David Howells <dhowells@redhat.com>
Use afs_extract_discard() rather than iov_iter_discard() as the former is a
wrapper for the latter, providing a place to put tracepoints.
Signed-off-by: David Howells <dhowells@redhat.com>
fs/afs/fsclient.c:18:29: warning:
afs_zero_fid defined but not used [-Wunused-const-variable=]
It is never used since commit 025db80c9e ("afs: Trace
the initiation and completion of client calls")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David Howells <dhowells@redhat.com>
fs/afs/volume.c:15:26: warning:
afs_voltypes defined but not used [-Wunused-const-variable=]
It is not used since commit d2ddc776a4 ("afs: Overhaul
volume and server record caching and fileserver rotation")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David Howells <dhowells@redhat.com>
When getting fscrypt policy via EXT4_IOC_GET_ENCRYPTION_POLICY, if
encryption feature is off, it's better to return EOPNOTSUPP instead of
ENODATA, so let's add ext4_has_feature_encrypt() to do the check for
that.
This makes it so that all fscrypt ioctls consistently check for the
encryption feature, and makes ext4 consistent with f2fs in this regard.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[EB - removed unneeded braces, updated the documentation, and
added more explanation to commit message]
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This function isn't a macro anymore, so remove various superflous braces,
and explicit cast that is done implicitly due to the return value, use
a normal if statement instead of trying to squeeze everything together.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Setting the DAX flag on the directory of a file system that is not on a
DAX capable device makes as little sense as setting it on a regular file
on the same file system.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Hole puching currently evicts pages from page cache and then goes on to
remove blocks from the inode. This happens under both XFS_IOLOCK_EXCL
and XFS_MMAPLOCK_EXCL which provides appropriate serialization with
racing reads or page faults. However there is currently nothing that
prevents readahead triggered by fadvise() or madvise() from racing with
the hole punch and instantiating page cache page after hole punching has
evicted page cache in xfs_flush_unmap_range() but before it has removed
blocks from the inode. This page cache page will be mapping soon to be
freed block and that can lead to returning stale data to userspace or
even filesystem corruption.
Fix the problem by protecting handling of readahead requests by
XFS_IOLOCK_SHARED similarly as we protect reads.
CC: stable@vger.kernel.org
Link: https://lore.kernel.org/linux-fsdevel/CAOQ4uxjQNmxqmtA_VbYW0Su9rKRk2zobJmahcyeaEVOFKVQ5dw@mail.gmail.com/
Reported-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When doing file lookups and checking for permissions, we end up in
xfs_get_acl() to see if there are any ACLs on the inode. This
requires and xattr lookup, and to do that we have to supply a buffer
large enough to hold an maximum sized xattr.
On workloads were we are accessing a wide range of cache cold files
under memory pressure (e.g. NFS fileservers) we end up spending a
lot of time allocating the buffer. The buffer is 64k in length, so
is a contiguous multi-page allocation, and if that then fails we
fall back to vmalloc(). Hence the allocation here is /expensive/
when we are looking up hundreds of thousands of files a second.
Initial numbers from a bpf trace show average time in xfs_get_acl()
is ~32us, with ~19us of that in the memory allocation. Note these
are average times, so there are going to be affected by the worst
case allocations more than the common fast case...
To avoid this, we could just do a "null" lookup to see if the ACL
xattr exists and then only do the allocation if it exists. This,
however, optimises the path for the "no ACL present" case at the
expense of the "acl present" case. i.e. we can halve the time in
xfs_get_acl() for the no acl case (i.e down to ~10-15us), but that
then increases the ACL case by 30% (i.e. up to 40-45us).
To solve this and speed up both cases, drive the xattr buffer
allocation into the attribute code once we know what the actual
xattr length is. For the no-xattr case, we avoid the allocation
completely, speeding up that case. For the common ACL case, we'll
end up with a fast heap allocation (because it'll be smaller than a
page), and only for the rarer "we have a remote xattr" will we have
a multi-page allocation occur. Hence the common ACL case will be
much faster, too.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The same code is used to copy do the attribute copying in three
different places. Consolidate them into a single function in
preparation from on-demand buffer allocation.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Because we repeat exactly the same code to get the remote attribute
value after both calls to xfs_attr3_leaf_getvalue() if it's a remote
attr. Just do it in xfs_attr3_leaf_getvalue() so the callers don't
have to care about it.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Shortform, leaf and remote value attr value retrieval return
different values for success. This makes it more complex to handle
actual errors xfs_attr_get() as some errors mean success and some
mean failure. Make the return values consistent for success and
failure consistent for all attribute formats.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When a directory is growing rapidly, new blocks tend to get added at
the end of the directory. These end up at the end of the freespace
index, and when the directory gets large finding these new
freespaces gets expensive. The code does a linear search across the
frespace index from the first block in the directory to the last,
hence meaning the newly added space is the last index searched.
Instead, do a reverse order index search, starting from the last
block and index in the freespace index. This makes most lookups for
free space on rapidly growing directories O(1) instead of O(N), but
should not have any impact on random insert workloads because the
average search length is the same regardless of which end of the
array we start at.
The result is a major improvement in large directory grow rates:
create time(sec) / rate (files/s)
File count vanilla Prev commit Patched
10k 0.41 / 24.3k 0.42 / 23.8k 0.41 / 24.3k
20k 0.74 / 27.0k 0.76 / 26.3k 0.75 / 26.7k
100k 3.81 / 26.4k 3.47 / 28.8k 3.27 / 30.6k
200k 8.58 / 23.3k 7.19 / 27.8k 6.71 / 29.8k
1M 85.69 / 11.7k 48.53 / 20.6k 37.67 / 26.5k
2M 280.31 / 7.1k 130.14 / 15.3k 79.55 / 25.2k
10M 3913.26 / 2.5k 552.89 / 18.1k
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When running a "create millions inodes in a directory" test
recently, I noticed we were spending a huge amount of time
converting freespace block headers from disk format to in-memory
format:
31.47% [kernel] [k] xfs_dir2_node_addname
17.86% [kernel] [k] xfs_dir3_free_hdr_from_disk
3.55% [kernel] [k] xfs_dir3_free_bests_p
We shouldn't be hitting the best free block scanning code so hard
when doing sequential directory creates, and it turns out there's
a highly suboptimal loop searching the the best free array in
the freespace block - it decodes the block header before checking
each entry inside a loop, instead of decoding the header once before
running the entry search loop.
This makes a massive difference to create rates. Profile now looks
like this:
13.15% [kernel] [k] xfs_dir2_node_addname
3.52% [kernel] [k] xfs_dir3_leaf_check_int
3.11% [kernel] [k] xfs_log_commit_cil
And the wall time/average file create rate differences are
just as stark:
create time(sec) / rate (files/s)
File count vanilla patched
10k 0.41 / 24.3k 0.42 / 23.8k
20k 0.74 / 27.0k 0.76 / 26.3k
100k 3.81 / 26.4k 3.47 / 28.8k
200k 8.58 / 23.3k 7.19 / 27.8k
1M 85.69 / 11.7k 48.53 / 20.6k
2M 280.31 / 7.1k 130.14 / 15.3k
The larger the directory, the bigger the performance improvement.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Simplify the logic in xfs_dir2_node_addname_int() by factoring out
the free block index lookup code that finds a block with enough free
space for the entry to be added. The code that is moved gets a major
cleanup at the same time, but there is no algorithm change here.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Factor out the code that adds a data block to a directory from
xfs_dir2_node_addname_int(). This makes the code flow cleaner and
more obvious and provides clear isolation of upcoming optimsations.
Signed-off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This gets rid of the need for a forward declaration of the static
function xfs_dir2_addname_int() and readies the code for factoring
of xfs_dir2_addname_int().
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Iterator functions already use 0 to signal "continue iterating", so get
rid of the #defines and just do it directly.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
The former has no users left; the latter was only to get LOOKUP_...
values to remapper in audit_inode() and that's an ex-parrot now.
All places that use symbols from namei.h include it either directly
or (in a few cases) via a local header, like fs/autofs/autofs_i.h
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
cgroup foreign inode handling has quite a bit of heuristics and
internal states which sometimes makes it difficult to understand
what's going on. Add tracepoints to improve visibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
As Joe Perches suggested [1],
err = bio_add_page(bio, page, PAGE_SIZE, 0);
- if (unlikely(err != PAGE_SIZE)) {
+ if (err != PAGE_SIZE) {
err = -EFAULT;
goto err_out;
}
The initial assignment to err is odd as it's not
actually an error value -E<FOO> but a int size
from a unsigned int len.
Here the return is either 0 or PAGE_SIZE.
This would be more legible to me as:
if (bio_add_page(bio, page, PAGE_SIZE, 0) != PAGE_SIZE) {
err = -EFAULT;
goto err_out;
}
[1] https://lore.kernel.org/r/74c4784319b40deabfbaea92468f7e3ef44f1c96.camel@perches.com/
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Link: https://lore.kernel.org/r/20190829171741.225219-1-gaoxiang25@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Use -ECANCELED to signal "stop iterating" instead of these magical
*_ITER_ABORT values, since it's duplicative.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl1oVAcACgkQiiy9cAdy
T1FANAv+LU2t966OYu3nEfuZWVLna50HvbmTzPLmL0ETN9FonUcc+Th+HDGNmDfs
m0fn0J86x2o4wHAzZnJZSgiqxIAy9O5VHpmObQSy6RWF1tNZXOsuhrRm09gHfdpq
MenMyP93WWpmeTFUVqKEfpdN2lGwcOfZ3B4eF2W962BBiezhyKwrTX16KD/VtdVE
MdyZOtL+ythx5zbQQLWPYbWbWuRPPE7Ic+056sepqpk3basawvcfH3LZgSkt2nFr
QgN11PBx242MHI8x6i40SekHN5qpqtlqYCTKfZd45TVE1tC/Y197+NIlrLm89hW3
6qDVf8OfDYUdufYI09uP0cpBrsJsNADLEEF2PJyh6ePjjWTSdgGc8BqOqgm8p4GS
LdKZOl6Qz8GFuXPqhLXdlgC7La4qFEO6I+9iExE4XmjA0tshv4Y4O79yBMapmCOL
U2V7I5kxvmx8dO60fZnovDa3DgwwMPGMPY8ug3+KOX1a5CfhYz1g00NtiWAA97A2
R9GQSLBb
=u7jL
-----END PGP SIGNATURE-----
Merge tag '5.3-rc6-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"A few small SMB3 fixes, and a larger one to fix various older string
handling functions"
* tag '5.3-rc6-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: update internal module number
cifs: replace various strncpy with strscpy and similar
cifs: Use kzfree() to zero out the password
cifs: set domainName when a domain-key is used in multiuser
xfs_trans_log_buf() takes a final argument of the last byte to
log in the buffer; b_length is in basic blocks, so this isn't
the correct last byte. Fix it.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
In xfs_rmap_irec_offset_unpack, we should always clear the contents of
rm_flags before we begin unpacking the encoded (ondisk) offset into the
incore rm_offset and incore rm_flags fields. Remove the open-coded
field zeroing as this encourages api misuse.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Remove the return value from the functions that schedule deferred bmap
operations since they never fail and do not return status.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Remove the return value from the functions that schedule deferred
refcount operations since they never fail and do not return status.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Remove the return value from the functions that schedule deferred rmap
operations since they never fail and do not return status.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
This function doesn't use the @state parameter, so get rid of it.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
In xfs_bmbt_diff_two_keys, we perform a signed int64_t subtraction with
two unsigned 64-bit quantities. If the second quantity is actually the
"maximum" key (all ones) as used in _query_all, the subtraction
effectively becomes addition of two positive numbers and the function
returns incorrect results. Fix this with explicit comparisons of the
unsigned values. Nobody needs this now, but the online repair patches
will need this to work properly.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
The xfs_rmap_has_other_keys helper aborts the iteration as soon as it
has an answer. Don't let this abort leak out to callers.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
In xfs_ialloc_setup_geometry, it's possible for a malicious/corrupt fs
image to set an unreasonably large value for sb_inopblog which will
cause ialloc_blks to be zero. If sb_imax_pct is also set, this results
in a division by zero error in the second do_div call. Therefore, force
maxicount to zero if ialloc_blks is zero.
Note that the kernel metadata verifiers will catch the garbage inopblog
value and abort the fs mount long before it tries to set up the inode
geometry; this is needed to avoid a crash in xfs_db while setting up the
xfs_mount structure.
Found by fuzzing sb_inopblog to 122 in xfs/350.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Using strscpy is cleaner, and avoids some problems with
handling maximum length strings. Linus noticed the
original problem and Aurelien pointed out some additional
problems. Fortunately most of this is SMB1 code (and
in particular the ASCII string handling older, which
is less common).
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
It's safer to zero out the password so that it can never be disclosed.
Fixes: 0c219f5799c7 ("cifs: set domainName when a domain-key is used in multiuser")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
RHBZ: 1710429
When we use a domain-key to authenticate using multiuser we must also set
the domainnmame for the new volume as it will be used and passed to the server
in the NTLMSSP Domain-name.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Highlights include:
Stable fixes:
- Fix a page lock leak in nfs_pageio_resend()
- Ensure O_DIRECT reports an error if the bytes read/written is 0
- Don't handle errors if the bind/connect succeeded
- Revert "NFSv4/flexfiles: Abort I/O early if the layout segment was invalidat
ed"
Bugfixes:
- Don't refresh attributes with mounted-on-file information
- Fix return values for nfs4_file_open() and nfs_finish_open()
- Fix pnfs layoutstats reporting of I/O errors
- Don't use soft RPC calls for pNFS/flexfiles I/O, and don't abort for
soft I/O errors when the user specifies a hard mount.
- Various fixes to the error handling in sunrpc
- Don't report writepage()/writepages() errors twice.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAl1lgz4ACgkQZwvnipYK
APIsHhAApqaVaGzwfeR87zq+QaaVOzYzejyvFgs3wh/Lc5xPH+SlQ6NxLbs8ppdT
srrOHV9E2MA4JgqoHaIBMTqWacQ0UfQQ/6qLEFCrps9/0QHs7fg0CAHS5emmgk2v
rD6Mezr5Nx8h5/QJCBEZXfas5lxsICz1EYJ4Pk8QT6IoyeC+fvarGZKvzIQJ3KDN
8yrdv5kCVtN7noREf1KDIqIlYvFbIEoOoglNA40G49e1ffT9Oz6qzTcg19HFO50x
eAIxc9u4KCUY/ASCvcv9biQ5200l7QSCqmR7/Xlj/+4aClKp6Ay058j0awxtHHDy
NlZt6V3XGlm1/SVpvtU/XXWcyJmQwX7kOVIEYOFmt+lEqC7ZBzWEpAaJ8h4DMLLc
PIxIWBSmXNxp6LPNI0dZFf7O6UZ3ZMRacav+HHu7mjWolEB22f4jQJs+RxNhnfLU
fg180YWBMX4V/98S7iigxZkRd+qqQhddYtku+o+bp3h4m6mVrrYNm11J0o0GWQWf
Lio9nlkLq9hkYpdBwkH4PtIv3b+O5f9yhfEYn15eF27Ru0Bob0+DiBkzlflcrJve
W2VfNAj+jxP3Wg0QAI40BSqUB3b+zVtZW5FenAUEK7NxhhPi6jrIsVhhVgGFZIAd
i1xwYUg6fDjielhGOxMTF66ilvduA9uBCFAnTD3iSBoZmF63vew=
=YHhU
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.3-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
"Highlights include:
Stable fixes:
- Fix a page lock leak in nfs_pageio_resend()
- Ensure O_DIRECT reports an error if the bytes read/written is 0
- Don't handle errors if the bind/connect succeeded
- Revert "NFSv4/flexfiles: Abort I/O early if the layout segment was
invalidat ed"
Bugfixes:
- Don't refresh attributes with mounted-on-file information
- Fix return values for nfs4_file_open() and nfs_finish_open()
- Fix pnfs layoutstats reporting of I/O errors
- Don't use soft RPC calls for pNFS/flexfiles I/O, and don't abort
for soft I/O errors when the user specifies a hard mount.
- Various fixes to the error handling in sunrpc
- Don't report writepage()/writepages() errors twice"
* tag 'nfs-for-5.3-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFS: remove set but not used variable 'mapping'
NFSv2: Fix write regression
NFSv2: Fix eof handling
NFS: Fix writepage(s) error handling to not report errors twice
NFS: Fix spurious EIO read errors
pNFS/flexfiles: Don't time out requests on hard mounts
SUNRPC: Handle connection breakages correctly in call_status()
Revert "NFSv4/flexfiles: Abort I/O early if the layout segment was invalidated"
SUNRPC: Handle EADDRINUSE and ENOBUFS correctly
pNFS/flexfiles: Turn off soft RPC calls
SUNRPC: Don't handle errors if the bind/connect succeeded
NFS: On fatal writeback errors, we need to call nfs_inode_remove_request()
NFS: Fix initialisation of I/O result struct in nfs_pgio_rpcsetup
NFS: Ensure O_DIRECT reports an error if the bytes read/written is 0
NFSv4/pnfs: Fix a page lock leak in nfs_pageio_resend()
NFSv4: Fix return value in nfs_finish_open()
NFSv4: Fix return values for nfs4_file_open()
NFS: Don't refresh attributes with mounted-on-file information
Both the sq and the cq rings have sizes just over a power of two, and
the sq ring is significantly smaller. By bundling them in a single
alllocation, we get the sq ring for free.
This also means that IORING_OFF_SQ_RING and IORING_OFF_CQ_RING now mean
the same thing. If we indicate this to userspace, we can save a mmap
call.
Signed-off-by: Hristo Venev <hristo@venev.name>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For pages that were retained via get_user_pages*(), release those pages
via the new put_user_page*() routines, instead of via put_page() or
release_pages().
This is part a tree-wide conversion, as described in commit fc1d8e7cca
("mm: introduce put_user_page*(), placeholder versions").
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-block@vger.kernel.org
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Implement cgroup_writeback_by_id() which initiates cgroup writeback
from bdi and memcg IDs. This will be used by memcg foreign inode
flushing.
v2: Use wb_get_lookup() instead of wb_get_create() to avoid creating
spurious wbs.
v3: Interpret 0 @nr as 1.25 * nr_dirty to implement best-effort
flushing while avoding possible livelocks.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
wb_completion is used to track writeback completions. We want to use
it from memcg side for foreign inode flushes. This patch updates it
to remember the target waitq instead of assuming bdi->wb_waitq and
expose it outside of fs-writeback.c.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fixes gcc '-Wunused-but-set-variable' warning:
fs/nfs/write.c: In function nfs_page_async_flush:
fs/nfs/write.c:609:24: warning: variable mapping set but not used [-Wunused-but-set-variable]
It is not use since commit aefb623c422e ("NFS: Fix
writepage(s) error handling to not report errors twice")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Ensure we update the write result count on success, since the
RPC call itself does not do so.
Reported-by: Jan Stancek <jstancek@redhat.com>
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Tested-by: Jan Stancek <jstancek@redhat.com>
If we received a reply from the server with a zero length read and
no error, then that implies we are at eof.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The inode block mapping scrub function does more work for btree format
extent maps than is absolutely necessary -- first it will walk the bmbt
and check all the entries, and then it will load the incore tree and
check every entry in that tree, possibly for a second time.
Simplify the code and decrease check runtime by separating the two
responsibilities. The bmbt walk will make sure the incore extent
mappings are loaded, check the shape of the bmap btree (via xchk_btree)
and check that every bmbt record has a corresponding incore extent map;
and the incore extent map walk takes all the responsibility for checking
the mapping records and cross referencing them with other AG metadata.
This enables us to clean up some messy parameter handling and reduce
redundant code. Rename a few functions to make the split of
responsibilities clearer.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Fixes gcc warning:
fs/xfs/libxfs/xfs_btree.c:4475: warning: Excess function parameter 'max_recs' description in 'xfs_btree_sblock_v5hdr_verify'
fs/xfs/libxfs/xfs_btree.c:4475: warning: Excess function parameter 'pag_max_level' description in 'xfs_btree_sblock_v5hdr_verify'
Fixes: c5ab131ba0 ("libxfs: refactor short btree block verification")
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Memory we use to submit for IO needs strict alignment to the
underlying driver contraints. Worst case, this is 512 bytes. Given
that all allocations for IO are always a power of 2 multiple of 512
bytes, the kernel heap provides natural alignment for objects of
these sizes and that suffices.
Until, of course, memory debugging of some kind is turned on (e.g.
red zones, poisoning, KASAN) and then the alignment of the heap
objects is thrown out the window. Then we get weird IO errors and
data corruption problems because drivers don't validate alignment
and do the wrong thing when passed unaligned memory buffers in bios.
TO fix this, introduce kmem_alloc_io(), which will guaranteeat least
512 byte alignment of buffers for IO, even if memory debugging
options are turned on. It is assumed that the minimum allocation
size will be 512 bytes, and that sizes will be power of 2 mulitples
of 512 bytes.
Use this everywhere we allocate buffers for IO.
This no longer fails with log recovery errors when KASAN is enabled
due to the brd driver not handling unaligned memory buffers:
# mkfs.xfs -f /dev/ram0 ; mount /dev/ram0 /mnt/test
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Needed to feed into the allocation routine to guarantee the memory
buffers we add to bios are correctly aligned to the underlying
device.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When trying to correlate XFS kernel allocations to memory reclaim
behaviour, it is useful to know what allocations XFS is actually
attempting. This information is not directly available from
tracepoints in the generic memory allocation and reclaim
tracepoints, so these new trace points provide a high level
indication of what the XFS memory demand actually is.
There is no per-filesystem context in this code, so we just trace
the type of allocation, the size and the allocation constraints.
The kmem code also doesn't include much of the common XFS headers,
so there are a few definitions that need to be added to the trace
headers and a couple of types that need to be made common to avoid
needing to include the whole world in the kmem code.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
If writepage()/writepages() saw an error, but handled it without
reporting it, we should not be re-reporting that error on exit.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If the client attempts to read a page, but the read fails due to some
spurious error (e.g. an ACCESS error or a timeout, ...) then we need
to allow other processes to retry.
Also try to report errors correctly when doing a synchronous readpage.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If the mount is hard, we should ignore the 'io_maxretrans' module
parameter so that we always keep retrying.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
This reverts commit a79f194aa4.
The mechanism for aborting I/O is racy, since we are not guaranteed that
the request is asleep while we're changing both task->tk_status and
task->tk_action.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # v5.1
The pNFS/flexfiles I/O requests are sent with the SOFTCONN flag set, so
they automatically time out if the connection breaks. It should
therefore not be necessary to have the soft flag set in addition.
Fixes: 5f01d95394 ("nfs41: create NFSv3 DS connection if specified")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Since no caller is using KM_NOSLEEP and no callee branches on KM_SLEEP,
we can remove KM_NOSLEEP and replace KM_SLEEP with 0.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
UBIFS:
- Don't block too long in writeback_inodes_sb()
- Fix for a possible overrun of the log head
- Fix double unlock in orphan_delete()
JFFS2:
- Remove C++ style from UAPI header and unbreak picky toolchains
-----BEGIN PGP SIGNATURE-----
iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAl1ik14WHHJpY2hhcmRA
c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wbP2D/4xVW7YP5Yyt6YrABJuclfoib30
2LI6eOz0+5OojQKUbOzXCN9N7Dv4TLJKrCjRc9qKYTIB1DiQXuBDqtYKg6CTBhHb
MjiftEDiBQ6j3jVmRxkQRXZEB9I3Uu9CkA8s65+UmL8peJfgNElpH34omsU1fzup
y0NhZhj77P5jsAG6r7yXvuaofCOTlZIZVPya9FX17J0Ra+3rMOCtVEqnaHk2E5RB
EQPAEByqXUIx7+9mOi1Krw7B7fesB7oOVbCykE5knX1pZQCTURP64yNr35WxN+7Z
crcpdEQtf54qWMCKf4ClIBHiPmmsDIHYJy3JXjgJKOwIYvrB3dZ5E170qPr3JixY
nS+l8x69IYZhWUzHg8gxDizk92iFYKbO1h5vBwI7NUFHkHLzylsgonBK0KdaUnol
OvI5oCO/rdJEMBPr5LEFpOjZJIEptPtXpDvQCpm5tWd5tuW+8edNpI38lDO9LThC
O0diZZUQfsuzD1XrvKRORPU+4lskzGV5b1UA0DWXdGKALqM5VrQZo1XftvA74Zkv
oZQcHNK5wdecQX81Oadfb/0a5SN7FGGtTUCKTpOyBIu0adarGIasC6TQr2aDiiNh
7jLjBoV2XEGhXZQrK2lm8G+6rJ7Mp11B6aoTFgDELzt+SB7htp6dARR2+4aGWXh9
iXgme0n9HXDDeuosag==
=Bsgx
-----END PGP SIGNATURE-----
Merge tag 'for-linus-5.3-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs
Pull UBIFS and JFFS2 fixes from Richard Weinberger:
"UBIFS:
- Don't block too long in writeback_inodes_sb()
- Fix for a possible overrun of the log head
- Fix double unlock in orphan_delete()
JFFS2:
- Remove C++ style from UAPI header and unbreak picky toolchains"
* tag 'for-linus-5.3-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
ubifs: Limit the number of pages in shrink_liability
ubifs: Correctly initialize c->min_log_bytes
ubifs: Fix double unlock around orphan_delete()
jffs2: Remove C++ style comments from uapi header
userfaultfd_release() should clear vm_flags/vm_userfaultfd_ctx even if
mm->core_state != NULL.
Otherwise a page fault can see userfaultfd_missing() == T and use an
already freed userfaultfd_ctx.
Link: http://lkml.kernel.org/r/20190820160237.GB4983@redhat.com
Fixes: 04f5866e41 ("coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping")
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Fix a forgotten inode unlock when chown/chgrp fail due to quota.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl1gnj0ACgkQ+H93GTRK
tOvAlA/8DE5Ff/itTrz7D+1JCGxZgLyD1osTn8ZFuqLn6gEOR36i/WD+7infM5Tr
yowKvHXT3qOzAGGAyJFcjYkKx+wcYd7URR3105RFGVpd5FzW60lA/Cbzi7ecY7vL
e2ukHeWBfOJGZsIuw/+E/sl6PeTmcq3NzHyLSHg2hYjcxTW6wxmvTbporC3Ns73L
48AI39g1++1vz9W/T0wXNVGlDKih8gZIXtSTVqdbX3/sZ6C3dMiNqKUQTce+u/Nh
KI6aELb8ClhWhBv8fBBlCRZ9Zl1iHKEB9Rj4vwotzK2Fm4jnYh1m0R6tuL8BK7jd
H50qpokQ51RmtdWdicQ290S+XZi4kWpUaQiPl5f8Hf9UYj+M3Vg3zrwyx9O2xdnk
Oj4LPG/gvkFtJM5A9hhmK2VvEUqmb04ikovdOy1cmUYJmfyX+78968uX7Fkq4kbR
Gqk2m8zSxwbBxn8Io8jA0PsrQjrAU98rNibhHpcseSsmK2z44M6Ch+uXW8j9a4ws
xllJ2R0wtm0o9phIaUiwhaBq8/j1m8fe+1haUSeeeByMOl3j/oHtk0T8p/zbMAvz
EmMcF3Poe6vFeSXNZTqKuTVg9J445fKZizgouEtNmuBU/mYq9TkHjN6MaqwGDaMn
n8zzzpgoW1YT9Yxf6u0CzBBVZgjapF9wg6Op4JuDdsl/DU//UI8=
=gRWY
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.3-fixes-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fix from Darrick Wong:
"A single patch that fixes a xfs lockup problem when a chown/chgrp
operation fails due to running out of quota. It has survived the usual
xfstests runs and merges cleanly with this morning's master:
- Fix a forgotten inode unlock when chown/chgrp fail due to quota"
* tag 'xfs-5.3-fixes-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: fix missing ILOCK unlock when xfs_setattr_nonsize fails due to EDQUOT
EROFS filesystem has been merged into linux-staging for a year.
EROFS is designed to be a better solution of saving extra storage
space with guaranteed end-to-end performance for read-only files
with the help of reduced metadata, fixed-sized output compression
and decompression inplace technologies.
In the past year, EROFS was greatly improved by many people as
a staging driver, self-tested, betaed by a large number of our
internal users, successfully applied to almost all in-service
HUAWEI smartphones as the part of EMUI 9.1 and proven to be stable
enough to be moved out of staging.
EROFS is a self-contained filesystem driver. Although there are
still some TODOs to be more generic, we have a dedicated team
actively keeping on working on EROFS in order to make it better
with the evolution of Linux kernel as the other in-kernel filesystems.
As Pavel suggested, it's better to do as one commit since git
can do moves and all histories will be saved in this way.
Let's promote it from staging and enhance it more actively as
a "real" part of kernel for more wider scenarios!
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Pavel Machek <pavel@denx.de>
Cc: David Sterba <dsterba@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Darrick J . Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Richard Weinberger <richard@nod.at>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Miao Xie <miaoxie@huawei.com>
Cc: Li Guifu <bluce.liguifu@huawei.com>
Cc: Fang Wei <fangwei1@huawei.com>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Link: https://lore.kernel.org/r/20190822213659.5501-1-hsiangkao@aol.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl1gLIsQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpnNgD/9SVVtQ6DpSyPojSxVrcAfbH7n0Y+62Mfzs
yWeCpYvmxTd2APWAVtGeBh74uH58MYqwHBp6IKF1713WwENDpv5cDXtHCNi+d3xI
KulR9SQSC0wCIov7ak43TeKwuIUjn0cVz9VdrmaXLlp5f5nzEeNDixIlxaDXm1sf
PGksrXxnMnxKJU00uaW3J05E7GW/6kUDYq2IuG26cIkdA6c4TCj+y8uSnn2RNIsc
KeynzPx9UyX40weoLhb1HTi2HzZ+Cfz7t34kZZeluaJOiFkBdS5G/1sBf2MWdPwd
ZdpKCC86SmZF87pk9B455DALj3tqrvtym3nCn2HQ8jiNsgSqmUl+qTseH5OpLLbB
AL6OzSMh5HZ1g+hsBPgATVlb3GyJoSno3BZMAe+dTgu+wcv1sowajpm3p4rEQcbk
p6RmdmCz8mdCGuC0wWpVtQVk7nE0EKIBDMggM2T3dvRPkSTiep2Zdjg1iu/6HNlW
RSIWtcqo8H3CgOi7EcFjbHGLJ0kt98MUXcUHBTbwdGmRGhxbTUyKENL3FeWGiSZ/
Ojmnv4grdBch2rI4wmyenqnL/eQ37Mzr1nW5ZkHkcf27MP/v8HEhRDwS1a+YQr1x
acEsy7OC6nDyycsamWgSavm+x5t0zWWOjl6O92UbnZ3pvIkeoReXLbH9sjzzjj0c
VvBO9UArSg==
=uM7/
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20190823' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Here's a set of fixes that should go into this release. This contains:
- Three minor fixes for NVMe.
- Three minor tweaks for the io_uring polling logic.
- Officially mark Song as the MD maintainer, after he's been filling
that role sucessfully for the last 6 months or so"
* tag 'for-linus-20190823' of git://git.kernel.dk/linux-block:
io_uring: add need_resched() check in inner poll loop
md: update MAINTAINERS info
io_uring: don't enter poll loop if we have CQEs pending
nvme: Add quirk for LiteON CL1 devices running FW 22301111
nvme: Fix cntlid validation when not using NVMEoF
nvme-multipath: fix possible I/O hang when paths are updated
io_uring: fix potential hang with polled IO
- Fix missing compat ioctl handling for get/setlabel
- Fix missing ioctl pointer sanitization on s390
- Fix a page locking deadlock in the dedupe comparison code
- Fix inadequate locking in reflink code w.r.t. concurrent directio
- Fix broken error detection when breaking layouts
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl1cEXsACgkQ+H93GTRK
tOsXlhAAiUowRArnwXnqR+5Z7e3nyFZOeL0DTJHVE3UpKABz/NBnevQgsy70Bqmk
mo27ANMY8y9i7zatuCvM9UX8PXnOdaUKwoey8j5BB44iaEAkz9afeOt09PuCe141
sNucDjq7yQWkhDNd38lujpcXMNqlVNDkDtpYGx8ArzdVaEJfudqgHFqR+lnL2LRH
xylaJprOxcE6tCFmCVsvQmlnIbuCMWF1e7B5IA0Aoh6dLTWdD8nRNbPi9PNp3nbK
c7UvsDcl2SrngXFbdgGCexmguKT29va8t/GkwRVPmhXgu/hslOIcZPhqIti/LG2w
7u6CuvTa22xIA0yX9utCSq04HSKRsDKygPpYuI3U10caKmvUsvXpMFZ3goktqAgd
8pUZpapMGORe2W+b5Wa1vi5/wv+MKMOxeeAoui38KyDJvFNOADT6hlQ//GfuJSph
/4d7BKcZFykWEl/NI2tzaoiCzHy3ObdBTi3eloNjFE/KxVKKuBbjX/j6YisyhUpW
i6/i4i1POp5E41tM3u17cC2DmgYiqFCzg799yrt1QBgqOCVZvGyOHR4X2B4AFWSh
RALHKS2hBdzDIIRwLJVzA428kRMRptRviELgluJLLvx7fIrhGJ3URNzFBVty+fJi
YG8d1WUHcxLamO3ayjydyWCgO7W8tWOP/jCOGe/2apU+hCNZFUk=
=50ZB
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.3-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
"Here are a few more bug fixes that trickled in since the last pull.
They've survived the usual xfstests runs and merge cleanly with this
morning's master.
I expect there to be one more pull request tomorrow for the fix to
that quota related inode unlock bug that we were reviewing last night,
but it will continue to soak in the testing machine for several more
hours.
- Fix missing compat ioctl handling for get/setlabel
- Fix missing ioctl pointer sanitization on s390
- Fix a page locking deadlock in the dedupe comparison code
- Fix inadequate locking in reflink code w.r.t. concurrent directio
- Fix broken error detection when breaking layouts"
* tag 'xfs-5.3-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
fs/xfs: Fix return code of xfs_break_leased_layouts()
xfs: fix reflink source file racing with directio writes
vfs: fix page locking deadlocks when deduping files
xfs: compat_ioctl: use compat_ptr()
xfs: fall back to native ioctls for unhandled compat ones
an assert and a NULL pointer dereference) plus a small series from Luis
fixing instances of vfree() under spinlock.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl1f2fITHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi83fB/0a+TnNY8Q2aEeB9Y/0sckSpRCsMGMV
syt2krwKC0EYM1f2dkJdgCjlSjMzMcHPseP3g5odRXgyPKJt5O9oE7l3vGDC4Oyt
chqhEh86UzG6Kcptx6tIzsAGYS9S4NzxR5sfXF6oRu8m1bwk1n5IhKxYjQDTvAMd
RxwvpdguNA9xvHeUvLMTpy2R3qE3uQ2dxierutW67GeyeCPkvyBmazzi72Q36hlL
y1w8DWaPBemBk5QEM9vmz5i2xQeLO4h4ejhP4LcXyVjJtfvAPl0JWOsHMK4uWRJf
6XjbGDaGYvID0hTQLlEw/k73976HmRxSbaXRtCZN+IG3yWGTL8ID6GqI
=kaFB
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-5.3-rc6' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"Three important fixes tagged for stable (an indefinite hang, a crash
on an assert and a NULL pointer dereference) plus a small series from
Luis fixing instances of vfree() under spinlock"
* tag 'ceph-for-5.3-rc6' of git://github.com/ceph/ceph-client:
libceph: fix PG split vs OSD (re)connect race
ceph: don't try fill file_lock on unsuccessful GETFILELOCK reply
ceph: clear page dirty before invalidate page
ceph: fix buffer free while holding i_ceph_lock in fill_inode()
ceph: fix buffer free while holding i_ceph_lock in __ceph_build_xattrs_blob()
ceph: fix buffer free while holding i_ceph_lock in __ceph_setxattr()
libceph: allow ceph_buffer_put() to receive a NULL ceph_buffer
Benjamin Moody reported to Debian that XFS partially wedges when a chgrp
fails on account of being out of disk quota. I ran his reproducer
script:
# adduser dummy
# adduser dummy plugdev
# dd if=/dev/zero bs=1M count=100 of=test.img
# mkfs.xfs test.img
# mount -t xfs -o gquota test.img /mnt
# mkdir -p /mnt/dummy
# chown -c dummy /mnt/dummy
# xfs_quota -xc 'limit -g bsoft=100k bhard=100k plugdev' /mnt
(and then as user dummy)
$ dd if=/dev/urandom bs=1M count=50 of=/mnt/dummy/foo
$ chgrp plugdev /mnt/dummy/foo
and saw:
================================================
WARNING: lock held when returning to user space!
5.3.0-rc5 #rc5 Tainted: G W
------------------------------------------------
chgrp/47006 is leaving the kernel with locks still held!
1 lock held by chgrp/47006:
#0: 000000006664ea2d (&xfs_nondir_ilock_class){++++}, at: xfs_ilock+0xd2/0x290 [xfs]
...which is clearly caused by xfs_setattr_nonsize failing to unlock the
ILOCK after the xfs_qm_vop_chown_reserve call fails. Add the missing
unlock.
Reported-by: benjamin.moody@gmail.com
Fixes: 253f4911f2 ("xfs: better xfs_trans_alloc interface")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Salvatore Bonaccorso <carnil@debian.org>
The outer poll loop checks for whether we need to reschedule, and
returns to userspace if we do. However, it's possible to get stuck
in the inner loop as well, if the CPU we are running on needs to
reschedule to finish the IO work.
Add the need_resched() check in the inner loop as well. This fixes
a potential hang if the kernel is configured with
CONFIG_PREEMPT_VOLUNTARY=y.
Reported-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl1ekN4ACgkQ+7dXa6fL
C2v5YA//WpHrecLwBiBfd4UE1QndDVC7bC1aVvmUsPYsMNTnc1wqD7zwPSVAkXt9
u7WVa0XsOK4Ks9PpNwwmtlFk2nSXvFbb1WsPiyUX/QWC+tB0jdHEvkymEonVPn85
UuNMcCx2Yzv7Mxw9aESWDziEN5PzsOChZC1M8fpVuEBDcqqbkkdSTM1LPzfHkRn5
4/OFnlaC/4D4qEfv+0gFZjf6zBEPicHRfgSWYgzyBxsEwZ5eGzTcpVSYPEJRsuYF
Ndqp0ei/65wUihk2gyoNG5PkC/9oouQV9ko17QG1uhiqrFpECiAkbyf8YmkUTDSc
WvNtKN3HnLKJhCPoJ1SpE1qFs0Iw10y2BySO2XLoj7N7421aSIU+nemQ9yZ1mQgc
GGwpBx1jIPMsN0IDXG8HIJCW3aUNU+Ygg2X7gvpF2gOvB29LVPN48/6kahpeQpAR
vzLRUod9+H4wD3kLqpOjDOCPmokZNktn+8rtqlctyCvwp41JBbmQ9/r68aoFhpe9
fFN4zhd3E365tgX63ooUQVa4thc09ltcYTAAhEz1Ma8kRsigwZ6pY5xSrpZ0dehW
4SEykEsqQDlSmFV0G/063F66M621o69VvETe8lhOsVVK3XVWzGkDdIXS1iGlFrNx
A/hXcr2rwau5qomo00blyPyeh2DcQhsAPI3SJyq7JL2bK4JEQD4=
=1/ML
-----END PGP SIGNATURE-----
Merge tag 'afs-fixes-20190822' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull AFS fixes from David Howells:
- Fix a cell record leak due to the default error not being cleared.
- Fix an oops in tracepoint due to a pointer that may contain an error.
- Fix the ACL storage op for YFS where the wrong op definition is being
used. By luck, this only actually affects the information appearing
in traces.
* tag 'afs-fixes-20190822' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: use correct afs_call_type in yfs_fs_store_opaque_acl2
afs: Fix possible oops in afs_lookup trace event
afs: Fix leak in afs_lookup_cell_rcu()
If the number of dirty pages to be written back is large,
then writeback_inodes_sb will block waiting for a long time,
causing hung task detection alarm. Therefore, we should limit
the maximum number of pages written back this time, which let
the budget be completed faster. The remaining dirty pages
tend to rely on the writeback mechanism to complete the
synchronization.
Fixes: b6e51316da ("writeback: separate starting of sync vs opportunistic writeback")
Signed-off-by: Liu Song <liu.song11@zte.com.cn>
Signed-off-by: Richard Weinberger <richard@nod.at>
Currently on a freshly mounted UBIFS, c->min_log_bytes is 0.
This can lead to a log overrun and make commits fail.
Recent kernels will report the following assert:
UBIFS assert failed: c->lhead_lnum != c->ltail_lnum, in fs/ubifs/log.c:412
c->min_log_bytes can have two states, 0 and c->leb_size.
It controls how much bytes of the log area are reserved for non-bud
nodes such as commit nodes.
After a commit it has to be set to c->leb_size such that we have always
enough space for a commit. While a commit runs it can be 0 to make the
remaining bytes of the log available to writers.
Having it set to 0 right after mount is wrong since no space for commits
is reserved.
Fixes: 1e51764a3c ("UBIFS: add new flash file system")
Reported-and-tested-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
We unlock after orphan_delete(), so no need to unlock
in the function too.
Reported-by: Han Xu <han.xu@nxp.com>
Fixes: 8009ce956c ("ubifs: Don't leak orphans on memory during commit")
Signed-off-by: Richard Weinberger <richard@nod.at>
It seems that 'yfs_RXYFSStoreOpaqueACL2' should be use in
yfs_fs_store_opaque_acl2().
Fixes: f5e4546347 ("afs: Implement YFS ACL setting")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David Howells <dhowells@redhat.com>
The afs_lookup trace event can cause the following:
[ 216.576777] BUG: kernel NULL pointer dereference, address: 000000000000023b
[ 216.576803] #PF: supervisor read access in kernel mode
[ 216.576813] #PF: error_code(0x0000) - not-present page
...
[ 216.576913] RIP: 0010:trace_event_raw_event_afs_lookup+0x9e/0x1c0 [kafs]
If the inode from afs_do_lookup() is an error other than ENOENT, or if it
is ENOENT and afs_try_auto_mntpt() returns an error, the trace event will
try to dereference the error pointer as a valid pointer.
Use IS_ERR_OR_NULL to only pass a valid pointer for the trace, or NULL.
Ideally the trace would include the error value, but for now just avoid
the oops.
Fixes: 80548b0399 ("afs: Add more tracepoints")
Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Fix a leak on the cell refcount in afs_lookup_cell_rcu() due to
non-clearance of the default error in the case a NULL cell name is passed
and the workstation default cell is used.
Also put a bit at the end to make sure we don't leak a cell ref if we're
going to be returning an error.
This leak results in an assertion like the following when the kafs module is
unloaded:
AFS: Assertion failed
2 == 1 is false
0x2 == 0x1 is false
------------[ cut here ]------------
kernel BUG at fs/afs/cell.c:770!
...
RIP: 0010:afs_manage_cells+0x220/0x42f [kafs]
...
process_one_work+0x4c2/0x82c
? pool_mayday_timeout+0x1e1/0x1e1
? do_raw_spin_lock+0x134/0x175
worker_thread+0x336/0x4a6
? rescuer_thread+0x4af/0x4af
kthread+0x1de/0x1ee
? kthread_park+0xd4/0xd4
ret_from_fork+0x24/0x30
Fixes: 989782dcdc ("afs: Overhaul cell database management")
Signed-off-by: David Howells <dhowells@redhat.com>
When ceph_mdsc_do_request returns an error, we can't assume that the
filelock_reply pointer will be set. Only try to fetch fields out of
the r_reply_info when it returns success.
Cc: stable@vger.kernel.org
Reported-by: Hector Martin <hector@marcansoft.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
clear_page_dirty_for_io(page) before mapping->a_ops->invalidatepage().
invalidatepage() clears page's private flag, if dirty flag is not
cleared, the page may cause BUG_ON failure in ceph_set_page_dirty().
Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/40862
Signed-off-by: Erqi Chen <chenerqi@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Calling ceph_buffer_put() in fill_inode() may result in freeing the
i_xattrs.blob buffer while holding the i_ceph_lock. This can be fixed by
postponing the call until later, when the lock is released.
The following backtrace was triggered by fstests generic/070.
BUG: sleeping function called from invalid context at mm/vmalloc.c:2283
in_atomic(): 1, irqs_disabled(): 0, pid: 3852, name: kworker/0:4
6 locks held by kworker/0:4/3852:
#0: 000000004270f6bb ((wq_completion)ceph-msgr){+.+.}, at: process_one_work+0x1b8/0x5f0
#1: 00000000eb420803 ((work_completion)(&(&con->work)->work)){+.+.}, at: process_one_work+0x1b8/0x5f0
#2: 00000000be1c53a4 (&s->s_mutex){+.+.}, at: dispatch+0x288/0x1476
#3: 00000000559cb958 (&mdsc->snap_rwsem){++++}, at: dispatch+0x2eb/0x1476
#4: 000000000d5ebbae (&req->r_fill_mutex){+.+.}, at: dispatch+0x2fc/0x1476
#5: 00000000a83d0514 (&(&ci->i_ceph_lock)->rlock){+.+.}, at: fill_inode.isra.0+0xf8/0xf70
CPU: 0 PID: 3852 Comm: kworker/0:4 Not tainted 5.2.0+ #441
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014
Workqueue: ceph-msgr ceph_con_workfn
Call Trace:
dump_stack+0x67/0x90
___might_sleep.cold+0x9f/0xb1
vfree+0x4b/0x60
ceph_buffer_release+0x1b/0x60
fill_inode.isra.0+0xa9b/0xf70
ceph_fill_trace+0x13b/0xc70
? dispatch+0x2eb/0x1476
dispatch+0x320/0x1476
? __mutex_unlock_slowpath+0x4d/0x2a0
ceph_con_workfn+0xc97/0x2ec0
? process_one_work+0x1b8/0x5f0
process_one_work+0x244/0x5f0
worker_thread+0x4d/0x3e0
kthread+0x105/0x140
? process_one_work+0x5f0/0x5f0
? kthread_park+0x90/0x90
ret_from_fork+0x3a/0x50
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Calling ceph_buffer_put() in __ceph_build_xattrs_blob() may result in
freeing the i_xattrs.blob buffer while holding the i_ceph_lock. This can
be fixed by having this function returning the old blob buffer and have
the callers of this function freeing it when the lock is released.
The following backtrace was triggered by fstests generic/117.
BUG: sleeping function called from invalid context at mm/vmalloc.c:2283
in_atomic(): 1, irqs_disabled(): 0, pid: 649, name: fsstress
4 locks held by fsstress/649:
#0: 00000000a7478e7e (&type->s_umount_key#19){++++}, at: iterate_supers+0x77/0xf0
#1: 00000000f8de1423 (&(&ci->i_ceph_lock)->rlock){+.+.}, at: ceph_check_caps+0x7b/0xc60
#2: 00000000562f2b27 (&s->s_mutex){+.+.}, at: ceph_check_caps+0x3bd/0xc60
#3: 00000000f83ce16a (&mdsc->snap_rwsem){++++}, at: ceph_check_caps+0x3ed/0xc60
CPU: 1 PID: 649 Comm: fsstress Not tainted 5.2.0+ #439
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x67/0x90
___might_sleep.cold+0x9f/0xb1
vfree+0x4b/0x60
ceph_buffer_release+0x1b/0x60
__ceph_build_xattrs_blob+0x12b/0x170
__send_cap+0x302/0x540
? __lock_acquire+0x23c/0x1e40
? __mark_caps_flushing+0x15c/0x280
? _raw_spin_unlock+0x24/0x30
ceph_check_caps+0x5f0/0xc60
ceph_flush_dirty_caps+0x7c/0x150
? __ia32_sys_fdatasync+0x20/0x20
ceph_sync_fs+0x5a/0x130
iterate_supers+0x8f/0xf0
ksys_sync+0x4f/0xb0
__ia32_sys_sync+0xa/0x10
do_syscall_64+0x50/0x1c0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fc6409ab617
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Calling ceph_buffer_put() in __ceph_setxattr() may end up freeing the
i_xattrs.prealloc_blob buffer while holding the i_ceph_lock. This can be
fixed by postponing the call until later, when the lock is released.
The following backtrace was triggered by fstests generic/117.
BUG: sleeping function called from invalid context at mm/vmalloc.c:2283
in_atomic(): 1, irqs_disabled(): 0, pid: 650, name: fsstress
3 locks held by fsstress/650:
#0: 00000000870a0fe8 (sb_writers#8){.+.+}, at: mnt_want_write+0x20/0x50
#1: 00000000ba0c4c74 (&type->i_mutex_dir_key#6){++++}, at: vfs_setxattr+0x55/0xa0
#2: 000000008dfbb3f2 (&(&ci->i_ceph_lock)->rlock){+.+.}, at: __ceph_setxattr+0x297/0x810
CPU: 1 PID: 650 Comm: fsstress Not tainted 5.2.0+ #437
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x67/0x90
___might_sleep.cold+0x9f/0xb1
vfree+0x4b/0x60
ceph_buffer_release+0x1b/0x60
__ceph_setxattr+0x2b4/0x810
__vfs_setxattr+0x66/0x80
__vfs_setxattr_noperm+0x59/0xf0
vfs_setxattr+0x81/0xa0
setxattr+0x115/0x230
? filename_lookup+0xc9/0x140
? rcu_read_lock_sched_held+0x74/0x80
? rcu_sync_lockdep_assert+0x2e/0x60
? __sb_start_write+0x142/0x1a0
? mnt_want_write+0x20/0x50
path_setxattr+0xba/0xd0
__x64_sys_lsetxattr+0x24/0x30
do_syscall_64+0x50/0x1c0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7ff23514359a
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Some legacy code in the CIFS driver uses single DES to calculate
some password hash, and uses the crypto cipher API to do so. Given
that there is no point in invoking an accelerated cipher for doing
56-bit symmetric encryption on a single 8-byte block of input, the
flexibility of the crypto cipher API does not add much value here,
and so we're much better off using a library call into the generic
C implementation.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
cache containerization.
-----BEGIN PGP SIGNATURE-----
iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAl1dXZcVHGJmaWVsZHNA
ZmllbGRzZXMub3JnAAoJECebzXlCjuG+aeoQALGy/uioZv5H4gdkHrcbdqSGmsTl
CNvIL6k94HTkkF8K3+30g000o0zzdkFrX/1nyguaJqftytwmCtocmZP/QdqzWQVT
UOel9LFJJn4GwQ7J7JIbpmc0YWK8oey1s1AYTMZG00s6ORVk5J0HfQqrryoGaG+o
9IySxbRKklGCm1J/0mMuPKKMaumiZPMf0GYrnmlMoW4KHg+ROP8e5Xp7VspqUCcv
KfpCqO7mcwBKgPez2hIIFBWh+CdoC/8ztymfN+15EBjTfS4Jl8D6v/+XTKs+IW+V
YwGiTt1pPBjrMy4nZMqIrSghS2owRoVuXiK/X6n38SQnpcmZEaeHFRYKAq8gZGzl
cvHtacVTp75n3gRUwTyE5hDLIpVOAe54doEQKR4rBUQZB6iul3DIkwhoHocQAf52
n+nmOK09CSP8M4uLVBNsfGGn/eU5jcqGY9M+4qoAxdv1/N/VCeQV9Lfx8FBRea2p
5fbo2f+g1nqKOhX6PTYMEg83lWvDwWUKghHNpVai4QQN/z+RRlKiNhKldK2y3Tvm
4ND3bCy++yiPIUlZpsSc6FFCdi+JHVCM1arqM1Irm92J1PzF9CELqhHEe9MAMiWt
iZgO/42uWZ2i0aqb3y94/6o9rvaYDABI21vu3yZmyxlHgZuyVT+GY1ggbHn9oSEc
sqwLMDwejclsCrIx
=irn8
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.3-1' of git://linux-nfs.org/~bfields/linux
Pull nfsd fixes from Bruce Fields:
"Fix nfsd bugs: three in the new nfsd/clients/ code, one in the reply
cache containerization"
* tag 'nfsd-5.3-1' of git://linux-nfs.org/~bfields/linux:
nfsd4: Fix kernel crash when reading proc file reply_cache_stats
nfsd: initialize i_private before d_add
nfsd: use i_wrlock instead of rcu for nfsdfs i_private
nfsd: fix dentry leak upon mkdir failure.
Add an ID and a device pointer to 'struct wakeup_source'. Use them to to
expose wakeup sources statistics in sysfs under
/sys/class/wakeup/wakeup<ID>/*.
Co-developed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Co-developed-by: Stephen Boyd <swboyd@chromium.org>
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Signed-off-by: Tri Vo <trong@android.com>
Tested-by: Kalesh Singh <kaleshsingh@google.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
We need to check if we have CQEs pending before starting a poll loop,
as those could be the events we will be spinning for (and hence we'll
find none). This can happen if a CQE triggers an error, or if it is
found by eg an IRQ before we get a chance to find it through polling.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If a request issue ends up being punted to async context to avoid
blocking, we can get into a situation where the original application
enters the poll loop for that very request before it has been issued.
This should not be an issue, except that the polling will hold the
io_uring uring_ctx mutex for the duration of the poll. When the async
worker has actually issued the request, it needs to acquire this mutex
to add the request to the poll issued list. Since the application
polling is already holding this mutex, the workqueue sleeps on the
mutex forever, and the application thus never gets a chance to poll for
the very request it was interested in.
Fix this by ensuring that the polling drops the uring_ctx occasionally
if it's not making any progress.
Reported-by: Jeffrey M. Birnbaum <jmbnyc@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Don't let userspace write to an active swap file because the kernel
effectively has a long term lease on the storage and things could get
seriously corrupted if we let this happen.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
In __break_lease(), the file lock 'new_fl' is allocated in lease_alloc().
However, it is not deallocated in the following execution if
smp_load_acquire() fails, leading to a memory leak bug. To fix this issue,
free 'new_fl' before returning the error.
Signed-off-by: Wenwen Wang <wenwen@cs.uga.edu>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
The parens used in the while loop would result in error being assigned
the value 1 rather than the intended errno value.
This is required to return -ETXTBSY from follow on break_layout()
changes.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Pull kernel thread signal handling fix from Eric Biederman:
"I overlooked the fact that kernel threads are created with all signals
set to SIG_IGN, and accidentally caused a regression in cifs and drbd
when replacing force_sig with send_sig.
This is my fix for that regression. I add a new function
allow_kernel_signal which allows kernel threads to receive signals
sent from the kernel, but continues to ignore all signals sent from
userspace. This ensures the user space interface for cifs and drbd
remain the same.
These kernel threads depend on blocking networking calls which block
until something is received or a signal is pending. Making receiving
of signals somewhat necessary for these kernel threads.
Perhaps someday we can cleanup those interfaces and remove
allow_kernel_signal. If not allow_kernel_signal is pretty trivial and
clearly documents what is going on so I don't think we will mind
carrying it"
* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
signal: Allow cifs and drbd to receive their terminating signals
If the writeback error is fatal, we need to remove the tracking structures
(i.e. the nfs_page) from the inode.
Fixes: 6fbda89b25 ("NFS: Replace custom error reporting mechanism...")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Initialise the result count to 0 rather than initialising it to the
argument count. The reason is that we want to ensure we record the
I/O stats correctly in the case where an error is returned (for
instance in the layoutstats).
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If the attempt to resend the I/O results in no bytes being read/written,
we must ensure that we report the error.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Fixes: 0a00b77b33 ("nfs: mirroring support for direct io")
Cc: stable@vger.kernel.org # v3.20+
If the attempt to resend the pages fails, we need to ensure that we
clean up those pages that were not transmitted.
Fixes: d600ad1f2b ("NFS41: pop some layoutget errors to application")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # v4.5+
If the file turns out to be of the wrong type after opening, we want
to revalidate the path and retry, so return EOPENSTALE rather than
ESTALE.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Currently, we are translating RPC level errors such as timeouts,
as well as interrupts etc into EOPENSTALE, which forces a single
replay of the open attempt. What we actually want to do is
force the replay only in the cases where the returned error
indicates that the file may have changed on the server.
So the fix is to spell out the exact set of errors where we want
to return EOPENSTALE.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If we've been given the attributes of the mounted-on-file, then do not
use those to check or update the attributes on the application-visible
inode.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
My recent to change to only use force_sig for a synchronous events
wound up breaking signal reception cifs and drbd. I had overlooked
the fact that by default kthreads start out with all signals set to
SIG_IGN. So a change I thought was safe turned out to have made it
impossible for those kernel thread to catch their signals.
Reverting the work on force_sig is a bad idea because what the code
was doing was very much a misuse of force_sig. As the way force_sig
ultimately allowed the signal to happen was to change the signal
handler to SIG_DFL. Which after the first signal will allow userspace
to send signals to these kernel threads. At least for
wake_ack_receiver in drbd that does not appear actively wrong.
So correct this problem by adding allow_kernel_signal that will allow
signals whose siginfo reports they were sent by the kernel through,
but will not allow userspace generated signals, and update cifs and
drbd to call allow_kernel_signal in an appropriate place so that their
thread can receive this signal.
Fixing things this way ensures that userspace won't be able to send
signals and cause problems, that it is clear which signals the
threads are expecting to receive, and it guarantees that nothing
else in the system will be affected.
This change was partly inspired by similar cifs and drbd patches that
added allow_signal.
Reported-by: ronnie sahlberg <ronniesahlberg@gmail.com>
Reported-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Tested-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Cc: Steve French <smfrench@gmail.com>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: David Laight <David.Laight@ACULAB.COM>
Fixes: 247bc9470b ("cifs: fix rmmod regression in cifs.ko caused by force_sig changes")
Fixes: 72abe3bcf0 ("signal/cifs: Fix cifs_put_tcp_session to call send_sig instead of force_sig")
Fixes: fee109901f ("signal/drbd: Use send_sig not force_sig")
Fixes: 3cf5d076fb ("signal: Remove task parameter from force_sig")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
While trawling through the dedupe file comparison code trying to fix
page deadlocking problems, Dave Chinner noticed that the reflink code
only takes shared IOLOCK/MMAPLOCKs on the source file. Because
page_mkwrite and directio writes do not take the EXCL versions of those
locks, this means that reflink can race with writer processes.
For pure remapping this can lead to undefined behavior and file
corruption; for dedupe this means that we cannot be sure that the
contents are identical when we decide to go ahead with the remapping.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl1ZOygACgkQxWXV+ddt
WDvokw/8CRjbN5Bjlk/JzmikH+mU28Cd7qQgahWw7Afkyh5Gzb4IJJbNXzapy988
dMMKYF2H6lxY46EZG8cF4MFjCv8L4L1eZ9CqwfZyf7MfzPL2pnLSN77QJYYebWp3
y9rc8xv+qUdIcumQP25yxXmtN0YxT5bIJiuCmpHJFNfyijHVRnXV2CxQ8nwe63/1
G25a03x6BNKtSrU3ZP0fW2VjARKzIF7i+bEy4Ew3n3dNKqAiROYecswcBedhCBkF
D9hL62uHcxQSeHi/6lAZYKpsp4g4pKEO4c92MxsI8PJtd4zgHQG7MttXsHC1F1FQ
lHcP3vxKICOOMa13W6QdytSX6uSpjeLyMTDfmvaahQmwG6I5dBN56pkGlWdDMKNn
NUCNBmV063D7Shed4W2uIafLfo3BLEwKr2pd6pOfOZHwOKPWblJ0v4KzVDLoKC7v
CA5HB9BBz4KfSyp3fkm9B5VGJW938vRHVfx55IoakL+tfs67cKDYo8QEFHDuz0P5
DrYNwKvp4a5llGQ6vdRKfeiN7vBea10795MI5vFJiGUHfY3pN99R4UUed7kaul58
n6gqYukcPtIBqm8auxK037nngA+V0N2y0ceM1/aKUGaZVlCtSmEKXseYjbaiH6fP
MEZSgZn4W5qnu2oQuKohfQsNHtY5WP9489GpByy+DS5QsbDaueg=
=ovM7
-----END PGP SIGNATURE-----
Merge tag 'for-5.3-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Two fixes that popped up during testing:
- fix for sysfs-related code that adds/removes block groups, warnings
appear during several fstests in connection with sysfs updates in
5.3, the fix essentially replaces a workaround with scope NOFS and
applies to 5.2-based branch too
- add sanity check of trim range"
* tag 'for-5.3-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: trim: Check the range passed into to prevent overflow
Btrfs: fix sysfs warning and missing raid sysfs directories
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl1Ygq4QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpjUnD/47XM1xHO2L/66+XiRnXwNc0Zmc0cNrtIsp
09EqZFeSwaqJKfoodcDphuHSYf6jkLk9xRKiFEr+KrUfzwvmgN3fFlbk6Azz0A4f
ACv4DtTPML40QUGzo5c+UfnJTwK8z4An/lG4aNDB3UucFwoWhfvLkzQAN+FNilAQ
sMTWPvhjiV/E7BMa7p108/NL9sv0UzUzgIOI7cZCq/Z5/jJBiohl4DzPPyfCKzOK
shKP3Q7zId2q7d9zJZZuwy/iYLTCEHzy+37hxS7343vXeTlxKbrxY6xKZ8Rhsccx
E3ep+SYuRxl7BHXPTwFh5HSMRXq5JEVBy3dnoMzFEu6pqzAPyLjQS0tG/lR9T1uS
E8t0QnsdjVvedF/c3a3b2pVcYCbRjHsB/5DLBPxRfgAPkvVm180XKQM95HkLQYle
A74mgU1WODnXsdyJeAacf1b+3t8p+5x5vpRq6Ls6dgmGn74qhfvs4tQnqypFFnyR
Le0O7ICWKotQFo0CzbPB029xORepq9HLkFEeh73K6uG+Bn5iGMC1oUmxB0uSuHpz
h9lPDjjj6RSLpzp1jCqswSHsQi00sLSKMtytjEt83/EWPcyN2oiPP1BTq+9jITYR
SvoKF6R+CLwQscb/DEXb/vRSehg5aivXHg1AOo0l6/OiPQmNdmSZEU4/McPoB2xW
1MC5Y4wV1g==
=f6DS
-----END PGP SIGNATURE-----
Merge tag 'for-linus-2019-08-17' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"A collection of fixes that should go into this series. This contains:
- Revert of the REQ_NOWAIT_INLINE and associated dio changes. There
were still corner cases there, and even though I had a solution for
it, it's too involved for this stage. (me)
- Set of NVMe fixes (via Sagi)
- io_uring fix for fixed buffers (Anthony)
- io_uring defer issue fix (Jackie)
- Regression fix for queue sync at exit time (zhengbin)
- xen blk-back memory leak fix (Wenwen)"
* tag 'for-linus-2019-08-17' of git://git.kernel.dk/linux-block:
io_uring: fix an issue when IOSQE_IO_LINK is inserted into defer list
block: remove REQ_NOWAIT_INLINE
io_uring: fix manual setup of iov_iter for fixed buffers
xen/blkback: fix memory leaks
blk-mq: move cancel of requeue_work to the front of blk_exit_queue
nvme-pci: Fix async probe remove race
nvme: fix controller removal race with scan work
nvme-rdma: fix possible use-after-free in connect error flow
nvme: fix a possible deadlock when passthru commands sent to a multipath device
nvme-core: Fix extra device_put() call on error path
nvmet-file: fix nvmet_file_flush() always returning an error
nvmet-loop: Flush nvme_delete_wq when removing the port
nvmet: Fix use-after-free bug when a port is removed
nvme-multipath: revalidate nvme_ns_head gendisk in nvme_validate_ns
When dedupe wants to use the page cache to compare parts of two files
for dedupe, we must be very careful to handle locking correctly. The
current code doesn't do this. It must lock and unlock the page only
once if the two pages are the same, since the overlapping range check
doesn't catch this when blocksize < pagesize. If the pages are distinct
but from the same file, we must observe page locking order and lock them
in order of increasing offset to avoid clashing with writeback locking.
Fixes: 876bec6f9b ("vfs: refactor clone/dedupe_file_range common functions")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
For 31-bit s390 user space, we have to pass pointer arguments through
compat_ptr() in the compat_ioctl handler.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Always try the native ioctl if we don't have a compat handler. This
removes a lot of boilerplate code as 'modern' ioctls should generally
be compat clean, and fixes the missing entries for the recently added
FS_IOC_GETFSLABEL/FS_IOC_SETFSLABEL ioctls.
Fixes: f7664b3197 ("xfs: implement online get/set fs label")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Since 9e8925b67a ("locks: Allow disabling mandatory locking at compile
time"), attempts to mount filesystems with "-o mand" will fail.
Unfortunately, there is no other indiciation of the reason for the
failure.
Change how the function is defined for better readability. When
CONFIG_MANDATORY_FILE_LOCKING is disabled, printk a warning when
someone attempts to mount with -o mand.
Also, add a blurb to the mandatory-locking.txt file to explain about
the "mand" option, and the behavior one should expect when it is
disabled.
Reported-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
A process could race in an open and attempt to read one of these files
before i_private is initialized, and get a spurious error.
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
As inode wb switching may make sync(2) miss some inodes, they're
synchronized using wb_switch_rwsem so that no wb switching happens
while sync(2) is in progress. In addition to synchronizing the actual
switching, the rwsem is also used to prevent queueing new switch
attempts while sync(2) is in progress. This is to avoid queueing too
many instances while the rwsem is held by sync(2). Unfortunately,
this is too agressive and can block wb switching for a long time if
sync(2) is frequent.
The goal is avoiding expolding the number of scheduled switches, not
avoiding scheduling anything. Let's use wb_switch_rwsem only for
synchronizing the actual switching and sync(2) and use
isw_nr_in_flight instead for limiting the maximum number of scheduled
switches. The limit is set to 1024 which should be more than enough
while still avoiding extreme situations.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
WB_FRN_TIME_CUT_DIV is used to tell the foreign inode detection logic
to ignore short writeback rounds to prevent getting confused by a
burst of short writebacks. The parameter is currently 2 meaning that
anything smaller than half of the running average writback duration
will be ignored.
This is unnecessarily aggressive. The detection logic uses 16 history
slots and is already reasonably protected against some short bursts
confusing it and the current parameter can lead to tens of seconds of
missed detection depending on the writeback pattern.
Let's change the parameter to 8, so that it only ignores writeback
with are smaller than 12.5% of the current running average.
v2: Add comment explaining what's going on with the foreign detection
parameters.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- Fix crashes when the attr fork isn't present due to errors but inode
inactivation tries to zap the attr data anyway.
- Convert more directory corruption debugging asserts to actual
EFSCORRUPTED returns instead of blowing up later on.
- Don't fail writeback just because we ran out of memory allocating
metadata log data.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl1RlXoACgkQ+H93GTRK
tOtc7A/+JIidhI/MHQLs7Ab9GW+PsHHBMSbVTV4Ge+SlfZtPNI38zrC1MC5LWvvV
bndOpRjLm4nOJcB7fsoEWufTs1dKOIUjk2yQi8x47ZvE+B/RcA4b6IDhwpbAI8GW
kt1RLNec9kpzhxCFFPzsXT9MwjEvvOvTeXfxaXTmiuB2kbJkR5dTlCUS2nUDnqsG
FGdmOUDjy1uVfFcSrp75KT/iYaqW08cG+uY/eUHRm+YMUKI8hF1t+n8cDnSg96VX
IN2DT1d3dTWiiF+JUZnMhVwJvPgV95DOf+yYy/F7qOcJUEmQ9tD6+0Ml/cI/AeLG
zERxHXM9A9Jy8S+2xkvf0J/+HStwfviWNToK3pbMIM1ZsoMTi9q8VgbB3AaFiijf
C4Q4T3W0jC44om8X/Ta/c+G/64Tj8yenzLDeTHvtQkoq77QPBam/aYjBc79oYvHi
r+R61kHNto+YjJsRbkwgF/S+bzru1qY9Ccr0LJZrUkSzh4d6p94fbQc+NX4L2sv7
WzAc+kOR/7qgVgy4gVr3ju0d89kP/Xn/0e0Ma0V8CSZlX5yg1dMLew5TJq693UYX
xjLGD2ltOoFEN8e7/WXI0/ktvvSCAQalmz+sPgJvTlosUhpGXky85ced1PSrKiEV
l0tREpmawDo9WVvC/06yBj97Op6PDdb4CovDcyLT6Yt3v1aBZT0=
=ivN3
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.3-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
- Fix crashes when the attr fork isn't present due to errors but inode
inactivation tries to zap the attr data anyway.
- Convert more directory corruption debugging asserts to actual
EFSCORRUPTED returns instead of blowing up later on.
- Don't fail writeback just because we ran out of memory allocating
metadata log data.
* tag 'xfs-5.3-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: don't crash on null attr fork xfs_bmapi_read
xfs: remove more ondisk directory corruption asserts
fs: xfs: xfs_log: Don't use KM_MAYFAIL at xfs_log_reserve().
synchronize_rcu() gets called multiple times each time a client is
destroyed. If the laundromat thread has a lot of clients to destroy,
the delay can be noticeable. This was causing pynfs test RENEW3 to
fail.
We could embed an rcu_head in each inode and do the kref_put in an rcu
callback. But simplest is just to take a lock here.
(I also wonder if the laundromat thread would be better replaced by a
bunch of scheduled work or timers or something.)
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
syzbot is reporting that nfsd_mkdir() forgot to remove dentry created by
d_alloc_name() when __nfsd_mkdir() failed (due to memory allocation fault
injection) [1].
[1] https://syzkaller.appspot.com/bug?id=ce41a1f769ea4637ebffedf004a803e8405b4674
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: syzbot <syzbot+2c95195d5d433f6ed6cb@syzkaller.appspotmail.com>
Fixes: e8a79fb14f ("nfsd: add nfsd/clients directory")
[bfields: clean up in nfsd_mkdir instead of __nfsd_mkdir]
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This patch may fix two issues:
First, when IOSQE_IO_DRAIN set, the next IOs need to be inserted into
defer list to delay execution, but link io will be actively scheduled to
run by calling io_queue_sqe.
Second, when multiple LINK_IOs are inserted together with defer_list,
the LINK_IO is no longer keep order.
|-------------|
| LINK_IO | ----> insert to defer_list -----------
|-------------| |
| LINK_IO | ----> insert to defer_list ----------|
|-------------| |
| LINK_IO | ----> insert to defer_list ----------|
|-------------| |
| NORMAL_IO | ----> insert to defer_list ----------|
|-------------| |
|
queue_work at same time <-----|
Fixes: 9e645e1105 ("io_uring: add support for sqe links")
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We had a few issues with this code, and there's still a problem around
how we deal with error handling for chained/split bios. For now, just
revert the code and we'll try again with a thoroug solution. This
reverts commits:
e15c2ffa10 ("block: fix O_DIRECT error handling for bio fragments")
0eb6ddfb86 ("block: Fix __blkdev_direct_IO() for bio fragments")
6a43074e2f ("block: properly handle IOCB_NOWAIT for async O_DIRECT IO")
893a1c9720 ("blk-mq: allow REQ_NOWAIT to return an error inline")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit bd11b3a391 ("io_uring: don't use iov_iter_advance() for fixed
buffers") introduced an optimization to avoid using the slow
iov_iter_advance by manually populating the iov_iter iterator in some
cases.
However, the computation of the iterator count field was erroneous: The
first bvec was always accounted for an extent of page size even if the
bvec length was smaller.
In consequence, some I/O operations on fixed buffers were unable to
operate on the full extent of the buffer, consistently skipping some
bytes at the end of it.
Fixes: bd11b3a391 ("io_uring: don't use iov_iter_advance() for fixed buffers")
Cc: stable@vger.kernel.org
Signed-off-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl1UELcACgkQ+7dXa6fL
C2viWw//eDLvElBjaQsabfpMOVGkf02t+zCzNfKSC0KdM1+GUZ1FyXQ0UAtgwICY
Sp01h/RZa0V07sfYP7R5kL4/KIMdODmhrP0iiHDpoMjKCL7qR9tFbJDAcHtH8xz2
52UV2dmdDBI/wdw/i5dn6M02SoYAQMl1XT49SkzhFSELVchkpraGsf1vf4yITeVe
eI1TaOxI+TUaeH5f6+KWp6c8K8q70p3KfrR2VmCWkBrD7PNg9lp19pVnz8tdofYu
xURHQbJulSqM+mY7pcNBOi2iWy3dCLjBTkVJIwIhZcZqLThACY38SSaPtmdhgif4
wcyyZUtd8EGPzPPqbfCx7ycTIIDtL/r98XtGyiTJBKrCK+flZONdu0g/oIzvJ/Wu
hV4+ButxCuMakbLOe+Hew3lhHFOy7m9XZtOURzxzZSm9uazHDMxnw4ocxIOs24F1
qus1sG0+rlVDcMYjo2tKEAzOl/ZejJ/NUTd60ANIWKTHply2/2/5dH94B0yLwDnp
tfifBrBkyqFB4XUKGvqvvJczl0d7+zsEScs4VQLVO/WhATjj6jNnrYKgwvBS5pCM
890qUzj3TRW7ciZLi0THMEHBlEfbEWhNCaggAqieIvbKv7t4Kh2cUBaIsxo4IYqU
PBZZhFXRul5ocTJrV9pScl4RbzxE5V0j9cwSiiWnzZL1sQucIgQ=
=zivP
-----END PGP SIGNATURE-----
Merge tag 'afs-fixes-20190814' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull afs fixes from David Howells:
- Fix the CB.ProbeUuid handler to generate its reply correctly.
- Fix a mix up in indices when parsing a Volume Location entry record.
- Fix a potential NULL-pointer deref when cleaning up a read request.
- Fix the expected data version of the destination directory in
afs_rename().
- Fix afs_d_revalidate() to only update d_fsdata if it's not the same
as the directory data version to reduce the likelihood of overwriting
the result of a competing operation. (d_fsdata carries the directory
DV or the least-significant word thereof).
- Fix the tracking of the data-version on a directory and make sure
that dentry objects get properly initialised, updated and
revalidated.
Also fix rename to update d_fsdata to match the new directory's DV if
the dentry gets moved over and unhash the dentry to stop
afs_d_revalidate() from interfering.
* tag 'afs-fixes-20190814' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Fix missing dentry data version updating
afs: Only update d_fsdata if different in afs_d_revalidate()
afs: Fix off-by-one in afs_rename() expected data version calculation
fs: afs: Fix a possible null-pointer dereference in afs_put_read()
afs: Fix loop index mixup in afs_deliver_vl_get_entry_by_name_u()
afs: Fix the CB.ProbeUuid service handler to reply correctly
If you use lseek or similar (e.g. pread) to access a location in a
seq_file file that is within a record, rather than at a record boundary,
then the first read will return the remainder of the record, and the
second read will return the whole of that same record (instead of the
next record). When seeking to a record boundary, the next record is
correctly returned.
This bug was introduced by a recent patch (identified below). Before
that patch, seq_read() would increment m->index when the last of the
buffer was returned (m->count == 0). After that patch, we rely on
->next to increment m->index after filling the buffer - but there was
one place where that didn't happen.
Link: https://lkml.kernel.org/lkml/877e7xl029.fsf@notabene.neil.brown.name/
Fixes: 1f4aace60b ("fs/seq_file.c: simplify seq_file iteration code and interface")
Signed-off-by: NeilBrown <neilb@suse.com>
Reported-by: Sergei Turchanov <turchanov@farpost.com>
Tested-by: Sergei Turchanov <turchanov@farpost.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Markus Elfring <Markus.Elfring@web.de>
Cc: <stable@vger.kernel.org> [4.19+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add fs-verity support to f2fs. fs-verity is a filesystem feature that
enables transparent integrity protection and authentication of read-only
files. It uses a dm-verity like mechanism at the file level: a Merkle
tree is used to verify any block in the file in log(filesize) time. It
is implemented mainly by helper functions in fs/verity/. See
Documentation/filesystems/fsverity.rst for the full documentation.
The f2fs support for fs-verity consists of:
- Adding a filesystem feature flag and an inode flag for fs-verity.
- Implementing the fsverity_operations to support enabling verity on an
inode and reading/writing the verity metadata.
- Updating ->readpages() to verify data as it's read from verity files
and to support reading verity metadata pages.
- Updating ->write_begin(), ->write_end(), and ->writepages() to support
writing verity metadata pages.
- Calling the fs-verity hooks for ->open(), ->setattr(), and ->ioctl().
Like ext4, f2fs stores the verity metadata (Merkle tree and
fsverity_descriptor) past the end of the file, starting at the first 64K
boundary beyond i_size. This approach works because (a) verity files
are readonly, and (b) pages fully beyond i_size aren't visible to
userspace but can be read/written internally by f2fs with only some
relatively small changes to f2fs. Extended attributes cannot be used
because (a) f2fs limits the total size of an inode's xattr entries to
4096 bytes, which wouldn't be enough for even a single Merkle tree
block, and (b) f2fs encryption doesn't encrypt xattrs, yet the verity
metadata *must* be encrypted when the file is because it contains hashes
of the plaintext data.
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Make ext4_mpage_readpages() verify data as it is read from fs-verity
files, using the helper functions from fs/verity/.
To support both encryption and verity simultaneously, this required
refactoring the decryption workflow into a generic "post-read
processing" workflow which can do decryption, verification, or both.
The case where the ext4 block size is not equal to the PAGE_SIZE is not
supported yet, since in that case ext4_mpage_readpages() sometimes falls
back to block_read_full_page(), which does not support fs-verity yet.
Co-developed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Add most of fs-verity support to ext4. fs-verity is a filesystem
feature that enables transparent integrity protection and authentication
of read-only files. It uses a dm-verity like mechanism at the file
level: a Merkle tree is used to verify any block in the file in
log(filesize) time. It is implemented mainly by helper functions in
fs/verity/. See Documentation/filesystems/fsverity.rst for the full
documentation.
This commit adds all of ext4 fs-verity support except for the actual
data verification, including:
- Adding a filesystem feature flag and an inode flag for fs-verity.
- Implementing the fsverity_operations to support enabling verity on an
inode and reading/writing the verity metadata.
- Updating ->write_begin(), ->write_end(), and ->writepages() to support
writing verity metadata pages.
- Calling the fs-verity hooks for ->open(), ->setattr(), and ->ioctl().
ext4 stores the verity metadata (Merkle tree and fsverity_descriptor)
past the end of the file, starting at the first 64K boundary beyond
i_size. This approach works because (a) verity files are readonly, and
(b) pages fully beyond i_size aren't visible to userspace but can be
read/written internally by ext4 with only some relatively small changes
to ext4. This approach avoids having to depend on the EA_INODE feature
and on rearchitecturing ext4's xattr support to support paging
multi-gigabyte xattrs into memory, and to support encrypting xattrs.
Note that the verity metadata *must* be encrypted when the file is,
since it contains hashes of the plaintext data.
This patch incorporates work by Theodore Ts'o and Chandan Rajendra.
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
To meet some users' needs, add optional support for having fs-verity
handle a portion of the authentication policy in the kernel. An
".fs-verity" keyring is created to which X.509 certificates can be
added; then a sysctl 'fs.verity.require_signatures' can be set to cause
the kernel to enforce that all fs-verity files contain a signature of
their file measurement by a key in this keyring.
See the "Built-in signature verification" section of
Documentation/filesystems/fsverity.rst for the full documentation.
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Add SHA-512 support to fs-verity. This is primarily a demonstration of
the trivial changes needed to support a new hash algorithm in fs-verity;
most users will still use SHA-256, due to the smaller space required to
store the hashes. But some users may prefer SHA-512.
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Add a function for filesystems to call to implement the
FS_IOC_MEASURE_VERITY ioctl. This ioctl retrieves the file measurement
that fs-verity calculated for the given file and is enforcing for reads;
i.e., reads that don't match this hash will fail. This ioctl can be
used for authentication or logging of file measurements in userspace.
See the "FS_IOC_MEASURE_VERITY" section of
Documentation/filesystems/fsverity.rst for the documentation.
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Add a function for filesystems to call to implement the
FS_IOC_ENABLE_VERITY ioctl. This ioctl enables fs-verity on a file.
See the "FS_IOC_ENABLE_VERITY" section of
Documentation/filesystems/fsverity.rst for the documentation.
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Wire up the new ioctls for adding and removing fscrypt keys to/from the
filesystem, and the new ioctl for retrieving v2 encryption policies.
The key removal ioctls also required making UBIFS use
fscrypt_drop_inode().
For more details see Documentation/filesystems/fscrypt.rst and the
fscrypt patches that added the implementation of these ioctls.
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Wire up the new ioctls for adding and removing fscrypt keys to/from the
filesystem, and the new ioctl for retrieving v2 encryption policies.
The key removal ioctls also required making f2fs_drop_inode() call
fscrypt_drop_inode().
For more details see Documentation/filesystems/fscrypt.rst and the
fscrypt patches that added the implementation of these ioctls.
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Wire up the new ioctls for adding and removing fscrypt keys to/from the
filesystem, and the new ioctl for retrieving v2 encryption policies.
The key removal ioctls also required making ext4_drop_inode() call
fscrypt_drop_inode().
For more details see Documentation/filesystems/fscrypt.rst and the
fscrypt patches that added the implementation of these ioctls.
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
By looking up the master keys in a filesystem-level keyring rather than
in the calling processes' key hierarchy, it becomes possible for a user
to set an encryption policy which refers to some key they don't actually
know, then encrypt their files using that key. Cryptographically this
isn't much of a problem, but the semantics of this would be a bit weird.
Thus, enforce that a v2 encryption policy can only be set if the user
has previously added the key, or has capable(CAP_FOWNER).
We tolerate that this problem will continue to exist for v1 encryption
policies, however; there is no way around that.
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Add a root-only variant of the FS_IOC_REMOVE_ENCRYPTION_KEY ioctl which
removes all users' claims of the key, not just the current user's claim.
I.e., it always removes the key itself, no matter how many users have
added it.
This is useful for forcing a directory to be locked, without having to
figure out which user ID(s) the key was added under. This is planned to
be used by a command like 'sudo fscrypt lock DIR --all-users' in the
fscrypt userspace tool (http://github.com/google/fscrypt).
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Allow the FS_IOC_ADD_ENCRYPTION_KEY and FS_IOC_REMOVE_ENCRYPTION_KEY
ioctls to be used by non-root users to add and remove encryption keys
from the filesystem-level crypto keyrings, subject to limitations.
Motivation: while privileged fscrypt key management is sufficient for
some users (e.g. Android and Chromium OS, where a privileged process
manages all keys), the old API by design also allows non-root users to
set up and use encrypted directories, and we don't want to regress on
that. Especially, we don't want to force users to continue using the
old API, running into the visibility mismatch between files and keyrings
and being unable to "lock" encrypted directories.
Intuitively, the ioctls have to be privileged since they manipulate
filesystem-level state. However, it's actually safe to make them
unprivileged if we very carefully enforce some specific limitations.
First, each key must be identified by a cryptographic hash so that a
user can't add the wrong key for another user's files. For v2
encryption policies, we use the key_identifier for this. v1 policies
don't have this, so managing keys for them remains privileged.
Second, each key a user adds is charged to their quota for the keyrings
service. Thus, a user can't exhaust memory by adding a huge number of
keys. By default each non-root user is allowed up to 200 keys; this can
be changed using the existing sysctl 'kernel.keys.maxkeys'.
Third, if multiple users add the same key, we keep track of those users
of the key (of which there remains a single copy), and won't really
remove the key, i.e. "lock" the encrypted files, until all those users
have removed it. This prevents denial of service attacks that would be
possible under simpler schemes, such allowing the first user who added a
key to remove it -- since that could be a malicious user who has
compromised the key. Of course, encryption keys should be kept secret,
but the idea is that using encryption should never be *less* secure than
not using encryption, even if your key was compromised.
We tolerate that a user will be unable to really remove a key, i.e.
unable to "lock" their encrypted files, if another user has added the
same key. But in a sense, this is actually a good thing because it will
avoid providing a false notion of security where a key appears to have
been removed when actually it's still in memory, available to any
attacker who compromises the operating system kernel.
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Add a new fscrypt policy version, "v2". It has the following changes
from the original policy version, which we call "v1" (*):
- Master keys (the user-provided encryption keys) are only ever used as
input to HKDF-SHA512. This is more flexible and less error-prone, and
it avoids the quirks and limitations of the AES-128-ECB based KDF.
Three classes of cryptographically isolated subkeys are defined:
- Per-file keys, like used in v1 policies except for the new KDF.
- Per-mode keys. These implement the semantics of the DIRECT_KEY
flag, which for v1 policies made the master key be used directly.
These are also planned to be used for inline encryption when
support for it is added.
- Key identifiers (see below).
- Each master key is identified by a 16-byte master_key_identifier,
which is derived from the key itself using HKDF-SHA512. This prevents
users from associating the wrong key with an encrypted file or
directory. This was easily possible with v1 policies, which
identified the key by an arbitrary 8-byte master_key_descriptor.
- The key must be provided in the filesystem-level keyring, not in a
process-subscribed keyring.
The following UAPI additions are made:
- The existing ioctl FS_IOC_SET_ENCRYPTION_POLICY can now be passed a
fscrypt_policy_v2 to set a v2 encryption policy. It's disambiguated
from fscrypt_policy/fscrypt_policy_v1 by the version code prefix.
- A new ioctl FS_IOC_GET_ENCRYPTION_POLICY_EX is added. It allows
getting the v1 or v2 encryption policy of an encrypted file or
directory. The existing FS_IOC_GET_ENCRYPTION_POLICY ioctl could not
be used because it did not have a way for userspace to indicate which
policy structure is expected. The new ioctl includes a size field, so
it is extensible to future fscrypt policy versions.
- The ioctls FS_IOC_ADD_ENCRYPTION_KEY, FS_IOC_REMOVE_ENCRYPTION_KEY,
and FS_IOC_GET_ENCRYPTION_KEY_STATUS now support managing keys for v2
encryption policies. Such keys are kept logically separate from keys
for v1 encryption policies, and are identified by 'identifier' rather
than by 'descriptor'. The 'identifier' need not be provided when
adding a key, since the kernel will calculate it anyway.
This patch temporarily keeps adding/removing v2 policy keys behind the
same permission check done for adding/removing v1 policy keys:
capable(CAP_SYS_ADMIN). However, the next patch will carefully take
advantage of the cryptographically secure master_key_identifier to allow
non-root users to add/remove v2 policy keys, thus providing a full
replacement for v1 policies.
(*) Actually, in the API fscrypt_policy::version is 0 while on-disk
fscrypt_context::format is 1. But I believe it makes the most sense
to advance both to '2' to have them be in sync, and to consider the
numbering to start at 1 except for the API quirk.
Reviewed-by: Paul Crowley <paulcrowley@google.com>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Add an implementation of HKDF (RFC 5869) to fscrypt, for the purpose of
deriving additional key material from the fscrypt master keys for v2
encryption policies. HKDF is a key derivation function built on top of
HMAC. We choose SHA-512 for the underlying unkeyed hash, and use an
"hmac(sha512)" transform allocated from the crypto API.
We'll be using this to replace the AES-ECB based KDF currently used to
derive the per-file encryption keys. While the AES-ECB based KDF is
believed to meet the original security requirements, it is nonstandard
and has problems that don't exist in modern KDFs such as HKDF:
1. It's reversible. Given a derived key and nonce, an attacker can
easily compute the master key. This is okay if the master key and
derived keys are equally hard to compromise, but now we'd like to be
more robust against threats such as a derived key being compromised
through a timing attack, or a derived key for an in-use file being
compromised after the master key has already been removed.
2. It doesn't evenly distribute the entropy from the master key; each 16
input bytes only affects the corresponding 16 output bytes.
3. It isn't easily extensible to deriving other values or keys, such as
a public hash for securely identifying the key, or per-mode keys.
Per-mode keys will be immediately useful for Adiantum encryption, for
which fscrypt currently uses the master key directly, introducing
unnecessary usage constraints. Per-mode keys will also be useful for
hardware inline encryption, which is currently being worked on.
HKDF solves all the above problems.
Reviewed-by: Paul Crowley <paulcrowley@google.com>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Add a new fscrypt ioctl, FS_IOC_GET_ENCRYPTION_KEY_STATUS. Given a key
specified by 'struct fscrypt_key_specifier' (the same way a key is
specified for the other fscrypt key management ioctls), it returns
status information in a 'struct fscrypt_get_key_status_arg'.
The main motivation for this is that applications need to be able to
check whether an encrypted directory is "unlocked" or not, so that they
can add the key if it is not, and avoid adding the key (which may
involve prompting the user for a passphrase) if it already is.
It's possible to use some workarounds such as checking whether opening a
regular file fails with ENOKEY, or checking whether the filenames "look
like gibberish" or not. However, no workaround is usable in all cases.
Like the other key management ioctls, the keyrings syscalls may seem at
first to be a good fit for this. Unfortunately, they are not. Even if
we exposed the keyring ID of the ->s_master_keys keyring and gave
everyone Search permission on it (note: currently the keyrings
permission system would also allow everyone to "invalidate" the keyring
too), the fscrypt keys have an additional state that doesn't map cleanly
to the keyrings API: the secret can be removed, but we can be still
tracking the files that were using the key, and the removal can be
re-attempted or the secret added again.
After later patches, some applications will also need a way to determine
whether a key was added by the current user vs. by some other user.
Reserved fields are included in fscrypt_get_key_status_arg for this and
other future extensions.
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Add a new fscrypt ioctl, FS_IOC_REMOVE_ENCRYPTION_KEY. This ioctl
removes an encryption key that was added by FS_IOC_ADD_ENCRYPTION_KEY.
It wipes the secret key itself, then "locks" the encrypted files and
directories that had been unlocked using that key -- implemented by
evicting the relevant dentries and inodes from the VFS caches.
The problem this solves is that many fscrypt users want the ability to
remove encryption keys, causing the corresponding encrypted directories
to appear "locked" (presented in ciphertext form) again. Moreover,
users want removing an encryption key to *really* remove it, in the
sense that the removed keys cannot be recovered even if kernel memory is
compromised, e.g. by the exploit of a kernel security vulnerability or
by a physical attack. This is desirable after a user logs out of the
system, for example. In many cases users even already assume this to be
the case and are surprised to hear when it's not.
It is not sufficient to simply unlink the master key from the keyring
(or to revoke or invalidate it), since the actual encryption transform
objects are still pinned in memory by their inodes. Therefore, to
really remove a key we must also evict the relevant inodes.
Currently one workaround is to run 'sync && echo 2 >
/proc/sys/vm/drop_caches'. But, that evicts all unused inodes in the
system rather than just the inodes associated with the key being
removed, causing severe performance problems. Moreover, it requires
root privileges, so regular users can't "lock" their encrypted files.
Another workaround, used in Chromium OS kernels, is to add a new
VFS-level ioctl FS_IOC_DROP_CACHE which is a more restricted version of
drop_caches that operates on a single super_block. It does:
shrink_dcache_sb(sb);
invalidate_inodes(sb, false);
But it's still a hack. Yet, the major users of filesystem encryption
want this feature badly enough that they are actually using these hacks.
To properly solve the problem, start maintaining a list of the inodes
which have been "unlocked" using each master key. Originally this
wasn't possible because the kernel didn't keep track of in-use master
keys at all. But, with the ->s_master_keys keyring it is now possible.
Then, add an ioctl FS_IOC_REMOVE_ENCRYPTION_KEY. It finds the specified
master key in ->s_master_keys, then wipes the secret key itself, which
prevents any additional inodes from being unlocked with the key. Then,
it syncs the filesystem and evicts the inodes in the key's list. The
normal inode eviction code will free and wipe the per-file keys (in
->i_crypt_info). Note that freeing ->i_crypt_info without evicting the
inodes was also considered, but would have been racy.
Some inodes may still be in use when a master key is removed, and we
can't simply revoke random file descriptors, mmap's, etc. Thus, the
ioctl simply skips in-use inodes, and returns -EBUSY to indicate that
some inodes weren't evicted. The master key *secret* is still removed,
but the fscrypt_master_key struct remains to keep track of the remaining
inodes. Userspace can then retry the ioctl to evict the remaining
inodes. Alternatively, if userspace adds the key again, the refreshed
secret will be associated with the existing list of inodes so they
remain correctly tracked for future key removals.
The ioctl doesn't wipe pagecache pages. Thus, we tolerate that after a
kernel compromise some portions of plaintext file contents may still be
recoverable from memory. This can be solved by enabling page poisoning
system-wide, which security conscious users may choose to do. But it's
very difficult to solve otherwise, e.g. note that plaintext file
contents may have been read in other places than pagecache pages.
Like FS_IOC_ADD_ENCRYPTION_KEY, FS_IOC_REMOVE_ENCRYPTION_KEY is
initially restricted to privileged users only. This is sufficient for
some use cases, but not all. A later patch will relax this restriction,
but it will require introducing key hashes, among other changes.
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Add a new fscrypt ioctl, FS_IOC_ADD_ENCRYPTION_KEY. This ioctl adds an
encryption key to the filesystem's fscrypt keyring ->s_master_keys,
making any files encrypted with that key appear "unlocked".
Why we need this
~~~~~~~~~~~~~~~~
The main problem is that the "locked/unlocked" (ciphertext/plaintext)
status of encrypted files is global, but the fscrypt keys are not.
fscrypt only looks for keys in the keyring(s) the process accessing the
filesystem is subscribed to: the thread keyring, process keyring, and
session keyring, where the session keyring may contain the user keyring.
Therefore, userspace has to put fscrypt keys in the keyrings for
individual users or sessions. But this means that when a process with a
different keyring tries to access encrypted files, whether they appear
"unlocked" or not is nondeterministic. This is because it depends on
whether the files are currently present in the inode cache.
Fixing this by consistently providing each process its own view of the
filesystem depending on whether it has the key or not isn't feasible due
to how the VFS caches work. Furthermore, while sometimes users expect
this behavior, it is misguided for two reasons. First, it would be an
OS-level access control mechanism largely redundant with existing access
control mechanisms such as UNIX file permissions, ACLs, LSMs, etc.
Encryption is actually for protecting the data at rest.
Second, almost all users of fscrypt actually do need the keys to be
global. The largest users of fscrypt, Android and Chromium OS, achieve
this by having PID 1 create a "session keyring" that is inherited by
every process. This works, but it isn't scalable because it prevents
session keyrings from being used for any other purpose.
On general-purpose Linux distros, the 'fscrypt' userspace tool [1] can't
similarly abuse the session keyring, so to make 'sudo' work on all
systems it has to link all the user keyrings into root's user keyring
[2]. This is ugly and raises security concerns. Moreover it can't make
the keys available to system services, such as sshd trying to access the
user's '~/.ssh' directory (see [3], [4]) or NetworkManager trying to
read certificates from the user's home directory (see [5]); or to Docker
containers (see [6], [7]).
By having an API to add a key to the *filesystem* we'll be able to fix
the above bugs, remove userspace workarounds, and clearly express the
intended semantics: the locked/unlocked status of an encrypted directory
is global, and encryption is orthogonal to OS-level access control.
Why not use the add_key() syscall
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We use an ioctl for this API rather than the existing add_key() system
call because the ioctl gives us the flexibility needed to implement
fscrypt-specific semantics that will be introduced in later patches:
- Supporting key removal with the semantics such that the secret is
removed immediately and any unused inodes using the key are evicted;
also, the eviction of any in-use inodes can be retried.
- Calculating a key-dependent cryptographic identifier and returning it
to userspace.
- Allowing keys to be added and removed by non-root users, but only keys
for v2 encryption policies; and to prevent denial-of-service attacks,
users can only remove keys they themselves have added, and a key is
only really removed after all users who added it have removed it.
Trying to shoehorn these semantics into the keyrings syscalls would be
very difficult, whereas the ioctls make things much easier.
However, to reuse code the implementation still uses the keyrings
service internally. Thus we get lockless RCU-mode key lookups without
having to re-implement it, and the keys automatically show up in
/proc/keys for debugging purposes.
References:
[1] https://github.com/google/fscrypt
[2] https://goo.gl/55cCrI#heading=h.vf09isp98isb
[3] https://github.com/google/fscrypt/issues/111#issuecomment-444347939
[4] https://github.com/google/fscrypt/issues/116
[5] https://bugs.launchpad.net/ubuntu/+source/fscrypt/+bug/1770715
[6] https://github.com/google/fscrypt/issues/128
[7] https://askubuntu.com/questions/1130306/cannot-run-docker-on-an-encrypted-filesystem
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Rename keyinfo.c to keysetup.c since this better describes what the file
does (sets up the key), and it matches the new file keysetup_v1.c.
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
In preparation for introducing v2 encryption policies which will find
and derive encryption keys differently from the current v1 encryption
policies, move the v1 policy-specific key setup code from keyinfo.c into
keysetup_v1.c.
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Do some more refactoring of the key setup code, in preparation for
introducing a filesystem-level keyring and v2 encryption policies:
- Now that ci_inode exists, don't pass around the inode unnecessarily.
- Define a function setup_file_encryption_key() which handles the crypto
key setup given an under-construction fscrypt_info. Don't pass the
fscrypt_context, since everything is in the fscrypt_info.
[This will be extended for v2 policies and the fs-level keyring.]
- Define a function fscrypt_set_derived_key() which sets the per-file
key, without depending on anything specific to v1 policies.
[This will also be used for v2 policies.]
- Define a function fscrypt_setup_v1_file_key() which takes the raw
master key, thus separating finding the key from using it.
[This will also be used if the key is found in the fs-level keyring.]
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
In preparation for introducing a filesystem-level keyring which will
contain fscrypt master keys, rename the existing 'struct
fscrypt_master_key' to 'struct fscrypt_direct_key'. This is the
structure in the existing table of master keys that's maintained to
deduplicate the crypto transforms for v1 DIRECT_KEY policies.
I've chosen to keep this table as-is rather than make it automagically
add/remove the keys to/from the filesystem-level keyring, since that
would add a lot of extra complexity to the filesystem-level keyring.
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Add an inode back-pointer to 'struct fscrypt_info', such that
inode->i_crypt_info->ci_inode == inode.
This will be useful for:
1. Evicting the inodes when a fscrypt key is removed, since we'll track
the inodes using a given key by linking their fscrypt_infos together,
rather than the inodes directly. This avoids bloating 'struct inode'
with a new list_head.
2. Simplifying the per-file key setup, since the inode pointer won't
have to be passed around everywhere just in case something goes wrong
and it's needed for fscrypt_warn().
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Update fs/crypto/ to use the new names for the UAPI constants rather
than the old names, then make the old definitions conditional on
!__KERNEL__.
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Return ENOPKG rather than ENOENT when trying to open a file that's
encrypted using algorithms not available in the kernel's crypto API.
This avoids an ambiguity, since ENOENT is also returned when the file
doesn't exist.
Note: this is the same approach I'm taking for fs-verity.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Users of fscrypt with non-default algorithms will encounter an error
like the following if they fail to include the needed algorithms into
the crypto API when configuring the kernel (as per the documentation):
Error allocating 'adiantum(xchacha12,aes)' transform: -2
This requires that the user figure out what the "-2" error means.
Make it more friendly by printing a warning like the following instead:
Missing crypto API support for Adiantum (API name: "adiantum(xchacha12,aes)")
Also upgrade the log level for *other* errors to KERN_ERR.
Signed-off-by: Eric Biggers <ebiggers@google.com>
When fs/crypto/ encounters an inode with an invalid encryption context,
currently it prints a warning if the pair of encryption modes are
unrecognized, but it's silent if there are other problems such as
unsupported context size, format, or flags. To help people debug such
situations, add more warning messages.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Most of the warning and error messages in fs/crypto/ are for situations
related to a specific inode, not merely to a super_block. So to make
things easier, make fscrypt_msg() take an inode rather than a
super_block, and make it print the inode number.
Note: This is the same approach I'm taking for fsverity_msg().
Signed-off-by: Eric Biggers <ebiggers@google.com>
Some minor cleanups for the code that base64 encodes and decodes
encrypted filenames and long name digests:
- Rename "digest_{encode,decode}()" => "base64_{encode,decode}()" since
they are used for filenames too, not just for long name digests.
- Replace 'while' loops with more conventional 'for' loops.
- Use 'u8' for binary data. Keep 'char' for string data.
- Fully constify the lookup table (pointer was not const).
- Improve comment.
No actual change in behavior.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Since commit 643fa9612b ("fscrypt: remove filesystem specific build
config option"), fs/crypto/ can no longer be built as a loadable module.
Thus it no longer needs a module_exit function, nor a MODULE_LICENSE.
So remove them, and change module_init to late_initcall.
Reviewed-by: Chandan Rajendra <chandan@linux.ibm.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Zorro Lang reported a crash in generic/475 if we try to inactivate a
corrupt inode with a NULL attr fork (stack trace shortened somewhat):
RIP: 0010:xfs_bmapi_read+0x311/0xb00 [xfs]
RSP: 0018:ffff888047f9ed68 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: ffff888047f9f038 RCX: 1ffffffff5f99f51
RDX: 0000000000000002 RSI: 0000000000000008 RDI: 0000000000000012
RBP: ffff888002a41f00 R08: ffffed10005483f0 R09: ffffed10005483ef
R10: ffffed10005483ef R11: ffff888002a41f7f R12: 0000000000000004
R13: ffffe8fff53b5768 R14: 0000000000000005 R15: 0000000000000001
FS: 00007f11d44b5b80(0000) GS:ffff888114200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000ef6000 CR3: 000000002e176003 CR4: 00000000001606e0
Call Trace:
xfs_dabuf_map.constprop.18+0x696/0xe50 [xfs]
xfs_da_read_buf+0xf5/0x2c0 [xfs]
xfs_da3_node_read+0x1d/0x230 [xfs]
xfs_attr_inactive+0x3cc/0x5e0 [xfs]
xfs_inactive+0x4c8/0x5b0 [xfs]
xfs_fs_destroy_inode+0x31b/0x8e0 [xfs]
destroy_inode+0xbc/0x190
xfs_bulkstat_one_int+0xa8c/0x1200 [xfs]
xfs_bulkstat_one+0x16/0x20 [xfs]
xfs_bulkstat+0x6fa/0xf20 [xfs]
xfs_ioc_bulkstat+0x182/0x2b0 [xfs]
xfs_file_ioctl+0xee0/0x12a0 [xfs]
do_vfs_ioctl+0x193/0x1000
ksys_ioctl+0x60/0x90
__x64_sys_ioctl+0x6f/0xb0
do_syscall_64+0x9f/0x4d0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f11d39a3e5b
The "obvious" cause is that the attr ifork is null despite the inode
claiming an attr fork having at least one extent, but it's not so
obvious why we ended up with an inode in that state.
Reported-by: Zorro Lang <zlang@redhat.com>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204031
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Continue our game of replacing ASSERTs for corrupt ondisk metadata with
EFSCORRUPTED returns.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
- Fix dax_layout_busy_page() to not discard private cow pages of fs/dax
private mappings.
- Update the memremap_pages core to properly cleanup on behalf of
internal reference-count users like device-dax.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJdUGLxAAoJEB7SkWpmfYgC62IP/3aHwBbdedlXves4NQ0QhN5z
3jooOqpayfgkdPZp4U2XnNolsbKDME6h7m7Mn+GzzRSCfNGsdcnLHDWwXt4tTWE8
rvDNPat+22oWSVqnZOTb5GfnZKGmAbk5eC2HI9HT2VPf/BWMIoU6/QhvIzeLCEf7
g72XTEouitGgRk2Cn8Wi3+y+fvbMdur/0qofBH9rxfQEgTWiDtCJvxZ5KyH18hlt
qbvJR0CG2mxIxEbM1qx/1/HysXgs3UeTJVzHioF5SLdGcyQP14Djp0MCuqTGiB6l
aEgMYSJcca7nilJNMcc2gNEsNNuga6UDWaF52FJuAoy+3vs837iexP6L+VbthqTT
70vAvEOnDyzgKe/jll8INjzSc+RDCbMXoFdSmGXTt9KHVMbZ+taGOeEIaY+UkYKk
g1BAWiZZEedJZXZmltGPDOXuPdzmK1uMR13gjz/FS298ffmznsqfEAKkSuzxWsKH
vheQnQjbEQoRNk4uI/mjxz/XYCEgnVqXX/9OlQ3WrJHWdOtBLEfykAem2RqeDGc7
QF3EZ+IGCD4xPRJWbWXtJdUK7EtGTPrzyiKHEPnXsj8xYDo6oAYK+erychiSI7y5
9htSyCWNZ5ccE8hqewKD5qN1XNqX6XmmmxgMX4i2w9fSdWFgzAkWPw/xTvw6BkfC
Bn5b/9LcFIfTX9/oAlwh
=Se2X
-----END PGP SIGNATURE-----
Merge tag 'dax-fixes-5.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull dax fixes from Dan Williams:
"A filesystem-dax and device-dax fix for v5.3.
The filesystem-dax fix is tagged for stable as the implementation has
been mistakenly throwing away all cow pages on any truncate or hole
punch operation as part of the solution to coordinate device-dma vs
truncate to dax pages.
The device-dax change fixes up a regression this cycle from the
introduction of a common 'internal per-cpu-ref' implementation.
Summary:
- Fix dax_layout_busy_page() to not discard private cow pages of
fs/dax private mappings.
- Update the memremap_pages core to properly cleanup on behalf of
internal reference-count users like device-dax"
* tag 'dax-fixes-5.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
mm/memremap: Fix reuse of pgmap instances with internal references
dax: dax_layout_busy_page() should not unmap cow pages
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl1NmNIQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpouEEADA1e2HH9AO8QGWBVSeFi5ZdRnEXsJUfNGD
d576M9empGI/hkbui0yY2tuodnMwmhUj641GTRC2dzl7RRyGcTMmqtGYPcyczlpU
+6Yil3+XVNYTXRpUsKWs32H9aubNZl/L3rCcJImGvMgLW5YtEjAZJFIFIzXWWwDJ
aZpTtnOH1+D3HH6HT35xk+aytSYZ7LsZ7X3LI9ZumKOUd2HJZGUkfWLXxgSuuUh/
/WaBEQ9xzDNARmfx9Qb/2wSAE7XOInupPr86fI9dnmXHZ8rwhsvHRIZEvNIIqF5Z
KzbF+rGJZ2bizKFpVFdlrIIyfBleFlQGFzYnGrs/+47zAl/CGicqmuAhGcOdGQXd
wyj6G66iBcBIQC9hsYhPtglyQSk95tzsPZLZa7/1TlhEKl3mpor4w30NrLz/P5oy
gdIivDhKP7aRFzBWw/2O4TOn3HhGxnZWVni6icOHE/pBQLBW12Ulc1nT0SiJ8RAt
PYhMCFwz0Xc7pYuEfGZKwNSCIrx4i7f+spWEAt0Fget/KaB993zm+K4OzGbfBv6r
FrZ5g/2MTVBJITmJVN2LDysF4EVEOzTULX+bCQOOWC7wmE/poCyNWj8gCpunIZsY
V5DAB+cbEw/OLfpdiogXvCcxrbukKHV8AVdMdLYB5yEAuPN0JMqcunMjECxpNAha
1wsv86ljvw==
=Au30
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20190809' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- Revert of a bcache patch that caused an oops for some (Coly)
- ata rb532 unused warning fix (Gustavo)
- AoE kernel crash fix (He)
- Error handling fixup for blkdev_get() (Jan)
- libata read/write translation and SFF PIO fix (me)
- Use after free and error handling fix for O_DIRECT fragments. There's
still a nowait + sync oddity in there, we'll nail that start next
week. If all else fails, I'll queue a revert of the NOWAIT change.
(me)
- Loop GFP_KERNEL -> GFP_NOIO deadlock fix (Mikulas)
- Two BFQ regression fixes that caused crashes (Paolo)
* tag 'for-linus-20190809' of git://git.kernel.dk/linux-block:
bcache: Revert "bcache: use sysfs_match_string() instead of __sysfs_match_string()"
loop: set PF_MEMALLOC_NOIO for the worker thread
bdev: Fixup error handling in blkdev_get()
block, bfq: handle NULL return value by bfq_init_rq()
block, bfq: move update of waker and woken list to queue freeing
block, bfq: reset last_completed_rq_bfqq if the pointed queue is freed
block: aoe: Fix kernel crash due to atomic sleep when exiting
libata: add SG safety checks in SFF pio transfers
libata: have ata_scsi_rw_xlat() fail invalid passthrough requests
block: fix O_DIRECT error handling for bio fragments
ata: rb532_cf: Fix unused variable warning in rb532_pata_driver_probe
It turns out that the current version of gfs2_metadata_walker suffers
from multiple problems that can cause gfs2_hole_size to report an
incorrect size. This will confuse fiemap as well as lseek with the
SEEK_DATA flag.
Fix that by changing gfs2_hole_walker to compute the metapath to the
first data block after the hole (if any), and compute the hole size
based on that.
Fixes xfstest generic/490.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
Cc: stable@vger.kernel.org # v4.18+
Highlights include:
Stable fixes:
- NFSv4: Ensure we check the return value of update_open_stateid() so we
correctly track active open state.
- NFSv4: Fix for delegation state recovery to ensure we recover all open
modes that are active.
- NFSv4: Fix an Oops in nfs4_do_setattr
Bugfixes:
- NFS: Fix regression whereby fscache errors are appearing on 'nofsc' mounts
- NFSv4: Fix a potential sleep while atomic in nfs4_do_reclaim()
- NFSv4: Fix a credential refcount leak in nfs41_check_delegation_stateid
- pNFS: Report errors from the call to nfs4_select_rw_stateid()
- NFSv4: Various other delegation and open stateid recovery fixes
- NFSv4: Fix state recovery behaviour when server connection times out
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAl1LKhEACgkQZwvnipYK
APIAuw/9HnKwnJYKvkAv/Pg2eBQZAgwjchc/uPsfteSPr8PMFS889rsqvDoGrAI4
VjZRh7Jsp/FPAlLzZKCnLF/fxKE83qxgS3MP14of9IoRv2gznsW7jexy48AhU/5t
Ae4Wgu3GJJ0IIrr8hbrkJRkBUoYUMLguCNNaZC7LDLzEVQ0wNDAVpdsZ+gdnCcrw
zhnFnz72p2h95tfL5QkJ+OYrAu4ikdlSjx2oOdLsUGFEAnTehpUPd3DPDiCQbctx
zPHSGukj+8tsPJ+EUVuj7ouDJqTDyMFVe1eKRJWMIq22bUAM1GBtVVMw8uFXJi5i
9WFUJIezHhkh3Hdx82ptUmt3u1hRuSolaKDICeIR2Kob0gUArqk0KR7upgVMS3Fn
INm/c4Zsqsa1ABevQTLWqz+nVPUPRFGmEZfjvBwkmYlkKnqbjWxXQRkROt8UJS3O
3vfK1hUEIUyt4uI2yHusru5nIQ3pv/h1WAwpfuSQFw+nEvC6YcssECz8uOhKEnEr
UWnUxRP66YVL4L+VAsajzAArBDQ8cUU3bv6q0x2IEWA8CHHy0BH1MGM6cumVW8YQ
rwj0KqR4+aH0u5g8EpOkODAJfuxo10oJi6MNlr2+OSn60tdmi/P1K8eg4ERODqya
vGqgftjTfmViYfoaWpt2NDqAbKOu4wXovb6jWqbP2lUsh+ttojg=
=IXqc
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.3-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client fixes from Trond Myklebust:
"Highlights include:
Stable fixes:
- NFSv4: Ensure we check the return value of update_open_stateid() so
we correctly track active open state.
- NFSv4: Fix for delegation state recovery to ensure we recover all
open modes that are active.
- NFSv4: Fix an Oops in nfs4_do_setattr
Fixes:
- NFS: Fix regression whereby fscache errors are appearing on 'nofsc'
mounts
- NFSv4: Fix a potential sleep while atomic in nfs4_do_reclaim()
- NFSv4: Fix a credential refcount leak in nfs41_check_delegation_stateid
- pNFS: Report errors from the call to nfs4_select_rw_stateid()
- NFSv4: Various other delegation and open stateid recovery fixes
- NFSv4: Fix state recovery behaviour when server connection times
out"
* tag 'nfs-for-5.3-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFSv4: Ensure state recovery handles ETIMEDOUT correctly
NFS: Fix regression whereby fscache errors are appearing on 'nofsc' mounts
NFSv4: Fix an Oops in nfs4_do_setattr
NFSv4: Fix a potential sleep while atomic in nfs4_do_reclaim()
NFSv4: Check the return value of update_open_stateid()
NFSv4.1: Only reap expired delegations
NFSv4.1: Fix open stateid recovery
NFSv4: Report the error from nfs4_select_rw_stateid()
NFSv4: When recovering state fails with EAGAIN, retry the same recovery
NFSv4: Print an error in the syslog when state is marked as irrecoverable
NFSv4: Fix delegation state recovery
NFSv4: Fix a credential refcount leak in nfs41_check_delegation_stateid
Pull cifs fixes from Steve French:
"Six small SMB3 fixes, two for stable"
* tag '5.3-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
SMB3: Kernel oops mounting a encryptData share with CONFIG_DEBUG_VIRTUAL
smb3: update TODO list of missing features
smb3: send CAP_DFS capability during session setup
SMB3: Fix potential memory leak when processing compound chain
SMB3: Fix deadlock in validate negotiate hits reconnect
cifs: fix rmmod regression in cifs.ko caused by force_sig changes
Commit 89e524c04f ("loop: Fix mount(2) failure due to race with
LOOP_SET_FD") converted blkdev_get() to use the new helpers for
finishing claiming of a block device. However the conversion botched the
error handling in blkdev_get() and thus the bdev has been marked as held
even in case __blkdev_get() returned error. This led to occasional
warnings with block/001 test from blktests like:
kernel: WARNING: CPU: 5 PID: 907 at fs/block_dev.c:1899 __blkdev_put+0x396/0x3a0
Correct the error handling.
CC: stable@vger.kernel.org
Fixes: 89e524c04f ("loop: Fix mount(2) failure due to race with LOOP_SET_FD")
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
0eb6ddfb86 tried to fix this up, but introduced a use-after-free
of dio. Additionally, we still had an issue with error handling,
as reported by Darrick:
"I noticed a regression in xfs/747 (an unreleased xfstest for the
xfs_scrub media scanning feature) on 5.3-rc3. I'll condense that down
to a simpler reproducer:
error-test: 0 209 linear 8:48 0
error-test: 209 1 error
error-test: 210 6446894 linear 8:48 210
Basically we have a ~3G /dev/sdd and we set up device mapper to fail IO
for sector 209 and to pass the io to the scsi device everywhere else.
On 5.3-rc3, performing a directio pread of this range with a < 1M buffer
(in other words, a request for fewer than MAX_BIO_PAGES bytes) yields
EIO like you'd expect:
pread64(3, 0x7f880e1c7000, 1048576, 0) = -1 EIO (Input/output error)
pread: Input/output error
+++ exited with 0 +++
But doing it with a larger buffer succeeds(!):
pread64(3, "XFSB\0\0\20\0\0\0\0\0\0\fL\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1146880, 0) = 1146880
read 1146880/1146880 bytes at offset 0
1 MiB, 1 ops; 0.0009 sec (1.124 GiB/sec and 1052.6316 ops/sec)
+++ exited with 0 +++
(Note that the part of the buffer corresponding to the dm-error area is
uninitialized)
On 5.3-rc2, both commands would fail with EIO like you'd expect. The
only change between rc2 and rc3 is commit 0eb6ddfb86 ("block: Fix
__blkdev_direct_IO() for bio fragments").
AFAICT we end up in __blkdev_direct_IO with a 1120K buffer, which gets
split into two bios: one for the first BIO_MAX_PAGES worth of data (1MB)
and a second one for the 96k after that."
Fix this by noting that it's always safe to dereference dio if we get
BLK_QC_T_EAGAIN returned, as end_io hasn't been run for that case. So
we can safely increment the dio size before calling submit_bio(), and
then decrement it on failure (not that it really matters, as the bio
and dio are going away).
For error handling, return to the original method of just using 'ret'
for tracking the error, and the size tracking in dio->size.
Fixes: 0eb6ddfb86 ("block: Fix __blkdev_direct_IO() for bio fragments")
Fixes: 6a43074e2f ("block: properly handle IOCB_NOWAIT for async O_DIRECT IO")
Reported-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ensure that the state recovery code handles ETIMEDOUT correctly,
and also that we set RPC_TASK_TIMEOUT when recovering open state.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Normally the range->len is set to default value (U64_MAX), but when it's
not default value, we should check if the range overflows.
And if it overflows, return -EINVAL before doing anything.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the 5.3 merge window, commit 7c7e301406 ("btrfs: sysfs: Replace
default_attrs in ktypes with groups"), we started using the member
"defaults_groups" for the kobject type "btrfs_raid_ktype". That leads
to a series of warnings when running some test cases of fstests, such
as btrfs/027, btrfs/124 and btrfs/176. The traces produced by those
warnings are like the following:
[116648.059212] kernfs: can not remove 'total_bytes', no directory
[116648.060112] WARNING: CPU: 3 PID: 28500 at fs/kernfs/dir.c:1504 kernfs_remove_by_name_ns+0x75/0x80
(...)
[116648.066482] CPU: 3 PID: 28500 Comm: umount Tainted: G W 5.3.0-rc3-btrfs-next-54 #1
(...)
[116648.069376] RIP: 0010:kernfs_remove_by_name_ns+0x75/0x80
(...)
[116648.072385] RSP: 0018:ffffabfd0090bd08 EFLAGS: 00010282
[116648.073437] RAX: 0000000000000000 RBX: ffffffffc0c11998 RCX: 0000000000000000
[116648.074201] RDX: ffff9fff603a7a00 RSI: ffff9fff603978a8 RDI: ffff9fff603978a8
[116648.074956] RBP: ffffffffc0b9ca2f R08: 0000000000000000 R09: 0000000000000001
[116648.075708] R10: ffff9ffe1f72e1c0 R11: 0000000000000000 R12: ffffffffc0b94120
[116648.076434] R13: ffffffffb3d9b4e0 R14: 0000000000000000 R15: dead000000000100
[116648.077143] FS: 00007f9cdc78a2c0(0000) GS:ffff9fff60380000(0000) knlGS:0000000000000000
[116648.077852] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[116648.078546] CR2: 00007f9fc4747ab4 CR3: 00000005c7832003 CR4: 00000000003606e0
[116648.079235] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[116648.079907] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[116648.080585] Call Trace:
[116648.081262] remove_files+0x31/0x70
[116648.081929] sysfs_remove_group+0x38/0x80
[116648.082596] sysfs_remove_groups+0x34/0x70
[116648.083258] kobject_del+0x20/0x60
[116648.083933] btrfs_free_block_groups+0x405/0x430 [btrfs]
[116648.084608] close_ctree+0x19a/0x380 [btrfs]
[116648.085278] generic_shutdown_super+0x6c/0x110
[116648.085951] kill_anon_super+0xe/0x30
[116648.086621] btrfs_kill_super+0x12/0xa0 [btrfs]
[116648.087289] deactivate_locked_super+0x3a/0x70
[116648.087956] cleanup_mnt+0xb4/0x160
[116648.088620] task_work_run+0x7e/0xc0
[116648.089285] exit_to_usermode_loop+0xfa/0x100
[116648.089933] do_syscall_64+0x1cb/0x220
[116648.090567] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[116648.091197] RIP: 0033:0x7f9cdc073b37
(...)
[116648.100046] ---[ end trace 22e24db328ccadf8 ]---
[116648.100618] ------------[ cut here ]------------
[116648.101175] kernfs: can not remove 'used_bytes', no directory
[116648.101731] WARNING: CPU: 3 PID: 28500 at fs/kernfs/dir.c:1504 kernfs_remove_by_name_ns+0x75/0x80
(...)
[116648.105649] CPU: 3 PID: 28500 Comm: umount Tainted: G W 5.3.0-rc3-btrfs-next-54 #1
(...)
[116648.107461] RIP: 0010:kernfs_remove_by_name_ns+0x75/0x80
(...)
[116648.109336] RSP: 0018:ffffabfd0090bd08 EFLAGS: 00010282
[116648.109979] RAX: 0000000000000000 RBX: ffffffffc0c119a0 RCX: 0000000000000000
[116648.110625] RDX: ffff9fff603a7a00 RSI: ffff9fff603978a8 RDI: ffff9fff603978a8
[116648.111283] RBP: ffffffffc0b9ca41 R08: 0000000000000000 R09: 0000000000000001
[116648.111940] R10: ffff9ffe1f72e1c0 R11: 0000000000000000 R12: ffffffffc0b94120
[116648.112603] R13: ffffffffb3d9b4e0 R14: 0000000000000000 R15: dead000000000100
[116648.113268] FS: 00007f9cdc78a2c0(0000) GS:ffff9fff60380000(0000) knlGS:0000000000000000
[116648.113939] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[116648.114607] CR2: 00007f9fc4747ab4 CR3: 00000005c7832003 CR4: 00000000003606e0
[116648.115286] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[116648.115966] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[116648.116649] Call Trace:
[116648.117326] remove_files+0x31/0x70
[116648.117997] sysfs_remove_group+0x38/0x80
[116648.118671] sysfs_remove_groups+0x34/0x70
[116648.119342] kobject_del+0x20/0x60
[116648.120022] btrfs_free_block_groups+0x405/0x430 [btrfs]
[116648.120707] close_ctree+0x19a/0x380 [btrfs]
[116648.121396] generic_shutdown_super+0x6c/0x110
[116648.122057] kill_anon_super+0xe/0x30
[116648.122702] btrfs_kill_super+0x12/0xa0 [btrfs]
[116648.123335] deactivate_locked_super+0x3a/0x70
[116648.123961] cleanup_mnt+0xb4/0x160
[116648.124586] task_work_run+0x7e/0xc0
[116648.125210] exit_to_usermode_loop+0xfa/0x100
[116648.125830] do_syscall_64+0x1cb/0x220
[116648.126463] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[116648.127080] RIP: 0033:0x7f9cdc073b37
(...)
[116648.135923] ---[ end trace 22e24db328ccadf9 ]---
These happen because, during the unmount path, we call kobject_del() for
raid kobjects that are not fully initialized, meaning that we set their
ktype (as btrfs_raid_ktype) through link_block_group() but we didn't set
their parent kobject, which is done through btrfs_add_raid_kobjects().
We have this split raid kobject setup since commit 75cb379d26
("btrfs: defer adding raid type kobject until after chunk relocation") in
order to avoid triggering reclaim during contextes where we can not
(either we are holding a transaction handle or some lock required by
the transaction commit path), so that we do the calls to kobject_add(),
which triggers GFP_KERNEL allocations, through btrfs_add_raid_kobjects()
in contextes where it is safe to trigger reclaim. That change expected
that a new raid kobject can only be created either when mounting the
filesystem or after raid profile conversion through the relocation path.
However, we can have new raid kobject created in other two cases at least:
1) During device replace (or scrub) after adding a device a to the
filesystem. The replace procedure (and scrub) do calls to
btrfs_inc_block_group_ro() which can allocate a new block group
with a new raid profile (because we now have more devices). This
can be triggered by test cases btrfs/027 and btrfs/176.
2) During a degraded mount trough any write path. This can be triggered
by test case btrfs/124.
Fixing this by adding extra calls to btrfs_add_raid_kobjects(), not only
makes things more complex and fragile, can also introduce deadlocks with
reclaim the following way:
1) Calling btrfs_add_raid_kobjects() at btrfs_inc_block_group_ro() or
anywhere in the replace/scrub path will cause a deadlock with reclaim
because if reclaim happens and a transaction commit is triggered,
the transaction commit path will block at btrfs_scrub_pause().
2) During degraded mounts it is essentially impossible to figure out where
to add extra calls to btrfs_add_raid_kobjects(), because allocation of
a block group with a new raid profile can happen anywhere, which means
we can't safely figure out which contextes are safe for reclaim, as
we can either hold a transaction handle or some lock needed by the
transaction commit path.
So it is too complex and error prone to have this split setup of raid
kobjects. So fix the issue by consolidating the setup of the kobjects in a
single place, at link_block_group(), and setup a nofs context there in
order to prevent reclaim being triggered by the memory allocations done
through the call chain of kobject_add().
Besides fixing the sysfs warnings during kobject_del(), this also ensures
the sysfs directories for the new raid profiles end up created and visible
to users (a bug that existed before the 5.3 commit 7c7e301406
("btrfs: sysfs: Replace default_attrs in ktypes with groups")).
Fixes: 75cb379d26 ("btrfs: defer adding raid type kobject until after chunk relocation")
Fixes: 7c7e301406 ("btrfs: sysfs: Replace default_attrs in ktypes with groups")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull networking fixes from David Miller:
"Yeah I should have sent a pull request last week, so there is a lot
more here than usual:
1) Fix memory leak in ebtables compat code, from Wenwen Wang.
2) Several kTLS bug fixes from Jakub Kicinski (circular close on
disconnect etc.)
3) Force slave speed check on link state recovery in bonding 802.3ad
mode, from Thomas Falcon.
4) Clear RX descriptor bits before assigning buffers to them in
stmmac, from Jose Abreu.
5) Several missing of_node_put() calls, mostly wrt. for_each_*() OF
loops, from Nishka Dasgupta.
6) Double kfree_skb() in peak_usb can driver, from Stephane Grosjean.
7) Need to hold sock across skb->destructor invocation, from Cong
Wang.
8) IP header length needs to be validated in ipip tunnel xmit, from
Haishuang Yan.
9) Use after free in ip6 tunnel driver, also from Haishuang Yan.
10) Do not use MSI interrupts on r8169 chips before RTL8168d, from
Heiner Kallweit.
11) Upon bridge device init failure, we need to delete the local fdb.
From Nikolay Aleksandrov.
12) Handle erros from of_get_mac_address() properly in stmmac, from
Martin Blumenstingl.
13) Handle concurrent rename vs. dump in netfilter ipset, from Jozsef
Kadlecsik.
14) Setting NETIF_F_LLTX on mac80211 causes complete breakage with
some devices, so revert. From Johannes Berg.
15) Fix deadlock in rxrpc, from David Howells.
16) Fix Kconfig deps of enetc driver, we must have PHYLIB. From Yue
Haibing.
17) Fix mvpp2 crash on module removal, from Matteo Croce.
18) Fix race in genphy_update_link, from Heiner Kallweit.
19) bpf_xdp_adjust_head() stopped working with generic XDP when we
fixes generic XDP to support stacked devices properly, fix from
Jesper Dangaard Brouer.
20) Unbalanced RCU locking in rt6_update_exception_stamp_rt(), from
David Ahern.
21) Several memory leaks in new sja1105 driver, from Vladimir Oltean"
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (214 commits)
net: dsa: sja1105: Fix memory leak on meta state machine error path
net: dsa: sja1105: Fix memory leak on meta state machine normal path
net: dsa: sja1105: Really fix panic on unregistering PTP clock
net: dsa: sja1105: Use the LOCKEDS bit for SJA1105 E/T as well
net: dsa: sja1105: Fix broken learning with vlan_filtering disabled
net: dsa: qca8k: Add of_node_put() in qca8k_setup_mdio_bus()
net: sched: sample: allow accessing psample_group with rtnl
net: sched: police: allow accessing police->params with rtnl
net: hisilicon: Fix dma_map_single failed on arm64
net: hisilicon: fix hip04-xmit never return TX_BUSY
net: hisilicon: make hip04_tx_reclaim non-reentrant
tc-testing: updated vlan action tests with batch create/delete
net sched: update vlan action for batched events operations
net: stmmac: tc: Do not return a fragment entry
net: stmmac: Fix issues when number of Queues >= 4
net: stmmac: xgmac: Fix XGMAC selftests
be2net: disable bh with spin_lock in be_process_mcc
net: cxgb3_main: Fix a resource leak in a error path in 'init_one()'
net: ethernet: sun4i-emac: Support phy-handle property for finding PHYs
net: bridge: move default pvid init/deinit to NETDEV_REGISTER/UNREGISTER
...
Fix kernel oops when mounting a encryptData CIFS share with
CONFIG_DEBUG_VIRTUAL
Signed-off-by: Sebastien Tisserant <stisserant@wallix.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
We had a report of a server which did not do a DFS referral
because the session setup Capabilities field was set to 0
(unlike negotiate protocol where we set CAP_DFS). Better to
send it session setup in the capabilities as well (this also
more closely matches Windows client behavior).
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org>
When a reconnect happens in the middle of processing a compound chain
the code leaks a buffer from the memory pool. Fix this by properly
checking for a return code and freeing buffers in case of error.
Also maintain a buf variable to be equal to either smallbuf or bigbuf
depending on a response buffer size while parsing a chain and when
returning to the caller.
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Currently we skip SMB2_TREE_CONNECT command when checking during
reconnect because Tree Connect happens when establishing
an SMB session. For SMB 3.0 protocol version the code also calls
validate negotiate which results in SMB2_IOCL command being sent
over the wire. This may deadlock on trying to acquire a mutex when
checking for reconnect. Fix this by skipping SMB2_IOCL command
when doing the reconnect check.
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
CC: Stable <stable@vger.kernel.org>
Vivek:
"As of now dax_layout_busy_page() calls unmap_mapping_range() with last
argument as 1, which says even unmap cow pages. I am wondering who needs
to get rid of cow pages as well.
I noticed one interesting side affect of this. I mount xfs with -o dax and
mmaped a file with MAP_PRIVATE and wrote some data to a page which created
cow page. Then I called fallocate() on that file to zero a page of file.
fallocate() called dax_layout_busy_page() which unmapped cow pages as well
and then I tried to read back the data I wrote and what I get is old
data from persistent memory. I lost the data I had written. This
read basically resulted in new fault and read back the data from
persistent memory.
This sounds wrong. Are there any users which need to unmap cow pages
as well? If not, I am proposing changing it to not unmap cow pages.
I noticed this while while writing virtio_fs code where when I tried
to reclaim a memory range and that corrupted the executable and I
was running from virtio-fs and program got segment violation."
Dan:
"In fact the unmap_mapping_range() in this path is only to synchronize
against get_user_pages_fast() and force it to call back into the
filesystem to re-establish the mapping. COW pages should be left
untouched by dax_layout_busy_page()."
Cc: <stable@vger.kernel.org>
Fixes: 5fac7408d8 ("mm, fs, dax: handle layout changes to pinned dax mappings")
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Link: https://lore.kernel.org/r/20190802192956.GA3032@redhat.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Fixes: 72abe3bcf0 ("signal/cifs: Fix cifs_put_tcp_session to call send_sig instead of force_sig")
The global change from force_sig caused module unloading of cifs.ko
to fail (since the cifsd process could not be killed, "rmmod cifs"
now would always fail)
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
People are reporing seeing fscache errors being reported concerning
duplicate cookies even in cases where they are not setting up fscache
at all. The rule needs to be that if fscache is not enabled, then it
should have no side effects at all.
To ensure this is the case, we disable fscache completely on all superblocks
for which the 'fsc' mount option was not set. In order to avoid issues
with '-oremount', we also disable the ability to turn fscache on via
remount.
Fixes: f1fe29b4a0 ("NFS: Use i_writecount to control whether...")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=200145
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Steve Dickson <steved@redhat.com>
Cc: David Howells <dhowells@redhat.com>
If the user specifies an open mode of 3, then we don't have a NFSv4 state
attached to the context, and so we Oops when we try to dereference it.
Reported-by: Olga Kornievskaia <aglo@umich.edu>
Fixes: 29b59f9416 ("NFSv4: change nfs4_do_setattr to take...")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # v4.10: 991eedb137: NFSv4: Only pass the...
Cc: stable@vger.kernel.org # v4.10+
John Hubbard reports seeing the following stack trace:
nfs4_do_reclaim
rcu_read_lock /* we are now in_atomic() and must not sleep */
nfs4_purge_state_owners
nfs4_free_state_owner
nfs4_destroy_seqid_counter
rpc_destroy_wait_queue
cancel_delayed_work_sync
__cancel_work_timer
__flush_work
start_flush_work
might_sleep:
(kernel/workqueue.c:2975: BUG)
The solution is to separate out the freeing of the state owners
from nfs4_purge_state_owners(), and perform that outside the atomic
context.
Reported-by: John Hubbard <jhubbard@nvidia.com>
Fixes: 0aaaf5c424 ("NFS: Cache state owners after files are closed")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Ensure that we always check the return value of update_open_stateid()
so that we can retry if the update of local state failed. This fixes
infinite looping on state recovery.
Fixes: e23008ec81 ("NFSv4 reduce attribute requests for open reclaim")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # v3.7+
Fix nfs_reap_expired_delegations() to ensure that we only reap delegations
that are actually expired, rather than triggering on random errors.
Fixes: 45870d6909 ("NFSv4.1: Test delegation stateids when server...")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The logic for checking in nfs41_check_open_stateid() whether the state
is supported by a delegation is inverted. In addition, it makes more
sense to perform that check before we check for expired locks.
Fixes: 8a64c4ef10 ("NFSv4.1: Even if the stateid is OK,...")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
In pnfs_update_layout() ensure that we do report any fatal errors from
nfs4_select_rw_stateid().
Fixes: d9aba2b40d ("NFSv4: Don't use the zero stateid with layoutget")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If the server returns with EAGAIN when we're trying to recover from
a server reboot, we currently delay for 1 second, but then mark the
stateid as needing recovery after the grace period has expired.
Instead, we should just retry the same recovery process immediately
after the 1 second delay. Break out of the loop after 10 retries.
Fixes: 35a61606a6 ("NFS: Reduce indentation of the switch statement...")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
When error recovery fails due to a fatal error on the server, ensure
we log it in the syslog.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Once we clear the NFS_DELEGATED_STATE flag, we're telling
nfs_delegation_claim_opens() that we're done recovering all open state
for that stateid, so we really need to ensure that we test for all
open modes that are currently cached and recover them before exiting
nfs4_open_delegation_recall().
Fixes: 24311f8841 ("NFSv4: Recovery of recalled read delegations...")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # v4.3+
It is unsafe to dereference delegation outside the rcu lock, and in
any case, the refcount is guaranteed held if cred is non-zero.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Pull xfs fixes from Darrick Wong:
- Avoid leaking kernel stack contents to userspace
- Fix a potential null pointer dereference in the dabtree scrub code
* tag 'xfs-5.3-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: Fix possible null-pointer dereferences in xchk_da_btree_block_check_sibling()
xfs: fix stack contents leakage in the v1 inumber ioctls
When the system is close-to-OOM, fsync() may fail due to -ENOMEM because
xfs_log_reserve() is using KM_MAYFAIL. It is a bad thing to fail writeback
operation due to user-triggerable OOM condition. Since we are not using
KM_MAYFAIL at xfs_trans_alloc() before calling xfs_log_reserve(), let's
use the same flags at xfs_log_reserve().
oom-torture: page allocation failure: order:0, mode:0x46c40(GFP_NOFS|__GFP_NOWARN|__GFP_RETRY_MAYFAIL|__GFP_COMP), nodemask=(null)
CPU: 7 PID: 1662 Comm: oom-torture Kdump: loaded Not tainted 5.3.0-rc2+ #925
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00
Call Trace:
dump_stack+0x67/0x95
warn_alloc+0xa9/0x140
__alloc_pages_slowpath+0x9a8/0xbce
__alloc_pages_nodemask+0x372/0x3b0
alloc_slab_page+0x3a/0x8d0
new_slab+0x330/0x420
___slab_alloc.constprop.94+0x879/0xb00
__slab_alloc.isra.89.constprop.93+0x43/0x6f
kmem_cache_alloc+0x331/0x390
kmem_zone_alloc+0x9f/0x110 [xfs]
kmem_zone_alloc+0x9f/0x110 [xfs]
xlog_ticket_alloc+0x33/0xd0 [xfs]
xfs_log_reserve+0xb4/0x410 [xfs]
xfs_trans_reserve+0x1d1/0x2b0 [xfs]
xfs_trans_alloc+0xc9/0x250 [xfs]
xfs_setfilesize_trans_alloc.isra.27+0x44/0xc0 [xfs]
xfs_submit_ioend.isra.28+0xa5/0x180 [xfs]
xfs_vm_writepages+0x76/0xa0 [xfs]
do_writepages+0x17/0x80
__filemap_fdatawrite_range+0xc1/0xf0
file_write_and_wait_range+0x53/0xa0
xfs_file_fsync+0x87/0x290 [xfs]
vfs_fsync_range+0x37/0x80
do_fsync+0x38/0x60
__x64_sys_fsync+0xf/0x20
do_syscall_64+0x4a/0x1c0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Fixes: eb01c9cd87 ("[XFS] Remove the xlog_ticket allocator")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Merge misc fixes from Andrew Morton:
"17 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
drivers/acpi/scan.c: document why we don't need the device_hotplug_lock
memremap: move from kernel/ to mm/
lib/test_meminit.c: use GFP_ATOMIC in RCU critical section
asm-generic: fix -Wtype-limits compiler warnings
cgroup: kselftest: relax fs_spec checks
mm/memory_hotplug.c: remove unneeded return for void function
mm/migrate.c: initialize pud_entry in migrate_vma()
coredump: split pipe command whitespace before expanding template
page flags: prioritize kasan bits over last-cpuid
ubsan: build ubsan.c more conservatively
kasan: remove clang version check for KASAN_STACK
mm: compaction: avoid 100% CPU usage during compaction when a task is killed
mm: migrate: fix reference check race between __find_get_block() and migration
mm: vmscan: check if mem cgroup is disabled or not before calling memcg slab shrinker
ocfs2: remove set but not used variable 'last_hash'
Revert "kmemleak: allow to coexist with fault injection"
kernel/signal.c: fix a kernel-doc markup
Save the offsets of the start of each argument to avoid having to update
pointers to each argument after every corename krealloc and to avoid
having to duplicate the memory for the dump command.
Executable names containing spaces were previously being expanded from
%e or %E and then split in the middle of the filename. This is
incorrect behaviour since an argument list can represent arguments with
spaces.
The splitting could lead to extra arguments being passed to the core
dump handler that it might have interpreted as options or ignored
completely.
Core dump handlers that are not aware of this Linux kernel issue will be
using %e or %E without considering that it may be split and so they will
be vulnerable to processes with spaces in their names breaking their
argument list. If their internals are otherwise well written, such as
if they are written in shell but quote arguments, they will work better
after this change than before. If they are not well written, then there
is a slight chance of breakage depending on the details of the code but
they will already be fairly broken by the split filenames.
Core dump handlers that are aware of this Linux kernel issue will be
placing %e or %E as the last item in their core_pattern and then
aggregating all of the remaining arguments into one, separated by
spaces. Alternatively they will be obtaining the filename via other
methods. Both of these will be compatible with the new arrangement.
A side effect from this change is that unknown template types (for
example %z) result in an empty argument to the dump handler instead of
the argument being dropped. This is a desired change as:
It is easier for dump handlers to process empty arguments than dropped
ones, especially if they are written in shell or don't pass each
template item with a preceding command-line option in order to
differentiate between individual template types. Most core_patterns in
the wild do not use options so they can confuse different template types
(especially numeric ones) if an earlier one gets dropped in old kernels.
If the kernel introduces a new template type and a core_pattern uses it,
the core dump handler might not expect that the argument can be dropped
in old kernels.
For example, this can result in security issues when %d is dropped in
old kernels. This happened with the corekeeper package in Debian and
resulted in the interface between corekeeper and Linux having to be
rewritten to use command-line options to differentiate between template
types.
The core_pattern for most core dump handlers is written by the handler
author who would generally not insert unknown template types so this
change should be compatible with all the core dump handlers that exist.
Link: http://lkml.kernel.org/r/20190528051142.24939-1-pabs3@bonedaddy.net
Fixes: 74aadce986 ("core_pattern: allow passing of arguments to user mode helper when core_pattern is a pipe")
Signed-off-by: Paul Wise <pabs3@bonedaddy.net>
Reported-by: Jakub Wilk <jwilk@jwilk.net> [https://bugs.debian.org/924398]
Reported-by: Paul Wise <pabs3@bonedaddy.net> [https://lore.kernel.org/linux-fsdevel/c8b7ecb8508895bf4adb62a748e2ea2c71854597.camel@bonedaddy.net/]
Suggested-by: Jakub Wilk <jwilk@jwilk.net>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fixes gcc '-Wunused-but-set-variable' warning:
fs/ocfs2/xattr.c: In function ocfs2_xattr_bucket_find:
fs/ocfs2/xattr.c:3828:6: warning: variable last_hash set but not used [-Wunused-but-set-variable]
It's never used and can be removed.
Link: http://lkml.kernel.org/r/20190716132110.34836-1-yuehaibing@huawei.com
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl1ERCMQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpjr7D/0U8SMu1T9JOge91zXQQUc7XtCX9RvHYhhj
vbwwN9RwpIfrTwuLZUCvt2vEz8WPOVfZbwYGkfFcdI+N5I/dOfT8Swiwy7Zabpi2
KTedn2EdELTizEuWQ3QhaBHWuTGvE04aAzZTBRCQ0tCOYTPpXGRavxhG6UHcQi+z
lohB5Pr/cyX8/jWJj4kq7381QYUUH2bm9uY7qutBsQOt2CsN5prjWxX3JM6EO1wb
VyyI25fWLaS+bZW+crVutcARxccuav4e+LEJbb9Z7+19vjmkc2qE+22F3MBxYCzo
tOjU0RP0IvvVR9t0Hahw/3MnDTDfuSqlqrT12zNtn7FrzOKpkygMyRa+u8YygI6k
2iAp92HkNWpjBxUFNGoRCRfJpApG3vT6/VkI8tixFSw/Re3F1H9Bc9IRZxc3uU4H
5DMRmjZXGg+8Nw+93XzwWnD1paCJcDsHRHUpWFNJvRfJYQzDaziPUBV9a9TZ+HMF
BnCJBCW641tcA5yCRwBF6OpoowtmxOtWce7Lr9wAjU+cYHMEzOQoG+J6gPH3q8Jh
aD9U2FcnE6kReL+MsGj42q1U1n60xngcdzo8Ca4bWfWNpqb4lJatjumkDAiI6U4q
DFDs9bRbB4LLgwkRQ+n1biwAK626KJOp5lGXrEu7XHXSTlO/BiJytISwASjlzKsZ
4uGHc/uUdA==
=P5E/
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20190802' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Here's a small collection of fixes that should go into this series.
This contains:
- io_uring potential use-after-free fix (Jackie)
- loop regression fix (Jan)
- O_DIRECT fragmented bio regression fix (Damien)
- Mark Denis as the new floppy maintainer (Denis)
- ataflop switch fall-through annotation (Gustavo)
- libata zpodd overflow fix (Kees)
- libata ahci deferred probe fix (Miquel)
- nbd invalidation BUG_ON() fix (Munehisa)
- dasd endless loop fix (Stefan)"
* tag 'for-linus-20190802' of git://git.kernel.dk/linux-block:
s390/dasd: fix endless loop after read unit address configuration
block: Fix __blkdev_direct_IO() for bio fragments
MAINTAINERS: floppy: take over maintainership
nbd: replace kill_bdev() with __invalidate_device() again
ata: libahci: do not complain in case of deferred probe
io_uring: fix KASAN use after free in io_sq_wq_submit_work
loop: Fix mount(2) failure due to race with LOOP_SET_FD
libata: zpodd: Fix small read overflow in zpodd_get_mech_type()
ataflop: Mark expected switch fall-through
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl1ETYkACgkQxWXV+ddt
WDupLA/+LXPJvxRG/Ob655sxOAqpG4+XUW7aaI5jSnlr958TtDlNFkpzrQ28NFzk
XlEe/HjgI6CH7DhNwe1IlHLqFt874fiCeIZ2jvuSVVbPAO+9T8a5BIbeUY6V4h5v
QUmbaUpVshsh2IDArU8Vc/UjycCuwrccfGkS5ydTVI6Eni4BWsOY2PMiB5ojdOTE
bpclws3ca/CMbNpGKnn3cH5aXJpENTXedXxBbp99DproZVY3YL1ZyDxc37R6+5Ua
kbZznSMZFoLjHt3vktHmiozF7W1Zdi6ExOp9NHApshlw/n08ICkY1oNfd9KSYgQs
8gTl9mZ8jbJnSfZTuHNL6LwfSZjDyT5LCTbXSj0UnaVOYwJVwopsDrrdO6AESd8S
1eFeEFQ62lEV+31xa0nsYBz5j6ejEfcvxiq14W8LawWpkxtAIsEyXockWtmXPods
29BYkstQADlCOu2WYdQ8ugseBtfEpEAvD+Rf+qqyZ6XlHMIjL1J1S+BIXQ868z1f
KZCVsepJ0dVm0siVNMFBUaM2z1PcHusxNgYd37YcJRX2t4Gql9Ku/00PJ+tRhsBa
80a5ue9p6nuZfR6XoM6Cpe3uG8qyFyPccC3ohwKEOAnHnmk2LEJYASudV4pk/P8G
Vy3o9Uv/NMMznop24r7UB3mklBQCIchKpaij3RT0Jx+3RGdR4Yo=
=jr4T
-----END PGP SIGNATURE-----
Merge tag 'for-5.3-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- tiny race window during 2 transactions aborting at the same time can
accidentally lead to a commit
- regression fix, possible deadlock during fiemap
- fix for an old bug when incremental send can fail on a file that has
been deduplicated in a special way
* tag 'for-5.3-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix deadlock between fiemap and transaction commits
Btrfs: fix race leading to fs corruption after transaction abort
Btrfs: fix incremental send failure after deduplication
The recent fix to properly handle IOCB_NOWAIT for async O_DIRECT IO
(patch 6a43074e2f) introduced two problems with BIO fragment handling
for direct IOs:
1) The dio size processed is calculated by incrementing the ret variable
by the size of the bio fragment issued for the dio. However, this size
is obtained directly from bio->bi_iter.bi_size AFTER the bio submission
which may result in referencing the bi_size value after the bio
completed, resulting in an incorrect value use.
2) The ret variable is not incremented by the size of the last bio
fragment issued for the bio, leading to an invalid IO size being
returned to the user.
Fix both problem by using dio->size (which is incremented before the bio
submission) to update the value of ret after bio submissions, including
for the last bio fragment issued.
Fixes: 6a43074e2f ("block: properly handle IOCB_NOWAIT for async O_DIRECT IO")
Reported-by: Masato Suzuki <masato.suzuki@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use the hrtimer_cancel_wait_running() synchronization mechanism to prevent
priority inversion and live locks on PREEMPT_RT.
[ tglx: Split out of combo patch ]
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190730223828.600085866@linutronix.de
Those are due to recent changes. Most of the issues
can be automatically fixed with:
$ ./scripts/documentation-file-ref-check --fix
The only exception was the sound binding with required
manual work.
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
This file has its own proper style, except that, after a while,
the coding style gets violated and whitespaces are placed on
different ways.
As Sphinx and ReST are very sentitive to whitespace differences,
I had to opt if each entry after required/mandatory/... fields
should start with zero spaces or with a tab. I opted to start them
all from the zero position, in order to avoid needing to break lines
with more than 80 columns, with would make harder for review.
Most of the other changes at porting.rst were made to use an unified
notation with works nice as a text file while also produce a good html
output after being parsed.
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
There are 3 remaining files without an extension inside the fs docs
dir.
Manually convert them to ReST.
In the case of the nfs/exporting.rst file, as the nfs docs
aren't ported yet, I opted to convert and add a :orphan: there,
with should be removed when it gets added into a nfs-specific
part of the fs documentation.
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
With the recent iomap write page reclaim deadlock fix, it turns out that the
GLF_DIRTY flag isn't always set when it needs to be anymore: previously, this
happened as a side effect of always adding the inode buffer head to the current
transaction with gfs2_trans_add_meta, but this isn't happening consistently
anymore. Fix by removing an additional unnecessary gfs2_trans_add_meta call
and by setting the GLF_DIRTY flag in gfs2_iomap_end.
(The GLF_DIRTY flag causes inode_go_sync to flush the transaction log when
syncing out the glock of that inode. When the flag isn't set, inode_go_sync
will skip inodes, including ones with an i_state of I_DIRTY_PAGES, which will
lead to cluster incoherency.)
In addition, in gfs2_iomap_page_done, if the metadata has changed, mark the
inode as I_DIRTY_DATASYNC to have the inode added to the current transaction:
we don't expect metadata to change here, but let's err on the safe side.
Fixes: d0a22a4b03 ("gfs2: Fix iomap write page reclaim deadlock");
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
In "consolidate the capability checks in sget_{fc,userns}())" the
wrong argument had been passed to mount_capable() by sget_fc().
That mistake had been further obscured later, when switching
mount_capable() to fs_context has moved the calculation of
bogus argument from sget_fc() to mount_capable() itself. It
should've been fc->user_ns all along.
Screwed-up-by: Al Viro <viro@zeniv.linux.org.uk>
Reported-by: Christian Brauner <christian@brauner.io>
Tested-by: Christian Brauner <christian@brauner.io>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull dax fix from Dan Williams:
"Fix a botched manual patch update that got dropped between testing and
application"
* 'dax-fix-5.3-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
dax: Fix missed wakeup in put_unlocked_entry()
Support for handling the PPPOEIOCSFWD ioctl in compat mode was added in
linux-2.5.69 along with hundreds of other commands, but was always broken
sincen only the structure is compatible, but the command number is not,
due to the size being sizeof(size_t), or at first sizeof(sizeof((struct
sockaddr_pppox)), which is different on 64-bit architectures.
Guillaume Nault adds:
And the implementation was broken until 2016 (see 29e73269aa ("pppoe:
fix reference counting in PPPoE proxy")), and nobody ever noticed. I
should probably have removed this ioctl entirely instead of fixing it.
Clearly, it has never been used.
Fix it by adding a compat_ioctl handler for all pppoe variants that
translates the command number and then calls the regular ioctl function.
All other ioctl commands handled by pppoe are compatible between 32-bit
and 64-bit, and require compat_ptr() conversion.
This should apply to all stable kernels.
Acked-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This set of patches adjust to follow recent setflags changes and fix two
regression introduced since 5.4-rc1.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAl1AgaYACgkQQBSofoJI
UNJzPA/+NXLuNcjdP2HdiTj2fZc5saFDZkMKmjd6559rsQI9799qGIqZbhoynV15
sWNToo04LL1nNSTj/+/Kemwx3jnKUst4HCeUodb88Fey/phzbX4Di59aOv9cb5nT
g1Fozf7pt4TI67/zOxrP32s5hAUe1c7kkdp4XC8lK4Sv7KqXxvw6dz7QOWCaWd/z
DfEww1miNL0zRKSHUxj+74nLU6Q3Ot9/lEg4Af+8hJfURuCtkqck3yUXuyYnPmA0
4E15IeayqUEwqLHsGb3ysWsTQz1gRPRJncMz9FHsaW2pPIGiQtvJMzlhmo8vdrnP
8vRXGL9U/kuzod/MR0nOl16Wmq4u75IbK4FGEHF032An0bzYspwej5GDWbD7fiIW
gjGF6MoSmtYdngSxED2Cf81MM5oCKGMMxkMc9C+NFIMR9UNtagTMjzLtEICP/Q0h
7u6A5vqbND7Oh8O2b+Xk4kvVtJ2Czu0Jq7p6kdfCPTXV3oJR9Wq7fdXyGBtm6AWx
sjbt6d0RvuR8G+4jPI8Wa8gZ9cPJIQW+Iw917Q/mDqCfK7mXarrpy8tsNzNduLzl
80XjIA1BknNPFOHKPndSAhkXrpb269nPVfdtPTv0GAris3DANG71tKMp9pijFOo0
rn6SKzVVmYo41FWgYpQc2b5V8uQZXK/LMY4TF6bgRHAfmPPgOxk=
=TSfD
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-5.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs fixes from Jaegeuk Kim:
"This set of patches adjust to follow recent setflags changes and fix
two regressions"
* tag 'f2fs-for-5.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs:
f2fs: use EINVAL for superblock with invalid magic
f2fs: fix to read source block before invalidating it
f2fs: remove redundant check from f2fs_setflags_common()
f2fs: use generic checking function for FS_IOC_FSSETXATTR
f2fs: use generic checking and prep function for FS_IOC_SETFLAGS
Commit 33ec3e53e7 ("loop: Don't change loop device under exclusive
opener") made LOOP_SET_FD ioctl acquire exclusive block device reference
while it updates loop device binding. However this can make perfectly
valid mount(2) fail with EBUSY due to racing LOOP_SET_FD holding
temporarily the exclusive bdev reference in cases like this:
for i in {a..z}{a..z}; do
dd if=/dev/zero of=$i.image bs=1k count=0 seek=1024
mkfs.ext2 $i.image
mkdir mnt$i
done
echo "Run"
for i in {a..z}{a..z}; do
mount -o loop -t ext2 $i.image mnt$i &
done
Fix the problem by not getting full exclusive bdev reference in
LOOP_SET_FD but instead just mark the bdev as being claimed while we
update the binding information. This just blocks new exclusive openers
instead of failing them with EBUSY thus fixing the problem.
Fixes: 33ec3e53e7 ("loop: Don't change loop device under exclusive opener")
Cc: stable@vger.kernel.org
Tested-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In xchk_da_btree_block_check_sibling(), there is an if statement on
line 274 to check whether ds->state->altpath.blk[level].bp is NULL:
if (ds->state->altpath.blk[level].bp)
When ds->state->altpath.blk[level].bp is NULL, it is used on line 281:
xfs_trans_brelse(..., ds->state->altpath.blk[level].bp);
struct xfs_buf_log_item *bip = bp->b_log_item;
ASSERT(bp->b_transp == tp);
Thus, possible null-pointer dereferences may occur.
To fix these bugs, ds->state->altpath.blk[level].bp is checked before
being used.
These bugs are found by a static analysis tool STCheck written by us.
Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The fiemap handler locks a file range that can have unflushed delalloc,
and after locking the range, it tries to attach to a running transaction.
If the running transaction started its commit, that is, it is in state
TRANS_STATE_COMMIT_START, and either the filesystem was mounted with the
flushoncommit option or the transaction is creating a snapshot for the
subvolume that contains the file that fiemap is operating on, we end up
deadlocking. This happens because fiemap is blocked on the transaction,
waiting for it to complete, and the transaction is waiting for the flushed
dealloc to complete, which requires locking the file range that the fiemap
task already locked. The following stack traces serve as an example of
when this deadlock happens:
(...)
[404571.515510] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
[404571.515956] Call Trace:
[404571.516360] ? __schedule+0x3ae/0x7b0
[404571.516730] schedule+0x3a/0xb0
[404571.517104] lock_extent_bits+0x1ec/0x2a0 [btrfs]
[404571.517465] ? remove_wait_queue+0x60/0x60
[404571.517832] btrfs_finish_ordered_io+0x292/0x800 [btrfs]
[404571.518202] normal_work_helper+0xea/0x530 [btrfs]
[404571.518566] process_one_work+0x21e/0x5c0
[404571.518990] worker_thread+0x4f/0x3b0
[404571.519413] ? process_one_work+0x5c0/0x5c0
[404571.519829] kthread+0x103/0x140
[404571.520191] ? kthread_create_worker_on_cpu+0x70/0x70
[404571.520565] ret_from_fork+0x3a/0x50
[404571.520915] kworker/u8:6 D 0 31651 2 0x80004000
[404571.521290] Workqueue: btrfs-flush_delalloc btrfs_flush_delalloc_helper [btrfs]
(...)
[404571.537000] fsstress D 0 13117 13115 0x00004000
[404571.537263] Call Trace:
[404571.537524] ? __schedule+0x3ae/0x7b0
[404571.537788] schedule+0x3a/0xb0
[404571.538066] wait_current_trans+0xc8/0x100 [btrfs]
[404571.538349] ? remove_wait_queue+0x60/0x60
[404571.538680] start_transaction+0x33c/0x500 [btrfs]
[404571.539076] btrfs_check_shared+0xa3/0x1f0 [btrfs]
[404571.539513] ? extent_fiemap+0x2ce/0x650 [btrfs]
[404571.539866] extent_fiemap+0x2ce/0x650 [btrfs]
[404571.540170] do_vfs_ioctl+0x526/0x6f0
[404571.540436] ksys_ioctl+0x70/0x80
[404571.540734] __x64_sys_ioctl+0x16/0x20
[404571.540997] do_syscall_64+0x60/0x1d0
[404571.541279] entry_SYSCALL_64_after_hwframe+0x49/0xbe
(...)
[404571.543729] btrfs D 0 14210 14208 0x00004000
[404571.544023] Call Trace:
[404571.544275] ? __schedule+0x3ae/0x7b0
[404571.544526] ? wait_for_completion+0x112/0x1a0
[404571.544795] schedule+0x3a/0xb0
[404571.545064] schedule_timeout+0x1ff/0x390
[404571.545351] ? lock_acquire+0xa6/0x190
[404571.545638] ? wait_for_completion+0x49/0x1a0
[404571.545890] ? wait_for_completion+0x112/0x1a0
[404571.546228] wait_for_completion+0x131/0x1a0
[404571.546503] ? wake_up_q+0x70/0x70
[404571.546775] btrfs_wait_ordered_extents+0x27c/0x400 [btrfs]
[404571.547159] btrfs_commit_transaction+0x3b0/0xae0 [btrfs]
[404571.547449] ? btrfs_mksubvol+0x4a4/0x640 [btrfs]
[404571.547703] ? remove_wait_queue+0x60/0x60
[404571.547969] btrfs_mksubvol+0x605/0x640 [btrfs]
[404571.548226] ? __sb_start_write+0xd4/0x1c0
[404571.548512] ? mnt_want_write_file+0x24/0x50
[404571.548789] btrfs_ioctl_snap_create_transid+0x169/0x1a0 [btrfs]
[404571.549048] btrfs_ioctl_snap_create_v2+0x11d/0x170 [btrfs]
[404571.549307] btrfs_ioctl+0x133f/0x3150 [btrfs]
[404571.549549] ? mem_cgroup_charge_statistics+0x4c/0xd0
[404571.549792] ? mem_cgroup_commit_charge+0x84/0x4b0
[404571.550064] ? __handle_mm_fault+0xe3e/0x11f0
[404571.550306] ? do_raw_spin_unlock+0x49/0xc0
[404571.550608] ? _raw_spin_unlock+0x24/0x30
[404571.550976] ? __handle_mm_fault+0xedf/0x11f0
[404571.551319] ? do_vfs_ioctl+0xa2/0x6f0
[404571.551659] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[404571.552087] do_vfs_ioctl+0xa2/0x6f0
[404571.552355] ksys_ioctl+0x70/0x80
[404571.552621] __x64_sys_ioctl+0x16/0x20
[404571.552864] do_syscall_64+0x60/0x1d0
[404571.553104] entry_SYSCALL_64_after_hwframe+0x49/0xbe
(...)
If we were joining the transaction instead of attaching to it, we would
not risk a deadlock because a join only blocks if the transaction is in a
state greater then or equals to TRANS_STATE_COMMIT_DOING, and the delalloc
flush performed by a transaction is done before it reaches that state,
when it is in the state TRANS_STATE_COMMIT_START. However a transaction
join is intended for use cases where we do modify the filesystem, and
fiemap only needs to peek at delayed references from the current
transaction in order to determine if extents are shared, and, besides
that, when there is no current transaction or when it blocks to wait for
a current committing transaction to complete, it creates a new transaction
without reserving any space. Such unnecessary transactions, besides doing
unnecessary IO, can cause transaction aborts (-ENOSPC) and unnecessary
rotation of the precious backup roots.
So fix this by adding a new transaction join variant, named join_nostart,
which behaves like the regular join, but it does not create a transaction
when none currently exists or after waiting for a committing transaction
to complete.
Fixes: 03628cdbc6 ("Btrfs: do not start a transaction during fiemap")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When one transaction is finishing its commit, it is possible for another
transaction to start and enter its initial commit phase as well. If the
first ends up getting aborted, we have a small time window where the second
transaction commit does not notice that the previous transaction aborted
and ends up committing, writing a superblock that points to btrees that
reference extent buffers (nodes and leafs) that were not persisted to disk.
The consequence is that after mounting the filesystem again, we will be
unable to load some btree nodes/leafs, either because the content on disk
is either garbage (or just zeroes) or corresponds to the old content of a
previouly COWed or deleted node/leaf, resulting in the well known error
messages "parent transid verify failed on ...".
The following sequence diagram illustrates how this can happen.
CPU 1 CPU 2
<at transaction N>
btrfs_commit_transaction()
(...)
--> sets transaction state to
TRANS_STATE_UNBLOCKED
--> sets fs_info->running_transaction
to NULL
(...)
btrfs_start_transaction()
start_transaction()
wait_current_trans()
--> returns immediately
because
fs_info->running_transaction
is NULL
join_transaction()
--> creates transaction N + 1
--> sets
fs_info->running_transaction
to transaction N + 1
--> adds transaction N + 1 to
the fs_info->trans_list list
--> returns transaction handle
pointing to the new
transaction N + 1
(...)
btrfs_sync_file()
btrfs_start_transaction()
--> returns handle to
transaction N + 1
(...)
btrfs_write_and_wait_transaction()
--> writeback of some extent
buffer fails, returns an
error
btrfs_handle_fs_error()
--> sets BTRFS_FS_STATE_ERROR in
fs_info->fs_state
--> jumps to label "scrub_continue"
cleanup_transaction()
btrfs_abort_transaction(N)
--> sets BTRFS_FS_STATE_TRANS_ABORTED
flag in fs_info->fs_state
--> sets aborted field in the
transaction and transaction
handle structures, for
transaction N only
--> removes transaction from the
list fs_info->trans_list
btrfs_commit_transaction(N + 1)
--> transaction N + 1 was not
aborted, so it proceeds
(...)
--> sets the transaction's state
to TRANS_STATE_COMMIT_START
--> does not find the previous
transaction (N) in the
fs_info->trans_list, so it
doesn't know that transaction
was aborted, and the commit
of transaction N + 1 proceeds
(...)
--> sets transaction N + 1 state
to TRANS_STATE_UNBLOCKED
btrfs_write_and_wait_transaction()
--> succeeds writing all extent
buffers created in the
transaction N + 1
write_all_supers()
--> succeeds
--> we now have a superblock on
disk that points to trees
that refer to at least one
extent buffer that was
never persisted
So fix this by updating the transaction commit path to check if the flag
BTRFS_FS_STATE_TRANS_ABORTED is set on fs_info->fs_state if after setting
the transaction to the TRANS_STATE_COMMIT_START we do not find any previous
transaction in the fs_info->trans_list. If the flag is set, just fail the
transaction commit with -EROFS, as we do in other places. The exact error
code for the previous transaction abort was already logged and reported.
Fixes: 49b25e0540 ("btrfs: enhance transaction abort infrastructure")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing an incremental send operation we can fail if we previously did
deduplication operations against a file that exists in both snapshots. In
that case we will fail the send operation with -EIO and print a message
to dmesg/syslog like the following:
BTRFS error (device sdc): Send: inconsistent snapshot, found updated \
extent for inode 257 without updated inode item, send root is 258, \
parent root is 257
This requires that we deduplicate to the same file in both snapshots for
the same amount of times on each snapshot. The issue happens because a
deduplication only updates the iversion of an inode and does not update
any other field of the inode, therefore if we deduplicate the file on
each snapshot for the same amount of time, the inode will have the same
iversion value (stored as the "sequence" field on the inode item) on both
snapshots, therefore it will be seen as unchanged between in the send
snapshot while there are new/updated/deleted extent items when comparing
to the parent snapshot. This makes the send operation return -EIO and
print an error message.
Example reproducer:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
# Create our first file. The first half of the file has several 64Kb
# extents while the second half as a single 512Kb extent.
$ xfs_io -f -s -c "pwrite -S 0xb8 -b 64K 0 512K" /mnt/foo
$ xfs_io -c "pwrite -S 0xb8 512K 512K" /mnt/foo
# Create the base snapshot and the parent send stream from it.
$ btrfs subvolume snapshot -r /mnt /mnt/mysnap1
$ btrfs send -f /tmp/1.snap /mnt/mysnap1
# Create our second file, that has exactly the same data as the first
# file.
$ xfs_io -f -c "pwrite -S 0xb8 0 1M" /mnt/bar
# Create the second snapshot, used for the incremental send, before
# doing the file deduplication.
$ btrfs subvolume snapshot -r /mnt /mnt/mysnap2
# Now before creating the incremental send stream:
#
# 1) Deduplicate into a subrange of file foo in snapshot mysnap1. This
# will drop several extent items and add a new one, also updating
# the inode's iversion (sequence field in inode item) by 1, but not
# any other field of the inode;
#
# 2) Deduplicate into a different subrange of file foo in snapshot
# mysnap2. This will replace an extent item with a new one, also
# updating the inode's iversion by 1 but not any other field of the
# inode.
#
# After these two deduplication operations, the inode items, for file
# foo, are identical in both snapshots, but we have different extent
# items for this inode in both snapshots. We want to check this doesn't
# cause send to fail with an error or produce an incorrect stream.
$ xfs_io -r -c "dedupe /mnt/bar 0 0 512K" /mnt/mysnap1/foo
$ xfs_io -r -c "dedupe /mnt/bar 512K 512K 512K" /mnt/mysnap2/foo
# Create the incremental send stream.
$ btrfs send -p /mnt/mysnap1 -f /tmp/2.snap /mnt/mysnap2
ERROR: send ioctl failed with -5: Input/output error
This issue started happening back in 2015 when deduplication was updated
to not update the inode's ctime and mtime and update only the iversion.
Back then we would hit a BUG_ON() in send, but later in 2016 send was
updated to return -EIO and print the error message instead of doing the
BUG_ON().
A test case for fstests follows soon.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203933
Fixes: 1c919a5e13 ("btrfs: don't update mtime/ctime on deduped inodes")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the in-kernel afs filesystem, the d_fsdata dentry field is used to hold
the data version of the parent directory when it was created or when
d_revalidate() last caused it to be updated. This is compared to the
->invalid_before field in the directory inode, rather than the actual data
version number, thereby allowing changes due to local edits to be ignored.
Only if the server data version gets bumped unexpectedly (eg. by a
competing client), do we need to revalidate stuff.
However, the d_fsdata field should also be updated if an rpc op is
performed that modifies that particular dentry. Such ops return the
revised data version of the directory(ies) involved, so we should use that.
This is particularly problematic for rename, since a dentry from one
directory may be moved directly into another directory (ie. mv a/x b/x).
It would then be sporting the wrong data version - and if this is in the
future, for the destination directory, revalidations would be missed,
leading to foreign renames and hard-link deletion being missed.
Fix this by the following means:
(1) Return the data version number from operations that read the directory
contents - if they issue the read. This starts in afs_dir_iterate()
and is used, ignored or passed back by its callers.
(2) In afs_lookup*(), set the dentry version to the version returned by
(1) before d_splice_alias() is called and the dentry published.
(3) In afs_d_revalidate(), set the dentry version to that returned from
(1) if an rpc call was issued. This means that if a parallel
procedure, such as mkdir(), modifies the directory, we won't
accidentally use the data version from that.
(4) In afs_{mkdir,create,link,symlink}(), set the new dentry's version to
the directory data version before d_instantiate() is called.
(5) In afs_{rmdir,unlink}, update the target dentry's version to the
directory data version as soon as we've updated the directory inode.
(6) In afs_rename(), we need to unhash the old dentry before we start so
that we don't get afs_d_revalidate() reverting the version change in
cross-directory renames.
We then need to set both the old and the new dentry versions the data
version of the new directory before we call d_move() as d_move() will
rehash them.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: David Howells <dhowells@redhat.com>
In the in-kernel afs filesystem, d_fsdata is set with the data version of
the parent directory. afs_d_revalidate() will update this to the current
directory version, but it shouldn't do this if it the value it read from
d_fsdata is the same as no lock is held and cmpxchg() is not used.
Fix the code to only change the value if it is different from the current
directory version.
Fixes: 260a980317 ("[AFS]: Add "directory write" support.")
Signed-off-by: David Howells <dhowells@redhat.com>
When afs_rename() calculates the expected data version of the target
directory in a cross-directory rename, it doesn't increment it as it
should, so it always thinks that the target inode is unexpectedly modified
on the server.
Fixes: a58823ac45 ("afs: Fix application of status and callback to be under same lock")
Signed-off-by: David Howells <dhowells@redhat.com>
In afs_read_dir(), there is an if statement on line 255 to check whether
req->pages is NULL:
if (!req->pages)
goto error;
If req->pages is NULL, afs_put_read() on line 337 is executed.
In afs_put_read(), req->pages[i] is used on line 195.
Thus, a possible null-pointer dereference may occur in this case.
To fix this possible bug, an if statement is added in afs_put_read() to
check req->pages.
This bug is found by a static analysis tool STCheck written by us.
Fixes: f3ddee8dc4 ("afs: Fix directory handling")
Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
afs_deliver_vl_get_entry_by_name_u() scans through the vl entry
received from the volume location server and builds a return list
containing the sites that are currently valid. When assigning
values for the return list, the index into the vl entry (i) is used
rather than the one for the new list (entry->nr_server). If all
sites are usable, this works out fine as the indices will match.
If some sites are not valid, for example if AFS_VLSF_DONTUSE is
set, fs_mask and the uuid will be set for the wrong return site.
Fix this by using entry->nr_server as the index into the arrays
being filled in rather than i.
This can lead to EDESTADDRREQ errors if none of the returned sites
have a valid fs_mask.
Fixes: d2ddc776a4 ("afs: Overhaul volume and server record caching and fileserver rotation")
Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
Fix the service handler function for the CB.ProbeUuid RPC call so that it
replies in the correct manner - that is an empty reply for success and an
abort of 1 for failure.
Putting 0 or 1 in an integer in the body of the reply should result in the
fileserver throwing an RX_PROTOCOL_ERROR abort and discarding its record of
the client; older servers, however, don't necessarily check that all the
data got consumed, and so might incorrectly think that they got a positive
response and associate the client with the wrong host record.
If the client is incorrectly associated, this will result in callbacks
intended for a different client being delivered to this one and then, when
the other client connects and responds positively, all of the callback
promises meant for the client that issued the improper response will be
lost and it won't receive any further change notifications.
Fixes: 9396d496d7 ("afs: support the CB.ProbeUuid RPC op")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
The condition checking whether put_unlocked_entry() needs to wake up
following waiter got broken by commit 23c84eb783 ("dax: Fix missed
wakeup with PMD faults"). We need to wake the waiter whenever the passed
entry is valid (i.e., non-NULL and not special conflict entry). This
could lead to processes never being woken up when waiting for entry
lock. Fix the condition.
Cc: <stable@vger.kernel.org>
Link: http://lore.kernel.org/r/20190729120228.GC17833@quack2.suse.cz
Fixes: 23c84eb783 ("dax: Fix missed wakeup with PMD faults")
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The kernel mount_block_root() function expects -EACESS or -EINVAL for a
unmountable filesystem when trying to mount the root with different
filesystem types.
However, in 5.3-rc1 the behavior when F2FS code cannot find valid block
changed to return -EFSCORRUPTED(-EUCLEAN), and this error code makes
mount_block_root() fail when trying to probe F2FS.
When the magic number of the superblock mismatches, it has a high
probability that it's just not a F2FS. In this case return -EINVAL seems
to be a better result, and this return value can make mount_block_root()
probing work again.
Return -EINVAL when the superblock has magic mismatch, -EFSCORRUPTED in
other cases (the magic matches but the superblock cannot be recognized).
Fixes: 10f966bbf5 ("f2fs: use generic EFSBADCRC/EFSCORRUPTED")
Signed-off-by: Icenowy Zheng <icenowy@aosc.io>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Explicitly initialize the onstack structures to zero so we don't leak
kernel memory into userspace when converting the in-core inumbers
structure to the v1 inogrp ioctl structure. Add a comment about why we
have to use memset to ensure that the padding holes in the structures
are set to zero.
Fixes: 5f19c7fc68 ("xfs: introduce v5 inode group structure")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Add functions that verify data pages that have been read from a
fs-verity file, against that file's Merkle tree. These will be called
from filesystems' ->readpage() and ->readpages() methods.
Since data verification can block, a workqueue is provided for these
methods to enqueue verification work from their bio completion callback.
See the "Verifying data" section of
Documentation/filesystems/fsverity.rst for more information.
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Add a function fsverity_prepare_setattr() which filesystems that support
fs-verity must call to deny truncates of verity files.
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Add the fsverity_file_open() function, which prepares an fs-verity file
to be read from. If not already done, it loads the fs-verity descriptor
from the filesystem and sets up an fsverity_info structure for the inode
which describes the Merkle tree and contains the file measurement. It
also denies all attempts to open verity files for writing.
This commit also begins the include/linux/fsverity.h header, which
declares the interface between fs/verity/ and filesystems.
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Add the beginnings of the fs/verity/ support layer, including the
Kconfig option and various helper functions for hashing. To start, only
SHA-256 is supported, but other hash algorithms can easily be added.
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Here are some small SPDX fixes for 5.3-rc2 for things that came in
during the 5.3-rc1 merge window that we previously missed.
Only 3 small patches here:
- 2 uapi patches to resolve some SPDX tags that were not correct
- fix an invalid SPDX tag in the iomap Makefile file
All have been properly reviewed on the public mailing lists.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXT2N9w8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ylY9wCeJIYfs/eNf3tsjLQXxUBMYAJNqnsAn2IaMiTt
cv2mck7JZm5KyHpP3f5N
=RSZa
-----END PGP SIGNATURE-----
Merge tag 'spdx-5.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx
Pull SPDX fixes from Greg KH:
"Here are some small SPDX fixes for 5.3-rc2 for things that came in
during the 5.3-rc1 merge window that we previously missed.
Only three small patches here:
- two uapi patches to resolve some SPDX tags that were not correct
- fix an invalid SPDX tag in the iomap Makefile file
All have been properly reviewed on the public mailing lists"
* tag 'spdx-5.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx:
iomap: fix Invalid License ID
treewide: remove SPDX "WITH Linux-syscall-note" from kernel-space headers again
treewide: add "WITH Linux-syscall-note" to SPDX tag of uapi headers
Pull scheduler fixes from Thomas Gleixner:
"Two fixes for the fair scheduling class:
- Prevent freeing memory which is accessible by concurrent readers
- Make the RCU annotations for numa groups consistent"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Use RCU accessors consistently for ->numa_group
sched/fair: Don't free p->numa_faults with concurrent readers
Hi Linus,
Please, pull the following patches that mark switch cases where we are
expecting to fall through. These patches are part of the ongoing efforts
to enable -Wimplicit-fallthrough. Most of them have been baking in linux-next
for a whole development cycle.
Also, pull the Makefile patch that globally enables the
-Wimplicit-fallthrough option.
Finally, some missing-break fixes that have been tagged for -stable:
- drm/amdkfd: Fix missing break in switch statement
- drm/amdgpu/gfx10: Fix missing break in switch statement
Notice that with these changes, we completely get rid of all the
fall-through warnings in the kernel.
Thanks
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEkmRahXBSurMIg1YvRwW0y0cG2zEFAl06XP0ACgkQRwW0y0cG
2zHPXhAAsatJGNIg7vuSicVipJBIlwgRLwtbE2rV+MCneUAZj37O9NtBgbxHNoJ+
5TBB8sLpNCw3rEvPKwm6vicRgntMclEY1vaplPOHt3RiY4lqVSIkpvFSCw1hGur3
+34+O++n/6HbtY96T+DN5WGMXrU3JSe46xnBLIt0BOoUwyKMaNdntiyd79GrrnyB
BwPkHrmB+9FEVq+dHmPnRIhc9WUfIdily3+j1CIncaM4eXKWLjoUlOIw1VcSEc4X
SSdFh2co+3nm/6O54ZxUEC1PyuQZWedsgFmeiTrOanG3AEWQ/jX7GwXvPlgraa2E
F5MzGUOQwXYL/IzXl1vGjQc4+FV4nhIjEBTDdZseOp7FP5xkHyyOwzGDEmaMXpXT
XThb+k7Q6EbiSPLFcz8zkCwNB8ngNJMNOsGP3JPFD06MfquqdzP7T1BsHcNtiMVo
FNwQg9KtvbK9F7a9lXDqfcxfuEScbqSvnKWWwNLImCqIBATYirJzeTeLvTbVWpA2
iyXF3ylIclsSMOvCtaz41iuoqNh12eqw2UIFPBmkU2wpVkTt+ZIn0Apgom90xe1K
tSUqMBbBMoHVtsBsILdkdz/m6FJpK+YmmwpRrJSDvPxA5c9caD7cj+62amW/gP4z
Hi6mC4hw1H5bKMsC2PP/QgJcmS/pgo1zwA9J2a2NNJbYPATHfKI=
=J2sO
-----END PGP SIGNATURE-----
Merge tag 'Wimplicit-fallthrough-5.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux
Pull Wimplicit-fallthrough enablement from Gustavo A. R. Silva:
"This marks switch cases where we are expecting to fall through, and
globally enables the -Wimplicit-fallthrough option in the main
Makefile.
Finally, some missing-break fixes that have been tagged for -stable:
- drm/amdkfd: Fix missing break in switch statement
- drm/amdgpu/gfx10: Fix missing break in switch statement
With these changes, we completely get rid of all the fall-through
warnings in the kernel"
* tag 'Wimplicit-fallthrough-5.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux:
Makefile: Globally enable fall-through warning
drm/i915: Mark expected switch fall-throughs
drm/amd/display: Mark expected switch fall-throughs
drm/amdkfd/kfd_mqd_manager_v10: Avoid fall-through warning
drm/amdgpu/gfx10: Fix missing break in switch statement
drm/amdkfd: Fix missing break in switch statement
perf/x86/intel: Mark expected switch fall-throughs
mtd: onenand_base: Mark expected switch fall-through
afs: fsclient: Mark expected switch fall-throughs
afs: yfsclient: Mark expected switch fall-throughs
can: mark expected switch fall-throughs
firewire: mark expected switch fall-throughs
f2fs_allocate_data_block() invalidates old block address and enable new block
address. Then, if we try to read old block by f2fs_submit_page_bio(), it will
give WARN due to reading invalid blocks.
Let's make the order sanely back.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl07KzUACgkQxWXV+ddt
WDs6Ow/+LQ4He5jW+xpgZ+5fSnSjervfSsYWxKARc4hYZog5l8e2ONASzrniKR0H
aU2SWUyLF0Os1w++vp1FYYwa1xaPJJRTvlqTI6DSCFgEpu9DefXkCAFuP/+fXbhC
Fx+ZDAbqgY2gUCEWcs4jQusSgz1sO0m97SJR7K4Vz+ZaBa+eneE5Hkm4kZd8AXg6
ufKkjOD0+Vy+CUVBZdPg8RlU5KRJ+yAE7B2CRyuSeq6cF0LxTxoNeAtWz5KWd6Pn
aNWDqa07na4kQz/p5s7Bw6qPC6i4hTdnNKVcc8N79iyz2TAetbjN8VXfbFFVCLZ9
3+9O/kFCh+ZolgV1f9U8iLwSbg7V/8wu1uUdR3645cq2PQljluMJEHPSWOmm+LVw
nOhm8VQiGsILbEYLRJbM3wI8dT4B0PtgP9IqYuk7TLV3qho6Pg8J2nyfpo5jehs5
IP0gDanCBHYgsoCUwZwnJaKH3hIP2+5IJw6AqX0Lt6ge9aYA35+kWR4KE5459Pxw
n0Eoiu+5y9QcfpgUuzb5EQIxHdRNlCjik70+n1yXBwwPtV9b6wDN9TlP3eI7p/Vx
j+FUYcgrjwjfAP7Eh5a/ne/XEcq1R5UghM+2TWEWpqGh11NPnMDJS6d4OYyeCvA6
yPLAhVvvqdbYM+PpALaNjvLxspxd6MSuKNLNk3pkJmkd69IXnXw=
=RT/D
-----END PGP SIGNATURE-----
Merge tag 'for-5.3-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Two regression fixes:
- hangs caused by a missing barrier in the locking code
- memory leaks of extent_state due to bad handling of a cached
pointer"
* tag 'for-5.3-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix extent_state leak in btrfs_lock_and_flush_ordered_range
btrfs: Fix deadlock caused by missing memory barrier
Pull vfs umount_tree() leak fix from Al Viro:
"Fix braino introduced in 'switch the remnants of releasing the
mountpoint away from fs_pin'.
The most visible result is leaking struct mount when mounting btrfs,
making it impossible to shut down"
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fix the struct mount leak in umount_tree()
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl07DGAQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgplf5EADOOvOdsz9N/Iw8ZHHHJXCqKR26zZv75G1z
0h1PGC7p0JZQbYFo0Zo7mjiRBGlg6tlXc2d4Gyl94XJKDwjeYTcFDvbvERdYa+MH
d2RiFkAfR967Ri4fb+FP5L3mYOQdMJ/zk0xCDHLv/DcxeFLa5a9EJS1+vBSR+AcB
0JpJWuHypGqGmbTaL0z9q2pmx0mgA1ERlWQtkMLrsEr2Vqg/rrjGwe2bGFY00lXc
vKtFkpfugKc4zVAPSzC1YZgojfDDpGNEA4QMtxMsEH4hqyMpHhrtUedNY5QrjC0B
p9h6aPXXYr2KhGP0grrEytzaYUOzK2crK5h+q+1vu6nOgx2EgmnLM9tBu/LuRH1j
uUzKJOa3/AE+bU7uZEsaUerTBsHrgEBa1x8G92obYRnjgW3aCD2CaSbjjBhNxTZ4
1dXyr0DTHFXZmfcfWja5tO26JTPzjwVOrwiRyU0S727UsdVJupoHiYLr5fwaDfgn
/Du2I/XWvFtflm5i0ND0sdcX1yRlFiGZ9e45z1QFaFmcteKKWzRBDlC6mQzI/lw3
oc583mhDR3tRtJxow+wn6AuMUehFRh8wj0UhL/MEMjLW8GiqXU5aRtanT+22Xz4L
saNDQieeEnV7raMYXMP0qIhkJtrNASmJQos+MOJAEGOWcS2ePIUUio2kSXie+071
BphJd2RamQ==
=HIzH
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20190726' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- Several io_uring fixes/improvements:
- Blocking fix for O_DIRECT (me)
- Latter page slowness for registered buffers (me)
- Fix poll hang under certain conditions (me)
- Defer sequence check fix for wrapped rings (Zhengyuan)
- Mismatch in async inc/dec accounting (Zhengyuan)
- Memory ordering issue that could cause stall (Zhengyuan)
- Track sequential defer in bytes, not pages (Zhengyuan)
- NVMe pull request from Christoph
- Set of hang fixes for wbt (Josef)
- Redundant error message kill for libahci (Ding)
- Remove unused blk_mq_sched_started_request() and related ops (Marcos)
- drbd dynamic alloc shash descriptor to reduce stack use (Arnd)
- blkcg ->pd_stat() non-debug print (Tejun)
- bcache memory leak fix (Wei)
- Comment fix (Akinobu)
- BFQ perf regression fix (Paolo)
* tag 'for-linus-20190726' of git://git.kernel.dk/linux-block: (24 commits)
io_uring: ensure ->list is initialized for poll commands
Revert "nvme-pci: don't create a read hctx mapping without read queues"
nvme: fix multipath crash when ANA is deactivated
nvme: fix memory leak caused by incorrect subsystem free
nvme: ignore subnqn for ADATA SX6000LNP
drbd: dynamically allocate shash descriptor
block: blk-mq: Remove blk_mq_sched_started_request and started_request
bcache: fix possible memory leak in bch_cached_dev_run()
io_uring: track io length in async_list based on bytes
io_uring: don't use iov_iter_advance() for fixed buffers
block: properly handle IOCB_NOWAIT for async O_DIRECT IO
blk-mq: allow REQ_NOWAIT to return an error inline
io_uring: add a memory barrier before atomic_read
rq-qos: use a mb for got_token
rq-qos: set ourself TASK_UNINTERRUPTIBLE after we schedule
rq-qos: don't reset has_sleepers on spurious wakeups
rq-qos: fix missed wake-ups in rq_qos_throttle
wait: add wq_has_single_sleeper helper
block, bfq: check also in-flight I/O in dispatch plugging
block: fix sysfs module parameters directory path in comment
...
We need to drop everything we remove from the tree, whether
mnt_has_parent() is true or not. Usually the bug manifests as a slow
memory leak (leaked struct mount for initramfs); it becomes much more
visible in mount_subtree() users, such as btrfs. There we leak
a struct mount for btrfs superblock being mounted, which prevents
fs shutdown on subsequent umount.
Fixes: 56cbb429d9 ("switch the remnants of releasing the mountpoint away from fs_pin")
Reported-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
btrfs_lock_and_flush_ordered_range() loads given "*cached_state" into
cachedp, which, in general, is NULL. Then, lock_extent_bits() updates
"cachedp", but it never goes backs to the caller. Thus the caller still
see its "cached_state" to be NULL and never free the state allocated
under btrfs_lock_and_flush_ordered_range(). As a result, we will
see massive state leak with e.g. fstests btrfs/005. Fix this bug by
properly handling the pointers.
Fixes: bd80d94efb ("btrfs: Always use a cached extent_state in btrfs_lock_and_flush_ordered_range")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In preparation to enabling -Wimplicit-fallthrough, mark switch
cases where we are expecting to fall through.
This patch fixes the following warnings:
Warning level 3 was used: -Wimplicit-fallthrough=3
fs/afs/fsclient.c: In function ‘afs_deliver_fs_fetch_acl’:
fs/afs/fsclient.c:2199:19: warning: this statement may fall through [-Wimplicit-fallthrough=]
call->unmarshall++;
~~~~~~~~~~~~~~~~^~
fs/afs/fsclient.c:2202:2: note: here
case 1:
^~~~
fs/afs/fsclient.c:2216:19: warning: this statement may fall through [-Wimplicit-fallthrough=]
call->unmarshall++;
~~~~~~~~~~~~~~~~^~
fs/afs/fsclient.c:2219:2: note: here
case 2:
^~~~
fs/afs/fsclient.c:2225:19: warning: this statement may fall through [-Wimplicit-fallthrough=]
call->unmarshall++;
~~~~~~~~~~~~~~~~^~
fs/afs/fsclient.c:2228:2: note: here
case 3:
^~~~
This patch is part of the ongoing efforts to enable
-Wimplicit-fallthrough.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
In preparation to enabling -Wimplicit-fallthrough, mark switch
cases where we are expecting to fall through.
This patch fixes the following warnings:
fs/afs/yfsclient.c: In function ‘yfs_deliver_fs_fetch_opaque_acl’:
fs/afs/yfsclient.c:1984:19: warning: this statement may fall through [-Wimplicit-fallthrough=]
call->unmarshall++;
~~~~~~~~~~~~~~~~^~
fs/afs/yfsclient.c:1987:2: note: here
case 1:
^~~~
fs/afs/yfsclient.c:2005:19: warning: this statement may fall through [-Wimplicit-fallthrough=]
call->unmarshall++;
~~~~~~~~~~~~~~~~^~
fs/afs/yfsclient.c:2008:2: note: here
case 2:
^~~~
fs/afs/yfsclient.c:2014:19: warning: this statement may fall through [-Wimplicit-fallthrough=]
call->unmarshall++;
~~~~~~~~~~~~~~~~^~
fs/afs/yfsclient.c:2017:2: note: here
case 3:
^~~~
fs/afs/yfsclient.c:2035:19: warning: this statement may fall through [-Wimplicit-fallthrough=]
call->unmarshall++;
~~~~~~~~~~~~~~~~^~
fs/afs/yfsclient.c:2038:2: note: here
case 4:
^~~~
fs/afs/yfsclient.c:2047:19: warning: this statement may fall through [-Wimplicit-fallthrough=]
call->unmarshall++;
~~~~~~~~~~~~~~~~^~
fs/afs/yfsclient.c:2050:2: note: here
case 5:
^~~~
Warning level 3 was used: -Wimplicit-fallthrough=3
Also, fix some commenting style issues.
This patch is part of the ongoing efforts to enable
-Wimplicit-fallthrough.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>