Now that all bdi structures filesystems use are properly refcounted, we
can remove the SB_I_DYNBDI flag.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Allocate struct backing_dev_info separately instead of embedding it
inside the superblock. This unifies handling of bdi among users.
CC: Richard Weinberger <richard@nod.at>
CC: Artem Bityutskiy <dedekind1@gmail.com>
CC: Adrian Hunter <adrian.hunter@intel.com>
CC: linux-mtd@lists.infradead.org
Acked-by: Richard Weinberger <richard@nod.at>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Allocate struct backing_dev_info separately instead of embedding it
inside the superblock. This unifies handling of bdi among users.
CC: Anna Schumaker <anna.schumaker@netapp.com>
CC: linux-nfs@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Allocate struct backing_dev_info separately instead of embedding it
inside the superblock. This unifies handling of bdi among users.
CC: Petr Vandrovec <petr@vandrovec.name>
Acked-by: Petr Vandrovec <petr@vandrovec.name>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Similarly to set_bdev_super() NILFS2 just used block device reference to
bdi. Convert it to properly getting bdi reference. The reference will
get automatically dropped on superblock destruction.
CC: linux-nilfs@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <axboe@fb.com>
Similarly to set_bdev_super() GFS2 just used block device reference to
bdi. Convert it to properly getting bdi reference. The reference will
get automatically dropped on superblock destruction.
CC: Steven Whitehouse <swhiteho@redhat.com>
CC: Bob Peterson <rpeterso@redhat.com>
CC: cluster-devel@redhat.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
It is not needed anymore since bdi is initialized whenever superblock
exists.
CC: Miklos Szeredi <miklos@szeredi.hu>
CC: linux-fsdevel@vger.kernel.org
Suggested-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Allocate struct backing_dev_info separately instead of embedding it
inside the superblock. This unifies handling of bdi among users.
CC: Miklos Szeredi <miklos@szeredi.hu>
CC: linux-fsdevel@vger.kernel.org
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Allocate struct backing_dev_info separately instead of embedding it
inside the superblock. This unifies handling of bdi among users.
CC: Boaz Harrosh <ooo@electrozaur.com>
CC: Benny Halevy <bhalevy@primarydata.com>
Acked-by: Boaz Harrosh <ooo@electrozaur.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Allocate struct backing_dev_info separately instead of embedding it
inside the superblock. This unifies handling of bdi among users.
CC: Jan Harkes <jaharkes@cs.cmu.edu>
CC: coda@cs.cmu.edu
CC: codalist@coda.cs.cmu.edu
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Allocate struct backing_dev_info separately instead of embedding it
inside the superblock. This unifies handling of bdi among users.
CC: David Howells <dhowells@redhat.com>
CC: linux-afs@lists.infradead.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Allocate struct backing_dev_info separately instead of embedding it
inside the superblock. This unifies handling of bdi among users.
CC: Tyler Hicks <tyhicks@canonical.com>
CC: ecryptfs@vger.kernel.org
Acked-by: Tyler Hicks <tyhicks@canonical.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Allocate struct backing_dev_info separately instead of embedding it
inside superblock. This unifies handling of bdi among users.
CC: Steve French <sfrench@samba.org>
CC: linux-cifs@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Allocate struct backing_dev_info separately instead of embedding it
inside client structure. This unifies handling of bdi among users.
CC: Ilya Dryomov <idryomov@gmail.com>
CC: "Yan, Zheng" <zyan@redhat.com>
CC: Sage Weil <sage@redhat.com>
CC: ceph-devel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Allocate struct backing_dev_info separately instead of embedding it
inside superblock. This unifies handling of bdi among users.
CC: Chris Mason <clm@fb.com>
CC: Josef Bacik <jbacik@fb.com>
CC: David Sterba <dsterba@suse.com>
CC: linux-btrfs@vger.kernel.org
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Allocate struct backing_dev_info separately instead of embedding it
inside session. This unifies handling of bdi among users.
CC: Eric Van Hensbergen <ericvh@gmail.com>
CC: Ron Minnich <rminnich@sandia.gov>
CC: Latchesar Ionkov <lucho@ionkov.net>
CC: v9fs-developer@lists.sourceforge.net
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
So far we just relied on block device to hold a bdi reference for us
while the filesystem is mounted. While that works perfectly fine, it is
a bit awkward that we have a pointer to a refcounted structure in the
superblock without proper reference. So make s_bdi hold a proper
reference to block device's BDI. No filesystem using mount_bdev()
actually changes s_bdi so this is safe and will make bdev filesystems
work the same way as filesystems needing to set up their private bdi.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
A function in kernel/bpf/syscall.c which got a bug fix in 'net'
was moved to kernel/bpf/verifier.c in 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull CIFS fix from Steve French:
"One more cifs fix for stable"
* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Do not send echoes before Negotiate is complete
Andrey reported a use-after-free in __ns_get_path():
spin_lock include/linux/spinlock.h:299 [inline]
lockref_get_not_dead+0x19/0x80 lib/lockref.c:179
__ns_get_path+0x197/0x860 fs/nsfs.c:66
open_related_ns+0xda/0x200 fs/nsfs.c:143
sock_ioctl+0x39d/0x440 net/socket.c:1001
vfs_ioctl fs/ioctl.c:45 [inline]
do_vfs_ioctl+0x1bf/0x1780 fs/ioctl.c:685
SYSC_ioctl fs/ioctl.c:700 [inline]
SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
We are under rcu read lock protection at that point:
rcu_read_lock();
d = atomic_long_read(&ns->stashed);
if (!d)
goto slow;
dentry = (struct dentry *)d;
if (!lockref_get_not_dead(&dentry->d_lockref))
goto slow;
rcu_read_unlock();
but don't use a proper RCU API on the free path, therefore a parallel
__d_free() could free it at the same time. We need to mark the stashed
dentry with DCACHE_RCUACCESS so that __d_free() will be called after all
readers leave RCU.
Fixes: e149ed2b80 ("take the targets of /proc/*/ns/* symlinks to separate fs")
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch introduces an ASYNC IPU policy.
Under senario of large # of async updating(e.g. log writing in Android),
disk would be seriously fragmented, and higher frequent gc would be triggered.
This patch uses IPU to rewrite the async update writting, since async is
NOT sensitive to io latency.
Signed-off-by: Hou Pengyang <houpengyang@huawei.com>
For IPU writes, there won't be any udpates in dnode page since we
will reuse old block address instead of allocating new one, so we
don't need to lock cp_rwsem during IPU IO submitting.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Introduce __check_rb_tree_consistence to check consistence of rb-tree
based discard cache in runtime.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Add an even class f2fs_discard for introducing f2fs_queue_discard, then
use f2fs_{queue,issue}_discard to trace __{queue,submit}_discard_cmd.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Keep issuing big size discard in prior instead of the one with random
size, so that we expect that it will help to:
- be quick to recycle unused large space in flash storage device.
- give a chance for
a) wait to merge small piece discards into bigger one, or
b) avoid issuing discards while they have being reallocated by SSR.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Avoid long variable name in discard_cmd_control structure, no logic
change.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Introduce rb-tree based discard cache infrastructure to speed up lookup and
merge operation of discard entry.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: initialize dc to avoid build warning]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Implement truncate/delete as a non-recursive algorithm. The older
algorithm was implemented with recursion to strip off each layer
at a time (going by height, starting with the maximum height.
This version tries to do the same thing but without recursion,
and without needing to allocate new structures or lists in memory.
For example, say you want to truncate a very large file to 1 byte,
and its end-of-file metapath is: 0.505.463.428. The starting
metapath would be 0.0.0.0. Since it's a truncate to non-zero, it
needs to preserve that byte, and all metadata pointing to it.
So it would start at 0.0.0.0, look up all its metadata buffers,
then free all data blocks pointed to at the highest level.
After that buffer is "swept", it moves on to 0.0.0.1, then
0.0.0.2, etc., reading in buffers and sweeping them clean.
When it gets to the end of the 0.0.0 metadata buffer (for 4K
blocks the last valid one is 0.0.0.508), it backs up to the
previous height and starts working on 0.0.1.0, then 0.0.1.1,
and so forth. After it reaches the end and sweeps 0.0.1.508,
it continues with 0.0.2.0, and so on. When that height is
exhausted, and it reaches 0.0.508.508 it backs up another level,
to 0.1.0.0, then 0.1.0.1, through 0.1.0.508. So it has to keep
marching backwards and forwards through the metadata until it's
all swept clean. Once it has all the data blocks freed, it
lowers the strip height, and begins the process all over again,
but with one less height. This time it sweeps 0.0.0 through
0.505.463. When that's clean, it lowers the strip height again
and works to free 0.505. Eventually it strips the lowest height, 0.
For a delete or truncate to 0, all metadata for all heights of
0.0.0.0 would be freed. For a truncate to 1 byte, 0.0.0.0 would
be preserved.
This isn't much different from normal integer incrementing,
where an integer gets incremented from 0000 (0.0.0.0) to 3021
(3.0.2.1). So 0000 gets increments to 0001, 0002, up to 0009,
then on to 0010, 0011 up to 0099, then 0100 and so forth. It's
just that each "digit" goes from 0 to 508 (for a total of 509
pointers) rather than from 0 to 9.
Note that the dinode will only have 483 pointers due to the
dinode structure itself.
Also note: this is just an example. These numbers (509 and 483)
are based on a standard 4K block size. Smaller block sizes will
yield smaller numbers of indirect pointers accordingly.
The truncation process is accomplished with the help of two
major functions and a few helper functions.
Functions do_strip and recursive_scan are obsolete, so removed.
New function sweep_bh_for_rgrps cleans a buffer_head pointed to
by the given metapath and height. By cleaning, I mean it frees
all blocks starting at the offset passed in metapath. It starts
at the first block in the buffer pointed to by the metapath and
identifies its resource group (rgrp). From there it frees all
subsequent block pointers that lie within that rgrp. If it's
already inside a transaction, it stays within it as long as it
can. In other words, it doesn't close a transaction until it knows
it's freed what it can from the resource group. In this way,
multiple buffers may be cleaned in a single transaction, as long
as those blocks in the buffer all lie within the same rgrp.
If it's not in a transaction, it starts one. If the buffer_head
has references to blocks within multiple rgrps, it frees all the
blocks inside the first rgrp it finds, then closes the
transaction. Then it repeats the cycle: identifies the next
unfreed block, uses it to find its rgrp, then starts a new
transaction for that set. It repeats this process repeatedly
until the buffer_head contains no more references to any blocks
past the given metapath.
Function trunc_dealloc has been reworked into a finite state
automaton. It has basically 3 active states:
DEALLOC_MP_FULL, DEALLOC_MP_LOWER, and DEALLOC_FILL_MP:
The DEALLOC_MP_FULL state implies the metapath has a full set
of buffers out to the "shrink height", and therefore, it can
call function sweep_bh_for_rgrps to free the blocks within the
highest height of the metapath. If it's just swept the lowest
level (or an error has occurred) the state machine is ended.
Otherwise it proceeds to the DEALLOC_MP_LOWER state.
The DEALLOC_MP_LOWER state implies we are finished with a given
buffer_head, which may now be released, and therefore we are
then missing some buffer information from the metapath. So we
need to find more buffers to read in. In most cases, this is
just a matter of releasing the buffer_head and moving to the
next pointer from the previous height, so it may be read in and
swept as well. If it can't find another non-null pointer to
process, it checks whether it's reached the end of a height
and needs to lower the strip height, or whether it still needs
move forward through the previous height's metadata. In this
state, all zero-pointers are skipped. From this state, it can
only loop around (once more backing up another height) or,
once a valid metapath is found (one that has non-zero
pointers), proceed to state DEALLOC_FILL_MP.
The DEALLOC_FILL_MP state implies that we have a metapath
but not all its buffers are read in. So we must proceed to read
in buffer_heads until the metapath has a valid buffer for every
height. If the previous state backed us up 3 heights, we may
need to read in a buffer, increment the height, then repeat the
process until buffers have been read in for all required heights.
If it's successful reading a buffer, and it's at the highest
height we need, it proceeds back to the DEALLOC_MP_FULL state.
If it's unable to fill in a buffer, (encounters a hole, etc.)
it tries to find another non-zero block pointer. If they're all
zero, it lowers the height and returns to the DEALLOC_MP_LOWER
state. If it finds a good non-null pointer, it loops around and
reads it in, while keeping the metapath in lock-step with the
pointers it examines.
The state machine runs until the truncation request is
satisfied. Then any transactions are ended, the quota and
statfs data are updated, and the function is complete.
Helper function metaptr1 was introduced to be an easy way to
determine the start of a buffer_head's indirect pointers.
Helper function lookup_mp_height was introduced to find a
metapath index and read in the buffer that corresponds to it.
In this way, function lookup_metapath becomes a simple loop to
call it for every height.
Helper function fillup_metapath is similar to lookup_metapath
except it can do partial lookups. If the state machine
backed up multiple levels (like 2999 wrapping to 3000) it
needs to find out the next starting point and start issuing
metadata reads at that point.
Helper function hptrs is a shortcut to determine how many
pointers should be expected in a buffer. Height 0 is the dinode
which has fewer pointers than the others.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Now that all places setting inode->i_flags that should be reflected in
on-disk flags are gone, we can remove i_attrs_to_sd_attrs() call.
Signed-off-by: Jan Kara <jack@suse.cz>
reiserfs_new_inode() clears IMMUTABLE and APPEND flags from a symlink
i_flags however a few lines below in sd_attrs_to_i_attrs() we will
happily overwrite i_flags with whatever we inherited from the directory.
Since this behavior is there for ages just remove the useless setting of
i_flags.
Signed-off-by: Jan Kara <jack@suse.cz>
Now that all places setting inode->i_flags that should be reflected in
on-disk flags are gone, we can remove jfs_get_inode_flags() call.
Signed-off-by: Jan Kara <jack@suse.cz>
Now that all places setting inode->i_flags that should be reflected in
on-disk flags are gone, we can remove ext2_get_inode_flags() call.
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
Now that all places setting inode->i_flags that should be reflected in
on-disk flags are gone, we can remove ext4_get_inode_flags() call.
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently we set IMMUTABLE and NOATIME flags on quota files to stop
userspace from messing with them. Now that all filesystems set these
flags in their quota_on handlers, we can stop setting the flags in
generic quota code. This will allow filesystems to stop copying i_flags
to their on-disk flags on various occasions.
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently immutable and noatime flags on quota files are set by quota
code which requires us to copy inode->i_flags to our on disk version
of quota flags in GETFLAGS ioctl and copy_to_dinode(). Move to
setting / clearing these on-disk flags directly to save that copying.
Signed-off-by: Jan Kara <jack@suse.cz>
Currently immutable and noatime flags on quota files are set by quota
code which requires us to copy inode->i_flags to our on disk version of
quota flags in GETFLAGS ioctl and __ext2_write_inode(). Move to setting
/ clearing these on-disk flags directly to save that copying.
Signed-off-by: Jan Kara <jack@suse.cz>
Currently immutable and noatime flags on quota files are set by quota
code which requires us to copy inode->i_flags to our on disk version of
quota flags in GETFLAGS ioctl and when writing stat item. Move to
setting / clearing these on-disk flags directly to save that copying.
Signed-off-by: Jan Kara <jack@suse.cz>
Currently immutable and noatime flags on quota files are set by quota
code which requires us to copy inode->i_flags to our on disk version of
quota flags in GETFLAGS ioctl and ext4_do_update_inode(). Move to
setting / clearing these on-disk flags directly to save that copying.
Signed-off-by: Jan Kara <jack@suse.cz>
The WARN_ON and warning from report_reserved_underflow can become very
noisy and is visible unconditionally although this is namely for
debugging. The patch "btrfs: Add WARN_ON for qgroup reserved underflow"
(18dc22c19b) went to 4.11-rc1 and the plan
was to get the fix as well, but this hasn't happened.
CC: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It is perfectly fine to link a tmpfile back using linkat().
Since tmpfiles are created with a link count of 0 they appear
on the orphan list, upon re-linking the inode has to be removed
from the orphan list again.
Ralph faced a filesystem corruption in combination with overlayfs
due to this bug.
Cc: <stable@vger.kernel.org>
Cc: Ralph Sennhauser <ralph.sennhauser@gmail.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Reported-by: Ralph Sennhauser <ralph.sennhauser@gmail.com>
Tested-by: Ralph Sennhauser <ralph.sennhauser@gmail.com>
Reported-by: Amir Goldstein <amir73il@gmail.com>
Fixes: 474b93704f ("ubifs: Implement O_TMPFILE")
Signed-off-by: Richard Weinberger <richard@nod.at>
commit 4fcd1813e6 ("Fix reconnect to not defer smb3 session reconnect
long after socket reconnect") added support for Negotiate requests to
be initiated by echo calls.
To avoid delays in calling echo after a reconnect, I added the patch
introduced by the commit b8c600120f ("Call echo service immediately
after socket reconnect").
This has however caused a regression with cifs shares which do not have
support for echo calls to trigger Negotiate requests. On connections
which need to call Negotiation, the echo calls trigger an error which
triggers a reconnect which in turn triggers another echo call. This
results in a loop which is only broken when an operation is performed on
the cifs share. For an idle share, it can DOS a server.
The patch uses the smb_operation can_echo() for cifs so that it is
called only if connection has been already been setup.
kernel bz: 194531
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Tested-by: Jonathan Liu <net147@gmail.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
It leaves the iterator advanced by the amount of IO it has requested
instead of the amount actually transferred. Among other things,
that confuses the hell out of generic_file_splice_read().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
short copy here should mean instant EFAULT, not "move to the
next page and hope it fails there, this time with nothing
copied"
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Unlike normal compat syscall variants, it is needed only for
biarch architectures that have different alignement requirements for
u64 in 32bit and 64bit ABI *and* have __put_user() that won't handle
a store of 64bit value at 32bit-aligned address. We used to have one
such (ia64), but its biarch support has been gone since 2010 (after
being broken in 2008, which went unnoticed since nobody had been using
it).
It had escaped removal at the same time only because back in 2004
a patch that switched several syscalls on amd64 from private wrappers to
generic compat ones had switched to use of compat_sys_getdents64(), which
hadn't needed (or used) a compat wrapper on amd64.
Let's bury it - it's at least 7 years overdue.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The function no longer does anything. The is only a single caller of
register_sysctl_root when semantically there should be two. Remove
this function so that if someone decides this functionality is needed
again it will be obvious all of the callers of setup_sysctl_set need
to be audited and modified appropriately.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Conflicts were simply overlapping changes. In the net/ipv4/route.c
case the code had simply moved around a little bit and the same fix
was made in both 'net' and 'net-next'.
In the net/sched/sch_generic.c case a fix in 'net' happened at
the same time that a new argument was added to qdisc_hash_add().
Signed-off-by: David S. Miller <davem@davemloft.net>
Otherwise lockdep says:
[ 1337.483798] ================================================
[ 1337.483999] [ BUG: lock held when returning to user space! ]
[ 1337.484252] 4.11.0-rc6 #19 Not tainted
[ 1337.484423] ------------------------------------------------
[ 1337.484626] mount/14766 is leaving the kernel with locks still held!
[ 1337.484841] 1 lock held by mount/14766:
[ 1337.485017] #0: (&type->s_umount_key#33/1){+.+.+.}, at: [<ffffffff8124171f>] sget_userns+0x2af/0x520
Caught by xfstests generic/413 which tried to mount with the unsupported
mount option dax. Then xfstests generic/422 ran sync which deadlocks.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Acked-by: Mike Marshall <hubcap@omnibond.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Normal pathname lookup doesn't allow empty pathnames, but using
AT_EMPTY_PATH (with name_to_handle_at() or fstatat(), for example) you
can trigger an empty pathname lookup.
And not only is the RCU lookup in that case entirely unnecessary
(because we'll obviously immediately finalize the end result), it is
actively wrong.
Why? An empth path is a special case that will return the original
'dirfd' dentry - and that dentry may not actually be RCU-free'd,
resulting in a potential use-after-free if we were to initialize the
path lazily under the RCU read lock and depend on complete_walk()
finalizing the dentry.
Found by syzkaller and KASAN.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reported-by: Vegard Nossum <vegard.nossum@gmail.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull btrfs fixes from Chris Mason:
"Dave Sterba collected a few more fixes for the last rc.
These aren't marked for stable, but I'm putting them in with a batch
were testing/sending by hand for this release"
* 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix potential use-after-free for cloned bio
Btrfs: fix segmentation fault when doing dio read
Btrfs: fix invalid dereference in btrfs_retry_endio
btrfs: drop the nossd flag when remounting with -o ssd
Pull more CIFS fixes from Steve French:
"As promised, here is the remaining set of cifs/smb3 fixes for stable
(and a fix for one regression) now that they have had additional
review and testing"
* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
CIFS: Fix SMB3 mount without specifying a security mechanism
CIFS: store results of cifs_reopen_file to avoid infinite wait
CIFS: remove bad_network_name flag
CIFS: reconnect thread reschedule itself
CIFS: handle guest access errors to Windows shares
CIFS: Fix null pointer deref during read resp processing
If mmap() maps a file, it can be passed an offset into the file at which
the mapping is to start. Offset could be a negative value when
represented as a loff_t. The offset plus length will be used to update
the file size (i_size) which is also a loff_t.
Validate the value of offset and offset + length to make sure they do
not overflow and appear as negative.
Found by syzcaller with commit ff8c0c53c4 ("mm/hugetlb.c: don't call
region_abort if region_chg fails") applied. Prior to this commit, the
overflow would still occur but we would luckily return ENOMEM.
To reproduce:
mmap(0, 0x2000, 0, 0x40021, 0xffffffffffffffffULL, 0x8000000000000000ULL);
Resulted in,
kernel BUG at mm/hugetlb.c:742!
Call Trace:
hugetlbfs_evict_inode+0x80/0xa0
evict+0x24a/0x620
iput+0x48f/0x8c0
dentry_unlink_inode+0x31f/0x4d0
__dentry_kill+0x292/0x5e0
dput+0x730/0x830
__fput+0x438/0x720
____fput+0x1a/0x20
task_work_run+0xfe/0x180
exit_to_usermode_loop+0x133/0x150
syscall_return_slowpath+0x184/0x1c0
entry_SYSCALL_64_fastpath+0xab/0xad
Fixes: ff8c0c53c4 ("mm/hugetlb.c: don't call region_abort if region_chg fails")
Link: http://lkml.kernel.org/r/1491951118-30678-1-git-send-email-mike.kravetz@oracle.com
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yet another instance of the same race.
Fix is identical to change_huge_pmd().
See "thp: fix MADV_DONTNEED vs. numa balancing race" for more details.
Link: http://lkml.kernel.org/r/20170302151034.27829-5-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I'm hitting the BUG in nfsd4_max_reply() at fs/nfsd/nfs4proc.c:2495 when
client sends an operation the server doesn't support.
in nfsd4_max_reply() it checks for NULL rsize_bop but a non-supported
operation wouldn't have that set.
Cc: Kinglong Mee <kinglongmee@gmail.com>
Fixes: 2282cd2c05 "NFSD: Get response size before operation..."
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Commit ef65aaede2 ("smb2: Enforce sec= mount option") changed the
behavior of a mount command to enforce a specified security mechanism
during mounting. On another hand according to the spec if SMB3 server
doesn't respond with a security context it implies that it supports
NTLMSSP. The current code doesn't keep it in mind and fails a mount
for such servers if no security mechanism is specified. Fix this by
indicating that a server supports NTLMSSP if a security context isn't
returned during negotiate phase. This allows the code to use NTLMSSP
by default for SMB3 mounts.
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Otherwise, we can see stale fsync/dentry mark given by previous calls, resulting
in giving up roll-forward recovery due to wrong dentry mark.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Don't use blk plug covering area where there won't be any IOs being issued.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Lockdep complains about use of the iolock in inode reclaim context
because it doesn't understand that reclaim has the last reference to
the inode, and thus an iolock->reclaim->iolock deadlock is not
possible.
The iolock is technically not necessary in xfs_inactive() and was
only added to appease an assert in xfs_free_eofblocks(), which can
be called from other non-reclaim contexts. Therefore, just kill the
assert and drop the use of the iolock from reclaim context to quiet
lockdep.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Long ago, all this gunk was added with a lament about problems
with gcc's do_div, and a fun recommendation in the changelog:
egcs-2.91.66 is the recommended compiler version for building XFS.
All this special stuff was needed to work around an old gcc bug,
apparently, and it's been there ever since.
There should be no need for this anymore, so remove it.
Remove the special 32-bit xfs_do_mod as well; just let the
kernel's do_div() handle all this.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
ndquots is a 32-bit value, and we don't care
about the remainder; there is no reason to use do_div
here, it seems to be the result of a decade+ historical
accident.
Worse, the do_div implementation in userspace breaks
when fed a 32-bit dividend, so we commented it out there
in any case.
Change to simple division, and then we can change
userspace to match, and mandate a 64-bit dividend in
the do_div() in userspace as well.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
rb-tree lookup/update functions are deeply coupled into extent cache
codes, it's very hard to reuse these basic functions, this patch
extracts common rb-tree operation infrastructure for latter reusing.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Now we're doing SSR aggressively more than ever before, so once we reach to
the reserved_segment, f2fs_balance_fs will call f2fs_gc, which triggers
checkpoint everytime. We actually must avoid that.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
KASAN reports that there is a use-after-free case of bio in btrfs_map_bio.
If we need to submit IOs to several disks at a time, the original bio
would get cloned and mapped to the destination disk, but we really should
use the original bio instead of a cloned bio to do the sanity check
because cloned bios are likely to be freed by its endio.
Reported-by: Diego <diegocg@gmail.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 2dabb32484 ("Btrfs: Direct I/O read: Work on sectorsized blocks")
introduced this bug during iterating bio pages in dio read's endio hook,
and it could end up with segment fault of the dio reading task.
So the reason is 'if (nr_sectors--)', and it makes the code assume that
there is one more block in the same page, so page offset is increased and
the bio which is created to repair the bad block then has an incorrect
bvec.bv_offset, and a later access of the page content would throw a
segmentation fault.
This also adds ASSERT to check page offset against page size.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing directIO repair, we have this oops:
[ 1458.532816] general protection fault: 0000 [#1] SMP
...
[ 1458.536291] Workqueue: btrfs-endio-repair btrfs_endio_repair_helper [btrfs]
[ 1458.536893] task: ffff88082a42d100 task.stack: ffffc90002b3c000
[ 1458.537499] RIP: 0010:btrfs_retry_endio+0x7e/0x1a0 [btrfs]
...
[ 1458.543261] Call Trace:
[ 1458.543958] ? rcu_read_lock_sched_held+0xc4/0xd0
[ 1458.544374] bio_endio+0xed/0x100
[ 1458.544750] end_workqueue_fn+0x3c/0x40 [btrfs]
[ 1458.545257] normal_work_helper+0x9f/0x900 [btrfs]
[ 1458.545762] btrfs_endio_repair_helper+0x12/0x20 [btrfs]
[ 1458.546224] process_one_work+0x34d/0xb70
[ 1458.546570] ? process_one_work+0x29e/0xb70
[ 1458.546938] worker_thread+0x1cf/0x960
[ 1458.547263] ? process_one_work+0xb70/0xb70
[ 1458.547624] kthread+0x17d/0x180
[ 1458.547909] ? kthread_create_on_node+0x70/0x70
[ 1458.548300] ret_from_fork+0x31/0x40
It turns out that btrfs_retry_endio is trying to get inode from a directIO
page.
This fixes the problem by using the saved inode pointer, done->inode.
btrfs_retry_endio_nocsum has the same problem, and it's fixed as well.
Also cleanup unused @start (which is too trivial for a separate patch).
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The opposite case was already handled right in the very next switch entry.
And also when turning on nossd, drop ssd_spread.
Reported-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: Adam Borowski <kilobyte@angband.pl>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It is not safe for one thread to modify the ->flags
of another thread as there is no locking that can protect
the update.
So tsk_restore_flags(), which takes a task pointer and modifies
the flags, is an invitation to do the wrong thing.
All current users pass "current" as the task, so no developers have
accepted that invitation. It would be best to ensure it remains
that way.
So rename tsk_restore_flags() to current_restore_flags() and don't
pass in a task_struct pointer. Always operate on current->flags.
Signed-off-by: NeilBrown <neilb@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This fixes Continuous Availability when errors during
file reopen are encountered.
cifs_user_readv and cifs_user_writev would wait for ever if
results of cifs_reopen_file are not stored and for later inspection.
In fact, results are checked and, in case of errors, a chain
of function calls leading to reads and writes to be scheduled in
a separate thread is skipped.
These threads will wake up the corresponding waiters once reads
and writes are done.
However, given the return value is not stored, when rc is checked
for errors a previous one (always zero) is inspected instead.
This leads to pending reads/writes added to the list, making
cifs_user_readv and cifs_user_writev wait for ever.
Signed-off-by: Germano Percossi <germano.percossi@citrix.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
STATUS_BAD_NETWORK_NAME can be received during node failover,
causing the flag to be set and making the reconnect thread
always unsuccessful, thereafter.
Once the only place where it is set is removed, the remaining
bits are rendered moot.
Removing it does not prevent "mount" from failing when a non
existent share is passed.
What happens when the share really ceases to exist while the
share is mounted is undefined now as much as it was before.
Signed-off-by: Germano Percossi <germano.percossi@citrix.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
In case of error, smb2_reconnect_server reschedule itself
with a delay, to avoid being too aggressive.
Signed-off-by: Germano Percossi <germano.percossi@citrix.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
Commit 1a967d6c9b ("correctly to
anonymous authentication for the NTLM(v2) authentication") introduces
a regression in handling errors related to attempting a guest
connection to a Windows share which requires authentication. This
should result in a permission denied error but actually causes the
kernel module to enter a never-ending loop trying to follow a DFS
referal which doesn't exist.
The base cause of this is the failure now occurs later in the process
during tree connect and not at the session setup setup and all errors
in tree connect are interpreted as needing to follow the DFS paths
which isn't in this case correct. So, check the returned error against
EACCES and fail if this is returned error.
Feedback from Aurelien:
PS> net user guest /activate:no
PS> mkdir C:\guestshare
PS> icacls C:\guestshare /grant 'Everyone:(OI)(CI)F'
PS> new-smbshare -name guestshare -path C:\guestshare -fullaccess Everyone
I've tested v3.10, v4.4, master, master+your patch using default options
(empty or no user "NU") and user=abc (U).
NT_LOGON_FAILURE in session setup: LF
This is what you seem to have in 3.10.
NT_ACCESS_DENIED in tree connect to the share: AD
This is what you get before your infinite loop.
| NU U
--------------------------------
3.10 | LF LF
4.4 | LF LF
master | AD LF
master+patch | AD LF
No infinite DFS loop :(
All these issues result in mount failing very fast with permission denied.
I guess it could be from either the Windows version or the share/folder
ACL. A deeper analysis of the packets might reveal more.
In any case I did not notice any issues for on a basic DFS setup with
the patch so I don't think it introduced any regressions, which is
probably all that matters. It still bothers me a little I couldn't hit
the bug.
I've included kernel output w/ debugging output and network capture of
my tests if anyone want to have a look at it. (master+patch = ml-guestfix).
Signed-off-by: Mark Syms <mark.syms@citrix.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Tested-by: Aurelien Aptel <aaptel@suse.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Currently during receiving a read response mid->resp_buf can be
NULL when it is being passed to cifs_discard_remaining_data() from
cifs_readv_discard(). Fix it by always passing server->smallbuf
instead and initializing mid->resp_buf at the end of read response
processing.
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Acked-by: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Add braces around variables used within macros for those make sense
to do it. Many of the macros in f2fs already do this. What this commit
doesn't do is anything that changes line# as a result of adding braces,
which usually affects the binary via __LINE__.
Confirmed no diff in fs/f2fs/f2fs.ko before/after this commit on x86_64,
to make sure this has no functional change as well as there's been no
unexpected side effect due to callers' arithmetics within the existing
code.
Signed-off-by: Tomohiro Kusumi <tkusumi@tuxera.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Callers are to unlock the page on failure after 86531d6b.
Signed-off-by: Tomohiro Kusumi <tkusumi@tuxera.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In f2fs_submit_discard_endio, we will wake up waiter before setting
discard command states, so waiter may use incorrect states. Change
the order between complete() and states setting to fix this issue.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Split f2fs_wait_discard_bios from f2fs_wait_discard_bio, just for cleanup,
no logic change.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Split discard_cmd_list to discard_{pend,wait}_list, so while sending/waiting
discard command, we can avoid traversing unneeded entries in original list.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This reverts commit 3436c4bdb3.
This makes a leak to register dirty segments. I reproduced the issue by
modified postmark which injects a lot of file create/delete/update and
finally triggers huge number of SSR allocations.
Cc: <stable@vger.kernel.org> # v4.10+
[Jaegeuk Kim: Change missing incorrect comment]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Pointer to ->free_mark callback unnecessarily occupies one long in each
fsnotify_mark although they are the same for all marks from one
notification group. Move the callback pointer to fsnotify_ops.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently we initialize mark->group only in fsnotify_add_mark_lock().
However we will need to access fsnotify_ops of corresponding group from
fsnotify_put_mark() so we need mark->group initialized earlier. Do that
in fsnotify_init_mark() which has a consequence that once
fsnotify_init_mark() is called on a mark, the mark has to be destroyed
by fsnotify_put_mark().
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
inode_mark.c now contains only a single function. Move it to
fs/notify/fsnotify.c and remove inode_mark.c.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
These are very thin wrappers, just remove them. Drop
fs/notify/vfsmount_mark.c as it is empty now.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
The function is already mostly contained in what
fsnotify_clear_marks_by_group() does. Just update that function to not
select marks when all of them should be destroyed and remove
fsnotify_detach_group_marks().
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
The _flags() suffix in the function name was more confusing than
explaining so just remove it. Also rename the argument from 'flags' to
'type' to better explain what the function expects.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Suggested-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Inline these helpers as they are very thin. We still keep them as we
don't want to expose details about how list type is determined.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
These helpers are just very thin wrappers now. Remove them.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
These helpers are now only a simple assignment and just obfuscate
what is going on. Remove them.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
When userspace task processing fanotify permission events screws up and
does not respond, fsnotify_mark_srcu SRCU is held indefinitely which
causes further hangs in the whole notification subsystem. Although we
cannot easily solve the problem of operations blocked waiting for
response from userspace, we can at least somewhat localize the damage by
dropping SRCU lock before waiting for userspace response and reacquiring
it when userspace responds.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Pass fsnotify_iter_info into ->handle_event() handler so that it can
release and reacquire SRCU lock via fsnotify_prepare_user_wait() and
fsnotify_finish_user_wait() functions. These functions also make sure
current marks are appropriately pinned so that iteration protected by
srcu in fsnotify() stays safe.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
fanotify wants to drop fsnotify_mark_srcu lock when waiting for response
from userspace so that the whole notification subsystem is not blocked
during that time. This patch provides a framework for safely getting
mark reference for a mark found in the object list which pins the mark
in that list. We can then drop fsnotify_mark_srcu, wait for userspace
response and then safely continue iteration of the object list once we
reaquire fsnotify_mark_srcu.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently we queue all marks for destruction on group shutdown and then
destroy them from fsnotify_destroy_group() instead from a worker thread
which is the usual path. However worker can already be processing some
list of marks to destroy so this does not make 100% all marks are really
destroyed by the time group is shut down. This isn't a big problem as
each mark holds group reference and thus group stays partially alive
until all marks are really freed but there's no point in complicating
our lives - just wait for the delayed work to be finished instead.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Instead of removing mark from object list from fsnotify_detach_mark(),
remove the mark when last reference to the mark is dropped. This will
allow fanotify to wait for userspace response to event without having to
hold onto fsnotify_mark_srcu.
To avoid pinning inodes by elevated refcount (and thus e.g. delaying
file deletion) while someone holds mark reference, we detach connector
from the object also from fsnotify_destroy_marks() and not only after
removing last mark from the list as it was now.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently we queue mark into a list of marks for destruction in
__fsnotify_free_mark() and keep the last mark reference dangling. After the
worker waits for SRCU period, it drops the last reference to the mark
which frees it. This scheme has the disadvantage that if we hold
reference to a mark and drop and reacquire SRCU lock, the mark can get
freed immediately which is slightly inconvenient and we will need to
avoid this in the future.
Move to a scheme where queueing of mark into a list of marks for
destruction happens when the last reference to the mark is dropped. Also
drop reference to the mark held by group list already when mark is
removed from that list instead of dropping it only from the destruction
worker.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Dropping mark reference can result in mark being freed. Although it
should not happen in inotify_remove_from_idr() since caller should hold
another reference, just don't risk lock up just after WARN_ON
unnecessarily. Also fold do_inotify_remove_from_idr() into the single
callsite as that function really is just two lines of real code.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently we free fsnotify_mark_connector structure only when inode /
vfsmount is getting freed. This can however impose noticeable memory
overhead when marks get attached to inodes only temporarily. So free the
connector structure once the last mark is detached from the object.
Since notification infrastructure can be working with the connector
under the protection of fsnotify_mark_srcu, we have to be careful and
free the fsnotify_mark_connector only after SRCU period passes.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
So far list of marks attached to an object (inode / vfsmount) was
protected by i_lock or mnt_root->d_lock. This dictates that the list
must be empty before the object can be destroyed although the list is
now anchored in the fsnotify_mark_connector structure. Protect the list
by a spinlock in the fsnotify_mark_connector structure to decouple
lifetime of a list of marks from a lifetime of the object. This also
simplifies the code quite a bit since we don't have to differentiate
between inode and vfsmount lists in quite a few places anymore.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
After removing all the indirection it is clear that
hlist_del_init_rcu(&mark->obj_list);
in fsnotify_destroy_marks() is not needed as the mark gets removed from
the list shortly afterwards in fsnotify_destroy_mark() ->
fsnotify_detach_mark() -> fsnotify_detach_from_object(). Also there is
no problem with mark being visible on object list while we call
fsnotify_destroy_mark() as parallel destruction of marks from several
places is properly handled (as mentioned in the comment in
fsnotify_destroy_marks(). So just remove the list removal and also the
stale comment.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
We lock object list lock in fsnotify_detach_from_object() twice - once
to detach mark and second time to recalculate mask. That is unnecessary
and later it will become problematic as we will free the connector as
soon as there is no mark in it. So move recalculation of fsnotify mask
into the same critical section that is detaching mark.
This also removes recalculation of child dentry flags from
fsnotify_detach_from_object(). That is however fine. Those marks will
get recalculated once some event happens on a child.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
fsnotify_detach_mark() calls fsnotify_destroy_inode_mark() or
fsnotify_destroy_vfsmount_mark() to remove mark from object list. These
two functions are however very similar and differ only in the lock they
use to protect the object list of marks. Simplify the code by removing
the indirection and removing mark from the object list in a common
function.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Instead of passing spinlock into fsnotify_destroy_marks() determine it
directly in that function from the connector type. This will reduce code
churn when changing lock protecting list of marks.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Move locking of a mark list into fsnotify_find_mark(). This reduces code
churn in the following patch changing lock protecting the list.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Move locking of locks protecting a list of marks into
fsnotify_recalc_mask(). This reduces code churn in the following patch
which changes the lock protecting the list of marks.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Move fsnotify_destroy_marks() to be later in the fs/notify/mark.c. It
will need some functions that are declared after its current
declaration. No functional change.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Adding notification mark to object list has been currently done through
fsnotify_add_{inode|vfsmount}_mark() helpers from
fsnotify_add_mark_locked() which call fsnotify_add_mark_list(). Remove
this unnecessary indirection to simplify the code.
Pushing all the locking to fsnotify_add_mark_list() also allows us to
allocate the connector structure with GFP_KERNEL mode.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently inode reference is held by fsnotify marks. Change the rules so
that inode reference is held by fsnotify_mark_connector structure
whenever the list is non-empty. This simplifies the code and is more
logical.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Move pointer to inode / vfsmount from mark itself to the
fsnotify_mark_connector structure. This is another step on the path
towards decoupling inode / vfsmount lifetime from notification mark
lifetime.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently notification marks are attached to object (inode or vfsmnt) by
a hlist_head in the object. The list is also protected by a spinlock in
the object. So while there is any mark attached to the list of marks,
the object must be pinned in memory (and thus e.g. last iput() deleting
inode cannot happen). Also for list iteration in fsnotify() to work, we
must hold fsnotify_mark_srcu lock so that mark itself and
mark->obj_list.next cannot get freed. Thus we are required to wait for
response to fanotify events from userspace process with
fsnotify_mark_srcu lock held. That causes issues when userspace process
is buggy and does not reply to some event - basically the whole
notification subsystem gets eventually stuck.
So to be able to drop fsnotify_mark_srcu lock while waiting for
response, we have to pin the mark in memory and make sure it stays in
the object list (as removing the mark waiting for response could lead to
lost notification events for groups later in the list). However we don't
want inode reclaim to block on such mark as that would lead to system
just locking up elsewhere.
This commit is the first in the series that paves way towards solving
these conflicting lifetime needs. Instead of anchoring the list of marks
directly in the object, we anchor it in a dedicated structure
(fsnotify_mark_connector) and just point to that structure from the
object. The following commits will also add spinlock protecting the list
and object pointer to the structure.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Add a comment that lifetime of a notification mark is protected by SRCU
and remove a comment about clearing of marks attached to the inode. It
is stale and more uptodate version is at fsnotify_destroy_marks() which
is the function handling this case.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Pull CIFS fixes from Steve French:
"This is a set of CIFS/SMB3 fixes for stable.
There is another set of four SMB3 reconnect fixes for stable in
progress but they are still being reviewed/tested, so didn't want to
wait any longer to send these five below"
* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
Reset TreeId to zero on SMB2 TREE_CONNECT
CIFS: Fix build failure with smb2
Introduce cifs_copy_file_range()
SMB3: Rename clone_range to copychunk_range
Handle mismatched open calls
Here are 3 small fixes for 4.11-rc6. One resolves a reported issue with
sysfs files that NeilBrown found, one is a documenatation fix for the
stable kernel rules, and the last is a small MAINTAINERS file update for
kernfs.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWOnrMw8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+yk/JQCfQKjOpGDAR9Hs6u4YQ4hJrAHFneYAn1F4MLDW
3b0ZMnlZHkDq834UwKnB
=iiei
-----END PGP SIGNATURE-----
Merge tag 'driver-core-4.11-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core fixes from Greg KH:
"Here are 3 small fixes for 4.11-rc6.
One resolves a reported issue with sysfs files that NeilBrown found,
one is a documenatation fix for the stable kernel rules, and the last
is a small MAINTAINERS file update for kernfs"
* tag 'driver-core-4.11-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
MAINTAINERS: separate out kernfs maintainership
sysfs: be careful of error returns from ops->show()
Documentation: stable-kernel-rules: fix stable-tag format
Pull VFS fixes from Al Viro:
"statx followup fixes and a fix for stack-smashing on alpha"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
alpha: fix stack smashing in old_adjtimex(2)
statx: Include a mask for stx_attributes in struct statx
statx: Reserve the top bit of the mask for future struct expansion
xfs: report crtime and attribute flags to statx
ext4: Add statx support
statx: optimize copy of struct statx to userspace
statx: remove incorrect part of vfs_statx() comment
statx: reject unknown flags when using NULL path
Documentation/filesystems: fix documentation for ->getattr()
This gets us support for non-discard efficient write of zeroes (e.g. NVMe)
and prepares for removing the discard_zeroes_data flag.
Also remove a pointless discard support check, which is done in
blkdev_issue_discard already.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Turn the existing discard flag into a new BLKDEV_ZERO_UNMAP flag with
similar semantics, but without referring to diѕcard.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
ops->show() can return a negative error code.
Commit 65da3484d9 ("sysfs: correctly handle short reads on PREALLOC attrs.")
(in v4.4) caused this to be stored in an unsigned 'size_t' variable, so errors
would look like large numbers.
As a result, if an error is returned, sysfs_kf_read() will return the
value of 'count', typically 4096.
Commit 17d0774f80 ("sysfs: correctly handle read offset on PREALLOC attrs")
(in v4.8) extended this error to use the unsigned large 'len' as a size for
memmove().
Consequently, if ->show returns an error, then the first read() on the
sysfs file will return 4096 and could return uninitialized memory to
user-space.
If the application performs a subsequent read, this will trigger a memmove()
with extremely large count, and is likely to crash the machine is bizarre ways.
This bug can currently only be triggered by reading from an md
sysfs attribute declared with __ATTR_PREALLOC() during the
brief period between when mddev_put() deletes an mddev from
the ->all_mddevs list, and when mddev_delayed_delete() - which is
scheduled on a workqueue - completes.
Before this, an error won't be returned by the ->show()
After this, the ->show() won't be called.
I can reproduce it reliably only by putting delay like
usleep_range(500000,700000);
early in mddev_delayed_delete(). Then after creating an
md device md0 run
echo clear > /sys/block/md0/md/array_state; cat /sys/block/md0/md/array_state
The bug can be triggered without the usleep.
Fixes: 65da3484d9 ("sysfs: correctly handle short reads on PREALLOC attrs.")
Fixes: 17d0774f80 ("sysfs: correctly handle read offset on PREALLOC attrs")
Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Merge misc fixes from Andrew Morton:
"10 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm: move pcp and lru-pcp draining into single wq
mailmap: update Yakir Yang email address
mm, swap_cgroup: reschedule when neeed in swap_cgroup_swapoff()
dax: fix radix tree insertion race
mm, thp: fix setting of defer+madvise thp defrag mode
ptrace: fix PTRACE_LISTEN race corrupting task->state
vmlinux.lds: add missing VMLINUX_SYMBOL macros
mm/page_alloc.c: fix print order in show_free_areas()
userfaultfd: report actual registered features in fdinfo
mm: fix page_vma_mapped_walk() for ksm pages
While running generic/340 in my test setup I hit the following race. It
can happen with kernels that support FS DAX PMDs, so v4.10 thru
v4.11-rc5.
Thread 1 Thread 2
-------- --------
dax_iomap_pmd_fault()
grab_mapping_entry()
spin_lock_irq()
get_unlocked_mapping_entry()
'entry' is NULL, can't call lock_slot()
spin_unlock_irq()
radix_tree_preload()
dax_iomap_pmd_fault()
grab_mapping_entry()
spin_lock_irq()
get_unlocked_mapping_entry()
...
lock_slot()
spin_unlock_irq()
dax_pmd_insert_mapping()
<inserts a PMD mapping>
spin_lock_irq()
__radix_tree_insert() fails with -EEXIST
<fall back to 4k fault, and die horribly
when inserting a 4k entry where a PMD exists>
The issue is that we have to drop mapping->tree_lock while calling
radix_tree_preload(), but since we didn't have a radix tree entry to
lock (unlike in the pmd_downgrade case) we have no protection against
Thread 2 coming along and inserting a PMD at the same index. For 4k
entries we handled this with a special-case response to -EEXIST coming
from the __radix_tree_insert(), but this doesn't save us for PMDs
because the -EEXIST case can also mean that we collided with a 4k entry
in the radix tree at a different index, but one that is covered by our
PMD range.
So, correctly handle both the 4k and 2M collision cases by explicitly
re-checking the radix tree for an entry at our index once we reacquire
mapping->tree_lock.
This patch has made it through a clean xfstests run with the current
v4.11-rc5 based linux/master, and it also ran generic/340 500 times in a
loop. It used to fail within the first 10 iterations.
Link: http://lkml.kernel.org/r/20170406212944.2866-1-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: <stable@vger.kernel.org> [4.10+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
fdinfo for userfault file descriptor reports UFFD_API_FEATURES. Up
until recently, the UFFD_API_FEATURES was defined as 0, therefore
corresponding field in fdinfo always contained zero. Now, with
introduction of several additional features, UFFD_API_FEATURES is not
longer 0 and it seems better to report actual features requested for the
userfaultfd object described by the fdinfo.
First, the applications that were using userfault will still see zero at
the features field in fdinfo. Next, reporting actual features rather
than available features, gives clear indication of what userfault
features are used by an application.
Link: http://lkml.kernel.org/r/1491140181-22121-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Without this fix (and another to the userspace component itself
described later), the kernel will be unable to process any OrangeFS
requests after the userspace component is restarted (due to a crash or
at the administrator's behest).
The bug here is that inside orangefs_remount, the orangefs_request_mutex
is locked. When the userspace component restarts while the filesystem
is mounted, it sends a ORANGEFS_DEV_REMOUNT_ALL ioctl to the device,
which causes the kernel to send it a few requests aimed at synchronizing
the state between the two. While this is happening the
orangefs_request_mutex is locked to prevent any other requests going
through.
This is only half of the bugfix. The other half is in the userspace
component which outright ignores(!) requests made before it considers
the filesystem remounted, which is after the ioctl returns. Of course
the ioctl doesn't return until after the userspace component responds to
the request it ignores. The userspace component has been changed to
allow ORANGEFS_VFS_OP_FEATURES regardless of the mount status.
Mike Marshall says:
"I've tested this patch against the fixed userspace part. This patch is
real important, I hope it can make it into 4.11...
Here's what happens when the userspace daemon is restarted, without
the patch:
=============================================
[ INFO: possible recursive locking detected ]
[ 4.10.0-00007-ge98bdb3 #1 Not tainted ]
---------------------------------------------
pvfs2-client-co/29032 is trying to acquire lock:
(orangefs_request_mutex){+.+.+.}, at: service_operation+0x3c7/0x7b0 [orangefs]
but task is already holding lock:
(orangefs_request_mutex){+.+.+.}, at: dispatch_ioctl_command+0x1bf/0x330 [orangefs]
CPU: 0 PID: 29032 Comm: pvfs2-client-co Not tainted 4.10.0-00007-ge98bdb3 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014
Call Trace:
__lock_acquire+0x7eb/0x1290
lock_acquire+0xe8/0x1d0
mutex_lock_killable_nested+0x6f/0x6e0
service_operation+0x3c7/0x7b0 [orangefs]
orangefs_remount+0xea/0x150 [orangefs]
dispatch_ioctl_command+0x227/0x330 [orangefs]
orangefs_devreq_ioctl+0x29/0x70 [orangefs]
do_vfs_ioctl+0xa3/0x6e0
SyS_ioctl+0x79/0x90"
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Acked-by: Mike Marshall <hubcap@omnibond.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We've added a considerable amount of fixes for stalls and issues
with the blk-mq scheduling in the 4.11 series since forking
off the for-4.12/block branch. We need to do improvements on
top of that for 4.12, so pull in the previous fixes to make
our lives easier going forward.
Signed-off-by: Jens Axboe <axboe@fb.com>
Commit e7d316a02f ("sysctl: handle error writing UINT_MAX to u32
fields") introduced the proc_douintvec helper function, but it forgot to
add the related sanity check when doing register_sysctl_table. So add
it now.
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Cc: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently the cifs module breaks the CIFS specs on reconnect as
described in http://msdn.microsoft.com/en-us/library/cc246529.aspx:
"TreeId (4 bytes): Uniquely identifies the tree connect for the
command. This MUST be 0 for the SMB2 TREE_CONNECT Request."
Signed-off-by: Jan-Marek Glogowski <glogow@fbihome.de>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Tested-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
CC: Stable <stable@vger.kernel.org>
I saw the following build error during a randconfig build:
fs/cifs/smb2ops.c: In function 'smb2_new_lease_key':
fs/cifs/smb2ops.c:1104:2: error: implicit declaration of function 'generate_random_uuid' [-Werror=implicit-function-declaration]
Explicit include the right header to fix this issue.
Signed-off-by: Tobias Regnery <tobias.regnery@gmail.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
The earlier changes to copy range for cifs unintentionally disabled the more
common form of server side copy.
The patch introduces the file_operations helper cifs_copy_file_range()
which is used by the syscall copy_file_range. The new file operations
helper allows us to perform server side copies for SMB2.0 and 2.1
servers as well as SMB 3.0+ servers which do not support the ioctl
FSCTL_DUPLICATE_EXTENTS_TO_FILE.
The new helper uses the ioctl FSCTL_SRV_COPYCHUNK_WRITE to perform
server side copies. The helper is called by vfs_copy_file_range() only
once an attempt to clone the file using the ioctl
FSCTL_DUPLICATE_EXTENTS_TO_FILE has failed.
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
Server side copy is one of the most important mechanisms smb2/smb3
supports and it was unintentionally disabled for most use cases.
Renaming calls to reflect the underlying smb2 ioctl called. This is
similar to the name duplicate_extents used for a similar ioctl which is
also used to duplicate files by reusing fs blocks. The name change is to
avoid confusion.
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
A signal can interrupt a SendReceive call which result in incoming
responses to the call being ignored. This is a problem for calls such as
open which results in the successful response being ignored. This
results in an open file resource on the server.
The patch looks into responses which were cancelled after being sent and
in case of successful open closes the open fids.
For this patch, the check is only done in SendReceive2()
RH-bz: 1403319
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Cc: Stable <stable@vger.kernel.org>
Apparently FIEMAP for xattrs has been broken since we switched to
the iomap backend because of an incorrect check for xattr presence.
Also fix the broken locking.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
No one cares about the low-level helper anymore.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
- Rework the inline directory verifier to avoid crashes on disk corruption
- Don't change file size when punching holes w/ KEEP_SIZE
- Close a kernel memory exposure bug
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCgAGBQJY4qGYAAoJEPh/dxk0SrTr3mwQAIDqc1RlZThYETn5Mru9BeQ0
NmiDbgO1394OSxSxpBVZCVW1jU23j3eXbOgO0oD8iEOySdTOwAoCxb78aYUkHsbS
wKKvix2kprIfsAYfGW264MZjX2JYiBUTjhV8dYw9UB5Kot3TPvc+IgC7ICJtcb08
y+/ycjbgMwlWxOQwkReaPhVYhKXa3/vqLBNN0E9oacSdpT10Mkb95RDxwFpe5zX2
RjPYode8RHsoI9+OdvVwvbmPrvylxEt3jKZdnStRMjmAX/X2aUNVDYSCOSpNg+sz
9kXec96kkoJ4oQVfiaEo0k4LWlTZnDSFe25CX8Gxu7yPgGt8XsnDADkieGBdVcYq
2q3AfEpmQrYgnLFRu4D3A8KEX8IrP8MLAh/zY/heVp9fTx9uhDvnjF+f+F6mMYXF
wzEf27RN+DeaXRLbHp35NOKPriT63JpRB/Hsgew6ibEiVPdy8J1+VYoJSohixr4d
yryDiwtph25spMYxeH6HIiPCeWa6fMr08vwuasWJTEZirxYmF/ZEHTK/bU+ydlDp
R9/UDTmiP0V10xO09bjLgFiREIO/b1fiOoniAMhhrT4tW2FL31hRpzl8jPZ++2X/
W2xLMJ1iyVw/Cf1NDWGFIOFX6Y+lnN6dZHftNUno6MFlmdmajNaTbRTBiSGnPH4U
URH7Dxpmkgp7xuNs8xqv
=QdXx
-----END PGP SIGNATURE-----
Merge tag 'xfs-4.11-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull XFS fixes from Darrick Wong:
"Here are three more fixes for 4.11.
The first one reworks the inline directory verifier to check the
working copy of the directory metadata and to avoid triggering a
periodic crash in xfs/348. The second patch fixes a regression in hole
punching at EOF that corrupts files; and the third patch closes a
kernel memory disclosure bug.
Summary:
- rework the inline directory verifier to avoid crashes on disk
corruption
- don't change file size when punching holes w/ KEEP_SIZE
- close a kernel memory exposure bug"
* tag 'xfs-4.11-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: fix kernel memory exposure problems
xfs: Honor FALLOC_FL_KEEP_SIZE when punching ends of files
xfs: rework the inline directory verifiers
-----BEGIN PGP SIGNATURE-----
iQIVAwUAWOYUWfSw1s6N8H32AQIqdg/9Hi+47eues/TBbogP8eRrqVEoNHFy75e/
MMTFe0/Qio7ps78VuOSThbqh96dzIX5K5/7JdiHZyQk2QCTaJ2BvheCUISQovhFl
yuAJcBhkO5iiQkR0agYdHVjIQGRth3usNIEyD1rm1DS/lr8ec9/iyjoipKpsZmxt
WlRF3eGgqA+cLpH4K+k4x/LJwIl8868MBz58p6XXW2yZFRygQzYHmMobhDwgLoC2
C2lHPEyllK7qcIaZD7SI/a2/bMwh7QTx1tJuQK3DgtJrAHigx96uxH3jqECk7fLg
EhjLqIFmWVCUcrBbUqjlNtcuevzxCZTCCB0LAZgmOTyEyCFJzgoQmQo97VhMPbG5
JF9bKg+JE6P5iwqtTBEW9p+LoyM7VAt6SzeuKH/vNAVGHc0ULDMB8XPYF3Nvqa3L
RGIwcxWCAItZFdDCUvWgTyEuZtVXu6LbuvnU1HkaXlXsLLi4041MQgcekY58k6kv
z4YnXojy0+mciJ4WV/7CLfNMyP36G0gwLugjLAsJwigxJoOTtsfphkGcpWlP9hBm
IyTFJw0qbbuQD7fprfw//e+IgfjDQbxYMQKxQaJZflXzYDCab8PQkFTl1kWPrJSR
yR0rk8wKb7Z1fyl/zpUNbw7KdFganqhkZ6jbOrr9G8Hp8jyubnwMCI5B2rSZfe1V
CSOtCbEt8Is=
=JnXV
-----END PGP SIGNATURE-----
Merge tag 'rxrpc-rewrite-20170406' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
rxrpc: Miscellany
Here's a set of patches that make some minor changes to AF_RXRPC:
(1) Store error codes in struct rxrpc_call::error as negative codes and
only convert to positive in recvmsg() to avoid confusion inside the
kernel.
(2) Note the result of trying to abort a call (this fails if the call is
already 'completed').
(3) Don't abort on temporary errors whilst processing challenge and
response packets, but rather drop the packet and wait for
retransmission.
And also adds some more tracing:
(4) Protocol errors.
(5) Received abort packets.
(6) Changes in the Rx window size due to ACK packet information.
(7) Client call initiation (to allow the rxrpc_call struct pointer, the
wire call ID and the user ID/afs_call pointer to be cross-referenced).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Mostly simple cases of overlapping changes (adding code nearby,
a function whose name changes, for example).
Signed-off-by: David S. Miller <davem@davemloft.net>
Use negative error codes in struct rxrpc_call::error because that's what
the kernel normally deals with and to make the code consistent. We only
turn them positive when transcribing into a cmsg for userspace recvmsg.
Signed-off-by: David Howells <dhowells@redhat.com>
Since callers statically know which type to use, make_dentry_ptr()
can simply be splitted into two inline functions. This way, the code
has less inlined, fewer arguments, and no cast.
Signed-off-by: Tomohiro Kusumi <tkusumi@tuxera.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch tries to split in-place-update bios from sequential bios.
Suggested-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The variable 'i' has been defined before, so here we can
use it directly.
Signed-off-by: Kaixu Xia <xiakaixu@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If two threads try to flush dirty pages in different inodes respectively,
f2fs_write_data_pages() will produce WRITE and WRITE_SYNC one at a time,
resulting in a lot of 4KB seperated IOs.
So, this patch gives higher priority to WB_SYNC_ALL IOs and gathers write
IOs with a big WRITE_SYNC'ed bio.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
It would better split small and large IOs separately in order to get more
consecutive big writes.
The default threshold is set to 64KB, but configurable by sysfs/min_hot_blocks.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch changes to use bitmap instead of extent in struct discard_entry
to indicate discard range in one segment, for fragmented space, this
implementation can save memory footprint.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Adds to count discard command entry and show the number in debugfs,
also fix to add cost of discard command cache into total comsumed
memory footprint.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Show historical count of flush command and discard command.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Commit 86066914ed "gfs2: Don't support
fallocate on jdata files" removed the ability of gfs2_grow to reserve
space at the end of the rindex, which could prevent a second gfs2_grow
from succeeding if the fs is full. Allow fallocate to work on the rindex
once again.
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
dquot_writeback_dquots() expects s_umount semaphore to be held to
protect it from other concurrent quota operations. reiserfs_sync_fs()
can call dquot_writeback_dquots() without holding s_umount semaphore
when called from flush_old_commits().
Fix the problem by grabbing s_umount in flush_old_commits(). However we
have to be careful and use only trylock since reiserfs_cancel_old_sync()
can be waiting for flush_old_commits() to complete while holding
s_umount semaphore. Possible postponing of sync work is not a big deal
though as that is only an opportunistic flush.
Fixes: 9d1ccbe70e
Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently canceling of delayed work that flushes old data using
cancel_old_flush() does not prevent work from being requeued. Thus
in theory new work can be queued after cancel_old_flush() from
reiserfs_freeze() has run. This will become larger problem once
flush_old_commits() can requeue the work itself.
Fix the problem by recording in sbi->work_queue that flushing work is
canceled and should not be requeued.
Signed-off-by: Jan Kara <jack@suse.cz>
ext2_sync_fs() could be called without s_umount semaphore held when
called through ext2_write_super() from __ext2_write_inode(). This
function then calls dquot_writeback_dquots() which relies on s_umount to
be held for protection against other quota operations.
In fact __ext2_write_inode() does not need all the functionality
ext2_write_super() provides. It is enough to just write the superblock.
So use ext2_sync_super() instead.
Fixes: 9d1ccbe70e
Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Use the realtime bitmap to return free space information via getfsmap.
Eventually this will be superseded by the realtime rmapbt code.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
If the reverse-mapping btree isn't available, fall back to the
free space btrees to provide partial reverse mapping information.
The online scrub tool can make use of even partial information to
speed up the data block scan.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Introduce a new ioctl that uses the reverse mapping btree to return
information about the physical layout of the filesystem.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Add _query_range and _query_all functions to the realtime bitmap
allocator. These two functions are similar in usage to the btree
functions with the same name and will be used for getfsmap and scrub.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Create a helper function that will query all records in a btree.
This will be used by the online repair functions to examine every
record in a btree to rebuild a second btree.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Implement a query_range function for the bnobt and cntbt. This will
be used for getfsmap fallback if there is no rmapbt and by the online
scrub and repair code.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Plumb in the pieces (init_high_key, diff_two_keys) necessary to call
query_range on the free space btrees. Remove the debugging asserts
so that we can make queries starting from block 0.
While we're at it, merge the redundant "if (btnum ==" hunks.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
In xfs_ioc_getbmap, we should only copy the fields of struct getbmap
from userspace, or else we end up copying random stack contents into the
kernel. struct getbmap is a strict subset of getbmapx, so a partial
structure copy should work fine.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
This function has been removed ever since at least 3.12-era. No need to
keep its declaration in the header so nuke it.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
"xfs_iread: validation failed for inode 96 failed"
One "failed" seems like enough.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Opencoding the trivial checks makes it much easier to read (and grep..).
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This checks for all the non-normal extent types, including handling both
encodings of delayed allocations.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The log covering background task used to be part of the xfssyncd
workqueue. That workqueue was removed as of commit 5889608df ("xfs:
syncd workqueue is no more") and the associated work item scheduled
to the xfs-log wq. The latter is used for log buffer I/O completion.
Since xfs_log_worker() can invoke a log flush, a deadlock is
possible between the xfs-log and xfs-cil workqueues. Consider the
following codepath from xfs_log_worker():
xfs_log_worker()
xfs_log_force()
_xfs_log_force()
xlog_cil_force()
xlog_cil_force_lsn()
xlog_cil_push_now()
flush_work()
The above is in xfs-log wq context and blocked waiting on the
completion of an xfs-cil work item. Concurrently, the cil push in
progress can end up blocked here:
xlog_cil_push_work()
xlog_cil_push()
xlog_write()
xlog_state_get_iclog_space()
xlog_wait(&log->l_flush_wait, ...)
The above is in xfs-cil context waiting on log buffer I/O
completion, which executes in xfs-log wq context. In this scenario
both workqueues are deadlocked waiting on eachother.
Add a new workqueue specifically for the high level log covering and
ail pushing worker, as was the case prior to commit 5889608df.
Diagnosed-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Fix a memory exposure problems in inumbers where we allocate an array of
structures with holes, fail to zero the holes, then blindly copy the
kernel memory contents (junk and all) into userspace.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
When punching past EOF on XFS, fallocate(mode=PUNCH_HOLE|KEEP_SIZE) will
round the file size up to the nearest multiple of PAGE_SIZE:
calvinow@vm-disks/generic-xfs-1 ~$ dd if=/dev/urandom of=test bs=2048 count=1
calvinow@vm-disks/generic-xfs-1 ~$ stat test
Size: 2048 Blocks: 8 IO Block: 4096 regular file
calvinow@vm-disks/generic-xfs-1 ~$ fallocate -n -l 2048 -o 2048 -p test
calvinow@vm-disks/generic-xfs-1 ~$ stat test
Size: 4096 Blocks: 8 IO Block: 4096 regular file
Commit 3c2bdc912a ("xfs: kill xfs_zero_remaining_bytes") replaced
xfs_zero_remaining_bytes() with calls to iomap helpers. The new helpers
don't enforce that [pos,offset) lies strictly on [0,i_size) when being
called from xfs_free_file_space(), so by "leaking" these ranges into
xfs_zero_range() we get this buggy behavior.
Fix this by reintroducing the checks xfs_zero_remaining_bytes() did
against i_size at the bottom of xfs_free_file_space().
Reported-by: Aaron Gao <gzh@fb.com>
Fixes: 3c2bdc912a ("xfs: kill xfs_zero_remaining_bytes")
Cc: Christoph Hellwig <hch@lst.de>
Cc: Brian Foster <bfoster@redhat.com>
Cc: <stable@vger.kernel.org> # 4.8+
Signed-off-by: Calvin Owens <calvinowens@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Fix a memory exposure problems in inumbers where we allocate an array of
structures with holes, fail to zero the holes, then blindly copy the
kernel memory contents (junk and all) into userspace.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
When punching past EOF on XFS, fallocate(mode=PUNCH_HOLE|KEEP_SIZE) will
round the file size up to the nearest multiple of PAGE_SIZE:
calvinow@vm-disks/generic-xfs-1 ~$ dd if=/dev/urandom of=test bs=2048 count=1
calvinow@vm-disks/generic-xfs-1 ~$ stat test
Size: 2048 Blocks: 8 IO Block: 4096 regular file
calvinow@vm-disks/generic-xfs-1 ~$ fallocate -n -l 2048 -o 2048 -p test
calvinow@vm-disks/generic-xfs-1 ~$ stat test
Size: 4096 Blocks: 8 IO Block: 4096 regular file
Commit 3c2bdc912a ("xfs: kill xfs_zero_remaining_bytes") replaced
xfs_zero_remaining_bytes() with calls to iomap helpers. The new helpers
don't enforce that [pos,offset) lies strictly on [0,i_size) when being
called from xfs_free_file_space(), so by "leaking" these ranges into
xfs_zero_range() we get this buggy behavior.
Fix this by reintroducing the checks xfs_zero_remaining_bytes() did
against i_size at the bottom of xfs_free_file_space().
Reported-by: Aaron Gao <gzh@fb.com>
Fixes: 3c2bdc912a ("xfs: kill xfs_zero_remaining_bytes")
Cc: Christoph Hellwig <hch@lst.de>
Cc: Brian Foster <bfoster@redhat.com>
Cc: <stable@vger.kernel.org> # 4.8+
Signed-off-by: Calvin Owens <calvinowens@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The inline directory verifiers should be called on the inode fork data,
which means after iformat_local on the read side, and prior to
ifork_flush on the write side. This makes the fork verifier more
consistent with the way buffer verifiers work -- i.e. they will operate
on the memory buffer that the code will be reading and writing directly.
Furthermore, revise the verifier function to return -EFSCORRUPTED so
that we don't flood the logs with corruption messages and assert
notices. This has been a particular problem with xfs/348, which
triggers the XFS_WANT_CORRUPTED_RETURN assertions, which halts the
kernel when CONFIG_XFS_DEBUG=y. Disk corruption isn't supposed to do
that, at least not in a verifier.
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Move recalculation of inode / vfsmount notification mask under
group->mark_mutex of the mark which was modified. These are the only
places where mask recalculation happens without mark being protected
from detaching from inode / vfsmount which will cause issues with the
following patches.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Printing inode pointers in warnings has dubious value and with future
changes we won't be able to easily get them without either locking or
chances we oops along the way. So just remove inode pointers from the
warning messages.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
show_fdinfo() iterates group's list of marks. All marks found there are
guaranteed to be alive and they stay so until we release
group->mark_mutex. So remove uncecessary tests whether mark is alive.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Revert commit 86d067a797: it turns out
that waiting for iopen glock dequeues here isn't needed anymore because
the bugs that commit was meant to fix have been fixed otherwise.
In addition, we want to avoid waiting on glocks in gfs2_evict_inode in
shrinker context because the shrinker may be invoked on behalf of DLM,
in which case calling into DLM again would deadlock. This commit makes
the described scenario less likely without completely avoiding it; it's
still a step in the right direction, though.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Switch from rhashtable_lookup_insert_fast + rhashtable_lookup_fast to
rhashtable_lookup_get_insert_fast, which is cleaner and avoids an extra
rhashtable lookup.
At the same time, turn the retry loop in gfs2_glock_get into an infinite
loop. The lookup or insert will eventually succeed, usually very fast,
but there is no reason to give up trying at a fixed number of
iterations.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Include a mask in struct stat to indicate which bits of stx_attributes the
filesystem actually supports.
This would also be useful if we add another system call that allows you to
do a 'bulk attribute set' and pass in a statx struct with the masks
appropriately set to say what you want to set.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reserve the top bit of the mask for future expansion of the statx struct
and give an error if statx() sees it set. All the other bits are ignored
if we see them set but don't support the bit; we just clear the bit in the
returned mask.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
statx has the ability to report inode creation times and inode flags, so
hook up di_crtime and di_flags to that functionality.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Return enhanced file attributes from the Ext4 filesystem. This includes
the following:
(1) The inode creation time (i_crtime) as stx_btime, setting STATX_BTIME.
(2) Certain FS_xxx_FL flags are mapped to stx_attribute flags.
This requires that all ext4 inodes have a getattr call, not just some of
them, so to this end, split the ext4_getattr() function and only call part
of it where appropriate.
Example output:
[root@andromeda ~]# touch foo
[root@andromeda ~]# chattr +ai foo
[root@andromeda ~]# /tmp/test-statx foo
statx(foo) = 0
results=fff
Size: 0 Blocks: 0 IO Block: 4096 regular file
Device: 08:12 Inode: 2101950 Links: 1
Access: (0644/-rw-r--r--) Uid: 0 Gid: 0
Access: 2016-02-11 17:08:29.031795451+0000
Modify: 2016-02-11 17:08:29.031795451+0000
Change: 2016-02-11 17:11:11.987790114+0000
Birth: 2016-02-11 17:08:29.031795451+0000
Attributes: 0000000000000030 (-------- -------- -------- -------- -------- -------- -------- --ai----)
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
I found that statx() was significantly slower than stat(). As a
microbenchmark, I compared 10,000,000 invocations of fstat() on a tmpfs
file to the same with statx() passed a NULL path:
$ time ./stat_benchmark
real 0m1.464s
user 0m0.275s
sys 0m1.187s
$ time ./statx_benchmark
real 0m5.530s
user 0m0.281s
sys 0m5.247s
statx is expected to be a little slower than stat because struct statx
is larger than struct stat, but not by *that* much. It turns out that
most of the overhead was in copying struct statx to userspace, mostly in
all the stac/clac instructions that got generated for each __put_user()
call. (This was on x86_64, but some other architectures, e.g. arm64,
have something similar now too.)
stat() instead initializes its struct on the stack and copies it to
userspace with a single call to copy_to_user(). This turns out to be
much faster, and changing statx to do this makes it almost as fast as
stat:
$ time ./statx_benchmark
real 0m1.624s
user 0m0.270s
sys 0m1.354s
For zeroing the reserved fields, start by zeroing the full struct with
memset. This makes it clear that every byte copied to userspace is
initialized, even implicit padding bytes (though there are none
currently). In the scenarios I tested, it also performed the same as a
designated initializer. Manually initializing each field was still
slightly faster, but would have been more error-prone and less
verifiable.
Also rename statx_set_result() to cp_statx() for consistency with
cp_old_stat() et al., and make it noinline so that struct statx doesn't
add to the stack usage during the main portion of the syscall execution.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
request_mask and query_flags are function arguments, not passed in
struct kstat. So remove the part of the comment which claims otherwise.
This was apparently left over from an earlier version of the statx
patch.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The statx() system call currently accepts unknown flags when called with
a NULL path to operate on a file descriptor. Left unchanged, this could
make it hard to introduce new query flags in the future, since
applications may not be able to tell whether a given flag is supported.
Fix this by failing the system call with EINVAL if any flags other than
KSTAT_QUERY_FLAGS are specified in combination with a NULL path.
Arguably, we could still permit known lookup-related flags such as
AT_SYMLINK_NOFOLLOW. However, that would be inconsistent with how
sys_utimensat() behaves when passed a NULL path, which seems to be the
closest precedent. And given that the NULL path case is (I believe)
mainly intended to be used to implement a wrapper function like fstatx()
that doesn't have a path argument, I think rejecting lookup-related
flags too is probably the best choice.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Merge misc fixes from Andrew Morton:
"11 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
kasan: do not sanitize kexec purgatory
drivers/rapidio/devices/tsi721.c: make module parameter variable name unique
mm/hugetlb.c: don't call region_abort if region_chg fails
kasan: report only the first error by default
hugetlbfs: initialize shared policy as part of inode allocation
mm: fix section name for .data..ro_after_init
mm, hugetlb: use pte_present() instead of pmd_present() in follow_huge_pmd()
mm: workingset: fix premature shadow node shrinking with cgroups
mm: rmap: fix huge file mmap accounting in the memcg stats
mm: move mm_percpu_wq initialization earlier
mm: migrate: fix remove_migration_pte() for ksm pages
backchannel; fix. Also some minor refinements to the nfsd
version-setting interface that we'd like to get fixed before release.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJY3weFAAoJECebzXlCjuG+fQYQAKh7QVB/vDZwvjWZ7OGSBhoe
LJfS1mzzf52MfQk6qNEM9to86qu+ywmH4XD2rvgMdif6Kc0qF1xBzzVEnp527lVW
nt1nLq/oIXeRl7a+5p+FxTnqHn0OBH+iCnxcVZI8tvHgptpR+6TwEzR36k4n4esN
kvNrrv9dFIoBGx0nSBScHTC3zNArU+C9oZTTHAuPJICNBHsOMqYF7sSAXPg9NFR+
HnowZfGWWXpSnWZ9DaM14Zb/dX0B9Pexv1MsgfiKYS9Beh7g4JN48+CHtWyVliwd
M/LVBpT2ZwFM/NzvJ2exVIqm/hwj4El2Sjy67XHwQzDvGjnUn/fz55SfXfzZSMyD
PMj+IeHKuT3jipNui1AXAlzYEz8gPvuKQ0vQ0vVuNX4Ln28KydGwvVTkUNltDnfq
E7L7RI6mj03OY1j1p6zeK6UJeueZq1gmjfq1NVTPO+TgQFPhh8A50NsDEVcaiwNN
W8uX7Qa39y79BT+4OFuYL05AbuqxKR+nAmCbVSLy9Kq4sc/6/YErBZtXXAghzPPl
4Es4tzlAd8skkxWlcVeCeUqPfcDCqHL6xKqa62m5wUuYXmqPtIMyAyVutF4lWpH/
dAV5Lcjz7HeQnpCgFZXXtIQW5OIIfFa08s0f7fuFm+uxz+nM/x8gg99tuwWP6OsT
Za7EtdB84M1ZGgjO3JhY
=oi0+
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.11-1' of git://linux-nfs.org/~bfields/linux
Pull nfsd fixes from Bruce Fields:
"The restriction of NFSv4 to TCP went overboard and also broke the
backchannel; fix.
Also some minor refinements to the nfsd version-setting interface that
we'd like to get fixed before release"
* tag 'nfsd-4.11-1' of git://linux-nfs.org/~bfields/linux:
svcrdma: set XPT_CONG_CTRL flag for bc xprt
NFSD: fix nfsd_reset_versions for NFSv4.
NFSD: fix nfsd_minorversion(.., NFSD_AVAIL)
NFSD: further refinement of content of /proc/fs/nfsd/versions
nfsd: map the ENOKEY to nfserr_perm for avoiding warning
SUNRPC/backchanel: set XPT_CONG_CTRL flag for bc xprt
Pull btrfs fixes from Chris Mason:
"We have three small fixes queued up in my for-linus-4.11 branch"
* 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix an integer overflow check
btrfs: Change qgroup_meta_rsv to 64bit
Btrfs: bring back repair during read
Any time after inode allocation, destroy_inode can be called. The
hugetlbfs inode contains a shared_policy structure, and
mpol_free_shared_policy is unconditionally called as part of
hugetlbfs_destroy_inode. Initialize the policy as part of inode
allocation so that any quick (error path) calls to destroy_inode will be
handed an initialized policy.
syzkaller fuzzer found this bug, that resulted in the following:
BUG: KASAN: user-memory-access in atomic_inc
include/asm-generic/atomic-instrumented.h:87 [inline] at addr
000000131730bd7a
BUG: KASAN: user-memory-access in __lock_acquire+0x21a/0x3a80
kernel/locking/lockdep.c:3239 at addr 000000131730bd7a
Write of size 4 by task syz-executor6/14086
CPU: 3 PID: 14086 Comm: syz-executor6 Not tainted 4.11.0-rc3+ #364
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
atomic_inc include/asm-generic/atomic-instrumented.h:87 [inline]
__lock_acquire+0x21a/0x3a80 kernel/locking/lockdep.c:3239
lock_acquire+0x1ee/0x590 kernel/locking/lockdep.c:3762
__raw_write_lock include/linux/rwlock_api_smp.h:210 [inline]
_raw_write_lock+0x33/0x50 kernel/locking/spinlock.c:295
mpol_free_shared_policy+0x43/0xb0 mm/mempolicy.c:2536
hugetlbfs_destroy_inode+0xca/0x120 fs/hugetlbfs/inode.c:952
alloc_inode+0x10d/0x180 fs/inode.c:216
new_inode_pseudo+0x69/0x190 fs/inode.c:889
new_inode+0x1c/0x40 fs/inode.c:918
hugetlbfs_get_inode+0x40/0x420 fs/hugetlbfs/inode.c:734
hugetlb_file_setup+0x329/0x9f0 fs/hugetlbfs/inode.c:1282
newseg+0x422/0xd30 ipc/shm.c:575
ipcget_new ipc/util.c:285 [inline]
ipcget+0x21e/0x580 ipc/util.c:639
SYSC_shmget ipc/shm.c:673 [inline]
SyS_shmget+0x158/0x230 ipc/shm.c:657
entry_SYSCALL_64_fastpath+0x1f/0xc2
Analysis provided by Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: http://lkml.kernel.org/r/1490477850-7944-1-git-send-email-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Stable Bugfixes:
- Fix infinite loop on BAD_STATEID error
Other Bugfixes:
- Fix old dentry rehash after move
- Fix pnfs GETDEVINFO hangs
- Fix pnfs fallback to MDS on commit errors
- Fix flexfiles kernel oops
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAljeko0ACgkQ18tUv7Cl
QOsZbw/+MXYEZCaILgaAjzoWO4qJhNuAzlblh2jSX2nitY6NAva2MORZoAnxqS/G
2qVWdYLvfQ2rKkIazInktaqBgsnl5Got9EbrD2hdV5LvM953U3KJkeXZD67ncvV7
YYmxaFipLfTfmLZDXlQ1h5wTKXXXw0VA2v2YL+sRZhAzVhcTyMyf1n89lT6H9Fqx
UPitMuBbRskCPFMOZ6xP+T6MsOpeIGVHHOvYWSSydCT6IoujnTjNRaZ9+VFSm5iU
rubL/qokg2VIJ60xbmv/toq61FkhI2xtJTtBFxHZo47En3RdwB53zOez/OmvTFLZ
lKSCh/Xkk/DDhUQWrYDaydPyGE6mWR+E/18/BZh0tx0n+yvHPg0ax56CTSJwpvg7
f6pydxQo0RAH3S/IWb0JB3jyi++EKznWK2OokllOdmT3DrYyKD3LiTY4T2MO8Wlu
+miFq6yk1iQHZR4R4JDkCWzpD7JeeR6lkg19kRZlBJm2Pv+8Rzg4qjBgqAo5OPxV
RiQ0CuyKoR2rtMz/4pepBai4fM42S/Nc59QJI/45mTUJ40whyoygJEov7s3SdTZi
H6cL7Ewe8m++MNW2aAsM2M0CVzoyv0D4mnPxgE8t8lNbkcT14bdi1WFwAHo/ICMd
KpyhsDVltakqUbelD3WWJtTXbAPCgwlOzuPdfbqtBMQJNP1aONs=
=1Xgv
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.11-3' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client fixes from Anna Schumaker:
"Here are a few more bugfixes that came in over the last couple of
weeks. Most of these fix various hangs and loops that people found,
but we also had a few error handling fixes.
Stable Bugfixes:
- fix infinite loop on BAD_STATEID error
Other Bugfixes:
- fix old dentry rehash after move
- fix pnfs GETDEVINFO hangs
- fix pnfs fallback to MDS on commit errors
- fix flexfiles kernel oops"
* tag 'nfs-for-4.11-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
nfs: flexfiles: fix kernel OOPS if MDS returns unsupported DS type
NFSv4.1 fix infinite loop on IO BAD_STATEID error
PNFS fix fallback to MDS if got error on commit to DS
NFS filelayout:call GETDEVICEINFO after pnfs_layout_process completes
NFS store nfs4_deviceid in struct nfs4_filelayout_segment
NFS cleanup struct nfs4_filelayout_segment
NFS: Fix old dentry rehash after move
Commit 63d63cbf5e "NFSv4.1: Don't recheck delegations that
have already been checked" introduced a regression where when a
client received BAD_STATEID error it would not send any TEST_STATEID
and instead go into an infinite loop of resending the IO that caused
the BAD_STATEID.
Fixes: 63d63cbf5e ("NFSv4.1: Don't recheck delegations that have already been checked")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Cc: stable@vger.kernel.org # 4.9+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Upong receiving some errors (EACCES) on commit to the DS the code
doesn't fallback to MDS and intead retrieds to the same DS again.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Remove faulty leftover check in do_rename(), apparently introduced in a
merge that combined whiteout support changes with commit f03b8ad8d3
("fs: support RENAME_NOREPLACE for local filesystems")
Fixes: f03b8ad8d3 ("fs: support RENAME_NOREPLACE for local
filesystems")
Fixes: 9e0a1fff8d ("ubifs: Implement RENAME_WHITEOUT")
Cc: stable@vger.kernel.org
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Richard Weinberger <richard@nod.at>
instead of filenames, print inode numbers, file types, and length.
Signed-off-by: Hyunchul Lee <cheol.lee@lge.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
if a character is not printable, print '?' instead of that.
Signed-off-by: Hyunchul Lee <cheol.lee@lge.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
if filename is encrypted, filename could have no printable characters.
so remove it.
Signed-off-by: Hyunchul Lee <cheol.lee@lge.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
- has_not_enough_free_secs
node_secs: 0 dent_secs: 0 freed:0 free_segments:103 reserved:104
- f2fs_gc
- get_victim_by_default
alloc_mode 0, gc_mode 1, max_search 2672, offset 4654, ofs_unit 1
- do_garbage_collect
start_segno 3976, end_segno 3977 type 0
- is_alive
nid 22797, blkaddr 2131882, ofs_in_node 0, version 0x8/0x0
- gc_data_segment 766, segno 3976, block 512/426 not alive
So, this patch fixes subtle corrupted case where node version does not match
to summary version which results in infinite loop by gc.
Reported-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In order to give more spatial locality, this patch changes the block allocation
policy which assigns beginning of partition for small and hot data/node blocks.
In order to do this, we set noheap allocation by default and introduce another
mount option, heap, to reset it back.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
generic_permission() presently checks CAP_DAC_OVERRIDE prior to
CAP_DAC_READ_SEARCH. This can cause misleading audit messages when
using a LSM such as SELinux or AppArmor, since CAP_DAC_OVERRIDE
may not be required for the operation. Flip the order of the
tests so that CAP_DAC_OVERRIDE is only checked when required for
the operation.
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by: John Johansen <john.johansen@canonical.com>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Acked-by: James Morris <james.l.morris@oracle.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
This isn't super serious because you need CAP_ADMIN to run this code.
I added this integer overflow check last year but apparently I am
rubbish at writing integer overflow checks... There are two issues.
First, access_ok() works on unsigned long type and not u64 so on 32 bit
systems the access_ok() could be checking a truncated size. The other
issue is that we should be using a stricter limit so we don't overflow
the kzalloc() setting ctx->clone_roots later in the function after the
access_ok():
alloc_size = sizeof(struct clone_root) * (arg->clone_sources_count + 1);
sctx->clone_roots = kzalloc(alloc_size, GFP_KERNEL | __GFP_NOWARN);
Fixes: f5ecec3ce2 ("btrfs: send: silence an integer overflow warning")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ added comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
Using an int value is causing qg->reserved to become negative and
exclusive -EDQUOT to be reached prematurely.
This affects exclusive qgroups only.
TEST CASE:
DEVICE=/dev/vdb
MOUNTPOINT=/mnt
SUBVOL=$MOUNTPOINT/tmp
umount $SUBVOL
umount $MOUNTPOINT
mkfs.btrfs -f $DEVICE
mount /dev/vdb $MOUNTPOINT
btrfs quota enable $MOUNTPOINT
btrfs subvol create $SUBVOL
umount $MOUNTPOINT
mount /dev/vdb $MOUNTPOINT
mount -o subvol=tmp $DEVICE $SUBVOL
btrfs qgroup limit -e 3G $SUBVOL
btrfs quota rescan /mnt -w
for i in `seq 1 44000`; do
dd if=/dev/zero of=/mnt/tmp/test_$i bs=10k count=1
if [[ $? > 0 ]]; then
btrfs qgroup show -pcref $SUBVOL
exit 1
fi
done
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
[ add reproducer to changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 20a7db8ab3 ("btrfs: add dummy callback for readpage_io_failed
and drop checks") made a cleanup around readpage_io_failed_hook, and
it was supposed to keep the original sematics, but it also
unexpectedly disabled repair during read for dup, raid1 and raid10.
This fixes the problem by letting data's inode call the generic
readpage_io_failed callback by returning -EAGAIN from its
readpage_io_failed_hook in order to notify end_bio_extent_readpage to
do the rest. We don't call it directly because the generic one takes
an offset from end_bio_extent_readpage() to calculate the index in the
checksum array and inode's readpage_io_failed_hook doesn't offer that
offset.
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ keep the const function attribute ]
Signed-off-by: David Sterba <dsterba@suse.com>
There is an include loop between netdevice.h, dsa.h, devlink.h because
of NETDEV_ALIGN, making it impossible to use devlink structures in
dsa.h.
Break this loop by taking dsa.h out of netdevice.h, add a forward
declaration of dsa_switch_tree and netdev_set_default_ethtool_ops()
function, which is what netdevice.h requires.
No longer having dsa.h in netdevice.h means the includes in dsa.h no
longer get included. This breaks a few other files which depend on
these includes. Add these directly in the affected file.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes missing increased max cost caused by a patch that we increased
cose of data segments in greedy algorithm.
Cc: <stable@vger.kernel.org> # v4.10+
Fixes: b9cd20619 "f2fs: node segment is prior to data segment selected victim"
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The inline directory verifiers should be called on the inode fork data,
which means after iformat_local on the read side, and prior to
ifork_flush on the write side. This makes the fork verifier more
consistent with the way buffer verifiers work -- i.e. they will operate
on the memory buffer that the code will be reading and writing directly.
Furthermore, revise the verifier function to return -EFSCORRUPTED so
that we don't flood the logs with corruption messages and assert
notices. This has been a particular problem with xfs/348, which
triggers the XFS_WANT_CORRUPTED_RETURN assertions, which halts the
kernel when CONFIG_XFS_DEBUG=y. Disk corruption isn't supposed to do
that, at least not in a verifier.
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
v2: get the inode d_ops the proper way
v3: describe the bug that this patch fixes; no code changes
Fix a filelayout GETDEVICEINFO call hang triggered from the LAYOUTGET
pnfs_layout_process where the GETDEVICEINFO call is waiting for a
session slot, and the LAYOUGET call is waiting for pnfs_layout_process
to complete before freeing the slot GETDEVICEINFO is waiting for..
This occurs in testing against the pynfs pNFS server where the
the on-wire reply highest_slotid and slot id are zero, and the
target high slot id is 8 (negotiated in CREATE_SESSION).
The internal fore channel slot table max_slotid, the maximum allowed
table slotid value, has been reduced via nfs41_set_max_slotid_locked
from 8 to 1. Thus there is one slot (slotid 0) available for use but
it has not been freed by LAYOUTGET proir to the GETDEVICEINFO request.
In order to ensure that layoutrecall callbacks are processed in the
correct order, nfs4_proc_layoutget processing needs to be finished
e.g. pnfs_layout_process) before giving up the slot that identifies
the layoutget (see referring_call_exists).
Move the filelayout_check_layout nfs4_find_get_device call outside of
the pnfs_layout_process call tree.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
In preparation for moving the filelayout getdeviceinfo call from
filelayout_alloc_lseg called by pnfs_process_layout
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
URLs to ftp.kernel.org are still exist though the service is closed [0].
This commit fixes the URLs to use www.kernel.org instead.
[0] https://www.kernel.org/shutting-down-ftp-services.html
Signed-off-by: SeongJae Park <sj38.park@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Now that nfs_rename()'s d_move has moved within the RPC task's rpc_call_done
callback, rehashing new_dentry will actually rehash the old dentry's name
in nfs_rename(). d_move() is going to rehash the new dentry for us anyway,
so doing it again here is unnecessary.
Reported-by: Chuck Lever <chuck.lever@oracle.com>
Fixes: 920b4530fb ("NFS: nfs_rename() handle -ERESTARTSYS dentry left behind")
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Here is a single kernfs fix for 4.11-rc4 that resolves a reported issue.
It has been in linux-next with no reported issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWNedpw8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ykLkgCdEVdmtWb9Fd0igfh7bSWBHdD9W20An3vKOror
nTP7sT8FwSWGKdOpIaik
=0Eht
-----END PGP SIGNATURE-----
Merge tag 'driver-core-4.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core fix from Greg KH:
"Here is a single kernfs fix for 4.11-rc4 that resolves a reported
issue.
It has been in linux-next with no reported issues"
* tag 'driver-core-4.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
kernfs: Check KERNFS_HAS_RELEASE before calling kernfs_release_file()
inodes relating to the inline_data and metadata checksum features.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAljXHNMACgkQ8vlZVpUN
gaPwoggAiodb37DHZ/X6fnRr8314OJT8mRUbUK3aDagCRb0Kp9iFAwwpHIG8Gxw1
akI7Jy8VWLC4EbHb9wzXFEO7wl/IBLq3t70Vid2cBR302gblhIIz6hkHrQ9RIlW3
MH5sFhXiVq4WYPuxQFWS6ohg6/SYTwcgI9rXxEnkLVmOiG2Ov2/v4/wiflau8vgK
fNYyncHSylwJ5QIaT8mUIawetlunEHO0Vz5AZNzkcMhkzUHxmRWvMtGWcvwukstb
7vXZhN5HHB8RZ33qcdtuAaNBHwBmrU/acicIpsvL/jfkFWlJTS0PBRUvwxnPeebo
G0xRDEIwpZoy5h8fxzIxqh+CQqg6QA==
=/ycw
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"Fix a memory leak on an error path, and two races when modifying
inodes relating to the inline_data and metadata checksum features"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix two spelling nits
ext4: lock the xattr block before checksuming it
jbd2: don't leak memory if setting up journal fails
ext4: mark inode dirty after converting inline directory
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAljW4wYACgkQ8vlZVpUN
gaPYugf9ExFbJhN+iYqUVbGXPvlr5VpEtDeVt7IfO3a37hqCEQ0IEPzksNIfUFul
B8/rYXpz0B5gqCJeo66CGLkb1SVvSoSKCq9/BTQtugohxM7sGxDFTmdB+A+u0QJH
leILfaMFuj0DhVOrdYVpGh7e1XPgSTUWy6/G42OJqf3SV2WxGRJtyBfmghZxEdiY
XYCGqjq47yOIPvzB+ufKe1hnphKMgxlHeuPvByzPCvOs58GlxAYR3Ycuvjc/nz+8
QVlAEPpGhf9ytEXELsxq/ZbsNj9xtXsNAzkAoMK+xZ2JCxIHRcS1ay/iAwxw+d9r
bnlpI+8tQ79GIGCv3cusJSwq7j1iuQ==
=wPlW
-----END PGP SIGNATURE-----
Merge tag 'fscrypt-for-linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt
Pull fscrypto fixes from Ted Ts'o:
"A code cleanup and bugfix for fs/crypto"
* tag 'fscrypt-for-linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt:
fscrypt: eliminate ->prepare_context() operation
fscrypt: remove broken support for detecting keyring key revocation
We must lock the xattr block before calculating or verifying the
checksum in order to avoid spurious checksum failures.
https://bugzilla.kernel.org/show_bug.cgi?id=193661
Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
This patch allow write data to normal file when writting
new checkpoint.
We relax three limitations for write_begin path:
1. data allocation
2. node allocation
3. variables in checkpoint
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds busy poll support to epoll. The implementation is meant to
be opportunistic in that it will take the NAPI ID from the last socket
that is added to the ready list that contains a valid NAPI ID and it will
use that for busy polling until the ready list goes empty. Once the ready
list goes empty the NAPI ID is reset and busy polling is disabled until a
new socket is added to the ready list.
In addition when we insert a new socket into the epoll we record the NAPI
ID and assume we are going to receive events on it. If that doesn't occur
it will be evicted as the active NAPI ID and we will resume normal
behavior.
An application can use SO_INCOMING_CPU or SO_REUSEPORT_ATTACH_C/EBPF socket
options to spread the incoming connections to specific worker threads
based on the incoming queue. This enables epoll for each worker thread
to have only sockets that receive packets from a single queue. So when an
application calls epoll_wait() and there are no events available to report,
busy polling is done on the associated queue to pull the packets.
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch flips the logic we were using to determine if the busy polling
has timed out. The main motivation for this is that we will need to
support two different possible timeout values in the future and by
recording the start time rather than when we would want to end we can focus
on making the end_time specific to the task be it epoll or socket based
polling.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds to show the max number of volatile operations which are
conducting concurrently.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In below concurrent case, allocated nid can be loaded into free nid cache
and be allocated again.
Thread A Thread B
- f2fs_create
- f2fs_new_inode
- alloc_nid
- __insert_nid_to_list(ALLOC_NID_LIST)
- f2fs_balance_fs_bg
- build_free_nids
- __build_free_nids
- scan_nat_page
- add_free_nid
- __lookup_nat_cache
- f2fs_add_link
- init_inode_metadata
- new_inode_page
- new_node_page
- set_node_addr
- alloc_nid_done
- __remove_nid_from_list(ALLOC_NID_LIST)
- __insert_nid_to_list(FREE_NID_LIST)
This patch makes nat cache lookup and free nid list operation being atomical
to avoid this race condition.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Use set_page_private marcro instead of operte page struct
directly
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When doing garbage collection, we try to record segment offset which
locates at next one of last victim, using it as the start offset in
next searching.
But in some corner cases, recorded offset may cross the end of main
segment area, it will cause incorrectly searching in dirty_segmap
bitmap. This patch adds modular operation to avoid this issue.
Reported-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Pull btrfs fixes from Chris Mason:
"Zygo tracked down a very old bug with inline compressed extents.
I didn't tag this one for stable because I want to do individual
tested backports. It's a little tricky and I'd rather do some extra
testing on it along the way"
* 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: add missing memset while reading compressed inline extents
Btrfs: fix regression in lock_delalloc_pages
btrfs: remove btrfs_err_str function from uapi/linux/btrfs.h
The latest gcc-7.0.1 snapshot warns about an unintialized variable use:
In file included from fs/reiserfs/lbalance.c:8:0:
fs/reiserfs/lbalance.c: In function 'leaf_item_bottle.isra.3':
fs/reiserfs/reiserfs.h:1279:13: error: '*((void *)&n_ih+8).v' may be used uninitialized in this function [-Werror=maybe-uninitialized]
v2->v = (v2->v & cpu_to_le64(15ULL << 60)) | cpu_to_le64(offset);
~~^~~
fs/reiserfs/reiserfs.h:1279:13: error: '*((void *)&n_ih+8).v' may be used uninitialized in this function [-Werror=maybe-uninitialized]
v2->v = (v2->v & cpu_to_le64(15ULL << 60)) | cpu_to_le64(offset);
This happens because the offset/type pair that is stored in
ih.key.u.k_offset_v2 is actually uninitialized when we call
set_le_ih_k_offset() and set_le_ih_k_type(). After we have called both,
all data is correct, but the first of the two reads uninitialized data
for the type field and writes it back before it gets overwritten.
This works around the warning by initializing the k_offset_v2 through
the slightly larger memcpy().
[JK: Remove now unused define and make it obvious we initialize the key]
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jan Kara <jack@suse.cz>
When block device is closed, we call inode_detach_wb() in __blkdev_put()
which sets inode->i_wb to NULL. That is contrary to expectations that
inode->i_wb stays valid once set during the whole inode's lifetime and
leads to oops in wb_get() in locked_inode_to_wb_and_lock_list() because
inode_to_wb() returned NULL.
The reason why we called inode_detach_wb() is not valid anymore though.
BDI is guaranteed to stay along until we call bdi_put() from
bdev_evict_inode() so we can postpone calling inode_detach_wb() to that
moment.
Also add a warning to catch if someone uses inode_detach_wb() in a
dangerous way.
Reported-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
When disk->fops->open() in __blkdev_get() returns -ERESTARTSYS, we
restart the process of opening the block device. However we forget to
switch bdev->bd_bdi back to noop_backing_dev_info and as a result bdev
inode will be pointing to a stale bdi. Fix the problem by setting
bdev->bd_bdi later when __blkdev_get() is already guaranteed to succeed.
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
The memory size of f2fs_stat_info also should be calculated.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
After filemap_write_and_wait_range fail, the FI_ATOMIC_FILE flags is removed,
so that f2fs should not increase the stat of atomic_write.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The crc_offset towards or beyond the end of block is wrong,
sanity check it.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As discuss with Jaegeuk and Chao,
"Once checkpoint is done, f2fs doesn't need to update there-in filename at all."
The disk-level filename is used only one case,
1. create a file A under a dir
2. sync A
3. godown
4. umount
5. mount (roll_forward)
Only the rename/cross_rename changes the filename, if it happens,
a. between step 1 and 2, the sync A will caused checkpoint, so that,
the roll_forward at step 5 never happens.
b. after step 2, the roll_forward happens, file A will roll forward
to the result as after step 1.
So that, any updating the disk filename is useless, just cleanup it.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
free_nid_bitmap and free_nid_count in update_free_nid_bitmap should be
updated atomically, use nid_list_lock cover them to avoid race in
concurrent scenario.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Reviewed-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>