Its only caller always passes S_IFREG as the @type parameter. As an
additional clean-up, add a kerneldoc comment.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Now that its two callers have their own version-specific instance of
this function, do_nfsd_create() is no longer used.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The NFSv3 CREATE and NFSv4 OPEN(CREATE) use cases are about to
diverge such that it makes sense to split do_nfsd_create() into one
version for NFSv3 and one for NFSv4.
As a first step, copy do_nfsd_create() to nfs3proc.c and remove
NFSv4-specific logic.
One immediate legibility benefit is that the logic for handling
NFSv3 createhow is now quite straightforward. NFSv4 createhow
has some subtleties that IMO do not belong in generic code.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
I'd like to move do_nfsd_create() out of vfs.c. Therefore
nfsd_create_setattr() needs to be made publicly visible.
Note that both call sites in vfs.c commit both the new object and
its parent directory, so just combine those common metadata commits
into nfsd_create_setattr().
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Clean up: The "out" label already invokes fh_drop_write().
Note that fh_drop_write() is already careful not to invoke
mnt_drop_write() if either it has already been done or there is
nothing to drop. Therefore no change in behavior is expected.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
As near as I can tell, mode bit masking and setting S_IFREG is
already done by do_nfsd_create() and vfs_create(). The NFSv4 path
(do_open_lookup), for example, does not bother with this special
processing.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Technically speaking, objects allocated out of a specific slab cache are
supposed to be freed to that slab cache. The popular slab backends will
take care of this for us, but SLOB famously doesn't. Fix this, even if
slob + xfs are not that common of a combination.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
When we're validating a recovered xattr log item during log recovery, we
should check the name before starting to allocate resources. This isn't
strictly necessary on its own, but it means that we won't bother with
huge memory allocations during recovery if the attr name is garbage,
which will simplify the changes in the next patch.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Make sure we screen the "attr flags" field of recovered xattr intent log
items to reject flag bits that we don't know about. This is really the
attr *filter* field from xfs_da_args, so rename the field and create
a mask to make checking for invalid bits easier.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
If a setxattr operation finds an xattr structure in leaf format, adding
the attr can fail due to lack of space and hence requires an upgrade to
node format. After this happens, we'll roll the transaction and
re-enter the state machine, at which time we need to perform a second
lookup of the attribute name to find its new location. This lookup
attaches a new da state structure to the xfs_attr_item but doesn't free
the old one (from the leaf lookup) and leaks it. Fix that.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
kmemleak reported that we lost an xfs_da_state while removing xattrs in
generic/020:
unreferenced object 0xffff88801c0e4b40 (size 480):
comm "attr", pid 30515, jiffies 4294931061 (age 5.960s)
hex dump (first 32 bytes):
78 bc 65 07 00 c9 ff ff 00 30 60 1c 80 88 ff ff x.e......0`.....
02 00 00 00 00 00 00 00 80 18 83 4e 80 88 ff ff ...........N....
backtrace:
[<ffffffffa023ef4a>] xfs_da_state_alloc+0x1a/0x30 [xfs]
[<ffffffffa021b6f3>] xfs_attr_node_hasname+0x23/0x90 [xfs]
[<ffffffffa021c6f1>] xfs_attr_set_iter+0x441/0xa30 [xfs]
[<ffffffffa02b5104>] xfs_xattri_finish_update+0x44/0x80 [xfs]
[<ffffffffa02b515e>] xfs_attr_finish_item+0x1e/0x40 [xfs]
[<ffffffffa0244744>] xfs_defer_finish_noroll+0x184/0x740 [xfs]
[<ffffffffa02a6473>] __xfs_trans_commit+0x153/0x3e0 [xfs]
[<ffffffffa021d149>] xfs_attr_set+0x469/0x7e0 [xfs]
[<ffffffffa02a78d9>] xfs_xattr_set+0x89/0xd0 [xfs]
[<ffffffff812e6512>] __vfs_removexattr+0x52/0x70
[<ffffffff812e6a08>] __vfs_removexattr_locked+0xb8/0x150
[<ffffffff812e6af6>] vfs_removexattr+0x56/0x100
[<ffffffff812e6bf8>] removexattr+0x58/0x90
[<ffffffff812e6cce>] path_removexattr+0x9e/0xc0
[<ffffffff812e6d44>] __x64_sys_lremovexattr+0x14/0x20
[<ffffffff81786b35>] do_syscall_64+0x35/0x80
I think this is a consequence of xfs_attr_node_removename_setup
attaching a new da(btree) state to xfs_attr_item and never freeing it.
I /think/ it's the case that the remove paths could detach the da state
earlier in the remove state machine since nothing else accesses the
state. However, let's future-proof the new xattr code by adding a
catch-all when we free the xfs_attr_item to make sure we never leak the
da state.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Combination of LOOKUP_IS_SCOPED and NULL nd->root.mnt is impossible
after successful path_init(). All places where ->root.mnt might
become NULL do that only if LOOKUP_IS_SCOPED is not there and
path_init() itself can return success without setting nd->root
only if ND_ROOT_PRESET had been set (in which case nd->root
had been set by caller and never changed) or if the name had
been a relative one *and* none of the bits in LOOKUP_IS_SCOPED
had been present.
Since all calls of legitimize_root() must be downstream of successful
path_init(), the check for !nd->root.mnt && (nd->flags & LOOKUP_IS_SCOPED)
is pure paranoia.
FWIW, it had been discussed (and agreed upon) with Aleksa back when
scoped lookups had been merged; looks like that had fallen through the
cracks back then.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
!foo() != 0 is a strange way to spell !foo(); fallout from
"fs: make unlazy_walk() error handling consistent"...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In f2fs_read_inline_data(), it is confused with checking of
inline_data flag, as we checked it before calling. So this
patch add some comments for f2fs_has_inline_data().
Signed-off-by: Chao Liu <liuchao@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
FAT supports creation time but not change time, and there was no
corresponding timestamp for creation time in previous VFS. The original
implementation took the compromise of saving the in-memory change time
into the on-disk creation time field, but this would lead to compatibility
issues with non-linux systems.
To address this issue, this patch changes the behavior of ctime. It will
no longer be loaded and stored from the creation time on disk. Instead of
that, it'll be consistent with the in-memory mtime and share the same
on-disk field. All updates to mtime will also be applied to ctime in
memory, while all updates to ctime will be ignored.
Link: https://lkml.kernel.org/r/20220503152536.2503003-2-cccheng@synology.com
Signed-off-by: Chung-Chiang Cheng <cccheng@synology.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Previously the protection of kernfs_pr_cont_buf was piggy backed by
rename_lock, which means that pr_cont() needs to be protected under
rename_lock. This can cause potential circular lock dependencies.
If there is an OOM, we have the following call hierarchy:
-> cpuset_print_current_mems_allowed()
-> pr_cont_cgroup_name()
-> pr_cont_kernfs_name()
pr_cont_kernfs_name() will grab rename_lock and call printk. So we have
the following lock dependencies:
kernfs_rename_lock -> console_sem
Sometimes, printk does a wakeup before releasing console_sem, which has
the dependence chain:
console_sem -> p->pi_lock -> rq->lock
Now, imagine one wants to read cgroup_name under rq->lock, for example,
printing cgroup_name in a tracepoint in the scheduler code. They will
be holding rq->lock and take rename_lock:
rq->lock -> kernfs_rename_lock
Now they will deadlock.
A prevention to this circular lock dependency is to separate the
protection of pr_cont_buf from rename_lock. In principle, rename_lock
is to protect the integrity of cgroup name when copying to buf. Once
pr_cont_buf has got its content, rename_lock can be dropped. So it's
safe to drop rename_lock after kernfs_name_locked (and
kernfs_path_from_node_locked) and rely on a dedicated pr_cont_lock
to protect pr_cont_buf.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Hao Luo <haoluo@google.com>
Link: https://lore.kernel.org/r/20220516190951.3144144-1-haoluo@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch allows expired client with lock state to be in COURTESY
state. Lock conflict with COURTESY client is resolved by the fs/lock
code using the lm_lock_expirable and lm_expire_lock callback in the
struct lock_manager_operations.
If conflict client is in COURTESY state, set it to EXPIRABLE and
schedule the laundromat to run immediately to expire the client. The
callback lm_expire_lock waits for the laundromat to flush its work
queue before returning to caller.
Reviewed-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Add 2 new callbacks, lm_lock_expirable and lm_expire_lock, to
lock_manager_operations to allow the lock manager to take appropriate
action to resolve the lock conflict if possible.
A new field, lm_mod_owner, is also added to lock_manager_operations.
The lm_mod_owner is used by the fs/lock code to make sure the lock
manager module such as nfsd, is not freed while lock conflict is being
resolved.
lm_lock_expirable checks and returns true to indicate that the lock
conflict can be resolved else return false. This callback must be
called with the flc_lock held so it can not block.
lm_expire_lock is called to resolve the lock conflict if the returned
value from lm_lock_expirable is true. This callback is called without
the flc_lock held since it's allowed to block. Upon returning from
this callback, the lock conflict should be resolved and the caller is
expected to restart the conflict check from the beginnning of the list.
Lock manager, such as NFSv4 courteous server, uses this callback to
resolve conflict by destroying lock owner, or the NFSv4 courtesy client
(client that has expired but allowed to maintains its states) that owns
the lock.
Reviewed-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
This patch moves create/destroy of laundry_wq from nfs4_state_start
and nfs4_state_shutdown_net to init_nfsd and exit_nfsd to prevent
the laundromat from being freed while a thread is processing a
conflicting lock.
Reviewed-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
This patch allows expired client with open state to be in COURTESY
state. Share/access conflict with COURTESY client is resolved by
setting COURTESY client to EXPIRABLE state, schedule laundromat
to run and returning nfserr_jukebox to the request client.
Reviewed-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
This patch provides courteous server support for delegation only.
Only expired client with delegation but no conflict and no open
or lock state is allowed to be in COURTESY state.
Delegation conflict with COURTESY/EXPIRABLE client is resolved by
setting it to EXPIRABLE, queue work for the laundromat and return
delay to the caller. Conflict is resolved when the laudromat runs
and expires the EXIRABLE client while the NFS client retries the
OPEN request. Local thread request that gets conflict is doing the
retry in _break_lease.
Client in COURTESY or EXPIRABLE state is allowed to reconnect and
continues to have access to its state. Access to the nfs4_client by
the reconnecting thread and the laundromat is serialized via the
client_lock.
Reviewed-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
nfsd_splice_actor() checks that the page being spliced does not
match the previous element in the svc_rqst::rq_pages array. We
believe this is to prevent a double put_page() in cases where the
READ payload is partially contained in the xdr_buf's head buffer.
However, the NFSD READ proc functions no longer place any part of
the READ payload in the head buffer, in order to properly support
NFS/RDMA READ with Write chunks. Therefore, simplify the logic in
nfsd_splice_actor() to remove this unnecessary check.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Pull io_uring fixes from Jens Axboe:
"Two small changes fixing issues from the 5.18 merge window:
- Fix wrong ordering of a tracepoint (Dylan)
- Fix MSG_RING on IOPOLL rings (me)"
* tag 'io_uring-5.18-2022-05-18' of git://git.kernel.dk/linux-block:
io_uring: don't attempt to IOPOLL for MSG_RING requests
io_uring: fix ordering of args in io_uring_queue_async_work
Yanming reported a kernel bug in Bugzilla kernel [1], which can be
reproduced. The bug message is:
The kernel message is shown below:
kernel BUG at fs/inode.c:611!
Call Trace:
evict+0x282/0x4e0
__dentry_kill+0x2b2/0x4d0
dput+0x2dd/0x720
do_renameat2+0x596/0x970
__x64_sys_rename+0x78/0x90
do_syscall_64+0x3b/0x90
[1] https://bugzilla.kernel.org/show_bug.cgi?id=215895
The bug is due to fuzzed inode has both inline_data and encrypted flags.
During f2fs_evict_inode(), as the inode was deleted by rename(), it
will cause inline data conversion due to conflicting flags. The page
cache will be polluted and the panic will be triggered in clear_inode().
Try fixing the bug by doing more sanity checks for inline data inode in
sanity_check_inode().
Cc: stable@vger.kernel.org
Reported-by: Ming Yan <yanming@tju.edu.cn>
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch tries to fix permission consistency issue as all other
mainline filesystems.
Since the initial introduction of (posix) fallocate back at the turn of
the century, it has been possible to use this syscall to change the
user-visible contents of files. This can happen by extending the file
size during a preallocation, or through any of the newer modes (punch,
zero, collapse, insert range). Because the call can be used to change
file contents, we should treat it like we do any other modification to a
file -- update the mtime, and drop set[ug]id privileges/capabilities.
The VFS function file_modified() does all this for us if pass it a
locked inode, so let's make fallocate drop permissions correctly.
Cc: stable@kernel.org
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
It's nonsensical to register a provided buffer ring, if a classic
provided buffer group with the same ID exists. Depending on the order of
which we decide what type to pick, the other type will never get used.
Explicitly disallow it and return an error if this is attempted.
Fixes: c7fb19428d ("io_uring: add support for ring mapped supplied buffers")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We use ->buf_pages != 0 to tell if this is a shared buffer ring or a
classic provided buffer group. If we unregister the shared ring and
then attempt to use it, buf_pages is zero yet the classic list head
isn't properly initialized. This causes io_buffer_select() to think
that we have classic buffers available, but then we crash when we try
and get one from the list.
Just initialize the list if we unregister a shared buffer ring, leaving
it in a sane state for either re-registration or for attempting to use
it. And do the same for the initial setup from the classic path.
Fixes: c7fb19428d ("io_uring: add support for ring mapped supplied buffers")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The parameter desc_size in fsverity_create_info() is useless and it is
not referenced anywhere. The greatest meaning of desc_size here is to
indecate the size of struct fsverity_descriptor and futher calculate the
size of signature. However, the desc->sig_size can do it also and it is
indeed, so remove it.
Therefore, it is no need to acquire desc_size by fsverity_get_descriptor()
in ensure_verity_info(), so remove the parameter desc_ret in
fsverity_get_descriptor() too.
Signed-off-by: Zhang Jianhua <chris.zjh@huawei.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20220518132256.2297655-1-chris.zjh@huawei.com
If processing the on-disk mount options fails after any memory was
allocated in the ext4_fs_context, e.g. s_qf_names, then this memory is
leaked. Fix this by calling ext4_fc_free() instead of kfree() directly.
Reproducer:
mkfs.ext4 -F /dev/vdc
tune2fs /dev/vdc -E mount_opts=usrjquota=file
echo clear > /sys/kernel/debug/kmemleak
mount /dev/vdc /vdc
echo scan > /sys/kernel/debug/kmemleak
sleep 5
echo scan > /sys/kernel/debug/kmemleak
cat /sys/kernel/debug/kmemleak
Fixes: 7edfd85b1f ("ext4: Completely separate options parsing and sb setup")
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Tested-by: Ritesh Harjani <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20220513231605.175121-2-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When I wrote the freevxfs driver I had some odd choice of licensing
statements, the options are either GPL (without version) or an odd
BSD-ish licensense with advertising clause.
The GPL vs always meant to be the same as the kernel, that is version
2 only, and the odd BSD-ish license doesn't make much sense. Add
a GPL2.0-only SPDX tag to make the GPL intentions clear and drop the
bogus BSD license.
Acked-by: Krzysztof Błaszkowski <kb@sysmikro.com.pl>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The logic for handling events on child in groups that have a mark on
the parent inode, but without FS_EVENT_ON_CHILD flag in the mask is
duplicated in several places and inconsistent.
Move the logic into the preparation of mark type iterator, so that the
parent mark type will be excluded from all mark type iterations in that
case.
This results in several subtle changes of behavior, hopefully all
desired changes of behavior, for example:
- Group A has a mount mark with FS_MODIFY in mask
- Group A has a mark with ignore mask that does not survive FS_MODIFY
and does not watch children on directory D.
- Group B has a mark with FS_MODIFY in mask that does watch children
on directory D.
- FS_MODIFY event on file D/foo should not clear the ignore mask of
group A, but before this change it does
And if group A ignore mask was set to survive FS_MODIFY:
- FS_MODIFY event on file D/foo should be reported to group A on account
of the mount mark, but before this change it is wrongly ignored
Fixes: 2f02fd3fa1 ("fanotify: fix ignore mask logic for events on child and on dir")
Reported-by: Jan Kara <jack@suse.com>
Link: https://lore.kernel.org/linux-fsdevel/20220314113337.j7slrb5srxukztje@quack3.lan/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220511190213.831646-3-amir73il@gmail.com