Fix all kernel-doc warnings in vfs.c:
vfs.c:54: warning: Function parameter or member 'parent' not described in 'ksmbd_vfs_lock_parent'
vfs.c:54: warning: Function parameter or member 'child' not described in 'ksmbd_vfs_lock_parent'
vfs.c:54: warning: No description found for return value of 'ksmbd_vfs_lock_parent'
vfs.c:372: warning: Function parameter or member 'fp' not described in 'ksmbd_vfs_read'
vfs.c:372: warning: Excess function parameter 'fid' description in 'ksmbd_vfs_read'
vfs.c:489: warning: Function parameter or member 'fp' not described in 'ksmbd_vfs_write'
vfs.c:489: warning: Excess function parameter 'fid' description in 'ksmbd_vfs_write'
vfs.c:555: warning: Function parameter or member 'path' not described in 'ksmbd_vfs_getattr'
vfs.c:555: warning: Function parameter or member 'stat' not described in 'ksmbd_vfs_getattr'
vfs.c:555: warning: Excess function parameter 'work' description in 'ksmbd_vfs_getattr'
vfs.c:555: warning: Excess function parameter 'fid' description in 'ksmbd_vfs_getattr'
vfs.c:555: warning: Excess function parameter 'attrs' description in 'ksmbd_vfs_getattr'
vfs.c:572: warning: Function parameter or member 'p_id' not described in 'ksmbd_vfs_fsync'
vfs.c:595: warning: Function parameter or member 'work' not described in 'ksmbd_vfs_remove_file'
vfs.c:595: warning: Function parameter or member 'path' not described in 'ksmbd_vfs_remove_file'
vfs.c:595: warning: Excess function parameter 'name' description in 'ksmbd_vfs_remove_file'
vfs.c:633: warning: Function parameter or member 'work' not described in 'ksmbd_vfs_link'
vfs.c:805: warning: Function parameter or member 'fp' not described in 'ksmbd_vfs_truncate'
vfs.c:805: warning: Excess function parameter 'fid' description in 'ksmbd_vfs_truncate'
vfs.c:846: warning: Excess function parameter 'size' description in 'ksmbd_vfs_listxattr'
vfs.c:953: warning: Function parameter or member 'option' not described in 'ksmbd_vfs_set_fadvise'
vfs.c:953: warning: Excess function parameter 'options' description in 'ksmbd_vfs_set_fadvise'
vfs.c:1167: warning: Function parameter or member 'um' not described in 'ksmbd_vfs_lookup_in_dir'
vfs.c:1203: warning: Function parameter or member 'work' not described in 'ksmbd_vfs_kern_path_locked'
vfs.c:1641: warning: No description found for return value of 'ksmbd_vfs_init_kstat'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Steve French <sfrench@samba.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Tom Talpey <tom@talpey.com>
Cc: linux-cifs@vger.kernel.org
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Fix 12 of 17 kernel-doc warnings in auth.c:
auth.c:221: warning: Function parameter or member 'conn' not described in 'ksmbd_auth_ntlmv2'
auth.c:221: warning: Function parameter or member 'cryptkey' not described in 'ksmbd_auth_ntlmv2'
auth.c:305: warning: Function parameter or member 'blob_len' not described in 'ksmbd_decode_ntlmssp_auth_blob'
auth.c:305: warning: Function parameter or member 'conn' not described in 'ksmbd_decode_ntlmssp_auth_blob'
auth.c:305: warning: Excess function parameter 'usr' description in 'ksmbd_decode_ntlmssp_auth_blob'
auth.c:385: warning: Function parameter or member 'blob_len' not described in 'ksmbd_decode_ntlmssp_neg_blob'
auth.c:385: warning: Function parameter or member 'conn' not described in 'ksmbd_decode_ntlmssp_neg_blob'
auth.c:385: warning: Excess function parameter 'rsp' description in 'ksmbd_decode_ntlmssp_neg_blob'
auth.c:385: warning: Excess function parameter 'sess' description in 'ksmbd_decode_ntlmssp_neg_blob'
auth.c:413: warning: Function parameter or member 'conn' not described in 'ksmbd_build_ntlmssp_challenge_blob'
auth.c:413: warning: Excess function parameter 'rsp' description in 'ksmbd_build_ntlmssp_challenge_blob'
auth.c:413: warning: Excess function parameter 'sess' description in 'ksmbd_build_ntlmssp_challenge_blob'
The other 5 are only present when a W=1 kernel build is done or
when scripts/kernel-doc is run with -Wall. They are:
auth.c:81: warning: No description found for return value of 'ksmbd_gen_sess_key'
auth.c:385: warning: No description found for return value of 'ksmbd_decode_ntlmssp_neg_blob'
auth.c:413: warning: No description found for return value of 'ksmbd_build_ntlmssp_challenge_blob'
auth.c:577: warning: No description found for return value of 'ksmbd_sign_smb2_pdu'
auth.c:628: warning: No description found for return value of 'ksmbd_sign_smb3_pdu'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Steve French <sfrench@samba.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Tom Talpey <tom@talpey.com>
Cc: linux-cifs@vger.kernel.org
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
ida_alloc() and ida_free() should be preferred to the deprecated
ida_simple_get() and ida_simple_remove().
This is less verbose.
Note that the upper limit of ida_simple_get() is exclusive, but the one of
ida_alloc_range() is inclusive. So change a 0xFFFFFFFF into a 0xFFFFFFFE in
order to keep the same behavior.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
If existing lease state and request state are same, don't increment
epoch in create context.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
smb2_set_ea() can be called in parent inode lock range.
So add get_write argument to smb2_set_ea() not to call nested
mnt_want_write().
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
If file opened with v2 lease is upgraded with v1 lease, smb server
should response v2 lease create context to client.
This patch fix smb2.lease.v2_epoch2 test failure.
This test case assumes the following scenario:
1. smb2 create with v2 lease(R, LEASE1 key)
2. smb server return smb2 create response with v2 lease context(R,
LEASE1 key, epoch + 1)
3. smb2 create with v1 lease(RH, LEASE1 key)
4. smb server return smb2 create response with v2 lease context(RH,
LEASE1 key, epoch + 2)
i.e. If same client(same lease key) try to open a file that is being
opened with v2 lease with v1 lease, smb server should return v2 lease.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Acked-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
The SMB2 Protocol requires that "The first byte of the Direct TCP
transport packet header MUST be zero (0x00)"[1]. Commit 1c1bcf2d3e
("ksmbd: validate smb request protocol id") removed the validation of
this 1-byte zero. Add the validation back now.
[1]: [MS-SMB2] - v20230227, page 30.
https://winprotocoldoc.blob.core.windows.net/productionwindowsarchives/MS-SMB2/%5bMS-SMB2%5d-230227.pdf
Fixes: 1c1bcf2d3e ("ksmbd: validate smb request protocol id")
Signed-off-by: Li Nan <linan122@huawei.com>
Acked-by: Tom Talpey <tom@talpey.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
In __cachefiles_prepare_write(), the start and pos variables were made
unsigned 64-bit so that the casts in the checking could be got rid of -
which should be fine since absolute file offsets can't be negative, except
that an error code may be obtained from vfs_llseek(), which *would* be
negative. This breaks the error check.
Fix this for now by reverting pos and start to be signed and putting back
the casts. Unfortunately, the error value checks cannot be replaced with
IS_ERR_VALUE() as long might be 32-bits.
Fixes: 7097c96411 ("cachefiles: Fix __cachefiles_prepare_write()")
Reported-by: Simon Horman <horms@kernel.org>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202401071152.DbKqMQMu-lkp@intel.com/
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
cc: Yiqun Leng <yqleng@linux.alibaba.com>
cc: Jia Zhu <zhujia.zj@bytedance.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-erofs@lists.ozlabs.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Return statement was not needed at end of cifs_chan_update_iface
Suggested-by: Christophe Jaillet <christophe.jaillet@wanadoo.fr>
Signed-off-by: Steve French <stfrench@microsoft.com>
The return values for cifs_chan_update_iface() didn't match what the
documentation said and nothing was checking them anyway. Just make it
a void function.
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
We return early if "iface" is NULL so there is no need to check here.
Delete those checks.
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
commit 23baf831a3 ("mm, treewide: redefine MAX_ORDER sanely") has
changed the definition of MAX_ORDER to be inclusive. This has caused
issues with code that was not yet upstream and depended on the previous
definition.
To draw attention to the altered meaning of the define, rename MAX_ORDER
to MAX_PAGE_ORDER.
Link: https://lkml.kernel.org/r/20231228144704.14033-2-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZZUzBQAKCRCRxhvAZXjc
ot+3AQCZw1PBD4azVxFMWH76qwlAGoVIFug4+ogKU/iUa4VLygEA2FJh1vLJw5iI
LpgBEIUTPVkwtzinAW94iJJo1Vr7NAI=
=p6PB
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.8.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs iov_iter cleanups from Christian Brauner:
"This contains a minor cleanup. The patches drop an unused argument
from import_single_range() allowing to replace import_single_range()
with import_ubuf() and dropping import_single_range() completely"
* tag 'vfs-6.8.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
iov_iter: replace import_single_range() with import_ubuf()
iov_iter: remove unused 'iov' argument from import_single_range()
Even though it seems to be able to resolve some names of
case-insensitive directories, the lack of d_hash and d_compare means we
end up with a broken state in the d_cache. Considering it was never a
goal to support these two together, and we are preparing to use
d_revalidate in case-insensitive filesystems, which would make the
combination even more broken, reject any attempt to get a casefolded
inode from ecryptfs.
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Reviewed-by: Eric Biggers <ebiggers@google.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZZU0egAKCRCRxhvAZXjc
onAqAP9s2ohvjE4QE2ad7svXOzNWKesGcyDyoEBwBpt3Yq8hvAEA+J4xiaMBlRAg
FmBobDwtcvOzxL1q+BbB3IsmmuFrRww=
=ZS18
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.8.cachefiles' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs cachefiles updates from Christian Brauner:
"This contains improvements for on-demand cachefiles.
If the daemon crashes and the on-demand cachefiles fd is unexpectedly
closed in-flight requests and subsequent read operations associated
with the fd will fail with EIO. This causes issues in various
scenarios as this failure is currently unrecoverable.
The work contained in this pull request introduces a failover mode and
enables the daemon to recover in-flight requested-related objects. A
restarted daemon will be able to process requests as usual.
This requires that in-flight requests are stored during daemon crash
or while the daemon is offline. In addition, a handle to
/dev/cachefiles needs to be stored.
This can be done by e.g., systemd's fdstore (cf. [1]) which enables
the restarted daemon to recover state.
Three new states are introduced in this patchset:
(1) CLOSE
Object is closed by the daemon.
(2) OPEN
Object is open and ready for processing. IOW, the open request
has been handled successfully.
(3) REOPENING
Object has been previously closed and is now reopened due to a
read request.
A restarted daemon can recover the /dev/cachefiles fd from systemd's
fdstore and writes "restore" to the device. This causes the object
state to be reset from CLOSE to REOPENING and reinitializes the
object.
The daemon may now handle the open request. Any in-flight operations
are restored and handled avoiding interruptions for users"
Link: https://systemd.io/FILE_DESCRIPTOR_STORE [1]
* tag 'vfs-6.8.cachefiles' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
cachefiles: add restore command to recover inflight ondemand read requests
cachefiles: narrow the scope of triggering EPOLLIN events in ondemand mode
cachefiles: resend an open request if the read request's object is closed
cachefiles: extract ondemand info field from cachefiles_object
cachefiles: introduce object ondemand state
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZZUzXQAKCRCRxhvAZXjc
ogOtAQDpqUp1zY4dV/dZisCJ5xarZTsSZ1AvgmcxZBtS0NhbdgEAshWvYGA9ryS/
ChL5jjtjjZDLhRA//reoFHTQIrdp2w8=
=bF+R
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.8.rw' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs rw updates from Christian Brauner:
"This contains updates from Amir for read-write backing file helpers
for stacking filesystems such as overlayfs:
- Fanotify is currently in the process of introducing pre content
events. Roughly, a new permission event will be added indicating
that it is safe to write to the file being accessed. These events
are used by hierarchical storage managers to e.g., fill the content
of files on first access.
During that work we noticed that our current permission checking is
inconsistent in rw_verify_area() and remap_verify_area().
Especially in the splice code permission checking is done multiple
times. For example, one time for the whole range and then again for
partial ranges inside the iterator.
In addition, we mostly do permission checking before we call
file_start_write() except for a few places where we call it after.
For pre-content events we need such permission checking to be done
before file_start_write(). So this is a nice reason to clean this
all up.
After this series, all permission checking is done before
file_start_write().
As part of this cleanup we also massaged the splice code a bit. We
got rid of a few helpers because we are alredy drowning in special
read-write helpers. We also cleaned up the return types for splice
helpers.
- Introduce generic read-write helpers for backing files. This lifts
some overlayfs code to common code so it can be used by the FUSE
passthrough work coming in over the next cycles. Make Amir and
Miklos the maintainers for this new subsystem of the vfs"
* tag 'vfs-6.8.rw' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (30 commits)
fs: fix __sb_write_started() kerneldoc formatting
fs: factor out backing_file_mmap() helper
fs: factor out backing_file_splice_{read,write}() helpers
fs: factor out backing_file_{read,write}_iter() helpers
fs: prepare for stackable filesystems backing file helpers
fsnotify: optionally pass access range in file permission hooks
fsnotify: assert that file_start_write() is not held in permission hooks
fsnotify: split fsnotify_perm() into two hooks
fs: use splice_copy_file_range() inline helper
splice: return type ssize_t from all helpers
fs: use do_splice_direct() for nfsd/ksmbd server-side-copy
fs: move file_start_write() into direct_splice_actor()
fs: fork splice_file_range() from do_splice_direct()
fs: create {sb,file}_write_not_started() helpers
fs: create file_write_started() helper
fs: create __sb_write_started() helper
fs: move kiocb_start_write() into vfs_iocb_iter_write()
fs: move permission hook out of do_iter_read()
fs: move permission hook out of do_iter_write()
fs: move file_start_write() into vfs_iter_write()
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZZU0CgAKCRCRxhvAZXjc
osncAQDSJK0frJL+72NqXxa4YNzivrnuw6fhp5iaDAEqxdm8ygEAoJWyh7Rmkt8G
drAXWGyGnCYqv7UgC6axLyciid7TxQg=
=vJuv
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.8.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs mount updates from Christian Brauner:
"This contains the work to retrieve detailed information about mounts
via two new system calls. This is hopefully the beginning of the end
of the saga that started with fsinfo() years ago.
The LWN articles in [1] and [2] can serve as a summary so we can avoid
rehashing everything here.
At LSFMM in May 2022 we got into a room and agreed on what we want to
do about fsinfo(). Basically, split it into pieces. This is the first
part of that agreement. Specifically, it is concerned with retrieving
information about mounts. So this only concerns the mount information
retrieval, not the mount table change notification, or the extended
filesystem specific mount option work. That is separate work.
Currently mounts have a 32bit id. Mount ids are already in heavy use
by libmount and other low-level userspace but they can't be relied
upon because they're recycled very quickly. We agreed that mounts
should carry a unique 64bit id by which they can be referenced
directly. This is now implemented as part of this work.
The new 64bit mount id is exposed in statx() through the new
STATX_MNT_ID_UNIQUE flag. If the flag isn't raised the old mount id is
returned. If it is raised and the kernel supports the new 64bit mount
id the flag is raised in the result mask and the new 64bit mount id is
returned. New and old mount ids do not overlap so they cannot be
conflated.
Two new system calls are introduced that operate on the 64bit mount
id: statmount() and listmount(). A summary of the api and usage can be
found on LWN as well (cf. [3]) but of course, I'll provide a summary
here as well.
Both system calls rely on struct mnt_id_req. Which is the request
struct used to pass the 64bit mount id identifying the mount to
operate on. It is extensible to allow for the addition of new
parameters and for future use in other apis that make use of mount
ids.
statmount() mimicks the semantics of statx() and exposes a set flags
that userspace may raise in mnt_id_req to request specific information
to be retrieved. A statmount() call returns a struct statmount filled
in with information about the requested mount. Supported requests are
indicated by raising the request flag passed in struct mnt_id_req in
the @mask argument in struct statmount.
Currently we do support:
- STATMOUNT_SB_BASIC:
Basic filesystem info
- STATMOUNT_MNT_BASIC
Mount information (mount id, parent mount id, mount attributes etc)
- STATMOUNT_PROPAGATE_FROM
Propagation from what mount in current namespace
- STATMOUNT_MNT_ROOT
Path of the root of the mount (e.g., mount --bind /bla /mnt returns /bla)
- STATMOUNT_MNT_POINT
Path of the mount point (e.g., mount --bind /bla /mnt returns /mnt)
- STATMOUNT_FS_TYPE
Name of the filesystem type as the magic number isn't enough due to submounts
The string options STATMOUNT_MNT_{ROOT,POINT} and STATMOUNT_FS_TYPE
are appended to the end of the struct. Userspace can use the offsets
in @fs_type, @mnt_root, and @mnt_point to reference those strings
easily.
The struct statmount reserves quite a bit of space currently for
future extensibility. This isn't really a problem and if this bothers
us we can just send a follow-up pull request during this cycle.
listmount() is given a 64bit mount id via mnt_id_req just as
statmount(). It takes a buffer and a size to return an array of the
64bit ids of the child mounts of the requested mount. Userspace can
thus choose to either retrieve child mounts for a mount in batches or
iterate through the child mounts. For most use-cases it will be
sufficient to just leave space for a few child mounts. But for big
mount tables having an iterator is really helpful. Iterating through a
mount table works by setting @param in mnt_id_req to the mount id of
the last child mount retrieved in the previous listmount() call"
Link: https://lwn.net/Articles/934469 [1]
Link: https://lwn.net/Articles/829212 [2]
Link: https://lwn.net/Articles/950569 [3]
* tag 'vfs-6.8.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
add selftest for statmount/listmount
fs: keep struct mnt_id_req extensible
wire up syscalls for statmount/listmount
add listmount(2) syscall
statmount: simplify string option retrieval
statmount: simplify numeric option retrieval
add statmount(2) syscall
namespace: extract show_path() helper
mounts: keep list of mounts in an rbtree
add unique mount ID
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZZUx4wAKCRCRxhvAZXjc
osaNAQC/c+xXVfiq/pFbuK9MQLna4RGZaGcG9k312YniXbHq0AD9HAf4aPcZwPy1
/wkD4pauj3UZ3f0xBSyazGBvAXyN0Qc=
=iFAQ
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.8.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs super updates from Christian Brauner:
"This contains the super work for this cycle including the long-awaited
series by Jan to make it possible to prevent writing to mounted block
devices:
- Writing to mounted devices is dangerous and can lead to filesystem
corruption as well as crashes. Furthermore syzbot comes with more
and more involved examples how to corrupt block device under a
mounted filesystem leading to kernel crashes and reports we can do
nothing about. Add tracking of writers to each block device and a
kernel cmdline argument which controls whether other writeable
opens to block devices open with BLK_OPEN_RESTRICT_WRITES flag are
allowed.
Note that this effectively only prevents modification of the
particular block device's page cache by other writers. The actual
device content can still be modified by other means - e.g. by
issuing direct scsi commands, by doing writes through devices lower
in the storage stack (e.g. in case loop devices, DM, or MD are
involved) etc. But blocking direct modifications of the block
device page cache is enough to give filesystems a chance to perform
data validation when loading data from the underlying storage and
thus prevent kernel crashes.
Syzbot can use this cmdline argument option to avoid uninteresting
crashes. Also users whose userspace setup does not need writing to
mounted block devices can set this option for hardening. We expect
that this will be interesting to quite a few workloads.
Btrfs is currently opted out of this because they still haven't
merged patches we require for this to work from three kernel
releases ago.
- Reimplement block device freezing and thawing as holder operations
on the block device.
This allows us to extend block device freezing to all devices
associated with a superblock and not just the main device. It also
allows us to remove get_active_super() and thus another function
that scans the global list of superblocks.
Freezing via additional block devices only works if the filesystem
chooses to use @fs_holder_ops for these additional devices as well.
That currently only includes ext4 and xfs.
Earlier releases switched get_tree_bdev() and mount_bdev() to use
@fs_holder_ops. The remaining nilfs2 open-coded version of
mount_bdev() has been converted to rely on @fs_holder_ops as well.
So block device freezing for the main block device will continue to
work as before.
There should be no regressions in functionality. The only special
case is btrfs where block device freezing for the main block device
never worked because sb->s_bdev isn't set. Block device freezing
for btrfs can be fixed once they can switch to @fs_holder_ops but
that can happen whenever they're ready"
* tag 'vfs-6.8.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (27 commits)
block: Fix a memory leak in bdev_open_by_dev()
super: don't bother with WARN_ON_ONCE()
super: massage wait event mechanism
ext4: Block writes to journal device
xfs: Block writes to log device
fs: Block writes to mounted block devices
btrfs: Do not restrict writes to btrfs devices
block: Add config option to not allow writing to mounted devices
block: Remove blkdev_get_by_*() functions
bcachefs: Convert to bdev_open_by_path()
fs: handle freezing from multiple devices
fs: remove dead check
nilfs2: simplify device handling
fs: streamline thaw_super_locked
ext4: simplify device handling
xfs: simplify device handling
fs: simplify setup_bdev_super() calls
blkdev: comment fs_holder_ops
porting: document block device freeze and thaw changes
fs: remove unused helper
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZZUxRQAKCRCRxhvAZXjc
ov/QAQDzvge3oQ9MEymmOiyzzcF+HhAXBr+9oEsYJjFc1p0TsgEA61gXjZo7F1jY
KBqd6znOZCR+Waj0kIVJRAo/ISRBqQc=
=0bRl
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.8.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains the usual miscellaneous features, cleanups, and fixes
for vfs and individual fses.
Features:
- Add Jan Kara as VFS reviewer
- Show correct device and inode numbers in proc/<pid>/maps for vma
files on stacked filesystems. This is now easily doable thanks to
the backing file work from the last cycles. This comes with
selftests
Cleanups:
- Remove a redundant might_sleep() from wait_on_inode()
- Initialize pointer with NULL, not 0
- Clarify comment on access_override_creds()
- Rework and simplify eventfd_signal() and eventfd_signal_mask()
helpers
- Process aio completions in batches to avoid needless wakeups
- Completely decouple struct mnt_idmap from namespaces. We now only
keep the actual idmapping around and don't stash references to
namespaces
- Reformat maintainer entries to indicate that a given subsystem
belongs to fs/
- Simplify fput() for files that were never opened
- Get rid of various pointless file helpers
- Rename various file helpers
- Rename struct file members after SLAB_TYPESAFE_BY_RCU switch from
last cycle
- Make relatime_need_update() return bool
- Use GFP_KERNEL instead of GFP_USER when allocating superblocks
- Replace deprecated ida_simple_*() calls with their current ida_*()
counterparts
Fixes:
- Fix comments on user namespace id mapping helpers. They aren't
kernel doc comments so they shouldn't be using /**
- s/Retuns/Returns/g in various places
- Add missing parameter documentation on can_move_mount_beneath()
- Rename i_mapping->private_data to i_mapping->i_private_data
- Fix a false-positive lockdep warning in pipe_write() for watch
queues
- Improve __fget_files_rcu() code generation to improve performance
- Only notify writer that pipe resizing has finished after setting
pipe->max_usage otherwise writers are never notified that the pipe
has been resized and hang
- Fix some kernel docs in hfsplus
- s/passs/pass/g in various places
- Fix kernel docs in ntfs
- Fix kcalloc() arguments order reported by gcc 14
- Fix uninitialized value in reiserfs"
* tag 'vfs-6.8.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (36 commits)
reiserfs: fix uninit-value in comp_keys
watch_queue: fix kcalloc() arguments order
ntfs: dir.c: fix kernel-doc function parameter warnings
fs: fix doc comment typo fs tree wide
selftests/overlayfs: verify device and inode numbers in /proc/pid/maps
fs/proc: show correct device and inode numbers in /proc/pid/maps
eventfd: Remove usage of the deprecated ida_simple_xx() API
fs: super: use GFP_KERNEL instead of GFP_USER for super block allocation
fs/hfsplus: wrapper.c: fix kernel-doc warnings
fs: add Jan Kara as reviewer
fs/inode: Make relatime_need_update return bool
pipe: wakeup wr_wait after setting max_usage
file: remove __receive_fd()
file: stop exposing receive_fd_user()
fs: replace f_rcuhead with f_task_work
file: remove pointless wrapper
file: s/close_fd_get_file()/file_close_fd()/g
Improve __fget_files_rcu() code generation (and thus __fget_light())
file: massage cleanup of files that failed to open
fs/pipe: Fix lockdep false-positive in watchqueue pipe_write()
...
Since the read operation beyond the ValidDataLength returns zero,
if we just extend the size of the file, we don't need to zero the
extended part, but only change the DataLength without changing
the ValidDataLength.
Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reviewed-by: Andy Wu <Andy.Wu@sony.com>
Reviewed-by: Aoyama Wataru <wataru.aoyama@sony.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
In stream extension directory entry, the ValidDataLength
field describes how far into the data stream user data has
been written, and the DataLength field describes the file
size.
Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reviewed-by: Andy Wu <Andy.Wu@sony.com>
Reviewed-by: Aoyama Wataru <wataru.aoyama@sony.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Replaced the internal table lookup algorithm with ffs of
the bitops library with better performance.
Use it to increase the single processing length of the
exfat_find_free_bitmap function, from single-byte search to long type.
Signed-off-by: John Sanpe <sanpeqf@gmail.com>
Acked-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Replace the internal table lookup algorithm with the hweight
library, which has instruction set acceleration capabilities.
Use it to increase the length of a single calculation of
the exfat_find_free_bitmap function to the long type.
Signed-off-by: John Sanpe <sanpeqf@gmail.com>
Acked-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
smb2_compound_op(SMB2_OP_GET_REPARSE) already checks if ioctl response
has a valid reparse data buffer's length, so there's no need to check
it again in parse_reparse_point().
In order to get rid of duplicate check, validate reparse data buffer's
length also in cifs_query_reparse_point().
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
As this function now destroys the svc_serv, this is a better name.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
sv_refcnt is no longer useful.
lockd and nfs-cb only ever have the svc active when there are a non-zero
number of threads, so sv_refcnt mirrors sv_nrthreads.
nfsd also keeps the svc active between when a socket is added and when
the first thread is started, but we don't really need a refcount for
that. We can simply not destroy the svc while there are any permanent
sockets attached.
So remove sv_refcnt and the get/put functions.
Instead of a final call to svc_put(), call svc_destroy() instead.
This is changed to also store NULL in the passed-in pointer to make it
easier to avoid use-after-free situations.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
A future patch will remove refcounting on svc_serv as it is of little
use.
It is currently used to keep the svc around while the pool_stats file is
open.
Change this to get the pointer, protected by the mutex, only in
seq_start, and the release the mutex in seq_stop.
This means that if the nfsd server is stopped and restarted while the
pool_stats file it open, then some pool stats info could be from the
first instance and some from the second. This might appear odd, but is
unlikely to be a problem in practice.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Callback operations enum is defined in client and server, move it to
common header file.
Signed-off-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Acked-by: Anna Schumaker <Anna.Schumaker@netapp.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
We check "state" for NULL on the previous line so it can't be NULL here.
No need to check again.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/r/202312031425.LffZTarR-lkp@intel.com/
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Avoid the use of an atomic bitop, and prepare for adding a run-time
switch for using splice reads.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
RQ_SPLICE_OK is a bit of a layering violation. Also, a subsequent
patch is going to provide a mechanism for always disabling splice
reads.
Splicing is an issue only for NFS READs, so refactor nfsd_read() to
check the auth type directly instead of relying on an rq_flag
setting.
The new helper will be added into the NFSv4 read path in a
subsequent patch.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Al Viro notes that normal system calls hold f_pos_lock when calling
->iterate_shared and ->llseek; however nfsd_readdir() does not take
that mutex when calling these methods.
It should be safe however because the struct file acquired by
nfsd_readdir() is not visible to other threads.
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
This trace point was for debugging the DRC's garbage collection. In
the field it's just noise.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
workqueue: nfsd_file_delayed_close [nfsd] hogged CPU for >13333us 8
times, consider switching to WQ_UNBOUND
There's no harm in closing a cached file descriptor on another core.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The usage of read_seqbegin_or_lock() in nfsd_copy_write_verifier()
is wrong. "seq" is always even and thus "or_lock" has no effect,
this code can never take ->writeverf_lock for writing.
I guess this is fine, nfsd_copy_write_verifier() just copies 8 bytes
and nfsd_reset_write_verifier() is supposed to be very rare operation
so we do not need the adaptive locking in this case.
Yet the code looks wrong and sub-optimal, it can use read_seqbegin()
without changing the behaviour.
[ cel: Note also that it eliminates this Sparse warning:
fs/nfsd/nfssvc.c:360:6: warning: context imbalance in 'nfsd_copy_write_verifier' -
different lock contexts for basic block
]
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
We've had a number of attempts at different NFSv4 client tracking
methods over the years, but now nfsdcld has emerged as the clear winner
since the others (recoverydir and the usermodehelper upcall) are
problematic.
As a case in point, the recoverydir backend uses MD5 hashes to encode
long form clientid strings, which means that nfsd repeatedly gets dinged
on FIPS audits, since MD5 isn't considered secure. Its use of MD5 is not
cryptographically significant, so there is no danger there, but allowing
us to compile that out allows us to sidestep the issue entirely.
As a prelude to eventually removing support for these client tracking
methods, add a new Kconfig option that enables them. Mark it deprecated
and make it default to N.
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Query dir responses don't provide enough information on reparse points
such as major/minor numbers and symlink targets other than reparse
tags, however we don't need to unconditionally revalidate them only
because they are reparse points. Instead, revalidate them only when
their ctime or reparse tag has changed.
For instance, Windows Server updates ctime of reparse points when
their data have changed.
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Change SMB2_set_eof() to take eof as CPU order rather than __le64 and pass
it directly rather than by pointer. This moves the conversion down into
SMB_set_eof() rather than all of its callers and means we don't need to
undo it for the traceline.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
The kfree() function was called in one case by
the allocate_mr_list() function during error handling
even if the passed variable contained a null pointer.
This issue was detected by using the Coccinelle software.
Thus use another label.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Steve French <stfrench@microsoft.com>
Recently, cifs_chan_update_iface was modified to not
remove an iface if a suitable replacement was not found.
With that, there were two conditionals that were exactly
the same. This change removes that extra condition check.
Also, fixed a logging in the same function to indicate
the correct message.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Parse reparse points in SMB3 posix query info as they will be
supported and required by the new specification.
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Use smb2_compound_op() with SMB2_OP_GET_REPARSE to get reparse point.
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Add support for creating symlinks via IO_REPARSE_TAG_SYMLINK reparse
points in SMB2+.
These are fully supported by most SMB servers and documented in
MS-FSCC. Also have the advantage of requiring fewer roundtrips as
their symlink targets can be parsed directly from CREATE responses on
STATUS_STOPPED_ON_SYMLINK errors.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202311260838.nx5mkj1j-lkp@intel.com/
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
The client was sending an SMB2_CREATE request without setting
OPEN_REPARSE_POINT flag thus failing the entire hardlink operation.
Fix this by setting OPEN_REPARSE_POINT in create options for
SMB2_CREATE request when the source inode is a repase point.
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
The client was sending an SMB2_CREATE request without setting
OPEN_REPARSE_POINT flag thus failing the entire rename operation.
Fix this by setting OPEN_REPARSE_POINT in create options for
SMB2_CREATE request when the source inode is a repase point.
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reduce number of roundtrips to server when querying reparse points in
->query_path_info() by sending a single compound request of
create+get_reparse+get_info+close.
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Add support for creating special files (e.g. char/block devices,
sockets, fifos) via NFS reparse points on SMB2+, which are fully
supported by most SMB servers and documented in MS-FSCC.
smb2_get_reparse_inode() creates the file with a corresponding reparse
point buffer set in @iov through a single roundtrip to the server.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202311260746.HOJ039BV-lkp@intel.com/
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Make smb2_compound_op() accept up to MAX_COMPOUND(5) commands to be
sent over a single compounded request.
This will allow next commits to read and write reparse files through a
single roundtrip to the server.
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Fixes no-op checkpatch errors and warnings.
Signed-off-by: Pierre Mariani <pierre.mariani@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Fix kernel-doc warnings found when using "W=1".
file.c:1385: warning: Excess function parameter 'time' description in 'ubifs_update_time'
and 9 warnings like this one:
file.c:326: warning: No description found for return value of 'allocate_budget'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/oe-kbuild-all/202312030417.66c5PwHj-lkp@intel.com/
Cc: Richard Weinberger <richard@nod.at>
Cc: linux-mtd@lists.infradead.org
Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com>
[rw: massaged patch to resolve conflict ]
Signed-off-by: Richard Weinberger <richard@nod.at>
For error handling path in ubifs_symlink(), inode will be marked as
bad first, then iput() is invoked. If inode->i_link is initialized by
fscrypt_encrypt_symlink() in encryption scenario, inode->i_link won't
be freed by callchain ubifs_free_inode -> fscrypt_free_inode in error
handling path, because make_bad_inode() has changed 'inode->i_mode' as
'S_IFREG'.
Following kmemleak is easy to be reproduced by injecting error in
ubifs_jnl_update() when doing symlink in encryption scenario:
unreferenced object 0xffff888103da3d98 (size 8):
comm "ln", pid 1692, jiffies 4294914701 (age 12.045s)
backtrace:
kmemdup+0x32/0x70
__fscrypt_encrypt_symlink+0xed/0x1c0
ubifs_symlink+0x210/0x300 [ubifs]
vfs_symlink+0x216/0x360
do_symlinkat+0x11a/0x190
do_syscall_64+0x3b/0xe0
There are two ways fixing it:
1. Remove make_bad_inode() in error handling path. We can do that
because ubifs_evict_inode() will do same processes for good
symlink inode and bad symlink inode, for inode->i_nlink checking
is before is_bad_inode().
2. Free inode->i_link before marking inode bad.
Method 2 is picked, it has less influence, personally, I think.
Cc: stable@vger.kernel.org
Fixes: 2c58d548f5 ("fscrypt: cache decrypted symlink target in ->i_link")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Suggested-by: Eric Biggers <ebiggers@kernel.org>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
new helpers:
- bch2_csum_to_text()
- bch2_csum_err_msg()
standardize our checksum error messages a bit, and print out the
checksums a bit more nicely.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
this definitely should _not_ be 1, and we don't actually want any
concurrency limiting at all here - btree node read completions are
getting blocked behind btree node write submissions.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Seeing weird latency issues in the btree node read path - add one
bch2_btree_node_read_done().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Seeing strange performance issues that might be caused by memory
pressure causing prefetched nodes to be evicted before they're used.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
BCH_FS_fsck_done -> BCH_FS_fsck_running; set when we might be fixing
fsck errors. Also; set fix_errors to ask by default when fsck is
running.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
New flag so that triggers can distinguish whether we're running
transactional or atomic triggers (or gc) - unifying the callbacks.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Prep work for disk space accounting rewrite: we're going to want to use
a single callback for both of our current triggers, so we need to change
them to have the same type signature first.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Prep work for disk space accounting rewrite: we're going to want to use
a single callback for both of our current triggers, so we need to change
them to have the same type signature first.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
BCH_MEMBER_DURABILITY() was not present initially; a value of 0 means
use the default, nonzero means use v - 1.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Use the new bch_member->seq, sb->write_time fields to detect split brain
and kick out devices when necessary.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add new fields for split brain detection:
- bch_member->seq, which tracks the sequence number of the last superblock
write that happened to each member device
- bch_sb->write_time, which tracks the time of the last superblock write,
to allow detection of when two members have diverged but had the same
number of superblock writes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
nochanges means "we cannot issue writes at all"; it's possible to go
into a pseudo read-write mode where we pin dirty metadata in memory,
which is used for fsck in dry run mode and doing journal replay on a
read only mount, but we do not want to allow an actual read-write mount
in nochanges mode.
But we do always want to allow early read-write, during recovery - this
patch clarifies that.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In the loop in netfs_rreq_unmark_after_write() that removes the PG_fscache
from folios after they've been written to the cache, as soon as we remove
the mark from a multipage folio, it can get split - and then we might see a
fragment of folio again.
Guard against this by advancing the 'unlocked' tracker to the index of the
last page in the folio to avoid a double removal of the PG_fscache mark.
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-afs@lists.infradead.org
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
- Fix another regression in the NFSD administrative API
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmWYCHIACgkQM2qzM29m
f5c7FRAAoXVRhP/vyua6PcfF8Je0bg83fhvqG8ppbLE6EjFzbMM7fZ9G5MudtY7/
w4PuDJ8rhQFl8oQTy004qkqV3JfZGo7buCumKoj8IA8m67cQ0WyB6Lz1fSW9BGHq
cc0PWQR/z88laRvCPViW+jOsLx5f+VvvyEsDUvEYdG6xXtTVhDfNxYQFPc3bRfd0
Sf5cPBuBbMAlz9PZm/c276y46rkBP3+2ONx+zHz0k/ZR/EgJ0RoxmzZIrAUguCRy
fAUW0xvrAuTMY6LMb/0YyKH6IOqYoImKWMGjhi0VFFG5AuxbmU7lLVo/S9tUlWMs
aAcyEtyuVU1422dikF/FRtstswfz64fj+PqIwiq8qjt14WB+BX/tV1tjwAjuI2py
YVt9T50FkhG42IG+Nkub7SXm6OuinfiUBcAilPMfBnSapT5l6BvVwrSXGIApcF87
NmQvRqjO4tEYE6/aYsUV/jZGQVYPTspvhv5kzXMQnL/6h/KlhrNLvTF9gTJFs+xM
ahY7I2nVmXayoLfrweaWX/a1VNMeYLvVOUPqm/DFiWWw51TqkmD8CHQwCsBiR0i9
iVu10eh0x9gakPBDVFuCLG+BOL+vsUxfgbeAOJdHYC0s78YLRovyFsDTHxRNv7hW
hDp+eblGUb+v5kNPbEZJKMmWXCwivOHnacc2EJyA1UcGRLL7I24=
=TKjN
-----END PGP SIGNATURE-----
Merge tag 'nfsd-6.7-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fix from Chuck Lever:
- Fix another regression in the NFSD administrative API
* tag 'nfsd-6.7-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
nfsd: drop the nfsd_put helper
If try_to_free_buffers() succeeded and then folio_alloc_buffers() failed,
grow_dev_folio() would return success. This would be incorrect; memory
allocation failure is supposed to result in a failure. It's a harmless
bug; the caller will simply go around the loop one more time and
grow_dev_folio() will correctly return a failure that time. But it was an
unintended change and looks like a more serious bug than it is.
While I'm in here, improve the commentary about why we return success even
though we failed.
Link: https://lkml.kernel.org/r/20240101093848.2017115-1-willy@infradead.org
Fixes: 6d840a1877 ("buffer: return bool from grow_dev_folio()")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reported-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQGyBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmWWRDMACgkQiiy9cAdy
T1E+tAv4g15Kh9JeSaO/x1oyujKSDnEALWGvriwj9EC6+Z+MzQQyyPxB4uUyRuSN
uts/QgvpK4LW2jIhyTaBfU9xRfhJYup1B+vaBtij6NwKfnxYo1PJCdAj8yelAfCW
7mgqSiIGMRbRFz3jnFbb5vQS3H32Hpje0h32Rh3LiKrS6q+9exCJDHA6jqQ5JwvP
/BrDeJ0vykyUjvjxcYQaeWZDkXEy7bysVVq+3qOEr9HSwX7Jv1f6JOLLpUoTTWx9
LPkBus02JOAWX6FUWewVdx1ar9cTafA54rzS3hrpT2lxjuMhSmL6v9Br9MCsAMnI
oWMWfjyefAM6ss4m/pAD178WZ61f1urIRrTRJJUwcWA+XjXTF6dXXZvMFdzqjYW5
k6nMhDfFXzSzywwSFmlQwh/00NPVzX7DuGjgd3ZdJ7z6bvFvltwZtkRPMVK7zvHc
X8LWAWYuaLku4yj3PJa3FiBrSYPsztTX3NLqQq+5vcxxTDCOcFduXComvvEosSiq
4UbhxHU=
=0P5G
-----END PGP SIGNATURE-----
Merge tag '6.7-rc8-smb3-mchan-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
"Three important multichannel smb3 client fixes found in recent
testing:
- fix oops due to incorrect refcounting of interfaces after
disabling multichannel
- fix possible unrecoverable session state after disabling
multichannel with active sessions
- fix two places that were missing use of chan_lock"
* tag '6.7-rc8-smb3-mchan-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: do not depend on release_iface for maintaining iface_list
cifs: cifs_chan_is_iface_active should be called with chan_lock held
cifs: after disabling multichannel, mark tcon for reconnect
The checking of @c->nroot->flags and @c->dirty_[n|p]n_cnt in function
nothing_to_commit() is not atomic, which could be raced with modifying
of lpt, for example:
P1 P2 P3
run_gc
ubifs_garbage_collect
do_commit
ubifs_return_leb
ubifs_lpt_lookup_dirty
dirty_cow_nnode
do_commit
nothing_to_commit
if (test_bit(DIRTY_CNODE, &c->nroot->flags)
// false
test_and_set_bit(DIRTY_CNODE, &nnode->flags)
c->dirty_nn_cnt += 1
ubifs_assert(c, c->dirty_nn_cnt == 0)
// false !
Fetch a reproducer in Link:
UBIFS error (ubi0:0 pid 2747): ubifs_assert_failed
UBIFS assert failed: c->dirty_pn_cnt == 0, in fs/ubifs/commit.c
Call Trace:
ubifs_ro_mode+0x58/0x70 [ubifs]
ubifs_assert_failed+0x6a/0x90 [ubifs]
do_commit+0x5b7/0x930 [ubifs]
ubifs_run_commit+0xc6/0x1a0 [ubifs]
ubifs_sync_fs+0xd8/0x110 [ubifs]
sync_filesystem+0xb4/0x120
do_syscall_64+0x6f/0x140
Fix it by checking @c->dirty_[n|p]n_cnt and @c->nroot state with
@c->lp_mutex locked.
Fixes: 944fdef52c ("UBIFS: do not start the commit if there is nothing to commit")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=218162
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
An issue can occur between write-streaming (storing dirty data in partial
non-uptodate pages) and a cachefiles object being culled to make space.
The problem occurs because the cache object is only marked in use while
there are files open using it. Once it has been released, it can be culled
and the cookie marked disabled.
At this point, a streaming write is permitted to occur (if the cache is
active, we require pages to be prefetched and cached), but the cache can
become active again before this gets flushed out - and then two effects can
occur:
(1) The cache may be asked to write out a region that's less than its DIO
block size (assumed by cachefiles to be PAGE_SIZE) - and this causes
one of two debugging statements to be emitted.
(2) netfs_how_to_modify() gets confused because it sees a page that isn't
allowed to be non-uptodate being uptodate and tries to prefetch it -
leading to a warning that PG_fscache is set twice.
Fix this by the following means:
(1) Add a netfs_inode flag to disallow write-streaming to an inode and set
it if we ever do local caching of that inode. It remains set for the
lifetime of that inode - even if the cookie becomes disabled.
(2) If the no-write-streaming flag is set, then make netfs_how_to_modify()
always want to prefetch instead.
(3) If netfs_how_to_modify() decides it wants to prefetch a folio, but
that folio has write-streamed data in it, then it requires the folio
be flushed first.
(4) Export a counter of the number of times we wanted to prefetch a
non-uptodate page, but found it had write-streamed data in it.
(5) Export a counter of the number of times we cancelled a write to the
cache because it didn't DIO align and remove the debug statements.
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-erofs@lists.ozlabs.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
With 16a26b20d2 ("ubifs: authentication: Add hashes to index nodes")
insert_node() and insert_dent() got a new function parameter 'hash'. Add
a description for this new parameter.
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202311051618.D7YUE1Rr-lkp@intel.com/
Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Use the correct function name in the kernel-doc comment to prevent
a kernel-doc warning:
auth.c:30: warning: expecting prototype for ubifs_node_calc_hash(). Prototype was for __ubifs_node_calc_hash() instead
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: kernel test robot <lkp@intel.com>
Closes: lore.kernel.org/r/202311052125.gE1Rylox-lkp@intel.com
Cc: Richard Weinberger <richard@nod.at>
Cc: linux-mtd@lists.infradead.org
Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Simplify ubifs_hmac_wkm() by using crypto_shash_tfm_digest() instead of
an alloc+init+update+final sequence. This should also improve
performance.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Tested-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Since JBD2 takes care of all metadata writeback errors of fs dev,
ext4_check_bdev_write_error() is useful only in nojournal mode.
Move it into '!ext4_handle_valid(handle)' branch.
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20231213013224.2100050-6-chengzhihao1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This is a replacement solution of commit bc71726c72 ("ext4: abort
the filesystem if failed to async write metadata buffer"), JBD2 can
detect metadata writeback error of fs dev by 'j_fs_dev_wb_err'.
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20231213013224.2100050-5-chengzhihao1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Since 'JBD2_CHECKPOINT_IO_ERROR' and j_atomic_flags' are not useful
anymore after fs dev's errseq is imported into jbd2, just remove them.
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20231213013224.2100050-4-chengzhihao1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Now JBD2 detects metadata writeback error of fs dev according to errseq.
Replace journal state flag by checking errseq.
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20231213013224.2100050-3-chengzhihao1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Add errseq in journal, so that JBD2 can detect whether metadata is
successfully written to fs bdev. This patch adds detection in recovery
process to replace original solution(using local variable wb_err).
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20231213013224.2100050-2-chengzhihao1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
After first execution of mb_find_order_for_block():
'fe_start' is the value of 'block' passed in mb_find_extent().
'fe_len' is the difference between the length of order-chunk and
remainder of the block divided by order-chunk.
And 'next' does not require initialization after above modifications.
Signed-off-by: Gou Hao <gouhao@uniontech.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20231113082617.11258-1-gouhao@uniontech.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
As an optimization, I was trying to work on exiting early from this
function if dealing with unwritten extent since they anyways read 0.
However, it was realised that there are certain code paths that can
end up calling ext4_block_zero_page_range() for an unwritten bh that
might still have data in pagecache. In this case, we can't exit early
and we do require to process the bh and zero out the pagecache to ensure
that a writeback can't kick in at a later time and flush the stale
pagecache to disk.
Since, adding the logic to exit early for unwritten bh was turning out
to be much more nuanced and the current code already handles it well,
just add a comment to explicitly document this behavior.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/d859b7ae5fe42e6626479b91ed9f4da3aae4c597.1698856309.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
dioread_nolock was originally disabled as a default option for bs < ps
scenarios due to a data corruption issue. Since then, we've had some
fixes in this area which address such issues. Enable dioread_nolock by
default and remove the experimental warning message for bs < ps path.
dioread for bs < ps has been tested on a 64k pagesize machine using:
kvm-xfstest -C 3 -g auto
with the following configs:
64k adv bigalloc_4k bigalloc_64k data_journal encrypt
dioread_nolock dioread_nolock_4k ext3 ext3conv nojournal
And no new regressions were seen compared to baseline kernel.
Suggested-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20231101154717.531865-1-ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
'blocks_per_page' is always 1 after 'if (blocks_per_page >= 2)',
'pnum' and 'block' are equal in this case.
Signed-off-by: Gou Hao <gouhao@uniontech.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20231024035215.29474-1-gouhao@uniontech.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
It's not safe to call nfsd_put once nfsd_last_thread has been called, as
that function will zero out the nn->nfsd_serv pointer.
Drop the nfsd_put helper altogether and open-code the svc_put in its
callers instead. That allows us to not be reliant on the value of that
pointer when handling an error.
Fixes: 2a501f55cd ("nfsd: call nfsd_last_thread() before final nfsd_put()")
Reported-by: Zhi Li <yieli@redhat.com>
Cc: NeilBrown <neilb@suse.de>
Signed-off-by: Jeffrey Layton <jlayton@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
As the ei->entries array is fixed for the duration of the eventfs_inode,
it can be used to skip over already read entries in eventfs_iterate().
That is, if ctx->pos is greater than zero, there's no reason in doing the
loop across the ei->entries array for the entries less than ctx->pos.
Instead, start the lookup of the entries at the current ctx->pos.
Link: https://lore.kernel.org/all/CAHk-=wiKwDUDv3+jCsv-uacDcHDVTYsXtBR9=6sGM5mqX+DhOg@mail.gmail.com/
Link: https://lore.kernel.org/linux-trace-kernel/20240104220048.494956957@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
In order to apply a shortcut to skip over the current ctx->pos
immediately, by using the ei->entries array, the reading of that array
should be first. Moving the array reading before the linked list reading
will make the shortcut change diff nicer to read.
Link: https://lore.kernel.org/all/CAHk-=wiKwDUDv3+jCsv-uacDcHDVTYsXtBR9=6sGM5mqX+DhOg@mail.gmail.com/
Link: https://lore.kernel.org/linux-trace-kernel/20240104220048.333115095@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The ctx->pos was only updated when it added an entry, but the "skip to
current pos" check (c--) happened for every loop regardless of if the
entry was added or not. This inconsistency caused readdir to be incorrect.
It was due to:
for (i = 0; i < ei->nr_entries; i++) {
if (c > 0) {
c--;
continue;
}
mutex_lock(&eventfs_mutex);
/* If ei->is_freed then just bail here, nothing more to do */
if (ei->is_freed) {
mutex_unlock(&eventfs_mutex);
goto out;
}
r = entry->callback(name, &mode, &cdata, &fops);
mutex_unlock(&eventfs_mutex);
[..]
ctx->pos++;
}
But this can cause the iterator to return a file that was already read.
That's because of the way the callback() works. Some events may not have
all files, and the callback can return 0 to tell eventfs to skip the file
for this directory.
for instance, we have:
# ls /sys/kernel/tracing/events/ftrace/function
format hist hist_debug id inject
and
# ls /sys/kernel/tracing/events/sched/sched_switch/
enable filter format hist hist_debug id inject trigger
Where the function directory is missing "enable", "filter" and
"trigger". That's because the callback() for events has:
static int event_callback(const char *name, umode_t *mode, void **data,
const struct file_operations **fops)
{
struct trace_event_file *file = *data;
struct trace_event_call *call = file->event_call;
[..]
/*
* Only event directories that can be enabled should have
* triggers or filters, with the exception of the "print"
* event that can have a "trigger" file.
*/
if (!(call->flags & TRACE_EVENT_FL_IGNORE_ENABLE)) {
if (call->class->reg && strcmp(name, "enable") == 0) {
*mode = TRACE_MODE_WRITE;
*fops = &ftrace_enable_fops;
return 1;
}
if (strcmp(name, "filter") == 0) {
*mode = TRACE_MODE_WRITE;
*fops = &ftrace_event_filter_fops;
return 1;
}
}
if (!(call->flags & TRACE_EVENT_FL_IGNORE_ENABLE) ||
strcmp(trace_event_name(call), "print") == 0) {
if (strcmp(name, "trigger") == 0) {
*mode = TRACE_MODE_WRITE;
*fops = &event_trigger_fops;
return 1;
}
}
[..]
return 0;
}
Where the function event has the TRACE_EVENT_FL_IGNORE_ENABLE set.
This means that the entries array elements for "enable", "filter" and
"trigger" when called on the function event will have the callback return
0 and not 1, to tell eventfs to skip these files for it.
Because the "skip to current ctx->pos" check happened for all entries, but
the ctx->pos++ only happened to entries that exist, it would confuse the
reading of a directory. Which would cause:
# ls /sys/kernel/tracing/events/ftrace/function/
format hist hist hist_debug hist_debug id inject inject
The missing "enable", "filter" and "trigger" caused ls to show "hist",
"hist_debug" and "inject" twice.
Update the ctx->pos for every iteration to keep its update and the "skip"
update consistent. This also means that on error, the ctx->pos needs to be
decremented if it was incremented without adding something.
Link: https://lore.kernel.org/all/20240104150500.38b15a62@gandalf.local.home/
Link: https://lore.kernel.org/linux-trace-kernel/20240104220048.172295263@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fixes: 493ec81a8f ("eventfs: Stop using dcache_readdir() for getdents()")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
If ei->is_freed is set in eventfs_iterate(), it means that the directory
that is being iterated on is in the process of being freed. Just exit the
loop immediately when that is ever detected, and separate out the return
of the entry->callback() from ei->is_freed.
Link: https://lore.kernel.org/linux-trace-kernel/20240104220048.016261289@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
For backchannel requests that lookup the appropriate nfs_client, use the
state-management rpc_clnt's rpc_timeout parameters for the backchannel's
response. When the nfs_client cannot be found, fall back to using the
xprt's default timeout parameters.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
bpf_cgroup_from_id() is basically a wrapper to cgroup_get_from_id(),
that is relying on kernfs to determine the right cgroup associated to
the target id.
As a kfunc, it has the potential to be attached to any function through
BPF, particularly in contexts where certain locks are held.
However, kernfs is not using an irq safe spinlock for kernfs_idr_lock,
that means any kernfs function that is acquiring this lock can be
interrupted and potentially hit bpf_cgroup_from_id() in the process,
triggering a deadlock.
For example, it is really easy to trigger a lockdep splat between
kernfs_idr_lock and rq->_lock, attaching a small BPF program to
__set_cpus_allowed_ptr_locked() that just calls bpf_cgroup_from_id():
=====================================================
WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
6.7.0-rc7-virtme #5 Not tainted
-----------------------------------------------------
repro/131 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
ffffffffb2dc4578 (kernfs_idr_lock){+.+.}-{2:2}, at: kernfs_find_and_get_node_by_id+0x1d/0x80
and this task is already holding:
ffff911cbecaf218 (&rq->__lock){-.-.}-{2:2}, at: task_rq_lock+0x50/0xc0
which would create a new lock dependency:
(&rq->__lock){-.-.}-{2:2} -> (kernfs_idr_lock){+.+.}-{2:2}
but this new dependency connects a HARDIRQ-irq-safe lock:
(&rq->__lock){-.-.}-{2:2}
... which became HARDIRQ-irq-safe at:
lock_acquire+0xbf/0x2b0
_raw_spin_lock_nested+0x2e/0x40
scheduler_tick+0x5d/0x170
update_process_times+0x9c/0xb0
tick_periodic+0x27/0xe0
tick_handle_periodic+0x24/0x70
__sysvec_apic_timer_interrupt+0x64/0x1a0
sysvec_apic_timer_interrupt+0x6f/0x80
asm_sysvec_apic_timer_interrupt+0x1a/0x20
memcpy+0xc/0x20
arch_dup_task_struct+0x15/0x30
copy_process+0x1ce/0x1eb0
kernel_clone+0xac/0x390
kernel_thread+0x6f/0xa0
kthreadd+0x199/0x230
ret_from_fork+0x31/0x50
ret_from_fork_asm+0x1b/0x30
to a HARDIRQ-irq-unsafe lock:
(kernfs_idr_lock){+.+.}-{2:2}
... which became HARDIRQ-irq-unsafe at:
...
lock_acquire+0xbf/0x2b0
_raw_spin_lock+0x30/0x40
__kernfs_new_node.isra.0+0x83/0x280
kernfs_create_root+0xf6/0x1d0
sysfs_init+0x1b/0x70
mnt_init+0xd9/0x2a0
vfs_caches_init+0xcf/0xe0
start_kernel+0x58a/0x6a0
x86_64_start_reservations+0x18/0x30
x86_64_start_kernel+0xc5/0xe0
secondary_startup_64_no_verify+0x178/0x17b
other info that might help us debug this:
Possible interrupt unsafe locking scenario:
CPU0 CPU1
---- ----
lock(kernfs_idr_lock);
local_irq_disable();
lock(&rq->__lock);
lock(kernfs_idr_lock);
<Interrupt>
lock(&rq->__lock);
*** DEADLOCK ***
Prevent this deadlock condition converting kernfs_idr_lock to a raw irq
safe spinlock.
The performance impact of this change should be negligible and it also
helps to prevent similar deadlock conditions with any other subsystems
that may depend on kernfs.
Fixes: 332ea1f697 ("bpf: Add bpf_cgroup_from_id() kfunc")
Cc: stable <stable@kernel.org>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20231229074916.53547-1-andrea.righi@canonical.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The kfree() function was called in one case by
the bl_resolve_deviceid() function during error handling
even if the passed data structure member contained a null pointer.
This issue was detected by using the Coccinelle software.
Thus use an other label.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
NFS already has writepages and migrate_folio, so it does not need to
implement writepage. The writepage operation is deprecated as it leads
to worse performance under high memory pressure due to folios being
written out in LRU order rather than sequentially within a file.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Now that we're calculating how large a remaining IO should be based
on the current request's offset, we no longer need to track bytes_left on
each struct nfs_direct_req. Drop the field, and clean up the direct
request tracepoints.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Instead of relying on the value of the 'bytes_left' field, we should
calculate the layout size based on the offset of the request that is
being written out.
Reported-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Fixes: 954998b60c ("NFS: Fix error handling for O_DIRECT write scheduling")
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
With this we can see the dentry -> inode linkage that's being
revalidated. A fileid of 0 means "negative dentry".
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
We do async renames in other cases besides sillyrenames now. This
tracepoint name is now misleading.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Add a call to the v4 d_revalidate entrypoint, just like the v3 one.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Once the client has processed the CB_LAYOUTRECALL, but has not yet
successfully returned the layout, the server is supposed to switch to
returning NFS4ERR_RETURNCONFLICT. This patch ensures that we handle
that return value correctly.
Fixes: 183d9e7b11 ("pnfs: rework LAYOUTGET retry handling")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If the server is recalling a layout, and sends us a list of referring
calls that we can see are complete, then we should just trust that the
stateid argument is correct, even if the sequence id doesn't match the
one we hold.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
When the server gives us a set of referring calls, to tell us that the
NFSv4.1 callback needs to be ordered with respect to those calls, then
we may want to make that information available to the operations. In
certain cases, it may allow them to optimise their behaviour due to the
extra knowledge.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The subjective cred (task->cred) can potentially be overridden and
subsquently freed in non-RCU context, which could lead to a panic if we
try to use it in cred_fscmp(). Use __task_cred(), which returns the
objective cred (task->real_cred) instead.
Fixes: 0eb43812c0 ("NFS: Clear the file access cache upon login")
Fixes: 5e9a7b9c2e ("NFS: Fix up a sparse warning")
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Again we have claimed regressions for walking a directory tree, this time
with the "find" utility which always tries to optimize away asking for any
attributes until it has a complete list of entries. This behavior makes
the readdir plus heuristic do the wrong thing, which causes a storm of
GETATTRs to determine each entry's type in order to continue the walk.
For v4 add the type attribute to each READDIR request to include it no
matter the heuristic. This allows a simple `find` command to proceed
quickly through a directory tree.
Suggested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
We noticed a SCSI device that refused to allow READ CAPACITY when the
device had a PR with exclusive access, registrants only. The result of
this situation is that the blocklayout driver adds a pnfs_block_dev of zero
length which always fails the offset_in_map tests. Instead of continuously
trying to do pNFS for this case, just mark the device as unavailable which
will allow the client to fallback to the MDS for the duration of
PNFS_DEVICE_RETRY_TIMEOUT.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The error path for blocklayout's device lookup is missing a reference drop
for the case where a lookup finds the device, but the device is marked with
NFS_DEVICEID_UNAVAILABLE.
Fixes: b3dce6a2f0 ("pnfs/blocklayout: handle transient devices")
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Fix the proc/fs/fscache symlink to point to "netfs" not "../netfs".
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Christian Brauner <christian@brauner.io>
cc: linux-fsdevel@vger.kernel.org
cc: linux-cachefs@redhat.com
In v9fs_upload_to_server(), we pass the error to netfslib to terminate the
subreq rather than the amount of data written - even if we did actually
write something.
Further, we assume that the write is always entirely done if successful -
but it might have been partially complete - as returned by
p9_client_write(), but we ignore that.
Fix this by indicating the amount written by preference and only returning
the error if we didn't write anything.
(We might want to return both in future if both are available as this
might be useful as to whether we retry or not.)
Suggested-by: Dominique Martinet <asmadeus@codewreck.org>
Link: https://lore.kernel.org/r/ZZULNQAZ0n0WQv7p@codewreck.org/
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Dominique Martinet <asmadeus@codewreck.org>
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: v9fs@lists.linux.dev
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
Do a couple of cleanups to 9p:
(1) Remove a couple of unused variables.
(2) Turn a BUG_ON() into a warning, consolidate with another warning and
make the warning message include the inode number rather than
whatever's in i_private (which will get hashed anyway).
Suggested-by: Dominique Martinet <asmadeus@codewreck.org>
Link: https://lore.kernel.org/r/ZZULNQAZ0n0WQv7p@codewreck.org/
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Dominique Martinet <asmadeus@codewreck.org>
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: v9fs@lists.linux.dev
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
Merge an ACPI power management change, ACPI backlight driver changes, APEI
updates and ACPI extlog driver changes for 6.8-rc1:
- Modify the ACPI LPIT table handling code to avoid u32 multiplication
overflows in state residency computations (Nikita Kiryushin).
- Drop an unused helper function from the ACPI backlight (video) driver
and add a clarifying comment to it (Hans de Goede).
- Update the ACPI backlight driver to avoid using uninitialized memory
in some cases (Nikita Kiryushin).
- Add ACPI backlight quirk for the Colorful X15 AT 23 laptop (Yuluo
Qiu).
- Add support for vendor-defined error types to the ACPI APEI error
injection code (Avadhut Naik).
- Adjust APEI to properly set MF_ACTION_REQUIRED on synchronous memory
failure events, so they are handled differently from the asynchronous
ones (Shuai Xue).
- Fix NULL pointer dereference check in the ACPI extlog driver (Prarit
Bhargava).
- Adjust the ACPI extlog driver to clear the Extended Error Log status
when RAS_CEC handled the error (Tony Luck).
* acpi-pm:
ACPI: LPIT: Avoid u32 multiplication overflow
* acpi-video:
ACPI: video: Add quirk for the Colorful X15 AT 23 Laptop
ACPI: video: check for error while searching for backlight device parent
ACPI: video: Drop should_check_lcd_flag()
ACPI: video: Add comment about acpi_video_backlight_use_native() usage
* acpi-apei:
ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events
ACPI: APEI: EINJ: Add support for vendor defined error types
platform/chrome: cros_ec_debugfs: Fix permissions for panicinfo
fs: debugfs: Add write functionality to debugfs blobs
ACPI: APEI: EINJ: Refactor available_error_type_show()
* acpi-extlog:
ACPI: extlog: Clear Extended Error Log status when RAS_CEC handled the error
ACPI: extlog: fix NULL pointer dereference check
Instead of walking the dentries on mount/remount to update the gid values of
all the dentries if a gid option is specified on mount, just update the root
inode. Add .getattr, .setattr, and .permissions on the tracefs inode
operations to update the permissions of the files and directories.
For all files and directories in the top level instance:
/sys/kernel/tracing/*
It will use the root inode as the default permissions. The inode that
represents: /sys/kernel/tracing (or wherever it is mounted).
When an instance is created:
mkdir /sys/kernel/tracing/instance/foo
The directory "foo" and all its files and directories underneath will use
the default of what foo is when it was created. A remount of tracefs will
not affect it.
If a user were to modify the permissions of any file or directory in
tracefs, it will also no longer be modified by a change in ownership of a
remount.
The events directory, if it is in the top level instance, will use the
tracefs root inode as the default ownership for itself and all the files and
directories below it.
For the events directory in an instance ("foo"), it will keep the ownership
of what it was when it was created, and that will be used as the default
ownership for the files and directories beneath it.
Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=wjVdGkjDXBbvLn2wbZnqP4UsH46E3gqJ9m7UG6DpX2+WA@mail.gmail.com/
Link: https://lore.kernel.org/linux-trace-kernel/20240103215016.1e0c9811@gandalf.local.home
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The eventfs creates dynamically allocated dentries and inodes. Using the
dcache_readdir() logic for its own directory lookups requires hiding the
cursor of the dcache logic and playing games to allow the dcache_readdir()
to still have access to the cursor while the eventfs saved what it created
and what it needs to release.
Instead, just have eventfs have its own iterate_shared callback function
that will fill in the dent entries. This simplifies the code quite a bit.
Link: https://lore.kernel.org/linux-trace-kernel/20240104015435.682218477@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ajay Kaher <akaher@vmware.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The "lookup" parameter is a way to differentiate the call to
create_file/dir_dentry() from when it's just a lookup (no need to up the
dentry refcount) and accessed via a readdir (need to up the refcount).
But reality, it just makes the code more complex. Just up the refcount and
let the caller decide to dput() the result or not.
Link: https://lore.kernel.org/linux-trace-kernel/20240103102553.17a19cea@gandalf.local.home
Link: https://lore.kernel.org/linux-trace-kernel/20240104015435.517502710@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ajay Kaher <akaher@vmware.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
- Fix a NULL kernel dereference in set_gid() on tracefs mounting.
When tracefs is mounted with "gid=1000", it will update the existing
dentries to have the new gid. The tracefs_inode which is retrieved
by a container_of(dentry->d_inode) has flags to see if the inode
belongs to the eventfs system.
The issue that was fixed was if getdents() was called on tracefs
that was previously mounted, and was not closed. It will leave
a "cursor dentry" in the subdirs list of the current dentries that
set_gid() walks. On a remount of tracefs, the container_of(dentry->d_inode)
will dereference a NULL pointer and cause a crash when referenced.
Simply have a check for dentry->d_inode to see if it is NULL and if
so, skip that entry.
- Fix the bits of the eventfs_inode structure. The "is_events" bit
was taken from the nr_entries field, but the nr_entries field wasn't
updated to be 30 bits and was still 31. Including the "is_freed" bit
this would use 33 bits which would make the structure use another
integer for just one bit.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZZTAdxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6quC9APwO307eRre10oAscdis90nh8jN9lg2T
bcaN5QKwcQgHDAEA3r/93A5UvczCp1NhSDEdBoL1NmRyYD034sYtaa8SpgI=
=WTpg
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.7-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix a NULL kernel dereference in set_gid() on tracefs mounting.
When tracefs is mounted with "gid=1000", it will update the existing
dentries to have the new gid. The tracefs_inode which is retrieved by
a container_of(dentry->d_inode) has flags to see if the inode belongs
to the eventfs system.
The issue that was fixed was if getdents() was called on tracefs that
was previously mounted, and was not closed. It will leave a "cursor
dentry" in the subdirs list of the current dentries that set_gid()
walks. On a remount of tracefs, the container_of(dentry->d_inode)
will dereference a NULL pointer and cause a crash when referenced.
Simply have a check for dentry->d_inode to see if it is NULL and if
so, skip that entry.
- Fix the bits of the eventfs_inode structure.
The "is_events" bit was taken from the nr_entries field, but the
nr_entries field wasn't updated to be 30 bits and was still 31.
Including the "is_freed" bit this would use 33 bits which would make
the structure use another integer for just one bit.
* tag 'trace-v6.7-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
eventfs: Fix bitwise fields for "is_events"
tracefs: Check for dentry->d_inode exists in set_gid()
The 9p filesystem is calling netfs_inode_init() in v9fs_init_inode() -
before the struct inode fields have been initialised from the obtained file
stats (ie. after v9fs_stat2inode*() has been called), but netfslib wants to
set a couple of its fields from i_size.
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Marc Dionne <marc.dionne@auristor.com>
Tested-by: Dominique Martinet <asmadeus@codewreck.org>
Acked-by: Dominique Martinet <asmadeus@codewreck.org>
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: v9fs@lists.linux.dev
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
Fix __cachefiles_prepare_write() to correctly determine whether the
requested write will fit correctly with the DIO alignment.
Reported-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Yiqun Leng <yqleng@linux.alibaba.com>
Tested-by: Jia Zhu <zhujia.zj@bytedance.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-erofs@lists.ozlabs.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
A flag was needed to denote which eventfs_inode was the "events"
directory, so a bit was taken from the "nr_entries" field, as there's not
that many entries, and 2^30 is plenty. But the bit number for nr_entries
was not updated to reflect the bit taken from it, which would add an
unnecessary integer to the structure.
Link: https://lore.kernel.org/linux-trace-kernel/20240102151832.7ca87275@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: 7e8358edf5 ("eventfs: Fix file and directory uid and gid ownership")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
If a getdents() is called on the tracefs directory but does not get all
the files, it can leave a "cursor" dentry in the d_subdirs list of tracefs
dentry. This cursor dentry does not have a d_inode for it. Before
referencing tracefs_inode from the dentry, the d_inode must first be
checked if it has content. If not, then it's not a tracefs_inode and can
be ignored.
The following caused a crash:
#define getdents64(fd, dirp, count) syscall(SYS_getdents64, fd, dirp, count)
#define BUF_SIZE 256
#define TDIR "/tmp/file0"
int main(void)
{
char buf[BUF_SIZE];
int fd;
int n;
mkdir(TDIR, 0777);
mount(NULL, TDIR, "tracefs", 0, NULL);
fd = openat(AT_FDCWD, TDIR, O_RDONLY);
n = getdents64(fd, buf, BUF_SIZE);
ret = mount(NULL, TDIR, NULL, MS_NOSUID|MS_REMOUNT|MS_RELATIME|MS_LAZYTIME,
"gid=1000");
return 0;
}
That's because the 256 BUF_SIZE was not big enough to read all the
dentries of the tracefs file system and it left a "cursor" dentry in the
subdirs of the tracefs root inode. Then on remounting with "gid=1000",
it would cause an iteration of all dentries which hit:
ti = get_tracefs(dentry->d_inode);
if (ti && (ti->flags & TRACEFS_EVENT_INODE))
eventfs_update_gid(dentry, gid);
Which crashed because of the dereference of the cursor dentry which had a NULL
d_inode.
In the subdir loop of the dentry lookup of set_gid(), if a child has a
NULL d_inode, simply skip it.
Link: https://lore.kernel.org/all/20240102135637.3a21fb10@gandalf.local.home/
Link: https://lore.kernel.org/linux-trace-kernel/20240102151249.05da244d@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: 7e8358edf5 ("eventfs: Fix file and directory uid and gid ownership")
Reported-by: "Ubisectech Sirius" <bugreport@ubisectech.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
- KVM_GET_REG_LIST improvement for vector registers
- Generate ISA extension reg_list using macros in get-reg-list selftest
- Steal time account support along with selftest
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEZdn75s5e6LHDQ+f/rUjsVaLHLAcFAmWQ+cgACgkQrUjsVaLH
LAckBA//R4X9L5ugfPdDunp3ntjZXmNtBS5pM2jD+UvaoFn2kOA1o5kOD5mXluuh
0imNjVuzlrX7XoAATQ4BoeoXg0whDbnv/8TE13KqSl1PfNziH2p5YD2DuHXPST3B
V2VHrGACZ4wN074Whztl0oYI72obGnmpcsiglifkeCvRPerghHuDu40eUaWvCmgD
DPwT+bjBgxkDZ4IheyytUrPql6reALh1Qo1vfj0FsJAgj+MAqQeD8n6rixSPnOdE
9XXa4jdu7ycDp675VX/DVEWsNBQGPrsRK/uCiMksO36td+wLCKpAkvX95mE0w/L8
qFJ+dN1c+1ZkjooHdVLfq2MjxaIRwmIowk7SeJbpvGIf/zG3r7eany7eXJT0+NjO
22j5FY2z1NqcSG6Fazx76Qp2vVBVbxHShP9h7d6VTZYS7XENjmV6IWHpTSuSF8+n
puj8Nf5C7WuqbySirSgQndDuKawn9myqfXXEoAuSiZ+kVyYEl8QnXm2gAIcxRDHX
x+NDPMv0DpMBRO9qa/tXeqgNue/XOTJwgbmXzAlCNff3U7hPIHJ/5aZiJ/Re5TeE
DxiU9AmIsNN2Bh0csS/wQbdScIqkOdOiDYEwT1DXOJWpmhiyCW7vR8ltaIuMJ4vP
DtlfuUlSe4aml957nAiqqyjQAY/7gqmpoaGwu+lmrOX1K7fdtF0=
=FeiG
-----END PGP SIGNATURE-----
Merge tag 'kvm-riscv-6.8-1' of https://github.com/kvm-riscv/linux into HEAD
KVM/riscv changes for 6.8 part #1
- KVM_GET_REG_LIST improvement for vector registers
- Generate ISA extension reg_list using macros in get-reg-list selftest
- Steal time account support along with selftest
The kernel thread function jfs_lazycommit() and jfs_sync() invoke the
try_to_freeze() in its loop. But all the kernel threads are no-freezable
by default. So if we want to make a kernel thread to be freezable, we have
to invoke set_freezable() explicitly.
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Now that we have dynamically resizable btree paths,
check_directory_structure() can check one path - inode up to the root -
in a single transaction.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
reattach_inode() was broken w.r.t. snapshots - we'd lookup the subvolume
to look up lost+found, but if we're in an interior node snapshot that
didn't make any sense.
Instead, this adds a dirent path for creating in a specific snapshot,
skipping the subvolume; and we also make sure to create lost+found in
the root snapshot, to avoid conflicts with lost+found being created in
overlapping snapshots.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
refactoring the BTREE_ITER_WITH_UPDATES code, prep for removing the flag
and making it always-on
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
refactoring the BTREE_ITER_WITH_UPDATES code, prep for removing the flag
and making it always-on
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
XXX: we're allocating memory with btree locks held - bad
We need to plumb through an error path so we can do
allocate_dropping_locks() - but we're merging this now because it fixes
a transaction path overflow caused by indirect extent fragmentation, and
the resize path is rare.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Since the btree_paths array is now about to become growable, we have to
be careful not to refer to paths by pointer across contexts where they
may be reallocated.
This fixes the remaining btree_interior_update() paths - split and
merge.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Start to plumb through dynamically growable btree_paths; this patch
replaces most BTREE_ITER_MAX references with trans->nr_paths.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
the reflink triggers are also bumping up against the maximum number of
paths in a transaction - and generating proportional numbers of updates.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- Some tweaks to greatly reduce locking overhead for the list of btree
transactions, so that it can always be enabled: leave btree_trans
objects on the list when they're on the percpu single item freelist,
and only check for duplicates in the same process when
CONFIG_BCACHEFS_DEBUG is enabled
- don't zero out the full btree_trans() unless we allocated it from
the mempool
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Upcoming patches are going to be changing trans->paths to a
reallocatable buffer. We need to guard against use after free when it's
used by other threads; this introduces RCU protection to those paths and
changes them to check for trans->paths == NULL
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
path->idx is now a code smell: we should be using path_idx_t, since it's
stable across btree path reallocation.
This is also a bit faster, using the same loop counter vs. fetching
path->idx from each path we iterate over.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Bit of cleanup & modernization: also moving this code to util.c, it'll
be used by userspace as well.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>