-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZUpEaAAKCRCRxhvAZXjc
ounBAQCAoS66gnOZ+k4kOWwB2zZ1Ueh3dPFC7IcEZ+pwFS8hpAEAxUQxV0TSWf5l
W/1oKRtAJyuSYvehHeMUSJmHVBiM8w4=
=bNm0
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.7.fsid' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fanotify fsid updates from Christian Brauner:
"This work is part of the plan to enable fanotify to serve as a drop-in
replacement for inotify. While inotify is availabe on all filesystems,
fanotify currently isn't.
In order to support fanotify on all filesystems two things are needed:
(1) all filesystems need to support AT_HANDLE_FID
(2) all filesystems need to report a non-zero f_fsid
This contains (1) and allows filesystems to encode non-decodable file
handlers for fanotify without implementing any exportfs operations by
encoding a file id of type FILEID_INO64_GEN from i_ino and
i_generation.
Filesystems that want to opt out of encoding non-decodable file ids
for fanotify that don't support NFS export can do so by providing an
empty export_operations struct.
This also partially addresses (2) by generating f_fsid for simple
filesystems as well as freevxfs. Remaining filesystems will be dealt
with by separate patches.
Finally, this contains the patch from the current exportfs maintainers
which moves exportfs under vfs with Chuck, Jeff, and Amir as
maintainers and vfs.git as tree"
* tag 'vfs-6.7.fsid' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
MAINTAINERS: create an entry for exportfs
fs: fix build error with CONFIG_EXPORTFS=m or not defined
freevxfs: derive f_fsid from bdev->bd_dev
fs: report f_fsid from s_dev for "simple" filesystems
exportfs: support encoding non-decodeable file handles by default
exportfs: define FILEID_INO64_GEN* file handle types
exportfs: make ->encode_fh() a mandatory method for NFS export
exportfs: add helpers to check if filesystem can encode/decode file handles
Similar to the common FILEID_INO32* file handle types, define common
FILEID_INO64* file handle types.
The type values of FILEID_INO64_GEN and FILEID_INO64_GEN_PARENT are the
values returned by fuse and xfs for 64bit ino encoded file handle types.
Note that these type value are filesystem specific and they do not define
a universal file handle format, for example:
fuse encodes FILEID_INO64_GEN as [ino-hi32,ino-lo32,gen] and xfs encodes
FILEID_INO64_GEN as [hostr-order-ino64,gen] (a.k.a xfs_fid64).
The FILEID_INO64_GEN fhandle type is going to be used for file ids for
fanotify from filesystems that do not support NFS export.
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231023180801.2953446-4-amir73il@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCZPYlzAAKCRDh3BK/laaZ
PEcxAP4suFAlonGntKJ5ltR+7ZN+WYdiraQ+5c6ISBFc+pFXgQD7B0xhztV4umSF
III+pbD6lE5gP5u7+Kw/pOnTI42yTQ8=
=aPjn
-----END PGP SIGNATURE-----
Merge tag 'fuse-update-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse updates from Miklos Szeredi:
- Revert non-waiting FLUSH due to a regression
- Fix a lookup counter leak in readdirplus
- Add an option to allow shared mmaps in no-cache mode
- Add btime support and statx intrastructure to the protocol
- Invalidate positive/negative dentry on failed create/delete
* tag 'fuse-update-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: conditionally fill kstat in fuse_do_statx()
fuse: invalidate dentry on EEXIST creates or ENOENT deletes
fuse: cache btime
fuse: implement statx
fuse: add ATTR_TIMEOUT macro
fuse: add STATX request
fuse: handle empty request_mask in statx
fuse: write back dirty pages before direct write in direct_io_relax mode
fuse: add a new fuse init flag to relax restrictions in no cache mode
fuse: invalidate page cache pages before direct write
fuse: nlookup missing decrement in fuse_direntplus_link
Revert "fuse: in fuse_flush only wait if someone wants the return code"
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXTKAAKCRCRxhvAZXjc
oifJAQCzi/p+AdQu8LA/0XvR7fTwaq64ZDCibU4BISuLGT2kEgEAuGbuoFZa0rs2
XYD/s4+gi64p9Z01MmXm2XO1pu3GPg0=
=eJz5
-----END PGP SIGNATURE-----
Merge tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs timestamp updates from Christian Brauner:
"This adds VFS support for multi-grain timestamps and converts tmpfs,
xfs, ext4, and btrfs to use them. This carries acks from all relevant
filesystems.
The VFS always uses coarse-grained timestamps when updating the ctime
and mtime after a change. This has the benefit of allowing filesystems
to optimize away a lot of metadata updates, down to around 1 per
jiffy, even when a file is under heavy writes.
Unfortunately, this has always been an issue when we're exporting via
NFSv3, which relies on timestamps to validate caches. A lot of changes
can happen in a jiffy, so timestamps aren't sufficient to help the
client decide to invalidate the cache.
Even with NFSv4, a lot of exported filesystems don't properly support
a change attribute and are subject to the same problems with timestamp
granularity. Other applications have similar issues with timestamps
(e.g., backup applications).
If we were to always use fine-grained timestamps, that would improve
the situation, but that becomes rather expensive, as the underlying
filesystem would have to log a lot more metadata updates.
This introduces fine-grained timestamps that are used when they are
actively queried.
This uses the 31st bit of the ctime tv_nsec field to indicate that
something has queried the inode for the mtime or ctime. When this flag
is set, on the next mtime or ctime update, the kernel will fetch a
fine-grained timestamp instead of the usual coarse-grained one.
As POSIX generally mandates that when the mtime changes, the ctime
must also change the kernel always stores normalized ctime values, so
only the first 30 bits of the tv_nsec field are ever used.
Filesytems can opt into this behavior by setting the FS_MGTIME flag in
the fstype. Filesystems that don't set this flag will continue to use
coarse-grained timestamps.
Various preparatory changes, fixes and cleanups are included:
- Fixup all relevant places where POSIX requires updating ctime
together with mtime. This is a wide-range of places and all
maintainers provided necessary Acks.
- Add new accessors for inode->i_ctime directly and change all
callers to rely on them. Plain accesses to inode->i_ctime are now
gone and it is accordingly rename to inode->__i_ctime and commented
as requiring accessors.
- Extend generic_fillattr() to pass in a request mask mirroring in a
sense the statx() uapi. This allows callers to pass in a request
mask to only get a subset of attributes filled in.
- Rework timestamp updates so it's possible to drop the @now
parameter the update_time() inode operation and associated helpers.
- Add inode_update_timestamps() and convert all filesystems to it
removing a bunch of open-coding"
* tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (107 commits)
btrfs: convert to multigrain timestamps
ext4: switch to multigrain timestamps
xfs: switch to multigrain timestamps
tmpfs: add support for multigrain timestamps
fs: add infrastructure for multigrain timestamps
fs: drop the timespec64 argument from update_time
xfs: have xfs_vn_update_time gets its own timestamp
fat: make fat_update_time get its own timestamp
fat: remove i_version handling from fat_update_time
ubifs: have ubifs_update_time use inode_update_timestamps
btrfs: have it use inode_update_timestamps
fs: drop the timespec64 arg from generic_update_time
fs: pass the request_mask to generic_fillattr
fs: remove silly warning from current_time
gfs2: fix timestamp handling on quota inodes
fs: rename i_ctime field to __i_ctime
selinux: convert to ctime accessor functions
security: convert to ctime accessor functions
apparmor: convert to ctime accessor functions
sunrpc: convert to ctime accessor functions
...
Not all inode attributes are supported by all filesystems, but for the
basic stats (which are returned by stat(2) and friends) all of them will
have some value, even if that doesn't reflect a real attribute of the file.
Btime is different, in that filesystems are free to report or not report a
value in statx. If the value is available, then STATX_BTIME bit is set in
stx_mask.
When caching the value of btime, remember the availability of the attribute
as well as the value (if available). This is done by using the
FUSE_I_BTIME bit in fuse_inode->state to indicate availability, while using
fuse_inode->inval_mask & STATX_BTIME to indicate the state of the cache
itself (i.e. set if cache is invalid, and cleared if cache is valid).
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Allow querying btime. When btime is requested in mask, then FUSE_STATX
request is sent. Otherwise keep using FUSE_GETATTR.
The userspace interface for statx matches that of the statx(2) API.
However there are limitations on how this interface is used:
- returned basic stats and btime are used, stx_attributes, etc. are
ignored
- always query basic stats and btime, regardless of what was requested
- requested sync type is ignored, the default is passed to the server
- if server returns with some attributes missing from the result_mask,
then no attributes will be cached
- btime is not cached yet (next patch will fix that)
For new inodes initialize fi->inval_mask to "all invalid", instead of "all
valid" as previously. Also only clear basic stats from inval_mask when
caching attributes. This will result in the caching logic not thinking
that btime is cached.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
FOPEN_DIRECT_IO is usually set by fuse daemon to indicate need of strong
coherency, e.g. network filesystems. Thus shared mmap is disabled since it
leverages page cache and may write to it, which may cause inconsistence.
But FOPEN_DIRECT_IO can be used not for coherency but to reduce memory
footprint as well, e.g. reduce guest memory usage with virtiofs.
Therefore, add a new fuse init flag FUSE_DIRECT_IO_RELAX to relax
restrictions in that mode, currently, it allows shared mmap. One thing to
note is to make sure it doesn't break coherency in your use case.
Signed-off-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
In later patches, we're going to change how the inode's ctime field is
used. Switch to using accessor functions instead of raw accesses of
inode->i_ctime.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-Id: <20230705190309.579783-44-jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
This is just a safety precaution to avoid checking flags on memory that was
initialized on the user space side. libfuse zeroes struct fuse_init_out
outarg, but this is not guranteed to be done in all implementations.
Better is to act on flags and to only apply flags2 when FUSE_INIT_EXT is
set.
There is a risk with this change, though - it might break existing user
space libraries, which are already using flags2 without setting
FUSE_INIT_EXT.
The corresponding libfuse patch is here
https://github.com/libfuse/libfuse/pull/662
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Fixes: 53db28933e ("fuse: extend init flags")
Cc: <stable@vger.kernel.org> # v5.17
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Add an init flag idicating whether the FUSE_EXPIRE_ONLY flag of
FUSE_NOTIFY_INVAL_ENTRY is effective.
This is needed for backports of this feature, otherwise the server could
just check the protocol version.
Fixes: 4f8d37020e ("fuse: add "expire only" mode to FUSE_NOTIFY_INVAL_ENTRY")
Cc: <stable@vger.kernel.org> # v6.2
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCY/ysqAAKCRDh3BK/laaZ
PLSfAP9c4z/KOZj/Am8nQ0mqI8Ss0Ei+hyu8Vow6NoyJkR4NvwD/SxEeI2+rpXj7
TGOA+6k4chLc7QIFBsIwGvQgeXld1gA=
=s4b/
-----END PGP SIGNATURE-----
Merge tag 'fuse-update-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse updates from Miklos Szeredi:
- Fix regression in fileattr permission checking
- Fix possible hang during PID namespace destruction
- Add generic support for request extensions
- Add supplementary group list extension
- Add limited support for supplying supplementary groups in create
requests
- Documentation fixes
* tag 'fuse-update-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: add inode/permission checks to fileattr_get/fileattr_set
fuse: fix all W=1 kernel-doc warnings
fuse: in fuse_flush only wait if someone wants the return code
fuse: optional supplementary group in create requests
fuse: add request extension
Permission to create an object (create, mkdir, symlink, mknod) needs to
take supplementary groups into account.
Add a supplementary group request extension. This can contain an arbitrary
number of group IDs and can be added to any request. This extension is not
added to any request by default.
Add FUSE_CREATE_SUPP_GROUP init flag to enable supplementary group info in
creation requests. This adds just a single supplementary group that
matches the parent group in the case described above. In other cases the
extension is not added.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This cycle we ported all filesystems to the new posix acl api. While
looking at further simplifications in this area to remove the last
remnants of the generic dummy posix acl handlers we realized that we
regressed fuse daemons that don't set FUSE_POSIX_ACL but still make use
of posix acls.
With the change to a dedicated posix acl api interacting with posix acls
doesn't go through the old xattr codepaths anymore and instead only
relies the get acl and set acl inode operations.
Before this change fuse daemons that don't set FUSE_POSIX_ACL were able
to get and set posix acl albeit with two caveats. First, that posix acls
aren't cached. And second, that they aren't used for permission checking
in the vfs.
We regressed that use-case as we currently refuse to retrieve any posix
acls if they aren't enabled via FUSE_POSIX_ACL. So older fuse daemons
would see a change in behavior.
We can restore the old behavior in multiple ways. We could change the
new posix acl api and look for a dedicated xattr handler and if we find
one prefer that over the dedicated posix acl api. That would break the
consistency of the new posix acl api so we would very much prefer not to
do that.
We could introduce a new ACL_*_CACHE sentinel that would instruct the
vfs permission checking codepath to not call into the filesystem and
ignore acls.
But a more straightforward fix for v6.2 is to do the same thing that
Overlayfs does and give fuse a separate get acl method for permission
checking. Overlayfs uses this to express different needs for vfs
permission lookup and acl based retrieval via the regular system call
path as well. Let fuse do the same for now. This way fuse can continue
to refuse to retrieve posix acls for daemons that don't set
FUSE_POSXI_ACL for permission checking while allowing a fuse server to
retrieve it via the usual system calls.
In the future, we could extend the get acl inode operation to not just
pass a simple boolean to indicate rcu lookup but instead make it a flag
argument. Then in addition to passing the information that this is an
rcu lookup to the filesystem we could also introduce a flag that tells
the filesystem that this is a request from the vfs to use these acls for
permission checking. Then fuse could refuse the get acl request for
permission checking when the daemon doesn't have FUSE_POSIX_ACL set in
the same get acl method. This would also help Overlayfs and allow us to
remove the second method for it as well.
But since that change is more invasive as we need to update the get acl
inode operation for multiple filesystems we should not do this as a fix
for v6.2. Instead we will do this for the v6.3 merge window.
Fwiw, since posix acls are now always correctly translated in the new
posix acl api we could also allow them to be used for daemons without
FUSE_POSIX_ACL that are not mounted on the host. But this is behavioral
change and again if dones should be done for v6.3. For now, let's just
restore the original behavior.
A nice side-effect of this change is that for fuse daemons with and
without FUSE_POSIX_ACL the same code is used for posix acls in a
backwards compatible way. This also means we can remove the legacy xattr
handlers completely. We've also added comments to explain the expected
behavior for daemons without FUSE_POSIX_ACL into the code.
Fixes: 318e66856d ("xattr: use posix acl api")
Signed-off-by: Seth Forshee (Digital Ocean) <sforshee@kernel.org>
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Force unmount of FUSE severes the connection with the user space, even if
there are still open files. Subsequent remount tries to re-use the
superblock held by the open files, which is meaningless in the FUSE case
after disconnect - reused super block doesn't have userspace counterpart
attached to it and is incapable of doing any IO.
This patch adds the functionality only for the block-device-based supers,
since the primary use case of the feature is to gracefully handle force
unmount of external devices, mounted with FUSE. This can be further
extended to cover all superblocks, if the need arises.
Signed-off-by: Daniil Lunev <dlunev@chromium.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The inode allocation is supposed to use alloc_inode_sb(), so convert
kmem_cache_alloc() of all filesystems to alloc_inode_sb().
Link: https://lkml.kernel.org/r/20220228122126.37293-5-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Theodore Ts'o <tytso@mit.edu> [ext4]
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Alex Shi <alexs@kernel.org>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Fam Zheng <fam.zheng@bytedance.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kari Argillander <kari.argillander@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
...to help userland apps that need to identify FUSE mounts.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
When the per inode DAX hint changes while the file is still *opened*, it
is quite complicated and maybe fragile to dynamically change the DAX
state.
Hence mark the inode and corresponding dentries as DONE_CACHE once the
per inode DAX hint changes, so that the inode instance will be evicted
and freed as soon as possible once the file is closed and the last
reference to the inode is put. And then when the file gets reopened next
time, the new instantiated inode will reflect the new DAX state.
In summary, when the per inode DAX hint changes for an *opened* file, the
DAX state of the file won't be updated until this file is closed and
reopened later.
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Among the FUSE_INIT phase, client shall advertise per inode DAX if it's
mounted with "dax=inode". Then server is aware that client is in per
inode DAX mode, and will construct per-inode DAX attribute accordingly.
Server shall also advertise support for per inode DAX. If server doesn't
support it while client is mounted with "dax=inode", client will
silently fallback to "dax=never" since "dax=inode" is advisory only.
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
DAX may be limited in some specific situation. When the number of usable
DAX windows is under watermark, the recalim routine will be triggered to
reclaim some DAX windows. It may have a negative impact on the
performance, since some processes may need to wait for DAX windows to be
recalimed and reused then. To mitigate the performance degradation, the
overall DAX window need to be expanded larger.
However, simply expanding the DAX window may not be a good deal in some
scenario. To maintain one DAX window chunk (i.e., 2MB in size), 32KB
(512 * 64 bytes) memory footprint will be consumed for page descriptors
inside guest, which is greater than the memory footprint if it uses
guest page cache when DAX disabled. Thus it'd better disable DAX for
those files smaller than 32KB, to reduce the demand for DAX window and
thus avoid the unworthy memory overhead.
Per inode DAX feature is introduced to address this issue, by offering a
finer grained control for dax to users, trying to achieve a balance
between performance and memory overhead.
The FUSE_ATTR_DAX flag in FUSE_LOOKUP reply is used to indicate whether
DAX should be enabled or not for corresponding file. Currently the state
whether DAX is enabled or not for the file is initialized only when
inode is instantiated.
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
We add 'always', 'never', and 'inode' (default). '-o dax' continues to
operate the same which is equivalent to 'always'.
The following behavior is consistent with that on ext4/xfs:
- The default behavior (when neither '-o dax' nor
'-o dax=always|never|inode' option is specified) is equal to 'inode'
mode, while 'dax=inode' won't be printed among the mount option list.
- The 'inode' mode is only advisory. It will silently fallback to 'never'
mode if fuse server doesn't support that.
Also noted that by the time of this commit, 'inode' mode is actually equal
to 'always' mode, before the per inode DAX flag is introduced in the
following patch.
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
When a new inode is created, send its security context to server along with
creation request (FUSE_CREAT, FUSE_MKNOD, FUSE_MKDIR and FUSE_SYMLINK).
This gives server an opportunity to create new file and set security
context (possibly atomically). In all the configurations it might not be
possible to set context atomically.
Like nfs and ceph, use security_dentry_init_security() to dermine security
context of inode and send it with create, mkdir, mknod, and symlink
requests.
Following is the information sent to server.
fuse_sectx_header, fuse_secctx, xattr_name, security_context
- struct fuse_secctx_header
This contains total number of security contexts being sent and total
size of all the security contexts (including size of
fuse_secctx_header).
- struct fuse_secctx
This contains size of security context which follows this structure.
There is one fuse_secctx instance per security context.
- xattr name string
This string represents name of xattr which should be used while setting
security context.
- security context
This is the actual security context whose size is specified in
fuse_secctx struct.
Also add the FUSE_SECURITY_CTX flag for the `flags` field of the
fuse_init_out struct. When this flag is set the kernel will append the
security context for a newly created inode to the request (create, mkdir,
mknod, and symlink). The server is responsible for ensuring that the inode
appears atomically (preferrably) with the requested security context.
For example, If the server is using SELinux and backed by a "real" linux
file system that supports extended attributes it can write the security
context value to /proc/thread-self/attr/fscreate before making the syscall
to create the inode.
This patch is based on patch from Chirantan Ekbote <chirantan@chromium.org>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
FUSE_INIT flags are close to running out, so add another 32bits worth of
space.
Add FUSE_INIT_EXT flag to the old flags field in fuse_init_in. If this
flag is set, then fuse_init_in is extended by 48bytes, in which a flags_hi
field is allocated to contain the high 32bits of the flags.
A flags_hi field is also added to fuse_init_out, allocated out of the
remaining unused fields.
Known userspace implementations of the fuse protocol have been checked to
accept the extended FUSE_INIT request, but this might cause problems with
other implementations. If that happens to be the case, the protocol
negotiation will have to be extended with an extra initialization request
roundtrip.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
If writeback_cache is enabled, then the size, mtime and ctime attributes of
regular files are always valid in the kernel's cache. They are retrieved
from userspace only when the inode is freshly looked up.
Add a more generic "cache_mask", that indicates which attributes are
currently valid in cache.
This patch doesn't change behavior.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
In case of writeback_cache fuse_fillattr() would revert the queried
attributes to the cached version.
Move this to fuse_change_attributes() in order to manage the writeback
logic in a central helper. This will be necessary for patches that follow.
Only fuse_do_getattr() -> fuse_fillattr() uses the attributes after calling
fuse_change_attributes(), so this should not change behavior.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
There are two instances of "bool is_wb = fc->writeback_cache" where the
actual use mostly involves checking "is_wb && S_ISREG(inode->i_mode)".
Clean up these cases by storing the second condition in the local variable.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
In writeback cache mode mtime/ctime updates are cached, and flushed to the
server using the ->write_inode() callback.
Closing the file will result in a dirty inode being immediately written,
but in other cases the inode can remain dirty after all references are
dropped. This result in the inode being written back from reclaim, which
can deadlock on a regular allocation while the request is being served.
The usual mechanisms (GFP_NOFS/PF_MEMALLOC*) don't work for FUSE, because
serving a request involves unrelated userspace process(es).
Instead do the same as for dirty pages: make sure the inode is written
before the last reference is gone.
- fallocate(2)/copy_file_range(2): these call file_update_time() or
file_modified(), so flush the inode before returning from the call
- unlink(2), link(2) and rename(2): these call fuse_update_ctime(), so
flush the ctime directly from this helper
Reported-by: chenguanyou <chenguanyou@xiaomi.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Syzkaller reports a null pointer dereference in fuse_test_super() that is
caused by sb->s_fs_info being NULL.
This is due to the fact that fuse_fill_super() is initializing s_fs_info,
which is too late, it's already on the fs_supers list. The initialization
needs to be done in sget_fc() with the sb_lock held.
Move allocation of fuse_mount and fuse_conn from fuse_fill_super() into
fuse_get_tree().
After this ->kill_sb() will always be called with non-NULL ->s_fs_info,
hence fuse_mount_destroy() can drop the test for non-NULL "fm".
Reported-by: syzbot+74a15f02ccb51f398601@syzkaller.appspotmail.com
Fixes: 5d5b74aa9c ("fuse: allow sharing existing sb")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
1. call fuse_mount_destroy() for open coded variants
2. before deactivate_locked_super() don't need fuse_mount destruction since
that will now be done (if ->s_fs_info is not cleared)
3. rearrange fuse_mount setup in fuse_get_tree_submount() so that the
regular pattern can be used
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The ->put_super callback is called from generic_shutdown_super() in case of
a fully initialized sb. This is called from kill_***_super(), which is
called from ->kill_sb instances.
Fuse uses ->put_super to destroy the fs specific fuse_mount and drop the
reference to the fuse_conn, while it does the same on each error case
during sb setup.
This patch moves the destruction from fuse_put_super() to
fuse_mount_destroy(), called at the end of all ->kill_sb instances. A
follup patch will clean up the error paths.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Checking "fm" works because currently sb->s_fs_info is cleared on error
paths; however, sb->s_root is what generic_shutdown_super() checks to
determine whether the sb was fully initialized or not.
This change will allow cleanup of sb setup error paths.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCYTcNYgAKCRDh3BK/laaZ
PCaHAQCTfaWYK+H6kNIffxnRG+yGsMAmoZLGbc5l9M5GeTEmtgEA0vkmWbDt4h/I
PSA0Q2T0gzSsBdxkF4jQHqKO+HGMVg8=
=Vh9A
-----END PGP SIGNATURE-----
Merge tag 'fuse-update-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse updates from Miklos Szeredi:
- Allow mounting an active fuse device. Previously the fuse device
would always be mounted during initialization, and sharing a fuse
superblock was only possible through mount or namespace cloning
- Fix data flushing in syncfs (virtiofs only)
- Fix data flushing in copy_file_range()
- Fix a possible deadlock in atomic O_TRUNC
- Misc fixes and cleanups
* tag 'fuse-update-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: remove unused arg in fuse_write_file_get()
fuse: wait for writepages in syncfs
fuse: flush extending writes
fuse: truncate pagecache on atomic_o_trunc
fuse: allow sharing existing sb
fuse: move fget() to fuse_get_tree()
fuse: move option checking into fuse_fill_super()
fuse: name fs_context consistently
fuse: fix use after free in fuse_read_interrupt()
In case of fuse the MM subsystem doesn't guarantee that page writeback
completes by the time ->sync_fs() is called. This is because fuse
completes page writeback immediately to prevent DoS of memory reclaim by
the userspace file server.
This means that fuse itself must ensure that writes are synced before
sending the SYNCFS request to the server.
Introduce sync buckets, that hold a counter for the number of outstanding
write requests. On syncfs replace the current bucket with a new one and
wait until the old bucket's counter goes down to zero.
It is possible to have multiple syncfs calls in parallel, in which case
there could be more than one waited-on buckets. Descendant buckets must
not complete until the parent completes. Add a count to the child (new)
bucket until the (parent) old bucket completes.
Use RCU protection to dereference the current bucket and to wake up an
emptied bucket. Use fc->lock to protect against parallel assignments to
the current bucket.
This leaves just the counter to be a possible scalability issue. The
fc->num_waiting counter has a similar issue, so both should be addressed at
the same time.
Reported-by: Amir Goldstein <amir73il@gmail.com>
Fixes: 2d82ab251e ("virtiofs: propagate sync() to file server")
Cc: <stable@vger.kernel.org> # v5.14
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Make it possible to create a new mount from a already working server.
Here's a detailed description of the problem from Jakob:
"The background for this question is occasional problems we see with our
fuse filesystem [1] and mount namespaces. On a usual client, we have
system-wide, autofs managed mountpoints. When a new mount namespace is
created (which can be done unprivileged in combination with user
namespaces), it can happen that a mountpoint is used inside the new
namespace but idle in the root mount namespace. So autofs unmounts the
parent, system-wide mountpoint. But the fuse module stays active and
still serves mountpoint in the child mount namespace. Because the fuse
daemon also blocks other system wide resources corresponding to the
mountpoint, this situation effectively prevents new mounts until the
child mount namespaces closes.
[1] https://github.com/cvmfs/cvmfs"
Reported-by: Jakob Blomer <jblomer@cern.ch>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Checking whether the "fd=", "rootmode=", "user_id=" and "group_id=" mount
options are present can be moved from fuse_get_tree() into
fuse_fill_super() where the value of the options are consumed.
This relaxes semantics of reusing a fuse blockdev mount using the device
name. Before this patch presence of these options were enforced but values
ignored, after this patch these options are completely ignored in this
case.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Use invalidate_lock instead of fuse's private i_mmap_sem. The intended
purpose is exactly the same. By this conversion we fix a long standing
race between hole punching and read(2) / readahead(2) paths that can
lead to stale page cache contents.
CC: Miklos Szeredi <miklos@szeredi.hu>
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Server responds to LOOKUP and other ops (READDIRPLUS/CREATE/MKNOD/...)
with ourarg containing nodeid and generation.
If a fuse inode is found in inode cache with the same nodeid but different
generation, the existing fuse inode should be unhashed and marked "bad" and
a new inode with the new generation should be hashed instead.
This can happen, for example, with passhrough fuse filesystem that returns
the real filesystem ino/generation on lookup and where real inode numbers
can get recycled due to real files being unlinked not via the fuse
passthrough filesystem.
With current code, this situation will not be detected and an old fuse
dentry that used to point to an older generation real inode, can be used to
access a completely new inode, which should be accessed only via the new
dentry.
Note that because the FORGET message carries the nodeid w/o generation, the
server should wait to get FORGET counts for the nlookup counts of the old
and reused inodes combined, before it can free the resources associated to
that nodeid.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This function used to be called from fuse_dentry_automount(). This code
was moved to fuse_get_tree_submount() in the same file since then. It
is unlikely there will ever be another user. No need to be extern in
this case.
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
We recently fixed an infinite loop by setting the SB_BORN flag on
submounts along with the write barrier needed by super_cache_count().
This is the job of vfs_get_tree() and FUSE shouldn't have to care
about the barrier at all.
Split out some code from fuse_dentry_automount() to the dedicated
fuse_get_tree_submount() handler for submounts and call vfs_get_tree().
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The creation of a submount is open-coded in fuse_dentry_automount().
This brings a lot of complexity and we recently had to fix bugs
because we weren't setting SB_BORN or because we were unlocking
sb->s_umount before sb was fully configured. Most of these could
have been avoided by using the mount API instead of open-coding.
Basically, this means coming up with a proper ->get_tree()
implementation for submounts and call vfs_get_tree(), or better
fc_mount().
The creation of the superblock for submounts is quite different from
the root mount. Especially, it doesn't require to allocate a FUSE
filesystem context, nor to parse parameters.
Introduce a dedicated context ops for submounts to make this clear.
This is just a placeholder for now, fuse_get_tree_submount() will
be populated in a subsequent patch.
Only visible change is that we stop allocating/freeing a useless FUSE
filesystem context with submounts.
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Even if POSIX doesn't mandate it, linux users legitimately expect sync() to
flush all data and metadata to physical storage when it is located on the
same system. This isn't happening with virtiofs though: sync() inside the
guest returns right away even though data still needs to be flushed from
the host page cache.
This is easily demonstrated by doing the following in the guest:
$ dd if=/dev/zero of=/mnt/foo bs=1M count=5K ; strace -T -e sync sync
5120+0 records in
5120+0 records out
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 5.22224 s, 1.0 GB/s
sync() = 0 <0.024068>
and start the following in the host when the 'dd' command completes
in the guest:
$ strace -T -e fsync /usr/bin/sync virtiofs/foo
fsync(3) = 0 <10.371640>
There are no good reasons not to honor the expected behavior of sync()
actually: it gives an unrealistic impression that virtiofs is super fast
and that data has safely landed on HW, which isn't the case obviously.
Implement a ->sync_fs() superblock operation that sends a new FUSE_SYNCFS
request type for this purpose. Provision a 64-bit placeholder for possible
future extensions. Since the file server cannot handle the wait == 0 case,
we skip it to avoid a gratuitous roundtrip. Note that this is
per-superblock: a FUSE_SYNCFS is send for the root mount and for each
submount.
Like with FUSE_FSYNC and FUSE_FSYNCDIR, lack of support for FUSE_SYNCFS in
the file server is treated as permanent success. This ensures
compatibility with older file servers: the client will get the current
behavior of sync() not being propagated to the file server.
Note that such an operation allows the file server to DoS sync(). Since a
typical FUSE file server is an untrusted piece of software running in
userspace, this is disabled by default. Only enable it with virtiofs for
now since virtiofsd is supposedly trusted by the guest kernel.
Reported-by: Robert Krawitz <rlk@redhat.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Pull misc vfs updates from Al Viro:
"Assorted stuff all over the place"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
useful constants: struct qstr for ".."
hostfs_open(): don't open-code file_dentry()
whack-a-mole: kill strlen_user() (again)
autofs: should_expire() argument is guaranteed to be positive
apparmor:match_mn() - constify devpath argument
buffer: a small optimization in grow_buffers
get rid of autofs_getpath()
constify dentry argument of dentry_path()/dentry_path_raw()
If an incoming FUSE request can't fit on the virtqueue, the request is
placed onto a workqueue so a worker can try to resubmit it later where
there will (hopefully) be space for it next time.
This is fine for requests that aren't larger than a virtqueue's maximum
capacity. However, if a request's size exceeds the maximum capacity of the
virtqueue (even if the virtqueue is empty), it will be doomed to a life of
being placed on the workqueue, removed, discovered it won't fit, and placed
on the workqueue yet again.
Furthermore, from section 2.6.5.3.1 (Driver Requirements: Indirect
Descriptors) of the virtio spec:
"A driver MUST NOT create a descriptor chain longer than the Queue
Size of the device."
To fix this, limit the number of pages FUSE will use for an overall
request. This way, each request can realistically fit on the virtqueue
when it is decomposed into a scattergather list and avoid violating section
2.6.5.3.1 of the virtio spec.
Signed-off-by: Connor Kuehl <ckuehl@redhat.com>
Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fuse client needs to send additional information to file server when it
calls SETXATTR(system.posix_acl_access), so add extra flags field to the
structure.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>