Registers fscache context for primary data blob. Also move the
initialization of s_op and related fields forward, since anonymous
inode will be allocated under the super block when registering the
fscache context.
Something worth mentioning about the cleanup routine.
1. The fscache context will instantiate anonymous inodes under the super
block. Release these anonymous inodes when .put_super() is called, or
we'll get "VFS: Busy inodes after unmount." warning.
2. The fscache context is initialized prior to the root inode. If
.kill_sb() is called when mount failed, .put_super() won't be called
when root inode has not been initialized yet. Thus .kill_sb() shall
also contain the cleanup routine.
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220425122143.56815-16-jefflexu@linux.alibaba.com
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Add erofs_fscache_read_folios() helper reading from fscache. It supports
on-demand read semantics. That is, it will make the backend prepare for
the data when cache miss. Once data ready, it will read from the cache.
This helper can then be used to implement .readpage()/.readahead() of
on-demand read semantics.
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220425122143.56815-15-jefflexu@linux.alibaba.com
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Introduce one anonymous inode for data blobs so that erofs can cache
metadata directly within such anonymous inode.
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220425122143.56815-14-jefflexu@linux.alibaba.com
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Introduce a context structure for managing data blobs, and helper
functions for initializing and cleaning up this context structure.
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220425122143.56815-13-jefflexu@linux.alibaba.com
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
A new fscache based mode is going to be introduced for erofs, in which
case on-demand read semantics is implemented through fscache.
As the first step, register fscache volume for each erofs filesystem.
That means, data blobs can not be shared among erofs filesystems. In the
following iteration, we are going to introduce the domain semantics, in
which case several erofs filesystems can belong to one domain, and data
blobs can be shared among these erofs filesystems of one domain.
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220425122143.56815-12-jefflexu@linux.alibaba.com
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Until then erofs is exactly blockdev based filesystem.
A new fscache-based mode is going to be introduced for erofs to support
scenarios where on-demand read semantics is needed, e.g. container
image distribution. In this case, erofs could be mounted from data blobs
through fscache.
Add a helper checking which mode erofs works in, and twist the code in
preparation for the upcoming fscache mode.
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220425122143.56815-11-jefflexu@linux.alibaba.com
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
... so that it can be used in the following introduced fscache mode.
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220425122143.56815-10-jefflexu@linux.alibaba.com
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Add tracepoints for on-demand read mode. Currently following tracepoints
are added:
OPEN request / COPEN reply
CLOSE request
READ request / CREAD reply
write through anonymous fd
release of anonymous fd
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20220425122143.56815-8-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Enable on-demand read mode by adding an optional parameter to the "bind"
command.
On-demand mode will be turned on when this parameter is "ondemand", i.e.
"bind ondemand". Otherwise cachefiles will work in the original mode.
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220509074028.74954-7-jefflexu@linux.alibaba.com
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Implement the data plane of on-demand read mode.
The early implementation [1] place the entry to
cachefiles_ondemand_read() in fscache_read(). However, fscache_read()
can only detect if the requested file range is fully cache miss, whilst
we need to notify the user daemon as long as there's a hole inside the
requested file range.
Thus the entry is now placed in cachefiles_prepare_read(). When working
in on-demand read mode, once a hole detected, the read routine will send
a READ request to the user daemon. The user daemon needs to fetch the
data and write it to the cache file. After sending the READ request, the
read routine will hang there, until the READ request is handled by the
user daemon. Then it will retry to read from the same file range. If no
progress encountered, the read routine will fail then.
A new NETFS_SREQ_ONDEMAND flag is introduced to indicate that on-demand
read should be done when a cache miss encountered.
[1] https://lore.kernel.org/all/20220406075612.60298-6-jefflexu@linux.alibaba.com/ #v8
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20220425122143.56815-6-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Notify the user daemon that cookie is going to be withdrawn, providing a
hint that the associated anonymous fd can be closed.
Be noted that this is only a hint. The user daemon may close the
associated anonymous fd when receiving the CLOSE request, then it will
receive another anonymous fd when the cookie gets looked up. Or it may
ignore the CLOSE request, and keep writing data through the anonymous
fd. However the next time the cookie gets looked up, the user daemon
will still receive another new anonymous fd.
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20220425122143.56815-5-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Add a refcount to avoid the deadlock in on-demand read mode. The
on-demand read mode will pin the corresponding cachefiles object for
each anonymous fd. The cachefiles object is unpinned when the anonymous
fd gets closed. When the user daemon exits and the fd of
"/dev/cachefiles" device node gets closed, it will wait for all
cahcefiles objects getting withdrawn. Then if there's any anonymous fd
getting closed after the fd of the device node, the user daemon will
hang forever, waiting for all objects getting withdrawn.
To fix this, add a refcount indicating if there's any object pinned by
anonymous fds. The cachefiles cache gets unbound and withdrawn when the
refcount is decreased to 0. It won't change the behaviour of the
original mode, in which case the cachefiles cache gets unbound and
withdrawn as long as the fd of the device node gets closed.
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220509074028.74954-4-jefflexu@linux.alibaba.com
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Fscache/CacheFiles used to serve as a local cache for a remote
networking fs. A new on-demand read mode will be introduced for
CacheFiles, which can boost the scenario where on-demand read semantics
are needed, e.g. container image distribution.
The essential difference between these two modes is seen when a cache
miss occurs: In the original mode, the netfs will fetch the data from
the remote server and then write it to the cache file; in on-demand
read mode, fetching the data and writing it into the cache is delegated
to a user daemon.
As the first step, notify the user daemon when looking up cookie. In
this case, an anonymous fd is sent to the user daemon, through which the
user daemon can write the fetched data to the cache file. Since the user
daemon may move the anonymous fd around, e.g. through dup(), an object
ID uniquely identifying the cache file is also attached.
Also add one advisory flag (FSCACHE_ADV_WANT_CACHE_SIZE) suggesting that
the cache file size shall be retrieved at runtime. This helps the
scenario where one cache file contains multiple netfs files, e.g. for
the purpose of deduplication. In this case, netfs itself has no idea the
size of the cache file, whilst the user daemon should give the hint on
it.
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220509074028.74954-3-jefflexu@linux.alibaba.com
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Extract the generic routine of writing data to cache files, and make it
generally available.
This will be used by the following patch implementing on-demand read
mode. Since it's called inside CacheFiles module, make the interface
generic and unrelated to netfs_cache_resources.
It is worth noting that, ki->inval_counter is not initialized after
this cleanup. It shall not make any visible difference, since
inval_counter is no longer used in the write completion routine, i.e.
cachefiles_write_complete().
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20220425122143.56815-2-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
This patch enables idmapped mounts for erofs, since all dedicated helpers
for this functionality existsm, so, in this patch we just pass down the
user_namespace argument from the VFS methods to the relevant helpers.
Simple idmap example on erofs image:
1. mkdir dir
2. touch dir/file
3. mkfs.erofs erofs.img dir
4. mount -t erofs -o loop erofs.img /mnt/erofs/
5. ls -ln /mnt/erofs/
total 0
-rw-rw-r-- 1 1000 1000 0 May 17 15:26 file
6. mount-idmapped --map-mount b:1000:1001:1 /mnt/erofs/ /mnt/scratch_erofs/
7. ls -ln /mnt/scratch_erofs/
total 0
-rw-rw-r-- 1 1001 1001 0 May 17 15:26 file
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Link: https://lore.kernel.org/r/20220517104103.3570721-1-chao@kernel.org
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Implement export operations in order to make EROFS support accessing
inodes with filehandles so that it can be exported via NFS and used
by overlayfs.
Without this patch, 'exportfs -rv' will report:
exportfs: /root/erofs_mp does not support NFS export
Also tested with unionmount-testsuite and the testcase below passes now:
./run --ov --erofs --verify hard-link
For more details about the testcase, see:
https://github.com/amir73il/unionmount-testsuite/pull/6
Signed-off-by: Hongnan Li <hongnan.li@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20220425040712.91685-1-hongnan.li@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
I got some KASAN report as below:
[ 46.959738] ==================================================================
[ 46.960430] BUG: KASAN: use-after-free in z_erofs_shifted_transform+0x2bd/0x370
[ 46.960430] Read of size 4074 at addr ffff8880300c2f8e by task fssum/188
...
[ 46.960430] Call Trace:
[ 46.960430] <TASK>
[ 46.960430] dump_stack_lvl+0x41/0x5e
[ 46.960430] print_report.cold+0xb2/0x6b7
[ 46.960430] ? z_erofs_shifted_transform+0x2bd/0x370
[ 46.960430] kasan_report+0x8a/0x140
[ 46.960430] ? z_erofs_shifted_transform+0x2bd/0x370
[ 46.960430] kasan_check_range+0x14d/0x1d0
[ 46.960430] memcpy+0x20/0x60
[ 46.960430] z_erofs_shifted_transform+0x2bd/0x370
[ 46.960430] z_erofs_decompress_pcluster+0xaae/0x1080
The root cause is that the tail pcluster won't be a complete filesystem
block anymore. So if ztailpacking is used, the second part of an
uncompressed tail pcluster may not be ``rq->pageofs_out``.
Fixes: ab749badf9 ("erofs: support unaligned data decompression")
Fixes: cecf864d3d ("erofs: support inline data decompression")
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20220512115833.24175-1-hsiangkao@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Some comments haven't been useful anymore since the code updated.
Let's drop them instead.
Link: https://lore.kernel.org/r/20220506194612.117120-2-hsiangkao@linux.alibaba.com
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
The big pcluster feature has been merged for a year, it has been mostly
stable now.
Signed-off-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20220407050505.12683-1-huyue2@coolpad.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmJkGaQACgkQiiy9cAdy
T1G7Pgv/SHmxsBGIT+YyW+ppWZGIMMpCld3YbovtN2EWdt3NXwvnGrTTFkXe0acJ
+63NjNnCoDTKy+XHN0HhMqLsJSep6J63e5rmiuvUxKQIR0n1WzZl7W1YpHKptZog
wMkObv0UzKgr6gLEwQeMMqvcyxOVmgssAXou4N8p6rDJLFijM2kcVjpB/B9uyUAR
JG0ss8lhX7+YTcRuI0QqyulHlUTiGwu4/XjO9oWs3bF2faADVIfZXOKGMFfRpjCK
YDGdh+HieW5y2SngvstdBuVxmZXLjWjWwe9mQCUaF7khZJ0acuGeQh9BCPNopjUD
0CBYe9JWM2NxTjXqmhspCGUi40+EedZgIMxukRyl7MrBX1wBF0ErSYpGmYSZQqoH
u2R9Pr8tUwuzVfQ6s7VhWACckKFNwI53lHOhY1ikc2M/NWeLy961Wi9JRykac8cZ
ElkJkJUdGttntjwilcKp7NuWmHssDzxNH103WAr3GZrhBDgntUHFBpsLqzQmQW3a
Gp3jkj1e
=Y9T8
-----END PGP SIGNATURE-----
Merge tag '5.18-rc3-ksmbd-fixes' of git://git.samba.org/ksmbd
Pull ksmbd server fixes from Steve French:
- cap maximum sector size reported to avoid mount problems
- reference count fix
- fix filename rename race
* tag '5.18-rc3-ksmbd-fixes' of git://git.samba.org/ksmbd:
ksmbd: set fixed sector size to FS_SECTOR_SIZE_INFORMATION
ksmbd: increment reference count of parent fp
ksmbd: remove filename in ksmbd_file
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmJjYJUQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpsZZEADS/dD7pZKxBLcHTCGJAik1/IIv/3ynTrOp
o86uV0AH+nL6lUyBU7uTTQVFn9Hjh6T10ZfRmcU+1Xb8G4obHTrQJkk5evwCNPng
9CUW2fwQa+6H4Ui8TU7f1rLcLlm+AUSVmab6h/20X5ldMwzF1JhcE11qtqZw0ti7
mDPwmxEsx7KMvMy59awA+5IpnXHxe5SvHXuzLsMNwux6dH7VauxE8R+Y+HHVzLWc
fM2dbEU2Hq5nL23DedMw3ZaHwhQTiWdOQA0386iDB6cJdFv19iw+ApD4KS/qAT2X
URQ3pmyNOXvOsBosVL4za7VVCoUlA23ZSMoU82p2K3NK4NGfV7S4oIeO5ZR2BK/C
bIC4c2gutIbYrtdSITBW4z2tj+26BBZS7LaT3Bek/3BL+GjQuM6vK8N4ZhRXPC+l
vWAwXUnWSyXR4+HWpvm3ewlrSY5CQjfsZgU1PIYybhTf/oo2BxX/HQfRk3XKfLIR
89gvITTUrC8B4dgPgLs/MF4Ercmoa2//2yL0onBEwdC2b1lRqD/bGM2FAYolzLzf
W1+BrFj3sRUjexdO7ChrtZvAWo59REAxXdP/3h+NbkIz8sunG2Vpf9NX3cXj30n3
bZL3SkEbtFxpnXspRSQRnmL6DMLPMa7UC+MxpNBAV0g6aMmuSdxqiIZR10+kIEyR
PJyRqbRefg==
=Huq7
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.18-2022-04-22' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"Just two small fixes - one fixing a potential leak for the iovec for
larger requests added in this cycle, and one fixing a theoretical leak
with CQE_SKIP and IOPOLL"
* tag 'io_uring-5.18-2022-04-22' of git://git.kernel.dk/linux-block:
io_uring: fix leaks on IOPOLL and CQE_SKIP
io_uring: free iovec if file assignment fails
injection testing. Change ext4's fallocate to update consistently
drop set[ug]id bits when an fallocate operation might possibly change
the user-visible contents of a file. Also, improve handling of
potentially invalid values in the the s_overhead_cluster superblock
field to avoid ext4 returning a negative number of free blocks.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmJinf8ACgkQ8vlZVpUN
gaOHgQf+MKgUZgteYogLzoP3mF1kSycOGawk4wZ3QHOLz7AvsV2p9J8BWihbS/EK
dBydfXbTMvCUrjWmpqb5dHECRzxdfxOJ0SPJtibc8DZaJc9ImNFmgSp9kyJ3uRaN
cPGO6Lz2RXpdumVMPPLwzUJdVyrLi0K6I1NYSocxKgribePzd+xil8S9zRZj8Bpe
RaeH0EytcRj2CI5qs5mI/mOPBAMsZeczd3HInI3gyCgP2I4ZOfsADne3APx57mcI
IGKf77nvIwMHeKel3MGYfFPitEs5cZpHUhHplCMtgFsO8H0IR93tqnlaCvTM7VAZ
Slamgl7pfcXFcLZP+pm0QL/82ub7iw==
=FIds
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"Fix some syzbot-detected bugs, as well as other bugs found by I/O
injection testing.
Change ext4's fallocate to consistently drop set[ug]id bits when an
fallocate operation might possibly change the user-visible contents of
a file.
Also, improve handling of potentially invalid values in the the
s_overhead_cluster superblock field to avoid ext4 returning a negative
number of free blocks"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
jbd2: fix a potential race while discarding reserved buffers after an abort
ext4: update the cached overhead value in the superblock
ext4: force overhead calculation if the s_overhead_cluster makes no sense
ext4: fix overhead calculation to account for the reserved gdt blocks
ext4, doc: fix incorrect h_reserved size
ext4: limit length to bitmap_maxbytes - blocksize in punch_hole
ext4: fix use-after-free in ext4_search_dir
ext4: fix bug_on in start_this_handle during umount filesystem
ext4: fix symlink file size not match to file content
ext4: fix fallocate to use file_modified to update permissions consistently
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmJi1i4ACgkQiiy9cAdy
T1G9qgv/fY5M1iyz7ddwSbFeX2MPL3225C/LuWf1cW/deUuf83/RyjefaI/8GSjG
DT+0mvpow/vvOWWxS+ZR+O3vp32Q/twO4NhgTU5FQicACU1AmofcY/klVUCm1W9f
oS6q/dlvWaqlqDLajprPOEuh5ZG01GlkmF0lTvo00ze40PlDfpXZEpmoCHtNPUx9
VUFvq+Bxh+3kvWaHk7YKN0ENilSZ9UYy1KIClpQXlXU4rRwT9MORiljwXB+FYw9E
suKsAyUk7ek01TBFUjcRC7OYjlq2t7wcvsOlLLwvMlisgauJZgPHAv1igJF04fVw
jMIE9RwH5EuPHP5JrVj+w3mkk9XXG4xbXREfmNCC4V/V1a3ZefU3r0lj1VBV4HIi
p2FpVvEBU+DkFuO8pwA2xv56ykCXooAftk1xX9Rj27ICEL/OPqiXbEdb46wTT+Nj
cf8L0qumOvV4/IMRre94RJCcwn4IfTW0O4VGCAhk3U+qR1MZzmGtpvjzMVdUOad+
C7jaDuwd
=XdOO
-----END PGP SIGNATURE-----
Merge tag '5.18-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Four fixes, two of them for stable:
- fcollapse fix
- reconnect lock fix
- DFS oops fix
- minor cleanup patch"
* tag '5.18-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: destage any unwritten data to the server before calling copychunk_write
cifs: use correct lock type in cifs_reconnect()
cifs: fix NULL ptr dereference in refresh_mounts()
cifs: Use kzalloc instead of kmalloc/memset
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYmF+/wAKCRCRxhvAZXjc
otmDAP47jPBjTS+gdnMy8fP6ymZu2+So3gex8N777x23mTvQ5AEAy9s4tKMb5UA5
rwQa8vRgkUBAyiB9yjloNOdN65X1cAE=
=WLsw
-----END PGP SIGNATURE-----
Merge tag 'fs.fixes.v5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull mount_setattr fix from Christian Brauner:
"The recent cleanup in e257039f0f ("mount_setattr(): clean the
control flow and calling conventions") switched the mount attribute
codepaths from do-while to for loops as they are more idiomatic when
walking mounts.
However, we did originally choose do-while constructs because if we
request a mount or mount tree to be made read-only we need to hold
writers in the following way: The mount attribute code will grab
lock_mount_hash() and then call mnt_hold_writers() which will
_unconditionally_ set MNT_WRITE_HOLD on the mount.
Any callers that need write access have to call mnt_want_write(). They
will immediately see that MNT_WRITE_HOLD is set on the mount and the
caller will then either spin (on non-preempt-rt) or wait on
lock_mount_hash() (on preempt-rt).
The fact that MNT_WRITE_HOLD is set unconditionally means that once
mnt_hold_writers() returns we need to _always_ pair it with
mnt_unhold_writers() in both the failure and success paths.
The do-while constructs did take care of this. But Al's change to a
for loop in the failure path stops on the first mount we failed to
change mount attributes _without_ going into the loop to call
mnt_unhold_writers().
This in turn means that once we failed to make a mount read-only via
mount_setattr() - i.e. there are already writers on that mount - we
will block any writers indefinitely. Fix this by ensuring that the for
loop always unsets MNT_WRITE_HOLD including the first mount we failed
to change to read-only. Also sprinkle a few comments into the cleanup
code to remind people about what is happening including myself. After
all, I didn't catch it during review.
This is only relevant on mainline and was reported by syzbot. Details
about the syzbot reports are all in the commit message"
* tag 'fs.fixes.v5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
fs: unset MNT_WRITE_HOLD on failure
This is a fix for commit f6795053da ("mm: mmap: Allow for "high"
userspace addresses") for hugetlb.
This patch adds support for "high" userspace addresses that are
optionally supported on the system and have to be requested via a hint
mechanism ("high" addr parameter to mmap).
Architectures such as powerpc and x86 achieve this by making changes to
their architectural versions of hugetlb_get_unmapped_area() function.
However, arm64 uses the generic version of that function.
So take into account arch_get_mmap_base() and arch_get_mmap_end() in
hugetlb_get_unmapped_area(). To allow that, move those two macros out
of mm/mmap.c into include/linux/sched/mm.h
If these macros are not defined in architectural code then they default
to (TASK_SIZE) and (base) so should not introduce any behavioural
changes to architectures that do not define them.
For the time being, only ARM64 is affected by this change.
Catalin (ARM64) said
"We should have fixed hugetlb_get_unmapped_area() as well when we added
support for 52-bit VA. The reason for commit f6795053da was to
prevent normal mmap() from returning addresses above 48-bit by default
as some user-space had hard assumptions about this.
It's a slight ABI change if you do this for hugetlb_get_unmapped_area()
but I doubt anyone would notice. It's more likely that the current
behaviour would cause issues, so I'd rather have them consistent.
Basically when arm64 gained support for 52-bit addresses we did not
want user-space calling mmap() to suddenly get such high addresses,
otherwise we could have inadvertently broken some programs (similar
behaviour to x86 here). Hence we added commit f6795053da. But we
missed hugetlbfs which could still get such high mmap() addresses. So
in theory that's a potential regression that should have bee addressed
at the same time as commit f6795053da (and before arm64 enabled
52-bit addresses)"
Link: https://lkml.kernel.org/r/ab847b6edb197bffdfe189e70fb4ac76bfe79e0d.1650033747.git.christophe.leroy@csgroup.eu
Fixes: f6795053da ("mm: mmap: Allow for "high" userspace addresses")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Steve Capper <steve.capper@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: <stable@vger.kernel.org> [5.0.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
because the copychunk_write might cover a region of the file that has not yet
been sent to the server and thus fail.
A simple way to reproduce this is:
truncate -s 0 /mnt/testfile; strace -f -o x -ttT xfs_io -i -f -c 'pwrite 0k 128k' -c 'fcollapse 16k 24k' /mnt/testfile
the issue is that the 'pwrite 0k 128k' becomes rearranged on the wire with
the 'fcollapse 16k 24k' due to write-back caching.
fcollapse is implemented in cifs.ko as a SMB2 IOCTL(COPYCHUNK_WRITE) call
and it will fail serverside since the file is still 0b in size serverside
until the writes have been destaged.
To avoid this we must ensure that we destage any unwritten data to the
server before calling COPYCHUNK_WRITE.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1997373
Reported-by: Xiaoli Feng <xifeng@redhat.com>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
TCP_Server_Info::origin_fullpath and TCP_Server_Info::leaf_fullpath
are protected by refpath_lock mutex and not cifs_tcp_ses_lock
spinlock.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Cc: stable@vger.kernel.org
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
- Fix use-after-free of the on-stack z_erofs_decompressqueue;
- Fix sysfs documentation Sphinx warnings.
-----BEGIN PGP SIGNATURE-----
iIcEABYIAC8WIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCYl2A0REceGlhbmdAa2Vy
bmVsLm9yZwAKCRA5NzHcH7XmBEe7AQCh7aNhYEncBnoHvrB276HCP0xmMwmc0gPq
6UeSWgarpAEArgfgLyo8tFgS+gvXbhB8/P1FqEMwaRZVLWathXID3QU=
=GFAS
-----END PGP SIGNATURE-----
Merge tag 'erofs-for-5.18-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs fixes from Gao Xiang:
"One patch to fix a use-after-free race related to the on-stack
z_erofs_decompressqueue, which happens very rarely but needs to be
fixed properly soon.
The other patch fixes some sysfs Sphinx warnings"
* tag 'erofs-for-5.18-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
Documentation/ABI: sysfs-fs-erofs: Fix Sphinx errors
erofs: fix use-after-free of on-stack io[]
This reverts commit 5a519c8fe4.
It turns out that making the pipe almost arbitrarily large has some
rather unexpected downsides. The kernel test robot reports a kernel
warning that is due to pipe->max_usage now growing to the point where
the iter_file_splice_write() buffer allocation can no longer be
satisfied as a slab allocation, and the
int nbufs = pipe->max_usage;
struct bio_vec *array = kcalloc(nbufs, sizeof(struct bio_vec),
GFP_KERNEL);
code sequence there will now always fail as a result.
That code could be modified to use kvcalloc() too, but I feel very
uncomfortable making those kinds of changes for a very niche use case
that really should have other options than make these kinds of
fundamental changes to pipe behavior.
Maybe the CRIU process dumping should be multi-threaded, and use
multiple pipes and multiple cores, rather than try to use one larger
pipe to minimize splice() calls.
Reported-by: kernel test robot <oliver.sang@intel.com>
Link: https://lore.kernel.org/all/20220420073717.GD16310@xsang-OptiPlex-9020/
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Last cycle we extended the idmapped mounts infrastructure to support
idmapped mounts of idmapped filesystems (No such filesystem yet exist.).
Since then, the meaning of an idmapped mount is a mount whose idmapping
is different from the filesystems idmapping.
While doing that work we missed to adapt the acl translation helpers.
They still assume that checking for the identity mapping is enough. But
they need to use the no_idmapping() helper instead.
Note, POSIX ACLs are always translated right at the userspace-kernel
boundary using the caller's current idmapping and the initial idmapping.
The order depends on whether we're coming from or going to userspace.
The filesystem's idmapping doesn't matter at the border.
Consequently, if a non-idmapped mount is passed we need to make sure to
always pass the initial idmapping as the mount's idmapping and not the
filesystem idmapping. Since it's irrelevant here it would yield invalid
ids and prevent setting acls for filesystems that are mountable in a
userns and support posix acls (tmpfs and fuse).
I verified the regression reported in [1] and verified that this patch
fixes it. A regression test will be added to xfstests in parallel.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215849 [1]
Fixes: bd303368b7 ("fs: support mapped mounts of mapped filesystems")
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: <stable@vger.kernel.org> # 5.17
Cc: <regressions@lists.linux.dev>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use kzalloc rather than duplicating its implementation, which
makes code simple and easy to understand.
Signed-off-by: Haowen Bai <baihaowen@meizu.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
If all completed requests in io_do_iopoll() were marked with
REQ_F_CQE_SKIP, we'll not only skip CQE posting but also
io_free_batch_list() leaking memory and resources.
Move @nr_events increment before REQ_F_CQE_SKIP check. We'll potentially
return the value greater than the real one, but iopolling will deal with
it and the userspace will re-iopoll if needed. In anyway, I don't think
there are many use cases for REQ_F_CQE_SKIP + IOPOLL.
Fixes: 83a13a4181 ("io_uring: tweak iopoll CQE_SKIP event counting")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/5072fc8693fbfd595f89e5d4305bfcfd5d2f0a64.1650186611.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We just return failure in this case, but we need to release the iovec
first. If we're doing IO with more than FAST_IOV segments, then the
iovec is allocated and must be freed.
Reported-by: syzbot+96b43810dfe9c3bb95ed@syzkaller.appspotmail.com
Fixes: 584b0180f0 ("io_uring: move read/write file prep state into actual opcode handler")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Merge misc fixes from Andrew Morton:
"14 patches.
Subsystems affected by this patch series: MAINTAINERS, binfmt, and
mm (tmpfs, secretmem, kasan, kfence, pagealloc, zram, compaction,
hugetlb, vmalloc, and kmemleak)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm: kmemleak: take a full lowmem check in kmemleak_*_phys()
mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcore
revert "fs/binfmt_elf: use PT_LOAD p_align values for static PIE"
revert "fs/binfmt_elf: fix PT_LOAD p_align values for loaders"
hugetlb: do not demote poisoned hugetlb pages
mm: compaction: fix compiler warning when CONFIG_COMPACTION=n
mm: fix unexpected zeroed page mapping with zram swap
mm, page_alloc: fix build_zonerefs_node()
mm, kfence: support kmem_dump_obj() for KFENCE objects
kasan: fix hw tags enablement when KUNIT tests are disabled
irq_work: use kasan_record_aux_stack_noalloc() record callstack
mm/secretmem: fix panic when growing a memfd_secret
tmpfs: fix regressions from wider use of ZERO_PAGE
MAINTAINERS: Broadcom internal lists aren't maintainers
Commit 925346c129 ("fs/binfmt_elf: fix PT_LOAD p_align values for
loaders") was an attempt to fix regressions due to 9630f0d60f
("fs/binfmt_elf: use PT_LOAD p_align values for static PIE").
But regressionss continue to be reported:
https://lore.kernel.org/lkml/cb5b81bd-9882-e5dc-cd22-54bdbaaefbbc@leemhuis.info/https://bugzilla.kernel.org/show_bug.cgi?id=215720https://lkml.kernel.org/r/b685f3d0-da34-531d-1aa9-479accd3e21b@leemhuis.info
This patch reverts the fix, so the original can also be reverted.
Fixes: 925346c129 ("fs/binfmt_elf: fix PT_LOAD p_align values for loaders")
Cc: H.J. Lu <hjl.tools@gmail.com>
Cc: Chris Kennelly <ckennelly@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: Fangrui Song <maskray@google.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thorsten Leemhuis <regressions@leemhuis.info>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmJY2BIQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpielEACDqz5oShRkOHGAaqo0jbOy7C9I0hwUXS9Y
QeIH7umrSUBgDXLIq3jbsgK3d0nroaYHTU8aa64XnyB0lGouBTiw+I8FZfxNlW7w
HI+AUZn/m0wvTuKcw//cSX2CP2pcVPfmao+JskU6kyrsoj0nkQNx2NNHXbVQ3cFC
DKlZE7ZulvzfM+xC0aJxIYUWLECzqgZvicn+mqeqVQX9QJ3k/637GTnVu83QSpIC
0Hw/isuGuaK+0nurwc4Rx9ZojItVYyPPt3a+8ImtGaJlhyeg9bHLMJZbBLvrlqjd
AS2iVPaQOhFfxt8qe7ETHpIUmkBQZIckavsCCO7sfFVKtGRNA0kQkAYLjXLDLP8T
1DXn8VQHsGHHcBd2vTZURng32AniOCzQkshLGF8s/yYuoHp7JODDhwu4xO5HC5eN
rD7SNDcW9mYlEmVtCuoeKCrRknHL+x3ZTlAPTXr3DgjtgB7phBUmLqZU7riwy7vs
0oGD/uXf5jT7Ujw2fZNF7LFetQVJkCi92p1+IGBO0hXQShZ5IsCITk95V2d+2cVs
J1r7ZkXiXiWHkHDQQfuKNeVoUXRe10a4+xOeOCMrwuOgKpKIyoN7ofb6Lp4SRE+j
f6PvGXvfwIRz9McJwg5IWCIEVASysn4s17bgnaMKXpRCgf67Z4HMTcBO1PHRORBD
ssBXiVxOhA==
=aTrS
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.18-2022-04-14' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
- Ensure we check and -EINVAL any use of reserved or struct padding.
Although we generally always do that, it's missed in two spots for
resource updates, one for the ring fd registration from this merge
window, and one for the extended arg. Make sure we have all of them
handled. (Dylan)
- A few fixes for the deferred file assignment (me, Pavel)
- Add a feature flag for the deferred file assignment so apps can tell
we handle it correctly (me)
- Fix a small perf regression with the current file position fix in
this merge window (me)
* tag 'io_uring-5.18-2022-04-14' of git://git.kernel.dk/linux-block:
io_uring: abort file assignment prior to assigning creds
io_uring: fix poll error reporting
io_uring: fix poll file assign deadlock
io_uring: use right issue_flags for splice/tee
io_uring: verify pad field is 0 in io_get_ext_arg
io_uring: verify resv is 0 in ringfd register/unregister
io_uring: verify that resv2 is 0 in io_uring_rsrc_update2
io_uring: move io_uring_rsrc_update2 validation
io_uring: fix assign file locking issue
io_uring: stop using io_wq_work as an fd placeholder
io_uring: move apoll->events cache
io_uring: io_kiocb_update_pos() should not touch file for non -1 offset
io_uring: flag the fact that linked file assignment is sane
If we (re-)calculate the file system overhead amount and it's
different from the on-disk s_overhead_clusters value, update the
on-disk version since this can take potentially quite a while on
bigalloc file systems.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
If the file system does not use bigalloc, calculating the overhead is
cheap, so force the recalculation of the overhead so we don't have to
trust the precalculated overhead in the superblock.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Currently ksmbd is using ->f_bsize from vfs_statfs() as sector size.
If fat/exfat is a local share, ->f_bsize is a cluster size that is too
large to be used as a sector size. Sector sizes larger than 4K cause
problem occurs when mounting an iso file through windows client.
The error message can be obtained using Mount-DiskImage command,
the error is:
"Mount-DiskImage : The sector size of the physical disk on which the
virtual disk resides is not supported."
This patch reports fixed 4KB sector size if ->s_blocksize is bigger
than 4KB.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Add missing increment reference count of parent fp in
ksmbd_lookup_fd_inode().
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Reviewed-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
If the filename is change by underlying rename the server, fp->filename
and real filename can be different. This patch remove the uses of
fp->filename in ksmbd and replace it with d_path().
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Reviewed-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
The kernel calculation was underestimating the overhead by not taking
into account the reserved gdt blocks. With this change, the overhead
calculated by the kernel matches the overhead calculation in mke2fs.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmJYRzcACgkQiiy9cAdy
T1EECAwAuCUHV174bwd5XQBeP7aouw6upKPF9jbXXJpQy/NKxwFAGpiKaEiGAGjb
dA1M+wdBZfeIQQW7pCuFWuWvtagJChEdnAKUamnX+G+l2iWCxsJz64nisc27u6CR
z/n6YfIT8OPVQLhfU9xgyXmmmDSSJy6YFNgL3Bh+poR/8764HhEkCgeV3P3kD4Vo
uvj9p6M4K1Ddh4v914QgDhLzk1UsR++zHCE9IuXrxNDgzrQ5Kh5VFn52vaEaNz7g
hIr790gALsGNnLrapiaD3L2alrGwr3GO3WT5pqAdCgzTWq5bngBBQYl0qSWnSHs+
OhnFFBtZ1efRX2bp6SeR8r2z5hNj4YiLJCo/cjoJBIU7W4YDOdLo+UVm/sW9xKgA
tbw5lYWNDMpHygkb+/uSnw807y1cz/3Ip9QbN/ESQP6b9hg/woAXkjgiMVDcew0P
nNcQEz0LZHUAzidIAHlByTo17QNQdHOhhZYQuA48dH4KJJrSEWXmiG9n9Qp4Q6cX
6aueD1Lw
=q5N5
-----END PGP SIGNATURE-----
Merge tag '5.18-rc2-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
- two fixes related to unmount
- symlink overflow fix
- minor netfs fix
- improved tracing for crediting (flow control)
* tag '5.18-rc2-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: verify that tcon is valid before dereference in cifs_kill_sb
cifs: potential buffer overflow in handling symlinks
cifs: Split the smb3_add_credits tracepoint
cifs: release cached dentries only if mount is complete
cifs: Check the IOCB_DIRECT flag, not O_DIRECT