Highlights include:
Stable fixes:
- pNFS/flexfiles: Fix infinite looping when the RDMA connection errors out
Bugfixes:
- NFS: fix port value parsing
- SUNRPC: Reinitialise the backchannel request buffers before reuse
- SUNRPC: fix expiry of auth creds
- NFSv4: Fix races in the legacy idmapper upcall
- NFS: O_DIRECT fixes from Jeff Layton
- NFSv4.1: Fix OP_SEQUENCE error handling
- SUNRPC: Fix an RPC/RDMA performance regression
- NFS: Fix case insensitive renames
- NFSv4/pnfs: Fix a use-after-free bug in open
- NFSv4.1: RECLAIM_COMPLETE must handle EACCES
Features:
- NFSv4.1: session trunking enhancements
- NFSv4.2: READ_PLUS performance optimisations
- NFS: relax the rules for rsize/wsize mount options
- NFS: don't unhash dentry during unlink/rename
- SUNRPC: Fail faster on bad verifier
- NFS/SUNRPC: Various tracing improvements
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmLzy24ACgkQZwvnipYK
APJJvhAAnVmv3jeOLjwDm+3X9xBevTq8PXAikWVeJgbSKCdjfqXO1J+XF49MpxXl
N8PQZMqnwWkF3WvhvYobblwGSl6gJ3RnhVuTgdv4jiPl0ZWyS3XngJY0dCTnNgdL
5O9AjoOMtnGVZwN5j8agymA8f2TUcel5mED6sAk10t2zDZY7VxuQqjp0m696jYjF
0PKBmxoC4+6tXtYJS7d2PGRCTjfEUx2BLnnGuLOKsB7X8f63XmxJiu8/AIiY7spr
M/M+BjAF3ok86dT1LjGlkvSNp23H70Wmsv98udfvxIWBm1l4972oCR/CS+kh3/mU
dYM+oQ3JxxgKEN7Fdak+zU/+qma9q5z2rPFpSIT1fMEuaKN/7H2cbiHPi5RnEBLa
AHWilX/lWBIMnZJZd9g3yYcGe6E/pkT6TqW5JY+2510koyfNER4IismAWMx2iYKU
D0WSZOkmEBS/OYZxpTnqGwvS4L1szo9DN3c+yG2KXLifnmVPpjXZO25wahqSuUo3
V6eYUCXRJmVg+IuXGsMNdrjxGYxD12xChoYzx5RlXls2lwHGeZr+iG3aL3+XayHa
I1Kji3500UmfEOUUUr4UiQ428dOdL3QqNzVzdymN8Vh4d7v64LUL0GSseY+10Xrs
xcbR6l/hwjBIo+I1Bi2mmv3W10tKErFy9eBIKzql3D6VHg7ESOo=
=U00h
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.20-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Stable fixes:
- pNFS/flexfiles: Fix infinite looping when the RDMA connection
errors out
Bugfixes:
- NFS: fix port value parsing
- SUNRPC: Reinitialise the backchannel request buffers before reuse
- SUNRPC: fix expiry of auth creds
- NFSv4: Fix races in the legacy idmapper upcall
- NFS: O_DIRECT fixes from Jeff Layton
- NFSv4.1: Fix OP_SEQUENCE error handling
- SUNRPC: Fix an RPC/RDMA performance regression
- NFS: Fix case insensitive renames
- NFSv4/pnfs: Fix a use-after-free bug in open
- NFSv4.1: RECLAIM_COMPLETE must handle EACCES
Features:
- NFSv4.1: session trunking enhancements
- NFSv4.2: READ_PLUS performance optimisations
- NFS: relax the rules for rsize/wsize mount options
- NFS: don't unhash dentry during unlink/rename
- SUNRPC: Fail faster on bad verifier
- NFS/SUNRPC: Various tracing improvements"
* tag 'nfs-for-5.20-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (46 commits)
NFS: Improve readpage/writepage tracing
NFS: Improve O_DIRECT tracing
NFS: Improve write error tracing
NFS: don't unhash dentry during unlink/rename
NFSv4/pnfs: Fix a use-after-free bug in open
NFS: nfs_async_write_reschedule_io must not recurse into the writeback code
SUNRPC: Don't reuse bvec on retransmission of the request
SUNRPC: Reinitialise the backchannel request buffers before reuse
NFSv4.1: RECLAIM_COMPLETE must handle EACCES
NFSv4.1 probe offline transports for trunking on session creation
SUNRPC create a function that probes only offline transports
SUNRPC export xprt_iter_rewind function
SUNRPC restructure rpc_clnt_setup_test_and_add_xprt
NFSv4.1 remove xprt from xprt_switch if session trunking test fails
SUNRPC create an rpc function that allows xprt removal from rpc_clnt
SUNRPC enable back offline transports in trunking discovery
SUNRPC create an iterator to list only OFFLINE xprts
NFSv4.1 offline trunkable transports on DESTROY_SESSION
SUNRPC add function to offline remove trunkable transports
SUNRPC expose functions for offline remote xprt functionality
...
- hardware poisoning support for 1GB hugepages, from Naoya Horiguchi
- highmem documentation fixups from Fabio De Francesco
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYvMFaQAKCRDdBJ7gKXxA
jrM7AQCsaVsrRJgVMNqjDvLAsWFEm48s6/smQT4UZ9n/rPQUsAEA0iRpOfbjcxwK
npAml1mp5uP2AkimTVLB6Bokt6N+wAg=
=9Wta
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2022-08-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull remaining MM updates from Andrew Morton:
"Three patch series - two that perform cleanups and one feature:
- hugetlb_vmemmap cleanups from Muchun Song
- hardware poisoning support for 1GB hugepages, from Naoya Horiguchi
- highmem documentation fixups from Fabio De Francesco"
* tag 'mm-stable-2022-08-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (23 commits)
Documentation/mm: add details about kmap_local_page() and preemption
highmem: delete a sentence from kmap_local_page() kdocs
Documentation/mm: rrefer kmap_local_page() and avoid kmap()
Documentation/mm: avoid invalid use of addresses from kmap_local_page()
Documentation/mm: don't kmap*() pages which can't come from HIGHMEM
highmem: specify that kmap_local_page() is callable from interrupts
highmem: remove unneeded spaces in kmap_local_page() kdocs
mm, hwpoison: enable memory error handling on 1GB hugepage
mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepage
mm, hwpoison: make __page_handle_poison returns int
mm, hwpoison: set PG_hwpoison for busy hugetlb pages
mm, hwpoison: make unpoison aware of raw error info in hwpoisoned hugepage
mm, hwpoison, hugetlb: support saving mechanism of raw error pages
mm/hugetlb: make pud_huge() and follow_huge_pud() aware of non-present pud entry
mm/hugetlb: check gigantic_page_runtime_supported() in return_unused_surplus_pages()
mm: hugetlb_vmemmap: use PTRS_PER_PTE instead of PMD_SIZE / PAGE_SIZE
mm: hugetlb_vmemmap: move code comments to vmemmap_dedup.rst
mm: hugetlb_vmemmap: improve hugetlb_vmemmap code readability
mm: hugetlb_vmemmap: replace early_param() with core_param()
mm: hugetlb_vmemmap: move vmemmap code related to HugeTLB to hugetlb_vmemmap.c
...
The goto out calls kfree(value) on an uninitialized pointer. Just
return directly as the other error paths do.
Fixes: 460bbf2990 ("fs/ntfs3: Do not change mode if ntfs_set_ea failed")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Since the function wnd_bits is defined but not called in any file, it is
a useless function, and we delete it in view of the brevity of the code.
Remove some warnings found by running scripts/kernel-doc, which is
caused by using 'make W=1'.
fs/ntfs3/bitmap.c:54:19: warning: unused function 'wnd_bits' [-Wunused-function].
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Work on "courteous server", which was introduced in 5.19, continues
apace. This release introduces a more flexible limit on the number
of NFSv4 clients that NFSD allows, now that NFSv4 clients can remain
in courtesy state long after the lease expiration timeout. The
client limit is adjusted based on the physical memory size of the
server.
The NFSD filecache is a cache of files held open by NFSv4 clients or
recently touched by NFSv2 or NFSv3 clients. This cache had some
significant scalability constraints that have been relieved in this
release. Thanks to all who contributed to this work.
A data corruption bug found during the most recent NFS bake-a-thon
that involves NFSv3 and NFSv4 clients writing the same file has been
addressed in this release.
This release includes several improvements in CPU scalability for
NFSv4 operations. In addition, Neil Brown provided patches that
simplify locking during file lookup, creation, rename, and removal
that enables subsequent work on making these operations more
scalable. We expect to see that work materialize in the next
release.
There are also numerous single-patch fixes, clean-ups, and the
usual improvements in observability.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmLujF0ACgkQM2qzM29m
f5c14w/+Lsoryo5vdTXMAZBXBNvVdXQmHqLIEotJEVA3sECr+Kad2bF8rFaCWVzS
Gf9KhTetmcDO9O73I5I/UtJ2qFT9B4I6baSGpOzIkjsM/aKeEEQpbpdzPYhKrCEv
bQu3P54js7snH4YV8s0I39nBFOWdnahYXaw7peqE/2GOHxaR2mz88bkrQ+OybCxz
KETqUxA6bKzkOT61S0nHcnQKd8HQzhocMDtrxtANHGsMM167ngI1dw4tUQAtfAUI
s9R+GS6qwiKgwGz1oqhTR6LA/h4DROxPnc7AieuD9FvuAnR3kXw61bGMN5Biwv2T
JZUTBbQvWhNasSV+7qOY9nBu+sHVC6Q7OZ5C9F/KjMyqCioDX0DnbxX9uKP20CDd
EAAMS8n4Tdgd4KRBWdkLXPzizWYAjZQmFIJtcZne1JzGZ4IWRnikgM5qD6n1VviZ
kcPRm5EN3DRHA+Hte4jG0EHIrE/7g5gnf+zr9dWl3uNhZtfTmumCfU16YYmKG8pP
QN4kXBR2w7dAvp8nRaOsY6bBFLDAk/jHbpY8Q4xoUO4tsojfWayCTGVFOrecOjxv
uSn0LhiidC5pLlkcPgwemhysVywDzr+gGXBRJXeUOHfdd05Q2gbFK8OpqDSvJ3dZ
aC/RxFvHc8jaktUcuIjkE6Rsz6AVaAH3EZj84oMZ4hZhyGbEreg=
=PEJ3
-----END PGP SIGNATURE-----
Merge tag 'nfsd-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd updates from Chuck Lever:
"Work on 'courteous server', which was introduced in 5.19, continues
apace. This release introduces a more flexible limit on the number of
NFSv4 clients that NFSD allows, now that NFSv4 clients can remain in
courtesy state long after the lease expiration timeout. The client
limit is adjusted based on the physical memory size of the server.
The NFSD filecache is a cache of files held open by NFSv4 clients or
recently touched by NFSv2 or NFSv3 clients. This cache had some
significant scalability constraints that have been relieved in this
release. Thanks to all who contributed to this work.
A data corruption bug found during the most recent NFS bake-a-thon
that involves NFSv3 and NFSv4 clients writing the same file has been
addressed in this release.
This release includes several improvements in CPU scalability for
NFSv4 operations. In addition, Neil Brown provided patches that
simplify locking during file lookup, creation, rename, and removal
that enables subsequent work on making these operations more scalable.
We expect to see that work materialize in the next release.
There are also numerous single-patch fixes, clean-ups, and the usual
improvements in observability"
* tag 'nfsd-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (78 commits)
lockd: detect and reject lock arguments that overflow
NFSD: discard fh_locked flag and fh_lock/fh_unlock
NFSD: use (un)lock_inode instead of fh_(un)lock for file operations
NFSD: use explicit lock/unlock for directory ops
NFSD: reduce locking in nfsd_lookup()
NFSD: only call fh_unlock() once in nfsd_link()
NFSD: always drop directory lock in nfsd_unlink()
NFSD: change nfsd_create()/nfsd_symlink() to unlock directory before returning.
NFSD: add posix ACLs to struct nfsd_attrs
NFSD: add security label to struct nfsd_attrs
NFSD: set attributes when creating symlinks
NFSD: introduce struct nfsd_attrs
NFSD: verify the opened dentry after setting a delegation
NFSD: drop fh argument from alloc_init_deleg
NFSD: Move copy offload callback arguments into a separate structure
NFSD: Add nfsd4_send_cb_offload()
NFSD: Remove kmalloc from nfsd4_do_async_copy()
NFSD: Refactor nfsd4_do_copy()
NFSD: Refactor nfsd4_cleanup_inter_ssc() (2/2)
NFSD: Refactor nfsd4_cleanup_inter_ssc() (1/2)
...
Don't leak request pointers, but use the "device:inode" labelling that
is used by all the other trace points. Furthermore, replace use of page
indexes with an offset, again in order to align behaviour with other
NFS trace points.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Commit 55e8c8eb2c ("posix-cpu-timers: Store a reference to a pid not a
task") started looking up tasks by PID when deleting a CPU timer.
When a non-leader thread calls execve, it will switch PIDs with the leader
process. Then, as it calls exit_itimers, posix_cpu_timer_del cannot find
the task because the timer still points out to the old PID.
That means that armed timers won't be disarmed, that is, they won't be
removed from the timerqueue_list. exit_itimers will still release their
memory, and when that list is later processed, it leads to a
use-after-free.
Clean up the timers from the de-threaded task before freeing them. This
prevents a reported use-after-free.
Fixes: 55e8c8eb2c ("posix-cpu-timers: Store a reference to a pid not a task")
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20220809170751.164716-1-cascardo@canonical.com
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmLyXZ4ACgkQ+7dXa6fL
C2tJCxAAhH4ufVMk1r8H/lflCulaBw917zgoOO/daO9FjKDtL6hReTN6OgJLLULw
a1ZMsRxl1NJy5duGYRswUR6gS8y4LzTdILvqfvP+9YSoZzou+uTHHuKFJheA6x++
/cRQFBQmx8JTBrEQXSHULa3FfKsU0UCwxCzy7q4o5zJogK6yiLhApxORIB5ZYi7W
k/3iKA0isqKsSEwmRt3D9ekypb3E/8QGaQV7Bng6/1wldFFmL1g5w/ubY8TCsefV
gnNUA3Ops3LsYj9a0vQGJ6lXKIol2nFClvmFM1qvb09u4PaVa2tofXd5FQuy/iC/
z2+ULUtBi7gs+DjsxPBf1cnbeA3BzNu5BfdQFd3QJksKFHcIMGsKnTQehPuVPhdn
p0IGZrf3/sncmXWVZ322w+R0TLiwB0CX9fh4WS5aYw9WHtRg+FGeFCoVLSV774M0
OiRxefg6A4sBMP2XwHICRXWPtqnBj2PYhQ6bj6pCHJg4sGfrYnf/q7vWFI1rJ9JW
258lhyWaaO0SqPMtdYamBQzPQAAsuG72Mrf5k3pFLWIuPPb9Ho3B7fwbtjcMQcbv
q+apsiiPk/aqbjbI8YclOD+XgRQMUaRO2s2+JCAQvoFoWPRLYE6ZaTDdesN0nHJB
PgTFACx/dPtTCpQxIUDldDsj8wk4iPmPZiQ31gVzKr53v2Jvkhc=
=LWMd
-----END PGP SIGNATURE-----
Merge tag 'fscache-fixes-20220809' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull fscache updates from David Howells:
- Fix a cookie access ref leak if a cookie is invalidated a second time
before the first invalidation is actually processed.
- Add a tracepoint to log cookie lookup failure
* tag 'fscache-fixes-20220809' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
fscache: add tracepoint when failing cookie
fscache: don't leak cookie access refs if invalidation is in progress or failed
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmLpXWwACgkQ+7dXa6fL
C2tpSg//WX4+1XvsehIZxykJYmQ7XjVjrg9uGIEaslUry2nqEECDeK7LppKLuCoD
1xMBc5GRn4h5YbBCGbMWnr+r6eezcXdwr5Fv+On4fD3l5AWTk5xTaR8v1So3G9bK
nAkWaoSrFapuifWgBkDGy9NhLW/k6d24BAKCqGhOlPcVzi4B95C4gQ5sIlMlH9xw
jUkHB+HSLZj/yBV47sWQUs0uy5fCiOgm3qgWYl2cTHD4qJ052i0ELO30uAv9Oykq
YjgOUryZlCP/vLSWzOeK8bgINjCLetd+2jGT96bV15BouQ4uoHSrgt7ACtp1I3NP
WAPGdzlzXZqaYSrqX2sPP/Tmq0/SipoKCRqK5e2yetmqs7zznKZe0ATwFhjJYp7L
IOULEaOAyrAeT9KQEwqtANXXoUxyEUqz2GKiybXWC8oUFJ0RAKpSf4LRi9cpoaRz
VDmi0F4/ZA+nMDOPPdbSZTtDCu+68Lz2l6Z7ONwIZ1qzyXJB4BC4wb+9rx7UftA9
QuX9TqGUtjlWiPcM0qvdFoPOIA3xOXbZVtlvqygFxzDDXGlRVZdUa3uqMjELKzK1
v1yRZIS00zqWM4lWFyaQNCZHwFhlnixRzkAmvQcioAb/NJL3eA4n3FbL+RBTSrWP
XEtgpI32ROobA3S+69LomYH4AisMgP9XDP04mOQqQfIxg9GFdEI=
=fXqy
-----END PGP SIGNATURE-----
Merge tag 'afs-fixes-20220802' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull AFS fixes from David Howells:
"Fix AFS refcount handling.
The first patch converts afs to use refcount_t for its refcounts and
the second patch fixes afs_put_call() and afs_put_server() to save the
values they're going to log in the tracepoint before decrementing the
refcount"
* tag 'afs-fixes-20220802' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Fix access after dec in put functions
afs: Use refcount_t rather than atomic_t
-----BEGIN PGP SIGNATURE-----
iHQEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYvIYIgAKCRCRxhvAZXjc
omE8AQDAZG2YjNJfMnUUaaWYO3+zTaHlQp7OQkQTXIHfcfViXQD4vPt3Wxh3rrF+
J8fwNcWmXhSNei8HP6cA06QmSajnDQ==
=GF9/
-----END PGP SIGNATURE-----
Merge tag 'fs.setgid.v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull setgid updates from Christian Brauner:
"This contains the work to move setgid stripping out of individual
filesystems and into the VFS itself.
Creating files that have both the S_IXGRP and S_ISGID bit raised in
directories that themselves have the S_ISGID bit set requires
additional privileges to avoid security issues.
When a filesystem creates a new inode it needs to take care that the
caller is either in the group of the newly created inode or they have
CAP_FSETID in their current user namespace and are privileged over the
parent directory of the new inode. If any of these two conditions is
true then the S_ISGID bit can be raised for an S_IXGRP file and if not
it needs to be stripped.
However, there are several key issues with the current implementation:
- S_ISGID stripping logic is entangled with umask stripping.
For example, if the umask removes the S_IXGRP bit from the file
about to be created then the S_ISGID bit will be kept.
The inode_init_owner() helper is responsible for S_ISGID stripping
and is called before posix_acl_create(). So we can end up with two
different orderings:
1. FS without POSIX ACL support
First strip umask then strip S_ISGID in inode_init_owner().
In other words, if a filesystem doesn't support or enable POSIX
ACLs then umask stripping is done directly in the vfs before
calling into the filesystem:
2. FS with POSIX ACL support
First strip S_ISGID in inode_init_owner() then strip umask in
posix_acl_create().
In other words, if the filesystem does support POSIX ACLs then
unmask stripping may be done in the filesystem itself when
calling posix_acl_create().
Note that technically filesystems are free to impose their own
ordering between posix_acl_create() and inode_init_owner() meaning
that there's additional ordering issues that influence S_ISGID
inheritance.
(Note that the commit message of commit 1639a49ccd ("fs: move
S_ISGID stripping into the vfs_*() helpers") gets the ordering
between inode_init_owner() and posix_acl_create() the wrong way
around. I realized this too late.)
- Filesystems that don't rely on inode_init_owner() don't get S_ISGID
stripping logic.
While that may be intentional (e.g. network filesystems might just
defer setgid stripping to a server) it is often just a security
issue.
Note that mandating the use of inode_init_owner() was proposed as
an alternative solution but that wouldn't fix the ordering issues
and there are examples such as afs where the use of
inode_init_owner() isn't possible.
In any case, we should also try the cleaner and generalized
solution first before resorting to this approach.
- We still have S_ISGID inheritance bugs years after the initial
round of S_ISGID inheritance fixes:
e014f37db1 ("xfs: use setattr_copy to set vfs inode attributes")
01ea173e10 ("xfs: fix up non-directory creation in SGID directories")
fd84bfdddd ("ceph: fix up non-directory creation in SGID directories")
All of this led us to conclude that the current state is too messy.
While we won't be able to make it completely clean as
posix_acl_create() is still a filesystem specific call we can improve
the S_SIGD stripping situation quite a bit by hoisting it out of
inode_init_owner() and into the respective vfs creation operations.
The obvious advantage is that we don't need to rely on individual
filesystems getting S_ISGID stripping right and instead can
standardize the ordering between S_ISGID and umask stripping directly
in the VFS.
A few short implementation notes:
- The stripping logic needs to happen in vfs_*() helpers for the sake
of stacking filesystems such as overlayfs that rely on these
helpers taking care of S_ISGID stripping.
- Security hooks have never seen the mode as it is ultimately seen by
the filesystem because of the ordering issue we mentioned. Nothing
is changed for them. We simply continue to strip the umask before
passing the mode down to the security hooks.
- The following filesystems use inode_init_owner() and thus relied on
S_ISGID stripping: spufs, 9p, bfs, btrfs, ext2, ext4, f2fs,
hfsplus, hugetlbfs, jfs, minix, nilfs2, ntfs3, ocfs2, omfs,
overlayfs, ramfs, reiserfs, sysv, ubifs, udf, ufs, xfs, zonefs,
bpf, tmpfs.
We've audited all callchains as best as we could. More details can
be found in the commit message to 1639a49ccd ("fs: move S_ISGID
stripping into the vfs_*() helpers")"
* tag 'fs.setgid.v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
ceph: rely on vfs for setgid stripping
fs: move S_ISGID stripping into the vfs_*() helpers
fs: Add missing umask strip in vfs_tmpfile
fs: add mode_strip_sgid() helper
It's possible for a request to invalidate a fscache_cookie will come in
while we're already processing an invalidation. If that happens we
currently take an extra access reference that will leak. Only call
__fscache_begin_cookie_access if the FSCACHE_COOKIE_DO_INVALIDATE bit
was previously clear.
Also, ensure that we attempt to clear the bit when the cookie is
"FAILED" and put the reference to avoid an access leak.
Fixes: 85e4ea1049 ("fscache: Fix invalidation/lookup race")
Suggested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmLwixsACgkQiiy9cAdy
T1FT6Av+NY7R2f7/vDRzX/8u1fFXmyQDU2/dl3S0ysqqFEYE5hnHgbFtVP1IYwId
/IAV8R6E/YMm8Nkbkf173EqHI8E9QNUSngTCk1bcvsiNSW4z3QOsujs9B6oTUeXM
ARthU4eFJQW0wcTyjElVcm59I/jdtpQKaoVDQ/uXlCjPMT+0BUHS4mFRJkAx6DrX
7wOO0vJ/WCqO2u0rSvK4unOZsvhrKHibtPMvQSiV6k3a+LVCJ5/5lwBpENKK4Mdw
n4LeNhSDtm6HE6WDFIxSj008WPX7CF3JvX+zvZJEsB7uFNry/GVkWySPLqVYUmtp
aSftdnWnw0Mv9AG3L4OMzyCQQP89ccoV81d6lyyJEBaGiOFhH8L1v6b7qAStCZue
zyyFWIVUXnstQKwDY8QWLKjmi7H+I1drFExxn0vABInPcaijQE8Mct0fSt015Tc1
EuCsO1JRyo+zn0mx7a1L80kHUH+yvHjKkfbINvPSa+qTFy21bgQZcG5CWSPpjadr
4zibpKko
=cDaA
-----END PGP SIGNATURE-----
Merge tag '5.20-rc-ksmbd-server-fixes' of git://git.samba.org/ksmbd
Pull ksmbd updates from Steve French:
- fixes for memory access bugs (out of bounds access, oops, leak)
- multichannel fixes
- session disconnect performance improvement, and session register
improvement
- cleanup
* tag '5.20-rc-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: fix heap-based overflow in set_ntacl_dacl()
ksmbd: prevent out of bound read for SMB2_TREE_CONNNECT
ksmbd: prevent out of bound read for SMB2_WRITE
ksmbd: fix use-after-free bug in smb2_tree_disconect
ksmbd: fix memory leak in smb2_handle_negotiate
ksmbd: fix racy issue while destroying session on multichannel
ksmbd: use wait_event instead of schedule_timeout()
ksmbd: fix kernel oops from idr_remove()
ksmbd: add channel rwlock
ksmbd: replace sessions list in connection with xarray
MAINTAINERS: ksmbd: add entry for documentation
ksmbd: remove unused ksmbd_share_configs_cleanup function
* more new_sync_{read,write}() speedups - ITER_UBUF introduction
* ITER_PIPE cleanups
* unification of iov_iter_get_pages/iov_iter_get_pages_alloc and
switching them to advancing semantics
* making ITER_PIPE take high-order pages without splitting them
* handling copy_page_from_iter() for high-order pages properly
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYvHI8QAKCRBZ7Krx/gZQ
62CQAPsGlbebqBeAT2pMulaGDxfLAsgz5Yf4BEaMLhPtRqFOQgD+KrZQId7Sd8O0
3IWucpTb2c4jvLlXhGMS+XWnusQH+AQ=
=pBux
-----END PGP SIGNATURE-----
Merge tag 'pull-work.iov_iter-rebased' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more iov_iter updates from Al Viro:
- more new_sync_{read,write}() speedups - ITER_UBUF introduction
- ITER_PIPE cleanups
- unification of iov_iter_get_pages/iov_iter_get_pages_alloc and
switching them to advancing semantics
- making ITER_PIPE take high-order pages without splitting them
- handling copy_page_from_iter() for high-order pages properly
* tag 'pull-work.iov_iter-rebased' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (32 commits)
fix copy_page_from_iter() for compound destinations
hugetlbfs: copy_page_to_iter() can deal with compound pages
copy_page_to_iter(): don't split high-order page in case of ITER_PIPE
expand those iov_iter_advance()...
pipe_get_pages(): switch to append_pipe()
get rid of non-advancing variants
ceph: switch the last caller of iov_iter_get_pages_alloc()
9p: convert to advancing variant of iov_iter_get_pages_alloc()
af_alg_make_sg(): switch to advancing variant of iov_iter_get_pages()
iter_to_pipe(): switch to advancing variant of iov_iter_get_pages()
block: convert to advancing variants of iov_iter_get_pages{,_alloc}()
iov_iter: advancing variants of iov_iter_get_pages{,_alloc}()
iov_iter: saner helper for page array allocation
fold __pipe_get_pages() into pipe_get_pages()
ITER_XARRAY: don't open-code DIV_ROUND_UP()
unify the rest of iov_iter_get_pages()/iov_iter_get_pages_alloc() guts
unify xarray_get_pages() and xarray_get_pages_alloc()
unify pipe_get_pages() and pipe_get_pages_alloc()
iov_iter_get_pages(): sanity-check arguments
iov_iter_get_pages_alloc(): lift freeing pages array on failure exits into wrapper
...
here nothing even looks at the iov_iter after the call, so we couldn't
care less whether it advances or not.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... and untangle the cleanup on failure to add into pipe.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Most of the users immediately follow successful iov_iter_get_pages()
with advancing by the amount it had returned.
Provide inline wrappers doing that, convert trivial open-coded
uses of those.
BTW, iov_iter_get_pages() never returns more than it had been asked
to; such checks in cifs ought to be removed someday...
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Use pipe_discard_from() explicitly in generic_file_read_iter(); don't bother
with rather non-obvious use of iov_iter_advance() in there.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Equivalent of single-segment iovec. Initialized by iov_iter_ubuf(),
checked for by iter_is_ubuf(), otherwise behaves like ITER_IOVEC
ones.
We are going to expose the things like ->write_iter() et.al. to those
in subsequent commits.
New predicate (user_backed_iter()) that is true for ITER_IOVEC and
ITER_UBUF; places like direct-IO handling should use that for
checking that pages we modify after getting them from iov_iter_get_pages()
would need to be dirtied.
DO NOT assume that replacing iter_is_iovec() with user_backed_iter()
will solve all problems - there's code that uses iter_is_iovec() to
decide how to poke around in iov_iter guts and for that the predicate
replacement obviously won't suffice.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
It it inconvenient to mention the feature of optimizing vmemmap pages
associated with HugeTLB pages when communicating with others since there
is no specific or abbreviated name for it when it is first introduced.
Let us give it a name HVO (HugeTLB Vmemmap Optimization) from now.
This commit also updates the document about "hugetlb_free_vmemmap" by the
way discussed in thread [1].
Link: https://lore.kernel.org/all/21aae898-d54d-cc4b-a11f-1bb7fddcfffa@redhat.com/ [1]
Link: https://lkml.kernel.org/r/20220628092235.91270-4-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Will Deacon <will@kernel.org>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
NFS unlink() (and rename over existing target) must determine if the
file is open, and must perform a "silly rename" instead of an unlink (or
before rename) if it is. Otherwise the client might hold a file open
which has been removed on the server.
Consequently if it determines that the file isn't open, it must block
any subsequent opens until the unlink/rename has been completed on the
server.
This is currently achieved by unhashing the dentry. This forces any
open attempt to the slow-path for lookup which will block on i_rwsem on
the directory until the unlink/rename completes. A future patch will
change the VFS to only get a shared lock on i_rwsem for unlink, so this
will no longer work.
Instead we introduce an explicit interlock. A special value is stored
in dentry->d_fsdata while the unlink/rename is running and
->d_revalidate blocks while that value is present. When ->d_revalidate
unblocks, the dentry will be invalid. This closes the race
without requiring exclusion on i_rwsem.
d_fsdata is already used in two different ways.
1/ an IS_ROOT directory dentry might have a "devname" stored in
d_fsdata. Such a dentry doesn't have a name and so cannot be the
target of unlink or rename. For safety we check if an old devname
is still stored, and remove it if it is.
2/ a dentry with DCACHE_NFSFS_RENAMED set will have a 'struct
nfs_unlinkdata' stored in d_fsdata. While this is set maydelete()
will fail, so an unlink or rename will never proceed on such
a dentry.
Neither of these can be in effect when a dentry is the target of unlink
or rename. So we can expect d_fsdata to be NULL, and store a special
value ((void*)1) which is given the name NFS_FSDATA_BLOCKED to indicate
that any lookup will be blocked.
The d_count() is incremented under d_lock() when a lookup finds the
dentry, so we check d_count() is low, and set NFS_FSDATA_BLOCKED under
the same lock to avoid any races.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
In this cycle, we mainly fixed some corner cases that manipulate a per-file
compression flag inappropriately. And, we found f2fs counted valid blocks in a
section incorrectly when zone capacity is set, and thus, fixed it with
additional sysfs entry to check it easily. Lastly, this series includes
several patches with respect to the new atomic write support such as a
couple of bug fixes and re-adding atomic_write_abort support that we removed
by mistake in the previous release.
Enhancement:
- add sysfs entries to understand atomic write operations and zone
capacity
- introduce memory mode to get a hint for low-memory devices
- adjust the waiting time of foreground GC
- decompress clusters under softirq to avoid non-deterministic latency
- do not skip updating inode when retrying to flush node page
- enforce single zone capacity
Bug fix:
- set the compression/no-compression flags correctly
- revive F2FS_IOC_ABORT_VOLATILE_WRITE
- check inline_data during compressed inode conversion
- understand zone capacity when calculating valid block count
As usual, the series includes several minor clean-ups and sanity checks.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmLxOccACgkQQBSofoJI
UNId/g/+Nx3FK874cyobE1PnPpUtfxLGqO9fjhrbje3bniTpgE9NtJUFg5hRQkxE
XHuufMrW++aBhn2ESMjbfdQ3v6vy5XUy7bi4FR71KxW4qp15mAqjTPfAZBFKZfMv
lCv54NKlura91GhI9Dl6JgGe1+MwNXIxVROyGvjXYogF0DWl+iJh4vYuCFUguiNU
mP6FmnZvbtK89jYxODoqwQaC+b6DV7ceaQ+c0dtS5TRvsUNv5mjWDeTvPMgk3At/
mAuWYXfIrf5xfDY93JPbrJhBLvu7Ey3EfXBnaFGRYbYxYYub9JZ4+/5di/rB9jRc
9AZ6LcLX3aKaT71EWa9vdCIffz8/PcSRjsmpEuVs7KNySwcnolnb1tAzlJPKy2AV
IJliY1Ef0+jrpg2lHYZoMb5qvo80c3xlyxlgZt0LSZKf1Wo41sjJVt6ZS7WLhHXu
OlzeI7lZBS9RKPUtU5cGNWkmZqamvmq09mMvqF4IUIaY40MizKZoV0yh9BjuUoxM
xniBIlC/q0HvwmbQ2OtNKDgv7+FdxrRlaDyhhkppa3UA8ZK3Edch26N9pBoh/r33
zJIR2BwCGmHz7yaX4HGzSt1phex2ABIGuZ4vBaGI7XDuYUD1tCZpC8wMCs2X3pKo
ldQz3uu0GA0BSsNKpRks2dwRF0JJVGTk8UwcSXPwTdTTdqyhmvI=
=dJ41
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this cycle, we mainly fixed some corner cases that manipulate a
per-file compression flag inappropriately. And, we found f2fs counted
valid blocks in a section incorrectly when zone capacity is set, and
thus, fixed it with additional sysfs entry to check it easily.
Lastly, this series includes several patches with respect to the new
atomic write support such as a couple of bug fixes and re-adding
atomic_write_abort support that we removed by mistake in the previous
release.
Enhancements:
- add sysfs entries to understand atomic write operations and zone
capacity
- introduce memory mode to get a hint for low-memory devices
- adjust the waiting time of foreground GC
- decompress clusters under softirq to avoid non-deterministic
latency
- do not skip updating inode when retrying to flush node page
- enforce single zone capacity
Bug fixes:
- set the compression/no-compression flags correctly
- revive F2FS_IOC_ABORT_VOLATILE_WRITE
- check inline_data during compressed inode conversion
- understand zone capacity when calculating valid block count
As usual, the series includes several minor clean-ups and sanity
checks"
* tag 'f2fs-for-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (29 commits)
f2fs: use onstack pages instead of pvec
f2fs: intorduce f2fs_all_cluster_page_ready
f2fs: clean up f2fs_abort_atomic_write()
f2fs: handle decompress only post processing in softirq
f2fs: do not allow to decompress files have FI_COMPRESS_RELEASED
f2fs: do not set compression bit if kernel doesn't support
f2fs: remove device type check for direct IO
f2fs: fix null-ptr-deref in f2fs_get_dnode_of_data
f2fs: revive F2FS_IOC_ABORT_VOLATILE_WRITE
f2fs: fix to do sanity check on segment type in build_sit_entries()
f2fs: obsolete unused MAX_DISCARD_BLOCKS
f2fs: fix to avoid use f2fs_bug_on() in f2fs_new_node_page()
f2fs: fix to remove F2FS_COMPR_FL and tag F2FS_NOCOMP_FL at the same time
f2fs: introduce sysfs atomic write statistics
f2fs: don't bother wait_ms by foreground gc
f2fs: invalidate meta pages only for post_read required inode
f2fs: allow compression of files without blocks
f2fs: fix to check inline_data during compressed inode conversion
f2fs: Delete f2fs_copy_page() and replace with memcpy_page()
f2fs: fix to invalidate META_MAPPING before DIO write
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCYvD4IQAKCRDh3BK/laaZ
PDHHAP93H+2E9c6biGd5pEaL2ABChRY+wsQURzD+SZ6AL3JaUQD+KHxA4q0MJvws
L6CWcf2XptUDCLe3P6sgSTvv5Gk1OAM=
=FoXF
-----END PGP SIGNATURE-----
Merge tag 'fuse-update-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse updates from Miklos Szeredi:
- Fix an issue with reusing the bdi in case of block based filesystems
- Allow root (in init namespace) to access fuse filesystems in user
namespaces if expicitly enabled with a module param
- Misc fixes
* tag 'fuse-update-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: retire block-device-based superblock on force unmount
vfs: function to prevent re-use of block-device-based superblocks
virtio_fs: Modify format for virtio_fs_direct_access
virtiofs: delete unused parameter for virtio_fs_cleanup_vqs
fuse: Add module param for CAP_SYS_ADMIN access bypassing allow_other
fuse: Remove the control interface for virtio-fs
fuse: ioctl: translate ENOSYS
fuse: limit nsec
fuse: avoid unnecessary spinlock bump
fuse: fix deadlock between atomic O_TRUNC and page invalidation
fuse: write inode in fuse_release()
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCYvDeCAAKCRDh3BK/laaZ
PGjCAP9TVuId3X7Akroc9W+qswPzwlW3fwtE6+9F6ABeNJNPZAEAgU2bp95vqZRh
OWP+ptnskceBcX/cRkfxkmgtiNE21wk=
=sucY
-----END PGP SIGNATURE-----
Merge tag 'ovl-update-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs update from Miklos Szeredi:
"Just a small update"
* tag 'ovl-update-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: fix spelling mistakes
ovl: drop WARN_ON() dentry is NULL in ovl_encode_fh()
ovl: improve ovl_get_acl() if POSIX ACL support is off
ovl: fix some kernel-doc comments
ovl: warn if trusted xattr creation fails
If something manages to set the maximum file size to MAX_OFFSET+1, this
can cause the xfs and ext4 filesystems at least to become corrupt.
Ordinarily, the kernel protects against userspace trying this by
checking the value early in the truncate() and ftruncate() system calls
calls - but there are at least two places that this check is bypassed:
(1) Cachefiles will round up the EOF of the backing file to DIO block
size so as to allow DIO on the final block - but this might push
the offset negative. It then calls notify_change(), but this
inadvertently bypasses the checking. This can be triggered if
someone puts an 8EiB-1 file on a server for someone else to try and
access by, say, nfs.
(2) ksmbd doesn't check the value it is given in set_end_of_file_info()
and then calls vfs_truncate() directly - which also bypasses the
check.
In both cases, it is potentially possible for a network filesystem to
cause a disk filesystem to be corrupted: cachefiles in the client's
cache filesystem; ksmbd in the server's filesystem.
nfsd is okay as it checks the value, but we can then remove this check
too.
Fix this by adding a check to inode_newsize_ok(), as called from
setattr_prepare(), thereby catching the issue as filesystems set up to
perform the truncate with minimal opportunity for bypassing the new
check.
Fixes: 1f08c925e7 ("cachefiles: Implement backing file wrangling")
Fixes: f441584858 ("cifsd: add file operations")
Signed-off-by: David Howells <dhowells@redhat.com>
Reported-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Namjae Jeon <linkinjeon@kernel.org>
Cc: stable@kernel.org
Acked-by: Alexander Viro <viro@zeniv.linux.org.uk>
cc: Steve French <sfrench@samba.org>
cc: Hyunchul Lee <hyc.lee@gmail.com>
cc: Chuck Lever <chuck.lever@oracle.com>
cc: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmLvF24ACgkQiiy9cAdy
T1HJjQv/aA6mIAdxCXQEmlp2RqaHldjQg/9MeG5/1si3cP60exLUbt5CY6mHMrOF
xj1mR22/6KR4DuhuwADkzYwhr0itmR98f5BpNjgwUTfXse1IjnNDjc6S4nWJybhv
mmnMjtMHwV3o5/gTC5YS2lA6zYWyhkiYQeMpmNkMJByNgK9dM+vPp97D8XwAm6s6
vFGpypWWCp1NWm6fHGLFO48FnbB2UnmTem36blN8Uz6fqNuBO6nsOLuXe0ovu7vH
5fOpE+6tXq3hD8YLTPRN8Zzl5h0wINVmYS/uXDvDdLJpL7vpCuy7hYqGaPQXH3KN
pXNVg8l1Uy2RhKKQQFoxjIe3y0hNBG83x3AyvarP/kA6ZqOnz+k9r2KvOtz4oWht
HK6FNtA3g+8hIRznYpNc7FM9F0MM5ToDi/sh+imVE4V09tPjLlx3XTnzcWqfs0oT
1q92yR+KxyYQ3OzcyCiQoNU9pcitX9CUcWItpCkk9FNjgasXhwk/yZKQiNfK3qXL
NQIYkz66
=uhLb
-----END PGP SIGNATURE-----
Merge tag '5.20-rc-smb3-client-fixes-part1' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs updates from Steve French:
"Mostly cleanup, including smb1 refactoring:
- multichannel perf improvement
- move additional SMB1 code to not be compiled in when legacy support
is disabled.
- bug fixes, including one important one for memory leak
- various cleanup patches
We are still working on and testing some deferred close improvements
including an important lease break fix for case when multiple deferred
closes are still open, and also some additional perf improvements -
those are not included here"
* tag '5.20-rc-smb3-client-fixes-part1' of git://git.samba.org/sfrench/cifs-2.6:
cifs: update internal module number
cifs: alloc_mid function should be marked as static
cifs: remove "cifs_" prefix from init/destroy mids functions
cifs: remove useless DeleteMidQEntry()
cifs: when insecure legacy is disabled shrink amount of SMB1 code
cifs: trivial style fixup
cifs: fix wrong unlock before return from cifs_tree_connect()
cifs: avoid use of global locks for high contention data
cifs: remove remaining build warnings
cifs: list_for_each() -> list_for_each_entry()
cifs: update MAINTAINERS file with reviewers
smb2: small refactor in smb2_check_message()
cifs: Fix memory leak when using fscache
cifs: remove minor build warning
cifs: remove some camelCase and also some static build warnings
cifs: remove unnecessary (void*) conversions.
cifs: remove unnecessary type castings
cifs: remove redundant initialization to variable mnt_sign_enabled
smb3: check xattr value length earlier
fatfs, autofs, squashfs, procfs, etc.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYu9BeQAKCRDdBJ7gKXxA
jp1DAP4mjCSvAwYzXklrIt+Knv3CEY5oVVdS+pWOAOGiJpldTAD9E5/0NV+VmlD9
kwS/13j38guulSlXRzDLmitbg81zAAI=
=Zfum
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2022-08-06-2' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc updates from Andrew Morton:
"Updates to various subsystems which I help look after. lib, ocfs2,
fatfs, autofs, squashfs, procfs, etc. A relatively small amount of
material this time"
* tag 'mm-nonmm-stable-2022-08-06-2' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (72 commits)
scripts/gdb: ensure the absolute path is generated on initial source
MAINTAINERS: kunit: add David Gow as a maintainer of KUnit
mailmap: add linux.dev alias for Brendan Higgins
mailmap: update Kirill's email
profile: setup_profiling_timer() is moslty not implemented
ocfs2: fix a typo in a comment
ocfs2: use the bitmap API to simplify code
ocfs2: remove some useless functions
lib/mpi: fix typo 'the the' in comment
proc: add some (hopefully) insightful comments
bdi: remove enum wb_congested_state
kernel/hung_task: fix address space of proc_dohung_task_timeout_secs
lib/lzo/lzo1x_compress.c: replace ternary operator with min() and min_t()
squashfs: support reading fragments in readahead call
squashfs: implement readahead
squashfs: always build "file direct" version of page actor
Revert "squashfs: provide backing_dev_info in order to disable read-ahead"
fs/ocfs2: Fix spelling typo in comment
ia64: old_rr4 added under CONFIG_HUGETLB_PAGE
proc: fix test for "vsyscall=xonly" boot option
...
- a couple of fixes
- add a tracepoint for fid refcounting
- some cleanup/followup on fid lookup
- some cleanup around req refcounting
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE/IPbcYBuWt0zoYhOq06b7GqY5nAFAmLuKqkACgkQq06b7GqY
5nA+RBAAvuA6AjKQSvsNxHsqSFMahwoE3cCPlF/QlmAnnZa1DzmRI3kKTAuuDKkE
Q4c7DjxzDJCWRTlkpNUCGZycHqpu1QQbfoTo43sIqob7W8xlajeemc5Fxqtw5sPM
m+0SzN7vvNJpCr+6pxMXwwGHSZmZOvpFjwj7cUjhpF/V1WO8bNxfCGzQdF0hX1Vn
2HoBbFUOmacL9Z/pF3O/ZG9LwCFuRQsH0EmbFBlJy1WdDtlVHTXlzDaGQ7EGaY4D
17UR6iITsj9ozacLhvk094PLIc3/RHDGLrm3C4Ka3zmUI7BsiYWPDmai3Pu/DNqn
JJ5sZkdrVowxyBbGxw8GpZ4YDJtGsU5XglFPdkw+ZazxhZLNEIstPXg0HTZybrOe
GE+WskWB0qS+RpX0tnYEcX6qOHWm3/63Yq5NG6A9tLQUSFku02jS/bCQSLBrmGWW
Js24IWvFSTvl6XytHeldYhJP618pNUxXRSqgYv96vT/LI3mUrIMN+IVBNPujO6p2
jIYXNoaqLoY3efXKW/WQmp7C/52ZP4ly4fOiz7qtHTQsCcIk8Xo6zwHtm/FkNEqc
sMZdqgLrxKPNBAlT8iEtt//wU2fB7mFt988p0pc+5lAK5t0h67KZJV6vDwTAMObX
wV6Ht+QhOHtJwO779fk8FhZPNRPYZYMutIyFNlRx+4gCpBuqJPc=
=941k
-----END PGP SIGNATURE-----
Merge tag '9p-for-5.20' of https://github.com/martinetd/linux
Pull 9p updates from Dominique Martinet:
- a couple of fixes
- add a tracepoint for fid refcounting
- some cleanup/followup on fid lookup
- some cleanup around req refcounting
* tag '9p-for-5.20' of https://github.com/martinetd/linux:
net/9p: Initialize the iounit field during fid creation
net: 9p: fix refcount leak in p9_read_work() error handling
9p: roll p9_tag_remove into p9_req_put
9p: Add client parameter to p9_req_put()
9p: Drop kref usage
9p: Fix some kernel-doc comments
9p fid refcount: cleanup p9_fid_put calls
9p fid refcount: add a 9p_fid_ref tracepoint
9p fid refcount: add p9_fid_get/put wrappers
9p: Fix minor typo in code comment
9p: Remove unnecessary variable for old fids while walking from d_parent
9p: Make the path walk logic more clear about when cloning is required
9p: Track the root fid with its own variable during lookups
- Instantiate glocks ouside of the glock state engine, in the contect of
the process taking the glock. This moves unnecessary complexity out
of the core glock code. Clean up the instantiate logic to be more
sensible.
- In gfs2_glock_async_wait(), cancel pending locking request upon
failure. Make sure all glocks are left in a consistent state.
- Various other minor cleanups and fixes.
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmLtdg8UHGFncnVlbmJh
QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTrqvA//WRdBtVgT7/5pkjljRolkBZ8B3sYx
T2KlHuiQdvnTGf2dWnOOoUzEZvPXPUovUZMA4dHx0jcRpOi4BsYGz986K/Zpq5hs
vieFEoKQdWk9O9NoNdRJN8Rl1tHTwejZi+kLerhYoJzgMC8AvgieLGO0Ol4Y0joc
lxop/8L1Tn2GiCN4NcBN7Eg2CC4ke58KZcMgWhWVBR2ZJe9/qdqlVEiehiSbCiiN
l89vsYLrG6bMylvNPc+AiyEvIGF5qkEHAErPIs7SfrjNRRWVhkmvTCWAO6JnehTQ
XwqYQiAWCXfxBXUYG1VSCgjmTynmO2yg1Slt+86OauI9ka+ow8epSmHh95TT1JcY
pmVF6CYhLI49dNl3R68CFlQ+Ov6iGt6gx9KEud5oE/Ew0vd/WIyi2/jSGrX59S07
zktMzEDjn31+jw31Raxc6+TQEU+0jQHCwzKWjbJ0tYy3nBdkCyefHwm199Ff40M/
6jHWaH/qcyuq8crrc8PLSJOguSd7FdfdFhXEmpaH2CPybvfuEVJfig4vYee3YtSx
KtZvgpy3bxBCfBDD7CPKfKMLrKrklYH+h7/lhCxbuSH0HvyS0ayXhmSvhXgfn+4e
uWY5yk7gHAaaKGOBkkYwFAWV7X32LS0ndWzI8Ac8m20ifV0eeveRNEX0A/fHIX2U
DlbhYq889mc2P70=
=qFus
-----END PGP SIGNATURE-----
Merge tag 'gfs2-v5.19-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Andreas Gruenbacher:
- Instantiate glocks ouside of the glock state engine, in the contect
of the process taking the glock. This moves unnecessary complexity
out of the core glock code. Clean up the instantiate logic to be more
sensible.
- In gfs2_glock_async_wait(), cancel pending locking request upon
failure. Make sure all glocks are left in a consistent state.
- Various other minor cleanups and fixes.
* tag 'gfs2-v5.19-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: List traversal in do_promote is safe
gfs2: do_promote glock holder stealing fix
gfs2: Use better variable name
gfs2: Make go_instantiate take a glock
gfs2: Add new go_held glock operation
gfs2: Revert 'Fix "truncate in progress" hang'
gfs2: Instantiate glocks ouside of glock state engine
gfs2: Fix up gfs2_glock_async_wait
gfs2: Minor gfs2_glock_nq_m cleanup
gfs2: Fix spelling mistake in comment
gfs2: Rewrap overlong comment in do_promote
gfs2: Remove redundant NULL check before kfree
On a higly fragmented filesystem a Direct IO write can fail with -ENOSPC error
even though the filesystem has sufficient number of free blocks.
This occurs if the file offset range on which the write operation is being
performed has a delalloc extent in the cow fork and this delalloc extent
begins much before the Direct IO range.
In such a scenario, xfs_reflink_allocate_cow() invokes xfs_bmapi_write() to
allocate the blocks mapped by the delalloc extent. The extent thus allocated
may not cover the beginning of file offset range on which the Direct IO write
was issued. Hence xfs_reflink_allocate_cow() ends up returning -ENOSPC.
The following script reliably recreates the bug described above.
#!/usr/bin/bash
device=/dev/loop0
shortdev=$(basename $device)
mntpnt=/mnt/
file1=${mntpnt}/file1
file2=${mntpnt}/file2
fragmentedfile=${mntpnt}/fragmentedfile
punchprog=/root/repos/xfstests-dev/src/punch-alternating
errortag=/sys/fs/xfs/${shortdev}/errortag/bmap_alloc_minlen_extent
umount $device > /dev/null 2>&1
echo "Create FS"
mkfs.xfs -f -m reflink=1 $device > /dev/null 2>&1
if [[ $? != 0 ]]; then
echo "mkfs failed."
exit 1
fi
echo "Mount FS"
mount $device $mntpnt > /dev/null 2>&1
if [[ $? != 0 ]]; then
echo "mount failed."
exit 1
fi
echo "Create source file"
xfs_io -f -c "pwrite 0 32M" $file1 > /dev/null 2>&1
sync
echo "Create Reflinked file"
xfs_io -f -c "reflink $file1" $file2 &>/dev/null
echo "Set cowextsize"
xfs_io -c "cowextsize 16M" $file1 > /dev/null 2>&1
echo "Fragment FS"
xfs_io -f -c "pwrite 0 64M" $fragmentedfile > /dev/null 2>&1
sync
$punchprog $fragmentedfile
echo "Allocate block sized extent from now onwards"
echo -n 1 > $errortag
echo "Create 16MiB delalloc extent in CoW fork"
xfs_io -c "pwrite 0 4k" $file1 > /dev/null 2>&1
sync
echo "Direct I/O write at offset 12k"
xfs_io -d -c "pwrite 12k 8k" $file1
This commit fixes the bug by invoking xfs_bmapi_write() in a loop until disk
blocks are allocated for atleast the starting file offset of the Direct IO
write range.
Fixes: 3c68d44a2b ("xfs: allocate direct I/O COW blocks in iomap_begin")
Reported-and-Root-caused-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: slight editing to make the locking less grody, and fix some style things]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Every now and then, I see the following hang during mount time
quotacheck when running fstests. Turning on KASAN seems to make it
happen somewhat more frequently. I've edited the backtrace for brevity.
XFS (sdd): Quotacheck needed: Please wait.
XFS: Assertion failed: bp->b_flags & _XBF_DELWRI_Q, file: fs/xfs/xfs_buf.c, line: 2411
------------[ cut here ]------------
WARNING: CPU: 0 PID: 1831409 at fs/xfs/xfs_message.c:104 assfail+0x46/0x4a [xfs]
CPU: 0 PID: 1831409 Comm: mount Tainted: G W 5.19.0-rc6-xfsx #rc6 09911566947b9f737b036b4af85e399e4b9aef64
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
RIP: 0010:assfail+0x46/0x4a [xfs]
Code: a0 8f 41 a0 e8 45 fe ff ff 8a 1d 2c 36 10 00 80 fb 01 76 0f 0f b6 f3 48 c7 c7 c0 f0 4f a0 e8 10 f0 02 e1 80 e3 01 74 02 0f 0b <0f> 0b 5b c3 48 8d 45 10 48 89 e2 4c 89 e6 48 89 1c 24 48 89 44 24
RSP: 0018:ffffc900078c7b30 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8880099ac000 RCX: 000000007fffffff
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffa0418fa0
RBP: ffff8880197bc1c0 R08: 0000000000000000 R09: 000000000000000a
R10: 000000000000000a R11: f000000000000000 R12: ffffc900078c7d20
R13: 00000000fffffff5 R14: ffffc900078c7d20 R15: 0000000000000000
FS: 00007f0449903800(0000) GS:ffff88803ec00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005610ada631f0 CR3: 0000000014dd8002 CR4: 00000000001706f0
Call Trace:
<TASK>
xfs_buf_delwri_pushbuf+0x150/0x160 [xfs 4561f5b32c9bfb874ec98d58d0719464e1f87368]
xfs_qm_flush_one+0xd6/0x130 [xfs 4561f5b32c9bfb874ec98d58d0719464e1f87368]
xfs_qm_dquot_walk.isra.0+0x109/0x1e0 [xfs 4561f5b32c9bfb874ec98d58d0719464e1f87368]
xfs_qm_quotacheck+0x319/0x490 [xfs 4561f5b32c9bfb874ec98d58d0719464e1f87368]
xfs_qm_mount_quotas+0x65/0x2c0 [xfs 4561f5b32c9bfb874ec98d58d0719464e1f87368]
xfs_mountfs+0x6b5/0xab0 [xfs 4561f5b32c9bfb874ec98d58d0719464e1f87368]
xfs_fs_fill_super+0x781/0x990 [xfs 4561f5b32c9bfb874ec98d58d0719464e1f87368]
get_tree_bdev+0x175/0x280
vfs_get_tree+0x1a/0x80
path_mount+0x6f5/0xaa0
__x64_sys_mount+0x103/0x140
do_syscall_64+0x2b/0x80
entry_SYSCALL_64_after_hwframe+0x46/0xb0
I /think/ this can happen if xfs_qm_flush_one is racing with
xfs_qm_dquot_isolate (i.e. dquot reclaim) when the second function has
taken the dquot flush lock but xfs_qm_dqflush hasn't yet locked the
dquot buffer, let alone queued it to the delwri list. In this case,
flush_one will fail to get the dquot flush lock, but it can lock the
incore buffer, but xfs_buf_delwri_pushbuf will then trip over this
ASSERT, which checks that the buffer isn't on a delwri list. The hang
results because the _delwri_submit_buffers ignores non DELWRI_Q buffers,
which means that xfs_buf_iowait waits forever for an IO that has not yet
been scheduled.
AFAICT, a reasonable solution here is to detect a dquot buffer that is
not on a DELWRI list, drop it, and return -EAGAIN to try the flush
again. It's not /that/ big of a deal if quotacheck writes the dquot
buffer repeatedly before we even set QUOTA_CHKD.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
If a blkdev_issue_flush fails, fsync needs to report that to upper
levels. Modify xfs_file_fsync to capture the errors, while trying to
flush as much data and log updates to disk as possible.
If log writes cannot flush the data device, we need to shut down the log
immediately because we've violated a log invariant. Modify this code to
check the return value of blkdev_issue_flush as well.
This behavior seems to go back to about 2.6.15 or so, which makes this
fixes tag a bit misleading.
Link: https://elixir.bootlin.com/linux/v2.6.15/source/fs/xfs/xfs_vnodeops.c#L1187
Fixes: b5071ada51 ("xfs: remove xfs_blkdev_issue_flush")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Lin, Yang Shi, Anshuman Khandual and Mike Rapoport
- Some kmemleak fixes from Patrick Wang and Waiman Long
- DAMON updates from SeongJae Park
- memcg debug/visibility work from Roman Gushchin
- vmalloc speedup from Uladzislau Rezki
- more folio conversion work from Matthew Wilcox
- enhancements for coherent device memory mapping from Alex Sierra
- addition of shared pages tracking and CoW support for fsdax, from
Shiyang Ruan
- hugetlb optimizations from Mike Kravetz
- Mel Gorman has contributed some pagealloc changes to improve latency
and realtime behaviour.
- mprotect soft-dirty checking has been improved by Peter Xu
- Many other singleton patches all over the place
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYuravgAKCRDdBJ7gKXxA
jpqSAQDrXSdII+ht9kSHlaCVYjqRFQz/rRvURQrWQV74f6aeiAD+NHHeDPwZn11/
SPktqEUrF1pxnGQxqLh1kUFUhsVZQgE=
=w/UH
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"Most of the MM queue. A few things are still pending.
Liam's maple tree rework didn't make it. This has resulted in a few
other minor patch series being held over for next time.
Multi-gen LRU still isn't merged as we were waiting for mapletree to
stabilize. The current plan is to merge MGLRU into -mm soon and to
later reintroduce mapletree, with a view to hopefully getting both
into 6.1-rc1.
Summary:
- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe
Lin, Yang Shi, Anshuman Khandual and Mike Rapoport
- Some kmemleak fixes from Patrick Wang and Waiman Long
- DAMON updates from SeongJae Park
- memcg debug/visibility work from Roman Gushchin
- vmalloc speedup from Uladzislau Rezki
- more folio conversion work from Matthew Wilcox
- enhancements for coherent device memory mapping from Alex Sierra
- addition of shared pages tracking and CoW support for fsdax, from
Shiyang Ruan
- hugetlb optimizations from Mike Kravetz
- Mel Gorman has contributed some pagealloc changes to improve
latency and realtime behaviour.
- mprotect soft-dirty checking has been improved by Peter Xu
- Many other singleton patches all over the place"
[ XFS merge from hell as per Darrick Wong in
https://lore.kernel.org/all/YshKnxb4VwXycPO8@magnolia/ ]
* tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (282 commits)
tools/testing/selftests/vm/hmm-tests.c: fix build
mm: Kconfig: fix typo
mm: memory-failure: convert to pr_fmt()
mm: use is_zone_movable_page() helper
hugetlbfs: fix inaccurate comment in hugetlbfs_statfs()
hugetlbfs: cleanup some comments in inode.c
hugetlbfs: remove unneeded header file
hugetlbfs: remove unneeded hugetlbfs_ops forward declaration
hugetlbfs: use helper macro SZ_1{K,M}
mm: cleanup is_highmem()
mm/hmm: add a test for cross device private faults
selftests: add soft-dirty into run_vmtests.sh
selftests: soft-dirty: add test for mprotect
mm/mprotect: fix soft-dirty check in can_change_pte_writable()
mm: memcontrol: fix potential oom_lock recursion deadlock
mm/gup.c: fix formatting in check_and_migrate_movable_page()
xfs: fail dax mount if reflink is enabled on a partition
mm/memcontrol.c: remove the redundant updating of stats_flush_threshold
userfaultfd: don't fail on unrecognized features
hugetlb_cgroup: fix wrong hugetlb cgroup numa stat
...
Rename generic mid functions to same style, i.e. without "cifs_"
prefix.
cifs_{init,destroy}_mids() -> {init,destroy}_mids()
Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
DeleteMidQEntry() was just a proxy for cifs_mid_q_entry_release().
- remove DeleteMidQEntry()
- rename cifs_mid_q_entry_release() to release_mid()
- rename kref_put() callback _cifs_mid_q_entry_release to __release_mid
- rename AllocMidQEntry() to alloc_mid()
- rename cifs_delete_mid() to delete_mid()
Update callers to use new names.
Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
Currently much of the smb1 code is built even when
CONFIG_CIFS_ALLOW_INSECURE_LEGACY is disabled.
Move cifssmb.c to only be compiled when insecure legacy is disabled,
and move various SMB1/CIFS helper functions to that ifdef. Some
functions that were not SMB1/CIFS specific needed to be moved out of
cifssmb.c
This shrinks cifs.ko by more than 10% which is good - but also will
help with the eventual movement of the legacy code to a distinct
module. Follow on patches can shrink the number of ifdefs by
code restructuring where smb1 code is wedged in functions that
should be calling dialect specific helper functions instead,
and also by moving some functions from file.c/dir.c/inode.c into
smb1 specific c files.
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
Since pvec have 15 pages, it not a multiple of 4, when write compressed
pages, write in 64K as a unit, it will call pagevec_lookup_range_tag
agagin, sometimes this will take a lot of time.
Use onstack pages instead of pvec to mitigate this problem.
Signed-off-by: Fengnan Chang <fengnanchang@gmail.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When write total cluster, all pages is uptodate, there is not need to call
f2fs_prepare_compress_overwrite, intorduce f2fs_all_cluster_page_ready
to avoid this.
Signed-off-by: Fengnan Chang <changfengnan@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
f2fs_abort_atomic_write() has checked whether current inode is
atomic_write one or not, it's redundant to check in its caller,
remove it for cleanup.
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Now decompression is being handled in workqueue and it makes read I/O
latency non-deterministic, because of the non-deterministic scheduling
nature of workqueues. So, I made it handled in softirq context only if
possible, not in low memory devices, since this modification will
maintain decompresion related memory a little longer.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If a file has FI_COMPRESS_RELEASED, all writes for it should not be
allowed. However, as of now, in case of compress_mode=user, writes
triggered by IOCTLs like F2FS_IOC_DE/COMPRESS_FILE are allowed unexpectly,
which could crash that file.
To fix it, let's do not allow F2FS_IOC_DE/COMPRESS_IOCTL if a file already
has FI_COMPRESS_RELEASED flag.
This is the reproduction process:
1. $ touch ./file
2. $ chattr +c ./file
3. $ dd if=/dev/random of=./file bs=4096 count=30 conv=notrunc
4. $ dd if=/dev/zero of=./file bs=4096 count=34 seek=30 conv=notrunc
5. $ sync
6. $ do_compress ./file ; call F2FS_IOC_COMPRESS_FILE
7. $ get_compr_blocks ./file ; call F2FS_IOC_GET_COMPRESS_BLOCKS
8. $ release ./file ; call F2FS_IOC_RELEASE_COMPRESS_BLOCKS
9. $ do_compress ./file ; call F2FS_IOC_COMPRESS_FILE again
10. $ get_compr_blocks ./file ; call F2FS_IOC_GET_COMPRESS_BLOCKS again
This reproduction process is tested in 128kb cluster size.
You can find compr_blocks has a negative value.
Fixes: 5fdb322ff2 ("f2fs: add F2FS_IOC_DECOMPRESS_FILE and F2FS_IOC_COMPRESS_FILE")
Signed-off-by: Junbeom Yeom <junbeom.yeom@samsung.com>
Signed-off-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Youngjin Gil <youngjin.gil@samsung.com>
Signed-off-by: Jaewook Kim <jw5454.kim@samsung.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If kernel doesn't have CONFIG_F2FS_FS_COMPRESSION, a file having FS_COMPR_FL via
ioctl(FS_IOC_SETFLAGS) is unaccessible due to f2fs_is_compress_backend_ready().
Let's avoid it.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
To ensure serialized IOs, f2fs allows only LFS mode for zoned
device. Remove redundant check for direct IO.
Signed-off-by: Eunhee Rho <eunhee83.rho@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
There is issue as follows when test f2fs atomic write:
F2FS-fs (loop0): Can't find valid F2FS filesystem in 2th superblock
F2FS-fs (loop0): invalid crc_offset: 0
F2FS-fs (loop0): f2fs_check_nid_range: out-of-range nid=1, run fsck to fix.
F2FS-fs (loop0): f2fs_check_nid_range: out-of-range nid=2, run fsck to fix.
==================================================================
BUG: KASAN: null-ptr-deref in f2fs_get_dnode_of_data+0xac/0x16d0
Read of size 8 at addr 0000000000000028 by task rep/1990
CPU: 4 PID: 1990 Comm: rep Not tainted 5.19.0-rc6-next-20220715 #266
Call Trace:
<TASK>
dump_stack_lvl+0x6e/0x91
print_report.cold+0x49a/0x6bb
kasan_report+0xa8/0x130
f2fs_get_dnode_of_data+0xac/0x16d0
f2fs_do_write_data_page+0x2a5/0x1030
move_data_page+0x3c5/0xdf0
do_garbage_collect+0x2015/0x36c0
f2fs_gc+0x554/0x1d30
f2fs_balance_fs+0x7f5/0xda0
f2fs_write_single_data_page+0xb66/0xdc0
f2fs_write_cache_pages+0x716/0x1420
f2fs_write_data_pages+0x84f/0x9a0
do_writepages+0x130/0x3a0
filemap_fdatawrite_wbc+0x87/0xa0
file_write_and_wait_range+0x157/0x1c0
f2fs_do_sync_file+0x206/0x12d0
f2fs_sync_file+0x99/0xc0
vfs_fsync_range+0x75/0x140
f2fs_file_write_iter+0xd7b/0x1850
vfs_write+0x645/0x780
ksys_write+0xf1/0x1e0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
As 3db1de0e58 commit changed atomic write way which new a cow_inode for
atomic write file, and also mark cow_inode as FI_ATOMIC_FILE.
When f2fs_do_write_data_page write cow_inode will use cow_inode's cow_inode
which is NULL. Then will trigger null-ptr-deref.
To solve above issue, introduce FI_COW_FILE flag for COW inode.
Fiexes: 3db1de0e582c("f2fs: change the current atomic write way")
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
F2FS_IOC_ABORT_VOLATILE_WRITE was used to abort a atomic write before.
However it was removed accidentally. So revive it by changing the name,
since volatile write had gone.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Fiexes: 7bc155fec5b3("f2fs: kill volatile write support")
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
- Improve scalability of the XFS log by removing spinlocks and global
synchronization points.
- Add security labels to whiteout inodes to match the other filesystems.
- Clean up per-ag pointer passing to simplify call sites.
- Reduce verifier overhead by precalculating more AG geometry.
- Implement fast-path lockless lookups in the buffer cache to reduce
spinlock hammering.
- Make attr forks a permanent part of the inode structure to fix a UAF
bug and because most files these days tend to have security labels and
soon will have parent pointers too.
- Clean up XFS_IFORK_Q usage and give it a better name.
- Fix more UAF bugs in the xattr code.
- SOB my tags.
- Fix some typos in the timestamp range documentation.
- Fix a few more memory leaks.
- Code cleanups and typo fixes.
- Fix an unlocked inode fork pointer access in getbmap.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmLmrLkACgkQ+H93GTRK
tOviexAAo7mJ03hCCWnnkcEYbVQNMH4WRuCpR45D8lz4PU/s6yL7/uxuyodc0dMm
/ZUWjCas1GMZmbOkCkL9eeatrZmgT5SeDbYc4EtHicHYi4sTgCB7ymx0soCUHXYi
7c0kdz+eQ/oY4QvY6JZwbFkRENDL2pkxM9itGHZT0OXHmAnGcIYvzP5Vuc2GtelL
0VWCcpusG0uck3+P1qa8e+TtkR2HU5PVGgAU7OhmAIs07aE3AheVEsPydgGKSIS9
PICnMg1oIgly4VQi28cp/5hU+Au6yBMGogxW8ultPFlM5RWKFt8MKUUhclzS+hZL
9dGSZ3JjpZrdmuUa9mdPnr1MsgrTF6CWHAeUsblSXUzjRT8S3Yz8I3gUMJAA/H17
ZGBu55+TlZtE4ZsK3q/4pqZXfylaaumbEqEi5lJX+7/IYh/WLAgxJihWSpSK2B4a
VBqi12EvMlrjZ4vrD2hqVEJAlguoWiqxgv2gXEZ5wy9dfvzGgysXwAigj0YQeJNQ
J++AYwdYs0pCK0O4eTGZsvp+6o9wj92irtrxwiucuKreDZTOlpCBOAXVTxqom1nX
1NS1YmKvC/RM1na6tiOIundwypgSXUe32qdan34xEWBVPY0mnSpX0N9Lcyoc0xbg
kajAKK9TIy968su/eoBuTQf2AIu1jbWMBNZSg9oELZjfrm0CkWM=
=fNjj
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.20-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs updates from Darrick Wong:
"The biggest changes for this release are the log scalability
improvements, lockless lookups for the buffer cache, and making the
attr fork a permanent part of the incore inode in preparation for
directory parent pointers.
There's also a bunch of bug fixes that have accumulated since -rc5. I
might send you a second pull request with some more bug fixes that I'm
still working on.
Once the merge window ends, I will hand maintainership back to Dave
Chinner until the 6.1-rc1 release so that I can conduct the design
review for the online fsck feature, and try to get it merged.
Summary:
- Improve scalability of the XFS log by removing spinlocks and global
synchronization points.
- Add security labels to whiteout inodes to match the other
filesystems.
- Clean up per-ag pointer passing to simplify call sites.
- Reduce verifier overhead by precalculating more AG geometry.
- Implement fast-path lockless lookups in the buffer cache to reduce
spinlock hammering.
- Make attr forks a permanent part of the inode structure to fix a
UAF bug and because most files these days tend to have security
labels and soon will have parent pointers too.
- Clean up XFS_IFORK_Q usage and give it a better name.
- Fix more UAF bugs in the xattr code.
- SOB my tags.
- Fix some typos in the timestamp range documentation.
- Fix a few more memory leaks.
- Code cleanups and typo fixes.
- Fix an unlocked inode fork pointer access in getbmap"
* tag 'xfs-5.20-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (61 commits)
xfs: delete extra space and tab in blank line
xfs: fix NULL pointer dereference in xfs_getbmap()
xfs: Fix typo 'the the' in comment
xfs: Fix comment typo
xfs: don't leak memory when attr fork loading fails
xfs: fix for variable set but not used warning
xfs: xfs_buf cache destroy isn't RCU safe
xfs: delete unnecessary NULL checks
xfs: fix comment for start time value of inode with bigtime enabled
xfs: fix use-after-free in xattr node block inactivation
xfs: lockless buffer lookup
xfs: remove a superflous hash lookup when inserting new buffers
xfs: reduce the number of atomic when locking a buffer after lookup
xfs: merge xfs_buf_find() and xfs_buf_get_map()
xfs: break up xfs_buf_find() into individual pieces
xfs: add in-memory iunlink log item
xfs: add log item precommit operation
xfs: combine iunlink inode update functions
xfs: clean up xfs_iunlink_update_inode()
xfs: double link the unlinked inode list
...
superblock and improved the performance of the online resizing of file
systems with bigalloc enabled. Fixed a lot of bugs, in particular for
the inline data feature, potential races when creating and deleting
inodes with shared extended attribute blocks, and the handling
directory blocks which are corrupted.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmLrRpEACgkQ8vlZVpUN
gaN2OAf/a2lxhZ6DvkTfWjni0BG6i2ajd11lT93N+we0wWk9f0VdNR2JUdTum0HL
UtwP48km+FuqkbDrODlbSky5V+IZhd90ihLyPbPKSU/52c7d6IxNOCz2Fxq981j2
Ik6QgdegvCaUDHmluJfYYcS5Pa97HXtSb6VVi1RAvHHFbYDSChObs76ZQWBmhsSh
Mo84mFGS7BDIVNVkg4PBMx4b3iFvKfE1AUdfA5dhB4GXHgDA+77GByw+RjdQ6Dh/
W0l5AVAXbK7BYSVX6Cg41WUMYOBu58Hrh/CHL1DWv3khvjgxLqM7ERAFOISVI3Ax
vCXPXfjpbTFElUQuOw4m33vixaFU+A==
=xTsM
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Add new ioctls to set and get the file system UUID in the ext4
superblock and improved the performance of the online resizing of file
systems with bigalloc enabled.
Fixed a lot of bugs, in particular for the inline data feature,
potential races when creating and deleting inodes with shared extended
attribute blocks, and the handling of directory blocks which are
corrupted"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (37 commits)
ext4: add ioctls to get/set the ext4 superblock uuid
ext4: avoid resizing to a partial cluster size
ext4: reduce computation of overhead during resize
jbd2: fix assertion 'jh->b_frozen_data == NULL' failure when journal aborted
ext4: block range must be validated before use in ext4_mb_clear_bb()
mbcache: automatically delete entries from cache on freeing
mbcache: Remove mb_cache_entry_delete()
ext2: avoid deleting xattr block that is being reused
ext2: unindent codeblock in ext2_xattr_set()
ext2: factor our freeing of xattr block reference
ext4: fix race when reusing xattr blocks
ext4: unindent codeblock in ext4_xattr_block_set()
ext4: remove EA inode entry from mbcache on inode eviction
mbcache: add functions to delete entry if unused
mbcache: don't reclaim used entries
ext4: make sure ext4_append() always allocates new block
ext4: check if directory block is within i_size
ext4: reflect mb_optimize_scan value in options file
ext4: avoid remove directory when directory is corrupted
ext4: aligned '*' in comments
...
Here is the set of driver core and kernfs changes for 6.0-rc1.
"biggest" thing in here is some scalability improvements for kernfs for
large systems. Other than that, included in here are:
- arch topology and cache info changes that have been reviewed
and discussed a lot.
- potential error path cleanup fixes
- deferred driver probe cleanups
- firmware loader cleanups and tweaks
- documentation updates
- other small things
All of these have been in the linux-next tree for a while with no
reported problems.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCYuqCnw8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ym/JgCcCnaycJY00ZPRQm3LQCyzfJ0HgqoAn2qxGV+K
NKycLeXZSnuvIA87dycE
=/4Jk
-----END PGP SIGNATURE-----
Merge tag 'driver-core-6.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core / kernfs updates from Greg KH:
"Here is the set of driver core and kernfs changes for 6.0-rc1.
The "biggest" thing in here is some scalability improvements for
kernfs for large systems. Other than that, included in here are:
- arch topology and cache info changes that have been reviewed and
discussed a lot.
- potential error path cleanup fixes
- deferred driver probe cleanups
- firmware loader cleanups and tweaks
- documentation updates
- other small things
All of these have been in the linux-next tree for a while with no
reported problems"
* tag 'driver-core-6.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (63 commits)
docs: embargoed-hardware-issues: fix invalid AMD contact email
firmware_loader: Replace kmap() with kmap_local_page()
sysfs docs: ABI: Fix typo in comment
kobject: fix Kconfig.debug "its" grammar
kernfs: Fix typo 'the the' in comment
docs: driver-api: firmware: add driver firmware guidelines. (v3)
arch_topology: Fix cache attributes detection in the CPU hotplug path
ACPI: PPTT: Leave the table mapped for the runtime usage
cacheinfo: Use atomic allocation for percpu cache attributes
drivers/base: fix userspace break from using bin_attributes for cpumap and cpulist
MAINTAINERS: Change mentions of mpm to olivia
docs: ABI: sysfs-devices-soc: Update Lee Jones' email address
docs: ABI: sysfs-class-pwm: Update Lee Jones' email address
Documentation/process: Add embargoed HW contact for LLVM
Revert "kernfs: Change kernfs_notify_list to llist."
ACPI: Remove the unused find_acpi_cpu_cache_topology()
arch_topology: Warn that topology for nested clusters is not supported
arch_topology: Add support for parsing sockets in /cpu-map
arch_topology: Set cluster identifier in each core/thread from /cpu-map
arch_topology: Limit span of cpu_clustergroup_mask()
...
lockd doesn't currently vet the start and length in nlm4 requests like
it should, and can end up generating lock requests with arguments that
overflow when passed to the filesystem.
The NLM4 protocol uses unsigned 64-bit arguments for both start and
length, whereas struct file_lock tracks the start and end as loff_t
values. By the time we get around to calling nlm4svc_retrieve_args,
we've lost the information that would allow us to determine if there was
an overflow.
Start tracking the actual start and len for NLM4 requests in the
nlm_lock. In nlm4svc_retrieve_args, vet these values to ensure they
won't cause an overflow, and return NLM4_FBIG if they do.
Link: https://bugzilla.linux-nfs.org/show_bug.cgi?id=392
Reported-by: Jan Kasiak <j.kasiak@gmail.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: <stable@vger.kernel.org> # 5.14+
As all inode locking is now fully balanced, fh_put() does not need to
call fh_unlock().
fh_lock() and fh_unlock() are no longer used, so discard them.
These are the only real users of ->fh_locked, so discard that too.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
When locking a file to access ACLs and xattrs etc, use explicit locking
with inode_lock() instead of fh_lock(). This means that the calls to
fh_fill_pre/post_attr() are also explicit which improves readability and
allows us to place them only where they are needed. Only the xattr
calls need pre/post information.
When locking a file we don't need I_MUTEX_PARENT as the file is not a
parent of anything, so we can use inode_lock() directly rather than the
inode_lock_nested() call that fh_lock() uses.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
When creating or unlinking a name in a directory use explicit
inode_lock_nested() instead of fh_lock(), and explicit calls to
fh_fill_pre_attrs() and fh_fill_post_attrs(). This is already done
for renames, with lock_rename() as the explicit locking.
Also move the 'fill' calls closer to the operation that might change the
attributes. This way they are avoided on some error paths.
For the v2-only code in nfsproc.c, the fill calls are not replaced as
they aren't needed.
Making the locking explicit will simplify proposed future changes to
locking for directories. It also makes it easily visible exactly where
pre/post attributes are used - not all callers of fh_lock() actually
need the pre/post attributes.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
nfsd_lookup() takes an exclusive lock on the parent inode, but no
callers want the lock and it may not be needed at all if the
result is in the dcache.
Change nfsd_lookup_dentry() to not take the lock, and call
lookup_one_len_locked() which takes lock only if needed.
nfsd4_open() currently expects the lock to still be held, but that isn't
necessary as nfsd_validate_delegated_dentry() provides required
guarantees without the lock.
NOTE: NFSv4 requires directory changeinfo for OPEN even when a create
wasn't requested and no change happened. Now that nfsd_lookup()
doesn't use fh_lock(), we need to explicitly fill the attributes
when no create happens. A new fh_fill_both_attrs() is provided
for that task.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
On non-error paths, nfsd_link() calls fh_unlock() twice. This is safe
because fh_unlock() records that the unlock has been done and doesn't
repeat it.
However it makes the code a little confusing and interferes with changes
that are planned for directory locking.
So rearrange the code to ensure fh_unlock() is called exactly once if
fh_lock() was called.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Some error paths in nfsd_unlink() allow it to exit without unlocking the
directory. This is not a problem in practice as the directory will be
locked with an fh_put(), but it is untidy and potentially confusing.
This allows us to remove all the fh_unlock() calls that are immediately
after nfsd_unlink() calls.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
nfsd_create() usually returns with the directory still locked.
nfsd_symlink() usually returns with it unlocked. This is clumsy.
Until recently nfsd_create() needed to keep the directory locked until
ACLs and security label had been set. These are now set inside
nfsd_create() (in nfsd_setattr()) so this need is gone.
So change nfsd_create() and nfsd_symlink() to always unlock, and remove
any fh_unlock() calls that follow calls to these functions.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
pacl and dpacl pointers are added to struct nfsd_attrs, which requires
that we have an nfsd_attrs_free() function to free them.
Those nfsv4 functions that can set ACLs now set up these pointers
based on the passed in NFSv4 ACL.
nfsd_setattr() sets the acls as appropriate.
Errors are handled as with security labels.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
A single change for this cycle to simplify handling of the memory page
used as super block buffer during mount (from Fabio).
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQSRPv8tYSvhwAzJdzjdoc3SxdoYdgUCYuoYcQAKCRDdoc3SxdoY
dhWYAQDcDZIzcGFrKTpmmpPI9Ep6rCk64V1cIvTKOik+Hp3/8wEA+btduq8w2Wlo
rrCypMOPBIygJxi9h1kugcP41HJLTwo=
=InBn
-----END PGP SIGNATURE-----
Merge tag 'zonefs-5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs
Pull zonefs update from Damien Le Moal:
"A single change for this cycle to simplify handling of the memory page
used as super block buffer during mount (from Fabio)"
* tag 'zonefs-5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
zonefs: Call page_address() on page acquired with GFP_KERNEL flag
- Skip writeback for pages that are completely beyond EOF
- Minor code cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmK92MsACgkQ+H93GTRK
tOuw5w//UpB7MqPpZyfqu5nIehVy9zJX8KacZHI3k82ID48fCpa/Q+KpfxHZFbMa
eqSzl2wHEv/bzH6I1UCtq2aA7y6ZhpiR3LiIV2PKEy/vgYtPeVSNhkIzwuK9YYPN
Wj1fLRUi4M4Y86a97vNSzdsdnHLCRnUHY2C35ywRvRCY0opUKSWGF45mD5Mj+0BK
Kl6TfPlyyR0ugBxOf2cJWA/uL/YA5sHT32uViSPxyFpXufyILz8rjxzhRwdSUwFV
TEoyTey4LgoADOA2X5Czxg0CIKfjM2xVnU/ybsdL/q0Mj5U1uzuJPsNY/M4pG0X8
v1/01NSSHVH37Kmr/3L15+doZDQGOLrRoWwStag2tRd6jNRNN7YctnanGoiR6elE
ElU64mDVjc1FJf1J7JF5IObUJMfw0/ulHX5rHYWZlbbO1rnjD6+fVYVQbMW7xsmu
o8fYiZHrkWBo5sYkWgIYhoD8fq3oMkYaryDTEtSS/Jy1tpu2gZpXzaMusKpu9qSe
yc93TTnqqgS9LrSy0gJJqMzR5SavDFZvFhcgIjZXcbonz5uPwYTY7D5lD7vOLqbR
b3oV3+lOQs15nhXs0RO4g1bFIIe9dqAZRlPJX1vYvgjlA3VSQWLcqtlxgpGYptf6
XT/yftI/+sVfTQSC6QAKmp7DTEtYLRKfpNSD9L52TKBvYigQPxo=
=aDOK
-----END PGP SIGNATURE-----
Merge tag 'iomap-5.20-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull iomap updates from Darrick Wong:
"The most notable change in this first batch is that we no longer
schedule pages beyond i_size for writeback, preferring instead to let
truncate deal with those pages.
Next week, there may be a second pull request to remove
iomap_writepage from the other two filesystems (gfs2/zonefs) that use
iomap for buffered IO. This follows in the same vein as the recent
removal of writepage from XFS, since it hasn't been triggered in a few
years; it does nothing during direct reclaim; and as far as the people
who examined the patchset can tell, it's moving the codebase in the
right direction.
However, as it was a late addition to for-next, I'm holding off on
that section for another week of testing to see if anyone can come up
with a solid reason for holding off in the meantime.
Summary:
- Skip writeback for pages that are completely beyond EOF
- Minor code cleanups"
* tag 'iomap-5.20-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
dax: set did_zero to true when zeroing successfully
iomap: set did_zero to true when zeroing successfully
iomap: skip pages past eof in iomap_do_writepage()
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmLoE6kACgkQxWXV+ddt
WDvdNw/7BunZ/ZnUUADehHkXvS9PcJjgt0pV5hC73H5nujnhi3aO8frpgBbVbl5a
Pq6UHLRPUMB2YUfm+HCmUyKtZ1zLeiEPSHe9VuRNzylj6jN6K4ebylsGGOpGpdfc
mwN1oJ2KkUKsXLyYXRd/cWr9oiYcTdeznSG4vbPXoCI52aZOfkqkGhgsG/sY4Lyz
fUk9MoaClTD5+eisuF4XWmRgizvaaVrXB5MIaDgODIOQ2YHBJ1Ahdsz2+FsyDQ9S
1Haz7DqMPkImZSlNfwogn30RuXQOAk27rXex34ZkNHwpCvi4TEyxnU67NXv5vlwG
xBl+QHbMHZaJJYZd7gxJHuRo///B42A1RdaKB9uoFri6C/IlNQXt8xnKapL7yo5Q
mCIwpelhUKzDNsrivw5p0Nttz9b7ugwvpHush3E2DHaaMCIi1dJ6Kkq0YDn+iI9f
xLUNU/Bc25A3vdL/IS606EZ4apJObAr3ZflB4XV+8yKmca4dxB2Xoh2NPJsJdpHJ
1riua8WBxZms8PZyS6CSpIE4Y8XSi87YdOq8iavRU8uhiiIvkAPLYdqOPWB4DUBm
mq58GoJN46XFd7eo/V7aanCb459+Xs9CnGnosFA28WWAaM8TqGalmZZthzEUBDXI
P3kHwe8ujIF6w6EPB6/KZ7PqTa5bAi2fBjnDy/QbXTe6nZpJbM4=
=o20T
-----END PGP SIGNATURE-----
Merge tag 'affs-5.20-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull affs fix from David Sterba:
"One update to AFFS, switching away from the kmap/kmap_atomic API"
* tag 'affs-5.20-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
affs: use memcpy_to_page and remove replace kmap_atomic()
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmLnyNUACgkQxWXV+ddt
WDt9vA/9HcF+v5EkknyW07tatTap/Hm/ZB86Z5OZi6ikwIEcHsWhp3rUICejm88e
GecDPIluDtCtyD6x4stuqkwOm22aDP5q2T9H6+gyw92ozyb436OV1Z8IrmftzXKY
EpZO70PHZT+E6E/WYvyoTmmoCrjib7YlqCWZZhSLUFpsqqlOInmHEH49PW6KvM4r
acUZ/RxHurKdmI3kNY6ECbAQl6CASvtTdYcVCx8fT2zN0azoLIQxpYa7n/9ca1R6
8WnYilCbLbNGtcUXvO2M3tMZ4/5kvxrwQsUn93ccCJYuiN0ASiDXbLZ2g4LZ+n56
JGu+y5v5oBwjpVf+46cuvnENP5BQ61594WPseiVjrqODWnPjN28XkcVC0XmPsiiZ
lszeHO2cuIrIFoCah8ELMl8usu8+qxfXmPxIXtPu9rEyKsDtOjxVYc8SMXqLp0qQ
qYtBoFm0JcZHqtZRpB+dhQ37/xXtH4ljUi/mI6x8iALVujeR273URs7yO9zgIdeW
uZoFtbwpHFLUk+TL7Ku82/zOXp3fCwtDpNmlYbxeMbea/be3ShjncM4+mYzvHYri
dYON2LFrq+mnRDqtIXTCaAYwX7zU8Y18Ev9QwlNll8dKlKwS89+jpqLoa+eVYy3c
/HitHFza70KxmOj4dvDVZlzDpPvl7kW1UBkmskg4u3jnNWzedkM=
=sS1q
-----END PGP SIGNATURE-----
Merge tag 'for-5.20-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"This brings some long awaited changes, the send protocol bump,
otherwise lots of small improvements and fixes. The main core part is
reworking bio handling, cleaning up the submission and endio and
improving error handling.
There are some changes outside of btrfs adding helpers or updating
API, listed at the end of the changelog.
Features:
- sysfs:
- export chunk size, in debug mode add tunable for setting its size
- show zoned among features (was only in debug mode)
- show commit stats (number, last/max/total duration)
- send protocol updated to 2
- new commands:
- ability write larger data chunks than 64K
- send raw compressed extents (uses the encoded data ioctls),
ie. no decompression on send side, no compression needed on
receive side if supported
- send 'otime' (inode creation time) among other timestamps
- send file attributes (a.k.a file flags and xflags)
- this is first version bump, backward compatibility on send and
receive side is provided
- there are still some known and wanted commands that will be
implemented in the near future, another version bump will be
needed, however we want to minimize that to avoid causing
usability issues
- print checksum type and implementation at mount time
- don't print some messages at mount (mentioned as people asked about
it), we want to print messages namely for new features so let's
make some space for that
- big metadata - this has been supported for a long time and is
not a feature that's worth mentioning
- skinny metadata - same reason, set by default by mkfs
Performance improvements:
- reduced amount of reserved metadata for delayed items
- when inserted items can be batched into one leaf
- when deleting batched directory index items
- when deleting delayed items used for deletion
- overall improved count of files/sec, decreased subvolume lock
contention
- metadata item access bounds checker micro-optimized, with a few
percent of improved runtime for metadata-heavy operations
- increase direct io limit for read to 256 sectors, improved
throughput by 3x on sample workload
Notable fixes:
- raid56
- reduce parity writes, skip sectors of stripe when there are no
data updates
- restore reading from on-disk data instead of using stripe cache,
this reduces chances to damage correct data due to RMW cycle
- refuse to replay log with unknown incompat read-only feature bit
set
- zoned
- fix page locking when COW fails in the middle of allocation
- improved tracking of active zones, ZNS drives may limit the
number and there are ENOSPC errors due to that limit and not
actual lack of space
- adjust maximum extent size for zone append so it does not cause
late ENOSPC due to underreservation
- mirror reading error messages show the mirror number
- don't fallback to buffered IO for NOWAIT direct IO writes, we don't
have the NOWAIT semantics for buffered io yet
- send, fix sending link commands for existing file paths when there
are deleted and created hardlinks for same files
- repair all mirrors for profiles with more than 1 copy (raid1c34)
- fix repair of compressed extents, unify where error detection and
repair happen
Core changes:
- bio completion cleanups
- don't double defer compression bios
- simplify endio workqueues
- add more data to btrfs_bio to avoid allocation for read requests
- rework bio error handling so it's same what block layer does,
the submission works and errors are consumed in endio
- when asynchronous bio offload fails fall back to synchronous
checksum calculation to avoid errors under writeback or memory
pressure
- new trace points
- raid56 events
- ordered extent operations
- super block log_root_transid deprecated (never used)
- mixed_backref and big_metadata sysfs feature files removed, they've
been default for sufficiently long time, there are no known users
and mixed_backref could be confused with mixed_groups
Non-btrfs changes, API updates:
- minor highmem API update to cover const arguments
- switch all kmap/kmap_atomic to kmap_local
- remove redundant flush_dcache_page()
- address_space_operations::writepage callback removed
- add bdev_max_segments() helper"
* tag 'for-5.20-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (163 commits)
btrfs: don't call btrfs_page_set_checked in finish_compressed_bio_read
btrfs: fix repair of compressed extents
btrfs: remove the start argument to check_data_csum and export
btrfs: pass a btrfs_bio to btrfs_repair_one_sector
btrfs: simplify the pending I/O counting in struct compressed_bio
btrfs: repair all known bad mirrors
btrfs: merge btrfs_dev_stat_print_on_error with its only caller
btrfs: join running log transaction when logging new name
btrfs: simplify error handling in btrfs_lookup_dentry
btrfs: send: always use the rbtree based inode ref management infrastructure
btrfs: send: fix sending link commands for existing file paths
btrfs: send: introduce recorded_ref_alloc and recorded_ref_free
btrfs: zoned: wait until zone is finished when allocation didn't progress
btrfs: zoned: write out partially allocated region
btrfs: zoned: activate necessary block group
btrfs: zoned: activate metadata block group on flush_space
btrfs: zoned: disable metadata overcommit for zoned
btrfs: zoned: introduce space_info->active_total_bytes
btrfs: zoned: finish least available block group on data bg allocation
btrfs: let can_allocate_chunk return error
...
Remove the obsolete 'efivars' sysfs based interface to the EFI variable
store, now that all users have moved to the efivarfs pseudo file system,
which was created ~10 years ago to address some fundamental shortcomings
in the sysfs based driver.
Move the 'business logic' related to which EFI variables are important
and may affect the boot flow from the efivars support layer into the
efivarfs pseudo file system, so it is no longer exposed to other parts
of the kernel.
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE+9lifEBpyUIVN1cpw08iOZLZjyQFAmLhuYIACgkQw08iOZLZ
jyRYYgwAwUvbFtBL+uOTB+jynOq1MkM1xrNSeEgzPyLhh7CXSa/e+48XqpQjCUxY
8HNqDvd4toiCeVKp25AO+FPrQBjT7NdyjnwGMP3DLkGCOSIrVqVJ/2OmbOY52Kzy
z1qbF2+7ak1TUsGzMnJHVDB+baGnUlZ39DuR6IAWstM9tkH/QEnRJV+ejS4h7jdN
XW42/M96mAYFfT4Q4gs1HJMPWLP22xoEbnb9SkOyFCB2tHDjrY94CqH8L7Ee1DgA
7sc2cAZ2mPlS8Rbs8NY8IA/LpN4tRu2EhbIxtmKMRgXA88aTKcsub9Tq9RXxzE/x
BpkH77mGbSBSrntg0gskKCrYGyJaGS9EYnHVy5HPU8hJagK295NmO3c/trHTb90Z
1RZClYsIUbNXe06e9rdk8w8Ozn+7ABstxzLCvj2MThtTBnAbjU0kQ9VeD0xwMvOp
EaF6D6tbEmHrZkGtKP2WEyOZm0tF2I3OP1U9LzjyTpvsIOINhqfVozlhTB6pU5YE
kW9E59i+
=VVQ6
-----END PGP SIGNATURE-----
Merge tag 'efi-efivars-removal-for-v5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi
Pull efivars sysfs interface removal from Ard Biesheuvel:
"Remove the obsolete 'efivars' sysfs based interface to the EFI
variable store, now that all users have moved to the efivarfs pseudo
file system, which was created ~10 years ago to address some
fundamental shortcomings in the sysfs based driver.
Move the 'business logic' related to which EFI variables are important
and may affect the boot flow from the efivars support layer into the
efivarfs pseudo file system, so it is no longer exposed to other parts
of the kernel"
* tag 'efi-efivars-removal-for-v5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
efi: vars: Move efivar caching layer into efivarfs
efi: vars: Switch to new wrapper layer
efi: vars: Remove deprecated 'efivars' sysfs interface
- Enable mirrored memory for arm64
- Fix up several abuses of the efivar API
- Refactor the efivar API in preparation for moving the 'business logic'
part of it into efivarfs
- Enable ACPI PRM on arm64
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE+9lifEBpyUIVN1cpw08iOZLZjyQFAmLhuDIACgkQw08iOZLZ
jyS9IQv/Wc2nhjN50S3gfrL+68/el/hGdP/J0FK5BOOjNosG2t1ZNYZtSthXqpPH
hRrTU2m6PpQUalRpFDyLiHkJvdBFQe4VmvrzBa3TIBIzyflLQPJzkWrqThPchV+B
qi4lmCtTDNIEJmayewqx1wWA+QmUiyI5zJ8wrZp84LTctBPL75seVv0SB20nqai0
3/I73omB2RLVGpCpeWvb++vePXL8euFW3FEwCTM8hRboICjORTyIZPy8Y5os+3xT
UgrIgVDOtn1Xwd4tK0qVwjOVA51east4Fcn3yGOrL40t+3SFm2jdpAJOO3UvyNPl
vkbtjvXsIjt3/oxreKxXHLbamKyueWIfZRyCLsrg6wrr96oypPk6ID4iDCQoen/X
Zf0VjM2vmvSd4YgnEIblOfSBxVg48cHJA4iVHVxFodNTrVnzGGFYPTmNKmJqo+Xn
JeUILM7jlR4h/t0+cTTK3Busu24annTuuz5L5rjf4bUm6pPf4crb1yJaFWtGhlpa
er233D6O
=zI0R
-----END PGP SIGNATURE-----
Merge tag 'efi-next-for-v5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi
Pull EFI updates from Ard Biesheuvel:
- Enable mirrored memory for arm64
- Fix up several abuses of the efivar API
- Refactor the efivar API in preparation for moving the 'business
logic' part of it into efivarfs
- Enable ACPI PRM on arm64
* tag 'efi-next-for-v5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: (24 commits)
ACPI: Move PRM config option under the main ACPI config
ACPI: Enable Platform Runtime Mechanism(PRM) support on ARM64
ACPI: PRM: Change handler_addr type to void pointer
efi: Simplify arch_efi_call_virt() macro
drivers: fix typo in firmware/efi/memmap.c
efi: vars: Drop __efivar_entry_iter() helper which is no longer used
efi: vars: Use locking version to iterate over efivars linked lists
efi: pstore: Omit efivars caching EFI varstore access layer
efi: vars: Add thin wrapper around EFI get/set variable interface
efi: vars: Don't drop lock in the middle of efivar_init()
pstore: Add priv field to pstore_record for backend specific use
Input: applespi - avoid efivars API and invoke EFI services directly
selftests/kexec: remove broken EFI_VARS secure boot fallback check
brcmfmac: Switch to appropriate helper to load EFI variable contents
iwlwifi: Switch to proper EFI variable store interface
media: atomisp_gmin_platform: stop abusing efivar API
efi: efibc: avoid efivar API for setting variables
efi: avoid efivars layer when loading SSDTs from variables
efi: Correct comment on efi_memmap_alloc
memblock: Disable mirror feature if kernelcore is not specified
...
One of the goals is to reduce the overhead of using ->read_iter()
and ->write_iter() instead of ->read()/->write(); new_sync_{read,write}()
has a surprising amount of overhead, in particular inside iocb_flags().
That's why the beginning of the series is in this pile; it's not directly
iov_iter-related, but it's a part of the same work...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYurGOQAKCRBZ7Krx/gZQ
6ysyAP91lvBfMRepcxpd9kvtuzWkU8A3rfSziZZteEHANB9Q7QEAiPn2a2OjWkcZ
uAyUWfCkHCNx+dSMkEvUgR5okQ0exAM=
=9UCV
-----END PGP SIGNATURE-----
Merge tag 'pull-work.iov_iter-base' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs iov_iter updates from Al Viro:
"Part 1 - isolated cleanups and optimizations.
One of the goals is to reduce the overhead of using ->read_iter() and
->write_iter() instead of ->read()/->write().
new_sync_{read,write}() has a surprising amount of overhead, in
particular inside iocb_flags(). That's the explanation for the
beginning of the series is in this pile; it's not directly
iov_iter-related, but it's a part of the same work..."
* tag 'pull-work.iov_iter-base' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
first_iovec_segment(): just return address
iov_iter: massage calling conventions for first_{iovec,bvec}_segment()
iov_iter: first_{iovec,bvec}_segment() - simplify a bit
iov_iter: lift dealing with maxpages out of first_{iovec,bvec}_segment()
iov_iter_get_pages{,_alloc}(): cap the maxsize with MAX_RW_COUNT
iov_iter_bvec_advance(): don't bother with bvec_iter
copy_page_{to,from}_iter(): switch iovec variants to generic
keep iocb_flags() result cached in struct file
iocb: delay evaluation of IS_SYNC(...) until we want to check IOCB_DSYNC
struct file: use anonymous union member for rcuhead and llist
btrfs: use IOMAP_DIO_NOSYNC
teach iomap_dio_rw() to suppress dsync
No need of likely/unlikely on calls of check_copy_size()
sure preemption is disabled in start_dir_add()/ end_dir_add() sections (on
non-RT it's automatic, on RT it needs to to be done explicitly) and moving
wakeups from __d_lookup_done() inside of such to the end of those sections.
Wakeups can be safely delayed for as long as ->d_lock on in-lookup
dentry is held; proving that has caught a bug in d_add_ci() that allows
memory corruption when sufficiently bogus ntfs (or case-insensitive xfs)
image is mounted. Easily fixed, fortunately.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYurAlAAKCRBZ7Krx/gZQ
6x0mAP9JI80PC/lkYLda+AJ7NmweorBDwrOxzB34biXtyhYDDQEAvdrV07LUkETM
FDN0+jgSpUikcs/kz5NxVBPRRN+RRAY=
=qpTA
-----END PGP SIGNATURE-----
Merge tag 'pull-work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs dcache updates from Al Viro:
"The main part here is making parallel lookups safe for RT - making
sure preemption is disabled in start_dir_add()/ end_dir_add() sections
(on non-RT it's automatic, on RT it needs to to be done explicitly)
and moving wakeups from __d_lookup_done() inside of such to the end of
those sections.
Wakeups can be safely delayed for as long as ->d_lock on in-lookup
dentry is held; proving that has caught a bug in d_add_ci() that
allows memory corruption when sufficiently bogus ntfs (or
case-insensitive xfs) image is mounted. Easily fixed, fortunately"
* tag 'pull-work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs/dcache: Move wakeup out of i_seq_dir write held region.
fs/dcache: Move the wakeup from __d_lookup_done() to the caller.
fs/dcache: Disable preemption on i_dir_seq write side on PREEMPT_RT
d_add_ci(): make sure we don't miss d_lookup_done()
magical no_llseek thing and makes checks consistent. In particular,
ad-hoc "can we do splice via internal pipe" checks got saner (and
somewhat more permissive, which is what Jason had been after, AFAICT)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYug2xgAKCRBZ7Krx/gZQ
6wxWAQDqeg+xMq2FGPXmgjCa+Cp3PXH96Lp6f3hHzakIDx+t8gEAxvuiXAD22Mct
6S1SKuGj0iDIuM4L7hUiWTiY/bDXSAc=
=3EC/
-----END PGP SIGNATURE-----
Merge tag 'pull-work.lseek' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs lseek updates from Al Viro:
"Jason's lseek series.
Saner handling of 'lseek should fail with ESPIPE' - this gets rid of
the magical no_llseek thing and makes checks consistent.
In particular, the ad-hoc "can we do splice via internal pipe" checks
got saner (and somewhat more permissive, which is what Jason had been
after, AFAICT)"
* tag 'pull-work.lseek' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: remove no_llseek
fs: check FMODE_LSEEK to control internal pipe splicing
vfio: do not set FMODE_LSEEK flag
dma-buf: remove useless FMODE_LSEEK flag
fs: do not compare against ->llseek
fs: clear or set FMODE_LSEEK based on llseek function
the next dentry in nameidata simplifies life considerably,
especially if we delay fetching ->d_inode until step_into().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYug0NQAKCRBZ7Krx/gZQ
66TIAQDziGCTjwvyzm+E/HDcWNl2LjZ4kbsuL3yv2DTW9jExyAD/UuptUgy2RKT1
RfJET/hnPhlIbWxbjhYLRtIj8QpcUAY=
=II7C
-----END PGP SIGNATURE-----
Merge tag 'pull-work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs namei updates from Al Viro:
"RCU pathwalk cleanups.
Storing sampled ->d_seq of the next dentry in nameidata simplifies
life considerably, especially if we delay fetching ->d_inode until
step_into()"
* tag 'pull-work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
step_into(): move fetching ->d_inode past handle_mounts()
lookup_fast(): don't bother with inode
follow_dotdot{,_rcu}(): don't bother with inode
step_into(): lose inode argument
namei: stash the sampled ->d_seq into nameidata
namei: move clearing LOOKUP_RCU towards rcu_read_unlock()
switch try_to_unlazy_next() to __legitimize_mnt()
follow_dotdot{,_rcu}(): change calling conventions
namei: get rid of pointless unlikely(read_seqcount_retry(...))
__follow_mount_rcu(): verify that mount_lock remains unchanged
- Fix an accounting bug that made NR_FILE_DIRTY grow without limit
when running xfstests
- Convert more of mpage to use folios
- Remove add_to_page_cache() and add_to_page_cache_locked()
- Convert find_get_pages_range() to filemap_get_folios()
- Improvements to the read_cache_page() family of functions
- Remove a few unnecessary checks of PageError
- Some straightforward filesystem conversions to use folios
- Split PageMovable users out from address_space_operations into their
own movable_operations
- Convert aops->migratepage to aops->migrate_folio
- Remove nobh support (Christoph Hellwig)
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmLpViQACgkQDpNsjXcp
gj5pBgf/f3+K7Hi3qw7aYQCYJQ7IA/bLyE/DLWI59kuiao6wDSve40B9YH9X++Ha
mRLp55bkQS+bwS2xa4jlqrIDJzAfNoWlXaXZHUXGL1C/52ChTF6jaH2cvO9PVlDS
7fLv1hy2LwiIdzpKJkUW7T+kcQGj3QLKqtQ4x8zD0LGMg055yvt/qndHSUi41nWT
/58+6W8Sk4vvRgkpeChFzF1lGLy00+FGT8y5V2kM9uRliFQ7XPCwqB2a3e5jbW6z
C1NXQmRnopCrnOT1TFIhK3DyX6MDIWV5qcikNAmCKFb9fQFPmjDLPt9iSoMGjw2M
Z+UVhJCaU3ISccd0DG5Ra/vzs9/O9Q==
=DgUi
-----END PGP SIGNATURE-----
Merge tag 'folio-6.0' of git://git.infradead.org/users/willy/pagecache
Pull folio updates from Matthew Wilcox:
- Fix an accounting bug that made NR_FILE_DIRTY grow without limit
when running xfstests
- Convert more of mpage to use folios
- Remove add_to_page_cache() and add_to_page_cache_locked()
- Convert find_get_pages_range() to filemap_get_folios()
- Improvements to the read_cache_page() family of functions
- Remove a few unnecessary checks of PageError
- Some straightforward filesystem conversions to use folios
- Split PageMovable users out from address_space_operations into
their own movable_operations
- Convert aops->migratepage to aops->migrate_folio
- Remove nobh support (Christoph Hellwig)
* tag 'folio-6.0' of git://git.infradead.org/users/willy/pagecache: (78 commits)
fs: remove the NULL get_block case in mpage_writepages
fs: don't call ->writepage from __mpage_writepage
fs: remove the nobh helpers
jfs: stop using the nobh helper
ext2: remove nobh support
ntfs3: refactor ntfs_writepages
mm/folio-compat: Remove migration compatibility functions
fs: Remove aops->migratepage()
secretmem: Convert to migrate_folio
hugetlb: Convert to migrate_folio
aio: Convert to migrate_folio
f2fs: Convert to filemap_migrate_folio()
ubifs: Convert to filemap_migrate_folio()
btrfs: Convert btrfs_migratepage to migrate_folio
mm/migrate: Add filemap_migrate_folio()
mm/migrate: Convert migrate_page() to migrate_folio()
nfs: Convert to migrate_folio
btrfs: Convert btree_migratepage to migrate_folio
mm/migrate: Convert expected_page_refs() to folio_expected_refs()
mm/migrate: Convert buffer_migrate_page() to buffer_migrate_folio()
...
Function ni_ins_new_attr now returns ERR_PTR(err),
so we check it now in other functions like ni_expand_mft_list
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Added comments to code
Added new function run_clone to make a copy of run
Added done and undo labels for restoring after errors
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
There are repetitive steps in case of bad inode
This commit wraps them in function
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Added some comments in frecord.c for more context.
Also changed run_lookup to static because it's an internal function.
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
In some cases xattr is too fragmented,
so we need to load it before writing.
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
In some cases we need to return ENOSPC
Fixes xfstest generic/213
Fixes: 114346978c ("fs/ntfs3: Check new size for limits")
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
ntfs_set_ea can fail with NOSPC, so we don't need to
change mode in this situation.
Fixes xfstest generic/449
Fixes: be71b5cba2 ("fs/ntfs3: Add attrib operations")
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
This function always returns 0, and ignores the nofail boolean. Drop the
nofail argument, make the function void return and fix up the callers.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This fixes a race between changing the ext4 superblock uuid and operations
like mounting, resizing, changing features, etc.
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jeremy Bongio <bongiojp@gmail.com>
Link: https://lore.kernel.org/r/20220721224422.438351-1-bongiojp@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This patch avoids an attempt to resize the filesystem to an
unaligned cluster boundary. An online resize to a size that is not
integral to cluster size results in the last iteration attempting to
grow the fs by a negative amount, which trips a BUG_ON and leaves the fs
with a corrupted in-memory superblock.
Signed-off-by: Oleg Kiselev <okiselev@amazon.com>
Link: https://lore.kernel.org/r/0E92A0AB-4F16-4F1A-94B7-702CC6504FDE@amazon.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This patch avoids doing an O(n**2)-complexity walk through every flex group.
Instead, it uses the already computed overhead information for the newly
allocated space, and simply adds it to the previously calculated
overhead stored in the superblock. This drastically reduces the time
taken to resize very large bigalloc filesystems (from 3+ hours for a
64TB fs down to milliseconds).
Signed-off-by: Oleg Kiselev <okiselev@amazon.com>
Link: https://lore.kernel.org/r/CE4F359F-4779-45E6-B6A9-8D67FDFF5AE2@amazon.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Block range to free is validated in ext4_free_blocks() using
ext4_inode_block_valid() and then it's passed to ext4_mb_clear_bb().
However in some situations on bigalloc file system the range might be
adjusted after the validation in ext4_free_blocks() which can lead to
troubles on corrupted file systems such as one found by syzkaller that
resulted in the following BUG
kernel BUG at fs/ext4/ext4.h:3319!
PREEMPT SMP NOPTI
CPU: 28 PID: 4243 Comm: repro Kdump: loaded Not tainted 5.19.0-rc6+ #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
RIP: 0010:ext4_free_blocks+0x95e/0xa90
Call Trace:
<TASK>
? lock_timer_base+0x61/0x80
? __es_remove_extent+0x5a/0x760
? __mod_timer+0x256/0x380
? ext4_ind_truncate_ensure_credits+0x90/0x220
ext4_clear_blocks+0x107/0x1b0
ext4_free_data+0x15b/0x170
ext4_ind_truncate+0x214/0x2c0
? _raw_spin_unlock+0x15/0x30
? ext4_discard_preallocations+0x15a/0x410
? ext4_journal_check_start+0xe/0x90
? __ext4_journal_start_sb+0x2f/0x110
ext4_truncate+0x1b5/0x460
? __ext4_journal_start_sb+0x2f/0x110
ext4_evict_inode+0x2b4/0x6f0
evict+0xd0/0x1d0
ext4_enable_quotas+0x11f/0x1f0
ext4_orphan_cleanup+0x3de/0x430
? proc_create_seq_private+0x43/0x50
ext4_fill_super+0x295f/0x3ae0
? snprintf+0x39/0x40
? sget_fc+0x19c/0x330
? ext4_reconfigure+0x850/0x850
get_tree_bdev+0x16d/0x260
vfs_get_tree+0x25/0xb0
path_mount+0x431/0xa70
__x64_sys_mount+0xe2/0x120
do_syscall_64+0x5b/0x80
? do_user_addr_fault+0x1e2/0x670
? exc_page_fault+0x70/0x170
entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x7fdf4e512ace
Fix it by making sure that the block range is properly validated before
used every time it changes in ext4_free_blocks() or ext4_mb_clear_bb().
Link: https://syzkaller.appspot.com/bug?id=5266d464285a03cee9dbfda7d2452a72c3c2ae7c
Reported-by: syzbot+15cd994e273307bf5cfa@syzkaller.appspotmail.com
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Cc: Tadeusz Struk <tadeusz.struk@linaro.org>
Tested-by: Tadeusz Struk <tadeusz.struk@linaro.org>
Link: https://lore.kernel.org/r/20220714165903.58260-1-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Use the fact that entries with elevated refcount are not removed from
the hash and just move removal of the entry from the hash to the entry
freeing time. When doing this we also change the generic code to hold
one reference to the cache entry, not two of them, which makes code
somewhat more obvious.
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220712105436.32204-10-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently when we decide to reuse xattr block we detect the case when
the last reference to xattr block is being dropped at the same time and
cancel the reuse attempt. Convert ext2 to a new scheme when as soon as
matching mbcache entry is found, we wait with dropping the last xattr
block reference until mbcache entry reference is dropped (meaning either
the xattr block reference is increased or we decided not to reuse the
block).
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220712105436.32204-8-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Replace one else in ext2_xattr_set() with a goto. This makes following
code changes simpler to follow. No functional changes.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20220712105436.32204-7-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Free of xattr block reference is opencode in two places. Factor it out
into a separate function and use it.
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220712105436.32204-6-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When ext4_xattr_block_set() decides to remove xattr block the following
race can happen:
CPU1 CPU2
ext4_xattr_block_set() ext4_xattr_release_block()
new_bh = ext4_xattr_block_cache_find()
lock_buffer(bh);
ref = le32_to_cpu(BHDR(bh)->h_refcount);
if (ref == 1) {
...
mb_cache_entry_delete();
unlock_buffer(bh);
ext4_free_blocks();
...
ext4_forget(..., bh, ...);
jbd2_journal_revoke(..., bh);
ext4_journal_get_write_access(..., new_bh, ...)
do_get_write_access()
jbd2_journal_cancel_revoke(..., new_bh);
Later the code in ext4_xattr_block_set() finds out the block got freed
and cancels reusal of the block but the revoke stays canceled and so in
case of block reuse and journal replay the filesystem can get corrupted.
If the race works out slightly differently, we can also hit assertions
in the jbd2 code.
Fix the problem by making sure that once matching mbcache entry is
found, code dropping the last xattr block reference (or trying to modify
xattr block in place) waits until the mbcache entry reference is
dropped. This way code trying to reuse xattr block is protected from
someone trying to drop the last reference to xattr block.
Reported-and-tested-by: Ritesh Harjani <ritesh.list@gmail.com>
CC: stable@vger.kernel.org
Fixes: 82939d7999 ("ext4: convert to mbcache2")
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220712105436.32204-5-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Remove unnecessary else (and thus indentation level) from a code block
in ext4_xattr_block_set(). It will also make following code changes
easier. No functional changes.
CC: stable@vger.kernel.org
Fixes: 82939d7999 ("ext4: convert to mbcache2")
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220712105436.32204-4-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently we remove EA inode from mbcache as soon as its xattr refcount
drops to zero. However there can be pending attempts to reuse the inode
and thus refcount handling code has to handle the situation when
refcount increases from zero anyway. So save some work and just keep EA
inode in mbcache until it is getting evicted. At that moment we are sure
following iget() of EA inode will fail anyway (or wait for eviction to
finish and load things from the disk again) and so removing mbcache
entry at that moment is fine and simplifies the code a bit.
CC: stable@vger.kernel.org
Fixes: 82939d7999 ("ext4: convert to mbcache2")
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220712105436.32204-3-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Add function mb_cache_entry_delete_or_get() to delete mbcache entry if
it is unused and also add a function to wait for entry to become unused
- mb_cache_entry_wait_unused(). We do not share code between the two
deleting function as one of them will go away soon.
CC: stable@vger.kernel.org
Fixes: 82939d7999 ("ext4: convert to mbcache2")
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220712105436.32204-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Do not reclaim entries that are currently used by somebody from a
shrinker. Firstly, these entries are likely useful. Secondly, we will
need to keep such entries to protect pending increment of xattr block
refcount.
CC: stable@vger.kernel.org
Fixes: 82939d7999 ("ext4: convert to mbcache2")
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220712105436.32204-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
ext4_append() must always allocate a new block, otherwise we run the
risk of overwriting existing directory block corrupting the directory
tree in the process resulting in all manner of problems later on.
Add a sanity check to see if the logical block is already allocated and
error out if it is.
Cc: stable@kernel.org
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20220704142721.157985-2-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently ext4 directory handling code implicitly assumes that the
directory blocks are always within the i_size. In fact ext4_append()
will attempt to allocate next directory block based solely on i_size and
the i_size is then appropriately increased after a successful
allocation.
However, for this to work it requires i_size to be correct. If, for any
reason, the directory inode i_size is corrupted in a way that the
directory tree refers to a valid directory block past i_size, we could
end up corrupting parts of the directory tree structure by overwriting
already used directory blocks when modifying the directory.
Fix it by catching the corruption early in __ext4_read_dirblock().
Addresses Red-Hat-Bugzilla: #2070205
CVE: CVE-2022-1184
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Cc: stable@vger.kernel.org
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20220704142721.157985-1-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Add support to display the mb_optimize_scan value in
/proc/fs/ext4/<dev>/options file. The option is only
displayed when the value is non default.
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20220704054603.21462-1-ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Now if check directoy entry is corrupted, ext4_empty_dir may return true
then directory will be removed when file system mounted with "errors=continue".
In order not to make things worse just return false when directory is corrupted.
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220622090223.682234-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When migrating to extents, the checksum seed of temporary inode
need to be replaced by inode's, otherwise the inode checksums
will be incorrect when swapping the inodes data.
However, the temporary inode can not match it's checksum to
itself since it has lost it's own checksum seed.
mkfs.ext4 -F /dev/sdc
mount /dev/sdc /mnt/sdc
xfs_io -fc "pwrite 4k 4k" -c "fsync" /mnt/sdc/testfile
chattr -e /mnt/sdc/testfile
chattr +e /mnt/sdc/testfile
umount /dev/sdc
fsck -fn /dev/sdc
========
...
Pass 1: Checking inodes, blocks, and sizes
Inode 13 passes checks, but checksum does not match inode. Fix? no
...
========
The fix is simple, save the checksum seed of temporary inode, and
recover it after migrating to extents.
Fixes: e81c9302a6 ("ext4: set csum seed in tmp inode while migrating to extents")
Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220617062515.2113438-1-lilingfeng3@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Use the EXT4_INODE_HAS_XATTR_SPACE macro to more accurately
determine whether the inode have xattr space.
Cc: stable@kernel.org
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220616021358.2504451-5-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If the ext4 inode does not have xattr space, 0 is returned in the
get_max_inline_xattr_value_size function. Otherwise, the function returns
a negative value when the inode does not contain EXT4_STATE_XATTR.
Cc: stable@kernel.org
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220616021358.2504451-4-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When adding an xattr to an inode, we must ensure that the inode_size is
not less than EXT4_GOOD_OLD_INODE_SIZE + extra_isize + pad. Otherwise,
the end position may be greater than the start position, resulting in UAF.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20220616021358.2504451-2-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
A race can occur in the unlikely event ext4 is unable to allocate a
physical cluster for a delayed allocation in a bigalloc file system
during writeback. Failure to allocate a cluster forces error recovery
that includes a call to mpage_release_unused_pages(). That function
removes any corresponding delayed allocated blocks from the extent
status tree. If a new delayed write is in progress on the same cluster
simultaneously, resulting in the addition of an new extent containing
one or more blocks in that cluster to the extent status tree, delayed
block accounting can be thrown off if that delayed write then encounters
a similar cluster allocation failure during future writeback.
Write lock the i_data_sem in mpage_release_unused_pages() to fix this
problem. Ext4's block/cluster accounting code for bigalloc relies on
i_data_sem for mutual exclusion, as is found in the delayed write path,
and the locking in mpage_release_unused_pages() is missing.
Cc: stable@kernel.org
Reported-by: Ye Bin <yebin10@huawei.com>
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20220615160530.1928801-1-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
We catch an assert problem in jbd2_journal_commit_transaction() when
doing fsstress and request falut injection tests. The problem is
happened in a race condition between jbd2_journal_commit_transaction()
and ext4_end_io_end(). Firstly, ext4_writepages() writeback dirty pages
and start reserved handle, and then the journal was aborted due to some
previous metadata IO error, jbd2_journal_abort() start to commit current
running transaction, the committing procedure could be raced by
ext4_end_io_end() and lead to subtract j_reserved_credits twice from
commit_transaction->t_outstanding_credits, finally the
t_outstanding_credits is mistakenly smaller than t_nr_buffers and
trigger assert.
kjournald2 kworker
jbd2_journal_commit_transaction()
write_unlock(&journal->j_state_lock);
atomic_sub(j_reserved_credits, t_outstanding_credits); //sub once
jbd2_journal_start_reserved()
start_this_handle() //detect aborted journal
jbd2_journal_free_reserved() //get running transaction
read_lock(&journal->j_state_lock)
__jbd2_journal_unreserve_handle()
atomic_sub(j_reserved_credits, t_outstanding_credits);
//sub again
read_unlock(&journal->j_state_lock);
journal->j_running_transaction = NULL;
J_ASSERT(t_nr_buffers <= t_outstanding_credits) //bomb!!!
Fix this issue by using journal->j_state_lock to protect the subtraction
in jbd2_journal_commit_transaction().
Fixes: 96f1e09745 ("jbd2: avoid long hold times of j_state_lock while committing a transaction")
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220611130426.2013258-1-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
jbd2_log_start_commit() is not used outside of jbd2 so unexport it. Also
make __jbd2_log_start_commit() static when we are at it.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Link: https://lore.kernel.org/r/20220608112355.4397-4-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Jbd2 exports jbd2_journal_enable_debug and __jbd2_debug() depite the
first is used only in fs/jbd2/journal.c and the second only within jbd2
code. Remove the pointless exports make jbd2_journal_enable_debug
static.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Link: https://lore.kernel.org/r/20220608112355.4397-3-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The name of jbd_debug() is confusing as all functions inside jbd2 have
jbd2_ prefix. Rename jbd_debug() to jbd2_debug(). No functional changes.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Link: https://lore.kernel.org/r/20220608112355.4397-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
We use jbd_debug() in some places in ext4. It seems a bit strange to use
jbd2 debugging output function for ext4 code. Also these days
ext4_debug() uses dynamic printk so each debug message can be enabled /
disabled on its own so the time when it made some sense to have these
combined (to allow easier common selecting of messages to report) has
passed. Just convert all jbd_debug() uses in ext4 to ext4_debug().
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Link: https://lore.kernel.org/r/20220608112355.4397-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
After each buddy split, mb_mark_used will search the proper order
for the block which may consume some loop in mb_find_order_for_block.
In fact, we can reuse the order and buddy generated by the buddy split.
Reviewed by: lei.rao@intel.com
Signed-off-by: hanjinke <hanjinke.666@bytedance.com>
Link: https://lore.kernel.org/r/20220606155305.74146-1-hanjinke.666@bytedance.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When the EXT4_IOC_RESIZE_FS ioctl is complete, update the backup
superblocks. We don't do this for the old-style resize ioctls since
they are quite ancient, and only used by very old versions of
resize2fs --- and we don't want to update the backup superblocks every
time EXT4_IOC_GROUP_ADD is called, since it might get called a lot.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20220629040026.112371-2-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When doing an online resize, the on-disk superblock on-disk wasn't
updated. This means that when the file system is unmounted and
remounted, and the on-disk overhead value is non-zero, this would
result in the results of statfs(2) to be incorrect.
This was partially fixed by Commits 10b01ee92d ("ext4: fix overhead
calculation to account for the reserved gdt blocks"), 85d825dbf4
("ext4: force overhead calculation if the s_overhead_cluster makes no
sense"), and eb7054212e ("ext4: update the cached overhead value in
the superblock").
However, since it was too expensive to forcibly recalculate the
overhead for bigalloc file systems at every mount, this didn't fix the
problem for bigalloc file systems. This commit should address the
problem when resizing file systems with the bigalloc feature enabled.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20220629040026.112371-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Since commit 6493792d32 ("ext4: convert symlink external data block
mapping to bdev"), create new symlink with inline_data is not supported,
but it missing to handle the leftover inlined symlinks, which could
cause below error message and fail to read symlink.
ls: cannot read symbolic link 'foo': Structure needs cleaning
EXT4-fs error (device sda): ext4_map_blocks:605: inode #12: block
2021161080: comm ls: lblock 0 mapped to illegal pblock 2021161080
(length 1)
Fix this regression by adding ext4_read_inline_link(), which read the
inline data directly and convert it through a kmalloced buffer.
Fixes: 6493792d32 ("ext4: convert symlink external data block mapping to bdev")
Cc: stable@kernel.org
Reported-by: Torge Matthies <openglfreak@googlemail.com>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Tested-by: Torge Matthies <openglfreak@googlemail.com>
Link: https://lore.kernel.org/r/20220630090100.2769490-1-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
API:
- Make proc files report fips module name and version.
Algorithms:
- Move generic SHA1 code into lib/crypto.
- Implement Chinese Remainder Theorem for RSA.
- Remove blake2s.
- Add XCTR with x86/arm64 acceleration.
- Add POLYVAL with x86/arm64 acceleration.
- Add HCTR2.
- Add ARIA.
Drivers:
- Add support for new CCP/PSP device ID in ccp.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEn51F/lCuNhUwmDeSxycdCkmxi6cFAmLosAAACgkQxycdCkmx
i6dvgxAAzcw0cKMuq3dbQamzeVu1bDW8rPb7yHnpXal3ao5ewa15+hFjsKhdh/s3
cjM5Lu7Qx4lnqtsh2JVSU5o2SgEpptxXNfxAngcn46ld5EgV/G4DYNKuXsatMZ2A
erCzXqG9dDxJmREat+5XgVfD1RFVsglmEA/Nv4Rvn+9O4O6PfwRa8GyUzeKC+byG
qs/1JyiPqpyApgzCvlQFAdTF4PM7ruDtg3mnMy2EKAzqj4JUseXRi1i81vLVlfBL
T40WESG/CnOwIF5MROhziAtkJMS4Y4v2VQ2++1p0gwG6pDCnq4w7u9cKPXYfNgZK
fMVCxrNlxIH3W99VfVXbXwqDSN6qEZtQvhnliwj9aEbEltIoH+B02wNfS/BDsTec
im+5NCnNQ6olMPyL0yHrMKisKd+DwTrEfYT5H2kFhcdcYZncQ9C6el57kimnJRzp
4ymPRudCKm/8weWGTtmjFMi+PFP4LgvCoR+VMUd+gVe91F9ZMAO0K7b5z5FVDyDf
wmsNBvsEnTdm/r7YceVzGwdKQaP9sE5wq8iD/yySD1PjlmzZos1CtCrqAIT/v2RK
pQdZCIkT8qCB+Jm03eEd4pwjEDnbZdQmpKt4cTy0HWIeLJVG1sXPNpgwPCaBEV4U
g0nctILtypChlSDmuGhTCyuElfMg6CXt4cgSZJTBikT+QcyWOm4=
=rfWK
-----END PGP SIGNATURE-----
Merge tag 'v5.20-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
"API:
- Make proc files report fips module name and version
Algorithms:
- Move generic SHA1 code into lib/crypto
- Implement Chinese Remainder Theorem for RSA
- Remove blake2s
- Add XCTR with x86/arm64 acceleration
- Add POLYVAL with x86/arm64 acceleration
- Add HCTR2
- Add ARIA
Drivers:
- Add support for new CCP/PSP device ID in ccp"
* tag 'v5.20-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (89 commits)
crypto: tcrypt - Remove the static variable initialisations to NULL
crypto: arm64/poly1305 - fix a read out-of-bound
crypto: hisilicon/zip - Use the bitmap API to allocate bitmaps
crypto: hisilicon/sec - fix auth key size error
crypto: ccree - Remove a useless dma_supported() call
crypto: ccp - Add support for new CCP/PSP device ID
crypto: inside-secure - Add missing MODULE_DEVICE_TABLE for of
crypto: hisilicon/hpre - don't use GFP_KERNEL to alloc mem during softirq
crypto: testmgr - some more fixes to RSA test vectors
cyrpto: powerpc/aes - delete the rebundant word "block" in comments
hwrng: via - Fix comment typo
crypto: twofish - Fix comment typo
crypto: rmd160 - fix Kconfig "its" grammar
crypto: keembay-ocs-ecc - Drop if with an always false condition
Documentation: qat: rewrite description
Documentation: qat: Use code block for qat sysfs example
crypto: lib - add module license to libsha1
crypto: lib - make the sha1 library optional
crypto: lib - move lib/sha1.c into lib/crypto/
crypto: fips - make proc files report fips module name and version
...
The netfs_write_begin() won't set the folio if the return value
is non-zero.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Clear O_TRUNC from the flags sent in the MDS create request.
`atomic_open' is called before permission check. We should not do any
modification to the file here. The caller will do the truncation
afterward.
Fixes: 124e68e740 ("ceph: file operations")
Signed-off-by: Hu Weiwen <sehuww@mail.scut.edu.cn>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The f_frsize maybe changed in the quota size is less than the defualt
4MB.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When the quota is approaching we need to notify it to the MDS as
soon as possible, or the client could write to the directory more
than expected.
This will flush the dirty caps without delaying after each write,
though this couldn't prevent the real size of a directory exceed
the quota but could prevent it as soon as possible.
Link: https://tracker.ceph.com/issues/56180
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Luís Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
If the 'i_inline_version' is 1, that means the file is just new
created and there shouldn't have any inline data in it, we should
skip retrieving the inline data from MDS.
This also could help reduce possiblity of dead lock issue introduce
by the inline data and Fcr caps.
Gradually we will remove the inline feature from kclient after ceph's
scrub too have support to unline the inline data, currently this
could help reduce the teuthology test failures.
This is possiblly could also fix a bug that for some old clients if
they couldn't explictly uninline the inline data when writing, the
inline version will keep as 1 always. We may always reading non-exist
data from inline data.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
For async create we will always try to choose the auth MDS of frag
the dentry belonged to of the parent directory to send the request
and ususally this works fine, but if the MDS migrated the directory
to another MDS before it could be handled the request will be
forwarded. And then the auth cap will be changed.
We need to update the auth cap in this case before the request is
forwarded.
Link: https://tracker.ceph.com/issues/55857
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Willy requested that we change this back to warning on folio->private
being non-NULl. He's trying to kill off the PG_private flag, and so we'd
like to catch where it's non-NULL.
Add a VM_WARN_ON_FOLIO (since it doesn't exist yet) and change over to
using that instead of VM_BUG_ON_FOLIO along with testing the ->private
pointer.
[ xiubli: define VM_WARN_ON_FOLIO macro in case DEBUG_VM is disabled
reported by kernel test robot <lkp@intel.com> ]
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
"was_async" is a bit misleadingly named. It's supposed to indicate
whether it's safe to call blocking operations from the context you're
calling it from, but it sounds like it's asking whether this was done
via async operation. For ceph, this it's always called from kernel
thread context so it should be safe to set this to false.
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
There's no reason we need to lock the inode for write in order to handle
an llseek. I suspect this should have been dropped in 2013 when we
stopped doing vmtruncate in llseek.
With that gone, ceph_llseek is functionally equivalent to
generic_file_llseek, so just call that after getting the size.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Luís Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When handle_cap_grant is called on an IMPORT op, then the snap_rwsem is
held and the function is expected to release it before returning. It
currently fails to do that in all cases which could lead to a deadlock.
Fixes: 6f05b30ea0 ("ceph: reset i_requested_max_size if file write is not wanted")
Link: https://tracker.ceph.com/issues/55857
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Luís Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The MDS tries to enforce a limit on the total key/values in extended
attributes. However, this limit is enforced only if doing a synchronous
operation (MDS_OP_SETXATTR) -- if we're buffering the xattrs, the MDS
doesn't have a chance to enforce these limits.
This patch adds support for decoding the xattrs maximum size setting that is
distributed in the mdsmap. Then, when setting an xattr, the kernel client
will revert to do a synchronous operation if that maximum size is exceeded.
While there, fix a dout() that would trigger a printk warning:
[ 98.718078] ------------[ cut here ]------------
[ 98.719012] precision 65536 too large
[ 98.719039] WARNING: CPU: 1 PID: 3755 at lib/vsprintf.c:2703 vsnprintf+0x5e3/0x600
...
Link: https://tracker.ceph.com/issues/55725
Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
And for the 'Xs' caps for getxattr we will also choose the auth MDS,
because the MDS side code is buggy due to setxattr won't notify the
replica MDSes when the values changed and the replica MDS will return
the old values. Though we will fix it in MDS code, but this still
makes sense for old ceph.
Link: https://tracker.ceph.com/issues/55331
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
If the connection was accidently closed due to the socket issue or
something else the clients will try to open the opened sessions, the
MDSes will send the session open reply one more time if the clients
support the notify feature.
When the clients retry to open the sessions the s_seq will be 0 as
default, we need to update it anyway.
Link: https://tracker.ceph.com/issues/53911
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
In async unlink case the kclient won't wait for the first reply
from MDS and just drop all the links and unhash the dentry and then
succeeds immediately.
For any new create/link/rename,etc requests followed by using the
same file names we must wait for the first reply of the inflight
unlink request, or the MDS possibly will fail these following
requests with -EEXIST if the inflight async unlink request was
delayed for some reasons.
And the worst case is that for the none async openc request it will
successfully open the file if the CDentry hasn't been unlinked yet,
but later the previous delayed async unlink request will remove the
CDenty. That means the just created file is possiblly deleted later
by accident.
We need to wait for the inflight async unlink requests to finish
when creating new files/directories by using the same file names.
Link: https://tracker.ceph.com/issues/55332
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>