In gfs2_write_jdata_batch(), to compute the number of blocks, compute
the total size of the folio batch instead of the number of pages it
contains. Not a functional change.
Note that we don't currently allow mounting filesystems with a block
size bigger than the page size. We could change that after converting
the page cache to folios. The page cache would then only contain
block-size or bigger folios, so rounding wouldn't become an issue here.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Get rid of the generation parameter of gfs2_alloc_blocks(): we only ever
set the generation of the current inode while creating it, so do so
directly.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
there's little I can say which isn't in the individual changelogs.
The lengthier patch series are
- "kdump: use generic functions to simplify crashkernel reservation in
arch", from Baoquan He. This is mainly cleanups and consolidation of
the "crashkernel=" kernel parameter handling.
- After much discussion, David Laight's "minmax: Relax type checks in
min() and max()" is here. Hopefully reduces some typecasting and the
use of min_t() and max_t().
- A group of patches from Oleg Nesterov which clean up and slightly fix
our handling of reads from /proc/PID/task/... and which remove
task_struct.therad_group.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZUQP9wAKCRDdBJ7gKXxA
jmOAAQDh8sxagQYocoVsSm28ICqXFeaY9Co1jzBIDdNesAvYVwD/c2DHRqJHEiS4
63BNcG3+hM9nwGJHb5lyh5m79nBMRg0=
=On4u
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2023-11-02-14-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
"As usual, lots of singleton and doubleton patches all over the tree
and there's little I can say which isn't in the individual changelogs.
The lengthier patch series are
- 'kdump: use generic functions to simplify crashkernel reservation
in arch', from Baoquan He. This is mainly cleanups and
consolidation of the 'crashkernel=' kernel parameter handling
- After much discussion, David Laight's 'minmax: Relax type checks in
min() and max()' is here. Hopefully reduces some typecasting and
the use of min_t() and max_t()
- A group of patches from Oleg Nesterov which clean up and slightly
fix our handling of reads from /proc/PID/task/... and which remove
task_struct.thread_group"
* tag 'mm-nonmm-stable-2023-11-02-14-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (64 commits)
scripts/gdb/vmalloc: disable on no-MMU
scripts/gdb: fix usage of MOD_TEXT not defined when CONFIG_MODULES=n
.mailmap: add address mapping for Tomeu Vizoso
mailmap: update email address for Claudiu Beznea
tools/testing/selftests/mm/run_vmtests.sh: lower the ptrace permissions
.mailmap: map Benjamin Poirier's address
scripts/gdb: add lx_current support for riscv
ocfs2: fix a spelling typo in comment
proc: test ProtectionKey in proc-empty-vm test
proc: fix proc-empty-vm test with vsyscall
fs/proc/base.c: remove unneeded semicolon
do_io_accounting: use sig->stats_lock
do_io_accounting: use __for_each_thread()
ocfs2: replace BUG_ON() at ocfs2_num_free_extents() with ocfs2_error()
ocfs2: fix a typo in a comment
scripts/show_delta: add __main__ judgement before main code
treewide: mark stuff as __ro_after_init
fs: ocfs2: check status values
proc: test /proc/${pid}/statm
compiler.h: move __is_constexpr() to compiler.h
...
included in this merge do the following:
- Kemeng Shi has contributed some compation maintenance work in the
series "Fixes and cleanups to compaction".
- Joel Fernandes has a patchset ("Optimize mremap during mutual
alignment within PMD") which fixes an obscure issue with mremap()'s
pagetable handling during a subsequent exec(), based upon an
implementation which Linus suggested.
- More DAMON/DAMOS maintenance and feature work from SeongJae Park i the
following patch series:
mm/damon: misc fixups for documents, comments and its tracepoint
mm/damon: add a tracepoint for damos apply target regions
mm/damon: provide pseudo-moving sum based access rate
mm/damon: implement DAMOS apply intervals
mm/damon/core-test: Fix memory leaks in core-test
mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval
- In the series "Do not try to access unaccepted memory" Adrian Hunter
provides some fixups for the recently-added "unaccepted memory' feature.
To increase the feature's checking coverage. "Plug a few gaps where
RAM is exposed without checking if it is unaccepted memory".
- In the series "cleanups for lockless slab shrink" Qi Zheng has done
some maintenance work which is preparation for the lockless slab
shrinking code.
- Qi Zheng has redone the earlier (and reverted) attempt to make slab
shrinking lockless in the series "use refcount+RCU method to implement
lockless slab shrink".
- David Hildenbrand contributes some maintenance work for the rmap code
in the series "Anon rmap cleanups".
- Kefeng Wang does more folio conversions and some maintenance work in
the migration code. Series "mm: migrate: more folio conversion and
unification".
- Matthew Wilcox has fixed an issue in the buffer_head code which was
causing long stalls under some heavy memory/IO loads. Some cleanups
were added on the way. Series "Add and use bdev_getblk()".
- In the series "Use nth_page() in place of direct struct page
manipulation" Zi Yan has fixed a potential issue with the direct
manipulation of hugetlb page frames.
- In the series "mm: hugetlb: Skip initialization of gigantic tail
struct pages if freed by HVO" has improved our handling of gigantic
pages in the hugetlb vmmemmep optimizaton code. This provides
significant boot time improvements when significant amounts of gigantic
pages are in use.
- Matthew Wilcox has sent the series "Small hugetlb cleanups" - code
rationalization and folio conversions in the hugetlb code.
- Yin Fengwei has improved mlock()'s handling of large folios in the
series "support large folio for mlock"
- In the series "Expose swapcache stat for memcg v1" Liu Shixin has
added statistics for memcg v1 users which are available (and useful)
under memcg v2.
- Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable)
prctl so that userspace may direct the kernel to not automatically
propagate the denial to child processes. The series is named "MDWE
without inheritance".
- Kefeng Wang has provided the series "mm: convert numa balancing
functions to use a folio" which does what it says.
- In the series "mm/ksm: add fork-exec support for prctl" Stefan Roesch
makes is possible for a process to propagate KSM treatment across
exec().
- Huang Ying has enhanced memory tiering's calculation of memory
distances. This is used to permit the dax/kmem driver to use "high
bandwidth memory" in addition to Optane Data Center Persistent Memory
Modules (DCPMM). The series is named "memory tiering: calculate
abstract distance based on ACPI HMAT"
- In the series "Smart scanning mode for KSM" Stefan Roesch has
optimized KSM by teaching it to retain and use some historical
information from previous scans.
- Yosry Ahmed has fixed some inconsistencies in memcg statistics in the
series "mm: memcg: fix tracking of pending stats updates values".
- In the series "Implement IOCTL to get and optionally clear info about
PTEs" Peter Xu has added an ioctl to /proc/<pid>/pagemap which permits
us to atomically read-then-clear page softdirty state. This is mainly
used by CRIU.
- Hugh Dickins contributed the series "shmem,tmpfs: general maintenance"
- a bunch of relatively minor maintenance tweaks to this code.
- Matthew Wilcox has increased the use of the VMA lock over file-backed
page faults in the series "Handle more faults under the VMA lock". Some
rationalizations of the fault path became possible as a result.
- In the series "mm/rmap: convert page_move_anon_rmap() to
folio_move_anon_rmap()" David Hildenbrand has implemented some cleanups
and folio conversions.
- In the series "various improvements to the GUP interface" Lorenzo
Stoakes has simplified and improved the GUP interface with an eye to
providing groundwork for future improvements.
- Andrey Konovalov has sent along the series "kasan: assorted fixes and
improvements" which does those things.
- Some page allocator maintenance work from Kemeng Shi in the series
"Two minor cleanups to break_down_buddy_pages".
- In thes series "New selftest for mm" Breno Leitao has developed
another MM self test which tickles a race we had between madvise() and
page faults.
- In the series "Add folio_end_read" Matthew Wilcox provides cleanups
and an optimization to the core pagecache code.
- Nhat Pham has added memcg accounting for hugetlb memory in the series
"hugetlb memcg accounting".
- Cleanups and rationalizations to the pagemap code from Lorenzo
Stoakes, in the series "Abstract vma_merge() and split_vma()".
- Audra Mitchell has fixed issues in the procfs page_owner code's new
timestamping feature which was causing some misbehaviours. In the
series "Fix page_owner's use of free timestamps".
- Lorenzo Stoakes has fixed the handling of new mappings of sealed files
in the series "permit write-sealed memfd read-only shared mappings".
- Mike Kravetz has optimized the hugetlb vmemmap optimization in the
series "Batch hugetlb vmemmap modification operations".
- Some buffer_head folio conversions and cleanups from Matthew Wilcox in
the series "Finish the create_empty_buffers() transition".
- As a page allocator performance optimization Huang Ying has added
automatic tuning to the allocator's per-cpu-pages feature, in the series
"mm: PCP high auto-tuning".
- Roman Gushchin has contributed the patchset "mm: improve performance
of accounted kernel memory allocations" which improves their performance
by ~30% as measured by a micro-benchmark.
- folio conversions from Kefeng Wang in the series "mm: convert page
cpupid functions to folios".
- Some kmemleak fixups in Liu Shixin's series "Some bugfix about
kmemleak".
- Qi Zheng has improved our handling of memoryless nodes by keeping them
off the allocation fallback list. This is done in the series "handle
memoryless nodes more appropriately".
- khugepaged conversions from Vishal Moola in the series "Some
khugepaged folio conversions".
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZULEMwAKCRDdBJ7gKXxA
jhQHAQCYpD3g849x69DmHnHWHm/EHQLvQmRMDeYZI+nx/sCJOwEAw4AKg0Oemv9y
FgeUPAD1oasg6CP+INZvCj34waNxwAc=
=E+Y4
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"Many singleton patches against the MM code. The patch series which are
included in this merge do the following:
- Kemeng Shi has contributed some compation maintenance work in the
series 'Fixes and cleanups to compaction'
- Joel Fernandes has a patchset ('Optimize mremap during mutual
alignment within PMD') which fixes an obscure issue with mremap()'s
pagetable handling during a subsequent exec(), based upon an
implementation which Linus suggested
- More DAMON/DAMOS maintenance and feature work from SeongJae Park i
the following patch series:
mm/damon: misc fixups for documents, comments and its tracepoint
mm/damon: add a tracepoint for damos apply target regions
mm/damon: provide pseudo-moving sum based access rate
mm/damon: implement DAMOS apply intervals
mm/damon/core-test: Fix memory leaks in core-test
mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval
- In the series 'Do not try to access unaccepted memory' Adrian
Hunter provides some fixups for the recently-added 'unaccepted
memory' feature. To increase the feature's checking coverage. 'Plug
a few gaps where RAM is exposed without checking if it is
unaccepted memory'
- In the series 'cleanups for lockless slab shrink' Qi Zheng has done
some maintenance work which is preparation for the lockless slab
shrinking code
- Qi Zheng has redone the earlier (and reverted) attempt to make slab
shrinking lockless in the series 'use refcount+RCU method to
implement lockless slab shrink'
- David Hildenbrand contributes some maintenance work for the rmap
code in the series 'Anon rmap cleanups'
- Kefeng Wang does more folio conversions and some maintenance work
in the migration code. Series 'mm: migrate: more folio conversion
and unification'
- Matthew Wilcox has fixed an issue in the buffer_head code which was
causing long stalls under some heavy memory/IO loads. Some cleanups
were added on the way. Series 'Add and use bdev_getblk()'
- In the series 'Use nth_page() in place of direct struct page
manipulation' Zi Yan has fixed a potential issue with the direct
manipulation of hugetlb page frames
- In the series 'mm: hugetlb: Skip initialization of gigantic tail
struct pages if freed by HVO' has improved our handling of gigantic
pages in the hugetlb vmmemmep optimizaton code. This provides
significant boot time improvements when significant amounts of
gigantic pages are in use
- Matthew Wilcox has sent the series 'Small hugetlb cleanups' - code
rationalization and folio conversions in the hugetlb code
- Yin Fengwei has improved mlock()'s handling of large folios in the
series 'support large folio for mlock'
- In the series 'Expose swapcache stat for memcg v1' Liu Shixin has
added statistics for memcg v1 users which are available (and
useful) under memcg v2
- Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable)
prctl so that userspace may direct the kernel to not automatically
propagate the denial to child processes. The series is named 'MDWE
without inheritance'
- Kefeng Wang has provided the series 'mm: convert numa balancing
functions to use a folio' which does what it says
- In the series 'mm/ksm: add fork-exec support for prctl' Stefan
Roesch makes is possible for a process to propagate KSM treatment
across exec()
- Huang Ying has enhanced memory tiering's calculation of memory
distances. This is used to permit the dax/kmem driver to use 'high
bandwidth memory' in addition to Optane Data Center Persistent
Memory Modules (DCPMM). The series is named 'memory tiering:
calculate abstract distance based on ACPI HMAT'
- In the series 'Smart scanning mode for KSM' Stefan Roesch has
optimized KSM by teaching it to retain and use some historical
information from previous scans
- Yosry Ahmed has fixed some inconsistencies in memcg statistics in
the series 'mm: memcg: fix tracking of pending stats updates
values'
- In the series 'Implement IOCTL to get and optionally clear info
about PTEs' Peter Xu has added an ioctl to /proc/<pid>/pagemap
which permits us to atomically read-then-clear page softdirty
state. This is mainly used by CRIU
- Hugh Dickins contributed the series 'shmem,tmpfs: general
maintenance', a bunch of relatively minor maintenance tweaks to
this code
- Matthew Wilcox has increased the use of the VMA lock over
file-backed page faults in the series 'Handle more faults under the
VMA lock'. Some rationalizations of the fault path became possible
as a result
- In the series 'mm/rmap: convert page_move_anon_rmap() to
folio_move_anon_rmap()' David Hildenbrand has implemented some
cleanups and folio conversions
- In the series 'various improvements to the GUP interface' Lorenzo
Stoakes has simplified and improved the GUP interface with an eye
to providing groundwork for future improvements
- Andrey Konovalov has sent along the series 'kasan: assorted fixes
and improvements' which does those things
- Some page allocator maintenance work from Kemeng Shi in the series
'Two minor cleanups to break_down_buddy_pages'
- In thes series 'New selftest for mm' Breno Leitao has developed
another MM self test which tickles a race we had between madvise()
and page faults
- In the series 'Add folio_end_read' Matthew Wilcox provides cleanups
and an optimization to the core pagecache code
- Nhat Pham has added memcg accounting for hugetlb memory in the
series 'hugetlb memcg accounting'
- Cleanups and rationalizations to the pagemap code from Lorenzo
Stoakes, in the series 'Abstract vma_merge() and split_vma()'
- Audra Mitchell has fixed issues in the procfs page_owner code's new
timestamping feature which was causing some misbehaviours. In the
series 'Fix page_owner's use of free timestamps'
- Lorenzo Stoakes has fixed the handling of new mappings of sealed
files in the series 'permit write-sealed memfd read-only shared
mappings'
- Mike Kravetz has optimized the hugetlb vmemmap optimization in the
series 'Batch hugetlb vmemmap modification operations'
- Some buffer_head folio conversions and cleanups from Matthew Wilcox
in the series 'Finish the create_empty_buffers() transition'
- As a page allocator performance optimization Huang Ying has added
automatic tuning to the allocator's per-cpu-pages feature, in the
series 'mm: PCP high auto-tuning'
- Roman Gushchin has contributed the patchset 'mm: improve
performance of accounted kernel memory allocations' which improves
their performance by ~30% as measured by a micro-benchmark
- folio conversions from Kefeng Wang in the series 'mm: convert page
cpupid functions to folios'
- Some kmemleak fixups in Liu Shixin's series 'Some bugfix about
kmemleak'
- Qi Zheng has improved our handling of memoryless nodes by keeping
them off the allocation fallback list. This is done in the series
'handle memoryless nodes more appropriately'
- khugepaged conversions from Vishal Moola in the series 'Some
khugepaged folio conversions'"
[ bcachefs conflicts with the dynamically allocated shrinkers have been
resolved as per Stephen Rothwell in
https://lore.kernel.org/all/20230913093553.4290421e@canb.auug.org.au/
with help from Qi Zheng.
The clone3 test filtering conflict was half-arsed by yours truly ]
* tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (406 commits)
mm/damon/sysfs: update monitoring target regions for online input commit
mm/damon/sysfs: remove requested targets when online-commit inputs
selftests: add a sanity check for zswap
Documentation: maple_tree: fix word spelling error
mm/vmalloc: fix the unchecked dereference warning in vread_iter()
zswap: export compression failure stats
Documentation: ubsan: drop "the" from article title
mempolicy: migration attempt to match interleave nodes
mempolicy: mmap_lock is not needed while migrating folios
mempolicy: alloc_pages_mpol() for NUMA policy without vma
mm: add page_rmappable_folio() wrapper
mempolicy: remove confusing MPOL_MF_LAZY dead code
mempolicy: mpol_shared_policy_init() without pseudo-vma
mempolicy trivia: use pgoff_t in shared mempolicy tree
mempolicy trivia: slightly more consistent naming
mempolicy trivia: delete those ancient pr_debug()s
mempolicy: fix migrate_pages(2) syscall return nr_failed
kernfs: drop shared NUMA mempolicy hooks
hugetlbfs: drop shared NUMA mempolicy pretence
mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets()
...
Function print_message() in quota.c doesn't return a meaningful return
value. Turn it into a void function and stop abusing it for setting
variable error to 0 in gfs2_quota_check().
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
When intializing a struct, all fields that are not explicitly mentioned
are zeroed out already.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Make sure we don't skip accounting for quota changes with the
quota=account mount option.
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZTppYgAKCRCRxhvAZXjc
okIHAP9anLz1QDyMLH12ASuHjgBc0Of3jcB6NB97IWGpL4O21gEA46ohaD+vcJuC
YkBLU3lXqQ87nfu28ExFAzh10hG2jwM=
=m4pB
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.7.ctime' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull vfs inode time accessor updates from Christian Brauner:
"This finishes the conversion of all inode time fields to accessor
functions as discussed on list. Changing timestamps manually as we
used to do before is error prone. Using accessors function makes this
robust.
It does not contain the switch of the time fields to discrete 64 bit
integers to replace struct timespec and free up space in struct inode.
But after this, the switch can be trivially made and the patch should
only affect the vfs if we decide to do it"
* tag 'vfs-6.7.ctime' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (86 commits)
fs: rename inode i_atime and i_mtime fields
security: convert to new timestamp accessors
selinux: convert to new timestamp accessors
apparmor: convert to new timestamp accessors
sunrpc: convert to new timestamp accessors
mm: convert to new timestamp accessors
bpf: convert to new timestamp accessors
ipc: convert to new timestamp accessors
linux: convert to new timestamp accessors
zonefs: convert to new timestamp accessors
xfs: convert to new timestamp accessors
vboxsf: convert to new timestamp accessors
ufs: convert to new timestamp accessors
udf: convert to new timestamp accessors
ubifs: convert to new timestamp accessors
tracefs: convert to new timestamp accessors
sysv: convert to new timestamp accessors
squashfs: convert to new timestamp accessors
server: convert to new timestamp accessors
client: convert to new timestamp accessors
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZTppWAAKCRCRxhvAZXjc
okB2AP4jjoRErJBwj245OIDJqzoj4m4UVOVd0MH2AkiSpANczwD/TToChdpusY2y
qAYg1fQoGMbDVlb7Txaj9qI9ieCf9w0=
=2PXg
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.7.xattr' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull vfs xattr updates from Christian Brauner:
"The 's_xattr' field of 'struct super_block' currently requires a
mutable table of 'struct xattr_handler' entries (although each handler
itself is const). However, no code in vfs actually modifies the
tables.
This changes the type of 's_xattr' to allow const tables, and modifies
existing file systems to move their tables to .rodata. This is
desirable because these tables contain entries with function pointers
in them; moving them to .rodata makes it considerably less likely to
be modified accidentally or maliciously at runtime"
* tag 'vfs-6.7.xattr' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (30 commits)
const_structs.checkpatch: add xattr_handler
net: move sockfs_xattr_handlers to .rodata
shmem: move shmem_xattr_handlers to .rodata
overlayfs: move xattr tables to .rodata
xfs: move xfs_xattr_handlers to .rodata
ubifs: move ubifs_xattr_handlers to .rodata
squashfs: move squashfs_xattr_handlers to .rodata
smb: move cifs_xattr_handlers to .rodata
reiserfs: move reiserfs_xattr_handlers to .rodata
orangefs: move orangefs_xattr_handlers to .rodata
ocfs2: move ocfs2_xattr_handlers and ocfs2_xattr_handler_map to .rodata
ntfs3: move ntfs_xattr_handlers to .rodata
nfs: move nfs4_xattr_handlers to .rodata
kernfs: move kernfs_xattr_handlers to .rodata
jfs: move jfs_xattr_handlers to .rodata
jffs2: move jffs2_xattr_handlers to .rodata
hfsplus: move hfsplus_xattr_handlers to .rodata
hfs: move hfs_xattr_handlers to .rodata
gfs2: move gfs2_xattr_handlers_max to .rodata
fuse: move fuse_xattr_handlers to .rodata
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZTpoQAAKCRCRxhvAZXjc
ovFNAQDgIRjXfZ1Ku+USxsRRdqp8geJVaNc3PuMmYhOYhUenqgEAmC1m+p0y31dS
P6+HlL16Mqgu0tpLCcJK9BibpDZ0Ew4=
=7yD1
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.7.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains the usual miscellaneous features, cleanups, and fixes
for vfs and individual fses.
Features:
- Rename and export helpers that get write access to a mount. They
are used in overlayfs to get write access to the upper mount.
- Print the pretty name of the root device on boot failure. This
helps in scenarios where we would usually only print
"unknown-block(1,2)".
- Add an internal SB_I_NOUMASK flag. This is another part in the
endless POSIX ACL saga in a way.
When POSIX ACLs are enabled via SB_POSIXACL the vfs cannot strip
the umask because if the relevant inode has POSIX ACLs set it might
take the umask from there. But if the inode doesn't have any POSIX
ACLs set then we apply the umask in the filesytem itself. So we end
up with:
(1) no SB_POSIXACL -> strip umask in vfs
(2) SB_POSIXACL -> strip umask in filesystem
The umask semantics associated with SB_POSIXACL allowed filesystems
that don't even support POSIX ACLs at all to raise SB_POSIXACL
purely to avoid umask stripping. That specifically means NFS v4 and
Overlayfs. NFS v4 does it because it delegates this to the server
and Overlayfs because it needs to delegate umask stripping to the
upper filesystem, i.e., the filesystem used as the writable layer.
This went so far that SB_POSIXACL is raised eve on kernels that
don't even have POSIX ACL support at all.
Stop this blatant abuse and add SB_I_NOUMASK which is an internal
superblock flag that filesystems can raise to opt out of umask
handling. That should really only be the two mentioned above. It's
not that we want any filesystems to do this. Ideally we have all
umask handling always in the vfs.
- Make overlayfs use SB_I_NOUMASK too.
- Now that we have SB_I_NOUMASK, stop checking for SB_POSIXACL in
IS_POSIXACL() if the kernel doesn't have support for it. This is a
very old patch but it's only possible to do this now with the wider
cleanup that was done.
- Follow-up work on fake path handling from last cycle. Citing mostly
from Amir:
When overlayfs was first merged, overlayfs files of regular files
and directories, the ones that are installed in file table, had a
"fake" path, namely, f_path is the overlayfs path and f_inode is
the "real" inode on the underlying filesystem.
In v6.5, we took another small step by introducing of the
backing_file container and the file_real_path() helper. This change
allowed vfs and filesystem code to get the "real" path of an
overlayfs backing file. With this change, we were able to make
fsnotify work correctly and report events on the "real" filesystem
objects that were accessed via overlayfs.
This method works fine, but it still leaves the vfs vulnerable to
new code that is not aware of files with fake path. A recent
example is commit db1d1e8b98 ("IMA: use vfs_getattr_nosec to get
the i_version"). This commit uses direct referencing to f_path in
IMA code that otherwise uses file_inode() and file_dentry() to
reference the filesystem objects that it is measuring.
This contains work to switch things around: instead of having
filesystem code opt-in to get the "real" path, have generic code
opt-in for the "fake" path in the few places that it is needed.
Is it far more likely that new filesystems code that does not use
the file_dentry() and file_real_path() helpers will end up causing
crashes or averting LSM/audit rules if we keep the "fake" path
exposed by default.
This change already makes file_dentry() moot, but for now we did
not change this helper just added a WARN_ON() in ovl_d_real() to
catch if we have made any wrong assumptions.
After the dust settles on this change, we can make file_dentry() a
plain accessor and we can drop the inode argument to ->d_real().
- Switch struct file to SLAB_TYPESAFE_BY_RCU. This looks like a small
change but it really isn't and I would like to see everyone on
their tippie toes for any possible bugs from this work.
Essentially we've been doing most of what SLAB_TYPESAFE_BY_RCU for
files since a very long time because of the nasty interactions
between the SCM_RIGHTS file descriptor garbage collection. So
extending it makes a lot of sense but it is a subtle change. There
are almost no places that fiddle with file rcu semantics directly
and the ones that did mess around with struct file internal under
rcu have been made to stop doing that because it really was always
dodgy.
I forgot to put in the link tag for this change and the discussion
in the commit so adding it into the merge message:
https://lore.kernel.org/r/20230926162228.68666-1-mjguzik@gmail.com
Cleanups:
- Various smaller pipe cleanups including the removal of a spin lock
that was only used to protect against writes without pipe_lock()
from O_NOTIFICATION_PIPE aka watch queues. As that was never
implemented remove the additional locking from pipe_write().
- Annotate struct watch_filter with the new __counted_by attribute.
- Clarify do_unlinkat() cleanup so that it doesn't look like an extra
iput() is done that would cause issues.
- Simplify file cleanup when the file has never been opened.
- Use module helper instead of open-coding it.
- Predict error unlikely for stale retry.
- Use WRITE_ONCE() for mount expiry field instead of just commenting
that one hopes the compiler doesn't get smart.
Fixes:
- Fix readahead on block devices.
- Fix writeback when layztime is enabled and inodes whose timestamp
is the only thing that changed reside on wb->b_dirty_time. This
caused excessively large zombie memory cgroup when lazytime was
enabled as such inodes weren't handled fast enough.
- Convert BUG_ON() to WARN_ON_ONCE() in open_last_lookups()"
* tag 'vfs-6.7.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (26 commits)
file, i915: fix file reference for mmap_singleton()
vfs: Convert BUG_ON to WARN_ON_ONCE in open_last_lookups
writeback, cgroup: switch inodes with dirty timestamps to release dying cgwbs
chardev: Simplify usage of try_module_get()
ovl: rely on SB_I_NOUMASK
fs: fix umask on NFS with CONFIG_FS_POSIX_ACL=n
fs: store real path instead of fake path in backing file f_path
fs: create helper file_user_path() for user displayed mapped file path
fs: get mnt_writers count for an open backing file's real path
vfs: stop counting on gcc not messing with mnt_expiry_mark if not asked
vfs: predict the error in retry_estale as unlikely
backing file: free directly
vfs: fix readahead(2) on block devices
io_uring: use files_lookup_fd_locked()
file: convert to SLAB_TYPESAFE_BY_RCU
vfs: shave work on failed file open
fs: simplify misleading code to remove ambiguity regarding ihold()/iput()
watch_queue: Annotate struct watch_filter with __counted_by
fs/pipe: use spinlock in pipe_read() only if there is a watch_queue
fs/pipe: remove unnecessary spinlock from pipe_write()
...
With all users converted, remove the old create_empty_buffers() and rename
folio_create_empty_buffers() to create_empty_buffers().
Link: https://lkml.kernel.org/r/20231016201114.1928083-28-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Pankaj Raghav <p.raghav@samsung.com>
Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Use the folio APIs, saving four hidden calls to compound_head().
Link: https://lkml.kernel.org/r/20231016201114.1928083-9-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Pankaj Raghav <p.raghav@samsung.com>
Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Remove several folio->page->folio conversions. Also use __GFP_NOFAIL
instead of calling yield() and the new get_nth_bh().
Link: https://lkml.kernel.org/r/20231016201114.1928083-8-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Pankaj Raghav <p.raghav@samsung.com>
Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Use the folio APIs, removing numerous hidden calls to compound_head().
Also remove the stale comment about the page being looked up if it's NULL.
Link: https://lkml.kernel.org/r/20231016201114.1928083-7-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Pankaj Raghav <p.raghav@samsung.com>
Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Header gfs2_ondisk.h defines GFS2_BASIC_BLOCK and GFS2_BASIC_BLOCK_SHIFT
in a misguided attempt to abstract away the fact that sectors on block
devices are 512 or (1 << 9) bytes in size. Stop using those definitions.
I would be inclinded to remove those definitions altogether, but the
gfs2 user-space tools are using them.
In addition, instead of GFS2_SB(inode)->sd_sb.sb_bsize_shift, simply use
inode->i_blkbits.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Andrew Price <anprice@redhat.com>
Add a missing initialization of variable ap in setattr_chown().
Without, chown() may be able to bypass quotas.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
In recent discussions around some performance improvements in the file
handling area we discussed switching the file cache to rely on
SLAB_TYPESAFE_BY_RCU which allows us to get rid of call_rcu() based
freeing for files completely. This is a pretty sensitive change overall
but it might actually be worth doing.
The main downside is the subtlety. The other one is that we should
really wait for Jann's patch to land that enables KASAN to handle
SLAB_TYPESAFE_BY_RCU UAFs. Currently it doesn't but a patch for this
exists.
With SLAB_TYPESAFE_BY_RCU objects may be freed and reused multiple times
which requires a few changes. So it isn't sufficient anymore to just
acquire a reference to the file in question under rcu using
atomic_long_inc_not_zero() since the file might have already been
recycled and someone else might have bumped the reference.
In other words, callers might see reference count bumps from newer
users. For this reason it is necessary to verify that the pointer is the
same before and after the reference count increment. This pattern can be
seen in get_file_rcu() and __files_get_rcu().
In addition, it isn't possible to access or check fields in struct file
without first aqcuiring a reference on it. Not doing that was always
very dodgy and it was only usable for non-pointer data in struct file.
With SLAB_TYPESAFE_BY_RCU it is necessary that callers first acquire a
reference under rcu or they must hold the files_lock of the fdtable.
Failing to do either one of this is a bug.
Thanks to Jann for pointing out that we need to ensure memory ordering
between reallocations and pointer check by ensuring that all subsequent
loads have a dependency on the second load in get_file_rcu() and
providing a fixup that was folded into this patch.
Cc: Jann Horn <jannh@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
This makes it harder for accidental or malicious changes to
gfs2_xattr_handlers_max at runtime.
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: cluster-devel@redhat.com
Signed-off-by: Wedson Almeida Filho <walmeida@microsoft.com>
Link: https://lore.kernel.org/r/20230930050033.41174-13-wedsonaf@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Add a kthread_stop_put() helper that stops a thread and puts its task
struct. Use it to replace the various instances of kthread_stop()
followed by put_task_struct().
Remove the kthread_stop_put() macro in usbip that is similar but doesn't
return the result of kthread_stop().
[agruenba@redhat.com: fix kerneldoc comment]
Link: https://lkml.kernel.org/r/20230911111730.2565537-1-agruenba@redhat.com
[akpm@linux-foundation.org: document kthread_stop_put()'s argument]
Link: https://lkml.kernel.org/r/20230907234048.2499820-1-agruenba@redhat.com
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In RCU mode, we might race with gfs2_evict_inode(), which zeroes
->i_gl. Freeing of the object it points to is RCU-delayed, so
if we manage to fetch the pointer before it's been replaced with
NULL, we are fine. Check if we'd fetched NULL and treat that
as "bail out and tell the caller to get out of RCU mode".
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
When lots of quota changes are made, there may be cases in which an
inode's quota information is increased and then decreased, such as when
blocks are added to a file, then deleted from it. If the timing is
right, function do_qc can add pending quota changes to a transaction,
then later, another call to do_qc can negate those changes, resulting
in a net gain of 0. The quota_change information is recorded in the qc
buffer (and qd element of the inode as well). The buffer is added to the
transaction by the first call to do_qc, but a subsequent call changes
the value from non-zero back to zero. At that point it's too late to
remove the buffer_head from the transaction. Later, when the quota sync
code is called, the zero-change qd element is discovered and flagged as
an assert warning. If the fs is mounted with errors=panic, the kernel
will panic.
This is usually seen when files are truncated and the quota changes are
negated by punch_hole/truncate which uses gfs2_quota_hold and
gfs2_quota_unhold rather than block allocations that use gfs2_quota_lock
and gfs2_quota_unlock which automatically do quota sync.
This patch solves the problem by adding a check to qd_check_sync such
that net-zero quota changes already added to the transaction are no
longer deemed necessary to be synced, and skipped.
In this case references are taken for the qd and the slot from do_qc
so those need to be put. The normal sequence of events for a normal
non-zero quota change is as follows:
gfs2_quota_change
do_qc
qd_hold
slot_hold
Later, when the changes are to be synced:
gfs2_quota_sync
qd_fish
qd_check_sync
gets qd ref via lockref_get_not_dead
do_sync
do_qc(QC_SYNC)
qd_put
lockref_put_or_lock
qd_unlock
qd_put
lockref_put_or_lock
In the net-zero change case, we add a check to qd_check_sync so it puts
the qd and slot references acquired in gfs2_quota_change and skip the
unneeded sync.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
During direct reads and writes, the caller is holding the inode glock in
deferred mode, which doesn't allow metadata updates. However, a previous
change caused callers to update the inode modification time before carrying out
direct writes, which caused the inode glock to be converted to exclusive mode
for the timestamp update, only to be immediately converted back to deferred
mode for the direct write. This locks out other direct readers and writers
and wreaks havoc on performance.
Fix that by reverting to not updating the inode modification time for direct
writes.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Those helpers don't add any clarity and are easy to use wrong. Spell
them out to make more obvious what's happening.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before commit b77b4a4815 ("gfs2: Rework freeze / thaw logic"), the
freeze glock was kept around in the glock cache in shared mode without
being actively held while a filesystem is in thawed state. In that
state, memory pressure could have eventually evicted the freeze glock,
and the freeze_go_demote_ok callback was needed to prevent that from
happening.
With the freeze / thaw rework, the freeze glock is now always actively
held in shared mode while a filesystem is thawed, and the
freeze_go_demote_ok hack is no longer needed.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
When trying to upgrade the iopen glock, gfs2_upgrade_iopen_glock() tries
to take the iopen glock with the LM_FLAG_TRY_1CB flag set before trying
to take it without the LM_FLAG_TRY or LM_FLAG_TRY_1CB flags set. Both
calls will cause the lock contention bast callbacks to be invoked
throughout the cluster, and we really don't need them to be invoked
twice. Remove the first LM_FLAG_TRY_1CB call to eliminate unnecessary
dlm traffic.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Patch eef46ab713 introduced a new gfs2 quota=quiet mount option.
Checks for the new option were added to quota.c, but a check in
gfs2_quota_lock_check() was overlooked. This patch adds the missing
check.
Fixes: eef46ab713 ("gfs2: Introduce new quota=quiet mount option")
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, function gfs2_scan_glock_lru would only try to free
glocks that had a reference count of 0. But if the reference count ever
got to 0, the glock should have already been freed.
Shrinker function gfs2_dispose_glock_lru checks whether glocks on the
LRU are demote_ok, and if so, tries to demote them. But that's only
possible if the reference count is at least 1.
This patch changes gfs2_scan_glock_lru so it will try to demote and/or
dispose of glocks that have a reference count of 1 and which are either
demotable, or are already unlocked.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
On a thawed filesystem, the freeze glock is held in shared mode. In
order to initiate a cluster-wide freeze, the node initiating the freeze
drops the freeze glock and grabs it in exclusive mode. The other nodes
recognize this as contention on the freeze glock; function
freeze_go_callback is invoked. This indicates to them that they must
freeze the filesystem locally, drop the freeze glock, and then
re-acquire it in shared mode before being able to unfreeze the
filesystem locally.
While a node is trying to re-acquire the freeze glock in shared mode,
additional contention can occur. In that case, the node must behave in
the same way as above.
Unfortunately, freeze_go_callback() contains a check that causes it to
bail out when the freeze glock isn't held in shared mode. Fix that to
allow the glock to be unlocked or held in shared mode.
In addition, update a reference to trylock_super() which has been
renamed to super_trylock_shared() in the meantime.
Fixes: b77b4a4815 ("gfs2: Rework freeze / thaw logic")
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
- Fix a glock state (non-)transition bug when a dlm request times out
and is canceled, and we have locking requests that can now be granted
immediately.
- Various fixes and cleanups in how the logd and quotad daemons are
woken up and terminated.
- Fix several bugs in the quota data reference counting and shrinking.
Free quota data objects synchronously in put_super() instead of
letting call_rcu() run wild.
- Make sure not to deallocate quota data during a withdraw; rather, defer
quota data deallocation to put_super(). Withdraws can happen in
contexts in which callers on the stack are holding quota data references.
- Many minor quota fixes and cleanups by Bob.
- Update the the mailing list address for gfs2 and dlm. (It's the same
list for both and we are moving it to gfs2@lists.linux.dev.)
- Various other minor cleanups.
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmT3T7UUHGFncnVlbmJh
QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTqxhw/+IWp+4cY4htNkTRG7xkheTeQ+5whG
NU40mp7Hj+WY5GoHqsk676q1pBkVAq5mNN1kt9S/oC6lLHrdu1HLpdIkgFow2nAC
nDqlEqx9/Da9Q4H/+K442usO90S4o1MmOXOE9xcGcvJLqK4FLDOVDXbUWa43OXrK
4HxgjgGSNPF4itD+U0o/V4f19jQ+4cNwmo6hGLyhsYillaUHiIXJezQlH7XycByM
qGJqlG6odJ56wE38NG8Bt9Lj+91PsLLqO1TJxSYyzpf0h9QGQ2ySvu6/esTXwxWO
XRuT4db7yjyAUhJoJMw+YU77xWQTz0/jriIDS7VqzvR1ns3GPaWdtb31TdUTBG4H
IvBA8ep3oxHtcYFoPzCLBXgOIDej6KjAgS3RSv51yLeaZRHFUBc21fTSXbcDTIUs
gkusZlRNQ9ANdBCVyf8hZxbE54HnaBJ8dKMZtynOXJEHs0EtGV8YKCNIpkFLxOvE
vZkKcRsmVtuZ9fVhX1iH7dYmcsCMPI8RNo47k7hHk2EG8dU+eqyPSbi4QCmErNFf
DlqX+fIuiDtOkbmWcrb2qdphn6j6bMLhDaOMJGIBOmgOPi+AU9dNAfmtu1cG4u1b
2TFyUISayiwqHJQgguzvDed15fxexYdgoLB7O9t9TMbCENxisguNa5TsAN6ZkiLQ
0hY6h80xSR2kCPU=
=EonA
-----END PGP SIGNATURE-----
Merge tag 'gfs2-v6.5-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Andreas Gruenbacher:
- Fix a glock state (non-)transition bug when a dlm request times out
and is canceled, and we have locking requests that can now be granted
immediately
- Various fixes and cleanups in how the logd and quotad daemons are
woken up and terminated
- Fix several bugs in the quota data reference counting and shrinking.
Free quota data objects synchronously in put_super() instead of
letting call_rcu() run wild
- Make sure not to deallocate quota data during a withdraw; rather,
defer quota data deallocation to put_super(). Withdraws can happen in
contexts in which callers on the stack are holding quota data
references
- Many minor quota fixes and cleanups by Bob
- Update the the mailing list address for gfs2 and dlm. (It's the same
list for both and we are moving it to gfs2@lists.linux.dev)
- Various other minor cleanups
* tag 'gfs2-v6.5-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: (51 commits)
MAINTAINERS: Update dlm mailing list
MAINTAINERS: Update gfs2 mailing list
gfs2: change qd_slot_count to qd_slot_ref
gfs2: check for no eligible quota changes
gfs2: Remove useless assignment
gfs2: simplify slot_get
gfs2: Simplify qd2offset
gfs2: introduce qd_bh_get_or_undo
gfs2: Remove quota allocation info from quota file
gfs2: use constant for array size
gfs2: Set qd_sync_gen in do_sync
gfs2: Remove useless err set
gfs2: Small gfs2_quota_lock cleanup
gfs2: move qdsb_put and reduce redundancy
gfs2: improvements to sysfs status
gfs2: Don't try to sync non-changes
gfs2: Simplify function need_sync
gfs2: remove unneeded pg_oflow variable
gfs2: remove unneeded variable done
gfs2: pass sdp to gfs2_write_buf_to_page
...
Variable qd_slot_count is a reference count, not a count of slots. This
patch renames it to qd_slot_ref to make that more clear.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, function gfs2_quota_sync would always allocate a page
full of memory and increment its quota sync generation number. This
happened even when the system was completely idle or if no blocks were
allocated or quota changes made. This patch adds function qd_changed
to determine if any changes have been made that qualify for a
quota sync. If not, it avoids the memory allocation and bumping the
generation number, along with all the additional work it would do.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
This assignment is unnecessary because if error was not already 0, it
would have branched to an error label already.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Simplify function slot_get and get rid of the goto that jumps into the
middle of an else branch.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
This is a minor cleanup of function qd2offset.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
This patch is an attempt to force some consistency in quota sync
processing. Two functions (qd_fish and gfs2_quota_unlock) called
qd_check_sync, after which they both called bh_get, and if that failed,
they took the same steps to undo the actions of qd_check_sync.
This patch introduces a new function, qd_bh_get_or_undo, which performs
the same steps, reducing code redundancy.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Function do_sync called gfs2_qa_get and put for quota allocation data.
But the inode in question is the system master quota file, which is
never subject to quotas. Therefore, a qa structure should be unnecessary
and if anything accesses it, it's probably a bug.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Function gfs2_quota_unlock declared an array of 4 qd elements. We have a
constant for that, we should be using it.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Func do_sync was called in two places: gfs2_quota_unlock and
gfs2_quota_sync. In gfs2_quota_sync it updated qd_sync_gen to the latest
superblock sync gen, if do_sync was successful. In gfs2_quota_unlock it
didn't update the value. That can only lead to extra work, for example,
if the value is synced by gfs2_quota_unlock but still has the old value.
This patch moves the setting of qd_sync_gen inside do_sync so we are
guaranteed consistency.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Function gfs2_adjust_quota set variable err, then set it again to a
different value. This patch removes the redundant set.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
No need to set error = 0 since it's set further down.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
This patch looks more invasive than it is. It simply moves function
qdsb_put before qd_unlock, then changes qd_unlock to call it rather than
open coding it. Again, this reduces redundancy.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
This patch adds some new fields to the gfs2 status file in sysfs to aid
in debugging.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Function need_sync is supposed to determine if a qd element needs to be
synced. If the "change" (qd_change) is zero, it does not need to be
synced because there's literally no change in the value. Before this
patch need_sync returned false if value < 0. That should be <= 0.
This patch changes the check to <=.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
This patch simplifies function need_sync by eliminating a variable in
favor of just returning the appropriate value as soon as we know it.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Function gfs2_write_disk_quota checks if its write overflows onto
another page, and if so, does a second write. Before this patch it kept
two variables for this, but only one is needed. This patch simplifies
it by eliminating pg_oflow.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Function gfs2_write_buf_to_page uses variable done to exit its loop, but
it's unnecessary if we just code an infinite loop and exit when we need.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
This patch passes the superblock pointer to gfs2_write_buf_to_page so it
becomes more apparent it's dealing with the system quota file.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Like the previous patch, we now pass the superblock pointer to function
gfs2_write_disk_quota. This makes the code more understandable, since it
only operates on the quota inode.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this change function gfs2_adjust_quota's first parameter was an
gfs2_inode pointer. But it always pointed to the quota inode. Here we
switch that to pass the superblock pointer, sdp, so it is easier to read
the code and understand that it's only dealing with the quota inode.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Since patch 845802b112 function gfs2_write_buf_to_page checks if the
target inode is jdata or ordered. This function only operates on the
system quota file, which is always jdata, so the check for jdata is
useless. This patch removes it.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
This patch adds a new mount option quota=quiet which is the same as
quota=on but it suppresses gfs2 quota error messages.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Add the device name to the names of the gfs2_logd and gfs2_quotad kernel
threads to allow for easier identification.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Rename the "gfs_recovery" workqueue to "gfs2_recovery", and
gfs_recovery_wq to gfs2_recovery_wq.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Function gfs2_withdraw() tries to synchronize concurrent callers by
atomically setting the SDF_WITHDRAWN flag in the first caller, setting
the SDF_WITHDRAW_IN_PROG flag to indicate that a withdraw is in
progress, performing the actual withdraw, and clearing the
SDF_WITHDRAW_IN_PROG flag when done. All other callers wait for the
SDF_WITHDRAW_IN_PROG flag to be cleared before returning.
This leaves a small window in which callers can find the SDF_WITHDRAWN
flag set before the SDF_WITHDRAW_IN_PROG flag has been set, causing them
to return prematurely, before the withdraw has been completed.
Fix that by setting the SDF_WITHDRAWN and SDF_WITHDRAW_IN_PROG flags
atomically.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Immediately stop the logd and quotad kernel threads when a filesystem
withdraw is detected: those threads aren't doing anything useful after a
withdraw. (Depends on the extra logd and quotad task struct references
held since commit 7a109f383fa3 ("gfs2: Fix asynchronous thread
destruction").)
In addition, check for kthread_should_stop() in the wait condition in
gfs2_quotad() to stop immediately when kthread_stop() is called.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
The kernel threads are currently stopped and destroyed synchronously by
gfs2_make_fs_ro() and gfs2_put_super(), and asynchronously by
signal_our_withdraw(), with no synchronization, so the synchronous and
asynchronous contexts can race with each other.
First, when creating the kernel threads, take an extra task struct
reference so that the task struct won't go away immediately when they
terminate. This allows those kthreads to terminate immediately when
they're done rather than hanging around as zombies until they are reaped
by kthread_stop(). When kthread_stop() is called on a terminated
kthread, it will return immediately.
Second, in signal_our_withdraw(), once the SDF_JOURNAL_LIVE flag has
been cleared, wake up the logd and quotad wait queues instead of
stopping the logd and quotad kthreads. The kthreads are then expected
to terminate automatically within short time, but if they cannot, they
will not block the withdraw.
For example, if a user process and one of the kthread decide to withdraw
at the same time, only one of them will perform the actual withdraw and
the other will wait for it to be done. If the kthread ends up being the
one to wait, the withdrawing user process won't be able to stop it.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
[ 81.372851][ T5532] CPU: 1 PID: 5532 Comm: syz-executor.0 Not tainted 6.2.0-rc1-syzkaller-dirty #0
[ 81.382080][ T5532] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/12/2023
[ 81.392343][ T5532] Call Trace:
[ 81.395654][ T5532] <TASK>
[ 81.398603][ T5532] dump_stack_lvl+0x1b1/0x290
[ 81.418421][ T5532] gfs2_assert_warn_i+0x19a/0x2e0
[ 81.423480][ T5532] gfs2_quota_cleanup+0x4c6/0x6b0
[ 81.428611][ T5532] gfs2_make_fs_ro+0x517/0x610
[ 81.457802][ T5532] gfs2_withdraw+0x609/0x1540
[ 81.481452][ T5532] gfs2_inode_refresh+0xb2d/0xf60
[ 81.506658][ T5532] gfs2_instantiate+0x15e/0x220
[ 81.511504][ T5532] gfs2_glock_wait+0x1d9/0x2a0
[ 81.516352][ T5532] do_sync+0x485/0xc80
[ 81.554943][ T5532] gfs2_quota_sync+0x3da/0x8b0
[ 81.559738][ T5532] gfs2_sync_fs+0x49/0xb0
[ 81.564063][ T5532] sync_filesystem+0xe8/0x220
[ 81.568740][ T5532] generic_shutdown_super+0x6b/0x310
[ 81.574112][ T5532] kill_block_super+0x79/0xd0
[ 81.578779][ T5532] deactivate_locked_super+0xa7/0xf0
[ 81.584064][ T5532] cleanup_mnt+0x494/0x520
[ 81.593753][ T5532] task_work_run+0x243/0x300
[ 81.608837][ T5532] exit_to_user_mode_loop+0x124/0x150
[ 81.614232][ T5532] exit_to_user_mode_prepare+0xb2/0x140
[ 81.619820][ T5532] syscall_exit_to_user_mode+0x26/0x60
[ 81.625287][ T5532] do_syscall_64+0x49/0xb0
[ 81.629710][ T5532] entry_SYSCALL_64_after_hwframe+0x63/0xcd
In this backtrace, gfs2_quota_sync() takes quota data references and
then calls do_sync(). Function do_sync() encounters filesystem
corruption and withdraws the filesystem, which (among other things) calls
gfs2_quota_cleanup(). Function gfs2_quota_cleanup() wrongly assumes
that nobody is holding any quota data references anymore, and destroys
all quota data objects. When gfs2_quota_sync() then resumes and
dereferences the quota data objects it is holding, those objects are no
longer there.
Function gfs2_quota_cleanup() deals with resource deallocation and can
easily be delayed until gfs2_put_super() in the case of a filesystem
withdraw. In fact, most of the other work gfs2_make_fs_ro() does is
unnecessary during a withdraw as well, so change signal_our_withdraw()
to skip gfs2_make_fs_ro() and perform the necessary steps directly
instead.
Thanks to Edward Adam Davis <eadavis@sina.com> for the initial patches.
Link: https://lore.kernel.org/all/0000000000002b5e2405f14e860f@google.com
Reported-by: syzbot+3f6a670108ce43356017@syzkaller.appspotmail.com
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
In gfs2_quota_cleanup(), wait for the quota data objects to be freed
before returning. Otherwise, there is no guarantee that the quota data
objects will be gone when their kmem cache is destroyed.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Fix the refcount of quota data objects created directly by
gfs2_quota_init(): those are placed into the in-memory quota "database"
for eventual syncing to the main quota file, but they are not actively
held and should thus have an initial refcount of 0.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Once a filesystem is withdrawn, don't complain about quota changes
that can't be synced to the main quota file anymore.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Rename gfs2_qd_dispose() to gfs2_qd_dispose_list(). Move some code
duplicated in gfs2_qd_dispose_list() and gfs2_quota_cleanup() into a
new gfs2_qd_dispose() function.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Change gfs2_quota_cleanup() to move the quota data objects to dispose of
on a dispose list and call gfs2_qd_dispose() on that list, like
gfs2_qd_shrink_scan() does, instead of disposing of the quota data
objects directly.
This may look a bit pointless by itself, but it will make more sense in
combination with a fix that follows.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Function gfs2_qd_isolate must only return LRU_REMOVED when removing the
item from the lru list; otherwise, the number of items on the list will
go wrong.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Rename the SDF_DEACTIVATING flag to SDF_KILL to make it more obvious
that this relates to the kill_sb filesystem operation.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Rename sd_glock_wait to sd_kill_wait: we'll use it for other things
related to "killing" a filesystem on unmount soon (kill_sb).
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch many of the functions in quota.c got their superblock
pointer, sdp, from the quota_data's glock pointer. That's silly because
the qd already has its own pointer to the superblock (qd_sbd).
This patch changes references to use that instead, eliminating a level
of indirection.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Commit f07b352021 ("GFS2: Made logd daemon take into account log
demand") changed gfs2_ail_flush_reqd() and gfs2_jrnl_flush_reqd() to
take sd_log_blks_needed into account, but the checks in
gfs2_log_commit() were not updated correspondingly.
Once that is fixed, gfs2_jrnl_flush_reqd() and gfs2_ail_flush_reqd() can
be used in gfs2_log_commit(). Make those two helpers available to
gfs2_log_commit() by defining them above gfs2_log_commit().
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
When quotad detects an I/O error, it sets sd_log_error and then it wakes
up logd to withdraw the filesystem. However, logd doesn't wake up when
sd_log_error is set. Fix that.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
First, function gfs2_ail_flush_reqd checks the SDF_FORCE_AIL_FLUSH flag
to determine if an AIL flush should be forced in low-memory situations.
However, it also immediately clears the flag, and when called repeatedly
as in function gfs2_logd, the flag will be lost. Fix that by pulling
the SDF_FORCE_AIL_FLUSH flag check out of gfs2_ail_flush_reqd.
Second, function gfs2_writepages sets the SDF_FORCE_AIL_FLUSH flag
whether or not enough pages were written. If enough pages could be
written, flushing the AIL is unnecessary, though.
Third, gfs2_writepages doesn't wake up logd after setting the
SDF_FORCE_AIL_FLUSH flag, so it can take a long time for logd to react.
It would be preferable to wake up logd, but that hurts the performance
of some workloads and we don't quite understand why so far, so don't
wake up logd so far.
Fixes: b066a4eebd ("gfs2: forcibly flush ail to relieve memory pressure")
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Consider the following case:
1. A glock is held in shared mode.
2. A process requests the glock in exclusive mode (rename).
3. Before the lock is granted, more processes (read / ls) request the
glock in shared mode again.
4. gfs2 sends a request to dlm for the lock in exclusive mode because
that holder is at the head of the queue.
5. Somehow the dlm request gets canceled, so dlm sends us back a
response with state == LM_ST_SHARED and LM_OUT_CANCELED. So at that
point, the glock is still held in shared mode.
6. finish_xmote gets called to process the response from dlm. It detects
that the glock is not in the requested mode and no demote is in
progress, so it moves the canceled holder to the tail of the queue
and finds the new holder at the head of the queue. That holder is
requesting the glock in shared mode.
7. finish_xmote calls do_xmote to transition the glock into shared mode,
but the glock is already in shared mode and so do_xmote complains
about that with:
GLOCK_BUG_ON(gl, gl->gl_state == gl->gl_target);
Instead, in finish_xmote, after moving the canceled holder to the tail
of the queue, check if any new holders can be granted. Only call
do_xmote to repeat the dlm request if the holder at the head of the
queue is requesting the glock in a mode that is incompatible with the
mode the glock is currently held in.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
The last user of this flag was removed in commit b77b4a4815 ("gfs2:
Rework freeze / thaw logic").
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Revert the rest of commit 220cca2a4f ("GFS2: Change truncate page
allocation to be GFP_NOFS"):
In gfs2_unstuff_dinode(), there is no need to carry out the page cache
allocation under GFP_NOFS because inodes on the "regular" filesystem are
never un-inlined under memory pressure, so switch back from
find_or_create_page() to grab_cache_page() here as well.
Inodes on the "metadata" filesystem can theoretically be un-inlined
under memory pressure, but any page cache allocations in that context
would happen in GFP_NOFS context because those inodes have
inode->i_mapping->gfp_mask set to GFP_NOFS (see the previous patch).
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Set mapping->gfp mask to GFP_NOFS for all metadata inodes so that
allocating pages in the address space of those inodes won't call back
into the filesystem. This allows to switch back from
find_or_create_page() to grab_cache_page() in two places.
Partially reverts commit 220cca2a4f ("GFS2: Change truncate page
allocation to be GFP_NOFS").
Thanks to Dan Carpenter <dan.carpenter@linaro.org> for pointing out a
Smatch static checker warning.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Simplify code pattern of 'folio->index + folio_nr_pages(folio)' by using
the existing helper folio_next_index().
Signed-off-by: Minjie Du <duminjie@vivo.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Changes include:
- Allow blocking posix lock requests to be interrupted while waiting.
This requires a cancel request to be sent to the userspace daemon
where posix lock requests are processed across the cluster.
- Fix a posix lock patch from the previous cycle in which lock requests
from different file systems could be mixed up.
- Fix some long standing problems with nfs posix lock cancelation.
- Add a new debugfs file for printing queued callbacks.
- Stop modifying buffers that have been used to receive a message.
- Misc cleanups and some refactoring.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJk8KCgAAoJEDgbc8f8gGmqfk4P/jB4L2qwaamq2mNRxFPXSzpp
y5UiNoMG8Mw4OT9vytu2xzmmrYT7d1TvZ4lNcLYjkNYmcyuTZzu8o/kvGwt9gnXC
94DPmGQb0RQY/pZOdTMcIBplXiCSFpooweFOQjiWo7wlwVlYGVcfEIv9xQTNIT2/
m0niBFEWDDbVudbWXXaa4lnvo07RTmSxiHjtxqbkea2jLUgxw9mYOR8C6De3rlJf
Uh450Kitktak9tywBZa3yj8Cgy8SbiWNHlNvcV1DI3QE7LKOM5+6qVuwERYYx9lw
JbdtEoRr97QFf4w40YrJpAxFBiHCLXAquz3D3cJI8mW0RDqDuGUFU6SfsCfQEza6
Dau6XrtfuumArMn/zViBIase9xkSb36RNFopr2Si6mUoLpPalUPuLr+42qmxZY3c
KOvWis4UFq4OiOqZY5gBBS6IKoJ+X4pVnNJswScvKFA2VBLCf9fucKRoEVOAUTbg
BoJEwOjBQCoaATbGBHjwdjZ4yX/x/tLN0LsPW202QOMGdfSdeD6Wr+COyS916eVK
8Nk3lcBcU21Nhulf2Ci3Zr6B9nG09UqDRHYfH0LJJX0dq++SBRvQvjI2lcdJ0Dvj
We7nVqhcW/r486oS/r8kTXOtctYYMxecoQFYPcVufQAIU8+6YZUD53wui8EyVL/2
3GmejZgMomvGn8D4kNPC
=BBCe
-----END PGP SIGNATURE-----
Merge tag 'dlm-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm updates from David Teigland:
- Allow blocking posix lock requests to be interrupted while waiting.
This requires a cancel request to be sent to the userspace daemon
where posix lock requests are processed across the cluster.
- Fix a posix lock patch from the previous cycle in which lock requests
from different file systems could be mixed up.
- Fix some long standing problems with nfs posix lock cancelation.
- Add a new debugfs file for printing queued callbacks.
- Stop modifying buffers that have been used to receive a message.
- Misc cleanups and some refactoring.
* tag 'dlm-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
dlm: fix plock lookup when using multiple lockspaces
fs: dlm: don't use RCOM_NAMES for version detection
fs: dlm: create midcomms nodes when configure
fs: dlm: constify receive buffer
fs: dlm: drop rxbuf manipulation in dlm_recover_master_copy
fs: dlm: drop rxbuf manipulation in dlm_copy_master_names
fs: dlm: get recovery sequence number as parameter
fs: dlm: cleanup lock order
fs: dlm: remove clear_members_cb
fs: dlm: add plock dev tracepoints
fs: dlm: check on plock ops when exit dlm
fs: dlm: debugfs for queued callbacks
fs: dlm: remove unused processed_nodes
fs: dlm: add missing spin_unlock
fs: dlm: fix F_CANCELLK to cancel pending request
fs: dlm: allow to F_SETLKW getting interrupted
fs: dlm: remove twice newline
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmTs08EQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpqa4EACu/zKE+omGXBV0Q7kEpVsChjp0ElGtSDIJ
tJfTuvnWqQjrqRv4ksmZvGdx8SkqFuXri4/7oBXlsaqeUVbIQdWJUpLErBye6nxa
lUb6nXOFWwyG94cMRYs71lN0loosjb7aiVw7oVLAIhntq3p3doFl/cyy3ndMZrUE
pZbsrWSt4QiOKhcO0TtIjfAwsr31AN51qFiNNITEiZl3UjXfkGRCK81X0yM2N8zZ
7Y0h1ldPBsZ/olNWeRyaW1uB64nKM0buR7/nDxCV/NI05nndJ34bIgo/JIj4xy0v
SiBj2+y86+oMJZt17yYENwOQdtX3hbyESGuVm9dCrO0t9/byVQxkUk0OMm65BM/l
l2d+gmMQZTbHziqfLlgq9i3i9+B4C2hsb7iBpuo7SW/FPbM45POgi3lpiZycaZyu
krQo1qwL4KSGXzGN9CabEuKDcJcXqLxqMDOyEDA3R5Kz06V9tNuM+Di/mr4vuZHK
sVHUfHuWBO9ionLlGPdc3fH/CuMqic8SHjumiAm2menBZV6cSzRDxpm6H4CyLt7y
tWmw7BNU7dfHFGd+Jw0Ld49sAuEybszEXq6qYv5uYBVfJNqDvOvEeVoQp0RN2jJA
AG30hymcZgxn9n7gkIgkPQDgIGUjnzUR8B2mE2UFU1CYVHXYXAXU55CCI5oeTkbs
d0Y/zCZf1A==
=p1bd
-----END PGP SIGNATURE-----
Merge tag 'for-6.6/block-2023-08-28' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:
"Pretty quiet round for this release. This contains:
- Add support for zoned storage to ublk (Andreas, Ming)
- Series improving performance for drivers that mark themselves as
needing a blocking context for issue (Bart)
- Cleanup the flush logic (Chengming)
- sed opal keyring support (Greg)
- Fixes and improvements to the integrity support (Jinyoung)
- Add some exports for bcachefs that we can hopefully delete again in
the future (Kent)
- deadline throttling fix (Zhiguo)
- Series allowing building the kernel without buffer_head support
(Christoph)
- Sanitize the bio page adding flow (Christoph)
- Write back cache fixes (Christoph)
- MD updates via Song:
- Fix perf regression for raid0 large sequential writes (Jan)
- Fix split bio iostat for raid0 (David)
- Various raid1 fixes (Heinz, Xueshi)
- raid6test build fixes (WANG)
- Deprecate bitmap file support (Christoph)
- Fix deadlock with md sync thread (Yu)
- Refactor md io accounting (Yu)
- Various non-urgent fixes (Li, Yu, Jack)
- Various fixes and cleanups (Arnd, Azeem, Chengming, Damien, Li,
Ming, Nitesh, Ruan, Tejun, Thomas, Xu)"
* tag 'for-6.6/block-2023-08-28' of git://git.kernel.dk/linux: (113 commits)
block: use strscpy() to instead of strncpy()
block: sed-opal: keyring support for SED keys
block: sed-opal: Implement IOC_OPAL_REVERT_LSP
block: sed-opal: Implement IOC_OPAL_DISCOVERY
blk-mq: prealloc tags when increase tagset nr_hw_queues
blk-mq: delete redundant tagset map update when fallback
blk-mq: fix tags leak when shrink nr_hw_queues
ublk: zoned: support REQ_OP_ZONE_RESET_ALL
md: raid0: account for split bio in iostat accounting
md/raid0: Fix performance regression for large sequential writes
md/raid0: Factor out helper for mapping and submitting a bio
md raid1: allow writebehind to work on any leg device set WriteMostly
md/raid1: hold the barrier until handle_read_error() finishes
md/raid1: free the r1bio before waiting for blocked rdev
md/raid1: call free_r1bio() before allow_barrier() in raid_end_bio_io()
blk-cgroup: Fix NULL deref caused by blkg_policy_data being installed before init
drivers/rnbd: restore sysfs interface to rnbd-client
md/raid5-cache: fix null-ptr-deref for r5l_flush_stripe_to_raid()
raid6: test: only check for Altivec if building on powerpc hosts
raid6: test: make sure all intermediate and artifact files are .gitignored
...
* Make large writes to the page cache fill sparse parts of the cache
with large folios, then use large memcpy calls for the large folio.
* Track the per-block dirty state of each large folio so that a
buffered write to a single byte on a large folio does not result in a
(potentially) multi-megabyte writeback IO.
* Allow some directio completions to be performed in the initiating
task's context instead of punting through a workqueue. This will
reduce latency for some io_uring requests.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZM0Z1AAKCRBKO3ySh0YR
pp7BAQCzkKejCM0185tNIH/faHjzidSisNQkJ5HoB4Opq9U66AEA6IPuAdlPlM/J
FPW1oPq33Yn7AV4wXjUNFfDLzVb/Fgg=
=dFBU
-----END PGP SIGNATURE-----
Merge tag 'iomap-6.6-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull iomap updates from Darrick Wong:
"We've got some big changes for this release -- I'm very happy to be
landing willy's work to enable large folios for the page cache for
general read and write IOs when the fs can make contiguous space
allocations, and Ritesh's work to track sub-folio dirty state to
eliminate the write amplification problems inherent in using large
folios.
As a bonus, io_uring can now process write completions in the caller's
context instead of bouncing through a workqueue, which should reduce
io latency dramatically. IOWs, XFS should see a nice performance bump
for both IO paths.
Summary:
- Make large writes to the page cache fill sparse parts of the cache
with large folios, then use large memcpy calls for the large folio.
- Track the per-block dirty state of each large folio so that a
buffered write to a single byte on a large folio does not result in
a (potentially) multi-megabyte writeback IO.
- Allow some directio completions to be performed in the initiating
task's context instead of punting through a workqueue. This will
reduce latency for some io_uring requests"
* tag 'iomap-6.6-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (26 commits)
iomap: support IOCB_DIO_CALLER_COMP
io_uring/rw: add write support for IOCB_DIO_CALLER_COMP
fs: add IOCB flags related to passing back dio completions
iomap: add IOMAP_DIO_INLINE_COMP
iomap: only set iocb->private for polled bio
iomap: treat a write through cache the same as FUA
iomap: use an unsigned type for IOMAP_DIO_* defines
iomap: cleanup up iomap_dio_bio_end_io()
iomap: Add per-block dirty state tracking to improve performance
iomap: Allocate ifs in ->write_begin() early
iomap: Refactor iomap_write_delalloc_punch() function out
iomap: Use iomap_punch_t typedef
iomap: Fix possible overflow condition in iomap_write_delalloc_scan
iomap: Add some uptodate state handling helpers for ifs state bitmap
iomap: Drop ifs argument from iomap_set_range_uptodate()
iomap: Rename iomap_page to iomap_folio_state and others
iomap: Copy larger chunks from userspace
iomap: Create large folios in the buffered write path
filemap: Allow __filemap_get_folio to allocate large folios
filemap: Add fgf_t typedef
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXpbgAKCRCRxhvAZXjc
oi8PAQCtXelGZHmTcmevsO8p4Qz7hFpkonZ/TnxKf+RdnlNgPgD+NWi+LoRBpaAj
xk4z8SqJaTTP4WXrG5JZ6o7EQkUL8gE=
=2e9I
-----END PGP SIGNATURE-----
Merge tag 'v6.6-vfs.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull superblock updates from Christian Brauner:
"This contains the super rework that was ready for this cycle. The
first part changes the order of how we open block devices and allocate
superblocks, contains various cleanups, simplifications, and a new
mechanism to wait on superblock state changes.
This unblocks work to ultimately limit the number of writers to a
block device. Jan has already scheduled follow-up work that will be
ready for v6.7 and allows us to restrict the number of writers to a
given block device. That series builds on this work right here.
The second part contains filesystem freezing updates.
Overview:
The generic superblock changes are rougly organized as follows
(ignoring additional minor cleanups):
(1) Removal of the bd_super member from struct block_device.
This was a very odd back pointer to struct super_block with
unclear rules. For all relevant places we have other means to get
the same information so just get rid of this.
(2) Simplify rules for superblock cleanup.
Roughly, everything that is allocated during fs_context
initialization and that's stored in fs_context->s_fs_info needs
to be cleaned up by the fs_context->free() implementation before
the superblock allocation function has been called successfully.
After sget_fc() returned fs_context->s_fs_info has been
transferred to sb->s_fs_info at which point sb->kill_sb() if
fully responsible for cleanup. Adhering to these rules means that
cleanup of sb->s_fs_info in fill_super() is to be avoided as it's
brittle and inconsistent.
Cleanup shouldn't be duplicated between sb->put_super() as
sb->put_super() is only called if sb->s_root has been set aka
when the filesystem has been successfully born (SB_BORN). That
complexity should be avoided.
This also means that block devices are to be closed in
sb->kill_sb() instead of sb->put_super(). More details in the
lower section.
(3) Make it possible to lookup or create a superblock before opening
block devices
There's a subtle dependency on (2) as some filesystems did rely
on fill_super() to be called in order to correctly clean up
sb->s_fs_info. All these filesystems have been fixed.
(4) Switch most filesystem to follow the same logic as the generic
mount code now does as outlined in (3).
(5) Use the superblock as the holder of the block device. We can now
easily go back from block device to owning superblock.
(6) Export and extend the generic fs_holder_ops and use them as
holder ops everywhere and remove the filesystem specific holder
ops.
(7) Call from the block layer up into the filesystem layer when the
block device is removed, allowing to shut down the filesystem
without risk of deadlocks.
(8) Get rid of get_super().
We can now easily go back from the block device to owning
superblock and can call up from the block layer into the
filesystem layer when the device is removed. So no need to wade
through all registered superblock to find the owning superblock
anymore"
Link: https://lore.kernel.org/lkml/20230824-prall-intakt-95dbffdee4a0@brauner/
* tag 'v6.6-vfs.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (47 commits)
super: use higher-level helper for {freeze,thaw}
super: wait until we passed kill super
super: wait for nascent superblocks
super: make locking naming consistent
super: use locking helpers
fs: simplify invalidate_inodes
fs: remove get_super
block: call into the file system for ioctl BLKFLSBUF
block: call into the file system for bdev_mark_dead
block: consolidate __invalidate_device and fsync_bdev
block: drop the "busy inodes on changed media" log message
dasd: also call __invalidate_device when setting the device offline
amiflop: don't call fsync_bdev in FDFMTBEG
floppy: call disk_force_media_change when changing the format
block: simplify the disk_force_media_change interface
nbd: call blk_mark_disk_dead in nbd_clear_sock_ioctl
xfs use fs_holder_ops for the log and RT devices
xfs: drop s_umount over opening the log and RT devices
ext4: use fs_holder_ops for the log device
ext4: drop s_umount over opening the log device
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXTKAAKCRCRxhvAZXjc
oifJAQCzi/p+AdQu8LA/0XvR7fTwaq64ZDCibU4BISuLGT2kEgEAuGbuoFZa0rs2
XYD/s4+gi64p9Z01MmXm2XO1pu3GPg0=
=eJz5
-----END PGP SIGNATURE-----
Merge tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs timestamp updates from Christian Brauner:
"This adds VFS support for multi-grain timestamps and converts tmpfs,
xfs, ext4, and btrfs to use them. This carries acks from all relevant
filesystems.
The VFS always uses coarse-grained timestamps when updating the ctime
and mtime after a change. This has the benefit of allowing filesystems
to optimize away a lot of metadata updates, down to around 1 per
jiffy, even when a file is under heavy writes.
Unfortunately, this has always been an issue when we're exporting via
NFSv3, which relies on timestamps to validate caches. A lot of changes
can happen in a jiffy, so timestamps aren't sufficient to help the
client decide to invalidate the cache.
Even with NFSv4, a lot of exported filesystems don't properly support
a change attribute and are subject to the same problems with timestamp
granularity. Other applications have similar issues with timestamps
(e.g., backup applications).
If we were to always use fine-grained timestamps, that would improve
the situation, but that becomes rather expensive, as the underlying
filesystem would have to log a lot more metadata updates.
This introduces fine-grained timestamps that are used when they are
actively queried.
This uses the 31st bit of the ctime tv_nsec field to indicate that
something has queried the inode for the mtime or ctime. When this flag
is set, on the next mtime or ctime update, the kernel will fetch a
fine-grained timestamp instead of the usual coarse-grained one.
As POSIX generally mandates that when the mtime changes, the ctime
must also change the kernel always stores normalized ctime values, so
only the first 30 bits of the tv_nsec field are ever used.
Filesytems can opt into this behavior by setting the FS_MGTIME flag in
the fstype. Filesystems that don't set this flag will continue to use
coarse-grained timestamps.
Various preparatory changes, fixes and cleanups are included:
- Fixup all relevant places where POSIX requires updating ctime
together with mtime. This is a wide-range of places and all
maintainers provided necessary Acks.
- Add new accessors for inode->i_ctime directly and change all
callers to rely on them. Plain accesses to inode->i_ctime are now
gone and it is accordingly rename to inode->__i_ctime and commented
as requiring accessors.
- Extend generic_fillattr() to pass in a request mask mirroring in a
sense the statx() uapi. This allows callers to pass in a request
mask to only get a subset of attributes filled in.
- Rework timestamp updates so it's possible to drop the @now
parameter the update_time() inode operation and associated helpers.
- Add inode_update_timestamps() and convert all filesystems to it
removing a bunch of open-coding"
* tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (107 commits)
btrfs: convert to multigrain timestamps
ext4: switch to multigrain timestamps
xfs: switch to multigrain timestamps
tmpfs: add support for multigrain timestamps
fs: add infrastructure for multigrain timestamps
fs: drop the timespec64 argument from update_time
xfs: have xfs_vn_update_time gets its own timestamp
fat: make fat_update_time get its own timestamp
fat: remove i_version handling from fat_update_time
ubifs: have ubifs_update_time use inode_update_timestamps
btrfs: have it use inode_update_timestamps
fs: drop the timespec64 arg from generic_update_time
fs: pass the request_mask to generic_fillattr
fs: remove silly warning from current_time
gfs2: fix timestamp handling on quota inodes
fs: rename i_ctime field to __i_ctime
selinux: convert to ctime accessor functions
security: convert to ctime accessor functions
apparmor: convert to ctime accessor functions
sunrpc: convert to ctime accessor functions
...
Now that all of the update_time operations are prepared for it, we can
drop the timespec64 argument from the update_time operation. Do that and
remove it from some associated functions like inode_update_time and
inode_needs_update_time.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-Id: <20230807-mgctime-v7-8-d1dec143a704@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
In future patches we're going to change how the ctime is updated
to keep track of when it has been queried. The way that the update_time
operation works (and a lot of its callers) make this difficult, since
they grab a timestamp early and then pass it down to eventually be
copied into the inode.
All of the existing update_time callers pass in the result of
current_time() in some fashion. Drop the "time" parameter from
generic_update_time, and rework it to fetch its own timestamp.
This change means that an update_time could fetch a different timestamp
than was seen in inode_needs_update_time. update_time is only ever
called with one of two flag combinations: Either S_ATIME is set, or
S_MTIME|S_CTIME|S_VERSION are set.
With this change we now treat the flags argument as an indicator that
some value needed to be updated when last checked, rather than an
indication to update specific timestamps.
Rework the logic for updating the timestamps and put it in a new
inode_update_timestamps helper that other update_time routines can use.
S_ATIME is as treated as we always have, but if any of the other three
are set, then we attempt to update all three.
Also, some callers of generic_update_time need to know what timestamps
were actually updated. Change it to return an S_* flag mask to indicate
that and rework the callers to expect it.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-Id: <20230807-mgctime-v7-3-d1dec143a704@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
generic_fillattr just fills in the entire stat struct indiscriminately
today, copying data from the inode. There is at least one attribute
(STATX_CHANGE_COOKIE) that can have side effects when it is reported,
and we're looking at adding more with the addition of multigrain
timestamps.
Add a request_mask argument to generic_fillattr and have most callers
just pass in the value that is passed to getattr. Have other callers
(e.g. ksmbd) just pass in STATX_BASIC_STATS. Also move the setting of
STATX_CHANGE_COOKIE into generic_fillattr.
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: "Paulo Alcantara (SUSE)" <pc@manguebit.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Message-Id: <20230807-mgctime-v7-2-d1dec143a704@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Starting with patch 2cb1e08985, gfs2 started using the new function
filemap_splice_read rather than the old (and subsequently deleted)
function generic_file_splice_read.
filemap_splice_read works by taking references to a number of folios in
the page cache and splicing those folios into a pipe. The folios are
then read from the pipe and the folio references are dropped. This can
take an arbitrary amount of time. We cannot allow that in gfs2 because
those folio references will pin the inode glock to the node and prevent
it from being demoted, which can lead to cluster-wide deadlocks.
Instead, use copy_splice_read.
(In addition, the old generic_file_splice_read called into ->read_iter,
which called gfs2_file_read_iter, which took the inode glock during the
operation. The new filemap_splice_read interface does not take the
inode glock anymore. This is fixable, but it still wouldn't prevent
cluster-wide deadlocks.)
Fixes: 2cb1e08985 ("splice: Use filemap_splice_read() instead of generic_file_splice_read()")
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Function gfs2_trans_add_meta() checks for the SDF_FROZEN flag to make
sure that no buffers are added to a transaction while the filesystem is
frozen. With the recent freeze/thaw rework, the SDF_FROZEN flag is
cleared after thaw_super() is called, which is sufficient for
serializing freeze/thaw.
However, other filesystem operations started after thaw_super() may now
be calling gfs2_trans_add_meta() before the SDF_FROZEN flag is cleared,
which will trigger the SDF_FROZEN check in gfs2_trans_add_meta(). Fix
that by checking the s_writers.frozen state instead.
In addition, make sure not to call gfs2_assert_withdraw() with the
sd_log_lock spin lock held. Check for a withdrawn filesystem before
checking for a frozen filesystem, and don't pin/add buffers to the
current transaction in case of a failure in either case.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Add a new config option that controls building the buffer_head code, and
select it from all file systems and stacking drivers that need it.
For the block device nodes and alternative iomap based buffered I/O path
is provided when buffer_head support is not enabled, and iomap needs a
a small tweak to define the IOMAP_F_BUFFER_HEAD flag to 0 to not call
into the buffer_head code when it doesn't exist.
Otherwise this is just Kconfig and ifdef changes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20230801172201.1923299-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
block_page_mkwrite_return is neither block nor mkwrite specific, and
should not be under CONFIG_BLOCK. Move it to mm.h and rename it to
vmf_fs_error.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20230801172201.1923299-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When filesystem blocksize is less than folio size (either with
mapping_large_folio_support() or with blocksize < pagesize) and when the
folio is uptodate in pagecache, then even a byte write can cause
an entire folio to be written to disk during writeback. This happens
because we currently don't have a mechanism to track per-block dirty
state within struct iomap_folio_state. We currently only track uptodate
state.
This patch implements support for tracking per-block dirty state in
iomap_folio_state->state bitmap. This should help improve the filesystem
write performance and help reduce write amplification.
Performance testing of below fio workload reveals ~16x performance
improvement using nvme with XFS (4k blocksize) on Power (64K pagesize)
FIO reported write bw scores improved from around ~28 MBps to ~452 MBps.
1. <test_randwrite.fio>
[global]
ioengine=psync
rw=randwrite
overwrite=1
pre_read=1
direct=0
bs=4k
size=1G
dir=./
numjobs=8
fdatasync=1
runtime=60
iodepth=64
group_reporting=1
[fio-run]
2. Also our internal performance team reported that this patch improves
their database workload performance by around ~83% (with XFS on Power)
Reported-by: Aravinda Herle <araherle@in.ibm.com>
Reported-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Use the size of the write as a hint for the size of the folio to create.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
While these aren't generally visible from userland, it's best to be
consistent with timestamp handling. When adjusting the quota, update the
mtime and ctime like we would with a write operation on any other inode,
and avoid updating the atime which should only be done for reads.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Message-Id: <20230713135249.153796-1-jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
In later patches, we're going to change how the inode's ctime field is
used. Switch to using accessor functions instead of raw accesses of
inode->i_ctime.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-Id: <20230705190309.579783-45-jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
This patch fixes the current handling of F_CANCELLK by not just doing a
unlock as we need to try to cancel a lock at first. A unlock makes sense
on a non-blocking lock request but if it's a blocking lock request we
need to cancel the request until it's not granted yet. This patch is fixing
this behaviour by first try to cancel a lock request and if it's failed
it's unlocking the lock which seems to be granted.
Note: currently the nfs locking handling was disabled by commit
40595cdc93 ("nfs: block notification on fs with its own ->lock").
However DLM was never being updated regarding to this change. Future
patches will try to fix lockd lock requests for DLM. This patch is
currently assuming the upstream DLM lockd handling is correct.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Userspace can freeze a filesystem using the FIFREEZE ioctl or by
suspending the block device; this state persists until userspace thaws
the filesystem with the FITHAW ioctl or resuming the block device.
Since commit 18e9e5104f ("Introduce freeze_super and thaw_super for
the fsfreeze ioctl") we only allow the first freeze command to succeed.
The kernel may decide that it is necessary to freeze a filesystem for
its own internal purposes, such as suspends in progress, filesystem fsck
activities, or quiescing a device prior to removal. Userspace thaw
commands must never break a kernel freeze, and kernel thaw commands
shouldn't undo userspace's freeze command.
Introduce a couple of freeze holder flags and wire it into the
sb_writers state. One kernel and one userspace freeze are allowed to
coexist at the same time; the filesystem will not thaw until both are
lifted.
I wonder if the f2fs/gfs2 code should be using a kernel freeze here, but
for now we'll use FREEZE_HOLDER_USERSPACE to preserve existing
behaviors.
Cc: mcgrof@kernel.org
Cc: jack@suse.cz
Cc: hch@infradead.org
Cc: ruansy.fnst@fujitsu.com
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
- Move the freeze/thaw logic from glock callback context to process /
worker thread context to prevent deadlocks.
- Fix a quota reference couting bug in do_qc().
- Carry on deallocating inodes even when gfs2_rindex_update() fails.
- Retry filesystem-internal reads when they are interruped by a signal.
- Eliminate kmap_atomic() in favor of kmap_local_page() /
memcpy_{from,to}_page().
- Get rid of noop_direct_IO.
- And a few more minor fixes and cleanups.
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmSkEWIUHGFncnVlbmJh
QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTqTSQ/+OKFnr1O1pVFP2k4CA5bGuJX4sfOt
TT045OAfy7VuE79q1Xi6HZ880kdP7fg4XUbcrwm45fEt5Q5kbbe6pjP34fCGGVSE
vLmb5BG/msb6fDLRN17S6uzyqKwFv9tGofTAtCtWTs7gmtezpBByHNHAu6duG89c
5LOhUbz3BhIOPrHUs6RB3h/OpmofVNwWCKQmfCEzYsHMzoWz1XuZJMWRb5Ho6BII
mOF0tWlEfr5gdavyWZ9UkuHqe3HI/1PfUNFVAD9S/V3kp2XPnc+HM3yP+S8df4p8
HT5VqVjH3JQL7sf6CnUXo9LP1veB+hHuvAaOyrHIKqHbIHwgxlWIIYcEYWKEGG8p
1CMjm6bI/hLQuFBUrpD1z3ppLavPZdl16Z3kCAVx8Ixl3livqMuiiZGBRdtBGdBr
RT66+SX1GurLp1EcWncqEXdJ5jgYKqVeArZKdh3thXSVaO+b8yq3IeuDRHseLnLA
egyzO3yLocvze3YFiZI4Y0V4ako9NO+2GNPd5+O9Bh9L7RepjiMCloDNpfDPQsTR
fJdW4t9vMts2eCzRASsXBUdUUV0k1Qvxdm+9BbDHFW9YvH7BQzEoi6qTZFsamdGT
NQqKdhrfiKsSEA8HFCgePYPzOGMAyH7fiaQUaXUGwkZy6AVmK+piKlJn+csNKbHQ
3dUo4upAzoeMwsA=
=GKxj
-----END PGP SIGNATURE-----
Merge tag 'gfs2-v6.4-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Andreas Gruenbacher:
- Move the freeze/thaw logic from glock callback context to process /
worker thread context to prevent deadlocks
- Fix a quota reference couting bug in do_qc()
- Carry on deallocating inodes even when gfs2_rindex_update() fails
- Retry filesystem-internal reads when they are interruped by a signal
- Eliminate kmap_atomic() in favor of kmap_local_page() /
memcpy_{from,to}_page()
- Get rid of noop_direct_IO
- And a few more minor fixes and cleanups
* tag 'gfs2-v6.4-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: (23 commits)
gfs2: Add quota_change type
gfs2: Use memcpy_{from,to}_page where appropriate
gfs2: Convert remaining kmap_atomic calls to kmap_local_page
gfs2: Replace deprecated kmap_atomic with kmap_local_page
gfs: Get rid of unnucessary locking in inode_go_dump
gfs2: gfs2_freeze_lock_shared cleanup
gfs2: Replace sd_freeze_state with SDF_FROZEN flag
gfs2: Rework freeze / thaw logic
gfs2: Rename SDF_{FS_FROZEN => FREEZE_INITIATOR}
gfs2: Reconfiguring frozen filesystem already rejected
gfs2: Rename gfs2_freeze_lock{ => _shared }
gfs2: Rename the {freeze,thaw}_super callbacks
gfs2: Rename remaining "transaction" glock references
gfs2: retry interrupted internal reads
gfs2: Fix possible data races in gfs2_show_options()
gfs2: Fix duplicate should_fault_in_pages() call
gfs2: set FMODE_CAN_ODIRECT instead of a dummy direct_IO method
gfs2: Don't remember delete unless it's successful
gfs2: Update rl_unlinked before releasing rgrp lock
gfs2: Fix gfs2_qa_get imbalance in gfs2_quota_hold
...
Function do_qc has two main uses: (1) to re-sync the local quota changes
(qd) to the master quotas, and (2) normal quota changes. In the case of
normal quota changes, the change can be positive or negative, as the
quota usage goes up and down.
Before this patch function do_qc was distinguishing one from another by
whether the resulting value is or isn't zero: In the case of a re-sync
(called do_sync) the quota value is moved from the temporary value to a
master value, so the amount is added to one and subtracted from the
other. The problem is that since the values can be positive or negative
we can occasionally run into situations where we are not doing a re-sync
but the quota change just happens to cancel out the previous value.
In the case of a re-sync extra references and locks are taken, and so
do_qc needs to release them. In the case of a normal quota change, no
extra references and locks are taken, so it must not try to release
them.
The problem is: if the quota change is not a re-sync but the value just
happens to cancel out the original quota change, the resulting zero
value fools do_qc into thinking this is a re-sync and therefore it must
release the extra references. This results in problems, mainly having to
do with slot reference numbers going smaller than zero.
This patch introduces new constants, QC_SYNC and QC_CHANGE so do_qc can
really tell the difference. For QC_SYNC calls it must release the extra
references acquired by gfs2_quota_unlock's call to qd_check_sync. For
QC_CHANGE calls it does not have extra references to put.
Note that this allows quota changes back to a value of zero, and so I
removed an assert warning related to that.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Replace kmap_local_page() + memcpy() + kunmap_local() sequences with
memcpy_{from,to}_page() where we are not doing anything else with the
mapped page.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Replace the remaining instances of kmap_atomic() ... kunmap_atomic()
with kmap_local_page() ... kunmap_local().
In gfs2_write_buf_to_page(), we can call flush_dcache_page() after
unmapping the page.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
kmap_atomic() is deprecated in favor of kmap_local_{folio,page}().
Therefore, replace kmap_atomic() with kmap_local_page() in
gfs2_internal_read() and stuffed_readpage().
kmap_atomic() disables page-faults and preemption (the latter only for
!PREEMPT_RT kernels), However, the code within the mapping/un-mapping in
gfs2_internal_read() and stuffed_readpage() does not depend on the
above-mentioned side effects.
Therefore, a mere replacement of the old API with the new one is all that
is required (i.e., there is no need to explicitly add any calls to
pagefault_disable() and/or preempt_disable()).
Signed-off-by: Deepak R Varma <drv@mailo.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Commit 27a2660f1e ("gfs2: Dump nrpages for inodes and their glocks")
added some locking around reading inode->i_data.nrpages. That locking
doesn't do anything really, so get rid of it.
With that, the glock argument to ->go_dump() can be made const again as
well.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
All the remaining users of gfs2_freeze_lock_shared() set freeze_gh to
&sdp->sd_freeze_gh and flags to 0, so remove those two parameters.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Replace sd_freeze_state with a new SDF_FROZEN flag.
There no longer is a need for indicating that a freeze is in progress
(SDF_STARTING_FREEZE); we are now protecting the critical sections with
the sd_freeze_mutex.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
So far, at mount time, gfs2 would take the freeze glock in shared mode
and then immediately drop it again, turning it into a cached glock that
can be reclaimed at any time. To freeze the filesystem cluster-wide,
the node initiating the freeze would take the freeze glock in exclusive
mode, which would cause the freeze glock's freeze_go_sync() callback to
run on each node. There, gfs2 would freeze the filesystem and schedule
gfs2_freeze_func() to run. gfs2_freeze_func() would re-acquire the
freeze glock in shared mode, thaw the filesystem, and drop the freeze
glock again. The initiating node would keep the freeze glock held in
exclusive mode. To thaw the filesystem, the initiating node would drop
the freeze glock again, which would allow gfs2_freeze_func() to resume
on all nodes, leaving the filesystem in the thawed state.
It turns out that in freeze_go_sync(), we cannot reliably and safely
freeze the filesystem. This is primarily because the final unmount of a
filesystem takes a write lock on the s_umount rw semaphore before
calling into gfs2_put_super(), and freeze_go_sync() needs to call
freeze_super() which also takes a write lock on the same semaphore,
causing a deadlock. We could work around this by trying to take an
active reference on the super block first, which would prevent unmount
from running at the same time. But that can fail, and freeze_go_sync()
isn't actually allowed to fail.
To get around this, this patch changes the freeze glock locking scheme
as follows:
At mount time, each node takes the freeze glock in shared mode. To
freeze a filesystem, the initiating node first freezes the filesystem
locally and then drops and re-acquires the freeze glock in exclusive
mode. All other nodes notice that there is contention on the freeze
glock in their go_callback callbacks, and they schedule
gfs2_freeze_func() to run. There, they freeze the filesystem locally
and drop and re-acquire the freeze glock before re-thawing the
filesystem. This is happening outside of the glock state engine, so
there, we are allowed to fail.
From a cluster point of view, taking and immediately dropping a glock is
indistinguishable from taking the glock and only dropping it upon
contention, so this new scheme is compatible with the old one.
Thanks to Li Dong <lidong@vivo.com> for reporting a locking bug in
gfs2_freeze_func() in a previous version of this commit.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
- Yosry has also eliminated cgroup's atomic rstat flushing.
- Nhat Pham adds the new cachestat() syscall. It provides userspace
with the ability to query pagecache status - a similar concept to
mincore() but more powerful and with improved usability.
- Mel Gorman provides more optimizations for compaction, reducing the
prevalence of page rescanning.
- Lorenzo Stoakes has done some maintanance work on the get_user_pages()
interface.
- Liam Howlett continues with cleanups and maintenance work to the maple
tree code. Peng Zhang also does some work on maple tree.
- Johannes Weiner has done some cleanup work on the compaction code.
- David Hildenbrand has contributed additional selftests for
get_user_pages().
- Thomas Gleixner has contributed some maintenance and optimization work
for the vmalloc code.
- Baolin Wang has provided some compaction cleanups,
- SeongJae Park continues maintenance work on the DAMON code.
- Huang Ying has done some maintenance on the swap code's usage of
device refcounting.
- Christoph Hellwig has some cleanups for the filemap/directio code.
- Ryan Roberts provides two patch series which yield some
rationalization of the kernel's access to pte entries - use the provided
APIs rather than open-coding accesses.
- Lorenzo Stoakes has some fixes to the interaction between pagecache
and directio access to file mappings.
- John Hubbard has a series of fixes to the MM selftesting code.
- ZhangPeng continues the folio conversion campaign.
- Hugh Dickins has been working on the pagetable handling code, mainly
with a view to reducing the load on the mmap_lock.
- Catalin Marinas has reduced the arm64 kmalloc() minimum alignment from
128 to 8.
- Domenico Cerasuolo has improved the zswap reclaim mechanism by
reorganizing the LRU management.
- Matthew Wilcox provides some fixups to make gfs2 work better with the
buffer_head code.
- Vishal Moola also has done some folio conversion work.
- Matthew Wilcox has removed the remnants of the pagevec code - their
functionality is migrated over to struct folio_batch.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZJejewAKCRDdBJ7gKXxA
joggAPwKMfT9lvDBEUnJagY7dbDPky1cSYZdJKxxM2cApGa42gEA6Cl8HRAWqSOh
J0qXCzqaaN8+BuEyLGDVPaXur9KirwY=
=B7yQ
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull mm updates from Andrew Morton:
- Yosry Ahmed brought back some cgroup v1 stats in OOM logs
- Yosry has also eliminated cgroup's atomic rstat flushing
- Nhat Pham adds the new cachestat() syscall. It provides userspace
with the ability to query pagecache status - a similar concept to
mincore() but more powerful and with improved usability
- Mel Gorman provides more optimizations for compaction, reducing the
prevalence of page rescanning
- Lorenzo Stoakes has done some maintanance work on the
get_user_pages() interface
- Liam Howlett continues with cleanups and maintenance work to the
maple tree code. Peng Zhang also does some work on maple tree
- Johannes Weiner has done some cleanup work on the compaction code
- David Hildenbrand has contributed additional selftests for
get_user_pages()
- Thomas Gleixner has contributed some maintenance and optimization
work for the vmalloc code
- Baolin Wang has provided some compaction cleanups,
- SeongJae Park continues maintenance work on the DAMON code
- Huang Ying has done some maintenance on the swap code's usage of
device refcounting
- Christoph Hellwig has some cleanups for the filemap/directio code
- Ryan Roberts provides two patch series which yield some
rationalization of the kernel's access to pte entries - use the
provided APIs rather than open-coding accesses
- Lorenzo Stoakes has some fixes to the interaction between pagecache
and directio access to file mappings
- John Hubbard has a series of fixes to the MM selftesting code
- ZhangPeng continues the folio conversion campaign
- Hugh Dickins has been working on the pagetable handling code, mainly
with a view to reducing the load on the mmap_lock
- Catalin Marinas has reduced the arm64 kmalloc() minimum alignment
from 128 to 8
- Domenico Cerasuolo has improved the zswap reclaim mechanism by
reorganizing the LRU management
- Matthew Wilcox provides some fixups to make gfs2 work better with the
buffer_head code
- Vishal Moola also has done some folio conversion work
- Matthew Wilcox has removed the remnants of the pagevec code - their
functionality is migrated over to struct folio_batch
* tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (380 commits)
mm/hugetlb: remove hugetlb_set_page_subpool()
mm: nommu: correct the range of mmap_sem_read_lock in task_mem()
hugetlb: revert use of page_cache_next_miss()
Revert "page cache: fix page_cache_next/prev_miss off by one"
mm/vmscan: fix root proactive reclaim unthrottling unbalanced node
mm: memcg: rename and document global_reclaim()
mm: kill [add|del]_page_to_lru_list()
mm: compaction: convert to use a folio in isolate_migratepages_block()
mm: zswap: fix double invalidate with exclusive loads
mm: remove unnecessary pagevec includes
mm: remove references to pagevec
mm: rename invalidate_mapping_pagevec to mapping_try_invalidate
mm: remove struct pagevec
net: convert sunrpc from pagevec to folio_batch
i915: convert i915_gpu_error to use a folio_batch
pagevec: rename fbatch_count()
mm: remove check_move_unevictable_pages()
drm: convert drm_gem_put_pages() to use a folio_batch
i915: convert shmem_sg_free_table() to use a folio_batch
scatterlist: add sg_set_folio()
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmSV8dwQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpilGD/9Yys1oxIXJpRf00fzrylAlBthRxMjFQVWw
zAut106hAQiBHvU8IkmGA3MvEFVHxtzwYhHI7IR8K3aZBIqscweCqmVI9JyogJw9
U9Twnzel47VmuKdM94FeoN+hbj1fP8EWTjzmy67/zEEfFCdmHvNlMi3lSrGYIpFy
39LxTB99Y4UarM5PtWbes37GYYljzMSWKuo4AfBkvq1eQa+sZ0Vq2xAABKq3UM7f
apqhgHtkJooRePDP0eQp+kAyyVMgW2jIK+oIdJDxNF3CKTu2w40RzaYz6fp+jVSU
H4R/xS59GW4/xql+VBJDh/qJg9K62DPPYjlW8BmSR8+IjvfFpsyH3/MacE50CD3P
20fs/Mnj49H79fDrQEHJI53cOOb2EmUitbwLbvOcColNTPpt8loBtdQxjF2RMU8R
Nyort9DJPFclYCxky1LYg1CNEC2Ln4Zy/jD47wPvqRmOQphOoVlV/hPnOEqvjaZC
49Vn70W2DeE9cXvYI7ha+XIg6/oj+Gs3iusEbV08Ci7EAtXgI+ZUUsQ97K8UNiUh
h2lqSJtuI7lBpYP9sf+BeCch5UCC+xGYyTdoM5f58lehWBBPtbs0g7S9RyRyOYxe
n+yxEUo3dAGzJ/xsKAjinbZfeWIpr0b1TkAh4w3Cq/BKzRr9Bp8lBAxYuancbQ+Y
1ADPteUOTA==
=zP4Y
-----END PGP SIGNATURE-----
Merge tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:
- NVMe pull request via Keith:
- Various cleanups all around (Irvin, Chaitanya, Christophe)
- Better struct packing (Christophe JAILLET)
- Reduce controller error logs for optional commands (Keith)
- Support for >=64KiB block sizes (Daniel Gomez)
- Fabrics fixes and code organization (Max, Chaitanya, Daniel
Wagner)
- bcache updates via Coly:
- Fix a race at init time (Mingzhe Zou)
- Misc fixes and cleanups (Andrea, Thomas, Zheng, Ye)
- use page pinning in the block layer for dio (David)
- convert old block dio code to page pinning (David, Christoph)
- cleanups for pktcdvd (Andy)
- cleanups for rnbd (Guoqing)
- use the unchecked __bio_add_page() for the initial single page
additions (Johannes)
- fix overflows in the Amiga partition handling code (Michael)
- improve mq-deadline zoned device support (Bart)
- keep passthrough requests out of the IO schedulers (Christoph, Ming)
- improve support for flush requests, making them less special to deal
with (Christoph)
- add bdev holder ops and shutdown methods (Christoph)
- fix the name_to_dev_t() situation and use cases (Christoph)
- decouple the block open flags from fmode_t (Christoph)
- ublk updates and cleanups, including adding user copy support (Ming)
- BFQ sanity checking (Bart)
- convert brd from radix to xarray (Pankaj)
- constify various structures (Thomas, Ivan)
- more fine grained persistent reservation ioctl capability checks
(Jingbo)
- misc fixes and cleanups (Arnd, Azeem, Demi, Ed, Hengqi, Hou, Jan,
Jordy, Li, Min, Yu, Zhong, Waiman)
* tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linux: (266 commits)
scsi/sg: don't grab scsi host module reference
ext4: Fix warning in blkdev_put()
block: don't return -EINVAL for not found names in devt_from_devname
cdrom: Fix spectre-v1 gadget
block: Improve kernel-doc headers
blk-mq: don't insert passthrough request into sw queue
bsg: make bsg_class a static const structure
ublk: make ublk_chr_class a static const structure
aoe: make aoe_class a static const structure
block/rnbd: make all 'class' structures const
block: fix the exclusive open mask in disk_scan_partitions
block: add overflow checks for Amiga partition support
block: change all __u32 annotations to __be32 in affs_hardblocks.h
block: fix signed int overflow in Amiga partition support
block: add capacity validation in bdev_add_partition()
block: fine-granular CAP_SYS_ADMIN for Persistent Reservation
block: disallow Persistent Reservation on partitions
reiserfs: fix blkdev_put() warning from release_journal_dev()
block: fix wrong mode for blkdev_get_by_dev() from disk_scan_partitions()
block: document the holder argument to blkdev_get_by_path
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmSV8QgQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpupIEADKEZvpxDyaxHjYZFFeoSJRkh+AEJHe0Xtr
J5vUL8t8zmAV3F7i8XaoAEcR0dC0VQcoTc8fAOty71+5hsc7gvtyyNjqU/YWRVqK
Xr+VJuSJ+OGx3MzpRWEkepagfPyqP5cyyCOK6gqIgqzc3IwqkR/3QHVRc6oR8YbY
AQd7tqm2fQXK9WDHEy5hcaQeqb9uKZjQQoZejpPPerpJM+9RMgKxpCGtnLLIUhr/
sgl7KyLIQPBmveO2vfOR+dmsJBqsLqneqkXDKMAIfpeVEEkHHAlCH4E5Ne1XUS+s
ie4If+reuyn1Ktt5Ry1t7w2wr8cX1fcay3K28tgwjE2Bvremc5YnYgb3pyUDW38f
tXXkpg/eTXd/Pn0Crpagoa9zJ927tt5JXIO1/PagPEP1XOqUuthshDFsrVqfqbs+
36gqX2JWB4NJTg9B9KBHA3+iVCJyZLjUqOqws7hOJOvhQytZVm/IwkGBg1Slhe1a
J5WemBlqX8lTgXz0nM7cOhPYTZeKe6hazCcb5VwxTUTj9SGyYtsMfqqTwRJO9kiF
j1VzbOAgExDYe+GvfqOFPh9VqZho66+DyOD/Xtca4eH7oYyHSmP66o8nhRyPBPZA
maBxQhUkPQn4/V/0fL2TwIdWYKsbj8bUyINKPZ2L35YfeICiaYIctTwNJxtRmItB
M3VxWD3GZQ==
=KhW4
-----END PGP SIGNATURE-----
Merge tag 'for-6.5/splice-2023-06-23' of git://git.kernel.dk/linux
Pull splice updates from Jens Axboe:
"This kills off ITER_PIPE to avoid a race between truncate,
iov_iter_revert() on the pipe and an as-yet incomplete DMA to a bio
with unpinned/unref'ed pages from an O_DIRECT splice read. This causes
memory corruption.
Instead, we either use (a) filemap_splice_read(), which invokes the
buffered file reading code and splices from the pagecache into the
pipe; (b) copy_splice_read(), which bulk-allocates a buffer, reads
into it and then pushes the filled pages into the pipe; or (c) handle
it in filesystem-specific code.
Summary:
- Rename direct_splice_read() to copy_splice_read()
- Simplify the calculations for the number of pages to be reclaimed
in copy_splice_read()
- Turn do_splice_to() into a helper, vfs_splice_read(), so that it
can be used by overlayfs and coda to perform the checks on the
lower fs
- Make vfs_splice_read() jump to copy_splice_read() to handle
direct-I/O and DAX
- Provide shmem with its own splice_read to handle non-existent pages
in the pagecache. We don't want a ->read_folio() as we don't want
to populate holes, but filemap_get_pages() requires it
- Provide overlayfs with its own splice_read to call down to a lower
layer as overlayfs doesn't provide ->read_folio()
- Provide coda with its own splice_read to call down to a lower layer
as coda doesn't provide ->read_folio()
- Direct ->splice_read to copy_splice_read() in tty, procfs, kernfs
and random files as they just copy to the output buffer and don't
splice pages
- Provide wrappers for afs, ceph, ecryptfs, ext4, f2fs, nfs, ntfs3,
ocfs2, orangefs, xfs and zonefs to do locking and/or revalidation
- Make cifs use filemap_splice_read()
- Replace pointers to generic_file_splice_read() with pointers to
filemap_splice_read() as DIO and DAX are handled in the caller;
filesystems can still provide their own alternate ->splice_read()
op
- Remove generic_file_splice_read()
- Remove ITER_PIPE and its paraphernalia as generic_file_splice_read
was the only user"
* tag 'for-6.5/splice-2023-06-23' of git://git.kernel.dk/linux: (31 commits)
splice: kdoc for filemap_splice_read() and copy_splice_read()
iov_iter: Kill ITER_PIPE
splice: Remove generic_file_splice_read()
splice: Use filemap_splice_read() instead of generic_file_splice_read()
cifs: Use filemap_splice_read()
trace: Convert trace/seq to use copy_splice_read()
zonefs: Provide a splice-read wrapper
xfs: Provide a splice-read wrapper
orangefs: Provide a splice-read wrapper
ocfs2: Provide a splice-read wrapper
ntfs3: Provide a splice-read wrapper
nfs: Provide a splice-read wrapper
f2fs: Provide a splice-read wrapper
ext4: Provide a splice-read wrapper
ecryptfs: Provide a splice-read wrapper
ceph: Provide a splice-read wrapper
afs: Provide a splice-read wrapper
9p: Add splice_read wrapper
net: Make sock_splice_read() use copy_splice_read() by default
tty, proc, kernfs, random: Use copy_splice_read()
...
We may someday support folios larger than 4GB, so use a size_t for the
byte count within a folio to prevent unpleasant truncations.
Link: https://lkml.kernel.org/r/20230612210141.730128-6-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Remove nine hidden calls to compound_head() by using a folio instead of a
page.
Link: https://lkml.kernel.org/r/20230612210141.730128-5-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add support for large folios and remove some accesses to page->mapping and
page->index.
Link: https://lkml.kernel.org/r/20230612210141.730128-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Remove a couple of folio->page conversions in the callers, and two calls
to compound_head() in the function itself. Rename it from
__gfs2_jdata_writepage() to __gfs2_jdata_write_folio().
Link: https://lkml.kernel.org/r/20230612210141.730128-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "gfs2/buffer folio changes for 6.5", v3.
This kind of started off as a gfs2 patch series, then became entwined with
buffer heads once I realised that gfs2 was the only remaining caller of
__block_write_full_page(). For those not in the gfs2 world, the big point
of this series is that block_write_full_page() should now handle large
folios correctly.
This patch (of 14):
Replace a few implicit calls to compound_head() with one explicit one.
Link: https://lkml.kernel.org/r/20230612210141.730128-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20230612210141.730128-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rename the SDF_FS_FROZEN flag to SDF_FREEZE_INITIATOR to indicate more
clearly that the node that has this flag set is the initiator of the
freeze.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com
Reconfiguring a frozen filesystem is already rejected in
reconfigure_super(), so there is no need to check for that condition
again at the filesystem level.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Rename gfs2_freeze_lock to gfs2_freeze_lock_shared to make it a bit more
obvious that this function establishes the "thawed" state of the freeze
glock.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Rename gfs2_freeze to gfs2_freeze_super and gfs2_unfreeze to
gfs2_thaw_super to match the names of the corresponding super
operations.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
The transaction glock was repurposed to serve as the new freeze glock
years ago. Don't refer to it as the transaction glock anymore.
Also, to be more precise, call it the "freeze glock" instead of the
"freeze lock". Ditto for the journal glock.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
The iomap-based read operations done by gfs2 for its system files, such
as rindex, may sometimes be interrupted and return -EINTR. This
confuses some users of gfs2_internal_read(). Fix that by retrying
interrupted reads.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Some fields such as gt_logd_secs of the struct gfs2_tune are accessed
without holding the lock gt_spin in gfs2_show_options():
val = sdp->sd_tune.gt_logd_secs;
if (val != 30)
seq_printf(s, ",commit=%d", val);
And thus can cause data races when gfs2_show_options() and other functions
such as gfs2_reconfigure() are concurrently executed:
spin_lock(>->gt_spin);
gt->gt_logd_secs = newargs->ar_commit;
To fix these possible data races, the lock sdp->sd_tune.gt_spin is
acquired before accessing the fields of gfs2_tune and released after these
accesses.
Further changes by Andreas:
- Don't hold the spin lock over the seq_printf operations.
Reported-by: BassCheck <bass@buaa.edu.cn>
Signed-off-by: Tuo Li <islituo@gmail.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
In gfs2_file_buffered_write(), we currently jump from the second call of
function should_fault_in_pages() to above the first call, so
should_fault_in_pages() is getting called twice in a row, causing it to
accidentally fall back to single-page writes rather than trying the more
efficient multi-page writes first.
Fix that by moving the retry label to the correct place, behind the
first call to should_fault_in_pages().
Fixes: e1fa9ea85c ("gfs2: Stop using glock holder auto-demotion for now")
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Since commit a2ad63daa8 ("VFS: add FMODE_CAN_ODIRECT file flag"), file
systems can just set the FMODE_CAN_ODIRECT flag at open time instead of
wiring up a dummy direct_IO method to indicate support for direct I/O.
Remove .direct_IO from gfs2_aops and set FMODE_CAN_ODIRECT in
gfs2_open_common for regular files that do not use data journalling.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
All callers of iomap_file_buffered_write need to updated ki_pos, move it
into common code.
Link: https://lkml.kernel.org/r/20230601145904.1385409-8-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Acked-by: Damien Le Moal <dlemoal@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "cleanup the filemap / direct I/O interaction", v4.
This series cleans up some of the generic write helper calling conventions
and the page cache writeback / invalidation for direct I/O. This is a
spinoff from the no-bufferhead kernel project, for which we'll want to an
use iomap based buffered write path in the block layer.
This patch (of 12):
The last user of current->backing_dev_info disappeared in commit
b9b1335e64 ("remove bdi_congested() and wb_congested() and related
functions"). Remove the field and all assignments to it.
Link: https://lkml.kernel.org/r/20230601145904.1385409-1-hch@lst.de
Link: https://lkml.kernel.org/r/20230601145904.1385409-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This patch changes function evict_unlinked_inode so it does not call
gfs2_inode_remember_delete until it gets a good return code from
gfs2_dinode_dealloc.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Function gfs2_free_di was changing the rgrp lvb count of unlinked
dinodes after the lock was released. This patch moves it inside the
lock.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
This patch fixes a case in which function gfs2_quota_hold encounters an
assert error and exits. The lack of gfs2_qa_put causes further problems
when the inode is evicted and the get/put count is non-zero.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, function gfs2_dinode_dealloc would abort if it got a
bad return code from gfs2_rindex_update(). The problem is that it left the
dinode in the unlinked (not free) state, which meant subsequent fsck
would clean it up and flag an error. That meant some of our QE tests
would fail.
The sole purpose of gfs2_rindex_update(), in this code path, is to read in
any newer rgrps added by gfs2_grow. But since this is a delete operation
it won't actually use any of those new rgrps. It can really only twiddle
the bits from "Unlinked" to "Free" in an existing rgrp. Therefore the
error should not prevent the transition from unlinked to free.
This patch makes gfs2_dinode_dealloc ignore the bad return code and
proceed with freeing the dinode so the QE tests will not be tripped up.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
This patch introduces a new out_free label and consolidates the three
places function gdlm_put_lock freed the glock. No change in
functionality.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
When a direct I/O write is performed, iomap_dio_rw() invalidates the
part of the page cache which the write is going to before carrying out
the write. In the odd case, the direct I/O write will be reading from
the same page it is writing to. gfs2 carries out writes with page
faults disabled, so it should have been obvious that this page
invalidation can cause iomap_dio_rw() to never make any progress.
Currently, gfs2 will end up in an endless retry loop in
gfs2_file_direct_write() instead, though.
Break this endless loop by limiting the number of retries and falling
back to buffered I/O after that.
Also simplify should_fault_in_pages() sightly and add a comment to make
the above case easier to understand.
Reported-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
The GFS2 superblock reading code uses bio_add_page() to add a page to a
newly created bio. bio_add_page() can fail, but the return value is never
checked.
Use __bio_add_page() as adding a single page to a newly created bio is
guaranteed to succeed.
This brings us a step closer to marking bio_add_page() as __must_check.
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/087c67d4e4973f949d3519c1e4822784ce583c5a.1685532726.git.johannes.thumshirn@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
On corrupt gfs2 file systems the evict code can try to reference the
journal descriptor structure, jdesc, after it has been freed and set to
NULL. The sequence of events is:
init_journal()
...
fail_jindex:
gfs2_jindex_free(sdp); <------frees journals, sets jdesc = NULL
if (gfs2_holder_initialized(&ji_gh))
gfs2_glock_dq_uninit(&ji_gh);
fail:
iput(sdp->sd_jindex); <--references jdesc in evict_linked_inode
evict()
gfs2_evict_inode()
evict_linked_inode()
ret = gfs2_trans_begin(sdp, 0, sdp->sd_jdesc->jd_blocks);
<------references the now freed/zeroed sd_jdesc pointer.
The call to gfs2_trans_begin is done because the truncate_inode_pages
call can cause gfs2 events that require a transaction, such as removing
journaled data (jdata) blocks from the journal.
This patch fixes the problem by adding a check for sdp->sd_jdesc to
function gfs2_evict_inode. In theory, this should only happen to corrupt
gfs2 file systems, when gfs2 detects the problem, reports it, then tries
to evict all the system inodes it has read in up to that point.
Reported-by: Yang Lan <lanyang0908@gmail.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
- Fix revoke processing at unmount and on read-only remount.
- Refuse reading in inodes with an impossible indirect block height.
- Various minor cleanups.
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmRH00EUHGFncnVlbmJh
QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTpfSw//a0IG/6tVfPwlCPd1qPKN/JbxHQVa
3JtRJXr5+RJ3vIKmJb+AFfEiPHRKgCk28d9hMbW8j3ieKSy/dPJe8ARIZmu8phpK
dOSiue87v4Pz6Tz2qpb6KgzUENohsqjTy0mEAEWxxZb7mWxUqj7t057hiEp/9NX6
jXcfoNInMxnf2HX3d11LoHXYOX37xvEjyhbZ4OmPn+xzmYQcfhuO96CTrIOe7lKe
U8dbFzDwtoUxYljb16do0OLva/YxaNdH33yCzrixVi9ki1aCIH6A2paznlUCzgF1
wzOhzV3kbz5e4flov34X1Jnd5RVkBmzMMM5bkcJzNg6JOSGL0bLg0PGYiXASAfje
KmsNVvq0yMWLLQIr2gMcEVI1IVDxkNKJruizbFfUs/SWHI55eDUZtI5jjA0v0cCw
5HRWialgCNdC9Wd/4HW+DfpztwxR5+j52TCtqgyRGQcGvkNtwGMeGgh9/sreuHhr
7Lod7LpBVmEzlXD97VIcJUcU2vhXtGA25pKXFHCilXh/MdA2UpmBWkVUkNVS/awr
ZTzoi/WkUIWLLVHxHruLOoJBWit8Mhnuxh1rlY0WD3wDEbnwCZKX8ziTYIsVvkWc
d/xjjafrolC6g1m/RoW3rXO/MLXaE0rRy4nb3MLhwsqaAQyZE6puwo40Kj/xNQqf
68kLauI2QFNpFm8=
=l1E9
-----END PGP SIGNATURE-----
Merge tag 'gfs2-v6.3-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Andreas Gruenbacher:
- Fix revoke processing at unmount and on read-only remount
- Refuse reading in inodes with an impossible indirect block height
- Various minor cleanups
* tag 'gfs2-v6.3-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: gfs2_ail_empty_gl no log flush on error
gfs2: Issue message when revokes cannot be written
gfs2: Perform second log flush in gfs2_make_fs_ro
gfs2: return errors from gfs2_ail_empty_gl
gfs2: Move variable assignment behind a null pointer check in inode_go_dump
gfs2: Use gfs2_holder_initialized for jindex
gfs2: Eliminate gfs2_trim_blocks
gfs2: Fix inode height consistency check
gfs2: Remove ghs[] from gfs2_unlink
gfs2: Remove ghs[] from gfs2_link
gfs2: Remove duplicate i_nlink check from gfs2_link()
Before this patch, function gfs2_ail_empty_gl called gfs2_log_flush even
in cases where it encountered an error. It should probably skip the log
flush and leave the file system in an inconsistent state, letting a
subsequent withdraw force the journal to be replayed to reestablish
metadata consistency.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, function gfs2_ail_empty_gl would silently return an
error to the caller. This would get silently set into sd_log_error which
would cause a withdraw, but there was no indication why the file system
was withdrawn. This patch adds a fs_err to log the appropriate error
message.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, function gfs2_make_fs_ro called gfs2_log_flush once to
finalize the log. However, if there's dirty metadata, log flushes tend
to sync the metadata and formulate revokes. Before this patch, those
revokes may not be written out to the journal immediately, which meant
unresolved glocks could still have revokes in their ail lists. When the
glock worker runs, it tries to transition the glock, but the unresolved
revokes in the ail still need to be written, so it tries to start a
transaction. It's impossible to start a transaction because at that
point, the SDF_JOURNAL_LIVE flag has been cleared by gfs2_make_fs_ro.
That causes the glock worker to fail, unable to write the revokes. The
calling sequence looked something like this:
gfs2_make_fs_ro
gfs2_log_flush - with GFS2_LOG_HEAD_FLUSH_SHUTDOWN flag set
if (flags & GFS2_LOG_HEAD_FLUSH_SHUTDOWN)
clear_bit(SDF_JOURNAL_LIVE, &sdp->sd_flags);
...meanwhile...
glock_work_func
do_xmote
rgrp_go_sync (or possibly inode_go_sync)
...
gfs2_ail_empty_gl
__gfs2_trans_begin
if (unlikely(!test_bit(SDF_JOURNAL_LIVE, &sdp->sd_flags))) {
...
return -EROFS;
The previous patch in the series ("gfs2: return errors from
gfs2_ail_empty_gl") now causes the transaction error to no longer be
ignored, so it causes a warning from MOST of the xfstests:
WARNING: CPU: 11 PID: X at fs/gfs2/super.c:603 gfs2_put_super [gfs2]
which corresponds to:
WARN_ON(gfs2_withdrawing(sdp));
The withdraw was triggered silently from do_xmote by:
if (unlikely(sdp->sd_log_error && !gfs2_withdrawn(sdp)))
gfs2_withdraw_delayed(sdp);
This patch adds a second log_flush to gfs2_make_fs_ro: one to sync the
data and one to sync any outstanding revokes and finalize the journal.
Note that both of these log flushes need to be "special," in other
words, not GFS2_LOG_HEAD_FLUSH_NORMAL.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, function gfs2_ail_empty_gl did not return errors it
encountered from __gfs2_trans_begin. Those errors usually came from the
fact that the file system was made read-only, often due to unmount
(but theoretically could be due to -o remount,ro), which prevented
the transaction from starting.
The inability to start a transaction prevented its revokes from being
properly written to the journal for glocks during unmount (and
transition to ro).
That meant glocks could be unlocked without the metadata properly
revoked in the journal. So other nodes could grab the glock thinking
that their lvb values were correct, but in fact corresponded to the
glock without its revokes properly synced. That presented as lvb
mismatch errors.
This patch allows gfs2_ail_empty_gl to return the error properly to
the caller.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZEEhwgAKCRCRxhvAZXjc
otwgAQDXHnKiPm/d76lITXbxdUNCtvZz+ig26EbOrD+vEszzIQEA81dru0QbCNCt
ctoZdcsmtKbt2VaYQF1CDOhlnNg5VQM=
=pER1
-----END PGP SIGNATURE-----
Merge tag 'v6.4/vfs.acl' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull acl updates from Christian Brauner:
"After finishing the introduction of the new posix acl api last cycle
the generic POSIX ACL xattr handlers are still around in the
filesystems xattr handlers for two reasons:
(1) Because a few filesystems rely on the ->list() method of the
generic POSIX ACL xattr handlers in their ->listxattr() inode
operation.
(2) POSIX ACLs are only available if IOP_XATTR is raised. The
IOP_XATTR flag is raised in inode_init_always() based on whether
the sb->s_xattr pointer is non-NULL. IOW, the registered xattr
handlers of the filesystem are used to raise IOP_XATTR. Removing
the generic POSIX ACL xattr handlers from all filesystems would
risk regressing filesystems that only implement POSIX ACL support
and no other xattrs (nfs3 comes to mind).
This contains the work to decouple POSIX ACLs from the IOP_XATTR flag
as they don't depend on xattr handlers anymore. So it's now possible
to remove the generic POSIX ACL xattr handlers from the sb->s_xattr
list of all filesystems. This is a crucial step as the generic POSIX
ACL xattr handlers aren't used for POSIX ACLs anymore and POSIX ACLs
don't depend on the xattr infrastructure anymore.
Adressing problem (1) will require more long-term work. It would be
best to get rid of the ->list() method of xattr handlers completely at
some point.
For erofs, ext{2,4}, f2fs, jffs2, ocfs2, and reiserfs the nop POSIX
ACL xattr handler is kept around so they can continue to use
array-based xattr handler indexing.
This update does simplify the ->listxattr() implementation of all
these filesystems however"
* tag 'v6.4/vfs.acl' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
acl: don't depend on IOP_XATTR
ovl: check for ->listxattr() support
reiserfs: rework priv inode handling
fs: rename generic posix acl handlers
reiserfs: rework ->listxattr() implementation
fs: simplify ->listxattr() implementation
fs: drop unused posix acl handlers
xattr: remove unused argument
xattr: add listxattr helper
xattr: simplify listxattr helpers
Since commit 27a2660f1e ("gfs2: Dump nrpages for inodes and their
glocks"), inode_go_dump() computes the address of inode within ip before
checking if ip is NULL. This isn't a bug by itself, but it can give
rise to bugs later. Avoid that by checking if ip is NULL first.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch function init_journal() used a local variable jindex to
keep track of whether it needed to dequeue the jindex holder when errors
were found. It also uselessly set the variable just before returning from
the function. This patch simplifies the code by eliminatinng the local
variable in favor of using function gfs2_holder_initialized.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Function gfs2_trim_blocks is not referenced. Eliminate it.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
The maximum allowed height of an inode's metadata tree depends on the
filesystem block size; it is lower for bigger-block filesystems. When
reading in an inode, make sure that the height doesn't exceed the
maximum allowed height.
Arrays like sd_heightsize are sized to be big enough for any filesystem
block size; they will often be slightly bigger than what's needed for a
specific filesystem.
Reported-by: syzbot+45d4691b1ed3c48eba05@syzkaller.appspotmail.com
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>