-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZj3HuwAKCRCRxhvAZXjc
orYvAQCZOr68uJaEaXAArYTdnMdQ6HIzG+FVlwrqtrhz0BV07wEAqgmtSR9XKh+L
0+DNepg4R8PZOHH371eSSsLNRCUCkAs=
=SVsU
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.10.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains the usual miscellaneous features, cleanups, and fixes
for vfs and individual fses.
Features:
- Free up FMODE_* bits. I've freed up bits 6, 7, 8, and 24. That
means we now have six free FMODE_* bits in total (but bit #6
already got used for FMODE_WRITE_RESTRICTED)
- Add FOP_HUGE_PAGES flag (follow-up to FMODE_* cleanup)
- Add fd_raw cleanup class so we can make use of automatic cleanup
provided by CLASS(fd_raw, f)(fd) for O_PATH fds as well
- Optimize seq_puts()
- Simplify __seq_puts()
- Add new anon_inode_getfile_fmode() api to allow specifying f_mode
instead of open-coding it in multiple places
- Annotate struct file_handle with __counted_by() and use
struct_size()
- Warn in get_file() whether f_count resurrection from zero is
attempted (epoll/drm discussion)
- Folio-sophize aio
- Export the subvolume id in statx() for both btrfs and bcachefs
- Relax linkat(AT_EMPTY_PATH) requirements
- Add F_DUPFD_QUERY fcntl() allowing to compare two file descriptors
for dup*() equality replacing kcmp()
Cleanups:
- Compile out swapfile inode checks when swap isn't enabled
- Use (1 << n) notation for FMODE_* bitshifts for clarity
- Remove redundant variable assignment in fs/direct-io
- Cleanup uses of strncpy in orangefs
- Speed up and cleanup writeback
- Move fsparam_string_empty() helper into header since it's currently
open-coded in multiple places
- Add kernel-doc comments to proc_create_net_data_write()
- Don't needlessly read dentry->d_flags twice
Fixes:
- Fix out-of-range warning in nilfs2
- Fix ecryptfs overflow due to wrong encryption packet size
calculation
- Fix overly long line in xfs file_operations (follow-up to FMODE_*
cleanup)
- Don't raise FOP_BUFFER_{R,W}ASYNC for directories in xfs (follow-up
to FMODE_* cleanup)
- Don't call xfs_file_open from xfs_dir_open (follow-up to FMODE_*
cleanup)
- Fix stable offset api to prevent endless loops
- Fix afs file server rotations
- Prevent xattr node from overflowing the eraseblock in jffs2
- Move fdinfo PTRACE_MODE_READ procfs check into the .permission()
operation instead of .open() operation since this caused userspace
regressions"
* tag 'vfs-6.10.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (39 commits)
afs: Fix fileserver rotation getting stuck
selftests: add F_DUPDFD_QUERY selftests
fcntl: add F_DUPFD_QUERY fcntl()
file: add fd_raw cleanup class
fs: WARN when f_count resurrection is attempted
seq_file: Simplify __seq_puts()
seq_file: Optimize seq_puts()
proc: Move fdinfo PTRACE_MODE_READ check into the inode .permission operation
fs: Create anon_inode_getfile_fmode()
xfs: don't call xfs_file_open from xfs_dir_open
xfs: drop fop_flags for directories
xfs: fix overly long line in the file_operations
shmem: Fix shmem_rename2()
libfs: Add simple_offset_rename() API
libfs: Fix simple_offset_rename_exchange()
jffs2: prevent xattr node from overflowing the eraseblock
vfs, swap: compile out IS_SWAPFILE() on swapless configs
vfs: relax linkat() AT_EMPTY_PATH - aka flink() - requirements
fs/direct-io: remove redundant assignment to variable retval
fs/dcache: Re-use value stored to dentry->d_flags instead of re-reading
...
bch2_write_super() was looping over online devices multiple times -
dropping and retaking io_ref each time.
This meant it could race with device removal; it could increment the
sequence number on a device but fail to write it - and then if the
device was re-added, it would get confused the next time around thinking
a superblock write was silently dropped.
Fix this by taking io_ref once, and stashing pointers to online devices
in a darray.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_fs_quota_read_inode() wasn't entirely updated to the
bch2_snapshot_tree() helper, which takes rcu lock.
Reported-by: syzbot+a3a9a61224ed3b7f0010@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Ancient versions of bcachefs produced packed formats that could
represent keys that our in memory format cannot represent;
bformat_needs_redo() has some tricky shifts to check for this sort of
overflow.
Reported-by: syzbot+594427aebfefeebe91c6@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
For forwards compatibility we have to allow unknown key types, and only
run the checks that make sense against them.
Fix a missing guard on k.k->type being known.
Reported-by: syzbot+ae4dc916da3ce51f284f@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We were forgetting to check for jset entries that overrun the end of the
section - both in validate and to_text(); to_text() needs to be safe for
types that fail to validate.
Reported-by: syzbot+c48865e11e7e893ec4ab@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
filefrag (and potentially other utilities that call fiemap) sometimes
pass ULONG_MAX as the length. fiemap_prep clamps excessively large
lengths - but the calculation of end can overflow if it occurs before
calling fiemap_prep. When this happens, filefrag assumes it has read to
the end and exits.
Signed-off-by: Reed Riley <reed@riley.engineer>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The bucket_gens array is a single array allocation (one byte per
bucket), and kernel allocations are still limited to INT_MAX.
Check this limit to avoid failing the bucket_gens array allocation.
Reported-by: syzbot+b29f436493184ea42e2b@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_get_next_dev() and bch2_get_next_online_dev() iterate over devices,
dropping and taking refs as they go; we can't access the previous device
(for ca->dev_idx) after we've dropped our ref to it, unless we take
rcu_read_lock() first.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_dev_lookup() is supposed to take a ref on the device it returns, but
for_each_member_device() takes refs as it iterates,
for_each_member_device_rcu() does not.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Normally this is initialized in __bch2_write(), which is executed in a
loop, but the inline data path skips this.
Reported-by: syzbot+fd3ccb331eb21f05d13b@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We're using mutex_lock() inside a wait_event() conditional -
prepare_to_wait() has already flipped task state, so potentially
blocking ops need annotation.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- fix a few more deadlocks in recovery
- fix u32/u64 issues in mi_btree_bitmap
- btree key cache shrinker now actually frees, with more instrumentation
coming so we can verify that it's working correctly more easily in the
future
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmYmveYACgkQE6szbY3K
bnY+kA//dtdKfPliuQlfYhUvIZSvgEy+KxjDaDmeJFVMKFYHiip/JJt7V/YU7jWW
DmWGtPqo1hoGZnic7h9fstbBCRgUdIYGBInqAWPzL3wmFYe5FPE02KN5fuZ+A+Mp
sn0QFML4oA0uxD7TRXfCeNx3NRwXonztletVskXCuLa0T8iTOTdOuAH0MCGow3OC
mlfRZyK05f6Q0UOIzvntfBl8Tkr+yk5g0GFz54U7s+qs/zJsYiYfruXuq6AxLTy2
xVMDj3Hc6W0vggVpv68HInluubl/b7rVSy+w59GG0D3iQ9/fBdqitFLdw49ReXzi
J//ctZLb3n+IM4lA7t5ev0lY7bvI2FwFNkrL4qW41E4un5eQ3ghbyOmMoz87svyg
4JW/CPGP7uKNVmfRuHn1qhgJ9/vIXkObJVl9GKZF2BylaZwjMM5YrL5MZwrKFQYy
9BMgemgvHFK+wRi74q/OUu3PyH045AoJdIKI66ypFexhGi5YeNqtRHLUKdfSrJR+
eEkAUaHcgLLfxyk+fIRvcSK+Q9j3BibvsSkU3vLSnl2B+xdvfJqb+I/yvBt/SZQW
P09JceDQABRBMu9beVbVqMED+PniSZVfG2eU2jBcZ+jhbGQgmiWfMNJVKS4/0uwz
PRS33P2mViVZ6PJokWyecbgGtVxKrK2ruPdcu6/W05Qi0Vv58+k=
=lCNd
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2024-04-22' of https://evilpiepirate.org/git/bcachefs
Pull bcachefs fixes from Kent Overstreet:
"Nothing too crazy in this one, and it looks like (fingers crossed) the
recovery and repair issues are settling down - although there's going
to be a long tail there, as we've still yet to really ramp up on error
injection or syzbot.
- fix a few more deadlocks in recovery
- fix u32/u64 issues in mi_btree_bitmap
- btree key cache shrinker now actually frees, with more
instrumentation coming so we can verify that it's working
correctly more easily in the future"
* tag 'bcachefs-2024-04-22' of https://evilpiepirate.org/git/bcachefs:
bcachefs: If we run merges at a lower watermark, they must be nonblocking
bcachefs: Fix inode early destruction path
bcachefs: Fix deadlock in journal write path
bcachefs: Tweak btree key cache shrinker so it actually frees
bcachefs: bkey_cached.btree_trans_barrier_seq needs to be a ulong
bcachefs: Fix missing call to bch2_fs_allocator_background_exit()
bcachefs: Check for journal entries overruning end of sb clean section
bcachefs: Fix bio alloc in check_extent_checksum()
bcachefs: fix leak in bch2_gc_write_reflink_key
bcachefs: KEY_TYPE_error is allowed for reflink
bcachefs: Fix bch2_dev_btree_bitmap_marked_sectors() shift
bcachefs: make sure to release last journal pin in replay
bcachefs: node scan: ignore multiple nodes with same seq if interior
bcachefs: Fix format specifier in validate_bset_keys()
bcachefs: Fix null ptr deref in twf from BCH_IOCTL_FSCK_OFFLINE
Fix another deadlock related to the merge path; previously, we switched
to always running merges at a lower watermark (because they are
noncritical); but when we run at a lower watermark we also need to run
nonblocking or we've introduced a new deadlock.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Reported-and-tested-by: s@m-h.ug
discard_new_inode() is the wrong interface to use when we need to free
an inode that was never inserted into the inode hash table; we can
bypass the whole iput() -> evict() path and replace it with
__destroy_inode(); kmem_cache_free() - this fixes a WARN_ON() about
I_NEW.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_journal_write() was incorrectly waiting on earlier journal writes
synchronously; this usually worked because most of the time we'd be
running in the context of a thread that did a journal_buf_put(), but
sometimes we'd be running out of the same workqueue that completes those
prior journal writes.
Additionally, this makes sure to punt to a workqueue before submitting
preflushes - we really don't want to be calling submit_bio() in the main
transaction commit path.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Freeing key cache items is a multi stage process; we need to wait for an
SRCU grace period to elapse, and we handle this ourselves - partially to
avoid callback overhead, but primarily so that when allocating we can
first allocate from the freed items waiting for an SRCU grace period.
Previously, the shrinker was counting the items on the 'waiting for SRCU
grace period' lists as items being scanned, but this meant that too many
items waiting for an SRCU grace period could prevent it from doing any
work at all.
After this, we're seeing that items skipped due to the accessed bit are
the main cause of the shrinker not making any progress, and we actually
want the key cache shrinker to run quite aggressively because reclaimed
items will still generally be found (more compactly) in the btree node
cache - so we also tweak the shrinker to not count those against
nr_to_scan.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
this stores the SRCU sequence number, which we use to check if an SRCU
barrier has elapsed; this is a partial fix for the key cache shrinker
not actually freeing.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Fix a missing bounds check in superblock validation.
Note that we don't yet have repair code for this case - repair code for
individual items is generally low priority, since the whole superblock
is checksummed, validated prior to write, and we have backups.
Reported-by: lei lu <llfamsec@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
KEY_TYPE_error is left behind when we have to delete all pointers in an
extent in fsck; it allows errors to be correctly returned by reads
later.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes a deadlock when journal replay has many keys to insert that
were from fsck, not the journal.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Interior nodes are not really needed, when we have to scan - but if this
pops up for leaf nodes we'll need a real heuristic.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When building for 32-bit platforms, for which size_t is 'unsigned int',
there is a warning from a format string in validate_bset_keys():
fs/bcachefs/btree_io.c: In function 'validate_bset_keys':
fs/bcachefs/btree_io.c:891:34: error: format '%lu' expects argument of type 'long unsigned int', but argument 12 has type 'unsigned int' [-Werror=format=]
891 | "bad k->u64s %u (min %u max %lu)", k->u64s,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
fs/bcachefs/btree_io.c:603:32: note: in definition of macro 'btree_err'
603 | msg, ##__VA_ARGS__); \
| ^~~
fs/bcachefs/btree_io.c:887:21: note: in expansion of macro 'btree_err_on'
887 | if (btree_err_on(!bkeyp_u64s_valid(&b->format, k),
| ^~~~~~~~~~~~
fs/bcachefs/btree_io.c:891:64: note: format string is defined here
891 | "bad k->u64s %u (min %u max %lu)", k->u64s,
| ~~^
| |
| long unsigned int
| %u
cc1: all warnings being treated as errors
BKEY_U64s is size_t so the entire expression is promoted to size_t. Use
the '%zu' specifier so that there is no warning regardless of the width
of size_t.
Fixes: 031ad9e7db ("bcachefs: Check for packed bkeys that are too big")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202404130747.wH6Dd23p-lkp@intel.com/
Closes: https://lore.kernel.org/oe-kbuild-all/202404131536.HdAMBOVc-lkp@intel.com/
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
various recovery fixes:
- fixes for the btree_insert_entry being resized on path allocation
btree_path array recently became dynamically resizable, and
btree_insert_entry along with it; this was being observed during
journal replay, when write buffer btree updates don't use the write
buffer and instead use the normal btree update path
- multiple fixes for deadlock in recovery when we need to do lots of
btree node merges; excessive merges were clocking up the whole
pipeline
- write buffer path now correctly does btree node merges when needed
- fix failure to go RW when superblock indicates recovery passes needed
(i.e. to complete an unfinished upgrade)
various unsafety fixes - test case contributed by a user who had two
drives out of a six drive array write out a whole bunch of garbage after
power failure
new (tiny) on disk format feature: since it appears the btree node scan
tool will be a more regular thing (crappy hardware, user error) - this
adds a 64 bit per-device bitmap of regions that have ever had btree
nodes.
a path->should_be_locked fix, from a larger patch series tightening up
invariants and assertions around btree transaction and path locking
state; this particular fix prevents us from keeping around btree_paths
that are no longer needed.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmYdaRIACgkQE6szbY3K
bnbqcA/9ETT0Jekf/V4klQmoWj9GX5nQstUz+ENABNNPL+5hld62EojiRvOW2qwU
zVs7O0M59B8/+4v4KJoW+RqnLFjAF4z/Gf+/Uw9WarsHAKIxxFFFARxG93JpGqOn
nGa8RSw0BaYQIdbMR0Bdacc2f0N+JkJQx956/+JV7EG5MAJqXgz00AvIuLqMZ+2t
0m9av3n0tVmstyvvGqk8pouvQjK0XUvIDYN3oiUDl7WXOAIKXDlp6yviiGnTbusq
DssmIt5fdeVBq/DAk5PMNEKM9NUP+weIZW1UWPWINaicarqyV+pn2fhvLrBxVl7q
zBSN3v28viaABKC8A15b2bqj3IT2WIBDoBCEi406akMao9eiVsE6is13rFkPQwQI
Obhc7NNDyOPPTvX25M3tKXpr8rSGoD2qHIMMKMIBe1ZWscj6lMbmUBErwzTOAW4+
pNTvzWT2XwcS7tE8Fx50ZxcehTQl6ir0hQvjJL5JV2po8XMbdGxcImBe6xPmAa3n
/XIzyglL8IvW494wjCsHxtTeOt+f8nW7BXJCrWB71UQeXIXq4b9FADOwWtlGTnxJ
6XNprfi8TSp+RsSRxav6DBw2ou5viGjAjP2ddrO6Lw37XUYV0igS+BeDNEPA4dwI
ZlbCzNE7qSXK2rjmGjyu7GCJ3+NOxJDQ8GdxkTDtpPrBF2kCOkQ=
=NAId
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2024-04-15' of https://evilpiepirate.org/git/bcachefs
Pull yet more bcachefs fixes from Kent Overstreet:
"This gets recovery working again for the affected user I've been
working with, and I'm still waiting to hear back on other bug reports
but should fix it for everyone else who's been having issues with
recovery.
- Various recovery fixes:
- fixes for the btree_insert_entry being resized on path
allocation btree_path array recently became dynamically
resizable, and btree_insert_entry along with it; this was being
observed during journal replay, when write buffer btree updates
don't use the write buffer and instead use the normal btree
update path
- multiple fixes for deadlock in recovery when we need to do lots
of btree node merges; excessive merges were clocking up the
whole pipeline
- write buffer path now correctly does btree node merges when
needed
- fix failure to go RW when superblock indicates recovery passes
needed (i.e. to complete an unfinished upgrade)
- Various unsafety fixes - test case contributed by a user who had
two drives out of a six drive array write out a whole bunch of
garbage after power failure
- New (tiny) on disk format feature: since it appears the btree node
scan tool will be a more regular thing (crappy hardware, user
error) - this adds a 64 bit per-device bitmap of regions that have
ever had btree nodes.
- A path->should_be_locked fix, from a larger patch series tightening
up invariants and assertions around btree transaction and path
locking state.
This particular fix prevents us from keeping around btree_paths
that are no longer needed"
* tag 'bcachefs-2024-04-15' of https://evilpiepirate.org/git/bcachefs: (24 commits)
bcachefs: set_btree_iter_dontneed also clears should_be_locked
bcachefs: fix error path of __bch2_read_super()
bcachefs: Check for backpointer bucket_offset >= bucket size
bcachefs: bch_member.btree_allocated_bitmap
bcachefs: sysfs internal/trigger_journal_flush
bcachefs: Fix bch2_btree_node_fill() for !path
bcachefs: add safety checks in bch2_btree_node_fill()
bcachefs: Interior known are required to have known key types
bcachefs: add missing bounds check in __bch2_bkey_val_invalid()
bcachefs: Fix btree node merging on write buffer btrees
bcachefs: Disable merges from interior update path
bcachefs: Run merges at BCH_WATERMARK_btree
bcachefs: Fix missing write refs in fs fio paths
bcachefs: Fix deadlock in journal replay
bcachefs: Go rw if running any explicit recovery passes
bcachefs: Standardize helpers for printing enum strs with bounds checks
bcachefs: don't queue btree nodes for rewrites during scan
bcachefs: fix race in bch2_btree_node_evict()
bcachefs: fix unsafety in bch2_stripe_to_text()
bcachefs: fix unsafety in bch2_extent_ptr_to_text()
...
This is part of a larger series cleaning up the semantics of
should_be_locked and adding assertions around it; if we don't need an
iterator/path anymore, it clearly doesn't need to be locked.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In __bch2_read_super(), if kstrdup() fails, it needs to release memory
in sb->holder, fix to call bch2_free_super() in the error path.
Signed-off-by: Chao Yu <chao@kernel.org>
Reviewed-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds a small (64 bit) per-device bitmap that tracks ranges that
have btree nodes, for accelerating btree node scan if it is ever needed.
- New helpers, bch2_dev_btree_bitmap_marked() and
bch2_dev_bitmap_mark(), for checking and updating the bitmap
- Interior btree update path updates the bitmaps when required
- The check_allocations pass has a new fsck_err check,
btree_bitmap_not_marked
- New on disk format version, mi_btree_mitmap, which indicates the new
bitmap is present
- Upgrade table lists the required recovery pass and expected fsck error
- Btree node scan uses the bitmap to skip ranges if we're on the new
version
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We shouldn't be doing the unlock/relock dance when we're not using a
path - this fixes an assertion pop when called from btree node scan.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
For forwards compatibilyt, we allow bkeys of unknown type in leaf nodes;
we can simply ignore metadata we don't understand. Pointers to btree
nodes must always be of known types, howwever.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>