Commit Graph

4041 Commits

Author SHA1 Message Date
Kent Overstreet
9ab55df599 bcachefs: Walk leaf to root in btree_gc
Next change will move gc_alloc_start initialization into the alloc
trigger, so we have to mark those first.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:14 -04:00
Kent Overstreet
86d46471d5 bcachefs: Don't block journal when finishing check_allocations()
Blocking the journal was needed to finish checking old style accounting,
but that code is gone and it's not needed in the alloc rewrite,
mark_lock is sufficient for synchronization.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:14 -04:00
Kent Overstreet
5645c32ccf bcachefs: bch2_fs_get_tree() cleanup
- improve error paths
- call bch2_fs_start() separately, after applying late-parsed options

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:14 -04:00
Kent Overstreet
25ee25e637 bcachefs: Kill bch2_mount()
Fold into bch2_fs_get_tree()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:14 -04:00
Kent Overstreet
b9efa9673e bcachefs: Eytzinger accumulation for accounting keys
The btree write buffer takes as input keys from the journal, sorts them,
deduplicates them, and flushes them back to the btree in sorted order.

The disk space accounting rewrite is moving accounting to normal btree
keys, with update (in this case deltas) accumulated in the write buffer
and then flushed to the btree; but this is going to increase the number
of keys handled by the write buffer by perhaps as much as a factor of
3x-5x.

The overhead from copying around and sorting this many keys would cause
a significant performance regression, but: there is huge locality in
updates to accounting keys that we can take advantage of.

Instead of appending accounting keys to the list of keys to be sorted,
this patch adds an eytzinger search tree of recently seen accounting
keys. We look up the accounting key in the eytzinger search tree and
apply the delta directly, adding it if it doesn't exist, and
periodically prune the eytzinger tree of unused entries.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:14 -04:00
Kent Overstreet
20ac515a9c bcachefs: bch_acct_rebalance_work
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:14 -04:00
Kent Overstreet
6af91147b6 bcachefs: bch_acct_btree
Add counters for how much disk space we're using per btree.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:14 -04:00
Kent Overstreet
6675c37662 bcachefs: bch_acct_snapshot
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:14 -04:00
Kent Overstreet
72c2778780 bcachefs: bch2_fs_usage_base_to_text()
Helper to show raw accounting in sysfs, mainly for debugging.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
f93bb76ba2 bcachefs: bch2_fs_accounting_to_text()
Helper to show raw accounting in sysfs, mainly for debugging.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
91f44781d5 bcachefs: Convert bch2_compression_stats_to_text() to new accounting
We no longer have to walk the whole btree to calculate compression
stats.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
bfcaa9079d bcachefs: bch_acct_compression
This adds per-compression-type accounting of compressed and uncompressed
size as well as number of extents - meaning we can now see compression
ratio (without walking the whole filesystem).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
5668e5deec bcachefs: bch2_verify_accounting_clean()
Verify that the in-memory accounting verifies the on-disk accounting
after a clean shutdown.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
00839addfc bcachefs: Convert bch2_replicas_gc2() to new accounting
bch2_replicas_gc2() is used for garbage collection superblock replicas
entries that are empty - this converts it to the new accounting scheme.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
fb23d57a6d bcachefs: Convert gc to new accounting
Rewrite fsck/gc for the new accounting scheme.

This adds a second set of in-memory accounting counters for gc to use;
like with other parts of gc we run all trigger in TRIGGER_GC mode, then
compare what we calculated to existing in-memory accounting at the end.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
4c4a7d48bd bcachefs: Kill replicas_journal_res
More dead code deletion

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
66a57684c6 bcachefs: Kill fs_usage_online
More dead code deletion.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
fe5eddc0d0 bcachefs: Kill bch2_fs_usage_to_text()
Dead code.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
8bb8d683a4 bcachefs: Delete journal-buf-sharded old style accounting
More deletion of dead code.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
5b9bc272e6 bcachefs: Kill writing old accounting to journal
More ripping out of the old disk space accounting.

Note that the new disk space accounting is incompatible with the old,
and writing out old style disk space accounting with the new code is
infeasible.

This means upgrading and downgrading past this version requires
regenerating accounting.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
3afb8dbf03 bcachefs: kill bch2_fs_usage_read()
With bch2_ioctl_fs_usage(), this is now dead code.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
6b39638b84 bcachefs: Convert bch2_ioctl_fs_usage() to new accounting
This converts bch2_ioctl_fs_usage() to read from the new disk
accounting, via bch2_fs_replicas_usage_read().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
72a6bb098c bcachefs: Kill bch2_fs_usage_initialize()
Deleting code for the old disk accounting scheme.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
f5095b9f85 bcachefs: dev_usage updated by new accounting
Reading disk accounting now requires an eytzinger lookup (see:
bch2_accounting_mem_read()), but the per-device counters are used
frequently enough that we'd like to still be able to read them with just
a percpu sum, as in the old code.

This patch special cases the device counters; when we update in-memory
accounting we also update the old style percpu counters if it's a deice
counter update.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
2e8d686a4a bcachefs: Coalesce accounting keys before journal replay
This fixes a performance regression in journal replay; without
colaescing accounting keys we have multiple keys at the same position,
which means journal_keys_peek_upto() has to skip past many overwritten
keys - turning journal replay into an O(n^2) algorithm.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
1d16c605cc bcachefs: Disk space accounting rewrite
Main part of the disk accounting rewrite.

This is a wholesale rewrite of the existing disk space accounting, which
relies on percepu counters that are sharded by journal buffer, and
rolled up and added to each journal write.

With the new scheme, every set of counters is a distinct key in the
accounting btree; this fixes scaling limitations of the old scheme,
where counters took up space in each journal entry and required multiple
percpu counters.

Now, in memory accounting requires a single set of percpu counters - not
multiple for each in flight journal buffer - and in the future we'll
probably also have counters that don't use in memory percpu counters,
they're not strictly required.

An accounting update is now a normal btree update, using the btree write
buffer path. At transaction commit time, we apply accounting updates to
the in memory counters, which are percpu counters indexed in an
eytzinger tree by the accounting key.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
5d9667d1d6 bcachefs: btree write buffer knows how to accumulate bch_accounting keys
Teach the btree write buffer how to accumulate accounting keys - instead
of having the newer key overwrite the older key as we do with other
updates, we need to add them together.

Also, add a flag so that write buffer flush knows when journal replay is
finished flushing accounting, and teach it to hold accounting keys until
that flag is set.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
9dec2a473b bcachefs: Accumulate accounting keys in journal replay
Until accounting keys hit the btree, they are deltas, not new versions
of the existing key; this means we have to teach journal replay to
accumulate them.

Additionally, the journal doesn't track precisely which entries have
been flushed to the btree; it only tracks a range of entries that may
possibly still need to be flushed.

That means we need to compare accounting keys against the version in the
btree and only flush updates that are newer.

There's another wrinkle with the write buffer: if the write buffer
starts flushing accounting keys before journal replay has finished
flushing accounting keys, journal replay will see the version number
from the new updates and updates from the journal will be lost.

To avoid this, journal replay has to flush accounting keys first, and
we'll be adding a flag so that write buffer flush knows to hold
accounting keys until then.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
2744e5c9eb bcachefs: KEY_TYPE_accounting
New key type for the disk space accounting rewrite.

 - Holds a variable sized array of u64s (may be more than one for
   accounting e.g. compressed and uncompressed size, or buckets and
   sectors for a given data type)

 - Updates are deltas, not new versions of the key: this means updates
   to accounting can happen via the btree write buffer, which we'll be
   teaching to accumulate deltas.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Thomas Bertschinger
929d954330 bcachefs: use new mount API
This updates bcachefs to use the new mount API:

- Update the file_system_type to use the new init_fs_context()
  function.

- Define the new fs_context_operations functions.

- No longer register bch2_mount() and bch2_remount(); these are now
  called via the new fs_context functions.

- Define a new helper type, bch2_opts_parse that includes a struct
  bch_opts and additionally a printbuf used to save options that can't
  be parsed until after the FS is opened. This enables us to parse as
  many options as possible prior to opening the filesystem while saving
  those options that need the open FS for later parsing.

Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Thomas Bertschinger
1c12d1caf8 bcachefs: Add error code to defer option parsing
This introduces a new error code, option_needs_open_fs, which is used to
indicate that an attempt was made to parse a mount option prior to
opening a filesystem, when that mount option requires an open filesystem
in order to be validated.

Returning this error results in bch2_parse_one_mount_opt() saving that
option for later parsing, after the filesystem is opened.

Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Thomas Bertschinger
9b7f0b5d3d bcachefs: add printbuf arg to bch2_parse_mount_opts()
Mount options that take the name of a device that may be part of a
filesystem, for example "metadata_target", cannot be validated until
after the filesystem has been opened. However, an attempt to parse those
options may be made prior to the filesystem being opened.

This change adds a printbuf parameter to bch2_parse_mount_opts() which
will be used to save those mount options, when they are supplied prior
to the FS being opened, so that they can be parsed later.

This functionality is not currently needed, but will be used after
bcachefs starts using the new mount API to parse mount options. This is
because using the new mount API, we will process mount options prior to
opening the FS, but the new API doesn't provide a convenient way to
"replay" mount option parsing. So we save these options ourselves to
accomplish this.

This change also splits out the code to parse a single option into
bch2_parse_one_mount_opt(), which will be useful when using the new
mount API which deals with a single mount option at a time.

Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Kent Overstreet
7773df19c3 bcachefs: metadata version bucket_stripe_sectors
New on disk format version for bch_alloc->stripe_sectors and
BCH_DATA_unstriped - accounting for unstriped data in stripe buckets.

Upgrade/downgrade requires regenerating alloc info - but only if erasure
coding is in use.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Kent Overstreet
2612e29142 bcachefs: BCH_DATA_unstriped
Add a new pseudo data type, to track buckets that are members of a
stripe, but have unstriped data in them.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Kent Overstreet
55f7962da3 bcachefs: bch_alloc->stripe_sectors
Add a separate counter to bch_alloc_v4 for amount of striped data; this
lets us separately track striped and unstriped data in a bucket, which
lets us see when erasure coding has failed to update extents with stripe
pointers, and also find buckets to continue updating if we crash mid way
through creating a new stripe.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Kent Overstreet
c13d526d9d bcachefs: check_key_has_inode()
Consolidate duplicated checks for extents/dirents/xattrs - these keys
should all have a corresponding inode of the correct type.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Thomas Bertschinger
51fc436c80 bcachefs: allow passing full device path for target options
The output of mount options such as "metadata_target" in `/proc/mounts`
uses the full path to the device.

mount(8) from util-linux uses the output from `/proc/mounts` to pass
existing mount options when performing a remount, so bcachefs should
accept as input the same form that it prints as output.

Without this change:

$ mount -t bcachefs -o metadata_target=vdb /dev/vdb /mnt
$ strace mount -o remount /mnt
...
fsconfig(4, FSCONFIG_SET_STRING, "metadata_target", "/dev/vdb", 0) = -1 EINVAL (Invalid argument)
...

Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Kent Overstreet
3811f48aa3 bcachefs: bch2_printbuf_strip_trailing_newline()
Add a new helper to fix inode_to_text()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Thomas Bertschinger
babe30fe8d bcachefs: don't expose "read_only" as a mount option
When "read_only" is exposed as a mount option, it is redundant with the
standard option "ro" and gives users multiple ways to specify that a
bcachefs filesystem should be mounted read-only. This presents the risk
of having inconsistent options specified.

This can be seen when remounting a read-only filesystem in read-write
mode, using mount(8) from util-linux. Because mount(8) parses the
existing mount options from `/proc/mounts` and applies them when
remounting, it can end up applying both "read_only" and "rw":

$ mount img -o ro /mnt
$ strace mount -o remount,rw /mnt
...
fsconfig(4, FSCONFIG_SET_FLAG, "read_only", NULL, 0) = 0
fsconfig(4, FSCONFIG_SET_FLAG, "rw", NULL, 0) = 0
...

Making "read_only" no longer a mount option means this edge case cannot
occur.

Fixes: 62719cf33c ("bcachefs: Fix nochanges/read_only interaction")
Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Thomas Bertschinger
03ec0927fa bcachefs: make offline fsck set read_only fs flag
A subsequent change will remove "read_only" as a mount option in favor
of the standard option "ro", meaning the userspace fsck command cannot
pass it to the fsck ioctl. Instead, in offline fsck, set "read_only"
kernel-side without trying to parse it as a mount option.

For compatibility with versions of the "bcachefs fsck" command that try
to pass the "read_only" mount opt, remove it from the mount options
string prior to parsing when it is present.

Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Kent Overstreet
652bc7fabc bcachefs: btree_ptr_sectors_written() now takes bkey_s_c
this is for the userspace metadata dump tool

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Kent Overstreet
9cc8eb3098 bcachefs: Check for bsets past bch_btree_ptr_v2.sectors_written
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Uros Bizjak
68573b936d bcachefs: Use try_cmpxchg() family of functions instead of cmpxchg()
Use try_cmpxchg() family of functions instead of
cmpxchg (*ptr, old, new) == old. x86 CMPXCHG instruction returns
success in ZF flag, so this change saves a compare after cmpxchg
(and related move instruction in front of cmpxchg).

Also, try_cmpxchg() implicitly assigns old *ptr value to "old" when
cmpxchg fails. There is no need to re-read the value in the loop.

No functional change intended.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Kent Overstreet
e76a2b65b0 bcachefs: add might_sleep() annotations for fsck_err()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Kent Overstreet
546b65378d bcachefs: fix missing include
fs-common.h needs dirent.h for enum bch_rename_mode

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Youling Tang
630d565dda bcachefs: Use filemap_read() to simplify the execution flow
Using filemap_read() can reduce unnecessary code execution
for non IOCB_DIRECT paths.

Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Youling Tang
da6fa380d3 bcachefs: Align the display format of btrees/inodes/keys
Before patch:
```
 #cat btrees/inodes/keys
 u64s 17 type inode_v3 0:4096:U32_MAX len 0 ver 0:   mode=40755
   flags= (16300000)
   bi_size=0
```

After patch:
```
 #cat btrees/inodes/keys
 u64s 17 type inode_v3 0:4096:U32_MAX len 0 ver 0:
   mode=40755
   flags=(16300000)
   bi_size=0
```

Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Youling Tang
12e7ff1a1e bcachefs: Fix missing spaces in journal_entry_dev_usage_to_text
Fixed missing spaces displayed in journal_entry_dev_usage_to_text
while adjusting the display format to improve readability.

before:
```
 # bcachefs list_journal -a -t alloc:1:0 /dev/sdb
 ...
     dev_usage: dev=0free: buckets=233180 sectors=0 fragmented=0sb: buckets=13 sectors=6152 fragmented=504journal: buckets=1847 sectors=945664 fragmented=0btree: buckets=20 sectors=10240 fragmented=0user: buckets=1419 sectors=726513 fragmented=15cached: buckets=0 sectors=0 fragmented=0parity: buckets=0 sectors=0 fragmented=0stripe: buckets=0 sectors=0 fragmented=0need_gc_gens: buckets=0 sectors=0 fragmented=0need_discard: buckets=1 sectors=0 fragmented=0
```

after:
```
 # bcachefs list_journal -a -t alloc:1:0 /dev/sdb
 ...
     dev_usage: dev=0
       free: buckets=233180 sectors=0 fragmented=0
       sb: buckets=13 sectors=6152 fragmented=504
       journal: buckets=1847 sectors=945664 fragmented=0
       btree: buckets=20 sectors=10240 fragmented=0
       user: buckets=1419 sectors=726513 fragmented=15
       cached: buckets=0 sectors=0 fragmented=0
       parity: buckets=0 sectors=0 fragmented=0
       stripe: buckets=0 sectors=0 fragmented=0
       need_gc_gens: buckets=0 sectors=0 fragmented=0
       need_discard: buckets=1 sectors=0 fragmented=0
```
Signed-off-by: Youling Tang <tangyouling@kylinos.cn>

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Kent Overstreet
f369de8267 bcachefs: fix ei_update_lock lock ordering
ei_update_lock is largely vestigal and will probably be removed, but
we're not ready for that just yet.

this fixes some lockdep splats with the new lockdep support for btree
node locks; they're harmless, since we were taking ei_update_lock before
actually locking any btree nodes, but "any btree nodes locked" are now
tracked at the btree_trans level.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:11 -04:00
Kent Overstreet
cdda2126ab bcachefs: bch2_btree_reserve_cache_to_text()
Add a pretty printer so the btree reserve cache can be seen in sysfs; as
it pins open_buckets we need it for tracking down open_buckets issues.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:11 -04:00
Kent Overstreet
d06a26d24d bcachefs: sysfs trigger_freelist_wakeup
another debugging knob

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:11 -04:00
Kent Overstreet
a1e7a97f22 bcachefs: sysfs internal/trigger_journal_writes
another debugging knob - trigger the journal to do ready journal writes

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:11 -04:00
Kent Overstreet
26a170aa61 bcachefs: add capacity, reserved to fs_alloc_debug_to_text()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:11 -04:00
Kent Overstreet
8a3c8303e2 bcachefs: uninline fallocate functions
better stack traces

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:11 -04:00
Kent Overstreet
52fd0f9620 bcachefs: btree ids are 64 bit bitmasks
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:11 -04:00
Kent Overstreet
3de8fd4a33 bcachefs: Print allocator stuck on timeout in fallocate path
same as in io_write.c, if we're waiting on the allocator for an
excessive amount of time, print what's going on

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:11 -04:00
Kent Overstreet
1841027c7d bcachefs: bch2_gc_btree() should not use btree_root_lock
btree_root_lock is for the root keys in btree_root, not the pointers to
the nodes themselves; this fixes a lock ordering issue between
btree_root_lock and btree node locks.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-11 20:10:55 -04:00
Kent Overstreet
f236ea4bca bcachefs: Set PF_MEMALLOC_NOFS when trans->locked
proper lock ordering is: fs_reclaim -> btree node locks

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-11 20:10:55 -04:00
Kent Overstreet
f0f3e51148 bcachefs; Use trans_unlock_long() when waiting on allocator
not using unlock_long() blocks key cache reclaim, and the allocator may
take awhile

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-11 20:10:55 -04:00
Kent Overstreet
aacd897d4d Revert "bcachefs: Mark bch_inode_info as SLAB_ACCOUNT"
This reverts commit 86d81ec5f5.

This wasn't tested with memcg enabled, it immediately hits a null ptr
deref in list_lru_add().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-11 20:01:38 -04:00
Kent Overstreet
fd80d14005 bcachefs: fix scheduling while atomic in break_cycle()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-10 12:59:28 -04:00
Kent Overstreet
6f692b1672 bcachefs: Fix RCU splat
Reported-by: syzbot+e74fea078710bbca6f4b@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-10 12:46:22 -04:00
Kent Overstreet
7d7f71cd87 bcachefs: Add missing bch2_trans_begin()
this fixes a 'transaction should be locked' error in backpointers fsck

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-10 09:53:39 -04:00
Kent Overstreet
0f6f8f7693 bcachefs: Fix missing error check in journal_entry_btree_keys_validate()
Closes: https://syzkaller.appspot.com/bug?extid=8996d8f176cf946ef641
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-10 09:53:39 -04:00
Kent Overstreet
f49d2c9835 bcachefs: Warn on attempting a move with no replicas
Instead of popping an assert in bch2_write(), WARN and print out some
debugging info.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-10 09:53:39 -04:00
Kent Overstreet
ad8b68cd39 bcachefs: bch2_data_update_to_text()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-10 09:53:39 -04:00
Kent Overstreet
0f1f7324da bcachefs: Log mount failure error code
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-10 09:53:39 -04:00
Kent Overstreet
8ed58789fc bcachefs: Fix undefined behaviour in eytzinger1_first()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-10 09:53:39 -04:00
Youling Tang
86d81ec5f5 bcachefs: Mark bch_inode_info as SLAB_ACCOUNT
After commit 230e9fc286 ("slab: add SLAB_ACCOUNT flag"), we need to mark
the inode cache as SLAB_ACCOUNT, similar to commit 5d097056c9 ("kmemcg:
account for certain kmem allocations to memcg")

Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-10 09:53:39 -04:00
Kent Overstreet
b02f973e67 bcachefs: Fix bch2_inode_insert() race path for tmpfiles
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-10 09:53:39 -04:00
Kent Overstreet
0435773239 bcachefs: Fix journal getting stuck on a flush commit
silly race

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-10 09:53:39 -04:00
Kent Overstreet
a2d23f3d91 bcachefs: io clock: run timer fns under clock lock
We don't have a way to flush a timer that's executing the callback, and
this is simple and limited enough in scope that we can just use the lock
instead.

Needed for the next patch that adds direct wakeups from the allocator to
copygc, where we're now more frequently calling io_timer_del() on an
expiring timer.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-01 22:56:28 -04:00
Kent Overstreet
b5cbb42dc5 bcachefs: Repair fragmentation_lru in alloc_write_key()
fragmentation_lru derives from dirty_sectors, and wasn't being checked.

Co-developed-by: Daniel Hill <daniel@gluo.nz>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-29 18:37:13 -04:00
Kent Overstreet
d39881d2da bcachefs: add check for missing fragmentation in check_alloc_to_lru_ref()
We need to make sure we're not missing any fragmenation entries in the
LRU BTREE after repairing ALLOC BTREE

Also, use the new bch2_btree_write_buffer_maybe_flush() helper; this was
only working without it before since bucket invalidation (usually)
wasn't happening while fsck was running.

Co-developed-by: Daniel Hill <daniel@gluo.nz>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-29 18:35:18 -04:00
Kent Overstreet
92e1c29ae8 bcachefs: bch2_btree_write_buffer_maybe_flush()
Add a new helper for checking references to write buffer btrees, where
we need a flush before we definitively know we have an inconsistency.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-29 18:34:52 -04:00
Kent Overstreet
ef05bdf5d6 bcachefs: Add missing printbuf_tabstops_reset() calls
Fixes warnings from bch2_print_allocator_stuck()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-29 18:14:18 -04:00
Kent Overstreet
67c564111f bcachefs: Fix loop restart in bch2_btree_transactions_read()
Accidental infinite loop; also fix btree_deadlock_to_text()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-28 21:08:48 -04:00
Kent Overstreet
1539bdf516 bcachefs: Fix bch2_read_retry_nodecode()
BCH_READ_NODECODE mode - used by the move paths - really wants to use
only the original rbio, but the retry path really wants to clone - oof.

Make sure to copy the crc of the pointer we read from back to the
original rbio, or we'll see spurious checksum errors later.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-28 20:47:04 -04:00
Kent Overstreet
44ec599035 bcachefs: Don't use the new_fs() bucket alloc path on an initialized fs
On a new filesystem or device we have to allocate the journal with a
bump allocator, because allocation info isn't ready yet - but when
hot-adding a device that doesn't have a journal, we don't want to use
that path.

Reported-by: syzbot+24a867cb90d8315cccff@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-28 19:47:31 -04:00
Kent Overstreet
a0bd30e4ea bcachefs: Fix shift greater than integer size
Reported-by: syzbot+e5292b50f1957164a4b6@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-28 19:42:22 -04:00
Kent Overstreet
600b8be5e7 bcachefs: Change bch2_fs_journal_stop() BUG_ON() to warning
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-28 19:16:41 -04:00
Kent Overstreet
84db600016 bcachefs: Delete old faulty bch2_trans_unlock() call
the unlock is now in read_extent, this fixes an assertion pop in
read_from_stale_dirty_pointer()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-28 13:30:13 -04:00
Kent Overstreet
759b2e800f bcachefs: Switch online_reserved shutdown assert to WARN()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-28 11:06:31 -04:00
Pei Li
64cd7de998 bcachefs: Fix kmalloc bug in __snapshot_t_mut
When allocating too huge a snapshot table, we should fail gracefully
in __snapshot_t_mut() instead of fail in kmalloc().

Reported-by: syzbot+770e99b65e26fa023ab1@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=770e99b65e26fa023ab1
Tested-by: syzbot+770e99b65e26fa023ab1@syzkaller.appspotmail.com
Signed-off-by: Pei Li <peili.dev@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-25 20:51:14 -04:00
Kent Overstreet
64ee1431cc bcachefs: Discard, invalidate workers are now per device
There's no reason for discards to be single threaded across all devices;
this will improve performance on multi device setups.

Additionally, making them per-device simplifies the refcounting on
bch_dev->io_ref; we now hold it for the duration that the discard path
is running, which fixes a race between the discard path and device
removal.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-25 18:47:55 -04:00
Pei Li
472237b69d bcachefs: Fix shift-out-of-bounds in bch2_blacklist_entries_gc
This series fix the shift-out-of-bounds issue in
bch2_blacklist_entries_gc().

Instead of passing 0 to eytzinger0_first() when iterating the entries,
we explicitly check 0 and initialize i to be 0.

syzbot has tested the proposed patch and the reproducer did not trigger
any issue:

Reported-and-tested-by: syzbot+835d255ad6bc7f29ee12@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=835d255ad6bc7f29ee12
Signed-off-by: Pei Li <peili.dev@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-25 17:53:31 -04:00
Pei Li
211c581de2 bcachefs: slab-use-after-free Read in bch2_sb_errors_from_cpu
Acquire fsck_error_counts_lock before accessing the critical section
protected by this lock.

syzbot has tested the proposed patch and the reproducer did not trigger
any issue.

Reported-by: syzbot+a2bc0e838efd7663f4d9@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=a2bc0e838efd7663f4d9
Signed-off-by: Pei Li <peili.dev@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-25 17:51:26 -04:00
Kuan-Wei Chiu
1fcce6b8a7 bcachefs: remove heap-related macros and switch to generic min_heap
Drop the heap-related macros from bcachefs and replacing them with the
generic min_heap implementation from include/linux.  By doing so, code
readability is improved by using functions instead of macros.  Moreover,
the min_heap implementation in include/linux adopts a bottom-up variation
compared to the textbook version currently used in bcachefs.  This
bottom-up variation allows for approximately 50% reduction in the number
of comparison operations during heap siftdown, without changing the number
of swaps, thus making it more efficient.

[visitorckw@gmail.com: fix missing assignment of minimum element]
  Link: https://lkml.kernel.org/r/20240602174828.1955320-1-visitorckw@gmail.com
Link: https://lkml.kernel.org/ioyfizrzq7w7mjrqcadtzsfgpuntowtjdw5pgn4qhvsdp4mqqg@nrlek5vmisbu
Link: https://lkml.kernel.org/r/20240524152958.919343-17-visitorckw@gmail.com
Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Acked-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw>
Cc: Coly Li <colyli@suse.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Sakai <msakai@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-06-24 22:25:00 -07:00
Kuan-Wei Chiu
fd60f7fe69 bcachefs: fix typo
Replace 'utiility' with 'utility'.

Link: https://lkml.kernel.org/r/20240524152958.919343-4-visitorckw@gmail.com
Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw>
Cc: Coly Li <colyli@suse.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Sakai <msakai@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-06-24 22:24:57 -07:00
Kent Overstreet
89d21b69b4 bcachefs: Add missing bch2_journal_do_writes() call
This fixes a rare deadlock when we're doing an emergency shutdown due to
failure to do a journal write.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-23 12:55:32 -04:00
Kent Overstreet
d6b52f6828 bcachefs: Fix null ptr deref in journal_pins_to_text()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-23 12:07:07 -04:00
Kent Overstreet
36da8e387b bcachefs: Add missing recalc_capacity() call
This fixes filesystem size not changing on device removal.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-23 10:12:51 -04:00
Kent Overstreet
1aaf5cb41b bcachefs: Fix btree_trans list ordering
The debug code relies on btree_trans_list being ordered so that it can
resume on subsequent calls or lock restarts.

However, it was using trans->locknig_wait.task.pid, which is incorrect
since btree_trans objects are cached and reused - typically by different
tasks.

Fix this by switching to pointer order, and also sort them lazily when
required - speeding up the btree_trans_get() fastpath.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-23 00:57:21 -04:00
Kent Overstreet
de611ab6fc bcachefs: Fix race between trans_put() and btree_transactions_read()
debug.c was using closure_get() on a different thread's closure where
the we don't know if the object being refcounted is alive.

We keep btree_trans objects on a list so they can be printed by debug
code, and because it is cost prohibitive to touch the btree_trans list
every time we allocate and free btree_trans objects, cached objects are
also on this list.

However, we do not want the debug code to see cached but not in use
btree_trans objects - critically because the btree_paths array will have
been freed (if it was reallocated).

closure_get() is also incorrect to use when that get may race with it
hitting zero, i.e. we must already have a ref on the object or know the
ref can't currently hit 0 for other reasons (as used in the cycle
detector).

to fix this, use the previously introduced closure_get_not_zero(),
closure_return_sync(), and closure_init_stack_release(); the debug code
now can only take a ref on a trans object if it's alive and in use.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-23 00:57:21 -04:00
Kent Overstreet
18e92841e8 bcachefs: Make btree_deadlock_to_text() clearer
btree_deadlock_to_text() searches the list of btree transactions to find
a deadlock - when it finds one it's done; it's not like other *_read()
functions that's printing each object.

Factor out btree_deadlock_to_text() to make this clearer.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-23 00:57:21 -04:00
Kent Overstreet
f44cc269a1 bcachefs: fix seqmutex_relock()
We were grabbing the sequence number before unlock incremented it - fix
this by moving the increment to seqmutex_lock() (so the seqmutex_relock()
failure path skips the mutex_trylock()), and returning the sequence
number from unlock(), to make the API simpler and safer.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-23 00:57:21 -04:00
Kent Overstreet
9bd01500e4 bcachefs: Fix freeing of error pointers
This fixes incorrect/missign checking of strndup_user() returns.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-22 17:22:24 -04:00
Youling Tang
bd4da0462e bcachefs: Move the ei_flags setting to after initialization
`inode->ei_flags` setting and cleaning should be done after initialization,
otherwise the operation is invalid.

Fixes: 9ca4853b98 ("bcachefs: Fix quota support for snapshots")
Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-21 10:17:07 -04:00
Kent Overstreet
2fe79ce7d1 bcachefs: Fix a UAF after write_super()
write_super() may reallocate the superblock buffer - but
bch_sb_field_ext was referencing it; don't use it after the write_super
call.

Reported-by: syzbot+8992fc10a192067b8d8a@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-21 10:17:07 -04:00
Kent Overstreet
e6b3a655ac bcachefs: Use bch2_print_string_as_lines for long err
printk strings get truncated to 1024 bytes; if we have a long error
message (journal debug info) we need to use a helper.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-21 10:17:07 -04:00
Kent Overstreet
dd9086487c bcachefs: Fix I_NEW warning in race path in bch2_inode_insert()
discard_new_inode() is the correct interface for tearing down an indoe
that was fully created but not made visible to other threads, but it
expects I_NEW to be set, which we don't use.

Reported-by: https://github.com/koverstreet/bcachefs/issues/690
Fixes: bcachefs: Fix race path in bch2_inode_insert()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-21 10:17:07 -04:00
Kent Overstreet
504794067f bcachefs: Replace bare EEXIST with private error codes
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-21 10:17:07 -04:00
Kent Overstreet
f648b6c12b bcachefs: Fix missing alloc_data_type_set()
Incorrect bucket state transition in the discard path; when incrementing
a bucket's generation number that had already been discarded, we were
forgetting to check if it should be need_gc_gens, not free.

This was caught by the .invalid checks in the transaction commit path,
causing us to go emergency read only.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-21 10:17:07 -04:00
Youling Tang
c6cab97cdf bcachefs: fix alignment of VMA for memory mapped files on THP
With CONFIG_READ_ONLY_THP_FOR_FS, the Linux kernel supports using THPs
for read-only mmapped files, such as shared libraries. However, the
kernel makes no attempt to actually align those mappings on 2MB
boundaries, which makes it impossible to use those THPs most of the
time. This issue applies to general file mapping THP as well as
existing setups using CONFIG_READ_ONLY_THP_FOR_FS. This is easily
fixed by using thp_get_unmapped_area for the unmapped_area function
in bcachefs, which is what ext2, ext4, fuse, xfs and btrfs all use.

Similar to commit b0c582233a ("btrfs: fix alignment of VMA for
memory mapped files on THP").

Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-20 09:14:58 -04:00
Kent Overstreet
33dfafa902 bcachefs: Fix safe errors by default
i.e. the start of automatic self healing:

If errors=continue or fix_safe, we now automatically fix simple errors
without user intervention.

New error action option: fix_safe

This replaces the existing errors=ro option, which gets a new slot, i.e.
existing errors=ro users now get errors=fix_safe.

This is currently only enabled for a limited set of errors - initially
just disk accounting; errors we would never not want to fix, and we
don't want to require user intervention (i.e. to make sure a bug report
gets filed).

Errors will still be counted in the superblock, so we (developers) will
still know they've been occuring if a bug report gets filed (as bug
reports typically include the errors superblock section).

Eventually we'll be enabling this for a much wider set of errors, after
we've done thorough error injection testing.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-20 09:13:09 -04:00
Kent Overstreet
a56da69799 bcachefs: Fix bch2_trans_put()
reference: https://github.com/koverstreet/bcachefs/issues/692

trans->ref is the reference used by the cycle detector, which walks
btree_trans objects of other threads to walk the graph of held locks and
issue wakeups when an abort is required.

We have to wait for the ref to go to 1 before freeing trans->paths or
clearing trans->locking_wait.task.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-19 18:34:18 -04:00
Kent Overstreet
0a2a507d40 bcachefs: set_worker_desc() for delete_dead_snapshots
this is long running - help users see what's going on

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-19 18:27:24 -04:00
Kent Overstreet
ddd118ab45 bcachefs: Fix bch2_sb_downgrade_update()
Missing enum conversion

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-19 18:27:24 -04:00
Kent Overstreet
2e9940d4a1 bcachefs: Handle cached data LRU wraparound
We only have 48 bits for the LRU time field, which is insufficient to
prevent wraparound.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-19 18:27:24 -04:00
Kent Overstreet
cff07e2739 bcachefs: Guard against overflowing LRU_TIME_BITS
LRUs only have 48 bits for the time field (i.e. LRU order); thus we need
overflow checks and guards.

Reported-by: syzbot+df3bf3f088dcaa728857@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-19 18:27:24 -04:00
Kent Overstreet
1ba44217f8 bcachefs: delete_dead_snapshots() doesn't need to go RW
We've been moving away from going RW lazily; if we want to go RW we do
that in set_may_go_rw(), and if we didn't go RW we don't need to delete
dead snapshots.

Reported-by: syzbot+4366624c0b5aac4906cf@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-19 18:27:24 -04:00
Kent Overstreet
dbf4d79b7f bcachefs: Fix early init error path in journal code
We shouln't be running the journal shutdown sequence if we never fully
initialized the journal.

Reported-by: syzbot+ffd2270f0bca3322ee00@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-19 18:27:24 -04:00
Kent Overstreet
9e7cfb35e2 bcachefs: Check for invalid btree IDs
We can only handle btree IDs up to 62, since the btree id (plus the type
for interior btree nodes) has to fit ito a 64 bit bitmask - check for
invalid ones to avoid invalid shifts later.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-19 18:27:23 -04:00
Kent Overstreet
e3fd3faa45 bcachefs: Fix btree ID bitmasks
these should be 64 bit bitmasks, not 32 bit.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-19 18:27:23 -04:00
Kent Overstreet
d406545613 bcachefs: Fix shift overflow in read_one_super()
Reported-by: syzbot+9f74cb4006b83e2a3df1@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-19 18:27:23 -04:00
Kent Overstreet
3727ca5604 bcachefs: Fix a locking bug in the do_discard_fast() path
We can't discard a bucket while it's still open; this needs the
bucket_is_open_safe() version, which takes the open_buckets lock.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-19 18:27:23 -04:00
Kent Overstreet
d47df4f616 bcachefs: Fix array-index-out-of-bounds
We use 0 size arrays as markers, but ubsan doesn't know that - cast them
to a pointer to fix the splat.

Also, make sure this code gets tested a bit more.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-19 18:27:23 -04:00
Kent Overstreet
f770a6e9a3 bcachefs: Fix initialization order for srcu barrier
btree_iter_init() needs to happen before key_cache_init(), to initialize
btree_trans_barrier

Reported-by: syzbot+3cca837c2183f8f6fcaf@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-19 18:27:23 -04:00
Mateusz Guzik
267574dee6 bcachefs: remove now spurious i_state initialization
inode_init_always started setting the field to 0.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/r/20240611120626.513952-5-mjguzik@gmail.com
Acked-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-06-13 13:40:45 +02:00
Kent Overstreet
f2736b9c79 bcachefs: Fix rcu_read_lock() leak in drop_extra_replicas
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-11 18:59:08 -04:00
Kent Overstreet
7124a8982b bcachefs: Add missing bch_inode_info.ei_flags init
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-10 20:50:14 -04:00
Kent Overstreet
b799220092 bcachefs: Add missing synchronize_srcu_expedited() call when shutting down
We use the polling interface to srcu for tracking pending frees; when
shutting down we don't need to wait for an srcu barrier to free them,
but SRCU still gets confused if we shutdown with an outstanding grace
period.

Reported-by: syzbot+6a038377f0a594d7d44e@syzkaller.appspotmail.com
Reported-by: syzbot+0ece6edfd05ed20e32d9@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-10 13:17:16 -04:00
Kent Overstreet
9432e90df1 bcachefs: Check for invalid bucket from bucket_gen(), gc_bucket()
Turn more asserts into proper recoverable error paths.

Reported-by: syzbot+246b47da27f8e7e7d6fb@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-10 13:17:16 -04:00
Kent Overstreet
9c4acd19bb bcachefs: Replace bucket_valid() asserts in bucket lookup with proper checks
The bucket_gens array and gc_buckets array known their own size; we
should be using those members, and returning an error.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-10 13:17:16 -04:00
Kent Overstreet
e0cb5722e1 bcachefs: Fix snapshot_create_lock lock ordering
======================================================
WARNING: possible circular locking dependency detected
6.10.0-rc2-ktest-00018-gebd1d148b278 #144 Not tainted
------------------------------------------------------
fio/1345 is trying to acquire lock:
ffff88813e200ab8 (&c->snapshot_create_lock){++++}-{3:3}, at: bch2_truncate+0x76/0xf0

but task is already holding lock:
ffff888105a1fa38 (&sb->s_type->i_mutex_key#13){+.+.}-{3:3}, at: do_truncate+0x7b/0xc0

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (&sb->s_type->i_mutex_key#13){+.+.}-{3:3}:
       down_write+0x3d/0xd0
       bch2_write_iter+0x1c0/0x10f0
       vfs_write+0x24a/0x560
       __x64_sys_pwrite64+0x77/0xb0
       x64_sys_call+0x17e5/0x1ab0
       do_syscall_64+0x68/0x130
       entry_SYSCALL_64_after_hwframe+0x4b/0x53

-> #1 (sb_writers#10){.+.+}-{0:0}:
       mnt_want_write+0x4a/0x1d0
       filename_create+0x69/0x1a0
       user_path_create+0x38/0x50
       bch2_fs_file_ioctl+0x315/0xbf0
       __x64_sys_ioctl+0x297/0xaf0
       x64_sys_call+0x10cb/0x1ab0
       do_syscall_64+0x68/0x130
       entry_SYSCALL_64_after_hwframe+0x4b/0x53

-> #0 (&c->snapshot_create_lock){++++}-{3:3}:
       __lock_acquire+0x1445/0x25b0
       lock_acquire+0xbd/0x2b0
       down_read+0x40/0x180
       bch2_truncate+0x76/0xf0
       bchfs_truncate+0x240/0x3f0
       bch2_setattr+0x7b/0xb0
       notify_change+0x322/0x4b0
       do_truncate+0x8b/0xc0
       do_ftruncate+0x110/0x270
       __x64_sys_ftruncate+0x43/0x80
       x64_sys_call+0x1373/0x1ab0
       do_syscall_64+0x68/0x130
       entry_SYSCALL_64_after_hwframe+0x4b/0x53

other info that might help us debug this:

Chain exists of:
  &c->snapshot_create_lock --> sb_writers#10 --> &sb->s_type->i_mutex_key#13

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&sb->s_type->i_mutex_key#13);
                               lock(sb_writers#10);
                               lock(&sb->s_type->i_mutex_key#13);
  rlock(&c->snapshot_create_lock);

 *** DEADLOCK ***

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-10 13:17:16 -04:00
Kent Overstreet
f9035b0ce6 bcachefs: Fix refcount leak in check_fix_ptrs()
fsck_err() does a goto fsck_err on error; factor out check_fix_ptr() so
that our error label can drop our device ref.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-10 13:17:16 -04:00
Kent Overstreet
bf2b356afd bcachefs: Leave a buffer in the btree key cache to avoid lock thrashing
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-10 13:17:16 -04:00
Kent Overstreet
2760bfe388 bcachefs: Fix reporting of freed objects from key cache shrinker
We count objects as freed when we move them to the srcu-pending lists
because we're doing the equivalent of a kfree_srcu(); the only
difference is managing the pending list ourself means we can allocate
from the pending list.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-10 13:17:16 -04:00
Kent Overstreet
9ac3e660ca bcachefs: set sb->s_shrinker->seeks = 0
inodes and dentries are still present in the btree node cache, in much
more compact form

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-10 13:17:16 -04:00
Kent Overstreet
bc65e98e68 bcachefs: increase key cache shrinker batch size
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-10 13:17:16 -04:00
Kent Overstreet
5ae67abcdf bcachefs: Enable automatic shrinking for rhashtables
Since the key cache shrinker walks the rhashtable, a mostly empty
rhashtable leads to really nasty reclaim performance issues.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-10 13:17:16 -04:00
Hongbo Li
26447d224a bcachefs: fix the display format for show-super
There are three keys displayed in non-uniform format.
Let's fix them.

[Before]
```
Label:	testbcachefs
Version:	1.9: (unknown version)
Version upgrade complete:	0.0: (unknown version)
```

[After]
```
Label:					testbcachefs
Version:				1.9: (unknown version)
Version upgrade complete:		0.0: (unknown version)
```

Fixes: 7423330e30 ("bcachefs: prt_printf() now respects \r\n\t")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-10 13:17:16 -04:00
Kent Overstreet
dab1870439 bcachefs: fix stack frame size in fsck.c
fsck.c always runs top of the stack so we're not too concerned here;
noinline_for_stack is sufficient

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-10 13:17:16 -04:00
Kent Overstreet
04f635ede8 bcachefs: Delete incorrect BTREE_ID_NR assertion
for forwards compat we now explicitly allow mounting and using
filesystems with unknown btrees, and we have to walk them for fsck.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-10 13:17:16 -04:00
Kent Overstreet
1c8cc24eef bcachefs: Fix incorrect error handling found_btree_node_is_readable()
error handling here is slightly odd, which is why we were accidently
calling evict() on an error pointer

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-10 13:17:15 -04:00
Kent Overstreet
161f73c2c7 bcachefs: Split out btree_write_submit_wq
Split the workqueues for btree read completions and btree write
submissions; we don't want concurrency control on btree read
completions, but we do want concurrency control on write submissions,
else blocking in submit_bio() will cause a ton of kworkers to be
allocated.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-10 13:17:15 -04:00
Kent Overstreet
319fef29e9 bcachefs: Fix trans->locked assert
in bch2_move_data_btree, we might start with the trans unlocked from a
previous loop iteration - we need a trans_begin() before iter_init().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-05 10:44:08 -04:00
Kent Overstreet
fdccb24352 bcachefs: Rereplicate now moves data off of durability=0 devices
This fixes an issue where setting a device to durability=0 after it's
been used makes it impossible to remove.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-05 10:44:08 -04:00
Kent Overstreet
9a64e1bfd8 bcachefs: Fix GFP_KERNEL allocation in break_cycle()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-05 10:44:08 -04:00
Linus Torvalds
ff9bce3d06 bcachefs fixes for 6.10-rc2
- two downgrade fixes
 - a couple snapshot deletion and repair fixes, thanks to noradtux for
   finding these and providing the image to debug them
 - a couple assert fixes
 - convert to folio helper, from Matthew
 - some improved error messages
 - bit of code reorganization (just moving things around); doing this
   while things are quiet so I'm not rebasing fixes past reorgs
 - don't return -EROFS on inconsistency error in recovery, this confuses
   util-linux and has it retry the mount
 - fix failure to return error on misaligned dio write; reported as an
   issue with coreutils shred
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmZYntUACgkQE6szbY3K
 bnbp7hAAvMgBanBT7qq3ac+W3vtgLuIk6gXNB7eRl+QNff7bJ+BzJH4UhCGhbo5g
 WzzQAQ2Zta6NwxbdAcZdL91qe4QDI3ITdIeKBZYtN/C8FySOeEk14K+CNhfQjYgd
 fJP2bx4LuUnyMri1pw8ZF3L/YXMOKhzTF8jLH04etty8Sbxss+zh9Dz6LFXqvloq
 3v0EmbzrgB3KH+zflJ+yxTFUO3/tNYJhZHGXD452AlJYs29bECAAzJ/5gUq43CqQ
 /q+omBqqqf7oJZ84dHIu2piZrUhUJqotLdcIkzlkxDg+hN/BPeY4hv+dw5GNffz7
 hgD6ieWm+0PQrf2WSBGRy7l3DglrwknUgrFSb8PlUAbOsg0TNsN7qjW6LVZSWMZ/
 tBWiUQ95VYtlP8KzwLrIZ+BcP/Jm0X5hIAxui0Diz+exh7onDiY7Gxsp8/r0krYI
 x0s7uLhl73Jb/TO3pX9BS6U+Y0bUu0GJb+TThOLNX961Vg900BmpZvLave6y3U0i
 E09JRetWGK50wgPPvNt7M+s8lhs0Jg+Q+AuHAUd3x8eb1NSMibAvYGzV4oVpElrT
 YAP7vrJSgVdCCpI6qqCt+SgxatNUCSa/sHraJz2XeVGFyE6iLlXylBHabxKPn5P2
 d8jyJ9cEHzumx6tHjLgm09UvoCBg00+ameiNOpjNKbPw6iJXfuw=
 =HDxx
 -----END PGP SIGNATURE-----

Merge tag 'bcachefs-2024-05-30' of https://evilpiepirate.org/git/bcachefs

Pull bcachefs fixes from Kent Overstreet:
 "Assorted odds and ends...

   - two downgrade fixes

   - a couple snapshot deletion and repair fixes, thanks to noradtux for
     finding these and providing the image to debug them

   - a couple assert fixes

   - convert to folio helper, from Matthew

   - some improved error messages

   - bit of code reorganization (just moving things around); doing this
     while things are quiet so I'm not rebasing fixes past reorgs

   - don't return -EROFS on inconsistency error in recovery, this
     confuses util-linux and has it retry the mount

   - fix failure to return error on misaligned dio write; reported as an
     issue with coreutils shred"

* tag 'bcachefs-2024-05-30' of https://evilpiepirate.org/git/bcachefs: (21 commits)
  bcachefs: Fix failure to return error on misaligned dio write
  bcachefs: Don't return -EROFS from mount on inconsistency error
  bcachefs: Fix uninitialized var warning
  bcachefs: Split out sb-errors_format.h
  bcachefs: Split out journal_seq_blacklist_format.h
  bcachefs: Split out replicas_format.h
  bcachefs: Split out disk_groups_format.h
  bcachefs: split out sb-downgrade_format.h
  bcachefs: split out sb-members_format.h
  bcachefs: Better fsck error message for key version
  bcachefs: btree_gc can now handle unknown btrees
  bcachefs: add missing MODULE_DESCRIPTION()
  bcachefs: Fix setting of downgrade recovery passes/errors
  bcachefs: Run check_key_has_snapshot in snapshot_delete_keys()
  bcachefs: Refactor delete_dead_snapshots()
  bcachefs: Fix locking assert
  bcachefs: Fix lookup_first_inode() when inode_generations are present
  bcachefs: Plumb bkey into __btree_err()
  bcachefs: Use copy_folio_from_iter_atomic()
  bcachefs: Fix sb-downgrade validation
  ...
2024-05-31 11:45:41 -07:00
Kent Overstreet
7b038b564b bcachefs: Fix failure to return error on misaligned dio write
This was reported as an error when running coreutils shred.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-29 16:40:30 -04:00
Kent Overstreet
83208cbf2f bcachefs: Don't return -EROFS from mount on inconsistency error
We were accidentally returning -EROFS during recovery on filesystem
inconsistency - since this is what the journal returns on emergency
shutdown.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-28 19:23:03 -04:00
Kent Overstreet
8528bde1b6 bcachefs: Fix uninitialized var warning
Can't actually be used uninitialized, but gcc was being silly.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-28 18:21:51 -04:00
Kent Overstreet
759bb4eabc bcachefs: Split out sb-errors_format.h
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-28 17:33:45 -04:00
Kent Overstreet
5c16c57488 bcachefs: Split out journal_seq_blacklist_format.h
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-28 17:32:03 -04:00
Kent Overstreet
24998050b6 bcachefs: Split out replicas_format.h
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-28 17:32:03 -04:00
Kent Overstreet
1cdcc6e3c2 bcachefs: Split out disk_groups_format.h
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-28 17:32:03 -04:00
Kent Overstreet
4c5eef0c50 bcachefs: split out sb-downgrade_format.h
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-28 17:32:03 -04:00
Kent Overstreet
016c22e410 bcachefs: split out sb-members_format.h
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-28 17:32:03 -04:00
Kent Overstreet
f1d4fed13f bcachefs: Better fsck error message for key version
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-28 11:29:26 -04:00
Kent Overstreet
088d0de812 bcachefs: btree_gc can now handle unknown btrees
Compatibility fix - we no longer have a separate table for which order
gc walks btrees in, and special case the stripes btree directly.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-28 11:29:26 -04:00
Jeff Johnson
b4131076c1 bcachefs: add missing MODULE_DESCRIPTION()
Fix the 'make W=1' warning:
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/bcachefs/mean_and_variance_test.o

Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-28 11:29:26 -04:00
Kent Overstreet
247c056bde bcachefs: Fix setting of downgrade recovery passes/errors
bch2_check_version_downgrade() was setting c->sb.version, which
bch2_sb_set_downgrade() expects to be at the previous version; and it
shouldn't even have been set directly because c->sb.version is updated
by write_super().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-28 11:29:26 -04:00
Kent Overstreet
08f50005e0 bcachefs: Run check_key_has_snapshot in snapshot_delete_keys()
delete_dead_snapshots now runs before the main fsck.c passes which check
for keys for invalid snapshots; thus, it needs those checks as well.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-28 11:29:26 -04:00
Kent Overstreet
82af5ceb5d bcachefs: Refactor delete_dead_snapshots()
Consolidate per-key work into delete_dead_snapshots_process_key(), so we
now walk all keys once, not twice.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-28 11:29:26 -04:00
Kent Overstreet
218e5e0c2a bcachefs: Fix locking assert
We now track whether a transaction is locked, and verify that we don't
have nodes locked when the transaction isn't locked; reorder relocks to
not pop the new assert.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-28 11:29:26 -04:00
Kent Overstreet
9e1a66e668 bcachefs: Fix lookup_first_inode() when inode_generations are present
This function is used for finding the hash seed (which is the same in
all versions of an inode in different snapshots): ff an inode has been
deleted in a child snapshot we need to iterate until we find a live
version.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-28 11:29:26 -04:00
Kent Overstreet
1292bc2ebf bcachefs: Plumb bkey into __btree_err()
It can be useful to know the exact byte offset within a btree node where
an error occured.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-28 11:29:23 -04:00
Matthew Wilcox (Oracle)
b82b6eeefd bcachefs: Use copy_folio_from_iter_atomic()
copy_page_from_iter_atomic() will be removed at some point.
Also fixup a comment for folios.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-26 22:30:09 -04:00
Kent Overstreet
9242a34b76 bcachefs: Fix sb-downgrade validation
Superblock downgrade entries are only two byte aligned, but section
sizes are 8 byte aligned, which means we have to be careful about
overrun checks; an entry that crosses the end of the section is allowed
(and ignored) as long as it has zero errors.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-26 12:44:12 -04:00
Kent Overstreet
d509cadc3a bcachefs: Fix debug assert
Reported-by: syzbot+a8074a75b8d73328751e@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-26 12:40:30 -04:00
Linus Torvalds
c40b1994b9 bcachefs fixes for 6.10-rc1
Just a few syzbot fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmZQk0cACgkQE6szbY3K
 bna7gA/+MSY3I95CwaJ4bBq5SCxOaRcrX099LFh8Zrj+OF+DWE2PtVo1LhhgnYrQ
 KpZrS2Q9Qgb2yVqYzOY6LBfH4il1O/WwvloMG0MbuYiQFu9/JL/6CEK9uFyiGmaC
 fdiFEN3u+8AK6phTFaqUU2ncG0XFQ1Ple5zmFXo4Y3ZJeNaubJeEDac+kbRvOwYh
 rQ6Iy0FNoQymv0BzmuM7g2NsbhdAgHTv7rhGbfpNBZv3lu0yDXsfZZgWTr2oXMSP
 FMhm4bcTGAFp5hbwq9k56ND8oSFpamsH7SwS4bDlEe1CNOfMI1JjnrvSEuDrocAE
 1Jn2J2Gv9NXnEHKamVzzpUILG67buEtYzJyDQk51N4kulgThdpRzjm+11ylD5U0U
 wzIK1HXsKHtRdUiIhQGLCLW61FXM+0QBIk2eXhPq88jsM2zTL7iMbXR3P/nvgUDy
 8ia8g5Q+nKxcb223M8WmK0rBwlaNasE/hXiFT54ntt8bK5nmVJjPMxVXUmYth3hw
 7STkuT0k5jVsMG1NqLkg+wSupj1AuWbD2hIcas7GkxarEYAULbQcClHYGpMll3Tw
 +pJfLjAtBOkcE4TwWDLflVBhwWtdmPNhk51Q3iLVRp0Gm7t0rhE2vE6TjpsIFnrg
 rUAgaqQqQ2WXfsRaGa2wx0tRKoW+8Iigq13ndn1AZIrfEtQkYUs=
 =vuNC
 -----END PGP SIGNATURE-----

Merge tag 'bcachefs-2024-05-24' of https://evilpiepirate.org/git/bcachefs

Pull bcachefs fixes from Kent Overstreet:
 "Nothing exciting, just syzbot fixes (except for the one
  FMODE_CAN_ODIRECT patch).

  Looks like syzbot reports have slowed down; this is all catch up from
  two weeks of conferences.

  Next hardening project is using Thomas's error injection tooling to
  torture test repair"

* tag 'bcachefs-2024-05-24' of https://evilpiepirate.org/git/bcachefs:
  bcachefs: Fix race path in bch2_inode_insert()
  bcachefs: Ensure we're RW before journalling
  bcachefs: Fix shutdown ordering
  bcachefs: Fix unsafety in bch2_dirent_name_bytes()
  bcachefs: Fix stack oob in __bch2_encrypt_bio()
  bcachefs: Fix btree_trans leak in bch2_readahead()
  bcachefs: Fix bogus verify_replicas_entry() assert
  bcachefs: Check for subvolues with bogus snapshot/inode fields
  bcachefs: bch2_checksum() returns 0 for unknown checksum type
  bcachefs: Fix bch2_alloc_ciphers()
  bcachefs: Add missing guard in bch2_snapshot_has_children()
  bcachefs: Fix missing parens in drop_locks_do()
  bcachefs: Improve bch2_assert_pos_locked()
  bcachefs: Fix shift overflows in replicas.c
  bcachefs: Fix shift overflow in btree_lost_data()
  bcachefs: Fix ref in trans_mark_dev_sbs() error path
  bcachefs: set FMODE_CAN_ODIRECT instead of a dummy direct_IO method
  bcachefs: Fix rcu splat in check_fix_ptrs()
2024-05-24 09:07:22 -07:00
Kent Overstreet
d93ff5fa40 bcachefs: Fix race path in bch2_inode_insert()
__destroy_new_inode() is appropriate when we have _just_allocated the
inode, but not when it's been fully initialized and on i_sb_list.

Reported-by: syzbot+a0ddc9873c280a4cb18f@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-22 20:37:47 -04:00
Kent Overstreet
cd3b31f9d4 bcachefs: Ensure we're RW before journalling
Reported-by: syzbot+c60cd352aedb109528bf@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-22 20:17:33 -04:00
Steven Rostedt (Google)
2c92ca849f tracing/treewide: Remove second parameter of __assign_str()
With the rework of how the __string() handles dynamic strings where it
saves off the source string in field in the helper structure[1], the
assignment of that value to the trace event field is stored in the helper
value and does not need to be passed in again.

This means that with:

  __string(field, mystring)

Which use to be assigned with __assign_str(field, mystring), no longer
needs the second parameter and it is unused. With this, __assign_str()
will now only get a single parameter.

There's over 700 users of __assign_str() and because coccinelle does not
handle the TRACE_EVENT() macro I ended up using the following sed script:

  git grep -l __assign_str | while read a ; do
      sed -e 's/\(__assign_str([^,]*[^ ,]\) *,[^;]*/\1)/' $a > /tmp/test-file;
      mv /tmp/test-file $a;
  done

I then searched for __assign_str() that did not end with ';' as those
were multi line assignments that the sed script above would fail to catch.

Note, the same updates will need to be done for:

  __assign_str_len()
  __assign_rel_str()
  __assign_rel_str_len()

I tested this with both an allmodconfig and an allyesconfig (build only for both).

[1] https://lore.kernel.org/linux-trace-kernel/20240222211442.634192653@goodmis.org/

Link: https://lore.kernel.org/linux-trace-kernel/20240516133454.681ba6a0@rorschach.local.home

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Julia Lawall <Julia.Lawall@inria.fr>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Jani Nikula <jani.nikula@intel.com>
Acked-by: Christian König <christian.koenig@amd.com> for the amdgpu parts.
Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> #for
Acked-by: Rafael J. Wysocki <rafael@kernel.org> # for thermal
Acked-by: Takashi Iwai <tiwai@suse.de>
Acked-by: Darrick J. Wong <djwong@kernel.org>	# xfs
Tested-by: Guenter Roeck <linux@roeck-us.net>
2024-05-22 20:14:47 -04:00
Kent Overstreet
d293ece108 bcachefs: Fix shutdown ordering
the btree key cache uses the srcu struct created/destroyed by
btree_iter.c; btree_iter needs to be exited last.

Reported-by: syzbot+3af9daea347788b15213@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-22 19:54:03 -04:00
Kent Overstreet
2195b755eb bcachefs: Fix unsafety in bch2_dirent_name_bytes()
Reported-by: syzbot+84fa6fb8c7f98b93cdea@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-22 19:14:36 -04:00
Kent Overstreet
2ba24864d2 bcachefs: Fix stack oob in __bch2_encrypt_bio()
Reported-by: syzbot+fff6b0fb00259873576a@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-22 19:01:17 -04:00
Kent Overstreet
70dd062e27 bcachefs: Fix btree_trans leak in bch2_readahead()
Reported-by: syzbot+d797fe78808e968d6c84@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-22 19:01:17 -04:00
Kent Overstreet
5fa421448d bcachefs: Fix bogus verify_replicas_entry() assert
verify_replicas_entry() is only for newly created replicas entries -
existing entries on disk may have unknown data types, and we have real
verifiers for them.

Reported-by: syzbot+73414091bd382684ee2b@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-22 19:01:17 -04:00
Linus Torvalds
38da32ee70 bd_inode series
Replacement of bdev->bd_inode with sane(r) set of primitives.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZkwjlgAKCRBZ7Krx/gZQ
 66OmAP9nhZLASn/iM2+979I6O0GW+vid+uLh48uW3d+LbsmVIgD9GYpR+cuLQ/xj
 mJESWfYKOVSpFFSrqlzKg9PQlU/GFgs=
 =6LRp
 -----END PGP SIGNATURE-----

Merge tag 'pull-bd_inode-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull bdev bd_inode updates from Al Viro:
 "Replacement of bdev->bd_inode with sane(r) set of primitives by me and
  Yu Kuai"

* tag 'pull-bd_inode-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  RIP ->bd_inode
  dasd_format(): killing the last remaining user of ->bd_inode
  nilfs_attach_log_writer(): use ->bd_mapping->host instead of ->bd_inode
  block/bdev.c: use the knowledge of inode/bdev coallocation
  gfs2: more obvious initializations of mapping->host
  fs/buffer.c: massage the remaining users of ->bd_inode to ->bd_mapping
  blk_ioctl_{discard,zeroout}(): we only want ->bd_inode->i_mapping here...
  grow_dev_folio(): we only want ->bd_inode->i_mapping there
  use ->bd_mapping instead of ->bd_inode->i_mapping
  block_device: add a pointer to struct address_space (page cache of bdev)
  missing helpers: bdev_unhash(), bdev_drop()
  block: move two helpers into bdev.c
  block2mtd: prevent direct access of bd_inode
  dm-vdo: use bdev_nr_bytes(bdev) instead of i_size_read(bdev->bd_inode)
  blkdev_write_iter(): saner way to get inode and bdev
  bcachefs: remove dead function bdev_sectors()
  ext4: remove block_device_ejected()
  erofs_buf: store address_space instead of inode
  erofs: switch erofs_bread() to passing offset instead of block number
2024-05-21 09:51:42 -07:00
Kent Overstreet
765b8cb8ac bcachefs: Check for subvolues with bogus snapshot/inode fields
This fixes an assertion pop in btree_iter.c that checks for forgetting
to pass a snapshot ID when iterating over snapshots btrees.

Reported-by: syzbot+0dfe05235e38653e2aee@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-20 05:37:26 -04:00
Kent Overstreet
6b74fdcc8e bcachefs: bch2_checksum() returns 0 for unknown checksum type
This fixes missing guards on trying to calculate a checksum with an
invalid/unknown checksum type; moving the guards up to e.g. btree_io.c
might be "more correct", but doesn't buy us anything - an unknown
checksum type will always be flagged as at least a checksum error so we
aren't losing any safety doing it this way and it makes it less likely
to accidentally pop an assert we don't want.

Reported-by: syzbot+e951ad5349f3a34a715a@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-20 05:37:26 -04:00
Kent Overstreet
c06a8b7567 bcachefs: Fix bch2_alloc_ciphers()
Don't put error pointers in bch_fs, that's gross.

This fixes (?) the check in bch2_checksum_type_valid() - depending on
our error paths, or depending on what our error paths are doing it at
least makes the code saner.

Reported-by: syzbot+2e3cb81b5d1fe18a374b@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-20 05:37:26 -04:00
Kent Overstreet
6d48e61364 bcachefs: Add missing guard in bch2_snapshot_has_children()
We additionally need to be going inconsistent if passed an invalid
snapshot ID; that patch will need more thorough testing.

Reported-by: syzbot+1c9fca23fe478633b305@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-20 05:37:26 -04:00
Kent Overstreet
6ce26ad376 bcachefs: Fix missing parens in drop_locks_do()
Reported-by: syzbot+95db43b0a06f157ee865@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-20 05:37:26 -04:00
Kent Overstreet
25989f4a9b bcachefs: Improve bch2_assert_pos_locked()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-20 05:37:26 -04:00
Kent Overstreet
bcfbaea8e5 bcachefs: Fix shift overflows in replicas.c
We can't disallow unknown data_types in verify() - we have to preserve
them unchanged for backwards compat; that means we have to add a few
more guards.

Reported-by: syzbot+249018ea545364f78d04@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-20 05:37:26 -04:00
Kent Overstreet
f108ddd467 bcachefs: Fix shift overflow in btree_lost_data()
Reported-by: syzbot+29f65db1a5fe427b5c56@syzkaller.appspotmail.com
Fixes: 55936afe11 ("bcachefs: Flag btrees with missing data")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-20 05:37:26 -04:00
Kent Overstreet
9667214b30 bcachefs: Fix ref in trans_mark_dev_sbs() error path
Reported-by: syzbot+5c7f715a7107a608a544@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-20 05:37:26 -04:00
Youling Tang
54429c902a bcachefs: set FMODE_CAN_ODIRECT instead of a dummy direct_IO method
Since commit a2ad63daa8 ("VFS: add FMODE_CAN_ODIRECT file flag") file
systems can just set the FMODE_CAN_ODIRECT flag at open time instead of
wiring up a dummy direct_IO method to indicate support for direct I/O.
Do that for bcachefs so that noop_direct_IO can eventually be removed.

Similar to commit b294349993 ("xfs: set FMODE_CAN_ODIRECT instead of
a dummy direct_IO method").

Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-20 05:37:26 -04:00
Kent Overstreet
427ba55503 bcachefs: Fix rcu splat in check_fix_ptrs()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-20 05:37:26 -04:00
Linus Torvalds
16dbfae867 bcachefs changes for 6.10-rc1
- More safety fixes, primarily found by syzbot
 
 - Run the upgrade/downgrade paths in nochnages mode. Nochanges mode is
   primarily for testing fsck/recovery in dry run mode, so it shouldn't
   change anything besides disabling writes and holding dirty metadata in
   memory.
 
   The idea here was to reduce the amount of activity if we can't write
   anything out, so that bringing up a filesystem in "super ro" mode
   would be more lilkely to work for data recovery - but norecovery is
   the correct option for this.
 
 - btree_trans->locked; we now track whether a btree_trans has any btree
   nodes locked, and this is used for improved assertions related to
   trans_unlock() and trans_relock(). We'll also be using it for
   improving how we work with lockdep in the future: we don't want
   lockdep to be tracking individual btree node locks because we take too
   many for lockdep to track, and it's not necessary since we have a
   cycle detector.
 
 - Trigger improvements that are prep work for online fsck
 
 - BTREE_TRIGGER_check_repair; this regularizes how we do some repair
   work for extents that goes with running triggers in fsck, and fixes
   some subtle issues with transaction restarts there.
 
 - bch2_snapshot_equiv() has now been ripped out of fsck.c; snapshot
   equivalence classes are for when snapshot deletion leaves behind
   redundant snapshot nodes, but snapshot deletion now cleans this up
   right away, so the abstraction doesn't need to leak.
 
 - Improvements to how we resume writing to the journal in recovery. The
   code for picking the new place to write when reading the journal is
   greatly simplified and we also store the position in the superblock
   for when we don't read the journal; this means that we preserve more
   of the journal for list_journal debugging.
 
 - Improvements to sysfs btree_cache and btree_node_cache, for debugging
   memory reclaim.
 
 - We now detect when we've blocked for 10 seconds on the allocator in
   the write path and dump some useful info.
 
 - Safety fixes for devices references: this is a big series that changes
   almost all device lookups to properly check if the device exists and
   take a reference to it.
 
   Previously we assumed that if a bkey exists that references a device
   then the device must exist, and this was enforced in .invalid methods,
   but this was incorrect because it meant device removal relied on
   accounting being correct to not leave keys pointing to invalid
   devices, and that's not something we can assume.
 
   Getting the "pointer to invalid device" checks out of our .invalid()
   methods fixes some long standing device removal bugs; the only
   outstanding bug with device removal now is a race between the discard
   path and deleting alloc info, which should be easily fixed.
 
 - The allocator now prefers not to expand the new
   member_info.btree_allocated bitmap, meaning if repair ever requires
   scanning for btree nodes (because of a corrupt interior nodes) we
   won't have to scan the whole device(s).
 
 - New coding style document, which among other things talks about the
   correct usage of assertions
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmZKJQgACgkQE6szbY3K
 bnZETg//SU9H0OHnBSMB/cteF6PKo9QR+dhT+3n+gWTxl0o/egbGTqwbzVqGtd2f
 J6II1BsDk8VoTOb/gFfLRShlmJfnj2jpRThU265faR/7LQYeSaqndDPkjOpTayAD
 Nj/DJyiSUTL753rZh3yUhOpOIHf7iapH6wuaZCPfhdfk+yvZNW8iz07JHjHLKRp8
 I2cFH0r6kN916NdRkt9oDCz68WouT8eWTqwcKra04XsLEZjNJHxLpKMq4M8UdPc7
 YynJPVt+aP8+VduGIq6pV8Co3afCP2oUywo11JpRmvLsw4tex/59wxOYtpMfgn6k
 4H+9WqiBwkbmnLDrfFHWRameS6F/7+GRAOVuz9nkmfk61UPU15gLjSRffqZ6u2YC
 7vbrXgebId/sZXtBpQd83RMMX52BnEJah0upNJ54IsSqfDYkU9lwl6CEyYpcX1hf
 YNBGBTbspZztc3AB13b3ow421FMhaySUg0FDmntMR9O8Z6/BXk7Ykc7b8DPEfrFs
 W6JY7q+ARBxr+EgFcV74fvMCf7NJTAhyv80AKryo7NFU2JZOyyaTxcTGSnolX4Mi
 lyHiOgicmOX+vy3vbC1dZoDcmIDJ4Uc0vixYcpKiZqxlR8XJ+wpevC50TEhxrcW+
 ZO4SloQvgyjI34xu/gZgjRYb3BhXK3x+ougVFpRG8V8zQ/+ccWg=
 =MKrF
 -----END PGP SIGNATURE-----

Merge tag 'bcachefs-2024-05-19' of https://evilpiepirate.org/git/bcachefs

Pull bcachefs updates from Kent Overstreet:

 - More safety fixes, primarily found by syzbot

 - Run the upgrade/downgrade paths in nochnages mode. Nochanges mode is
   primarily for testing fsck/recovery in dry run mode, so it shouldn't
   change anything besides disabling writes and holding dirty metadata
   in memory.

   The idea here was to reduce the amount of activity if we can't write
   anything out, so that bringing up a filesystem in "super ro" mode
   would be more lilkely to work for data recovery - but norecovery is
   the correct option for this.

 - btree_trans->locked; we now track whether a btree_trans has any btree
   nodes locked, and this is used for improved assertions related to
   trans_unlock() and trans_relock(). We'll also be using it for
   improving how we work with lockdep in the future: we don't want
   lockdep to be tracking individual btree node locks because we take
   too many for lockdep to track, and it's not necessary since we have a
   cycle detector.

 - Trigger improvements that are prep work for online fsck

 - BTREE_TRIGGER_check_repair; this regularizes how we do some repair
   work for extents that goes with running triggers in fsck, and fixes
   some subtle issues with transaction restarts there.

 - bch2_snapshot_equiv() has now been ripped out of fsck.c; snapshot
   equivalence classes are for when snapshot deletion leaves behind
   redundant snapshot nodes, but snapshot deletion now cleans this up
   right away, so the abstraction doesn't need to leak.

 - Improvements to how we resume writing to the journal in recovery. The
   code for picking the new place to write when reading the journal is
   greatly simplified and we also store the position in the superblock
   for when we don't read the journal; this means that we preserve more
   of the journal for list_journal debugging.

 - Improvements to sysfs btree_cache and btree_node_cache, for debugging
   memory reclaim.

 - We now detect when we've blocked for 10 seconds on the allocator in
   the write path and dump some useful info.

 - Safety fixes for devices references: this is a big series that
   changes almost all device lookups to properly check if the device
   exists and take a reference to it.

   Previously we assumed that if a bkey exists that references a device
   then the device must exist, and this was enforced in .invalid
   methods, but this was incorrect because it meant device removal
   relied on accounting being correct to not leave keys pointing to
   invalid devices, and that's not something we can assume.

   Getting the "pointer to invalid device" checks out of our .invalid()
   methods fixes some long standing device removal bugs; the only
   outstanding bug with device removal now is a race between the discard
   path and deleting alloc info, which should be easily fixed.

 - The allocator now prefers not to expand the new
   member_info.btree_allocated bitmap, meaning if repair ever requires
   scanning for btree nodes (because of a corrupt interior nodes) we
   won't have to scan the whole device(s).

 - New coding style document, which among other things talks about the
   correct usage of assertions

* tag 'bcachefs-2024-05-19' of https://evilpiepirate.org/git/bcachefs: (155 commits)
  bcachefs: add no_invalid_checks flag
  bcachefs: add counters for failed shrinker reclaim
  bcachefs: Fix sb_field_downgrade validation
  bcachefs: Plumb bch_validate_flags to sb_field_ops.validate()
  bcachefs: s/bkey_invalid_flags/bch_validate_flags
  bcachefs: fsync() should not return -EROFS
  bcachefs: Invalid devices are now checked for by fsck, not .invalid methods
  bcachefs: kill bch2_dev_bkey_exists() in bch2_check_fix_ptrs()
  bcachefs: kill bch2_dev_bkey_exists() in bch2_read_endio()
  bcachefs: bch2_dev_get_ioref() checks for device not present
  bcachefs: bch2_dev_get_ioref2(); io_read.c
  bcachefs: bch2_dev_get_ioref2(); debug.c
  bcachefs: bch2_dev_get_ioref2(); journal_io.c
  bcachefs: bch2_dev_get_ioref2(); io_write.c
  bcachefs: bch2_dev_get_ioref2(); btree_io.c
  bcachefs: bch2_dev_get_ioref2(); backpointers.c
  bcachefs: bch2_dev_get_ioref2(); alloc_background.c
  bcachefs: for_each_bset() declares loop iter
  bcachefs: Move BCACHEFS_STATFS_MAGIC value to UAPI magic.h
  bcachefs: Improve sysfs internal/btree_cache
  ...
2024-05-19 13:45:48 -07:00
Linus Torvalds
1b0aabcc9a vfs-6.10.misc
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZj3HuwAKCRCRxhvAZXjc
 orYvAQCZOr68uJaEaXAArYTdnMdQ6HIzG+FVlwrqtrhz0BV07wEAqgmtSR9XKh+L
 0+DNepg4R8PZOHH371eSSsLNRCUCkAs=
 =SVsU
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.10.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
 "This contains the usual miscellaneous features, cleanups, and fixes
  for vfs and individual fses.

  Features:

   - Free up FMODE_* bits. I've freed up bits 6, 7, 8, and 24. That
     means we now have six free FMODE_* bits in total (but bit #6
     already got used for FMODE_WRITE_RESTRICTED)

   - Add FOP_HUGE_PAGES flag (follow-up to FMODE_* cleanup)

   - Add fd_raw cleanup class so we can make use of automatic cleanup
     provided by CLASS(fd_raw, f)(fd) for O_PATH fds as well

   - Optimize seq_puts()

   - Simplify __seq_puts()

   - Add new anon_inode_getfile_fmode() api to allow specifying f_mode
     instead of open-coding it in multiple places

   - Annotate struct file_handle with __counted_by() and use
     struct_size()

   - Warn in get_file() whether f_count resurrection from zero is
     attempted (epoll/drm discussion)

   - Folio-sophize aio

   - Export the subvolume id in statx() for both btrfs and bcachefs

   - Relax linkat(AT_EMPTY_PATH) requirements

   - Add F_DUPFD_QUERY fcntl() allowing to compare two file descriptors
     for dup*() equality replacing kcmp()

  Cleanups:

   - Compile out swapfile inode checks when swap isn't enabled

   - Use (1 << n) notation for FMODE_* bitshifts for clarity

   - Remove redundant variable assignment in fs/direct-io

   - Cleanup uses of strncpy in orangefs

   - Speed up and cleanup writeback

   - Move fsparam_string_empty() helper into header since it's currently
     open-coded in multiple places

   - Add kernel-doc comments to proc_create_net_data_write()

   - Don't needlessly read dentry->d_flags twice

  Fixes:

   - Fix out-of-range warning in nilfs2

   - Fix ecryptfs overflow due to wrong encryption packet size
     calculation

   - Fix overly long line in xfs file_operations (follow-up to FMODE_*
     cleanup)

   - Don't raise FOP_BUFFER_{R,W}ASYNC for directories in xfs (follow-up
     to FMODE_* cleanup)

   - Don't call xfs_file_open from xfs_dir_open (follow-up to FMODE_*
     cleanup)

   - Fix stable offset api to prevent endless loops

   - Fix afs file server rotations

   - Prevent xattr node from overflowing the eraseblock in jffs2

   - Move fdinfo PTRACE_MODE_READ procfs check into the .permission()
     operation instead of .open() operation since this caused userspace
     regressions"

* tag 'vfs-6.10.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (39 commits)
  afs: Fix fileserver rotation getting stuck
  selftests: add F_DUPDFD_QUERY selftests
  fcntl: add F_DUPFD_QUERY fcntl()
  file: add fd_raw cleanup class
  fs: WARN when f_count resurrection is attempted
  seq_file: Simplify __seq_puts()
  seq_file: Optimize seq_puts()
  proc: Move fdinfo PTRACE_MODE_READ check into the inode .permission operation
  fs: Create anon_inode_getfile_fmode()
  xfs: don't call xfs_file_open from xfs_dir_open
  xfs: drop fop_flags for directories
  xfs: fix overly long line in the file_operations
  shmem: Fix shmem_rename2()
  libfs: Add simple_offset_rename() API
  libfs: Fix simple_offset_rename_exchange()
  jffs2: prevent xattr node from overflowing the eraseblock
  vfs, swap: compile out IS_SWAPFILE() on swapless configs
  vfs: relax linkat() AT_EMPTY_PATH - aka flink() - requirements
  fs/direct-io: remove redundant assignment to variable retval
  fs/dcache: Re-use value stored to dentry->d_flags instead of re-reading
  ...
2024-05-13 11:40:06 -07:00
Thomas Bertschinger
07f9a27f19 bcachefs: add no_invalid_checks flag
Setting this flag on a filesystem results in validity checks being
skipped when writing bkeys. This flag will be used by tooling that
deliberately injects corruption into a filesystem in order to exercise
fsck. It shouldn't be set outside of testing/debugging code.

Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-09 16:24:30 -04:00
Daniel Hill
bceacfa97e bcachefs: add counters for failed shrinker reclaim
This adds distinct counters for every reason the btree node shrinker can
fail to free an object - if our shrinker isn't making progress, this
will tell us why.

Signed-off-by: Daniel Hill <daniel@gluo.nz>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-09 16:24:29 -04:00
Kent Overstreet
692aa7a54b bcachefs: Fix sb_field_downgrade validation
- bch2_sb_downgrade_validate() wasn't checking for a downgrade entry
  extending past the end of the superblock section

- for_each_downgrade_entry() is used in to_text() and needs to work on
  malformed input; it also was missing a check for a field extending
  past the end of the section

Reported-by: syzbot+e49ccab73449180bc9be@syzkaller.appspotmail.com
Fixes: 84f1638795 ("bcachefs: bch_sb_field_downgrade")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-09 16:23:36 -04:00
Kent Overstreet
a5c3e265d3 bcachefs: Plumb bch_validate_flags to sb_field_ops.validate()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-09 16:23:36 -04:00
Kent Overstreet
65eaf4e24a bcachefs: s/bkey_invalid_flags/bch_validate_flags
We're about to start using bch_validate_flags for superblock section
validation - it's no longer bkey specific.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-09 16:23:36 -04:00
Kent Overstreet
d09a8468d9 bcachefs: fsync() should not return -EROFS
fsync has a slightly odd usage of -EROFS, where it means "does not
support fsync". I didn't choose it...

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-09 16:23:36 -04:00
Kent Overstreet
99179fb898 bcachefs: Invalid devices are now checked for by fsck, not .invalid methods
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-09 16:23:36 -04:00
Kent Overstreet
2f4b4a3b44 bcachefs: kill bch2_dev_bkey_exists() in bch2_check_fix_ptrs()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-09 16:23:36 -04:00
Kent Overstreet
02b7fa4fe5 bcachefs: kill bch2_dev_bkey_exists() in bch2_read_endio()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-09 16:23:36 -04:00
Kent Overstreet
2c91ab7262 bcachefs: bch2_dev_get_ioref() checks for device not present
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-09 16:23:36 -04:00
Kent Overstreet
465bf6f42a bcachefs: bch2_dev_get_ioref2(); io_read.c
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-09 16:23:36 -04:00
Kent Overstreet
91ffdecfc7 bcachefs: bch2_dev_get_ioref2(); debug.c
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-09 16:23:35 -04:00
Kent Overstreet
6212ea2497 bcachefs: bch2_dev_get_ioref2(); journal_io.c
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-09 16:23:35 -04:00
Kent Overstreet
48af853925 bcachefs: bch2_dev_get_ioref2(); io_write.c
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-09 16:23:35 -04:00
Kent Overstreet
690f7cdf73 bcachefs: bch2_dev_get_ioref2(); btree_io.c
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-09 16:23:35 -04:00
Kent Overstreet
466298e2f6 bcachefs: bch2_dev_get_ioref2(); backpointers.c
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-09 16:23:35 -04:00
Kent Overstreet
0e57996c69 bcachefs: bch2_dev_get_ioref2(); alloc_background.c
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-09 16:23:35 -04:00
Kent Overstreet
b6fb4269e7 bcachefs: for_each_bset() declares loop iter
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-09 16:23:34 -04:00
Petr Vorel
e2f48c4809 bcachefs: Move BCACHEFS_STATFS_MAGIC value to UAPI magic.h
Move BCACHEFS_STATFS_MAGIC value to UAPI <linux/magic.h> under
BCACHEFS_SUPER_MAGIC definition (use common approach for name) and reuse the
definition in bcachefs_format.h BCACHEFS_STATFS_MAGIC.

There are other bcachefs magic definitions: BCACHE_MAGIC, BCHFS_MAGIC,
which use UUID_INIT() and are used only in libbcachefs. Therefore move
only BCACHEFS_STATFS_MAGIC value, which can be used outside of
libbcachefs for f_type field in struct statfs in statfs() or fstatfs().

Suggested-by: Su Yue <glass.su@suse.com>
Signed-off-by: Petr Vorel <pvorel@suse.cz>
Acked-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:24 -04:00
Kent Overstreet
e11ecc6133 bcachefs: Improve sysfs internal/btree_cache
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:24 -04:00
Kent Overstreet
c670509134 bcachefs: Allocator prefers not to expand mi.btree_allocated bitmap
We now have a small bitmap in the member info section of the superblock
for "regions that have btree nodes", so that if we ever have to scan for
btree nodes in repair we don't have to scan the whole device(s).

This tweaks the allocator to prefer allocating from regions that are
already marked in this bitmap.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:24 -04:00
Kent Overstreet
40574946b8 bcachefs: Better bucket alloc tracepoints
Tracepoints are garbage, and perf trace even cuts off some of our
fields.

Much nicer to just trace a string, and then we can build nicely
formatted output with printbufs.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:24 -04:00
Kent Overstreet
6202569777 bcachefs: Move nocow unlock to bch2_write_endio()
This fixes a lifetime issue; bch2_nocow_write_unlock() uses
PTR_BUCKET_POS(), which needs the device - but we drop our ref to the
device in bch2_write_endio().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:24 -04:00
Kent Overstreet
b6d29b5869 bcachefs: kill bch2_dev_bkey_exists() in journal_ptrs_to_text()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:24 -04:00
Kent Overstreet
f4301b635a bcachefs: kill bch2_dev_bkey_exists() in discard_one_bucket_fast()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:24 -04:00
Kent Overstreet
bc3204c80a bcachefs: kill bch2_dev_bkey_exists() in check_alloc_info()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:24 -04:00
Kent Overstreet
d8585a79be bcachefs: bch2_dev_have_ref()
bch2_dev_bkey_exists() is going away; bch2_dev_have_ref() documents that
we're looking up a device without checking if it's present because we
have a reference to it already.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:24 -04:00
Kent Overstreet
222eacabc1 bcachefs: kill bch2_dev_bkey_exists() in data_update_init()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:23 -04:00
Kent Overstreet
a9422fd404 bcachefs: kill bch2_dev_bkey_exists() in bkey_pick_read_device()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:23 -04:00
Kent Overstreet
db39a35dde bcachefs: pass bch_dev to read_from_stale_dirty_pointer()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:23 -04:00
Kent Overstreet
78e9b548f3 bcachefs: bch2_dev_bucket_exists() uses bch2_dev_rcu()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:23 -04:00
Kent Overstreet
ad897d241b bcachefs: kill bch2_dev_bkey_exists() in btree_gc.c
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:23 -04:00
Kent Overstreet
9cadb4ea56 bcachefs: bch2_extent_normalize() -> bch2_dev_rcu()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:23 -04:00
Kent Overstreet
8e3cc2003f bcachefs: bch2_bkey_has_target() -> bch2_dev_rcu()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:23 -04:00
Kent Overstreet
8feecbed24 bcachefs: extent_ptr_invalid() -> bch2_dev_rcu()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:23 -04:00
Kent Overstreet
3858aa4268 bcachefs: ptr_stale() -> dev_ptr_stale()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:23 -04:00
Kent Overstreet
302c980a81 bcachefs: extent_ptr_durability() -> bch2_dev_rcu()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:23 -04:00
Kent Overstreet
3793b3f91f bcachefs: bch2_extent_merge() -> bch2_dev_rcu()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:23 -04:00
Kent Overstreet
c387d84413 bcachefs: ec_validate_checksums() -> bch2_dev_tryget()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:23 -04:00
Kent Overstreet
8783856ab1 bcachefs: ob_dev()
Wrapper around bch2_dev_have_ref() for open_buckets; we do guarantee
that the device an open_bucket points to exists.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:23 -04:00
Kent Overstreet
dbd0408087 bcachefs: move replica_set from bch_dev to bch_fs
This is needed for the next patch - the write submit path has to be able
to allocate a replica bio even when we weren't able to get a ref on the
device.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:23 -04:00
Kent Overstreet
633cf06944 bcachefs: Kill bch2_dev_bkey_exists() in backpointer code
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:23 -04:00
Kent Overstreet
1f2f92ec3f bcachefs: PTR_BUCKET_POS() now takes bch_dev
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:23 -04:00
Kent Overstreet
fa6cce09f0 bcachefs: bch2_dev_iterate()
New helper for getting refs to devices as we iterate.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:23 -04:00
Kent Overstreet
cb4d340a10 bcachefs: bch2_evacuate_bucket() -> bch2_dev_tryget()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:23 -04:00
Kent Overstreet
07d7c4da7b bcachefs: bch2_bucket_ref_update() now takes bch_dev
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:23 -04:00
Kent Overstreet
a7f1c26f59 bcachefs: bch2_trigger_alloc() -> bch2_dev_tryget()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:23 -04:00
Kent Overstreet
9b3059a1b3 bcachefs: bch2_check_alloc_key() -> bch2_dev_tryget_noerror()
More elimination of bch2_dev_bkey_exists() usage.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:22 -04:00
Kent Overstreet
4cd91e2f87 bcachefs: Convert to bch2_dev_tryget_noerror()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:22 -04:00
Kent Overstreet
b07eb8252f bcachefs: bch2_dev_tryget()
Most uses of bch2_dev_bkey_exists() are going away, where we assume that
because a key references a device the device most exist - instead, we'll
be explicitly checking if the device exists and getting a reference to
it.

This adds the new helpers.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:22 -04:00
Kent Overstreet
6349b07c25 bcachefs: bch2_have_enough_devs() checks for nonexistent device
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:22 -04:00
Kent Overstreet
adf81796ee bcachefs: journal_replay_entry_early() checks for nonexistent device
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:22 -04:00
Kent Overstreet
13a16dabde bcachefs: bch2_dev_btree_bitmap_marked() -> bch2_dev_rcu()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:22 -04:00
Kent Overstreet
267039d0fc bcachefs: Pass device to bch2_bucket_do_index()
Eliminating bch2_dev_bkey_exists() uses and replacing them with proper
checks; this one was unnecessary since the caller already has it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:22 -04:00
Kent Overstreet
f5faf43f85 bcachefs: Pass device to bch2_alloc_write_key()
More elimating bch2_dev_bkey_exists()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:22 -04:00
Kent Overstreet
23f308ae19 bcachefs: bch2_dev_safe() -> bch2_dev_rcu()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:22 -04:00
Kent Overstreet
552aa54865 bcachefs: Debug asserts for ca->ref
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:22 -04:00
Kent Overstreet
f295298b8c bcachefs: New helpers for device refcounts
This will be used in the next patch for adding some new debug mode
asserts.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:22 -04:00
Kent Overstreet
e98786ea85 bcachefs: bch2_print_allocator_stuck()
If we block on the allocator for more than 10 seconds, print out some
useful debugging info.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:22 -04:00
Kent Overstreet
9a768ab75b bcachefs: bch2_bkey_drop_ptrs() declares loop iter
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:22 -04:00
Kent Overstreet
b895c70326 bcachefs: x-macroize journal flags enums
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:22 -04:00
Kent Overstreet
3a718c0647 bcachefs: On device add, prefer unused slots
We can't strictly guarantee that no pointers refer to nonexistent
devices - we attempt to, but we need to be safe when the filesystem is
corrupt.

Therefore, change device_add to try to pick a slot that's never been
used, or the slot that's been unused the longest.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:22 -04:00
Kent Overstreet
ffcbec6076 bcachefs: Kill opts.buckets_nouse
Now explicitly allocate and free the buckets_nouse bitmap - this is
going to be used for online fsck.

To go RW when we haven't check allocations, we'll do a much slimmed down
version that just initializes the buckets_nouse bitmaps.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:22 -04:00
Kent Overstreet
abe2f470bc bcachefs: simplify bch2_trans_start_alloc_update()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:22 -04:00
Kent Overstreet
0acf2169a5 bcachefs: __mark_stripe_bucket() now takes bch_alloc_v4
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:21 -04:00
Kent Overstreet
be11ae16c4 bcachefs: __mark_pointer now takes bch_alloc_v4
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:21 -04:00