Commit Graph

4156 Commits

Author SHA1 Message Date
Kent Overstreet
7e7595723c bcachefs: bchfs_read(): call trans_begin() on every loop iter
Same as the recent change for __bch2_read(); also, kill now unnecessary
btree_trans_too_many_iters() calls.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:48 -04:00
Kent Overstreet
804baca745 bcachefs: kill bch2_btree_iter_peek_and_restart()
dead code

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:48 -04:00
Kent Overstreet
32ed4a620c bcachefs: Btree path tracepoints
Fastpath tracepoints, rarely needed, only enabled with
CONFIG_BCACHEFS_PATH_TRACEPOINTS.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:48 -04:00
Kent Overstreet
abbfc4db50 bcachefs: Add check for btree_path ref overflow
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:48 -04:00
Kent Overstreet
16005147cc bcachefs: Don't delete open files in online fsck
If a file is unlinked but still open, we don't want online fsck to
delete it - or fun inconsistencies will happen.

https://github.com/koverstreet/bcachefs/issues/727

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:47 -04:00
Youling Tang
094c6a9f5c bcachefs: Mark bch_inode_info as SLAB_ACCOUNT
After commit 230e9fc286 ("slab: add SLAB_ACCOUNT flag"), we need to mark
the inode cache as SLAB_ACCOUNT, similar to commit 5d097056c9 ("kmemcg:
account for certain kmem allocations to memcg")

Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:47 -04:00
Kent Overstreet
2c377d8a71 bcachefs: fix btree_key_cache sysfs knob
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:47 -04:00
Youling Tang
082330c361 bcachefs: allocate inode by using alloc_inode_sb()
The inode allocation is supposed to use alloc_inode_sb(), so convert
kmem_cache_alloc() to alloc_inode_sb().

It will also fix [1] to avoid the NULL pointer dereference BUG in
list_lru_add() when CONFIG_MEMCG is enabled.

Links:
[1]: https://lore.kernel.org/all/20589721-46c0-4344-b2ef-6ab48bbe2ea5@linux.dev/
[2]: https://lore.kernel.org/all/7db60e36-9c96-4938-a28d-a9745e287386@linux.dev/

Fixes: 86d81ec5f5 ("bcachefs: Mark bch_inode_info as SLAB_ACCOUNT")
Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:47 -04:00
Kent Overstreet
9092a38a3d bcachefs: Opt_durability can now be set via bch2_opt_set_sb()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:47 -04:00
Kent Overstreet
4aedeac570 bcachefs: bch2_opt_set_sb() can now set (some) device options
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:47 -04:00
Kent Overstreet
afefc986b7 bcachefs: data_allowed is now an opts.h option
need this so cmd_option in userspace can handle it

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:47 -04:00
Thorsten Blum
8573dd3474 bcachefs: Annotate struct bucket_array with __counted_by()
Add the __counted_by compiler attribute to the flexible array member
bucket to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and
CONFIG_FORTIFY_SOURCE.

Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:47 -04:00
Nathan Chancellor
5396e5af3c bcachefs: Fix format specifier in bch2_btree_key_cache_to_text()
When building for a 32-bit architecture, for which 'size_t' is
'unsigned int', there is a compiler warning due to use of '%lu':

  In file included from fs/bcachefs/vstructs.h:5,
                   from fs/bcachefs/bcachefs_format.h:80,
                   from fs/bcachefs/bcachefs.h:207,
                   from fs/bcachefs/btree_key_cache.c:3:
  fs/bcachefs/btree_key_cache.c: In function 'bch2_btree_key_cache_to_text':
  fs/bcachefs/btree_key_cache.c:795:25: error: format '%lu' expects argument of type 'long unsigned int', but argument 3 has type 'size_t' {aka 'unsigned int'} [-Werror=format=]
    795 |         prt_printf(out, "pending:\t%lu\r\n",            per_cpu_sum(bc->nr_pending));
        |                         ^~~~~~~~~~~~~~~~~~~
  fs/bcachefs/util.h:78:63: note: in definition of macro 'prt_printf'
     78 | #define prt_printf(_out, ...)           bch2_prt_printf(_out, __VA_ARGS__)
        |                                                               ^~~~~~~~~~~
  fs/bcachefs/btree_key_cache.c:795:38: note: format string is defined here
    795 |         prt_printf(out, "pending:\t%lu\r\n",            per_cpu_sum(bc->nr_pending));
        |                                    ~~^
        |                                      |
        |                                      long unsigned int
        |                                    %u
  cc1: all warnings being treated as errors

Use the proper specifier, '%zu', to resolve the warning.

Fixes: e447e49977b8 ("bcachefs: key cache can now allocate from pending")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:47 -04:00
Kent Overstreet
5f1929f1f0 bcachefs: key cache can now allocate from pending
btree_trans objects can hold the btree_trans_barrier srcu read lock for
an extended amount of time (they shouldn't, but it's difficult to
guarantee).

the srcu barrier blocks memory reclaim, so to avoid too many stranded
key cache items, this uses the new pending_rcu_items to allocate from
pending items - like we did before, but now without a global lock on the
key cache.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:47 -04:00
Kent Overstreet
f2bfe7e837 bcachefs: Rip out freelists from btree key cache
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:47 -04:00
Kent Overstreet
d2ed0f206a bcachefs: rcu_pending now works in userspace
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:47 -04:00
Kent Overstreet
8e973a4f3c bcachefs: rcu_pending
Generic data structure for explicitly tracking pending RCU items,
allowing items to be dequeued (i.e. allocate from items pending
freeing). Works with conventional RCU and SRCU, and possibly other RCU
flavors in the future, meaning this can serve as a more generic
replacement for SLAB_TYPESAFE_BY_RCU.

Pending items are tracked in radix trees; if memory allocation fails, we
fall back to linked lists.

A rcu_pending is initialized with a callback, which is invoked when
pending items's grace periods have expired. Two types of callback
processing are handled specially:

- RCU_PENDING_KVFREE_FN

  New backend for kvfree_rcu(). Slightly faster, and eliminates the
  synchronize_rcu() slowpath in kvfree_rcu_mightsleep() - instead, an
  rcu_head is allocated if we don't have one and can't use the radix
  tree

  TODO:
  - add a shrinker (as in the existing kvfree_rcu implementation) so that
    memory reclaim can free expired objects if callback processing isn't
    keeping up, and to expedite a grace period if we're under memory
    pressure and too much memory is stranded by RCU

  - add a counter for amount of memory pending

- RCU_PENDING_CALL_RCU_FN

  Accelerated backend for call_rcu() - pending callbacks are tracked in
  a radix tree to eliminate linked list overhead.

to serve as replacement backends for kvfree_rcu() and call_rcu(); these
may be of interest to other uses (e.g. SLAB_TYPESAFE_BY_RCU users).

Note:

Internally, we're using a single rearming call_rcu() callback for
notifications from the core RCU subsystem for notifications when objects
are ready to be processed.

Ideally we would be getting a callback every time a grace period
completes for which we have objects, but that would require multiple
rcu_heads in flight, and since the number of gp sequence numbers with
uncompleted callbacks is not bounded, we can't do that yet.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:47 -04:00
Kent Overstreet
54f7702466 bcachefs: Fix deadlock in __wait_on_freeing_inode()
We can't call __wait_on_freeing_inode() with btree locks held; we're
waiting on another thread that's in evict(), and before it clears that
bit it needs to write that inode to flush timestamps - deadlock.

Fixing this involves a fair amount of re-jiggering to plumb a new
transaction restart.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:47 -04:00
Kent Overstreet
112d21fd1a bcachefs: switch to rhashtable for vfs inodes hash
the standard vfs inode hash table suffers from painful lock contention -
this is long overdue

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:47 -04:00
Reed Riley
27663d7784 bcachefs: Replace div_u64 with div64_u64 where second param is u64
Bcachefs often uses this function to divide by nanosecond times - which
can easily cause problems when cast to u32.  For example, `cat
/sys/fs/bcachefs/*/internal/rebalance_status` would return invalid data
in the `duration waited` field because dividing by the number of
nanoseconds in a minute requires the divisor parameter to be u64.

Signed-off-by: Reed Riley <reed@riley.engineer>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:47 -04:00
Feiko Nanninga
36f0af4f44 bcachefs: Fix sysfs rebalance duration waited formatting
cat /sys/fs/bcachefs/*/internal/rebalance_status
waiting
  io wait duration:  13.5 GiB
  io wait remaining: 627 MiB
  duration waited:   1392 m

duration waited was increasing at a rate of about 14 times the expected
rate.

div_u64 takes a u32 divisor, but u->nsecs (from time_units[]) can be
bigger than u32.

Signed-off-by: Feiko Nanninga <feiko.nanninga@fnanninga.de>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:47 -04:00
Alyssa Ross
a3ed1cc413 bcachefs: Fix negative timespecs
This fixes two problems in the handling of negative times:

 • rem is signed, but the rem * c->sb.nsec_per_time_unit operation
   produced a bogus unsigned result, because s32 * u32 = u32.

 • The timespec was not normalized (it could contain more than a
   billion nanoseconds).

For example, { .tv_sec = -14245441, .tv_nsec = 750000000 }, after
being round tripped through timespec_to_bch2_time and then
bch2_time_to_timespec would come back as
{ .tv_sec = -14245440, .tv_nsec = 4044967296 } (more than 4 billion
nanoseconds).

Cc: stable@vger.kernel.org
Fixes: 595c1e9bab ("bcachefs: Fix time handling")
Closes: https://github.com/koverstreet/bcachefs/issues/743
Co-developed-by: Erin Shepherd <erin.shepherd@e43.eu>
Signed-off-by: Erin Shepherd <erin.shepherd@e43.eu>
Co-developed-by: Ryan Lahfa <ryan@lahfa.xyz>
Signed-off-by: Ryan Lahfa <ryan@lahfa.xyz>
Signed-off-by: Alyssa Ross <hi@alyssa.is>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:47 -04:00
Kent Overstreet
52df04f039 bcachefs: More BCH_SB_MEMBER_INVALID support
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:46 -04:00
Kent Overstreet
df88febc20 bcachefs: Simplify bch2_bkey_drop_ptrs()
bch2_bkey_drop_ptrs() had a some complicated machinery for avoiding
O(n^2) when dropping multiple pointers - but when n is only going to be
~4, it's not worth it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:46 -04:00
Kent Overstreet
ec36573dcd bcachefs: Add a cond_resched() to __journal_keys_sort()
Without this, we'd potentially sort multiple times without a
cond_resched(), leading to hung task warnings on larger systems.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:46 -04:00
Kent Overstreet
5a6e43af1e bcachefs: Fix ca->io_ref usage
ca->io_ref does not protect against the filesystem going way,
c->write_ref does. Much like

0b50b7313e bcachefs: Fix refcounting in discard path

the other async paths need fixing.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:46 -04:00
Kent Overstreet
53f6619554 bcachefs: BCH_SB_MEMBER_INVALID
Create a sentinal value for "invalid device".

This is needed for removing devices that have stripes on them (force
removing, without evacuating); we need a sentinal value for the stripe
pointers to the device being removed.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-03 20:43:14 -04:00
Kent Overstreet
7f12a963b6 bcachefs: fix rebalance accounting
Fixes: 49aa783039 ("bcachefs: Fix rebalance_work accounting")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-01 15:54:40 -04:00
Kent Overstreet
3d3020c461 bcachefs: Mark more errors as autofix
errors that are known to always be safe to fix should be autofix: this
should be most errors even at this point, but that will need some
thorough review.

note that errors are still logged in the superblock, so we'll still know
that they happened.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-31 19:27:01 -04:00
Kent Overstreet
e3e6940940 bcachefs: Revert lockless buffered IO path
We had a report of data corruption on nixos when building installer
images.

https://github.com/NixOS/nixpkgs/pull/321055#issuecomment-2184131334

It seems that writes are being dropped, but only when issued by QEMU,
and possibly only in snapshot mode. It's undetermined if it's write
calls are being dropped or dirty folios.

Further testing, via minimizing the original patch to just the change
that skips the inode lock on non appends/truncates, reveals that it
really is just not taking the inode lock that causes the corruption: it
has nothing to do with the other logic changes for preserving write
atomicity in corner cases.

It's also kernel config dependent: it doesn't reproduce with the minimal
kernel config that ktest uses, but it does reproduce with nixos's distro
config. Bisection the kernel config initially pointer the finger at page
migration or compaction, but it appears that was erroneous; we haven't
yet determined what kernel config option actually triggers it.

Sadly it appears this will have to be reverted since we're getting too
close to release and my plate is full, but we'd _really_ like to fully
debug it.

My suspicion is that this patch is exposing a preexisting bug - the
inode lock actually covers very little in IO paths, and we have a
different lock (the pagecache add lock) that guards against races with
truncate here.

Fixes: 7e64c86cdc ("bcachefs: Buffered write path now can avoid the inode lock")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-31 19:26:08 -04:00
Christian Brauner
0fe340a98b inode: port __I_NEW to var event
Port the __I_NEW mechanism to use the new var event mechanism.

Link: https://lore.kernel.org/r/20240823-work-i_state-v3-4-5cd5fd207a57@kernel.org
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-30 08:22:39 +02:00
Kent Overstreet
d26935690c bcachefs: Fix bch2_extents_match() false positive
This was caught as a very rare nonce inconsistency, on systems with
encryption and replication (and tiering, or some form of rebalance
operation running):

[Wed Jul 17 13:30:03 2024] about to insert invalid key in data update path
[Wed Jul 17 13:30:03 2024] old: u64s 10 type extent 671283510:6392:U32_MAX len 16 ver 106595503: durability: 2 crc: c_size 8 size 16 offset 0 nonce 0 csum chacha20_poly1305_80 compress zstd ptr: 3:355968:104 gen 7 ptr: 4:513244:48 gen 6 rebalance: target hdd compression zstd
[Wed Jul 17 13:30:03 2024] k:   u64s 10 type extent 671283510:6400:U32_MAX len 16 ver 106595508: durability: 2 crc: c_size 8 size 16 offset 0 nonce 0 csum chacha20_poly1305_80 compress zstd ptr: 3:355968:112 gen 7 ptr: 4:513244:56 gen 6 rebalance: target hdd compression zstd
[Wed Jul 17 13:30:03 2024] new: u64s 14 type extent 671283510:6392:U32_MAX len 8 ver 106595508: durability: 2 crc: c_size 8 size 16 offset 0 nonce 0 csum chacha20_poly1305_80 compress zstd ptr: 3:355968:112 gen 7 cached ptr: 4:513244:56 gen 6 cached rebalance: target hdd compression zstd crc: c_size 8 size 16 offset 8 nonce 0 csum chacha20_poly1305_80 compress zstd ptr: 1:10860085:32 gen 0 ptr: 0:17285918:408 gen 0
[Wed Jul 17 13:30:03 2024] bcachefs (cca5bc65-fe77-409d-a9fa-465a6e7f4eae): fatal error - emergency read only

bch2_extents_match() was reporting true for extents that did not
actually point to the same data.

bch2_extent_match() iterates over pairs of pointers, looking for
pointers that point to the same location on disk (with matching
generation numbers). However one or both extents may have been trimmed
(or merged) and they might not have the same disk offset: it corrects
for this by subtracting the key offset and the checksum entry offset.

However, this failed when an extent was immediately partially
overwritten, and the new overwrite was allocated the next adjacent disk
space.

Normally, with compression off, this would never cause a bug, since the
new extent would have to be immediately after the old extent for the
pointer offsets to match, and the rebalance index update path is not
looking for an extent outside the range of the extent it moved.

However with compression enabled, extents take up less space on disk
than they do in the btree index space - and spuriously matching after
partial overwrite is possible.

To fix this, add a secondary check, that strictly checks that the
regions pointed to on disk overlap.

https://github.com/koverstreet/bcachefs/issues/717

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-26 20:33:12 -04:00
Kent Overstreet
66927b8928 bcachefs: Fix failure to return error in data_update_index_update()
This fixes an assertion pop in io_write.c - if we don't return an error
we're supposed to have completed all the btree updates.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-26 20:33:12 -04:00
Kent Overstreet
49aa783039 bcachefs: Fix rebalance_work accounting
rebalance_work was keying off of the presence of rebelance_opts in the
extent - but that was incorrect, we keep those around after rebalance
for indirect extents since the inode's options are not directly
available

Fixes: 20ac515a9c ("bcachefs: bch_acct_rebalance_work")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-24 10:16:21 -04:00
Kent Overstreet
d3204616a6 bcachefs: Fix failure to flush moves before sleeping in copygc
This fixes an apparent deadlock - rebalance would get stuck trying to
take nocow locks because they weren't being released by copygc.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-24 10:16:21 -04:00
Kent Overstreet
a592cdf516 bcachefs: don't use rht_bucket() in btree_key_cache_scan()
rht_bucket() does strange complicated things when a rehash is in
progress.

Instead, just skip scanning when a rehash is in progress: scanning is
going to be more expensive (many more empty slots to cover), and some
sort of infinite loop is being observed

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-22 10:04:41 -04:00
Kent Overstreet
3e878fe5a0 bcachefs: add missing inode_walker_exit()
fix a small leak

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-22 10:04:41 -04:00
Kent Overstreet
87313ac1f1 bcachefs: clear path->should_be_locked in bch2_btree_key_cache_drop()
bch2_btree_key_cache_drop() evicts the key cache entry - it's used when
we're doing an update that bypasses the key cache, because for cache
coherency reasons a key can't be in the key cache unless it also exists
in the btree - i.e. creates have to bypass the cache.

After evicting, the path no longer points to a key cache key, and
relock() will always fail if should_be_locked is true.

Prep for improving path->should_be_locked assertions

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-22 03:12:57 -04:00
Yuesong Li
dedb2fe375 bcachefs: Fix double assignment in check_dirent_to_subvol()
ret was assigned twice in check_dirent_to_subvol(). Reported by cocci.

Signed-off-by: Yuesong Li <liyuesong@vivo.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-22 02:40:52 -04:00
Kent Overstreet
0b50b7313e bcachefs: Fix refcounting in discard path
bch_dev->io_ref does not protect against the filesystem going away;
bch_fs->writes does.

Thus the filesystem write ref needs to be the last ref we release.

Reported-by: syzbot+9e0404b505e604f67e41@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-22 02:07:23 -04:00
Kent Overstreet
8ed823b192 bcachefs: Fix compat issue with old alloc_v4 keys
we allow new fields to be added to existing key types, and new versions
should treat them as being zeroed; this was not handled in
alloc_v4_validate.

Reported-by: syzbot+3b2968fa4953885dd66a@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-22 02:07:23 -04:00
Kent Overstreet
7f2de6947f bcachefs: Fix warning in bch2_fs_journal_stop()
j->last_empty_seq needs to match j->seq when the journal is empty

Reported-by: syzbot+4093905737cf289b6b38@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-22 02:07:23 -04:00
Kent Overstreet
bdbdd4759f bcachefs: Fix missing validation in bch2_sb_journal_v2_validate()
Reported-by: syzbot+47ecc948aadfb2ab3efc@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-22 02:07:23 -04:00
Kent Overstreet
cab18be695 bcachefs: Fix replay_now_at() assert
Journal replay, in the slowpath where we insert keys in journal order,
was inserting keys in the wrong order; keys from early repair come last.

Reported-by: syzbot+2c4fcb257ce2b6a29d0e@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-22 02:07:23 -04:00
Kent Overstreet
6575b8c987 bcachefs: Fix locking in bch2_ioc_setlabel()
Fixes: 7a254053a5 ("bcachefs: support FS_IOC_SETFSLABEL")
Reported-by: syzbot+7e9efdfec27fbde0141d@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-22 02:07:23 -04:00
Kent Overstreet
5dbfc4ef72 bcachefs: fix failure to relock in btree_node_fill()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-22 02:07:23 -04:00
Kent Overstreet
3c5d0b72a8 bcachefs: fix failure to relock in bch2_btree_node_mem_alloc()
We weren't always so strict about trans->locked state - but now we are,
and new assertions are shaking some bugs out.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-22 02:07:23 -04:00
Kent Overstreet
1dceae4cc1 bcachefs: unlock_long() before resort in journal replay
Fix another SRCU splat - this one pretty harmless.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-22 02:07:23 -04:00
Kent Overstreet
cecc328240 bcachefs: fix missing bch2_err_str()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-22 02:07:23 -04:00
Kent Overstreet
b8db1bd802 bcachefs: fix time_stats_to_text()
Fixes: 7423330e30 ("bcachefs: prt_printf() now respects \r\n\t")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-22 02:07:22 -04:00
Kent Overstreet
c2a503f3e9 bcachefs: Fix bch2_bucket_gens_init()
Comparing the wrong bpos - this was missed because normally
bucket_gens_init() runs on brand new filesystems, but this bug caused it
to overwrite bucket_gens keys with 0s when upgrading ancient
filesystems.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-22 02:07:22 -04:00
Kent Overstreet
e150a7e89c bcachefs: Fix bch2_trigger_alloc assert
On testing on an old mangled filesystem, we missed a case.

Fixes: bd864bc2d9 ("bcachefs: Fix bch2_trigger_alloc when upgrading from old versions")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-22 02:07:22 -04:00
Kent Overstreet
49203a6b9d bcachefs: Fix failure to relock in btree_node_get()
discovered by new trans->locked asserts

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-22 02:07:22 -04:00
Kent Overstreet
548e7f5167 bcachefs: setting bcachefs_effective.* xattrs is a noop
bcachefs_effective.* xattrs show the options inherited from parent
directories (as well as explicitly set); this namespace is not for
setting bcachefs options.

Change the .set() handler to a noop so that if e.g. rsync is copying
xattrs it'll do the right thing, and only copy xattrs in the bcachefs.*
namespace. We don't want to return an error, because that will cause
rsync to bail out or get spammy.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-22 02:07:22 -04:00
Kent Overstreet
8cc0e50614 bcachefs: Fix "trying to move an extent, but nr_replicas=0"
data_update_init() does a bunch of complicated stuff to decide how many
replicas to add, since we only want to increase an extent's durability
on an explicit rereplicate, but extent pointers may be on devices with
different durability settings.

There was a corner case when evacuating a device that had been set to
durability=0 after data had been written to it, and extents on that
device had already been rereplicated - then evacuate only needs to drop
pointers on that device, not move them.

So the assert for !m->op.nr_replicas was spurious; this was a perfectly
legitimate case that needed to be handled.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-22 02:07:22 -04:00
Kent Overstreet
3f53d05041 bcachefs: bch2_data_update_init() cleanup
Factor out some helpers - this function has gotten much too big.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-22 02:07:22 -04:00
Kent Overstreet
2102bdac67 bcachefs: Extra debug for data move path
We don't have sufficient information to debug:

https://github.com/koverstreet/bcachefs/issues/726

- print out durability of extent ptrs, when non default
- print the number of replicas we need in data_update_to_text()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-19 21:42:08 -04:00
Kent Overstreet
47cdc7b144 bcachefs: Fix incorrect gfp flags
fixes:
00488 WARNING: CPU: 9 PID: 194 at mm/page_alloc.c:4410 __alloc_pages_noprof+0x1818/0x1888
00488 Modules linked in:
00488 CPU: 9 UID: 0 PID: 194 Comm: kworker/u66:1 Not tainted 6.11.0-rc1-ktest-g18fa10d6495f #2931
00488 Hardware name: linux,dummy-virt (DT)
00488 Workqueue: writeback wb_workfn (flush-bcachefs-2)
00488 pstate: 20001005 (nzCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--)
00488 pc : __alloc_pages_noprof+0x1818/0x1888
00488 lr : __alloc_pages_noprof+0x5f4/0x1888
00488 sp : ffffff80ccd8ed00
00488 x29: ffffff80ccd8ed00 x28: 0000000000000000 x27: dfffffc000000000
00488 x26: 0000000000000010 x25: 0000000000000002 x24: 0000000000000000
00488 x23: 0000000000000000 x22: 1ffffff0199b1dbe x21: ffffff80cc680900
00488 x20: 0000000000000000 x19: ffffff80ccd8eed0 x18: 0000000000000000
00488 x17: ffffff80cc58a010 x16: dfffffc000000000 x15: 1ffffff00474e518
00488 x14: 1ffffff00474e518 x13: 1ffffff00474e518 x12: ffffffb8104701b9
00488 x11: 1ffffff8104701b8 x10: ffffffb8104701b8 x9 : ffffffc08043cde8
00488 x8 : 00000047efb8fe48 x7 : ffffff80ccd8ee20 x6 : 0000000000048000
00488 x5 : 1ffffff810470138 x4 : 0000000000000050 x3 : 1ffffff0199b1d94
00488 x2 : ffffffb0199b1d94 x1 : 0000000000000001 x0 : ffffffc082387448
00488 Call trace:
00488  __alloc_pages_noprof+0x1818/0x1888
00488  new_slab+0x284/0x2f0
00488  ___slab_alloc+0x208/0x8e0
00488  __kmalloc_noprof+0x328/0x340
00488  __bch2_writepage+0x106c/0x1830
00488  write_cache_pages+0xa0/0xe8

due to __GFP_NOFAIL without allowing reclaim

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-18 20:42:05 -04:00
Kent Overstreet
d9f49c3106 bcachefs: fix field-spanning write warning
attempts to retrofit memory safety onto C are increasingly annoying

------------[ cut here ]------------
memcpy: detected field-spanning write (size 4) of single field "&k.replicas" at fs/bcachefs/replicas.c:454 (size 3)
WARNING: CPU: 5 PID: 6525 at fs/bcachefs/replicas.c:454 bch2_replicas_gc2+0x2cb/0x400 [bcachefs]
bch2_replicas_gc2+0x2cb/0x400:
bch2_replicas_gc2 at /home/ojab/src/bcachefs/fs/bcachefs/replicas.c:454 (discriminator 3)
Modules linked in: dm_mod tun nf_conntrack_netlink nfnetlink xt_addrtype br_netfilter overlay msr sctp bcachefs lz4hc_compress lz4_compress libcrc32c xor raid6_pq lz4_decompress pps_ldisc pps_core wireguard libchacha20poly1305 chacha_x86_64 poly1305_x86_64 ip6_udp_tunnel udp_tunnel curve25519_x86_64 libcurve25519_generic libchacha sit tunnel4 ip_tunnel af_packet bridge stp llc ip6table_nat ip6table_filter ip6_tables xt_MASQUERADE xt_conntrack iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter ip_tables x_tables tcp_bbr sch_fq_codel efivarfs nls_iso8859_1 nls_cp437 vfat fat cdc_mbim cdc_wdm cdc_ncm cdc_ether usbnet r8152 input_leds joydev mii amdgpu mousedev hid_generic usbhid hid ath10k_pci amd_atl edac_mce_amd ath10k_core kvm_amd ath kvm mac80211 bfq crc32_pclmul crc32c_intel polyval_clmulni polyval_generic sha512_ssse3 sha256_ssse3 sha1_ssse3 snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg i2c_algo_bit drm_exec snd_hda_codec r8169 drm_suballoc_helper
aesni_intel gf128mul crypto_simd amdxcp realtek mfd_core tpm_crb drm_buddy snd_hwdep mdio_devres libarc4 cryptd tpm_tis wmi_bmof cfg80211 evdev libphy snd_hda_core tpm_tis_core gpu_sched rapl xhci_pci xhci_hcd snd_pcm drm_display_helper snd_timer tpm sp5100_tco rfkill efi_pstore mpt3sas drm_ttm_helper ahci usbcore libaescfb ccp snd ttm 8250 libahci watchdog soundcore raid_class sha1_generic acpi_cpufreq k10temp 8250_base usb_common scsi_transport_sas i2c_piix4 hwmon video serial_mctrl_gpio serial_base ecdh_generic wmi rtc_cmos backlight ecc gpio_amdpt rng_core gpio_generic button
CPU: 5 UID: 0 PID: 6525 Comm: bcachefs Tainted: G        W          6.11.0-rc1-ojab-00058-g224bc118aec9 #6 6d5debde398d2a84851f42ab300dae32c2992027
Tainted: [W]=WARN
RIP: 0010:bch2_replicas_gc2+0x2cb/0x400 [bcachefs]
Code: c7 c2 60 91 d1 c1 48 89 c6 48 c7 c7 98 91 d1 c1 4c 89 14 24 44 89 5c 24 08 48 89 44 24 20 c6 05 fa 68 04 00 01 e8 05 a3 40 e4 <0f> 0b 4c 8b 14 24 44 8b 5c 24 08 48 8b 44 24 20 e9 55 fe ff ff 8b
RSP: 0018:ffffb434c9263d60 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff9a8efa79cc00 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffffb434c9263de0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000005
R13: ffff9a8efa73c300 R14: ffff9a8d9e880000 R15: ffff9a8d9e8806f8
FS:  0000000000000000(0000) GS:ffff9a9410c80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000565423373090 CR3: 0000000164e30000 CR4: 00000000003506f0
Call Trace:
<TASK>
? __warn+0x97/0x150
? bch2_replicas_gc2+0x2cb/0x400 [bcachefs 9803eca5e131ef28f26250ede34072d5b50d98b3]
bch2_replicas_gc2+0x2cb/0x400:
bch2_replicas_gc2 at /home/ojab/src/bcachefs/fs/bcachefs/replicas.c:454 (discriminator 3)
? report_bug+0x196/0x1c0
? handle_bug+0x3c/0x70
? exc_invalid_op+0x17/0x80
? __wake_up_klogd.part.0+0x4c/0x80
? asm_exc_invalid_op+0x16/0x20
? bch2_replicas_gc2+0x2cb/0x400 [bcachefs 9803eca5e131ef28f26250ede34072d5b50d98b3]
bch2_replicas_gc2+0x2cb/0x400:
bch2_replicas_gc2 at /home/ojab/src/bcachefs/fs/bcachefs/replicas.c:454 (discriminator 3)
? bch2_dev_usage_read+0xa0/0xa0 [bcachefs 9803eca5e131ef28f26250ede34072d5b50d98b3]
bch2_dev_usage_read+0xa0/0xa0:
discard_in_flight_remove at /home/ojab/src/bcachefs/fs/bcachefs/alloc_background.c:1712

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-18 20:42:05 -04:00
Kent Overstreet
d6d539c9a7 bcachefs: Reallocate table when we're increasing size
Fixes: c2f6e16a67 ("bcachefs: Increase size of cuckoo hash table on too many rehashes")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-18 20:41:50 -04:00
Kent Overstreet
0e49d3ff12 bcachefs: Fix locking in __bch2_trans_mark_dev_sb()
We run this in full RW mode now, so we have to guard against the
superblock buffer being reallocated.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-16 20:45:15 -04:00
Kent Overstreet
99c87fe0f5 bcachefs: fix incorrect i_state usage
Reported-by: syzbot+95e40eae71609e40d851@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-16 12:46:40 -04:00
Kent Overstreet
9482f3b053 bcachefs: avoid overflowing LRU_TIME_BITS for cached data lru
Reported-by: syzbot+510b0b28f8e6de64d307@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-16 12:46:40 -04:00
Kent Overstreet
075cabf324 bcachefs: Fix forgetting to pass trans to fsck_err()
Reported-by: syzbot+e3938cd6d761b78750e6@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-16 12:46:40 -04:00
Kent Overstreet
c2f6e16a67 bcachefs: Increase size of cuckoo hash table on too many rehashes
Also, improve the calculation of the new table size, so that it can
shrink when needed.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-16 12:46:40 -04:00
Kent Overstreet
58474f76a7 bcachefs: bcachefs_metadata_version_disk_accounting_inum
This adds another disk accounting counter to track usage per inode
number (any snapshot ID).

This will be used for a couple things:

- It'll give us a way to tell the user how much space a given file ista
  consuming in all snapshots; i.e. how much extra space it's consuming
  due to snapshot versioning.

- It counts number of extents and total size of extents (both in btree
  keyspace sectors and actual disk usage), meaning it gives us average
  extent size: that is, it'll let us cheaply find fragmented files that
  should be defragmented.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-13 23:00:50 -04:00
Kent Overstreet
5132b99bb6 bcachefs: Kill __bch2_accounting_mem_mod()
The next patch will be adding a disk accounting counter type which is
not kept in the in-memory eytzinger tree.

As prep, fold __bch2_accounting_mem_mod() into
bch2_accounting_mem_mod_locked() so that we can check for that counter
type and bail out without calling bpos_to_disk_accounting_pos() twice.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-13 23:00:50 -04:00
Kent Overstreet
d97de0d017 bcachefs: Make bkey_fsck_err() a wrapper around fsck_err()
bkey_fsck_err() was added as an interface that looks like fsck_err(),
but previously all it did was ensure that the appropriate error counter
was incremented in the superblock.

This is a cleanup and bugfix patch that converts it to a wrapper around
fsck_err(). This is needed to fix an issue with the upgrade path to
disk_accounting_v3, where the "silent fix" error list now includes
bkey_fsck errors; fsck_err() handles this in a unified way, and since we
need to change printing of bkey fsck errors from the caller to the inner
bkey_fsck_err() calls, this ends up being a pretty big change.

Als,, rename .invalid() methods to .validate(), for clarity, while we're
changing the function signature anyways (to drop the printbuf argument).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-13 23:00:50 -04:00
Kent Overstreet
c99471024f bcachefs: Fix warning in __bch2_fsck_err() for trans not passed in
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-13 23:00:50 -04:00
Kent Overstreet
06a8693b89 bcachefs: Add a time_stat for blocked on key cache flush
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-13 23:00:50 -04:00
Kent Overstreet
790666c8ac bcachefs: Improve trans_blocked_journal_reclaim tracepoint
include information about the state of the btree key cache

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-13 23:00:34 -04:00
Kent Overstreet
7254555c44 bcachefs: Add hysteresis to waiting on btree key cache flush
This helps ensure key cache reclaim isn't contending with threads
waiting for the key cache to be helped, and fixes a severe performance
bug.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-13 23:00:34 -04:00
Kent Overstreet
968feb854a bcachefs: Convert for_each_btree_node() to lockrestart_do()
for_each_btree_node() now works similarly to for_each_btree_key(), where
the loop body is passed as an argument to be passed to lockrestart_do().

This now calls trans_begin() on every loop iteration - which fixes an
SRCU warning in backpointers fsck.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-13 22:56:50 -04:00
Kent Overstreet
48d6cc1b48 bcachefs: Add missing downgrade table entry
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-13 22:56:50 -04:00
Kent Overstreet
486d920735 bcachefs: disk accounting: ignore unknown types
forward compat fix

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-13 22:56:50 -04:00
Kent Overstreet
d9e615762b bcachefs: bch2_accounting_invalid() fixup
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-13 22:56:50 -04:00
Kent Overstreet
bd864bc2d9 bcachefs: Fix bch2_trigger_alloc when upgrading from old versions
bch2_trigger_alloc was assuming that the new key would always be newly
created and thus always an alloc_v4 key, but - not when called from
btree_gc.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-13 22:56:50 -04:00
Kent Overstreet
a24e6e7146 bcachefs: delete faulty fastpath in bch2_btree_path_traverse_cached()
bch2_btree_path_traverse_cached() was previously checking if it could
just relock the path, which is a common idiom in path traversal.

However, it was using btree_node_relock(), not btree_path_relock();
btree_path_relock() only succeeds if the path was in state
BTREE_ITER_NEED_RELOCK.

If the path was in state BTREE_ITER_NEED_TRAVERSE a full traversal is
needed; this led to a null ptr deref in
bch2_btree_path_traverse_cached().

And the short circuit check here isn't needed, since it was already done
in the main bch2_btree_path_traverse_one().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-13 22:56:50 -04:00
Kent Overstreet
8a2491db7b bcachefs: bcachefs_metadata_version_disk_accounting_v3
bcachefs_metadata_version_disk_accounting_v2 erroneously had padding
bytes in disk_accounting_key, which is a problem because we have to
guarantee that all unused bytes in disk_accounting_key are zeroed.

Fortunately 6.11 isn't out yet, so it's cheap to fix this by spinning a
new version.

Reported-by: Gabriel de Perthuis <g2p.code@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-09 19:21:28 -04:00
Kent Overstreet
1a9e219db1 bcachefs: improve bch2_dev_usage_to_text()
Add a line for capacity

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-09 14:40:19 -04:00
Kent Overstreet
077e473723 bcachefs: bch2_accounting_invalid()
Implement bch2_accounting_invalid(); check for junk at the end, and
replicas accounting entries in particular need to be checked or we'll
pop asserts later.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-09 14:40:19 -04:00
Kent Overstreet
f39bae2e02 bcachefs: Switch to .get_inode_acl()
.set_acl() requires a dentry, and if one isn't passed it marks the VFS
inode as not having an ACL.

This has been causing inodes with ACLs to have them "disappear" on
bcachefs filesystem, depending on which path those inodes get pulled
into the cache from.

Switching to .get_inode_acl(), like other local filesystems, fixes this.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-08 15:14:02 -04:00
Kent Overstreet
73dc1656f4 bcachefs: Use bch2_wait_on_allocator() in btree node alloc path
If the allocator gets stuck, we need to know why.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-07 21:04:55 -04:00
Kent Overstreet
cecf72798b bcachefs: Make allocator stuck timeout configurable, ratelimit messages
Limit these messages to once every 2 minutes to avoid spamming logs;
with multiple devices the output can be quite significant.

Also, up the default timeout to 30 seconds from 10 seconds.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-07 21:04:55 -04:00
Kent Overstreet
6d496e02b4 bcachefs: Add missing path_traverse() to btree_iter_next_node()
This fixes a bug exposed by the next path - we pop an assert in
path_set_should_be_locked().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-07 21:04:55 -04:00
Qais Yousef
ae04f69de0 sched/rt: Rename realtime_{prio, task}() to rt_or_dl_{prio, task}()
Some find the name realtime overloaded. Use rt_or_dl() as an
alternative, hopefully better, name.

Suggested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Qais Yousef <qyousef@layalina.io>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240610192018.1567075-4-qyousef@layalina.io
2024-08-07 18:32:38 +02:00
Qais Yousef
130fd056dd sched/rt: Clean up usage of rt_task()
rt_task() checks if a task has RT priority. But depends on your
dictionary, this could mean it belongs to RT class, or is a 'realtime'
task, which includes RT and DL classes.

Since this has caused some confusion already on discussion [1], it
seemed a clean up is due.

I define the usage of rt_task() to be tasks that belong to RT class.
Make sure that it returns true only for RT class and audit the users and
replace the ones required the old behavior with the new realtime_task()
which returns true for RT and DL classes. Introduce similar
realtime_prio() to create similar distinction to rt_prio() and update
the users that required the old behavior to use the new function.

Move MAX_DL_PRIO to prio.h so it can be used in the new definitions.

Document the functions to make it more obvious what is the difference
between them. PI-boosted tasks is a factor that must be taken into
account when choosing which function to use.

Rename task_is_realtime() to realtime_task_policy() as the old name is
confusing against the new realtime_task().

No functional changes were intended.

[1] https://lore.kernel.org/lkml/20240506100509.GL40213@noisy.programming.kicks-ass.net/

Signed-off-by: Qais Yousef <qyousef@layalina.io>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: "Steven Rostedt (Google)" <rostedt@goodmis.org>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/r/20240610192018.1567075-2-qyousef@layalina.io
2024-08-07 18:32:37 +02:00
Kent Overstreet
2caca9fb16 bcachefs: ec should not allocate from ro devs
This fixes a device removal deadlock when using erasure coding.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-07 08:31:10 -04:00
Kent Overstreet
c1e4446247 bcachefs: Improved allocator debugging for ec
chasing down a device removal deadlock with erasure coding

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-07 08:31:10 -04:00
Kent Overstreet
02026e8931 bcachefs: Add missing bch2_trans_begin() call
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-07 08:31:10 -04:00
Kent Overstreet
90b211fa2d bcachefs: Add a comment for bucket helper types
We've had bugs in the past with incorrect integer conversions in disk
accounting code, which is why bucket helpers now always return s64s; add
a comment explaining this.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-07 08:31:10 -04:00
Kent Overstreet
7442b5cdf2 bcachefs: Don't rely on implicit unsigned -> signed integer conversion
implicit integer conversion is a fertile source of bugs, and we really
would rather not have the min()/max() macros doing it implicitly.
bcachefs appears to be the only place in the kernel where this happens,
so let's fix it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-07 08:31:10 -04:00
Matthew Wilcox (Oracle)
1da86618bd
fs: Convert aops->write_begin to take a folio
Convert all callers from working on a page to working on one page
of a folio (support for working on an entire folio can come later).
Removes a lot of folio->page->folio conversions.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-07 11:33:21 +02:00
Matthew Wilcox (Oracle)
a225800f32
fs: Convert aops->write_end to take a folio
Most callers have a folio, and most implementations operate on a folio,
so remove the conversion from folio->page->folio to fit through this
interface.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-07 11:32:02 +02:00
Kent Overstreet
e61dd67860 bcachefs: Fix double free of ca->buckets_nouse
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Fixes: ffcbec6076 ("bcachefs: Kill opts.buckets_nouse")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-30 20:43:29 -04:00
Linus Torvalds
dd018c238b bcachefs fixes for 6.11-rc1
- another fix for fsck getting stuck, from marcin
 - small syzbot fix
 - another undefined shift fix
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmaekrUACgkQE6szbY3K
 bnZoLg//Sbdo0JsUJIDU3gnmyEMmWx1mvUT7MJS2EwwLpLR2t1oNcE5CB9aNdv2d
 fhJxjhuBnc9Z1yO83eQSinUcMGdpM3QIS3BH1dwMa2kY5jE0cfxvdoXX+2CDVzPe
 6/0SgX+0Ce2X7MxQSI8Nu9RhNWtEwFqdtOoOanBWHzjBPNDC5+ZuocZAnqMyM0cu
 KYmtZ1lyGwa23ILwaKWuZopXj8jd62FU+X49PWpzLvOLM+1BWwYqOrL0ZgNokkLc
 LoalSWmdVDTXCs6dteDO++nwLoPJbcYgwv7fbB6yiHj0xs/bwRgu0FHVzd/1tjMM
 O8VY/9/el+EbVE2VkA6JMHe6IMRJS6Gf6c78MfwwPPtcOyinM8AVlLNa1WLH6Sjw
 XfCKzENnM6CHSfefY3IcYrKHQ3htdZNw7nvnz2PTwP8KejHBkOgmde3Dqhn6khKa
 R3U3nR9hc5lOvN6Y7EuLmw/tLvab6NjNdm5xSIo9Tpg2GjpsnITZL7hSKOAG1ua/
 7JxWGND+SrDgU/bEv+kRsOTRDgx81uOMQ1IUMmW5CwFPj61K1X3q4SBwjxeopC3Q
 CQ9IpkK/twLai8ENSy37HqeSzqbLCsJFILR3q68SlyE7KSuGFdR6ySHX0NYQFY1L
 TDTJcajUB0O23xlL7WEIyeH3pGx6+adS5YYsk0dR9s5o7UEn84g=
 =CKAY
 -----END PGP SIGNATURE-----

Merge tag 'bcachefs-2024-07-22' of https://evilpiepirate.org/git/bcachefs

Pull bcachefs fixes from Kent Overstreet:

 - another fix for fsck getting stuck, from marcin

 - small syzbot fix

 - another undefined shift fix

* tag 'bcachefs-2024-07-22' of https://evilpiepirate.org/git/bcachefs:
  bcachefs: Fix printbuf usage while atomic
  bcachefs: More informative error message in reattach_inode()
  bcachefs: kill btree_trans_too_many_iters() in bch2_bucket_alloc_freelist()
  bcachefs: mean_and_variance: Avoid too-large shift amounts
2024-07-22 10:59:08 -07:00
Kent Overstreet
737759fc09 bcachefs: Fix printbuf usage while atomic
Reported-by: syzbot+f765e51170cf13493f0b@syzkaller.appspotmail.com
Fixes: f12410bb7d ("bcachefs: Add an error message for insufficient rw journal devs")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-22 11:27:15 -04:00
Kent Overstreet
7a086baad0 bcachefs: More informative error message in reattach_inode()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-22 11:27:15 -04:00
Linus Torvalds
527eff227d - In the series "treewide: Refactor heap related implementation",
Kuan-Wei Chiu has significantly reworked the min_heap library code and
   has taught bcachefs to use the new more generic implementation.
 
 - Yury Norov's series "Cleanup cpumask.h inclusion in core headers"
   reworks the cpumask and nodemask headers to make things generally more
   rational.
 
 - Kuan-Wei Chiu has sent along some maintenance work against our sorting
   library code in the series "lib/sort: Optimizations and cleanups".
 
 - More library maintainance work from Christophe Jaillet in the series
   "Remove usage of the deprecated ida_simple_xx() API".
 
 - Ryusuke Konishi continues with the nilfs2 fixes and clanups in the
   series "nilfs2: eliminate the call to inode_attach_wb()".
 
 - Kuan-Ying Lee has some fixes to the gdb scripts in the series "Fix GDB
   command error".
 
 - Plus the usual shower of singleton patches all over the place.  Please
   see the relevant changelogs for details.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZp2GvwAKCRDdBJ7gKXxA
 jlf/AP48xP5ilIHbtpAKm2z+MvGuTxJQ5VSC0UXFacuCbc93lAEA+Yo+vOVRmh6j
 fQF2nVKyKLYfSz7yqmCyAaHWohIYLgg=
 =Stxz
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2024-07-21-15-07' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-MM updates from Andrew Morton:

 - In the series "treewide: Refactor heap related implementation",
   Kuan-Wei Chiu has significantly reworked the min_heap library code
   and has taught bcachefs to use the new more generic implementation.

 - Yury Norov's series "Cleanup cpumask.h inclusion in core headers"
   reworks the cpumask and nodemask headers to make things generally
   more rational.

 - Kuan-Wei Chiu has sent along some maintenance work against our
   sorting library code in the series "lib/sort: Optimizations and
   cleanups".

 - More library maintainance work from Christophe Jaillet in the series
   "Remove usage of the deprecated ida_simple_xx() API".

 - Ryusuke Konishi continues with the nilfs2 fixes and clanups in the
   series "nilfs2: eliminate the call to inode_attach_wb()".

 - Kuan-Ying Lee has some fixes to the gdb scripts in the series "Fix
   GDB command error".

 - Plus the usual shower of singleton patches all over the place. Please
   see the relevant changelogs for details.

* tag 'mm-nonmm-stable-2024-07-21-15-07' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (98 commits)
  ia64: scrub ia64 from poison.h
  watchdog/perf: properly initialize the turbo mode timestamp and rearm counter
  tsacct: replace strncpy() with strscpy()
  lib/bch.c: use swap() to improve code
  test_bpf: convert comma to semicolon
  init/modpost: conditionally check section mismatch to __meminit*
  init: remove unused __MEMINIT* macros
  nilfs2: Constify struct kobj_type
  nilfs2: avoid undefined behavior in nilfs_cnt32_ge macro
  math: rational: add missing MODULE_DESCRIPTION() macro
  lib/zlib: add missing MODULE_DESCRIPTION() macro
  fs: ufs: add MODULE_DESCRIPTION()
  lib/rbtree.c: fix the example typo
  ocfs2: add bounds checking to ocfs2_check_dir_entry()
  fs: add kernel-doc comments to ocfs2_prepare_orphan_dir()
  coredump: simplify zap_process()
  selftests/fpu: add missing MODULE_DESCRIPTION() macro
  compiler.h: simplify data_race() macro
  build-id: require program headers to be right after ELF header
  resource: add missing MODULE_DESCRIPTION()
  ...
2024-07-21 17:56:22 -07:00
Kent Overstreet
2fa88b1919 bcachefs: kill btree_trans_too_many_iters() in bch2_bucket_alloc_freelist()
When we're called via
trans commit -> btree split -> allocator

We may have already arbitrarily many btree_paths, for the transaction
commit we're trying to do; when this happens, the
btree_trans_too_many_iters() call causes us to livelock.

Since the allocator calls btree_iter_dontneed to release paths as it
iterates, this shouldn't cause any problems.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-18 21:06:03 -04:00
Linus Torvalds
720261cfc7 bcachefs changes for 6.11-rc1 (version 2)
Additional fixes on top of the original 6.11 pull request:
 - undefined behaviour fixes, originally noted as breaking userspace LTO
   builds
 - fix a spurious warning in fsck_err, reported by Marcin
 - fix an integer overflow on trans->nr_updates, also reported by Marcin;
   this broke during deletion of highly fragmented indirect extents
 - Add comments for lockdep functions
 
 ======
 
 - Metadata version 1.8: Stripe sectors accounting, BCH_DATA_unstriped
 
 This splits out the accounting of dirty sectors and stripe sectors in
 alloc keys; this lets us see stripe buckets that still have unstriped
 data in them.
 
 This is needed for ensuring that erasure coding is working correctly, as
 well as completing stripe creation after a crash.
 
 - Metadata version 1.9: Disk accounting rewrite
 
 The previous disk accounting scheme relied heavily on percpu counters
 that were also sharded by outstanding journal buffer; it was fast but
 not extensible or scalable, and meant that all accounting counters were
 recorded in every journal entry.
 
 The new disk accounting scheme stores accounting as normal btree keys;
 updates are deltas until they are flushed by the btree write buffer.
 
 This means we have no practical limit on the number of counters, and a
 new tagged union format that's easy to extend.
 
 We now have counters for compression type/ratio, per-snapshot-id usage,
 per-btree-id usage, and pending rebalance work.
 
 - Self healing on read IO/checksum error
 
 data is now automatically rewritten if we get a read error and then a
 successful retry
 
 - Mount API conversion (thanks to Thomas Bertschinger)
 
 - Better lockdep coverage
 
 Previously, btree node locks were tracked individually by lockdep, like
 any other lock. But we may take _many_ btree node locks simultaneously,
 we easily blow through the limit of 48 locks that lockdep can track,
 leading to lockdep turning itself off.
 
 Tracking each btree node lock individually isn't really necessary since
 we have our own cycle detector for deadlock avoidance and centralized
 tracking of btree node locks, so we now have a single lockdep_map in
 btree_trans for "any btree nodes are locked".
 
 - some more small incremental work towards online check_allocations
 
 - lots more debugging improvements, fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmaZmVYACgkQE6szbY3K
 bnayLBAAr/RB75neEVzNqzE/qLoz9HBfPs7NrGNZg3bLzie9BvcEKGf3VUs2mm83
 6qN/bCJRBhd2vdMcFvIrYYqvD+F2qFFFrBjTzY20toReCwO6q5o8Ihfzv+uSu1qn
 Lg/6AbwwDwHPgSLcFM6yVfNsNBpOZ4tru7xble5/89RGp5uhbU38tsFwkUcngN/t
 4GWjtUYC+rFZaJSbt+wtGtM++nSURhMu1rxbR3MnVLT3JNW0wiG+FnRymCRQOYwx
 eslI6wP3JArTKMvACJqXTDivmjUPNdvsJ26LW6b5KBLi411OgV219ek0mJlkc5gl
 lOOZ5LPmPqn0BxsIdqOd+/OGOpvYkfWEyDj6/46K8KKFt+i++vCTzqIeL06HGQFS
 oCZajoHseLTlWPZTnoi3/KPWkKSJyaROrvFPSHT5/9sJfeeFt88hNSMP9g4+S4GI
 QSXK70GzEVjxr7YzUZwZUmRGWHt7YcS/qkTKJZRkLwUbr2BnrKEJhOa0t/i3RopN
 glFP/hP3dObTcWUF4xH90htfD2E/wHbP7nPwNCL6367n18Uj4TJD09vtM8xHsXSR
 YXECCjfskVDHAQJw/aV5jF1NpaeSW6g/bILi8tQhvRpfzLLvsaVGxA1xp5v+VZEQ
 w3WBepPofEE1EaVfKNeHCiE7STAnSVbWYWTatmjPdqVCyT7C9S0=
 =tnhh
 -----END PGP SIGNATURE-----

Merge tag 'bcachefs-2024-07-18.2' of https://evilpiepirate.org/git/bcachefs

Pull bcachefs updates from Kent Overstreet:

 - Metadata version 1.8: Stripe sectors accounting, BCH_DATA_unstriped

   This splits out the accounting of dirty sectors and stripe sectors in
   alloc keys; this lets us see stripe buckets that still have unstriped
   data in them.

   This is needed for ensuring that erasure coding is working correctly,
   as well as completing stripe creation after a crash.

 - Metadata version 1.9: Disk accounting rewrite

   The previous disk accounting scheme relied heavily on percpu counters
   that were also sharded by outstanding journal buffer; it was fast but
   not extensible or scalable, and meant that all accounting counters
   were recorded in every journal entry.

   The new disk accounting scheme stores accounting as normal btree
   keys; updates are deltas until they are flushed by the btree write
   buffer.

   This means we have no practical limit on the number of counters, and
   a new tagged union format that's easy to extend.

   We now have counters for compression type/ratio, per-snapshot-id
   usage, per-btree-id usage, and pending rebalance work.

 - Self healing on read IO/checksum error

   Data is now automatically rewritten if we get a read error and then a
   successful retry

 - Mount API conversion (thanks to Thomas Bertschinger)

 - Better lockdep coverage

   Previously, btree node locks were tracked individually by lockdep,
   like any other lock. But we may take _many_ btree node locks
   simultaneously, we easily blow through the limit of 48 locks that
   lockdep can track, leading to lockdep turning itself off.

   Tracking each btree node lock individually isn't really necessary
   since we have our own cycle detector for deadlock avoidance and
   centralized tracking of btree node locks, so we now have a single
   lockdep_map in btree_trans for "any btree nodes are locked".

 - Some more small incremental work towards online check_allocations

 - Lots more debugging improvements

 - Fixes, including:
    - undefined behaviour fixes, originally noted as breaking userspace
      LTO builds
    - fix a spurious warning in fsck_err, reported by Marcin
    - fix an integer overflow on trans->nr_updates, also reported by
      Marcin; this broke during deletion of highly fragmented indirect
      extents

* tag 'bcachefs-2024-07-18.2' of https://evilpiepirate.org/git/bcachefs: (120 commits)
  lockdep: Add comments for lockdep_set_no{validate,track}_class()
  bcachefs: Fix integer overflow on trans->nr_updates
  bcachefs: silence silly kdoc warning
  bcachefs: Fix fsck warning about btree_trans not passed to fsck error
  bcachefs: Add an error message for insufficient rw journal devs
  bcachefs: varint: Avoid left-shift of a negative value
  bcachefs: darray: Don't pass NULL to memcpy()
  bcachefs: Kill bch2_assert_btree_nodes_not_locked()
  bcachefs: Rename BCH_WRITE_DONE -> BCH_WRITE_SUBMITTED
  bcachefs: __bch2_read(): call trans_begin() on every loop iter
  bcachefs: show none if label is not set
  bcachefs: drop packed, aligned from bkey_inode_buf
  bcachefs: btree node scan: fall back to comparing by journal seq
  bcachefs: Add lockdep support for btree node locks
  lockdep: lockdep_set_notrack_class()
  bcachefs: Improve copygc_wait_to_text()
  bcachefs: Convert clock code to u64s
  bcachefs: Improve startup message
  bcachefs: Self healing on read IO error
  bcachefs: Make read_only a mount option again, but hidden
  ...
2024-07-18 17:27:43 -07:00
Tavian Barnes
73f88592dd bcachefs: mean_and_variance: Avoid too-large shift amounts
Shifting a value by the width of its type or more is undefined.

Signed-off-by: Tavian Barnes <tavianator@tavianator.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-18 18:33:30 -04:00
Kent Overstreet
6f719cbe0c bcachefs: Fix integer overflow on trans->nr_updates
We can't have more updates than paths, so btree_path_idx_t is the
correct type to use.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-18 18:33:30 -04:00
Kent Overstreet
f05a0b9c73 bcachefs: silence silly kdoc warning
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-18 18:33:30 -04:00
Kent Overstreet
2c4c17fefc bcachefs: Fix fsck warning about btree_trans not passed to fsck error
If a btree_trans is in use it's supposed to be passed to fsck_err so
that it can be unlocked if we're waiting on userspace input; but the
btree IO paths do call fsck errors where a btree_trans exists on the
stack but it's not passed through.

But it's ok, because it's unlocked while doing IO.

Fixes: a850bde649 ("bcachefs: fsck_err() may now take a btree_trans")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-18 18:33:30 -04:00
Kent Overstreet
f12410bb7d bcachefs: Add an error message for insufficient rw journal devs
This causes us to go read-only - need an error message saying why.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-18 18:33:30 -04:00
Tavian Barnes
ee1b8dc17a bcachefs: varint: Avoid left-shift of a negative value
Shifting a negative value left is undefined.

Signed-off-by: Tavian Barnes <tavianator@tavianator.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-18 18:33:30 -04:00
Linus Torvalds
2aae1d67fd vfs-6.11.inode
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZpEG2wAKCRCRxhvAZXjc
 ooW/AQDzyY+xNGt4OPMvlyFUHd5RcyiLsMhYrkKc3FaIFjesVgD+PFW5PPW12c0V
 Z4VHg9w1HDDuUn4XvELs7OXZpek7RgU=
 =eDC8
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.11.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs inode / dentry updates from Christian Brauner:
 "This contains smaller performance improvements to inodes and dentries:

  inode:

   - Add rcu based inode lookup variants.

     They avoid one inode hash lock acquire in the common case thereby
     significantly reducing contention. We already support RCU-based
     operations but didn't take advantage of them during inode
     insertion.

     Callers of iget_locked() get the improvement without any code
     changes. Callers that need a custom callback can switch to
     iget5_locked_rcu() as e.g., did btrfs.

     With 20 threads each walking a dedicated 1000 dirs * 1000 files
     directory tree to stat(2) on a 32 core + 24GB ram vm:

        before: 3.54s user 892.30s system 1966% cpu 45.549 total
        after:  3.28s user 738.66s system 1955% cpu 37.932 total (-16.7%)

     Long-term we should pick up the effort to introduce more
     fine-grained locking and possibly improve on the currently used
     hash implementation.

   - Start zeroing i_state in inode_init_always() instead of doing it in
     individual filesystems.

     This allows us to remove an unneeded lock acquire in new_inode()
     and not burden individual filesystems with this.

  dcache:

   - Move d_lockref out of the area used by RCU lookup to avoid
     cacheline ping poing because the embedded name is sharing a
     cacheline with d_lockref.

   - Fix dentry size on 32bit with CONFIG_SMP=y so it does actually end
     up with 128 bytes in total"

* tag 'vfs-6.11.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: fix dentry size
  vfs: move d_lockref out of the area used by RCU lookup
  bcachefs: remove now spurious i_state initialization
  xfs: remove now spurious i_state initialization in xfs_inode_alloc
  vfs: partially sanitize i_state zeroing on inode creation
  xfs: preserve i_state around inode_init_always in xfs_reinit_inode
  btrfs: use iget5_locked_rcu
  vfs: add rcu-based find_inode variants for iget ops
2024-07-15 11:39:44 -07:00
Tavian Barnes
2e118ba36d bcachefs: darray: Don't pass NULL to memcpy()
memcpy's second parameter must not be NULL, even if size is zero.

Signed-off-by: Tavian Barnes <tavianator@tavianator.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 21:52:37 -04:00
Kent Overstreet
efb2018e4d bcachefs: Kill bch2_assert_btree_nodes_not_locked()
We no longer track individual btree node locks with lockdep, so this
will never be enabled.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:59:12 -04:00
Kent Overstreet
ae46905631 bcachefs: Rename BCH_WRITE_DONE -> BCH_WRITE_SUBMITTED
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:59:12 -04:00
Kent Overstreet
1d18b5cabc bcachefs: __bch2_read(): call trans_begin() on every loop iter
perusal of /sys/kernel/debug/bcachefs/*/btree_transaction_stats shows
that the read path has been acculumalating unneeded paths on the reflink
btree, which we don't want.

The solution is to call bch2_trans_begin(), which drops paths not used
on previous loop iteration.

bch2_readahead:
  Max mem used: 0
  Transaction duration:
    count:      194235
                           since mount        recent
    duration of events
      min:                      150 ns
      max:                        9 ms
      total:                    838 ms
      mean:                       4 us          6 us
      stddev:                    34 us          7 us
    time between events
      min:                       10 ns
      max:                       15 h
      mean:                       2 s          12 s
      stddev:                     2 s           3 ms
  Maximum allocated btree paths (193):
    path: idx  2 ref 0:0 P   btree=extents l=0 pos 270943112:392:U32_MAX locks 0
    path: idx  3 ref 1:0   S btree=extents l=0 pos 270943112:24578:U32_MAX locks 1
    path: idx  4 ref 0:0 P   btree=reflink l=0 pos 0:24773509:0 locks 0
    path: idx  5 ref 0:0 P S btree=reflink l=0 pos 0:24773631:0 locks 1
    path: idx  6 ref 0:0 P S btree=reflink l=0 pos 0:24773759:0 locks 1
    path: idx  7 ref 0:0 P S btree=reflink l=0 pos 0:24773887:0 locks 1
    path: idx  8 ref 0:0 P S btree=reflink l=0 pos 0:24774015:0 locks 1
    path: idx  9 ref 0:0 P S btree=reflink l=0 pos 0:24774143:0 locks 1
    path: idx 10 ref 0:0 P S btree=reflink l=0 pos 0:24774271:0 locks 1
<many more reflink paths>

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:16 -04:00
Hongbo Li
114f530e1e bcachefs: show none if label is not set
If label is not set, the Label tag in superblock info show '(none)'.

```
[Before]
Device index:                               0
Label:
Version:                                    1.4: member_seq

[After]
Device index:                               0
Label:                                      (none)
Version:                                    1.4: member_seq
```

Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:16 -04:00
Kent Overstreet
7b6dda7282 bcachefs: drop packed, aligned from bkey_inode_buf
Unnecessary here, and this broke the rust bindings:

error[E0588]: packed type cannot transitively contain a `#[repr(align)]` type
     --> /build/source/target/release/build/bch_bindgen-9445b24c90aca2a3/out/bcachefs.rs:29025:1
      |
29025 | pub struct bkey_i_inode_v3 {
      | ^^^^^^^^^^^^^^^^^^^^^^^^^^
      |
note: `bch_inode_v3` has a `#[repr(align)]` attribute
     --> /build/source/target/release/build/bch_bindgen-9445b24c90aca2a3/out/bcachefs.rs:8949:1
      |
8949  | pub struct bch_inode_v3 {
      | ^^^^^^^^^^^^^^^^^^^^^^^

error[E0588]: packed type cannot transitively contain a `#[repr(align)]` type
     --> /build/source/target/release/build/bch_bindgen-9445b24c90aca2a3/out/bcachefs.rs:32826:1
      |
32826 | pub struct bkey_inode_buf {
      | ^^^^^^^^^^^^^^^^^^^^^^^^^
      |
note: `bch_inode_v3` has a `#[repr(align)]` attribute
     --> /build/source/target/release/build/bch_bindgen-9445b24c90aca2a3/out/bcachefs.rs:8949:1
      |
8949  | pub struct bch_inode_v3 {
      | ^^^^^^^^^^^^^^^^^^^^^^^
note: `bkey_inode_buf` contains a field of type `bkey_i_inode_v3`
     --> /build/source/target/release/build/bch_bindgen-9445b24c90aca2a3/out/bcachefs.rs:32827:9
      |
32827 |     pub inode: bkey_i_inode_v3,
      |         ^^^^^
note: ...which contains a field of type `bch_inode_v3`
     --> /build/source/target/release/build/bch_bindgen-9445b24c90aca2a3/out/bcachefs.rs:29027:9
      |
29027 |     pub v: bch_inode_v3,
      |         ^

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:16 -04:00
Kent Overstreet
6ec8623f7c bcachefs: btree node scan: fall back to comparing by journal seq
highly damaged filesystems, or filesystems that have been damaged and
repair and damaged again, may have sequence numbers we can't fully trust
- which in itself is something we need to debug.

Add a journal_seq fallback so that repair doesn't get stuck.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:16 -04:00
Kent Overstreet
375476c414 bcachefs: Add lockdep support for btree node locks
This adds lockdep tracking for held btree locks with a single dep_map in
btree_trans, i.e. tracking all held btree locks as one object.

This is more practical and more useful than having lockdep track held
btree locks individually, because
 - we can take more locks than lockdep can track (unbounded, now that we
   have dynamically resizable btree paths)
 - there's no lock ordering between btree locks for lockdep to track (we
   do cycle detection)
 - and this makes it easy to teach lockdep that btree locks are not safe
   to hold while invoking memory reclaim.

The last rule is one that lockdep would never learn, because we only do
trylock() from within shrinkers - but we very much do not want to be
invoking memory reclaim while holding btree node locks.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:16 -04:00
Kent Overstreet
1a616c2fe9 lockdep: lockdep_set_notrack_class()
Add a new helper to disable lockdep tracking entirely for a given class.

This is needed for bcachefs, which takes too many btree node locks for
lockdep to track. Instead, we have a single lockdep_map for "btree_trans
has any btree nodes locked", which makes more since given that we have
centralized lock management and a cycle detector.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:16 -04:00
Kent Overstreet
8f523d425e bcachefs: Improve copygc_wait_to_text()
printing the raw values can occasionally be very useful

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:16 -04:00
Kent Overstreet
27d033df35 bcachefs: Convert clock code to u64s
Eliminate possible integer truncation bugs on 32 bit

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:16 -04:00
Kent Overstreet
ec8bf491a9 bcachefs: Improve startup message
We're not always mounting when we start the filesystem

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:16 -04:00
Kent Overstreet
a2cb8a6236 bcachefs: Self healing on read IO error
This repurposes the promote path, which already knows how to call
data_update() after a read: we now automatically rewrite bad data when
we get a read error and then successfully retry from a different
replica.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:16 -04:00
Kent Overstreet
b1d63b06e8 bcachefs: Make read_only a mount option again, but hidden
fsck passes read_only as a mount option, and it's required for
nochanges, which it also uses.

Usually read_only is handled by the VFS, but we need to be able to
handle it too; we just don't want to print it out twice, so mark it as a
hidden option.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:16 -04:00
Kent Overstreet
9d9d212e26 bcachefs: bch2_extent_crc_unpacked_to_text()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:16 -04:00
Kent Overstreet
5e3c208325 bcachefs: Ratelimit checksum error messages
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:16 -04:00
Kent Overstreet
0f3372dcee bcachefs: spelling fix
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:16 -04:00
Kent Overstreet
d2cb6b219d bcachefs: Simplify btree key cache fill path
Don't allocate the new bkey_cached until after we've done the btree
lookup; this means we can kill bkey_cached.valid.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:16 -04:00
Kent Overstreet
39d5d8290c bcachefs: Improve "unable to allocate journal write" message
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:16 -04:00
Kent Overstreet
e0d5bc6a66 bcachefs: Fix missing BTREE_TRIGGER_bucket_invalidate flag
This fixes an accounting mismatch for cached data.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:16 -04:00
Kent Overstreet
7554a8bb6d bcachefs: Ensure buffered writes write as much as they can
This adds a new helper, bch2_folio_reservation_get_partial(), which
reserves as many blocks as possible and may return partial success.

__bch2_buffered_write() is switched to the new helper - this fixes
fstests generic/275, the write until -ENOSPC test.

generic/230 now fails: this appears to be a test bug, where xfs_io isn't
looping after a partial write to get the error code.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:16 -04:00
Hongbo Li
95924420b0 bcachefs: support STATX_DIOALIGN for statx file
Add support for STATX_DIOALIGN to bcachefs, so that direct I/O alignment
restrictions are exposed to userspace in a generic way.

[Before]
```
./statx_test /mnt/bcachefs/test
statx(/mnt/bcachefs/test) = 0
dio mem align:0
dio offset align:0
```

[After]
```
./statx_test /mnt/bcachefs/test
statx(/mnt/bcachefs/test) = 0
dio mem align:1
dio offset align:512
```

Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:16 -04:00
Kent Overstreet
7aa7183e00 bcachefs: split out lru_format.h
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:16 -04:00
Kent Overstreet
789566da25 bcachefs: bch2_btree_key_cache_drop() now evicts
As part of improving btree key cache coherency, the bkey_cached.valid
flag is going away.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:16 -04:00
Pankaj Raghav
febc33cb35 bcachefs: set fgf order hint before starting a buffered write
Set the preferred folio order in the fgp_flags by calling
fgf_set_order(). Page cache will try to allocate large folio of the
preferred order whenever possible instead of allocating multiple 0 order
folios.

This improves the buffered write performance up to 1.25x with default
mount options and up to 1.57x when mounted with no_data_io option with
the following fio workload:

fio --name=bcachefs --filename=/mnt/test  --size=100G \
     --ioengine=io_uring --iodepth=16 --rw=write --bs=128k

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:15 -04:00
Pankaj Raghav
2b02b9552c bcachefs: use FGP_WRITEBEGIN instead of combining individual flags
Use FGP_WRITEBEGIN to avoid repeating the individual FGP flags before
starting a buffered write.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:15 -04:00
Kent Overstreet
b0d3ab531f bcachefs: Reduce the scope of gc_lock
gc_lock is now only for synchronization between check_alloc_info and
interior btree updates - nothing else

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:15 -04:00
Kent Overstreet
132e1a2380 bcachefs: per_cpu_sum()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:15 -04:00
Kent Overstreet
385f0c05d6 bcachefs: kill key cache arg to bch2_assert_pos_locked()
this is an internal implementation detail - and we're improving key
cache coherency

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:15 -04:00
Kent Overstreet
c30402e548 bcachefs: btree_path_cached_set()
new helper - small refactoring

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:15 -04:00
Kent Overstreet
71fdc0b5a6 bcachefs: btree_node_unlock() assert
we have a separate helper for releasing write locks

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:15 -04:00
Kent Overstreet
dd3995a6a4 bcachefs: bch2_gc_pos_to_text()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:15 -04:00
Kent Overstreet
11169d9983 bcachefs: bch2_btree_id_to_text()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:15 -04:00
Kent Overstreet
ae4fb17e86 bcachefs: Kill gc_pos_btree_node()
gc_pos is now based on keys, not nodes, for invariantness w.r.t. splits
and merges

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:15 -04:00
Kent Overstreet
820b9efeb1 bcachefs: Fix bch2_gc_accounting_done() locking
The transaction commit path takes mark_lock, so we shouldn't be holding
it; use a bpos as an iterator so that we can drop and retake.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:15 -04:00
Kent Overstreet
f73e6bb6d6 bcachefs: bch2_accounting_mem_gc()
Add a new helper to free zeroed out accounting entries, and use it in
bch2_replicas_gc2(); bch2_replicas_gc2() was killing superblock replicas
entries if their corresponding accounting counters were nonzero, but
that's incorrect - the superblock replicas entry needs to exist if the
accounting entry exists, not if it's nonzero, because we check and
create the replicas entry when creating the new accounting entry - we
don't know when it's becoming nonzero.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:15 -04:00
Kent Overstreet
2574e95a8b bcachefs: Refactor disk accounting data structures
Break up the percpu counter allocations into individual allocations for
each disk accounting counter; this fixes an issue on large systems where
we have too many replica entries to for the percpu allocator's max
practical size.

Also, use just one eytzinger tree for the normal set of counters and the
gc counters; this simplifies accounting_gc_done() where we need the same
set of counters to be present in both tables.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:15 -04:00
Brian Foster
b5597347a5 bcachefs: fix smatch data leak warning in fs usage ioctl
smatch warns that the copy of arg to userspace is a potential data
leak by virtue of arg.pad not being checked or zeroed. This was
introduced by the commit referenced below that switched arg from
being a zeroed runtime allocation to living on the stack. Fix by
simply zero initializing the structure.

Fixes: cde738a61e65 ("bcachefs: Convert bch2_ioctl_fs_usage() to new accounting")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:15 -04:00
Kent Overstreet
f295920bc4 bcachefs: Fix race in bch2_accounting_mem_insert()
bch2_accounting_mem_insert() drops and retakes mark_lock; thus, we need
to check if the entry in question has already been inserted.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:15 -04:00
Ariel Miculas
49858d869b bcachefs: bch2_btree_insert() - add btree iter flags
The commit 65bd442397 ("bcachefs: bch2_btree_insert_trans() no longer
specifies BTREE_ITER_cached") removes BTREE_ITER_cached from
bch2_btree_insert_trans, which causes the update_inode function from
bcachefs-tools to take a long time (~20s).  Add an iter_flags parameter
to bch2_btree_insert, so the users can specify iter update trigger
flags, such as BTREE_ITER_cached.

Signed-off-by: Ariel Miculas <ariel.miculas@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:15 -04:00
Kent Overstreet
8863d1e092 bcachefs: BCH_IOCTL_QUERY_ACCOUNTING
Add a new ioctl that can return the new accounting counter types; it
takes as input a bitmask of accounting types to return.

This will be used for returning e.g. compression accounting and
rebalance_work accounting.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:15 -04:00
Reed Riley
7f3dc6c98b bcachefs: support REMAP_FILE_DEDUP in bch2_remap_file_range
By removing the early-exit when REMAP_FILE_DEDUP is set, we should be
able to support the fideduperange ioctl, albeit less efficiently than if
we handled some of the extent locking and comparison logic inside
bcachefs.  Extent comparison logic already exists inside of
`__generic_remap_file_range_prep`.

Signed-off-by: Reed Riley <reed@riley.engineer>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:15 -04:00
Hongbo Li
7a254053a5 bcachefs: support FS_IOC_SETFSLABEL
Implement support for FS_IOC_SETFSLABEL ioctl to set filesystem
label.

Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:15 -04:00
Hongbo Li
81bce3cf2b bcachefs: support get fs label
Implement support for FS_IOC_GETFSLABEL ioctl to read filesystem
label.

Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:15 -04:00
Hongbo Li
8a4ef7e28a bcachefs: implement FS_IOC_GETVERSION to support lsattr
In this patch we add the FS_IOC_GETVERSION ioctl for getting
i_generation from inode, after that, users can list file's
generation number by using "lsattr".

Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:14 -04:00
Kent Overstreet
889fb3dc5d bcachefs: Unlock trans when waiting for user input in fsck
We can't hold locks while waiting for user input, that's a deadlock.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:14 -04:00
Youling Tang
747d1d6c7e bcachefs: Add tracepoints for bch2_sync_fs() and bch2_fsync()
Add trace_bch2_sync_fs() and trace_bch2_fsync() implementations.

The output in trace is as follows:
  sync-29779   [000] .....   193.700935: bch2_sync_fs: dev 254,16 wait 1
  <...>-40027  [002] .....   342.535227: bch2_fsync: dev 254,32 ino 4099 parent 4096 datasync 1

Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:14 -04:00
Youling Tang
8b0882505d bcachefs: track writeback errors using the generic tracking infrastructure
We already using mapping_set_error() in bch2_writepage_io_done(), so all
we need to do is to use file_check_and_advance_wb_err() when handling
fsync() requests in bch2_fsync().

Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:14 -04:00
Ariel Miculas
f8b0147364 bcachefs: bch2_dir_emit() - fix directory reads in the fuse driver
Commit 0c0cbfdb84 dropped the ctx->pos
update before the call to dir_emit. This breaks the userspace
implementation, causing the directory reads to be stuck in an infinite
loop. This doesn't happen in the kernel because the vfs handles the
updates to ctx->pos, but in the fuse implementation nobody updates
it.

Signed-off-by: Ariel Miculas <ariel.miculas@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:14 -04:00
Kent Overstreet
7ed122aea2 bcachefs: twf: delete dead struct fields
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:14 -04:00
Kent Overstreet
d37dd9b604 bcachefs: bch2_stdio_redirect_readline_timeout()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:14 -04:00
Kent Overstreet
0c97c437e3 bcachefs: twf: convert bch2_stdio_redirect_readline() to darray
We now read the line from the buffer atomically, which means we have to
allow the buffer to grow past STDIO_REDIRECT_BUFSIZE if we're waiting
for a full line - this behaviour is necessary for
stdio_redirect_readline_timeout() in the next patch.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:14 -04:00
Kent Overstreet
36008d5d01 bcachefs: Plumb more logging through stdio redirect
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:14 -04:00
Kent Overstreet
a850bde649 bcachefs: fsck_err() may now take a btree_trans
fsck_err() now optionally takes a btree_trans; if the current thread has
one, it is required that it be passed.

The next patch will use this to unlock when waiting for user input.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:14 -04:00
Kent Overstreet
38e3ca275c bcachefs: btree_types bitmask cleanups
Make things more consistent and ensure that we're using u64 bitfields -
key types and btree ids are already around 32 bits.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:14 -04:00
Kent Overstreet
174722de55 bcachefs: Delete old assertion for online fsck
the order in which btree_gc walks keys have changed, so we no longer
have the sort of issues with online fsck this assertion was warning
about.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:14 -04:00
Kent Overstreet
38ad9dc8c6 bcachefs: Initialize gc buckets in alloc trigger
Needed for online fsck; we need the trigger to initialize newly
allocated buckets and generation number changes while gc is running.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:14 -04:00
Kent Overstreet
9ab55df599 bcachefs: Walk leaf to root in btree_gc
Next change will move gc_alloc_start initialization into the alloc
trigger, so we have to mark those first.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:14 -04:00
Kent Overstreet
86d46471d5 bcachefs: Don't block journal when finishing check_allocations()
Blocking the journal was needed to finish checking old style accounting,
but that code is gone and it's not needed in the alloc rewrite,
mark_lock is sufficient for synchronization.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:14 -04:00
Kent Overstreet
5645c32ccf bcachefs: bch2_fs_get_tree() cleanup
- improve error paths
- call bch2_fs_start() separately, after applying late-parsed options

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:14 -04:00
Kent Overstreet
25ee25e637 bcachefs: Kill bch2_mount()
Fold into bch2_fs_get_tree()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:14 -04:00
Kent Overstreet
b9efa9673e bcachefs: Eytzinger accumulation for accounting keys
The btree write buffer takes as input keys from the journal, sorts them,
deduplicates them, and flushes them back to the btree in sorted order.

The disk space accounting rewrite is moving accounting to normal btree
keys, with update (in this case deltas) accumulated in the write buffer
and then flushed to the btree; but this is going to increase the number
of keys handled by the write buffer by perhaps as much as a factor of
3x-5x.

The overhead from copying around and sorting this many keys would cause
a significant performance regression, but: there is huge locality in
updates to accounting keys that we can take advantage of.

Instead of appending accounting keys to the list of keys to be sorted,
this patch adds an eytzinger search tree of recently seen accounting
keys. We look up the accounting key in the eytzinger search tree and
apply the delta directly, adding it if it doesn't exist, and
periodically prune the eytzinger tree of unused entries.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:14 -04:00
Kent Overstreet
20ac515a9c bcachefs: bch_acct_rebalance_work
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:14 -04:00
Kent Overstreet
6af91147b6 bcachefs: bch_acct_btree
Add counters for how much disk space we're using per btree.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:14 -04:00
Kent Overstreet
6675c37662 bcachefs: bch_acct_snapshot
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:14 -04:00
Kent Overstreet
72c2778780 bcachefs: bch2_fs_usage_base_to_text()
Helper to show raw accounting in sysfs, mainly for debugging.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
f93bb76ba2 bcachefs: bch2_fs_accounting_to_text()
Helper to show raw accounting in sysfs, mainly for debugging.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
91f44781d5 bcachefs: Convert bch2_compression_stats_to_text() to new accounting
We no longer have to walk the whole btree to calculate compression
stats.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
bfcaa9079d bcachefs: bch_acct_compression
This adds per-compression-type accounting of compressed and uncompressed
size as well as number of extents - meaning we can now see compression
ratio (without walking the whole filesystem).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
5668e5deec bcachefs: bch2_verify_accounting_clean()
Verify that the in-memory accounting verifies the on-disk accounting
after a clean shutdown.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
00839addfc bcachefs: Convert bch2_replicas_gc2() to new accounting
bch2_replicas_gc2() is used for garbage collection superblock replicas
entries that are empty - this converts it to the new accounting scheme.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
fb23d57a6d bcachefs: Convert gc to new accounting
Rewrite fsck/gc for the new accounting scheme.

This adds a second set of in-memory accounting counters for gc to use;
like with other parts of gc we run all trigger in TRIGGER_GC mode, then
compare what we calculated to existing in-memory accounting at the end.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
4c4a7d48bd bcachefs: Kill replicas_journal_res
More dead code deletion

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
66a57684c6 bcachefs: Kill fs_usage_online
More dead code deletion.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
fe5eddc0d0 bcachefs: Kill bch2_fs_usage_to_text()
Dead code.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
8bb8d683a4 bcachefs: Delete journal-buf-sharded old style accounting
More deletion of dead code.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
5b9bc272e6 bcachefs: Kill writing old accounting to journal
More ripping out of the old disk space accounting.

Note that the new disk space accounting is incompatible with the old,
and writing out old style disk space accounting with the new code is
infeasible.

This means upgrading and downgrading past this version requires
regenerating accounting.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
3afb8dbf03 bcachefs: kill bch2_fs_usage_read()
With bch2_ioctl_fs_usage(), this is now dead code.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
6b39638b84 bcachefs: Convert bch2_ioctl_fs_usage() to new accounting
This converts bch2_ioctl_fs_usage() to read from the new disk
accounting, via bch2_fs_replicas_usage_read().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
72a6bb098c bcachefs: Kill bch2_fs_usage_initialize()
Deleting code for the old disk accounting scheme.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
f5095b9f85 bcachefs: dev_usage updated by new accounting
Reading disk accounting now requires an eytzinger lookup (see:
bch2_accounting_mem_read()), but the per-device counters are used
frequently enough that we'd like to still be able to read them with just
a percpu sum, as in the old code.

This patch special cases the device counters; when we update in-memory
accounting we also update the old style percpu counters if it's a deice
counter update.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
2e8d686a4a bcachefs: Coalesce accounting keys before journal replay
This fixes a performance regression in journal replay; without
colaescing accounting keys we have multiple keys at the same position,
which means journal_keys_peek_upto() has to skip past many overwritten
keys - turning journal replay into an O(n^2) algorithm.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
1d16c605cc bcachefs: Disk space accounting rewrite
Main part of the disk accounting rewrite.

This is a wholesale rewrite of the existing disk space accounting, which
relies on percepu counters that are sharded by journal buffer, and
rolled up and added to each journal write.

With the new scheme, every set of counters is a distinct key in the
accounting btree; this fixes scaling limitations of the old scheme,
where counters took up space in each journal entry and required multiple
percpu counters.

Now, in memory accounting requires a single set of percpu counters - not
multiple for each in flight journal buffer - and in the future we'll
probably also have counters that don't use in memory percpu counters,
they're not strictly required.

An accounting update is now a normal btree update, using the btree write
buffer path. At transaction commit time, we apply accounting updates to
the in memory counters, which are percpu counters indexed in an
eytzinger tree by the accounting key.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
5d9667d1d6 bcachefs: btree write buffer knows how to accumulate bch_accounting keys
Teach the btree write buffer how to accumulate accounting keys - instead
of having the newer key overwrite the older key as we do with other
updates, we need to add them together.

Also, add a flag so that write buffer flush knows when journal replay is
finished flushing accounting, and teach it to hold accounting keys until
that flag is set.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
9dec2a473b bcachefs: Accumulate accounting keys in journal replay
Until accounting keys hit the btree, they are deltas, not new versions
of the existing key; this means we have to teach journal replay to
accumulate them.

Additionally, the journal doesn't track precisely which entries have
been flushed to the btree; it only tracks a range of entries that may
possibly still need to be flushed.

That means we need to compare accounting keys against the version in the
btree and only flush updates that are newer.

There's another wrinkle with the write buffer: if the write buffer
starts flushing accounting keys before journal replay has finished
flushing accounting keys, journal replay will see the version number
from the new updates and updates from the journal will be lost.

To avoid this, journal replay has to flush accounting keys first, and
we'll be adding a flag so that write buffer flush knows to hold
accounting keys until then.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
2744e5c9eb bcachefs: KEY_TYPE_accounting
New key type for the disk space accounting rewrite.

 - Holds a variable sized array of u64s (may be more than one for
   accounting e.g. compressed and uncompressed size, or buckets and
   sectors for a given data type)

 - Updates are deltas, not new versions of the key: this means updates
   to accounting can happen via the btree write buffer, which we'll be
   teaching to accumulate deltas.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Thomas Bertschinger
929d954330 bcachefs: use new mount API
This updates bcachefs to use the new mount API:

- Update the file_system_type to use the new init_fs_context()
  function.

- Define the new fs_context_operations functions.

- No longer register bch2_mount() and bch2_remount(); these are now
  called via the new fs_context functions.

- Define a new helper type, bch2_opts_parse that includes a struct
  bch_opts and additionally a printbuf used to save options that can't
  be parsed until after the FS is opened. This enables us to parse as
  many options as possible prior to opening the filesystem while saving
  those options that need the open FS for later parsing.

Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Thomas Bertschinger
1c12d1caf8 bcachefs: Add error code to defer option parsing
This introduces a new error code, option_needs_open_fs, which is used to
indicate that an attempt was made to parse a mount option prior to
opening a filesystem, when that mount option requires an open filesystem
in order to be validated.

Returning this error results in bch2_parse_one_mount_opt() saving that
option for later parsing, after the filesystem is opened.

Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Thomas Bertschinger
9b7f0b5d3d bcachefs: add printbuf arg to bch2_parse_mount_opts()
Mount options that take the name of a device that may be part of a
filesystem, for example "metadata_target", cannot be validated until
after the filesystem has been opened. However, an attempt to parse those
options may be made prior to the filesystem being opened.

This change adds a printbuf parameter to bch2_parse_mount_opts() which
will be used to save those mount options, when they are supplied prior
to the FS being opened, so that they can be parsed later.

This functionality is not currently needed, but will be used after
bcachefs starts using the new mount API to parse mount options. This is
because using the new mount API, we will process mount options prior to
opening the FS, but the new API doesn't provide a convenient way to
"replay" mount option parsing. So we save these options ourselves to
accomplish this.

This change also splits out the code to parse a single option into
bch2_parse_one_mount_opt(), which will be useful when using the new
mount API which deals with a single mount option at a time.

Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Kent Overstreet
7773df19c3 bcachefs: metadata version bucket_stripe_sectors
New on disk format version for bch_alloc->stripe_sectors and
BCH_DATA_unstriped - accounting for unstriped data in stripe buckets.

Upgrade/downgrade requires regenerating alloc info - but only if erasure
coding is in use.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Kent Overstreet
2612e29142 bcachefs: BCH_DATA_unstriped
Add a new pseudo data type, to track buckets that are members of a
stripe, but have unstriped data in them.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Kent Overstreet
55f7962da3 bcachefs: bch_alloc->stripe_sectors
Add a separate counter to bch_alloc_v4 for amount of striped data; this
lets us separately track striped and unstriped data in a bucket, which
lets us see when erasure coding has failed to update extents with stripe
pointers, and also find buckets to continue updating if we crash mid way
through creating a new stripe.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00