check_snapshot() copies the bch_snapshot to a temporary to easily handle
older versions that don't have all the fields of the current version,
but it lacked a min() to correctly handle keys newer and larger than the
current version.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If a journal write errored, the list of devices it was written to could
be empty - we're not supposed to mark an empty replicas list.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_direct_IO_read() checks the request offset and size for sector
alignment and then falls through to a couple calculations to shrink
the size of the request based on the inode size. The problem is that
these checks round up to the fs block size, which runs the risk of
underflowing iter->count if the block size happens to be large
enough. This is triggered by fstest generic/361 with a 4k block
size, which subsequently leads to a crash. To avoid this crash,
check that the shorten length doesn't exceed the overall length of
the iter.
Fixes:
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Su Yue <glass.su@suse.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If we're in FILTER_SNAPSHOTS mode and we start scanning a range of the
keyspace where no keys are visible in the current snapshot, we have a
problem - we'll scan for a very long time before scanning terminates.
Awhile back, this was fixed for most cases with peek_upto() (and
assertions that enforce that it's being used).
But the fix missed the fact that the inodes btree is different - every
key offset is in a different snapshot tree, not just the inode field.
Fixes:
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Recently, we fixed our __GFP_NOFAIL usage in the readahead path, but the
easy one in read_single_folio() (where wa can return an error) was
missed - oops.
Fixes:
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When also downgrading, check_version_upgrade() could pick a new version
greater than the latest supported version.
Fixes:
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This prevents going emergency read only when the user has specified
replicas_required > replicas.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
During xfstest tests, there are some kmemleak reports e.g. generic/051 with
if USE_KMEMLEAK=yes:
====================================================================
EXPERIMENTAL kmemleak reported some memory leaks! Due to the way kmemleak
works, the leak might be from an earlier test, or something totally unrelated.
unreferenced object 0xffff9ef905aaf778 (size 8):
comm "mount.bcachefs", pid 169844, jiffies 4295281209 (age 87.040s)
hex dump (first 8 bytes):
a5 cc cc cc cc cc cc cc ........
backtrace:
[<ffffffff87fd9a43>] __kmem_cache_alloc_node+0x1f3/0x2c0
[<ffffffff87f49b66>] kmalloc_trace+0x26/0xb0
[<ffffffffc0a3fefe>] __bch2_read_super+0xfe/0x4e0 [bcachefs]
[<ffffffffc0a3ad22>] bch2_fs_open+0x262/0x1710 [bcachefs]
[<ffffffffc09c9e24>] bch2_mount+0x4c4/0x640 [bcachefs]
[<ffffffff88080c90>] legacy_get_tree+0x30/0x60
[<ffffffff8802c748>] vfs_get_tree+0x28/0xf0
[<ffffffff88061fe5>] path_mount+0x475/0xb60
[<ffffffff880627e5>] __x64_sys_mount+0x105/0x140
[<ffffffff88932642>] do_syscall_64+0x42/0xf0
[<ffffffff88a000e6>] entry_SYSCALL_64_after_hwframe+0x6e/0x76
unreferenced object 0xffff9ef96cdc4fc0 (size 32):
comm "mount.bcachefs", pid 169844, jiffies 4295281209 (age 87.040s)
hex dump (first 32 bytes):
2f 64 65 76 2f 6d 61 70 70 65 72 2f 74 65 73 74 /dev/mapper/test
2d 31 00 cc cc cc cc cc cc cc cc cc cc cc cc cc -1..............
backtrace:
[<ffffffff87fd9a43>] __kmem_cache_alloc_node+0x1f3/0x2c0
[<ffffffff87f4a081>] __kmalloc_node_track_caller+0x51/0x150
[<ffffffff87f3adc2>] kstrdup+0x32/0x60
[<ffffffffc0a3ff1a>] __bch2_read_super+0x11a/0x4e0 [bcachefs]
[<ffffffffc0a3ad22>] bch2_fs_open+0x262/0x1710 [bcachefs]
[<ffffffffc09c9e24>] bch2_mount+0x4c4/0x640 [bcachefs]
[<ffffffff88080c90>] legacy_get_tree+0x30/0x60
[<ffffffff8802c748>] vfs_get_tree+0x28/0xf0
[<ffffffff88061fe5>] path_mount+0x475/0xb60
[<ffffffff880627e5>] __x64_sys_mount+0x105/0x140
[<ffffffff88932642>] do_syscall_64+0x42/0xf0
[<ffffffff88a000e6>] entry_SYSCALL_64_after_hwframe+0x6e/0x76
====================================================================
The leak happens if bdev_open_by_path() failed to open a block device
then it goes label 'out' directly without call of bch2_free_super().
Fix it by going to label 'err' instead of 'out' if bdev_open_by_path()
fails.
Signed-off-by: Su Yue <glass.su@suse.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Calling fd_install() makes a file reachable for userland, including the
possibility to close the file descriptor, which leads to calling its
'release' hook. If that happens before the code had a chance to bump the
reference of the newly created task struct, the release callback will
call put_task_struct() too early, leading to the premature destruction
of the kernel thread.
Avoid that race by calling fd_install() later, after all the setup is
done.
Fixes: 1c6fdbd8f2 ("bcachefs: Initial commit")
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Parent dir is locked by user_path_locked_at() before validating the
required dentry. It should be unlocked if we can not perform the
deletion.
This fixes the problem:
$ bcachefs subvolume delete not-exist-entry
BCH_IOCTL_SUBVOLUME_DESTROY ioctl error: No such file or directory
$ bcachefs subvolume delete not-exist-entry
the second will stuck because the parent dir is locked in the previous
deletion.
Signed-off-by: Guoyu Ou <benogy@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The gcc compiler on paric does support the __int128 type, although the
architecture does not have native 128-bit support.
The effect is, that the bcachefs u128_square() function will pull in the
libgcc __multi3() helper, which breaks the kernel build when bcachefs is
built as module since this function isn't currently exported in
arch/parisc/kernel/parisc_ksyms.c.
The build failure can be seen in the latest debian kernel build at:
https://buildd.debian.org/status/fetch.php?pkg=linux&arch=hppa&ver=6.7.1-1%7Eexp1&stamp=1706132569&raw=0
We prefer to not export that symbol, so fall back to the optional 64-bit
implementation provided by bcachefs and thus avoid usage of __multi3().
Signed-off-by: Helge Deller <deller@gmx.de>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add a new helper, bch2_hash_lookup_in_snapshot(), for when we're not
operating in a subvolume and already have a snapshot ID, and then use it
in lookup_lostfound() -> __lookup_dirent().
This is a bugfix - lookup_lostfound() doesn't take a subvolume ID, we
were passing a nonsense subvolume ID before, and don't have one to pass
since we may be operating in an interior snapshot node that doesn't have
a subvolume ID.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Some (bad) devices can have really terrible discard latency; we don't
want them blocking memory reclaim and causing warnings.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
REQ_OP_FLUSH is only for internal use in the blk-mq and request based
drivers. File systems and other block layer consumers must use
REQ_OP_WRITE | REQ_PREFLUSH as documented in
Documentation/block/writeback_cache_control.rst.
While REQ_OP_FLUSH appears to work for blk-mq drivers it does not
get the proper flush state machine handling, and completely fails
for any bio based drivers, including all the stacking drivers. The
block layer will also get a check in 6.8 to reject this use case
entirely.
[Note: completely untested, but as this never got fixed since the
original bug report in November:
https://bugzilla.kernel.org/show_bug.cgi?id=218184
and the the discussion in December:
https://lore.kernel.org/all/20231221053016.72cqcfg46vxwohcj@moria.home.lan/T/
this seems to be best way to force it]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- assorted prep work for disk space accounting rewrite
- BTREE_TRIGGER_ATOMIC: after combining our trigger callbacks, this
makes our trigger context more explicit
- A few fixes to avoid excessive transaction restarts on multithreaded
workloads: fstests (in addition to ktest tests) are now checking
slowpath counters, and that's shaking out a few bugs
- Assorted tracepoint improvements
- Starting to break up bcachefs_format.h and move on disk types so
they're with the code they belong to; this will make room to start
documenting the on disk format better.
- A few minor fixes
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmWtjOsACgkQE6szbY3K
bnbyXRAAsx+yM81TFqsLzRRqf8oocRwf2dj5XzExz9Ig/lYQS5LIVROS2OxwDsAc
DeaYQSTcph9dkOswCrNR96bBnEgmmZ1ClfVI6WRXvm6vs4rjhSMNbNaVyySrMUVn
5p/Lsn1/RKl0lWMYlHrdryo+106zRcr6z1Hiv9QCXkXhzdkV8wFYDkfbMveShUsu
KobC29wvd2EfZr04nqsIXS/y/iRIXhtZqJmFCiAguN70UWrwUwArpELHI5Ve+WPZ
9VjgFXW6Ka3QxJs/20tX+t24DrC+eDXR44DzQmxwG5mPBBpXkcSk5UgRw/EUag5U
5+mDZQ5Ei3gvZvUwrilMosVy3pIw0IuvqeqwDGFoFXs1cce01QCMN+NG/dBTQw9i
KGGxJw5sOrZ8fIiFnypk1M+r9NVtA8MjriLNR5bJjCWPSpWqzkT2HzxFXc6HmTZu
vsE/AxwC1RLA6B2HZlDEqLOdHE3cofkDiIzWM5ABvb4p118iyk9hE6HhAufk5UdE
HaG646kGB8pUY/sCxBIOD6K2pgthDFv+fftTM7X+uIazD3bovvPQCEInu48/KAHn
/KmslSPO0txyjnRFMbXFJvd4Fgfo44GcBCeqGpy3B79aEJ3nroyRZ0qNnnsqj0Gl
picUWjTn4W561Q1zBXuE/6cLWEp+sfaqYQcM8L3CCitRTVDPaCQ=
=yd+F
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2024-01-21' of https://evilpiepirate.org/git/bcachefs
Pull more bcachefs updates from Kent Overstreet:
"Some fixes, Some refactoring, some minor features:
- Assorted prep work for disk space accounting rewrite
- BTREE_TRIGGER_ATOMIC: after combining our trigger callbacks, this
makes our trigger context more explicit
- A few fixes to avoid excessive transaction restarts on
multithreaded workloads: fstests (in addition to ktest tests) are
now checking slowpath counters, and that's shaking out a few bugs
- Assorted tracepoint improvements
- Starting to break up bcachefs_format.h and move on disk types so
they're with the code they belong to; this will make room to start
documenting the on disk format better.
- A few minor fixes"
* tag 'bcachefs-2024-01-21' of https://evilpiepirate.org/git/bcachefs: (46 commits)
bcachefs: Improve inode_to_text()
bcachefs: logged_ops_format.h
bcachefs: reflink_format.h
bcachefs; extents_format.h
bcachefs: ec_format.h
bcachefs: subvolume_format.h
bcachefs: snapshot_format.h
bcachefs: alloc_background_format.h
bcachefs: xattr_format.h
bcachefs: dirent_format.h
bcachefs: inode_format.h
bcachefs; quota_format.h
bcachefs: sb-counters_format.h
bcachefs: counters.c -> sb-counters.c
bcachefs: comment bch_subvolume
bcachefs: bch_snapshot::btime
bcachefs: add missing __GFP_NOWARN
bcachefs: opts->compression can now also be applied in the background
bcachefs: Prep work for variable size btree node buffers
bcachefs: grab s_umount only if snapshotting
...
Add a field to bch_snapshot for creation time; this will be important
when we start exposing the snapshot tree to userspace.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The "apply this compression method in the background" paths now use the
compression option if background_compression is not set; this means that
setting or changing the compression option will cause existing data to
be compressed accordingly in the background.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bcachefs btree nodes are big - typically 256k - and btree roots are
pinned in memory. As we're now up to 18 btrees, we now have significant
memory overhead in mostly empty btree roots.
And in the future we're going to start enforcing that certain btree node
boundaries exist, to solve lock contention issues - analagous to XFS's
AGIs.
Thus, we need to start allocating smaller btree node buffers when we
can. This patch changes code that refers to the filesystem constant
c->opts.btree_node_size to refer to the btree node buffer size -
btree_buf_bytes() - where appropriate.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The variable tmp is being assigned a value but it isn't being
read afterwards. The assignment is redundant and so tmp can be
removed.
Cleans up clang scan build warning:
warning: Although the value stored to 'ret' is used in the enclosing
expression, the value is never actually read from 'ret'
[deadcode.DeadStores]
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
drop_locks_do() should not be used in a fastpath without first trying
the do in nonblocking mode - the unlock and relock will cause excessive
transaction restarts and potentially livelocking with other threads that
are contending for the same locks.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Factor out bch2_journal_bufs_to_text(), and use it in the
journal_entry_full() tracepoint; when we can't get a journal reservation
we need to know the outstanding journal entry sizes to know if the
problem is due to excessive flushing.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>