The add_inode_ref() function does not initialize the "name" struct when
it is declared. If any of the following calls to "read_one_inode()
returns NULL,
dir = read_one_inode(root, parent_objectid);
if (!dir) {
ret = -ENOENT;
goto out;
}
inode = read_one_inode(root, inode_objectid);
if (!inode) {
ret = -EIO;
goto out;
}
then "name.name" would be freed on "out" before being initialized.
out:
...
kfree(name.name);
This issue was reported by Coverity with CID 1526744.
Fixes: e43eec81c5 ("btrfs: use struct qstr instead of name and namelen pairs")
CC: stable@vger.kernel.org # 6.6+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Roi Martin <jroi.martin@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are using the logical address ("bytenr") of an extent as the key for
qgroup records in the dirty extents xarray. This is a problem because the
xarrays use "unsigned long" for keys/indices, meaning that on a 32 bits
platform any extent starting at or beyond 4G is truncated, which is a too
low limitation as virtually everyone is using storage with more than 4G of
space. This means a "bytenr" of 4G gets truncated to 0, and so does 8G and
16G for example, resulting in incorrect qgroup accounting.
Fix this by using sector numbers as keys instead, that is, using keys that
match the logical address right shifted by fs_info->sectorsize_bits, which
is what we do for the fs_info->buffer_radix that tracks extent buffers
(radix trees also use an "unsigned long" type for keys). This also makes
the index space more dense which helps optimize the xarray (as mentioned
at Documentation/core-api/xarray.rst).
Fixes: 3cce39a8ca ("btrfs: qgroup: use xarray to track dirty extents in transaction")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmcH4qAACgkQxWXV+ddt
WDtDig//czntfO+iRvDERZWTIB6vdVExfLd3r3ZNYlO1pIvgCuvqx3iYva+0ZhGW
8A+gcRax7cz0jCaxDp/+5lIRfdNxZH6/LwjZsDgU8Ly7himeRmwhtn2fCgNeiH/K
bUl92+ZMo2vwqTKXYa3xF1g3Hz6cRXVW7gJrMwNhb1hpPTGx+lgYJU02m/Io/vjK
1jcrZ84OEPIOY5uiAoDyO2hgsT/zVEeuuOiSTpKSzrghPbo0vmjLiYJ5T+CE5Uw3
u3w7/Fqnw49NwucqtncvyFFDXY9EWNuQhowi3hqJgOYTInqwwJigIpQV0hDDwYxb
ohGUGjazGfAEf/cy1jZXMbwCVgg8/Nj9x0eDKKhfs19VYUbMkEYQ8LKRTUlCeBwS
H/2AmqpqHEEO+tPY3P+w6MVwkNho8JNpWPdP5OzJs7XrD067IViOjD06HPM/k5ci
TU3zp9NYvgHVtmfZK1Aqsg9OYVhI1klVXejmlAzOLxejRPWXK/1hBw3kXbC6I+k1
50l0Yh1dgEnclMI3yWsKoj8IYUAkh2eudt0pNsot4a5vICMY++NVS2eukdz5UcEz
ix7hcpYcCcmzoOaelyEgmdAncWVGJT5w2Nzy85YaOp+Z1C65Ywb41utU+sSY+swB
kZfwl9vrsfu754vX7UKBherCvvYo+Lnj3GeX8Oe+1LoT2BP0TPk=
=lTqc
-----END PGP SIGNATURE-----
Merge tag 'for-6.12-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- update fstrim loop and add more cancellation points, fix reported
delayed or blocked suspend if there's a huge chunk queued
- fix error handling in recent qgroup xarray conversion
- in zoned mode, fix warning printing device path without RCU
protection
- again fix invalid extent xarray state (6252690f7e), lost due to
refactoring
* tag 'for-6.12-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix clear_dirty and writeback ordering in submit_one_sector()
btrfs: zoned: fix missing RCU locking in error message when loading zone info
btrfs: fix missing error handling when adding delayed ref with qgroups enabled
btrfs: add cancellation points to trim loops
btrfs: split remaining space to discard in chunks
Jeff Layton <jlayton@kernel.org> says:
The VFS has always used coarse-grained timestamps when updating the
ctime and mtime after a change. This has the benefit of allowing
filesystems to optimize away a lot metadata updates, down to around 1
per jiffy, even when a file is under heavy writes.
Unfortunately, this has always been an issue when we're exporting via
NFSv3, which relies on timestamps to validate caches. A lot of changes
can happen in a jiffy, so timestamps aren't sufficient to help the
client decide when to invalidate the cache. Even with NFSv4, a lot of
exported filesystems don't properly support a change attribute and are
subject to the same problems with timestamp granularity. Other
applications have similar issues with timestamps (e.g backup
applications).
If we were to always use fine-grained timestamps, that would improve the
situation, but that becomes rather expensive, as the underlying
filesystem would have to log a lot more metadata updates.
What we need is a way to only use fine-grained timestamps when they are
being actively queried. Use the (unused) top bit in inode->i_ctime_nsec
as a flag that indicates whether the current timestamps have been
queried via stat() or the like. When it's set, we allow the kernel to
use a fine-grained timestamp iff it's necessary to make the ctime show
a different value.
This solves the problem of being able to distinguish the timestamp
between updates, but introduces a new problem: it's now possible for a
file being changed to get a fine-grained timestamp. A file that is
altered just a bit later can then get a coarse-grained one that appears
older than the earlier fine-grained time. This violates timestamp
ordering guarantees.
To remedy this, keep a global monotonic atomic64_t value that acts as a
timestamp floor. When we go to stamp a file, we first get the latter of
the current floor value and the current coarse-grained time. If the
inode ctime hasn't been queried then we just attempt to stamp it with
that value.
If it has been queried, then first see whether the current coarse time
is later than the existing ctime. If it is, then we accept that value.
If it isn't, then we get a fine-grained time and try to swap that into
the global floor. Whether that succeeds or fails, we take the resulting
floor time, convert it to realtime and try to swap that into the ctime.
We take the result of the ctime swap whether it succeeds or fails, since
either is just as valid.
Filesystems can opt into this by setting the FS_MGTIME fstype flag.
Others should be unaffected (other than being subject to the same floor
value as multigrain filesystems).
* patches from https://lore.kernel.org/r/20241002-mgtime-v10-0-d1c4717f5284@kernel.org:
tmpfs: add support for multigrain timestamps
btrfs: convert to multigrain timestamps
ext4: switch to multigrain timestamps
xfs: switch to multigrain timestamps
Documentation: add a new file documenting multigrain timestamps
fs: add percpu counters for significant multigrain timestamp events
fs: tracepoints around multigrain timestamp events
fs: handle delegated timestamps in setattr_copy_mgtime
fs: have setattr_copy handle multigrain timestamps appropriately
fs: add infrastructure for multigrain timestamps
Link: https://lore.kernel.org/r/20241002-mgtime-v10-0-d1c4717f5284@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.
Beyond enabling the FS_MGTIME flag, this patch eliminates
update_time_for_write, which goes to great pains to avoid in-memory
stores. Just have it overwrite the timestamps unconditionally.
Note that this also drops the IS_I_VERSION check and unconditionally
bumps the change attribute, since SB_I_VERSION is always set on btrfs.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Randy Dunlap <rdunlap@infradead.org> # documentation bits
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20241002-mgtime-v10-11-d1c4717f5284@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
This commit is a replay of commit 6252690f7e ("btrfs: fix invalid
mapping of extent xarray state"). We need to call
btrfs_folio_clear_dirty() before btrfs_set_range_writeback(), so that
xarray DIRTY tag is cleared.
With a refactoring commit 8189197425 ("btrfs: refactor
__extent_writepage_io() to do sector-by-sector submission"), it screwed
up and the order is reversed and causing the same hang. Fix the ordering
now in submit_one_sector().
Fixes: 8189197425 ("btrfs: refactor __extent_writepage_io() to do sector-by-sector submission")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_load_zone_info() we have an error path that is dereferencing
the name of a device which is a RCU string but we are not holding a RCU
read lock, which is incorrect.
Fix this by using btrfs_err_in_rcu() instead of btrfs_err().
The problem is there since commit 08e11a3db0 ("btrfs: zoned: load zone's
allocation offset"), back then at btrfs_load_block_group_zone_info() but
then later on that code was factored out into the helper
btrfs_load_zone_info() by commit 09a46725cc ("btrfs: zoned: factor out
per-zone logic from btrfs_load_block_group_zone_info").
Fixes: 08e11a3db0 ("btrfs: zoned: load zone's allocation offset")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When adding a delayed ref head, at delayed-ref.c:add_delayed_ref_head(),
if we fail to insert the qgroup record we don't error out, we ignore it.
In fact we treat it as if there was no error and there was already an
existing record - we don't distinguish between the cases where
btrfs_qgroup_trace_extent_nolock() returns 1, meaning a record already
existed and we can free the given record, and the case where it returns
a negative error value, meaning the insertion into the xarray that is
used to track records failed.
Effectively we end up ignoring that we are lacking qgroup record in the
dirty extents xarray, resulting in incorrect qgroup accounting.
Fix this by checking for errors and return them to the callers.
Fixes: 3cce39a8ca ("btrfs: qgroup: use xarray to track dirty extents in transaction")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are reports that system cannot suspend due to running trim because
the task responsible for trimming the device isn't able to finish in
time, especially since we have a free extent discarding phase, which can
trim a lot of unallocated space. There are no limits on the trim size
(unlike the block group part).
Since trime isn't a critical call it can be interrupted at any time,
in such cases we stop the trim, report the amount of discarded bytes and
return an error.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219180
Link: https://bugzilla.suse.com/show_bug.cgi?id=1229737
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Per Qu Wenruo in case we have a very large disk, e.g. 8TiB device,
mostly empty although we will do the split according to our super block
locations, the last super block ends at 256G, we can submit a huge
discard for the range [256G, 8T), causing a large delay.
Split the space left to discard based on BTRFS_MAX_DISCARD_CHUNK_SIZE in
preparation of introduction of cancellation points to trim. The value
of the chunk size is arbitrary, it can be higher or derived from actual
device capabilities but we can't easily read that using
bio_discard_limit().
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219180
Link: https://bugzilla.suse.com/show_bug.cgi?id=1229737
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmb/9agACgkQxWXV+ddt
WDtKFA/5AW88osk/1k/NVhOvOa0xPr5XyDLq1n8Gaxfy8uHlHAc8wdsvJzCDMS0M
qUOD/tOPhRI0HGXPiKD767erwbyXiZAcCTkSd8x5jlXy1hVjUHQSKO//JxD0vtAZ
jOscoUA1wJJutopCXcppnoUUFE2753edEg0w2EtUXMpfqivqOmMCR+1ZtKkfaNJo
oRuZCq3Oi8hu7Wsvmh4Etq/9MvGM+xovXAMAji6Op8nsP1jJlzWztpEUogLOQH2S
IhDFFxP9shBV9JjV+HSyXcAYr8VArH6HtYjapR9oajCH0pvSjLYQQPq/qQ0//8Hb
SHr4YP6RBbpEIvvaQeA7vwsckBHNBOrSAbEgRQ2+zdmiwha6SRIEVyXYh5LUwcYT
WnVozbk7ZX9rD1jOVhvgouFG6vUz8A6/qt3BD028bVcyMvBXW4gsEduCMVlFGmlN
D6+hNY6J08j4HUEGnPk7fAYi/lk5OEK1p5yarUgsOQ3GqWWS0ywkrfbmbWwyC+Ff
AxggFTl9YodU5RMs7EU2GeHkLU6LnXgevk6FFm0JzsLtT/BbEP7pj6tOot4Msl1e
2ovqFiSbuPNg5Wr70ZBRO9LDIAYtTZy1UrVR/YCSLzm+wZsXMelDIHMQfE7+Yodp
O5Cud23AanRTwjErvNl3X4rhut8rrI1FsR89gDyKK1EPG+mwofo=
=yslr
-----END PGP SIGNATURE-----
Merge tag 'for-6.12-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- in incremental send, fix invalid clone operation for file that got
its size decreased
- fix __counted_by() annotation of send path cache entries, we do not
store the terminating NUL
- fix a longstanding bug in relocation (and quite hard to hit by
chance), drop back reference cache that can get out of sync after
transaction commit
- wait for fixup worker kthread before finishing umount
- add missing raid-stripe-tree extent for NOCOW files, zoned mode
cannot have NOCOW files but RST is meant to be a standalone feature
- handle transaction start error during relocation, avoid potential
NULL pointer dereference of relocation control structure (reported by
syzbot)
- disable module-wide rate limiting of debug level messages
- minor fix to tracepoint definition (reported by checkpatch.pl)
* tag 'for-6.12-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: disable rate limiting when debug enabled
btrfs: wait for fixup workers before stopping cleaner kthread during umount
btrfs: fix a NULL pointer dereference when failed to start a new trasacntion
btrfs: send: fix invalid clone operation for file that got its size decreased
btrfs: tracepoints: end assignment with semicolon at btrfs_qgroup_extent event class
btrfs: drop the backref cache during relocation if we commit
btrfs: also add stripe entries for NOCOW writes
btrfs: send: fix buffer overflow detection when copying path to cache entry
We are close to removing the private_2 flag, so switch btrfs to using
owner_2 for its ordered flag. This is mostly used by buffer head
filesystems, so btrfs can use it because it doesn't use buffer heads.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20241002040111.1023018-5-willy@infradead.org
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
asm/unaligned.h is always an include of asm-generic/unaligned.h;
might as well move that thing to linux/unaligned.h and include
that - there's nothing arch-specific in that header.
auto-generated by the following:
for i in `git grep -l -w asm/unaligned.h`; do
sed -i -e "s/asm\/unaligned.h/linux\/unaligned.h/" $i
done
for i in `git grep -l -w asm-generic/unaligned.h`; do
sed -i -e "s/asm-generic\/unaligned.h/linux\/unaligned.h/" $i
done
git mv include/asm-generic/unaligned.h include/linux/unaligned.h
git mv tools/include/asm-generic/unaligned.h tools/include/linux/unaligned.h
sed -i -e "/unaligned.h/d" include/asm-generic/Kbuild
sed -i -e "s/__ASM_GENERIC/__LINUX/" include/linux/unaligned.h tools/include/linux/unaligned.h
Disable ratelimiting for btrfs_printk when CONFIG_BTRFS_DEBUG is
enabled. This allows for more verbose output which is often needed by
functions like btrfs_dump_space_info().
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Syzbot reported a NULL pointer dereference with the following crash:
FAULT_INJECTION: forcing a failure.
start_transaction+0x830/0x1670 fs/btrfs/transaction.c:676
prepare_to_relocate+0x31f/0x4c0 fs/btrfs/relocation.c:3642
relocate_block_group+0x169/0xd20 fs/btrfs/relocation.c:3678
...
BTRFS info (device loop0): balance: ended with status: -12
Oops: general protection fault, probably for non-canonical address 0xdffffc00000000cc: 0000 [#1] PREEMPT SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000660-0x0000000000000667]
RIP: 0010:btrfs_update_reloc_root+0x362/0xa80 fs/btrfs/relocation.c:926
Call Trace:
<TASK>
commit_fs_roots+0x2ee/0x720 fs/btrfs/transaction.c:1496
btrfs_commit_transaction+0xfaf/0x3740 fs/btrfs/transaction.c:2430
del_balance_item fs/btrfs/volumes.c:3678 [inline]
reset_balance_state+0x25e/0x3c0 fs/btrfs/volumes.c:3742
btrfs_balance+0xead/0x10c0 fs/btrfs/volumes.c:4574
btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3673
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:907 [inline]
__se_sys_ioctl+0xf9/0x170 fs/ioctl.c:893
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
[CAUSE]
The allocation failure happens at the start_transaction() inside
prepare_to_relocate(), and during the error handling we call
unset_reloc_control(), which makes fs_info->balance_ctl to be NULL.
Then we continue the error path cleanup in btrfs_balance() by calling
reset_balance_state() which will call del_balance_item() to fully delete
the balance item in the root tree.
However during the small window between set_reloc_contrl() and
unset_reloc_control(), we can have a subvolume tree update and created a
reloc_root for that subvolume.
Then we go into the final btrfs_commit_transaction() of
del_balance_item(), and into btrfs_update_reloc_root() inside
commit_fs_roots().
That function checks if fs_info->reloc_ctl is in the merge_reloc_tree
stage, but since fs_info->reloc_ctl is NULL, it results a NULL pointer
dereference.
[FIX]
Just add extra check on fs_info->reloc_ctl inside
btrfs_update_reloc_root(), before checking
fs_info->reloc_ctl->merge_reloc_tree.
That DEAD_RELOC_TREE handling is to prevent further modification to the
reloc tree during merge stage, but since there is no reloc_ctl at all,
we do not need to bother that.
Reported-by: syzbot+283673dbc38527ef9f3d@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/66f6bfa7.050a0220.38ace9.0019.GAE@google.com/
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During an incremental send we may end up sending an invalid clone
operation, for the last extent of a file which ends at an unaligned offset
that matches the final i_size of the file in the send snapshot, in case
the file had its initial size (the size in the parent snapshot) decreased
in the send snapshot. In this case the destination will fail to apply the
clone operation because its end offset is not sector size aligned and it
ends before the current size of the file.
Sending the truncate operation always happens when we finish processing an
inode, after we process all its extents (and xattrs, names, etc). So fix
this by ensuring the file has a valid size before we send a clone
operation for an unaligned extent that ends at the final i_size of the
file. The size we truncate to matches the start offset of the clone range
but it could be any value between that start offset and the final size of
the file since the clone operation will expand the i_size if the current
size is smaller than the end offset. The start offset of the range was
chosen because it's always sector size aligned and avoids a truncation
into the middle of a page, which results in dirtying the page due to
filling part of it with zeroes and then making the clone operation at the
receiver trigger IO.
The following test reproduces the issue:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
mkfs.btrfs -f $DEV
mount $DEV $MNT
# Create a file with a size of 256K + 5 bytes, having two extents, one
# with a size of 128K and another one with a size of 128K + 5 bytes.
last_ext_size=$((128 * 1024 + 5))
xfs_io -f -d -c "pwrite -S 0xab -b 128K 0 128K" \
-c "pwrite -S 0xcd -b $last_ext_size 128K $last_ext_size" \
$MNT/foo
# Another file which we will later clone foo into, but initially with
# a larger size than foo.
xfs_io -f -c "pwrite -S 0xef 0 1M" $MNT/bar
btrfs subvolume snapshot -r $MNT/ $MNT/snap1
# Now resize bar and clone foo into it.
xfs_io -c "truncate 0" \
-c "reflink $MNT/foo" $MNT/bar
btrfs subvolume snapshot -r $MNT/ $MNT/snap2
rm -f /tmp/send-full /tmp/send-inc
btrfs send -f /tmp/send-full $MNT/snap1
btrfs send -p $MNT/snap1 -f /tmp/send-inc $MNT/snap2
umount $MNT
mkfs.btrfs -f $DEV
mount $DEV $MNT
btrfs receive -f /tmp/send-full $MNT
btrfs receive -f /tmp/send-inc $MNT
umount $MNT
Running it before this patch:
$ ./test.sh
(...)
At subvol snap1
At snapshot snap2
ERROR: failed to clone extents to bar: Invalid argument
A test case for fstests will be sent soon.
Reported-by: Ben Millwood <thebenmachine@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAJhrHS2z+WViO2h=ojYvBPDLsATwLbg+7JaNCyYomv0fUxEpQQ@mail.gmail.com/
Fixes: 46a6e10a1a ("btrfs: send: allow cloning non-aligned extent if it ends at i_size")
CC: stable@vger.kernel.org # 6.11
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since the inception of relocation we have maintained the backref cache
across transaction commits, updating the backref cache with the new
bytenr whenever we COWed blocks that were in the cache, and then
updating their bytenr once we detected a transaction id change.
This works as long as we're only ever modifying blocks, not changing the
structure of the tree.
However relocation does in fact change the structure of the tree. For
example, if we are relocating a data extent, we will look up all the
leaves that point to this data extent. We will then call
do_relocation() on each of these leaves, which will COW down to the leaf
and then update the file extent location.
But, a key feature of do_relocation() is the pending list. This is all
the pending nodes that we modified when we updated the file extent item.
We will then process all of these blocks via finish_pending_nodes, which
calls do_relocation() on all of the nodes that led up to that leaf.
The purpose of this is to make sure we don't break sharing unless we
absolutely have to. Consider the case that we have 3 snapshots that all
point to this leaf through the same nodes, the initial COW would have
created a whole new path. If we did this for all 3 snapshots we would
end up with 3x the number of nodes we had originally. To avoid this we
will cycle through each of the snapshots that point to each of these
nodes and update their pointers to point at the new nodes.
Once we update the pointer to the new node we will drop the node we
removed the link for and all of its children via btrfs_drop_subtree().
This is essentially just btrfs_drop_snapshot(), but for an arbitrary
point in the snapshot.
The problem with this is that we will never reflect this in the backref
cache. If we do this btrfs_drop_snapshot() for a node that is in the
backref tree, we will leave the node in the backref tree. This becomes
a problem when we change the transid, as now the backref cache has
entire subtrees that no longer exist, but exist as if they still are
pointed to by the same roots.
In the best case scenario you end up with "adding refs to an existing
tree ref" errors from insert_inline_extent_backref(), where we attempt
to link in nodes on roots that are no longer valid.
Worst case you will double free some random block and re-use it when
there's still references to the block.
This is extremely subtle, and the consequences are quite bad. There
isn't a way to make sure our backref cache is consistent between
transid's.
In order to fix this we need to simply evict the entire backref cache
anytime we cross transid's. This reduces performance in that we have to
rebuild this backref cache every time we change transid's, but fixes the
bug.
This has existed since relocation was added, and is a pretty critical
bug. There's a lot more cleanup that can be done now that this
functionality is going away, but this patch is as small as possible in
order to fix the problem and make it easy for us to backport it to all
the kernels it needs to be backported to.
Followup series will dismantle more of this code and simplify relocation
drastically to remove this functionality.
We have a reproducer that reproduced the corruption within a few minutes
of running. With this patch it survives several iterations/hours of
running the reproducer.
Fixes: 3fd0a5585e ("Btrfs: Metadata ENOSPC handling for balance")
CC: stable@vger.kernel.org
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
NOCOW writes do not generate stripe_extent entries in the RAID stripe
tree, as the RAID stripe-tree feature initially was designed with a
zoned filesystem in mind and on a zoned filesystem, we do not allow NOCOW
writes. But the RAID stripe-tree feature is independent from the zoned
feature, so we must also do NOCOW writes for RAID stripe-tree filesystems.
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmbxUdkACgkQxWXV+ddt
WDtAVQ//SCg5XtExxtol1emzZ+AGQjwRnRfUPo/x32h9SmaynaHa/sLsG2EwePKs
1lrkW8gEx3NF1bfeCubhoVX2eAo/1rwGqtPEbweE7XaYtmSnxT8jXeH2fQcMwQMc
PkYfnCMIOdJzwoVS8wS3kLmuDep+9DJrbeI9oN5tUgugkTTbW7g576uv/SXjp46D
Dl4b1uvVOCowBbY2Bz1pg0fQpBzJcLzvynGElSi85uoQ520JuA8PP/3Pszg8BTxm
6MO99kF0MhVSBnKSvmlIgxmlnGhlW/AlZakxywRYYKsiSM/eCWHpyUV0p4mMcpWW
QM8yeJcAhugTDIV3VdRpGx4NcJSo1PPaXxRrMr/vnnuOPF4VQ2gSw+S4p44YCsML
VpyNJIjeXNO86A6feQybxwczMzdpkc5UzdfJ+l3CDSxcGiQGRU3WWPIHjte90e38
ZNjXknc96EwOmxsx8ojGlfi7Lh9yHklMGslxI64488PTa+2RRGITUSziAla29nrd
E4U6bh+bLeh2a11u+OjvSqIjdDfoJZD40Abnqe6DVA9pboPaLvf8vAVZa1FOJxsI
oVJgkdhEBGbn26KqlghlnbkYBdjuGxtBoyuCvUAI8ybOTVnp423d+JYXkZOnSq9A
EdL3UGII4LWQ71p+QxF3tm5nuKfbulyibfoBNj57zk0hM2OVNdg=
=wGXc
-----END PGP SIGNATURE-----
Merge tag 'for-6.12-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix dangling pointer to rb-tree of defragmented inodes after cleanup
- a followup fix to handle concurrent lseek on the same fd that could
leak memory under some conditions
- fix wrong root id reported in tree checker when verifying dref
* tag 'for-6.12-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix use-after-free on rbtree that tracks inodes for auto defrag
btrfs: tree-checker: fix the wrong output of data backref objectid
btrfs: fix race setting file private on concurrent lseek using same fd
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZvDNmgAKCRBZ7Krx/gZQ
63zrAP9vI0rf55v27twiabe9LnI7aSx5ckoqXxFIFxyT3dOYpQD/bPmoApnWDD3d
592+iDgLsema/H/0/CqfqlaNtDNY8Q0=
=HUl5
-----END PGP SIGNATURE-----
Merge tag 'pull-stable-struct_fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull 'struct fd' updates from Al Viro:
"Just the 'struct fd' layout change, with conversion to accessor
helpers"
* tag 'pull-stable-struct_fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
add struct fd constructors, get rid of __to_fd()
struct fd: representation change
introduce fd_file(), convert all accessors to it.
When cleaning up defrag inodes at btrfs_cleanup_defrag_inodes(), called
during remount and unmount, we are freeing every node from the rbtree
that tracks inodes for auto defrag using
rbtree_postorder_for_each_entry_safe(), which doesn't modify the tree
itself. So once we unlock the lock that protects the rbtree, we have a
tree pointing to a root that was freed (and a root pointing to freed
nodes, and their children pointing to other freed nodes, and so on).
This makes further access to the tree result in a use-after-free with
unpredictable results.
Fix this by initializing the rbtree to an empty root after the call to
rbtree_postorder_for_each_entry_safe() and before unlocking.
Fixes: 276940915f ("btrfs: clear defragmented inodes using postorder in btrfs_cleanup_defrag_inodes()")
Reported-by: syzbot+ad7966ca1f5dd8b001b3@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/000000000000f9aad406223eabff@google.com/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
There are some reports about invalid data backref objectids, the report
looks like this:
BTRFS critical (device sda): corrupt leaf: block=333654787489792 slot=110 extent bytenr=333413935558656 len=65536 invalid data ref objectid value 2543
The data ref objectid is the inode number inside the subvolume.
But in above case, the value is completely sane, not really showing the
problem.
[CAUSE]
The root cause of the problem is the deprecated feature, inode cache.
This feature results a special inode number, -12ULL, and it's no longer
recognized by tree-checker, triggering the error.
The direct problem here is the output of data ref objectid. The value
shown is in fact the dref_root (subvolume id), not the dref_objectid
(inode number).
[FIX]
Fix the output to use dref_objectid instead.
Reported-by: Neil Parton <njparton@gmail.com>
Reported-by: Archange <archange@archlinux.org>
Link: https://lore.kernel.org/linux-btrfs/CAAYHqBbrrgmh6UmW3ANbysJX9qG9Pbg3ZwnKsV=5mOpv_qix_Q@mail.gmail.com/
Link: https://lore.kernel.org/linux-btrfs/9541deea-9056-406e-be16-a996b549614d@archlinux.org/
Fixes: f333a3c7e8 ("btrfs: tree-checker: validate dref root and objectid")
CC: stable@vger.kernel.org # 6.11
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing concurrent lseek(2) system calls against the same file
descriptor, using multiple threads belonging to the same process, we have
a short time window where a race happens and can result in a memory leak.
The race happens like this:
1) A program opens a file descriptor for a file and then spawns two
threads (with the pthreads library for example), lets call them
task A and task B;
2) Task A calls lseek with SEEK_DATA or SEEK_HOLE and ends up at
file.c:find_desired_extent() while holding a read lock on the inode;
3) At the start of find_desired_extent(), it extracts the file's
private_data pointer into a local variable named 'private', which has
a value of NULL;
4) Task B also calls lseek with SEEK_DATA or SEEK_HOLE, locks the inode
in shared mode and enters file.c:find_desired_extent(), where it also
extracts file->private_data into its local variable 'private', which
has a NULL value;
5) Because it saw a NULL file private, task A allocates a private
structure and assigns to the file structure;
6) Task B also saw a NULL file private so it also allocates its own file
private and then assigns it to the same file structure, since both
tasks are using the same file descriptor.
At this point we leak the private structure allocated by task A.
Besides the memory leak, there's also the detail that both tasks end up
using the same cached state record in the private structure (struct
btrfs_file_private::llseek_cached_state), which can result in a
use-after-free problem since one task can free it while the other is
still using it (only one task took a reference count on it). Also, sharing
the cached state is not a good idea since it could result in incorrect
results in the future - right now it should not be a problem because it
end ups being used only in extent-io-tree.c:count_range_bits() where we do
range validation before using the cached state.
Fix this by protecting the private assignment and check of a file while
holding the inode's spinlock and keep track of the task that allocated
the private, so that it's used only by that task in order to prevent
user-after-free issues with the cached state record as well as potentially
using it incorrectly in the future.
Fixes: 3c32c7212f ("btrfs: use cached state when looking for delalloc ranges with lseek")
CC: stable@vger.kernel.org # 6.6+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmbkZhQQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpjOKD/0fzd4yOcqxSI9W3OLGd04VrOTJIQa4CRbV
GmoTq39pOeIDVGug5ekkTpqqHHnuGk+nQhCzD9vsN/eTmC7yZOIr847O2aWzvYEn
PzFRgmJpoo2E9sr/IsTR5LnJjbaIZhQVkqLH6ZOj9tpKlVwN2SK0nIRVNrAi5zgT
MaDrto/2OUld+vmA99Rgb23jxM6UBdCPIjuiVa+11Vg9Z3D1tWbBmrsG7OMysyIf
FbASBeKHqFSO61/ipFCZv6VV1X8zoWEVyT8n4A1yUbbN5rLzPgoQJVbfSqQRXIdr
cdrKeCbKxl+joSgKS6LKpvnfwRgGF+hgAfpZg4c0vrbZGTQcRhhLFECyh/aVI08F
p5TOMArhVaX59664gHgSPq4KnGTXOO29dot9N3Jya/ZQnxinjY9r+GVOfLuduPPy
1B04vab8oAsk4zK7fZbkDxgYUyifwzK/vQ6OqYq2mYdpdIS/AE7T2ou61Bz5mI7I
/BuucNV0Z96OKlyLEXwXXZjZgNu1TFcq6ARIBJ8L08PY64Fesj5BXabRyXkeNH26
0exyz9heeJs6OwRGfngXmS24tDSS0k74CeZX3KoePNj69u6KCn346KiU1qgntwwD
E5F7AEHqCl5FjUEIWB4M1EPlfA8U0MzOL+tkx2xKJAjsU60wAy7jRSyOIcqodpMs
6UlPcJzgYg==
=uuLl
-----END PGP SIGNATURE-----
Merge tag 'for-6.12/block-20240913' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:
- MD changes via Song:
- md-bitmap refactoring (Yu Kuai)
- raid5 performance optimization (Artur Paszkiewicz)
- Other small fixes (Yu Kuai, Chen Ni)
- Add a sysfs entry 'new_level' (Xiao Ni)
- Improve information reported in /proc/mdstat (Mateusz Kusiak)
- NVMe changes via Keith:
- Asynchronous namespace scanning (Stuart)
- TCP TLS updates (Hannes)
- RDMA queue controller validation (Niklas)
- Align field names to the spec (Anuj)
- Metadata support validation (Puranjay)
- A syntax cleanup (Shen)
- Fix a Kconfig linking error (Arnd)
- New queue-depth quirk (Keith)
- Add missing unplug trace event (Keith)
- blk-iocost fixes (Colin, Konstantin)
- t10-pi modular removal and fixes (Alexey)
- Fix for potential BLKSECDISCARD overflow (Alexey)
- bio splitting cleanups and fixes (Christoph)
- Deal with folios rather than rather than pages, speeding up how the
block layer handles bigger IOs (Kundan)
- Use spinlocks rather than bit spinlocks in zram (Sebastian, Mike)
- Reduce zoned device overhead in ublk (Ming)
- Add and use sendpages_ok() for drbd and nvme-tcp (Ofir)
- Fix regression in partition error pointer checking (Riyan)
- Add support for write zeroes and rotational status in nbd (Wouter)
- Add Yu Kuai as new BFQ maintainer. The scheduler has been
unmaintained for quite a while.
- Various sets of fixes for BFQ (Yu Kuai)
- Misc fixes and cleanups (Alvaro, Christophe, Li, Md Haris, Mikhail,
Yang)
* tag 'for-6.12/block-20240913' of git://git.kernel.dk/linux: (120 commits)
nvme-pci: qdepth 1 quirk
block: fix potential invalid pointer dereference in blk_add_partition
blk_iocost: make read-only static array vrate_adj_pct const
block: unpin user pages belonging to a folio at once
mm: release number of pages of a folio
block: introduce folio awareness and add a bigger size from folio
block: Added folio-ized version of bio_add_hw_page()
block, bfq: factor out a helper to split bfqq in bfq_init_rq()
block, bfq: remove local variable 'bfqq_already_existing' in bfq_init_rq()
block, bfq: remove local variable 'split' in bfq_init_rq()
block, bfq: remove bfq_log_bfqg()
block, bfq: merge bfq_release_process_ref() into bfq_put_cooperator()
block, bfq: fix procress reference leakage for bfqq in merge chain
block, bfq: fix uaf for accessing waker_bfqq after splitting
blk-throttle: support prioritized processing of metadata
blk-throttle: remove last_low_overflow_time
drbd: Add NULL check for net_conf to prevent dereference in state validation
nvme-tcp: fix link failure for TCP auth
blk-mq: add missing unplug trace event
mtip32xx: Remove redundant null pointer checks in mtip_hw_debugfs_init()
...
[SUBPAGE COMPRESSION LIMITS]
Currently inside writepage_delalloc(), if a delalloc range is going to
be submitted asynchronously (inline or compression, the page
dirty/writeback/unlock are all handled in at different time, not at the
submission time), then we return 1 and extent_writepage() will skip the
submission.
This is fine if every sector matches page size, but if a sector is
smaller than page size (aka, subpage case), then it can be very
problematic, for example for the following 64K page:
0 16K 32K 48K 64K
|/| |///////| |/|
| |
4K 52K
Where |/| is the dirty range we need to submit.
In the above case, we need the following different handling for the 3
ranges:
- [0, 4K) needs to be submitted for regular write
A single sector cannot be compressed.
- [16K, 32K) needs to be submitted for compressed write
- [48K, 52K) needs to be submitted for regular write.
Above, if we try to submit [16K, 32K) for compressed write, we will
return 1 and immediately, and without submitting the remaining
[48K, 52K) range.
Furthermore, since extent_writepage() will exit without unlocking any
sectors, the submitted range [0, 4K) will not have sector unlocked.
That's the reason why for now subpage is only allowed for full page
range.
[ENHANCEMENT]
- Introduce a submission bitmap at btrfs_bio_ctrl::submit_bitmap
This records which sectors will be submitted by extent_writepage_io().
This allows us to track which sectors needs to be submitted thus later
to be properly unlocked.
For asynchronously submitted range (inline/compression), the
corresponding bits will be cleared from that bitmap.
- Only return 1 if no sector needs to be submitted in
writepage_delalloc()
- Only submit sectors marked by submission bitmap inside
extent_writepage_io()
So we won't touch the asynchronously submitted part.
- Introduce btrfs_folio_end_writer_lock_bitmap() helper
This will only unlock the involved sectors specified by @bitmap
parameter, to avoid touching the range asynchronously submitted.
Please note that, since subpage compression is still limited to page
aligned range, this change is only a preparation for future sector
perfect compression support for subpage.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_folio_unlock_writer() is already calling
btrfs_folio_end_writer_lock() to do the heavy lifting work, the only
missing 0 writer check.
Thus there is no need to keep two different functions, move the 0 writer
check into btrfs_folio_end_writer_lock(), and remove
btrfs_folio_unlock_writer().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All cleanup paths lead to btrfs_path_free so path can be defined with
the automatic freeing callback in the following functions:
- btrfs_insert_orphan_item()
- btrfs_del_orphan_item()
Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All cleanup paths lead to btrfs_path_free so path can be defined with
the automatic freeing callback in the following functions:
- calculate_emulated_zone_size()
- calculate_alloc_pointer()
Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a DEFINE_FREE for struct btrfs_path. This defines a function that
can be called using the __free attribute. Define a macro
BTRFS_PATH_AUTO_FREE to make the declaration of an auto freeing path
very clear.
The intended use is to define the auto free of path in cases where the
path is allocated somewhere at the beginning and freed either on all
error paths or at the end of the function.
int func() {
BTRFS_PATH_AUTO_FREE(path);
if (...)
return -ERROR;
path = alloc_path();
...
if (...)
return -ERROR;
...
return 0;
}
Signed-off-by: Leo Martins <loemra.dev@gmail.com>
[ update changelog ]
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_folio_end_all_writers() is only utilized in
extent_writepage() as a way to unlock all subpage range (for both
successful submission and error handling).
Meanwhile we have a similar function, btrfs_folio_end_writer_lock().
The difference is, btrfs_folio_end_writer_lock() expects a range that is
a subset of the already locked range.
This limit on btrfs_folio_end_writer_lock() is a little overkilled,
preventing it from being utilized for error paths.
So here we enhance btrfs_folio_end_writer_lock() to accept a superset of
the locked range, and only end the locked subset.
This means we can replace btrfs_folio_end_all_writers() with
btrfs_folio_end_writer_lock() instead.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Continue adding const to parameters. This is for clarity and minor
addition to safety. There are some minor effects, in the assembly code
and .ko measured on release config.
Signed-off-by: David Sterba <dsterba@suse.com>
Currently BTRFS_I is a static inline function that takes a const inode
and returns btrfs inode, dropping the 'const' qualifier. This can break
assumptions of compiler though it seems there's no real case.
To make the parameter and return type consistent regardint const we can
use the container_of_const() that preserves it. However this would not
check the parameter type. To fix that use the same _Generic construct
but implement only the two expected types.
Signed-off-by: David Sterba <dsterba@suse.com>
We have a few places that check if we have the inode locked by doing:
ASSERT(inode_is_locked(vfs_inode));
This actually proved to be useful several times as if assertions are
enabled (and by default they are in many distros) it immediately triggers
a crash which is impossible for users to miss.
However that doesn't check if the lock is held by the calling task, so
the check passes if some other task locked the inode.
Using one of the lockdep functions to check the lock is held, like
lockdep_assert_held() for example, does check that the calling task
holds the lock, and if that's not the case it produces a warning and
stack trace in dmesg. However, despite the misleading "assert" in the
name of the lockdep helpers, it does not trigger a crash/BUG_ON(), just
a warning and splat in dmesg, which is easy to get unnoticed by users
who may have lockdep enabled.
So add a helper that does the ASSERT() and calls lockdep_assert_held()
immediately after and use it every where we check the inode is locked.
Like this if the lock is held by some other task we get the warning
in dmesg which is caught by fstests, very helpful during development,
and may also be occassionaly noticed by users with lockdep enabled.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Even in case of failure we could've discarded some data and userspace
should be made aware of it, so copy fstrim_range to userspace
regardless.
Also make sure to update the trimmed bytes amount even if
btrfs_trim_free_extents fails.
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. Moreover find_or_create_page() is compatible API, and it can
replaced with __filemap_get_folio(). Some interfaces have been converted
to use folio before, so the conversion operation from page can be
eliminated here.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. Based on the previous patch, the compression path can be
directly used in folio without converting to page.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. And memcpy_to_page() can be replaced with memcpy_to_folio().
But there is no memzero_folio(), but it can be replaced equivalently by
folio_zero_range().
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. And memcpy_to_page() can be replaced with memcpy_to_folio().
But there is no memzero_folio(), but it can be replaced equivalently by
folio_zero_range().
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. And memcpy_to_page() can be replaced with memcpy_to_folio().
But there is no memzero_folio(), but it can be replaced equivalently by
folio_zero_range().
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. And page_to_inode() can be replaced with folio_to_inode() now.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. Moreover, use folio_pos() instead of page_offset(),
which is more consistent with folio usage.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. Moreover, use folio_pos() instead of page_offset(),
which is more consistent with folio usage.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. Moreover, use kmap_local_folio() instead of kmap_local_page(),
which is more consistent with folio usage.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. And use folio_pos instead of page_offset, which is more
consistent with folio usage. At the same time, folio_test_private() can
handle folio directly without converting from page to folio first.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. Use folio_pos instead of page_offset, which is more
consistent with folio usage.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. Now clear_page_extent_mapped() can deal with a folio
directly, so change its name to clear_folio_extent_mapped().
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs compression path is not really subpage compatible, every
thing is still done in page unit.
That's fine for regular sector size and subpage routine. As even for
subpage routine compression is only enabled if the whole range is page
aligned, so reading the page cache in page unit is totally fine.
However in preparation for the future subpage perfect compression
support, we need to change the compression routine to properly handle a
subpage range.
This patch would prepare both zlib and zstd to only read the subpage
range for compression.
Lzo is already doing subpage aware read, as lzo's on-disk format is
already sectorsize dependent.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are only two differences between the two functions:
- btrfs_orig_bbio_end_io() does extra error propagation
This is mostly to allow tolerance for write errors.
- btrfs_orig_bbio_end_io() does extra pending_ios check
This check can handle both the original bio, or the cloned one.
(All accounting happens in the original one).
This makes btrfs_orig_bbio_end_io() a much safer call.
In fact we already had a double freeing error due to usage of
btrfs_bio_end_io() in the error path of btrfs_submit_chunk().
So just move the whole content of btrfs_orig_bbio_end_io() into
btrfs_bio_end_io().
For normal paths this brings no change, because they are already calling
btrfs_orig_bbio_end_io() in the first place.
For error paths (not only inside bio.c but also external callers), this
change will introduce extra checks, especially for external callers, as
they will error out without submitting the btrfs bio.
But considering it's already in the error path, such slower but much
safer checks are still an overall win.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Historically we've held the extent lock throughout the entire read.
There's been a few reasons for this, but it's mostly just caused us
problems. For example, this prevents us from allowing page faults
during direct io reads, because we could deadlock. This has forced us
to only allow 4k reads at a time for io_uring NOWAIT requests because we
have no idea if we'll be forced to page fault and thus have to do a
whole lot of work.
On the buffered side we are protected by the page lock, as long as we're
reading things like buffered writes, punch hole, and even direct IO to a
certain degree will get hung up on the page lock while the page is in
flight.
On the direct side we have the dio extent lock, which acts much like the
way the extent lock worked previously to this patch, however just for
direct reads. This protects direct reads from concurrent direct writes,
while we're protected from buffered writes via the inode lock.
Now that we're protected in all cases, narrow the extent lock to the
part where we're getting the extent map to submit the reads, no longer
holding the extent lock for the entire read operation. Push the extent
lock down into do_readpage() so that we're only grabbing it when looking
up the extent map. This portion was contributed by Goldwyn.
Co-developed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we hold the extent lock for the entire duration of a read.
This isn't really necessary in the buffered case, we're protected by the
page lock, however it's necessary for O_DIRECT.
For O_DIRECT reads, if we only locked the extent for the part where we
get the extent, we could potentially race with an O_DIRECT write in the
same region. This isn't really a problem, unless the read is delayed so
much that the write does the COW, unpins the old extent, and some other
application re-allocates the extent before the read is actually able to
be submitted. At that point at best we'd have a checksum mismatch, but
at worse we could read data that doesn't belong to us.
To address this potential race we need to make sure we don't have
overlapping, concurrent direct io reads and writes.
To accomplish this use the new EXTENT_DIO_LOCKED bit in the direct IO
case in the same spot as the current extent lock. The writes will take
this while they're creating the ordered extent, which is also used to
make sure concurrent buffered reads or concurrent direct reads are not
allowed to occur, and drop it after the ordered extent is taken. For
reads it will act as the current read behavior for the EXTENT_LOCKED
bit, we set it when we're starting the read, we clear it in the end_io
to allow other direct writes to continue.
This still has the drawback of disallowing concurrent overlapping direct
reads from occurring, but that exists with the current extent locking.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In order to support dropping the extent lock during a read we need a way
to make sure that direct reads and direct writes for overlapping ranges
are protected from each other. To accomplish this introduce another
lock bit specifically for direct io. Subsequent patches will utilize
this to protect direct IO operations.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Defrag ioctl passes readahead from the file, but autodefrag does not
have a file so the readahead state is allocated when needed.
The autodefrag loop in cleaner thread iterates over inodes so we can
simply provide an on-stack readahead state and will not need to allocate
it in btrfs_defrag_file(). The size is 32 bytes which is acceptable.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's only one caller inode_should_defrag() that passes NULL to
btrfs_add_inode_defrag() so we can drop it an simplify the code.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The potential memory allocation failure is not a fatal error, skipping
autodefrag is fine and the caller inode_should_defrag() does not care
about the errors. Further writes can attempt to add the inode back to
the defragmentation list again.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_cleanup_defrag_inodes() is not called frequently, only in remount
or unmount, but the way it frees the inodes in fs_info->defrag_inodes
is inefficient. Each time it needs to locate first node, remove it,
potentially rebalance tree until it's done. This allows to do a
conditional reschedule.
For cleanups the rbtree_postorder_for_each_entry_safe() iterator is
convenient but we can't reschedule and restart iteration because some of
the tree nodes would be already freed.
The cleanup operation is kmem_cache_free() which will likely take the
fast path for most objects so rescheduling should not be necessary.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function does not follow the pattern where the underscores would be
justified, so rename it.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function does not follow the pattern where the underscores would be
justified, so rename it.
Also update the misleading comment, the passed item is not freed, that's
what the caller does.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function does not follow the pattern where the underscores would be
justified, so rename it.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A comparator function does not change its parameters, make them const.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function does not follow the pattern where the underscores would be
justified, so rename it.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function does not follow the pattern where the underscores would be
justified, so rename it.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Previous patch freed the function name btrfs_submit_bio() so we can use
it for a helper that submits struct bio.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function name is a bit misleading as it submits the btrfs_bio
(bbio), rename it so we can use btrfs_submit_bio() when an actual bio is
submitted.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The member btrfs_fs_info::subpage_info stores the cached bitmap start
position inside the merged bitmap.
However in reality there is only one thing depending on the sectorsize,
bitmap_nr_bits, which records the number of sectors that fit inside a
page.
The sequence of sub-bitmaps have fixed order, thus it's just a quick
multiplication to calculate the start position of each sub-bitmaps.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The parameter @nr_ret is used to tell the caller how many sectors have
been submitted for IO.
Then callers check @nr_ret value to determine if we need to manually
clear the PAGECACHE_TAG_DIRTY, as if we submitted no sector (e.g. all
sectors are beyond i_size) there is no folio_start_writeback() called thus
PAGECACHE_TAG_DIRTY tag will not be cleared.
Remove this parameter by:
- Moving the btrfs_folio_clear_writeback() call into
__extent_writepage_io()
So that if we didn't submit any IO, then manually call
btrfs_folio_set_writeback() to clear PAGECACHE_TAG_DIRTY when
the page is no longer dirty.
- Use a bool to record if we have submitted any sector
Instead of an int.
- Use subpage compatible helpers to end folio writeback.
This brings no change to the behavior, just for the sake of consistency.
As for the call site inside __extent_writepage(), we're always called
for the whole page, so the existing full page helper
folio_(start|end)_writeback() is totally fine.
For the call site inside extent_write_locked_range(), although we can
have subpage range, folio_start_writeback() will only clear
PAGECACHE_TAG_DIRTY if the page is no longer dirty, and the full folio
will still be dirty if there is any subpage dirty range.
Only when the last dirty subpage sector is cleared, the
folio_start_writeback() will clear PAGECACHE_TAG_DIRTY.
So no matter if we call the full page or subpage helper, the result
is still the same, then just use the subpage helpers for consistency.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Fix a few obvious grammar mistakes: a -> an, then -> than.
Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use xarray to track dirty extents to reduce the size of the struct
btrfs_qgroup_extent_record from 64 bytes to 40 bytes. The xarray is
more cache line friendly, it also reduces the complexity of insertion
and search code compared to rb tree.
Another change introduced is about error handling. Before this patch,
the result of btrfs_qgroup_trace_extent_nolock() is always a success. In
this patch, because of this function calls the function xa_store() which
has the possibility to fail, so mark qgroup as inconsistent if error
happened and then free preallocated memory. Also we preallocate memory
before spin_lock(), if memory preallcation failed, error handling is the
same the existing code.
Suggested-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Junchao Sun <sunjunchao2870@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Clean up resources using goto to get rid of repeated code.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Junchao Sun <sunjunchao2870@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Unlike the bitmap usage inside raid56, for __extent_writepage_io() we
handle the subpage submission not sector-by-sector, but for each dirty
range we found.
This is not a big deal normally, as the subpage complex code is already
mostly optimized out by the compiler for x86_64.
However for the sake of consistency and for the future of subpage
sector-perfect compression support, this patch does:
- Extract the sector submission code into submit_one_sector()
- Add the needed code to extract the dirty bitmap for subpage case
There is a small pitfall for non-subpage case, as we cleared page
dirty before starting writeback, so we have to manually set
the default dirty_bitmap to 1 for such case.
- Use bitmap_and() to calculate the target sectors we need to submit
This is done for both subpage and non-subpage cases, and will later
be expanded to skip inline/compression ranges.
For x86_64, the dirty bitmap will be fixed to 1, with the length of 1,
so we're still doing the same workload per sector.
For larger page sizes, the overhead will be a little larger, as previous
we only need to do one extent_map lookup per-dirty-range, but now it
will be one extent_map lookup per-sector.
But that is the same frequency as x86_64, so we're just aligning the
behavior to x86_64.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In commit 75258f20fb ("btrfs: subpage: dump extra subpage bitmaps for
debug") an internal macro GET_SUBPAGE_BITMAP() is introduced to grab the
bitmap of each attribute.
But that commit is using bitmap_cut() which will do the left shift of
the larger bitmap, causing incorrect values.
Thankfully this bitmap_cut() is only called for debug usage, and so far
it's not yet causing problem.
Fix it to use bitmap_read() to only grab the desired sub-bitmap.
Fixes: 75258f20fb ("btrfs: subpage: dump extra subpage bitmaps for debug")
CC: stable@vger.kernel.org # 6.6+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we're calling btrfs_num_copies() before btrfs_get_chunk_map() in
btrfs_map_block(). But btrfs_num_copies() itself does a chunk map lookup
to be able to calculate the number of copies.
So split out the code getting the number of copies from btrfs_num_copies()
into a helper called btrfs_chunk_map_num_copies() and directly call it
from btrfs_map_block() and btrfs_num_copies().
This saves us one rbtree lookup per btrfs_map_block() invocation.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The BTRFS_IOC_SYNC ioctl wants to wake up the cleaner kthread so that it
does any pending work (subvolume deletion, delayed iputs, etc), however
it is waking up the transaction kthread, which in turn wakes up the
cleaner. Since we don't have any transaction to commit, as any ongoing
transaction was already committed when it called btrfs_sync_fs() and
the goal is just to wake up the cleaner thread, directly wake up the
cleaner instead of the transaction kthread.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs only supports sectorsize 4K, 8K, 16K, 32K, 64K for now, thus for
systems with 4K page size, there is no way the fs is subpage (sectorsize
< PAGE_SIZE).
So here we define btrfs_is_subpage() different according to the
PAGE_SIZE:
- PAGE_SIZE > 4K
We may hit real subpage cases, define btrfs_is_subpage() as a regular
function and do the usual checks.
- PAGE_SIZE == 4K (no smaller PAGE_SIZE support AFAIK)
There is no way the fs is subpage, so just define btrfs_is_subpage()
as an inline function which always return false.
This saves about 7K bytes for x86_64 debug builds:
text data bss dec hex filename
Before: 1484452 168693 25776 1678921 199e49 fs/btrfs/btrfs.ko
After: 1476605 168445 25776 1670826 197eaa fs/btrfs/btrfs.ko
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that RAID stripe-tree lookup failures are not treated as a fatal issue
any more, change the RAID stripe-tree lookup error message to debug level.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Set rst_search_commit_root in the btrfs_io_stripe we're passing to
btrfs_map_block() in case we're doing data relocation.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Rename 'btrfs_io_stripe::is_scrub' to 'rst_search_commit_root'. While
'is_scrub' describes the state of the io_stripe (it is a stripe submitted
by scrub) it does not describe the purpose, namely looking at the commit
root when searching RAID stripe-tree entries.
Renaming the stripe to rst_search_commit_root describes this purpose.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This just creates unnecessary noise and doesn't provide any insights into
debugging RAID stripe-tree related issues.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a comment to document the complicated locked_page unlock logic in
cow_file_range_inline. The specifically tricky part is that a caller
just up the stack converts ret == 0 to ret == 1 and then another
caller far up the callstack handles ret == 1 as a success, AND returns
without cleanup in that case, both of which "feel" unnatural and led to
the original bug.
Try to document that somewhat specific callstack logic here to explain
the weird un-setting of locked_folio on success.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
When iterating the chunk maps when a device replace finishes we are doing
a full rbtree search for each chunk map, which is not the most efficient
thing to do, wasting CPU time. As we are holding a write lock on the tree
during the whole iteration, we can simply start from the first node in the
tree and then move to the next chunk map by doing a rb_next() call - the
only exception is when we need to reschedule, in which case we have to do
a full rbtree search since we dropped the write lock and the tree may have
changed (chunk maps may have been removed and the tree got rebalanced).
So just do that.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At the end of a device replace we must go over all the chunk maps and
update their stripes to point to the target device instead of the source
device. We iterate over the chunk maps while holding a write lock and
we never reschedule, which can result in monopolizing a CPU for too long
and blocking readers for too long (it's a rw lock, non-blocking).
So improve on this by rescheduling if necessary. This is safe because at
this point we are holding the chunk mutex, which means no new chunks can
be allocated and therefore we don't risk missing a new chunk map that
covers a range behind the last one we processed before rescheduling.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of getting a page and using that to clear dirty for io, use the
folio helper and use the appropriate folio functions.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We only use a page to copy in the data for the inline extent. Use a
folio for this instead.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We already use a lot of functions here that use folios, update the
function to use __filemap_get_folio instead of find_get_page and then
use the folio directly.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently this already uses a folio for most things, update it to take a
folio and update all the page usage with the corresponding folio usage.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We already use a folio some in this function, replace all page usage
with the folio and update the function to take the folio as an argument.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that btrfs_get_extent takes a folio, update __get_extent_map to
take a folio as well.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We only pass this into read_inline_extent, change it to take a folio and
update the callers.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of using a page, use a folio instead, take a folio as an
argument, and update the callers appropriately.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Update uncompress_inline to take a folio and update it's usage
accordingly.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now the fixup creator and consumer use folios, change this to use a
folio as well.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of a page, use a folio for btrfs_writepage_cow_fixup. We
already have a folio at the only caller, and the fixup worker uses
folios.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function heavily messes with pages, instead update it to use a
folio.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This mostly uses folios already, update it to take a folio and update
the rest of the function to use the folio instead of the page.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of passing in the page for ->locked_page, make it hold a
locked_folio and then update the users of async_chunk to act
accordingly.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that every function that btrfs_run_delalloc_range calls takes a
folio, update it to take a folio and update the callers.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This just passes the page into the compressed machinery to keep track of
the locked page. Update this to take a folio and convert it to a page
where appropriate.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that btrfs_cleanup_ordered_extents is operating mostly with folios,
update it to use a folio instead of a page, and the update the function
and the callers as appropriate.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We walk through pages in this function and clear ordered, and the
function for this uses folios. Update the function to use a folio for
this whole operation.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now all of the functions that use locked_page in run_delalloc_nocow take
a folio, update it to take a folio and update the caller.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With this we can pass the folio directly into cow_file_range().
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Convert this to take a folio and pass it into all of the various cleanup
functions. Update the callers to pass in a folio instead.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we want the folio in this function, convert it to take a folio
directly and use that.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We pass the folio into extent_write_locked_range, go ahead and take a
folio to pass along, and update the callers to pass in a folio.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This mostly uses folios, convert it to take a folio instead and update
the callers to pass in the folio.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of taking the locked page, take the locked folio so we can pass
that into __process_folios_contig.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that this mostly uses folios, update it to take folios, use the
folios that are passed in, and rename from process_one_page =>
process_one_folio.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This operates mostly on folios, update it to take a folio for the locked
folio instead of the page, rename from __process_pages_contig =>
__process_folios_contig.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All of the callers have a folio at this point, update
__unlock_for_delalloc to take a folio so that it's consistent with its
callers.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Also rename lock_delalloc_pages => lock_delalloc_folios in the process,
now that it exclusively works on folios.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of passing in a page for locked_page, pass in the folio instead.
We only use the folio itself to validate some range assumptions, and
then pass it into other functions.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We already use a folio heavily in this function, pass the folio in
directly and use it everywhere, only passing the page down to functions
that do not take a folio yet.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We only need a folio now, make it take a folio as an argument and update
all of the callers.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The callers and callee's of this now all use folios, update it to take a
folio as well.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass in a folio instead, and use a folio instead of a page.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We already have a folio that we're using in btrfs_page_mkwrite, update
the rest of the function to use folio everywhere else. This will make
it easier on Willy when he drops page->index.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Willy is going to get rid of page->index, and add_ra_bio_pages uses
page->index. Make his life easier by converting add_ra_bio_pages to use
folios so that we are no longer using page->index.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we've gotten most of the helpers updated to only take a folio,
update __extent_writepage to only deal in folios.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of using pages for everything, find a folio and use that. This
makes things a bit cleaner as a lot of the functions calls here all take
folios.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
__extent_writepage_io uses page everywhere, but a lot of these functions
take a folio. Convert it to use the folio based helpers, and then
change it to take a folio as an argument and update its callers.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Willy is wanting to get rid of page->index, convert the writepage
tracepoint to take a folio so we can do folio->index instead of
page->index.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that the callers and helpers mostly use folio, convert
btrfs_do_readpage to take a folio, and rename it to btrfs_do_read_folio.
Update all of the page stuff to use the folio based helpers instead.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The callers of this helper are going to be converted to using a folio,
so adjust submit_extent_page to become submit_extent_folio and update it
to use all the relevant folio helpers.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This already uses a folio internally, change it to take a folio as an
argument instead.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have this helper function to set the page range uptodate once we're
done reading it, as well as run fsverity against it. Half of these
functions already take a folio, just rename this to end_folio_read and
then rework it to take a folio instead, and update everything
accordingly.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we're using the page for everything here. Convert this to use
the folio helpers instead.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're the only user of readahead_page_batch(). Convert
btrfs_readahead() to use the folio based helpers to do readahead.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[ENHANCEMENT]
When mounting a btrfs filesystem, the filesystem opens the block device,
and if this fails, there is no message about it. Print a message about
it to help debugging.
[TEST]
I have a btrfs filesystem on three block devices, one of which is
write-protected, so regular mounts fail, but there is no message in
dmesg.
/dev/vdb normal
/dev/vdc write protected
/dev/vdd normal
Before patch:
$ sudo mount /dev/vdb /mnt/
mount: mount(2) failed: no such file or directory
$ sudo dmesg # Show only messages about missing block devices
....
[ 352.947196] BTRFS error (device vdb): devid 2 uuid 4ee2c625-a3b2-4fe0-b411-756b23e08533 missing
....
After patch:
$ sudo mount /dev/vdb /mnt/
mount: mount(2) failed: no such file or directory
$ sudo dmesg # Show bdev_file_open_by_path failed.
....
[ 352.944328] BTRFS error: failed to open device for path /dev/vdc with flags 0x3: -13
[ 352.947196] BTRFS error (device vdb): missing devid 2 uuid 4ee2c625-a3b2-4fe0-b411-756b23e08533
....
Signed-off-by: Li Zhang <zhanglikernel@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Functions btrfs_uuid_scan_kthread() and btrfs_create_uuid_tree() are for
UUID tree rescan and creation, it's not suitable for volumes.[ch].
Move them to uuid-tree.[ch] instead.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At extent_map_block_end() we are calling the inline functions
extent_map_block_start() and extent_map_block_len() multiple times, which
results in expanding their code multiple times, increasing the compiled
code size and repeating the computations those functions do.
Improve this by caching their results in local variables.
The size of the module before this change:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1755770 163800 16920 1936490 1d8c6a fs/btrfs/btrfs.ko
And after this change:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1755656 163800 16920 1936376 1d8bf8 fs/btrfs/btrfs.ko
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_delete_raid_extent() was written under the assumption, that it's
call-chain always passes a start, length tuple that matches a single
extent. But btrfs_delete_raid_extent() is called by
do_free_extent_accounting() which in turn is called by
__btrfs_free_extent().
But this call-chain passes in a start address and a length that can
possibly match multiple on-disk extents.
To make this possible, we have to adjust the start and length of each
btree node lookup, to not delete beyond the requested range.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Update a stripe extent in case of an already existing logical address,
but with different physical addresses and/or device id instead of
bailing out with EEXIST.
This can happen i.e. in case of a device replace operation, where data
extents get rewritten to a new disk.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we have 2 threads that are using the same file descriptor and one of
them is doing direct IO writes while the other is doing fsync, we have a
race where we can end up either:
1) Attempt a fsync without holding the inode's lock, triggering an
assertion failures when assertions are enabled;
2) Do an invalid memory access from the fsync task because the file private
points to memory allocated on stack by the direct IO task and it may be
used by the fsync task after the stack was destroyed.
The race happens like this:
1) A user space program opens a file descriptor with O_DIRECT;
2) The program spawns 2 threads using libpthread for example;
3) One of the threads uses the file descriptor to do direct IO writes,
while the other calls fsync using the same file descriptor.
4) Call task A the thread doing direct IO writes and task B the thread
doing fsyncs;
5) Task A does a direct IO write, and at btrfs_direct_write() sets the
file's private to an on stack allocated private with the member
'fsync_skip_inode_lock' set to true;
6) Task B enters btrfs_sync_file() and sees that there's a private
structure associated to the file which has 'fsync_skip_inode_lock' set
to true, so it skips locking the inode's VFS lock;
7) Task A completes the direct IO write, and resets the file's private to
NULL since it had no prior private and our private was stack allocated.
Then it unlocks the inode's VFS lock;
8) Task B enters btrfs_get_ordered_extents_for_logging(), then the
assertion that checks the inode's VFS lock is held fails, since task B
never locked it and task A has already unlocked it.
The stack trace produced is the following:
assertion failed: inode_is_locked(&inode->vfs_inode), in fs/btrfs/ordered-data.c:983
------------[ cut here ]------------
kernel BUG at fs/btrfs/ordered-data.c:983!
Oops: invalid opcode: 0000 [#1] PREEMPT SMP PTI
CPU: 9 PID: 5072 Comm: worker Tainted: G U OE 6.10.5-1-default #1 openSUSE Tumbleweed 69f48d427608e1c09e60ea24c6c55e2ca1b049e8
Hardware name: Acer Predator PH315-52/Covini_CFS, BIOS V1.12 07/28/2020
RIP: 0010:btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs]
Code: 50 d6 86 c0 e8 (...)
RSP: 0018:ffff9e4a03dcfc78 EFLAGS: 00010246
RAX: 0000000000000054 RBX: ffff9078a9868e98 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff907dce4a7800 RDI: ffff907dce4a7800
RBP: ffff907805518800 R08: 0000000000000000 R09: ffff9e4a03dcfb38
R10: ffff9e4a03dcfb30 R11: 0000000000000003 R12: ffff907684ae7800
R13: 0000000000000001 R14: ffff90774646b600 R15: 0000000000000000
FS: 00007f04b96006c0(0000) GS:ffff907dce480000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f32acbfc000 CR3: 00000001fd4fa005 CR4: 00000000003726f0
Call Trace:
<TASK>
? __die_body.cold+0x14/0x24
? die+0x2e/0x50
? do_trap+0xca/0x110
? do_error_trap+0x6a/0x90
? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
? exc_invalid_op+0x50/0x70
? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
? asm_exc_invalid_op+0x1a/0x20
? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
btrfs_sync_file+0x21a/0x4d0 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
? __seccomp_filter+0x31d/0x4f0
__x64_sys_fdatasync+0x4f/0x90
do_syscall_64+0x82/0x160
? do_futex+0xcb/0x190
? __x64_sys_futex+0x10e/0x1d0
? switch_fpu_return+0x4f/0xd0
? syscall_exit_to_user_mode+0x72/0x220
? do_syscall_64+0x8e/0x160
? syscall_exit_to_user_mode+0x72/0x220
? do_syscall_64+0x8e/0x160
? syscall_exit_to_user_mode+0x72/0x220
? do_syscall_64+0x8e/0x160
? syscall_exit_to_user_mode+0x72/0x220
? do_syscall_64+0x8e/0x160
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Another problem here is if task B grabs the private pointer and then uses
it after task A has finished, since the private was allocated in the stack
of task A, it results in some invalid memory access with a hard to predict
result.
This issue, triggering the assertion, was observed with QEMU workloads by
two users in the Link tags below.
Fix this by not relying on a file's private to pass information to fsync
that it should skip locking the inode and instead pass this information
through a special value stored in current->journal_info. This is safe
because in the relevant section of the direct IO write path we are not
holding a transaction handle, so current->journal_info is NULL.
The following C program triggers the issue:
$ cat repro.c
/* Get the O_DIRECT definition. */
#ifndef _GNU_SOURCE
#define _GNU_SOURCE
#endif
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdint.h>
#include <fcntl.h>
#include <errno.h>
#include <string.h>
#include <pthread.h>
static int fd;
static ssize_t do_write(int fd, const void *buf, size_t count, off_t offset)
{
while (count > 0) {
ssize_t ret;
ret = pwrite(fd, buf, count, offset);
if (ret < 0) {
if (errno == EINTR)
continue;
return ret;
}
count -= ret;
buf += ret;
}
return 0;
}
static void *fsync_loop(void *arg)
{
while (1) {
int ret;
ret = fsync(fd);
if (ret != 0) {
perror("Fsync failed");
exit(6);
}
}
}
int main(int argc, char *argv[])
{
long pagesize;
void *write_buf;
pthread_t fsyncer;
int ret;
if (argc != 2) {
fprintf(stderr, "Use: %s <file path>\n", argv[0]);
return 1;
}
fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC | O_DIRECT, 0666);
if (fd == -1) {
perror("Failed to open/create file");
return 1;
}
pagesize = sysconf(_SC_PAGE_SIZE);
if (pagesize == -1) {
perror("Failed to get page size");
return 2;
}
ret = posix_memalign(&write_buf, pagesize, pagesize);
if (ret) {
perror("Failed to allocate buffer");
return 3;
}
ret = pthread_create(&fsyncer, NULL, fsync_loop, NULL);
if (ret != 0) {
fprintf(stderr, "Failed to create writer thread: %d\n", ret);
return 4;
}
while (1) {
ret = do_write(fd, write_buf, pagesize, 0);
if (ret != 0) {
perror("Write failed");
exit(5);
}
}
return 0;
}
$ mkfs.btrfs -f /dev/sdi
$ mount /dev/sdi /mnt/sdi
$ timeout 10 ./repro /mnt/sdi/foo
Usually the race is triggered within less than 1 second. A test case for
fstests will follow soon.
Reported-by: Paulo Dias <paulo.miguel.dias@gmail.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219187
Reported-by: Andreas Jahn <jahn-andi@web.de>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219199
Reported-by: syzbot+4704b3cc972bd76024f1@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/00000000000044ff540620d7dee2@google.com/
Fixes: 939b656bc8 ("btrfs: fix corruption after buffer fault in during direct IO append write")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs rejects to mount a FS if it finds a block group with a broken write
pointer (e.g, unequal write pointers on two zones of RAID1 block group).
Since such case can happen easily with a power-loss or crash of a system,
we need to handle the case more gently.
Handle such block group by making it unallocatable, so that there will be
no writes into it. That can be done by setting the allocation pointer at
the end of allocating region (= block_group->zone_capacity). Then, existing
code handle zone_unusable properly.
Having proper zone_capacity is necessary for the change. So, set it as fast
as possible.
We cannot handle RAID0 and RAID10 case like this. But, they are anyway
unable to read because of a missing stripe.
Fixes: 265f7237dd ("btrfs: zoned: allow DUP on meta-data block groups")
Fixes: 568220fa96 ("btrfs: zoned: support RAID0/1/10 on top of raid stripe tree")
CC: stable@vger.kernel.org # 6.1+
Reported-by: HAN Yuwei <hrx@bupt.moe>
Cc: Xuefer <xuefer@gmail.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The local extent changeset is passed to clear_record_extent_bits() where
it may have some additional memory dynamically allocated for ulist. When
qgroup is disabled, the memory is leaked because in this case the
changeset is not released upon __btrfs_qgroup_release_data() return.
Since the recorded contents of the changeset are not used thereafter, just
don't pass it.
Found by Linux Verification Center (linuxtesting.org) with Syzkaller.
Reported-by: syzbot+81670362c283f3dd889c@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/lkml/000000000000aa8c0c060ade165e@google.com
Fixes: af0e2aab3b ("btrfs: qgroup: flush reservations during quota disable")
CC: stable@vger.kernel.org # 6.10+
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Signed-off-by: David Sterba <dsterba@suse.com>
The current setup with bio_may_exceed_limit and __bio_split_to_limits
is a bit of a mess.
Change it so that __bio_split_to_limits does all the work and is just
a variant of bio_split_to_limits that returns nr_segs. This is done
by inlining it and instead have the various bio_split_* helpers directly
submit the potentially split bios.
To support btrfs, the rw version has a lower level helper split out
that just returns the offset to split. This turns out to nicely clean
up the btrfs flow as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Link: https://lore.kernel.org/r/20240826173820.1690925-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The return variable 'ret' at btrfs_reclaim_sweep() is never assigned if
none of the space infos is reclaimable (for example if periodic reclaim
is disabled, which is the default), so we return an undefined value.
This can be fixed my making btrfs_reclaim_sweep() not return any value
as well as do_reclaim_sweep() because:
1) do_reclaim_sweep() always returns 0, so we can make it return void;
2) The only caller of btrfs_reclaim_sweep() (btrfs_reclaim_bgs()) doesn't
care about its return value, and in its context there's nothing to do
about any errors anyway.
Therefore remove the return value from btrfs_reclaim_sweep() and
do_reclaim_sweep().
Fixes: e4ca3932ae ("btrfs: periodic block_group reclaim")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
There is an internal report that KASAN is reporting use-after-free, with
the following backtrace:
BUG: KASAN: slab-use-after-free in btrfs_check_read_bio+0xa68/0xb70 [btrfs]
Read of size 4 at addr ffff8881117cec28 by task kworker/u16:2/45
CPU: 1 UID: 0 PID: 45 Comm: kworker/u16:2 Not tainted 6.11.0-rc2-next-20240805-default+ #76
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
Workqueue: btrfs-endio btrfs_end_bio_work [btrfs]
Call Trace:
dump_stack_lvl+0x61/0x80
print_address_description.constprop.0+0x5e/0x2f0
print_report+0x118/0x216
kasan_report+0x11d/0x1f0
btrfs_check_read_bio+0xa68/0xb70 [btrfs]
process_one_work+0xce0/0x12a0
worker_thread+0x717/0x1250
kthread+0x2e3/0x3c0
ret_from_fork+0x2d/0x70
ret_from_fork_asm+0x11/0x20
Allocated by task 20917:
kasan_save_stack+0x37/0x60
kasan_save_track+0x10/0x30
__kasan_slab_alloc+0x7d/0x80
kmem_cache_alloc_noprof+0x16e/0x3e0
mempool_alloc_noprof+0x12e/0x310
bio_alloc_bioset+0x3f0/0x7a0
btrfs_bio_alloc+0x2e/0x50 [btrfs]
submit_extent_page+0x4d1/0xdb0 [btrfs]
btrfs_do_readpage+0x8b4/0x12a0 [btrfs]
btrfs_readahead+0x29a/0x430 [btrfs]
read_pages+0x1a7/0xc60
page_cache_ra_unbounded+0x2ad/0x560
filemap_get_pages+0x629/0xa20
filemap_read+0x335/0xbf0
vfs_read+0x790/0xcb0
ksys_read+0xfd/0x1d0
do_syscall_64+0x6d/0x140
entry_SYSCALL_64_after_hwframe+0x4b/0x53
Freed by task 20917:
kasan_save_stack+0x37/0x60
kasan_save_track+0x10/0x30
kasan_save_free_info+0x37/0x50
__kasan_slab_free+0x4b/0x60
kmem_cache_free+0x214/0x5d0
bio_free+0xed/0x180
end_bbio_data_read+0x1cc/0x580 [btrfs]
btrfs_submit_chunk+0x98d/0x1880 [btrfs]
btrfs_submit_bio+0x33/0x70 [btrfs]
submit_one_bio+0xd4/0x130 [btrfs]
submit_extent_page+0x3ea/0xdb0 [btrfs]
btrfs_do_readpage+0x8b4/0x12a0 [btrfs]
btrfs_readahead+0x29a/0x430 [btrfs]
read_pages+0x1a7/0xc60
page_cache_ra_unbounded+0x2ad/0x560
filemap_get_pages+0x629/0xa20
filemap_read+0x335/0xbf0
vfs_read+0x790/0xcb0
ksys_read+0xfd/0x1d0
do_syscall_64+0x6d/0x140
entry_SYSCALL_64_after_hwframe+0x4b/0x53
[CAUSE]
Although I cannot reproduce the error, the report itself is good enough
to pin down the cause.
The call trace is the regular endio workqueue context, but the
free-by-task trace is showing that during btrfs_submit_chunk() we
already hit a critical error, and is calling btrfs_bio_end_io() to error
out. And the original endio function called bio_put() to free the whole
bio.
This means a double freeing thus causing use-after-free, e.g.:
1. Enter btrfs_submit_bio() with a read bio
The read bio length is 128K, crossing two 64K stripes.
2. The first run of btrfs_submit_chunk()
2.1 Call btrfs_map_block(), which returns 64K
2.2 Call btrfs_split_bio()
Now there are two bios, one referring to the first 64K, the other
referring to the second 64K.
2.3 The first half is submitted.
3. The second run of btrfs_submit_chunk()
3.1 Call btrfs_map_block(), which by somehow failed
Now we call btrfs_bio_end_io() to handle the error
3.2 btrfs_bio_end_io() calls the original endio function
Which is end_bbio_data_read(), and it calls bio_put() for the
original bio.
Now the original bio is freed.
4. The submitted first 64K bio finished
Now we call into btrfs_check_read_bio() and tries to advance the bio
iter.
But since the original bio (thus its iter) is already freed, we
trigger the above use-after free.
And even if the memory is not poisoned/corrupted, we will later call
the original endio function, causing a double freeing.
[FIX]
Instead of calling btrfs_bio_end_io(), call btrfs_orig_bbio_end_io(),
which has the extra check on split bios and do the proper refcounting
for cloned bios.
Furthermore there is already one extra btrfs_cleanup_bio() call, but
that is duplicated to btrfs_orig_bbio_end_io() call, so remove that
label completely.
Reported-by: David Sterba <dsterba@suse.com>
Fixes: 852eee62d3 ("btrfs: allow btrfs_submit_bio to split bios")
CC: stable@vger.kernel.org # 6.6+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's a warning (probably on some older compiler version):
fs/btrfs/fiemap.c: warning: 'last_extent_end' may be used uninitialized in this function [-Wmaybe-uninitialized]: => 822:19
Initialize the variable to 0 although it's not necessary as it's either
properly set or not used after an error. The called function is in the
same file so this is a false alert but we want to fix all
-Wmaybe-uninitialized reports.
Link: https://lore.kernel.org/all/20240819070639.2558629-1-geert@linux-m68k.org/
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: David Sterba <dsterba@suse.com>
We have transient failures with btrfs/301, specifically in the part
where we do
for i in $(seq 0 10); do
write 50m to file
rm -f file
done
Sometimes this will result in a transient quota error, and it's because
sometimes we start writeback on the file which results in a delayed
iput, and thus the rm doesn't actually clean the file up. When we're
flushing the quota space we need to run the delayed iputs to make sure
all the unlinks that we think have completed have actually completed.
This removes the small window where we could fail to find enough space
in our quota.
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
__btrfs_add_free_space_zoned() references and modifies bg's alloc_offset,
ro, and zone_unusable, but without taking the lock. It is mostly safe
because they monotonically increase (at least for now) and this function is
mostly called by a transaction commit, which is serialized by itself.
Still, taking the lock is a safer and correct option and I'm going to add a
change to reset zone_unusable while a block group is still alive. So, add
locking around the operations.
Fixes: 169e0da91a ("btrfs: zoned: track unusable bytes for zones")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[REPORT]
There is a corruption report that btrfs refused to mount a fs that has
overlapping dev extents:
BTRFS error (device sdc): dev extent devid 4 physical offset 14263979671552 overlap with previous dev extent end 14263980982272
BTRFS error (device sdc): failed to verify dev extents against chunks: -117
BTRFS error (device sdc): open_ctree failed
[CAUSE]
The direct cause is very obvious, there is a bad dev extent item with
incorrect length.
With btrfs check reporting two overlapping extents, the second one shows
some clue on the cause:
ERROR: dev extent devid 4 offset 14263979671552 len 6488064 overlap with previous dev extent end 14263980982272
ERROR: dev extent devid 13 offset 2257707008000 len 6488064 overlap with previous dev extent end 2257707270144
ERROR: errors found in extent allocation tree or chunk allocation
The second one looks like a bitflip happened during new chunk
allocation:
hex(2257707008000) = 0x20da9d30000
hex(2257707270144) = 0x20da9d70000
diff = 0x00000040000
So it looks like a bitflip happened during new dev extent allocation,
resulting the second overlap.
Currently we only do the dev-extent verification at mount time, but if the
corruption is caused by memory bitflip, we really want to catch it before
writing the corruption to the storage.
Furthermore the dev extent items has the following key definition:
(<device id> DEV_EXTENT <physical offset>)
Thus we can not just rely on the generic key order check to make sure
there is no overlapping.
[ENHANCEMENT]
Introduce dedicated dev extent checks, including:
- Fixed member checks
* chunk_tree should always be BTRFS_CHUNK_TREE_OBJECTID (3)
* chunk_objectid should always be
BTRFS_FIRST_CHUNK_CHUNK_TREE_OBJECTID (256)
- Alignment checks
* chunk_offset should be aligned to sectorsize
* length should be aligned to sectorsize
* key.offset should be aligned to sectorsize
- Overlap checks
If the previous key is also a dev-extent item, with the same
device id, make sure we do not overlap with the previous dev extent.
Reported: Stefan N <stefannnau@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CA+W5K0rSO3koYTo=nzxxTm1-Pdu1HYgVxEpgJ=aGc7d=E8mGEg@mail.gmail.com/
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Unlink changes the link count on the target inode. POSIX mandates that
the ctime must also change when this occurs.
According to https://pubs.opengroup.org/onlinepubs/9699919799/functions/unlink.html:
"Upon successful completion, unlink() shall mark for update the last data
modification and last file status change timestamps of the parent
directory. Also, if the file's link count is not 0, the last file status
change timestamp of the file shall be marked for update."
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add link to the opengroup docs ]
Signed-off-by: David Sterba <dsterba@suse.com>
Add the __counted_by compiler attribute to the flexible array member
name to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and
CONFIG_FORTIFY_SOURCE.
Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we a find that an extent is shared but its end offset is not sector
size aligned, then we don't clone it and issue write operations instead.
This is because the reflink (remap_file_range) operation does not allow
to clone unaligned ranges, except if the end offset of the range matches
the i_size of the source and destination files (and the start offset is
sector size aligned).
While this is not incorrect because send can only guarantee that a file
has the same data in the source and destination snapshots, it's not
optimal and generates confusion and surprising behaviour for users.
For example, running this test:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
mkfs.btrfs -f $DEV
mount $DEV $MNT
# Use a file size not aligned to any possible sector size.
file_size=$((1 * 1024 * 1024 + 5)) # 1MB + 5 bytes
dd if=/dev/random of=$MNT/foo bs=$file_size count=1
cp --reflink=always $MNT/foo $MNT/bar
btrfs subvolume snapshot -r $MNT/ $MNT/snap
rm -f /tmp/send-test
btrfs send -f /tmp/send-test $MNT/snap
umount $MNT
mkfs.btrfs -f $DEV
mount $DEV $MNT
btrfs receive -vv -f /tmp/send-test $MNT
xfs_io -r -c "fiemap -v" $MNT/snap/bar
umount $MNT
Gives the following result:
(...)
mkfile o258-7-0
rename o258-7-0 -> bar
write bar - offset=0 length=49152
write bar - offset=49152 length=49152
write bar - offset=98304 length=49152
write bar - offset=147456 length=49152
write bar - offset=196608 length=49152
write bar - offset=245760 length=49152
write bar - offset=294912 length=49152
write bar - offset=344064 length=49152
write bar - offset=393216 length=49152
write bar - offset=442368 length=49152
write bar - offset=491520 length=49152
write bar - offset=540672 length=49152
write bar - offset=589824 length=49152
write bar - offset=638976 length=49152
write bar - offset=688128 length=49152
write bar - offset=737280 length=49152
write bar - offset=786432 length=49152
write bar - offset=835584 length=49152
write bar - offset=884736 length=49152
write bar - offset=933888 length=49152
write bar - offset=983040 length=49152
write bar - offset=1032192 length=16389
chown bar - uid=0, gid=0
chmod bar - mode=0644
utimes bar
utimes
BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=06d640da-9ca1-604c-b87c-3375175a8eb3, stransid=7
/mnt/sdi/snap/bar:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..2055]: 26624..28679 2056 0x1
There's no clone operation to clone extents from the file foo into file
bar and fiemap confirms there's no shared flag (0x2000).
So update send_write_or_clone() so that it proceeds with cloning if the
source and destination ranges end at the i_size of the respective files.
After this changes the result of the test is:
(...)
mkfile o258-7-0
rename o258-7-0 -> bar
clone bar - source=foo source offset=0 offset=0 length=1048581
chown bar - uid=0, gid=0
chmod bar - mode=0644
utimes bar
utimes
BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=582420f3-ea7d-564e-bbe5-ce440d622190, stransid=7
/mnt/sdi/snap/bar:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..2055]: 26624..28679 2056 0x2001
A test case for fstests will also follow up soon.
Link: https://github.com/kdave/btrfs-progs/issues/572#issuecomment-2282841416
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently the extent map shrinker can be run by any task when attempting
to allocate memory and there's enough memory pressure to trigger it.
To avoid too much latency we stop iterating over extent maps and removing
them once the task needs to reschedule. This logic was introduced in commit
b3ebb9b7e9 ("btrfs: stop extent map shrinker if reschedule is needed").
While that solved high latency problems for some use cases, it's still
not enough because with a too high number of tasks entering the extent map
shrinker code, either due to memory allocations or because they are a
kswapd task, we end up having a very high level of contention on some
spin locks, namely:
1) The fs_info->fs_roots_radix_lock spin lock, which we need to find
roots to iterate over their inodes;
2) The spin lock of the xarray used to track open inodes for a root
(struct btrfs_root::inodes) - on 6.10 kernels and below, it used to
be a red black tree and the spin lock was root->inode_lock;
3) The fs_info->delayed_iput_lock spin lock since the shrinker adds
delayed iputs (calls btrfs_add_delayed_iput()).
Instead of allowing the extent map shrinker to be run by any task, make
it run only by kswapd tasks. This still solves the problem of running
into OOM situations due to an unbounded extent map creation, which is
simple to trigger by direct IO writes, as described in the changelog
of commit 956a17d9d0 ("btrfs: add a shrinker for extent maps"), and
by a similar case when doing buffered IO on files with a very large
number of holes (keeping the file open and creating many holes, whose
extent maps are only released when the file is closed).
Reported-by: kzd <kzd@56709.net>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219121
Reported-by: Octavia Togami <octavia.togami@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAHPNGSSt-a4ZZWrtJdVyYnJFscFjP9S7rMcvEMaNSpR556DdLA@mail.gmail.com/
Fixes: 956a17d9d0 ("btrfs: add a shrinker for extent maps")
CC: stable@vger.kernel.org # 6.10+
Tested-by: kzd <kzd@56709.net>
Tested-by: Octavia Togami <octavia.togami@gmail.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[REPORT]
There is a bug report that kernel is rejecting a mismatching inode mode
and its dir item:
[ 1881.553937] BTRFS critical (device dm-0): inode mode mismatch with
dir: inode mode=040700 btrfs type=2 dir type=0
[CAUSE]
It looks like the inode mode is correct, while the dir item type
0 is BTRFS_FT_UNKNOWN, which should not be generated by btrfs at all.
This may be caused by a memory bit flip.
[ENHANCEMENT]
Although tree-checker is not able to do any cross-leaf verification, for
this particular case we can at least reject any dir type with
BTRFS_FT_UNKNOWN.
So here we enhance the dir type check from [0, BTRFS_FT_MAX), to
(0, BTRFS_FT_MAX).
Although the existing corruption can not be fixed just by such enhanced
checking, it should prevent the same 0x2->0x0 bitflip for dir type to
reach disk in the future.
Reported-by: Kota <nospam@kota.moe>
Link: https://lore.kernel.org/linux-btrfs/CACsxjPYnQF9ZF-0OhH16dAx50=BXXOcP74MxBc3BG+xae4vTTw@mail.gmail.com/
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the patch 78c52d9eb6 ("btrfs: check for refs on snapshot delete
resume") I added some code to handle file systems that had been
corrupted by a bug that incorrectly skipped updating the drop progress
key while dropping a snapshot. This code would check to see if we had
already deleted our reference for a child block, and skip the deletion
if we had already.
Unfortunately there is a bug, as the check would only check the on-disk
references. I made an incorrect assumption that blocks in an already
deleted snapshot that was having the deletion resume on mount wouldn't
be modified.
If we have 2 pending deleted snapshots that share blocks, we can easily
modify the rules for a block. Take the following example
subvolume a exists, and subvolume b is a snapshot of subvolume a. They
share references to block 1. Block 1 will have 2 full references, one
for subvolume a and one for subvolume b, and it belongs to subvolume a
(btrfs_header_owner(block 1) == subvolume a).
When deleting subvolume a, we will drop our full reference for block 1,
and because we are the owner we will drop our full reference for all of
block 1's children, convert block 1 to FULL BACKREF, and add a shared
reference to all of block 1's children.
Then we will start the snapshot deletion of subvolume b. We look up the
extent info for block 1, which checks delayed refs and tells us that
FULL BACKREF is set, so sets parent to the bytenr of block 1. However
because this is a resumed snapshot deletion, we call into
check_ref_exists(). Because check_ref_exists() only looks at the disk,
it doesn't find the shared backref for the child of block 1, and thus
returns 0 and we skip deleting the reference for the child of block 1
and continue. This orphans the child of block 1.
The fix is to lookup the delayed refs, similar to what we do in
btrfs_lookup_extent_info(). However we only care about whether the
reference exists or not. If we fail to find our reference on disk, go
look up the bytenr in the delayed refs, and if it exists look for an
existing ref in the delayed ref head. If that exists then we know we
can delete the reference safely and carry on. If it doesn't exist we
know we have to skip over this block.
This bug has existed since I introduced this fix, however requires
having multiple deleted snapshots pending when we unmount. We noticed
this in production because our shutdown path stops the container on the
system, which deletes a bunch of subvolumes, and then reboots the box.
This gives us plenty of opportunities to hit this issue. Looking at the
history we've seen this occasionally in production, but we had a big
spike recently thanks to faster machines getting jobs with multiple
subvolumes in the job.
Chris Mason wrote a reproducer which does the following
mount /dev/nvme4n1 /btrfs
btrfs subvol create /btrfs/s1
simoop -E -f 4k -n 200000 -z /btrfs/s1
while(true) ; do
btrfs subvol snap /btrfs/s1 /btrfs/s2
simoop -f 4k -n 200000 -r 10 -z /btrfs/s2
btrfs subvol snap /btrfs/s2 /btrfs/s3
btrfs balance start -dusage=80 /btrfs
btrfs subvol del /btrfs/s2 /btrfs/s3
umount /btrfs
btrfsck /dev/nvme4n1 || exit 1
mount /dev/nvme4n1 /btrfs
done
On the second loop this would fail consistently, with my patch it has
been running for hours and hasn't failed.
I also used dm-log-writes to capture the state of the failure so I could
debug the problem. Using the existing failure case to test my patch
validated that it fixes the problem.
Fixes: 78c52d9eb6 ("btrfs: check for refs on snapshot delete resume")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For any changes of struct fd representation we need to
turn existing accesses to fields into calls of wrappers.
Accesses to struct fd::flags are very few (3 in linux/file.h,
1 in net/socket.c, 3 in fs/overlayfs/file.c and 3 more in
explicit initializers).
Those can be dealt with in the commit converting to
new layout; accesses to struct fd::file are too many for that.
This commit converts (almost) all of f.file to
fd_file(f). It's not entirely mechanical ('file' is used as
a member name more than just in struct fd) and it does not
even attempt to distinguish the uses in pointer context from
those in boolean context; the latter will be eventually turned
into a separate helper (fd_empty()).
NOTE: mass conversion to fd_empty(), tempting as it
might be, is a bad idea; better do that piecewise in commit
that convert from fdget...() to CLASS(...).
[conflicts in fs/fhandle.c, kernel/bpf/syscall.c, mm/memcontrol.c
caught by git; fs/stat.c one got caught by git grep]
[fs/xattr.c conflict]
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmazbGYACgkQxWXV+ddt
WDsQAw//Z3XjjylTZPuHNk/AiMe5oochxB5T9ZracQOzG0o70gj1w/UQIZBkSzp1
66g8I4YdbwvEXKDg9Oi/GPDSON3GuhAiLXp+0Y/reeD/totgvrhROuJ3mIk5CZ0H
B4fIKH3xCKLQan26Opgju4qjum7+AR7ekFveM6GicxXXb3eAYALgoEFt63eZjZVu
7myak78gmBuK5QdGHH+onEhn+HfC57UTGBqu1bJsSOQC7dANkU+WzmgbH6FeOHqx
2T/lN/tu2tBBoF4zMvC472Zjmj4PnubNQnwwv0oJ8Z2Y0yIY95joZV0uXIXO72oz
BMQC6s0cltiTn1Tfe4iIWn+ZNjcfGAZO7aoD5NcJb2F/Mz5mZuMhtq3BND0wJ8/+
SuYk7PxX7tcOaFrDAn3Ne7XHsD7r5lLkFICXkzcNG4dqkBUxR3dN4Oi8KKMFrgCP
kTpf0/lAkYKYSrU86Yn1zhwRLaH8jm3fulVUY8i/p0tJnpsW8AmeME1Sk97Bhxbb
y6zt8+MPscPuQi3jPsBevaYd8Q8BIT34vVtU/jNmGAyqEv1wVYQRRK2Ma4sJJfNk
EEQV9i4VqXWLc17dcWVZrG6lEsVRIIBE/2Adhth6Myq73bkunpB9ZNNNZApf1T2j
Wn5gPzkuQd6FWrXe8V7oSN9rT1OPn9uHL6BmSUWhHUCuFeATgoc=
=03Gf
-----END PGP SIGNATURE-----
Merge tag 'for-6.11-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix double inode unlock for direct IO sync writes (reported by
syzbot)
- fix root tree id/name map definitions, don't use fixed size buffers
for name (reported by -Werror=unterminated-string-initialization)
- fix qgroup reserve leaks in bufferd write path
- update scrub status structure more often so it can be reported in
user space more accurately and let 'resume' not repeat work
- in preparation to remove space cache v1 in the future print a warning
if it's detected
* tag 'for-6.11-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: avoid using fixed char array size for tree names
btrfs: fix double inode unlock for direct IO sync writes
btrfs: emit a warning about space cache v1 being deprecated
btrfs: fix qgroup reserve leaks in cow_file_range
btrfs: implement launder_folio for clearing dirty page reserve
btrfs: scrub: update last_physical after scrubbing one stripe
btrfs: factor out stripe length calculation into a helper
[BUG]
There is a bug report that using the latest trunk GCC 15, btrfs would cause
unterminated-string-initialization warning:
linux-6.6/fs/btrfs/print-tree.c:29:49: error: initializer-string for array of ‘char’ is too long [-Werror=unterminated-string-initialization]
29 | { BTRFS_BLOCK_GROUP_TREE_OBJECTID, "BLOCK_GROUP_TREE" },
|
^~~~~~~~~~~~~~~~~~
[CAUSE]
To print tree names we have an array of root_name_map structure, which
uses "char name[16];" to store the name string of a tree.
But the following trees have names exactly at 16 chars length:
- "BLOCK_GROUP_TREE"
- "RAID_STRIPE_TREE"
This means we will have no space for the terminating '\0', and can lead
to unexpected access when printing the name.
[FIX]
Instead of "char name[16];" use "const char *" instead.
Since the name strings are all read-only data, and are all NULL
terminated by default, there is not much need to bother the length at
all.
Reported-by: Sam James <sam@gentoo.org>
Reported-by: Alejandro Colomar <alx@kernel.org>
Fixes: edde81f1ab ("btrfs: add raid stripe tree pretty printer")
Fixes: 9c54e80ddc ("btrfs: add code to support the block group root")
CC: stable@vger.kernel.org # 6.1+
Suggested-by: Alejandro Colomar <alx@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Alejandro Colomar <alx@kernel.org>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we do a direct IO sync write, at btrfs_sync_file(), and we need to skip
inode logging or we get an error starting a transaction or an error when
flushing delalloc, we end up unlocking the inode when we shouldn't under
the 'out_release_extents' label, and then unlock it again at
btrfs_direct_write().
Fix that by checking if we have to skip inode unlocking under that label.
Reported-by: syzbot+7dbbb74af6291b5a5a8b@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/000000000000dfd631061eaeb4bc@google.com/
Fixes: 939b656bc8 ("btrfs: fix corruption after buffer fault in during direct IO append write")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We've been wanting to get rid of this for a while, add a message to
indicate that this feature is going away and when so we can finally have
a date when we're going to remove it. The output looks like this
BTRFS warning (device nvme0n1): space cache v1 is being deprecated and will be removed in a future release, please use -o space_cache=v2
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the buffered write path, the dirty page owns the qgroup reserve until
it creates an ordered_extent.
Therefore, any errors that occur before the ordered_extent is created
must free that reservation, or else the space is leaked. The fstest
generic/475 exercises various IO error paths, and is able to trigger
errors in cow_file_range where we fail to get to allocating the ordered
extent. Note that because we *do* clear delalloc, we are likely to
remove the inode from the delalloc list, so the inodes/pages to not have
invalidate/launder called on them in the commit abort path.
This results in failures at the unmount stage of the test that look like:
BTRFS: error (device dm-8 state EA) in cleanup_transaction:2018: errno=-5 IO failure
BTRFS: error (device dm-8 state EA) in btrfs_replace_file_extents:2416: errno=-5 IO failure
BTRFS warning (device dm-8 state EA): qgroup 0/5 has unreleased space, type 0 rsv 28672
------------[ cut here ]------------
WARNING: CPU: 3 PID: 22588 at fs/btrfs/disk-io.c:4333 close_ctree+0x222/0x4d0 [btrfs]
Modules linked in: btrfs blake2b_generic libcrc32c xor zstd_compress raid6_pq
CPU: 3 PID: 22588 Comm: umount Kdump: loaded Tainted: G W 6.10.0-rc7-gab56fde445b8 #21
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
RIP: 0010:close_ctree+0x222/0x4d0 [btrfs]
RSP: 0018:ffffb4465283be00 EFLAGS: 00010202
RAX: 0000000000000001 RBX: ffffa1a1818e1000 RCX: 0000000000000001
RDX: 0000000000000000 RSI: ffffb4465283bbe0 RDI: ffffa1a19374fcb8
RBP: ffffa1a1818e13c0 R08: 0000000100028b16 R09: 0000000000000000
R10: 0000000000000003 R11: 0000000000000003 R12: ffffa1a18ad7972c
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
FS: 00007f9168312b80(0000) GS:ffffa1a4afcc0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f91683c9140 CR3: 000000010acaa000 CR4: 00000000000006f0
Call Trace:
<TASK>
? close_ctree+0x222/0x4d0 [btrfs]
? __warn.cold+0x8e/0xea
? close_ctree+0x222/0x4d0 [btrfs]
? report_bug+0xff/0x140
? handle_bug+0x3b/0x70
? exc_invalid_op+0x17/0x70
? asm_exc_invalid_op+0x1a/0x20
? close_ctree+0x222/0x4d0 [btrfs]
generic_shutdown_super+0x70/0x160
kill_anon_super+0x11/0x40
btrfs_kill_super+0x11/0x20 [btrfs]
deactivate_locked_super+0x2e/0xa0
cleanup_mnt+0xb5/0x150
task_work_run+0x57/0x80
syscall_exit_to_user_mode+0x121/0x130
do_syscall_64+0xab/0x1a0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f916847a887
---[ end trace 0000000000000000 ]---
BTRFS error (device dm-8 state EA): qgroup reserved space leaked
Cases 2 and 3 in the out_reserve path both pertain to this type of leak
and must free the reserved qgroup data. Because it is already an error
path, I opted not to handle the possible errors in
btrfs_free_qgroup_data.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
In the buffered write path, dirty pages can be said to "own" the qgroup
reservation until they create an ordered_extent. It is possible for
there to be outstanding dirty pages when a transaction is aborted, in
which case there is no cancellation path for freeing this reservation
and it is leaked.
We do already walk the list of outstanding delalloc inodes in
btrfs_destroy_delalloc_inodes() and call invalidate_inode_pages2() on them.
This does *not* call btrfs_invalidate_folio(), as one might guess, but
rather calls launder_folio() and release_folio(). Since this is a
reservation associated with dirty pages only, rather than something
associated with the private bit (ordered_extent is cancelled separately
already in the cleanup transaction path), implementing this release
should be done via launder_folio.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently sctx->stat.last_physical only got updated in the following
cases:
- When the last stripe of a non-RAID56 chunk is scrubbed
This implies a pitfall, if the last stripe is at the chunk boundary,
and we finished the scrub of the whole chunk, we won't update
last_physical at all until the next chunk.
- When a P/Q stripe of a RAID56 chunk is scrubbed
This leads the following two problems:
- sctx->stat.last_physical is not updated for a almost full chunk
This is especially bad, affecting scrub resume, as the resume would
start from last_physical, causing unnecessary re-scrub.
- "btrfs scrub status" will not report any progress for a long time
Fix the problem by properly updating @last_physical after each stripe is
scrubbed.
And since we're here, for the sake of consistency, use spin lock to
protect the update of @last_physical, just like all the remaining
call sites touching sctx->stat.
Reported-by: Michel Palleau <michel.palleau@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAMFk-+igFTv2E8svg=cQ6o3e6CrR5QwgQ3Ok9EyRaEvvthpqCQ@mail.gmail.com/
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently there are two locations which need to calculate the real
length of a stripe (which can be at the end of a chunk, and the chunk
size may not always be 64K aligned).
Factor them into a helper as we're going to have a third user soon.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmapmOQACgkQxWXV+ddt
WDsXVhAAi4X+xt3o4jcN3IAu08JCQAAyXnFWC3lvn7sqYjSrcccI6ZT4/gAbHss+
qrifakRGoYQ7fAjYBmhw48HqPmHtI2OQjIUDaIqHQOS68aXShBo9HiE460HRY4GT
QV/KT0w37E2/R0EDR9gyjLq3ZA3/raxN1n+LNCFhRWmtsAEZrk4XzsADWb05YkIq
1QBa92DzEhVpd04X8YHIYBgRidWbcYST6xhoWdyL9VZ1pzZsISq5LH67D4f/J1KU
gXNf+ZnF9DXsQnptJrMsjhx61seJ2F0/vozFZ+l6SjRr0jeysmrJI0dxqQc/hUga
gbLmdha6ztKdn03JOIL+lfdZYzICFl/2fekSWI2SNcag+TYszACjlFOyHusOgKsa
3qQwzVB699FheWO5nrOOvOtgq0ZqGsrIvhIXLhA7/bVpNavPnUB7IQCcs8n89ImQ
hUIebfX1FZnYXTrB6Hhm92LUb0lyLSlW1we3SSmaAMiy1TiXHG7hO2G/sIbOPAJC
5VzdHf0DEjzEdjmTrGOV7JBfy5JmMK56oN8viZS95p70DYxNGvEOhLs/8n5twpri
MWV8GElcOjjC+KnGnUH72spsnEKONpdzyccG9kiZEgkEi4csgHSxrkSmAehYD6i6
MFYk+i7jvZ1VsbOulmdGOLbHS7whxi9pWb/CT3KKF1Ei5/v07bU=
=JdOX
-----END PGP SIGNATURE-----
Merge tag 'for-6.11-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix regression in extent map rework when handling insertion of
overlapping compressed extent
- fix unexpected file length when appending to a file using direct io
and buffer not faulted in
- in zoned mode, fix accounting of unusable space when flipping
read-only block group back to read-write
- fix page locking when COWing an inline range, assertion failure found
by syzbot
- fix calculation of space info in debugging print
- tree-checker, add validation of data reference item
- fix a few -Wmaybe-uninitialized build warnings
* tag 'for-6.11-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: initialize location to fix -Wmaybe-uninitialized in btrfs_lookup_dentry()
btrfs: fix corruption after buffer fault in during direct IO append write
btrfs: zoned: fix zone_unusable accounting on making block group read-write again
btrfs: do not subtract delalloc from avail bytes
btrfs: make cow_file_range_inline() honor locked_page on error
btrfs: fix corrupt read due to bad offset of a compressed extent map
btrfs: tree-checker: validate dref root and objectid
Some arch + compiler combinations report a potentially unused variable
location in btrfs_lookup_dentry(). This is a false alert as the variable
is passed by value and always valid or there's an error. The compilers
cannot probably reason about that although btrfs_inode_by_name() is in
the same file.
> + /kisskb/src/fs/btrfs/inode.c: error: 'location.objectid' may be used
+uninitialized in this function [-Werror=maybe-uninitialized]: => 5603:9
> + /kisskb/src/fs/btrfs/inode.c: error: 'location.type' may be used
+uninitialized in this function [-Werror=maybe-uninitialized]: => 5674:5
m68k-gcc8/m68k-allmodconfig
mips-gcc8/mips-allmodconfig
powerpc-gcc5/powerpc-all{mod,yes}config
powerpc-gcc5/ppc64_defconfig
Initialize it to zero, this should fix the warnings and won't change the
behaviour as btrfs_inode_by_name() accepts only a root or inode item
types, otherwise returns an error.
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Link: https://lore.kernel.org/linux-btrfs/bd4e9928-17b3-9257-8ba7-6b7f9bbb639a@linux-m68k.org/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During an append (O_APPEND write flag) direct IO write if the input buffer
was not previously faulted in, we can corrupt the file in a way that the
final size is unexpected and it includes an unexpected hole.
The problem happens like this:
1) We have an empty file, with size 0, for example;
2) We do an O_APPEND direct IO with a length of 4096 bytes and the input
buffer is not currently faulted in;
3) We enter btrfs_direct_write(), lock the inode and call
generic_write_checks(), which calls generic_write_checks_count(), and
that function sets the iocb position to 0 with the following code:
if (iocb->ki_flags & IOCB_APPEND)
iocb->ki_pos = i_size_read(inode);
4) We call btrfs_dio_write() and enter into iomap, which will end up
calling btrfs_dio_iomap_begin() and that calls
btrfs_get_blocks_direct_write(), where we update the i_size of the
inode to 4096 bytes;
5) After btrfs_dio_iomap_begin() returns, iomap will attempt to access
the page of the write input buffer (at iomap_dio_bio_iter(), with a
call to bio_iov_iter_get_pages()) and fail with -EFAULT, which gets
returned to btrfs at btrfs_direct_write() via btrfs_dio_write();
6) At btrfs_direct_write() we get the -EFAULT error, unlock the inode,
fault in the write buffer and then goto to the label 'relock';
7) We lock again the inode, do all the necessary checks again and call
again generic_write_checks(), which calls generic_write_checks_count()
again, and there we set the iocb's position to 4K, which is the current
i_size of the inode, with the following code pointed above:
if (iocb->ki_flags & IOCB_APPEND)
iocb->ki_pos = i_size_read(inode);
8) Then we go again to btrfs_dio_write() and enter iomap and the write
succeeds, but it wrote to the file range [4K, 8K), leaving a hole in
the [0, 4K) range and an i_size of 8K, which goes against the
expectations of having the data written to the range [0, 4K) and get an
i_size of 4K.
Fix this by not unlocking the inode before faulting in the input buffer,
in case we get -EFAULT or an incomplete write, and not jumping to the
'relock' label after faulting in the buffer - instead jump to a location
immediately before calling iomap, skipping all the write checks and
relocking. This solves this problem and it's fine even in case the input
buffer is memory mapped to the same file range, since only holding the
range locked in the inode's io tree can cause a deadlock, it's safe to
keep the inode lock (VFS lock), as was fixed and described in commit
51bd9563b6 ("btrfs: fix deadlock due to page faults during direct IO
reads and writes").
A sample reproducer provided by a reporter is the following:
$ cat test.c
#ifndef _GNU_SOURCE
#define _GNU_SOURCE
#endif
#include <fcntl.h>
#include <stdio.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <unistd.h>
int main(int argc, char *argv[])
{
if (argc < 2) {
fprintf(stderr, "Usage: %s <test file>\n", argv[0]);
return 1;
}
int fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC | O_DIRECT |
O_APPEND, 0644);
if (fd < 0) {
perror("creating test file");
return 1;
}
char *buf = mmap(NULL, 4096, PROT_READ,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
ssize_t ret = write(fd, buf, 4096);
if (ret < 0) {
perror("pwritev2");
return 1;
}
struct stat stbuf;
ret = fstat(fd, &stbuf);
if (ret < 0) {
perror("stat");
return 1;
}
printf("size: %llu\n", (unsigned long long)stbuf.st_size);
return stbuf.st_size == 4096 ? 0 : 1;
}
A test case for fstests will be sent soon.
Reported-by: Hanna Czenczek <hreitz@redhat.com>
Link: https://lore.kernel.org/linux-btrfs/0b841d46-12fe-4e64-9abb-871d8d0de271@redhat.com/
Fixes: 8184620ae2 ("btrfs: fix lost file sync on direct IO write with nowait and dsync iocb")
CC: stable@vger.kernel.org # 6.1+
Tested-by: Hanna Czenczek <hreitz@redhat.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When btrfs makes a block group read-only, it adds all free regions in the
block group to space_info->bytes_readonly. That free space excludes
reserved and pinned regions. OTOH, when btrfs makes the block group
read-write again, it moves all the unused regions into the block group's
zone_unusable. That unused region includes reserved and pinned regions.
As a result, it counts too much zone_unusable bytes.
Fortunately (or unfortunately), having erroneous zone_unusable does not
affect the calculation of space_info->bytes_readonly, because free
space (num_bytes in btrfs_dec_block_group_ro) calculation is done based on
the erroneous zone_unusable and it reduces the num_bytes just to cancel the
error.
This behavior can be easily discovered by adding a WARN_ON to check e.g,
"bg->pinned > 0" in btrfs_dec_block_group_ro(), and running fstests test
case like btrfs/282.
Fix it by properly considering pinned and reserved in
btrfs_dec_block_group_ro(). Also, add a WARN_ON and introduce
btrfs_space_info_update_bytes_zone_unusable() to catch a similar mistake.
Fixes: 169e0da91a ("btrfs: zoned: track unusable bytes for zones")
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The block group's avail bytes printed when dumping a space info subtract
the delalloc_bytes. However, as shown in btrfs_add_reserved_bytes() and
btrfs_free_reserved_bytes(), it is added or subtracted along with
"reserved" for the delalloc case, which means the "delalloc_bytes" is a
part of the "reserved" bytes. So, excluding it to calculate the avail space
counts delalloc_bytes twice, which can lead to an invalid result.
Fixes: e50b122b83 ("btrfs: print available space for a block group when dumping a space info")
CC: stable@vger.kernel.org # 6.6+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
The btrfs buffered write path runs through __extent_writepage() which
has some tricky return value handling for writepage_delalloc().
Specifically, when that returns 1, we exit, but for other return values
we continue and end up calling btrfs_folio_end_all_writers(). If the
folio has been unlocked (note that we check the PageLocked bit at the
start of __extent_writepage()), this results in an assert panic like
this one from syzbot:
BTRFS: error (device loop0 state EAL) in free_log_tree:3267: errno=-5 IO failure
BTRFS warning (device loop0 state EAL): Skipping commit of aborted transaction.
BTRFS: error (device loop0 state EAL) in cleanup_transaction:2018: errno=-5 IO failure
assertion failed: folio_test_locked(folio), in fs/btrfs/subpage.c:871
------------[ cut here ]------------
kernel BUG at fs/btrfs/subpage.c:871!
Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
CPU: 1 PID: 5090 Comm: syz-executor225 Not tainted
6.10.0-syzkaller-05505-gb1bc554e009e #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 06/27/2024
RIP: 0010:btrfs_folio_end_all_writers+0x55b/0x610 fs/btrfs/subpage.c:871
Code: e9 d3 fb ff ff e8 25 22 c2 fd 48 c7 c7 c0 3c 0e 8c 48 c7 c6 80 3d
0e 8c 48 c7 c2 60 3c 0e 8c b9 67 03 00 00 e8 66 47 ad 07 90 <0f> 0b e8
6e 45 b0 07 4c 89 ff be 08 00 00 00 e8 21 12 25 fe 4c 89
RSP: 0018:ffffc900033d72e0 EFLAGS: 00010246
RAX: 0000000000000045 RBX: 00fff0000000402c RCX: 663b7a08c50a0a00
RDX: 0000000000000000 RSI: 0000000080000000 RDI: 0000000000000000
RBP: ffffc900033d73b0 R08: ffffffff8176b98c R09: 1ffff9200067adfc
R10: dffffc0000000000 R11: fffff5200067adfd R12: 0000000000000001
R13: dffffc0000000000 R14: 0000000000000000 R15: ffffea0001cbee80
FS: 0000000000000000(0000) GS:ffff8880b9500000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f5f076012f8 CR3: 000000000e134000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
__extent_writepage fs/btrfs/extent_io.c:1597 [inline]
extent_write_cache_pages fs/btrfs/extent_io.c:2251 [inline]
btrfs_writepages+0x14d7/0x2760 fs/btrfs/extent_io.c:2373
do_writepages+0x359/0x870 mm/page-writeback.c:2656
filemap_fdatawrite_wbc+0x125/0x180 mm/filemap.c:397
__filemap_fdatawrite_range mm/filemap.c:430 [inline]
__filemap_fdatawrite mm/filemap.c:436 [inline]
filemap_flush+0xdf/0x130 mm/filemap.c:463
btrfs_release_file+0x117/0x130 fs/btrfs/file.c:1547
__fput+0x24a/0x8a0 fs/file_table.c:422
task_work_run+0x24f/0x310 kernel/task_work.c:222
exit_task_work include/linux/task_work.h:40 [inline]
do_exit+0xa2f/0x27f0 kernel/exit.c:877
do_group_exit+0x207/0x2c0 kernel/exit.c:1026
__do_sys_exit_group kernel/exit.c:1037 [inline]
__se_sys_exit_group kernel/exit.c:1035 [inline]
__x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1035
x64_sys_call+0x2634/0x2640
arch/x86/include/generated/asm/syscalls_64.h:232
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f5f075b70c9
Code: Unable to access opcode bytes at
0x7f5f075b709f.
I was hitting the same issue by doing hundreds of accelerated runs of
generic/475, which also hits IO errors by design.
I instrumented that reproducer with bpftrace and found that the
undesirable folio_unlock was coming from the following callstack:
folio_unlock+5
__process_pages_contig+475
cow_file_range_inline.constprop.0+230
cow_file_range+803
btrfs_run_delalloc_range+566
writepage_delalloc+332
__extent_writepage # inlined in my stacktrace, but I added it here
extent_write_cache_pages+622
Looking at the bisected-to patch in the syzbot report, Josef realized
that the logic of the cow_file_range_inline error path subtly changing.
In the past, on error, it jumped to out_unlock in cow_file_range(),
which honors the locked_page, so when we ultimately call
folio_end_all_writers(), the folio of interest is still locked. After
the change, we always unlocked ignoring the locked_page, on both success
and error. On the success path, this all results in returning 1 to
__extent_writepage(), which skips the folio_end_all_writers() call,
which makes it OK to have unlocked.
Fix the bug by wiring the locked_page into cow_file_range_inline() and
only setting locked_page to NULL on success.
Reported-by: syzbot+a14d8ac9af3a2a4fd0c8@syzkaller.appspotmail.com
Fixes: 0586d0a89e ("btrfs: move extent bit and page cleanup into cow_file_range_inline")
CC: stable@vger.kernel.org # 6.10+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
We only had a couple of array[] declarations, and changing them to just
use 'MAX()' instead of 'max()' fixes the issue.
This will allow us to simplify our min/max macros enormously, since they
can now unconditionally use temporary variables to avoid using the
argument values multiple times.
Cc: David Laight <David.Laight@aculab.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If we attempt to insert a compressed extent map that has a range that
overlaps another extent map we have in the inode's extent map tree, we
can end up with an incorrect offset after adjusting the new extent map at
merge_extent_mapping() because we don't update the extent map's offset.
For example consider the following scenario:
1) We have a file extent item for a compressed extent covering the file
range [108K, 144K) and currently there's no corresponding extent map
in the inode's extent map tree;
2) The inode's size is 141K;
3) We have an encoded write (compressed) into the file range [120K, 128K),
which overlaps the existing file extent item. The encoded write creates
a matching extent map, adds it to the inode's extent map tree and
creates an ordered extent for it.
Note that the corresponding file extent item is added to the subvolume
tree only when the ordered extent completes (when executing
btrfs_finish_one_ordered());
4) We have a write into the file range [160K, 164K).
This writes increases the i_size of the file, and there's a hole
between the current i_size (141K) and the start offset of this write,
and since the old i_size is in the middle of the block [140K, 144K),
we have to write zeroes to the range [141K, 144K) (3072 bytes) and
therefore dirty that page.
We then call btrfs_set_extent_delalloc() with a start offset of 140K.
We then end up at btrfs_find_new_delalloc_bytes() which will call
btrfs_get_extent() for the range [140K, 144K);
5) The btrfs_get_extent() doesn't find any extent map in the inode's
extent map tree covering the range [140K, 144K), so it searches the
subvolume tree for any file extent items covering that range.
There it finds the file extent item for the range [108K, 144K),
creates a compressed extent map for that range and then calls
btrfs_add_extent_mapping() with that extent map and passes the
range [140K, 144K) via the "start" and "len" parameters;
6) The call to add_extent_mapping() done by btrfs_add_extent_mapping()
fails with -EEXIST because there's an extent map, created at step 2
for the [120K, 128K) range, that covers that overlaps with the range
of the given extent map ([108K, 144K)).
Then it does a lookup for extent map from step 2 add calls
merge_extent_mapping() to adjust the input extent map ([108K, 144K)).
That adjust the extent map to a start offset of 128K and a length
of 16K (starting just after the extent map from step 2), but it does
not update the offset field of the extent map, leaving it with a value
of zero instead of updating to a value of 20K (128K - 108K = 20K).
As a result any read for the range [128K, 144K) can return
incorrect data since we read from a wrong section of the extent (unless
both the correct and incorrect ranges happen to have the same data).
So fix this by changing merge_extent_mapping() to update the extent map's
offset even if it's compressed. Also add a test case to the self tests.
This didn't happen before the patchset that does big changes in the extent
map structure (which includes the commit in the Fixes tag below) because
we kept track of the original start offset in the extent map (member
"orig_start") so we could always calculate the correct offset by
subtracting that offset from the start offset.
A test case for fstests that triggered this problem using send/receive
with compressed writes will be added soon.
Fixes: 3d2ac99224 ("btrfs: introduce new members for extent_map")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[CORRUPTION]
There is a bug report that btrfs flips RO due to a corruption in the
extent tree, the involved dumps looks like this:
item 188 key (402811572224 168 4096) itemoff 14598 itemsize 79
extent refs 3 gen 3678544 flags 1
ref#0: extent data backref root 13835058055282163977 objectid 281473384125923 offset 81432576 count 1
ref#1: shared data backref parent 1947073626112 count 1
ref#2: shared data backref parent 1156030103552 count 1
BTRFS critical (device vdc1: state EA): unable to find ref byte nr 402811572224 parent 0 root 265 owner 28703026 offset 81432576 slot 189
BTRFS error (device vdc1: state EA): failed to run delayed ref for logical 402811572224 num_bytes 4096 type 178 action 2 ref_mod 1: -2
[CAUSE]
The corrupted entry is ref#0 of item 188.
The root number 13835058055282163977 is beyond the upper limit for root
items (the current limit is 1 << 48), and the objectid also looks
suspicious.
Only the offset and count is correct.
[ENHANCEMENT]
Although it's still unknown why we have such many bytes corrupted
randomly, we can still enhance the tree-checker for data backrefs by:
- Validate the root value
For now there should only be 3 types of roots can have data backref:
* subvolume trees
* data reloc trees
* root tree
Only for v1 space cache
- validate the objectid value
The objectid should be a valid inode number.
Hopefully we can catch such problem in the future with the new checkers.
Reported-by: Kai Krakow <hurikhan77@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAMthOuPjg5RDT-G_LXeBBUUtzt3cq=JywF+D1_h+JYxe=WKp-Q@mail.gmail.com/#t
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
walkers") is known to cause a performance regression
(https://lore.kernel.org/all/3acefad9-96e5-4681-8014-827d6be71c7a@linux.ibm.com/T/#mfa809800a7862fb5bdf834c6f71a3a5113eb83ff).
Yu has a fix which I'll send along later via the hotfixes branch.
- In the series "mm: Avoid possible overflows in dirty throttling" Jan
Kara addresses a couple of issues in the writeback throttling code.
These fixes are also targetted at -stable kernels.
- Ryusuke Konishi's series "nilfs2: fix potential issues related to
reserved inodes" does that. This should actually be in the
mm-nonmm-stable tree, along with the many other nilfs2 patches. My bad.
- More folio conversions from Kefeng Wang in the series "mm: convert to
folio_alloc_mpol()"
- Kemeng Shi has sent some cleanups to the writeback code in the series
"Add helper functions to remove repeated code and improve readability of
cgroup writeback"
- Kairui Song has made the swap code a little smaller and a little
faster in the series "mm/swap: clean up and optimize swap cache index".
- In the series "mm/memory: cleanly support zeropage in
vm_insert_page*(), vm_map_pages*() and vmf_insert_mixed()" David
Hildenbrand has reworked the rather sketchy handling of the use of the
zeropage in MAP_SHARED mappings. I don't see any runtime effects here -
more a cleanup/understandability/maintainablity thing.
- Dev Jain has improved selftests/mm/va_high_addr_switch.c's handling of
higher addresses, for aarch64. The (poorly named) series is
"Restructure va_high_addr_switch".
- The core TLB handling code gets some cleanups and possible slight
optimizations in Bang Li's series "Add update_mmu_tlb_range() to
simplify code".
- Jane Chu has improved the handling of our
fake-an-unrecoverable-memory-error testing feature MADV_HWPOISON in the
series "Enhance soft hwpoison handling and injection".
- Jeff Johnson has sent a billion patches everywhere to add
MODULE_DESCRIPTION() to everything. Some landed in this pull.
- In the series "mm: cleanup MIGRATE_SYNC_NO_COPY mode", Kefeng Wang has
simplified migration's use of hardware-offload memory copying.
- Yosry Ahmed performs more folio API conversions in his series "mm:
zswap: trivial folio conversions".
- In the series "large folios swap-in: handle refault cases first",
Chuanhua Han inches us forward in the handling of large pages in the
swap code. This is a cleanup and optimization, working toward the end
objective of full support of large folio swapin/out.
- In the series "mm,swap: cleanup VMA based swap readahead window
calculation", Huang Ying has contributed some cleanups and a possible
fixlet to his VMA based swap readahead code.
- In the series "add mTHP support for anonymous shmem" Baolin Wang has
taught anonymous shmem mappings to use multisize THP. By default this
is a no-op - users must opt in vis sysfs controls. Dramatic
improvements in pagefault latency are realized.
- David Hildenbrand has some cleanups to our remaining use of
page_mapcount() in the series "fs/proc: move page_mapcount() to
fs/proc/internal.h".
- David also has some highmem accounting cleanups in the series
"mm/highmem: don't track highmem pages manually".
- Build-time fixes and cleanups from John Hubbard in the series
"cleanups, fixes, and progress towards avoiding "make headers"".
- Cleanups and consolidation of the core pagemap handling from Barry
Song in the series "mm: introduce pmd|pte_needs_soft_dirty_wp helpers
and utilize them".
- Lance Yang's series "Reclaim lazyfree THP without splitting" has
reduced the latency of the reclaim of pmd-mapped THPs under fairly
common circumstances. A 10x speedup is seen in a microbenchmark.
It does this by punting to aother CPU but I guess that's a win unless
all CPUs are pegged.
- hugetlb_cgroup cleanups from Xiu Jianfeng in the series
"mm/hugetlb_cgroup: rework on cftypes".
- Miaohe Lin's series "Some cleanups for memory-failure" does just that
thing.
- Is anyone reading this stuff? If so, email me!
- Someone other than SeongJae has developed a DAMON feature in Honggyu
Kim's series "DAMON based tiered memory management for CXL memory".
This adds DAMON features which may be used to help determine the
efficiency of our placement of CXL/PCIe attached DRAM.
- DAMON user API centralization and simplificatio work in SeongJae
Park's series "mm/damon: introduce DAMON parameters online commit
function".
- In the series "mm: page_type, zsmalloc and page_mapcount_reset()"
David Hildenbrand does some maintenance work on zsmalloc - partially
modernizing its use of pageframe fields.
- Kefeng Wang provides more folio conversions in the series "mm: remove
page_maybe_dma_pinned() and page_mkclean()".
- More cleanup from David Hildenbrand, this time in the series
"mm/memory_hotplug: use PageOffline() instead of PageReserved() for
!ZONE_DEVICE". It "enlightens memory hotplug more about PageOffline()
pages" and permits the removal of some virtio-mem hacks.
- Barry Song's series "mm: clarify folio_add_new_anon_rmap() and
__folio_add_anon_rmap()" is a cleanup to the anon folio handling in
preparation for mTHP (multisize THP) swapin.
- Kefeng Wang's series "mm: improve clear and copy user folio"
implements more folio conversions, this time in the area of large folio
userspace copying.
- The series "Docs/mm/damon/maintaier-profile: document a mailing tool
and community meetup series" tells people how to get better involved
with other DAMON developers. From SeongJae Park.
- A large series ("kmsan: Enable on s390") from Ilya Leoshkevich does
that.
- David Hildenbrand sends along more cleanups, this time against the
migration code. The series is "mm/migrate: move NUMA hinting fault
folio isolation + checks under PTL".
- Jan Kara has found quite a lot of strangenesses and minor errors in
the readahead code. He addresses this in the series "mm: Fix various
readahead quirks".
- SeongJae Park's series "selftests/damon: test DAMOS tried regions and
{min,max}_nr_regions" adds features and addresses errors in DAMON's self
testing code.
- Gavin Shan has found a userspace-triggerable WARN in the pagecache
code. The series "mm/filemap: Limit page cache size to that supported
by xarray" addresses this. The series is marked cc:stable.
- Chengming Zhou's series "mm/ksm: cmp_and_merge_page() optimizations
and cleanup" cleans up and slightly optimizes KSM.
- Roman Gushchin has separated the memcg-v1 and memcg-v2 code - lots of
code motion. The series (which also makes the memcg-v1 code
Kconfigurable) are
"mm: memcg: separate legacy cgroup v1 code and put under config
option" and
"mm: memcg: put cgroup v1-specific memcg data under CONFIG_MEMCG_V1"
- Dan Schatzberg's series "Add swappiness argument to memory.reclaim"
adds an additional feature to this cgroup-v2 control file.
- The series "Userspace controls soft-offline pages" from Jiaqi Yan
permits userspace to stop the kernel's automatic treatment of excessive
correctable memory errors. In order to permit userspace to monitor and
handle this situation.
- Kefeng Wang's series "mm: migrate: support poison recover from migrate
folio" teaches the kernel to appropriately handle migration from
poisoned source folios rather than simply panicing.
- SeongJae Park's series "Docs/damon: minor fixups and improvements"
does those things.
- In the series "mm/zsmalloc: change back to per-size_class lock"
Chengming Zhou improves zsmalloc's scalability and memory utilization.
- Vivek Kasireddy's series "mm/gup: Introduce memfd_pin_folios() for
pinning memfd folios" makes the GUP code use FOLL_PIN rather than bare
refcount increments. So these paes can first be moved aside if they
reside in the movable zone or a CMA block.
- Andrii Nakryiko has added a binary ioctl()-based API to /proc/pid/maps
for much faster reading of vma information. The series is "query VMAs
from /proc/<pid>/maps".
- In the series "mm: introduce per-order mTHP split counters" Lance Yang
improves the kernel's presentation of developer information related to
multisize THP splitting.
- Michael Ellerman has developed the series "Reimplement huge pages
without hugepd on powerpc (8xx, e500, book3s/64)". This permits
userspace to use all available huge page sizes.
- In the series "revert unconditional slab and page allocator fault
injection calls" Vlastimil Babka removes a performance-affecting and not
very useful feature from slab fault injection.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZp2C+QAKCRDdBJ7gKXxA
joTkAQDvjqOoFStqk4GU3OXMYB7WCU/ZQMFG0iuu1EEwTVDZ4QEA8CnG7seek1R3
xEoo+vw0sWWeLV3qzsxnCA1BJ8cTJA8=
=z0Lf
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2024-07-21-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- In the series "mm: Avoid possible overflows in dirty throttling" Jan
Kara addresses a couple of issues in the writeback throttling code.
These fixes are also targetted at -stable kernels.
- Ryusuke Konishi's series "nilfs2: fix potential issues related to
reserved inodes" does that. This should actually be in the
mm-nonmm-stable tree, along with the many other nilfs2 patches. My
bad.
- More folio conversions from Kefeng Wang in the series "mm: convert to
folio_alloc_mpol()"
- Kemeng Shi has sent some cleanups to the writeback code in the series
"Add helper functions to remove repeated code and improve readability
of cgroup writeback"
- Kairui Song has made the swap code a little smaller and a little
faster in the series "mm/swap: clean up and optimize swap cache
index".
- In the series "mm/memory: cleanly support zeropage in
vm_insert_page*(), vm_map_pages*() and vmf_insert_mixed()" David
Hildenbrand has reworked the rather sketchy handling of the use of
the zeropage in MAP_SHARED mappings. I don't see any runtime effects
here - more a cleanup/understandability/maintainablity thing.
- Dev Jain has improved selftests/mm/va_high_addr_switch.c's handling
of higher addresses, for aarch64. The (poorly named) series is
"Restructure va_high_addr_switch".
- The core TLB handling code gets some cleanups and possible slight
optimizations in Bang Li's series "Add update_mmu_tlb_range() to
simplify code".
- Jane Chu has improved the handling of our
fake-an-unrecoverable-memory-error testing feature MADV_HWPOISON in
the series "Enhance soft hwpoison handling and injection".
- Jeff Johnson has sent a billion patches everywhere to add
MODULE_DESCRIPTION() to everything. Some landed in this pull.
- In the series "mm: cleanup MIGRATE_SYNC_NO_COPY mode", Kefeng Wang
has simplified migration's use of hardware-offload memory copying.
- Yosry Ahmed performs more folio API conversions in his series "mm:
zswap: trivial folio conversions".
- In the series "large folios swap-in: handle refault cases first",
Chuanhua Han inches us forward in the handling of large pages in the
swap code. This is a cleanup and optimization, working toward the end
objective of full support of large folio swapin/out.
- In the series "mm,swap: cleanup VMA based swap readahead window
calculation", Huang Ying has contributed some cleanups and a possible
fixlet to his VMA based swap readahead code.
- In the series "add mTHP support for anonymous shmem" Baolin Wang has
taught anonymous shmem mappings to use multisize THP. By default this
is a no-op - users must opt in vis sysfs controls. Dramatic
improvements in pagefault latency are realized.
- David Hildenbrand has some cleanups to our remaining use of
page_mapcount() in the series "fs/proc: move page_mapcount() to
fs/proc/internal.h".
- David also has some highmem accounting cleanups in the series
"mm/highmem: don't track highmem pages manually".
- Build-time fixes and cleanups from John Hubbard in the series
"cleanups, fixes, and progress towards avoiding "make headers"".
- Cleanups and consolidation of the core pagemap handling from Barry
Song in the series "mm: introduce pmd|pte_needs_soft_dirty_wp helpers
and utilize them".
- Lance Yang's series "Reclaim lazyfree THP without splitting" has
reduced the latency of the reclaim of pmd-mapped THPs under fairly
common circumstances. A 10x speedup is seen in a microbenchmark.
It does this by punting to aother CPU but I guess that's a win unless
all CPUs are pegged.
- hugetlb_cgroup cleanups from Xiu Jianfeng in the series
"mm/hugetlb_cgroup: rework on cftypes".
- Miaohe Lin's series "Some cleanups for memory-failure" does just that
thing.
- Someone other than SeongJae has developed a DAMON feature in Honggyu
Kim's series "DAMON based tiered memory management for CXL memory".
This adds DAMON features which may be used to help determine the
efficiency of our placement of CXL/PCIe attached DRAM.
- DAMON user API centralization and simplificatio work in SeongJae
Park's series "mm/damon: introduce DAMON parameters online commit
function".
- In the series "mm: page_type, zsmalloc and page_mapcount_reset()"
David Hildenbrand does some maintenance work on zsmalloc - partially
modernizing its use of pageframe fields.
- Kefeng Wang provides more folio conversions in the series "mm: remove
page_maybe_dma_pinned() and page_mkclean()".
- More cleanup from David Hildenbrand, this time in the series
"mm/memory_hotplug: use PageOffline() instead of PageReserved() for
!ZONE_DEVICE". It "enlightens memory hotplug more about PageOffline()
pages" and permits the removal of some virtio-mem hacks.
- Barry Song's series "mm: clarify folio_add_new_anon_rmap() and
__folio_add_anon_rmap()" is a cleanup to the anon folio handling in
preparation for mTHP (multisize THP) swapin.
- Kefeng Wang's series "mm: improve clear and copy user folio"
implements more folio conversions, this time in the area of large
folio userspace copying.
- The series "Docs/mm/damon/maintaier-profile: document a mailing tool
and community meetup series" tells people how to get better involved
with other DAMON developers. From SeongJae Park.
- A large series ("kmsan: Enable on s390") from Ilya Leoshkevich does
that.
- David Hildenbrand sends along more cleanups, this time against the
migration code. The series is "mm/migrate: move NUMA hinting fault
folio isolation + checks under PTL".
- Jan Kara has found quite a lot of strangenesses and minor errors in
the readahead code. He addresses this in the series "mm: Fix various
readahead quirks".
- SeongJae Park's series "selftests/damon: test DAMOS tried regions and
{min,max}_nr_regions" adds features and addresses errors in DAMON's
self testing code.
- Gavin Shan has found a userspace-triggerable WARN in the pagecache
code. The series "mm/filemap: Limit page cache size to that supported
by xarray" addresses this. The series is marked cc:stable.
- Chengming Zhou's series "mm/ksm: cmp_and_merge_page() optimizations
and cleanup" cleans up and slightly optimizes KSM.
- Roman Gushchin has separated the memcg-v1 and memcg-v2 code - lots of
code motion. The series (which also makes the memcg-v1 code
Kconfigurable) are "mm: memcg: separate legacy cgroup v1 code and put
under config option" and "mm: memcg: put cgroup v1-specific memcg
data under CONFIG_MEMCG_V1"
- Dan Schatzberg's series "Add swappiness argument to memory.reclaim"
adds an additional feature to this cgroup-v2 control file.
- The series "Userspace controls soft-offline pages" from Jiaqi Yan
permits userspace to stop the kernel's automatic treatment of
excessive correctable memory errors. In order to permit userspace to
monitor and handle this situation.
- Kefeng Wang's series "mm: migrate: support poison recover from
migrate folio" teaches the kernel to appropriately handle migration
from poisoned source folios rather than simply panicing.
- SeongJae Park's series "Docs/damon: minor fixups and improvements"
does those things.
- In the series "mm/zsmalloc: change back to per-size_class lock"
Chengming Zhou improves zsmalloc's scalability and memory
utilization.
- Vivek Kasireddy's series "mm/gup: Introduce memfd_pin_folios() for
pinning memfd folios" makes the GUP code use FOLL_PIN rather than
bare refcount increments. So these paes can first be moved aside if
they reside in the movable zone or a CMA block.
- Andrii Nakryiko has added a binary ioctl()-based API to
/proc/pid/maps for much faster reading of vma information. The series
is "query VMAs from /proc/<pid>/maps".
- In the series "mm: introduce per-order mTHP split counters" Lance
Yang improves the kernel's presentation of developer information
related to multisize THP splitting.
- Michael Ellerman has developed the series "Reimplement huge pages
without hugepd on powerpc (8xx, e500, book3s/64)". This permits
userspace to use all available huge page sizes.
- In the series "revert unconditional slab and page allocator fault
injection calls" Vlastimil Babka removes a performance-affecting and
not very useful feature from slab fault injection.
* tag 'mm-stable-2024-07-21-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (411 commits)
mm/mglru: fix ineffective protection calculation
mm/zswap: fix a white space issue
mm/hugetlb: fix kernel NULL pointer dereference when migrating hugetlb folio
mm/hugetlb: fix possible recursive locking detected warning
mm/gup: clear the LRU flag of a page before adding to LRU batch
mm/numa_balancing: teach mpol_to_str about the balancing mode
mm: memcg1: convert charge move flags to unsigned long long
alloc_tag: fix page_ext_get/page_ext_put sequence during page splitting
lib: reuse page_ext_data() to obtain codetag_ref
lib: add missing newline character in the warning message
mm/mglru: fix overshooting shrinker memory
mm/mglru: fix div-by-zero in vmpressure_calc_level()
mm/kmemleak: replace strncpy() with strscpy()
mm, page_alloc: put should_fail_alloc_page() back behing CONFIG_FAIL_PAGE_ALLOC
mm, slab: put should_failslab() back behind CONFIG_SHOULD_FAILSLAB
mm: ignore data-race in __swap_writepage
hugetlbfs: ensure generic_hugetlb_get_unmapped_area() returns higher address than mmap_min_addr
mm: shmem: rename mTHP shmem counters
mm: swap_state: use folio_alloc_mpol() in __read_swap_cache_async()
mm/migrate: putback split folios when numa hint migration fails
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmaak1cACgkQxWXV+ddt
WDvtDw/9GLN/pzAhH+rg0v51b4Ofxqa5QXuz7GIHb+UOqOdiZIjun+UmBst8794t
oZ0rhO1sjXVNwci5TG2oWUHcoAkjoPrVgJ/X5nzBHOzHJCQlV1kNzgsDyY+HFL9F
FBkJ6u1pCXUIxry6zp55NLWLMcjtFdMphjLXENh+GzUx5aAXl7VamK41B3+pHGfW
gDCNJ41vaaHqTGN8+9AADpZ2Oo4CUCtr8eNxGXZ0O5/0s5CdRJrjphtBTcGwbGch
SYVceySM4ElLmFi+hKFxx68MkWgnuTnBhZhbOM4V2fPFv0hLzXyQ7OrAk+MY8O2m
AHLx2jHKfCdONFeTUdt1cY5/wM6Afhy+N/iKtv3t2+Wi70C/LEHQRY1Op3rfx1rn
vOIbR9IXEHVi6ncO/E9c4LacSqLd/KSOaZn2Z/6i5wN+NY86CrCMuPwr6Pv0LL6x
aSHka47SFFTQLvHUdwmzexJ2YuosFdI0BhpiEu8ylAZTJ17yDJatk8wM+FB6Rfh1
vdPMRi93nVfrCwkU63Y2gqtJ3ncb3mbk/0uUdtMMflZJgjL0qkxTmcu3pSEdzIYR
gHFLVlns4cljL5PB9yMH/JjYjYn+Y6bCVvVyuhQZ3FAanUVOSFin/YWfo1bIi1et
ENNP+lhUKYvKLcz3QcnQpX3a6PkFPrmFi5wniAvxymrmVKJ3g3E=
=ROO7
-----END PGP SIGNATURE-----
Merge tag 'for-6.11-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"A fix for build breakage on 32bit platforms"
* tag 'for-6.11-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: change BTRFS_MOUNT_* flags to 64bit type
Currently the BTRFS_MOUNT_* flags are already beyond 32 bits, this is
going to cause compilation errors for some 32 bit systems, as their
unsigned long is only 32 bits long, thus flag
BTRFS_MOUNT_IGNORESUPERFLAGS overflows and can lead to errors.
Fix the problem by:
- Migrate all existing BTRFS_MOUNT_* flags to unsigned long long
- Migrate all mount option related variables to unsigned long long
* btrfs_fs_info::mount_opt
* btrfs_fs_context::mount_opt
* mount_opt parameter of btrfs_check_options()
* old_opts parameter of btrfs_remount_begin()
* old_opts parameter of btrfs_remount_cleanup()
* mount_opt parameter of btrfs_check_mountopts_zoned()
* mount_opt and opt parameters of check_ro_option()
Fixes: 32e6216512 ("btrfs: introduce new "rescue=ignoresuperflags" mount option")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmaVN3MACgkQxWXV+ddt
WDtpIRAAl+1NjsEj8e5V/UYn8Jr06ujTOnrkR3PCTICxDHbUaMLkQEw21H0K/ogQ
3fOiEVpSlZOfKdYXtXaMQbC0jd/Af2eA10Uht96nAEjAtxu1uJ4cFZGu2meNdXZP
xUioivJ/CElMPH2aluG6FaQvUTqmhrEr8tSoYbxzQmUd434q9kqqyjtw1tfzYDG1
VDn2f7ykhpB/8P0aoqgWSshWTmaCzG0GkuI28o1o0iZUIF/P9TKdzxlLRW6BVHE7
T2oGLEQjN1GQbCH75L4IeNJDkCBVfcDcbZkUDJ/ae4Pt/jJQTFY53YIP9wXFZQnd
mdfHmK7Atpsk75ATftYSq+ENkbQ5fsuut5CD63u54gAqA4M1FncDXTAWS1Y30F76
P8juSCmsSy0o3gTflDIo/IMdntoh/JmncwwStF6oKzmyUZZzzarsqM8mc1P03ZNt
3ttlnbY7lC1TDAlD5J2wXE0INCT2pN+4C9IToWdRypeuLu6qrI7cQ0oylyp9OVQM
t9umTXm0B6s1cyqEDjJf0xJZS/JTHYwu7S4EmAJwicgiLpOjABVTmO8021rVmDJy
TAUu6yEhSsrTT6Dxm7/2Et1EEOKFF5hhsG1SiGD9oUIZK6B5+0waT+rbkEWl7osR
4/TAv2zX6tuCc7HIW0fQloM/6/Gyd5wcDVaQNDUzFA075uKstwY=
=k5d3
-----END PGP SIGNATURE-----
Merge tag 'for-6.11-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"The highlights are new logic behind background block group reclaim,
automatic removal of qgroup after removing a subvolume and new
'rescue=' mount options.
The rest is optimizations, cleanups and refactoring.
User visible features:
- dynamic block group reclaim:
- tunable framework to avoid situations where eager data
allocations prevent creating new metadata chunks due to lack of
unallocated space
- reuse sysfs knob bg_reclaim_threshold (otherwise used only in
zoned mode) for a fixed value threshold
- new on/off sysfs knob "dynamic_reclaim" calculating the value
based on heuristics, aiming to keep spare working space for
relocating chunks but not to needlessly relocate partially
utilized block groups or reclaim newly allocated ones
- stats are exported in sysfs per block group type, files
"reclaim_*"
- this may increase IO load at unexpected times but the corner
case of no allocatable block groups is known to be worse
- automatically remove qgroup of deleted subvolumes:
- adjust qgroup removal conditions, make sure all related
subvolume data are already removed, or return EBUSY, also take
into account setting of sysfs drop_subtree_threshold
- also works in squota mode
- mount option updates: new modes of 'rescue=' that allow to mount
images (read-only) that could have been partially converted by user
space tools
- ignoremetacsums - invalid metadata checksums are ignored
- ignoresuperflags - super block flags that track conversion in
progress (like UUID or checksums)
Core:
- size of struct btrfs_inode is now below 1024 (on a release config),
improved memory packing and other secondary effects
- switch tracking of open inodes from rb-tree to xarray, minor
performance improvement
- reduce number of empty transaction commits when there are no dirty
data/metadata
- memory allocation optimizations (reduced numbers, reordering out of
critical sections)
- extent map structure optimizations and refactoring, more sanity
checks
- more subpage in zoned mode preparations or fixes
- general snapshot code cleanups, improvements and documentation
- tree-checker updates: more file extent ram_bytes fixes, continued
- raid-stripe-tree update (not backward compatible):
- remove extent encoding field from the structure, can be inferred
from other information
- requires btrfs-progs 6.9.1 or newer
- cleanups and refactoring
- error message updates
- error handling improvements
- return type and parameter cleanups and improvements"
* tag 'for-6.11-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (152 commits)
btrfs: fix extent map use-after-free when adding pages to compressed bio
btrfs: fix bitmap leak when loading free space cache on duplicate entry
btrfs: remove the BUG_ON() inside extent_range_clear_dirty_for_io()
btrfs: move extent_range_clear_dirty_for_io() into inode.c
btrfs: enhance compression error messages
btrfs: fix data race when accessing the last_trans field of a root
btrfs: rename the extra_gfp parameter of btrfs_alloc_page_array()
btrfs: remove the extra_gfp parameter from btrfs_alloc_folio_array()
btrfs: introduce new "rescue=ignoresuperflags" mount option
btrfs: introduce new "rescue=ignoremetacsums" mount option
btrfs: output the unrecognized super block flags as hex
btrfs: remove unused Opt enums
btrfs: tree-checker: add extra ram_bytes and disk_num_bytes check
btrfs: fix the ram_bytes assignment for truncated ordered extents
btrfs: make validate_extent_map() catch ram_bytes mismatch
btrfs: ignore incorrect btrfs_file_extent_item::ram_bytes
btrfs: cleanup the bytenr usage inside btrfs_extent_item_to_extent_map()
btrfs: fix typo in error message in btrfs_validate_super()
btrfs: move the direct IO code into its own file
btrfs: pass a btrfs_inode to btrfs_set_prop()
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmaOTd8QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgppqIEACUr8Vv2FtezvT3OfVSlYWHHLXzkRhwEG5s
vdk0o7Ow6U54sMjfymbHTgLD0ZOJf3uJ6BI95FQuW41jPzDFVbx4Hy8QzqonMkw9
1D/YQ4zrVL2mOKBzATbKpoGJzMOzGeoXEueFZ1AYPAX7RrDtP4xPQNfrcfkdE2zF
LycJN70Vp6lrZZMuI9yb9ts1tf7TFzK0HJANxOAKTgSiPmBmxesjkJlhrdUrgkAU
qDVyjj7u/ssndBJAb9i6Bl95Do8s9t4DeJq5/6wgKqtf5hClMXzPVB8Wy084gr6E
rTRsCEhOug3qEZSqfAgAxnd3XFRNc/p2KMUe5YZ4mAqux4hpSmIQQDM/5X5K9vEv
f4MNqUGlqyqntZx+KPyFpf7kLHFYS1qK4ub0FojWJEY4GrbBPNjjncLJ9+ozR0c8
kNDaFjMNAjalBee1FxNNH8LdVcd28rrCkPxRLEfO/gvBMUmvJf4ZyKmSED0v5DhY
vZqKlBqG+wg0EXvdiWEHMDh9Y+q/2XBIkS6NN/Bhh61HNu+XzC838ts1X7lR+4o2
AM5Vapw+v0q6kFBMRP3IcJI/c0UcIU8EQU7axMyzWtvhog8kx8x01hIj1L4UyYYr
rUdWrkugBVXJbywFuH/QIJxWxS/z4JdSw5VjASJLIrXy+aANmmG9Wonv95eyhpUv
5iv+EdRSNA==
=wVi8
-----END PGP SIGNATURE-----
Merge tag 'for-6.11/block-20240710' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:
- NVMe updates via Keith:
- Device initialization memory leak fixes (Keith)
- More constants defined (Weiwen)
- Target debugfs support (Hannes)
- PCIe subsystem reset enhancements (Keith)
- Queue-depth multipath policy (Redhat and PureStorage)
- Implement get_unique_id (Christoph)
- Authentication error fixes (Gaosheng)
- MD updates via Song
- sync_action fix and refactoring (Yu Kuai)
- Various small fixes (Christoph Hellwig, Li Nan, and Ofir Gal, Yu
Kuai, Benjamin Marzinski, Christophe JAILLET, Yang Li)
- Fix loop detach/open race (Gulam)
- Fix lower control limit for blk-throttle (Yu)
- Add module descriptions to various drivers (Jeff)
- Add support for atomic writes for block devices, and statx reporting
for same. Includes SCSI and NVMe (John, Prasad, Alan)
- Add IO priority information to block trace points (Dongliang)
- Various zone improvements and tweaks (Damien)
- mq-deadline tag reservation improvements (Bart)
- Ignore direct reclaim swap writes in writeback throttling (Baokun)
- Block integrity improvements and fixes (Anuj)
- Add basic support for rust based block drivers. Has a dummy null_blk
variant for now (Andreas)
- Series converting driver settings to queue limits, and cleanups and
fixes related to that (Christoph)
- Cleanup for poking too deeply into the bvec internals, in preparation
for DMA mapping API changes (Christoph)
- Various minor tweaks and fixes (Jiapeng, John, Kanchan, Mikulas,
Ming, Zhu, Damien, Christophe, Chaitanya)
* tag 'for-6.11/block-20240710' of git://git.kernel.dk/linux: (206 commits)
floppy: add missing MODULE_DESCRIPTION() macro
loop: add missing MODULE_DESCRIPTION() macro
ublk_drv: add missing MODULE_DESCRIPTION() macro
xen/blkback: add missing MODULE_DESCRIPTION() macro
block/rnbd: Constify struct kobj_type
block: take offset into account in blk_bvec_map_sg again
block: fix get_max_segment_size() warning
loop: Don't bother validating blocksize
virtio_blk: Don't bother validating blocksize
null_blk: Don't bother validating blocksize
block: Validate logical block size in blk_validate_limits()
virtio_blk: Fix default logical block size fallback
nvmet-auth: fix nvmet_auth hash error handling
nvme: implement ->get_unique_id
block: pass a phys_addr_t to get_max_segment_size
block: add a bvec_phys helper
blk-lib: check for kill signal in ioctl BLKZEROOUT
block: limit the Write Zeroes to manually writing zeroes fallback
block: refacto blkdev_issue_zeroout
block: move read-only and supported checks into (__)blkdev_issue_zeroout
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZpEG2wAKCRCRxhvAZXjc
ooW/AQDzyY+xNGt4OPMvlyFUHd5RcyiLsMhYrkKc3FaIFjesVgD+PFW5PPW12c0V
Z4VHg9w1HDDuUn4XvELs7OXZpek7RgU=
=eDC8
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.11.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs inode / dentry updates from Christian Brauner:
"This contains smaller performance improvements to inodes and dentries:
inode:
- Add rcu based inode lookup variants.
They avoid one inode hash lock acquire in the common case thereby
significantly reducing contention. We already support RCU-based
operations but didn't take advantage of them during inode
insertion.
Callers of iget_locked() get the improvement without any code
changes. Callers that need a custom callback can switch to
iget5_locked_rcu() as e.g., did btrfs.
With 20 threads each walking a dedicated 1000 dirs * 1000 files
directory tree to stat(2) on a 32 core + 24GB ram vm:
before: 3.54s user 892.30s system 1966% cpu 45.549 total
after: 3.28s user 738.66s system 1955% cpu 37.932 total (-16.7%)
Long-term we should pick up the effort to introduce more
fine-grained locking and possibly improve on the currently used
hash implementation.
- Start zeroing i_state in inode_init_always() instead of doing it in
individual filesystems.
This allows us to remove an unneeded lock acquire in new_inode()
and not burden individual filesystems with this.
dcache:
- Move d_lockref out of the area used by RCU lookup to avoid
cacheline ping poing because the embedded name is sharing a
cacheline with d_lockref.
- Fix dentry size on 32bit with CONFIG_SMP=y so it does actually end
up with 128 bytes in total"
* tag 'vfs-6.11.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: fix dentry size
vfs: move d_lockref out of the area used by RCU lookup
bcachefs: remove now spurious i_state initialization
xfs: remove now spurious i_state initialization in xfs_inode_alloc
vfs: partially sanitize i_state zeroing on inode creation
xfs: preserve i_state around inode_init_always in xfs_reinit_inode
btrfs: use iget5_locked_rcu
vfs: add rcu-based find_inode variants for iget ops
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmaRcQgACgkQxWXV+ddt
WDvAGxAAknJAiREp/AmzhSwkhr+nSnqex0t+VVgsOaMTu0BEHO0xhoXc3l0QuSwS
u2AIqmOYyzr/UQVXCuatBqAE+5T4njtYAYIWwE825yquAtHNyuok9+Sjhfvxrwgs
HmNAN4Vvl2Fwds7xbWE8ug18QlssuRTIX8hk7ZtS6xo49g0tsbRX9KlzIPpsULD3
BOZa+2NJwC1PGVeNPf3p06rfiUkKfmFYgdDybe2zJ17uwsRz1CFSsaEEB35ys1f0
xYOS4epfcie03EGyZmYctuNxatUkk/J/1lTH4Z9JHwvPBvLK1U97SyJ11Wz2VQC/
8ar8gUDRYtjWdf6vn6AWBM4MseaYm9LDMlPhbSfvpDcWiclGTE64IOP4gKKr3mCh
WzlNSIR9I+tYgrhvcsCEzd7lvrSVHa7clwfooYgkEx0wl5lgbN0llAdtJWG3eeLn
3stxje2FqqXsFNj5N9SrPy7f7t6xF2i8vwk4qh6EpRuT4yuatb+nWzDm9EuTT/Bc
P+zM1KFp7Blk7Zw/Tpw0O9qjt1whStY2xrqcMzg539WVo45MmuFEFzmGBRwZsH55
QPGLIjXPpt728AgMdhBFEG0DtWaiA3AOI/C5nYOtLu92aZVBmbaX7/d/GpJv3Vvd
Ihvr9s1c49YvTZsIS0T0tkq/7LXZi/SToRJDjhP5HCrRGf7A30Y=
=gtsF
-----END PGP SIGNATURE-----
Merge tag 'for-6.10-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Fix a regression in extent map shrinker behaviour.
In the past weeks we got reports from users that there are huge
latency spikes or freezes. This was bisected to newly added shrinker
of extent maps (it was added to fix a build up of the structures in
memory).
I'm assuming that the freezes would happen to many users after release
so I'd like to get it merged now so it's in 6.10. Although the diff
size is not small the changes are relatively straightforward, the
reporters verified the fixes and we did testing on our side.
The fixes:
- adjust behaviour under memory pressure and check lock or scheduling
conditions, bail out if needed
- synchronize tracking of the scanning progress so inode ranges are
not skipped or work duplicated
- do a delayed iput when scanning a root so evicting an inode does
not slow things down in case of lots of dirty data, also fix
lockdep warning, a deadlock could happen when writing the dirty
data would need to start a transaction"
* tag 'for-6.10-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: avoid races when tracking progress for extent map shrinking
btrfs: stop extent map shrinker if reschedule is needed
btrfs: use delayed iput during extent map shrinking
We store the progress (root and inode numbers) of the extent map shrinker
in fs_info without any synchronization but we can have multiple tasks
calling into the shrinker during memory allocations when there's enough
memory pressure for example.
This can result in a task A reading fs_info->extent_map_shrinker_last_ino
after another task B updates it, and task A reading
fs_info->extent_map_shrinker_last_root before task B updates it, making
task A see an odd state that isn't necessarily harmful but may make it
skip certain inode ranges or do more work than necessary by going over
the same inodes again. These unprotected accesses would also trigger
warnings from tools like KCSAN.
So add a lock to protect access to these progress fields.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The extent map shrinker can be called in a variety of contexts where we
are under memory pressure, and of them is when a task is trying to
allocate memory. For this reason the shrinker is typically called with a
value of struct shrink_control::nr_to_scan that is much smaller than what
we return in the nr_cached_objects callback of struct super_operations
(fs/btrfs/super.c:btrfs_nr_cached_objects()), so that the shrinker does
not take a long time and cause high latencies. However we can still take
a lot of time in the shrinker even for a limited amount of nr_to_scan:
1) When traversing the red black tree that tracks open inodes in a root,
as for example with millions of open inodes we get a deep tree which
takes time searching for an inode;
2) Iterating over the extent map tree, which is a red black tree, of an
inode when doing the rb_next() calls and when removing an extent map
from the tree, since often that requires rebalancing the red black
tree;
3) When trying to write lock an inode's extent map tree we may wait for a
significant amount of time, because there's either another task about
to do IO and searching for an extent map in the tree or inserting an
extent map in the tree, and we can have thousands or even millions of
extent maps for an inode. Furthermore, there can be concurrent calls
to the shrinker so the lock might be busy simply because there is
already another task shrinking extent maps for the same inode;
4) We often reschedule if we need to, which further increases latency.
So improve on this by stopping the extent map shrinking code whenever we
need to reschedule and make it skip an inode if we can't immediately lock
its extent map tree.
Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Reported-by: Andrea Gelmini <andrea.gelmini@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CABXGCsMmmb36ym8hVNGTiU8yfUS_cGvoUmGCcBrGWq9OxTrs+A@mail.gmail.com/
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we failed to link a free space entry because there's already a
conflicting entry for the same offset, we free the free space entry but
we don't free the associated bitmap that we had just allocated before.
Fix that by freeing the bitmap before freeing the entry.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Previously we had a BUG_ON() inside extent_range_clear_dirty_for_io(), as
we expected all involved folios to be still locked, thus no folio should be
missing.
However for extent_range_clear_dirty_for_io() itself, we can skip the
missing folio and handle the remaining ones, and return an error if
there is anything wrong.
Remove the BUG_ON() and let the caller to handle the error.
In the caller we do not have a quick way to cleanup the error, but all
the compression routines would handle the missing folio as an error and
properly error out, so we only need to do an ASSERT() for developers,
while for non-debug build the compression routine would handle the
error correctly.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is only used inside inode.c by compress_file_range(),
so move it to inode.c and unexport it.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add more verbose and specific messages to all main error points in
compression code for all algorithms. Currently there's no way to know
which inode is affected or where in the data errors happened.
The messages follow a common format:
- what happened
- error code if relevant
- root and inode
- additional data like offsets or lengths
There's no helper for the messages as they differ in some details and
that would be cumbersome to generalize to a single function. As all the
errors are "almost never happens" there are the unlikely annotations
done as compression is hot path.
Signed-off-by: David Sterba <dsterba@suse.com>
KCSAN complains about a data race when accessing the last_trans field of a
root:
[ 199.553628] BUG: KCSAN: data-race in btrfs_record_root_in_trans [btrfs] / record_root_in_trans [btrfs]
[ 199.555186] read to 0x000000008801e308 of 8 bytes by task 2812 on cpu 1:
[ 199.555210] btrfs_record_root_in_trans+0x9a/0x128 [btrfs]
[ 199.555999] start_transaction+0x154/0xcd8 [btrfs]
[ 199.556780] btrfs_join_transaction+0x44/0x60 [btrfs]
[ 199.557559] btrfs_dirty_inode+0x9c/0x140 [btrfs]
[ 199.558339] btrfs_update_time+0x8c/0xb0 [btrfs]
[ 199.559123] touch_atime+0x16c/0x1e0
[ 199.559151] pipe_read+0x6a8/0x7d0
[ 199.559179] vfs_read+0x466/0x498
[ 199.559204] ksys_read+0x108/0x150
[ 199.559230] __s390x_sys_read+0x68/0x88
[ 199.559257] do_syscall+0x1c6/0x210
[ 199.559286] __do_syscall+0xc8/0xf0
[ 199.559318] system_call+0x70/0x98
[ 199.559431] write to 0x000000008801e308 of 8 bytes by task 2808 on cpu 0:
[ 199.559464] record_root_in_trans+0x196/0x228 [btrfs]
[ 199.560236] btrfs_record_root_in_trans+0xfe/0x128 [btrfs]
[ 199.561097] start_transaction+0x154/0xcd8 [btrfs]
[ 199.561927] btrfs_join_transaction+0x44/0x60 [btrfs]
[ 199.562700] btrfs_dirty_inode+0x9c/0x140 [btrfs]
[ 199.563493] btrfs_update_time+0x8c/0xb0 [btrfs]
[ 199.564277] file_update_time+0xb8/0xf0
[ 199.564301] pipe_write+0x8ac/0xab8
[ 199.564326] vfs_write+0x33c/0x588
[ 199.564349] ksys_write+0x108/0x150
[ 199.564372] __s390x_sys_write+0x68/0x88
[ 199.564397] do_syscall+0x1c6/0x210
[ 199.564424] __do_syscall+0xc8/0xf0
[ 199.564452] system_call+0x70/0x98
This is because we update and read last_trans concurrently without any
type of synchronization. This should be generally harmless and in the
worst case it can make us do extra locking (btrfs_record_root_in_trans())
trigger some warnings at ctree.c or do extra work during relocation - this
would probably only happen in case of load or store tearing.
So fix this by always reading and updating the field using READ_ONCE()
and WRITE_ONCE(), this silences KCSAN and prevents load and store tearing.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is only one caller utilizing the @extra_gfp parameter,
alloc_eb_folio_array(). And in that case the extra_gfp is only assigned
to __GFP_NOFAIL.
Rename the @extra_gfp parameter to @nofail to indicate that.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_alloc_folio_array() is only utilized in
btrfs_submit_compressed_read() and no other location, and the only
caller is not utilizing the @extra_gfp parameter.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This new mount option allows the kernel to skip the super flags check,
it's mostly to allow the kernel to do a rescue mount of an interrupted
checksum conversion.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce "rescue=ignoremetacsums" to ignore metadata csums, all the
other metadata sanity checks are still kept as is.
This new mount option is mostly to allow the kernel to mount an
interrupted checksum conversion (at the metadata csum overwrite stage).
And since the main part of metadata sanity checks is inside
tree-checker, we shouldn't lose much safety, and the new mount option is
rescue mount option it requires full read-only mount.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Most of the extra super block flags are beyond 32bits (from
CHANGING_FSID_V2 to CHANGING_*_CSUMS), thus using %llu is not only too
long and pretty hard to read.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The following three Opt_* enums haven't been utilized since the port to
new mount API:
- Opt_ignorebadroots
- Opt_ignoredatacsums
- Opt_rescue_all
All those enums are from the old day where we have dedicated mount
options, nowadays they have been moved to "rescue=" mount option
groups, and no more global tokens for them.
So we can safely remove them now.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is to ensure non-compressed file extents (both regular and
prealloc) should have matching ram_bytes and disk_num_bytes.
This is only for CONFIG_BTRFS_DEBUG and CONFIG_BTRFS_ASSERT case,
furthermore this will not return error, but just a kernel warning to
inform developers.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[HICCUP]
After adding extra checks on btrfs_file_extent_item::ram_bytes to
tree-checker, running fsstress leads to tree-checker warning at write time,
as we created file extent items with an invalid ram_bytes.
All those offending file extents have offset 0, and ram_bytes matching
num_bytes, and smaller than disk_num_bytes.
This would also trigger the recently enhanced btrfs-check, which catches
such mismatches and report them as minor errors.
[CAUSE]
When a folio/page is invalidated and it is part of a submitted OE, we
mark the OE truncated just to the beginning of the folio/page.
And for truncated OE, we insert the file extent item with incorrect
value for ram_bytes (using num_bytes instead of the usual value).
This is not a big deal for end users, as we do not utilize the ram_bytes
field for regular non-compressed extents.
This mismatch is just a small violation against on-disk format.
[FIX]
Fix it by removing the override on btrfs_file_extent_item::ram_bytes.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Previously validate_extent_map() is only to catch bugs related to
extent_map member cleanups.
But with recent btrfs-check enhancement to catch ram_bytes mismatch with
disk_num_bytes, it would be much better to catch such extent maps
earlier.
So this patch adds extra ram_bytes validation for extent maps.
Please note that, older filesystems with such mismatch won't trigger this error:
- extent_map::ram_bytes is already fixed
Previous patch has already fixed the ram_bytes for affected file
extents.
So this enhanced sanity check should not affect end users.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[HICCUP]
Kernels can create file extent items with incorrect ram_bytes like this:
item 6 key (257 EXTENT_DATA 0) itemoff 15816 itemsize 53
generation 7 type 1 (regular)
extent data disk byte 13631488 nr 32768
extent data offset 0 nr 4096 ram 4096
extent compression 0 (none)
Thankfully kernel can handle them properly, as in that case ram_bytes is
not utilized at all.
[ENHANCEMENT]
Since the hiccup is not going to cause any data-loss and is only a minor
violation of on-disk format, here we only need to ignore the incorrect
ram_bytes value, and use the correct one from
btrfs_file_extent_item::disk_num_bytes.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[HICCUP]
Before commit 85de2be7129c ("btrfs: remove extent_map::block_start
member"), we utilized @bytenr variable inside
btrfs_extent_item_to_extent_map() to calculate block_start.
But that commit removed block_start completely, we have no need to
advance @bytenr at all.
[ENHANCEMENT]
- Rename @bytenr as @disk_bytenr
- Only declare @disk_bytenr inside the if branch
- Make @disk_bytenr const and remove the modification on it
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's a typo in an error message when checking the block group tree
feature, it mentions fres-space-tree instead of free-space-tree. Fix
that.
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>