In btrfs_ref_tree_mod(), when !parent 're' was allocated through
kmalloc(). In the following code, if an error occurs, the execution will
be redirected to 'out' or 'out_unlock' and the function will be exited.
However, on some of the paths, 're' are not deallocated and may lead to
memory leaks.
For example: lookup_block_entry() for 'be' returns NULL, the out label
will be invoked. During that flow ref and 'ra' are freed but not 're',
which can potentially lead to a memory leak.
CC: stable@vger.kernel.org # 5.10+
Reported-and-tested-by: syzbot+d66de4cbf532749df35f@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=d66de4cbf532749df35f
Signed-off-by: Bragatheswaran Manickavel <bragathemanick0908@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is a feature request to add dmesg output when unmounting a btrfs.
There are several alternative methods to do the same thing, but with
their own problems:
- Use eBPF to watch btrfs_put_super()/open_ctree()
Not end user friendly, they have to dip their head into the source
code.
- Watch for directory /sys/fs/<uuid>/
This is way more simple, but still requires some simple device -> uuid
lookups. And a script needs to use inotify to watch /sys/fs/.
Compared to all these, directly outputting the information into dmesg
would be the most simple one, with both device and UUID included.
And since we're here, also add the output when mounting a filesystem for
the first time for parity. A more fine grained monitoring of subvolume
mounts should be done by another layer, like audit.
Now mounting a btrfs with all default mkfs options would look like this:
[81.906566] BTRFS info (device dm-8): first mount of filesystem 633b5c16-afe3-4b79-b195-138fe145e4f2
[81.907494] BTRFS info (device dm-8): using crc32c (crc32c-intel) checksum algorithm
[81.908258] BTRFS info (device dm-8): using free space tree
[81.912644] BTRFS info (device dm-8): auto enabling async discard
[81.913277] BTRFS info (device dm-8): checking UUID tree
[91.668256] BTRFS info (device dm-8): last unmount of filesystem 633b5c16-afe3-4b79-b195-138fe145e4f2
CC: stable@vger.kernel.org # 5.4+
Link: https://github.com/kdave/btrfs-progs/issues/689
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Syzbot reported a regression that after commit 6ed05643dd ("btrfs:
create qgroup earlier in snapshot creation") we can trigger transaction
abort during snapshot creation:
BTRFS: Transaction aborted (error -17)
WARNING: CPU: 0 PID: 5057 at fs/btrfs/transaction.c:1778 create_pending_snapshot+0x25f4/0x2b70 fs/btrfs/transaction.c:1778
Modules linked in:
CPU: 0 PID: 5057 Comm: syz-executor225 Not tainted 6.6.0-syzkaller-15365-g305230142ae0 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/09/2023
RIP: 0010:create_pending_snapshot+0x25f4/0x2b70 fs/btrfs/transaction.c:1778
Call Trace:
<TASK>
create_pending_snapshots+0x195/0x1d0 fs/btrfs/transaction.c:1967
btrfs_commit_transaction+0xf1c/0x3730 fs/btrfs/transaction.c:2440
create_snapshot+0x4a5/0x7e0 fs/btrfs/ioctl.c:845
btrfs_mksubvol+0x5d0/0x750 fs/btrfs/ioctl.c:995
btrfs_mksnapshot+0xb5/0xf0 fs/btrfs/ioctl.c:1041
__btrfs_ioctl_snap_create+0x344/0x460 fs/btrfs/ioctl.c:1294
btrfs_ioctl_snap_create+0x13c/0x190 fs/btrfs/ioctl.c:1321
btrfs_ioctl+0xbbf/0xd40
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:871 [inline]
__se_sys_ioctl+0xf8/0x170 fs/ioctl.c:857
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x44/0x110 arch/x86/entry/common.c:82
entry_SYSCALL_64_after_hwframe+0x63/0x6b
RIP: 0033:0x7f2f791127b9
</TASK>
[CAUSE]
The error number is -EEXIST, which can happen for qgroup if there is
already an existing qgroup and then we're trying to create a snapshot
for it.
[FIX]
In that case, we can continue creating the snapshot, although it may
lead to qgroup inconsistency, it's not so critical to abort the current
transaction.
So in this case, we can just ignore the non-critical errors, mostly -EEXIST
(there is already a qgroup).
Reported-by: syzbot+4d81015bc10889fd12ea@syzkaller.appspotmail.com
Fixes: 6ed05643dd ("btrfs: create qgroup earlier in snapshot creation")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
There is a bug report that ntfs2btrfs had a bug that it can lead to
transaction abort and the filesystem flips to read-only.
[CAUSE]
For inline backref items, kernel has a strict requirement for their
ordered, they must follow the following rules:
- All btrfs_extent_inline_ref::type should be in an ascending order
- Within the same type, the items should follow a descending order by
their sequence number
For EXTENT_DATA_REF type, the sequence number is result from
hash_extent_data_ref().
For other types, their sequence numbers are
btrfs_extent_inline_ref::offset.
Thus if there is any code not following above rules, the resulted
inline backrefs can prevent the kernel to locate the needed inline
backref and lead to transaction abort.
[FIX]
Ntrfs2btrfs has already fixed the problem, and btrfs-progs has added the
ability to detect such problems.
For kernel, let's be more noisy and be more specific about the order, so
that the next time kernel hits such problem we would reject it in the
first place, without leading to transaction abort.
Link: https://github.com/kdave/btrfs-progs/pull/622
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When using simple quotas we are not supposed to allocate qgroup records
when adding delayed references. However we allocate them if either mode
of quotas is enabled (the new simple one or the old one), but then we
never free them because running the accounting, which frees the records,
is only run when using the old quotas (at btrfs_qgroup_account_extents()),
resulting in a memory leak of the records allocated when adding delayed
references.
Fix this by allocating the records only if the old quotas mode is enabled.
Also fix btrfs_qgroup_trace_extent_nolock() to return 1 if the old quotas
mode is not enabled - meaning the caller has to free the record.
Fixes: 182940f4f4 ("btrfs: qgroup: add new quota mode for simple quotas")
Reported-by: syzbot+d3ddc6dcc6386dea398b@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/00000000000004769106097f9a34@google.com/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing qgroup accounting for an extent, we take the spinlock
fs_info->qgroup_lock and then add qgroups to the local list (iterator)
named "qgroups". These qgroups are found in the fs_info->qgroup_tree
rbtree. After we're done, we unlock fs_info->qgroup_lock and then call
qgroup_iterator_nested_clean(), which will iterate over all the qgroups
added to the local list "qgroups" and then delete them from the list.
Deleting a qgroup from the list can however result in a use-after-free
if a qgroup remove operation happens after we unlock fs_info->qgroup_lock
and before or while we are at qgroup_iterator_nested_clean().
Fix this by calling qgroup_iterator_nested_clean() while still holding
the lock fs_info->qgroup_lock - we don't need it under the 'out' label
since before taking the lock the "qgroups" list is always empty. This
guarantees safety because btrfs_remove_qgroup() takes that lock before
removing a qgroup from the rbtree fs_info->qgroup_tree.
This was reported by syzbot with the following stack traces:
BUG: KASAN: slab-use-after-free in __list_del_entry_valid_or_report+0x2f/0x130 lib/list_debug.c:49
Read of size 8 at addr ffff888027e420b0 by task kworker/u4:3/48
CPU: 1 PID: 48 Comm: kworker/u4:3 Not tainted 6.6.0-syzkaller-10396-g4652b8e4f3ff #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/09/2023
Workqueue: btrfs-qgroup-rescan btrfs_work_helper
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x1e7/0x2d0 lib/dump_stack.c:106
print_address_description mm/kasan/report.c:364 [inline]
print_report+0x163/0x540 mm/kasan/report.c:475
kasan_report+0x175/0x1b0 mm/kasan/report.c:588
__list_del_entry_valid_or_report+0x2f/0x130 lib/list_debug.c:49
__list_del_entry_valid include/linux/list.h:124 [inline]
__list_del_entry include/linux/list.h:215 [inline]
list_del_init include/linux/list.h:287 [inline]
qgroup_iterator_nested_clean fs/btrfs/qgroup.c:2623 [inline]
btrfs_qgroup_account_extent+0x18b/0x1150 fs/btrfs/qgroup.c:2883
qgroup_rescan_leaf fs/btrfs/qgroup.c:3543 [inline]
btrfs_qgroup_rescan_worker+0x1078/0x1c60 fs/btrfs/qgroup.c:3604
btrfs_work_helper+0x37c/0xbd0 fs/btrfs/async-thread.c:315
process_one_work kernel/workqueue.c:2630 [inline]
process_scheduled_works+0x90f/0x1400 kernel/workqueue.c:2703
worker_thread+0xa5f/0xff0 kernel/workqueue.c:2784
kthread+0x2d3/0x370 kernel/kthread.c:388
ret_from_fork+0x48/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:242
</TASK>
Allocated by task 6355:
kasan_save_stack mm/kasan/common.c:45 [inline]
kasan_set_track+0x4f/0x70 mm/kasan/common.c:52
____kasan_kmalloc mm/kasan/common.c:374 [inline]
__kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:383
kmalloc include/linux/slab.h:600 [inline]
kzalloc include/linux/slab.h:721 [inline]
btrfs_quota_enable+0xee9/0x2060 fs/btrfs/qgroup.c:1209
btrfs_ioctl_quota_ctl+0x143/0x190 fs/btrfs/ioctl.c:3705
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:871 [inline]
__se_sys_ioctl+0xf8/0x170 fs/ioctl.c:857
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x44/0x110 arch/x86/entry/common.c:82
entry_SYSCALL_64_after_hwframe+0x63/0x6b
Freed by task 6355:
kasan_save_stack mm/kasan/common.c:45 [inline]
kasan_set_track+0x4f/0x70 mm/kasan/common.c:52
kasan_save_free_info+0x28/0x40 mm/kasan/generic.c:522
____kasan_slab_free+0xd6/0x120 mm/kasan/common.c:236
kasan_slab_free include/linux/kasan.h:164 [inline]
slab_free_hook mm/slub.c:1800 [inline]
slab_free_freelist_hook mm/slub.c:1826 [inline]
slab_free mm/slub.c:3809 [inline]
__kmem_cache_free+0x263/0x3a0 mm/slub.c:3822
btrfs_remove_qgroup+0x764/0x8c0 fs/btrfs/qgroup.c:1787
btrfs_ioctl_qgroup_create+0x185/0x1e0 fs/btrfs/ioctl.c:3811
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:871 [inline]
__se_sys_ioctl+0xf8/0x170 fs/ioctl.c:857
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x44/0x110 arch/x86/entry/common.c:82
entry_SYSCALL_64_after_hwframe+0x63/0x6b
Last potentially related work creation:
kasan_save_stack+0x3f/0x60 mm/kasan/common.c:45
__kasan_record_aux_stack+0xad/0xc0 mm/kasan/generic.c:492
__call_rcu_common kernel/rcu/tree.c:2667 [inline]
call_rcu+0x167/0xa70 kernel/rcu/tree.c:2781
kthread_worker_fn+0x4ba/0xa90 kernel/kthread.c:823
kthread+0x2d3/0x370 kernel/kthread.c:388
ret_from_fork+0x48/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:242
Second to last potentially related work creation:
kasan_save_stack+0x3f/0x60 mm/kasan/common.c:45
__kasan_record_aux_stack+0xad/0xc0 mm/kasan/generic.c:492
__call_rcu_common kernel/rcu/tree.c:2667 [inline]
call_rcu+0x167/0xa70 kernel/rcu/tree.c:2781
kthread_worker_fn+0x4ba/0xa90 kernel/kthread.c:823
kthread+0x2d3/0x370 kernel/kthread.c:388
ret_from_fork+0x48/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:242
The buggy address belongs to the object at ffff888027e42000
which belongs to the cache kmalloc-512 of size 512
The buggy address is located 176 bytes inside of
freed 512-byte region [ffff888027e42000, ffff888027e42200)
The buggy address belongs to the physical page:
page:ffffea00009f9000 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x27e40
head:ffffea00009f9000 order:2 entire_mapcount:0 nr_pages_mapped:0 pincount:0
flags: 0xfff00000000840(slab|head|node=0|zone=1|lastcpupid=0x7ff)
page_type: 0xffffffff()
raw: 00fff00000000840 ffff888012c41c80 ffffea0000a5ba00 dead000000000002
raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 2, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 4514, tgid 4514 (udevadm), ts 24598439480, free_ts 23755696267
set_page_owner include/linux/page_owner.h:31 [inline]
post_alloc_hook+0x1e6/0x210 mm/page_alloc.c:1536
prep_new_page mm/page_alloc.c:1543 [inline]
get_page_from_freelist+0x31db/0x3360 mm/page_alloc.c:3170
__alloc_pages+0x255/0x670 mm/page_alloc.c:4426
alloc_slab_page+0x6a/0x160 mm/slub.c:1870
allocate_slab mm/slub.c:2017 [inline]
new_slab+0x84/0x2f0 mm/slub.c:2070
___slab_alloc+0xc85/0x1310 mm/slub.c:3223
__slab_alloc mm/slub.c:3322 [inline]
__slab_alloc_node mm/slub.c:3375 [inline]
slab_alloc_node mm/slub.c:3468 [inline]
__kmem_cache_alloc_node+0x19d/0x270 mm/slub.c:3517
kmalloc_trace+0x2a/0xe0 mm/slab_common.c:1098
kmalloc include/linux/slab.h:600 [inline]
kzalloc include/linux/slab.h:721 [inline]
kernfs_fop_open+0x3e7/0xcc0 fs/kernfs/file.c:670
do_dentry_open+0x8fd/0x1590 fs/open.c:948
do_open fs/namei.c:3622 [inline]
path_openat+0x2845/0x3280 fs/namei.c:3779
do_filp_open+0x234/0x490 fs/namei.c:3809
do_sys_openat2+0x13e/0x1d0 fs/open.c:1440
do_sys_open fs/open.c:1455 [inline]
__do_sys_openat fs/open.c:1471 [inline]
__se_sys_openat fs/open.c:1466 [inline]
__x64_sys_openat+0x247/0x290 fs/open.c:1466
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x44/0x110 arch/x86/entry/common.c:82
entry_SYSCALL_64_after_hwframe+0x63/0x6b
page last free stack trace:
reset_page_owner include/linux/page_owner.h:24 [inline]
free_pages_prepare mm/page_alloc.c:1136 [inline]
free_unref_page_prepare+0x8c3/0x9f0 mm/page_alloc.c:2312
free_unref_page+0x37/0x3f0 mm/page_alloc.c:2405
discard_slab mm/slub.c:2116 [inline]
__unfreeze_partials+0x1dc/0x220 mm/slub.c:2655
put_cpu_partial+0x17b/0x250 mm/slub.c:2731
__slab_free+0x2b6/0x390 mm/slub.c:3679
qlink_free mm/kasan/quarantine.c:166 [inline]
qlist_free_all+0x75/0xe0 mm/kasan/quarantine.c:185
kasan_quarantine_reduce+0x14b/0x160 mm/kasan/quarantine.c:292
__kasan_slab_alloc+0x23/0x70 mm/kasan/common.c:305
kasan_slab_alloc include/linux/kasan.h:188 [inline]
slab_post_alloc_hook+0x67/0x3d0 mm/slab.h:762
slab_alloc_node mm/slub.c:3478 [inline]
slab_alloc mm/slub.c:3486 [inline]
__kmem_cache_alloc_lru mm/slub.c:3493 [inline]
kmem_cache_alloc+0x104/0x2c0 mm/slub.c:3502
getname_flags+0xbc/0x4f0 fs/namei.c:140
do_sys_openat2+0xd2/0x1d0 fs/open.c:1434
do_sys_open fs/open.c:1455 [inline]
__do_sys_openat fs/open.c:1471 [inline]
__se_sys_openat fs/open.c:1466 [inline]
__x64_sys_openat+0x247/0x290 fs/open.c:1466
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x44/0x110 arch/x86/entry/common.c:82
entry_SYSCALL_64_after_hwframe+0x63/0x6b
Memory state around the buggy address:
ffff888027e41f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff888027e42000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff888027e42080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff888027e42100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff888027e42180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
Reported-by: syzbot+e0b615318f8fcfc01ceb@syzkaller.appspotmail.com
Fixes: dce28769a3 ("btrfs: qgroup: use qgroup_iterator_nested to in qgroup_update_refcnt()")
CC: stable@vger.kernel.org # 6.6
Link: https://lore.kernel.org/linux-btrfs/00000000000091a5b2060936bf6d@google.com/
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At device_list_add() we allocate a btrfs_fs_devices structure and then
before checking if the allocation failed (pointer is ERR_PTR(-ENOMEM)),
we dereference the error pointer in a memcpy() argument if the feature
BTRFS_FEATURE_INCOMPAT_METADATA_UUID is enabled.
Fix this by checking for an allocation error before trying the memcpy().
Fixes: f7361d8c3f ("btrfs: sipmlify uuid parameters of alloc_fs_devices()")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
There is a compilation warning reported on commit ae76d8e3e1 ("btrfs:
scrub: fix grouping of read IO"), where gcc (14.0.0 20231022 experimental)
is reporting the following uninitialized variable:
fs/btrfs/scrub.c: In function ‘scrub_simple_mirror.isra’:
fs/btrfs/scrub.c:2075:29: error: ‘found_logical’ may be used uninitialized [-Werror=maybe-uninitialized[https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html#index-Wmaybe-uninitialized]]
2075 | cur_logical = found_logical + BTRFS_STRIPE_LEN;
fs/btrfs/scrub.c:2040:21: note: ‘found_logical’ was declared here
2040 | u64 found_logical;
| ^~~~~~~~~~~~~
[CAUSE]
This is a false alert, as @found_logical is passed as parameter
@found_logical_ret of function queue_scrub_stripe().
As long as queue_scrub_stripe() returned 0, we would update
@found_logical_ret. And if queue_scrub_stripe() returned >0 or <0, the
caller would not utilized @found_logical, thus there should be nothing
wrong.
Although the triggering gcc is still experimental, it looks like the
extra check on "if (found_logical_ret)" can sometimes confuse the
compiler.
Meanwhile the only caller of queue_scrub_stripe() is always passing a
valid pointer, there is no need for such check at all.
[FIX]
Although the report itself is a false alert, we can still make it more
explicit by:
- Replace the check for @found_logical_ret with ASSERT()
- Initialize @found_logical to U64_MAX
- Add one extra ASSERT() to make sure @found_logical got updated
Link: https://lore.kernel.org/linux-btrfs/87fs1x1p93.fsf@gentoo.org/
Fixes: ae76d8e3e1 ("btrfs: scrub: fix grouping of read IO")
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Dave reported a bug where we were aborting the transaction while trying
to cleanup the squota reservation for an extent.
This turned out to be because we're doing btrfs_header_owner(next) in
do_walk_down when we decide to free the block. However in this code
block we haven't explicitly read next, so it could be stale. We would
then get whatever garbage happened to be in the pages at this point.
The commit that introduced that is "btrfs: track owning root in
btrfs_ref".
Fix this by saving the owner_root when we do the
btrfs_lookup_extent_info(). We always do this in do_walk_down, it is
how we make the decision of whether or not to delete the block. This is
cheap because we've already done the extent item lookup at this point,
so it's straightforward to just grab the owner root as well.
Then we can use this when deleting the metadata block without needing to
force a read of the extent buffer to find the owner.
This fixes the problem that Dave reported.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Running the fio command below on a ZNS device results in "Resource
temporarily unavailable" error.
$ sudo fio --name=w --directory=/mnt --filesize=1GB --bs=16MB --numjobs=16 \
--rw=write --ioengine=libaio --iodepth=128 --direct=1
fio: io_u error on file /mnt/w.2.0: Resource temporarily unavailable: write offset=117440512, buflen=16777216
fio: io_u error on file /mnt/w.2.0: Resource temporarily unavailable: write offset=134217728, buflen=16777216
...
This happens because -EAGAIN error returned from btrfs_reserve_extent()
called from btrfs_new_extent_direct() is spilling over to the userland.
btrfs_reserve_extent() returns -EAGAIN when there is no active zone
available. Then, the caller should wait for some other on-going IO to
finish a zone and retry the allocation.
This logic is already implemented for buffered write in cow_file_range(),
but it is missing for the direct IO counterpart. Implement the same logic
for it.
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: 2ce543f478 ("btrfs: zoned: wait until zone is finished when allocation didn't progress")
CC: stable@vger.kernel.org # 6.1+
Tested-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is a check of the write pointer vs the zone size to reject an invalid
write pointer. However, as of now, we have RAID0/RAID10 on the zoned
mode, we can have a block group whose size is larger than the zone size.
As an equivalent check against the block group's zone_capacity is already
there, we can just drop this invalid check.
Fixes: 568220fa96 ("btrfs: zoned: support RAID0/1/10 on top of raid stripe tree")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's more obvious to return a literal zero instead of "return ret;".
Plus Smatch complains that ret could be uninitialized if the
ordered_extent->bioc_list list is empty and this silences that warning.
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the tree search v2 ioctl we use the type size_t, which is an unsigned
long, to track the buffer size in the local variable 'buf_size'. An
unsigned long is 32 bits wide on a 32 bits architecture. The buffer size
defined in struct btrfs_ioctl_search_args_v2 is a u64, so when we later
try to copy the local variable 'buf_size' to the argument struct, when
the search returns -EOVERFLOW, we copy only 32 bits which will be a
problem on big endian systems.
Fix this by using a u64 type for the buffer sizes, not only at
btrfs_ioctl_tree_search_v2(), but also everywhere down the call chain
so that we can use the u64 at btrfs_ioctl_tree_search_v2().
Fixes: cc68a8a5a4 ("btrfs: new ioctl TREE_SEARCH_V2")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: https://lore.kernel.org/linux-btrfs/ce6f4bd6-9453-4ffe-ba00-cee35495e10f@moroto.mountain/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The type of timespec64::tv_nsec is 'unsigned long', while we have only
u32 for on-disk and in-memory. This wastes a few bytes in btrfs_inode.
Add separate members for sec and nsec with the corresponding type width.
This creates a 4 byte hole in btrfs_inode which can be utilized in the
future.
Signed-off-by: David Sterba <dsterba@suse.com>
During log syncing, when we start updating the log root tree we compute
an index value, stored in variable 'index2', once we lock the log root
tree's mutex. This value depends on the log root's log_transid. And
shortly after we compute again the same value for 'index2' - the value
is exactly the same since we haven't released the mutex and therefore
the log_transid of the log root is the same as before.
This second 'index2' computation became pointless after commit
a93e01682e ("btrfs: remove no longer needed use of log_writers for the
log root tree"). So remove it.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The variable dirty is initialized with a value that is never read, it
is being re-assigned later on. Remove the redundant initialization.
Cleans up clang scan build warning:
fs/btrfs/inode.c:5965:7: warning: Value stored to 'dirty' during its
initialization is never read [deadcode.DeadStores]
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This adds sysfs objects to indicate temp_fsid feature support and
its status.
/sys/fs/btrfs/features/temp_fsid
/sys/fs/btrfs/<UUID>/temp_fsid
For example:
Consider two cloned and mounted devices.
$ blkid /dev/sdc[1-2]
/dev/sdc1: UUID="509ad44b-ad2a-4a8a-bc8d-fe69db7220d5" ..
/dev/sdc2: UUID="509ad44b-ad2a-4a8a-bc8d-fe69db7220d5" ..
One gets actual fsid, and the other gets the temp_fsid when
mounted.
$ btrfs filesystem show -m
Label: none uuid: 509ad44b-ad2a-4a8a-bc8d-fe69db7220d5
Total devices 1 FS bytes used 54.14MiB
devid 1 size 300.00MiB used 144.00MiB path /dev/sdc1
Label: none uuid: 33bad74e-c91b-43a5-aef8-b3cab97ae63a
Total devices 1 FS bytes used 54.14MiB
devid 1 size 300.00MiB used 144.00MiB path /dev/sdc2
Their sysfs as below.
$ cat /sys/fs/btrfs/features/temp_fsid
0
$ cat /sys/fs/btrfs/509ad44b-ad2a-4a8a-bc8d-fe69db7220d5/temp_fsid
0
$ cat /sys/fs/btrfs/33bad74e-c91b-43a5-aef8-b3cab97ae63a/temp_fsid
1
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The device addition operation will transform the cloned temp-fsid mounted
device into a multi-device filesystem. Therefore, it is marked as
unsupported.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A seed device is an integral component of the sprout device, which
functions as a multi-device filesystem. Therefore, temp-fsid feature
is not supported.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Update the comment to explain the relationship between temp_fsid, fsid,
and metadata_uuid.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When syncing the log, if we get an error when updating the log root, we
check first if the log root tree context is in a log context list, and if
so it deletes from the log root tree context from the list. This check
however is pointless because at this moment the context is always in a
list, he have just added it to a context list. The check became pointless
after commit a93e01682e ("btrfs: remove no longer needed use of
log_writers for the log root tree"). So remove this now pointless empty
list check.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Update the comment for the lock named "lock" in struct btrfs_inode because
it does not mention that the fields "delalloc_bytes", "defrag_bytes",
"csum_bytes", "outstanding_extents" and "disk_i_size" are also protected
by that lock.
Also add a comment on top of each field protected by this lock to mention
that the lock protects them.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The memory barrier (smp_mb()) at btrfs_sync_file() is completely redundant
now that fs_info->last_trans_committed is read using READ_ONCE(), with the
helper btrfs_get_last_trans_committed(), and written using WRITE_ONCE()
with the helper btrfs_set_last_trans_committed().
This barrier was introduced in 2011, by commit a4abeea41a ("Btrfs: kill
trans_mutex"), but even back then it was not correct since the writer side
(in btrfs_commit_transaction()), did not issue a pairing memory barrier
after it updated fs_info->last_trans_committed.
So remove this barrier.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently the last_trans_committed field of struct btrfs_fs_info is
modified and read without any locking or other protection. For example
early in the fsync path, skip_inode_logging() is called which reads
fs_info->last_trans_committed, but at the same time we can have a
transaction commit completing and updating that field.
In the case of an fsync this is harmless and any data race should be
rare and at most cause an unnecessary logging of an inode.
To avoid data race warnings from tools like KCSAN and other issues such
as load and store tearing (amongst others, see [1]), create helpers to
access the last_trans_committed field of struct btrfs_fs_info using
READ_ONCE() and WRITE_ONCE(), and use these helpers everywhere.
[1] https://lwn.net/Articles/793253/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently the generation field of struct btrfs_fs_info is always modified
while holding fs_info->trans_lock locked. Most readers will access this
field without taking that lock but while holding a transaction handle,
which is safe to do due to the transaction life cycle.
However there are other readers that are neither holding the lock nor
holding a transaction handle open:
1) When reading an inode from disk, at btrfs_read_locked_inode();
2) When reading the generation to expose it to sysfs, at
btrfs_generation_show();
3) Early in the fsync path, at skip_inode_logging();
4) When creating a hole at btrfs_cont_expand(), during write paths,
truncate and reflinking;
5) In the fs_info ioctl (btrfs_ioctl_fs_info());
6) While mounting the filesystem, in the open_ctree() path. In these
cases it's safe to directly read fs_info->generation as no one
can concurrently start a transaction and update fs_info->generation.
In case of the fsync path, races here should be harmless, and in the worst
case they may cause a fsync to log an inode when it's not really needed,
so nothing bad from a functional perspective. In the other cases it's not
so clear if functional problems may arise, though in case 1 rare things
like a load/store tearing [1] may cause the BTRFS_INODE_NEEDS_FULL_SYNC
flag not being set on an inode and therefore result in incorrect logging
later on in case a fsync call is made.
To avoid data race warnings from tools like KCSAN and other issues such
as load and store tearing (amongst others, see [1]), create helpers to
access the generation field of struct btrfs_fs_info using READ_ONCE() and
WRITE_ONCE(), and use these helpers where needed.
[1] https://lwn.net/Articles/793253/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently the log_transid field of a root is always modified while holding
the root's log_mutex locked. Most readers of a root's log_transid are also
holding the root's log_mutex locked, however there is one exception which
is btrfs_set_inode_last_trans() where we don't take the lock to avoid
blocking several operations if log syncing is happening in parallel.
Any races here should be harmless, and in the worst case they may cause a
fsync to log an inode when it's not really needed, so nothing bad from a
functional perspective.
To avoid data race warnings from tools like KCSAN and other issues such
as load and store tearing (amongst others, see [1]), create helpers to
access the log_transid field of a root using READ_ONCE() and WRITE_ONCE(),
and use these helpers where needed.
[1] https://lwn.net/Articles/793253/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, the last_log_commit of a root can be accessed concurrently
without any lock protection. Readers can be calling btrfs_inode_in_log()
early in a fsync call, which reads a root's last_log_commit, while a
writer can change the last_log_commit while a log tree if being synced,
at btrfs_sync_log(). Any races here should be harmless, and in the worst
case they may cause a fsync to log an inode when it's not really needed,
so nothing bad from a functional perspective.
To avoid data race warnings from tools like KCSAN and other issues such
as load and store tearing (amongst others, see [1]), create helpers to
access the last_log_commit field of a root using READ_ONCE() and
WRITE_ONCE(), and use these helpers everywhere.
[1] https://lwn.net/Articles/793253/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Guilherme's previous work [1] aimed at the mounting of cloned devices
using a superblock flag SINGLE_DEV during mkfs.
[1] https://lore.kernel.org/linux-btrfs/20230831001544.3379273-1-gpiccoli@igalia.com/
Building upon this work, here is in memory only approach. As it mounts
we determine if the same fsid is already mounted if then we generate a
random temp fsid which shall be used the mount, in memory only not
written to the disk. We distinguish devices by devt.
Example:
$ fallocate -l 300m ./disk1.img
$ mkfs.btrfs -f ./disk1.img
$ cp ./disk1.img ./disk2.img
$ cp ./disk1.img ./disk3.img
$ mount -o loop ./disk1.img /btrfs
$ mount -o ./disk2.img /btrfs1
$ mount -o ./disk3.img /btrfs2
$ btrfs fi show -m
Label: none uuid: 4a212b48-1bec-46a5-938a-783c8c1f0b02
Total devices 1 FS bytes used 144.00KiB
devid 1 size 300.00MiB used 88.00MiB path /dev/loop0
Label: none uuid: adabf2fe-5515-4ad0-95b4-7b1609218c16
Total devices 1 FS bytes used 144.00KiB
devid 1 size 300.00MiB used 88.00MiB path /dev/loop1
Label: none uuid: 1d77d0df-7d92-439e-adbd-20b9b86fdedb
Total devices 1 FS bytes used 144.00KiB
devid 1 size 300.00MiB used 88.00MiB path /dev/loop2
Co-developed-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In preparation for adding support to mount multiple single-disk
btrfs filesystems with the same FSID, wrap find_fsid() into
find_fsid_by_disk().
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Space for block group item insertions, necessary after allocating a new
block group, is reserved in the delayed refs block reserve. Currently we
do this by incrementing the transaction handle's delayed_ref_updates
counter and then calling btrfs_update_delayed_refs_rsv(), which will
increase the size of the delayed refs block reserve by an amount that
corresponds to the same amount we use for delayed refs, given by
btrfs_calc_delayed_ref_bytes().
That is an excessive amount because it corresponds to the amount of space
needed to insert one item in a btree (btrfs_calc_insert_metadata_size())
times 2 when the free space tree feature is enabled. All we need is an
amount as given by btrfs_calc_insert_metadata_size(), since we only need to
insert a block group item in the extent tree (or block group tree if this
feature is enabled). By using btrfs_calc_insert_metadata_size() we will
need to reserve 2 times less space when using the free space tree, putting
less pressure on space reservation.
So use helpers to reserve and release space for block group item
insertions that use btrfs_calc_insert_metadata_size() for calculation of
the space.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Space for block group item updates, necessary after allocating or
deallocating an extent from a block group, is reserved in the delayed
refs block reserve. Currently we do this by incrementing the transaction
handle's delayed_ref_updates counter and then calling
btrfs_update_delayed_refs_rsv(), which will increase the size of the
delayed refs block reserve by an amount that corresponds to the same
amount we use for delayed refs, given by btrfs_calc_delayed_ref_bytes().
That is an excessive amount because it corresponds to the amount of space
needed to insert one item in a btree (btrfs_calc_insert_metadata_size())
times 2 when the free space tree feature is enabled. All we need is an
amount as given by btrfs_calc_metadata_size(), since we only need to
update an existing block group item in the extent tree (or block group
tree if this feature is enabled). By using btrfs_calc_metadata_size() we
will need to reserve 4 times less space when using the free space tree
and 2 times less space when not using it, putting less pressure on space
reservation.
So use helpers to reserve and release space for block group item updates
that use btrfs_calc_metadata_size() for calculation of the space.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Previous commit created a hole in struct btrfs_inode, we can move
outstanding_extents there. This reduces size by 8 bytes from 1120 to
1112 on a release config.
Signed-off-by: David Sterba <dsterba@suse.com>
The structure btrfs_ordered_inode_tree is used only in one place, in
btrfs_inode. The structure itself has a 4 byte hole which is wasted
space.
Move the btrfs_ordered_inode_tree members to btrfs_inode with a common
prefix 'ordered_tree_' where the hole can be utilized and shrink inode
size.
Signed-off-by: David Sterba <dsterba@suse.com>
A user reported some unpleasant behavior with very small file systems.
The reproducer is this
$ mkfs.btrfs -f -m single -b 8g /dev/vdb
$ mount /dev/vdb /mnt/test
$ dd if=/dev/zero of=/mnt/test/testfile bs=512M count=20
This will result in usage that looks like this
Overall:
Device size: 8.00GiB
Device allocated: 8.00GiB
Device unallocated: 1.00MiB
Device missing: 0.00B
Device slack: 2.00GiB
Used: 5.47GiB
Free (estimated): 2.52GiB (min: 2.52GiB)
Free (statfs, df): 0.00B
Data ratio: 1.00
Metadata ratio: 1.00
Global reserve: 5.50MiB (used: 0.00B)
Multiple profiles: no
Data,single: Size:7.99GiB, Used:5.46GiB (68.41%)
/dev/vdb 7.99GiB
Metadata,single: Size:8.00MiB, Used:5.77MiB (72.07%)
/dev/vdb 8.00MiB
System,single: Size:4.00MiB, Used:16.00KiB (0.39%)
/dev/vdb 4.00MiB
Unallocated:
/dev/vdb 1.00MiB
As you can see we've gotten ourselves quite full with metadata, with all
of the disk being allocated for data.
On smaller file systems there's not a lot of time before we get full, so
our overcommit behavior bites us here. Generally speaking data
reservations result in chunk allocations as we assume reservation ==
actual use for data. This means at any point we could end up with a
chunk allocation for data, and if we're very close to full we could do
this before we have a chance to figure out that we need another metadata
chunk.
Address this by adjusting the overcommit logic. Simply put we need to
take away 1 chunk from the available chunk space in case of a data
reservation. This will allow us to stop overcommitting before we
potentially lose this space to a data allocation. With this fix in
place we properly allocate a metadata chunk before we're completely
full, allowing for enough slack space in metadata.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
My overcommit patch exposed a bug with btrfs/177 [1]. The problem here is
that when we grow the device we're not adding to ->free_chunk_space, so
subsequent allocations can cause ->free_chunk_space to wrap, which
causes problems in can_overcommit because we add this to ->total_bytes,
which causes the counter to wrap and gives us an unexpected ENOSPC.
Fix this by properly updating ->free_chunk_space with the new available
space in btrfs_grow_device.
[1] First version of the fix:
https://lore.kernel.org/linux-btrfs/b97e47ce0ce1d41d221878de7d6090b90aa7a597.1695065233.git.josef@toxicpanda.com/
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are two bugs in how we adjust ->free_chunk_space in
btrfs_shrink_device. First we're removing the entire diff between
new_size and old_size from ->free_chunk_space. This only works if we're
reducing the free area, which we could potentially not be. So adjust
the math to only subtract the diff in the free space from
->free_chunk_space.
Additionally in the error case we're unconditionally adding the diff
back into ->free_chunk_space, which we need to only do if this device is
writeable.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, at find_first_extent_bit(), when we are given a cached extent
state that happens to have its end offset match the desired range start,
we find the next extent state using that cached state, with next_state()
calls, and then return it.
We then try to cache that next state by calling cache_state_if_flags(),
but that will not cache the state because we haven't reset *cached_state
to NULL, so we end up with the cached_state unchanged, and if the caller
is iterating over extent states in the io tree, its next call to
find_first_extent_bit() will not use the current cached state as its end
offset does not match the minimum start range offset, therefore the cached
state is reset and we have to search the rbtree to find the next suitable
extent state record.
So fix this by resetting the cached state to NULL (and dropping our ref
on it) when we have a suitable cached state and we found a next state by
using next_state() starting from the cached state. This makes use cases
of calling find_first_extent_bit() to go over all ranges in the io tree
to do a single rbtree full search, only on the first call, and the next
calls will just do next_state() (rb_next() wrapper) calls, which is more
efficient.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When freeing a log tree, during a transaction commit, we clear its dirty
log pages io tree by calling clear_extent_bits() using a range from 0 to
(u64)-1. This will iterate the io tree's rbtree and call rb_erase() on
each node before freeing it, which will often trigger rebalance operations
on the rbtree. A better alternative it to use extent_io_tree_release(),
which will not do deletions and trigger rebalances.
So use extent_io_tree_release() instead of clear_extent_bits().
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently extent_io_tree_release() is a loop that keeps getting the first
node in the io tree, using rb_first() which is a loop that gets to the
leftmost node of the rbtree, and then for each node it calls rb_erase(),
which often requires rebalancing the rbtree.
We can make this more efficient by using
rbtree_postorder_for_each_entry_safe() to free each node without having
to delete it from the rbtree and without looping to get the first node.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The wait_on_state() function is very short and has a single caller, which
is wait_extent_bit(), so remove the function and put its code into the
caller.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The memory barrier at extent_io_tree_release() is redundant. Holding
spin_lock here is not enough to drop the barrier completely. We only
change the waitqueue of an extent state record while holding the tree
lock - see wait_on_state().
The update to waitqueue state will not become stale because there will
be an spin_unlock/spin_lock sequence between the change and waiting,
this implies a full memory barrier.
So remove the explicit smp_mb() barrier.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ reword reasoning ]
Signed-off-by: David Sterba <dsterba@suse.com>
The function wait_extent_bit() is not used outside extent-io-tree.c so
make it static. Furthermore the function doesn't have the 'btrfs_' prefix.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's this comment at extent_io_tree_release() that mentions io btrees,
but this function is no longer used only for io btrees. Originally it was
added as a static function named clear_btree_io_tree() at transaction.c,
in commit 663dfbb077 ("Btrfs: deal with convert_extent_bit errors to
avoid fs corruption"), as it was used only for cleaning one of the io
trees that track dirty extent buffers, the dirty_log_pages io tree of a
a root and the dirty_pages io tree of a transaction. Later it was renamed
and exported and now it's used to cleanup other io trees such as the
allocation state io tree of a device or the csums range io tree of a log
root.
So remove that comment and replace it with one at the top of the function
that is more complete, mentioning what the function does and that it's
expected to be called only when a task is sure no one else will need to
use the tree anymore, as well as there should be no locked ranges in the
tree and therefore no waiters on its extent state records. Also add an
assertion to check that there are no locked extent state records in the
tree.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When inserting a new extent state record into an io tree that happens to
be mergeable, we currently do the following:
1) Insert the extent state record in the io tree's rbtree. This requires
going down the tree to find where to insert it, and during the
insertion we often need to balance the rbtree;
2) We then check if the previous node is mergeable, so we call rb_prev()
to find it, which requires some looping to find the previous node;
3) If the previous node is mergeable, we adjust our node to include the
range of the previous node and then delete the previous node from the
rbtree, which again may need to balance the rbtree;
4) Then we check if the next node is mergeable with the node we inserted,
so we call rb_next(), which requires some looping too. If the next node
is indeed mergeable, we expand the range of our node to include the
next node's range and then delete the next node from the rbtree, which
again may need to balance the tree.
So these are quite of lot of iterations and looping over the rbtree, and
some of the operations may need to rebalance the rb tree. This can be made
a bit more efficient by:
1) When iterating the rbtree, once we find a node that is mergeable with
the node we want to insert, we can just adjust that node's range with
the range of the node to insert - this avoids continuing iterating
over the tree and deleting a node from the rbtree;
2) If we expand the range of a mergeable node, then we find the next or
the previous node, depending on other we merged a range to the right or
to the left of the node we are currently at during the iteration. This
merging is as before, we find the next or previous node with rb_next()
or rb_prev() and if that other node is mergeable with the current one,
we adjust the range of the current node and remove the other node from
the rbtree;
3) Whenever we need to insert the new extent state record it's because
we don't have any extent state record in the rbtree which can be
merged, so we can remove the call to merge_state() after the insertion,
saving rb_next() and rb_prev() calls, which require some looping.
So update the insertion function insert_state() to have this behaviour.
Running dbench for 120 seconds and capturing the execution times of
set_extent_bit() at pin_down_extent(), resulted in the following data
(time values are in nanoseconds):
Before this change:
Count: 2278299
Range: 0.000 - 4003728.000; Mean: 713.436; Median: 612.000; Stddev: 3606.952
Percentiles: 90th: 1187.000; 95th: 1350.000; 99th: 1724.000
0.000 - 7.534: 5 |
7.534 - 35.418: 36 |
35.418 - 154.403: 273 |
154.403 - 662.138: 1244016 #####################################################
662.138 - 2828.745: 1031335 ############################################
2828.745 - 12074.102: 1395 |
12074.102 - 51525.930: 806 |
51525.930 - 219874.955: 162 |
219874.955 - 938254.688: 22 |
938254.688 - 4003728.000: 3 |
After this change:
Count: 2275862
Range: 0.000 - 1605175.000; Mean: 678.903; Median: 590.000; Stddev: 2149.785
Percentiles: 90th: 1105.000; 95th: 1245.000; 99th: 1590.000
0.000 - 10.219: 10 |
10.219 - 40.957: 36 |
40.957 - 155.907: 262 |
155.907 - 585.789: 1127214 ####################################################
585.789 - 2193.431: 1145134 #####################################################
2193.431 - 8205.578: 1648 |
8205.578 - 30689.378: 1039 |
30689.378 - 114772.699: 362 |
114772.699 - 429221.537: 52 |
429221.537 - 1605175.000: 10 |
Maximum duration (range), average duration, percentiles and standard
deviation are all better.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The semantics of test_range_bit() with filled == 0 is now in it's own
helper so test_range_bit will check the whole range unconditionally.
The detection logic is flipped and assumes success by default and
catches exceptions.
Signed-off-by: David Sterba <dsterba@suse.com>
The existing helper test_range_bit works in two ways, checks if the whole
range contains all the bits, or stop on the first occurrence. By adding
a specific helper for the latter case, the inner loop can be simplified
and contains fewer conditionals, making it a bit faster.
There's no caller that uses the cached state pointer so this reduces the
argument count further.
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_realloc_node() is only used by the defrag code. Nowadays we have a
defrag.c file, so move it, and its helper close_blocks(), into defrag.c.
During the move also do a few minor cosmetic changes:
1) Change the return value of close_blocks() from int to bool;
2) Use SZ_32K instead of 32768 at close_blocks();
3) Make some variables const in btrfs_realloc_node(), 'blocksize' and
'end_slot';
4) Get rid of 'parent_nritems' variable, in both places where it was
used it could be replaced by calling btrfs_header_nritems(parent);
5) Change the type of a couple variables from int to bool;
6) Rename variable 'err' to 'ret', as that's the most common name we
use to track the return value of a function;
7) Move some variables from the top scope to the scope of the for loop
where they are used.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Export comp_keys() out of ctree.c, as btrfs_comp_keys(), so that in a
later patch we can move out defrag specific code from ctree.c into
defrag.c.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Rename and export __btrfs_cow_block() as btrfs_force_cow_block(). This is
to allow to move defrag specific code out of ctree.c and into defrag.c in
one of the next patches.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_cow_block() we can use round_down() to align the extent buffer's
logical offset to the start offset of a metadata block group, instead of
the less easy to read set of bitwise operations (two plus one subtraction).
So replace the bitwise operations with a round_down() call.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>