linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-26 22:21:42 +00:00

Author	SHA1	Message	Date
Naohiro Aota	332581bde2	btrfs: zoned: do not zone finish data relocation block group When multiple writes happen at once, we may need to sacrifice a currently active block group to be zone finished for a new allocation. We choose a block group with the least free space left, and zone finish it. To do the finishing, we need to send IOs for already allocated region and wait for them and on-going IOs. Otherwise, these IOs fail because the zone is already finished at the time the IO reach a device. However, if a block group dedicated to the data relocation is zone finished, there is a chance that finishing it before an ongoing write IO reaches the device. That is because there is timing gap between an allocation is done (block_group->reservations == 0, as pre-allocation is done) and an ordered extent is created when the relocation IO starts. Thus, if we finish the zone between them, we can fail the IOs. We cannot simply use "fs_info->data_reloc_bg == block_group->start" to avoid the zone finishing. Because, the data_reloc_bg may already switch to a new block group, while there are still ongoing write IOs to the old data_reloc_bg. So, this patch reworks the BLOCK_GROUP_FLAG_ZONED_DATA_RELOC bit to indicate there is a data relocation allocation and/or ongoing write to the block group. The bit is set on allocation and cleared in end_io function of the last IO for the currently allocated region. To change the timing of the bit setting also solves the issue that the bit being left even after there is no IO going on. With the current code, if the data_reloc_bg switches after the last IO to the current data_reloc_bg, the bit is set at this timing and there is no one clearing that bit. As a result, that block group is kept unallocatable for anything. Fixes: `343d8a3085` ("btrfs: zoned: prevent allocation from previous data relocation BG") Fixes: `74e91b12b1` ("btrfs: zoned: zone finish unused block group") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:54:47 +02:00
Josef Bacik	e7f1326cc2	btrfs: set page extent mapped after read_folio in relocate_one_page One of the CI runs triggered the following panic assertion failed: PagePrivate(page) && page->private, in fs/btrfs/subpage.c:229 ------------[ cut here ]------------ kernel BUG at fs/btrfs/subpage.c:229! Internal error: Oops - BUG: 00000000f2000800 [#1] SMP CPU: 0 PID: 923660 Comm: btrfs Not tainted 6.5.0-rc3+ #1 pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--) pc : btrfs_subpage_assert+0xbc/0xf0 lr : btrfs_subpage_assert+0xbc/0xf0 sp : ffff800093213720 x29: ffff800093213720 x28: ffff8000932138b4 x27: 000000000c280000 x26: 00000001b5d00000 x25: 000000000c281000 x24: 000000000c281fff x23: 0000000000001000 x22: 0000000000000000 x21: ffffff42b95bf880 x20: ffff42b9528e0000 x19: 0000000000001000 x18: ffffffffffffffff x17: 667274622f736620 x16: 6e69202c65746176 x15: 0000000000000028 x14: 0000000000000003 x13: 00000000002672d7 x12: 0000000000000000 x11: ffffcd3f0ccd9204 x10: ffffcd3f0554ae50 x9 : ffffcd3f0379528c x8 : ffff800093213428 x7 : 0000000000000000 x6 : ffffcd3f091771e8 x5 : ffff42b97f333948 x4 : 0000000000000000 x3 : 0000000000000000 x2 : 0000000000000000 x1 : ffff42b9556cde80 x0 : 000000000000004f Call trace: btrfs_subpage_assert+0xbc/0xf0 btrfs_subpage_set_dirty+0x38/0xa0 btrfs_page_set_dirty+0x58/0x88 relocate_one_page+0x204/0x5f0 relocate_file_extent_cluster+0x11c/0x180 relocate_data_extent+0xd0/0xf8 relocate_block_group+0x3d0/0x4e8 btrfs_relocate_block_group+0x2d8/0x490 btrfs_relocate_chunk+0x54/0x1a8 btrfs_balance+0x7f4/0x1150 btrfs_ioctl+0x10f0/0x20b8 __arm64_sys_ioctl+0x120/0x11d8 invoke_syscall.constprop.0+0x80/0xd8 do_el0_svc+0x6c/0x158 el0_svc+0x50/0x1b0 el0t_64_sync_handler+0x120/0x130 el0t_64_sync+0x194/0x198 Code: 91098021 b0007fa0 91346000 97e9c6d2 (d4210000) This is the same problem outlined in `17b17fcd6d` ("btrfs: set_page_extent_mapped after read_folio in btrfs_cont_expand") , and the fix is the same. I originally looked for the same pattern elsewhere in our code, but mistakenly skipped over this code because I saw the page cache readahead before we set_page_extent_mapped, not realizing that this was only in the !page case, that we can still end up with a !uptodate page and then do the btrfs_read_folio further down. The fix here is the same as the above mentioned patch, move the set_page_extent_mapped call to after the btrfs_read_folio() block to make sure that we have the subpage blocksize stuff setup properly before using the page. CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:54:47 +02:00
Josef Bacik	cd361199ff	btrfs: wait on uncached block groups on every allocation loop My initial fix for the generic/475 hangs was related to metadata, but our CI testing uncovered another case where we hang for similar reasons. We again have a task with a plug that is holding an outstanding request that is keeping the dm device from finishing it's suspend, and that task is stuck in the allocator. This time it is stuck trying to allocate data, but we do not have a block group that matches the size class. The larger loop in the allocator looks like this (simplified of course) find_free_extent for_each_block_group { ffe_ctl->cached == btrfs_block_group_cache_done(bg) if (!ffe_ctl->cached) ffe_ctl->have_caching_bg = true; do_allocation() btrfs_wait_block_group_cache_progress(); } if (loop == LOOP_CACHING_WAIT && ffe_ctl->have_caching_bg) go search again; In my earlier fix we were trying to allocate from the block group, but we weren't waiting for the progress because we were only waiting for the free space to be >= the amount of free space we wanted. My fix made it so we waited for forward progress to be made as well, so we would be sure to wait. This time however we did not have a block group that matched our size class, so what was happening was this find_free_extent for_each_block_group { ffe_ctl->cached == btrfs_block_group_cache_done(bg) if (!ffe_ctl->cached) ffe_ctl->have_caching_bg = true; if (size_class_doesn't_match()) goto loop; do_allocation() btrfs_wait_block_group_cache_progress(); loop: release_block_group(block_group); } if (loop == LOOP_CACHING_WAIT && ffe_ctl->have_caching_bg) go search again; The size_class_doesn't_match() part was true, so we'd just skip this block group and never wait for caching, and then because we found a caching block group we'd just go back and do the loop again. We never sleep and thus never flush the plug and we have the same deadlock. Fix the logic for waiting on the block group caching to instead do it unconditionally when we goto loop. This takes the logic out of the allocation step, so now the loop looks more like this find_free_extent for_each_block_group { ffe_ctl->cached == btrfs_block_group_cache_done(bg) if (!ffe_ctl->cached) ffe_ctl->have_caching_bg = true; if (size_class_doesn't_match()) goto loop; do_allocation() btrfs_wait_block_group_cache_progress(); loop: if (loop > LOOP_CACHING_NOWAIT && !ffe_ctl->retry_uncached && !ffe_ctl->cached) { ffe_ctl->retry_uncached = true; btrfs_wait_block_group_cache_progress(); } release_block_group(block_group); } if (loop == LOOP_CACHING_WAIT && ffe_ctl->have_caching_bg) go search again; This simplifies the logic a lot, and makes sure that if we're hitting uncached block groups we're always waiting on them at some point. I ran this through 100 iterations of generic/475, as this particular case was harder to hit than the previous one. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:54:47 +02:00
Ruan Jinjie	84af994b85	btrfs: use LIST_HEAD() to initialize the list_head Use LIST_HEAD() to initialize the list_head instead of open-coding it. Signed-off-by: Ruan Jinjie <ruanjinjie@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:54:46 +02:00
Qu Wenruo	257614301a	btrfs: handle errors properly in update_inline_extent_backref() [PROBLEM] Inside function update_inline_extent_backref(), we have several BUG_ON()s along with some ASSERT()s which can be triggered by corrupted filesystem. [ANAYLYSE] Most of those BUG_ON()s and ASSERT()s are just a way of handling unexpected on-disk data. Although we have tree-checker to rule out obviously incorrect extent tree blocks, it's not enough for these ones. Thus we need proper error handling for them. [FIX] Thankfully all the callers of update_inline_extent_backref() would eventually handle the errror by aborting the current transaction. So this patch would do the proper error handling by: - Make update_inline_extent_backref() to return int The return value would be either 0 or -EUCLEAN. - Replace BUG_ON()s and ASSERT()s with proper error handling This includes: * Dump the bad extent tree leaf * Output an error message for the cause This would include the extent bytenr, num_bytes (if needed), the bad values and expected good values. * Return -EUCLEAN Note here we remove all the WARN_ON()s, as eventually the transaction would be aborted, thus a backtrace would be triggered anyway. - Better comments on why we expect refs == 1 and refs_to_mode == -1 for tree blocks Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:20 +02:00
Naohiro Aota	5b135b382a	btrfs: zoned: re-enable metadata over-commit for zoned mode Now that, we can re-enable metadata over-commit. As we moved the activation from the reservation time to the write time, we no longer need to ensure all the reserved bytes is properly activated. Without the metadata over-commit, it suffers from lower performance because it needs to flush the delalloc items more often and allocate more block groups. Re-enabling metadata over-commit will solve the issue. Fixes: `79417d040f` ("btrfs: zoned: disable metadata overcommit for zoned") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:19 +02:00
Naohiro Aota	5a7d107e5e	btrfs: zoned: don't activate non-DATA BG on allocation Now that a non-DATA block group is activated at write time, don't activate it on allocation time. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:19 +02:00
Naohiro Aota	6a8ebc773e	btrfs: zoned: no longer count fresh BG region as zone unusable Now that we switched to write time activation, we no longer need to (and must not) count the fresh region as zone unusable. This commit is similar to revert of commit `fa2068d7e9` ("btrfs: zoned: count fresh BG region as zone unusable"). Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:19 +02:00
Naohiro Aota	13bb483d32	btrfs: zoned: activate metadata block group on write time In the current implementation, block groups are activated at reservation time to ensure that all reserved bytes can be written to an active metadata block group. However, this approach has proven to be less efficient, as it activates block groups more frequently than necessary, putting pressure on the active zone resource and leading to potential issues such as early ENOSPC or hung_task. Another drawback of the current method is that it hampers metadata over-commit, and necessitates additional flush operations and block group allocations, resulting in decreased overall performance. To address these issues, this commit introduces a write-time activation of metadata and system block group. This involves reserving at least one active block group specifically for a metadata and system block group. Since metadata write-out is always allocated sequentially, when we need to write to a non-active block group, we can wait for the ongoing IOs to complete, activate a new block group, and then proceed with writing to the new block group. Fixes: `b093151391` ("btrfs: zoned: activate metadata block group on flush_space") CC: stable@vger.kernel.org # 6.1+ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:19 +02:00
Naohiro Aota	a7e1ac7bdc	btrfs: zoned: reserve zones for an active metadata/system block group Ensure a metadata and system block group can be activated on write time, by leaving a certain number of active zones when trying to activate a data block group. Zones for two metadata block groups (normal and tree-log) and one system block group are reserved, according to the profile type: two zones per block group on the DUP profile and one zone per block group otherwise. The reservation must be freed once a non-data block group is allocated. If not, we over-reserve the active zones and data block group activation will suffer. For the dynamic reservation count, we need to manage the reservation count per device. The reservation count variable is protected by fs_info->zone_active_bgs_lock. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:19 +02:00
Naohiro Aota	c1c3c2bc29	btrfs: zoned: update meta write pointer on zone finish On finishing a zone, the meta_write_pointer should be set of the end of the zone to reflect the actual write pointer position. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:19 +02:00
Naohiro Aota	0356ad41e0	btrfs: zoned: defer advancing meta write pointer We currently advance the meta_write_pointer in btrfs_check_meta_write_pointer(). That makes it necessary to revert it when locking the buffer failed. Instead, we can advance it just before sending the buffer. Also, this is necessary for the following commit. In the commit, it needs to release the zoned_meta_io_lock to allow IOs to come in and wait for them to fill the currently active block group. If we advance the meta_write_pointer before locking the extent buffer, the following extent buffer can pass the meta_write_pointer check, resulting in an unaligned write failure. Advancing the pointer is still thread-safe as the extent buffer is locked. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:19 +02:00
Naohiro Aota	2ad8c0510a	btrfs: zoned: return int from btrfs_check_meta_write_pointer Now that we have writeback_control passed to btrfs_check_meta_write_pointer(), we can move the wbc condition in submit_eb_page() to btrfs_check_meta_write_pointer() and return int. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:19 +02:00
Naohiro Aota	7db94301a9	btrfs: zoned: introduce block group context to btrfs_eb_write_context For metadata write out on the zoned mode, we call btrfs_check_meta_write_pointer() to check if an extent buffer to be written is aligned to the write pointer. We look up a block group containing the extent buffer for every extent buffer, which takes unnecessary effort as the writing extent buffers are mostly contiguous. Introduce "zoned_bg" to cache the block group working on. Also, while at it, rename "cache" to "block_group". Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:19 +02:00
Naohiro Aota	861093eff4	btrfs: introduce struct to consolidate extent buffer write context Introduce btrfs_eb_write_context to consolidate writeback_control and the exntent buffer context. This will help adding a block group context as well. While at it, move the eb context setting before btrfs_check_meta_write_pointer(). We can set it here because we anyway need to skip pages in the same eb if that eb is rejected by btrfs_check_meta_write_pointer(). Suggested-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:19 +02:00
Filipe Manana	9c93c238c1	btrfs: avoid start and commit empty transaction when flushing qgroups When flushing qgroups, we try to join a running transaction, with btrfs_join_transaction(), and then commit the transaction. However using btrfs_join_transaction() will result in creating a new transaction in case there isn't any running or if there's an existing one already committing. This is pointless as we only need to attach to an existing one that is not committing and in case there's an existing one committing, wait for its commit to complete. Creating and committing an empty transaction is wasteful, pointless IO and unnecessary rotation of the backup roots. So use btrfs_attach_transaction_barrier() instead, to avoid creating and committing empty transactions. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:18 +02:00
Filipe Manana	6705b48a50	btrfs: avoid start and commit empty transaction when starting qgroup rescan When starting a qgroup rescan, we try to join a running transaction, with btrfs_join_transaction(), and then commit the transaction. However using btrfs_join_transaction() will result in creating a new transaction in case there isn't any running or if there's an existing one already committing. This is pointless as we only need to attach to an existing one that is not committing and in case there's an existing one committing, wait for its commit to complete. Creating and committing an empty transaction is wasteful, pointless IO and unnecessary rotation of the backup roots. So use btrfs_attach_transaction_barrier() instead, to avoid creating and committing empty transactions. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:18 +02:00
Filipe Manana	2ee70ed19c	btrfs: avoid starting and committing empty transaction when flushing space When flushing space and we are in the COMMIT_TRANS state, we join a transaction with btrfs_join_transaction() and then commit the returned transaction. However btrfs_join_transaction() starts a new transaction if there is none currently open, which is pointless since comitting a new, empty transaction, doesn't achieve anything, it only wastes time, IO and creates an unnecessary rotation of the backup roots. So use btrfs_attach_transaction_barrier() to avoid starting a new transaction. This also waits for any ongoing transaction that is committing (state >= TRANS_STATE_COMMIT_DOING) to fully complete, and therefore wait for all the extents that were pinned during the transaction's lifetime to be unpinned. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:18 +02:00
Filipe Manana	2391245ac2	btrfs: avoid starting new transaction when flushing delayed items and refs When flushing space we join a transaction to flush delayed items and delayed references, in order to try to release space. However using btrfs_join_transaction() not only joins an existing transaction as well as it starts a new transaction if there is none open. If there is no transaction open, we don't have neither delayed items nor delayed references, so creating a new transaction is a waste of time, IO and creates an unnecessary rotation of the backup roots without gaining any benefits (including releasing space). So use btrfs_join_transaction_nostart() when attempting to flush delayed items and references. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:18 +02:00
Filipe Manana	ed8947bc73	btrfs: merge find_free_dev_extent() and find_free_dev_extent_start() There is no point in having find_free_dev_extent() because it's just a simple wrapper around find_free_dev_extent_start() which always passes a value of 0 for the search_start argument. Since there are no other callers of find_free_dev_extent_start(), remove find_free_dev_extent() and rename find_free_dev_extent_start() to find_free_dev_extent(), removing its search_start argument because it's always 0. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:18 +02:00
Filipe Manana	883647f4b5	btrfs: make find_free_dev_extent() static The function find_free_dev_extent() is only used within volumes.c, so make it static and remove its prototype from volumes.h. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:18 +02:00
Filipe Manana	504b1596bd	btrfs: make btrfs_cleanup_fs_roots() static btrfs_cleanup_fs_roots() is not used outside disk-io.c, so make it static, remove its prototype from disk-io.h and move its definition above the where it's used in disk-io.c Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:18 +02:00
Filipe Manana	7e3bfd146e	btrfs: fail priority metadata ticket with real fs error At priority_reclaim_metadata_space(), if we were not able to satisfy the the ticket after going through the various flushing states and we notice the fs went into an error state, likely due to a transaction abort during the flushing, set the ticket's error to the error that caused the transaction abort instead of an unconditional -EROFS. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:18 +02:00
Filipe Manana	a7f8de500e	btrfs: return real error when orphan cleanup fails due to a transaction abort During mount we will call btrfs_orphan_cleanup() to remove any inodes that were previously deleted (have a link count of 0) but for which we were not able before to remove their items from the subvolume tree. The removal of the items will happen by triggering eviction, when we do the final iput() on them at btrfs_orphan_cleanup(), which will end in the loop at btrfs_evict_inode() that truncates inode items. In a dire situation we may have a transaction abort due to -ENOSPC when attempting to truncate the inode items, and in that case the orphan item (key type BTRFS_ORPHAN_ITEM_KEY) will remain in the subvolume tree and when we hit the next iteration of the while loop at btrfs_orphan_cleanup() we will find the same orphan item as before, and then we will return -EINVAL from btrfs_orphan_cleanup() through the following if statement: if (found_key.offset == last_objectid) { btrfs_err(fs_info, "Error removing orphan entry, stopping orphan cleanup"); ret = -EINVAL; goto out; } This makes the mount operation fail with -EINVAL, when it should have been -ENOSPC. This is confusing because -EINVAL might lead a user into thinking it provided invalid mount options for example. An example where this happens: $ mount test.img /mnt mount: /mnt: wrong fs type, bad option, bad superblock on /dev/loop0, missing codepage or helper program, or other error. $ dmesg [ 2542.356934] BTRFS: device fsid 977fff75-1181-4d2b-a739-384fa710d16e devid 1 transid 47409973 /dev/loop0 scanned by mount (4459) [ 2542.357451] BTRFS info (device loop0): using crc32c (crc32c-intel) checksum algorithm [ 2542.357461] BTRFS info (device loop0): disk space caching is enabled [ 2542.742287] BTRFS info (device loop0): auto enabling async discard [ 2542.764554] BTRFS info (device loop0): checking UUID tree [ 2551.743065] ------------[ cut here ]------------ [ 2551.743068] BTRFS: Transaction aborted (error -28) [ 2551.743149] WARNING: CPU: 7 PID: 215 at fs/btrfs/block-group.c:3494 btrfs_write_dirty_block_groups+0x397/0x3d0 [btrfs] [ 2551.743311] Modules linked in: btrfs blake2b_generic (...) [ 2551.743353] CPU: 7 PID: 215 Comm: kworker/u24:5 Not tainted 6.4.0-rc6-btrfs-next-134+ #1 [ 2551.743356] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [ 2551.743357] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs] [ 2551.743405] RIP: 0010:btrfs_write_dirty_block_groups+0x397/0x3d0 [btrfs] [ 2551.743449] Code: 8b 43 0c (...) [ 2551.743451] RSP: 0018:ffff982c005a7c40 EFLAGS: 00010286 [ 2551.743452] RAX: 0000000000000000 RBX: ffff88fc6e44b400 RCX: 0000000000000000 [ 2551.743453] RDX: 0000000000000002 RSI: ffffffff8dff0878 RDI: 00000000ffffffff [ 2551.743454] RBP: ffff88fc51817208 R08: 0000000000000000 R09: ffff982c005a7ae0 [ 2551.743455] R10: 0000000000000001 R11: 0000000000000001 R12: ffff88fc43d2e570 [ 2551.743456] R13: ffff88fc43d2e400 R14: ffff88fc8fb08ee0 R15: ffff88fc6e44b530 [ 2551.743457] FS: 0000000000000000(0000) GS:ffff89035fbc0000(0000) knlGS:0000000000000000 [ 2551.743458] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2551.743459] CR2: 00007fa8cdf2f6f4 CR3: 0000000124850003 CR4: 0000000000370ee0 [ 2551.743462] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 2551.743463] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 2551.743464] Call Trace: [ 2551.743472] <TASK> [ 2551.743474] ? __warn+0x80/0x130 [ 2551.743478] ? btrfs_write_dirty_block_groups+0x397/0x3d0 [btrfs] [ 2551.743520] ? report_bug+0x1f4/0x200 [ 2551.743523] ? handle_bug+0x42/0x70 [ 2551.743526] ? exc_invalid_op+0x14/0x70 [ 2551.743528] ? asm_exc_invalid_op+0x16/0x20 [ 2551.743532] ? btrfs_write_dirty_block_groups+0x397/0x3d0 [btrfs] [ 2551.743574] ? _raw_spin_unlock+0x15/0x30 [ 2551.743576] ? btrfs_run_delayed_refs+0x1bd/0x200 [btrfs] [ 2551.743609] commit_cowonly_roots+0x1e9/0x260 [btrfs] [ 2551.743652] btrfs_commit_transaction+0x42e/0xfa0 [btrfs] [ 2551.743693] ? __pfx_autoremove_wake_function+0x10/0x10 [ 2551.743697] flush_space+0xf1/0x5d0 [btrfs] [ 2551.743743] ? _raw_spin_unlock+0x15/0x30 [ 2551.743745] ? finish_task_switch+0x91/0x2a0 [ 2551.743748] ? _raw_spin_unlock+0x15/0x30 [ 2551.743750] ? btrfs_get_alloc_profile+0xc9/0x1f0 [btrfs] [ 2551.743793] btrfs_async_reclaim_metadata_space+0xe1/0x230 [btrfs] [ 2551.743837] process_one_work+0x1d9/0x3e0 [ 2551.743844] worker_thread+0x4a/0x3b0 [ 2551.743847] ? __pfx_worker_thread+0x10/0x10 [ 2551.743849] kthread+0xee/0x120 [ 2551.743852] ? __pfx_kthread+0x10/0x10 [ 2551.743854] ret_from_fork+0x29/0x50 [ 2551.743860] </TASK> [ 2551.743861] ---[ end trace 0000000000000000 ]--- [ 2551.743863] BTRFS info (device loop0: state A): dumping space info: [ 2551.743866] BTRFS info (device loop0: state A): space_info DATA has 126976 free, is full [ 2551.743868] BTRFS info (device loop0: state A): space_info total=13458472960, used=13458137088, pinned=143360, reserved=0, may_use=0, readonly=65536 zone_unusable=0 [ 2551.743870] BTRFS info (device loop0: state A): space_info METADATA has -51625984 free, is full [ 2551.743872] BTRFS info (device loop0: state A): space_info total=771751936, used=770146304, pinned=1605632, reserved=0, may_use=51625984, readonly=0 zone_unusable=0 [ 2551.743874] BTRFS info (device loop0: state A): space_info SYSTEM has 14663680 free, is not full [ 2551.743875] BTRFS info (device loop0: state A): space_info total=14680064, used=16384, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0 [ 2551.743877] BTRFS info (device loop0: state A): global_block_rsv: size 53231616 reserved 51544064 [ 2551.743878] BTRFS info (device loop0: state A): trans_block_rsv: size 0 reserved 0 [ 2551.743879] BTRFS info (device loop0: state A): chunk_block_rsv: size 0 reserved 0 [ 2551.743880] BTRFS info (device loop0: state A): delayed_block_rsv: size 0 reserved 0 [ 2551.743881] BTRFS info (device loop0: state A): delayed_refs_rsv: size 786432 reserved 0 [ 2551.743886] BTRFS: error (device loop0: state A) in btrfs_write_dirty_block_groups:3494: errno=-28 No space left [ 2551.743911] BTRFS info (device loop0: state EA): forced readonly [ 2551.743951] BTRFS warning (device loop0: state EA): could not allocate space for delete; will truncate on mount [ 2551.743962] BTRFS error (device loop0: state EA): Error removing orphan entry, stopping orphan cleanup [ 2551.743973] BTRFS warning (device loop0: state EA): Skipping commit of aborted transaction. [ 2551.743989] BTRFS error (device loop0: state EA): could not do orphan cleanup -22 So make the btrfs_orphan_cleanup() return the value of BTRFS_FS_ERROR(), if it's set, and -EINVAL otherwise. For that same example, after this change, the mount operation fails with -ENOSPC: $ mount test.img /mnt mount: /mnt: mount(2) system call failed: No space left on device. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:18 +02:00
Filipe Manana	ae3364e521	btrfs: store the error that turned the fs into error state Currently when we turn the fs into an error state, typically after a transaction abort, we don't store the error anywhere, we just set a bit (BTRFS_FS_STATE_ERROR) at struct btrfs_fs_info::fs_state to signal the error state. There are cases where it would be useful to have access to the specific error in order to provide a more meaningful error to users/applications. This change adds a member to struct btrfs_fs_info to store the error and removes the BTRFS_FS_STATE_ERROR bit. When there's no error, the new member (fs_error) has a value of 0, otherwise its value is a negative errno value. Followup changes will make use of this new member. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:18 +02:00
Filipe Manana	1b6948acb8	btrfs: don't steal space from global rsv after a transaction abort When doing a priority metadata space reclaim, while we are going through the flush states and running their respective operations, it's possible that a transaction abort happened, for example when running delayed refs we hit -ENOSPC or in the critical section of transaction commit we failed with -ENOSPC or some other error. In these cases a transaction was aborted and the fs turned into error state. If that happened, then it makes no sense to steal from the global block reserve and return success to the caller if the stealing was successful - the caller will later get an error when attempting to modify the fs. Instead make the ticket fail if we have the fs in error state and don't attempt to steal from the global rsv, as it's not only it's pointless, it also simplifies debugging some -ENOSPC problems. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:17 +02:00
Filipe Manana	1ff9fee3bd	btrfs: print available space across all block groups when dumping space info When dumping a space info also sum the available space for all block groups and then print it. This often useful for debugging -ENOSPC related problems. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:17 +02:00
Filipe Manana	e50b122b83	btrfs: print available space for a block group when dumping a space info When dumping a space info, we iterate over all its block groups and then print their size and the amounts of bytes used, reserved, pinned, etc. When debugging -ENOSPC problems it's also useful to know how much space is available (free), so calculate that and print it as well. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:17 +02:00
Filipe Manana	b92e8f5472	btrfs: print block group super and delalloc bytes when dumping space info When dumping a space info's block groups, also print the number of bytes used for super blocks and delalloc. This is often useful for debugging -ENOSPC problems. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:17 +02:00
Filipe Manana	4d2024e90d	btrfs: print target number of bytes when dumping free space When dumping free space, with btrfs_dump_free_space(), we pass a bytes argument in order to count how many free space entries in the block group have a size greater than or equal to that number of bytes. We then print how many suitable entries we found, but we don't print the target number of bytes, we just say "bytes". Change the message to actually print the number of bytes, which makes debugging -ENOSPC issues a bit easier. Also sligthly change the odd grammar and terminology: the sentence is ending with 'is', which doesn't make sense, and the term 'blocks' is confusing as we are referring to free space entries within the block group's free space cache. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:17 +02:00
Filipe Manana	19288951ff	btrfs: update comment for btrfs_join_transaction_nostart() Update the comment for btrfs_join_transaction_nostart() to be more clear about how it works and how it's different from btrfs_attach_transaction(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:17 +02:00
Filipe Manana	4490e803e1	btrfs: don't start transaction when joining with TRANS_JOIN_NOSTART When joining a transaction with TRANS_JOIN_NOSTART, if we don't find a running transaction we end up creating one. This goes against the purpose of TRANS_JOIN_NOSTART which is to join a running transaction if its state is at or below the state TRANS_STATE_COMMIT_START, otherwise return an -ENOENT error and don't start a new transaction. So fix this to not create a new transaction if there's no running transaction at or below that state. CC: stable@vger.kernel.org # 4.14+ Fixes: `a6d155d2e3` ("Btrfs: fix deadlock between fiemap and transaction commits") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:17 +02:00
Qu Wenruo	096d230165	btrfs: refactor main loop in memmove_extent_buffer() [BACKGROUND] Currently memove_extent_buffer() does a loop where it strop at any page boundary inside [dst_offset, dst_offset + len) or [src_offset, src_offset + len). This is mostly allowing us to do copy_pages(), but if we're going to use folios we will need to handle multi-page (the old behavior) or single folio (the new optimization). The current code would be a burden for future changes. [ENHANCEMENT] Instead of sticking with copy_pages(), here we utilize the new __write_extent_buffer() helper to handle the writes. Unlike the refactoring in memcpy_extent_buffer(), we can not just rely on the write_extent_buffer() and only handle page boundaries inside src range. The function write_extent_buffer() itself is still doing forward writing, thus it cannot handle the following case: (already in the extent buffer memory operation tests, cross page overlapping run 2) Src Page boundary \|///////\| \|///\|////\| Dst In the above case, if we just follow page boundary in the src range, we have no need to do any split, just one __write_extent_buffer() with use_memmove = true. But __write_extent_buffer() would split the dst range into two, so it first copies the beginning part of the src range into the first half of the dst range. After this operation, the beginning of the dst range is already updated, causing corruption. So we have to follow the old behavior of handling both page boundaries. And since we're the last caller of copy_pages(), we can remove it completely. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:17 +02:00
Qu Wenruo	13840f3f28	btrfs: refactor main loop in memcpy_extent_buffer() [BACKGROUND] Currently memcpy_extent_buffer() does a loop where it would stop at any page boundary inside [dst_offset, dst_offset + len) or [src_offset, src_offset + len). This is mostly allowing us to do copy_pages(), but if we're going to use folios we will need to handle multi-page (the old behavior) or single folio (the new optimization). The current code would be a burden for future changes. [ENHANCEMENT] There is a hidden pitfall of the naming memcpy_extent_buffer(), unlike regular memcpy(), this function can handle overlapping ranges. So here we extract write_extent_buffer() into a new internal helper, __write_extent_buffer(), and add a new parameter @use_memmove, to indicate whether we should use memmove() or regular memcpy(). Now we can go __write_extent_buffer() to handle writing into the dst range, with proper overlapping detection. This has a tiny change to the chance of calling memmove(). As the split only happens at the source range page boundaries, the memcpy/memmove() range would be slightly larger than the old code, thus slightly increase the chance we call memmove() other than memcopy(). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:17 +02:00
Qu Wenruo	682a0bc557	btrfs: copy all pages at once at the end of btrfs_clone_extent_buffer() btrfs_clone_extent_buffer() calls copy_page() at each iteration but we can copy all pages at the end in one go if there were no errors. This would make later conversion to folios easier. Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:17 +02:00
Qu Wenruo	54948681c2	btrfs: refactor main loop in copy_extent_buffer_full() [BACKGROUND] copy_extent_buffer_full() currently does different handling for regular and subpage cases, for regular cases it does a page by page copying. For subpage cases, it just copies the content. This is fine for the page based extent buffer code, but for the incoming folio conversion, it can be a burden to add a new branch just to handle all the different combinations (subpage vs regular, one single folio vs multi pages). [ENHANCE] Instead of handling the different combinations, just go one single handling for all cases, utilizing write_extent_buffer() to do the copying. Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:17 +02:00
Qu Wenruo	730c374e5b	btrfs: use write_extent_buffer() to implement write_extent_buffer_*id() Helpers write_extent_buffer_chunk_tree_uuid() and write_extent_buffer_fsid(), they can be implemented by write_extent_buffer(). These two helpers are not that frequently used, they only get called during initialization of a new tree block. There is not much need for those slightly optimized versions. And since they can be easily converted to one write_extent_buffer() call, define them as inline helpers. This would make later page/folio switch much easier, as all change only need to happen in write_extent_buffer(). Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:17 +02:00
Qu Wenruo	cb22964f1d	btrfs: refactor extent buffer bitmaps operations [BACKGROUND] Currently we handle extent bitmaps manually in extent_buffer_bitmap_set() and extent_buffer_bitmap_clear(). Although with various helpers like eb_bitmap_offset() it's still a little messy to read. The code seems to be a copy of bitmap_set(), but with all the cross-page handling embedded into the code. [ENHANCEMENT] This patch would enhance the readability by introducing two helpers: - memset_extent_buffer() To handle the byte aligned range, thus all the cross-page handling is done there. - extent_buffer_get_byte() This for the first and the last byte operations, which only need to grab one byte, thus no need for any cross-page handling. So we can split both extent_buffer_bitmap_set() and extent_buffer_bitmap_clear() into 3 parts: - Handle the first byte If the range fits inside the first byte, we can exit early. - Handle the byte aligned part This is the part which can have cross-page operations, and it would be handled by memset_extent_buffer(). - Handle the last byte This refactoring does not only make the code a little easier to read, but also makes later folio/page switch much easier, as the switch only needs to be done inside memset_extent_buffer() and extent_buffer_get_byte(). Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:16 +02:00
Qu Wenruo	5864f1da6b	btrfs: tests: add self tests for extent buffer memory operations The new self tests would populate a memory range with random bytes, then copy it to the extent buffer, so that we can verify if the extent buffer memory operation and memmove()/memcopy() are resulting the same contents. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:16 +02:00
Qu Wenruo	257deed2a9	btrfs: tests: enhance extent buffer bitmap tests Enhance extent bitmap tests for the following aspects: - Remove unnecessary @len from __test_eb_bitmaps() We can fetch the length from extent buffer - Explicitly distinguish bit and byte length Now every start/len inside bitmap tests would have either "byte_" or "bit_" prefix to make it more explicit. - Better error reporting If we have mismatch bits, the error report would dump the following contents: * start bytenr * bit number * the full byte from bitmap * the full byte from the extent This is to save developers time so obvious problem can be found immediately - Extract bitmap set/clear and check operation into two helpers This is to save some code lines, as we will have more tests to do. - Add new tests The following tests are added, mostly for the incoming extent bitmap accessor refactoring: * Set bits inside the same byte * Clear bits inside the same byte * Cross byte boundary set * Cross byte boundary clear * Cross multi-byte boundary set * Cross multi-byte boundary clear Those new tests have already saved my backend for the incoming extent buffer bitmap refactoring. Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:16 +02:00
Josef Bacik	b9d97cff25	btrfs: move comments to btrfs_loop_type definition Some of these loop types aren't described, and they should be with the definitions to make it easier to tell what each of them do. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:16 +02:00
Anand Jain	7f9879eb60	btrfs: print name and pid when device scanning processes race There is a race between systemd and mount, as both of them try to register the device in the kernel. When systemd loses the race, it prints the following message: BTRFS error: device /dev/sdb7 belongs to fsid 1b3bacbf-14db-49c9-a3ef-547998aacc4e, and the fs is already mounted. The 'btrfs dev scan' registers one device at a time, so there is no way for the mount thread to wait in the kernel for all the devices to have registered as it won't know if all the devices are discovered. For now, improve the error log by printing the command name and process ID along with the error message. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:16 +02:00
Christoph Hellwig	256b0cf90d	btrfs: fix zoned handling in submit_uncompressed_range For zoned file systems we need to use run_delalloc_zoned to submit writeback, as we need to write out partial allocations when running into zone active limits. submit_uncompressed_range currently always calls cow_file_range to allocate blocks and thus misses the active zone limits handling. Fix this by passing the pages_dirty argument to run_delalloc_zoned and always using it from submit_uncompressed_range as it does the right thing for zoned and non-zoned file systems. To account for the fact that run_delalloc_zoned is now also used for non-zoned file systems rename it to run_delalloc_cow, and add comment describing it. Fixes: `42c0110009` ("btrfs: zoned: introduce dedicated data write path for zoned filesystems") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:16 +02:00
Christoph Hellwig	778b878543	btrfs: don't redirty locked_page in run_delalloc_zoned extent_write_locked_range currently expects that either all or no pages are dirty when it is called. Bur run_delalloc_zoned is called directly in the writepages path, and has the dirty bit cleared only for locked_page and which the extent_write_cache_pages currently operates. It currently works around this by redirtying locked_page, but that is a bit inefficient and cumbersome. Pass a locked_page argument to run_delalloc_zoned so that clearing the dirty bit can be skipped on just that page. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:16 +02:00
Christoph Hellwig	6e144bf16b	btrfs: refactor the zoned device handling in cow_file_range Handling of the done_offset to cow_file_range is a bit confusing, as it is not updated at all when the function succeeds, and the -EAGAIN status is used bother for the case where we need to wait for a zone finish and the one where the allocation was partially successful. Change the calling convention so that done_offset is always updated, and 0 is returned if some allocation was successful (partial allocation can still only happen for zoned devices), and waiting for a zone finish is done internally in cow_file_range instead of the caller. Also write a comment explaining the logic. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:16 +02:00
Christoph Hellwig	44962ca37c	btrfs: don't redirty pages in compress_file_range compress_file_range needs to clear the dirty bit before handing off work to the compression worker threads to prevent processes coming in through mmap and changing the file contents while the compression is accessing the data (See commit `4adaa61102` ("Btrfs: fix race between mmap writes and compression"). But when compress_file_range decides to not compress the data, it falls back to submit_uncompressed_range which uses extent_write_locked_range to write the uncompressed data. extent_write_locked_range currently expects all pages to be marked dirty so that it can clear the dirty bit itself, and thus compress_file_range has to redirty the page range. Redirtying the page range is rather inefficient and also pointless, so instead pass a pages_dirty parameter to extent_write_locked_range and skip the redirty game entirely. Note that compress_file_range was even redirtying the locked_page twice given that extent_range_clear_dirty_for_io already redirties all pages in the range, which must include locked_page if there is one. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:15 +02:00
Christoph Hellwig	f778b6b8e0	btrfs: share the code to free the page array in compress_file_range compress_file_range has two code blocks to free the page array for the compressed data. Share the code using a goto label. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:15 +02:00
Christoph Hellwig	184aa1ffa5	btrfs: use a separate label for the incompressible case in compress_file_range compress_file_range can fail to compress either because of resource or alignment constraints or because the data is incompressible. In the latter case the inode is marked so that compression isn't tried again. Currently that check is based on the condition that the pages array has been allocated which is rather cryptic. Use a separate label to clearly distinguish this case. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:15 +02:00
Christoph Hellwig	6a7167bf9c	btrfs: further simplify the compress or not logic in compress_file_range Currently the logic whether to compress or not in compress_file_range is a bit convoluted because it tries to share code for creating inline extents for the compressible [1] path and the bail to uncompressed path. But the latter isn't needed at all, because cow_file_range as called by submit_uncompressed_range will already create inline extents as needed, so there is no need to have special handling for it if we can live with the fact that it will be called a bit later in the ->ordered_func of the workqueue instead of right now. [1] there is undocumented logic that creates an uncompressed inline extent outside of the shall not compress logic if total_in is too small. This logic isn't explained in comments or any commit log I could find, so I've preserved it. Documentation explaining it would be appreciated if anyone understands this code. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:15 +02:00
Christoph Hellwig	e94e54e89b	btrfs: streamline compress_file_range Reorder compress_file_range so that the main compression flow happens straight line and not in branches. To do this ensure that pages is always zeroed before a page allocation happens, which allows the cleanup_and_bail_uncompressed label to clean up the page allocations as needed. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:15 +02:00

1 2 3 4 5 ...

82963 Commits