linux

Author	SHA1	Message	Date
Josef Bacik	1b7ec85ef4	btrfs: pass root owner to read_tree_block In order to properly set the lockdep class of a newly allocated block we need to know the owner of the block. For non-refcounted trees this is straightforward, we always know in advance what tree we're reading from. For refcounted trees we don't necessarily know, however all refcounted trees share the same lockdep class name, tree-<level>. Fix all the callers of read_tree_block() to pass in the root objectid we're using. In places like relocation and backref we could probably unconditionally use 0, but just in case use the root when we have it, otherwise use 0 in the cases we don't have the root as it's going to be a refcounted tree anyway. This is a preparation patch for further changes. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-12-08 15:54:07 +01:00
Josef Bacik	bfb484d922	btrfs: cleanup extent buffer readahead We're going to pass around more information when we allocate extent buffers, in order to make that cleaner how we do readahead. Most of the callers have the parent node that we're getting our blockptr from, with the sole exception of relocation which simply has the bytenr it wants to read. Add a helper that takes the current arguments that we need (bytenr and gen), and add another helper for simply reading the slot out of a node. In followup patches the helper that takes all the extra arguments will be expanded, and the simpler helper won't need to have it's arguments adjusted. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-12-08 15:54:05 +01:00
Josef Bacik	e747853cae	btrfs: load free space cache asynchronously While documenting the usage of the commit_root_sem, I noticed that we do not actually take the commit_root_sem in the case of the free space cache. This is problematic because we're supposed to hold that sem while we're reading the commit roots, which is what we do for the free space cache. The reason I did it inline when I originally wrote the code was because there's the case of unpinning where we need to make sure that the free space cache is loaded if we're going to use the free space cache. But we can accomplish the same thing by simply waiting for the cache to be loaded. Rework this code to load the free space cache asynchronously. This allows us to greatly cleanup the caching code because now it's all shared by the various caching methods. We also are now in a position to have the commit_root semaphore held while we're loading the free space cache. And finally our modification of ->last_byte_to_unpin is removed because it can be handled in the proper way on commit. Some care must be taken when replaying the log, when we expect that the free space cache will be read entirely before we start excluding space to replay. This could lead to overwriting space during replay. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-12-08 15:54:03 +01:00
Josef Bacik	2ca08c56e8	btrfs: explicitly protect ->last_byte_to_unpin in unpin_extent_range Currently unpin_extent_range happens in the transaction commit context, so we are protected from ->last_byte_to_unpin changing while we're unpinning, because any new transactions would have to wait for us to complete before modifying ->last_byte_to_unpin. However in the future we may want to change how this works, for instance with async unpinning or other such TODO items. To prepare for that future explicitly protect ->last_byte_to_unpin with the commit_root_sem so we are sure it won't change while we're doing our work. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-12-08 15:54:02 +01:00
Josef Bacik	27d56e62e4	btrfs: update last_byte_to_unpin in switch_commit_roots While writing an explanation for the need of the commit_root_sem for btrfs_prepare_extent_commit, I realized we have a slight hole that could result in leaked space if we have to do the old style caching. Consider the following scenario commit root +----+----+----+----+----+----+----+ \|\\\\\| \|\\\\\|\\\\\| \|\\\\\|\\\\\| +----+----+----+----+----+----+----+ 0 1 2 3 4 5 6 7 new commit root +----+----+----+----+----+----+----+ \| \| \| \|\\\\\| \| \|\\\\\| +----+----+----+----+----+----+----+ 0 1 2 3 4 5 6 7 Prior to this patch, we run btrfs_prepare_extent_commit, which updates the last_byte_to_unpin, and then we subsequently run switch_commit_roots. In this example lets assume that caching_ctl->progress == 1 at btrfs_prepare_extent_commit() time, which means that cache->last_byte_to_unpin == 1. Then we go and do the switch_commit_roots(), but in the meantime the caching thread has made some more progress, because we drop the commit_root_sem and re-acquired it. Now caching_ctl->progress == 3. We swap out the commit root and carry on to unpin. The race can happen like: 1) The caching thread was running using the old commit root when it found the extent for [2, 3); 2) Then it released the commit_root_sem because it was in the last item of a leaf and the semaphore was contended, and set ->progress to 3 (value of 'last'), as the last extent item in the current leaf was for the extent for range [2, 3); 3) Next time it gets the commit_root_sem, will start using the new commit root and search for a key with offset 3, so it never finds the hole for [2, 3). So the caching thread never saw [2, 3) as free space in any of the commit roots, and by the time finish_extent_commit() was called for the range [0, 3), ->last_byte_to_unpin was 1, so it only returned the subrange [0, 1) to the free space cache, skipping [2, 3). In the unpin code we have last_byte_to_unpin == 1, so we unpin [0,1), but do not unpin [2,3). However because caching_ctl->progress == 3 we do not see the newly freed section of [2,3), and thus do not add it to our free space cache. This results in us missing a chunk of free space in memory (on disk too, unless we have a power failure before writing the free space cache to disk). Fix this by making sure the ->last_byte_to_unpin is set at the same time that we swap the commit roots, this ensures that we will always be consistent. CC: stable@vger.kernel.org # 5.8+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ update changelog with Filipe's review comments ] Signed-off-by: David Sterba <dsterba@suse.com>	2020-12-08 15:54:02 +01:00
Josef Bacik	9076dbd5ee	btrfs: do not shorten unpin len for caching block groups While fixing up our ->last_byte_to_unpin locking I noticed that we will shorten len based on ->last_byte_to_unpin if we're caching when we're adding back the free space. This is correct for the free space, as we cannot unpin more than ->last_byte_to_unpin, however we use len to adjust the ->bytes_pinned counters and such, which need to track the actual pinned usage. This could result in WARN_ON(space_info->bytes_pinned) triggering at unmount time. Fix this by using a local variable for the amount to add to free space cache, and leave len untouched in this case. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-12-08 15:54:02 +01:00
Josef Bacik	b9729ce014	btrfs: locking: rip out path->leave_spinning We no longer distinguish between blocking and spinning, so rip out all this code. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-12-08 15:54:02 +01:00
Josef Bacik	ac5887c8e0	btrfs: locking: remove all the blocking helpers Now that we're using a rw_semaphore we no longer need to indicate if a lock is blocking or not, nor do we need to flip the entire path from blocking to spinning. Remove these helpers and all the places they are called. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-12-08 15:54:01 +01:00
David Sterba	fe5ecbe818	btrfs: precalculate checksums per leaf once btrfs_csum_bytes_to_leaves shows up in system profiles, which makes it a candidate for optimizations. After the 64bit division has been replaced by shift, there's still a calculation done each time the function is called: checksums per leaf. As this is a constant value for the entire filesystem lifetime, we can calculate it once at mount time and reuse. This also allows to reduce the division to 64bit/32bit as we know the constant will always fit the 32bit type. Replace the open-coded rounding up with a macro that internally handles the 64bit division and as it's now a short function, make it static inline (slight code increase, slight stack usage reduction). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-12-08 15:53:58 +01:00
David Sterba	ab108d992b	btrfs: use precalculated sectorsize_bits from fs_info We do a lot of calculations where we divide or multiply by sectorsize. We also know and make sure that sectorsize is a power of two, so this means all divisions can be turned to shifts and avoid eg. expensive u64/u32 divisions. The type is u32 as it's more register friendly on x86_64 compared to u8 and the resulting assembly is smaller (movzbl vs movl). There's also superblock s_blocksize_bits but it's usually one more pointer dereference farther than fs_info. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-12-08 15:53:57 +01:00
David Sterba	c842268458	btrfs: add set/get accessors for root_item::drop_level The drop_level member is used directly unlike all the other int types in root_item. Add the definition and use it everywhere. The type is u8 so there's no conversion necessary and the helpers are properly inlined, this is for consistency. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-12-08 15:53:52 +01:00
Pujin Shi	cad69d1396	btrfs: tree-checker: fix incorrect printk format This patch addresses a compile warning: fs/btrfs/extent-tree.c: In function '__btrfs_free_extent': fs/btrfs/extent-tree.c:3187:4: warning: format '%lu' expects argument of type 'long unsigned int', but argument 8 has type 'unsigned int' [-Wformat=] Fixes: `1c2a07f598` ("btrfs: extent-tree: kill BUG_ON() in __btrfs_free_extent()") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Pujin Shi <shipujin.t@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-10-26 15:02:30 +01:00
Josef Bacik	9631e4cc1a	btrfs: introduce BTRFS_NESTING_COW for cow'ing blocks When we COW a block we are holding a lock on the original block, and then we lock the new COW block. Because our lockdep maps are based on root + level, this will make lockdep complain. We need a way to indicate a subclass for locking the COW'ed block, so plumb through our btrfs_lock_nesting from btrfs_cow_block down to the btrfs_init_buffer, and then introduce BTRFS_NESTING_COW to be used for cow'ing blocks. The reason I've added all this extra infrastructure is because there will be need of different nesting classes in follow up patches. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-10-07 12:12:16 +02:00
Qu Wenruo	07cce5cf3b	btrfs: extent-tree: kill the BUG_ON() in insert_inline_extent_backref() [BUG] With a crafted image, btrfs can panic at insert_inline_extent_backref(): kernel BUG at fs/btrfs/extent-tree.c:1857! invalid opcode: 0000 [#1] SMP PTI CPU: 0 PID: 1117 Comm: btrfs-transacti Not tainted 5.0.0-rc8+ #9 RIP: 0010:insert_inline_extent_backref+0xcc/0xe0 RSP: 0018:ffffac4dc1287be8 EFLAGS: 00010293 RAX: 0000000000000000 RBX: 0000000000000007 RCX: 0000000000000001 RDX: 0000000000001000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffffac4dc1287c28 R08: ffffac4dc1287ab8 R09: ffffac4dc1287ac0 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: ffff8febef88a540 R14: ffff8febeaa7bc30 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff8febf7a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f663ace94c0 CR3: 0000000235698006 CR4: 00000000000206f0 Call Trace: ? _cond_resched+0x1a/0x50 __btrfs_inc_extent_ref.isra.64+0x7e/0x240 ? btrfs_merge_delayed_refs+0xa5/0x330 __btrfs_run_delayed_refs+0x653/0x1120 btrfs_run_delayed_refs+0xdb/0x1b0 btrfs_commit_transaction+0x52/0x950 ? start_transaction+0x94/0x450 transaction_kthread+0x163/0x190 kthread+0x105/0x140 ? btrfs_cleanup_transaction+0x560/0x560 ? kthread_destroy_worker+0x50/0x50 ret_from_fork+0x35/0x40 Modules linked in: ---[ end trace 2ad8b3de903cf825 ]--- [CAUSE] Due to extent tree corruption (still valid by itself, but bad cross ref), we can allocate an extent which is still in extent tree. The offending tree block of that case is from csum tree. The newly allocated tree block is also for csum tree. Then we will try to insert a tree block ref for the existing tree block ref. For a tree extent item, tree block can never be shared directly by the same tree twice. We have such BUG_ON() to prevent such problem, but this is not a proper error handling. [FIX] Replace that BUG_ON() with proper error message and leaf dump for debug build. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202829 Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-10-07 12:12:14 +02:00
Qu Wenruo	1c2a07f598	btrfs: extent-tree: kill BUG_ON() in __btrfs_free_extent() __btrfs_free_extent() is doing two things: 1. Reduce the refs number of an extent backref Either it's an inline extent backref (inside EXTENT/METADATA item) or a keyed extent backref (SHARED_* item). We only need to locate that backref line, either reduce the number or remove the backref line completely. 2. Update the refs count in EXTENT/METADATA_ITEM During step 1), we will try to locate the EXTENT/METADATA_ITEM without triggering another btrfs_search_slot() as fast path. Only when we fail to locate that item, we will trigger another btrfs_search_slot() to get that EXTENT/METADATA_ITEM after we updated/deleted the backref line. And we have a lot of strict checks on things like refs_to_drop against extent refs and special case checks for single ref extents. There are 7 BUG_ON()s, although they're doing correct checks, they can be triggered by crafted images. This patch improves the function: - Introduce two examples to show what __btrfs_free_extent() is doing One inline backref case and one keyed case. Should cover most cases. - Kill all BUG_ON()s with proper error message and optional leaf dump - Add comment to show the overall flow Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202819 [ The report triggers one BUG_ON() in __btrfs_free_extent() ] Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-10-07 12:12:14 +02:00
Josef Bacik	2732798c9b	btrfs: call btrfs_try_granting_tickets when unpinning anything When unpinning we were only calling btrfs_try_granting_tickets() if global_rsv->space_info == space_info, which is problematic because we use ticketing for SYSTEM chunks, and want to use it for DATA as well. Fix this by moving this call outside of that if statement. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-10-07 12:06:51 +02:00
Qu Wenruo	437490fed3	btrfs: tracepoints: output proper root owner for trace_find_free_extent() The current trace event always output result like this: find_free_extent: root=2(EXTENT_TREE) len=16384 empty_size=0 flags=4(METADATA) find_free_extent: root=2(EXTENT_TREE) len=16384 empty_size=0 flags=4(METADATA) find_free_extent: root=2(EXTENT_TREE) len=8192 empty_size=0 flags=1(DATA) find_free_extent: root=2(EXTENT_TREE) len=8192 empty_size=0 flags=1(DATA) find_free_extent: root=2(EXTENT_TREE) len=4096 empty_size=0 flags=1(DATA) find_free_extent: root=2(EXTENT_TREE) len=4096 empty_size=0 flags=1(DATA) T's saying we're allocating data extent for EXTENT tree, which is not even possible. It's because we always use EXTENT tree as the owner for trace_find_free_extent() without using the @root from btrfs_reserve_extent(). This patch will change the parameter to use proper @root for trace_find_free_extent(): Now it looks much better: find_free_extent: root=5(FS_TREE) len=16384 empty_size=0 flags=36(METADATA\|DUP) find_free_extent: root=5(FS_TREE) len=8192 empty_size=0 flags=1(DATA) find_free_extent: root=5(FS_TREE) len=16384 empty_size=0 flags=1(DATA) find_free_extent: root=5(FS_TREE) len=4096 empty_size=0 flags=1(DATA) find_free_extent: root=5(FS_TREE) len=8192 empty_size=0 flags=1(DATA) find_free_extent: root=5(FS_TREE) len=16384 empty_size=0 flags=36(METADATA\|DUP) find_free_extent: root=7(CSUM_TREE) len=16384 empty_size=0 flags=36(METADATA\|DUP) find_free_extent: root=2(EXTENT_TREE) len=16384 empty_size=0 flags=36(METADATA\|DUP) find_free_extent: root=1(ROOT_TREE) len=16384 empty_size=0 flags=36(METADATA\|DUP) Reported-by: Hans van Kranenburg <hans@knorrie.org> CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-10-07 12:06:49 +02:00
Qu Wenruo	ea57788eb7	btrfs: require only sector size alignment for parent eb bytenr [BUG] A completely sane converted fs will cause kernel warning at balance time: [ 1557.188633] BTRFS info (device sda7): relocating block group 8162107392 flags data [ 1563.358078] BTRFS info (device sda7): found 11722 extents [ 1563.358277] BTRFS info (device sda7): leaf 7989321728 gen 95 total ptrs 213 free space 3458 owner 2 [ 1563.358280] item 0 key (7984947200 169 0) itemoff 16250 itemsize 33 [ 1563.358281] extent refs 1 gen 90 flags 2 [ 1563.358282] ref#0: tree block backref root 4 [ 1563.358285] item 1 key (7985602560 169 0) itemoff 16217 itemsize 33 [ 1563.358286] extent refs 1 gen 93 flags 258 [ 1563.358287] ref#0: shared block backref parent 7985602560 [ 1563.358288] (parent 7985602560 is NOT ALIGNED to nodesize 16384) [ 1563.358290] item 2 key (7985635328 169 0) itemoff 16184 itemsize 33 ... [ 1563.358995] BTRFS error (device sda7): eb 7989321728 invalid extent inline ref type 182 [ 1563.358996] ------------[ cut here ]------------ [ 1563.359005] WARNING: CPU: 14 PID: 2930 at 0xffffffff9f231766 Then with transaction abort, and obviously failed to balance the fs. [CAUSE] That mentioned inline ref type 182 is completely sane, it's BTRFS_SHARED_BLOCK_REF_KEY, it's some extra check making kernel to believe it's invalid. Commit `64ecdb647d` ("Btrfs: add one more sanity check for shared ref type") introduced extra checks for backref type. One of the requirement is, parent bytenr must be aligned to node size, which is not correct. One example is like this: 0 1G 1G+4K 2G 2G+4K \| \|///////////////////\|//\| <- A chunk starts at 1G+4K \| \| <- A tree block get reserved at bytenr 1G+4K Then we have a valid tree block at bytenr 1G+4K, but not aligned to nodesize (16K). Such chunk is not ideal, but current kernel can handle it pretty well. We may warn about such tree block in the future, but should not reject them. [FIX] Change the alignment requirement from node size alignment to sector size alignment. Also, to make our lives a little easier, also output @iref when btrfs_get_extent_inline_ref_type() failed, so we can locate the item easier. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=205475 Fixes: `64ecdb647d` ("Btrfs: add one more sanity check for shared ref type") CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> [ update comments and messages ] Signed-off-by: David Sterba <dsterba@suse.com>	2020-09-07 14:51:05 +02:00
Josef Bacik	ad24466588	btrfs: set the correct lockdep class for new nodes When flipping over to the rw_semaphore I noticed I'd get a lockdep splat in replace_path(), which is weird because we're swapping the reloc root with the actual target root. Turns out this is because we're using the root->root_key.objectid as the root id for the newly allocated tree block when setting the lockdep class, however we need to be using the actual owner of this new block, which is saved in owner. The affected path is through btrfs_copy_root as all other callers of btrfs_alloc_tree_block (which calls init_new_buffer) have root_objectid == root->root_key.objectid . CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-08-27 14:08:52 +02:00
Boris Burkov	a84d5d429f	btrfs: detect nocow for swap after snapshot delete can_nocow_extent and btrfs_cross_ref_exist both rely on a heuristic for detecting a must cow condition which is not exactly accurate, but saves unnecessary tree traversal. The incorrect assumption is that if the extent was created in a generation smaller than the last snapshot generation, it must be referenced by that snapshot. That is true, except the snapshot could have since been deleted, without affecting the last snapshot generation. The original patch claimed a performance win from this check, but it also leads to a bug where you are unable to use a swapfile if you ever snapshotted the subvolume it's in. Make the check slower and more strict for the swapon case, without modifying the general cow checks as a compromise. Turning swap on does not seem to be a particularly performance sensitive operation, so incurring a possibly unnecessary btrfs_search_slot seems worthwhile for the added usability. Note: Until the snapshot is competely cleaned after deletion, check_committed_refs will still cause the logic to think that cow is necessary, so the user must until 'btrfs subvolu sync' finished before activating the swapfile swapon. CC: stable@vger.kernel.org # 5.4+ Suggested-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2020-08-21 12:21:23 +02:00
Qu Wenruo	c57dd1f2f6	btrfs: trim: fix underflow in trim length to prevent access beyond device boundary [BUG] The following script can lead to tons of beyond device boundary access: mkfs.btrfs -f $dev -b 10G mount $dev $mnt trimfs $mnt btrfs filesystem resize 1:-1G $mnt trimfs $mnt [CAUSE] Since commit `929be17a9b` ("btrfs: Switch btrfs_trim_free_extents to find_first_clear_extent_bit"), we try to avoid trimming ranges that's already trimmed. So we check device->alloc_state by finding the first range which doesn't have CHUNK_TRIMMED and CHUNK_ALLOCATED not set. But if we shrunk the device, that bits are not cleared, thus we could easily got a range starts beyond the shrunk device size. This results the returned @start and @end are all beyond device size, then we call "end = min(end, device->total_bytes -1);" making @end smaller than device size. Then finally we goes "len = end - start + 1", totally underflow the result, and lead to the beyond-device-boundary access. [FIX] This patch will fix the problem in two ways: - Clear CHUNK_TRIMMED \| CHUNK_ALLOCATED bits when shrinking device This is the root fix - Add extra safety check when trimming free device extents We check and warn if the returned range is already beyond current device. Link: https://github.com/kdave/btrfs-progs/issues/282 Fixes: `929be17a9b` ("btrfs: Switch btrfs_trim_free_extents to find_first_clear_extent_bit") CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-08-12 10:15:58 +02:00
Qu Wenruo	f3e3d9cc35	btrfs: avoid possible signal interruption of btrfs_drop_snapshot() on relocation tree [BUG] There is a bug report about bad signal timing could lead to read-only fs during balance: BTRFS info (device xvdb): balance: start -d -m -s BTRFS info (device xvdb): relocating block group 73001861120 flags metadata BTRFS info (device xvdb): found 12236 extents, stage: move data extents BTRFS info (device xvdb): relocating block group 71928119296 flags data BTRFS info (device xvdb): found 3 extents, stage: move data extents BTRFS info (device xvdb): found 3 extents, stage: update data pointers BTRFS info (device xvdb): relocating block group 60922265600 flags metadata BTRFS: error (device xvdb) in btrfs_drop_snapshot:5505: errno=-4 unknown BTRFS info (device xvdb): forced readonly BTRFS info (device xvdb): balance: ended with status: -4 [CAUSE] The direct cause is the -EINTR from the following call chain when a fatal signal is pending: relocate_block_group() \|- clean_dirty_subvols() \|- btrfs_drop_snapshot() \|- btrfs_start_transaction() \|- btrfs_delayed_refs_rsv_refill() \|- btrfs_reserve_metadata_bytes() \|- __reserve_metadata_bytes() \|- wait_reserve_ticket() \|- prepare_to_wait_event(); \|- ticket->error = -EINTR; Normally this behavior is fine for most btrfs_start_transaction() callers, as they need to catch any other error, same for the signal, and exit ASAP. However for balance, especially for the clean_dirty_subvols() case, we're already doing cleanup works, getting -EINTR from btrfs_drop_snapshot() could cause a lot of unexpected problems. From the mentioned forced read-only report, to later balance error due to half dropped reloc trees. [FIX] Fix this problem by using btrfs_join_transaction() if btrfs_drop_snapshot() is called from relocation context. Since btrfs_join_transaction() won't get interrupted by signal, we can continue the cleanup. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com>3 Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:45 +02:00
Qu Wenruo	a3cf0e4342	btrfs: qgroup: free per-trans reserved space when a subvolume gets dropped [BUG] Sometime fsstress could lead to qgroup warning for case like generic/013: BTRFS warning (device dm-3): qgroup 0/259 has unreleased space, type 1 rsv 81920 ------------[ cut here ]------------ WARNING: CPU: 9 PID: 24535 at fs/btrfs/disk-io.c:4142 close_ctree+0x1dc/0x323 [btrfs] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:close_ctree+0x1dc/0x323 [btrfs] Call Trace: btrfs_put_super+0x15/0x17 [btrfs] generic_shutdown_super+0x72/0x110 kill_anon_super+0x18/0x30 btrfs_kill_super+0x17/0x30 [btrfs] deactivate_locked_super+0x3b/0xa0 deactivate_super+0x40/0x50 cleanup_mnt+0x135/0x190 __cleanup_mnt+0x12/0x20 task_work_run+0x64/0xb0 __prepare_exit_to_usermode+0x1bc/0x1c0 __syscall_return_slowpath+0x47/0x230 do_syscall_64+0x64/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 ---[ end trace 6c341cdf9b6cc3c1 ]--- BTRFS error (device dm-3): qgroup reserved space leaked While that subvolume 259 is no longer in that filesystem. [CAUSE] Normally per-trans qgroup reserved space is freed when a transaction is committed, in commit_fs_roots(). However for completely dropped subvolume, that subvolume is completely gone, thus is no longer in the fs_roots_radix, and its per-trans reserved qgroup will never be freed. Since the subvolume is already gone, leaked per-trans space won't cause any trouble for end users. [FIX] Just call btrfs_qgroup_free_meta_all_pertrans() before a subvolume is completely dropped. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:44 +02:00
Qu Wenruo	92a7cc4252	btrfs: rename BTRFS_ROOT_REF_COWS to BTRFS_ROOT_SHAREABLE The name BTRFS_ROOT_REF_COWS is not very clear about the meaning. In fact, that bit can only be set to those trees: - Subvolume roots - Data reloc root - Reloc roots for above roots All other trees won't get this bit set. So just by the result, it is obvious that, roots with this bit set can have tree blocks shared with other trees. Either shared by snapshots, or by reloc roots (an special snapshot created by relocation). This patch will rename BTRFS_ROOT_REF_COWS to BTRFS_ROOT_SHAREABLE to make it easier to understand, and update all comment mentioning "reference counted" to follow the rename. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-05-25 11:25:35 +02:00
Filipe Manana	6b7304af62	btrfs: rename member 'trimming' of block group to a more generic name Back in 2014, commit `04216820fe` ("Btrfs: fix race between fs trimming and block group remove/allocation"), I added the 'trimming' member to the block group structure. Its purpose was to prevent races between trimming and block group deletion/allocation by pinning the block group in a way that prevents its logical address and device extents from being reused while trimming is in progress for a block group, so that if another task deletes the block group and then another task allocates a new block group that gets the same logical address and device extents while the trimming task is still in progress. After the previous fix for scrub (patch "btrfs: fix a race between scrub and block group removal/allocation"), scrub now also has the same needs that trimming has, so the member name 'trimming' no longer makes sense. Since there is already a 'pinned' member in the block group that refers to space reservations (pinned bytes), rename the member to 'frozen', add a comment on top of it to describe its general purpose and rename the helpers to increment and decrement the counter as well, to match the new member name. The next patch in the series will move the helpers into a more suitable file (from free-space-cache.c to block-group.c). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-05-25 11:25:29 +02:00
YueHaibing	eec5b6e097	btrfs: remove unused function heads_to_leaves There's no callers in-tree anymore since commit `64403612b7` ("btrfs: rework btrfs_check_space_for_delayed_refs") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-05-25 11:25:28 +02:00
David Sterba	7c09c03091	btrfs: don't force read-only after error in drop snapshot Deleting a subvolume on a full filesystem leads to ENOSPC followed by a forced read-only. This is not a transaction abort and the filesystem is otherwise ok, so the error should be just propagated to the callers. This is caused by unnecessary call to btrfs_handle_fs_error for all errors, except EAGAIN. This does not make sense as the standard transaction abort mechanism is in btrfs_drop_snapshot so all relevant failures are handled. Originally in commit `cb1b69f450` ("Btrfs: forced readonly when btrfs_drop_snapshot() fails") there was no return value at all, so the btrfs_std_error made some sense but once the error handling and propagation has been implemented we don't need it anymore. Signed-off-by: David Sterba <dsterba@suse.com>	2020-05-25 11:25:23 +02:00
Jules Irenge	c142c6a449	btrfs: add missing annotation for btrfs_lock_cluster() Sparse reports a warning at btrfs_lock_cluster() warning: context imbalance in btrfs_lock_cluster() - wrong count The root cause is the missing annotation at btrfs_lock_cluster() Add the missing __acquires(&cluster->refill_lock) annotation. Signed-off-by: Jules Irenge <jbi.octave@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-05-25 11:25:16 +02:00
Josef Bacik	cd22a51c66	btrfs: do not use readahead for running delayed refs Readahead will generate a lot of extra reads for adjacent nodes, but when running delayed refs we have no idea if the next ref is going to be adjacent or not, so this potentially just generates a lot of extra IO. To make matters worse each ref is truly just looking for one item, it doesn't generally search forward, so we simply don't need it here. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:03:50 +01:00
Josef Bacik	8c38938c7b	btrfs: move the root freeing stuff into btrfs_put_root There are a few different ways to free roots, either you allocated them yourself and you just do free_extent_buffer(root->node); free_extent_buffer(root->commit_node); btrfs_put_root(root); Which is the pattern for log roots. Or for snapshots/subvolumes that are being dropped you simply call btrfs_free_fs_root() which does all the cleanup for you. Unify this all into btrfs_put_root(), so that we don't free up things associated with the root until the last reference is dropped. This makes the root freeing code much more significant. The only caveat is at close_ctree() time we have to free the extent buffers for all of our main roots (extent_root, chunk_root, etc) because we have to drop the btree_inode and we'll run into issues if we hold onto those nodes until ->kill_sb() time. This will be addressed in the future when we kill the btree_inode. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:59 +01:00
Nikolay Borisov	0078a9f941	btrfs: Remove block_rsv parameter from btrfs_drop_snapshot It's no longer used following `30d40577e3` ("btrfs: reloc: Also queue orphan reloc tree for cleanup to avoid BUG_ON()"), so just remove it. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:55 +01:00
Nikolay Borisov	65cd6d9e30	btrfs: Open code insert_extent_backref No need to add a level of indirection for hiding a simple 'if'. Open code insert_extent_backref in its sole caller. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:53 +01:00
Naohiro Aota	7e89540942	btrfs: factor out prepare_allocation() for extent allocation This function finally factor out prepare_allocation() form find_free_extent(). This function is called before the allocation loop and a specific allocator function like prepare_allocation_clustered() should initialize their private information and can set proper hint_byte to indicate where to start the allocation with. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:51 +01:00
Naohiro Aota	45d8e033b2	btrfs: skip LOOP_NO_EMPTY_SIZE if not clustered allocation LOOP_NO_EMPTY_SIZE is solely dedicated for clustered allocation. So, we can skip this stage and give up the allocation. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:51 +01:00
Naohiro Aota	c70e2139dc	btrfs: factor out chunk_allocation_failed() for extent allocation Factor out chunk_allocation_failed() from find_free_extent_update_loop(). This function is called when it failed to allocate a chunk. The function can modify "ffe_ctl->loop" and return 0 to continue with the next stage. Or, it can return -ENOSPC to give up here. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:51 +01:00
Naohiro Aota	15b7ee6584	btrfs: drop unnecessary arguments from find_free_extent_update_loop() Now that, we don't use last_ptr and use_cluster in the function. Drop these arguments from it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:51 +01:00
Naohiro Aota	0ab9724bf5	btrfs: factor out found_extent() for extent allocation Factor out found_extent() from find_free_extent_update_loop(). This function is called when a proper extent is found and before returning from find_free_extent(). Hook functions like found_extent_clustered() should save information for a next allocation. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:50 +01:00
Naohiro Aota	baba50624f	btrfs: factor out release_block_group() Factor out release_block_group() from find_free_extent(). This function is called when it gives up an allocation from a block group. Each allocation policy should reset its information for an allocation in the next block group. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:50 +01:00
Naohiro Aota	897cae7948	btrfs: drop unnecessary arguments from clustered allocation functions Now that, find_free_extent_clustered() and find_free_extent_unclustered() can access "last_ptr" from the "clustered" variable, we can drop it from the arguments. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:50 +01:00
Naohiro Aota	c668690dc0	btrfs: factor out do_allocation() for extent allocation Factor out do_allocation() from find_free_extent(). This function do an actual extent allocation in a given block group. The ffe_ctl->policy is used to determine the actual allocator function to use. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:50 +01:00
Naohiro Aota	c10859be9b	btrfs: move variables for clustered allocation into find_free_extent_ctl Move "last_ptr" and "use_cluster" into struct find_free_extent_ctl, so that hook functions for clustered allocator can use these variables. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:50 +01:00
Naohiro Aota	ea544149a4	btrfs: move hint_byte into find_free_extent_ctl This commit moves hint_byte into find_free_extent_ctl, so that we can modify the hint_byte in the other functions. This will help us split find_free_extent further. This commit also renames the function argument "hint_byte" to "hint_byte_orig" to avoid misuse. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:49 +01:00
Naohiro Aota	cb2f96f8ab	btrfs: introduce extent allocation policy This commit introduces extent allocation policy for btrfs. This policy controls how btrfs allocate an extents from block groups. There is no functional change introduced with this commit. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:49 +01:00
Naohiro Aota	52d40aba68	btrfs: change full_search to bool in find_free_extent_update_loop While the "full_search" variable defined in find_free_extent() is bool, but the full_search argument of find_free_extent_update_loop() is defined as int. Let's trivially fix the argument type. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:47 +01:00
David Sterba	42c9d0b524	btrfs: simplify parameters of btrfs_set_disk_extent_flags All callers pass extent buffer start and length so the extent buffer itself should work fine. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:45 +01:00
Nikolay Borisov	dcc3eb9638	btrfs: convert snapshot/nocow exlcusion to drew lock This patch removes all haphazard code implementing nocow writers exclusion from pending snapshot creation and switches to using the drew lock to ensure this invariant still holds. 'Readers' are snapshot creators from create_snapshot and 'writers' are nocow writers from buffered write path or btrfs_setsize. This locking scheme allows for multiple snapshots to happen while any nocow writers are blocked, since writes to page cache in the nocow path will make snapshots inconsistent. So for performance reasons we'd like to have the ability to run multiple concurrent snapshots and also favors readers in this case. And in case there aren't pending snapshots (which will be the majority of the cases) we rely on the percpu's writers counter to avoid cacheline contention. The main gain from using the drew lock is it's now a lot easier to reason about the guarantees of the locking scheme and whether there is some silent breakage lurking. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:44 +01:00
Josef Bacik	ab9b2c7b32	btrfs: handle logged extent failure properly If we're allocating a logged extent we attempt to insert an extent record for the file extent directly. We increase space_info->bytes_reserved, because the extent entry addition will call btrfs_update_block_group(), which will convert the ->bytes_reserved to ->bytes_used. However if we fail at any point while inserting the extent entry we will bail and leave space on ->bytes_reserved, which will trigger a WARN_ON() on umount. Fix this by pinning the space if we fail to insert, which is what happens in every other failure case that involves adding the extent entry. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:38 +01:00
Nikolay Borisov	fe119a6eeb	btrfs: switch to per-transaction pinned extents This commit flips the switch to start tracking/processing pinned extents on a per-transaction basis. It mostly replaces all references from btrfs_fs_info::(pinned_extents\|freed_extents[]) to btrfs_transaction::pinned_extents. Two notable modifications that warrant explicit mention are changing clean_pinned_extents to get a reference to the previously running transaction. The other one is removal of call to btrfs_destroy_pinned_extent since transactions are going to be cleaned in btrfs_cleanup_one_transaction. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:38 +01:00
Nikolay Borisov	f2fb72983b	btrfs: Mark pinned log extents as excluded In preparation to making pinned extents per-transaction ensure that log such extents are always excluded from caching. To achieve this in addition to marking them via btrfs_pin_extent_for_log_replay they also need to be marked with btrfs_add_excluded_extent to prevent log tree extent buffer being loaded by the free space caching thread. That's required since log tree blocks are not recorded in the extent tree, hence they always look free. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:37 +01:00
Nikolay Borisov	6690d07126	btrfs: Make pin_down_extent take transaction handle All callers have a reference to a transaction handle so pass it to pin_down_extent. This is the final step before switching pinned extent tracking to a per-transaction basis. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:37 +01:00

1 2 3 4 5 ...

1607 Commits