linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-22 12:11:40 +00:00

Author	SHA1	Message	Date
Dr. David Alan Gilbert	b628c13951	btrfs: remove unused btrfs_try_tree_write_lock() btrfs_try_tree_write_lock() has been unused since commit `50b21d7a06` ("btrfs: submit a writeback bio per extent_buffer"). Remove it as we don't need it anymore. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-11-11 14:34:14 +01:00
Dr. David Alan Gilbert	441ffe8a98	btrfs: remove unused btrfs_is_parity_mirror() btrfs_is_parity_mirror() has been unused since commit `4886ff7b50` ("btrfs: introduce a new helper to submit write bio for repair"). Remove it as the code was refactored and we don't need the helper anymore. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-11-11 14:34:14 +01:00
Dr. David Alan Gilbert	004641bd06	btrfs: remove unused btrfs_free_squota_rsv() btrfs_free_squota_rsv() was added in commit `e85a0adacf` ("btrfs: ensure releasing squota reserve on head refs") but has remained unused since then. Remove it as we don't seem to need it and was probably a leftover. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-11-11 14:34:14 +01:00
Johannes Thumshirn	506be4d565	btrfs: tests: add selftests for raid-stripe-tree Add first stash of very basic self tests for the RAID stripe-tree. More test cases will follow exercising the tree. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-11-11 14:34:14 +01:00
Shen Lichuan	2144e1f23f	btrfs: correct typos in multiple comments across various files Fix some confusing spelling errors that were currently identified, the details are as follows: block-group.c: 2800: uncompressible ==> incompressible extent-tree.c: 3131: EXTEMT ==> EXTENT extent_io.c: 3124: utlizing ==> utilizing extent_map.c: 1323: ealier ==> earlier extent_map.c: 1325: possiblity ==> possibility fiemap.c: 189: emmitted ==> emitted fiemap.c: 197: emmitted ==> emitted fiemap.c: 203: emmitted ==> emitted transaction.h: 36: trasaction ==> transaction volumes.c: 5312: filesysmte ==> filesystem zoned.c: 1977: trasnsaction ==> transaction Signed-off-by: Shen Lichuan <shenlichuan@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-11-11 14:34:14 +01:00
Youling Tang	fa984c9e62	btrfs: remove unused page_to_inode and page_to_fs_info macros This macro is no longer used after the "btrfs: Cleaned up folio->page conversion" series patch [1] was applied, so remove it. [1]: https://patchwork.kernel.org/project/linux-btrfs/cover/20240828182908.3735344-1-lizetao1@huawei.com/ Reviewed-by: Neal Gompa <neal@gompa.dev> Signed-off-by: Youling Tang <tangyouling@kylinos.cn> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-11-11 14:34:14 +01:00
Riyan Dhiman	522945b342	btrfs: remove redundant stop_loop variable in scrub_stripe() The variable stop_loop was originally introduced in commit `625f1c8dc6` ("Btrfs: improve the loop of scrub_stripe"). It was initialized to 0 in commit `3b080b2564` ("Btrfs: scrub raid56 stripes in the right way"). However, in a later commit `18d30ab961` ("btrfs: scrub: use scrub_simple_mirror() to handle RAID56 data stripe scrub"), the code that modified stop_loop was removed, making the variable redundant. Currently, stop_loop is only initialized with 0 and is never used or modified within the scrub_stripe() function. As a result, this patch removes the stop_loop variable to clean up the code and eliminate unnecessary redundancy. This change has no impact on functionality, as stop_loop was never utilized in any meaningful way in the final version of the code. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Riyan Dhiman <riyandhiman14@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-11-11 14:34:13 +01:00
Filipe Manana	287d1cf303	btrfs: remove pointless initialization at btrfs_qgroup_trace_extent() The qgroup record was allocated with kzalloc(), so it's pointless to set its old_roots member to NULL. Remove the assignment. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-11-11 14:34:13 +01:00
Filipe Manana	db58e152a2	btrfs: always use delayed_refs local variable at btrfs_qgroup_trace_extent() Instead of dereferencing the delayed refs from the transaction multiple times, store it early in the local variable and then always use the variable. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-11-11 14:34:13 +01:00
Filipe Manana	c5e2680224	btrfs: remove unnecessary delayed refs locking at btrfs_qgroup_trace_extent() There's no need to hold the delayed refs spinlock when calling btrfs_qgroup_trace_extent_nolock() from btrfs_qgroup_trace_extent(), since it doesn't change anything in delayed refs and it only changes the xarray used to track qgroup extent records, which is protected by the xarray's lock. Holding the lock is only adding unnecessary lock contention with other tasks that actually need to take the lock to add/remove/change delayed references. So remove the locking. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-11-11 14:34:13 +01:00
Filipe Manana	fad884b0c8	btrfs: store fs_info in a local variable at btrfs_qgroup_trace_extent_post() Instead of extracting fs_info from the transaction multiples times, store it in a local variable and use it. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-11-11 14:34:13 +01:00
Filipe Manana	c28b97f53b	btrfs: qgroups: remove bytenr field from struct btrfs_qgroup_extent_record Now that we track qgroup extent records in a xarray we don't need to have a "bytenr" field in struct btrfs_qgroup_extent_record, since we can get it from the index of the record in the xarray. So remove the field and grab the bytenr from either the index key or any other place where it's available (delayed refs). This reduces the size of struct btrfs_qgroup_extent_record from 40 bytes down to 32 bytes, meaning that we now can store 128 instances of this structure instead of 102 per 4K page. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-11-11 14:34:13 +01:00
Johannes Thumshirn	2206265f41	btrfs: remove code duplication in ordered extent finishing Remove the duplicated transaction joining, block reserve setting and raid extent inserting in btrfs_finish_ordered_extent(). While at it, also abort the transaction in case inserting a RAID stripe-tree entry fails. Suggested-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-11-11 14:34:13 +01:00
Qu Wenruo	7e06de7c83	btrfs: canonicalize the device path before adding it [PROBLEM] Currently btrfs accepts any file path for its device, resulting some weird situation: # ./mount_by_fd /dev/test/scratch1 /mnt/btrfs/ The program has the following source code: #include <fcntl.h> #include <stdio.h> #include <sys/mount.h> int main(int argc, char *argv[]) { int fd = open(argv[1], O_RDWR); char path[256]; snprintf(path, sizeof(path), "/proc/self/fd/%d", fd); return mount(path, argv[2], "btrfs", 0, NULL); } Then we can have the following weird device path: BTRFS: device fsid 2378be81-fe12-46d2-a9e8-68cf08dd98d5 devid 1 transid 7 /proc/self/fd/3 (253:2) scanned by mount_by_fd (18440) Normally it's not a big deal, and later udev can trigger a device path rename. But if udev didn't trigger, the device path "/proc/self/fd/3" will show up in mtab. [CAUSE] For filename "/proc/self/fd/3", it means the opened file descriptor 3. In above case, it's exactly the device we want to open, aka points to "/dev/test/scratch1" which is another symlink pointing to "/dev/dm-2". Inside kernel we solve the mount source using LOOKUP_FOLLOW, which follows the symbolic link and grab the proper block device. But inside btrfs we also save the filename into btrfs_device::name, and utilize that member to report our mount source, which leads to the above situation. [FIX] Instead of unconditionally trust the path, check if the original file (not following the symbolic link) is inside "/dev/", if not, then manually lookup the path to its final destination, and use that as our device path. This allows us to still use symbolic links, like "/dev/mapper/test-scratch" from LVM2, which is required for fstests runs with LVM2 setup. And for really weird names, like the above case, we solve it to "/dev/dm-2" instead. Reviewed-by: Filipe Manana <fdmanana@suse.com> Link: https://bugzilla.suse.com/show_bug.cgi?id=1230641 Reported-by: Fabian Vogt <fvogt@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-11-11 14:34:13 +01:00
Qu Wenruo	2e8b6bc0ab	btrfs: avoid unnecessary device path update for the same device [PROBLEM] It is very common for udev to trigger device scan, and every time a mounted btrfs device got re-scan from different soft links, we will get some of unnecessary device path updates, this is especially common for LVM based storage: # lvs scratch1 test -wi-ao---- 10.00g scratch2 test -wi-a----- 10.00g scratch3 test -wi-a----- 10.00g scratch4 test -wi-a----- 10.00g scratch5 test -wi-a----- 10.00g test test -wi-a----- 10.00g # mkfs.btrfs -f /dev/test/scratch1 # mount /dev/test/scratch1 /mnt/btrfs # dmesg -c [ 205.705234] BTRFS: device fsid 7be2602f-9e35-4ecf-a6ff-9e91d2c182c9 devid 1 transid 6 /dev/mapper/test-scratch1 (253:4) scanned by mount (1154) [ 205.710864] BTRFS info (device dm-4): first mount of filesystem 7be2602f-9e35-4ecf-a6ff-9e91d2c182c9 [ 205.711923] BTRFS info (device dm-4): using crc32c (crc32c-intel) checksum algorithm [ 205.713856] BTRFS info (device dm-4): using free-space-tree [ 205.722324] BTRFS info (device dm-4): checking UUID tree So far so good, but even if we just touched any soft link of "dm-4", we will get quite some unnecessary device path updates. # touch /dev/mapper/test-scratch1 # dmesg -c [ 469.295796] BTRFS info: devid 1 device path /dev/mapper/test-scratch1 changed to /dev/dm-4 scanned by (udev-worker) (1221) [ 469.300494] BTRFS info: devid 1 device path /dev/dm-4 changed to /dev/mapper/test-scratch1 scanned by (udev-worker) (1221) Such device path rename is unnecessary and can lead to random path change due to the udev race. [CAUSE] Inside device_list_add(), we are using a very primitive way checking if the device has changed, strcmp(). Which can never handle links well, no matter if it's hard or soft links. So every different link of the same device will be treated as a different device, causing the unnecessary device path update. [FIX] Introduce a helper, is_same_device(), and use path_equal() to properly detect the same block device. So that the different soft links won't trigger the rename race. Reviewed-by: Filipe Manana <fdmanana@suse.com> Link: https://bugzilla.suse.com/show_bug.cgi?id=1230641 Reported-by: Fabian Vogt <fvogt@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-11-11 14:34:13 +01:00
Qu Wenruo	1d2fbb7f1f	btrfs: allow compression even if the range is not page aligned Previously for btrfs with sector size smaller than page size (subpage), we only allow compression if the range is fully page aligned. This is to work around the asynchronous submission of compressed range, which delayed the page unlock and writeback into a workqueue, furthermore asynchronous submission can lock multiple sector range across page boundary. Such asynchronous submission makes it very hard to co-operate with other regular writes. With the recent changes to the subpage folio unlock path, now asynchronous submission of compressed pages can co-operate with regular submission, so enable sector perfect compression if it's an experimental build. The ETA for moving this feature out of experimental is 6.15, and I hope all remaining corner cases can be exposed before that. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-11-11 14:34:13 +01:00
Qu Wenruo	c96d0e3921	btrfs: mark all dirty sectors as locked inside writepage_delalloc() Currently we only mark sectors as locked if there is a NEW delalloc range for it. But NEW delalloc range is not the same as dirty sectors we want to submit, e.g: 0 32K 64K 96K 128K \| \|////////\|\|///////\| \|////\| 120K For above 64K page size case, writepage_delalloc() for page 0 will find and lock the delalloc range [32K, 96K), which is beyond the page boundary. Then when writepage_delalloc() is called for the page 64K, since [64K, 96K) is already locked, only [120K, 128K) will be locked. This means, although range [64K, 96K) is dirty and will be submitted later by extent_writepage_io(), it will not be marked as locked. This is fine for now, as we call btrfs_folio_end_writer_lock_bitmap() to free every non-compressed sector, and compression is only allowed for full page range. But this is not safe for future sector perfect compression support, as this can lead to double folio unlock: Thread A \| Thread B ---------------------------------------+-------------------------------- \| submit_one_async_extent() \| \|- extent_clear_unlock_delalloc() extent_writepage() \| \|- btrfs_folio_end_writer_lock() \|- btrfs_folio_end_writer_lock_bitmap()\| \|- btrfs_subpage_end_and_test_writer() \| \| \| \|- atomic_sub_and_test() \| \| \| /* Now the atomic value is 0 */ \|- if (atomic_read() == 0) \| \| \|- folio_unlock() \| \|- folio_unlock() The root cause is the above range [64K, 96K) is dirtied and should also be locked but it isn't. So to make everything more consistent and prepare for the incoming sector perfect compression, mark all dirty sectors as locked. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-11-11 14:34:13 +01:00
Qu Wenruo	2bca8eb077	btrfs: move the delalloc range bitmap search into extent_io.c Currently for subpage (sector size < page size) cases, we reuse subpage locked bitmap to find out all delalloc ranges we have locked, and run all those found ranges. However such reuse is not perfect, e.g.: 0 32K 64K 96K 128K \| \|////////\|\|///////\| \|////\| 120K For above range, writepage_delalloc() for page 0 will handle the range [32K, 96k), note delalloc range can be beyond the page boundary. But writepage_delalloc() for page 64K will only handle range [120K, 128K), as the previous run on page 0 has already handled range [64K, 96K). Meanwhile for the writeback we should expect range [64K, 96K) to also be locked, this leads to the mismatch from locked bitmap and delalloc range. This is not causing problems yet, but it's still an inconsistent behavior. So instead of relying on the subpage locked bitmap, move the delalloc range search using local @delalloc_bitmap, so that we can remove the existing btrfs_folio_find_writer_locked(). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-11-11 14:34:12 +01:00
Qu Wenruo	928b4de66e	btrfs: do not assume the full page range is not dirty in extent_writepage_io() The function extent_writepage_io() will submit the dirty sectors inside the page for the write. But recently to co-operate with the incoming subpage compression enhancement, a new bitmap is introduced to btrfs_bio_ctrl::submit_bitmap, to only avoid a subset of the dirty range. This is because we can have the following cases with 64K page size: 0 16K 32K 48K 64K \| \|/////////\| \|/\| 52K For range [16K, 32K), we queue the dirty range for compression, which is ran in a delayed workqueue. Then for range [48K, 52K), we go through the regular submission path. In that case, our btrfs_bio_ctrl::submit_bitmap will exclude the range [16K, 32K). The dirty flags for the range [16K, 32K) is only cleared when the compression is done, by the extent_clear_unlock_delalloc() call inside submit_one_async_extent(). This patch fix the false alert by removing the btrfs_folio_assert_not_dirty() check, since it's no longer correct for subpage compression cases. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-11-11 14:34:12 +01:00
Qu Wenruo	a4ef54dbb5	btrfs: make extent_range_clear_dirty_for_io() to handle sector size < page size cases For btrfs with sector size < page size (e.g. 4K sector size, 64K page size), and enable the sector perfect compression support, then the following dirty range can lead to problems: 0 32K 64K 96K 128K \| \|///////\|\|//////\| \|/\| 124K In above case, if we start writeback for that inode, the last dirty range [124K, 128K) will not be submitted and cause reserved space leakage: - Start writeback for page 0 We find the range [32K, 96K) is suitable for compression, and queue it into a workqueue to do the delayed compression and submission. - Compression happens for range [32K, 96K) Function extent_range_clear_dirty_for_io() is called, however it is only doing full page handling, not considering any the extra bitmaps for subpage cases. That function will clear page dirty for both page 0 and page 64K. - Writeback for the inode is done Because page 64K has its dirty flag cleared, it will not be considered as a writeback target. This means the range [124K, 128K) will not be submitted, and reserved space for it will be leaked. Fix this problem by using the subpage helper to clear the dirty flag. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-11-11 14:34:12 +01:00
Qu Wenruo	a8706d0271	btrfs: wait for writeback if sector size is smaller than page size [PROBLEM] If sector perfect compression is enabled for sector size < page size case, the following case can lead dirty ranges not being written back: 0 32K 64K 96K 128K \| \|///////\|\|//////\| \|/\| 124K In above example, the page size is 64K, and we need to write back above two pages. - Submit for page 0 (main thread) We found delalloc range [32K, 96K), which can be compressed. So we queue an async range for [32K, 96K). This means, the page unlock/clearing dirty/setting writeback will all happen in a workqueue context. - The compression is done, and compressed range is submitted (workqueue) Since the compression is done in asynchronously, the compression can be done before the main thread to submit for page 64K. Now the whole range [32K, 96K), involving two pages, will be marked writeback. - Submit for page 64K (main thread) extent_write_cache_pages() got its wbc->sync_mode is WB_SYNC_NONE, so it skips the writeback wait. And unlock the page and exit. This means the dirty range [124K, 128K) will never be submitted, until next writeback happens for page 64K. This will never happen for previous kernels because: - For sector size == page size case Since one page is one sector, if a page is marked writeback it will not have dirty flags. So this corner case will never hit. - For sector size < page size case We never do subpage compression, a range can only be submitted for compression if the range is fully page aligned. This change makes the subpage behavior mostly the same as non-subpage cases. [ENHANCEMENT] Instead of relying WB_SYNC_NONE check only, if it's a subpage case, then always wait for writeback flags. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-11-11 14:34:12 +01:00
Qu Wenruo	dd5e276254	btrfs: compression: add an ASSERT() to ensure the read-in length is sane There are already two bugs (one in zlib, one in zstd) that involved compression path is not handling sector size < page size cases well. So it makes more sense to make sure that btrfs_compress_folios() returns Since we already have two bugs (one in zlib, one in zstd) in the compression path resulting the @total_in be to larger than the to-be-compressed range length, there is enough reason to add an ASSERT() to make sure the total read-in length doesn't exceed the input length. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-11-11 14:34:12 +01:00
Qu Wenruo	90275a7762	btrfs: zstd: make the compression path to handle sector size < page size Inside zstd_compress_folios(), after exhausted one input page, we need to switch to the next page as input. However when counting the total input bytes (@tot_in), we always increase it by PAGE_SIZE. For the following case, it can cause incorrect value: 0 32K 64K 96K \| \|///////////\|\|///////////\| After compressing range [32K, 64K), we switch to the next page, and increasing @tot_in by 64K, while we only read 32K. This will cause the @total_in to return a value larger than the input length. Fix it by only increase @tot_in by the input size. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-11-11 14:34:12 +01:00
Qu Wenruo	f6ebedb09b	btrfs: zlib: make the compression path to handle sector size < page size Inside zlib_compress_folios(), each time we switch the input page cache, the @start is increased by PAGE_SIZE. But for the incoming compression support for sector size < page size (previously we support compression only when the range is fully page aligned), this is not going to handle the following case: 0 32K 64K 96K \| \|///////////\|\|///////////\| @start has the initial value 32K, indicating the start filepos of the to-be-compressed range. And when grabbing the first page as input, we always call "start += PAGE_SIZE;". But since @start is starting at 32K, it will be increased by 64K, resulting it to be 96K for the next range, causing incorrect input range and corruption for the future subpage compression. Fix it by only increase @start by the input size. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-11-11 14:34:12 +01:00
Qu Wenruo	67cd3f2217	btrfs: split out CONFIG_BTRFS_EXPERIMENTAL from CONFIG_BTRFS_DEBUG Currently CONFIG_BTRFS_EXPERIMENTAL is not only for the extra debugging output, but also for experimental features. This is not ideal to distinguish planned but not yet stable features from those purely designed for debugging. This patch splits the following features into CONFIG_BTRFS_EXPERIMENTAL: - Extent map shrinker This seems to be the first one to exit experimental. - Extent tree v2 This seems to be the last one to graduate from experimental. - Raid stripe tree - Csum offload mode - Send protocol v3 Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-11-11 14:34:12 +01:00
Qu Wenruo	c186345a6b	btrfs: make assert_rbio() to only check CONFIG_BTRFS_ASSERT According to the description, CONFIG_BTRFS_DEBUG is only for extra debug info, meanwhile sanity checks should be managed by CONFIG_BTRFS_ASSERT. There is no need to check both to enable assert_rbio(). Just remove the check for CONFIG_BTRFS_DEBUG. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-11-11 14:34:12 +01:00
Johannes Thumshirn	8cca35cb29	btrfs: don't take dev_replace rwsem on task already holding it Running fstests btrfs/011 with MKFS_OPTIONS="-O rst" to force the usage of the RAID stripe-tree, we get the following splat from lockdep: BTRFS info (device sdd): dev_replace from /dev/sdd (devid 1) to /dev/sdb started ============================================ WARNING: possible recursive locking detected 6.11.0-rc3-btrfs-for-next #599 Not tainted -------------------------------------------- btrfs/2326 is trying to acquire lock: ffff88810f215c98 (&fs_info->dev_replace.rwsem){++++}-{3:3}, at: btrfs_map_block+0x39f/0x2250 but task is already holding lock: ffff88810f215c98 (&fs_info->dev_replace.rwsem){++++}-{3:3}, at: btrfs_map_block+0x39f/0x2250 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&fs_info->dev_replace.rwsem); lock(&fs_info->dev_replace.rwsem); * DEADLOCK * May be due to missing lock nesting notation 1 lock held by btrfs/2326: #0: ffff88810f215c98 (&fs_info->dev_replace.rwsem){++++}-{3:3}, at: btrfs_map_block+0x39f/0x2250 stack backtrace: CPU: 1 UID: 0 PID: 2326 Comm: btrfs Not tainted 6.11.0-rc3-btrfs-for-next #599 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Call Trace: <TASK> dump_stack_lvl+0x5b/0x80 __lock_acquire+0x2798/0x69d0 ? __pfx___lock_acquire+0x10/0x10 ? __pfx___lock_acquire+0x10/0x10 lock_acquire+0x19d/0x4a0 ? btrfs_map_block+0x39f/0x2250 ? __pfx_lock_acquire+0x10/0x10 ? find_held_lock+0x2d/0x110 ? lock_is_held_type+0x8f/0x100 down_read+0x8e/0x440 ? btrfs_map_block+0x39f/0x2250 ? __pfx_down_read+0x10/0x10 ? do_raw_read_unlock+0x44/0x70 ? _raw_read_unlock+0x23/0x40 btrfs_map_block+0x39f/0x2250 ? btrfs_dev_replace_by_ioctl+0xd69/0x1d00 ? btrfs_bio_counter_inc_blocked+0xd9/0x2e0 ? __kasan_slab_alloc+0x6e/0x70 ? __pfx_btrfs_map_block+0x10/0x10 ? __pfx_btrfs_bio_counter_inc_blocked+0x10/0x10 ? kmem_cache_alloc_noprof+0x1f2/0x300 ? mempool_alloc_noprof+0xed/0x2b0 btrfs_submit_chunk+0x28d/0x17e0 ? __pfx_btrfs_submit_chunk+0x10/0x10 ? bvec_alloc+0xd7/0x1b0 ? bio_add_folio+0x171/0x270 ? __pfx_bio_add_folio+0x10/0x10 ? __kasan_check_read+0x20/0x20 btrfs_submit_bio+0x37/0x80 read_extent_buffer_pages+0x3df/0x6c0 btrfs_read_extent_buffer+0x13e/0x5f0 read_tree_block+0x81/0xe0 read_block_for_search+0x4bd/0x7a0 ? __pfx_read_block_for_search+0x10/0x10 btrfs_search_slot+0x78d/0x2720 ? __pfx_btrfs_search_slot+0x10/0x10 ? lock_is_held_type+0x8f/0x100 ? kasan_save_track+0x14/0x30 ? __kasan_slab_alloc+0x6e/0x70 ? kmem_cache_alloc_noprof+0x1f2/0x300 btrfs_get_raid_extent_offset+0x181/0x820 ? __pfx_lock_acquire+0x10/0x10 ? __pfx_btrfs_get_raid_extent_offset+0x10/0x10 ? down_read+0x194/0x440 ? __pfx_down_read+0x10/0x10 ? do_raw_read_unlock+0x44/0x70 ? _raw_read_unlock+0x23/0x40 btrfs_map_block+0x5b5/0x2250 ? __pfx_btrfs_map_block+0x10/0x10 scrub_submit_initial_read+0x8fe/0x11b0 ? __pfx_scrub_submit_initial_read+0x10/0x10 submit_initial_group_read+0x161/0x3a0 ? lock_release+0x20e/0x710 ? __pfx_submit_initial_group_read+0x10/0x10 ? __pfx_lock_release+0x10/0x10 scrub_simple_mirror.isra.0+0x3eb/0x580 scrub_stripe+0xe4d/0x1440 ? lock_release+0x20e/0x710 ? __pfx_scrub_stripe+0x10/0x10 ? __pfx_lock_release+0x10/0x10 ? do_raw_read_unlock+0x44/0x70 ? _raw_read_unlock+0x23/0x40 scrub_chunk+0x257/0x4a0 scrub_enumerate_chunks+0x64c/0xf70 ? __mutex_unlock_slowpath+0x147/0x5f0 ? __pfx_scrub_enumerate_chunks+0x10/0x10 ? bit_wait_timeout+0xb0/0x170 ? __up_read+0x189/0x700 ? scrub_workers_get+0x231/0x300 ? up_write+0x490/0x4f0 btrfs_scrub_dev+0x52e/0xcd0 ? create_pending_snapshots+0x230/0x250 ? __pfx_btrfs_scrub_dev+0x10/0x10 btrfs_dev_replace_by_ioctl+0xd69/0x1d00 ? lock_acquire+0x19d/0x4a0 ? __pfx_btrfs_dev_replace_by_ioctl+0x10/0x10 ? lock_release+0x20e/0x710 ? btrfs_ioctl+0xa09/0x74f0 ? __pfx_lock_release+0x10/0x10 ? do_raw_spin_lock+0x11e/0x240 ? __pfx_do_raw_spin_lock+0x10/0x10 btrfs_ioctl+0xa14/0x74f0 ? lock_acquire+0x19d/0x4a0 ? find_held_lock+0x2d/0x110 ? __pfx_btrfs_ioctl+0x10/0x10 ? lock_release+0x20e/0x710 ? do_sigaction+0x3f0/0x860 ? __pfx_do_vfs_ioctl+0x10/0x10 ? do_raw_spin_lock+0x11e/0x240 ? lockdep_hardirqs_on_prepare+0x270/0x3e0 ? _raw_spin_unlock_irq+0x28/0x50 ? do_sigaction+0x3f0/0x860 ? __pfx_do_sigaction+0x10/0x10 ? __x64_sys_rt_sigaction+0x18e/0x1e0 ? __pfx___x64_sys_rt_sigaction+0x10/0x10 ? __x64_sys_close+0x7c/0xd0 __x64_sys_ioctl+0x137/0x190 do_syscall_64+0x71/0x140 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f0bd1114f9b Code: Unable to access opcode bytes at 0x7f0bd1114f71. RSP: 002b:00007ffc8a8c3130 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f0bd1114f9b RDX: 00007ffc8a8c35e0 RSI: 00000000ca289435 RDI: 0000000000000003 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000007 R10: 0000000000000008 R11: 0000000000000246 R12: 00007ffc8a8c6c85 R13: 00000000398e72a0 R14: 0000000000004361 R15: 0000000000000004 </TASK> This happens because on RAID stripe-tree filesystems we recurse back into btrfs_map_block() on scrub to perform the logical to device physical mapping. But as the device replace task is already holding the dev_replace::rwsem we deadlock. So don't take the dev_replace::rwsem in case our task is the task performing the device replace. Suggested-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-11-11 14:34:12 +01:00
Linus Torvalds	9183e033ec	for-6.12-rc6-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmct+40ACgkQxWXV+ddt WDvCtRAAp0rheEu14hpVvWE2//+6u9Gx7Wfjzbj0+o4zBRWdg7BigFxfeb6JsH/E 2TjuWdcoP/OMV9ghCBQAQxySAPtsxH7skkyNy2UcMk5byBIrNvhw9auP5GXXlrhK jSKDD4yfOMb++8LhrLevgTrijNyjLqaKXruw9a1Pmc3gxpdNmnMEySsQaF62o2Sm YC3jwi0KpNAhu2qyJ6TnPgd5zf3BTM0JAeuB019IZW4WoeRTOdcPe7S7gqqJwZ+e lL0D2/lfIE1lKvLE266Fab4FAQiJV07rozYj25XHiDpqThCxnJVOZCEHasOQ1PRy d6j3RmGPqJYAYfQL1L+FH2hsS1BVZfVyCV1V7A/cN+lAffBfnROnf13C3gJ15Nbx 3lTyjBPQQw2WpfdmeyF3ikbrjZ8AfahChQO+mMnLN7oAWdIwWX5MRB+cwfWTxzA/ P8upz6HSTpSwy8nXdq264q1KkyCjx0Wv+8iyU7LirN2fCcEchA12HAIaOBeHedgh PrGZDqrkZccQQxAvU5H7hQv0hZkGK8qba381oYHO09g72VM6ysuBU7tGrPZrlZYB CvYTCwNZ/lqI8ikrcHOyUO1SPR9SaaWej1mWgBJ69ZIfg+ZuMtOMl171DU4S/i2V iYgYoN8eCqTQWdaX5kk+3LWmK8fSU7F/KSDtJtT1KxkaSwCacfY= =TQzP -----END PGP SIGNATURE----- Merge tag 'for-6.12-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few more one-liners that fix some user visible problems: - use correct range when clearing qgroup reservations after COW - properly reset freed delayed ref list head - fix ro/rw subvolume mounts to be backward compatible with old and new mount API" * tag 'for-6.12-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix the length of reserved qgroup to free btrfs: reinitialize delayed ref list after deleting it from the list btrfs: fix per-subvolume RO/RW flags with new mount API	2024-11-08 07:31:03 -10:00
Haisu Wang	2b084d8205	btrfs: fix the length of reserved qgroup to free The dealloc flag may be cleared and the extent won't reach the disk in cow_file_range when errors path. The reserved qgroup space is freed in commit `30479f31d4` ("btrfs: fix qgroup reserve leaks in cow_file_range"). However, the length of untouched region to free needs to be adjusted with the correct remaining region size. Fixes: `30479f31d4` ("btrfs: fix qgroup reserve leaks in cow_file_range") CC: stable@vger.kernel.org # 6.11+ Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Haisu Wang <haisuwang@tencent.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-11-07 02:08:29 +01:00
Filipe Manana	c9a75ec45f	btrfs: reinitialize delayed ref list after deleting it from the list At insert_delayed_ref() if we need to update the action of an existing ref to BTRFS_DROP_DELAYED_REF, we delete the ref from its ref head's ref_add_list using list_del(), which leaves the ref's add_list member not reinitialized, as list_del() sets the next and prev members of the list to LIST_POISON1 and LIST_POISON2, respectively. If later we end up calling drop_delayed_ref() against the ref, which can happen during merging or when destroying delayed refs due to a transaction abort, we can trigger a crash since at drop_delayed_ref() we call list_empty() against the ref's add_list, which returns false since the list was not reinitialized after the list_del() and as a consequence we call list_del() again at drop_delayed_ref(). This results in an invalid list access since the next and prev members are set to poison pointers, resulting in a splat if CONFIG_LIST_HARDENED and CONFIG_DEBUG_LIST are set or invalid poison pointer dereferences otherwise. So fix this by deleting from the list with list_del_init() instead. Fixes: `1d57ee9416` ("btrfs: improve delayed refs iterations") CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-11-07 02:07:53 +01:00
Qu Wenruo	cda7163d4e	btrfs: fix per-subvolume RO/RW flags with new mount API [BUG] With util-linux 2.40.2, the 'mount' utility is already utilizing the new mount API. e.g: # strace mount -o subvol=subv1,ro /dev/test/scratch1 /mnt/test/ ... fsconfig(3, FSCONFIG_SET_STRING, "source", "/dev/mapper/test-scratch1", 0) = 0 fsconfig(3, FSCONFIG_SET_STRING, "subvol", "subv1", 0) = 0 fsconfig(3, FSCONFIG_SET_FLAG, "ro", NULL, 0) = 0 fsconfig(3, FSCONFIG_CMD_CREATE, NULL, NULL, 0) = 0 fsmount(3, FSMOUNT_CLOEXEC, 0) = 4 mount_setattr(4, "", AT_EMPTY_PATH, {attr_set=MOUNT_ATTR_RDONLY, attr_clr=0, propagation=0 /* MS_??? /, userns_fd=0}, 32) = 0 move_mount(4, "", AT_FDCWD, "/mnt/test", MOVE_MOUNT_F_EMPTY_PATH) = 0 But this leads to a new problem, that per-subvolume RO/RW mount no longer works, if the initial mount is RO: # mount -o subvol=subv1,ro /dev/test/scratch1 /mnt/test # mount -o rw,subvol=subv2 /dev/test/scratch1 /mnt/scratch # mount \| grep mnt /dev/mapper/test-scratch1 on /mnt/test type btrfs (ro,relatime,discard=async,space_cache=v2,subvolid=256,subvol=/subv1) /dev/mapper/test-scratch1 on /mnt/scratch type btrfs (ro,relatime,discard=async,space_cache=v2,subvolid=257,subvol=/subv2) # touch /mnt/scratch/foobar touch: cannot touch '/mnt/scratch/foobar': Read-only file system This is a common use cases on distros. [CAUSE] We have a workaround for remount to handle the RO->RW change, but if the mount is using the new mount API, we do not do that, and rely on the mount tool NOT to set the ro flag. But that's not how the mount tool is doing for the new API: fsconfig(3, FSCONFIG_SET_STRING, "source", "/dev/mapper/test-scratch1", 0) = 0 fsconfig(3, FSCONFIG_SET_STRING, "subvol", "subv1", 0) = 0 fsconfig(3, FSCONFIG_SET_FLAG, "ro", NULL, 0) = 0 <<<< Setting RO flag for super block fsconfig(3, FSCONFIG_CMD_CREATE, NULL, NULL, 0) = 0 fsmount(3, FSMOUNT_CLOEXEC, 0) = 4 mount_setattr(4, "", AT_EMPTY_PATH, {attr_set=MOUNT_ATTR_RDONLY, attr_clr=0, propagation=0 / MS_??? */, userns_fd=0}, 32) = 0 move_mount(4, "", AT_FDCWD, "/mnt/test", MOVE_MOUNT_F_EMPTY_PATH) = 0 This means we will set the super block RO at the first mount. Later RW mount will not try to reconfigure the fs to RW because the mount tool is already using the new API. This totally breaks the per-subvolume RO/RW mount behavior. [FIX] Do not skip the reconfiguration even if using the new API. The old comments are just expecting any mount tool to properly skip the RO flag set even if we specify "ro", which is not the reality. Update the comments regarding the backward compatibility on the kernel level so it works with old and new mount utilities. CC: stable@vger.kernel.org # 6.8+ Fixes: `f044b31867` ("btrfs: handle the ro->rw transition for mounting different subvolumes") Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-11-07 02:07:45 +01:00
Al Viro	6348be02ee	fdget(), trivial conversions fdget() is the first thing done in scope, all matching fdput() are immediately followed by leaving the scope. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-11-03 01:28:06 -05:00
Linus Torvalds	6b4926494e	for-6.12-rc5-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmck8eQACgkQxWXV+ddt WDu05g/6AwrnvPkivC4iVOv4Wkzrpk4gm76smx91Y9B8tSDLI1pHaS27CvJz9iWl vBKXPN3PQVQHwo6SPn+NjsFOSMkXlbBOVKpPU+MlZwH9Tuw66qcC+EnUCK2wEuAy 3TN7cUGIA4r/j+SkhgIz+Irlr5pjdb1KkPIMBEVGcVFqDIuvDaTEGBqTn2i/V5aa dMn+gK+9rfngTOJ68t/pEFaX7SEWCvgMIcBpBB4/vs1gHm3ve2bcc1sBAdMxb1Se SrxgZfq+Rc5tkMn540JaWGwkb0rLzwXlurK6ygTKDKCpH0IMX+pBvDkexh9Zj0ux jejlRxiuDzTx3z2a7FjHDyp2sdZWMpq3sPsowpJ1Dsgi5EtSxTy4irmQuSAZY1Uj /uo6YwV9aTGeiNDwZeKqKc/wOuAttaMZLr14s37pro9KxndFJ/XZBxeyB+euUCOw B8AvAQVVIJAYQLyWINWruNKppqlgiO2RaN15RvvT2pX01d0TOx1KX1XFQku7YFxb M/8ZNXzJ96XtkeyHL3euo3zj7N5jWtnCvPINugUG1ADQa+bc8aX336gld1neD6fs QqIFIgzZG0l4N95viJilACrI6tW9zFnBqMyNFRhucKiX9aP9glOvhSfxfjcpDuQ/ i/LIyxVLwp8M3hPNvv8tC345+1C2ug9AD0OyhWjjIYPuiOxtTWs= =alpB -----END PGP SIGNATURE----- Merge tag 'for-6.12-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few more stability fixes. There's one patch adding export of MIPS cmpxchg helper, used in the error propagation fix. - fix error propagation from split bios to the original btrfs bio - fix merging of adjacent extents (normal operation, defragmentation) - fix potential use after free after freeing btrfs device structures" * tag 'for-6.12-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix defrag not merging contiguous extents due to merged extent maps btrfs: fix extent map merging not happening for adjacent extents btrfs: fix use-after-free of block device file in __btrfs_free_extra_devids() btrfs: fix error propagation of split bios MIPS: export __cmpxchg_small()	2024-11-01 07:31:47 -10:00
Filipe Manana	77b0d113ee	btrfs: fix defrag not merging contiguous extents due to merged extent maps When running defrag (manual defrag) against a file that has extents that are contiguous and we already have the respective extent maps loaded and merged, we end up not defragging the range covered by those contiguous extents. This happens when we have an extent map that was the result of merging multiple extent maps for contiguous extents and the length of the merged extent map is greater than or equals to the defrag threshold length. The script below reproduces this scenario: $ cat test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi mkfs.btrfs -f $DEV mount $DEV $MNT # Create a 256K file with 4 extents of 64K each. xfs_io -f -c "falloc 0 64K" \ -c "pwrite 0 64K" \ -c "falloc 64K 64K" \ -c "pwrite 64K 64K" \ -c "falloc 128K 64K" \ -c "pwrite 128K 64K" \ -c "falloc 192K 64K" \ -c "pwrite 192K 64K" \ $MNT/foo umount $MNT echo -n "Initial number of file extent items: " btrfs inspect-internal dump-tree -t 5 $DEV \| grep EXTENT_DATA \| wc -l mount $DEV $MNT # Read the whole file in order to load and merge extent maps. cat $MNT/foo > /dev/null btrfs filesystem defragment -t 128K $MNT/foo umount $MNT echo -n "Number of file extent items after defrag with 128K threshold: " btrfs inspect-internal dump-tree -t 5 $DEV \| grep EXTENT_DATA \| wc -l mount $DEV $MNT # Read the whole file in order to load and merge extent maps. cat $MNT/foo > /dev/null btrfs filesystem defragment -t 256K $MNT/foo umount $MNT echo -n "Number of file extent items after defrag with 256K threshold: " btrfs inspect-internal dump-tree -t 5 $DEV \| grep EXTENT_DATA \| wc -l Running it: $ ./test.sh Initial number of file extent items: 4 Number of file extent items after defrag with 128K threshold: 4 Number of file extent items after defrag with 256K threshold: 4 The 4 extents don't get merged because we have an extent map with a size of 256K that is the result of merging the individual extent maps for each of the four 64K extents and at defrag_lookup_extent() we have a value of zero for the generation threshold ('newer_than' argument) since this is a manual defrag. As a consequence we don't call defrag_get_extent() to get an extent map representing a single file extent item in the inode's subvolume tree, so we end up using the merged extent map at defrag_collect_targets() and decide not to defrag. Fix this by updating defrag_lookup_extent() to always discard extent maps that were merged and call defrag_get_extent() regardless of the minimum generation threshold ('newer_than' argument). A test case for fstests will be sent along soon. CC: stable@vger.kernel.org # 6.1+ Fixes: `199257a78b` ("btrfs: defrag: don't use merged extent map for their generation check") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-10-31 16:46:41 +01:00
Filipe Manana	a0f0625390	btrfs: fix extent map merging not happening for adjacent extents If we have 3 or more adjacent extents in a file, that is, consecutive file extent items pointing to adjacent extents, within a contiguous file range and compatible flags, we end up not merging all the extents into a single extent map. For example: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt/sdc $ xfs_io -f -d -c "pwrite -b 64K 0 64K" \ -c "pwrite -b 64K 64K 64K" \ -c "pwrite -b 64K 128K 64K" \ -c "pwrite -b 64K 192K 64K" \ /mnt/sdc/foo After all the ordered extents complete we unpin the extent maps and try to merge them, but instead of getting a single extent map we get two because: 1) When the first ordered extent completes (file range [0, 64K)) we unpin its extent map and attempt to merge it with the extent map for the range [64K, 128K), but we can't because that extent map is still pinned; 2) When the second ordered extent completes (file range [64K, 128K)), we unpin its extent map and merge it with the previous extent map, for file range [0, 64K), but we can't merge with the next extent map, for the file range [128K, 192K), because this one is still pinned. The merged extent map for the file range [0, 128K) gets the flag EXTENT_MAP_MERGED set; 3) When the third ordered extent completes (file range [128K, 192K)), we unpin its extent map and attempt to merge it with the previous extent map, for file range [0, 128K), but we can't because that extent map has the flag EXTENT_MAP_MERGED set (mergeable_maps() returns false due to different flags) while the extent map for the range [128K, 192K) doesn't have that flag set. We also can't merge it with the next extent map, for file range [192K, 256K), because that one is still pinned. At this moment we have 3 extent maps: One for file range [0, 128K), with the flag EXTENT_MAP_MERGED set. One for file range [128K, 192K). One for file range [192K, 256K) which is still pinned; 4) When the fourth and final extent completes (file range [192K, 256K)), we unpin its extent map and attempt to merge it with the previous extent map, for file range [128K, 192K), which succeeds since none of these extent maps have the EXTENT_MAP_MERGED flag set. So we end up with 2 extent maps: One for file range [0, 128K), with the flag EXTENT_MAP_MERGED set. One for file range [128K, 256K), with the flag EXTENT_MAP_MERGED set. Since after merging extent maps we don't attempt to merge again, that is, merge the resulting extent map with the one that is now preceding it (and the one following it), we end up with those two extent maps, when we could have had a single extent map to represent the whole file. Fix this by making mergeable_maps() ignore the EXTENT_MAP_MERGED flag. While this doesn't present any functional issue, it prevents the merging of extent maps which allows to save memory, and can make defrag not merging extents too (that will be addressed in the next patch). Fixes: `199257a78b` ("btrfs: defrag: don't use merged extent map for their generation check") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-10-31 16:45:16 +01:00
Zhihao Cheng	aec8e6bf83	btrfs: fix use-after-free of block device file in __btrfs_free_extra_devids() Mounting btrfs from two images (which have the same one fsid and two different dev_uuids) in certain executing order may trigger an UAF for variable 'device->bdev_file' in __btrfs_free_extra_devids(). And following are the details: 1. Attach image_1 to loop0, attach image_2 to loop1, and scan btrfs devices by ioctl(BTRFS_IOC_SCAN_DEV): / btrfs_device_1 → loop0 fs_device \ btrfs_device_2 → loop1 2. mount /dev/loop0 /mnt btrfs_open_devices btrfs_device_1->bdev_file = btrfs_get_bdev_and_sb(loop0) btrfs_device_2->bdev_file = btrfs_get_bdev_and_sb(loop1) btrfs_fill_super open_ctree fail: btrfs_close_devices // -ENOMEM btrfs_close_bdev(btrfs_device_1) fput(btrfs_device_1->bdev_file) // btrfs_device_1->bdev_file is freed btrfs_close_bdev(btrfs_device_2) fput(btrfs_device_2->bdev_file) 3. mount /dev/loop1 /mnt btrfs_open_devices btrfs_get_bdev_and_sb(&bdev_file) // EIO, btrfs_device_1->bdev_file is not assigned, // which points to a freed memory area btrfs_device_2->bdev_file = btrfs_get_bdev_and_sb(loop1) btrfs_fill_super open_ctree btrfs_free_extra_devids if (btrfs_device_1->bdev_file) fput(btrfs_device_1->bdev_file) // UAF ! Fix it by setting 'device->bdev_file' as 'NULL' after closing the btrfs_device in btrfs_close_one_device(). Fixes: `1423881941` ("btrfs: do not background blkdev_put()") CC: stable@vger.kernel.org # 4.19+ Link: https://bugzilla.kernel.org/show_bug.cgi?id=219408 Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-10-29 21:59:25 +01:00
Christoph Hellwig	2f5a65ef30	block: add a bdev_limits helper Add a helper to get the queue_limits from the bdev without having to poke into the request_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20241029141937.249920-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 09:15:00 -06:00
Pankaj Raghav	30dac24e14	fs/writeback: convert wbc_account_cgroup_owner to take a folio Most of the callers of wbc_account_cgroup_owner() are converting a folio to page before calling the function. wbc_account_cgroup_owner() is converting the page back to a folio to call mem_cgroup_css_from_folio(). Convert wbc_account_cgroup_owner() to take a folio instead of a page, and convert all callers to pass a folio directly except f2fs. Convert the page to folio for all the callers from f2fs as they were the only callers calling wbc_account_cgroup_owner() with a page. As f2fs is already in the process of converting to folios, these call sites might also soon be calling wbc_account_cgroup_owner() with a folio directly in the future. No functional changes. Only compile tested. Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Link: https://lore.kernel.org/r/20240926140121.203821-1-kernel@pankajraghav.com Acked-by: David Sterba <dsterba@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-28 13:26:54 +01:00
Linus Torvalds	4e46774408	for-6.12-rc4-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmcZGqsACgkQxWXV+ddt WDsZTg//dSLAAswT3dTAvWDt7rGxPhJC1V+gnzWdj/0Q3CUimNaw7zHp1QHtJDFT S+3TVvJJ2JDvOnbi3u24s/bL5YdkWcvIyy87oVE4trJMbQPc/E45pkYFqF3TRQAH wEYAVEnO9f9WUY+ekxX2XBzhmKp3xol93j9BBHDXOF9kHsDC5lI0D5YVYEhRY8Qu D4GcqAhqUj5Hj6I6ppiqO47NCJBNzw1Se9QsgruPpmItRbB0/LYJFhUfevwosTKg xVpkRVntFqQjkIdIdBfBv/ZGWfxJyM7K4M49QwLUqUQfxugu7BiGYuEjkBkiy07a pZDEOF9s8wUZsxvRVohqKlhL0zHBF+/pAANowYuhKNW1sqKt4GVCdN3V34AbH8ST JIbPvC2g1tUzIc2wckE8GO/NnsNR4r3k6iPB53MdCHrIo3jnENOeb9wF0GiVDb6s OrhCa3ph2ps80YC1aCnc4Jr/yV2ONebSivnqvHCIUQEZfpjtAc07atm0i946/7nx eiBE+9zSVJZB0LoooIVz2I3HX3lRnm3Wwi4nj8U/sNL/IbGaHNCurIETR23NivYP yWVql2njwE+yMc8q9YZXs4MBdKsSGP6eGJW3ZwKi6ru2PJdf5eIib+ffvR4bLqXD UUDfMyC1esGsQB24sc8wppk97wmmMrdqQUj9WnmNlTP8FUFggvQ= =sYxX -----END PGP SIGNATURE----- Merge tag 'for-6.12-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - mount option fixes: - fix handling of compression mount options on remount - reject rw remount in case there are options that don't work in read-write mode (like rescue options) - fix zone accounting of unusable space - fix in-memory corruption when merging extent maps - fix delalloc range locking for sector < page - use more convenient default value of drop subtree threshold, clean more subvolumes without the fallback to marking quotas inconsistent - fix smatch warning about incorrect value passed to ERR_PTR * tag 'for-6.12-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix passing 0 to ERR_PTR in btrfs_search_dir_index_item() btrfs: reject ro->rw reconfiguration if there are hard ro requirements btrfs: fix read corruption due to race with extent map merging btrfs: fix the delalloc range locking if sector size < page size btrfs: qgroup: set a more sane default value for subtree drop threshold btrfs: clear force-compress on remount when compress mount option is given btrfs: zoned: fix zone unusable accounting for freed reserved extent	2024-10-24 13:04:15 -07:00
Naohiro Aota	d48e1dea39	btrfs: fix error propagation of split bios The purpose of btrfs_bbio_propagate_error() shall be propagating an error of split bio to its original btrfs_bio, and tell the error to the upper layer. However, it's not working well on some cases. * Case 1. Immediate (or quick) end_bio with an error When btrfs sends btrfs_bio to mirrored devices, btrfs calls btrfs_bio_end_io() when all the mirroring bios are completed. If that btrfs_bio was split, it is from btrfs_clone_bioset and its end_io function is btrfs_orig_write_end_io. For this case, btrfs_bbio_propagate_error() accesses the orig_bbio's bio context to increase the error count. That works well in most cases. However, if the end_io is called enough fast, orig_bbio's (remaining part after split) bio context may not be properly set at that time. Since the bio context is set when the orig_bbio (the last btrfs_bio) is sent to devices, that might be too late for earlier split btrfs_bio's completion. That will result in NULL pointer dereference. That bug is easily reproducible by running btrfs/146 on zoned devices [1] and it shows the following trace. [1] You need raid-stripe-tree feature as it create "-d raid0 -m raid1" FS. BUG: kernel NULL pointer dereference, address: 0000000000000020 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] PREEMPT SMP PTI CPU: 1 UID: 0 PID: 13 Comm: kworker/u32:1 Not tainted 6.11.0-rc7-BTRFS-ZNS+ #474 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Workqueue: writeback wb_workfn (flush-btrfs-5) RIP: 0010:btrfs_bio_end_io+0xae/0xc0 [btrfs] BTRFS error (device dm-0): bdev /dev/mapper/error-test errs: wr 2, rd 0, flush 0, corrupt 0, gen 0 RSP: 0018:ffffc9000006f248 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff888005a7f080 RCX: ffffc9000006f1dc RDX: 0000000000000000 RSI: 000000000000000a RDI: ffff888005a7f080 RBP: ffff888011dfc540 R08: 0000000000000000 R09: 0000000000000001 R10: ffffffff82e508e0 R11: 0000000000000005 R12: ffff88800ddfbe58 R13: ffff888005a7f080 R14: ffff888005a7f158 R15: ffff888005a7f158 FS: 0000000000000000(0000) GS:ffff88803ea80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000020 CR3: 0000000002e22006 CR4: 0000000000370ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ? __die_body.cold+0x19/0x26 ? page_fault_oops+0x13e/0x2b0 ? _printk+0x58/0x73 ? do_user_addr_fault+0x5f/0x750 ? exc_page_fault+0x76/0x240 ? asm_exc_page_fault+0x22/0x30 ? btrfs_bio_end_io+0xae/0xc0 [btrfs] ? btrfs_log_dev_io_error+0x7f/0x90 [btrfs] btrfs_orig_write_end_io+0x51/0x90 [btrfs] dm_submit_bio+0x5c2/0xa50 [dm_mod] ? find_held_lock+0x2b/0x80 ? blk_try_enter_queue+0x90/0x1e0 __submit_bio+0xe0/0x130 ? ktime_get+0x10a/0x160 ? lockdep_hardirqs_on+0x74/0x100 submit_bio_noacct_nocheck+0x199/0x410 btrfs_submit_bio+0x7d/0x150 [btrfs] btrfs_submit_chunk+0x1a1/0x6d0 [btrfs] ? lockdep_hardirqs_on+0x74/0x100 ? __folio_start_writeback+0x10/0x2c0 btrfs_submit_bbio+0x1c/0x40 [btrfs] submit_one_bio+0x44/0x60 [btrfs] submit_extent_folio+0x13f/0x330 [btrfs] ? btrfs_set_range_writeback+0xa3/0xd0 [btrfs] extent_writepage_io+0x18b/0x360 [btrfs] extent_write_locked_range+0x17c/0x340 [btrfs] ? __pfx_end_bbio_data_write+0x10/0x10 [btrfs] run_delalloc_cow+0x71/0xd0 [btrfs] btrfs_run_delalloc_range+0x176/0x500 [btrfs] ? find_lock_delalloc_range+0x119/0x260 [btrfs] writepage_delalloc+0x2ab/0x480 [btrfs] extent_write_cache_pages+0x236/0x7d0 [btrfs] btrfs_writepages+0x72/0x130 [btrfs] do_writepages+0xd4/0x240 ? find_held_lock+0x2b/0x80 ? wbc_attach_and_unlock_inode+0x12c/0x290 ? wbc_attach_and_unlock_inode+0x12c/0x290 __writeback_single_inode+0x5c/0x4c0 ? do_raw_spin_unlock+0x49/0xb0 writeback_sb_inodes+0x22c/0x560 __writeback_inodes_wb+0x4c/0xe0 wb_writeback+0x1d6/0x3f0 wb_workfn+0x334/0x520 process_one_work+0x1ee/0x570 ? lock_is_held_type+0xc6/0x130 worker_thread+0x1d1/0x3b0 ? __pfx_worker_thread+0x10/0x10 kthread+0xee/0x120 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x30/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> Modules linked in: dm_mod btrfs blake2b_generic xor raid6_pq rapl CR2: 0000000000000020 * Case 2. Earlier completion of orig_bbio for mirrored btrfs_bios btrfs_bbio_propagate_error() assumes the end_io function for orig_bbio is called last among split bios. In that case, btrfs_orig_write_end_io() sets the bio->bi_status to BLK_STS_IOERR by seeing the bioc->error [2]. Otherwise, the increased orig_bio's bioc->error is not checked by anyone and return BLK_STS_OK to the upper layer. [2] Actually, this is not true. Because we only increases orig_bioc->errors by max_errors, the condition "atomic_read(&bioc->error) > bioc->max_errors" is still not met if only one split btrfs_bio fails. * Case 3. Later completion of orig_bbio for un-mirrored btrfs_bios In contrast to the above case, btrfs_bbio_propagate_error() is not working well if un-mirrored orig_bbio is completed last. It sets orig_bbio->bio.bi_status to the btrfs_bio's error. But, that is easily over-written by orig_bbio's completion status. If the status is BLK_STS_OK, the upper layer would not know the failure. * Solution Considering the above cases, we can only save the error status in the orig_bbio (remaining part after split) itself as it is always available. Also, the saved error status should be propagated when all the split btrfs_bios are finished (i.e, bbio->pending_ios == 0). This commit introduces "status" to btrfs_bbio and saves the first error of split bios to original btrfs_bio's "status" variable. When all the split bios are finished, the saved status is loaded into original btrfs_bio's status. With this commit, btrfs/146 on zoned devices does not hit the NULL pointer dereference anymore. Fixes: `852eee62d3` ("btrfs: allow btrfs_submit_bio to split bios") CC: stable@vger.kernel.org # 6.6+ Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-10-23 18:17:43 +02:00
Yue Haibing	75f49c3dc7	btrfs: fix passing 0 to ERR_PTR in btrfs_search_dir_index_item() The ret may be zero in btrfs_search_dir_index_item() and should not passed to ERR_PTR(). Now btrfs_unlink_subvol() is the only caller to this, reconstructed it to check ERR_PTR(-ENOENT) while ret >= 0. This fixes smatch warnings: fs/btrfs/dir-item.c:353 btrfs_search_dir_index_item() warn: passing zero to 'ERR_PTR' Fixes: `9dcbe16fcc` ("btrfs: use btrfs_for_each_slot in btrfs_search_dir_index_item") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-10-22 16:10:55 +02:00
Qu Wenruo	3c36a72c1d	btrfs: reject ro->rw reconfiguration if there are hard ro requirements [BUG] Syzbot reports the following crash: BTRFS info (device loop0 state MCS): disabling free space tree BTRFS info (device loop0 state MCS): clearing compat-ro feature flag for FREE_SPACE_TREE (0x1) BTRFS info (device loop0 state MCS): clearing compat-ro feature flag for FREE_SPACE_TREE_VALID (0x2) Oops: general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] PREEMPT SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 RIP: 0010:backup_super_roots fs/btrfs/disk-io.c:1691 [inline] RIP: 0010:write_all_supers+0x97a/0x40f0 fs/btrfs/disk-io.c:4041 Call Trace: <TASK> btrfs_commit_transaction+0x1eae/0x3740 fs/btrfs/transaction.c:2530 btrfs_delete_free_space_tree+0x383/0x730 fs/btrfs/free-space-tree.c:1312 btrfs_start_pre_rw_mount+0xf28/0x1300 fs/btrfs/disk-io.c:3012 btrfs_remount_rw fs/btrfs/super.c:1309 [inline] btrfs_reconfigure+0xae6/0x2d40 fs/btrfs/super.c:1534 btrfs_reconfigure_for_mount fs/btrfs/super.c:2020 [inline] btrfs_get_tree_subvol fs/btrfs/super.c:2079 [inline] btrfs_get_tree+0x918/0x1920 fs/btrfs/super.c:2115 vfs_get_tree+0x90/0x2b0 fs/super.c:1800 do_new_mount+0x2be/0xb40 fs/namespace.c:3472 do_mount fs/namespace.c:3812 [inline] __do_sys_mount fs/namespace.c:4020 [inline] __se_sys_mount+0x2d6/0x3c0 fs/namespace.c:3997 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f [CAUSE] To support mounting different subvolume with different RO/RW flags for the new mount APIs, btrfs introduced two workaround to support this feature: - Skip mount option/feature checks if we are mounting a different subvolume - Reconfigure the fs to RW if the initial mount is RO Combining these two, we can have the following sequence: - Mount the fs ro,rescue=all,clear_cache,space_cache=v1 rescue=all will mark the fs as hard read-only, so no v2 cache clearing will happen. - Mount a subvolume rw of the same fs. We go into btrfs_get_tree_subvol(), but fc_mount() returns EBUSY because our new fc is RW, different from the original fs. Now we enter btrfs_reconfigure_for_mount(), which switches the RO flag first so that we can grab the existing fs_info. Then we reconfigure the fs to RW. - During reconfiguration, option/features check is skipped This means we will restart the v2 cache clearing, and convert back to v1 cache. This will trigger fs writes, and since the original fs has "rescue=all" option, it skips the csum tree read. And eventually causing NULL pointer dereference in super block writeback. [FIX] For reconfiguration caused by different subvolume RO/RW flags, ensure we always run btrfs_check_options() to ensure we have proper hard RO requirements met. In fact the function btrfs_check_options() doesn't really do many complex checks, but hard RO requirement and some feature dependency checks, thus there is no special reason not to do the check for mount reconfiguration. Reported-by: syzbot+56360f93efa90ff15870@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/0000000000008c5d090621cb2770@google.com/ Fixes: `f044b31867` ("btrfs: handle the ro->rw transition for mounting different subvolumes") CC: stable@vger.kernel.org # 6.8+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-10-22 16:10:51 +02:00
Boris Burkov	7a2339058e	btrfs: fix read corruption due to race with extent map merging In debugging some corrupt squashfs files, we observed symptoms of corrupt page cache pages but correct on-disk contents. Further investigation revealed that the exact symptom was a correct page followed by an incorrect, duplicate, page. This got us thinking about extent maps. commit `ac05ca913e` ("Btrfs: fix race between using extent maps and merging them") enforces a reference count on the primary `em` extent_map being merged, as that one gets modified. However, since, commit `3d2ac99224` ("btrfs: introduce new members for extent_map") both 'em' and 'merge' get modified, which started modifying 'merge' and thus introduced the same race. We were able to reproduce this by looping the affected squashfs workload in parallel on a bunch of separate btrfs-es while also dropping caches. We are still working on a simple enough reproducer to make into an fstest. The simplest fix is to stop modifying 'merge', which is not essential, as it is dropped immediately after the merge. This behavior is simply a consequence of the order of the two extent maps being important in computing the new values. Modify merge_ondisk_extents to take prev and next by const* and also take a third merged parameter that it puts the results in. Note that this introduces the rather odd behavior of passing 'em' to merge_ondisk_extents as a const * and as a regular ptr. Fixes: `3d2ac99224` ("btrfs: introduce new members for extent_map") CC: stable@vger.kernel.org # 6.11+ Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2024-10-22 16:10:13 +02:00
Qu Wenruo	f10f59f91a	btrfs: fix the delalloc range locking if sector size < page size Inside lock_delalloc_folios(), there are several problems related to sector size < page size handling: - Set the writer locks without checking if the folio is still valid We call btrfs_folio_start_writer_lock() just like it's folio_lock(). But since the folio may not even be the folio of the current mapping, we can easily screw up the folio->private. - The range is not clamped inside the page This means we can over write other bitmaps if the start/len is not properly handled, and trigger the btrfs_subpage_assert(). - @processed_end is always rounded up to page end If the delalloc range is not page aligned, and we need to retry (returning -EAGAIN), then we will unlock to the page end. Thankfully this is not a huge problem, as now btrfs_folio_end_writer_lock() can handle range larger than the locked range, and only unlock what is already locked. Fix all these problems by: - Lock and check the folio first, then call btrfs_folio_set_writer_lock() So that if we got a folio not belonging to the inode, we won't touch folio->private. - Properly truncate the range inside the page - Update @processed_end to the locked range end Fixes: `1e1de38792` ("btrfs: make process_one_page() to handle subpage locking") CC: stable@vger.kernel.org # 6.1+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-10-22 16:09:44 +02:00
Qu Wenruo	5f9062a48d	btrfs: qgroup: set a more sane default value for subtree drop threshold Since commit `011b46c304` ("btrfs: skip subtree scan if it's too high to avoid low stall in btrfs_commit_transaction()"), btrfs qgroup can automatically skip large subtree scan at the cost of marking qgroup inconsistent. It's designed to address the final performance problem of snapshot drop with qgroup enabled, but to be safe the default value is BTRFS_MAX_LEVEL, requiring a user space daemon to set a different value to make it work. I'd say it's not a good idea to rely on user space tool to set this default value, especially when some operations (snapshot dropping) can be triggered immediately after mount, leaving a very small window to that that sysfs interface. So instead of disabling this new feature by default, enable it with a low threshold (3), so that large subvolume tree drop at mount time won't cause huge qgroup workload. CC: stable@vger.kernel.org # 6.1 Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-10-22 16:09:11 +02:00
Filipe Manana	3510e684b8	btrfs: clear force-compress on remount when compress mount option is given After the migration to use fs context for processing mount options we had a slight change in the semantics for remounting a filesystem that was mounted with compress-force. Before we could clear compress-force by passing only "-o compress[=algo]" during a remount, but after that change that does not work anymore, force-compress is still present and one needs to pass "-o compress-force=no,compress[=algo]" to the mount command. Example, when running on a kernel 6.8+: $ mount -o compress-force=zlib:9 /dev/sdi /mnt/sdi $ mount \| grep sdi /dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress-force=zlib:9,discard=async,space_cache=v2,subvolid=5,subvol=/) $ mount -o remount,compress=zlib:5 /mnt/sdi $ mount \| grep sdi /dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress-force=zlib:5,discard=async,space_cache=v2,subvolid=5,subvol=/) On a 6.7 kernel (or older): $ mount -o compress-force=zlib:9 /dev/sdi /mnt/sdi $ mount \| grep sdi /dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress-force=zlib:9,discard=async,space_cache=v2,subvolid=5,subvol=/) $ mount -o remount,compress=zlib:5 /mnt/sdi $ mount \| grep sdi /dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress=zlib:5,discard=async,space_cache=v2,subvolid=5,subvol=/) So update btrfs_parse_param() to clear "compress-force" when "compress" is given, providing the same semantics as kernel 6.7 and older. Reported-by: Roman Mamedov <rm@romanrm.net> Link: https://lore.kernel.org/linux-btrfs/20241014182416.13d0f8b0@nvm/ CC: stable@vger.kernel.org # 6.8+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-10-22 16:07:53 +02:00
Naohiro Aota	bf9821ba47	btrfs: zoned: fix zone unusable accounting for freed reserved extent When btrfs reserves an extent and does not use it (e.g, by an error), it calls btrfs_free_reserved_extent() to free the reserved extent. In the process, it calls btrfs_add_free_space() and then it accounts the region bytes as block_group->zone_unusable. However, it leaves the space_info->bytes_zone_unusable side not updated. As a result, ENOSPC can happen while a space_info reservation succeeded. The reservation is fine because the freed region is not added in space_info->bytes_zone_unusable, leaving that space as "free". OTOH, corresponding block group counts it as zone_unusable and its allocation pointer is not rewound, we cannot allocate an extent from that block group. That will also negate space_info's async/sync reclaim process, and cause an ENOSPC error from the extent allocation process. Fix that by returning the space to space_info->bytes_zone_unusable. Ideally, since a bio is not submitted for this reserved region, we should return the space to free space and rewind the allocation pointer. But, it needs rework on extent allocation handling, so let it work in this way for now. Fixes: `169e0da91a` ("btrfs: zoned: track unusable bytes for zones") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-10-17 16:16:46 +02:00
Linus Torvalds	667b1d41b2	for-6.12-rc3-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmcPxtAACgkQxWXV+ddt WDu9lA//WfB88fwEKnqBYDRo6aiSMIAzLDuXkJ9i8d7rcjZO1OIZkEnMOsxhvTcZ KxgjNjkgzoTyUwoAUlG+ZpvMeSNMhBdr2NFkXmYzN9oanFE4zplpZiWx6tGSApRU 0ilngjXBsr8p03HmB88Yb05DVYQ2elMP6Jx3VETDBa0CNyp4//tGKzusNhZdA7KM XLZmkKRk3ZKabNo+p2J5t8UGJCl2L18U0o/EphfSkODKadUnsBbAPZUt2EGQCZwv uZhDFAUkgTFBkeRO7JwTfDrNi51M4zwmh+kEduzg4Ny4TdFb1UapU7K1N330WMru 4Qa953Met9I4NB/kKI+fZP1lN4NGuD2qEU6yoZVSy4UiqRp1gEg8kOUfVGFbNJa1 VFYcwdrBad0I4PjnQc5bpZVjzqJT5wWiZxjlWrB7VyIfdmnvQxe5h4DBwBhN5FJr +MEtuY2QNFygjDAZ5z0Ss8hegqI+FYi562Cjy9QRLhb3qGD8STF2BIChaILIn3oA UVJUlUP6CUmCu1RZRMFB4/WkeHO46FmZxJErGfFeXqJInThf0/rdSZOQgIP0JsUq N8FINEgXFAMCkK1PT7MNvAYkSP0tR7B0JjGKcSlGS3v3F0URCNGvHSiqbLedAtXT lc1MdXTZxub8h6xhIvgY1j7HRAFGrunn7LD6MIKRWX1SZPWwAGI= =DEUA -----END PGP SIGNATURE----- Merge tag 'for-6.12-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - regression fix: dirty extents tracked in xarray for qgroups must be adjusted for 32bit platforms - fix potentially freeing uninitialized name in fscrypt structure - fix warning about unneeded variable in a send callback * tag 'for-6.12-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix uninitialized pointer free on read_alloc_one_name() error btrfs: send: cleanup unneeded return variable in changed_verity() btrfs: fix uninitialized pointer free in add_inode_ref() btrfs: use sector numbers as keys for the dirty extents xarray	2024-10-16 09:30:20 -07:00
Roi Martin	2ab5e243c2	btrfs: fix uninitialized pointer free on read_alloc_one_name() error The function read_alloc_one_name() does not initialize the name field of the passed fscrypt_str struct if kmalloc fails to allocate the corresponding buffer. Thus, it is not guaranteed that fscrypt_str.name is initialized when freeing it. This is a follow-up to the linked patch that fixes the remaining instances of the bug introduced by commit `e43eec81c5` ("btrfs: use struct qstr instead of name and namelen pairs"). Link: https://lore.kernel.org/linux-btrfs/20241009080833.1355894-1-jroi.martin@gmail.com/ Fixes: `e43eec81c5` ("btrfs: use struct qstr instead of name and namelen pairs") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Roi Martin <jroi.martin@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-10-11 19:55:04 +02:00
Christian Heusel	a0af4936e4	btrfs: send: cleanup unneeded return variable in changed_verity() As all changed_* functions need to return something, just return 0 directly here, as the verity status is passed via the context. Reported by LKP: fs/btrfs/send.c:6877:5-8: Unneeded variable: "ret". Return "0" on line 6883 Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/oe-kbuild-all/202410092305.WbyqspH8-lkp@intel.com/ Signed-off-by: Christian Heusel <christian@heusel.eu> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-10-11 19:54:58 +02:00
Roi Martin	66691c6e2f	btrfs: fix uninitialized pointer free in add_inode_ref() The add_inode_ref() function does not initialize the "name" struct when it is declared. If any of the following calls to "read_one_inode() returns NULL, dir = read_one_inode(root, parent_objectid); if (!dir) { ret = -ENOENT; goto out; } inode = read_one_inode(root, inode_objectid); if (!inode) { ret = -EIO; goto out; } then "name.name" would be freed on "out" before being initialized. out: ... kfree(name.name); This issue was reported by Coverity with CID 1526744. Fixes: `e43eec81c5` ("btrfs: use struct qstr instead of name and namelen pairs") CC: stable@vger.kernel.org # 6.6+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Roi Martin <jroi.martin@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-10-11 19:54:52 +02:00
Filipe Manana	97420be7bd	btrfs: use sector numbers as keys for the dirty extents xarray We are using the logical address ("bytenr") of an extent as the key for qgroup records in the dirty extents xarray. This is a problem because the xarrays use "unsigned long" for keys/indices, meaning that on a 32 bits platform any extent starting at or beyond 4G is truncated, which is a too low limitation as virtually everyone is using storage with more than 4G of space. This means a "bytenr" of 4G gets truncated to 0, and so does 8G and 16G for example, resulting in incorrect qgroup accounting. Fix this by using sector numbers as keys instead, that is, using keys that match the logical address right shifted by fs_info->sectorsize_bits, which is what we do for the fs_info->buffer_radix that tracks extent buffers (radix trees also use an "unsigned long" type for keys). This also makes the index space more dense which helps optimize the xarray (as mentioned at Documentation/core-api/xarray.rst). Fixes: `3cce39a8ca` ("btrfs: qgroup: use xarray to track dirty extents in transaction") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-10-11 18:33:35 +02:00
Linus Torvalds	eb952c47d1	for-6.12-rc2-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmcH4qAACgkQxWXV+ddt WDtDig//czntfO+iRvDERZWTIB6vdVExfLd3r3ZNYlO1pIvgCuvqx3iYva+0ZhGW 8A+gcRax7cz0jCaxDp/+5lIRfdNxZH6/LwjZsDgU8Ly7himeRmwhtn2fCgNeiH/K bUl92+ZMo2vwqTKXYa3xF1g3Hz6cRXVW7gJrMwNhb1hpPTGx+lgYJU02m/Io/vjK 1jcrZ84OEPIOY5uiAoDyO2hgsT/zVEeuuOiSTpKSzrghPbo0vmjLiYJ5T+CE5Uw3 u3w7/Fqnw49NwucqtncvyFFDXY9EWNuQhowi3hqJgOYTInqwwJigIpQV0hDDwYxb ohGUGjazGfAEf/cy1jZXMbwCVgg8/Nj9x0eDKKhfs19VYUbMkEYQ8LKRTUlCeBwS H/2AmqpqHEEO+tPY3P+w6MVwkNho8JNpWPdP5OzJs7XrD067IViOjD06HPM/k5ci TU3zp9NYvgHVtmfZK1Aqsg9OYVhI1klVXejmlAzOLxejRPWXK/1hBw3kXbC6I+k1 50l0Yh1dgEnclMI3yWsKoj8IYUAkh2eudt0pNsot4a5vICMY++NVS2eukdz5UcEz ix7hcpYcCcmzoOaelyEgmdAncWVGJT5w2Nzy85YaOp+Z1C65Ywb41utU+sSY+swB kZfwl9vrsfu754vX7UKBherCvvYo+Lnj3GeX8Oe+1LoT2BP0TPk= =lTqc -----END PGP SIGNATURE----- Merge tag 'for-6.12-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - update fstrim loop and add more cancellation points, fix reported delayed or blocked suspend if there's a huge chunk queued - fix error handling in recent qgroup xarray conversion - in zoned mode, fix warning printing device path without RCU protection - again fix invalid extent xarray state (`6252690f7e`), lost due to refactoring * tag 'for-6.12-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix clear_dirty and writeback ordering in submit_one_sector() btrfs: zoned: fix missing RCU locking in error message when loading zone info btrfs: fix missing error handling when adding delayed ref with qgroups enabled btrfs: add cancellation points to trim loops btrfs: split remaining space to discard in chunks	2024-10-10 10:02:59 -07:00
Christian Brauner	b40508ca5d	Merge patch series "timekeeping/fs: multigrain timestamp redux" Jeff Layton <jlayton@kernel.org> says: The VFS has always used coarse-grained timestamps when updating the ctime and mtime after a change. This has the benefit of allowing filesystems to optimize away a lot metadata updates, down to around 1 per jiffy, even when a file is under heavy writes. Unfortunately, this has always been an issue when we're exporting via NFSv3, which relies on timestamps to validate caches. A lot of changes can happen in a jiffy, so timestamps aren't sufficient to help the client decide when to invalidate the cache. Even with NFSv4, a lot of exported filesystems don't properly support a change attribute and are subject to the same problems with timestamp granularity. Other applications have similar issues with timestamps (e.g backup applications). If we were to always use fine-grained timestamps, that would improve the situation, but that becomes rather expensive, as the underlying filesystem would have to log a lot more metadata updates. What we need is a way to only use fine-grained timestamps when they are being actively queried. Use the (unused) top bit in inode->i_ctime_nsec as a flag that indicates whether the current timestamps have been queried via stat() or the like. When it's set, we allow the kernel to use a fine-grained timestamp iff it's necessary to make the ctime show a different value. This solves the problem of being able to distinguish the timestamp between updates, but introduces a new problem: it's now possible for a file being changed to get a fine-grained timestamp. A file that is altered just a bit later can then get a coarse-grained one that appears older than the earlier fine-grained time. This violates timestamp ordering guarantees. To remedy this, keep a global monotonic atomic64_t value that acts as a timestamp floor. When we go to stamp a file, we first get the latter of the current floor value and the current coarse-grained time. If the inode ctime hasn't been queried then we just attempt to stamp it with that value. If it has been queried, then first see whether the current coarse time is later than the existing ctime. If it is, then we accept that value. If it isn't, then we get a fine-grained time and try to swap that into the global floor. Whether that succeeds or fails, we take the resulting floor time, convert it to realtime and try to swap that into the ctime. We take the result of the ctime swap whether it succeeds or fails, since either is just as valid. Filesystems can opt into this by setting the FS_MGTIME fstype flag. Others should be unaffected (other than being subject to the same floor value as multigrain filesystems). * patches from https://lore.kernel.org/r/20241002-mgtime-v10-0-d1c4717f5284@kernel.org: tmpfs: add support for multigrain timestamps btrfs: convert to multigrain timestamps ext4: switch to multigrain timestamps xfs: switch to multigrain timestamps Documentation: add a new file documenting multigrain timestamps fs: add percpu counters for significant multigrain timestamp events fs: tracepoints around multigrain timestamp events fs: handle delegated timestamps in setattr_copy_mgtime fs: have setattr_copy handle multigrain timestamps appropriately fs: add infrastructure for multigrain timestamps Link: https://lore.kernel.org/r/20241002-mgtime-v10-0-d1c4717f5284@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-10 10:20:57 +02:00
Jeff Layton	e2e801d6e6	btrfs: convert to multigrain timestamps Enable multigrain timestamps, which should ensure that there is an apparent change to the timestamp whenever it has been written after being actively observed via getattr. Beyond enabling the FS_MGTIME flag, this patch eliminates update_time_for_write, which goes to great pains to avoid in-memory stores. Just have it overwrite the timestamps unconditionally. Note that this also drops the IS_I_VERSION check and unconditionally bumps the change attribute, since SB_I_VERSION is always set on btrfs. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Tested-by: Randy Dunlap <rdunlap@infradead.org> # documentation bits Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20241002-mgtime-v10-11-d1c4717f5284@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-10 10:20:53 +02:00
Naohiro Aota	e761be2a07	btrfs: fix clear_dirty and writeback ordering in submit_one_sector() This commit is a replay of commit `6252690f7e` ("btrfs: fix invalid mapping of extent xarray state"). We need to call btrfs_folio_clear_dirty() before btrfs_set_range_writeback(), so that xarray DIRTY tag is cleared. With a refactoring commit `8189197425` ("btrfs: refactor __extent_writepage_io() to do sector-by-sector submission"), it screwed up and the order is reversed and causing the same hang. Fix the ordering now in submit_one_sector(). Fixes: `8189197425` ("btrfs: refactor __extent_writepage_io() to do sector-by-sector submission") Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-10-09 13:23:51 +02:00
Filipe Manana	fe4cd7ed12	btrfs: zoned: fix missing RCU locking in error message when loading zone info At btrfs_load_zone_info() we have an error path that is dereferencing the name of a device which is a RCU string but we are not holding a RCU read lock, which is incorrect. Fix this by using btrfs_err_in_rcu() instead of btrfs_err(). The problem is there since commit `08e11a3db0` ("btrfs: zoned: load zone's allocation offset"), back then at btrfs_load_block_group_zone_info() but then later on that code was factored out into the helper btrfs_load_zone_info() by commit `09a46725cc` ("btrfs: zoned: factor out per-zone logic from btrfs_load_block_group_zone_info"). Fixes: `08e11a3db0` ("btrfs: zoned: load zone's allocation offset") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-10-09 13:23:51 +02:00
Filipe Manana	6ef8fbce01	btrfs: fix missing error handling when adding delayed ref with qgroups enabled When adding a delayed ref head, at delayed-ref.c:add_delayed_ref_head(), if we fail to insert the qgroup record we don't error out, we ignore it. In fact we treat it as if there was no error and there was already an existing record - we don't distinguish between the cases where btrfs_qgroup_trace_extent_nolock() returns 1, meaning a record already existed and we can free the given record, and the case where it returns a negative error value, meaning the insertion into the xarray that is used to track records failed. Effectively we end up ignoring that we are lacking qgroup record in the dirty extents xarray, resulting in incorrect qgroup accounting. Fix this by checking for errors and return them to the callers. Fixes: `3cce39a8ca` ("btrfs: qgroup: use xarray to track dirty extents in transaction") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-10-07 23:22:30 +02:00
Luca Stefani	69313850dc	btrfs: add cancellation points to trim loops There are reports that system cannot suspend due to running trim because the task responsible for trimming the device isn't able to finish in time, especially since we have a free extent discarding phase, which can trim a lot of unallocated space. There are no limits on the trim size (unlike the block group part). Since trime isn't a critical call it can be interrupted at any time, in such cases we stop the trim, report the amount of discarded bytes and return an error. Link: https://bugzilla.kernel.org/show_bug.cgi?id=219180 Link: https://bugzilla.suse.com/show_bug.cgi?id=1229737 CC: stable@vger.kernel.org # 5.15+ Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-10-07 23:21:56 +02:00
Luca Stefani	a99fcb0158	btrfs: split remaining space to discard in chunks Per Qu Wenruo in case we have a very large disk, e.g. 8TiB device, mostly empty although we will do the split according to our super block locations, the last super block ends at 256G, we can submit a huge discard for the range [256G, 8T), causing a large delay. Split the space left to discard based on BTRFS_MAX_DISCARD_CHUNK_SIZE in preparation of introduction of cancellation points to trim. The value of the chunk size is arbitrary, it can be higher or derived from actual device capabilities but we can't easily read that using bio_discard_limit(). Link: https://bugzilla.kernel.org/show_bug.cgi?id=219180 Link: https://bugzilla.suse.com/show_bug.cgi?id=1229737 CC: stable@vger.kernel.org # 5.15+ Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-10-07 23:21:53 +02:00
Linus Torvalds	79eb2c07af	for-6.12-rc1-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmb/9agACgkQxWXV+ddt WDtKFA/5AW88osk/1k/NVhOvOa0xPr5XyDLq1n8Gaxfy8uHlHAc8wdsvJzCDMS0M qUOD/tOPhRI0HGXPiKD767erwbyXiZAcCTkSd8x5jlXy1hVjUHQSKO//JxD0vtAZ jOscoUA1wJJutopCXcppnoUUFE2753edEg0w2EtUXMpfqivqOmMCR+1ZtKkfaNJo oRuZCq3Oi8hu7Wsvmh4Etq/9MvGM+xovXAMAji6Op8nsP1jJlzWztpEUogLOQH2S IhDFFxP9shBV9JjV+HSyXcAYr8VArH6HtYjapR9oajCH0pvSjLYQQPq/qQ0//8Hb SHr4YP6RBbpEIvvaQeA7vwsckBHNBOrSAbEgRQ2+zdmiwha6SRIEVyXYh5LUwcYT WnVozbk7ZX9rD1jOVhvgouFG6vUz8A6/qt3BD028bVcyMvBXW4gsEduCMVlFGmlN D6+hNY6J08j4HUEGnPk7fAYi/lk5OEK1p5yarUgsOQ3GqWWS0ywkrfbmbWwyC+Ff AxggFTl9YodU5RMs7EU2GeHkLU6LnXgevk6FFm0JzsLtT/BbEP7pj6tOot4Msl1e 2ovqFiSbuPNg5Wr70ZBRO9LDIAYtTZy1UrVR/YCSLzm+wZsXMelDIHMQfE7+Yodp O5Cud23AanRTwjErvNl3X4rhut8rrI1FsR89gDyKK1EPG+mwofo= =yslr -----END PGP SIGNATURE----- Merge tag 'for-6.12-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - in incremental send, fix invalid clone operation for file that got its size decreased - fix __counted_by() annotation of send path cache entries, we do not store the terminating NUL - fix a longstanding bug in relocation (and quite hard to hit by chance), drop back reference cache that can get out of sync after transaction commit - wait for fixup worker kthread before finishing umount - add missing raid-stripe-tree extent for NOCOW files, zoned mode cannot have NOCOW files but RST is meant to be a standalone feature - handle transaction start error during relocation, avoid potential NULL pointer dereference of relocation control structure (reported by syzbot) - disable module-wide rate limiting of debug level messages - minor fix to tracepoint definition (reported by checkpatch.pl) * tag 'for-6.12-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: disable rate limiting when debug enabled btrfs: wait for fixup workers before stopping cleaner kthread during umount btrfs: fix a NULL pointer dereference when failed to start a new trasacntion btrfs: send: fix invalid clone operation for file that got its size decreased btrfs: tracepoints: end assignment with semicolon at btrfs_qgroup_extent event class btrfs: drop the backref cache during relocation if we commit btrfs: also add stripe entries for NOCOW writes btrfs: send: fix buffer overflow detection when copying path to cache entry	2024-10-04 10:05:13 -07:00
Matthew Wilcox (Oracle)	a6752a6e7f	btrfs: Switch from using the private_2 flag to owner_2 We are close to removing the private_2 flag, so switch btrfs to using owner_2 for its ordered flag. This is mostly used by buffer head filesystems, so btrfs can use it because it doesn't use buffer heads. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20241002040111.1023018-5-willy@infradead.org Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-04 09:24:25 +02:00
Al Viro	5f60d5f6bb	move asm/unaligned.h to linux/unaligned.h asm/unaligned.h is always an include of asm-generic/unaligned.h; might as well move that thing to linux/unaligned.h and include that - there's nothing arch-specific in that header. auto-generated by the following: for i in `git grep -l -w asm/unaligned.h`; do sed -i -e "s/asm\/unaligned.h/linux\/unaligned.h/" $i done for i in `git grep -l -w asm-generic/unaligned.h`; do sed -i -e "s/asm-generic\/unaligned.h/linux\/unaligned.h/" $i done git mv include/asm-generic/unaligned.h include/linux/unaligned.h git mv tools/include/asm-generic/unaligned.h tools/include/linux/unaligned.h sed -i -e "/unaligned.h/d" include/asm-generic/Kbuild sed -i -e "s/__ASM_GENERIC/__LINUX/" include/linux/unaligned.h tools/include/linux/unaligned.h	2024-10-02 17:23:23 -04:00
Leo Martins	d6e7ac65d4	btrfs: disable rate limiting when debug enabled Disable ratelimiting for btrfs_printk when CONFIG_BTRFS_DEBUG is enabled. This allows for more verbose output which is often needed by functions like btrfs_dump_space_info(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Leo Martins <loemra.dev@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-10-01 19:29:41 +02:00
Filipe Manana	41fd1e9406	btrfs: wait for fixup workers before stopping cleaner kthread during umount During unmount, at close_ctree(), we have the following steps in this order: 1) Park the cleaner kthread - this doesn't destroy the kthread, it basically halts its execution (wake ups against it work but do nothing); 2) We stop the cleaner kthread - this results in freeing the respective struct task_struct; 3) We call btrfs_stop_all_workers() which waits for any jobs running in all the work queues and then free the work queues. Syzbot reported a case where a fixup worker resulted in a crash when doing a delayed iput on its inode while attempting to wake up the cleaner at btrfs_add_delayed_iput(), because the task_struct of the cleaner kthread was already freed. This can happen during unmount because we don't wait for any fixup workers still running before we call kthread_stop() against the cleaner kthread, which stops and free all its resources. Fix this by waiting for any fixup workers at close_ctree() before we call kthread_stop() against the cleaner and run pending delayed iputs. The stack traces reported by syzbot were the following: BUG: KASAN: slab-use-after-free in __lock_acquire+0x77/0x2050 kernel/locking/lockdep.c:5065 Read of size 8 at addr ffff8880272a8a18 by task kworker/u8:3/52 CPU: 1 UID: 0 PID: 52 Comm: kworker/u8:3 Not tainted 6.12.0-rc1-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024 Workqueue: btrfs-fixup btrfs_work_helper Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:377 [inline] print_report+0x169/0x550 mm/kasan/report.c:488 kasan_report+0x143/0x180 mm/kasan/report.c:601 __lock_acquire+0x77/0x2050 kernel/locking/lockdep.c:5065 lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5825 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline] _raw_spin_lock_irqsave+0xd5/0x120 kernel/locking/spinlock.c:162 class_raw_spinlock_irqsave_constructor include/linux/spinlock.h:551 [inline] try_to_wake_up+0xb0/0x1480 kernel/sched/core.c:4154 btrfs_writepage_fixup_worker+0xc16/0xdf0 fs/btrfs/inode.c:2842 btrfs_work_helper+0x390/0xc50 fs/btrfs/async-thread.c:314 process_one_work kernel/workqueue.c:3229 [inline] process_scheduled_works+0xa63/0x1850 kernel/workqueue.c:3310 worker_thread+0x870/0xd30 kernel/workqueue.c:3391 kthread+0x2f0/0x390 kernel/kthread.c:389 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 </TASK> Allocated by task 2: kasan_save_stack mm/kasan/common.c:47 [inline] kasan_save_track+0x3f/0x80 mm/kasan/common.c:68 unpoison_slab_object mm/kasan/common.c:319 [inline] __kasan_slab_alloc+0x66/0x80 mm/kasan/common.c:345 kasan_slab_alloc include/linux/kasan.h:247 [inline] slab_post_alloc_hook mm/slub.c:4086 [inline] slab_alloc_node mm/slub.c:4135 [inline] kmem_cache_alloc_node_noprof+0x16b/0x320 mm/slub.c:4187 alloc_task_struct_node kernel/fork.c:180 [inline] dup_task_struct+0x57/0x8c0 kernel/fork.c:1107 copy_process+0x5d1/0x3d50 kernel/fork.c:2206 kernel_clone+0x223/0x880 kernel/fork.c:2787 kernel_thread+0x1bc/0x240 kernel/fork.c:2849 create_kthread kernel/kthread.c:412 [inline] kthreadd+0x60d/0x810 kernel/kthread.c:765 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 Freed by task 61: kasan_save_stack mm/kasan/common.c:47 [inline] kasan_save_track+0x3f/0x80 mm/kasan/common.c:68 kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:579 poison_slab_object mm/kasan/common.c:247 [inline] __kasan_slab_free+0x59/0x70 mm/kasan/common.c:264 kasan_slab_free include/linux/kasan.h:230 [inline] slab_free_hook mm/slub.c:2343 [inline] slab_free mm/slub.c:4580 [inline] kmem_cache_free+0x1a2/0x420 mm/slub.c:4682 put_task_struct include/linux/sched/task.h:144 [inline] delayed_put_task_struct+0x125/0x300 kernel/exit.c:228 rcu_do_batch kernel/rcu/tree.c:2567 [inline] rcu_core+0xaaa/0x17a0 kernel/rcu/tree.c:2823 handle_softirqs+0x2c5/0x980 kernel/softirq.c:554 __do_softirq kernel/softirq.c:588 [inline] invoke_softirq kernel/softirq.c:428 [inline] __irq_exit_rcu+0xf4/0x1c0 kernel/softirq.c:637 irq_exit_rcu+0x9/0x30 kernel/softirq.c:649 instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1037 [inline] sysvec_apic_timer_interrupt+0xa6/0xc0 arch/x86/kernel/apic/apic.c:1037 asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:702 Last potentially related work creation: kasan_save_stack+0x3f/0x60 mm/kasan/common.c:47 __kasan_record_aux_stack+0xac/0xc0 mm/kasan/generic.c:541 __call_rcu_common kernel/rcu/tree.c:3086 [inline] call_rcu+0x167/0xa70 kernel/rcu/tree.c:3190 context_switch kernel/sched/core.c:5318 [inline] __schedule+0x184b/0x4ae0 kernel/sched/core.c:6675 schedule_idle+0x56/0x90 kernel/sched/core.c:6793 do_idle+0x56a/0x5d0 kernel/sched/idle.c:354 cpu_startup_entry+0x42/0x60 kernel/sched/idle.c:424 start_secondary+0x102/0x110 arch/x86/kernel/smpboot.c:314 common_startup_64+0x13e/0x147 The buggy address belongs to the object at ffff8880272a8000 which belongs to the cache task_struct of size 7424 The buggy address is located 2584 bytes inside of freed 7424-byte region [ffff8880272a8000, ffff8880272a9d00) The buggy address belongs to the physical page: page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x272a8 head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 flags: 0xfff00000000040(head\|node=0\|zone=1\|lastcpupid=0x7ff) page_type: f5(slab) raw: 00fff00000000040 ffff88801bafa500 dead000000000122 0000000000000000 raw: 0000000000000000 0000000080040004 00000001f5000000 0000000000000000 head: 00fff00000000040 ffff88801bafa500 dead000000000122 0000000000000000 head: 0000000000000000 0000000080040004 00000001f5000000 0000000000000000 head: 00fff00000000003 ffffea00009caa01 ffffffffffffffff 0000000000000000 head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kasan: bad access detected page_owner tracks the page as allocated page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO\|__GFP_FS\|__GFP_NOWARN\|__GFP_NORETRY\|__GFP_COMP\|__GFP_NOMEMALLOC), pid 2, tgid 2 (kthreadd), ts 71247381401, free_ts 71214998153 set_page_owner include/linux/page_owner.h:32 [inline] post_alloc_hook+0x1f3/0x230 mm/page_alloc.c:1537 prep_new_page mm/page_alloc.c:1545 [inline] get_page_from_freelist+0x3039/0x3180 mm/page_alloc.c:3457 __alloc_pages_noprof+0x256/0x6c0 mm/page_alloc.c:4733 alloc_pages_mpol_noprof+0x3e8/0x680 mm/mempolicy.c:2265 alloc_slab_page+0x6a/0x120 mm/slub.c:2413 allocate_slab+0x5a/0x2f0 mm/slub.c:2579 new_slab mm/slub.c:2632 [inline] ___slab_alloc+0xcd1/0x14b0 mm/slub.c:3819 __slab_alloc+0x58/0xa0 mm/slub.c:3909 __slab_alloc_node mm/slub.c:3962 [inline] slab_alloc_node mm/slub.c:4123 [inline] kmem_cache_alloc_node_noprof+0x1fe/0x320 mm/slub.c:4187 alloc_task_struct_node kernel/fork.c:180 [inline] dup_task_struct+0x57/0x8c0 kernel/fork.c:1107 copy_process+0x5d1/0x3d50 kernel/fork.c:2206 kernel_clone+0x223/0x880 kernel/fork.c:2787 kernel_thread+0x1bc/0x240 kernel/fork.c:2849 create_kthread kernel/kthread.c:412 [inline] kthreadd+0x60d/0x810 kernel/kthread.c:765 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 page last free pid 5230 tgid 5230 stack trace: reset_page_owner include/linux/page_owner.h:25 [inline] free_pages_prepare mm/page_alloc.c:1108 [inline] free_unref_page+0xcd0/0xf00 mm/page_alloc.c:2638 discard_slab mm/slub.c:2678 [inline] __put_partials+0xeb/0x130 mm/slub.c:3146 put_cpu_partial+0x17c/0x250 mm/slub.c:3221 __slab_free+0x2ea/0x3d0 mm/slub.c:4450 qlink_free mm/kasan/quarantine.c:163 [inline] qlist_free_all+0x9a/0x140 mm/kasan/quarantine.c:179 kasan_quarantine_reduce+0x14f/0x170 mm/kasan/quarantine.c:286 __kasan_slab_alloc+0x23/0x80 mm/kasan/common.c:329 kasan_slab_alloc include/linux/kasan.h:247 [inline] slab_post_alloc_hook mm/slub.c:4086 [inline] slab_alloc_node mm/slub.c:4135 [inline] kmem_cache_alloc_noprof+0x135/0x2a0 mm/slub.c:4142 getname_flags+0xb7/0x540 fs/namei.c:139 do_sys_openat2+0xd2/0x1d0 fs/open.c:1409 do_sys_open fs/open.c:1430 [inline] __do_sys_openat fs/open.c:1446 [inline] __se_sys_openat fs/open.c:1441 [inline] __x64_sys_openat+0x247/0x2a0 fs/open.c:1441 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Memory state around the buggy address: ffff8880272a8900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8880272a8980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff8880272a8a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff8880272a8a80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8880272a8b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== Reported-by: syzbot+8aaf2df2ef0164ffe1fb@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/66fb36b1.050a0220.aab67.003b.GAE@google.com/ CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-10-01 19:29:33 +02:00
Qu Wenruo	c3b47f49e8	btrfs: fix a NULL pointer dereference when failed to start a new trasacntion [BUG] Syzbot reported a NULL pointer dereference with the following crash: FAULT_INJECTION: forcing a failure. start_transaction+0x830/0x1670 fs/btrfs/transaction.c:676 prepare_to_relocate+0x31f/0x4c0 fs/btrfs/relocation.c:3642 relocate_block_group+0x169/0xd20 fs/btrfs/relocation.c:3678 ... BTRFS info (device loop0): balance: ended with status: -12 Oops: general protection fault, probably for non-canonical address 0xdffffc00000000cc: 0000 [#1] PREEMPT SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000660-0x0000000000000667] RIP: 0010:btrfs_update_reloc_root+0x362/0xa80 fs/btrfs/relocation.c:926 Call Trace: <TASK> commit_fs_roots+0x2ee/0x720 fs/btrfs/transaction.c:1496 btrfs_commit_transaction+0xfaf/0x3740 fs/btrfs/transaction.c:2430 del_balance_item fs/btrfs/volumes.c:3678 [inline] reset_balance_state+0x25e/0x3c0 fs/btrfs/volumes.c:3742 btrfs_balance+0xead/0x10c0 fs/btrfs/volumes.c:4574 btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3673 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:907 [inline] __se_sys_ioctl+0xf9/0x170 fs/ioctl.c:893 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f [CAUSE] The allocation failure happens at the start_transaction() inside prepare_to_relocate(), and during the error handling we call unset_reloc_control(), which makes fs_info->balance_ctl to be NULL. Then we continue the error path cleanup in btrfs_balance() by calling reset_balance_state() which will call del_balance_item() to fully delete the balance item in the root tree. However during the small window between set_reloc_contrl() and unset_reloc_control(), we can have a subvolume tree update and created a reloc_root for that subvolume. Then we go into the final btrfs_commit_transaction() of del_balance_item(), and into btrfs_update_reloc_root() inside commit_fs_roots(). That function checks if fs_info->reloc_ctl is in the merge_reloc_tree stage, but since fs_info->reloc_ctl is NULL, it results a NULL pointer dereference. [FIX] Just add extra check on fs_info->reloc_ctl inside btrfs_update_reloc_root(), before checking fs_info->reloc_ctl->merge_reloc_tree. That DEAD_RELOC_TREE handling is to prevent further modification to the reloc tree during merge stage, but since there is no reloc_ctl at all, we do not need to bother that. Reported-by: syzbot+283673dbc38527ef9f3d@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/66f6bfa7.050a0220.38ace9.0019.GAE@google.com/ CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-10-01 19:22:37 +02:00
Filipe Manana	fa630df665	btrfs: send: fix invalid clone operation for file that got its size decreased During an incremental send we may end up sending an invalid clone operation, for the last extent of a file which ends at an unaligned offset that matches the final i_size of the file in the send snapshot, in case the file had its initial size (the size in the parent snapshot) decreased in the send snapshot. In this case the destination will fail to apply the clone operation because its end offset is not sector size aligned and it ends before the current size of the file. Sending the truncate operation always happens when we finish processing an inode, after we process all its extents (and xattrs, names, etc). So fix this by ensuring the file has a valid size before we send a clone operation for an unaligned extent that ends at the final i_size of the file. The size we truncate to matches the start offset of the clone range but it could be any value between that start offset and the final size of the file since the clone operation will expand the i_size if the current size is smaller than the end offset. The start offset of the range was chosen because it's always sector size aligned and avoids a truncation into the middle of a page, which results in dirtying the page due to filling part of it with zeroes and then making the clone operation at the receiver trigger IO. The following test reproduces the issue: $ cat test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi mkfs.btrfs -f $DEV mount $DEV $MNT # Create a file with a size of 256K + 5 bytes, having two extents, one # with a size of 128K and another one with a size of 128K + 5 bytes. last_ext_size=$((128 * 1024 + 5)) xfs_io -f -d -c "pwrite -S 0xab -b 128K 0 128K" \ -c "pwrite -S 0xcd -b $last_ext_size 128K $last_ext_size" \ $MNT/foo # Another file which we will later clone foo into, but initially with # a larger size than foo. xfs_io -f -c "pwrite -S 0xef 0 1M" $MNT/bar btrfs subvolume snapshot -r $MNT/ $MNT/snap1 # Now resize bar and clone foo into it. xfs_io -c "truncate 0" \ -c "reflink $MNT/foo" $MNT/bar btrfs subvolume snapshot -r $MNT/ $MNT/snap2 rm -f /tmp/send-full /tmp/send-inc btrfs send -f /tmp/send-full $MNT/snap1 btrfs send -p $MNT/snap1 -f /tmp/send-inc $MNT/snap2 umount $MNT mkfs.btrfs -f $DEV mount $DEV $MNT btrfs receive -f /tmp/send-full $MNT btrfs receive -f /tmp/send-inc $MNT umount $MNT Running it before this patch: $ ./test.sh (...) At subvol snap1 At snapshot snap2 ERROR: failed to clone extents to bar: Invalid argument A test case for fstests will be sent soon. Reported-by: Ben Millwood <thebenmachine@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CAJhrHS2z+WViO2h=ojYvBPDLsATwLbg+7JaNCyYomv0fUxEpQQ@mail.gmail.com/ Fixes: `46a6e10a1a` ("btrfs: send: allow cloning non-aligned extent if it ends at i_size") CC: stable@vger.kernel.org # 6.11 Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-10-01 19:15:12 +02:00
Josef Bacik	db7e68b522	btrfs: drop the backref cache during relocation if we commit Since the inception of relocation we have maintained the backref cache across transaction commits, updating the backref cache with the new bytenr whenever we COWed blocks that were in the cache, and then updating their bytenr once we detected a transaction id change. This works as long as we're only ever modifying blocks, not changing the structure of the tree. However relocation does in fact change the structure of the tree. For example, if we are relocating a data extent, we will look up all the leaves that point to this data extent. We will then call do_relocation() on each of these leaves, which will COW down to the leaf and then update the file extent location. But, a key feature of do_relocation() is the pending list. This is all the pending nodes that we modified when we updated the file extent item. We will then process all of these blocks via finish_pending_nodes, which calls do_relocation() on all of the nodes that led up to that leaf. The purpose of this is to make sure we don't break sharing unless we absolutely have to. Consider the case that we have 3 snapshots that all point to this leaf through the same nodes, the initial COW would have created a whole new path. If we did this for all 3 snapshots we would end up with 3x the number of nodes we had originally. To avoid this we will cycle through each of the snapshots that point to each of these nodes and update their pointers to point at the new nodes. Once we update the pointer to the new node we will drop the node we removed the link for and all of its children via btrfs_drop_subtree(). This is essentially just btrfs_drop_snapshot(), but for an arbitrary point in the snapshot. The problem with this is that we will never reflect this in the backref cache. If we do this btrfs_drop_snapshot() for a node that is in the backref tree, we will leave the node in the backref tree. This becomes a problem when we change the transid, as now the backref cache has entire subtrees that no longer exist, but exist as if they still are pointed to by the same roots. In the best case scenario you end up with "adding refs to an existing tree ref" errors from insert_inline_extent_backref(), where we attempt to link in nodes on roots that are no longer valid. Worst case you will double free some random block and re-use it when there's still references to the block. This is extremely subtle, and the consequences are quite bad. There isn't a way to make sure our backref cache is consistent between transid's. In order to fix this we need to simply evict the entire backref cache anytime we cross transid's. This reduces performance in that we have to rebuild this backref cache every time we change transid's, but fixes the bug. This has existed since relocation was added, and is a pretty critical bug. There's a lot more cleanup that can be done now that this functionality is going away, but this patch is as small as possible in order to fix the problem and make it easy for us to backport it to all the kernels it needs to be backported to. Followup series will dismantle more of this code and simplify relocation drastically to remove this functionality. We have a reproducer that reproduced the corruption within a few minutes of running. With this patch it survives several iterations/hours of running the reproducer. Fixes: `3fd0a5585e` ("Btrfs: Metadata ENOSPC handling for balance") CC: stable@vger.kernel.org Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-10-01 19:10:26 +02:00
Johannes Thumshirn	97f9782276	btrfs: also add stripe entries for NOCOW writes NOCOW writes do not generate stripe_extent entries in the RAID stripe tree, as the RAID stripe-tree feature initially was designed with a zoned filesystem in mind and on a zoned filesystem, we do not allow NOCOW writes. But the RAID stripe-tree feature is independent from the zoned feature, so we must also do NOCOW writes for RAID stripe-tree filesystems. Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-10-01 19:09:04 +02:00
Filipe Manana	96c6ca7157	btrfs: send: fix buffer overflow detection when copying path to cache entry Starting with commit `c0247d289e` ("btrfs: send: annotate struct name_cache_entry with __counted_by()") we annotated the variable length array "name" from the name_cache_entry structure with __counted_by() to improve overflow detection. However that alone was not correct, because the length of that array does not match the "name_len" field - it matches that plus 1 to include the NUL string terminator, so that makes a fortified kernel think there's an overflow and report a splat like this: strcpy: detected buffer overflow: 20 byte write of buffer size 19 WARNING: CPU: 3 PID: 3310 at __fortify_report+0x45/0x50 CPU: 3 UID: 0 PID: 3310 Comm: btrfs Not tainted 6.11.0-prnet #1 Hardware name: CompuLab Ltd. sbc-ihsw/Intense-PC2 (IPC2), BIOS IPC2_3.330.7 X64 03/15/2018 RIP: 0010:__fortify_report+0x45/0x50 Code: 48 8b 34 (...) RSP: 0018:ffff97ebc0d6f650 EFLAGS: 00010246 RAX: 7749924ef60fa600 RBX: ffff8bf5446a521a RCX: 0000000000000027 RDX: 00000000ffffdfff RSI: ffff97ebc0d6f548 RDI: ffff8bf84e7a1cc8 RBP: ffff8bf548574080 R08: ffffffffa8c40e10 R09: 0000000000005ffd R10: 0000000000000004 R11: ffffffffa8c70e10 R12: ffff8bf551eef400 R13: 0000000000000000 R14: 0000000000000013 R15: 00000000000003a8 FS: 00007fae144de8c0(0000) GS:ffff8bf84e780000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fae14691690 CR3: 00000001027a2003 CR4: 00000000001706f0 Call Trace: <TASK> ? __warn+0x12a/0x1d0 ? __fortify_report+0x45/0x50 ? report_bug+0x154/0x1c0 ? handle_bug+0x42/0x70 ? exc_invalid_op+0x1a/0x50 ? asm_exc_invalid_op+0x1a/0x20 ? __fortify_report+0x45/0x50 __fortify_panic+0x9/0x10 __get_cur_name_and_parent+0x3bc/0x3c0 get_cur_path+0x207/0x3b0 send_extent_data+0x709/0x10d0 ? find_parent_nodes+0x22df/0x25d0 ? mas_nomem+0x13/0x90 ? mtree_insert_range+0xa5/0x110 ? btrfs_lru_cache_store+0x5f/0x1e0 ? iterate_extent_inodes+0x52d/0x5a0 process_extent+0xa96/0x11a0 ? __pfx_lookup_backref_cache+0x10/0x10 ? __pfx_store_backref_cache+0x10/0x10 ? __pfx_iterate_backrefs+0x10/0x10 ? __pfx_check_extent_item+0x10/0x10 changed_cb+0x6fa/0x930 ? tree_advance+0x362/0x390 ? memcmp_extent_buffer+0xd7/0x160 send_subvol+0xf0a/0x1520 btrfs_ioctl_send+0x106b/0x11d0 ? __pfx___clone_root_cmp_sort+0x10/0x10 _btrfs_ioctl_send+0x1ac/0x240 btrfs_ioctl+0x75b/0x850 __se_sys_ioctl+0xca/0x150 do_syscall_64+0x85/0x160 ? __count_memcg_events+0x69/0x100 ? handle_mm_fault+0x1327/0x15c0 ? __se_sys_rt_sigprocmask+0xf1/0x180 ? syscall_exit_to_user_mode+0x75/0xa0 ? do_syscall_64+0x91/0x160 ? do_user_addr_fault+0x21d/0x630 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7fae145eeb4f Code: 00 48 89 (...) RSP: 002b:00007ffdf1cb09b0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fae145eeb4f RDX: 00007ffdf1cb0ad0 RSI: 0000000040489426 RDI: 0000000000000004 RBP: 00000000000078fe R08: 00007fae144006c0 R09: 00007ffdf1cb0927 R10: 0000000000000008 R11: 0000000000000246 R12: 00007ffdf1cb1ce8 R13: 0000000000000003 R14: 000055c499fab2e0 R15: 0000000000000004 </TASK> Fix this by not storing the NUL string terminator since we don't actually need it for name cache entries, this way "name_len" corresponds to the actual size of the "name" array. This requires marking the "name" array field with __nonstring and using memcpy() instead of strcpy() as recommended by the guidelines at: https://github.com/KSPP/linux/issues/90 Reported-by: David Arendt <admin@prnet.org> Link: https://lore.kernel.org/linux-btrfs/cee4591a-3088-49ba-99b8-d86b4242b8bd@prnet.org/ Fixes: `c0247d289e` ("btrfs: send: annotate struct name_cache_entry with __counted_by()") CC: stable@vger.kernel.org # 6.11 Tested-by: David Arendt <admin@prnet.org> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-10-01 19:06:30 +02:00
Linus Torvalds	a1fb2fcbb6	for-6.12-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmbxUdkACgkQxWXV+ddt WDtAVQ//SCg5XtExxtol1emzZ+AGQjwRnRfUPo/x32h9SmaynaHa/sLsG2EwePKs 1lrkW8gEx3NF1bfeCubhoVX2eAo/1rwGqtPEbweE7XaYtmSnxT8jXeH2fQcMwQMc PkYfnCMIOdJzwoVS8wS3kLmuDep+9DJrbeI9oN5tUgugkTTbW7g576uv/SXjp46D Dl4b1uvVOCowBbY2Bz1pg0fQpBzJcLzvynGElSi85uoQ520JuA8PP/3Pszg8BTxm 6MO99kF0MhVSBnKSvmlIgxmlnGhlW/AlZakxywRYYKsiSM/eCWHpyUV0p4mMcpWW QM8yeJcAhugTDIV3VdRpGx4NcJSo1PPaXxRrMr/vnnuOPF4VQ2gSw+S4p44YCsML VpyNJIjeXNO86A6feQybxwczMzdpkc5UzdfJ+l3CDSxcGiQGRU3WWPIHjte90e38 ZNjXknc96EwOmxsx8ojGlfi7Lh9yHklMGslxI64488PTa+2RRGITUSziAla29nrd E4U6bh+bLeh2a11u+OjvSqIjdDfoJZD40Abnqe6DVA9pboPaLvf8vAVZa1FOJxsI oVJgkdhEBGbn26KqlghlnbkYBdjuGxtBoyuCvUAI8ybOTVnp423d+JYXkZOnSq9A EdL3UGII4LWQ71p+QxF3tm5nuKfbulyibfoBNj57zk0hM2OVNdg= =wGXc -----END PGP SIGNATURE----- Merge tag 'for-6.12-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix dangling pointer to rb-tree of defragmented inodes after cleanup - a followup fix to handle concurrent lseek on the same fd that could leak memory under some conditions - fix wrong root id reported in tree checker when verifying dref * tag 'for-6.12-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix use-after-free on rbtree that tracks inodes for auto defrag btrfs: tree-checker: fix the wrong output of data backref objectid btrfs: fix race setting file private on concurrent lseek using same fd	2024-09-23 11:49:02 -07:00
Linus Torvalds	f8ffbc365f	struct fd layout change (and conversion to accessor helpers) -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZvDNmgAKCRBZ7Krx/gZQ 63zrAP9vI0rf55v27twiabe9LnI7aSx5ckoqXxFIFxyT3dOYpQD/bPmoApnWDD3d 592+iDgLsema/H/0/CqfqlaNtDNY8Q0= =HUl5 -----END PGP SIGNATURE----- Merge tag 'pull-stable-struct_fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull 'struct fd' updates from Al Viro: "Just the 'struct fd' layout change, with conversion to accessor helpers" * tag 'pull-stable-struct_fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: add struct fd constructors, get rid of __to_fd() struct fd: representation change introduce fd_file(), convert all accessors to it.	2024-09-23 09:35:36 -07:00
Filipe Manana	7f1b63f981	btrfs: fix use-after-free on rbtree that tracks inodes for auto defrag When cleaning up defrag inodes at btrfs_cleanup_defrag_inodes(), called during remount and unmount, we are freeing every node from the rbtree that tracks inodes for auto defrag using rbtree_postorder_for_each_entry_safe(), which doesn't modify the tree itself. So once we unlock the lock that protects the rbtree, we have a tree pointing to a root that was freed (and a root pointing to freed nodes, and their children pointing to other freed nodes, and so on). This makes further access to the tree result in a use-after-free with unpredictable results. Fix this by initializing the rbtree to an empty root after the call to rbtree_postorder_for_each_entry_safe() and before unlocking. Fixes: `276940915f` ("btrfs: clear defragmented inodes using postorder in btrfs_cleanup_defrag_inodes()") Reported-by: syzbot+ad7966ca1f5dd8b001b3@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/000000000000f9aad406223eabff@google.com/ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-17 17:35:53 +02:00
Qu Wenruo	b0b595e61d	btrfs: tree-checker: fix the wrong output of data backref objectid [BUG] There are some reports about invalid data backref objectids, the report looks like this: BTRFS critical (device sda): corrupt leaf: block=333654787489792 slot=110 extent bytenr=333413935558656 len=65536 invalid data ref objectid value 2543 The data ref objectid is the inode number inside the subvolume. But in above case, the value is completely sane, not really showing the problem. [CAUSE] The root cause of the problem is the deprecated feature, inode cache. This feature results a special inode number, -12ULL, and it's no longer recognized by tree-checker, triggering the error. The direct problem here is the output of data ref objectid. The value shown is in fact the dref_root (subvolume id), not the dref_objectid (inode number). [FIX] Fix the output to use dref_objectid instead. Reported-by: Neil Parton <njparton@gmail.com> Reported-by: Archange <archange@archlinux.org> Link: https://lore.kernel.org/linux-btrfs/CAAYHqBbrrgmh6UmW3ANbysJX9qG9Pbg3ZwnKsV=5mOpv_qix_Q@mail.gmail.com/ Link: https://lore.kernel.org/linux-btrfs/9541deea-9056-406e-be16-a996b549614d@archlinux.org/ Fixes: `f333a3c7e8` ("btrfs: tree-checker: validate dref root and objectid") CC: stable@vger.kernel.org # 6.11 Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-17 17:34:17 +02:00
Filipe Manana	7ee85f5515	btrfs: fix race setting file private on concurrent lseek using same fd When doing concurrent lseek(2) system calls against the same file descriptor, using multiple threads belonging to the same process, we have a short time window where a race happens and can result in a memory leak. The race happens like this: 1) A program opens a file descriptor for a file and then spawns two threads (with the pthreads library for example), lets call them task A and task B; 2) Task A calls lseek with SEEK_DATA or SEEK_HOLE and ends up at file.c:find_desired_extent() while holding a read lock on the inode; 3) At the start of find_desired_extent(), it extracts the file's private_data pointer into a local variable named 'private', which has a value of NULL; 4) Task B also calls lseek with SEEK_DATA or SEEK_HOLE, locks the inode in shared mode and enters file.c:find_desired_extent(), where it also extracts file->private_data into its local variable 'private', which has a NULL value; 5) Because it saw a NULL file private, task A allocates a private structure and assigns to the file structure; 6) Task B also saw a NULL file private so it also allocates its own file private and then assigns it to the same file structure, since both tasks are using the same file descriptor. At this point we leak the private structure allocated by task A. Besides the memory leak, there's also the detail that both tasks end up using the same cached state record in the private structure (struct btrfs_file_private::llseek_cached_state), which can result in a use-after-free problem since one task can free it while the other is still using it (only one task took a reference count on it). Also, sharing the cached state is not a good idea since it could result in incorrect results in the future - right now it should not be a problem because it end ups being used only in extent-io-tree.c:count_range_bits() where we do range validation before using the cached state. Fix this by protecting the private assignment and check of a file while holding the inode's spinlock and keep track of the task that allocated the private, so that it's used only by that task in order to prevent user-after-free issues with the cached state record as well as potentially using it incorrectly in the future. Fixes: `3c32c7212f` ("btrfs: use cached state when looking for delalloc ranges with lseek") CC: stable@vger.kernel.org # 6.6+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-17 17:31:48 +02:00
Linus Torvalds	26bb0d3f38	for-6.12/block-20240913 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmbkZhQQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpjOKD/0fzd4yOcqxSI9W3OLGd04VrOTJIQa4CRbV GmoTq39pOeIDVGug5ekkTpqqHHnuGk+nQhCzD9vsN/eTmC7yZOIr847O2aWzvYEn PzFRgmJpoo2E9sr/IsTR5LnJjbaIZhQVkqLH6ZOj9tpKlVwN2SK0nIRVNrAi5zgT MaDrto/2OUld+vmA99Rgb23jxM6UBdCPIjuiVa+11Vg9Z3D1tWbBmrsG7OMysyIf FbASBeKHqFSO61/ipFCZv6VV1X8zoWEVyT8n4A1yUbbN5rLzPgoQJVbfSqQRXIdr cdrKeCbKxl+joSgKS6LKpvnfwRgGF+hgAfpZg4c0vrbZGTQcRhhLFECyh/aVI08F p5TOMArhVaX59664gHgSPq4KnGTXOO29dot9N3Jya/ZQnxinjY9r+GVOfLuduPPy 1B04vab8oAsk4zK7fZbkDxgYUyifwzK/vQ6OqYq2mYdpdIS/AE7T2ou61Bz5mI7I /BuucNV0Z96OKlyLEXwXXZjZgNu1TFcq6ARIBJ8L08PY64Fesj5BXabRyXkeNH26 0exyz9heeJs6OwRGfngXmS24tDSS0k74CeZX3KoePNj69u6KCn346KiU1qgntwwD E5F7AEHqCl5FjUEIWB4M1EPlfA8U0MzOL+tkx2xKJAjsU60wAy7jRSyOIcqodpMs 6UlPcJzgYg== =uuLl -----END PGP SIGNATURE----- Merge tag 'for-6.12/block-20240913' of git://git.kernel.dk/linux Pull block updates from Jens Axboe: - MD changes via Song: - md-bitmap refactoring (Yu Kuai) - raid5 performance optimization (Artur Paszkiewicz) - Other small fixes (Yu Kuai, Chen Ni) - Add a sysfs entry 'new_level' (Xiao Ni) - Improve information reported in /proc/mdstat (Mateusz Kusiak) - NVMe changes via Keith: - Asynchronous namespace scanning (Stuart) - TCP TLS updates (Hannes) - RDMA queue controller validation (Niklas) - Align field names to the spec (Anuj) - Metadata support validation (Puranjay) - A syntax cleanup (Shen) - Fix a Kconfig linking error (Arnd) - New queue-depth quirk (Keith) - Add missing unplug trace event (Keith) - blk-iocost fixes (Colin, Konstantin) - t10-pi modular removal and fixes (Alexey) - Fix for potential BLKSECDISCARD overflow (Alexey) - bio splitting cleanups and fixes (Christoph) - Deal with folios rather than rather than pages, speeding up how the block layer handles bigger IOs (Kundan) - Use spinlocks rather than bit spinlocks in zram (Sebastian, Mike) - Reduce zoned device overhead in ublk (Ming) - Add and use sendpages_ok() for drbd and nvme-tcp (Ofir) - Fix regression in partition error pointer checking (Riyan) - Add support for write zeroes and rotational status in nbd (Wouter) - Add Yu Kuai as new BFQ maintainer. The scheduler has been unmaintained for quite a while. - Various sets of fixes for BFQ (Yu Kuai) - Misc fixes and cleanups (Alvaro, Christophe, Li, Md Haris, Mikhail, Yang) * tag 'for-6.12/block-20240913' of git://git.kernel.dk/linux: (120 commits) nvme-pci: qdepth 1 quirk block: fix potential invalid pointer dereference in blk_add_partition blk_iocost: make read-only static array vrate_adj_pct const block: unpin user pages belonging to a folio at once mm: release number of pages of a folio block: introduce folio awareness and add a bigger size from folio block: Added folio-ized version of bio_add_hw_page() block, bfq: factor out a helper to split bfqq in bfq_init_rq() block, bfq: remove local variable 'bfqq_already_existing' in bfq_init_rq() block, bfq: remove local variable 'split' in bfq_init_rq() block, bfq: remove bfq_log_bfqg() block, bfq: merge bfq_release_process_ref() into bfq_put_cooperator() block, bfq: fix procress reference leakage for bfqq in merge chain block, bfq: fix uaf for accessing waker_bfqq after splitting blk-throttle: support prioritized processing of metadata blk-throttle: remove last_low_overflow_time drbd: Add NULL check for net_conf to prevent dereference in state validation nvme-tcp: fix link failure for TCP auth blk-mq: add missing unplug trace event mtip32xx: Remove redundant null pointer checks in mtip_hw_debugfs_init() ...	2024-09-16 13:33:06 +02:00
Qu Wenruo	bd610c0937	btrfs: only unlock the to-be-submitted ranges inside a folio [SUBPAGE COMPRESSION LIMITS] Currently inside writepage_delalloc(), if a delalloc range is going to be submitted asynchronously (inline or compression, the page dirty/writeback/unlock are all handled in at different time, not at the submission time), then we return 1 and extent_writepage() will skip the submission. This is fine if every sector matches page size, but if a sector is smaller than page size (aka, subpage case), then it can be very problematic, for example for the following 64K page: 0 16K 32K 48K 64K \|/\| \|///////\| \|/\| \| \| 4K 52K Where \|/\| is the dirty range we need to submit. In the above case, we need the following different handling for the 3 ranges: - [0, 4K) needs to be submitted for regular write A single sector cannot be compressed. - [16K, 32K) needs to be submitted for compressed write - [48K, 52K) needs to be submitted for regular write. Above, if we try to submit [16K, 32K) for compressed write, we will return 1 and immediately, and without submitting the remaining [48K, 52K) range. Furthermore, since extent_writepage() will exit without unlocking any sectors, the submitted range [0, 4K) will not have sector unlocked. That's the reason why for now subpage is only allowed for full page range. [ENHANCEMENT] - Introduce a submission bitmap at btrfs_bio_ctrl::submit_bitmap This records which sectors will be submitted by extent_writepage_io(). This allows us to track which sectors needs to be submitted thus later to be properly unlocked. For asynchronously submitted range (inline/compression), the corresponding bits will be cleared from that bitmap. - Only return 1 if no sector needs to be submitted in writepage_delalloc() - Only submit sectors marked by submission bitmap inside extent_writepage_io() So we won't touch the asynchronously submitted part. - Introduce btrfs_folio_end_writer_lock_bitmap() helper This will only unlock the involved sectors specified by @bitmap parameter, to avoid touching the range asynchronously submitted. Please note that, since subpage compression is still limited to page aligned range, this change is only a preparation for future sector perfect compression support for subpage. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:22 +02:00
Qu Wenruo	49a9907368	btrfs: merge btrfs_folio_unlock_writer() into btrfs_folio_end_writer_lock() The function btrfs_folio_unlock_writer() is already calling btrfs_folio_end_writer_lock() to do the heavy lifting work, the only missing 0 writer check. Thus there is no need to keep two different functions, move the 0 writer check into btrfs_folio_end_writer_lock(), and remove btrfs_folio_unlock_writer(). Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:22 +02:00
Leo Martins	68f32b9c98	btrfs: BTRFS_PATH_AUTO_FREE in orphan.c All cleanup paths lead to btrfs_path_free so path can be defined with the automatic freeing callback in the following functions: - btrfs_insert_orphan_item() - btrfs_del_orphan_item() Signed-off-by: Leo Martins <loemra.dev@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:22 +02:00
Leo Martins	45763a0cbb	btrfs: use btrfs_path auto free in zoned.c All cleanup paths lead to btrfs_path_free so path can be defined with the automatic freeing callback in the following functions: - calculate_emulated_zone_size() - calculate_alloc_pointer() Signed-off-by: Leo Martins <loemra.dev@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:22 +02:00
Leo Martins	4c74a32ad3	btrfs: DEFINE_FREE for struct btrfs_path Add a DEFINE_FREE for struct btrfs_path. This defines a function that can be called using the __free attribute. Define a macro BTRFS_PATH_AUTO_FREE to make the declaration of an auto freeing path very clear. The intended use is to define the auto free of path in cases where the path is allocated somewhere at the beginning and freed either on all error paths or at the end of the function. int func() { BTRFS_PATH_AUTO_FREE(path); if (...) return -ERROR; path = alloc_path(); ... if (...) return -ERROR; ... return 0; } Signed-off-by: Leo Martins <loemra.dev@gmail.com> [ update changelog ] Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:22 +02:00
Qu Wenruo	ab6eac7c91	btrfs: remove btrfs_folio_end_all_writers() The function btrfs_folio_end_all_writers() is only utilized in extent_writepage() as a way to unlock all subpage range (for both successful submission and error handling). Meanwhile we have a similar function, btrfs_folio_end_writer_lock(). The difference is, btrfs_folio_end_writer_lock() expects a range that is a subset of the already locked range. This limit on btrfs_folio_end_writer_lock() is a little overkilled, preventing it from being utilized for error paths. So here we enhance btrfs_folio_end_writer_lock() to accept a superset of the locked range, and only end the locked subset. This means we can replace btrfs_folio_end_all_writers() with btrfs_folio_end_writer_lock() instead. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:22 +02:00
David Sterba	ca283ea992	btrfs: constify more pointer parameters Continue adding const to parameters. This is for clarity and minor addition to safety. There are some minor effects, in the assembly code and .ko measured on release config. Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:22 +02:00
David Sterba	070969f17d	btrfs: rework BTRFS_I as macro to preserve parameter const Currently BTRFS_I is a static inline function that takes a const inode and returns btrfs inode, dropping the 'const' qualifier. This can break assumptions of compiler though it seems there's no real case. To make the parameter and return type consistent regardint const we can use the container_of_const() that preserves it. However this would not check the parameter type. To fix that use the same _Generic construct but implement only the two expected types. Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:22 +02:00
Filipe Manana	1b6e068a0c	btrfs: add and use helper to verify the calling task has locked the inode We have a few places that check if we have the inode locked by doing: ASSERT(inode_is_locked(vfs_inode)); This actually proved to be useful several times as if assertions are enabled (and by default they are in many distros) it immediately triggers a crash which is impossible for users to miss. However that doesn't check if the lock is held by the calling task, so the check passes if some other task locked the inode. Using one of the lockdep functions to check the lock is held, like lockdep_assert_held() for example, does check that the calling task holds the lock, and if that's not the case it produces a warning and stack trace in dmesg. However, despite the misleading "assert" in the name of the lockdep helpers, it does not trigger a crash/BUG_ON(), just a warning and splat in dmesg, which is easy to get unnoticed by users who may have lockdep enabled. So add a helper that does the ASSERT() and calls lockdep_assert_held() immediately after and use it every where we check the inode is locked. Like this if the lock is held by some other task we get the warning in dmesg which is caught by fstests, very helpful during development, and may also be occassionaly noticed by users with lockdep enabled. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:22 +02:00
Luca Stefani	3368597206	btrfs: always update fstrim_range on failure in FITRIM ioctl Even in case of failure we could've discarded some data and userspace should be made aware of it, so copy fstrim_range to userspace regardless. Also make sure to update the trimmed bytes amount even if btrfs_trim_free_extents fails. CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:21 +02:00
Li Zetao	faad57ae20	btrfs: convert copy_inline_to_page() to use folio The old page API is being gradually replaced and converted to use folio to improve code readability and avoid repeated conversion between page and folio. Moreover find_or_create_page() is compatible API, and it can replaced with __filemap_get_folio(). Some interfaces have been converted to use folio before, so the conversion operation from page can be eliminated here. Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:21 +02:00
Li Zetao	aeb6d88148	btrfs: convert btrfs_decompress() to take a folio The old page API is being gradually replaced and converted to use folio to improve code readability and avoid repeated conversion between page and folio. Based on the previous patch, the compression path can be directly used in folio without converting to page. Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:21 +02:00
Li Zetao	b70f3a4546	btrfs: convert zstd_decompress() to take a folio The old page API is being gradually replaced and converted to use folio to improve code readability and avoid repeated conversion between page and folio. And memcpy_to_page() can be replaced with memcpy_to_folio(). But there is no memzero_folio(), but it can be replaced equivalently by folio_zero_range(). Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:21 +02:00
Li Zetao	9f9a4e43a8	btrfs: convert lzo_decompress() to take a folio The old page API is being gradually replaced and converted to use folio to improve code readability and avoid repeated conversion between page and folio. And memcpy_to_page() can be replaced with memcpy_to_folio(). But there is no memzero_folio(), but it can be replaced equivalently by folio_zero_range(). Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:21 +02:00
Li Zetao	54c78d497b	btrfs: convert zlib_decompress() to take a folio The old page API is being gradually replaced and converted to use folio to improve code readability and avoid repeated conversion between page and folio. And memcpy_to_page() can be replaced with memcpy_to_folio(). But there is no memzero_folio(), but it can be replaced equivalently by folio_zero_range(). Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:21 +02:00
Li Zetao	046c0d6596	btrfs: convert try_release_extent_mapping() to take a folio The old page API is being gradually replaced and converted to use folio to improve code readability and avoid repeated conversion between page and folio. And page_to_inode() can be replaced with folio_to_inode() now. Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:21 +02:00
Li Zetao	dd0a8df455	btrfs: convert try_release_extent_state() to take a folio The old page API is being gradually replaced and converted to use folio to improve code readability and avoid repeated conversion between page and folio. Moreover, use folio_pos() instead of page_offset(), which is more consistent with folio usage. Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:21 +02:00
Li Zetao	08dd8507b1	btrfs: convert submit_eb_page() to take a folio The old page API is being gradually replaced and converted to use folio to improve code readability and avoid repeated conversion between page and folio. Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:21 +02:00
Li Zetao	135873258c	btrfs: convert submit_eb_subpage() to take a folio The old page API is being gradually replaced and converted to use folio to improve code readability and avoid repeated conversion between page and folio. Moreover, use folio_pos() instead of page_offset(), which is more consistent with folio usage. Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:21 +02:00
Li Zetao	884937793d	btrfs: convert read_key_bytes() to take a folio The old page API is being gradually replaced and converted to use folio to improve code readability and avoid repeated conversion between page and folio. Moreover, use kmap_local_folio() instead of kmap_local_page(), which is more consistent with folio usage. Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:20 +02:00
Li Zetao	b8ae2bfa68	btrfs: convert try_release_extent_buffer() to take a folio The old page API is being gradually replaced and converted to use folio to improve code readability and avoid repeated conversion between page and folio. Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:20 +02:00
Li Zetao	0145aa38cb	btrfs: convert try_release_subpage_extent_buffer() to take a folio The old page API is being gradually replaced and converted to use folio to improve code readability and avoid repeated conversion between page and folio. And use folio_pos instead of page_offset, which is more consistent with folio usage. At the same time, folio_test_private() can handle folio directly without converting from page to folio first. Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:20 +02:00
Li Zetao	d4aeb5f7a7	btrfs: convert get_next_extent_buffer() to take a folio The old page API is being gradually replaced and converted to use folio to improve code readability and avoid repeated conversion between page and folio. Use folio_pos instead of page_offset, which is more consistent with folio usage. Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:20 +02:00
Li Zetao	266a9361a4	btrfs: convert clear_page_extent_mapped() to take a folio The old page API is being gradually replaced and converted to use folio to improve code readability and avoid repeated conversion between page and folio. Now clear_page_extent_mapped() can deal with a folio directly, so change its name to clear_folio_extent_mapped(). Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:20 +02:00
Qu Wenruo	fd1e75d010	btrfs: make compression path to be subpage compatible Currently btrfs compression path is not really subpage compatible, every thing is still done in page unit. That's fine for regular sector size and subpage routine. As even for subpage routine compression is only enabled if the whole range is page aligned, so reading the page cache in page unit is totally fine. However in preparation for the future subpage perfect compression support, we need to change the compression routine to properly handle a subpage range. This patch would prepare both zlib and zstd to only read the subpage range for compression. Lzo is already doing subpage aware read, as lzo's on-disk format is already sectorsize dependent. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:20 +02:00
Qu Wenruo	9ca0e58cb7	btrfs: merge btrfs_orig_bbio_end_io() into btrfs_bio_end_io() There are only two differences between the two functions: - btrfs_orig_bbio_end_io() does extra error propagation This is mostly to allow tolerance for write errors. - btrfs_orig_bbio_end_io() does extra pending_ios check This check can handle both the original bio, or the cloned one. (All accounting happens in the original one). This makes btrfs_orig_bbio_end_io() a much safer call. In fact we already had a double freeing error due to usage of btrfs_bio_end_io() in the error path of btrfs_submit_chunk(). So just move the whole content of btrfs_orig_bbio_end_io() into btrfs_bio_end_io(). For normal paths this brings no change, because they are already calling btrfs_orig_bbio_end_io() in the first place. For error paths (not only inside bio.c but also external callers), this change will introduce extra checks, especially for external callers, as they will error out without submitting the btrfs bio. But considering it's already in the error path, such slower but much safer checks are still an overall win. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:20 +02:00
Josef Bacik	ac325fc2aa	btrfs: do not hold the extent lock for entire read Historically we've held the extent lock throughout the entire read. There's been a few reasons for this, but it's mostly just caused us problems. For example, this prevents us from allowing page faults during direct io reads, because we could deadlock. This has forced us to only allow 4k reads at a time for io_uring NOWAIT requests because we have no idea if we'll be forced to page fault and thus have to do a whole lot of work. On the buffered side we are protected by the page lock, as long as we're reading things like buffered writes, punch hole, and even direct IO to a certain degree will get hung up on the page lock while the page is in flight. On the direct side we have the dio extent lock, which acts much like the way the extent lock worked previously to this patch, however just for direct reads. This protects direct reads from concurrent direct writes, while we're protected from buffered writes via the inode lock. Now that we're protected in all cases, narrow the extent lock to the part where we're getting the extent map to submit the reads, no longer holding the extent lock for the entire read operation. Push the extent lock down into do_readpage() so that we're only grabbing it when looking up the extent map. This portion was contributed by Goldwyn. Co-developed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:20 +02:00
Josef Bacik	07d399cb4e	btrfs: take the dio extent lock during O_DIRECT operations Currently we hold the extent lock for the entire duration of a read. This isn't really necessary in the buffered case, we're protected by the page lock, however it's necessary for O_DIRECT. For O_DIRECT reads, if we only locked the extent for the part where we get the extent, we could potentially race with an O_DIRECT write in the same region. This isn't really a problem, unless the read is delayed so much that the write does the COW, unpins the old extent, and some other application re-allocates the extent before the read is actually able to be submitted. At that point at best we'd have a checksum mismatch, but at worse we could read data that doesn't belong to us. To address this potential race we need to make sure we don't have overlapping, concurrent direct io reads and writes. To accomplish this use the new EXTENT_DIO_LOCKED bit in the direct IO case in the same spot as the current extent lock. The writes will take this while they're creating the ordered extent, which is also used to make sure concurrent buffered reads or concurrent direct reads are not allowed to occur, and drop it after the ordered extent is taken. For reads it will act as the current read behavior for the EXTENT_LOCKED bit, we set it when we're starting the read, we clear it in the end_io to allow other direct writes to continue. This still has the drawback of disallowing concurrent overlapping direct reads from occurring, but that exists with the current extent locking. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:20 +02:00
Josef Bacik	7e2a595084	btrfs: introduce EXTENT_DIO_LOCKED In order to support dropping the extent lock during a read we need a way to make sure that direct reads and direct writes for overlapping ranges are protected from each other. To accomplish this introduce another lock bit specifically for direct io. Subsequent patches will utilize this to protect direct IO operations. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:20 +02:00
David Sterba	df2825e985	btrfs: always pass readahead state to defrag Defrag ioctl passes readahead from the file, but autodefrag does not have a file so the readahead state is allocated when needed. The autodefrag loop in cleaner thread iterates over inodes so we can simply provide an on-stack readahead state and will not need to allocate it in btrfs_defrag_file(). The size is 32 bytes which is acceptable. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:20 +02:00
David Sterba	11e3107d47	btrfs: drop transaction parameter from btrfs_add_inode_defrag() There's only one caller inode_should_defrag() that passes NULL to btrfs_add_inode_defrag() so we can drop it an simplify the code. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:19 +02:00
David Sterba	91c9f2855e	btrfs: return void from btrfs_add_inode_defrag() The potential memory allocation failure is not a fatal error, skipping autodefrag is fine and the caller inode_should_defrag() does not care about the errors. Further writes can attempt to add the inode back to the defragmentation list again. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:19 +02:00
David Sterba	276940915f	btrfs: clear defragmented inodes using postorder in btrfs_cleanup_defrag_inodes() btrfs_cleanup_defrag_inodes() is not called frequently, only in remount or unmount, but the way it frees the inodes in fs_info->defrag_inodes is inefficient. Each time it needs to locate first node, remove it, potentially rebalance tree until it's done. This allows to do a conditional reschedule. For cleanups the rbtree_postorder_for_each_entry_safe() iterator is convenient but we can't reschedule and restart iteration because some of the tree nodes would be already freed. The cleanup operation is kmem_cache_free() which will likely take the fast path for most objects so rescheduling should not be necessary. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:19 +02:00
David Sterba	ffc531652d	btrfs: rename __btrfs_run_defrag_inode() and drop double underscores The function does not follow the pattern where the underscores would be justified, so rename it. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:19 +02:00
David Sterba	4225756902	btrfs: rename __btrfs_add_inode_defrag() and drop double underscores The function does not follow the pattern where the underscores would be justified, so rename it. Also update the misleading comment, the passed item is not freed, that's what the caller does. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:19 +02:00
David Sterba	6d2f07e13c	btrfs: rename __need_auto_defrag() and drop double underscores The function does not follow the pattern where the underscores would be justified, so rename it. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:19 +02:00
David Sterba	b7164d9ab0	btrfs: constify arguments of compare_inode_defrag() A comparator function does not change its parameters, make them const. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:19 +02:00
David Sterba	a92914a80b	btrfs: rename __compare_inode_defrag() and drop double underscores The function does not follow the pattern where the underscores would be justified, so rename it. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:19 +02:00
David Sterba	06de42c5a9	btrfs: rename __extent_writepage() and drop double underscores The function does not follow the pattern where the underscores would be justified, so rename it. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:19 +02:00
David Sterba	22b4ef50dc	btrfs: rename __btrfs_submit_bio() and drop double underscores Previous patch freed the function name btrfs_submit_bio() so we can use it for a helper that submits struct bio. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:19 +02:00
David Sterba	792e86ef31	btrfs: rename btrfs_submit_bio() to btrfs_submit_bbio() The function name is a bit misleading as it submits the btrfs_bio (bbio), rename it so we can use btrfs_submit_bio() when an actual bio is submitted. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:19 +02:00
Qu Wenruo	ce4a71ee15	btrfs: subpage: remove btrfs_fs_info::subpage_info member The member btrfs_fs_info::subpage_info stores the cached bitmap start position inside the merged bitmap. However in reality there is only one thing depending on the sectorsize, bitmap_nr_bits, which records the number of sectors that fit inside a page. The sequence of sub-bitmaps have fixed order, thus it's just a quick multiplication to calculate the start position of each sub-bitmaps. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:18 +02:00
Qu Wenruo	2c70fe16ea	btrfs: remove the nr_ret parameter from __extent_writepage_io() The parameter @nr_ret is used to tell the caller how many sectors have been submitted for IO. Then callers check @nr_ret value to determine if we need to manually clear the PAGECACHE_TAG_DIRTY, as if we submitted no sector (e.g. all sectors are beyond i_size) there is no folio_start_writeback() called thus PAGECACHE_TAG_DIRTY tag will not be cleared. Remove this parameter by: - Moving the btrfs_folio_clear_writeback() call into __extent_writepage_io() So that if we didn't submit any IO, then manually call btrfs_folio_set_writeback() to clear PAGECACHE_TAG_DIRTY when the page is no longer dirty. - Use a bool to record if we have submitted any sector Instead of an int. - Use subpage compatible helpers to end folio writeback. This brings no change to the behavior, just for the sake of consistency. As for the call site inside __extent_writepage(), we're always called for the whole page, so the existing full page helper folio_(start\|end)_writeback() is totally fine. For the call site inside extent_write_locked_range(), although we can have subpage range, folio_start_writeback() will only clear PAGECACHE_TAG_DIRTY if the page is no longer dirty, and the full folio will still be dirty if there is any subpage dirty range. Only when the last dirty subpage sector is cleared, the folio_start_writeback() will clear PAGECACHE_TAG_DIRTY. So no matter if we call the full page or subpage helper, the result is still the same, then just use the subpage helpers for consistency. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:18 +02:00
Thorsten Blum	e39ba5dfd0	btrfs: send: fix grammar in comments Fix a few obvious grammar mistakes: a -> an, then -> than. Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:18 +02:00
Junchao Sun	3cce39a8ca	btrfs: qgroup: use xarray to track dirty extents in transaction Use xarray to track dirty extents to reduce the size of the struct btrfs_qgroup_extent_record from 64 bytes to 40 bytes. The xarray is more cache line friendly, it also reduces the complexity of insertion and search code compared to rb tree. Another change introduced is about error handling. Before this patch, the result of btrfs_qgroup_trace_extent_nolock() is always a success. In this patch, because of this function calls the function xa_store() which has the possibility to fail, so mark qgroup as inconsistent if error happened and then free preallocated memory. Also we preallocate memory before spin_lock(), if memory preallcation failed, error handling is the same the existing code. Suggested-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Junchao Sun <sunjunchao2870@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:18 +02:00
Junchao Sun	14ed830d10	btrfs: qgroup: use goto style to handle errors in add_delayed_ref() Clean up resources using goto to get rid of repeated code. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Junchao Sun <sunjunchao2870@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:18 +02:00
Qu Wenruo	8189197425	btrfs: refactor __extent_writepage_io() to do sector-by-sector submission Unlike the bitmap usage inside raid56, for __extent_writepage_io() we handle the subpage submission not sector-by-sector, but for each dirty range we found. This is not a big deal normally, as the subpage complex code is already mostly optimized out by the compiler for x86_64. However for the sake of consistency and for the future of subpage sector-perfect compression support, this patch does: - Extract the sector submission code into submit_one_sector() - Add the needed code to extract the dirty bitmap for subpage case There is a small pitfall for non-subpage case, as we cleared page dirty before starting writeback, so we have to manually set the default dirty_bitmap to 1 for such case. - Use bitmap_and() to calculate the target sectors we need to submit This is done for both subpage and non-subpage cases, and will later be expanded to skip inline/compression ranges. For x86_64, the dirty bitmap will be fixed to 1, with the length of 1, so we're still doing the same workload per sector. For larger page sizes, the overhead will be a little larger, as previous we only need to do one extent_map lookup per-dirty-range, but now it will be one extent_map lookup per-sector. But that is the same frequency as x86_64, so we're just aligning the behavior to x86_64. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:18 +02:00
Qu Wenruo	77b0b98bb7	btrfs: subpage: fix the bitmap dump which can cause bitmap corruption In commit `75258f20fb` ("btrfs: subpage: dump extra subpage bitmaps for debug") an internal macro GET_SUBPAGE_BITMAP() is introduced to grab the bitmap of each attribute. But that commit is using bitmap_cut() which will do the left shift of the larger bitmap, causing incorrect values. Thankfully this bitmap_cut() is only called for debug usage, and so far it's not yet causing problem. Fix it to use bitmap_read() to only grab the desired sub-bitmap. Fixes: `75258f20fb` ("btrfs: subpage: dump extra subpage bitmaps for debug") CC: stable@vger.kernel.org # 6.6+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:18 +02:00
Johannes Thumshirn	0ae653fbec	btrfs: reduce chunk_map lookups in btrfs_map_block() Currently we're calling btrfs_num_copies() before btrfs_get_chunk_map() in btrfs_map_block(). But btrfs_num_copies() itself does a chunk map lookup to be able to calculate the number of copies. So split out the code getting the number of copies from btrfs_num_copies() into a helper called btrfs_chunk_map_num_copies() and directly call it from btrfs_map_block() and btrfs_num_copies(). This saves us one rbtree lookup per btrfs_map_block() invocation. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:18 +02:00
Filipe Manana	6d752cacae	btrfs: directly wake up cleaner kthread in the BTRFS_IOC_SYNC ioctl The BTRFS_IOC_SYNC ioctl wants to wake up the cleaner kthread so that it does any pending work (subvolume deletion, delayed iputs, etc), however it is waking up the transaction kthread, which in turn wakes up the cleaner. Since we don't have any transaction to commit, as any ongoing transaction was already committed when it called btrfs_sync_fs() and the goal is just to wake up the cleaner thread, directly wake up the cleaner instead of the transaction kthread. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:18 +02:00
Qu Wenruo	efffb803bf	btrfs: make btrfs_is_subpage() to return false directly for 4K page size Btrfs only supports sectorsize 4K, 8K, 16K, 32K, 64K for now, thus for systems with 4K page size, there is no way the fs is subpage (sectorsize < PAGE_SIZE). So here we define btrfs_is_subpage() different according to the PAGE_SIZE: - PAGE_SIZE > 4K We may hit real subpage cases, define btrfs_is_subpage() as a regular function and do the usual checks. - PAGE_SIZE == 4K (no smaller PAGE_SIZE support AFAIK) There is no way the fs is subpage, so just define btrfs_is_subpage() as an inline function which always return false. This saves about 7K bytes for x86_64 debug builds: text data bss dec hex filename Before: 1484452 168693 25776 1678921 199e49 fs/btrfs/btrfs.ko After: 1476605 168445 25776 1670826 197eaa fs/btrfs/btrfs.ko Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:18 +02:00
Johannes Thumshirn	0c749585fc	btrfs: change RST lookup error message level to debug Now that RAID stripe-tree lookup failures are not treated as a fatal issue any more, change the RAID stripe-tree lookup error message to debug level. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:17 +02:00
Johannes Thumshirn	04915240e2	btrfs: don't readahead the relocation inode on RST On relocation we're doing readahead on the relocation inode, but if the filesystem is backed by a RAID stripe tree we can get ENOENT (e.g. due to preallocated extents not being mapped in the RST) from the lookup. But readahead doesn't handle the error and submits invalid reads to the device, causing an assertion in the scatter-gather list code: BTRFS info (device nvme1n1): balance: start -d -m -s BTRFS info (device nvme1n1): relocating block group 6480920576 flags data\|raid0 BTRFS error (device nvme1n1): cannot find raid-stripe for logical [6481928192, 6481969152] devid 2, profile raid0 ------------[ cut here ]------------ kernel BUG at include/linux/scatterlist.h:115! Oops: invalid opcode: 0000 [#1] PREEMPT SMP PTI CPU: 0 PID: 1012 Comm: btrfs Not tainted 6.10.0-rc7+ #567 RIP: 0010:__blk_rq_map_sg+0x339/0x4a0 RSP: 0018:ffffc90001a43820 EFLAGS: 00010202 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffea00045d4802 RDX: 0000000117520000 RSI: 0000000000000000 RDI: ffff8881027d1000 RBP: 0000000000003000 R08: ffffea00045d4902 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000001000 R12: ffff8881003d10b8 R13: ffffc90001a438f0 R14: 0000000000000000 R15: 0000000000003000 FS: 00007fcc048a6900(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000002cd11000 CR3: 00000001109ea001 CR4: 0000000000370eb0 Call Trace: <TASK> ? __die_body.cold+0x14/0x25 ? die+0x2e/0x50 ? do_trap+0xca/0x110 ? do_error_trap+0x65/0x80 ? __blk_rq_map_sg+0x339/0x4a0 ? exc_invalid_op+0x50/0x70 ? __blk_rq_map_sg+0x339/0x4a0 ? asm_exc_invalid_op+0x1a/0x20 ? __blk_rq_map_sg+0x339/0x4a0 nvme_prep_rq.part.0+0x9d/0x770 nvme_queue_rq+0x7d/0x1e0 __blk_mq_issue_directly+0x2a/0x90 ? blk_mq_get_budget_and_tag+0x61/0x90 blk_mq_try_issue_list_directly+0x56/0xf0 blk_mq_flush_plug_list.part.0+0x52b/0x5d0 __blk_flush_plug+0xc6/0x110 blk_finish_plug+0x28/0x40 read_pages+0x160/0x1c0 page_cache_ra_unbounded+0x109/0x180 relocate_file_extent_cluster+0x611/0x6a0 ? btrfs_search_slot+0xba4/0xd20 ? balance_dirty_pages_ratelimited_flags+0x26/0xb00 relocate_data_extent.constprop.0+0x134/0x160 relocate_block_group+0x3f2/0x500 btrfs_relocate_block_group+0x250/0x430 btrfs_relocate_chunk+0x3f/0x130 btrfs_balance+0x71b/0xef0 ? kmalloc_trace_noprof+0x13b/0x280 btrfs_ioctl+0x2c2e/0x3030 ? kvfree_call_rcu+0x1e6/0x340 ? list_lru_add_obj+0x66/0x80 ? mntput_no_expire+0x3a/0x220 __x64_sys_ioctl+0x96/0xc0 do_syscall_64+0x54/0x110 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7fcc04514f9b Code: Unable to access opcode bytes at 0x7fcc04514f71. RSP: 002b:00007ffeba923370 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fcc04514f9b RDX: 00007ffeba923460 RSI: 00000000c4009420 RDI: 0000000000000003 RBP: 0000000000000000 R08: 0000000000000013 R09: 0000000000000001 R10: 00007fcc043fbba8 R11: 0000000000000246 R12: 00007ffeba924fc5 R13: 00007ffeba923460 R14: 0000000000000002 R15: 00000000004d4bb0 </TASK> Modules linked in: ---[ end trace 0000000000000000 ]--- RIP: 0010:__blk_rq_map_sg+0x339/0x4a0 RSP: 0018:ffffc90001a43820 EFLAGS: 00010202 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffea00045d4802 RDX: 0000000117520000 RSI: 0000000000000000 RDI: ffff8881027d1000 RBP: 0000000000003000 R08: ffffea00045d4902 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000001000 R12: ffff8881003d10b8 R13: ffffc90001a438f0 R14: 0000000000000000 R15: 0000000000003000 FS: 00007fcc048a6900(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fcc04514f71 CR3: 00000001109ea001 CR4: 0000000000370eb0 Kernel panic - not syncing: Fatal exception Kernel Offset: disabled ---[ end Kernel panic - not syncing: Fatal exception ]--- So in case of a relocation on a RAID stripe-tree based file system, skip the readahead. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:17 +02:00
Johannes Thumshirn	f4d39cf1ce	btrfs: set search_commit_root on stripe io in case of relocation Set rst_search_commit_root in the btrfs_io_stripe we're passing to btrfs_map_block() in case we're doing data relocation. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:17 +02:00
Johannes Thumshirn	d6106f0dc5	btrfs: rename btrfs_io_stripe::is_scrub to rst_search_commit_root Rename 'btrfs_io_stripe::is_scrub' to 'rst_search_commit_root'. While 'is_scrub' describes the state of the io_stripe (it is a stripe submitted by scrub) it does not describe the purpose, namely looking at the commit root when searching RAID stripe-tree entries. Renaming the stripe to rst_search_commit_root describes this purpose. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:17 +02:00
Johannes Thumshirn	f8428360c8	btrfs: don't dump stripe-tree on lookup error This just creates unnecessary noise and doesn't provide any insights into debugging RAID stripe-tree related issues. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:17 +02:00
Boris Burkov	f8e9f4a76d	btrfs: add comment about locking in cow_file_range_inline() Add a comment to document the complicated locked_page unlock logic in cow_file_range_inline. The specifically tricky part is that a caller just up the stack converts ret == 0 to ret == 1 and then another caller far up the callstack handles ret == 1 as a success, AND returns without cleanup in that case, both of which "feel" unnatural and led to the original bug. Try to document that somewhat specific callstack logic here to explain the weird un-setting of locked_folio on success. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:17 +02:00
Filipe Manana	68a505bb87	btrfs: more efficient chunk map iteration when device replace finishes When iterating the chunk maps when a device replace finishes we are doing a full rbtree search for each chunk map, which is not the most efficient thing to do, wasting CPU time. As we are holding a write lock on the tree during the whole iteration, we can simply start from the first node in the tree and then move to the next chunk map by doing a rb_next() call - the only exception is when we need to reschedule, in which case we have to do a full rbtree search since we dropped the write lock and the tree may have changed (chunk maps may have been removed and the tree got rebalanced). So just do that. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:17 +02:00
Filipe Manana	b79f1c2caa	btrfs: reschedule when updating chunk maps at the end of a device replace At the end of a device replace we must go over all the chunk maps and update their stripes to point to the target device instead of the source device. We iterate over the chunk maps while holding a write lock and we never reschedule, which can result in monopolizing a CPU for too long and blocking readers for too long (it's a rw lock, non-blocking). So improve on this by rescheduling if necessary. This is safe because at this point we are holding the chunk mutex, which means no new chunks can be allocated and therefore we don't risk missing a new chunk map that covers a range behind the last one we processed before rescheduling. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:17 +02:00
Josef Bacik	5fe1912449	btrfs: convert extent_range_clear_dirty_for_io() to use a folio Instead of getting a page and using that to clear dirty for io, use the folio helper and use the appropriate folio functions. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:17 +02:00
Josef Bacik	c86d3aac81	btrfs: convert insert_inline_extent() to use a folio We only use a page to copy in the data for the inline extent. Use a folio for this instead. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:17 +02:00
Josef Bacik	1bbf3a3aea	btrfs: convert btrfs_set_range_writeback() to use a folio We already use a lot of functions here that use folios, update the function to use __filemap_get_folio instead of find_get_page and then use the folio directly. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:17 +02:00
Josef Bacik	dfc9e3017a	btrfs: convert wait_subpage_spinlock() to only use a folio Currently this already uses a folio for most things, update it to take a folio and update all the page usage with the corresponding folio usage. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:16 +02:00
Josef Bacik	1a48259d9b	btrfs: convert find_next_dirty_byte() to take a folio We already use a folio some in this function, replace all page usage with the folio and update the function to take the folio as an argument. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:16 +02:00
Josef Bacik	7ed07d1662	btrfs: convert __get_extent_map() to take a folio Now that btrfs_get_extent takes a folio, update __get_extent_map to take a folio as well. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:16 +02:00
Josef Bacik	dce9ef9412	btrfs: convert btrfs_get_extent() to take a folio We only pass this into read_inline_extent, change it to take a folio and update the callers. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:16 +02:00
Josef Bacik	220e77c412	btrfs: convert read_inline_extent() to use a folio Instead of using a page, use a folio instead, take a folio as an argument, and update the callers appropriately. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:16 +02:00
Josef Bacik	752965824b	btrfs: convert uncompress_inline() to take a folio Update uncompress_inline to take a folio and update it's usage accordingly. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:16 +02:00
Josef Bacik	1b5125bbd4	btrfs: convert struct btrfs_writepage_fixup to use a folio Now the fixup creator and consumer use folios, change this to use a folio as well. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:16 +02:00
Josef Bacik	d71b53c3cb	btrfs: convert btrfs_writepage_cow_fixup() to use folio Instead of a page, use a folio for btrfs_writepage_cow_fixup. We already have a folio at the only caller, and the fixup worker uses folios. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:16 +02:00
Josef Bacik	7d003cc2b3	btrfs: convert btrfs_writepage_fixup_worker() to use a folio This function heavily messes with pages, instead update it to use a folio. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:16 +02:00
Josef Bacik	0d11706810	btrfs: convert submit_uncompressed_range() to take a folio This mostly uses folios already, update it to take a folio and update the rest of the function to use the folio instead of the page. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:16 +02:00
Josef Bacik	3ed984b5d0	btrfs: convert struct async_chunk to hold a folio Instead of passing in the page for ->locked_page, make it hold a locked_folio and then update the users of async_chunk to act accordingly. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:15 +02:00
Josef Bacik	2609c9289f	btrfs: convert btrfs_run_delalloc_range() to take a folio Now that every function that btrfs_run_delalloc_range calls takes a folio, update it to take a folio and update the callers. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:15 +02:00

1 2 3 4 5 ...

13594 Commits