linux

mirror of https://github.com/torvalds/linux.git synced 2024-12-13 06:32:50 +00:00

Author	SHA1	Message	Date
Nikolay Borisov	b7d2083a36	btrfs: raid56: don't opencode swap() in __raid_recover_end_io Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:41 +02:00
Nikolay Borisov	8302586327	btrfs: raid56: use in_range where applicable While at it use the opportunity to simplify find_logical_bio_stripe by reducing the scope of 'stripe_start' variable and squash the sector-to-bytes conversion on one line. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:41 +02:00
Nikolay Borisov	bf28a605e6	btrfs: raid56: assign bio in while() when using bio_list_pop Unify the style in the file such that return value of bio_list_pop is assigned directly in the while loop. This is in line with the rest of the kernel. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:41 +02:00
Nikolay Borisov	f90ae76a5c	btrfs: raid56: remove redundant device check in rbio_add_io_page The merging logic is always executed if the current stripe's device is not missing. So there's no point in duplicating the check. Simply remove it, while at it reduce the scope of the 'last_end' variable. If the current stripe's device is missing we fail the stripe early on. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:41 +02:00
Nikolay Borisov	608769a4e4	btrfs: always initialize btrfs_bio::tgtdev_map/raid_map pointers Since btrfs_bio always contains the extra space for the tgtdev_map and raid_maps it's pointless to make the assignment iff specific conditions are met. Instead, always assign the pointers to their correct value at allocation time. To accommodate this change also move code a bit in __btrfs_map_block so that btrfs_bio::stripes array is always initialized before the raid_map, subsequently move the call to sort_parity_stripes in the 'if' building the raid_map, retaining the old behavior. To better understand the change, there are 2 aspects to this: 1. The original code is harder to grasp because the calculations for initializing raid_map/tgtdev ponters are apart from the initial allocation of memory. Having them predicated on 2 separate checks doesn't help that either... So by moving the initialisation in alloc_btrfs_bio puts everything together. 2. tgtdev/raid_maps are now always initialized despite sometimes they might be equal i.e __btrfs_map_block_for_discard calls alloc_btrfs_bio with tgtdev = 0 but their usage should be predicated on external checks i.e. just because those pointers are non-null doesn't mean they are valid per-se. And actually while taking another look at __btrfs_map_block I saw a discrepancy: Original code initialised tgtdev_map if the following check is true: if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL) However, further down tgtdev_map is only used if the following check is true: if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL && need_full_stripe(op)) e.g. the additional need_full_stripe(op) predicate is there. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ copy more details from mail discussion ] Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:41 +02:00
Nikolay Borisov	3092c68fc5	btrfs: sysfs: add bdi link to the fsid directory Since BTRFS uses a private bdi it makes sense to create a link to this bdi under /sys/fs/btrfs/<UUID>/bdi. This allows size of read ahead to be controlled. Without this patch it's not possible to uniquely identify which bdi pertains to which btrfs filesystem in the case of multiple btrfs filesystems. It's fine to simply call sysfs_remove_link without checking if the link indeed has been created. The call path sysfs_remove_link kernfs_remove_by_name kernfs_remove_by_name_ns will simply return -ENOENT in case it doesn't exist. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:41 +02:00
Nikolay Borisov	5a9472fe7f	btrfs: increment corrupt device counter during compressed read If a compressed read fails due to checksum error only a line is printed to dmesg, device corrupt counter is not modified. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:40 +02:00
Nikolay Borisov	26056eab4b	btrfs: remove needless ASSERT check of orig_bio in end_compressed_bio_read compressed_bio::orig_bio is always set in btrfs_submit_compressed_read before any bio submission is performed. Since that function is always called with a valid bio it renders the ASSERT unnecessary. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:40 +02:00
Nikolay Borisov	814723e0a5	btrfs: increment device corruption error in case of checksum error Now that btrfs_io_bio have access to btrfs_device we can safely increment the device corruption counter on error. There is one notable exception - repair bios for raid. Since those don't go through the normal submit_stripe_bio callpath but through raid56_parity_recover thus repair bios won't have their device set. Scrub increments the corruption counter for checksum mismatch as well but does not call this function. Link: https://lore.kernel.org/linux-btrfs/4857863.FCrPRfMyHP@liv/ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:40 +02:00
Nikolay Borisov	3eee86c8fd	btrfs: don't check for btrfs_device::bdev in btrfs_end_bio btrfs_map_bio ensures that all submitted bios to devices have valid btrfs_device::bdev so this check can be removed from btrfs_end_bio. This check was added in june 2012 `597a60fade` ("Btrfs: don't count I/O statistic read errors for missing devices") but then in October of the same year another commit `de1ee92ac3` ("Btrfs: recheck bio against block device when we map the bio") started checking for the presence of btrfs_device::bdev before actually issuing the bio. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:40 +02:00
Nikolay Borisov	c31efbdf23	btrfs: record btrfs_device directly in btrfs_io_bio Instead of recording stripe_index and using that to access correct btrfs_device from btrfs_bio::stripes record the btrfs_device in btrfs_io_bio. This will enable endio handlers to increment device error counters on checksum errors. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:40 +02:00
Nikolay Borisov	3526302f26	btrfs: streamline btrfs_get_io_failure_record logic Make the function directly return a pointer to a failure record and adjust callers to handle it. Also refactor the logic inside so that the case which allocates the failure record for the first time is not handled in an 'if' arm, saving us a level of indentation. Finally make the function static as it's not used outside of extent_io.c . Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:39 +02:00
Nikolay Borisov	2279a27053	btrfs: make get_state_failrec return failrec directly Only failure that get_state_failrec can get is if there is no failure for the given address. There is no reason why the function should return a status code and use a separate parameter for returning the actual failure rec (if one is found). Simplify it by making the return type a pointer and return ERR_PTR value in case of errors. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:39 +02:00
David Sterba	b90a4ab6ba	btrfs: remove deprecated mount option subvolrootid The option subvolrootid used to be a workaround for mounting subvolumes and ineffective since `5e2a4b25da` ("btrfs: deprecate subvolrootid mount option"). We have subvol= that works and we don't need to keep the cruft, let's remove it. Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:39 +02:00
David Sterba	d801e7a355	btrfs: remove deprecated mount option alloc_start The mount option alloc_start has no effect since `0d0c71b317` ("btrfs: obsolete and remove mount option alloc_start") which has details why it's been deprecated. We can remove it. Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:39 +02:00
Filipe Manana	a93e01682e	btrfs: remove no longer needed use of log_writers for the log root tree When syncing the log, we used to update the log root tree without holding neither the log_mutex of the subvolume root nor the log_mutex of log root tree. We used to have two critical sections delimited by the log_mutex of the log root tree, so in the first one we incremented the log_writers of the log root tree and on the second one we decremented it and waited for the log_writers counter to go down to zero. This was because the update of the log root tree happened between the two critical sections. The use of two critical sections allowed a little bit more of parallelism and required the use of the log_writers counter, necessary to make sure we didn't miss any log root tree update when we have multiple tasks trying to sync the log in parallel. However after commit `06989c799f` ("Btrfs: fix race updating log root item during fsync") the log root tree update was moved into a critical section delimited by the subvolume's log_mutex. Later another commit moved the log tree update from that critical section into the second critical section delimited by the log_mutex of the log root tree. Both commits addressed different bugs. The end result is that the first critical section delimited by the log_mutex of the log root tree became pointless, since there's nothing done between it and the second critical section, we just have an unlock of the log_mutex followed by a lock operation. This means we can merge both critical sections, as the first one does almost nothing now, and we can stop using the log_writers counter of the log root tree, which was incremented in the first critical section and decremented in the second criticial section, used to make sure no one in the second critical section started writeback of the log root tree before some other task updated it. So just remove the mutex_unlock() followed by mutex_lock() of the log root tree, as well as the use of the log_writers counter for the log root tree. This patch is part of a series that has the following patches: 1/4 btrfs: only commit the delayed inode when doing a full fsync 2/4 btrfs: only commit delayed items at fsync if we are logging a directory 3/4 btrfs: stop incremening log_batch for the log root tree when syncing log 4/4 btrfs: remove no longer needed use of log_writers for the log root tree After the entire patchset applied I saw about 12% decrease on max latency reported by dbench. The test was done on a qemu vm, with 8 cores, 16Gb of ram, using kvm and using a raw NVMe device directly (no intermediary fs on the host). The test was invoked like the following: mkfs.btrfs -f /dev/sdk mount -o ssd -o nospace_cache /dev/sdk /mnt/sdk dbench -D /mnt/sdk -t 300 8 umount /mnt/dsk CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:39 +02:00
Filipe Manana	28a9579561	btrfs: stop incremening log_batch for the log root tree when syncing log We are incrementing the log_batch atomic counter of the root log tree but we never use that counter, it's used only for the log trees of subvolume roots. We started doing it when we moved the log_batch and log_write counters from the global, per fs, btrfs_fs_info structure, into the btrfs_root structure in commit `7237f18336` ("Btrfs: fix tree logs parallel sync"). So just stop doing it for the log root tree and add a comment over the field declaration so inform it's used only for log trees of subvolume roots. This patch is part of a series that has the following patches: 1/4 btrfs: only commit the delayed inode when doing a full fsync 2/4 btrfs: only commit delayed items at fsync if we are logging a directory 3/4 btrfs: stop incremening log_batch for the log root tree when syncing log 4/4 btrfs: remove no longer needed use of log_writers for the log root tree After the entire patchset applied I saw about 12% decrease on max latency reported by dbench. The test was done on a qemu vm, with 8 cores, 16Gb of ram, using kvm and using a raw NVMe device directly (no intermediary fs on the host). The test was invoked like the following: mkfs.btrfs -f /dev/sdk mount -o ssd -o nospace_cache /dev/sdk /mnt/sdk dbench -D /mnt/sdk -t 300 8 umount /mnt/dsk CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:39 +02:00
Filipe Manana	5aa7d1a7f4	btrfs: only commit delayed items at fsync if we are logging a directory When logging an inode we are committing its delayed items if either the inode is a directory or if it is a new inode, created in the current transaction. We need to do it for directories, since new directory indexes are stored as delayed items of the inode and when logging a directory we need to be able to access all indexes from the fs/subvolume tree in order to figure out which index ranges need to be logged. However for new inodes that are not directories, we do not need to do it because the only type of delayed item they can have is the inode item, and we are guaranteed to always log an up to date version of the inode item: ) for a full fsync we do it by committing the delayed inode and then copying the item from the fs/subvolume tree with copy_inode_items_to_log(); ) for a fast fsync we always log the inode item based on the contents of the in-memory struct btrfs_inode. We guarantee this is always done since commit `e4545de5b0` ("Btrfs: fix fsync data loss after append write"). So stop running delayed items for a new inodes that are not directories, since that forces committing the delayed inode into the fs/subvolume tree, wasting time and adding contention to the tree when a full fsync is not required. We will only do it in case a fast fsync is needed. This patch is part of a series that has the following patches: 1/4 btrfs: only commit the delayed inode when doing a full fsync 2/4 btrfs: only commit delayed items at fsync if we are logging a directory 3/4 btrfs: stop incremening log_batch for the log root tree when syncing log 4/4 btrfs: remove no longer needed use of log_writers for the log root tree After the entire patchset applied I saw about 12% decrease on max latency reported by dbench. The test was done on a qemu vm, with 8 cores, 16Gb of ram, using kvm and using a raw NVMe device directly (no intermediary fs on the host). The test was invoked like the following: mkfs.btrfs -f /dev/sdk mount -o ssd -o nospace_cache /dev/sdk /mnt/sdk dbench -D /mnt/sdk -t 300 8 umount /mnt/dsk CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:38 +02:00
Filipe Manana	8c8648dd1f	btrfs: only commit the delayed inode when doing a full fsync Commit `2c2c452b0c` ("Btrfs: fix fsync when extend references are added to an inode") forced a commit of the delayed inode when logging an inode in order to ensure we would end up logging the inode item during a full fsync. By committing the delayed inode, we updated the inode item in the fs/subvolume tree and then later when copying items from leafs modified in the current transaction into the log tree (with copy_inode_items_to_log()) we ended up copying the inode item from the fs/subvolume tree into the log tree. Logging an up to date version of the inode item is required to make sure at log replay time we get the link count fixup triggered among other things (replay xattr deletes, etc). The test case generic/040 from fstests exercises the bug which that commit fixed. However for a fast fsync we don't need to commit the delayed inode because we always log an up to date version of the inode item based on the struct btrfs_inode we have in-memory. We started doing this for fast fsyncs since commit `e4545de5b0` ("Btrfs: fix fsync data loss after append write"). So just stop committing the delayed inode if we are doing a fast fsync, we are only wasting time and adding contention on fs/subvolume tree. This patch is part of a series that has the following patches: 1/4 btrfs: only commit the delayed inode when doing a full fsync 2/4 btrfs: only commit delayed items at fsync if we are logging a directory 3/4 btrfs: stop incremening log_batch for the log root tree when syncing log 4/4 btrfs: remove no longer needed use of log_writers for the log root tree After the entire patchset applied I saw about 12% decrease on max latency reported by dbench. The test was done on a qemu vm, with 8 cores, 16Gb of ram, using kvm and using a raw NVMe device directly (no intermediary fs on the host). The test was invoked like the following: mkfs.btrfs -f /dev/sdk mount -o ssd -o nospace_cache /dev/sdk /mnt/sdk dbench -D /mnt/sdk -t 300 8 umount /mnt/dsk CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:38 +02:00
Qu Wenruo	2dfb1e43f5	btrfs: preallocate anon block device at first phase of snapshot creation [BUG] When the anonymous block device pool is exhausted, subvolume/snapshot creation fails with EMFILE (Too many files open). This has been reported by a user. The allocation happens in the second phase during transaction commit where it's only way out is to abort the transaction BTRFS: Transaction aborted (error -24) WARNING: CPU: 17 PID: 17041 at fs/btrfs/transaction.c:1576 create_pending_snapshot+0xbc4/0xd10 [btrfs] RIP: 0010:create_pending_snapshot+0xbc4/0xd10 [btrfs] Call Trace: create_pending_snapshots+0x82/0xa0 [btrfs] btrfs_commit_transaction+0x275/0x8c0 [btrfs] btrfs_mksubvol+0x4b9/0x500 [btrfs] btrfs_ioctl_snap_create_transid+0x174/0x180 [btrfs] btrfs_ioctl_snap_create_v2+0x11c/0x180 [btrfs] btrfs_ioctl+0x11a4/0x2da0 [btrfs] do_vfs_ioctl+0xa9/0x640 ksys_ioctl+0x67/0x90 __x64_sys_ioctl+0x1a/0x20 do_syscall_64+0x5a/0x110 entry_SYSCALL_64_after_hwframe+0x44/0xa9 ---[ end trace 33f2f83f3d5250e9 ]--- BTRFS: error (device sda1) in create_pending_snapshot:1576: errno=-24 unknown BTRFS info (device sda1): forced readonly BTRFS warning (device sda1): Skipping commit of aborted transaction. BTRFS: error (device sda1) in cleanup_transaction:1831: errno=-24 unknown [CAUSE] When the global anonymous block device pool is exhausted, the following call chain will fail, and lead to transaction abort: btrfs_ioctl_snap_create_v2() \|- btrfs_ioctl_snap_create_transid() \|- btrfs_mksubvol() \|- btrfs_commit_transaction() \|- create_pending_snapshot() \|- btrfs_get_fs_root() \|- btrfs_init_fs_root() \|- get_anon_bdev() [FIX] Although we can't enlarge the anonymous block device pool, at least we can preallocate anon_dev for subvolume/snapshot in the first phase, outside of transaction context and exactly at the moment the user calls the creation ioctl. Reported-by: Greed Rong <greedrong@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CA+UqX+NTrZ6boGnWHhSeZmEY5J76CTqmYjO2S+=tHJX7nb9DPw@mail.gmail.com/ CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:38 +02:00
Qu Wenruo	082b6c970f	btrfs: free anon block device right after subvolume deletion [BUG] When a lot of subvolumes are created, there is a user report about transaction aborted caused by slow anonymous block device reclaim: BTRFS: Transaction aborted (error -24) WARNING: CPU: 17 PID: 17041 at fs/btrfs/transaction.c:1576 create_pending_snapshot+0xbc4/0xd10 [btrfs] RIP: 0010:create_pending_snapshot+0xbc4/0xd10 [btrfs] Call Trace: create_pending_snapshots+0x82/0xa0 [btrfs] btrfs_commit_transaction+0x275/0x8c0 [btrfs] btrfs_mksubvol+0x4b9/0x500 [btrfs] btrfs_ioctl_snap_create_transid+0x174/0x180 [btrfs] btrfs_ioctl_snap_create_v2+0x11c/0x180 [btrfs] btrfs_ioctl+0x11a4/0x2da0 [btrfs] do_vfs_ioctl+0xa9/0x640 ksys_ioctl+0x67/0x90 __x64_sys_ioctl+0x1a/0x20 do_syscall_64+0x5a/0x110 entry_SYSCALL_64_after_hwframe+0x44/0xa9 ---[ end trace 33f2f83f3d5250e9 ]--- BTRFS: error (device sda1) in create_pending_snapshot:1576: errno=-24 unknown BTRFS info (device sda1): forced readonly BTRFS warning (device sda1): Skipping commit of aborted transaction. BTRFS: error (device sda1) in cleanup_transaction:1831: errno=-24 unknown [CAUSE] The anonymous device pool is shared and its size is 1M. It's possible to hit that limit if the subvolume deletion is not fast enough and the subvolumes to be cleaned keep the ids allocated. [WORKAROUND] We can't avoid the anon device pool exhaustion but we can shorten the time the id is attached to the subvolume root once the subvolume becomes invisible to the user. Reported-by: Greed Rong <greedrong@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CA+UqX+NTrZ6boGnWHhSeZmEY5J76CTqmYjO2S+=tHJX7nb9DPw@mail.gmail.com/ CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:38 +02:00
Qu Wenruo	851fd730a7	btrfs: don't allocate anonymous block device for user invisible roots [BUG] When a lot of subvolumes are created, there is a user report about transaction aborted: BTRFS: Transaction aborted (error -24) WARNING: CPU: 17 PID: 17041 at fs/btrfs/transaction.c:1576 create_pending_snapshot+0xbc4/0xd10 [btrfs] RIP: 0010:create_pending_snapshot+0xbc4/0xd10 [btrfs] Call Trace: create_pending_snapshots+0x82/0xa0 [btrfs] btrfs_commit_transaction+0x275/0x8c0 [btrfs] btrfs_mksubvol+0x4b9/0x500 [btrfs] btrfs_ioctl_snap_create_transid+0x174/0x180 [btrfs] btrfs_ioctl_snap_create_v2+0x11c/0x180 [btrfs] btrfs_ioctl+0x11a4/0x2da0 [btrfs] do_vfs_ioctl+0xa9/0x640 ksys_ioctl+0x67/0x90 __x64_sys_ioctl+0x1a/0x20 do_syscall_64+0x5a/0x110 entry_SYSCALL_64_after_hwframe+0x44/0xa9 ---[ end trace 33f2f83f3d5250e9 ]--- BTRFS: error (device sda1) in create_pending_snapshot:1576: errno=-24 unknown BTRFS info (device sda1): forced readonly BTRFS warning (device sda1): Skipping commit of aborted transaction. BTRFS: error (device sda1) in cleanup_transaction:1831: errno=-24 unknown [CAUSE] The error is EMFILE (Too many files open) and comes from the anonymous block device allocation. The ids are in a shared pool of size 1<<20. The ids are assigned to live subvolumes, ie. the root structure exists in memory (eg. after creation or after the root appears in some path). The pool could be exhausted if the numbers are not reclaimed fast enough, after subvolume deletion or if other system component uses the anon block devices. [WORKAROUND] Since it's not possible to completely solve the problem, we can only minimize the time the id is allocated to a subvolume root. Firstly, we can reduce the use of anon_dev by trees that are not subvolume roots, like data reloc tree. This patch will do extra check on root objectid, to skip roots that don't need anon_dev. Currently it's only data reloc tree and orphan roots. Reported-by: Greed Rong <greedrong@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CA+UqX+NTrZ6boGnWHhSeZmEY5J76CTqmYjO2S+=tHJX7nb9DPw@mail.gmail.com/ CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:38 +02:00
Qu Wenruo	49e5fb4621	btrfs: qgroup: export qgroups in sysfs This patch will add the following sysfs interface: /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/referenced /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/exclusive /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/max_referenced /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/max_exclusive /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/limit_flags Which is also available in output of "btrfs qgroup show". /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/rsv_data /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/rsv_meta_pertrans /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/rsv_meta_prealloc The last 3 rsv related members are not visible to users, but can be very useful to debug qgroup limit related bugs. Also, to avoid '/' used in <qgroup_id>, the separator between qgroup level and qgroup id is changed to '_'. The interface is not hidden behind 'debug' as we want this interface to be included into production build and to provide another way to read the qgroup information besides the ioctls. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:37 +02:00
Qu Wenruo	06f67c4707	btrfs: use __u16 for the return value of btrfs_qgroup_level() The qgroup level is limited to u16, so no need to use u64 for it. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:37 +02:00
Nikolay Borisov	cfdd459215	btrfs: make btrfs_qgroup_check_reserved_leak take btrfs_inode vfs_inode is used only for the inode number everything else requires btrfs_inode. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ use btrfs_ino ] Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:37 +02:00
Nikolay Borisov	d90944141b	btrfs: make btrfs_set_inode_last_trans take btrfs_inode Instead of making multiple calls to BTRFS_I simply take btrfs_inode as an input paramter. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:37 +02:00
Nikolay Borisov	056d9beca3	btrfs: make prealloc_file_extent_cluster take btrfs_inode The vfs inode is only used for a pair of inode_lock/unlock calls all other uses call for btrfs_inode. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:37 +02:00
Nikolay Borisov	65d87f7918	btrfs: remove BTRFS_I calls in btrfs_writepage_fixup_worker All of its children functions use btrfs_inode. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:36 +02:00
Nikolay Borisov	e5b7231e20	btrfs: make btrfs_delalloc_reserve_space take btrfs_inode All of its children take btrfs_inode so bubble up this requirement to btrfs_delalloc_reserve_space's interface and stop calling BTRFS_I internally. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:36 +02:00
Nikolay Borisov	36ea6f3e93	btrfs: make btrfs_check_data_free_space take btrfs_inode Instead of calling BTRFS_I on the passed vfs_inode take btrfs_inode directly. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:36 +02:00
Nikolay Borisov	86d52921a2	btrfs: make btrfs_delalloc_release_space take btrfs_inode It needs btrfs_inode so take it as a parameter directly. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:36 +02:00
Nikolay Borisov	25ce28caaa	btrfs: make btrfs_free_reserved_data_space take btrfs_inode It only uses btrfs_inode internally so take it as a parameter. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:36 +02:00
Nikolay Borisov	9db5d510ac	btrfs: make btrfs_free_reserved_data_space_noquota take btrfs_fs_info No point in taking an inode only to get btrfs_fs_info from it, instead take btrfs_fs_info directly. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:35 +02:00
Nikolay Borisov	7661a3e033	btrfs: make btrfs_qgroup_reserve_data take btrfs_inode There's only a single use of vfs_inode in a tracepoint so let's take btrfs_inode directly. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:35 +02:00
Nikolay Borisov	088545f6e4	btrfs: make btrfs_dirty_pages take btrfs_inode There is a single use of the generic vfs_inode so let's take btrfs_inode as a parameter and remove couple of redundant BTRFS_I() calls. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:35 +02:00
Nikolay Borisov	c2566f2289	btrfs: make btrfs_set_extent_delalloc take btrfs_inode Preparation to make btrfs_dirty_pages take btrfs_inode as parameter. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:35 +02:00
Nikolay Borisov	cd4c0bf942	btrfs: make writepage_delalloc take btrfs_inode Only find_lock_delalloc_range uses vfs_inode so let's take the btrfs_inode as a parameter. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:35 +02:00
Nikolay Borisov	d4580fe25d	btrfs: make __extent_writepage_io take btrfs_inode It has only a single use for a generic vfs inode vs 3 for btrfs_inode. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:34 +02:00
Nikolay Borisov	9fc6f911a0	btrfs: make btrfs_new_extent_direct take btrfs_inode This function really needs a btrfs_inode and not a generic vfs one. Take it as a parameter and get rid of superfluous BTRFS_I() calls. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:34 +02:00
Nikolay Borisov	64f54188ea	btrfs: make btrfs_create_dio_extent take btrfs_inode Take btrfs_inode directly and stop using superfulous BTRFS_I calls. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:34 +02:00
Nikolay Borisov	c1e095202c	btrfs: make btrfs_add_ordered_extent_dio take btrfs_inode Simply forwards its argument so let's get rid of one extra BTRFS_I by taking btrfs_inode directly. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:34 +02:00
Nikolay Borisov	98456b9c46	btrfs: make btrfs_run_delalloc_range take btrfs_inode All children now take btrfs_inode so convert it to taking it as a parameter as well. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:34 +02:00
Nikolay Borisov	0c4942258c	btrfs: make need_force_cow take btrfs_inode Gets rid of superfulous BTRFS_I() calls and prepare for converting btrfs_run_delalloc_range to using btrfs_inode. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:34 +02:00
Nikolay Borisov	808a129232	btrfs: make inode_need_compress take btrfs_inode Simply gets rid of superfluous BTRFS_I() calls. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:33 +02:00
Nikolay Borisov	99c88dc71c	btrfs: make inode_can_compress take btrfs_inode Gets rid of superfluous BTRFS_I() calls. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:33 +02:00
Nikolay Borisov	64e1db566d	btrfs: make btrfs_cleanup_ordered_extents take btrfs_inode Preparation to converting btrfs_run_delalloc_range to using btrfs_inode without BTRFS_I() calls. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:33 +02:00
Nikolay Borisov	b672b5c156	btrfs: make __endio_write_update_ordered take btrfs_inode It really wants btrfs_inode. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:33 +02:00
Nikolay Borisov	7095821ee1	btrfs: make btrfs_dec_test_first_ordered_pending take btrfs_inode It doesn't really need vfs_inode but btrfs_inode. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:33 +02:00
Nikolay Borisov	751b64318d	btrfs: make cow_file_range_async take btrfs_inode It only uses vfs inode for assigning it to the async_chunk function. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:32 +02:00
Nikolay Borisov	968322c8c6	btrfs: make run_delalloc_nocow take btrfs_inode It only really uses btrfs_inode. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:32 +02:00
Nikolay Borisov	8ba96f3dd6	btrfs: make fallback_to_cow take btrfs_inode It really wants btrfs_inode and is prepration to converting run_delalloc_nocow to taking btrfs_inode. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:32 +02:00
Nikolay Borisov	c553f94df4	btrfs: make insert_reserved_file_extent take btrfs_inode Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com>c Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:32 +02:00
Nikolay Borisov	72b7d15bf1	btrfs: make btrfs_qgroup_release_data take btrfs_inode It just forwards its argument to __btrfs_qgroup_release_data. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:32 +02:00
Nikolay Borisov	a0ff10dcc4	btrfs: make submit_compressed_extents take btrfs_inode All but 3 uses require vfs_inode so convert the logic to have btrfs_inode be the main inode struct. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:31 +02:00
Nikolay Borisov	c7ee1819dc	btrfs: make btrfs_submit_compressed_write take btrfs_inode Majority of its uses are for btrfs_inode so take it as an argument directly. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:31 +02:00
Nikolay Borisov	4cc612090b	btrfs: make btrfs_add_ordered_extent_compress take btrfs_inode It simpy forwards its inode argument to __btrfs_add_ordered_extent which already takes btrfs_inode. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:31 +02:00
Nikolay Borisov	6e26c44223	btrfs: make cow_file_range take btrfs_inode All its children functions take btrfs_inode so convert it to taking btrfs_inode. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:31 +02:00
Nikolay Borisov	e7fbf60453	btrfs: make btrfs_add_ordered_extent take btrfs_inode Preparation to converting its callers to taking btrfs_inode. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:31 +02:00
Nikolay Borisov	a0349401c1	btrfs: make cow_file_range_inline take btrfs_inode It has only 2 uses for the vfs_inode - insert_inline_extent and i_size_read. On the flipside it will allow converting its callers to btrfs_inode, so convert it to taking btrfs_inode. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:30 +02:00
Nikolay Borisov	8b8a979f1f	btrfs: make btrfs_qgroup_free_data take btrfs_inode It passes btrfs_inode to its callee so change the interface. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:30 +02:00
Nikolay Borisov	8769af96cf	btrfs: make __btrfs_qgroup_release_data take btrfs_inode It uses vfs_inode only for a tracepoint so convert its interface to take btrfs_inode directly. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:30 +02:00
Nikolay Borisov	df2cfd131f	btrfs: make qgroup_free_reserved_data take btrfs_inode It only uses btrfs_inode so can just as easily take it as an argument. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:30 +02:00
David Sterba	3502a8c0dc	btrfs: allow use of global block reserve for balance item deletion On a filesystem with exhausted metadata, but still enough to start balance, it's possible to hit this error: [324402.053842] BTRFS info (device loop0): 1 enospc errors during balance [324402.060769] BTRFS info (device loop0): balance: ended with status: -28 [324402.172295] BTRFS: error (device loop0) in reset_balance_state:3321: errno=-28 No space left It fails inside reset_balance_state and turns the filesystem to read-only, which is unnecessary and should be fixed too, but the problem is caused by lack for space when the balance item is deleted. This is a one-time operation and from the same rank as unlink that is allowed to use the global block reserve. So do the same for the balance item. Status of the filesystem (100GiB) just after the balance fails: $ btrfs fi df mnt Data, single: total=80.01GiB, used=38.58GiB System, single: total=4.00MiB, used=16.00KiB Metadata, single: total=19.99GiB, used=19.48GiB GlobalReserve, single: total=512.00MiB, used=50.11MiB CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:29 +02:00
Qu Wenruo	38d37aa9c3	btrfs: refactor btrfs_check_can_nocow() into two variants The function btrfs_check_can_nocow() now has two completely different call patterns. For nowait variant, callers don't need to do any cleanup. While for wait variant, callers need to release the lock if they can do nocow write. This is somehow confusing, and is already a problem for the exported btrfs_check_can_nocow(). So this patch will separate the different patterns into different functions. For nowait variant, the function will be called check_nocow_nolock(). For wait variant, the function pair will be btrfs_check_nocow_lock() btrfs_check_nocow_unlock(). Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:28 +02:00
Qu Wenruo	e4ecaf90bc	btrfs: add comments for btrfs_check_can_nocow() and can_nocow_extent() These two functions have extra conditions that their callers need to meet, and some not-that-common parameters used for return value. So adding some comments may save reviewers some time. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:28 +02:00
Qu Wenruo	6d4572a9d7	btrfs: allow btrfs_truncate_block() to fallback to nocow for data space reservation [BUG] When the data space is exhausted, even if the inode has NOCOW attribute, we will still refuse to truncate unaligned range due to ENOSPC. The following script can reproduce it pretty easily: #!/bin/bash dev=/dev/test/test mnt=/mnt/btrfs umount $dev &> /dev/null umount $mnt &> /dev/null mkfs.btrfs -f $dev -b 1G mount -o nospace_cache $dev $mnt touch $mnt/foobar chattr +C $mnt/foobar xfs_io -f -c "pwrite -b 4k 0 4k" $mnt/foobar > /dev/null xfs_io -f -c "pwrite -b 4k 0 1G" $mnt/padding &> /dev/null sync xfs_io -c "fpunch 0 2k" $mnt/foobar umount $mnt Currently this will fail at the fpunch part. [CAUSE] Because btrfs_truncate_block() always reserves space without checking the NOCOW attribute. Since the writeback path follows NOCOW bit, we only need to bother the space reservation code in btrfs_truncate_block(). [FIX] Make btrfs_truncate_block() follow btrfs_buffered_write() to try to reserve data space first, and fall back to NOCOW check only when we don't have enough space. Such always-try-reserve is an optimization introduced in btrfs_buffered_write(), to avoid expensive btrfs_check_can_nocow() call. This patch will export check_can_nocow() as btrfs_check_can_nocow(), and use it in btrfs_truncate_block() to fix the problem. Reported-by: Martin Doucha <martin.doucha@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:28 +02:00
David Sterba	b547a88ea5	btrfs: start deprecation of mount option inode_cache Estimated time of removal of the functionality is 5.11, the option will be still parsed but will have no effect. Reasons for deprecation and removal: - very poor naming choice of the mount option, it's supposed to cache and reuse the inode _numbers_, but it sounds a some generic cache for inodes - the only known usecase where this option would make sense is on a 32bit architecture where inode numbers in one subvolume would be exhausted due to 32bit inode::i_ino - the cache is stored on disk, consumes space, needs to be loaded and written back - new inode number allocation is slower due to lookups into the cache (compared to a simple increment which is the default) - uses the free-space-cache code that is going to be deprecated as well in the future Known problems: - since 2011, returning EEXIST when there's not enough space in a page to store all checksums, see commit `4b9465cb9e` ("Btrfs: add mount -o inode_cache") Remaining issues: - if the option was enabled, new inodes created, the option disabled again, the cache is still stored on the devices and there's currently no way to remove it Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:28 +02:00
David Sterba	a2570ef330	btrfs: remove unused btrfs_root::defrag_trans_start Last touched in 2013 by commit `de78b51a28` ("btrfs: remove cache only arguments from defrag path") that was the only code that used the value. Now it's only set but never used for anything, so we can remove it. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:28 +02:00
David Sterba	bab16e21e8	btrfs: don't use UAPI types for fiemap callback The fiemap callback is not part of UAPI interface and the prototypes don't have the __u64 types either. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:27 +02:00
Denis Efremov	5af9d6ef3f	btrfs: tests: remove if duplicate in __check_free_space_extents() num_extents is already checked in the next if condition and can be safely removed. Signed-off-by: Denis Efremov <efremov@linux.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:27 +02:00
Johannes Thumshirn	923eb52365	btrfs: use free_root_extent_buffer to free root In btrfs_put_root() we're freeing a btrfs_root's 'node' and 'commit_root' extent buffers manually via kfree(), while we're using free_root_extent_buffers() in the free_root_pointers() function above. free_root_extent_buffers() also NULLs the pointers after freeing, which mitigates potential double frees. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:27 +02:00
Nikolay Borisov	4e9d0d0109	btrfs: use for loop in prealloc_file_extent_cluster This function iterates all extents in the extent cluster, make this intention obvious by using a for loop. No functional chanes. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:27 +02:00
Nikolay Borisov	214e61d07e	btrfs: perform data management operations outside of inode lock btrfs_alloc_data_chunk_ondemand and btrfs_free_reserved_data_space_noquota don't really use the guts of the inodes being passed to them. This implies it's not required to call them under extent lock. Move code around in prealloc_file_extent_cluster to do the heavy, data alloc/free operations outside of the lock. This also makes the 'out' label unnecessary, so remove it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:27 +02:00
Nikolay Borisov	c171edd5c8	btrfs: remove hole check in prealloc_file_extent_cluster Extents in the extent cluster are guaranteed to be contiguous as such the hole check inside the loop can never trigger. In fact this check was never functional since it was added in `18513091af` ("btrfs: update btrfs_space_info's bytes_may_use timely") which came after the commit introducing clustered/contiguous extents `0257bb82d2` ("Btrfs: relocate file extents in clusters"). Let's just remove it as it adds noise to the source. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:27 +02:00
Nikolay Borisov	906c448c3d	btrfs: make __btrfs_drop_extents take btrfs_inode It has only 4 uses of a vfs_inode for inode_sub_bytes but unifies the interface with the non __ prefixed version. Will also makes converting its callers to btrfs_inode easier. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:26 +02:00
Nikolay Borisov	bd242a08a6	btrfs: make btrfs_csum_one_bio takae btrfs_inode Will enable converting btrfs_submit_compressed_write to btrfs_inode more easily. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:26 +02:00
Nikolay Borisov	ad7ff17b65	btrfs: make extent_clear_unlock_delalloc take btrfs_inode It has one VFS and 1 btrfs inode usages but converting it to btrfs_inode interface will allow seamless conversion of its callers. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:26 +02:00
Nikolay Borisov	4b67c11dd1	btrfs: make create_io_em take btrfs_inode It really wants a btrfs_inode and will allow submit_compressed_extents to be completely converted to btrfs_inode in follow up patches. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:26 +02:00
Nikolay Borisov	7bfa953501	btrfs: make btrfs_reloc_clone_csums take btrfs_inode It really wants btrfs_inode and not a vfs inode. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:26 +02:00
Nikolay Borisov	c350437269	btrfs: make btrfs_lookup_ordered_extent take btrfs_inode It doesn't use the generic vfs inode for anything use btrfs_inode directly. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:25 +02:00
Nikolay Borisov	43c69849ae	btrfs: make get_extent_allocation_hint take btrfs_inode It doesn't use the vfs inode for anything, can just as easily take btrfs_inode. Follow up patches will convert callers as well. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:25 +02:00
Nikolay Borisov	da69fea9f7	btrfs: make __btrfs_add_ordered_extent take struct btrfs_inode This is internal btrfs function what really needs the vfs_inode only for igrab and a tracepoint. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:25 +02:00
Filipe Manana	3ef64143a7	btrfs: remove no longer used trans_list member of struct btrfs_ordered_extent The 'trans_list' member of an ordered extent was used to keep track of the ordered extents for which a transaction commit had to wait. These were ordered extents that were started and logged by an fsync. However we don't do that anymore and before we stopped doing it we changed the approach to wait for the ordered extents in commit `161c3549b4` ("Btrfs: change how we wait for pending ordered extents"), which stopped using that list and therefore the 'trans_list' member is not used anymore since that commit. So just remove it since it's doing nothing and making each ordered extent structure waste memory (2 pointers). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:25 +02:00
Filipe Manana	cd8d39f4ae	btrfs: remove no longer used log_list member of struct btrfs_ordered_extent The 'log_list' member of an ordered extent was used keep track of which ordered extents we needed to wait after logging metadata, but is not used anymore since commit `5636cf7d6d` ("btrfs: remove the logged extents infrastructure"), as we now always wait on ordered extent completion before logging metadata. So just remove it since it's doing nothing and making each ordered extent structure waste more memory (2 pointers). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:25 +02:00
David Sterba	ce6ef5abe6	btrfs: add little-endian optimized key helpers The CPU and on-disk keys are mapped to two different structures because of the endianness. There's an intermediate buffer used to do the conversion, but this is not necessary when CPU and on-disk endianness match. Add optimized versions of helpers that take disk_key and use the buffer directly for CPU keys or drop the intermediate buffer and conversion. This saves a lot of stack space accross many functions and removes about 6K of generated binary code: text data bss dec hex filename 1090439 17468 14912 1122819 112203 pre/btrfs.ko 1084613 17456 14912 1116981 110b35 post/btrfs.ko Delta: -5826 Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:24 +02:00
Qu Wenruo	5958253cf6	btrfs: qgroup: catch reserved space leaks at unmount time Before this patch, qgroup completely relies on per-inode extent io tree to detect reserved data space leak. However previous bug has already shown how release page before btrfs_finish_ordered_io() could lead to leak, and since it's QGROUP_RESERVED bit cleared without triggering qgroup rsv, it can't be detected by per-inode extent io tree. So this patch adds another (and hopefully the final) safety net to catch qgroup data reserved space leak. At least the new safety net catches all the leaks during development, so it should be pretty useful in the real world. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:24 +02:00
Qu Wenruo	7dbeaad0af	btrfs: change timing for qgroup reserved space for ordered extents to fix reserved space leak [BUG] The following simple workload from fsstress can lead to qgroup reserved data space leak: 0/0: creat f0 x:0 0 0 0/0: creat add id=0,parent=-1 0/1: write f0[259 1 0 0 0 0] [600030,27288] 0 0/4: dwrite - xfsctl(XFS_IOC_DIOINFO) f0[259 1 0 0 64 627318] return 25, fallback to stat() 0/4: dwrite f0[259 1 0 0 64 627318] [610304,106496] 0 This would cause btrfs qgroup to leak 20480 bytes for data reserved space. If btrfs qgroup limit is enabled, such leak can lead to unexpected early EDQUOT and unusable space. [CAUSE] When doing direct IO, kernel will try to writeback existing buffered page cache, then invalidate them: generic_file_direct_write() \|- filemap_write_and_wait_range(); \|- invalidate_inode_pages2_range(); However for btrfs, the bi_end_io hook doesn't finish all its heavy work right after bio ends. In fact, it delays its work further: submit_extent_page(end_io_func=end_bio_extent_writepage); end_bio_extent_writepage() \|- btrfs_writepage_endio_finish_ordered() \|- btrfs_init_work(finish_ordered_fn); <<< Work queue execution >>> finish_ordered_fn() \|- btrfs_finish_ordered_io(); \|- Clear qgroup bits This means, when filemap_write_and_wait_range() returns, btrfs_finish_ordered_io() is not guaranteed to be executed, thus the qgroup bits for related range are not cleared. Now into how the leak happens, this will only focus on the overlapping part of buffered and direct IO part. 1. After buffered write The inode had the following range with QGROUP_RESERVED bit: 596 616K \|///////////////\| Qgroup reserved data space: 20K 2. Writeback part for range [596K, 616K) Write back finished, but btrfs_finish_ordered_io() not get called yet. So we still have: 596K 616K \|///////////////\| Qgroup reserved data space: 20K 3. Pages for range [596K, 616K) get released This will clear all qgroup bits, but don't update the reserved data space. So we have: 596K 616K \| \| Qgroup reserved data space: 20K That number doesn't match the qgroup bit range anymore. 4. Dio prepare space for range [596K, 700K) Qgroup reserved data space for that range, we got: 596K 616K 700K \|///////////////\|///////////////////////\| Qgroup reserved data space: 20K + 104K = 124K 5. btrfs_finish_ordered_range() gets executed for range [596K, 616K) Qgroup free reserved space for that range, we got: 596K 616K 700K \| \|///////////////////////\| We need to free that range of reserved space. Qgroup reserved data space: 124K - 20K = 104K 6. btrfs_finish_ordered_range() gets executed for range [596K, 700K) However qgroup bit for range [596K, 616K) is already cleared in previous step, so we only free 84K for qgroup reserved space. 596K 616K 700K \| \| \| We need to free that range of reserved space. Qgroup reserved data space: 104K - 84K = 20K Now there is no way to release that 20K unless disabling qgroup or unmounting the fs. [FIX] This patch will change the timing of btrfs_qgroup_release/free_data() call. Here it uses buffered COW write as an example. The new timing \| The old timing ----------------------------------------+--------------------------------------- btrfs_buffered_write() \| btrfs_buffered_write() \|- btrfs_qgroup_reserve_data() \| \|- btrfs_qgroup_reserve_data() \| btrfs_run_delalloc_range() \| btrfs_run_delalloc_range() \|- btrfs_add_ordered_extent() \| \|- btrfs_qgroup_release_data() \| The reserved is passed into \| btrfs_ordered_extent structure \| \| btrfs_finish_ordered_io() \| btrfs_finish_ordered_io() \|- The reserved space is passed to \| \|- btrfs_qgroup_release_data() btrfs_qgroup_record \| The resereved space is passed \| to btrfs_qgroup_recrod \| btrfs_qgroup_account_extents() \| btrfs_qgroup_account_extents() \|- btrfs_qgroup_free_refroot() \| \|- btrfs_qgroup_free_refroot() The point of such change is to ensure, when ordered extents are submitted, the qgroup reserved space is already released, to keep the timing aligned with file_write_and_wait_range(). So that qgroup data reserved space is all bound to btrfs_ordered_extent and solve the timing mismatch. Fixes: `f695fdcef8` ("btrfs: qgroup: Introduce functions to release/free qgroup reserve data space") Suggested-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:24 +02:00
Qu Wenruo	a7f8b1c2ac	btrfs: file: reserve qgroup space after the hole punch range is locked The incoming qgroup reserved space timing will move the data reservation to ordered extent completely. However in btrfs_punch_hole_lock_range() will call btrfs_invalidate_page(), which will clear QGROUP_RESERVED bit for the range. In current stage it's OK, but if we're making ordered extents handle the reserved space, then btrfs_punch_hole_lock_range() can clear the QGROUP_RESERVED bit before we submit ordered extent, leading to qgroup reserved space leakage. So here change the timing to make reserve data space after btrfs_punch_hole_lock_range(). The new timing is fine for either current code or the new code. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:24 +02:00
Qu Wenruo	9729f10a60	btrfs: inode: move qgroup reserved space release to the callers of insert_reserved_file_extent() This is to prepare for the incoming timing change of qgroup reserved data space and ordered extent. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:24 +02:00
Qu Wenruo	203f44c519	btrfs: inode: refactor the parameters of insert_reserved_file_extent() Function insert_reserved_file_extent() takes a long list of parameters, which are all for btrfs_file_extent_item, even including two reserved members, encryption and other_encoding. This makes the parameter list unnecessary long for a function which only gets called twice. This patch will refactor the parameter list, by using btrfs_file_extent_item as parameter directly to hugely reduce the number of parameters. Also, since there are only two callers, one in btrfs_finish_ordered_io() which inserts file extent for ordered extent, and one __btrfs_prealloc_file_range(). These two call sites have completely different context, where ordered extent can be compressed, but will always be regular extent, while the preallocated one is never going to be compressed and always has PREALLOC type. So use two small wrapper for these two different call sites to improve readability. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:23 +02:00
David Sterba	100aa5d9f9	btrfs: scrub: clean up temporary page variables in scrub_checksum_tree_block Add proper variable for the scrub page and use it instead of repeatedly dereferencing the other structures. Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:23 +02:00
David Sterba	521e102227	btrfs: scrub: simplify tree block checksum calculation Use a simpler iteration over tree block pages, same what csum_tree_block does: first page always exists, loop over the rest. Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:23 +02:00
David Sterba	d41ebef200	btrfs: scrub: clean up temporary page variables in scrub_checksum_data Add proper variable for the scrub page and use it instead of repeatedly dereferencing the other structures. Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:23 +02:00
David Sterba	771aba0d12	btrfs: scrub: simplify data block checksum calculation We have sectorsize same as PAGE_SIZE, the checksum can be calculated in one go. Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:23 +02:00
David Sterba	c746054109	btrfs: scrub: clean up temporary page variables in scrub_checksum_super Add proper variable for the scrub page and use it instead of repeatedly dereferencing the other structures. Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:23 +02:00
David Sterba	74710cf1fb	btrfs: scrub: remove temporary csum array in scrub_checksum_super The page contents with the checksum is available during the entire function so we don't need to make a copy. Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:22 +02:00
David Sterba	83cf6d5eae	btrfs: scrub: simplify superblock checksum calculation BTRFS_SUPER_INFO_SIZE is 4096, and fits to a page on all supported architectures, so we can calculate the checksum in one go. Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:22 +02:00
David Sterba	b04852520e	btrfs: scrub: unify naming of page address variables As the page mapping has been removed, rename the variables to 'kaddr' that we use everywhere else. The type is changed to 'char *' so pointer arithmetic works without casts. Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:22 +02:00
David Sterba	a8b3a89074	btrfs: scrub: remove kmap/kunmap of pages All pages that scrub uses in the scrub_block::pagev array are allocated with GFP_KERNEL and never part of any mapping, so kmap is not necessary, we only need to know the page address. In scrub_write_page_to_dev_replace we don't even need to call flush_dcache_page because of the same reason as above. Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:22 +02:00
Qu Wenruo	74ef00185e	btrfs: introduce "rescue=" mount option This patch introduces a new "rescue=" mount option group for all mount options for data recovery. Different rescue sub options are seperated by ':'. E.g "ro,rescue=nologreplay:usebackuproot". The original plan was to use ';', but ';' needs to be escaped/quoted, or it will be interpreted by bash, similar to '\|'. And obviously, user can specify rescue options one by one like: "ro,rescue=nologreplay,rescue=usebackuproot". The following mount options are converted to "rescue=", old mount options are deprecated but still available for compatibility purpose: - usebackuproot Now it's "rescue=usebackuproot" - nologreplay Now it's "rescue=nologreplay" Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:22 +02:00

1 2 3 4 5 ...

9169 Commits