linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-29 15:41:36 +00:00

Author	SHA1	Message	Date
Jakub Kicinski	accc3b4a57	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-09-29 14:30:51 -07:00
Tetsuo Handa	cbddcc4fa3	btrfs: set generation before calling btrfs_clean_tree_block in btrfs_init_new_buffer syzbot is reporting uninit-value in btrfs_clean_tree_block() [1], for commit `bc877d285c` ("btrfs: Deduplicate extent_buffer init code") missed that btrfs_set_header_generation() in btrfs_init_new_buffer() must not be moved to after clean_tree_block() because clean_tree_block() is calling btrfs_header_generation() since commit `55c69072d6` ("Btrfs: Fix extent_buffer usage when nodesize != leafsize"). Since memzero_extent_buffer() will reset "struct btrfs_header" part, we can't move btrfs_set_header_generation() to before memzero_extent_buffer(). Just re-add btrfs_set_header_generation() before btrfs_clean_tree_block(). Link: https://syzkaller.appspot.com/bug?extid=fba8e2116a12609b6c59 [1] Reported-by: syzbot <syzbot+fba8e2116a12609b6c59@syzkaller.appspotmail.com> Fixes: `bc877d285c` ("btrfs: Deduplicate extent_buffer init code") CC: stable@vger.kernel.org # 4.19+ Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:31 +02:00
Filipe Manana	db21370bff	btrfs: drop extent map range more efficiently Currently when dropping extent maps for a file range, through btrfs_drop_extent_map_range(), we do the following non-optimal things: 1) We lookup for extent maps one by one, always starting the search from the root of the extent map tree. This is not efficient if we have multiple extent maps in the range; 2) We check on every iteration if we have the 'split' and 'split2' spare extent maps in case we need to split an extent map that intersects our range but also crosses its boundaries (to the left, to the right or both cases). If our target range is for example: [2M, 8M) And we have 3 extents maps in the range: [1M, 3M) [3M, 6M) [6M, 10M[ The on the first iteration we allocate two extent maps for 'split' and 'split2', and use the 'split' to split the first extent map, so after the split we set 'split' to 'split2' and then set 'split2' to NULL. On the second iteration, we don't need to split the second extent map, but because 'split2' is now NULL, we allocate a new extent map for 'split2'. On the third iteration we need to split the third extent map, so we use the extent map pointed by 'split'. So we ended up allocating 3 extent maps for splitting, but all we needed was 2 extent maps. We never need to allocate more than 2, because extent maps that need to be split are always the first one and the last one in the target range. Improve on this by: 1) Using rb_next() to move on to the next extent map. This results in iterating over less nodes of the tree and it does not require comparing the ranges of nodes to our start/end offset; 2) Allocate the 2 extent maps for splitting before entering the loop and never allocate more than 2. In practice it's very rare to have the combination of both extent map allocations fail, since we have a dedicated slab for extent maps, and also have the need to split two extent maps. This patch is part of a patchset comprised of the following patches: btrfs: fix missed extent on fsync after dropping extent maps btrfs: move btrfs_drop_extent_cache() to extent_map.c btrfs: use extent_map_end() at btrfs_drop_extent_map_range() btrfs: use cond_resched_rwlock_write() during inode eviction btrfs: move open coded extent map tree deletion out of inode eviction btrfs: add helper to replace extent map range with a new extent map btrfs: remove the refcount warning/check at free_extent_map() btrfs: remove unnecessary extent map initializations btrfs: assert tree is locked when clearing extent map from logging btrfs: remove unnecessary NULL pointer checks when searching extent maps btrfs: remove unnecessary next extent map search btrfs: avoid pointless extent map tree search when flushing delalloc btrfs: drop extent map range more efficiently And the following fio test was done before and after applying the whole patchset, on a non-debug kernel (Debian's default kernel config) on a 12 cores Intel box with 64G of ram: $ cat test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-R free-space-tree -O no-holes" cat <<EOF > /tmp/fio-job.ini [writers] rw=randwrite fsync=8 fallocate=none group_reporting=1 direct=0 bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5 ioengine=psync filesize=2G runtime=300 time_based directory=$MNT numjobs=8 thread EOF echo performance \| \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor echo echo "Using config:" echo cat /tmp/fio-job.ini echo umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Result before applying the patchset: WRITE: bw=197MiB/s (206MB/s), 197MiB/s-197MiB/s (206MB/s-206MB/s), io=57.7GiB (61.9GB), run=300188-300188msec Result after applying the patchset: WRITE: bw=203MiB/s (213MB/s), 203MiB/s-203MiB/s (213MB/s-213MB/s), io=59.5GiB (63.9GB), run=300019-300019msec Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:31 +02:00
Filipe Manana	b54bb86556	btrfs: avoid pointless extent map tree search when flushing delalloc When flushing delalloc, in COW mode at cow_file_range(), before entering the loop that allocates extents and creates ordered extents, we do a call to btrfs_drop_extent_map_range() for the whole range. This is pointless because in the loop we call create_io_em(), which will also call btrfs_drop_extent_map_range() before inserting the new extent map. So remove that call at cow_file_range() not only because it is not needed, but also because it will make the btrfs_drop_extent_map_range() calls made from create_io_em() waste time searching the extent map tree, and that tree can be large for files with many extents. It also makes us waste time at btrfs_drop_extent_map_range() allocating and freeing the split extent maps for nothing. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:31 +02:00
Filipe Manana	6c05813ebb	btrfs: remove unnecessary next extent map search At __tree_search(), and its single caller __lookup_extent_mapping(), there is no point in finding the next extent map that starts after the search offset if we were able to find the previous extent map that ends before our search offset, because __lookup_extent_mapping() ignores the next acceptable extent map if we were able to find the previous one. So just return immediately if we were able to find the previous extent map, therefore avoiding wasting time iterating the tree looking for the next extent map which will not be used by __lookup_extent_mapping(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:31 +02:00
Filipe Manana	08f088dd63	btrfs: remove unnecessary NULL pointer checks when searching extent maps The previous and next pointer arguments passed to __tree_search() are never NULL as the only caller of this function, __lookup_extent_mapping(), always passes the address of two on stack pointers. So remove the NULL checks and add assertions to verify the pointers. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:31 +02:00
Filipe Manana	74333c7d87	btrfs: assert tree is locked when clearing extent map from logging When calling clear_em_logging() we should have a write lock on the extent map tree, as we will try to merge the extent map with the previous and next ones in the tree. So assert that we have a write lock. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:31 +02:00
Filipe Manana	2e0cdaa028	btrfs: remove unnecessary extent map initializations When allocating an extent map, we use kmem_cache_zalloc() which guarantees the returned memory is initialized to zeroes, therefore it's pointless to initialize the generation and flags of the extent map to zero again. Remove those initializations, as they are pointless and slightly increase the object text size. Before removing them: $ size fs/btrfs/extent_map.o text data bss dec hex filename 9241 274 24 9539 2543 fs/btrfs/extent_map.o After removing them: $ size fs/btrfs/extent_map.o text data bss dec hex filename 9209 274 24 9507 2523 fs/btrfs/extent_map.o Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:30 +02:00
Filipe Manana	ad5d6e9148	btrfs: remove the refcount warning/check at free_extent_map() At free_extent_map(), it's pointless to have a WARN_ON() to check if the refcount of the extent map is zero. Such check is already done by the refcount_t module and refcount_dec_and_test(), which loudly complains if we try to decrement a reference count that is currently 0. The WARN_ON() dates back to the time when used a regular atomic_t type for the reference counter, before we switched to the refcount_t type. The main goal of the refcount_t type/module is precisely to catch such types of bugs and loudly complain if they happen. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:30 +02:00
Filipe Manana	a1ba4c080b	btrfs: add helper to replace extent map range with a new extent map We have several places that need to drop all the extent maps in a given file range and then add a new extent map for that range. Currently they call btrfs_drop_extent_map_range() to delete all extent maps in the range and then keep trying to add the new extent map in a loop that keeps retrying while the insertion of the new extent map fails with -EEXIST. So instead of repeating this logic, add a helper to extent_map.c that does these steps and name it btrfs_replace_extent_map_range(). Also add a comment about why the retry loop is necessary. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:30 +02:00
Filipe Manana	9c9d1b4f74	btrfs: move open coded extent map tree deletion out of inode eviction Move the loop that removes all the extent maps from the inode's extent map tree during inode eviction out of inode.c and into extent_map.c, to btrfs_drop_extent_map_range(). Anything manipulating extent maps or the extent map tree should be in extent_map.c. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:30 +02:00
Filipe Manana	99ba0c8150	btrfs: use cond_resched_rwlock_write() during inode eviction At evict_inode_truncate_pages(), instead of manually checking if rescheduling is needed, then unlock the extent map tree, reschedule and then write lock again the tree, use the helper cond_resched_rwlock_write() which does all that. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:30 +02:00
Filipe Manana	f3109e33bb	btrfs: use extent_map_end() at btrfs_drop_extent_map_range() Instead of open coding the end offset calculation of an extent map, use the helper extent_map_end() and cache its result in a local variable, since it's used several times. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:30 +02:00
Filipe Manana	4c0c8cfc84	btrfs: move btrfs_drop_extent_cache() to extent_map.c The function btrfs_drop_extent_cache() doesn't really belong at file.c because what it does is drop a range of extent maps for a file range. It directly allocates and manipulates extent maps, by dropping, splitting and replacing them in an extent map tree, so it should be located at extent_map.c, where all manipulations of an extent map tree and its extent maps are supposed to be done. So move it out of file.c and into extent_map.c. Additionally do the following changes: 1) Rename it into btrfs_drop_extent_map_range(), as this makes it more clear about what it does. The term "cache" is a bit confusing as it's not widely used, "extent maps" or "extent mapping" is much more common; 2) Change its 'skip_pinned' argument from int to bool; 3) Turn several of its local variables from int to bool, since they are used as booleans; 4) Move the declaration of some variables out of the function's main scope and into the scopes where they are used; 5) Remove pointless assignment of false to 'modified' early in the while loop, as later that variable is set and it's not used before that second assignment; 6) Remove checks for NULL before calling free_extent_map(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:30 +02:00
Filipe Manana	cef7820d6a	btrfs: fix missed extent on fsync after dropping extent maps When dropping extent maps for a range, through btrfs_drop_extent_cache(), if we find an extent map that starts before our target range and/or ends before the target range, and we are not able to allocate extent maps for splitting that extent map, then we don't fail and simply remove the entire extent map from the inode's extent map tree. This is generally fine, because in case anyone needs to access the extent map, it can just load it again later from the respective file extent item(s) in the subvolume btree. However, if that extent map is new and is in the list of modified extents, then a fast fsync will miss the parts of the extent that were outside our range (that needed to be split), therefore not logging them. Fix that by marking the inode for a full fsync. This issue was introduced after removing BUG_ON()s triggered when the split extent map allocations failed, done by commit `7014cdb493` ("Btrfs: btrfs_drop_extent_cache should never fail"), back in 2012, and the fast fsync path already existed but was very recent. Also, in the case where we could allocate extent maps for the split operations but then fail to add a split extent map to the tree, mark the inode for a full fsync as well. This is not supposed to ever fail, and we assert that, but in case assertions are disabled (CONFIG_BTRFS_ASSERT is not set), it's the correct thing to do to make sure a fast fsync will not miss a new extent. CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:30 +02:00
Jeff Layton	3050dfa63e	btrfs: remove stale prototype of btrfs_write_inode This function no longer exists, was removed in `3c4276936f` ("Btrfs: fix btrfs_write_inode vs delayed iput deadlock"). Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:29 +02:00
Stefan Roesch	926078b21d	btrfs: enable nowait async buffered writes Enable nowait async buffered writes in btrfs_do_write_iter() and btrfs_file_open(). In this version encoded buffered writes have the optimization not enabled. Encoded writes are enabled by using an ioctl. io_uring currently does not support ioctls. This might be enabled in the future. Performance results: For fio the following results have been obtained with a queue depth of 1 and 4k block size (runtime 600 secs): sequential writes: without patch with patch libaio psync iops: 55k 134k 117K 148K bw: 221MB/s 538MB/s 469MB/s 592MB/s clat: 15286ns 82ns 994ns 6340ns For an io depth of 1, the new patch improves throughput by over two times (compared to the existing behavior, where buffered writes are processed by an io-worker process) and also the latency is considerably reduced. To achieve the same or better performance with the existing code an io depth of 4 is required. Increasing the iodepth further does not lead to improvements. The tests have been run like this: ./fio --name=seq-writers --ioengine=psync --iodepth=1 --rw=write \ --bs=4k --direct=0 --size=100000m --time_based --runtime=600 \ --numjobs=1 --filename=... ./fio --name=seq-writers --ioengine=io_uring --iodepth=1 --rw=write \ --bs=4k --direct=0 --size=100000m --time_based --runtime=600 \ --numjobs=1 --filename=... ./fio --name=seq-writers --ioengine=libaio --iodepth=1 --rw=write \ --bs=4k --direct=0 --size=100000m --time_based --runtime=600 \ --numjobs=1 --filename=... Testing: This patch has been tested with xfstests, fsx, fio. xfstests shows no new diffs compared to running without the patch series. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:29 +02:00
Stefan Roesch	c922b016f3	btrfs: assert nowait mode is not used for some btree search functions Adds nowait asserts to btree search functions which are not used by buffered IO and direct IO paths. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:29 +02:00
Stefan Roesch	965f47aeb5	btrfs: make btrfs_buffered_write nowait compatible We need to avoid unconditionally calling balance_dirty_pages_ratelimited as it could wait for some reason. Use balance_dirty_pages_ratelimited_flags with the BDP_ASYNC in case the buffered write is nowait, returning EAGAIN eventually. It also moves the function after the again label. This can cause the function to be called a bit later, but this should have no impact in the real world. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:28 +02:00
Stefan Roesch	304e45acdb	btrfs: plumb NOWAIT through the write path We have everywhere setup for nowait, plumb NOWAIT through the write path. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Stefan Roesch <shr@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:28 +02:00
Stefan Roesch	2fcab928cc	btrfs: make lock_and_cleanup_extent_if_need nowait compatible Add the nowait parameter to lock_and_cleanup_extent_if_need(). If the nowait parameter is specified we try to lock the extent in nowait mode. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:28 +02:00
Stefan Roesch	fc22600012	btrfs: make prepare_pages nowait compatible Add nowait parameter to the prepare_pages function. In case nowait is specified for an async buffered write request, do a nowait allocation or return -EAGAIN. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:28 +02:00
Josef Bacik	80f9d24130	btrfs: make btrfs_check_nocow_lock nowait compatible Now all the helpers that btrfs_check_nocow_lock uses handle nowait, add a nowait flag to btrfs_check_nocow_lock so it can be used by the write path. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Stefan Roesch <shr@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:28 +02:00
Josef Bacik	d2c7a19f5c	btrfs: add btrfs_try_lock_ordered_range For IOCB_NOWAIT we're going to want to use try lock on the extent lock, and simply bail if there's an ordered extent in the range because the only choice there is to wait for the ordered extent to complete. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:28 +02:00
Josef Bacik	1daedb1d6b	btrfs: add the ability to use NO_FLUSH for data reservations In order to accommodate NOWAIT IOCB's we need to be able to do NO_FLUSH data reservations, so plumb this through the delalloc reservation system. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Stefan Roesch <shr@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:28 +02:00
Josef Bacik	26ce911446	btrfs: make can_nocow_extent nowait compatible If we have NOWAIT specified on our IOCB and we're writing into a PREALLOC or NOCOW extent then we need to be able to tell can_nocow_extent that we don't want to wait on any locks or metadata IO. Fix can_nocow_extent to allow for NOWAIT. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:26 +02:00
Baokun Li	f9c1f24860	ext4: fix null-ptr-deref in ext4_write_info I caught a null-ptr-deref bug as follows: ================================================================== KASAN: null-ptr-deref in range [0x0000000000000068-0x000000000000006f] CPU: 1 PID: 1589 Comm: umount Not tainted 5.10.0-02219-dirty #339 RIP: 0010:ext4_write_info+0x53/0x1b0 [...] Call Trace: dquot_writeback_dquots+0x341/0x9a0 ext4_sync_fs+0x19e/0x800 __sync_filesystem+0x83/0x100 sync_filesystem+0x89/0xf0 generic_shutdown_super+0x79/0x3e0 kill_block_super+0xa1/0x110 deactivate_locked_super+0xac/0x130 deactivate_super+0xb6/0xd0 cleanup_mnt+0x289/0x400 __cleanup_mnt+0x16/0x20 task_work_run+0x11c/0x1c0 exit_to_user_mode_prepare+0x203/0x210 syscall_exit_to_user_mode+0x5b/0x3a0 do_syscall_64+0x59/0x70 entry_SYSCALL_64_after_hwframe+0x44/0xa9 ================================================================== Above issue may happen as follows: ------------------------------------- exit_to_user_mode_prepare task_work_run __cleanup_mnt cleanup_mnt deactivate_super deactivate_locked_super kill_block_super generic_shutdown_super shrink_dcache_for_umount dentry = sb->s_root sb->s_root = NULL <--- Here set NULL sync_filesystem __sync_filesystem sb->s_op->sync_fs > ext4_sync_fs dquot_writeback_dquots sb->dq_op->write_info > ext4_write_info ext4_journal_start(d_inode(sb->s_root), EXT4_HT_QUOTA, 2) d_inode(sb->s_root) s_root->d_inode <--- Null pointer dereference To solve this problem, we use ext4_journal_start_sb directly to avoid s_root being used. Cc: stable@kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220805123947.565152-1-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2022-09-29 10:39:04 -04:00
Josh Triplett	426d15ad11	ext4: don't run ext4lazyinit for read-only filesystems On a read-only filesystem, we won't invoke the block allocator, so we don't need to prefetch the block bitmaps. This avoids starting and running the ext4lazyinit thread at all on a system with no read-write ext4 filesystems (for instance, a container VM with read-only filesystems underneath an overlayfs). Fixes: `21175ca434` ("ext4: make prefetch_block_bitmaps default") Signed-off-by: Josh Triplett <josh@joshtriplett.org> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Link: https://lore.kernel.org/r/48b41da1498fcac3287e2e06b660680646c1c050.1659323972.git.josh@joshtriplett.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2022-09-29 10:38:57 -04:00
Yang Xu	2d544ec923	ext4: remove deprecated noacl/nouser_xattr options These two options should have been removed since 3.5, but none notices it. Recently, I and Darrick found this. Also, have some discussion for this[1][2][3]. So now, let's remove them. Link: https://lore.kernel.org/linux-ext4/6258F7BB.8010104@fujitsu.com/T/#u[1] Link: https://lore.kernel.org/linux-ext4/20220602110421.ymoug3rwfspmryqg@fedora/T/#t[2] Link: https://lore.kernel.org/linux-ext4/08e2ca4c8f6344bdcd76d75b821116c6147fd57a.camel@kernel.org/T/#t[3] Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/1658977369-2478-1-git-send-email-xuyang2018.jy@fujitsu.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2022-09-29 10:38:57 -04:00
Jan Kara	4bb26f2885	ext4: avoid crash when inline data creation follows DIO write When inode is created and written to using direct IO, there is nothing to clear the EXT4_STATE_MAY_INLINE_DATA flag. Thus when inode gets truncated later to say 1 byte and written using normal write, we will try to store the data as inline data. This confuses the code later because the inode now has both normal block and inline data allocated and the confusion manifests for example as: kernel BUG at fs/ext4/inode.c:2721! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 359 Comm: repro Not tainted 5.19.0-rc8-00001-g31ba1e3b8305-dirty #15 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014 RIP: 0010:ext4_writepages+0x363d/0x3660 RSP: 0018:ffffc90000ccf260 EFLAGS: 00010293 RAX: ffffffff81e1abcd RBX: 0000008000000000 RCX: ffff88810842a180 RDX: 0000000000000000 RSI: 0000008000000000 RDI: 0000000000000000 RBP: ffffc90000ccf650 R08: ffffffff81e17d58 R09: ffffed10222c680b R10: dfffe910222c680c R11: 1ffff110222c680a R12: ffff888111634128 R13: ffffc90000ccf880 R14: 0000008410000000 R15: 0000000000000001 FS: 00007f72635d2640(0000) GS:ffff88811b000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000565243379180 CR3: 000000010aa74000 CR4: 0000000000150eb0 Call Trace: <TASK> do_writepages+0x397/0x640 filemap_fdatawrite_wbc+0x151/0x1b0 file_write_and_wait_range+0x1c9/0x2b0 ext4_sync_file+0x19e/0xa00 vfs_fsync_range+0x17b/0x190 ext4_buffered_write_iter+0x488/0x530 ext4_file_write_iter+0x449/0x1b90 vfs_write+0xbcd/0xf40 ksys_write+0x198/0x2c0 __x64_sys_write+0x7b/0x90 do_syscall_64+0x3d/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd </TASK> Fix the problem by clearing EXT4_STATE_MAY_INLINE_DATA when we are doing direct IO write to a file. Cc: stable@kernel.org Reported-by: Tadeusz Struk <tadeusz.struk@linaro.org> Reported-by: syzbot+bd13648a53ed6933ca49@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?id=a1e89d09bbbcbd5c4cb45db230ee28c822953984 Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Tested-by: Tadeusz Struk<tadeusz.struk@linaro.org> Link: https://lore.kernel.org/r/20220727155753.13969-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2022-09-29 10:38:41 -04:00
Zhihao Cheng	191249f708	quota: Add more checking after reading from quota file It would be better to do more sanity checking (eg. dqdh_entries, block no.) for the content read from quota file, which can prevent corrupting the quota file. Link: https://lore.kernel.org/r/20220923134555.2623931-4-chengzhihao1@huawei.com Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz>	2022-09-29 15:37:30 +02:00
Zhihao Cheng	3fc61e0e96	quota: Replace all block number checking with helper function Cleanup all block checking places, replace them with helper function do_check_range(). Link: https://lore.kernel.org/r/20220923134555.2623931-3-chengzhihao1@huawei.com Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz>	2022-09-29 15:37:28 +02:00
Zhihao Cheng	6c8ea8b8cd	quota: Check next/prev free block number after reading from quota file Following process: Init: v2_read_file_info: <3> dqi_free_blk 0 dqi_free_entry 5 dqi_blks 6 Step 1. chown bin f_a -> dquot_acquire -> v2_write_dquot: qtree_write_dquot do_insert_tree find_free_dqentry get_free_dqblk write_blk(info->dqi_blocks) // info->dqi_blocks = 6, failure. The content in physical block (corresponding to blk 6) is random. Step 2. chown root f_a -> dquot_transfer -> dqput_all -> dqput -> ext4_release_dquot -> v2_release_dquot -> qtree_delete_dquot: dquot_release remove_tree free_dqentry put_free_dqblk(6) info->dqi_free_blk = blk // info->dqi_free_blk = 6 Step 3. drop cache (buffer head for block 6 is released) Step 4. chown bin f_b -> dquot_acquire -> commit_dqblk -> v2_write_dquot: qtree_write_dquot do_insert_tree find_free_dqentry get_free_dqblk dh = (struct qt_disk_dqdbheader *)buf blk = info->dqi_free_blk // 6 ret = read_blk(info, blk, buf) // The content of buf is random info->dqi_free_blk = le32_to_cpu(dh->dqdh_next_free) // random blk Step 5. chown bin f_c -> notify_change -> ext4_setattr -> dquot_transfer: dquot = dqget -> acquire_dquot -> ext4_acquire_dquot -> dquot_acquire -> commit_dqblk -> v2_write_dquot -> dq_insert_tree: do_insert_tree find_free_dqentry get_free_dqblk blk = info->dqi_free_blk // If blk < 0 and blk is not an error code, it will be returned as dquot transfer_to[USRQUOTA] = dquot // A random negative value __dquot_transfer(transfer_to) dquot_add_inodes(transfer_to[cnt]) spin_lock(&dquot->dq_dqb_lock) // page fault , which will lead to kernel page fault: Quota error (device sda): qtree_write_dquot: Error -8000 occurred while creating quota BUG: unable to handle page fault for address: ffffffffffffe120 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page Oops: 0002 [#1] PREEMPT SMP CPU: 0 PID: 5974 Comm: chown Not tainted 6.0.0-rc1-00004 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) RIP: 0010:_raw_spin_lock+0x3a/0x90 Call Trace: dquot_add_inodes+0x28/0x270 __dquot_transfer+0x377/0x840 dquot_transfer+0xde/0x540 ext4_setattr+0x405/0x14d0 notify_change+0x68e/0x9f0 chown_common+0x300/0x430 __x64_sys_fchownat+0x29/0x40 In order to avoid accessing invalid quota memory address, this patch adds block number checking of next/prev free block read from quota file. Fetch a reproducer in [Link]. Link: https://bugzilla.kernel.org/show_bug.cgi?id=216372 Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") CC: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220923134555.2623931-2-chengzhihao1@huawei.com Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz>	2022-09-29 15:37:25 +02:00
Thomas Gleixner	e3f12f0602	printk: Declare log_wait properly kernel/printk/printk.c:365:1: warning: symbol 'log_wait' was not declared. Should it be static? Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20220924000454.3319186-3-john.ogness@linutronix.de	2022-09-29 15:20:29 +02:00
Al Viro	06bbaa6dc5	[coredump] don't use __kernel_write() on kmap_local_page() passing kmap_local_page() result to __kernel_write() is unsafe - random ->write_iter() might (and 9p one does) get unhappy when passed ITER_KVEC with pointer that came from kmap_local_page(). Fix by providing a variant of __kernel_write() that takes an iov_iter from caller (__kernel_write() becomes a trivial wrapper) and adding dump_emit_page() that parallels dump_emit(), except that instead of __kernel_write() it uses __kernel_write_iter() with ITER_BVEC source. Fixes: `3159ed5779` "fs/coredump: use kmap_local_page()" Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2022-09-28 14:28:40 -04:00
Eric Whitney	d412df530f	ext4: minor defrag code improvements Modify the error returns for two file types that can't be defragged to more clearly communicate those restrictions to a caller. When the defrag code is applied to swap files, return -ETXTBSY, and when applied to quota files, return -EOPNOTSUPP. Move an extent tree search whose results are only occasionally required to the site always requiring them for improved efficiency. Address a few typos. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Link: https://lore.kernel.org/r/20220722163910.268564-1-enwlinux@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2022-09-27 16:48:57 -04:00
Jerry Lee 李修賢	df3cb754d1	ext4: continue to expand file system when the target size doesn't reach When expanding a file system from (16TiB-2MiB) to 18TiB, the operation exits early which leads to result inconsistency between resize2fs and Ext4 kernel driver. === before === ○ → resize2fs /dev/mapper/thin resize2fs 1.45.5 (07-Jan-2020) Filesystem at /dev/mapper/thin is mounted on /mnt/test; on-line resizing required old_desc_blocks = 2048, new_desc_blocks = 2304 The filesystem on /dev/mapper/thin is now 4831837696 (4k) blocks long. [ 865.186308] EXT4-fs (dm-5): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none. [ 912.091502] dm-4: detected capacity change from 34359738368 to 38654705664 [ 970.030550] dm-5: detected capacity change from 34359734272 to 38654701568 [ 1000.012751] EXT4-fs (dm-5): resizing filesystem from 4294966784 to 4831837696 blocks [ 1000.012878] EXT4-fs (dm-5): resized filesystem to 4294967296 === after === [ 129.104898] EXT4-fs (dm-5): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none. [ 143.773630] dm-4: detected capacity change from 34359738368 to 38654705664 [ 198.203246] dm-5: detected capacity change from 34359734272 to 38654701568 [ 207.918603] EXT4-fs (dm-5): resizing filesystem from 4294966784 to 4831837696 blocks [ 207.918754] EXT4-fs (dm-5): resizing filesystem from 4294967296 to 4831837696 blocks [ 207.918758] EXT4-fs (dm-5): Converting file system to meta_bg [ 207.918790] EXT4-fs (dm-5): resizing filesystem from 4294967296 to 4831837696 blocks [ 221.454050] EXT4-fs (dm-5): resized to 4658298880 blocks [ 227.634613] EXT4-fs (dm-5): resized filesystem to 4831837696 Signed-off-by: Jerry Lee <jerrylee@qnap.com> Link: https://lore.kernel.org/r/PU1PR04MB22635E739BD21150DC182AC6A18C9@PU1PR04MB2263.apcprd04.prod.outlook.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2022-09-27 16:40:30 -04:00
Eric W. Biederman	987f20a9dc	a.out: Remove the a.out implementation In commit `19e8b701e2` ("a.out: Stop building a.out/osf1 support on alpha and m68k") the last users of a.out were disabled. As nothing has turned up to cause this change to be reverted, let's remove the code implementing a.out support as well. There may be userspace users of the uapi bits left so the uapi headers have been left untouched. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Arnd Bergmann <arnd@arndb.de> # arm defconfigs Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/871qrx3hq3.fsf@email.froward.int.ebiederm.org	2022-09-27 07:11:02 -07:00
Gao Xiang	312fe643ad	erofs: clean up erofs_iget() isdir indicated REQ_META\|REQ_PRIO which no longer works now. Get rid of isdir entirely. Link: https://lore.kernel.org/r/20220927063607.54832-2-hsiangkao@linux.alibaba.com Reviewed-by: Yue Hu <huyue2@coolpad.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2022-09-27 17:27:45 +08:00
Gao Xiang	53a7f9961c	erofs: clean up unnecessary code and comments Some conditional macros and comments are useless. Link: https://lore.kernel.org/r/20220927063607.54832-1-hsiangkao@linux.alibaba.com Reviewed-by: Yue Hu <huyue2@coolpad.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2022-09-27 17:27:25 +08:00
Yue Hu	31da107fdb	erofs: fold in z_erofs_reload_indexes() The name of this function looks not very accurate compared to it's implementation and it's only a wrapper to erofs_read_metabuf(). So, let's fold it directly instead. Signed-off-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20220927032518.25266-1-zbestahu@gmail.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2022-09-27 14:39:31 +08:00
xu xin	cb4df4cae4	ksm: count allocated ksm rmap_items for each process Patch series "ksm: count allocated rmap_items and update documentation", v5. KSM can save memory by merging identical pages, but also can consume additional memory, because it needs to generate rmap_items to save each scanned page's brief rmap information. To determine how beneficial the ksm-policy (like madvise), they are using brings, so we add a new interface /proc/<pid>/ksm_stat for each process The value "ksm_rmap_items" in it indicates the total allocated ksm rmap_items of this process. The detailed description can be seen in the following patches' commit message. This patch (of 2): KSM can save memory by merging identical pages, but also can consume additional memory, because it needs to generate rmap_items to save each scanned page's brief rmap information. Some of these pages may be merged, but some may not be abled to be merged after being checked several times, which are unprofitable memory consumed. The information about whether KSM save memory or consume memory in system-wide range can be determined by the comprehensive calculation of pages_sharing, pages_shared, pages_unshared and pages_volatile. A simple approximate calculation: profit =~ pages_sharing * sizeof(page) - (all_rmap_items) * sizeof(rmap_item); where all_rmap_items equals to the sum of pages_sharing, pages_shared, pages_unshared and pages_volatile. But we cannot calculate this kind of ksm profit inner single-process wide because the information of ksm rmap_item's number of a process is lacked. For user applications, if this kind of information could be obtained, it helps upper users know how beneficial the ksm-policy (like madvise) they are using brings, and then optimize their app code. For example, one application madvise 1000 pages as MERGEABLE, while only a few pages are really merged, then it's not cost-efficient. So we add a new interface /proc/<pid>/ksm_stat for each process in which the value of ksm_rmap_itmes is only shown now and so more values can be added in future. So similarly, we can calculate the ksm profit approximately for a single process by: profit =~ ksm_merging_pages * sizeof(page) - ksm_rmap_items * sizeof(rmap_item); where ksm_merging_pages is shown at /proc/<pid>/ksm_merging_pages, and ksm_rmap_items is shown in /proc/<pid>/ksm_stat. Link: https://lkml.kernel.org/r/20220830143731.299702-1-xu.xin16@zte.com.cn Link: https://lkml.kernel.org/r/20220830143838.299758-1-xu.xin16@zte.com.cn Signed-off-by: xu xin <xu.xin16@zte.com.cn> Reviewed-by: Xiaokai Ran <ran.xiaokai@zte.com.cn> Reviewed-by: Yang Yang <yang.yang29@zte.com.cn> Signed-off-by: CGEL ZTE <cgel.zte@gmail.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-09-26 19:46:29 -07:00
Liam R. Howlett	69dbe6daf1	userfaultfd: use maple tree iterator to iterate VMAs Don't use the mm_struct linked list or the vma->vm_next in prep for removal. Link: https://lkml.kernel.org/r/20220906194824.2110408-45-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-09-26 19:46:21 -07:00
Matthew Wilcox (Oracle)	c4c84f0628	fs/proc/task_mmu: stop using linked list and highest_vm_end Remove references to mm_struct linked list and highest_vm_end for when they are removed Link: https://lkml.kernel.org/r/20220906194824.2110408-44-Liam.Howlett@oracle.com Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-09-26 19:46:21 -07:00
Liam R. Howlett	5f14b9246e	fs/proc/base: use the vma iterators in place of linked list Use the vma iterator instead of a for loop across the linked list. The link list of vmas will be removed in this patch set. Link: https://lkml.kernel.org/r/20220906194824.2110408-43-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-09-26 19:46:21 -07:00
Matthew Wilcox (Oracle)	19066e5868	exec: use VMA iterator instead of linked list Remove a use of the vm_next list by doing the initial lookup with the VMA iterator and then using it to find the next entry. Link: https://lkml.kernel.org/r/20220906194824.2110408-42-Liam.Howlett@oracle.com Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-09-26 19:46:21 -07:00
Matthew Wilcox (Oracle)	182ea1d717	coredump: remove vma linked list walk Use the Maple Tree iterator instead. This is too complicated for the VMA iterator to handle, so let's open-code it for now. If this turns out to be a common pattern, we can migrate it to common code. Link: https://lkml.kernel.org/r/20220906194824.2110408-41-Liam.Howlett@oracle.com Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-09-26 19:46:20 -07:00
Liam R. Howlett	7964cf8caa	mm: remove vmacache By using the maple tree and the maple tree state, the vmacache is no longer beneficial and is complicating the VMA code. Remove the vmacache to reduce the work in keeping it up to date and code complexity. Link: https://lkml.kernel.org/r/20220906194824.2110408-26-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-09-26 19:46:18 -07:00
Matthew Wilcox (Oracle)	0c563f1480	proc: remove VMA rbtree use from nommu These users of the rbtree should probably have been walks of the linked list, but convert them to use walks of the maple tree. Link: https://lkml.kernel.org/r/20220906194824.2110408-17-Liam.Howlett@oracle.com Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-09-26 19:46:16 -07:00
Yu Zhao	bd74fdaea1	mm: multi-gen LRU: support page table walks To further exploit spatial locality, the aging prefers to walk page tables to search for young PTEs and promote hot pages. A kill switch will be added in the next patch to disable this behavior. When disabled, the aging relies on the rmap only. NB: this behavior has nothing similar with the page table scanning in the 2.4 kernel [1], which searches page tables for old PTEs, adds cold pages to swapcache and unmaps them. To avoid confusion, the term "iteration" specifically means the traversal of an entire mm_struct list; the term "walk" will be applied to page tables and the rmap, as usual. An mm_struct list is maintained for each memcg, and an mm_struct follows its owner task to the new memcg when this task is migrated. Given an lruvec, the aging iterates lruvec_memcg()->mm_list and calls walk_page_range() with each mm_struct on this list to promote hot pages before it increments max_seq. When multiple page table walkers iterate the same list, each of them gets a unique mm_struct; therefore they can run concurrently. Page table walkers ignore any misplaced pages, e.g., if an mm_struct was migrated, pages it left in the previous memcg will not be promoted when its current memcg is under reclaim. Similarly, page table walkers will not promote pages from nodes other than the one under reclaim. This patch uses the following optimizations when walking page tables: 1. It tracks the usage of mm_struct's between context switches so that page table walkers can skip processes that have been sleeping since the last iteration. 2. It uses generational Bloom filters to record populated branches so that page table walkers can reduce their search space based on the query results, e.g., to skip page tables containing mostly holes or misplaced pages. 3. It takes advantage of the accessed bit in non-leaf PMD entries when CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y. 4. It does not zigzag between a PGD table and the same PMD table spanning multiple VMAs. IOW, it finishes all the VMAs within the range of the same PMD table before it returns to a PGD table. This improves the cache performance for workloads that have large numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5. Server benchmark results: Single workload: fio (buffered I/O): no change Single workload: memcached (anon): +[8, 10]% Ops/sec KB/sec patch1-7: 1147696.57 44640.29 patch1-8: 1245274.91 48435.66 Configurations: no change Client benchmark results: kswapd profiles: patch1-7 48.16% lzo1x_1_do_compress (real work) 8.20% page_vma_mapped_walk (overhead) 7.06% _raw_spin_unlock_irq 2.92% ptep_clear_flush 2.53% __zram_bvec_write 2.11% do_raw_spin_lock 2.02% memmove 1.93% lru_gen_look_around 1.56% free_unref_page_list 1.40% memset patch1-8 49.44% lzo1x_1_do_compress (real work) 6.19% page_vma_mapped_walk (overhead) 5.97% _raw_spin_unlock_irq 3.13% get_pfn_folio 2.85% ptep_clear_flush 2.42% __zram_bvec_write 2.08% do_raw_spin_lock 1.92% memmove 1.44% alloc_zspage 1.36% memset Configurations: no change Thanks to the following developers for their efforts [3]. kernel test robot <lkp@intel.com> [1] https://lwn.net/Articles/23732/ [2] https://llvm.org/docs/ScudoHardenedAllocator.html [3] https://lore.kernel.org/r/202204160827.ekEARWQo-lkp@intel.com/ Link: https://lkml.kernel.org/r/20220918080010.2920238-9-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-09-26 19:46:09 -07:00
Yu Zhao	ec1c86b25f	mm: multi-gen LRU: groundwork Evictable pages are divided into multiple generations for each lruvec. The youngest generation number is stored in lrugen->max_seq for both anon and file types as they are aged on an equal footing. The oldest generation numbers are stored in lrugen->min_seq[] separately for anon and file types as clean file pages can be evicted regardless of swap constraints. These three variables are monotonically increasing. Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits in order to fit into the gen counter in folio->flags. Each truncated generation number is an index to lrugen->lists[]. The sliding window technique is used to track at least MIN_NR_GENS and at most MAX_NR_GENS generations. The gen counter stores a value within [1, MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it stores 0. There are two conceptually independent procedures: "the aging", which produces young generations, and "the eviction", which consumes old generations. They form a closed-loop system, i.e., "the page reclaim". Both procedures can be invoked from userspace for the purposes of working set estimation and proactive reclaim. These techniques are commonly used to optimize job scheduling (bin packing) in data centers [1][2]. To avoid confusion, the terms "hot" and "cold" will be applied to the multi-gen LRU, as a new convention; the terms "active" and "inactive" will be applied to the active/inactive LRU, as usual. The protection of hot pages and the selection of cold pages are based on page access channels and patterns. There are two access channels: one through page tables and the other through file descriptors. The protection of the former channel is by design stronger because: 1. The uncertainty in determining the access patterns of the former channel is higher due to the approximation of the accessed bit. 2. The cost of evicting the former channel is higher due to the TLB flushes required and the likelihood of encountering the dirty bit. 3. The penalty of underprotecting the former channel is higher because applications usually do not prepare themselves for major page faults like they do for blocked I/O. E.g., GUI applications commonly use dedicated I/O threads to avoid blocking rendering threads. There are also two access patterns: one with temporal locality and the other without. For the reasons listed above, the former channel is assumed to follow the former pattern unless VM_SEQ_READ or VM_RAND_READ is present; the latter channel is assumed to follow the latter pattern unless outlying refaults have been observed [3][4]. The next patch will address the "outlying refaults". Three macros, i.e., LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are added in this patch to make the entire patchset less diffy. A page is added to the youngest generation on faulting. The aging needs to check the accessed bit at least twice before handing this page over to the eviction. The first check takes care of the accessed bit set on the initial fault; the second check makes sure this page has not been used since then. This protocol, AKA second chance, requires a minimum of two generations, hence MIN_NR_GENS. [1] https://dl.acm.org/doi/10.1145/3297858.3304053 [2] https://dl.acm.org/doi/10.1145/3503222.3507731 [3] https://lwn.net/Articles/495543/ [4] https://lwn.net/Articles/815342/ Link: https://lkml.kernel.org/r/20220918080010.2920238-6-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-09-26 19:46:09 -07:00
Peter Xu	0d206b5d2e	mm/swap: add swp_offset_pfn() to fetch PFN from swap entry We've got a bunch of special swap entries that stores PFN inside the swap offset fields. To fetch the PFN, normally the user just calls swp_offset() assuming that'll be the PFN. Add a helper swp_offset_pfn() to fetch the PFN instead, fetching only the max possible length of a PFN on the host, meanwhile doing proper check with MAX_PHYSMEM_BITS to make sure the swap offsets can actually store the PFNs properly always using the BUILD_BUG_ON() in is_pfn_swap_entry(). One reason to do so is we never tried to sanitize whether swap offset can really fit for storing PFN. At the meantime, this patch also prepares us with the future possibility to store more information inside the swp offset field, so assuming "swp_offset(entry)" to be the PFN will not stand any more very soon. Replace many of the swp_offset() callers to use swp_offset_pfn() where proper. Note that many of the existing users are not candidates for the replacement, e.g.: (1) When the swap entry is not a pfn swap entry at all, or, (2) when we wanna keep the whole swp_offset but only change the swp type. For the latter, it can happen when fork() triggered on a write-migration swap entry pte, we may want to only change the migration type from write->read but keep the rest, so it's not "fetching PFN" but "changing swap type only". They're left aside so that when there're more information within the swp offset they'll be carried over naturally in those cases. Since at it, dropping hwpoison_entry_to_pfn() because that's exactly what the new swp_offset_pfn() is about. Link: https://lkml.kernel.org/r/20220811161331.37055-4-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Minchan Kim <minchan@kernel.org> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-09-26 19:46:05 -07:00
Linus Torvalds	3800a713b6	26 hotfixes. 8 are for issues which were introduced during this -rc cycle, 18 are for earlier issues, and are cc:stable. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYzH+NgAKCRDdBJ7gKXxA ju4AAQDrFWErVp+ra5P66SSbiFmm8NAW1awt4nHwAPcihNf3yQD/eQcB3w2q0Dm1 9HjsyEVkTYIeaJSAbCraDnMwUdWTIgY= =p5+0 -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2022-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull last (?) hotfixes from Andrew Morton: "26 hotfixes. 8 are for issues which were introduced during this -rc cycle, 18 are for earlier issues, and are cc:stable" * tag 'mm-hotfixes-stable-2022-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (26 commits) x86/uaccess: avoid check_object_size() in copy_from_user_nmi() mm/page_isolation: fix isolate_single_pageblock() isolation behavior mm,hwpoison: check mm when killing accessing process mm/hugetlb: correct demote page offset logic mm: prevent page_frag_alloc() from corrupting the memory mm: bring back update_mmu_cache() to finish_fault() frontswap: don't call ->init if no ops are registered mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all() mm: fix madivse_pageout mishandling on non-LRU page powerpc/64s/radix: don't need to broadcast IPI for radix pmd collapse flush mm: gup: fix the fast GUP race against THP collapse mm: fix dereferencing possible ERR_PTR vmscan: check folio_test_private(), not folio_get_private() mm: fix VM_BUG_ON in __delete_from_swap_cache() tools: fix compilation after gfp_types.h split mm/damon/dbgfs: fix memory leak when using debugfs_lookup() mm/migrate_device.c: copy pte dirty bit to page mm/migrate_device.c: add missing flush_cache_page() mm/migrate_device.c: flush TLB while holding PTL x86/mm: disable instrumentations of mm/pgprot.c ...	2022-09-26 13:23:15 -07:00
Andrew Morton	6d751329e7	Merge branch 'mm-hotfixes-stable' into mm-stable	2022-09-26 13:13:15 -07:00
Linus Torvalds	3a710532cf	Fix an potential unitialzied variable bug; this was a fixup that I had forgotten to apply before the last pull request for ext4. My bad. -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmMx4H8ACgkQ8vlZVpUN gaPSMwgAij3HAydTOXKBpiMW426I+GhhQCeKzud2uPs74gB3ikfh0YizJe520dpH ffFAnItLAGChcO+vRpYSJumFKQJpCu+bb/6lOJ2oVBhoASPBlNNxVmSsyqcNMmaI HGmvPgwcAhzeaaAFzkhzcpeEdygldUGClekRwvjLqYyj6i9EfwHkPBVzMn/9sSNk Pasp3AhEH5c2IhR9JnH7VvNDCIVBnIqapKsiyysBs1cjEvs5wCLAm3PhgKrjGERR RUvirl5jbTAAxRxdyiBppVSsbj10lLD+knKSxdnoJ+X7jiZp7ktEcCI+DVdwt5Cd NmfwlVJxQ91lq90FBsJyB6lzU5e69g== =GEOG -----END PGP SIGNATURE----- Merge tag 'ext4_for_linus_fixes2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull missed ext4 fix from Ted Ts'o: "Fix an potential unitialzied variable bug; this was a fixup that I had forgotten to apply before the last pull request for ext4. My bad" * tag 'ext4_for_linus_fixes2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fixup possible uninitialized variable access in ext4_mb_choose_next_group_cr1()	2022-09-26 13:10:11 -07:00
Yury Norov	70a1cb106d	lib/bitmap: don't call __bitmap_weight() in kernel code __bitmap_weight() is not to be used directly in the kernel code because it's a helper for bitmap_weight(). Switch everything to bitmap_weight(). Signed-off-by: Yury Norov <yury.norov@gmail.com>	2022-09-26 12:19:12 -07:00
Jeff Layton	895ddf5ed4	nfsd: extra checks when freeing delegation stateids We've had some reports of problems in the refcounting for delegation stateids that we've yet to track down. Add some extra checks to ensure that we've removed the object from various lists before freeing it. Link: https://bugzilla.redhat.com/show_bug.cgi?id=2127067 Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2022-09-26 14:50:58 -04:00
Jeff Layton	b95239ca49	nfsd: make nfsd4_run_cb a bool return function queue_work can return false and not queue anything, if the work is already queued. If that happens in the case of a CB_RECALL, we'll have taken an extra reference to the stid that will never be put. Ensure we throw a warning in that case. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2022-09-26 14:50:57 -04:00
Jeff Layton	25fbe1fca1	nfsd: fix comments about spinlock handling with delegations Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2022-09-26 14:23:55 -04:00
Jeff Layton	4d01416ab4	nfsd: only fill out return pointer on success in nfsd4_lookup_stateid In the case of a revoked delegation, we still fill out the pointer even when returning an error, which is bad form. Only overwrite the pointer on success. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2022-09-26 14:23:44 -04:00
Dai Ngo	019805fea9	NFSD: fix use-after-free on source server when doing inter-server copy Use-after-free occurred when the laundromat tried to free expired cpntf_state entry on the s2s_cp_stateids list after inter-server copy completed. The sc_cp_list that the expired copy state was inserted on was already freed. When COPY completes, the Linux client normally sends LOCKU(lock_state x), FREE_STATEID(lock_state x) and CLOSE(open_state y) to the source server. The nfs4_put_stid call from nfsd4_free_stateid cleans up the copy state from the s2s_cp_stateids list before freeing the lock state's stid. However, sometimes the CLOSE was sent before the FREE_STATEID request. When this happens, the nfsd4_close_open_stateid call from nfsd4_close frees all lock states on its st_locks list without cleaning up the copy state on the sc_cp_list list. When the time the FREE_STATEID arrives the server returns BAD_STATEID since the lock state was freed. This causes the use-after-free error to occur when the laundromat tries to free the expired cpntf_state. This patch adds a call to nfs4_free_cpntf_statelist in nfsd4_close_open_stateid to clean up the copy state before calling free_ol_stateid_reaplist to free the lock state's stid on the reaplist. Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2022-09-26 14:21:46 -04:00
Chuck Lever	76ce4dcec0	NFSD: Cap rsize_bop result based on send buffer size Since before the git era, NFSD has conserved the number of pages held by each nfsd thread by combining the RPC receive and send buffers into a single array of pages. This works because there are no cases where an operation needs a large RPC Call message and a large RPC Reply at the same time. Once an RPC Call has been received, svc_process() updates svc_rqst::rq_res to describe the part of rq_pages that can be used for constructing the Reply. This means that the send buffer (rq_res) shrinks when the received RPC record containing the RPC Call is large. Add an NFSv4 helper that computes the size of the send buffer. It replaces svc_max_payload() in spots where svc_max_payload() returns a value that might be larger than the remaining send buffer space. Callers who need to know the transport's actual maximum payload size will continue to use svc_max_payload(). Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2022-09-26 14:02:50 -04:00
Chuck Lever	781fde1a2b	NFSD: Rename the fields in copy_stateid_t Code maintenance: The name of the copy_stateid_t::sc_count field collides with the sc_count field in struct nfs4_stid, making the latter difficult to grep for when auditing stateid reference counting. No behavior change expected. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2022-09-26 14:02:50 -04:00
ChenXiaoSong	1342f9dd3f	nfsd: use DEFINE_SHOW_ATTRIBUTE to define nfsd_file_cache_stats_fops Use DEFINE_SHOW_ATTRIBUTE helper macro to simplify the code. Signed-off-by: ChenXiaoSong <chenxiaosong2@huawei.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2022-09-26 14:02:50 -04:00
ChenXiaoSong	64776611a0	nfsd: use DEFINE_SHOW_ATTRIBUTE to define nfsd_reply_cache_stats_fops Use DEFINE_SHOW_ATTRIBUTE helper macro to simplify the code. nfsd_net is converted from seq_file->file instead of seq_file->private in nfsd_reply_cache_stats_show(). Signed-off-by: ChenXiaoSong <chenxiaosong2@huawei.com> [ cel: reduce line length ] Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2022-09-26 14:02:50 -04:00
ChenXiaoSong	1d7f6b302b	nfsd: use DEFINE_SHOW_ATTRIBUTE to define client_info_fops Use DEFINE_SHOW_ATTRIBUTE helper macro to simplify the code. inode is converted from seq_file->file instead of seq_file->private in client_info_show(). Signed-off-by: ChenXiaoSong <chenxiaosong2@huawei.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2022-09-26 14:02:49 -04:00
ChenXiaoSong	9beeaab8e0	nfsd: use DEFINE_SHOW_ATTRIBUTE to define export_features_fops and supported_enctypes_fops Use DEFINE_SHOW_ATTRIBUTE helper macro to simplify the code. Signed-off-by: ChenXiaoSong <chenxiaosong2@huawei.com> [ cel: reduce line length ] Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2022-09-26 14:02:49 -04:00
ChenXiaoSong	0cfb0c4228	nfsd: use DEFINE_PROC_SHOW_ATTRIBUTE to define nfsd_proc_ops Use DEFINE_PROC_SHOW_ATTRIBUTE helper macro to simplify the code. Signed-off-by: ChenXiaoSong <chenxiaosong2@huawei.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2022-09-26 14:02:49 -04:00
Chuck Lever	9f553e61bd	NFSD: Pack struct nfsd4_compoundres Remove a couple of 4-byte holes on platforms with 64-bit pointers. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2022-09-26 14:02:49 -04:00
Chuck Lever	77e378cf2a	NFSD: Remove unused nfsd4_compoundargs::cachetype field This field was added by commit `1091006c5e` ("nfsd: turn on reply cache for NFSv4") but was never put to use. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2022-09-26 14:02:48 -04:00
Chuck Lever	6604148cf9	NFSD: Remove "inline" directives on op_rsize_bop helpers These helpers are always invoked indirectly, so the compiler can't inline these anyway. While we're updating the synopses of these helpers, defensively convert their parameters to const pointers. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2022-09-26 14:02:48 -04:00
Chuck Lever	9993a66317	NFSD: Clean up nfs4svc_encode_compoundres() In today's Linux NFS server implementation, the NFS dispatcher initializes each XDR result stream, and the NFSv4 .pc_func and .pc_encode methods all use xdr_stream-based encoding. This keeps rq_res.len automatically updated. There is no longer a need for the WARN_ON_ONCE() check in nfs4svc_encode_compoundres(). Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2022-09-26 14:02:48 -04:00
Chuck Lever	d4da5baa53	NFSD: Clean up WRITE arg decoders xdr_stream_subsegment() already returns a boolean value. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2022-09-26 14:02:47 -04:00
Chuck Lever	c3d2a04f05	NFSD: Use xdr_inline_decode() to decode NFSv3 symlinks Replace the check for buffer over/underflow with a helper that is commonly used for this purpose. The helper also sets xdr->nwords correctly after successfully linearizing the symlink argument into the stream's scratch buffer. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2022-09-26 14:02:47 -04:00
Chuck Lever	98124f5bd6	NFSD: Refactor common code out of dirlist helpers The dust has settled a bit and it's become obvious what code is totally common between nfsd_init_dirlist_pages() and nfsd3_init_dirlist_pages(). Move that common code to SUNRPC. The new helper brackets the existing xdr_init_decode_pages() API. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2022-09-26 14:02:47 -04:00
Chuck Lever	3fdc546462	NFSD: Reduce amount of struct nfsd4_compoundargs that needs clearing Have SunRPC clear everything except for the iops array. Then have each NFSv4 XDR decoder clear it's own argument before decoding. Now individual operations may have a large argument struct while not penalizing the vast majority of operations with a small struct. And, clearing the argument structure occurs as the argument fields are initialized, enabling the CPU to do write combining on that memory. In some cases, clearing is not even necessary because all of the fields in the argument structure are initialized by the decoder. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2022-09-26 14:02:42 -04:00
Chuck Lever	103cc1fafe	SUNRPC: Parametrize how much of argsize should be zeroed Currently, SUNRPC clears the whole of .pc_argsize before processing each incoming RPC transaction. Add an extra parameter to struct svc_procedure to enable upper layers to reduce the amount of each operation's argument structure that is zeroed by SUNRPC. The size of struct nfsd4_compoundargs, in particular, is a lot to clear on each incoming RPC Call. A subsequent patch will cut this down to something closer to what NFSv2 and NFSv3 uses. This patch should cause no behavior changes. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2022-09-26 14:02:42 -04:00
Dai Ngo	7746b32f46	NFSD: add shrinker to reap courtesy clients on low memory condition Add courtesy_client_reaper to react to low memory condition triggered by the system memory shrinker. The delayed_work for the courtesy_client_reaper is scheduled on the shrinker's count callback using the laundry_wq. The shrinker's scan callback is not used for expiring the courtesy clients due to potential deadlocks. Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2022-09-26 14:02:41 -04:00
Dai Ngo	3a4ea23d86	NFSD: keep track of the number of courtesy clients in the system Add counter nfs4_courtesy_client_count to nfsd_net to keep track of the number of courtesy clients in the system. Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2022-09-26 14:02:41 -04:00
Anna Schumaker	06981d5606	NFSD: Return nfserr_serverfault if splice_ok but buf->pages have data This was discussed with Chuck as part of this patch set. Returning nfserr_resource was decided to not be the best error message here, and he suggested changing to nfserr_serverfault instead. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Link: https://lore.kernel.org/linux-nfs/20220907195259.926736-1-anna@kernel.org/T/#t Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2022-09-26 14:02:41 -04:00
Chuck Lever	5f5f8b6d65	NFSD: Make nfsd4_remove() wait before returning NFS4ERR_DELAY nfsd_unlink() can kick off a CB_RECALL (via vfs_unlink() -> leases_conflict()) if a delegation is present. Before returning NFS4ERR_DELAY, give the client holding that delegation a chance to return it and then retry the nfsd_unlink() again, once. Link: https://bugzilla.linux-nfs.org/show_bug.cgi?id=354 Tested-by: Igor Mammedov <imammedo@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>	2022-09-26 14:02:41 -04:00
Chuck Lever	68c522afd0	NFSD: Make nfsd4_rename() wait before returning NFS4ERR_DELAY nfsd_rename() can kick off a CB_RECALL (via vfs_rename() -> leases_conflict()) if a delegation is present. Before returning NFS4ERR_DELAY, give the client holding that delegation a chance to return it and then retry the nfsd_rename() again, once. This version of the patch handles renaming an existing file, but does not deal with renaming onto an existing file. That case will still always trigger an NFS4ERR_DELAY. Link: https://bugzilla.linux-nfs.org/show_bug.cgi?id=354 Tested-by: Igor Mammedov <imammedo@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>	2022-09-26 14:02:40 -04:00
Chuck Lever	34b91dda71	NFSD: Make nfsd4_setattr() wait before returning NFS4ERR_DELAY nfsd_setattr() can kick off a CB_RECALL (via notify_change() -> break_lease()) if a delegation is present. Before returning NFS4ERR_DELAY, give the client holding that delegation a chance to return it and then retry the nfsd_setattr() again, once. Link: https://bugzilla.linux-nfs.org/show_bug.cgi?id=354 Tested-by: Igor Mammedov <imammedo@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>	2022-09-26 14:02:36 -04:00
Chuck Lever	c0aa1913db	NFSD: Refactor nfsd_setattr() Move code that will be retried (in a subsequent patch) into a helper function. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>	2022-09-26 14:02:32 -04:00
Chuck Lever	c035362eb9	NFSD: Add a mechanism to wait for a DELEGRETURN Subsequent patches will use this mechanism to wake up an operation that is waiting for a client to return a delegation. The new tracepoint records whether the wait timed out or was properly awoken by the expected DELEGRETURN: nfsd-1155 [002] 83799.493199: nfsd_delegret_wakeup: xid=0x14b7d6ef fh_hash=0xf6826792 (timed out) Suggested-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>	2022-09-26 14:02:32 -04:00
Chuck Lever	1035d65446	NFSD: Add tracepoints to report NFSv4 callback completions Wireshark has always been lousy about dissecting NFSv4 callbacks, especially NFSv4.0 backchannel requests. Add tracepoints so we can surgically capture these events in the trace log. Tracepoints are time-stamped and ordered so that we can now observe the timing relationship between a CB_RECALL Reply and the client's DELEGRETURN Call. Example: nfsd-1153 [002] 211.986391: nfsd_cb_recall: addr=192.168.1.67:45767 client 62ea82e4:fee7492a stateid 00000003:00000001 nfsd-1153 [002] 212.095634: nfsd_compound: xid=0x0000002c opcnt=2 nfsd-1153 [002] 212.095647: nfsd_compound_status: op=1/2 OP_PUTFH status=0 nfsd-1153 [002] 212.095658: nfsd_file_put: hash=0xf72 inode=0xffff9291148c7410 ref=3 flags=HASHED\|REFERENCED may=READ file=0xffff929103b3ea00 nfsd-1153 [002] 212.095661: nfsd_compound_status: op=2/2 OP_DELEGRETURN status=0 kworker/u25:8-148 [002] 212.096713: nfsd_cb_recall_done: client 62ea82e4:fee7492a stateid 00000003:00000001 status=0 Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>	2022-09-26 14:02:32 -04:00
Chuck Lever	de29cf7e6c	NFSD: Trace NFSv4 COMPOUND tags The Linux NFSv4 client implementation does not use COMPOUND tags, but the Solaris and MacOS implementations do, and so does pynfs. Record these eye-catchers in the server's trace buffer to annotate client requests while troubleshooting. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>	2022-09-26 14:02:31 -04:00
Chuck Lever	948755efc9	NFSD: Replace dprintk() call site in fh_verify() Record permission errors in the trace log. Note that the new trace event is conditional, so it will only record non-zero return values from nfsd_permission(). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>	2022-09-26 14:02:31 -04:00
Gaosheng Cui	18224dc58d	nfsd: remove nfsd4_prepare_cb_recall() declaration nfsd4_prepare_cb_recall() has been removed since commit `0162ac2b97` ("nfsd: introduce nfsd4_callback_ops"), so remove it. Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2022-09-26 14:02:31 -04:00
Jeff Layton	6106d9119b	nfsd: clean up mounted_on_fileid handling We only need the inode number for this, not a full rack of attributes. Rename this function make it take a pointer to a u64 instead of struct kstat, and change it to just request STATX_INO. Signed-off-by: Jeff Layton <jlayton@kernel.org> [ cel: renamed get_mounted_on_ino() ] Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2022-09-26 14:02:30 -04:00
Chuck Lever	7518a3dc5e	NFSD: Fix handling of oversized NFSv4 COMPOUND requests If an NFS server returns NFS4ERR_RESOURCE on the first operation in an NFSv4 COMPOUND, there's no way for a client to know where the problem is and then simplify the compound to make forward progress. So instead, make NFSD process as many operations in an oversized COMPOUND as it can and then return NFS4ERR_RESOURCE on the first operation it did not process. pynfs NFSv4.0 COMP6 exercises this case, but checks only for the COMPOUND status code, not whether the server has processed any of the operations. pynfs NFSv4.1 SEQ6 and SEQ7 exercise the NFSv4.1 case, which detects too many operations per COMPOUND by checking against the limits negotiated when the session was created. Suggested-by: Bruce Fields <bfields@fieldses.org> Fixes: `0078117c6d` ("nfsd: return RESOURCE not GARBAGE_ARGS on too many ops") Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2022-09-26 14:02:30 -04:00
NeilBrown	9558f9304c	NFSD: drop fname and flen args from nfsd_create_locked() nfsd_create_locked() does not use the "fname" and "flen" arguments, so drop them from declaration and all callers. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2022-09-26 14:02:30 -04:00
Chuck Lever	fa6be9cc6e	NFSD: Protect against send buffer overflow in NFSv3 READ Since before the git era, NFSD has conserved the number of pages held by each nfsd thread by combining the RPC receive and send buffers into a single array of pages. This works because there are no cases where an operation needs a large RPC Call message and a large RPC Reply at the same time. Once an RPC Call has been received, svc_process() updates svc_rqst::rq_res to describe the part of rq_pages that can be used for constructing the Reply. This means that the send buffer (rq_res) shrinks when the received RPC record containing the RPC Call is large. A client can force this shrinkage on TCP by sending a correctly- formed RPC Call header contained in an RPC record that is excessively large. The full maximum payload size cannot be constructed in that case. Cc: <stable@vger.kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2022-09-26 14:02:30 -04:00
Chuck Lever	401bc1f908	NFSD: Protect against send buffer overflow in NFSv2 READ Since before the git era, NFSD has conserved the number of pages held by each nfsd thread by combining the RPC receive and send buffers into a single array of pages. This works because there are no cases where an operation needs a large RPC Call message and a large RPC Reply at the same time. Once an RPC Call has been received, svc_process() updates svc_rqst::rq_res to describe the part of rq_pages that can be used for constructing the Reply. This means that the send buffer (rq_res) shrinks when the received RPC record containing the RPC Call is large. A client can force this shrinkage on TCP by sending a correctly- formed RPC Call header contained in an RPC record that is excessively large. The full maximum payload size cannot be constructed in that case. Cc: <stable@vger.kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2022-09-26 14:02:29 -04:00
Chuck Lever	640f87c190	NFSD: Protect against send buffer overflow in NFSv3 READDIR Since before the git era, NFSD has conserved the number of pages held by each nfsd thread by combining the RPC receive and send buffers into a single array of pages. This works because there are no cases where an operation needs a large RPC Call message and a large RPC Reply message at the same time. Once an RPC Call has been received, svc_process() updates svc_rqst::rq_res to describe the part of rq_pages that can be used for constructing the Reply. This means that the send buffer (rq_res) shrinks when the received RPC record containing the RPC Call is large. A client can force this shrinkage on TCP by sending a correctly- formed RPC Call header contained in an RPC record that is excessively large. The full maximum payload size cannot be constructed in that case. Thanks to Aleksi Illikainen and Kari Hulkko for uncovering this issue. Reported-by: Ben Ronallo <Benjamin.Ronallo@synopsys.com> Cc: <stable@vger.kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2022-09-26 14:02:26 -04:00
Chuck Lever	00b4492686	NFSD: Protect against send buffer overflow in NFSv2 READDIR Restore the previous limit on the @count argument to prevent a buffer overflow attack. Fixes: `53b1119a6e` ("NFSD: Fix READDIR buffer overflow") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2022-09-26 14:02:26 -04:00
Chuck Lever	80e591ce63	NFSD: Increase NFSD_MAX_OPS_PER_COMPOUND When attempting an NFSv4 mount, a Solaris NFSv4 client builds a single large COMPOUND that chains a series of LOOKUPs to get to the pseudo filesystem root directory that is to be mounted. The Linux NFS server's current maximum of 16 operations per NFSv4 COMPOUND is not large enough to ensure that this works for paths that are more than a few components deep. Since NFSD_MAX_OPS_PER_COMPOUND is mostly a sanity check, and most NFSv4 COMPOUNDS are between 3 and 6 operations (thus they do not trigger any re-allocation of the operation array on the server), increasing this maximum should result in little to no impact. The ops array can get large now, so allocate it via vmalloc() to help ensure memory fragmentation won't cause an allocation failure. Link: https://bugzilla.kernel.org/show_bug.cgi?id=216383 Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2022-09-26 14:02:25 -04:00
Christophe JAILLET	30a30fcc3f	nfsd: Propagate some error code returned by memdup_user() Propagate the error code returned by memdup_user() instead of a hard coded -EFAULT. Suggested-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2022-09-26 14:02:22 -04:00
Christophe JAILLET	d44899b8bb	nfsd: Avoid some useless tests memdup_user() can't return NULL, so there is no point for checking for it. Simplify some tests accordingly. Suggested-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2022-09-26 14:02:21 -04:00
Christophe JAILLET	fd1ef88049	nfsd: Fix a memory leak in an error handling path If this memdup_user() call fails, the memory allocated in a previous call a few lines above should be freed. Otherwise it leaks. Fixes: `6ee95d1c89` ("nfsd: add support for upcall version 2") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2022-09-26 14:02:21 -04:00
Jinpeng Cui	4ab3442ca3	NFSD: remove redundant variable status Return value directly from fh_verify() do_open_permission() exp_pseudoroot() instead of getting value from redundant variable status. Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Jinpeng Cui <cui.jinpeng2@zte.com.cn> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2022-09-26 14:02:21 -04:00
Olga Kornievskaia	754035ff79	NFSD enforce filehandle check for source file in COPY If the passed in filehandle for the source file in the COPY operation is not a regular file, the server MUST return NFS4ERR_WRONG_TYPE. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2022-09-26 14:02:20 -04:00
Wolfram Sang	97f8e62572	lockd: move from strlcpy with unused retval to strscpy Follow the advice of the below link and prefer 'strscpy' in this subsystem. Conversion is 1:1 because the return value is not used. Generated by a coccinelle script. Link: https://lore.kernel.org/r/CAHk-=wgfRnXz0W3D37d01q3JFkr_i_uTL=V6A6G1oUZcprmknw@mail.gmail.com/ Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2022-09-26 14:02:20 -04:00
Wolfram Sang	72f78ae00a	NFSD: move from strlcpy with unused retval to strscpy Follow the advice of the below link and prefer 'strscpy' in this subsystem. Conversion is 1:1 because the return value is not used. Generated by a coccinelle script. Link: https://lore.kernel.org/r/CAHk-=wgfRnXz0W3D37d01q3JFkr_i_uTL=V6A6G1oUZcprmknw@mail.gmail.com/ Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2022-09-26 14:02:20 -04:00
Jan Kara	a078dff870	ext4: fixup possible uninitialized variable access in ext4_mb_choose_next_group_cr1() Variable 'grp' may be left uninitialized if there's no group with suitable average fragment size (or larger). Fix the problem by initializing it earlier. Link: https://lore.kernel.org/r/20220922091542.pkhedytey7wzp5fi@quack3 Fixes: `83e80a6e35` ("ext4: use buckets for cr 1 block scan instead of rbtree") Cc: stable@kernel.org Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2022-09-26 13:21:05 -04:00
Gao Xiang	5c2a64252c	erofs: introduce partial-referenced pclusters Due to deduplication for compressed data, pclusters can be partially referenced with their prefixes. Together with the user-space implementation, it enables EROFS variable-length global compressed data deduplication with rolling hash. Link: https://lore.kernel.org/r/20220923014915.4362-1-hsiangkao@linux.alibaba.com Reviewed-by: Yue Hu <huyue2@coolpad.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2022-09-26 23:55:43 +08:00
Yue Hu	b15b2e307c	erofs: support on-disk compressed fragments data Introduce on-disk compressed fragments data feature. This approach adds a new field called `h_fragmentoff' in the per-file compression header to indicate the fragment offset of each tail pcluster or the whole file in the special packed inode. Similar to ztailpacking, it will also find and record the 'headlcn' of the tail pcluster when initializing per-inode zmap for making follow-on requests more easy. Signed-off-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/YzHKxcFTlHGgXeH9@B-P7TQMD6M-0146.local Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2022-09-26 23:55:39 +08:00
Alexander Aring	3b7610302a	fs: dlm: fix possible use after free if tracing This patch fixes a possible use after free if tracing for the specific event is enabled. To avoid the use after free we introduce a out_put label like all other user lock specific requests and safe in a boolean to do a put or not which depends on the execution path of dlm_user_request(). Cc: stable@vger.kernel.org Fixes: `7a3de7324c` ("fs: dlm: trace user space callbacks") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>	2022-09-26 09:58:07 -05:00
Jan Kara	e7c7fbb9a8	ext2: Use kvmalloc() for group descriptor array Array of group descriptor block buffers can get rather large. In theory in can reach 1MB for perfectly valid filesystem and even more for maliciously crafted ones. Use kvmalloc() to allocate the array to avoid straining memory allocator with large order allocations unnecessarily. Reported-by: syzbot+0f2f7e65a3007d39539f@syzkaller.appspotmail.com Signed-off-by: Jan Kara <jack@suse.cz>	2022-09-26 14:59:52 +02:00
Jan Kara	d766f2d1e3	ext2: Add sanity checks for group and filesystem size Add sanity check that filesystem size does not exceed the underlying device size and that group size is big enough so that metadata can fit into it. This avoid trying to mount some crafted filesystems with extremely large group counts. Reported-by: syzbot+0f2f7e65a3007d39539f@syzkaller.appspotmail.com Reported-by: kernel test robot <oliver.sang@intel.com> # Test fixup CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz>	2022-09-26 14:59:47 +02:00
Josef Bacik	857bc13f85	btrfs: implement a nowait option for tree searches For NOWAIT IOCBs we'll need a way to tell search to not wait on locks or anything. Accomplish this by adding a path->nowait flag that will use trylocks and skip reading of metadata, returning -EAGAIN in either of these cases. For now we only need this for reads, so only the read side is handled. Add an ASSERT() to catch anybody trying to use this for writes so they know they'll have to implement the write side. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:46:42 +02:00
Qu Wenruo	d7f67ac9a9	btrfs: relax block-group-tree feature dependency checks [BUG] When one user did a wrong attempt to clear block group tree, which can not be done through mount option, by using "-o clear_cache,space_cache=v2", it will cause the following error on a fs with block-group-tree feature: BTRFS info (device dm-1): force clearing of disk cache BTRFS info (device dm-1): using free space tree BTRFS info (device dm-1): clearing free space tree BTRFS info (device dm-1): clearing compat-ro feature flag for FREE_SPACE_TREE (0x1) BTRFS info (device dm-1): clearing compat-ro feature flag for FREE_SPACE_TREE_VALID (0x2) BTRFS error (device dm-1): block-group-tree feature requires fres-space-tree and no-holes BTRFS error (device dm-1): super block corruption detected before writing it to disk BTRFS: error (device dm-1) in write_all_supers:4318: errno=-117 Filesystem corrupted (unexpected superblock corruption detected) BTRFS warning (device dm-1: state E): Skipping commit of aborted transaction. [CAUSE] Although the dependency for block-group-tree feature is just an artificial one (to reduce test matrix), we put the dependency check into btrfs_validate_super(). This is too strict, and during space cache clearing, we will have a window where free space tree is cleared, and we need to commit the super block. In that window, we had block group tree without v2 cache, and triggered the artificial dependency check. This is not necessary at all, especially for such a soft dependency. [FIX] Introduce a new helper, btrfs_check_features(), to do all the runtime limitation checks, including: - Unsupported incompat flags check - Unsupported compat RO flags check - Setting missing incompat flags - Artificial feature dependency checks Currently only block group tree will rely on this. - Subpage runtime check for v1 cache With this helper, we can move quite some checks from open_ctree()/btrfs_remount() into it, and just call it after btrfs_parse_options(). Now "-o clear_cache,space_cache=v2" will not trigger the above error anymore. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ edit messages ] Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:07 +02:00
Qu Wenruo	5467abba1c	btrfs: move end_io_func argument to btrfs_bio_ctrl structure For function submit_extent_page() and alloc_new_bio(), we have an argument @end_io_func to indicate the end io function. But that function never change inside any call site of them, thus no need to pass the pointer around everywhere. There is a better match for the lifespan of all the call sites, as we have btrfs_bio_ctrl structure, thus we can put the endio function pointer there, and grab the pointer every time we allocate a new bio. Also add extra ASSERT()s to make sure every call site of submit_extent_page() and alloc_new_bio() has properly set the pointer inside btrfs_bio_ctrl. This removes one argument from the already long argument list of submit_extent_page(). Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:07 +02:00
Qu Wenruo	209ecde55c	btrfs: switch page and disk_bytenr argument position for submit_extent_page() Normally we put (page, pg_len, pg_offset) arguments together, just like what __bio_add_page() does. But in submit_extent_page(), what we got is, (page, disk_bytenr, pg_len, pg_offset), which sometimes can be confusing. Change the order to (disk_bytenr, page, pg_len, pg_offset) to make it to follow the common schema. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:07 +02:00
Qu Wenruo	814b6f9158	btrfs: update the comment for submit_extent_page() Since commit `390ed29b81` ("btrfs: refactor submit_extent_page() to make bio and its flag tracing easier"), we are using bio_ctrl structure to replace some of arguments of submit_extent_page(). But unfortunately that commit didn't update the comment for submit_extent_page(), thus some arguments are stale like: - bio_ret - mirror_num Those are all contained in bio_ctrl now. - prev_bio_flags We no longer use this flag to determine if we can merge bios. Update the comment for submit_extent_page() to keep it up-to-date. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:07 +02:00
Josef Bacik	d692173944	btrfs: add struct declarations in dev-replace.h dev-replace.h just has function prototypes for device replace, however if you happen to include it in the wrong order you'll get compile errors because of different structures not being defined. Since these are just pointer args to functions we can declare them at the top in order to reduce the pain of using the header. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:07 +02:00
Josef Bacik	9b9b885465	btrfs: use a runtime flag to indicate an inode is a free space inode We always check the root of an inode as well as it's inode number to determine if it's a free space inode. This is problematic as the helper is in a header file where it doesn't have the fs_info definition. To avoid this and make the check a little cleaner simply add a flag to the runtime_flags to indicate that the inode is a free space inode, set that when we create the inode, and then change the helper to check for this flag. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:07 +02:00
Josef Bacik	e256927b88	btrfs: open code and remove btrfs_insert_inode_hash helper This exists to insert the btree_inode in the super blocks inode hash table. Since it's only used for the btree inode move the code to where we use it in disk-io.c and remove the helper. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:06 +02:00
Josef Bacik	ee8ba05cbb	btrfs: open code and remove btrfs_inode_sectorsize helper This is defined in btrfs_inode.h, and dereferences btrfs_root and btrfs_fs_info, both of which aren't defined in btrfs_inode.h. Additionally, in many places we already have root or fs_info, so this helper often makes the code harder to read. So delete the helper and simply open code it in the few places that we use it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:06 +02:00
Josef Bacik	2b6433c7f6	btrfs: move btrfs_ordered_sum_size into file-item.c This is defined in ordered-data.h, but is only used in file-item.c. Move this to file-item.c as it doesn't need to be global. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:06 +02:00
Josef Bacik	d9d88fde56	btrfs: move the fs_info related helpers closer to fs_info in ctree.h This is purely cosmetic, to make it straightforward to copy and paste the definition and helpers from ctree.h into fs.h. These are helpers that act directly on the fs_info, and were scattered throughout ctree.h. Move them directly below the fs_info definition to make it easier to move them later. This includes the exclop prototypes, which shares an enum that's used in struct btrfs_fs_info as well. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:06 +02:00
Josef Bacik	f119553fd3	btrfs: move btrfs_csum_ptr to inode.c This helper is only used in inode.c, move it locally to that file instead of defining it in ctree.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:06 +02:00
Josef Bacik	0e75f0054a	btrfs: move fs_info forward declarations to the top of ctree.h In order to make it more straightforward to move the fs_info struct and it's related structures, move the struct declarations to the top of ctree.h. This will make it easier to clean up after the fact. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:06 +02:00
Josef Bacik	2103da3b0e	btrfs: move btrfs_swapfile_pin into volumes.h This isn't a great spot for this, but one of the swapfile helper functions is in volumes.c, so move the struct to volumes.h. In the future when we have better separation of code there will be a more natural spot for this. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:06 +02:00
Josef Bacik	c2e79e865b	btrfs: move btrfs_pinned_by_swapfile prototype into volumes.h This is defined in volumes.c, move the prototype into volumes.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:06 +02:00
Josef Bacik	43712116f8	btrfs: move btrfs_init_async_reclaim_work prototype to space-info.h The code for this helper is in space-info.c, move the prototype to space-info.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:06 +02:00
Josef Bacik	c29abab4f9	btrfs: move btrfs_full_stripe_locks_tree into block-group.h This is actually embedded in struct btrfs_block_group, so move this definition to block-group.h, and then open-code the init of the tree where we init the rest of the block group instead of using a helper. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:06 +02:00
Josef Bacik	16708a8898	btrfs: move btrfs_caching_type to block-group.h This is a block group related definition, move it into block-group.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:06 +02:00
Christoph Hellwig	bd86a532b2	btrfs: stop tracking failed reads in the I/O tree There is a separate I/O failure tree to track the fail reads, so remove the extra EXTENT_DAMAGED bit in the I/O tree as it's set but never used. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:05 +02:00
Josef Bacik	23408d8196	btrfs: remove is_data_inode() checks in extent-io-tree.c We're only initializing extent_io_tree's with a private data if we're a normal inode, so we don't need this extra check. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:05 +02:00
Josef Bacik	efb0645bd9	btrfs: don't init io tree with private data for non-inodes We only use this for normal inodes, so don't set it if we're not a normal inode. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:05 +02:00
Josef Bacik	bd015294af	btrfs: replace delete argument with EXTENT_CLEAR_ALL_BITS Instead of taking up a whole argument to indicate we're clearing everything in a range, simply add another EXTENT bit to control this, and then update all the callers to drop this argument from the clear_extent_bit variants. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:05 +02:00
Josef Bacik	b71fb16b2f	btrfs: don't clear CTL bits when trying to release extent state When trying to release the extent states due to memory pressure we'll set all the bits except LOCKED, NODATASUM, and DELALLOC_NEW. This includes some of the CTL bits, which isn't really a problem but isn't correct either. Exclude the CTL bits from this clearing. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:05 +02:00
Josef Bacik	71528e9e16	btrfs: get rid of extent_io_tree::dirty_bytes This was used as an optimization for count_range_bits(EXTENT_DIRTY), which was used by the failed record code. However this was removed in this series by patch "btrfs: convert the io_failure_tree to a plain rb_tree" which was the last user of this optimization. Remove the ->dirty_bytes as nobody cares anymore. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:05 +02:00
Josef Bacik	4374d03d21	btrfs: remove extent_io_tree::track_uptodate Since commit 78361f64ff42 ("btrfs: remove unnecessary EXTENT_UPTODATE state in buffered I/O path") we no longer check ->track_uptodate, remove it. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:05 +02:00
Josef Bacik	570eb97bac	btrfs: unify the lock/unlock extent variants We have two variants of lock/unlock extent, one set that takes a cached state, another that does not. This is slightly annoying, and generally speaking there are only a few places where we don't have a cached state. Simplify this by making lock_extent/unlock_extent the only variant and make it take a cached state, then convert all the callers appropriately. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:05 +02:00
Josef Bacik	291bbb1e7e	btrfs: drop extent_changeset from set_extent_bit The only places that set extent_changeset is set_record_extent_bits, everywhere else sets it to NULL. Drop this argument from set_extent_bit. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:05 +02:00
Josef Bacik	994bcd1eae	btrfs: remove failed_start argument from set_extent_bit This is only used for internal locking related helpers, everybody else just passes in NULL. I've changed set_extent_bit to __set_extent_bit and made it static, removed failed_start from set_extent_bit and have it call __set_extent_bit with a NULL failed_start, and I've moved some code down below the now static __set_extent_bit. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:05 +02:00
Josef Bacik	dbbf49928f	btrfs: remove the wake argument from clear_extent_bits This is only used in the case that we are clearing EXTENT_LOCKED, so infer this value from the bits passed in instead of taking it as an argument. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:04 +02:00
Josef Bacik	c07d1004c5	btrfs: drop exclusive_bits from set_extent_bit This is only ever set if we have EXTENT_LOCKED set, so simply push this into the function itself and remove the function argument. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:04 +02:00
Josef Bacik	d6f65c27f5	btrfs: move extent io tree unrelated prototypes to their appropriate header These prototypes have nothing to do with the extent_io_tree helpers, move them to their appropriate header. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:04 +02:00
Josef Bacik	e63b81aef2	btrfs: use next_state/prev_state in merge_state We use rb_next/rb_prev and then get the entry for the adjacent items in an extent io tree. We have helpers for this, so convert merge_state to use next_state/prev_state and simplify the code. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:04 +02:00
Josef Bacik	43b068cad5	btrfs: make tree_search_prev_next return extent_state's Instead of doing the rb_entry again once we return from this function, simply return the actual states themselves, and then clean up the only user of this helper to handle states instead of nodes. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:04 +02:00
Josef Bacik	e349fd3bfb	btrfs: make tree_search_for_insert return extent_state We use this to search for an extent state, or return the nodes we need to insert a new extent state. This means we have the following pattern node = tree_search_for_insert(); if (!node) { /* alloc and insert. */ goto again; } state = rb_entry(node, struct extent_state, rb_node); we don't use the node for anything else. Making tree_search_for_insert() return the extent_state means we can drop the rb_node and clean this up by eliminating the rb_entry. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:04 +02:00
Josef Bacik	aa852dabf9	btrfs: make tree_search return struct extent_state We have a consistent pattern of n = tree_search(); if (!n) goto out; state = rb_entry(n, struct extent_state, rb_node); while (state) { /* do something. / } which is a bit redundant. If we make tree_search return the state we can simply have state = tree_search(); while (state) { / do something. */ } which cleans up the code quite a bit. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:04 +02:00
Josef Bacik	ccaeff9290	btrfs: use next_state instead of rb_next where we can We can simplify a lot of these functions where we have to cycle through extent_state's by simply using next_state() instead of rb_next(). In many spots this allows us to do things like while (state) { /* whatever */ state = next_state(state); } instead of while (1) { state = rb_entry(n, struct extent_state, rb_node); n = rb_next(n); if (!n) break; } Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:04 +02:00
Josef Bacik	071d19f513	btrfs: remove struct tree_entry in extent-io-tree.c This existed when we overloaded the tree manipulation functions for both the extent_io_tree and the extent buffer tree. However the extent buffers are now stored in a radix tree, so we no longer need this abstraction. Remove struct tree_entry and use extent_state directly instead. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:04 +02:00
Josef Bacik	a4055213bf	btrfs: unexport all the temporary exports for extent-io-tree.c Now that we've moved everything we can unexport all the temporary exports, move the random helpers, and mark everything as static again. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:04 +02:00
Josef Bacik	d8038a1f46	btrfs: unexport btrfs_debug_check_extent_io_range We no longer need to export this as all users are in extent-io-tree.c, remove it from the header and put it into extent-io-tree.c. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:03 +02:00
Josef Bacik	e3974c6694	btrfs: move core extent_io_tree functions to extent-io-tree.c This is still huge, but unfortunately I cannot make it smaller without renaming tree_search() and changing all the callers to use the new name, then moving those chunks and then changing the name back. This feels like too much churn for code movement, so I've limited this to only things that called tree_search(). With this patch all of the extent_io_tree code is now in extent-io-tree.c. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:03 +02:00
Josef Bacik	3883001838	btrfs: move a few exported extent_io_tree helpers to extent-io-tree.c These are the last few helpers that do not rely on tree_search() and who's other helpers are exported and in extent-io-tree.c already. Move these across now in order to make the core move smaller. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:03 +02:00
Josef Bacik	04eba89323	btrfs: temporarily export and then move extent state helpers In order to avoid moving all of the related code at once temporarily export all of the extent state related helpers. Then move these helpers into extent-io-tree.c. We will clean up the exports and make them static in followup patches. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:03 +02:00
Josef Bacik	91af24e484	btrfs: temporarily export and move core extent_io_tree tree functions A lot of the various internals of extent_io_tree call these two functions for insert or searching the rb tree for entries, so temporarily export them and then move them to extent-io-tree.c. We can't move tree_search() without renaming it, and I don't want to introduce a bunch of churn just to do that, so move these functions first and then we can move a few big functions and then the remaining users of tree_search(). Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:03 +02:00
Josef Bacik	6962541e96	btrfs: move btrfs_debug_check_extent_io_range into extent-io-tree.c This helper is used by a lot of the core extent_io_tree helpers, so temporarily export it and move it into extent-io-tree.c in order to make it straightforward to migrate the helpers in batches. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:03 +02:00
Josef Bacik	ec39e39bbf	btrfs: export wait_extent_bit This is used by the subpage code in addition to lock_extent_bits, so export it so we can move it out of extent_io.c Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:03 +02:00
Josef Bacik	a66318872c	btrfs: move simple extent bit helpers out of extent_io.c These are just variants and wrappers around the actual work horses of the extent state. Extract these out of extent_io.c. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:03 +02:00
Josef Bacik	ad79532957	btrfs: convert BUG_ON(EXTENT_BIT_LOCKED) checks to ASSERT's We only call these functions from the qgroup code which doesn't call with EXTENT_BIT_LOCKED. These are BUG_ON()'s that exist to keep us developers from using these functions with EXTENT_BIT_LOCKED, so convert them to ASSERT()'s. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:03 +02:00
Josef Bacik	83cf709a89	btrfs: move extent state init and alloc functions to their own file Start cleaning up extent_io.c by moving the extent state code out of it. This patch starts with the extent state allocation code and the extent_io_tree init code. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:03 +02:00
Josef Bacik	c45379a20f	btrfs: temporarily export alloc_extent_state helpers We're going to move this code in stages, but while we're doing that we need to export these helpers so we can more easily move the code into the new file. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:03 +02:00
Josef Bacik	a40246e8af	btrfs: separate out the eb and extent state leak helpers Currently we have the add/del functions generic so that we can use them for both extent buffers and extent states. We want to separate this code however, so separate these helpers into per-object helpers in anticipation of the split. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:02 +02:00
Josef Bacik	a62a3bd954	btrfs: separate out the extent state and extent buffer init code In order to help separate the extent buffer from the extent io tree code we need to break up the init functions. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:02 +02:00
Josef Bacik	cdca85b092	btrfs: use find_first_extent_bit in btrfs_clean_io_failure Currently we're using find_first_extent_bit_state to check if our state contains the given failrec range, however this is more of an internal extent_io_tree helper, and is technically unsafe to use because we're accessing the state outside of the extent_io_tree lock. Instead use the normal helper find_first_extent_bit which returns the range of the extent state we find in find_first_extent_bit_state and use that to do our sanity checking. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:02 +02:00
Josef Bacik	87c11705cc	btrfs: convert the io_failure_tree to a plain rb_tree We still have this oddity of stashing the io_failure_record in the extent state for the io_failure_tree, which is leftover from when we used to stuff private pointers in extent_io_trees. However this doesn't make a lot of sense for the io failure records, we can simply use a normal rb_tree for this. This will allow us to further simplify the extent_io_tree code by removing the io_failure_rec pointer from the extent state. Convert the io_failure_tree to an rb tree + spinlock in the inode, and then use our rb tree simple helpers to insert and find failed records. This greatly cleans up this code and makes it easier to separate out the extent_io_tree code. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:02 +02:00
Josef Bacik	a206174805	btrfs: unexport internal failrec functions These are internally used functions and are not used outside of extent_io.c. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:02 +02:00
Josef Bacik	0d0a762c41	btrfs: rename clean_io_failure and remove extraneous args This is exported, so rename it to btrfs_clean_io_failure. Additionally we are passing in the io tree's and such from the inode, so instead of doing all that simply pass in the inode itself and get all the components we need directly inside of btrfs_clean_io_failure. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:02 +02:00
David Sterba	748f553c3c	btrfs: add KCSAN annotations for unlocked access to block_rsv->full KCSAN reports that there's unlocked access mixed with locked access, which is technically correct but is not a bug. To avoid false alerts at least from KCSAN, add annotation and use a wrapper whenever ->full is accessed for read outside of lock. It is used as a fast check and only advisory. In the worst case the block reserve is found !full and becomes full in the meantime, but properly handled. Depending on the value of ->full, btrfs_block_rsv_release decides where to return the reservation, and block_rsv_release_bytes handles a NULL pointer for block_rsv and if it's not NULL then it double checks the full status under a lock. Link: https://lore.kernel.org/linux-btrfs/CAAwBoOJDjei5Hnem155N_cJwiEkVwJYvgN-tQrwWbZQGhFU=cA@mail.gmail.com/ Link: https://lore.kernel.org/linux-btrfs/YvHU/vsXd7uz5V6j@hungrycats.org Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:02 +02:00
Filipe Manana	b0b47a3859	btrfs: remove useless used space increment during space reservation At space-info.c:__reserve_bytes(), we increment the 'used' variable, but then we don't use the variable anymore, making the increment pointless. The increment became useless with commit `2e294c6049` ("btrfs: simplify the logic in need_preemptive_flushing"), so just remove it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:02 +02:00
Christoph Hellwig	650c8a9c7d	btrfs: zoned: refactor device checks in btrfs_check_zoned_mode btrfs_check_zoned_mode is really hard to follow, mostly due to the fact that a lot of the checks use duplicate conditions after support for zone emulation for conventional devices on file systems with the ZONED flag was added. Fix this by factoring out the check for host managed devices for !ZONED file systems into a separate helper and then simplifying the rest of the code. Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:02 +02:00
Christophe JAILLET	03ad25310f	btrfs: qgroup: fix a typo in a comment Add a missing 'r'. s/qgoup/qgroup/ . Codespell does not catch that for some reason. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:02 +02:00
Gaosheng Cui	6ea1a5264b	btrfs: remove btrfs_bit_radix_cachep declaration btrfs_bit_radix_cachep has been removed since commit `45c06543af` ("Btrfs: remove unused btrfs_bit_radix slab"), so remove it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:01 +02:00
Qu Wenruo	011b46c304	btrfs: skip subtree scan if it's too high to avoid low stall in btrfs_commit_transaction() Btrfs qgroup has a long history of bringing performance penalty in btrfs_commit_transaction(). Although we tried our best to migrate such impact, there is still an unsolved call site, btrfs_drop_snapshot(). This function will find the highest shared tree block and modify its extent ownership to do a subvolume/snapshot dropping. Such change will affect the whole subtree, and cause tons of qgroup dirty extents and stall btrfs_commit_transaction(). To avoid such problem, here we introduce a new sysfs interface, /sys/fs/btrfs/<uuid>/qgroups/drop_subptree_threshold, to determine at whether and at which level we should skip qgroup accounting for subtree dropping. The default value is BTRFS_MAX_LEVEL, thus every subtree drop will go through qgroup accounting, to ensure qgroup numbers are kept as consistent as possible. While for performance sensitive cases, add a way to change the values to more reasonable values like 3, to make any subtree, which is at or higher than level 3, to mark qgroup inconsistent and skip the accounting. The cost is obvious, the qgroup number is no longer consistent, but at least performance is more reasonable, and users have the control. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:01 +02:00
Qu Wenruo	e15e9f43c7	btrfs: introduce BTRFS_QGROUP_RUNTIME_FLAG_NO_ACCOUNTING to skip qgroup accounting The new flag will make btrfs qgroup skip all its time consuming qgroup accounting. The lifespan is the same as BTRFS_QGROUP_RUNTIME_FLAG_CANCEL_RESCAN, only get cleared after a new rescan. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:01 +02:00
Qu Wenruo	e562a8bdf6	btrfs: introduce BTRFS_QGROUP_RUNTIME_FLAG_CANCEL_RESCAN Introduce a new runtime flag, BTRFS_QGROUP_RUNTIME_FLAG_CANCEL_RESCAN, which will inform qgroup rescan to cancel its work asynchronously. This is to address the window when an operation makes qgroup numbers inconsistent (like qgroup inheriting) while a qgroup rescan is running. In that case, qgroup inconsistent flag will be cleared when qgroup rescan finishes. But we changed the ownership of some extents, which means the rescan is already meaningless, and the qgroup inconsistent flag should not be cleared. With the new flag, each time we set INCONSISTENT flag, we also set this new flag to inform any running qgroup rescan to exit immediately, and leaving the INCONSISTENT flag there. The new runtime flag can only be cleared when a new rescan is started. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:01 +02:00
Qu Wenruo	e71564c043	btrfs: introduce BTRFS_QGROUP_STATUS_FLAGS_MASK for later expansion Currently we only have 3 qgroup flags: - BTRFS_QGROUP_STATUS_FLAG_ON - BTRFS_QGROUP_STATUS_FLAG_RESCAN - BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT These flags match the on-disk flags used in btrfs_qgroup_status. But we're going to introduce extra runtime flags which will not reach disks. So here we introduce a new mask, BTRFS_QGROUP_STATUS_FLAGS_MASK, to make sure only those flags can reach disks. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:01 +02:00
Qu Wenruo	ed2e35d85d	btrfs: sysfs: introduce global qgroup attribute group Although we already have info kobject for each qgroup, we don't have global qgroup info attributes to show things like enabled or inconsistent status flags. Add this qgroups attribute groups, and the first member is qgroup_flags, which is a read-only attribute to show human readable qgroup flags. The path is: /sys/fs/btrfs/<uuid>/qgroups/enabled /sys/fs/btrfs/<uuid>/qgroups/inconsistent The output is simple, just 1 or 0. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:01 +02:00
Filipe Manana	ac3c0d36a2	btrfs: make fiemap more efficient and accurate reporting extent sharedness The current fiemap implementation does not scale very well with the number of extents a file has. This is both because the main algorithm to find out the extents has a high algorithmic complexity and because for each extent we have to check if it's shared. This second part, checking if an extent is shared, is significantly improved by the two previous patches in this patchset, while the first part is improved by this specific patch. Every now and then we get reports from users mentioning fiemap is too slow or even unusable for files with a very large number of extents, such as the two recent reports referred to by the Link tags at the bottom of this change log. To understand why the part of finding which extents a file has is very inefficient, consider the example of doing a full ranged fiemap against a file that has over 100K extents (normal for example for a file with more than 10G of data and using compression, which limits the extent size to 128K). When we enter fiemap at extent_fiemap(), the following happens: 1) Before entering the main loop, we call get_extent_skip_holes() to get the first extent map. This leads us to btrfs_get_extent_fiemap(), which in turn calls btrfs_get_extent(), to find the first extent map that covers the file range [0, LLONG_MAX). btrfs_get_extent() will first search the inode's extent map tree, to see if we have an extent map there that covers the range. If it does not find one, then it will search the inode's subvolume b+tree for a fitting file extent item. After finding the file extent item, it will allocate an extent map, fill it in with information extracted from the file extent item, and add it to the inode's extent map tree (which requires a search for insertion in the tree). 2) Then we enter the main loop at extent_fiemap(), emit the details of the extent, and call again get_extent_skip_holes(), with a start offset matching the end of the extent map we previously processed. We end up at btrfs_get_extent() again, will search the extent map tree and then search the subvolume b+tree for a file extent item if we could not find an extent map in the extent tree. We allocate an extent map, fill it in with the details in the file extent item, and then insert it into the extent map tree (yet another search in this tree). 3) The second step is repeated over and over, until we have processed the whole file range. Each iteration ends at btrfs_get_extent(), which does a red black tree search on the extent map tree, then searches the subvolume b+tree, allocates an extent map and then does another search in the extent map tree in order to insert the extent map. In the best scenario we have all the extent maps already in the extent tree, and so for each extent we do a single search on a red black tree, so we have a complexity of O(n log n). In the worst scenario we don't have any extent map already loaded in the extent map tree, or have very few already there. In this case the complexity is much higher since we do: - A red black tree search on the extent map tree, which has O(log n) complexity, initially very fast since the tree is empty or very small, but as we end up allocating extent maps and adding them to the tree when we don't find them there, each subsequent search on the tree gets slower, since it's getting bigger and bigger after each iteration. - A search on the subvolume b+tree, also O(log n) complexity, but it has items for all inodes in the subvolume, not just items for our inode. Plus on a filesystem with concurrent operations on other inodes, we can block doing the search due to lock contention on b+tree nodes/leaves. - Allocate an extent map - this can block, and can also fail if we are under serious memory pressure. - Do another search on the extent maps red black tree, with the goal of inserting the extent map we just allocated. Again, after every iteration this tree is getting bigger by 1 element, so after many iterations the searches are slower and slower. - We will not need the allocated extent map anymore, so it's pointless to add it to the extent map tree. It's just wasting time and memory. In short we end up searching the extent map tree multiple times, on a tree that is growing bigger and bigger after each iteration. And besides that we visit the same leaf of the subvolume b+tree many times, since a leaf with the default size of 16K can easily have more than 200 file extent items. This is very inefficient overall. This patch changes the algorithm to instead iterate over the subvolume b+tree, visiting each leaf only once, and only searching in the extent map tree for file ranges that have holes or prealloc extents, in order to figure out if we have delalloc there. It will never allocate an extent map and add it to the extent map tree. This is very similar to what was previously done for the lseek's hole and data seeking features. Also, the current implementation relying on extent maps for figuring out which extents we have is not correct. This is because extent maps can be merged even if they represent different extents - we do this to minimize memory utilization and keep extent map trees smaller. For example if we have two extents that are contiguous on disk, once we load the two extent maps, they get merged into a single one - however if only one of the extents is shared, we end up reporting both as shared or both as not shared, which is incorrect. This reproducer triggers that bug: $ cat fiemap-bug.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj mkfs.btrfs -f $DEV mount $DEV $MNT # Create a file with two 256K extents. # Since there is no other write activity, they will be contiguous, # and their extent maps merged, despite having two distinct extents. xfs_io -f -c "pwrite -S 0xab 0 256K" \ -c "fsync" \ -c "pwrite -S 0xcd 256K 256K" \ -c "fsync" \ $MNT/foo # Now clone only the second extent into another file. xfs_io -f -c "reflink $MNT/foo 256K 0 256K" $MNT/bar # Filefrag will report a single 512K extent, and say it's not shared. echo filefrag -v $MNT/foo umount $MNT Running the reproducer: $ ./fiemap-bug.sh wrote 262144/262144 bytes at offset 0 256 KiB, 64 ops; 0.0038 sec (65.479 MiB/sec and 16762.7030 ops/sec) wrote 262144/262144 bytes at offset 262144 256 KiB, 64 ops; 0.0040 sec (61.125 MiB/sec and 15647.9218 ops/sec) linked 262144/262144 bytes at offset 0 256 KiB, 1 ops; 0.0002 sec (1.034 GiB/sec and 4237.2881 ops/sec) Filesystem type is: 9123683e File size of /mnt/sdj/foo is 524288 (128 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 127: 3328.. 3455: 128: last,eof /mnt/sdj/foo: 1 extent found We end up reporting that we have a single 512K that is not shared, however we have two 256K extents, and the second one is shared. Changing the reproducer to clone instead the first extent into file 'bar', makes us report a single 512K extent that is shared, which is algo incorrect since we have two 256K extents and only the first one is shared. This patch is part of a larger patchset that is comprised of the following patches: btrfs: allow hole and data seeking to be interruptible btrfs: make hole and data seeking a lot more efficient btrfs: remove check for impossible block start for an extent map at fiemap btrfs: remove zero length check when entering fiemap btrfs: properly flush delalloc when entering fiemap btrfs: allow fiemap to be interruptible btrfs: rename btrfs_check_shared() to a more descriptive name btrfs: speedup checking for extent sharedness during fiemap btrfs: skip unnecessary extent buffer sharedness checks during fiemap btrfs: make fiemap more efficient and accurate reporting extent sharedness The patchset was tested on a machine running a non-debug kernel (Debian's default config) and compared the tests below on a branch without the patchset versus the same branch with the whole patchset applied. The following test for a large compressed file without holes: $ cat fiemap-perf-test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi mkfs.btrfs -f $DEV mount -o compress=lzo $DEV $MNT # 40G gives 327680 128K file extents (due to compression). xfs_io -f -c "pwrite -S 0xab -b 1M 0 20G" $MNT/foobar umount $MNT mount -o compress=lzo $DEV $MNT start=$(date +%s%N) filefrag $MNT/foobar end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "fiemap took $dur milliseconds (metadata not cached)" start=$(date +%s%N) filefrag $MNT/foobar end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "fiemap took $dur milliseconds (metadata cached)" umount $MNT Before patchset: $ ./fiemap-perf-test.sh (...) /mnt/sdi/foobar: 327680 extents found fiemap took 3597 milliseconds (metadata not cached) /mnt/sdi/foobar: 327680 extents found fiemap took 2107 milliseconds (metadata cached) After patchset: $ ./fiemap-perf-test.sh (...) /mnt/sdi/foobar: 327680 extents found fiemap took 1214 milliseconds (metadata not cached) /mnt/sdi/foobar: 327680 extents found fiemap took 684 milliseconds (metadata cached) That's a speedup of about 3x for both cases (no metadata cached and all metadata cached). The test provided by Pavel (first Link tag at the bottom), which uses files with a large number of holes, was also used to measure the gains, and it consists on a small C program and a shell script to invoke it. The C program is the following: $ cat pavels-test.c #include <stdio.h> #include <unistd.h> #include <stdlib.h> #include <fcntl.h> #include <sys/stat.h> #include <sys/time.h> #include <sys/ioctl.h> #include <linux/fs.h> #include <linux/fiemap.h> #define FILE_INTERVAL (1<<13) /* 8Kb / long long interval(struct timeval t1, struct timeval t2) { long long val = 0; val += (t2.tv_usec - t1.tv_usec); val += (t2.tv_sec - t1.tv_sec) 1000 * 1000; return val; } int main(int argc, char *argv) { struct fiemap fiemap = {}; struct timeval t1, t2; char data = 'a'; struct stat st; int fd, off, file_size = FILE_INTERVAL; if (argc != 3 && argc != 2) { printf("usage: %s <path> [size]\n", argv[0]); return 1; } if (argc == 3) file_size = atoi(argv[2]); if (file_size < FILE_INTERVAL) file_size = FILE_INTERVAL; file_size -= file_size % FILE_INTERVAL; fd = open(argv[1], O_RDWR \| O_CREAT \| O_TRUNC, 0644); if (fd < 0) { perror("open"); return 1; } for (off = 0; off < file_size; off += FILE_INTERVAL) { if (pwrite(fd, &data, 1, off) != 1) { perror("pwrite"); close(fd); return 1; } } if (ftruncate(fd, file_size)) { perror("ftruncate"); close(fd); return 1; } if (fstat(fd, &st) < 0) { perror("fstat"); close(fd); return 1; } printf("size: %ld\n", st.st_size); printf("actual size: %ld\n", st.st_blocks 512); fiemap.fm_length = FIEMAP_MAX_OFFSET; gettimeofday(&t1, NULL); if (ioctl(fd, FS_IOC_FIEMAP, &fiemap) < 0) { perror("fiemap"); close(fd); return 1; } gettimeofday(&t2, NULL); printf("fiemap: fm_mapped_extents = %d\n", fiemap.fm_mapped_extents); printf("time = %lld us\n", interval(t1, t2)); close(fd); return 0; } $ gcc -o pavels_test pavels_test.c And the wrapper shell script: $ cat fiemap-pavels-test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi mkfs.btrfs -f -O no-holes $DEV mount $DEV $MNT echo echo "********* 256M *******" echo ./pavels-test $MNT/testfile $((1 << 28)) echo ./pavels-test $MNT/testfile $((1 << 28)) echo echo "******* 512M *******" echo ./pavels-test $MNT/testfile $((1 << 29)) echo ./pavels-test $MNT/testfile $((1 << 29)) echo echo "******* 1G *******" echo ./pavels-test $MNT/testfile $((1 << 30)) echo ./pavels-test $MNT/testfile $((1 << 30)) umount $MNT Running his reproducer before applying the patchset: ******* 256M ******* size: 268435456 actual size: 134217728 fiemap: fm_mapped_extents = 32768 time = `4003133` us size: 268435456 actual size: 134217728 fiemap: fm_mapped_extents = 32768 time = `4895330` us ******* 512M ******* size: 536870912 actual size: 268435456 fiemap: fm_mapped_extents = 65536 time = 30123675 us size: 536870912 actual size: 268435456 fiemap: fm_mapped_extents = 65536 time = 33450934 us ******* 1G ******* size: 1073741824 actual size: 536870912 fiemap: fm_mapped_extents = 131072 time = 224924074 us size: 1073741824 actual size: 536870912 fiemap: fm_mapped_extents = 131072 time = 217239242 us Running it after applying the patchset: ******* 256M ******* size: 268435456 actual size: 134217728 fiemap: fm_mapped_extents = 32768 time = 29475 us size: 268435456 actual size: 134217728 fiemap: fm_mapped_extents = 32768 time = 29307 us ******* 512M ******* size: 536870912 actual size: 268435456 fiemap: fm_mapped_extents = 65536 time = 58996 us size: 536870912 actual size: 268435456 fiemap: fm_mapped_extents = 65536 time = 59115 us ******* 1G ********* size: 1073741824 actual size: 536870912 fiemap: fm_mapped_extents = 116251 time = 124141 us size: 1073741824 actual size: 536870912 fiemap: fm_mapped_extents = 131072 time = 119387 us The speedup is massive, both on the first fiemap call and on the second one as well, as his test creates files with many holes and small extents (every extent follows a hole and precedes another hole). For the 256M file we go from 4 seconds down to 29 milliseconds in the first run, and then from 4.9 seconds down to 29 milliseconds again in the second run, a speedup of 138x and 169x, respectively. For the 512M file we go from 30.1 seconds down to 59 milliseconds in the first run, and then from 33.5 seconds down to 59 milliseconds again in the second run, a speedup of 510x and 568x, respectively. For the 1G file, we go from 225 seconds down to 124 milliseconds in the first run, and then from 217 seconds down to 119 milliseconds in the second run, a speedup of 1815x and 1824x, respectively. Reported-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/ Reported-by: Dominique MARTINET <dominique.martinet@atmark-techno.com> Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:01 +02:00
Filipe Manana	b8f164e3e6	btrfs: skip unnecessary extent buffer sharedness checks during fiemap During fiemap, for each file extent we find, we must check if it's shared or not. The sharedness check starts by verifying if the extent is directly shared (its refcount in the extent tree is > 1), and if it is not directly shared, then we will check if every node in the subvolume b+tree leading from the root to the leaf that has the file extent item (in reverse order), is shared (through snapshots). However this second step is not needed if our extent was created in a transaction more recent than the last transaction where a snapshot of the inode's root happened, because it can't be shared indirectly (through shared subtrees) without a snapshot created in a more recent transaction. So grab the generation of the extent from the extent map and pass it to btrfs_is_data_extent_shared(), which will skip this second phase when the generation is more recent than the root's last snapshot value. Note that we skip this optimization if the extent map is the result of merging 2 or more extent maps, because in this case its generation is the maximum of the generations of all merged extent maps. The fact the we use extent maps and they can be merged despite the underlying extents being distinct (different file extent items in the subvolume b+tree and different extent items in the extent b+tree), can result in some bugs when reporting shared extents. But this is a problem of the current implementation of fiemap relying on extent maps. One example where we get incorrect results is: $ cat fiemap-bug.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj mkfs.btrfs -f $DEV mount $DEV $MNT # Create a file with two 256K extents. # Since there is no other write activity, they will be contiguous, # and their extent maps merged, despite having two distinct extents. xfs_io -f -c "pwrite -S 0xab 0 256K" \ -c "fsync" \ -c "pwrite -S 0xcd 256K 256K" \ -c "fsync" \ $MNT/foo # Now clone only the second extent into another file. xfs_io -f -c "reflink $MNT/foo 256K 0 256K" $MNT/bar # Filefrag will report a single 512K extent, and say it's not shared. echo filefrag -v $MNT/foo umount $MNT Running the reproducer: $ ./fiemap-bug.sh wrote 262144/262144 bytes at offset 0 256 KiB, 64 ops; 0.0038 sec (65.479 MiB/sec and 16762.7030 ops/sec) wrote 262144/262144 bytes at offset 262144 256 KiB, 64 ops; 0.0040 sec (61.125 MiB/sec and 15647.9218 ops/sec) linked 262144/262144 bytes at offset 0 256 KiB, 1 ops; 0.0002 sec (1.034 GiB/sec and 4237.2881 ops/sec) Filesystem type is: 9123683e File size of /mnt/sdj/foo is 524288 (128 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 127: 3328.. 3455: 128: last,eof /mnt/sdj/foo: 1 extent found We end up reporting that we have a single 512K that is not shared, however we have two 256K extents, and the second one is shared. Changing the reproducer to clone instead the first extent into file 'bar', makes us report a single 512K extent that is shared, which is algo incorrect since we have two 256K extents and only the first one is shared. This is z problem that existed before this change, and remains after this change, as it can't be easily fixed. The next patch in the series reworks fiemap to primarily use file extent items instead of extent maps (except for checking for delalloc ranges), with the goal of improving its scalability and performance, but it also ends up fixing this particular bug caused by extent map merging. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:01 +02:00
Filipe Manana	12a824dc67	btrfs: speedup checking for extent sharedness during fiemap One of the most expensive tasks performed during fiemap is to check if an extent is shared. This task has two major steps: 1) Check if the data extent is shared. This implies checking the extent item in the extent tree, checking delayed references, etc. If we find the data extent is directly shared, we terminate immediately; 2) If the data extent is not directly shared (its extent item has a refcount of 1), then it may be shared if we have snapshots that share subtrees of the inode's subvolume b+tree. So we check if the leaf containing the file extent item is shared, then its parent node, then the parent node of the parent node, etc, until we reach the root node or we find one of them is shared - in which case we stop immediately. During fiemap we process the extents of a file from left to right, from file offset 0 to EOF. This means that we iterate b+tree leaves from left to right, and has the implication that we keep repeating that second step above several times for the same b+tree path of the inode's subvolume b+tree. For example, if we have two file extent items in leaf X, and the path to leaf X is A -> B -> C -> X, then when we try to determine if the data extent referenced by the first extent item is shared, we check if the data extent is shared - if it's not, then we check if leaf X is shared, if not, then we check if node C is shared, if not, then check if node B is shared, if not than check if node A is shared. When we move to the next file extent item, after determining the data extent is not shared, we repeat the checks for X, C, B and A - doing all the expensive searches in the extent tree, delayed refs, etc. If we have thousands of tile extents, then we keep repeating the sharedness checks for the same paths over and over. On a file that has no shared extents or only a small portion, it's easy to see that this scales terribly with the number of extents in the file and the sizes of the extent and subvolume b+trees. This change eliminates the repeated sharedness check on extent buffers by caching the results of the last path used. The results can be used as long as no snapshots were created since they were cached (for not shared extent buffers) or no roots were dropped since they were cached (for shared extent buffers). This greatly reduces the time spent by fiemap for files with thousands of extents and/or large extent and subvolume b+trees. Example performance test: $ cat fiemap-perf-test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi mkfs.btrfs -f $DEV mount -o compress=lzo $DEV $MNT # 40G gives 327680 128K file extents (due to compression). xfs_io -f -c "pwrite -S 0xab -b 1M 0 40G" $MNT/foobar umount $MNT mount -o compress=lzo $DEV $MNT start=$(date +%s%N) filefrag $MNT/foobar end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "fiemap took $dur milliseconds (metadata not cached)" start=$(date +%s%N) filefrag $MNT/foobar end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "fiemap took $dur milliseconds (metadata cached)" umount $MNT Before this patch: $ ./fiemap-perf-test.sh (...) /mnt/sdi/foobar: 327680 extents found fiemap took 3597 milliseconds (metadata not cached) /mnt/sdi/foobar: 327680 extents found fiemap took 2107 milliseconds (metadata cached) After this patch: $ ./fiemap-perf-test.sh (...) /mnt/sdi/foobar: 327680 extents found fiemap took 1646 milliseconds (metadata not cached) /mnt/sdi/foobar: 327680 extents found fiemap took 698 milliseconds (metadata cached) That's about 2.2x faster when no metadata is cached, and about 3x faster when all metadata is cached. On a real filesystem with many other files, data, directories, etc, the b+trees will be 2 or 3 levels higher, therefore this optimization will have a higher impact. Several reports of a slow fiemap show up often, the two Link tags below refer to two recent reports of such slowness. This patch, together with the next ones in the series, is meant to address that. Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/ Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:01 +02:00
Filipe Manana	8eedaddaab	btrfs: rename btrfs_check_shared() to a more descriptive name The function btrfs_check_shared() is supposed to be used to check if a data extent is shared, but its name is too generic, may easily cause confusion in the sense that it may be used for metadata extents. So rename it to btrfs_is_data_extent_shared(), which will also make it less confusing after the next change that adds a backref lookup cache for the b+tree nodes that lead to the leaf that contains the file extent item that points to the target data extent. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:01 +02:00
Filipe Manana	09fbc1c8e7	btrfs: allow fiemap to be interruptible Doing fiemap on a file with a very large number of extents can take a very long time, and we have reports of it being too slow (two recent examples in the Link tags below), so make it interruptible. Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/ Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:00 +02:00
Filipe Manana	33a86cfa17	btrfs: properly flush delalloc when entering fiemap If the flag FIEMAP_FLAG_SYNC is passed to fiemap, it means all delalloc should be flushed and writeback complete. We call the generic helper fiemap_prep() which does a filemap_write_and_wait() in case that flag is given, however that is not enough if we have compression. Because a single filemap_fdatawrite_range() only starts compression (in an async thread) and therefore returns before the compression is done and writeback is started. So make btrfs_fiemap(), actually wait for all writeback to start and complete if FIEMAP_FLAG_SYNC is set. We start and wait for writeback on the whole possible file range, from 0 to LLONG_MAX, because that is what the generic code at fiemap_prep() does. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:00 +02:00
Filipe Manana	9a42bbaeff	btrfs: remove zero length check when entering fiemap There's no point to check for a 0 length at extent_fiemap(), as before calling it, we called fiemap_prep() at btrfs_fiemap(), which already checks for a zero length and returns the same -EINVAL error. So remove the pointless check. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:00 +02:00
Filipe Manana	f12eec9a26	btrfs: remove check for impossible block start for an extent map at fiemap During fiemap we are testing if an extent map has a block start with a value of EXTENT_MAP_LAST_BYTE, but that is never set on an extent map, and never was according to git history. So remove that useless check. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:00 +02:00
Filipe Manana	b6e833567e	btrfs: make hole and data seeking a lot more efficient The current implementation of hole and data seeking for llseek does not scale well in regards to the number of extents and the distance between the start offset and the next hole or extent. This is due to a very high algorithmic complexity. Often we also get reports of btrfs' hole and data seeking (llseek) being too slow, such as at 2017's LSFMM (see the Link tag at the bottom). In order to better understand it, lets consider the case where the start offset is 0, we are seeking for a hole and the file size is 16G. Between file offset 0 and the first hole in the file there are 100K extents - this is common for large files, specially if we have compression enabled, since the maximum extent size is limited to 128K. The steps take by the main loop of the current algorithm are the following: 1) We start by calling btrfs_get_extent_fiemap(), for file offset 0, which calls btrfs_get_extent(). This will first lookup for an extent map in the inode's extent map tree (a red black tree). If the extent map is not loaded in memory, then it will do a lookup for the corresponding file extent item in the subvolume's b+tree, create an extent map based on the contents of the file extent item and then add the extent map to the extent map tree of the inode; 2) The second iteration calls btrfs_get_extent_fiemap() again, this time with a start offset matching the end offset of the previous extent. Again, btrfs_get_extent() will first search the extent map tree, and if it doesn't find an extent map there, it will again search in the b+tree of the subvolume for a matching file extent item, build an extent map based on the file extent item, and add the extent map to to the extent map tree of the inode; 3) This repeats over and over until we find the first hole (when seeking for holes) or until we find the first extent (when seeking for data). If there no extent maps loaded in memory for each iteration, then on each iteration we do 1 extent map tree search, 1 b+tree search, plus 1 more extent map tree traversal to insert an extent map - plus we allocate memory for the extent map. On each iteration we are growing the size of the extent map tree, making each future search slower, and also visiting the same b+tree leaves over and over again - taking into account with the default leaf size of 16K we can fit more than 200 file extent items in a leaf - so we can visit the same b+tree leaf 200+ times, on each visit walking down a path from the root to the leaf. So it's easy to see that what we have now doesn't scale well. Also, it loads an extent map for every file extent item into memory, which is not efficient - we should add extents maps only when doing IO (writing or reading file data). This change implements a new algorithm which scales much better, and works like this: 1) We iterate over the subvolume's b+tree, visiting each leaf that has file extent items once and only once; 2) For any file extent items found, that don't represent holes or prealloc extents, it will not search the extent map tree - there's no need at all for that - an extent map is just an in-memory representation of a file extent item; 3) When a hole is found, or a prealloc extent, it will check if there's delalloc for its range. For this it will search for EXTENT_DELALLOC bits in the inode's io tree and check the extent map tree - this is for accounting for unflushed delalloc and for flushed delalloc (the period between running delalloc and ordered extent completion), respectively. This is similar to what the current implementation does when it finds a hole or prealloc extent, but without creating extent maps and adding them to the extent map tree in case they are not loaded in memory; 4) It never allocates extent maps, or adds extent maps to the inode's extent map tree. This not only saves memory and time (from the tree insertions and allocations), but also eliminates the possibility of -ENOMEM due to allocating too many extent maps. Part of this new code will also be used later for fiemap (which also suffers similar scalability problems). The following test example can be used to quickly measure the efficiency before and after this patch: $ cat test-seek-hole.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi mkfs.btrfs -f $DEV mount -o compress=lzo $DEV $MNT # 16G file -> 131073 compressed extents. xfs_io -f -c "pwrite -S 0xab -b 1M 0 16G" $MNT/foobar # Leave a 1M hole at file offset 15G. xfs_io -c "fpunch 15G 1M" $MNT/foobar # Unmount and mount again, so that we can test when there's no # metadata cached in memory. umount $MNT mount -o compress=lzo $DEV $MNT # Test seeking for hole from offset 0 (hole is at offset 15G). start=$(date +%s%N) xfs_io -c "seek -h 0" $MNT/foobar end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "Took $dur milliseconds to seek first hole (metadata not cached)" echo start=$(date +%s%N) xfs_io -c "seek -h 0" $MNT/foobar end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "Took $dur milliseconds to seek first hole (metadata cached)" echo umount $MNT Before this change: $ ./test-seek-hole.sh (...) Whence Result HOLE 16106127360 Took 176 milliseconds to seek first hole (metadata not cached) Whence Result HOLE 16106127360 Took 17 milliseconds to seek first hole (metadata cached) After this change: $ ./test-seek-hole.sh (...) Whence Result HOLE 16106127360 Took 43 milliseconds to seek first hole (metadata not cached) Whence Result HOLE 16106127360 Took 13 milliseconds to seek first hole (metadata cached) That's about 4x faster when no metadata is cached and about 30% faster when all metadata is cached. In practice the differences may often be significantly higher, either due to a higher number of extents in a file or because the subvolume's b+tree is much bigger than in this example, where we only have one file. Link: https://lwn.net/Articles/718805/ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:00 +02:00
Filipe Manana	aed0ca180b	btrfs: allow hole and data seeking to be interruptible Doing hole or data seeking on a file with a very large number of extents can take a long time, and we have reports of it being too slow (such as at LSFMM from 2017, see the Link below). So make it interruptible. Link: https://lwn.net/Articles/718805/ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:00 +02:00
zhang songyi	bd64f6221a	btrfs: remove the unnecessary result variables Return the sysfs_emit() and iterate_object_props() directly instead of using unnecessary variables. Reported-by: Zeal Robot <zealci@zte.com.cn> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: zhang songyi <zhang.songyi@zte.com.cn> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:00 +02:00
Qu Wenruo	1c56ab9919	btrfs: separate BLOCK_GROUP_TREE compat RO flag from EXTENT_TREE_V2 The problem of long mount time caused by block group item search is already known for some time, and the solution of block group tree has been proposed. There is really no need to bound this feature into extent tree v2, just introduce compat RO flag, BLOCK_GROUP_TREE, to correctly solve the problem. All the code handling block group root is already in the upstream kernel, thus this patch really only needs to introduce the new compat RO flag. This patch introduces one extra artificial limitation on block group tree feature, that free space cache v2 and no-holes feature must be enabled to use this new compat RO feature. This artificial requirement is mostly to reduce the test combinations, and can be a guideline for future features, to mostly rely on the latest default features. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:00 +02:00
Qu Wenruo	14033b08a0	btrfs: don't save block group root into super block The extent tree v2 needs a new root for storing all block group items, the whole feature hasn't been finished yet so we can afford to do some changes. My initial proposal years ago just added a new tree rootid, and load it from tree root, just like what we did for quota/free space tree/uuid/extent roots. But the extent tree v2 patches introduced a completely new way to store block group tree root into super block which is arguably wasteful. Currently there are only 3 trees stored in super blocks, and they all have their valid reasons: - Chunk root Needed for bootstrap. - Tree root Really the entry point for all trees. - Log root This is special as log root has to be updated out of existing transaction mechanism. There is not even any reason to put block group root into super blocks, the block group tree is updated at the same time as the old extent tree, no need for extra bootstrap/out-of-transaction update. So just move block group root from super block into tree root. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:00 +02:00
Qu Wenruo	81d5d61454	btrfs: enhance unsupported compat RO flags handling Currently there are two corner cases not handling compat RO flags correctly: - Remount We can still mount the fs RO with compat RO flags, then remount it RW. We should not allow any write into a fs with unsupported RO flags. - Still try to search block group items In fact, behavior/on-disk format change to extent tree should not need a full incompat flag. And since we can ensure fs with unsupported RO flags never got any writes (with above case fixed), then we can even skip block group items search at mount time. This patch will enhance the unsupported RO compat flags by: - Reject read-write remount if there are unsupported RO compat flags - Go dummy block group items directly for unsupported RO compat flags In fact, only changes to chunk/subvolume/root/csum trees should go incompat flags. The latter part should allow future change to extent tree to be compat RO flags. Thus this patch also needs to be backported to all stable trees. CC: stable@vger.kernel.org # 4.9+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:00 +02:00
Qu Wenruo	8e327b9c0d	btrfs: dump all space infos if we abort transaction due to ENOSPC We have hit some transaction abort due to -ENOSPC internally. Normally we should always reserve enough space for metadata for every transaction, thus hitting -ENOSPC should really indicate some cases we didn't expect. But unfortunately current error reporting will only give a kernel warning and stack trace, not really helpful to debug what's causing the problem. And mount option debug_enospc can only help when user can reproduce the problem, but under most cases, such transaction abort by -ENOSPC is really hard to reproduce. So this patch will dump all space infos (data, metadata, system) when we abort the first transaction with -ENOSPC. This should at least provide some clue to us. The example of a dump would look like this: BTRFS: Transaction aborted (error -28) WARNING: CPU: 8 PID: 3366 at fs/btrfs/transaction.c:2137 btrfs_commit_transaction+0xf81/0xfb0 [btrfs] <call trace skipped> ---[ end trace 0000000000000000 ]--- BTRFS info (device dm-1: state A): dumping space info: BTRFS info (device dm-1: state A): space_info DATA has 6791168 free, is not full BTRFS info (device dm-1: state A): space_info total=8388608, used=1597440, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0 BTRFS info (device dm-1: state A): space_info METADATA has 257114112 free, is not full BTRFS info (device dm-1: state A): space_info total=268435456, used=131072, pinned=180224, reserved=65536, may_use=10878976, readonly=65536 zone_unusable=0 BTRFS info (device dm-1: state A): space_info SYSTEM has 8372224 free, is not full BTRFS info (device dm-1: state A): space_info total=8388608, used=16384, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0 BTRFS info (device dm-1: state A): global_block_rsv: size 3670016 reserved 3670016 BTRFS info (device dm-1: state A): trans_block_rsv: size 0 reserved 0 BTRFS info (device dm-1: state A): chunk_block_rsv: size 0 reserved 0 BTRFS info (device dm-1: state A): delayed_block_rsv: size 4063232 reserved 4063232 BTRFS info (device dm-1: state A): delayed_refs_rsv: size 3145728 reserved 3145728 BTRFS: error (device dm-1: state A) in btrfs_commit_transaction:2137: errno=-28 No space left BTRFS info (device dm-1: state EA): forced readonly Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:59 +02:00
Qu Wenruo	25a860c409	btrfs: output human readable space info flag For btrfs_space_info, its flags has only 4 possible values: - BTRFS_BLOCK_GROUP_SYSTEM - BTRFS_BLOCK_GROUP_METADATA \| BTRFS_BLOCK_GROUP_DATA - BTRFS_BLOCK_GROUP_METADATA - BTRFS_BLOCK_GROUP_DATA Make the output more human readable, now it looks like: BTRFS info (device dm-1: state A): space_info METADATA has 251494400 free, is not full Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:59 +02:00
Qu Wenruo	a05d3c9153	btrfs: check superblock to ensure the fs was not modified at thaw time [BACKGROUND] There is an incident report that, one user hibernated the system, with one btrfs on removable device still mounted. Then by some incident, the btrfs got mounted and modified by another system/OS, then back to the hibernated system. After resuming from the hibernation, new write happened into the victim btrfs. Now the fs is completely broken, since the underlying btrfs is no longer the same one before the hibernation, and the user lost their data due to various transid mismatch. [REPRODUCER] We can emulate the situation using the following small script: truncate -s 1G $dev mkfs.btrfs -f $dev mount $dev $mnt fsstress -w -d $mnt -n 500 sync xfs_freeze -f $mnt cp $dev $dev.backup # There is no way to mount the same cloned fs on the same system, # as the conflicting fsid will be rejected by btrfs. # Thus here we have to wipe the fs using a different btrfs. mkfs.btrfs -f $dev.backup dd if=$dev.backup of=$dev bs=1M xfs_freeze -u $mnt fsstress -w -d $mnt -n 20 umount $mnt btrfs check $dev The final fsck will fail due to some tree blocks has incorrect fsid. This is enough to emulate the problem hit by the unfortunate user. [ENHANCEMENT] Although such case should not be that common, it can still happen from time to time. From the view of btrfs, we can detect any unexpected super block change, and if there is any unexpected change, we just mark the fs read-only, and thaw the fs. By this we can limit the damage to minimal, and I hope no one would lose their data by this anymore. Suggested-by: Goffredo Baroncelli <kreijack@libero.it> Link: https://lore.kernel.org/linux-btrfs/83bf3b4b-7f4c-387a-b286-9251e3991e34@bluemole.com/ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:59 +02:00
Christoph Hellwig	928ff3beb8	btrfs: stop allocation a btrfs_io_context for simple I/O The I/O context structure is only used to pass the btrfs_device to the end I/O handler for I/Os that go to a single device. Stop allocating the I/O context for these cases by passing the optional btrfs_io_stripe argument to __btrfs_map_block to query the mapping information and then using a fast path submission and I/O completion handler. As the old btrfs_io_context based I/O submission path is only used for mirrored writes, rename the functions to make that clear and stop setting the btrfs_bio device and mirror_num field that is only used for reads. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:59 +02:00
Christoph Hellwig	03793cbbc8	btrfs: add fast path for single device io in __btrfs_map_block There is no need for most of the btrfs_io_context when doing I/O to a single device. To support such I/O without the extra btrfs_io_context allocation, turn the mirror_num argument into a pointer so that it can be used to output the selected mirror number, and add an optional argument that points to a btrfs_io_stripe structure, which will be filled with a single extent if provided by the caller. In that case the btrfs_io_context allocation can be skipped as all information for the single device I/O is provided in the mirror_num argument and the on-stack btrfs_io_stripe. A caller that makes use of this new argument will be added in the next commit. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:59 +02:00
Christoph Hellwig	28793b194e	btrfs: decide bio cloning inside submit_stripe_bio Remove the orig_bio argument as it can be derived from the bioc, and the clone argument as it can be calculated from bioc and dev_nr. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:59 +02:00
Christoph Hellwig	32747c4455	btrfs: factor out low-level bio setup from submit_stripe_bio Split out a low-level btrfs_submit_dev_bio helper that just submits the bio without any cloning decisions or setting up the end I/O handler for future reuse by a different caller. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:59 +02:00
Christoph Hellwig	917f32a235	btrfs: give struct btrfs_bio a real end_io handler Currently btrfs_bio end I/O handling is a bit of a mess. The bi_end_io handler and bi_private pointer of the embedded struct bio are both used to handle the completion of the high-level btrfs_bio and for the I/O completion for the low-level device that the embedded bio ends up being sent to. To support this bi_end_io and bi_private are saved into the btrfs_io_context structure and then restored after the bio sent to the underlying device has completed the actual I/O. Untangle this by adding an end I/O handler and private data to struct btrfs_bio for the high-level btrfs_bio based completions, and leave the actual bio bi_end_io handler and bi_private pointer entirely to the low-level device I/O. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:59 +02:00
Christoph Hellwig	f1c2937976	btrfs: properly abstract the parity raid bio handling The parity raid write/recover functionality is currently not very well abstracted from the bio submission and completion handling in volumes.c: - the raid56 code directly completes the original btrfs_bio fed into btrfs_submit_bio instead of dispatching back to volumes.c - the raid56 code consumes the bioc and bio_counter references taken by volumes.c, which also leads to special casing of the calls from the scrub code into the raid56 code To fix this up supply a bi_end_io handler that calls back into the volumes.c machinery, which then puts the bioc, decrements the bio_counter and completes the original bio, and updates the scrub code to also take ownership of the bioc and bio_counter in all cases. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:59 +02:00
Christoph Hellwig	c3a62baf21	btrfs: use chained bios when cloning The stripes_pending in the btrfs_io_context counts number of inflight low-level bios for an upper btrfs_bio. For reads this is generally one as reads are never cloned, while for writes we can trivially use the bio remaining mechanisms that is used for chained bios. To be able to make use of that mechanism, split out a separate trivial end_io handler for the cloned bios that does a minimal amount of error tracking and which then calls bio_endio on the original bio to transfer control to that, with the remaining counter making sure it is completed last. This then allows to merge btrfs_end_bioc into the original bio bi_end_io handler. To make this all work all error handling needs to happen through the bi_end_io handler, which requires a small amount of reshuffling in submit_stripe_bio so that the bio is cloned already by the time the suitability of the device is checked. This reduces the size of the btrfs_io_context and prepares splitting the btrfs_bio at the stripe boundary. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:59 +02:00
Christoph Hellwig	2bbc72f14f	btrfs: don't take a bio_counter reference for cloned bios Stop grabbing an extra bio_counter reference for each clone bio in a mirrored write and instead just release the one original reference in btrfs_end_bioc once all the bios for a single btrfs_bio have completed instead of at the end of btrfs_submit_bio once all bios have been submitted. This means the reference is now carried by the "upper" btrfs_bio only instead of each lower bio. Also remove the now unused btrfs_bio_counter_inc_noblocked helper. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:58 +02:00
Christoph Hellwig	6b42f5e343	btrfs: pass the operation to btrfs_bio_alloc Pass the operation to btrfs_bio_alloc, matching what bio_alloc_bioset set does. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:58 +02:00
Christoph Hellwig	d45cfb883b	btrfs: move btrfs_bio allocation to volumes.c volumes.c is the place that implements the storage layer using the btrfs_bio structure, so move the bio_set and allocation helpers there as well. To make up for the new initialization boilerplate, merge the two init/exit helpers in extent_io.c into a single one. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:58 +02:00
Christoph Hellwig	1e408af31b	btrfs: don't create integrity bioset for btrfs_bioset btrfs never uses bio integrity data itself, so don't allocate the integrity pools for btrfs_bioset. This patch is a revert of the commit `b208c2f7ce` ("btrfs: Fix crash due to not allocating integrity data for a set"). The integrity data pool is not needed, the bio-integrity code now handles allocating the integrity payload without that. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:58 +02:00
Josef Bacik	fc80f7aca5	btrfs: remove use btrfs_remove_free_space_cache instead of variant We are calling __btrfs_remove_free_space_cache everywhere to cleanup the block group free space, however we can just use btrfs_remove_free_space_cache and pass in the block group in all of these places. Then we can remove __btrfs_remove_free_space_cache and rename __btrfs_remove_free_space_cache_locked to __btrfs_remove_free_space_cache. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:58 +02:00
Josef Bacik	8a1ae2781d	btrfs: call __btrfs_remove_free_space_cache_locked on cache load failure Now that lockdep is staying enabled through our entire CI runs I started seeing the following stack in generic/475 ------------[ cut here ]------------ WARNING: CPU: 1 PID: 2171864 at fs/btrfs/discard.c:604 btrfs_discard_update_discardable+0x98/0xb0 CPU: 1 PID: 2171864 Comm: kworker/u4:0 Not tainted 5.19.0-rc8+ #789 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Workqueue: btrfs-cache btrfs_work_helper RIP: 0010:btrfs_discard_update_discardable+0x98/0xb0 RSP: 0018:ffffb857c2f7bad0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff8c85c605c200 RCX: 0000000000000001 RDX: 0000000000000000 RSI: ffffffff86807c5b RDI: ffffffff868a831e RBP: ffff8c85c4c54000 R08: 0000000000000000 R09: 0000000000000000 R10: ffff8c85c66932f0 R11: 0000000000000001 R12: ffff8c85c3899010 R13: ffff8c85d5be4f40 R14: ffff8c85c4c54000 R15: ffff8c86114bfa80 FS: 0000000000000000(0000) GS:ffff8c863bd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f2e7f168160 CR3: 000000010289a004 CR4: 0000000000370ee0 Call Trace: __btrfs_remove_free_space_cache+0x27/0x30 load_free_space_cache+0xad2/0xaf0 caching_thread+0x40b/0x650 ? lock_release+0x137/0x2d0 btrfs_work_helper+0xf2/0x3e0 ? lock_is_held_type+0xe2/0x140 process_one_work+0x271/0x590 ? process_one_work+0x590/0x590 worker_thread+0x52/0x3b0 ? process_one_work+0x590/0x590 kthread+0xf0/0x120 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 This is the code ctl = block_group->free_space_ctl; discard_ctl = &block_group->fs_info->discard_ctl; lockdep_assert_held(&ctl->tree_lock); We have a temporary free space ctl for loading the free space cache in order to avoid having allocations happening while we're loading the cache. When we hit an error we free it all up, however this also calls btrfs_discard_update_discardable, which requires block_group->free_space_ctl->tree_lock to be held. However this is our temporary ctl so this lock isn't held. Fix this by calling __btrfs_remove_free_space_cache_locked instead so that we only clean up the entries and do not mess with the discardable stats. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:58 +02:00
Filipe Manana	331cd94614	btrfs: fix race between quota enable and quota rescan ioctl When enabling quotas, at btrfs_quota_enable(), after committing the transaction, we change fs_info->quota_root to point to the quota root we created and set BTRFS_FS_QUOTA_ENABLED at fs_info->flags. Then we try to start the qgroup rescan worker, first by initializing it with a call to qgroup_rescan_init() - however if that fails we end up freeing the quota root but we leave fs_info->quota_root still pointing to it, this can later result in a use-after-free somewhere else. We have previously set the flags BTRFS_FS_QUOTA_ENABLED and BTRFS_QGROUP_STATUS_FLAG_ON, so we can only fail with -EINPROGRESS at btrfs_quota_enable(), which is possible if someone already called the quota rescan ioctl, and therefore started the rescan worker. So fix this by ignoring an -EINPROGRESS and asserting we can't get any other error. Reported-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/linux-btrfs/20220823015931.421355-1-yebin10@huawei.com/ CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:58 +02:00
Maciej S. Szmigiero	dbecac2663	btrfs: don't print information about space cache or tree every remount btrfs currently prints information about space cache or free space tree being in use on every remount, regardless whether such remount actually enabled or disabled one of these features. This is actually unnecessary since providing remount options changing the state of these features will explicitly print the appropriate notice. Let's instead print such unconditional information just on an initial mount to avoid filling the kernel log when, for example, laptop-mode-tools remount the fs on some events. Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:58 +02:00
Filipe Manana	1fdbd03d3d	btrfs: simplify error handling at btrfs_del_root_ref() At btrfs_del_root_ref() we are using two return variables, named 'ret' and 'err'. This makes it harder to follow and easier to return the wrong value in case an error happens - the previous patch in the series, which has the subject "btrfs: fix silent failure when deleting root reference", fixed a bug due to confusion created by these two variables. So change the function to use a single variable for tracking the return value of the function, using only 'ret', which is consistent with most of the codebase. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:58 +02:00
Omar Sandoval	48ff70830b	btrfs: get rid of block group caching progress logic struct btrfs_caching_ctl::progress and struct btrfs_block_group::last_byte_to_unpin were previously needed to ensure that unpin_extent_range() didn't return a range to the free space cache before the caching thread had a chance to cache that range. However, the commit "btrfs: fix space cache corruption and potential double allocations" made it so that we always synchronously cache the block group at the time that we pin the extent, so this machinery is no longer necessary. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:58 +02:00
BingJing Chang	9ed0a72e5b	btrfs: send: fix failures when processing inodes with no links There is a bug causing send failures when processing an orphan directory with no links. In commit `46b2f4590a` ("Btrfs: fix send failure when root has deleted files still open")', the orphan inode issue was addressed. The send operation fails with a ENOENT error because of any attempts to generate a path for the inode with a link count of zero. Therefore, in that patch, sctx->ignore_cur_inode was introduced to be set if the current inode has a link count of zero for bypassing some unnecessary steps. And a helper function btrfs_unlink_all_paths() was introduced and called to clean up old paths found in the parent snapshot. However, not only regular files but also directories can be orphan inodes. So if the send operation meets an orphan directory, it will issue a wrong unlink command for that directory now. Soon the receive operation fails with a EISDIR error. Besides, the send operation also fails with a ENOENT error later when it tries to generate a path of it. Similar example but making an orphan dir for an incremental send: $ btrfs subvolume create vol $ mkdir vol/dir $ touch vol/dir/foo $ btrfs subvolume snapshot -r vol snap1 $ btrfs subvolume snapshot -r vol snap2 # Turn the second snapshot to RW mode and delete the whole dir while # holding an open file descriptor on it. $ btrfs property set snap2 ro false $ exec 73<snap2/dir $ rm -rf snap2/dir # Set the second snapshot back to RO mode and do an incremental send. $ btrfs property set snap2 ro true $ mkdir receive_dir $ btrfs send snap2 -p snap1 \| btrfs receive receive_dir/ At subvol snap2 At snapshot snap2 ERROR: send ioctl failed with -2: No such file or directory ERROR: unlink dir failed. Is a directory Actually, orphan inodes are more common use cases in cascading backups. (Please see the illustration below.) In a cascading backup, a user wants to replicate a couple of snapshots from Machine A to Machine B and from Machine B to Machine C. Machine B doesn't take any RO snapshots for sending. All a receiver does is create an RW snapshot of its parent snapshot, apply the send stream and turn it into RO mode at the end. Even if all paths of some inodes are deleted in applying the send stream, these inodes would not be deleted and become orphans after changing the subvolume from RW to RO. Moreover, orphan inodes can occur not only in send snapshots but also in parent snapshots because Machine B may do a batch replication of a couple of snapshots. An illustration for cascading backups: Machine A (snapshot {1..n}) --> Machine B --> Machine C The idea to solve the problem is to delete all the items of orphan inodes before using these snapshots for sending. I used to think that the reasonable timing for doing that is during the ioctl of changing the subvolume from RW to RO because it sounds good that we will not modify the fs tree of a RO snapshot anymore. However, attempting to do the orphan cleanup in the ioctl would be pointless. Because if someone is holding an open file descriptor on the inode, the reference count of the inode will never drop to 0. Then iput() cannot trigger eviction, which finally deletes all the items of it. So we try to extend the original patch to handle orphans in send/parent snapshots. Here are several cases that need to be considered: Case 1: BTRFS_COMPARE_TREE_NEW \| send snapshot \| action -------------------------------- nlink \| 0 \| ignore In case 1, when we get a BTRFS_COMPARE_TREE_NEW tree comparison result, it means that a new inode is found in the send snapshot and it doesn't appear in the parent snapshot. Since this inode has a link count of zero (It's an orphan and there're no paths for it.), we can leverage sctx->ignore_cur_inode in the original patch to prevent it from being created. Case 2: BTRFS_COMPARE_TREE_DELETED \| parent snapshot \| action ---------------------------------- nlink \| 0 \| as usual In case 2, when we get a BTRFS_COMPARE_TREE_DELETED tree comparison result, it means that the inode only appears in the parent snapshot. As usual, the send operation will try to delete all its paths. However, this inode has a link count of zero, so no paths of it will be found. No deletion operations will be issued. We don't need to change any logic. Case 3: BTRFS_COMPARE_TREE_CHANGED \| \| parent snapshot \| send snapshot \| action ----------------------------------------------------------------------- subcase 1 \| nlink \| 0 \| 0 \| ignore subcase 2 \| nlink \| >0 \| 0 \| new_gen(deletion) subcase 3 \| nlink \| 0 \| >0 \| new_gen(creation) In case 3, when we get a BTRFS_COMPARE_TREE_CHANGED tree comparison result, it means that the inode appears in both snapshots. Here are 3 subcases. First, when the inode has link counts of zero in both snapshots. Since there are no paths for this inode in (source/destination) parent snapshots and we don't care about whether there is also an orphan inode in destination or not, we can set sctx->ignore_cur_inode on to prevent it from being created. For the second and the third subcases, if there are paths in one snapshot and there're no paths in the other snapshot for this inode. We can treat this inode as a new generation. We can also leverage the logic handling a new generation of an inode with small adjustments. Then it will delete all old paths and create a new inode with new attributes and paths only when there's a positive link count in the send snapshot. In subcase 2, the send operation only needs to delete all old paths as in the parent snapshot. But it may require more operations for a directory to remove its old paths. If a not-empty directory is going to be deleted (because it has a link count of zero in the send snapshot) but there are files/directories with bigger inode numbers under it, the send operation will need to rename it to its orphan name first. After processing and deleting the last item under this directory, the send operation will check this directory, aka the parent directory of the last item, again and issue a rmdir operation to remove it finally. Therefore, we also need to treat inodes with a link count of zero as if they didn't exist in get_cur_inode_state(), which is used in process_recorded_refs(). By doing this, when checking a directory with orphan names after the last item under it has been deleted, the send operation now can properly issue a rmdir operation. Otherwise, without doing this, the orphan directory with an orphan name would be kept here at the end due to the existing inode with a link count of zero being found. In subcase 3, as in case 2, no old paths would be found, so no deletion operations will be issued. The send operation will only create a new one for that inode. Note that subcase 3 is not common. That's because it's easy to reduce the hard links of an inode, but once all valid paths are removed, there are no valid paths for creating other hard links. The only way to do that is trying to send an older snapshot after a newer snapshot has been sent. Reviewed-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: BingJing Chang <bingjingc@synology.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:57 +02:00
BingJing Chang	7e93f6dc11	btrfs: send: refactor arguments of get_inode_info() Refactor get_inode_info() to populate all wanted fields on an output structure. Besides, also introduce a helper function called get_inode_gen(), which is commonly used. Reviewed-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: BingJing Chang <bingjingc@synology.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:57 +02:00
Ethan Lien	52b029f427	btrfs: remove unnecessary EXTENT_UPTODATE state in buffered I/O path After we copied data to page cache in buffered I/O, we 1. Insert a EXTENT_UPTODATE state into inode's io_tree, by endio_readpage_release_extent(), set_extent_delalloc() or set_extent_defrag(). 2. Set page uptodate before we unlock the page. But the only place we check io_tree's EXTENT_UPTODATE state is in btrfs_do_readpage(). We know we enter btrfs_do_readpage() only when we have a non-uptodate page, so it is unnecessary to set EXTENT_UPTODATE. For example, when performing a buffered random read: fio --rw=randread --ioengine=libaio --direct=0 --numjobs=4 \ --filesize=32G --size=4G --bs=4k --name=job \ --filename=/mnt/file --name=job Then check how many extent_state in io_tree: cat /proc/slabinfo \| grep btrfs_extent_state \| awk '{print $2}' w/o this patch, we got 640567 btrfs_extent_state. w/ this patch, we got 204 btrfs_extent_state. Maintaining such a big tree brings overhead since every I/O needs to insert EXTENT_LOCKED, insert EXTENT_UPTODATE, then remove EXTENT_LOCKED. And in every insert or remove, we need to lock io_tree, do tree search, alloc or dealloc extent states. By removing unnecessary EXTENT_UPTODATE, we keep io_tree in a minimal size and reduce overhead when performing buffered I/O. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Robbie Ko <robbieko@synology.com> Signed-off-by: Ethan Lien <ethanlien@synology.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:57 +02:00
Filipe Manana	7059c65831	btrfs: simplify adding and replacing references during log replay During log replay, when adding/replacing inode references, there are two special cases that have special code for them: 1) When we have an inode with two or more hardlinks in the same directory, therefore two or more names encoded in the same inode reference item, and one of the hard links gets renamed to the old name of another hard link - that is, the index number for a name changes. This was added in commit `0d836392ca` ("Btrfs: fix mount failure after fsync due to hard link recreation"), and is covered by test case generic/502 from fstests; 2) When we have several inodes that got renamed to an old name of some other inode, in a cascading style. The code to deal with this special case was added in commit `6b5fc433a7` ("Btrfs: fix fsync after succession of renames of different files"), and is covered by test cases generic/526 and generic/527 from fstests. Both cases can be deal with by making sure __add_inode_ref() is always called by add_inode_ref() for every name encoded in the inode reference item, and not just for the first name that has a conflict. With such change we no longer need that special casing for the two cases mentioned before. So do those changes. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:57 +02:00
David Sterba	fb731430be	btrfs: sysfs: show discard stats and tunables in non-debug build When discard=async was introduced there were also sysfs knobs and stats for debugging and tuning, hidden under CONFIG_BTRFS_DEBUG. The defaults have been set and so far seem to satisfy all users on a range of workloads. As there are not only tunables (like iops or kbps) but also stats tracking amount of discardable bytes, that should be available when the async discard is on (otherwise it's not). The stats are moved from the per-fs debug directory, so it's under /sys/fs/btrfs/FSID/discard - discard_bitmap_bytes - amount of discarded bytes from data tracked as bitmaps - discard_extent_bytes - dtto but as extents - discard_bytes_saved - - discardable_bytes - amount of bytes that can be discarded - discardable_extents - number of extents to be discarded - iops_limit - tunable limit of number of discard IOs to be issued - kbps_limit - tunable limit of kilobytes per second issued as discard IO - max_discard_size - tunable limit for size of one IO discard request Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:57 +02:00
Filipe Manana	30b80f3ce0	btrfs: use delayed items when logging a directory When logging a directory we start by flushing all its delayed items. That results in adding dir index items to the subvolume btree, for new dentries, and removing dir index items from the subvolume btree for any dentries that were deleted. This makes it straightforward to log a directory simply by iterating over all the modified subvolume btree leaves, especially when we used to log both dir index keys and dir item keys (before commit `339d035424` ("btrfs: only copy dir index keys when logging a directory") and when we used to copy old dir index entries for leaves modified in the current transaction (before commit `732d591a5d` ("btrfs: stop copying old dir items when logging a directory")). From an efficiency point of view this has a couple of drawbacks: 1) Adds extra latency, due to copying delayed items to the subvolume btree and deleting dir index items from the btree. Further if there are other tasks accessing the btree, which is common (syscalls like creat, mkdir, rename, link, unlink, truncate, reflinks, etc, finishing an ordered extent, etc), lock contention can cause further delays, both to the task logging a directory and to the other tasks accessing the btree; 2) More time spent overall flushing delayed items, if after logging the directory further changes are done to the directory in the same transaction. For example, if we add 10 dentries to a directory, fsync it, add more 10 dentries, fsync it again, then add more 10 dentries and fsync it again, then we end up inserting 3 batches of 10 items to the subvolume btree. With the changes from this patch, we flush all the delayed items to the btree only once - a single batch of 30 items, and outside the logging code (transaction commit or when delayed items are flushed asynchronously). This change simply skips the flushing of delayed items every time we log a directory. Instead we copy the delayed insertion items directly to the log tree and delete delayed deletion items directly from the log tree. Therefore avoiding changing first the subvolume btree and then scanning it for new items to copy from it to the log tree and detecting deletions by observing gaps in consecutive dir index keys in subvolume btree leaves. Running the following tests on a non-debug kernel (Debian's default kernel config), on a box with a NVMe device, a 12 cores Intel CPU and 64G of ram, produced the results below. The results compare a branch without this patch and all the other patches it depends on versus the same branch with the patchset applied. The patchset is comprised of the following patches: btrfs: don't drop dir index range items when logging a directory btrfs: remove the root argument from log_new_dir_dentries() btrfs: update stale comment for log_new_dir_dentries() btrfs: free list element sooner at log_new_dir_dentries() btrfs: avoid memory allocation at log_new_dir_dentries() for common case btrfs: remove root argument from btrfs_delayed_item_reserve_metadata() btrfs: store index number instead of key in struct btrfs_delayed_item btrfs: remove unused logic when looking up delayed items btrfs: shrink the size of struct btrfs_delayed_item btrfs: search for last logged dir index if it's not cached in the inode btrfs: move need_log_inode() to above log_conflicting_inodes() btrfs: move log_new_dir_dentries() above btrfs_log_inode() btrfs: log conflicting inodes without holding log mutex of the initial inode btrfs: skip logging parent dir when conflicting inode is not a dir btrfs: use delayed items when logging a directory Custom test script for testing time spent at btrfs_log_inode(): #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 # Total number of files to create in the test directory. NUM_FILES=10000 # Fsync after creating or renaming N files. FSYNC_AFTER=100 umount $DEV &> /dev/null mkfs.btrfs -f $DEV mount -o ssd $DEV $MNT TEST_DIR=$MNT/testdir mkdir $TEST_DIR echo "Creating files..." for ((i = 1; i <= $NUM_FILES; i++)); do echo -n > $TEST_DIR/file_$i if (( ($i % $FSYNC_AFTER) == 0 )); then xfs_io -c "fsync" $TEST_DIR fi done sync echo "Renaming files..." for ((i = 1; i <= $NUM_FILES; i++)); do mv $TEST_DIR/file_$i $TEST_DIR/file_$i.renamed if (( ($i % $FSYNC_AFTER) == 0 )); then xfs_io -c "fsync" $TEST_DIR fi done umount $MNT And using the following bpftrace script to capture the total time that is spent at btrfs_log_inode(): #!/usr/bin/bpftrace k:btrfs_log_inode { @start_log_inode[tid] = nsecs; } kr:btrfs_log_inode /@start_log_inode[tid]/ { $dur = (nsecs - @start_log_inode[tid]) / 1000; @btrfs_log_inode_total_time = sum($dur); delete(@start_log_inode[tid]); } END { clear(@start_log_inode); } Result before applying patchset: @btrfs_log_inode_total_time: 622642 Result after applying patchset: @btrfs_log_inode_total_time: 354134 (-43.1% time spent) The following dbench script was also used for testing: #!/bin/bash NUM_JOBS=$(nproc --all) DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-O no-holes -R free-space-tree" echo "performance" \| \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor umount $DEV &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT dbench -D $MNT --skip-cleanup -t 120 -S $NUM_JOBS umount $MNT Before patchset: Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 3322265 0.034 21.032 Close 2440562 0.002 0.994 Rename 140664 1.150 269.633 Unlink 670796 1.093 269.678 Deltree 96 5.481 15.510 Mkdir 48 0.004 0.052 Qpathinfo 3010924 0.014 8.127 Qfileinfo 528055 0.001 0.518 Qfsinfo 552113 0.003 0.372 Sfileinfo 270575 0.005 0.688 Find 1164176 0.052 13.931 WriteX 1658537 0.019 5.918 ReadX 5207412 0.003 1.034 LockX 10818 0.003 0.079 UnlockX 10818 0.002 0.313 Flush 232811 1.027 269.735 Throughput 869.867 MB/sec (sync dirs) 12 clients 12 procs max_latency=269.741 ms After patchset: Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 4152738 0.029 20.863 Close 3050770 0.002 1.119 Rename 175829 0.871 211.741 Unlink 838447 0.845 211.724 Deltree 120 4.798 14.162 Mkdir 60 0.003 0.005 Qpathinfo `3763807` 0.011 4.673 Qfileinfo 660111 0.001 0.400 Qfsinfo 690141 0.003 0.429 Sfileinfo 338260 0.005 0.725 Find 1455273 0.046 6.787 WriteX 2073307 0.017 5.690 ReadX 6509193 0.003 1.171 LockX 13522 0.003 0.077 UnlockX 13522 0.002 0.125 Flush 291044 0.811 211.631 Throughput 1089.27 MB/sec (sync dirs) 12 clients 12 procs max_latency=211.750 ms (+25.2% throughput, -21.5% max latency) Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:57 +02:00
Filipe Manana	5557a069f3	btrfs: skip logging parent dir when conflicting inode is not a dir When we find a conflicting inode (an inode that had the same name and parent directory as the inode we are logging now) that was deleted in the current transaction, we always end up logging its parent directory. This is to deal with the case where the conflicting inode corresponds to a deleted subvolume/snapshot or a directory that had subvolumes/snapshots (or some subdirectory inside it had subvolumes/snapshots, etc), because we can't deal with dropping subvolumes/snapshots during log replay. So if we log the parent directory, and if we are dealing with these special cases, then we fallback to a transaction commit when logging the parent, because its last_unlink_trans will match the current transaction (which gets set and propagated when a subvolume/snapshot is deleted). This change skips the logging of the parent directory when the conflicting inode is not a directory (or a subvolume/snapshot). This is ok because in this case logging the current inode is enough to trigger an unlink of the conflicting inode during log replay. So for a case like this: $ mkdir /mnt/dir $ echo -n "first foo data" > /mnt/dir/foo $ sync $ rm -f /mnt/dir/foo $ echo -n "second foo data" > /mnt/dir/foo $ xfs_io -c "fsync" /mnt/dir/foo We avoid logging parent directory "dir" when logging the new file "foo". In other cases it avoids falling back to a transaction commit, when the parent directory has a last_unlink_trans value that matches the current transaction, due to moving a file from it to some other directory. This is a case that happens frequently with dbench for example, where a new file that has the name/parent of another file that was deleted in the current transaction, is fsynced. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:57 +02:00
Filipe Manana	e09d94c9e4	btrfs: log conflicting inodes without holding log mutex of the initial inode When logging an inode, if we detect the inode has a reference that conflicts with some other inode that got renamed, we log that other inode while holding the log mutex of the current inode. We then find out if there are other inodes that conflict with the first conflicting inode, and log them while under the log mutex of the original inode. This is fine because the recursion can only happen once. For the upcoming work where we directly log delayed items without flushing them first to the subvolume tree, this recursion adds a lot of complexity and it's hard to keep lockdep happy about it. So collect a list of conflicting inodes and then log the inodes after unlocking the log mutex of the inode we started with. Also limit the maximum number of conflict inodes we log to 10, to avoid spending too much time logging (and maybe allocating too many list elements too), as typically we don't have more than 1 or 2 conflicting inodes - if we go over the limit, simply fallback to a transaction commit. It is possible to have a very long list of conflicting inodes to be intentionally created by a user if he/she creates a very long succession of renames like this: (...) rename E to F rename D to E rename C to D rename B to C rename A to B touch A (create a new file named A) fsync A If that happened for a sequence of hundreds or thousands of renames, it could massively slow down the logging and cause other secondary effects like for example blocking other fsync operations and transaction commits for a very long time (assuming it wouldn't run into -ENOSPC or -ENOMEM first). However such cases are very uncommon to happen in practice, nevertheless it's better to be prepared for them and avoid chaos. Such long sequence of conflicting inodes could be created before this change. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:57 +02:00
Filipe Manana	f6d86dbeba	btrfs: move log_new_dir_dentries() above btrfs_log_inode() The static function log_new_dir_dentries() is currently defined below btrfs_log_inode(), but in an upcoming patch a new function is introduced that is called by btrfs_log_inode() and this new function needs to call log_new_dir_dentries(). So move log_new_dir_dentries() to a location between btrfs_log_inode() and need_log_inode() (the later is called by log_new_dir_dentries()). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:57 +02:00
Filipe Manana	a375102426	btrfs: move need_log_inode() to above log_conflicting_inodes() The static function need_log_inode() is defined below btrfs_log_inode() and log_conflicting_inodes(), but in the next patches in the series we will need to call need_log_inode() in a couple new functions that will be used by btrfs_log_inode(). So move its definition to a location above log_conflicting_inodes(). Also make its arguments 'const', since they are not supposed to be modified. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:57 +02:00
Filipe Manana	193df62457	btrfs: search for last logged dir index if it's not cached in the inode The key offset of the last dir index item that was logged is stored in the inode's last_dir_index_offset field. However that field is not persisted in the inode item or elsewhere, so if the inode gets evicted and reloaded, it gets a value of (u64)-1, so that when we are logging dir index items we check if they were logged before, to avoid attempts to insert duplicated keys and fallback to a transaction commit. Improve on this by searching for the last dir index that was logged when we start logging a directory if the inode's last_dir_index_offset is not set (has a value of (u64)-1) and it was logged before. This avoids checking if each dir index item we find was already logged before, and simplifies the logging of dir index items (process_dir_items_leaf()). This will also be needed for an incoming change where we start logging delayed items directly, without flushing them first. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:56 +02:00
Filipe Manana	4c469798ee	btrfs: shrink the size of struct btrfs_delayed_item Currently struct btrfs_delayed_item has a base size of 96 bytes, but its size can be decreased by doing the following 2 tweaks: 1) Change data_len from u32 to u16. Our maximum possible leaf size is 64K, so the data_len can never be larger than that, and in fact it is always much smaller than that. The max length for a dentry's name is ensured at the VFS level (PATH_MAX, 4096 bytes) and in struct btrfs_inode_ref and btrfs_dir_item we use a u16 to store the name's length; 2) Change 'ins_or_del' to a 1 bit enum, which is all we need since it can only have 2 values. After this there's also no longer the need to BUG_ON() before using 'ins_or_del' in several places. Also rename the field from 'ins_or_del' to 'type', which is more clear. These two tweaks decrease the size of struct btrfs_delayed_item from 96 bytes down to 88 bytes. A previous patch already reduced the size of this structure by 16 bytes, but an upcoming change will increase its size by 16 bytes (adding a struct list_head element). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:56 +02:00
Filipe Manana	4cbf37f504	btrfs: remove unused logic when looking up delayed items All callers pass NULL to the 'prev' and 'next' arguments of the function __btrfs_lookup_delayed_item(), so remove these arguments. Also, remove the unnecessary wrapper __btrfs_lookup_delayed_insertion_item(), making btrfs_delete_delayed_insertion_item() directly call __btrfs_lookup_delayed_item(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:56 +02:00
Filipe Manana	96d89923fa	btrfs: store index number instead of key in struct btrfs_delayed_item All delayed items are for dir index keys, so there's really no point of having an embedded struct btrfs_key in struct btrfs_delayed_item, which makes the structure use more space than necessary (and adds a hole of 7 bytes). So replace the key field with an index number (u64), which reduces the size of struct btrfs_delayed_item from 112 bytes down to 96 bytes. Some upcoming work will increase the structure size by 16 bytes, so this change compensates for that future size increase. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:56 +02:00
Filipe Manana	df4928818b	btrfs: remove root argument from btrfs_delayed_item_reserve_metadata() The root argument of btrfs_delayed_item_reserve_metadata() is used only to get the fs_info object, but we already have a transaction handle, which we can use to get the fs_info. So remove the root argument. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:56 +02:00
Filipe Manana	009d9bea49	btrfs: avoid memory allocation at log_new_dir_dentries() for common case At log_new_dir_dentries() we always start by allocating a list element for the starting inode and then do a while loop with the condition being a list emptiness check. This however is not needed, we can avoid allocating this initial list element and then just check for the list emptiness at the end of the loop's body. So just do that to save one memory allocation from the kmalloc-32 slab. This allows for not doing any memory allocation when we don't have any subdirectory to log, which is a very common case. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:56 +02:00
Filipe Manana	4008481343	btrfs: free list element sooner at log_new_dir_dentries() At log_new_dir_dentries(), there's no need to keep the current list element allocated while processing the leaves with directory items for the current directory, and while logging other inodes. Plus in case we find a subdirectory, we also end up allocating a new list element while the current one is still allocated, temporarily using more memory than necessary. So free the current list element early on, before processing leaves. Also make the removal and release of all list elements in case of an error more simple by eliminating the label and goto, adding an explicit loop to release all list elements in case an error happens. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:56 +02:00
Filipe Manana	b96c552b99	btrfs: update stale comment for log_new_dir_dentries() The comment refers to the function log_dir_items() in order to check why the inodes of new directory entries need to be logged, but the relevant comments are no longer at log_dir_items(), they were moved to the function process_dir_items_leaf() in commit `eb10d85ee7` ("btrfs: factor out the copying loop of dir items from log_dir_items()"). So update it with the current function name. Also remove references with i_mutex to "VFS lock", since the inode lock is no longer a mutex since 2016 (it's now a rw semaphore). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:56 +02:00
Filipe Manana	8786a6d740	btrfs: remove the root argument from log_new_dir_dentries() There's no point in passing a root argument to log_new_dir_dentries() because it always corresponds to the root of the given inode. So remove it and extract the root from the given inode. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:56 +02:00
Filipe Manana	04fc7d5123	btrfs: don't drop dir index range items when logging a directory When logging a directory that was previously logged in the current transaction, we drop all the range items (BTRFS_DIR_LOG_INDEX_KEY key type). This is because we will process all leaves in the subvolume's tree that were changed in the current transaction and then add range items for covering new dir index items and deleted dir index items, which could cover now a larger range than before. We used to fail if we tried to insert a range item key that already exists, so we dropped all range items to avoid failing. However nowadays, since commit `750ee45490` ("btrfs: fix assertion failure when logging directory key range item"), we simply update any range item that already exists, increasing its range's last dir index if needed. Since the range covered by a range item can never decrease, due to the fact that dir index values come from a monotonically increasing counter and are never reused, we can stop dropping all range items before we start logging a directory. By not dropping the items we can avoid having occasional tree rebalance operations. This will also be needed for an incoming change where we start logging delayed items directly, without flushing them first. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:56 +02:00
Qu Wenruo	786672e9e1	btrfs: scrub: use larger block size for data extent scrub [PROBLEM] The existing scrub code for data extents always limit the block size to sectorsize. This causes quite some extra scrub_block being allocated: (there is a data extent at logical bytenr 298844160, length 64KiB) alloc_scrub_block: new block: logical=298844160 physical=298844160 mirror=1 alloc_scrub_block: new block: logical=298848256 physical=298848256 mirror=1 alloc_scrub_block: new block: logical=298852352 physical=298852352 mirror=1 alloc_scrub_block: new block: logical=298856448 physical=298856448 mirror=1 alloc_scrub_block: new block: logical=298860544 physical=298860544 mirror=1 alloc_scrub_block: new block: logical=298864640 physical=298864640 mirror=1 alloc_scrub_block: new block: logical=298868736 physical=298868736 mirror=1 alloc_scrub_block: new block: logical=298872832 physical=298872832 mirror=1 alloc_scrub_block: new block: logical=298876928 physical=298876928 mirror=1 alloc_scrub_block: new block: logical=298881024 physical=298881024 mirror=1 alloc_scrub_block: new block: logical=298885120 physical=298885120 mirror=1 alloc_scrub_block: new block: logical=298889216 physical=298889216 mirror=1 alloc_scrub_block: new block: logical=298893312 physical=298893312 mirror=1 alloc_scrub_block: new block: logical=298897408 physical=298897408 mirror=1 alloc_scrub_block: new block: logical=298901504 physical=298901504 mirror=1 alloc_scrub_block: new block: logical=298905600 physical=298905600 mirror=1 ... scrub_block_put: free block: logical=298844160 physical=298844160 len=4096 mirror=1 scrub_block_put: free block: logical=298848256 physical=298848256 len=4096 mirror=1 scrub_block_put: free block: logical=298852352 physical=298852352 len=4096 mirror=1 scrub_block_put: free block: logical=298856448 physical=298856448 len=4096 mirror=1 scrub_block_put: free block: logical=298860544 physical=298860544 len=4096 mirror=1 scrub_block_put: free block: logical=298864640 physical=298864640 len=4096 mirror=1 scrub_block_put: free block: logical=298868736 physical=298868736 len=4096 mirror=1 scrub_block_put: free block: logical=298872832 physical=298872832 len=4096 mirror=1 scrub_block_put: free block: logical=298876928 physical=298876928 len=4096 mirror=1 scrub_block_put: free block: logical=298881024 physical=298881024 len=4096 mirror=1 scrub_block_put: free block: logical=298885120 physical=298885120 len=4096 mirror=1 scrub_block_put: free block: logical=298889216 physical=298889216 len=4096 mirror=1 scrub_block_put: free block: logical=298893312 physical=298893312 len=4096 mirror=1 scrub_block_put: free block: logical=298897408 physical=298897408 len=4096 mirror=1 scrub_block_put: free block: logical=298901504 physical=298901504 len=4096 mirror=1 scrub_block_put: free block: logical=298905600 physical=298905600 len=4096 mirror=1 This behavior will waste a lot of memory, especially after we have moved quite some members from scrub_sector to scrub_block. [FIX] To reduce the allocation of scrub_block, and to reduce memory usage, use BTRFS_STRIPE_LEN instead of sectorsize as the block size to scrub data extents. This results only one scrub_block to be allocated for above data extent: alloc_scrub_block: new block: logical=298844160 physical=298844160 mirror=1 scrub_block_put: free block: logical=298844160 physical=298844160 len=65536 mirror=1 This would greatly reduce the memory usage (even it's just transient) for larger data extents scrub. For above example, the memory usage would be: Old: num_sectors * (sizeof(scrub_block) + sizeof(scrub_sector)) 16 * (408 + 96) = 8065 New: sizeof(scrub_block) + num_sectors * sizeof(scrub_sector) 408 + 16 * 96 = 1944 A good reduction of 75.9%. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:55 +02:00
Qu Wenruo	8686c40e67	btrfs: scrub: move logical/physical/dev/mirror_num from scrub_sector to scrub_block Currently we store the following members in scrub_sector: - logical - physical - physical_for_dev_replace - dev - mirror_num However the current scrub code has ensured that scrub_blocks never cross stripe boundary. This is caused by the entry functions (scrub_simple_mirror, scrub_simple_stripe), thus every scrub_block will not cross stripe boundary. Thus this makes it possible to move those members into scrub_block other than putting them into scrub_sector. This should save quite some memory, as a scrub_block can be as large as 64 sectors, even for metadata it's 16 sectors byte default. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:55 +02:00
Qu Wenruo	eb2fad3005	btrfs: scrub: remove scrub_sector::page and use scrub_block::pages instead Although scrub currently works for subpage (PAGE_SIZE > sectorsize) cases, it will allocate one page for each scrub_sector, which can cause extra unnecessary memory usage. Utilize scrub_block::pages[] instead of allocating page for each scrub_sector, this allows us to integrate larger extents while using less memory. For example, if our page size is 64K, sectorsize is 4K, and we got an 32K sized extent. We will only allocate one page for scrub_block, and all 8 scrub sectors will point to that page. To do that properly, here we introduce several small helpers: - scrub_page_get_logical() Get the logical bytenr of a page. We store the logical bytenr of the page range into page::private. But for 32bit systems, their (void *) is not large enough to contain a u64, so in that case we will need to allocate extra memory for it. For 64bit systems, we can use page::private directly. - scrub_block_get_logical() Just get the logical bytenr of the first page. - scrub_sector_get_page() Return the page which the scrub_sector points to. - scrub_sector_get_page_offset() Return the offset inside the page which the scrub_sector points to. - scrub_sector_get_kaddr() Return the address which the scrub_sector points to. Just a wrapper using scrub_sector_get_page() and scrub_sector_get_page_offset() - bio_add_scrub_sector() Please note that, even with this patch, we're still allocating one page for one sector for data extents. This is because in scrub_extent() we split the data extent using sectorsize. The memory usage reduction will need extra work to make scrub to work like data read to only use the correct sector(s). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:55 +02:00
Qu Wenruo	f3e01e0e3c	btrfs: scrub: introduce scrub_block::pages for more efficient memory usage for subpage [BACKGROUND] Currently for scrub, we allocate one page for one sector, this is fine for PAGE_SIZE == sectorsize support, but can waste extra memory for subpage support. [CODE CHANGE] Make scrub_block contain all the pages, so if we're scrubbing an extent sized 64K, and our page size is also 64K, we only need to allocate one page. [LIFESPAN CHANGE] Since now scrub_sector no longer holds a page, but is using scrub_block::pages[] instead, we have to ensure scrub_block has a longer lifespan for write bio. The lifespan for read bio is already large enough. Now scrub_block will only be released after the write bio finished. [COMING NEXT] Currently we only added scrub_block::pages[] for this purpose, but scrub_sector is still utilizing the old scrub_sector::page. The switch will happen in the next patch. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:55 +02:00
Qu Wenruo	5dd3d8e468	btrfs: scrub: factor out allocation and initialization of scrub_sector into helper The allocation and initialization is shared by 3 call sites, and we're going to change the initialization of some members in the upcoming patches. So factor out the allocation and initialization of scrub_sector into a helper, alloc_scrub_sector(), which will do the following work: - Allocate the memory for scrub_sector - Allocate a page for scrub_sector::page - Initialize scrub_sector::refs to 1 - Attach the allocated scrub_sector to scrub_block The attachment is bidirectional, which means scrub_block::sectorv[] will be updated and scrub_sector::sblock will also be updated. - Update scrub_block::sector_count and do extra sanity check on it Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:55 +02:00
Qu Wenruo	15b88f6d24	btrfs: scrub: factor out initialization of scrub_block into helper Although there are only two callers, we are going to add some members for scrub_block in the incoming patches. Factoring out the initialization code will make later expansion easier. One thing to note is, even scrub_handle_errored_block() doesn't utilize scrub_block::refs, we still use alloc_scrub_block() to initialize sblock::ref, allowing us to use scrub_block_put() to do cleanup. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:55 +02:00
Qu Wenruo	1dfa500511	btrfs: scrub: use pointer array to replace sblocks_for_recheck In function scrub_handle_errored_block(), we use @sblocks_for_recheck pointer to hold one scrub_block for each mirror, and uses kcalloc() to allocate an array. But this one pointer for an array is not readable due to the member offsets done by addition and not []. Change this pointer to struct scrub_block *[BTRFS_MAX_MIRRORS], this will slightly increase the stack memory usage. Since function scrub_handle_errored_block() won't get iterative calls, this extra cost would completely be acceptable. And since we're here, also set sblock->refs and use scrub_block_put() to clean them up, as later we will add extra members in scrub_block, which needs scrub_block_put() to clean them up. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:55 +02:00
Boris Burkov	38622010a6	btrfs: send: add support for fs-verity Preserve the fs-verity status of a btrfs file across send/recv. There is no facility for installing the Merkle tree contents directly on the receiving filesystem, so we package up the parameters used to enable verity found in the verity descriptor. This gives the receive side enough information to properly enable verity again. Note that this means that receive will have to re-compute the whole Merkle tree, similar to how compression worked before encoded_write. Since the file becomes read-only after verity is enabled, it is important that verity is added to the send stream after any file writes. Therefore, when we process a verity item, merely note that it happened, then actually create the command in the send stream during 'finish_inode_if_needed'. This also creates V3 of the send stream format, without any format changes besides adding the new commands and attributes. Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:55 +02:00
Uros Bizjak	e5677f0560	btrfs: use atomic_try_cmpxchg in free_extent_buffer Use `atomic_try_cmpxchg(ptr, &old, new)` instead of `atomic_cmpxchg(ptr, old, new) == old` in free_extent_buffer. This has two benefits: - The x86 cmpxchg instruction returns success in the ZF flag, so this change saves a compare after cmpxchg, as well as a related move instruction in the front of cmpxchg. - atomic_try_cmpxchg implicitly assigns the *ptr value to &old when cmpxchg fails, enabling further code simplifications. This patch has no functional change. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:55 +02:00
Qu Wenruo	fc65bb5318	btrfs: scrub: remove impossible sanity checks There are several sanity checks which are no longer possible to trigger inside btrfs_scrub_dev(). Since we have mount time check against super block nodesize/sectorsize, and our fixed macro is hardcoded to handle even the worst combination. Thus those sanity checks are no longer needed, can be easily removed. But this patch still uses some ASSERT()s as a safe net just in case we change some features in the future to trigger those impossible combinations. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:55 +02:00
Josef Bacik	527c490f44	btrfs: delete btrfs_wait_space_cache_v1_finished We used to use this in a few spots, but now we only use it directly inside of block-group.c, so remove the helper and just open code where we were using it. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:55 +02:00
Josef Bacik	588a486835	btrfs: remove lock protection for BLOCK_GROUP_FLAG_RELOCATING_REPAIR Before when this was modifying the bit field we had to protect it with the bg->lock, however now we're using bit helpers so we can stop using the bg->lock. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:54 +02:00
Josef Bacik	7b9c293b05	btrfs: remove BLOCK_GROUP_FLAG_HAS_CACHING_CTL This is used mostly to determine if we need to look at the caching ctl list and clean up any references to this block group. However we never clear this flag, specifically because we need to know if we have to remove a caching ctl we have for this block group still. This is in the remove block group path which isn't a fast path, so the optimization doesn't really matter, simplify this logic and remove the flag. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:54 +02:00
Josef Bacik	50c31eaa4c	btrfs: simplify block group traversal in btrfs_put_block_group_cache We're breaking out and re-searching for the next block group while evicting any of the block group cache inodes. This is not needed, the block groups aren't disappearing here, we can simply loop through the block groups like normal and iput any inode that we find. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:54 +02:00
Josef Bacik	9283b9e09a	btrfs: remove lock protection for BLOCK_GROUP_FLAG_TO_COPY We use this during device replace for zoned devices, we were simply taking the lock because it was in a bit field and we needed the lock to be safe with other modifications in the bitfield. With the bit helpers we no longer require that locking. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:54 +02:00
Josef Bacik	3349b57fd4	btrfs: convert block group bit field to use bit helpers We use a bit field in the btrfs_block_group for different flags, however this is awkward because we have to hold the block_group->lock for any modification of any of these fields, and makes the code clunky for a few of these flags. Convert these to a properly flags setup so we can utilize the bit helpers. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:54 +02:00
Josef Bacik	723de71d41	btrfs: handle space_info setting of bg in btrfs_add_bg_to_space_info We previously had the pattern of btrfs_update_space_info(all, the, bg, fields, &space_info); link_block_group(bg); bg->space_info = space_info; Now that we're passing the bg into btrfs_add_bg_to_space_info we can do the linking in that function, transforming this to simply btrfs_add_bg_to_space_info(fs_info, bg); and put the link_block_group() and bg->space_info assignment directly in btrfs_add_bg_to_space_info. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:54 +02:00
Josef Bacik	9d4b0a129a	btrfs: simplify arguments of btrfs_update_space_info and rename This function has grown a bunch of new arguments, and it just boils down to passing in all the block group fields as arguments. Simplify this by passing in the block group itself and updating the space_info fields based on the block group fields directly. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:54 +02:00
Josef Bacik	2f12741f81	btrfs: use btrfs_fs_closing for background bg work For both unused bg deletion and async balance work we'll happily run if the fs is closing. However I want to move these to their own worker thread, and they can be long running jobs, so add a check to see if we're closing and simply bail. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:54 +02:00
Omar Sandoval	d1f68ba069	btrfs: rename btrfs_insert_file_extent() to btrfs_insert_hole_extent() btrfs_insert_file_extent() is only ever used to insert holes, so rename it and remove the redundant parameters. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:54 +02:00
David Sterba	7f298f224e	btrfs: sysfs: use sysfs_streq for string matching We have own string matching helper that duplicates what sysfs_streq does, with a slight difference that it skips initial whitespace. So far this is used for the drive allocation policy. The initial whitespace of written sysfs values should be rather discouraged and we should use a standard helper. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:53 +02:00
Qu Wenruo	f9eab5f0bb	btrfs: scrub: try to fix super block errors [BUG] The following script shows that, although scrub can detect super block errors, it never tries to fix it: mkfs.btrfs -f -d raid1 -m raid1 $dev1 $dev2 xfs_io -c "pwrite 67108864 4k" $dev2 mount $dev1 $mnt btrfs scrub start -B $dev2 btrfs scrub start -Br $dev2 umount $mnt The first scrub reports the super error correctly: scrub done for f3289218-abd3-41ac-a630-202f766c0859 Scrub started: Tue Aug 2 14:44:11 2022 Status: finished Duration: 0:00:00 Total to scrub: 1.26GiB Rate: 0.00B/s Error summary: super=1 Corrected: 0 Uncorrectable: 0 Unverified: 0 But the second read-only scrub still reports the same super error: Scrub started: Tue Aug 2 14:44:11 2022 Status: finished Duration: 0:00:00 Total to scrub: 1.26GiB Rate: 0.00B/s Error summary: super=1 Corrected: 0 Uncorrectable: 0 Unverified: 0 [CAUSE] The comments already shows that super block can be easily fixed by committing a transaction: /* * If we find an error in a super block, we just report it. * They will get written with the next transaction commit * anyway */ But the truth is, such assumption is not always true, and since scrub should try to repair every error it found (except for read-only scrub), we should really actively commit a transaction to fix this. [FIX] Just commit a transaction if we found any super block errors, after everything else is done. We cannot do this just after scrub_supers(), as btrfs_commit_transaction() will try to pause and wait for the running scrub, thus we can not call it with scrub_lock hold. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:53 +02:00
Qu Wenruo	e69bf81c9a	btrfs: scrub: properly report super block errors in system log [PROBLEM] Unlike data/metadata corruption, if scrub detected some error in the super block, the only error message is from the updated device status: BTRFS info (device dm-1): scrub: started on devid 2 BTRFS error (device dm-1): bdev /dev/mapper/test-scratch2 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 BTRFS info (device dm-1): scrub: finished on devid 2 with status: 0 This is not helpful at all. [CAUSE] Unlike data/metadata error reporting, there is no visible report in kernel dmesg to report supper block errors. In fact, return value of scrub_checksum_super() is intentionally skipped, thus scrub_handle_errored_block() will never be called for super blocks. [FIX] Make super block errors to output an error message, now the full dmesg would looks like this: BTRFS info (device dm-1): scrub: started on devid 2 BTRFS warning (device dm-1): super block error on device /dev/mapper/test-scratch2, physical 67108864 BTRFS error (device dm-1): bdev /dev/mapper/test-scratch2 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 BTRFS info (device dm-1): scrub: finished on devid 2 with status: 0 BTRFS info (device dm-1): scrub: started on devid 2 This fix involves: - Move the super_errors reporting to scrub_handle_errored_block() This allows the device status message to show after the super block error message. But now we no longer distinguish super block corruption and generation mismatch, now all counted as corruption. - Properly check the return value from scrub_checksum_super() - Add extra super block error reporting for scrub_print_warning(). Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:53 +02:00
Alexander Zhu	b0c582233a	btrfs: fix alignment of VMA for memory mapped files on THP With CONFIG_READ_ONLY_THP_FOR_FS, the Linux kernel supports using THPs for read-only mmapped files, such as shared libraries. However, the kernel makes no attempt to actually align those mappings on 2MB boundaries, which makes it impossible to use those THPs most of the time. This issue applies to general file mapping THP as well as existing setups using CONFIG_READ_ONLY_THP_FOR_FS. This is easily fixed by using thp_get_unmapped_area for the unmapped_area function in btrfs, which is what ext2, ext4, fuse, and xfs all use. Initially btrfs had been left out in commit 8c07fc452ac0 ("btrfs: fix alignment of VMA for memory mapped files on THP") as btrfs does not support DAX. However, commit `1854bc6e24` ("mm/readahead: Align file mappings for non-DAX") removed the DAX requirement. We should now be able to call thp_get_unmapped_area() for btrfs. The problem can be seen in /proc/PID/smaps where THPeligible is set to 0 on mappings to eligible shared object files as shown below. Before this patch: 7fc6a7e18000-7fc6a80cc000 r-xp 00000000 00:1e 199856 /usr/lib64/libcrypto.so.1.1.1k Size: 2768 kB THPeligible: 0 VmFlags: rd ex mr mw me With this patch the library is mapped at a 2MB aligned address: fbdfe200000-7fbdfe4b4000 r-xp 00000000 00:1e 199856 /usr/lib64/libcrypto.so.1.1.1k Size: 2768 kB THPeligible: 1 VmFlags: rd ex mr mw me This fixes the alignment of VMAs for any mmap of a file that has the rd and ex permissions and size >= 2MB. The VMA alignment and THPeligible field for anonymous memory is handled separately and is thus not effected by this change. CC: stable@vger.kernel.org # 5.18+ Signed-off-by: Alexander Zhu <alexlzhu@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:53 +02:00
Ioannis Angelakopoulos	5f4403e10f	btrfs: add lockdep annotations for the ordered extents wait event This wait event is very similar to the pending ordered wait event in the sense that it occurs in a different context than the condition signaling for the event. The signaling occurs in btrfs_remove_ordered_extent() while the wait event is implemented in btrfs_start_ordered_extent() in fs/btrfs/ordered-data.c However, in this case a thread must not acquire the lockdep map for the ordered extents wait event when the ordered extent is related to a free space inode. That is because lockdep creates dependencies between locks acquired both in execution paths related to normal inodes and paths related to free space inodes, thus leading to false positives. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:53 +02:00
Ioannis Angelakopoulos	9d7464c87b	btrfs: change the lockdep class of free space inode's invalidate_lock Reinitialize the class of the lockdep map for struct inode's mapping->invalidate_lock in load_free_space_cache() function in fs/btrfs/free-space-cache.c. This will prevent lockdep from producing false positives related to execution paths that make use of free space inodes and paths that make use of normal inodes. Specifically, with this change lockdep will create separate lock dependencies that include the invalidate_lock, in the case that free space inodes are used and in the case that normal inodes are used. The lockdep class for this lock was first initialized in inode_init_always() in fs/inode.c. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:53 +02:00
Ioannis Angelakopoulos	8b53779eaa	btrfs: add lockdep annotations for pending_ordered wait event In contrast to the num_writers and num_extwriters wait events, the condition for the pending ordered wait event is signaled in a different context from the wait event itself. The condition signaling occurs in btrfs_remove_ordered_extent() in fs/btrfs/ordered-data.c while the wait event is implemented in btrfs_commit_transaction() in fs/btrfs/transaction.c Thus the thread signaling the condition has to acquire the lockdep map as a reader at the start of btrfs_remove_ordered_extent() and release it after it has signaled the condition. In this case some dependencies might be left out due to the placement of the annotation, but it is better than no annotation at all. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:53 +02:00
Ioannis Angelakopoulos	3e738c531a	btrfs: add lockdep annotations for transaction states wait events Add lockdep annotations for the transaction states that have wait events; 1) TRANS_STATE_COMMIT_START 2) TRANS_STATE_UNBLOCKED 3) TRANS_STATE_SUPER_COMMITTED 4) TRANS_STATE_COMPLETED The new macros introduced here to annotate the transaction states wait events have the same effect as the generic lockdep annotation macros. With the exception of the lockdep annotation for TRANS_STATE_COMMIT_START the transaction thread has to acquire the lockdep maps for the transaction states as reader after the lockdep map for num_writers is released so that lockdep does not complain. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:53 +02:00
Ioannis Angelakopoulos	5a9ba6709f	btrfs: add lockdep annotations for num_extwriters wait event Similarly to the num_writers wait event in fs/btrfs/transaction.c add a lockdep annotation for the num_extwriters wait event. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:53 +02:00
Ioannis Angelakopoulos	e1489b4fe6	btrfs: add lockdep annotations for num_writers wait event Annotate the num_writers wait event in fs/btrfs/transaction.c with lockdep in order to catch deadlocks involving this wait event. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:53 +02:00
Ioannis Angelakopoulos	ab9a323f9a	btrfs: add macros for annotating wait events with lockdep Introduce four macros that are used to annotate wait events in btrfs code with lockdep; 1) the btrfs_lockdep_init_map 2) the btrfs_lockdep_acquire, 3) the btrfs_lockdep_release 4) the btrfs_might_wait_for_event macros. The btrfs_lockdep_init_map macro is used to initialize a lockdep map. The btrfs_lockdep_<acquire,release> macros are used by threads to take the lockdep map as readers (shared lock) and release it, respectively. The btrfs_might_wait_for_event macro is used by threads to take the lockdep map as writers (exclusive lock) and release it. In general, the lockdep annotation for wait events work as follows: The condition for a wait event can be modified and signaled at the same time by multiple threads. These threads hold the lockdep map as readers when they enter a context in which blocking would prevent signaling the condition. Frequently, this occurs when a thread violates a condition (lockdep map acquire), before restoring it and signaling it at a later point (lockdep map release). The threads that block on the wait event take the lockdep map as writers (exclusive lock). These threads have to block until all the threads that hold the lockdep map as readers signal the condition for the wait event and release the lockdep map. The lockdep annotation is used to warn about potential deadlock scenarios that involve the threads that modify and signal the wait event condition and threads that block on the wait event. A simple example is illustrated below: Without lockdep: TA TB cond = false lock(A) wait_event(w, cond) unlock(A) lock(A) cond = true signal(w) unlock(A) With lockdep: TA TB rwsem_acquire_read(lockdep_map) cond = false lock(A) rwsem_acquire(lockdep_map) rwsem_release(lockdep_map) wait_event(w, cond) unlock(A) lock(A) cond = true signal(w) unlock(A) rwsem_release(lockdep_map) In the second case, with the lockdep annotation, lockdep would warn about an ABBA deadlock, while the first case would just deadlock at some point. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:52 +02:00
Qu Wenruo	62cd9d4474	btrfs: dump extra info if one free space cache has more bitmaps than it should There is an internal report on hitting the following ASSERT() in recalculate_thresholds(): ASSERT(ctl->total_bitmaps <= max_bitmaps); Above @max_bitmaps is calculated using the following variables: - bytes_per_bg 8 * 4096 * 4096 (128M) for x86_64/x86. - block_group->length The length of the block group. @max_bitmaps is the rounded up value of block_group->length / 128M. Normally one free space cache should not have more bitmaps than above value, but when it happens the ASSERT() can be triggered if CONFIG_BTRFS_ASSERT is also enabled. But the ASSERT() itself won't provide enough info to know which is going wrong. Is the bg too small thus it only allows one bitmap? Or is there something else wrong? So although I haven't found extra reports or crash dump to do further investigation, add the extra info to make it more helpful to debug. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:52 +02:00
Gaosheng Cui	7a80bf902d	fanotify: Remove obsoleted fanotify_event_has_path() All uses of fanotify_event_has_path() have been removed since commit `9c61f3b560` ("fanotify: break up fanotify_alloc_event()"), now it is useless, so remove it. Link: https://lore.kernel.org/r/20220926023018.1505270-1-cuigaosheng1@huawei.com Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz>	2022-09-26 11:28:40 +02:00
Ronnie Sahlberg	bb44c31cdc	cifs: destage dirty pages before re-reading them for cache=none This is the opposite case of kernel bugzilla 216301. If we mmap a file using cache=none and then proceed to update the mmapped area these updates are not reflected in a later pread() of that part of the file. To fix this we must first destage any dirty pages in the range before we allow the pread() to proceed. Cc: stable@vger.kernel.org Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Reviewed-by: Enzo Matsumiya <ematsumiya@suse.de> Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2022-09-25 17:31:28 -05:00
Enzo Matsumiya	09a1f9a168	cifs: return correct error in ->calc_signature() If an error happens while getting the key or session in the ->calc_signature implementations, 0 (success) is returned. Fix it by returning a proper error code. Since it seems to be highly unlikely to happen wrap the rc check in unlikely() too. Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Fixes: `32811d242f` ("cifs: Start using per session key for smb2/3 for signature generation") Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de> Signed-off-by: Steve French <stfrench@microsoft.com>	2022-09-25 17:01:50 -05:00
Jiangshan Yi	c3b6eed31f	cifs: misc: fix spelling typo in comment Fix spelling typo in comment. Reported-by: k2ci <kernel-bot@kylinos.cn> Signed-off-by: Jiangshan Yi <yijiangshan@kylinos.cn> Signed-off-by: Steve French <stfrench@microsoft.com>	2022-09-25 17:01:50 -05:00
Linus Torvalds	5e049663f6	Ext4 regression and bug fixes: - Performance regression fix from 5.18 on a Rasberry Pi - Fix extent parsing bug which triggers a BUG_ON when a (corrupted) extent tree has has a non-root node when zero entries. - Fix a livelock where in the right (wrong) circumstances a large number of nfsd threads can try to write to a nearly full file system, and retry for hours(!) -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmMvsNEACgkQ8vlZVpUN gaMgsQf/b/JDCFXmki1/MLMvMP5LkX2rxTkq8P3lsZMP1yVOncSoir57jFvBWR6L h+So+k+ATfDIh3IeEf9S08deivRj6Se6BUwkewKwS8tPdmWUFpXr2TCGr4MTbkss 4TxBOoarC0RsOrxHdbgsZDnhn8FtR58AS1lAeW/cOur1QcKxXXTz1baDKiTqB7ru LmXaFc15U+wxvVkijTHA1/RgnMd96gR9ilj7NP/UKQCYe+CloYrJDyjASNBmk2xP ZQiaHiMKBBfsLpBaqCgbkDAfEwzcBYGs2LDiiJ+wlmOHerE0pEJRNlREV3/Xt39O KwULSjZUlMMnVtHKn3IfWtkpmWZ6cg== =2uNr -----END PGP SIGNATURE----- Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Regression and bug fixes: - Performance regression fix from 5.18 on a Rasberry Pi - Fix extent parsing bug which triggers a BUG_ON when a (corrupted) extent tree has has a non-root node when zero entries. - Fix a livelock where in the right (wrong) circumstances a large number of nfsd threads can try to write to a nearly full file system, and retry for hours(!)" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: limit the number of retries after discarding preallocations blocks ext4: fix bug in extents parsing when eh_entries == 0 and eh_depth > 0 ext4: use buckets for cr 1 block scan instead of rbtree ext4: use locality group preallocation for small closed files ext4: make directory inode spreading reflect flexbg size ext4: avoid unnecessary spreading of allocations among groups ext4: make mballoc try target group first even with mb_optimize_scan	2022-09-25 09:03:31 -07:00
Linus Torvalds	4207d59567	dax-and-nvdimm-fixes-v6.0-final - Fix a infinite loop bug in fsdax - Fix memory-type detection for devdax (EINJ regression) - Small cleanups -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQSbo+XnGs+rwLz9XGXfioYZHlFsZwUCYy+skgAKCRDfioYZHlFs Zxw3AQCVDbuMh2wBUSJP4G4XF+ZHM+FntpUmawcmUFsEjt0fGQEA9NCjOoj7jLlP BrI1OtbskTLAMzJeRI3qY3irV4iHsAI= =28T9 -----END PGP SIGNATURE----- Merge tag 'dax-and-nvdimm-fixes-v6.0-final' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull NVDIMM and DAX fixes from Dan Williams: "A recently discovered one-line fix for devdax that further addresses a v5.5 regression, and (a bit embarrassing) a small batch of fixes that have been sitting in my fixes tree for weeks. The older fixes have soaked in linux-next during that time and address an fsdax infinite loop and some other minor fixups. - Fix a infinite loop bug in fsdax - Fix memory-type detection for devdax (EINJ regression) - Small cleanups" * tag 'dax-and-nvdimm-fixes-v6.0-final' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: devdax: Fix soft-reservation memory description fsdax: Fix infinite loop in dax_iomap_rw() nvdimm/namespace: drop nested variable in create_namespace_pmem() ndtest: Cleanup all of blk namespace specific code pmem: fix a name collision	2022-09-25 08:53:52 -07:00
Dan Williams	b3bbcc5d1d	Merge branch 'for-6.0/dax' into libnvdimm-fixes Pick up another "Soft Reservation" fix for v6.0-final on top of some straggling nvdimm fixes that missed v5.19.	2022-09-24 18:14:12 -07:00
ChenXiaoSong	19029f3f47	debugfs: use DEFINE_SHOW_ATTRIBUTE to define debugfs_regset32_fops Use DEFINE_SHOW_ATTRIBUTE helper macro to simplify the code. Signed-off-by: ChenXiaoSong <chenxiaosong2@huawei.com> Link: https://lore.kernel.org/r/20220923102554.2443452-1-chenxiaosong2@huawei.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-09-24 15:00:48 +02:00
Brian Norris	b8de524ce4	debugfs: Only clobber mode/uid/gid on remount if asked Users may have explicitly configured their debugfs permissions; we shouldn't overwrite those just because a second mount appeared. Only clobber if the options were provided at mount time. Existing behavior: ## Pre-existing status: debugfs is 0755. # chmod 755 /sys/kernel/debug/ # stat -c '%A' /sys/kernel/debug/ drwxr-xr-x ## New mount sets kernel-default permissions: # mount -t debugfs none /mnt/foo # stat -c '%A' /mnt/foo drwx------ ## Unexpected: the original mount changed permissions: # stat -c '%A' /sys/kernel/debug drwx------ New behavior: ## Pre-existing status: debugfs is 0755. # chmod 755 /sys/kernel/debug/ # stat -c '%A' /sys/kernel/debug/ drwxr-xr-x ## New mount inherits existing permissions: # mount -t debugfs none /mnt/foo # stat -c '%A' /mnt/foo drwxr-xr-x ## Expected: old mount is unchanged: # stat -c '%A' /sys/kernel/debug drwxr-xr-x Full test cases are being submitted to LTP. Signed-off-by: Brian Norris <briannorris@chromium.org> Link: https://lore.kernel.org/r/20220912163042.v3.1.Icbd40fce59f55ad74b80e5d435ea233579348a78@changeid Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-09-24 14:01:37 +02:00
Christian A. Ehrhardt	4abc996528	kernfs: fix use-after-free in __kernfs_remove Syzkaller managed to trigger concurrent calls to kernfs_remove_by_name_ns() for the same file resulting in a KASAN detected use-after-free. The race occurs when the root node is freed during kernfs_drain(). To prevent this acquire an additional reference for the root of the tree that is removed before calling __kernfs_remove(). Found by syzkaller with the following reproducer (slab_nomerge is required): syz_mount_image$ext4(0x0, &(0x7f0000000100)='./file0\x00', 0x100000, 0x0, 0x0, 0x0, 0x0) r0 = openat(0xffffffffffffff9c, &(0x7f0000000080)='/proc/self/exe\x00', 0x0, 0x0) close(r0) pipe2(&(0x7f0000000140)={0xffffffffffffffff, <r1=>0xffffffffffffffff}, 0x800) mount$9p_fd(0x0, &(0x7f0000000040)='./file0\x00', &(0x7f00000000c0), 0x408, &(0x7f0000000280)={'trans=fd,', {'rfdno', 0x3d, r0}, 0x2c, {'wfdno', 0x3d, r1}, 0x2c, {[{@cache_loose}, {@mmap}, {@loose}, {@loose}, {@mmap}], [{@mask={'mask', 0x3d, '^MAY_EXEC'}}, {@fsmagic={'fsmagic', 0x3d, 0x10001}}, {@dont_hash}]}}) Sample report: ================================================================== BUG: KASAN: use-after-free in kernfs_type include/linux/kernfs.h:335 [inline] BUG: KASAN: use-after-free in kernfs_leftmost_descendant fs/kernfs/dir.c:1261 [inline] BUG: KASAN: use-after-free in __kernfs_remove.part.0+0x843/0x960 fs/kernfs/dir.c:1369 Read of size 2 at addr ffff8880088807f0 by task syz-executor.2/857 CPU: 0 PID: 857 Comm: syz-executor.2 Not tainted 6.0.0-rc3-00363-g7726d4c3e60b #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x6e/0x91 lib/dump_stack.c:106 print_address_description mm/kasan/report.c:317 [inline] print_report.cold+0x5e/0x5e5 mm/kasan/report.c:433 kasan_report+0xa3/0x130 mm/kasan/report.c:495 kernfs_type include/linux/kernfs.h:335 [inline] kernfs_leftmost_descendant fs/kernfs/dir.c:1261 [inline] __kernfs_remove.part.0+0x843/0x960 fs/kernfs/dir.c:1369 __kernfs_remove fs/kernfs/dir.c:1356 [inline] kernfs_remove_by_name_ns+0x108/0x190 fs/kernfs/dir.c:1589 sysfs_slab_add+0x133/0x1e0 mm/slub.c:5943 __kmem_cache_create+0x3e0/0x550 mm/slub.c:4899 create_cache mm/slab_common.c:229 [inline] kmem_cache_create_usercopy+0x167/0x2a0 mm/slab_common.c:335 p9_client_create+0xd4d/0x1190 net/9p/client.c:993 v9fs_session_init+0x1e6/0x13c0 fs/9p/v9fs.c:408 v9fs_mount+0xb9/0xbd0 fs/9p/vfs_super.c:126 legacy_get_tree+0xf1/0x200 fs/fs_context.c:610 vfs_get_tree+0x85/0x2e0 fs/super.c:1530 do_new_mount fs/namespace.c:3040 [inline] path_mount+0x675/0x1d00 fs/namespace.c:3370 do_mount fs/namespace.c:3383 [inline] __do_sys_mount fs/namespace.c:3591 [inline] __se_sys_mount fs/namespace.c:3568 [inline] __x64_sys_mount+0x282/0x300 fs/namespace.c:3568 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f725f983aed Code: 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f725f0f7028 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 RAX: ffffffffffffffda RBX: 00007f725faa3f80 RCX: 00007f725f983aed RDX: 00000000200000c0 RSI: 0000000020000040 RDI: 0000000000000000 RBP: 00007f725f9f419c R08: 0000000020000280 R09: 0000000000000000 R10: 0000000000000408 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000006 R14: 00007f725faa3f80 R15: 00007f725f0d7000 </TASK> Allocated by task 855: kasan_save_stack+0x1e/0x40 mm/kasan/common.c:38 kasan_set_track mm/kasan/common.c:45 [inline] set_alloc_info mm/kasan/common.c:437 [inline] __kasan_slab_alloc+0x66/0x80 mm/kasan/common.c:470 kasan_slab_alloc include/linux/kasan.h:224 [inline] slab_post_alloc_hook mm/slab.h:727 [inline] slab_alloc_node mm/slub.c:3243 [inline] slab_alloc mm/slub.c:3251 [inline] __kmem_cache_alloc_lru mm/slub.c:3258 [inline] kmem_cache_alloc+0xbf/0x200 mm/slub.c:3268 kmem_cache_zalloc include/linux/slab.h:723 [inline] __kernfs_new_node+0xd4/0x680 fs/kernfs/dir.c:593 kernfs_new_node fs/kernfs/dir.c:655 [inline] kernfs_create_dir_ns+0x9c/0x220 fs/kernfs/dir.c:1010 sysfs_create_dir_ns+0x127/0x290 fs/sysfs/dir.c:59 create_dir lib/kobject.c:63 [inline] kobject_add_internal+0x24a/0x8d0 lib/kobject.c:223 kobject_add_varg lib/kobject.c:358 [inline] kobject_init_and_add+0x101/0x160 lib/kobject.c:441 sysfs_slab_add+0x156/0x1e0 mm/slub.c:5954 __kmem_cache_create+0x3e0/0x550 mm/slub.c:4899 create_cache mm/slab_common.c:229 [inline] kmem_cache_create_usercopy+0x167/0x2a0 mm/slab_common.c:335 p9_client_create+0xd4d/0x1190 net/9p/client.c:993 v9fs_session_init+0x1e6/0x13c0 fs/9p/v9fs.c:408 v9fs_mount+0xb9/0xbd0 fs/9p/vfs_super.c:126 legacy_get_tree+0xf1/0x200 fs/fs_context.c:610 vfs_get_tree+0x85/0x2e0 fs/super.c:1530 do_new_mount fs/namespace.c:3040 [inline] path_mount+0x675/0x1d00 fs/namespace.c:3370 do_mount fs/namespace.c:3383 [inline] __do_sys_mount fs/namespace.c:3591 [inline] __se_sys_mount fs/namespace.c:3568 [inline] __x64_sys_mount+0x282/0x300 fs/namespace.c:3568 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Freed by task 857: kasan_save_stack+0x1e/0x40 mm/kasan/common.c:38 kasan_set_track+0x21/0x30 mm/kasan/common.c:45 kasan_set_free_info+0x20/0x40 mm/kasan/generic.c:370 ____kasan_slab_free mm/kasan/common.c:367 [inline] ____kasan_slab_free mm/kasan/common.c:329 [inline] __kasan_slab_free+0x108/0x190 mm/kasan/common.c:375 kasan_slab_free include/linux/kasan.h:200 [inline] slab_free_hook mm/slub.c:1754 [inline] slab_free_freelist_hook mm/slub.c:1780 [inline] slab_free mm/slub.c:3534 [inline] kmem_cache_free+0x9c/0x340 mm/slub.c:3551 kernfs_put.part.0+0x2b2/0x520 fs/kernfs/dir.c:547 kernfs_put+0x42/0x50 fs/kernfs/dir.c:521 __kernfs_remove.part.0+0x72d/0x960 fs/kernfs/dir.c:1407 __kernfs_remove fs/kernfs/dir.c:1356 [inline] kernfs_remove_by_name_ns+0x108/0x190 fs/kernfs/dir.c:1589 sysfs_slab_add+0x133/0x1e0 mm/slub.c:5943 __kmem_cache_create+0x3e0/0x550 mm/slub.c:4899 create_cache mm/slab_common.c:229 [inline] kmem_cache_create_usercopy+0x167/0x2a0 mm/slab_common.c:335 p9_client_create+0xd4d/0x1190 net/9p/client.c:993 v9fs_session_init+0x1e6/0x13c0 fs/9p/v9fs.c:408 v9fs_mount+0xb9/0xbd0 fs/9p/vfs_super.c:126 legacy_get_tree+0xf1/0x200 fs/fs_context.c:610 vfs_get_tree+0x85/0x2e0 fs/super.c:1530 do_new_mount fs/namespace.c:3040 [inline] path_mount+0x675/0x1d00 fs/namespace.c:3370 do_mount fs/namespace.c:3383 [inline] __do_sys_mount fs/namespace.c:3591 [inline] __se_sys_mount fs/namespace.c:3568 [inline] __x64_sys_mount+0x282/0x300 fs/namespace.c:3568 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd The buggy address belongs to the object at ffff888008880780 which belongs to the cache kernfs_node_cache of size 128 The buggy address is located 112 bytes inside of 128-byte region [ffff888008880780, ffff888008880800) The buggy address belongs to the physical page: page:00000000732833f8 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x8880 flags: 0x100000000000200(slab\|node=0\|zone=1) raw: 0100000000000200 0000000000000000 dead000000000122 ffff888001147280 raw: 0000000000000000 0000000000150015 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888008880680: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb ffff888008880700: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc >ffff888008880780: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff888008880800: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb ffff888008880880: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc ================================================================== Acked-by: Tejun Heo <tj@kernel.org> Cc: stable <stable@kernel.org> # -rc3 Signed-off-by: Christian A. Ehrhardt <lk@c--e.de> Link: https://lore.kernel.org/r/20220913121723.691454-1-lk@c--e.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-09-24 13:52:27 +02:00
Greg Kroah-Hartman	ec9c88070d	Merge `1707c39ae3` ("Merge tag 'driver-core-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core") driver-core-next This merges the driver core changes in 6.0-rc7 into driver-core-next as they are needed here as well for testing. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-09-24 13:32:01 +02:00
Miklos Szeredi	7d37539037	fuse: implement ->tmpfile() This is basically equivalent to the FUSE_CREATE operation which creates and opens a regular file. Add a new FUSE_TMPFILE operation, otherwise just reuse the protocol and the code for FUSE_CREATE. Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2022-09-24 07:00:00 +02:00
Miklos Szeredi	863f144f12	vfs: open inside ->tmpfile() This is in preparation for adding tmpfile support to fuse, which requires that the tmpfile creation and opening are done as a single operation. Replace the 'struct dentry ' argument of i_op->tmpfile with 'struct file '. Call finish_open_simple() as the last thing in ->tmpfile() instances (may be omitted in the error case). Change d_tmpfile() argument to 'struct file *' as well to make callers more readable. Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2022-09-24 07:00:00 +02:00
Miklos Szeredi	9751b33865	vfs: move open right after ->tmpfile() Create a helper finish_open_simple() that opens the file with the original dentry. Handle the error case here as well to simplify callers. Call this helper right after ->tmpfile() is called. Next patch will change the tmpfile API and move this call into tmpfile instances. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2022-09-24 07:00:00 +02:00
Miklos Szeredi	3e9d4c5935	vfs: make vfs_tmpfile() static No callers outside of fs/namei.c anymore. Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2022-09-24 07:00:00 +02:00
Miklos Szeredi	2b1a77461f	ovl: use vfs_tmpfile_open() helper If tmpfile is used for copy up, then use this helper to create the tmpfile and open it at the same time. This will later allow filesystems such as fuse to do this operation atomically. Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2022-09-24 07:00:00 +02:00
Miklos Szeredi	24a81759b6	cachefiles: use vfs_tmpfile_open() helper Use the vfs_tmpfile_open() helper instead of doing tmpfile creation and opening separately. The only minor difference is that previously no permission checking was done, while vfs_tmpfile_open() will call may_open() with zero access mask (i.e. no access is checked). Even if this would make a difference with callers caps (don't see how it could, even in the LSM codepaths) cachfiles raises caps before performing the tmpfile creation, so this extra permission check will not result in any regression. Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2022-09-24 07:00:00 +02:00
Miklos Szeredi	08d7a6fb7e	cachefiles: only pass inode to *mark_inode_inuse() helpers The only reason to pass dentry was because of a pr_notice() text. Move that to the two callers where it makes sense and add a WARN_ON() to the third. file_inode(file) is never NULL on an opened file. Remove check in cachefiles_unmark_inode_in_use(). Do not open code cachefiles_do_unmark_inode_in_use() in cachefiles_put_directory(). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2022-09-24 06:59:59 +02:00
Miklos Szeredi	38017d4444	cachefiles: tmpfile error handling cleanup Separate the error labels from the success path and use 'ret' to store the error value before jumping to the error label. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2022-09-24 06:59:59 +02:00
Al Viro	19ee5345f2	hugetlbfs: cleanup mknod and tmpfile Duplicate the few lines that are shared between hugetlbfs_mknod() and hugetlbfs_tmpfile(). This is a prerequisite for sanely changing the signature of ->tmpfile(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2022-09-24 06:59:59 +02:00
Miklos Szeredi	22873deac9	vfs: add vfs_tmpfile_open() helper This helper unifies tmpfile creation with opening. Existing vfs_tmpfile() callers outside of fs/namei.c will be converted to using this helper. There are two such callers: cachefile and overlayfs. The cachefiles code currently uses the open_with_fake_path() helper to open the tmpfile, presumably to disable accounting of the open file. Overlayfs uses tmpfile for copy_up, which means these struct file instances will be short lived, hence it doesn't really matter if they are accounted or not. Disable accounting in this helper too, which should be okay for both callers. Add MAY_OPEN permission checking for consistency. Like for create(2) read/write permissions are not checked. Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2022-09-24 06:59:59 +02:00
Yue Hu	fdffc091e6	erofs: support interlaced uncompressed data for compressed files Currently, uncompressed data is all handled in the shifted way, which means we have to shift the whole on-disk plain pcluster to get the logical data. However, since we are also using in-place I/O for uncompressed data, data copy will be reduced a lot if pcluster is recorded in the interlaced way as illustrated below: _______________________________________________________________ \| \| \| \|_ tail part \|_ head part _\| \|<- blk0 ->\| .. \|<- blkn-2 ->\|<- blkn-1 ->\| The logical data then becomes: ________________________________________________________ \|_ head part _\|_ blk0 _\| .. \|_ blkn-2 _\|_ tail part _\| In addition, non-4k plain pclusters are also survived by the interlaced way, which can be used for non-4k lclusters as well. However, it's almost impossible to de-duplicate uncompressed data in the interlaced way, therefore shifted uncompressed data is still useful. Signed-off-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/8369112678604fdf4ef796626d59b1fdd0745a53.1663898962.git.huyue2@coolpad.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2022-09-23 10:55:56 +08:00
Jingbo Xu	1ae9470c3e	erofs: clean up .read_folio() and .readahead() in fscache mode The implementation of these two functions in fscache mode is almost the same. Extract the same part as a generic helper to remove the code duplication. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Jia Zhu <zhujia.zj@bytedance.com> Link: https://lore.kernel.org/r/20220922062414.20437-1-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2022-09-23 09:52:42 +08:00
Jakub Kicinski	0140a7168f	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net drivers/net/ethernet/freescale/fec.h `7b15515fc1` ("Revert "fec: Restart PPS after link state change"") `40c79ce13b` ("net: fec: add stop mode support for imx8 platform") https://lore.kernel.org/all/20220921105337.62b41047@canb.auug.org.au/ drivers/pinctrl/pinctrl-ocelot.c `c297561bc9` ("pinctrl: ocelot: Fix interrupt controller") `181f604b33` ("pinctrl: ocelot: add ability to be used in a non-mmio configuration") https://lore.kernel.org/all/20220921110032.7cd28114@canb.auug.org.au/ tools/testing/selftests/drivers/net/bonding/Makefile `bbb774d921` ("net: Add tests for bonding and team address list management") `152e8ec776` ("selftests/bonding: add a test for bonding lladdr target") https://lore.kernel.org/all/20220921110437.5b7dbd82@canb.auug.org.au/ drivers/net/can/usb/gs_usb.c `5440428b3d` ("can: gs_usb: gs_can_open(): fix race dev->can.state condition") `45dfa45f52` ("can: gs_usb: add RX and TX hardware timestamp support") https://lore.kernel.org/all/84f45a7d-92b6-4dc5-d7a1-072152fab6ff@tessares.net/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-09-22 13:02:10 -07:00
Theodore Ts'o	80fa46d6b9	ext4: limit the number of retries after discarding preallocations blocks This patch avoids threads live-locking for hours when a large number threads are competing over the last few free extents as they blocks getting added and removed from preallocation pools. From our bug reporter: A reliable way for triggering this has multiple writers continuously write() to files when the filesystem is full, while small amounts of space are freed (e.g. by truncating a large file -1MiB at a time). In the local filesystem, this can be done by simply not checking the return code of write (0) and/or the error (ENOSPACE) that is set. Over NFS with an async mount, even clients with proper error checking will behave this way since the linux NFS client implementation will not propagate the server errors [the write syscalls immediately return success] until the file handle is closed. This leads to a situation where NFS clients send a continuous stream of WRITE rpcs which result in ERRNOSPACE -- but since the client isn't seeing this, the stream of writes continues at maximum network speed. When some space does appear, multiple writers will all attempt to claim it for their current write. For NFS, we may see dozens to hundreds of threads that do this. The real-world scenario of this is database backup tooling (in particular, github.com/mdkent/percona-xtrabackup) which may write large files (>1TiB) to NFS for safe keeping. Some temporary files are written, rewound, and read back -- all before closing the file handle (the temp file is actually unlinked, to trigger automatic deletion on close/crash.) An application like this operating on an async NFS mount will not see an error code until TiB have been written/read. The lockup was observed when running this database backup on large filesystems (64 TiB in this case) with a high number of block groups and no free space. Fragmentation is generally not a factor in this filesystem (~thousands of large files, mostly contiguous except for the parts written while the filesystem is at capacity.) Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org	2022-09-22 10:51:19 -04:00
Luís Henriques	29a5b8a137	ext4: fix bug in extents parsing when eh_entries == 0 and eh_depth > 0 When walking through an inode extents, the ext4_ext_binsearch_idx() function assumes that the extent header has been previously validated. However, there are no checks that verify that the number of entries (eh->eh_entries) is non-zero when depth is > 0. And this will lead to problems because the EXT_FIRST_INDEX() and EXT_LAST_INDEX() will return garbage and result in this: [ 135.245946] ------------[ cut here ]------------ [ 135.247579] kernel BUG at fs/ext4/extents.c:2258! [ 135.249045] invalid opcode: 0000 [#1] PREEMPT SMP [ 135.250320] CPU: 2 PID: 238 Comm: tmp118 Not tainted 5.19.0-rc8+ #4 [ 135.252067] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b-rebuilt.opensuse.org 04/01/2014 [ 135.255065] RIP: 0010:ext4_ext_map_blocks+0xc20/0xcb0 [ 135.256475] Code: [ 135.261433] RSP: 0018:ffffc900005939f8 EFLAGS: 00010246 [ 135.262847] RAX: 0000000000000024 RBX: ffffc90000593b70 RCX: 0000000000000023 [ 135.264765] RDX: ffff8880038e5f10 RSI: 0000000000000003 RDI: ffff8880046e922c [ 135.266670] RBP: ffff8880046e9348 R08: 0000000000000001 R09: ffff888002ca580c [ 135.268576] R10: 0000000000002602 R11: 0000000000000000 R12: 0000000000000024 [ 135.270477] R13: 0000000000000000 R14: 0000000000000024 R15: 0000000000000000 [ 135.272394] FS: 00007fdabdc56740(0000) GS:ffff88807dd00000(0000) knlGS:0000000000000000 [ 135.274510] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 135.276075] CR2: 00007ffc26bd4f00 CR3: 0000000006261004 CR4: 0000000000170ea0 [ 135.277952] Call Trace: [ 135.278635] <TASK> [ 135.279247] ? preempt_count_add+0x6d/0xa0 [ 135.280358] ? percpu_counter_add_batch+0x55/0xb0 [ 135.281612] ? _raw_read_unlock+0x18/0x30 [ 135.282704] ext4_map_blocks+0x294/0x5a0 [ 135.283745] ? xa_load+0x6f/0xa0 [ 135.284562] ext4_mpage_readpages+0x3d6/0x770 [ 135.285646] read_pages+0x67/0x1d0 [ 135.286492] ? folio_add_lru+0x51/0x80 [ 135.287441] page_cache_ra_unbounded+0x124/0x170 [ 135.288510] filemap_get_pages+0x23d/0x5a0 [ 135.289457] ? path_openat+0xa72/0xdd0 [ 135.290332] filemap_read+0xbf/0x300 [ 135.291158] ? _raw_spin_lock_irqsave+0x17/0x40 [ 135.292192] new_sync_read+0x103/0x170 [ 135.293014] vfs_read+0x15d/0x180 [ 135.293745] ksys_read+0xa1/0xe0 [ 135.294461] do_syscall_64+0x3c/0x80 [ 135.295284] entry_SYSCALL_64_after_hwframe+0x46/0xb0 This patch simply adds an extra check in __ext4_ext_check(), verifying that eh_entries is not 0 when eh_depth is > 0. Link: https://bugzilla.kernel.org/show_bug.cgi?id=215941 Link: https://bugzilla.kernel.org/show_bug.cgi?id=216283 Cc: Baokun Li <libaokun1@huawei.com> Cc: stable@kernel.org Signed-off-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun1@huawei.com> Link: https://lore.kernel.org/r/20220822094235.2690-1-lhenriques@suse.de Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2022-09-22 10:50:54 -04:00
Christoph Hellwig	0e91fc1e0f	fscrypt: work on block_devices instead of request_queues request_queues are a block layer implementation detail that should not leak into file systems. Change the fscrypt inline crypto code to retrieve block devices instead of request_queues from the file system. As part of that, clean up the interaction with multi-device file systems by returning both the number of devices and the actual device array in a single method call. Signed-off-by: Christoph Hellwig <hch@lst.de> [ebiggers: bug fixes and minor tweaks] Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20220901193208.138056-4-ebiggers@kernel.org	2022-09-21 20:33:06 -07:00
Eric Biggers	22e9947a4b	fscrypt: stop holding extra request_queue references Now that the fscrypt_master_key lifetime has been reworked to not be subject to the quirks of the keyrings subsystem, blk_crypto_evict_key() no longer gets called after the filesystem has already been unmounted. Therefore, there is no longer any need to hold extra references to the filesystem's request_queue(s). (And these references didn't always do their intended job anyway, as pinning a request_queue doesn't necessarily pin the corresponding blk_crypto_profile.) Stop taking these extra references. Instead, just pass the super_block to fscrypt_destroy_inline_crypt_key(), and use it to get the list of block devices the key needs to be evicted from. Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20220901193208.138056-3-ebiggers@kernel.org	2022-09-21 20:33:06 -07:00
Eric Biggers	d7e7b9af10	fscrypt: stop using keyrings subsystem for fscrypt_master_key The approach of fs/crypto/ internally managing the fscrypt_master_key structs as the payloads of "struct key" objects contained in a "struct key" keyring has outlived its usefulness. The original idea was to simplify the code by reusing code from the keyrings subsystem. However, several issues have arisen that can't easily be resolved: - When a master key struct is destroyed, blk_crypto_evict_key() must be called on any per-mode keys embedded in it. (This started being the case when inline encryption support was added.) Yet, the keyrings subsystem can arbitrarily delay the destruction of keys, even past the time the filesystem was unmounted. Therefore, currently there is no easy way to call blk_crypto_evict_key() when a master key is destroyed. Currently, this is worked around by holding an extra reference to the filesystem's request_queue(s). But it was overlooked that the request_queue reference is not guaranteed to pin the corresponding blk_crypto_profile too; for device-mapper devices that support inline crypto, it doesn't. This can cause a use-after-free. - When the last inode that was using an incompletely-removed master key is evicted, the master key removal is completed by removing the key struct from the keyring. Currently this is done via key_invalidate(). Yet, key_invalidate() takes the key semaphore. This can deadlock when called from the shrinker, since in fscrypt_ioctl_add_key(), memory is allocated with GFP_KERNEL under the same semaphore. - More generally, the fact that the keyrings subsystem can arbitrarily delay the destruction of keys (via garbage collection delay, or via random processes getting temporary key references) is undesirable, as it means we can't strictly guarantee that all secrets are ever wiped. - Doing the master key lookups via the keyrings subsystem results in the key_permission LSM hook being called. fscrypt doesn't want this, as all access control for encrypted files is designed to happen via the files themselves, like any other files. The workaround which SELinux users are using is to change their SELinux policy to grant key search access to all domains. This works, but it is an odd extra step that shouldn't really have to be done. The fix for all these issues is to change the implementation to what I should have done originally: don't use the keyrings subsystem to keep track of the filesystem's fscrypt_master_key structs. Instead, just store them in a regular kernel data structure, and rework the reference counting, locking, and lifetime accordingly. Retain support for RCU-mode key lookups by using a hash table. Replace fscrypt_sb_free() with fscrypt_sb_delete(), which releases the keys synchronously and runs a bit earlier during unmount, so that block devices are still available. A side effect of this patch is that neither the master keys themselves nor the filesystem keyrings will be listed in /proc/keys anymore. ("Master key users" and the master key users keyrings will still be listed.) However, this was mostly an implementation detail, and it was intended just for debugging purposes. I don't know of anyone using it. This patch does not change how "master key users" (->mk_users) works; that still uses the keyrings subsystem. That is still needed for key quotas, and changing that isn't necessary to solve the issues listed above. If we decide to change that too, it would be a separate patch. I've marked this as fixing the original commit that added the fscrypt keyring, but as noted above the most important issue that this patch fixes wasn't introduced until the addition of inline encryption support. Fixes: `22d94f493b` ("fscrypt: add FS_IOC_ADD_ENCRYPTION_KEY ioctl") Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20220901193208.138056-2-ebiggers@kernel.org	2022-09-21 20:33:06 -07:00
Jan Kara	83e80a6e35	ext4: use buckets for cr 1 block scan instead of rbtree Using rbtree for sorting groups by average fragment size is relatively expensive (needs rbtree update on every block freeing or allocation) and leads to wide spreading of allocations because selection of block group is very sentitive both to changes in free space and amount of blocks allocated. Furthermore selecting group with the best matching average fragment size is not necessary anyway, even more so because the variability of fragment sizes within a group is likely large so average is not telling much. We just need a group with large enough average fragment size so that we have high probability of finding large enough free extent and we don't want average fragment size to be too big so that we are likely to find free extent only somewhat larger than what we need. So instead of maintaing rbtree of groups sorted by fragment size keep bins (lists) or groups where average fragment size is in the interval [2^i, 2^(i+1)). This structure requires less updates on block allocation / freeing, generally avoids chaotic spreading of allocations into block groups, and still is able to quickly (even faster that the rbtree) provide a block group which is likely to have a suitably sized free space extent. This patch reduces number of block groups used when untarring archive with medium sized files (size somewhat above 64k which is default mballoc limit for avoiding locality group preallocation) to about half and thus improves write speeds for eMMC flash significantly. Fixes: `196e402adf` ("ext4: improve cr 0 / cr 1 group scanning") CC: stable@kernel.org Reported-and-tested-by: Stefan Wahren <stefan.wahren@i2se.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/ Link: https://lore.kernel.org/r/20220908092136.11770-5-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2022-09-21 22:12:03 -04:00
Jan Kara	a9f2a2931d	ext4: use locality group preallocation for small closed files Curently we don't use any preallocation when a file is already closed when allocating blocks (from writeback code when converting delayed allocation). However for small files, using locality group preallocation is actually desirable as that is not specific to a particular file. Rather it is a method to pack small files together to reduce fragmentation and for that the fact the file is closed is actually even stronger hint the file would benefit from packing. So change the logic to allow locality group preallocation in this case. Fixes: `196e402adf` ("ext4: improve cr 0 / cr 1 group scanning") CC: stable@kernel.org Reported-and-tested-by: Stefan Wahren <stefan.wahren@i2se.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/ Link: https://lore.kernel.org/r/20220908092136.11770-4-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2022-09-21 22:12:00 -04:00
Jan Kara	613c5a8589	ext4: make directory inode spreading reflect flexbg size Currently the Orlov inode allocator searches for free inodes for a directory only in flex block groups with at most inodes_per_group/16 more directory inodes than average per flex block group. However with growing size of flex block group this becomes unnecessarily strict. Scale allowed difference from average directory count per flex block group with flex block group size as we do with other metrics. Tested-by: Stefan Wahren <stefan.wahren@i2se.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Cc: stable@kernel.org Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/ Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220908092136.11770-3-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2022-09-21 22:11:55 -04:00
Jan Kara	1940265ede	ext4: avoid unnecessary spreading of allocations among groups mb_set_largest_free_order() updates lists containing groups with largest chunk of free space of given order. The way it updates it leads to always moving the group to the tail of the list. Thus allocations looking for free space of given order effectively end up cycling through all groups (and due to initialization in last to first order). This spreads allocations among block groups which reduces performance for rotating disks or low-end flash media. Change mb_set_largest_free_order() to only update lists if the order of the largest free chunk in the group changed. Fixes: `196e402adf` ("ext4: improve cr 0 / cr 1 group scanning") CC: stable@kernel.org Reported-and-tested-by: Stefan Wahren <stefan.wahren@i2se.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/ Link: https://lore.kernel.org/r/20220908092136.11770-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2022-09-21 22:11:41 -04:00
Jan Kara	4fca50d440	ext4: make mballoc try target group first even with mb_optimize_scan One of the side-effects of mb_optimize_scan was that the optimized functions to select next group to try were called even before we tried the goal group. As a result we no longer allocate files close to corresponding inodes as well as we don't try to expand currently allocated extent in the same group. This results in reaim regression with workfile.disk workload of upto 8% with many clients on my test machine: baseline mb_optimize_scan Hmean disk-1 2114.16 ( 0.00%) 2099.37 ( -0.70%) Hmean disk-41 87794.43 ( 0.00%) 83787.47 * -4.56%* Hmean disk-81 148170.73 ( 0.00%) 135527.05 * -8.53%* Hmean disk-121 177506.11 ( 0.00%) 166284.93 * -6.32%* Hmean disk-161 220951.51 ( 0.00%) 207563.39 * -6.06%* Hmean disk-201 208722.74 ( 0.00%) 203235.59 ( -2.63%) Hmean disk-241 222051.60 ( 0.00%) 217705.51 ( -1.96%) Hmean disk-281 252244.17 ( 0.00%) 241132.72 * -4.41%* Hmean disk-321 255844.84 ( 0.00%) 245412.84 * -4.08%* Also this is causing huge regression (time increased by a factor of 5 or so) when untarring archive with lots of small files on some eMMC storage cards. Fix the problem by making sure we try goal group first. Fixes: `196e402adf` ("ext4: improve cr 0 / cr 1 group scanning") CC: stable@kernel.org Reported-and-tested-by: Stefan Wahren <stefan.wahren@i2se.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/all/20220727105123.ckwrhbilzrxqpt24@quack3/ Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/ Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220908092136.11770-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2022-09-21 22:11:34 -04:00
Dylan Yudaken	9f0deaa12d	eventfd: guard wake_up in eventfd fs calls as well Guard wakeups that the user can trigger, and that may end up triggering a call back into eventfd_signal. This is in addition to the current approach that only guards in eventfd_signal. Rename in_eventfd_signal -> in_eventfd at the same time to reflect this. Without this there would be a deadlock in the following code using libaio: int main() { struct io_context ctx = NULL; struct iocb iocb; struct iocb iocbs[] = { &iocb }; int evfd; uint64_t val = 1; evfd = eventfd(0, EFD_CLOEXEC); assert(!io_setup(2, &ctx)); io_prep_poll(&iocb, evfd, POLLIN); io_set_eventfd(&iocb, evfd); assert(1 == io_submit(ctx, 1, iocbs)); write(evfd, &val, 8); } Signed-off-by: Dylan Yudaken <dylany@fb.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20220816135959.1490641-1-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-21 10:30:42 -06:00
Linus Torvalds	48062bb264	Description for this pull request: - fix integer overflow on large partition. -----BEGIN PGP SIGNATURE----- iQJKBAABCgA0FiEE6NzKS6Uv/XAAGHgyZwv7A1FEIQgFAmMqhUQWHGxpbmtpbmpl b25Aa2VybmVsLm9yZwAKCRBnC/sDUUQhCL8tD/9IPXYUS0Xntv0scUSVRMg/jO1I m/5c5hwDlggMa2oqaX1E9MfD1cXygFb0w33Am6CE+UWwQ02UvmLbOdL8AM9Nm6pd KBL2EfNULb1Ak7oIkat0yTGpHw0vFjBP9/9Nq3Q13JQmpCos2GuDbjtpWuRbVVAI u7zScdQCkgrugEBjLSar+HhGAOhq42inmSBoKTckQGFC0sUfWP8FUFM2PMMSD7TB ucjCAYOUT+9aRDAVjazGQc7+BYnF8EWukiwyb7yYelm1/RxdBpc7Or+/vzDXP3Yw 9F4bDaaFnVUX9l6n6iL/LlorMqOsNdrp3t8nnSctSuHqpfFEsDZFTcOciDM7Y3ZO 1sE4F8QK628G66huO85lkXowUklProyKGNeKe7XWX4+uwC+AB5FzuOX0Iw9jXy/a fvbkTMaeru0RJwlGDG6d5aaNZEhlJUxDf0Jg9Cw8VIwiBoSdNVIcyjq7U6rDm2fH +4kUxozXT8c88jSLRoF7MmdpX86OKmU5lAlsYRmUXcxIfDqv9eOhYKkl+C5sjHaB i2b1rocDLy6jWzspP8DWYnlwuytbcKLyxjIsHi99fvJcWj+iuFMnsy6qD3ksSasi mdYyud1GQDPKBNvPMrXYeqFjZbEFTp7il2SrFruqkFtBqRvkxMJQPS6Vv8Mn9am+ 3g8VKzhaHdjcUtka8g== =5Iy0 -----END PGP SIGNATURE----- Merge tag 'exfat-for-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat Pull exfat fix from Namjae Jeon: - fix integer overflow on large partitions * tag 'exfat-for-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat: exfat: fix overflow for large capacity partition	2022-09-21 08:40:57 -07:00
Christian Brauner	38e316398e	xattr: always us is_posix_acl_xattr() helper The is_posix_acl_xattr() helper was added in `0c5fd887d2` ("acl: move idmapped mount fixup into vfs_{g,s}etxattr()") to remove the open-coded checks for POSIX ACLs. We missed to update two locations. Switch them to use the helper. Cc: Seth Forshee (DigitalOcean) <sforshee@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> Reviewed-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2022-09-21 12:01:29 +02:00
Zhihao Cheng	a0c5156573	ubifs: Fix AA deadlock when setting xattr for encrypted file Following process: vfs_setxattr(host) ubifs_xattr_set down_write(host_ui->xattr_sem) <- lock first time create_xattr ubifs_new_inode(host) fscrypt_prepare_new_inode(host) fscrypt_policy_to_inherit(host) if (IS_ENCRYPTED(inode)) fscrypt_require_key(host) fscrypt_get_encryption_info(host) ubifs_xattr_get(host) down_read(host_ui->xattr_sem) <- AA deadlock , which may trigger an AA deadlock problem: [ 102.620871] INFO: task setfattr:1599 blocked for more than 10 seconds. [ 102.625298] Not tainted 5.19.0-rc7-00001-gb666b6823ce0-dirty #711 [ 102.628732] task:setfattr state:D stack: 0 pid: 1599 [ 102.628749] Call Trace: [ 102.628753] <TASK> [ 102.628776] __schedule+0x482/0x1060 [ 102.629964] schedule+0x92/0x1a0 [ 102.629976] rwsem_down_read_slowpath+0x287/0x8c0 [ 102.629996] down_read+0x84/0x170 [ 102.630585] ubifs_xattr_get+0xd1/0x370 [ubifs] [ 102.630730] ubifs_crypt_get_context+0x1f/0x30 [ubifs] [ 102.630791] fscrypt_get_encryption_info+0x7d/0x1c0 [ 102.630810] fscrypt_policy_to_inherit+0x56/0xc0 [ 102.630817] fscrypt_prepare_new_inode+0x35/0x160 [ 102.630830] ubifs_new_inode+0xcc/0x4b0 [ubifs] [ 102.630873] ubifs_xattr_set+0x591/0x9f0 [ubifs] [ 102.630961] xattr_set+0x8c/0x3e0 [ubifs] [ 102.631003] __vfs_setxattr+0x71/0xc0 [ 102.631026] vfs_setxattr+0x105/0x270 [ 102.631034] do_setxattr+0x6d/0x110 [ 102.631041] setxattr+0xa0/0xd0 [ 102.631087] __x64_sys_setxattr+0x2f/0x40 Fetch a reproducer in [Link]. Just like ext4 does, which skips encrypting for inode with EXT4_EA_INODE_FL flag. Stop encypting xattr inode for ubifs. Link: https://bugzilla.kernel.org/show_bug.cgi?id=216260 Fixes: `f4e3634a3b` ("ubifs: Fix races between xattr_{set\|get} ...") Fixes: `d475a50745` ("ubifs: Add skeleton for fscrypto") Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>	2022-09-21 11:32:58 +02:00
ZhaoLong Wang	713346ca1d	ubifs: Fix UBIFS ro fail due to truncate in the encrypted directory The ubifs_compress() function does not compress the data When the data length is short than 128 bytes or the compressed data length is not ideal.It cause that the compressed length of the truncated data in the truncate_data_node() function may be greater than the length of the raw data read from the flash. The above two lengths are transferred to the ubifs_encrypt() function as parameters. This may lead to assertion fails and then the file system becomes read-only. This patch use the actual length of the data in the memory as the input parameter for assert comparison, which avoids the problem. Link: https://bugzilla.kernel.org/show_bug.cgi?id=216213 Signed-off-by: ZhaoLong Wang <wangzhaolong1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>	2022-09-21 11:32:38 +02:00

... 4 5 6 7 8 ...

78669 Commits