linux

mirror of https://github.com/torvalds/linux.git synced 2024-12-02 00:51:44 +00:00

Author	SHA1	Message	Date
Linus Torvalds	815409a12c	overlayfs update for 5.15 -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCYTDKKAAKCRDh3BK/laaZ PG9PAQCUF0fdBlCKudwSEt5PV5xemycL9OCAlYCd7d4XbBIe9wEA6sVJL9J+OwV2 aF0NomiXtJccE+S9+byjVCyqSzQJGQQ= =6L2Y -----END PGP SIGNATURE----- Merge tag 'ovl-update-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs update from Miklos Szeredi: - Copy up immutable/append/sync/noatime attributes (Amir Goldstein) - Improve performance by enabling RCU lookup. - Misc fixes and improvements The reason this touches so many files is that the ->get_acl() method now gets a "bool rcu" argument. The ->get_acl() API was updated based on comments from Al and Linus: Link: https://lore.kernel.org/linux-fsdevel/CAJfpeguQxpd6Wgc0Jd3ks77zcsAv_bn0q17L3VNnnmPKu11t8A@mail.gmail.com/ * tag 'ovl-update-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: enable RCU'd ->get_acl() vfs: add rcu argument to ->get_acl() callback ovl: fix BUG_ON() in may_delete() when called from ovl_cleanup() ovl: use kvalloc in xattr copy-up ovl: update ctime when changing fileattr ovl: skip checking lower file's i_writecount on truncate ovl: relax lookup error on mismatch origin ftype ovl: do not set overlay.opaque for new directories ovl: add ovl_allow_offline_changes() helper ovl: disable decoding null uuid with redirect_dir ovl: consistent behavior for immutable/append-only inodes ovl: copy up sync/noatime fileattr flags ovl: pass ovl_fs to ovl_check_setxattr() fs: add generic helper for filling statx attribute flags	2021-09-02 09:21:27 -07:00
Linus Torvalds	0ee7c3e25d	New code for 5.15: - Simplify the bio_end_page usage in the buffered IO code. - Support reading inline data at nonzero offsets for erofs. - Fix some typos and bad grammar. - Convert kmap_atomic usage in the inline data read path. - Add some extra inline data input checking. - Fix a memory corruption bug stemming from iomap_swapfile_activate trying to activate more pages than mm was expecting. - Pass errnos through the page writeback code so that writeback errors are reported correctly instead of being munged to EIO. - Replace iomap_apply with a open-coded iterator loops to reduce the number of indirect calls by a third to a half. - Refactor the fsdax code to use iomap iterators instead of the open-coded iomap_apply code that it had before. - Format file range iomap tracepoint data in hexadecimal and standardize the names used in the pretty-print string. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmEnwC0ACgkQ+H93GTRK tOtVOQ//Zu9ul2ZmPARMV8xyAfopnLpmggREOthFbPkDZ3z3ZgRpPxlbAvWEEKnj VDNLFNj204rDojuxP/YSdxgiawLod7dYfXIwwft8R8oI7MdgVQhpvimUi5bkz/Od X5pmFDe84INfFvEztOgC+sPk1RI/ToQLgrcIffWMWfF2iyVkNVMCD5MMe6LoH1la 9GbVCfPx6Y2Nffaa8EuAEgaCo7FMPc81bvQG4qpeqXyX8qql/r5n4YENhkn3n4qa zI4F2lgqwbelFkamZOYNDjtLt13lb7Ze0PoFOpmTZUqlyybqhRxDvJ+OxZn8W6zH 20pxWx/RCXhCp/sS6DRcYyf7WKoIfdGDkxed7aSuhJ+VKKtBtsjMoy7dh5IY5RJa 8L1DMat6xtea8Glx04SF7Vib0n/An9oHOTzLEWxsUlRaPhW68uVpKgXuGLTAf+dc ztJhlQ9pLX0D2NmgGlkXN8d4F1XEH2BgyIrtF6UNtMbyIlCREHM9HELJs6JzKl6U a4ivJXyaq8o/hlXr8IMWUOTVubS0i+hgvvQjnVJmcSTJxhH10mPPJLnNsGX6heD9 SlnnXRbD03iqsbMJP/R431VKooryOSKBc86IEECkuMz3RUfw75DGAnLtETnT1rsA 71rSVf5NaCGZ2hV4du6jv53TS7yrPpqkxJHyDWD1WP4xGPbO1XA= =iVns -----END PGP SIGNATURE----- Merge tag 'iomap-5.15-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull iomap updates from Darrick Wong: "The most notable externally visible change for this cycle is the addition of support for reads to inline tail fragments of files, which was requested by the erofs developers; and a correction for a kernel memory corruption bug if the sysadmin tries to activate a swapfile with more pages than the swapfile header suggests. We also now report writeback completion errors to the file mapping correctly, instead of munging all errors into EIO. Internally, the bulk of the changes are Christoph's patchset to reduce the indirect function call count by a third to a half by converting iomap iteration from a loop pattern to a generator/consumer pattern. As an added bonus, fsdax no longer open-codes iomap apply loops. Summary: - Simplify the bio_end_page usage in the buffered IO code. - Support reading inline data at nonzero offsets for erofs. - Fix some typos and bad grammar. - Convert kmap_atomic usage in the inline data read path. - Add some extra inline data input checking. - Fix a memory corruption bug stemming from iomap_swapfile_activate trying to activate more pages than mm was expecting. - Pass errnos through the page writeback code so that writeback errors are reported correctly instead of being munged to EIO. - Replace iomap_apply with a open-coded iterator loops to reduce the number of indirect calls by a third to a half. - Refactor the fsdax code to use iomap iterators instead of the open-coded iomap_apply code that it had before. - Format file range iomap tracepoint data in hexadecimal and standardize the names used in the pretty-print string" * tag 'iomap-5.15-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (41 commits) iomap: standardize tracepoint formatting and storage mm/swap: consider max pages in iomap_swapfile_add_extent iomap: move loop control code to iter.c iomap: constify iomap_iter_srcmap fsdax: switch the fault handlers to use iomap_iter fsdax: factor out a dax_fault_actor() helper fsdax: factor out helpers to simplify the dax fault code iomap: rework unshare flag iomap: pass an iomap_iter to various buffered I/O helpers iomap: remove iomap_apply fsdax: switch dax_iomap_rw to use iomap_iter iomap: switch iomap_swapfile_activate to use iomap_iter iomap: switch iomap_seek_data to use iomap_iter iomap: switch iomap_seek_hole to use iomap_iter iomap: switch iomap_bmap to use iomap_iter iomap: switch iomap_fiemap to use iomap_iter iomap: switch __iomap_dio_rw to use iomap_iter iomap: switch iomap_page_mkwrite to use iomap_iter iomap: switch iomap_zero_range to use iomap_iter iomap: switch iomap_file_unshare to use iomap_iter ...	2021-08-31 11:13:35 -07:00
Linus Torvalds	87045e6546	for-5.15-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmEs2NIACgkQxWXV+ddt WDsJMQ/+PJ/yXfI85mAeAzTJLWQ0zD6YO3iBhf3wOeyychWC4on435pj+zW8zR/U /bix25ygoWF4MvGF6p0uyv4Z5mnvkZXE5lapUcJu6wXG7se1QRPH0broTh05IBXK SnT93Eb9RexaiNFk7DVma9XkviqZ/ZISPtkJ9wYrfIba7j/U/wa+PtEFS7wk58hP rFQXgV64xm/pcP28YYHfOkCjdyUMdJrnBUvfKOlX6d94lmYbP5lyiTL+XJEXExzN wPakD0UsnXPr4TRvf+YRTPeFHPPUgyORII7otVUOKmGywWtcJrELX8rXFoW+6GwB dzZIcSYXHUxU5UrtMbZgiztVBJ+bQY5juYMIrj13eYOMYkijxAqPP84iDO15+TSV zNqyAVjUglHCGUGjhSpAxnAmtp+IJTZfVAWcvIKq3VqvJtb8tssQsk9bqFjH1xlH qNJLE57CYe3tjw05K9y0keMh2iJWRWkXZYkgI/zjwo5nreemobpN+3fO4yneVLh7 ecdBmSl/JVSzAB1NamLOCZNGZLUqiiuTvZlJtI6ZsekrN1+4A6QzVcU/MGjSYL1v C7W0hK0LF+e3xIBkxTKVq8noolsgbmlWacxJq8fZq9HwZy5IVJOVm9STDlCuLaIo gPr0V0itkclcsMU0CHTyCjMsfuHYUwJZXwg93wKfJf5UCzS4OWU= =ALO9 -----END PGP SIGNATURE----- Merge tag 'for-5.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "The highlights of this round are integrations with fs-verity and idmapped mounts, the rest is usual mix of minor improvements, speedups and cleanups. There are some patches outside of btrfs, namely updating some VFS interfaces, all straightforward and acked. Features: - fs-verity support, using standard ioctls, backward compatible with read-only limitation on inodes with previously enabled fs-verity - idmapped mount support - make mount with rescue=ibadroots more tolerant to partially damaged trees - allow raid0 on a single device and raid10 on two devices, degenerate cases but might be useful as an intermediate step during conversion to other profiles - zoned mode block group auto reclaim can be disabled via sysfs knob Performance improvements: - continue readahead of node siblings even if target node is in memory, could speed up full send (on sample test +11%) - batching of delayed items can speed up creating many files - fsync/tree-log speedups - avoid unnecessary work (gains +2% throughput, -2% run time on sample load) - reduced lock contention on renames (on dbench +4% throughput, up to -30% latency) Fixes: - various zoned mode fixes - preemptive flushing threshold tuning, avoid excessive work on almost full filesystems Core: - continued subpage support, preparation for implementing remaining features like compression and defragmentation; with some limitations, write is now enabled on 64K page systems with 4K sectors, still considered experimental - no readahead on compressed reads - inline extents disabled - disabled raid56 profile conversion and mount - improved flushing logic, fixing early ENOSPC on some workloads - inode flags have been internally split to read-only and read-write incompat bit parts, used by fs-verity - new tree items for fs-verity - descriptor item - Merkle tree item - inode operations extended to be namespace-aware - cleanups and refactoring Generic code changes: - fs: new export filemap_fdatawrite_wbc - fs: removed sync_inode - block: bio_trim argument type fixups - vfs: add namespace-aware lookup" * tag 'for-5.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (114 commits) btrfs: reset replace target device to allocation state on close btrfs: zoned: fix ordered extent boundary calculation btrfs: do not do preemptive flushing if the majority is global rsv btrfs: reduce the preemptive flushing threshold to 90% btrfs: tree-log: check btrfs_lookup_data_extent return value btrfs: avoid unnecessarily logging directories that had no changes btrfs: allow idmapped mount btrfs: handle ACLs on idmapped mounts btrfs: allow idmapped INO_LOOKUP_USER ioctl btrfs: allow idmapped SUBVOL_SETFLAGS ioctl btrfs: allow idmapped SET_RECEIVED_SUBVOL ioctls btrfs: relax restrictions for SNAP_DESTROY_V2 with subvolids btrfs: allow idmapped SNAP_DESTROY ioctls btrfs: allow idmapped SNAP_CREATE/SUBVOL_CREATE ioctls btrfs: check whether fsgid/fsuid are mapped during subvolume creation btrfs: allow idmapped permission inode op btrfs: allow idmapped setattr inode op btrfs: allow idmapped tmpfile inode op btrfs: allow idmapped symlink inode op btrfs: allow idmapped mkdir inode op ...	2021-08-31 09:41:22 -07:00
Linus Torvalds	9b49ceb854	for-5.14-rc7-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmEmQy4ACgkQxWXV+ddt WDuCRRAAmuO+6Zsl5MSq0hBnpec/VBN6lTi9VPt184BjW1IWsqwR1Ax8dVQEKgCm gzkGYEuVq2L5p+/ugWKKftAbmUU85Jf3AIsv81SCJQosRkxVXAdbrZOv00yUZy6/ 5YOdO+9u61otvtO6LcZz9l+0LcpSmrBwEszluyIS+nArgQyZwX2aZTjcScDJvB9+ 1y7Eo6eIbqbcJOf4mLDIJh0bHaiA7HB6jYJkbsnz51wBU2ETATzNzAoyP5ReTPGc 1s0uxrpY37kHcUUTd6q8VLDTM6Ei4vF2zQm0jWcrw0K3hM6yPuH+GiEADoV/xsls 6pbtss1E81rHEQjcK8brf6CxbOak8/WXV0gRia/3avkFteVlax+NJxRdVhksuJln siGlQqASX3vYdNL0nG+U0ml1Y9C1ZXTXu4lGjS6rtT9oeV+YSccG2UjIT9LEtuON W/zE4bUMqCddcZFEPH5jNK+ChGS8mmfs+UFFR+W/JzMIO8Uji5/K44FZDFBo0Oc/ 3JEgk7ZV4D+8SblBPMxJx0fZqbE8ggKM+IN5CAyscINOWOxrmRiaFFRygRX0TLDB 2uts9owItW6zvaTRY6RclVeCvJ6ARQli4pv7YxZmH85hhtCbn515imWvLWw4+tSg QwrtDnPVMSJTdzFHvsmeE9lM6Vaw0ur70Ysyd29k/XJu3WwRdkM= =jN7s -----END PGP SIGNATURE----- Merge tag 'for-5.14-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "One more fix that I think qualifies for a late merge. It's a revert of a one-liner fix that meanwhile got backported to stable kernels and we got reports from users. The broken fix prevents creating compressed inline extents, which could be noticeable on space consumption. Technically it's a regression as the patch was merged in 5.14-rc1 but got propagated to several stable kernels and has higher exposure than a 'typical' development cycle bug" * tag 'for-5.14-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Revert "btrfs: compression: don't try to compress if we don't have enough pages"	2021-08-26 11:05:11 -07:00
Qu Wenruo	4e9655763b	Revert "btrfs: compression: don't try to compress if we don't have enough pages" This reverts commit `f216562731`. [BUG] It's no longer possible to create compressed inline extent after commit `f216562731` ("btrfs: compression: don't try to compress if we don't have enough pages"). [CAUSE] For compression code, there are several possible reasons we have a range that needs to be compressed while it's no more than one page. - Compressed inline write The data is always smaller than one sector and the test lacks the condition to properly recognize a non-inline extent. - Compressed subpage write For the incoming subpage compressed write support, we require page alignment of the delalloc range. And for 64K page size, we can compress just one page into smaller sectors. For those reasons, the requirement for the data to be more than one page is not correct, and is already causing regression for compressed inline data writeback. The idea of skipping one page to avoid wasting CPU time could be revisited in the future. [FIX] Fix it by reverting the offending commit. Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Link: https://lore.kernel.org/linux-btrfs/afa2742.c084f5d6.17b6b08dffc@tnonline.net Fixes: `f216562731` ("btrfs: compression: don't try to compress if we don't have enough pages") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-25 15:08:19 +02:00
Desmond Cheong Zhi Xi	0d977e0eba	btrfs: reset replace target device to allocation state on close This crash was observed with a failed assertion on device close: BTRFS: Transaction aborted (error -28) WARNING: CPU: 1 PID: 3902 at fs/btrfs/extent-tree.c:2150 btrfs_run_delayed_refs+0x1d2/0x1e0 [btrfs] Modules linked in: btrfs blake2b_generic libcrc32c crc32c_intel xor zstd_decompress zstd_compress xxhash lzo_compress lzo_decompress raid6_pq loop CPU: 1 PID: 3902 Comm: kworker/u8:4 Not tainted 5.14.0-rc5-default+ #1532 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014 Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs] RIP: 0010:btrfs_run_delayed_refs+0x1d2/0x1e0 [btrfs] RSP: 0018:ffffb7a5452d7d80 EFLAGS: 00010282 RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffffffffabee13c4 RDI: 00000000ffffffff RBP: ffff97834176a378 R08: 0000000000000001 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000001 R12: ffff97835195d388 R13: 0000000005b08000 R14: ffff978385484000 R15: 000000000000016c FS: 0000000000000000(0000) GS:ffff9783bd800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000056190d003fe8 CR3: 000000002a81e005 CR4: 0000000000170ea0 Call Trace: flush_space+0x197/0x2f0 [btrfs] btrfs_async_reclaim_metadata_space+0x139/0x300 [btrfs] process_one_work+0x262/0x5e0 worker_thread+0x4c/0x320 ? process_one_work+0x5e0/0x5e0 kthread+0x144/0x170 ? set_kthread_struct+0x40/0x40 ret_from_fork+0x1f/0x30 irq event stamp: 19334989 hardirqs last enabled at (19334997): [<ffffffffab0e0c87>] console_unlock+0x2b7/0x400 hardirqs last disabled at (19335006): [<ffffffffab0e0d0d>] console_unlock+0x33d/0x400 softirqs last enabled at (19334900): [<ffffffffaba0030d>] __do_softirq+0x30d/0x574 softirqs last disabled at (19334893): [<ffffffffab0721ec>] irq_exit_rcu+0x12c/0x140 ---[ end trace 45939e308e0dd3c7 ]--- BTRFS: error (device vdd) in btrfs_run_delayed_refs:2150: errno=-28 No space left BTRFS info (device vdd): forced readonly BTRFS warning (device vdd): failed setting block group ro: -30 BTRFS info (device vdd): suspending dev_replace for unmount assertion failed: !test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state), in fs/btrfs/volumes.c:1150 ------------[ cut here ]------------ kernel BUG at fs/btrfs/ctree.h:3431! invalid opcode: 0000 [#1] PREEMPT SMP CPU: 1 PID: 3982 Comm: umount Tainted: G W 5.14.0-rc5-default+ #1532 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014 RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs] RSP: 0018:ffffb7a5454c7db8 EFLAGS: 00010246 RAX: 0000000000000068 RBX: ffff978364b91c00 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffffabee13c4 RDI: 00000000ffffffff RBP: ffff9783523a4c00 R08: 0000000000000001 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000001 R12: ffff9783523a4d18 R13: 0000000000000000 R14: 0000000000000004 R15: 0000000000000003 FS: 00007f61c8f42800(0000) GS:ffff9783bd800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000056190cffa810 CR3: 0000000030b96002 CR4: 0000000000170ea0 Call Trace: btrfs_close_one_device.cold+0x11/0x55 [btrfs] close_fs_devices+0x44/0xb0 [btrfs] btrfs_close_devices+0x48/0x160 [btrfs] generic_shutdown_super+0x69/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x2c/0xa0 cleanup_mnt+0x144/0x1b0 task_work_run+0x59/0xa0 exit_to_user_mode_loop+0xe7/0xf0 exit_to_user_mode_prepare+0xaf/0xf0 syscall_exit_to_user_mode+0x19/0x50 do_syscall_64+0x4a/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae This happens when close_ctree is called while a dev_replace hasn't completed. In close_ctree, we suspend the dev_replace, but keep the replace target around so that we can resume the dev_replace procedure when we mount the root again. This is the call trace: close_ctree(): btrfs_dev_replace_suspend_for_unmount(); btrfs_close_devices(): btrfs_close_fs_devices(): btrfs_close_one_device(): ASSERT(!test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state)); However, since the replace target sticks around, there is a device with BTRFS_DEV_STATE_REPLACE_TGT set on close, and we fail the assertion in btrfs_close_one_device. To fix this, if we come across the replace target device when closing, we should properly reset it back to allocation state. This fix also ensures that if a non-target device has a corrupted state and has the BTRFS_DEV_STATE_REPLACE_TGT bit set, the assertion will still catch the error. Reported-by: David Sterba <dsterba@suse.com> Fixes: `b2a6166768` ("btrfs: fix rw device counting in __btrfs_free_extra_devids") CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:57:18 +02:00
Naohiro Aota	939c7feb19	btrfs: zoned: fix ordered extent boundary calculation btrfs_lookup_ordered_extent() is supposed to query the offset in a file instead of the logical address. Pass the file offset from submit_extent_page() to calc_bio_boundaries(). Also, calc_bio_boundaries() relies on the bio's operation flag, so move the call site after setting it. Fixes: `390ed29b81` ("btrfs: refactor submit_extent_page() to make bio and its flag tracing easier") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:16 +02:00
Josef Bacik	1146239794	btrfs: do not do preemptive flushing if the majority is global rsv A common characteristic of the bug report where preemptive flushing was going full tilt was the fact that the vast majority of the free metadata space was used up by the global reserve. The hard 90% threshold would cover the majority of these cases, but to be even smarter we should take into account how much of the outstanding reservations are covered by the global block reserve. If the global block reserve accounts for the vast majority of outstanding reservations, skip preemptive flushing, as it will likely just cause churn and pain. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=212185 Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:16 +02:00
Josef Bacik	93c60b17f2	btrfs: reduce the preemptive flushing threshold to 90% The preemptive flushing code was added in order to avoid needing to synchronously wait for ENOSPC flushing to recover space. Once we're almost full however we can essentially flush constantly. We were using 98% as a threshold to determine if we were simply full, however in practice this is a really high bar to hit. For example reports of systems running into this problem had around 94% usage and thus continued to flush. Fix this by lowering the threshold to 90%, which is a more sane value, especially for smaller file systems. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=212185 CC: stable@vger.kernel.org # 5.12+ Fixes: `576fa34830` ("btrfs: improve preemptive background space flushing") Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:15 +02:00
Marcos Paulo de Souza	3736127a3a	btrfs: tree-log: check btrfs_lookup_data_extent return value Function btrfs_lookup_data_extent calls btrfs_search_slot to verify if the EXTENT_ITEM exists in the extent tree. btrfs_search_slot can return values bellow zero if an error happened. Function replay_one_extent currently checks if the search found something (0 returned) and increments the reference, and if not, it seems to evaluate as 'not found'. Fix the condition by checking if the value was bellow zero and return early. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:15 +02:00
Filipe Manana	8be2ba2e0e	btrfs: avoid unnecessarily logging directories that had no changes There are several cases where when logging an inode we need to log its parent directories or logging subdirectories when logging a directory. There are cases however where we end up logging a directory even if it was not changed in the current transaction, no dentries added or removed since the last transaction. While this is harmless from a functional point of view, it is a waste time as it brings no advantage. One example where this is triggered is the following: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ mkdir /mnt/A $ mkdir /mnt/B $ mkdir /mnt/C $ touch /mnt/A/foo $ ln /mnt/A/foo /mnt/B/bar $ ln /mnt/A/foo /mnt/C/baz $ sync $ rm -f /mnt/A/foo $ xfs_io -c "fsync" /mnt/B/bar This last fsync ends up logging directories A, B and C, however we only need to log directory A, as B and C were not changed since the last transaction commit. So fix this by changing need_log_inode(), to return false in case the given inode is a directory and has a ->last_trans value smaller than the current transaction's ID. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:15 +02:00
Christian Brauner	5b9b26f5d0	btrfs: allow idmapped mount Now that we converted btrfs internally to account for idmapped mounts allow the creation of idmapped mounts on by setting the FS_ALLOW_IDMAP flag. We only need to raise this flag on the btrfs_root_fs_type filesystem since btrfs_mount_root() is ultimately responsible for allocating the superblock and is called into from btrfs_mount() associated with btrfs_fs_type. The conversion of the btrfs inode operations was straightforward. Regarding btrfs specific ioctls that perform checks based on inode permissions only those have been allowed that are not filesystem wide operations and hence can be reasonably charged against a specific mount. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:15 +02:00
Christian Brauner	4a8b34afa9	btrfs: handle ACLs on idmapped mounts Make the ACL code idmapped mount aware. The POSIX default and POSIX access ACLs are the only ACLs other than some specific xattrs that take DAC permissions into account. On an idmapped mount they need to be translated according to the mount's userns. The main change is done to __btrfs_set_acl() which is responsible for translating POSIX ACLs to their final on-disk representation. The btrfs_init_acl() helper does not need to take the idmapped mount into account since it is called in the context of file creation operations (mknod, create, mkdir, symlink, tmpfile) and is used for btrfs_init_inode_security() to copy POSIX default and POSIX access permissions from the parent directory. These ACLs need to be inherited unmodified from the parent directory. This is identical to what we do for ext4 and xfs. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:15 +02:00
Christian Brauner	6623d9a0b0	btrfs: allow idmapped INO_LOOKUP_USER ioctl The INO_LOOKUP_USER is an unprivileged version of the INO_LOOKUP ioctl and has the following restrictions. The main difference between the two is that INO_LOOKUP is filesystem wide operation wheres INO_LOOKUP_USER is scoped beneath the file descriptor passed with the ioctl. Specifically, INO_LOOKUP_USER must adhere to the following restrictions: - The caller must be privileged over each inode of each path component for the path they are trying to lookup. - The path for the subvolume the caller is trying to lookup must be reachable from the inode associated with the file descriptor passed with the ioctl. The second condition makes it possible to scope the lookup of the path to the mount identified by the file descriptor passed with the ioctl. This allows us to enable this ioctl on idmapped mounts. Specifically, this is possible because all child subvolumes of a parent subvolume are reachable when the parent subvolume is mounted. So if the user had access to open the parent subvolume or has been given the fd then they can lookup the path if they had access to it provided they were privileged over each path component. Note, the INO_LOOKUP_USER ioctl allows a user to learn the path and name of a subvolume even though they would otherwise be restricted from doing so via regular VFS-based lookup. So think about a parent subvolume with multiple child subvolumes. Someone could mount he parent subvolume and restrict access to the child subvolumes by overmounting them with empty directories. At this point the user can't traverse the child subvolumes and they can't open files in the child subvolumes. However, they can still learn the path of child subvolumes as long as they have access to the parent subvolume by using the INO_LOOKUP_USER ioctl. The underlying assumption here is that it's ok that the lookup ioctls can't really take mounts into account other than the original mount the fd belongs to during lookup. Since this assumption is baked into the original INO_LOOKUP_USER ioctl we can extend it to idmapped mounts. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:15 +02:00
Christian Brauner	39e1674ff0	btrfs: allow idmapped SUBVOL_SETFLAGS ioctl Setting flags on subvolumes or snapshots are core features of btrfs. The SUBVOL_SETFLAGS ioctl is especially important as it allows to make subvolumes and snapshots read-only or read-write. Allow setting flags on btrfs subvolumes and snapshots on idmapped mounts. This is a fairly straightforward operation since all the permission checking helpers are already capable of handling idmapped mounts. So we just need to pass down the mount's userns. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:14 +02:00
Christian Brauner	e4fed17a32	btrfs: allow idmapped SET_RECEIVED_SUBVOL ioctls The SET_RECEIVED_SUBVOL ioctls are used to set information about a received subvolume. Make it possible to set information about a received subvolume on idmapped mounts. This is a fairly straightforward operation since all the permission checking helpers are already capable of handling idmapped mounts. So we just need to pass down the mount's userns. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:14 +02:00
Christian Brauner	aabb34e7a3	btrfs: relax restrictions for SNAP_DESTROY_V2 with subvolids So far we prevented the deletion of subvolumes and snapshots using subvolume ids possible with the BTRFS_SUBVOL_SPEC_BY_ID flag. This restriction is necessary on idmapped mounts as this allows filesystem wide subvolume and snapshot deletions and thus can escape the scope of what's exposed under the mount identified by the fd passed with the ioctl. Deletion by subvolume id works by looking for an alias of the parent of the subvolume or snapshot to be deleted. The parent alias can be anywhere in the filesystem. However, as long as the alias of the parent that is found is the same as the one identified by the file descriptor passed through the ioctl we can allow the deletion. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:14 +02:00
Christian Brauner	c4ed533bdc	btrfs: allow idmapped SNAP_DESTROY ioctls Destroying subvolumes and snapshots are important features of btrfs. Both operations are available to unprivileged users if the filesystem has been mounted with the "user_subvol_rm_allowed" mount option. Allow subvolume and snapshot deletion on idmapped mounts. This is a fairly straightforward operation since all the permission checking helpers are already capable of handling idmapped mounts. So we just need to pass down the mount's userns. Subvolumes and snapshots can either be deleted by specifying their name or - if BTRFS_IOC_SNAP_DESTROY_V2 is used - by their subvolume or snapshot id if the BTRFS_SUBVOL_SPEC_BY_ID is set. This feature is blocked on idmapped mounts as this allows filesystem wide subvolume deletions and thus can escape the scope of what's exposed under the mount identified by the fd passed with the ioctl. This means that even the root or CAP_SYS_ADMIN capable user can't delete a subvolume via BTRFS_SUBVOL_SPEC_BY_ID. This is intentional. The root user is currently already subject to permission checks in btrfs_may_delete() including whether the inode's i_uid/i_gid of the directory the subvolume is located in have a mapping in the caller's idmapping. For this to fail isn't currently possible since a btrfs filesystem can't be mounted with a non-initial idmapping but it shows that even the root user would fail to delete a subvolume if the relevant inode isn't mapped in their idmapping. The idmapped mount case is the same in principle. This isn't a huge problem a root user wanting to delete arbitrary subvolumes can just always create another (even detached) mount without an idmapping attached. In addition, we will allow BTRFS_SUBVOL_SPEC_BY_ID for cases where the subvolume to delete is directly located under inode referenced by the fd passed for the ioctl() in a follow-up commit. Here is an example where a btrfs subvolume is deleted through a subvolume mount that does not expose the subvolume to be delete but it can still be deleted by using the subvolume id: /* Compile the following program as "delete_by_spec". / #define _GNU_SOURCE #include <fcntl.h> #include <inttypes.h> #include <linux/btrfs.h> #include <stdio.h> #include <stdlib.h> #include <sys/ioctl.h> #include <sys/stat.h> #include <sys/types.h> #include <unistd.h> static int rm_subvolume_by_id(int fd, uint64_t subvolid) { struct btrfs_ioctl_vol_args_v2 args = {}; int ret; args.flags = BTRFS_SUBVOL_SPEC_BY_ID; args.subvolid = subvolid; ret = ioctl(fd, BTRFS_IOC_SNAP_DESTROY_V2, &args); if (ret < 0) return -1; return 0; } int main(int argc, char argv[]) { int subvolid = 0; if (argc < 3) exit(1); fprintf(stderr, "Opening %s\n", argv[1]); int fd = open(argv[1], O_CLOEXEC \| O_DIRECTORY); if (fd < 0) exit(2); subvolid = atoi(argv[2]); fprintf(stderr, "Deleting subvolume with subvolid %d\n", subvolid); int ret = rm_subvolume_by_id(fd, subvolid); if (ret < 0) exit(3); exit(0); } #include <stdio.h>" #include <stdlib.h>" #include <linux/btrfs.h" truncate -s 10G btrfs.img mkfs.btrfs btrfs.img export LOOPDEV=$(sudo losetup -f --show btrfs.img) mount ${LOOPDEV} /mnt sudo chown $(id -u):$(id -g) /mnt btrfs subvolume create /mnt/A btrfs subvolume create /mnt/B/C # Get subvolume id via: sudo btrfs subvolume show /mnt/A # Save subvolid SUBVOLID=<nr> sudo umount /mnt sudo mount ${LOOPDEV} -o subvol=B/C,user_subvol_rm_allowed /mnt ./delete_by_spec /mnt ${SUBVOLID} With idmapped mounts this can potentially be used by users to delete subvolumes/snapshots they would otherwise not have access to as the idmapping would be applied to an inode that is not exposed in the mount of the subvolume. The fact that this is a filesystem wide operation suggests it might be a good idea to expose this under a separate ioctl that clearly indicates this. In essence, the file descriptor passed with the ioctl is merely used to identify the filesystem on which to operate when BTRFS_SUBVOL_SPEC_BY_ID is used. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:14 +02:00
Christian Brauner	4d4340c912	btrfs: allow idmapped SNAP_CREATE/SUBVOL_CREATE ioctls Creating subvolumes and snapshots is one of the core features of btrfs and is even available to unprivileged users. Make it possible to use subvolume and snapshot creation on idmapped mounts. This is a fairly straightforward operation since all the permission checking helpers are already capable of handling idmapped mounts. So we just need to pass down the mount's userns. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:14 +02:00
Christian Brauner	5474bf400f	btrfs: check whether fsgid/fsuid are mapped during subvolume creation When a new subvolume is created btrfs currently doesn't check whether the fsgid/fsuid of the caller actually have a mapping in the user namespace attached to the filesystem. The VFS always checks this to make sure that the caller's fsgid/fsuid can be represented on-disk. This is most relevant for filesystems that can be mounted inside user namespaces but it is in general a good hardening measure to prevent unrepresentable gid/uid from being written to disk. Since we want to support idmapped mounts for btrfs ioctls to create subvolumes in follow-up patches this becomes important since we want to make sure the fsgid/fsuid of the caller as mapped according to the idmapped mount can be represented on-disk. Simply add the missing fsuidgid_has_mapping() line from the VFS may_create() version to btrfs_may_create(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:14 +02:00
Christian Brauner	3bc71ba02c	btrfs: allow idmapped permission inode op Enable btrfs_permission() to handle idmapped mounts. This is just a matter of passing down the mount's userns. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:13 +02:00
Christian Brauner	d4d0946461	btrfs: allow idmapped setattr inode op Enable btrfs_setattr() to handle idmapped mounts. This is just a matter of passing down the mount's userns. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:13 +02:00
Christian Brauner	98b6ab5fc0	btrfs: allow idmapped tmpfile inode op Enable btrfs_tmpfile() to handle idmapped mounts. This is just a matter of passing down the mount's userns. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:13 +02:00
Christian Brauner	5a0521086e	btrfs: allow idmapped symlink inode op Enable btrfs_symlink() to handle idmapped mounts. This is just a matter of passing down the mount's userns. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:13 +02:00
Christian Brauner	b0b3e44d34	btrfs: allow idmapped mkdir inode op Enable btrfs_mkdir() to handle idmapped mounts. This is just a matter of passing down the mount's userns. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:13 +02:00
Christian Brauner	e93ca491d0	btrfs: allow idmapped create inode op Enable btrfs_create() to handle idmapped mounts. This is just a matter of passing down the mount's userns. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:13 +02:00
Christian Brauner	72105277dc	btrfs: allow idmapped mknod inode op Enable btrfs_mknod() to handle idmapped mounts. This is just a matter of passing down the mount's userns. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:12 +02:00
Christian Brauner	c020d2eaf1	btrfs: allow idmapped getattr inode op Enable btrfs_getattr() to handle idmapped mounts. This is just a matter of passing down the mount's userns. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:12 +02:00
Christian Brauner	ca07274c3d	btrfs: allow idmapped rename inode op Enable btrfs_rename() to handle idmapped mounts. This is just a matter of passing down the mount's userns. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:12 +02:00
Christian Brauner	b3b6f5b922	btrfs: handle idmaps in btrfs_new_inode() Extend btrfs_new_inode() to take the idmapped mount into account when initializing a new inode. This is just a matter of passing down the mount's userns. The rest is taken care of in inode_init_owner(). This is a preliminary patch to make the individual btrfs inode operations idmapped mount aware. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:12 +02:00
Anand Jain	e7849e33cf	btrfs: sysfs: document structures and their associated files Sysfs file has grown big. It takes some time to locate the correct struct attribute to add new files. Create a table and map the struct attribute to its sysfs path. Also, fix the comment about the debug sysfs path. And add the comments to the attributes instead of attribute group, where sysfs file names are defined. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:12 +02:00
Qu Wenruo	e4571b8c5e	btrfs: fix NULL pointer dereference when deleting device by invalid id [BUG] It's easy to trigger NULL pointer dereference, just by removing a non-existing device id: # mkfs.btrfs -f -m single -d single /dev/test/scratch1 \ /dev/test/scratch2 # mount /dev/test/scratch1 /mnt/btrfs # btrfs device remove 3 /mnt/btrfs Then we have the following kernel NULL pointer dereference: BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 9 PID: 649 Comm: btrfs Not tainted 5.14.0-rc3-custom+ #35 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:btrfs_rm_device+0x4de/0x6b0 [btrfs] btrfs_ioctl+0x18bb/0x3190 [btrfs] ? lock_is_held_type+0xa5/0x120 ? find_held_lock.constprop.0+0x2b/0x80 ? do_user_addr_fault+0x201/0x6a0 ? lock_release+0xd2/0x2d0 ? __x64_sys_ioctl+0x83/0xb0 __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae [CAUSE] Commit `a27a94c2b0` ("btrfs: Make btrfs_find_device_by_devspec return btrfs_device directly") moves the "missing" device path check into btrfs_rm_device(). But btrfs_rm_device() itself can have case where it only receives @devid, with NULL as @device_path. In that case, calling strcmp() on NULL will trigger the NULL pointer dereference. Before that commit, we handle the "missing" case inside btrfs_find_device_by_devspec(), which will not check @device_path at all if @devid is provided, thus no way to trigger the bug. [FIX] Before calling strcmp(), also make sure @device_path is not NULL. Fixes: `a27a94c2b0` ("btrfs: Make btrfs_find_device_by_devspec return btrfs_device directly") CC: stable@vger.kernel.org # 5.4+ Reported-by: butt3rflyh4ck <butterflyhuangxx@gmail.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:11 +02:00
Naohiro Aota	63fb5879db	btrfs: zoned: add asserts on splitting extent_map We call split_zoned_em() on an extent_map on submitting a bio for it. Thus, we can assume the extent_map is PINNED, not LOGGING, and in the modified list. Add ASSERT()s to ensure the extent_maps after the split also has the proper flags set and are in the modified list. Suggested-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:11 +02:00
Naohiro Aota	0ae79c6fe7	btrfs: zoned: fix block group alloc_offset calculation alloc_offset is offset from the start of a block group and @offset is actually an address in logical space. Thus, we need to consider block_group->start when calculating them. Fixes: `011b41bffa` ("btrfs: zoned: advance allocation pointer after tree log node") CC: stable@vger.kernel.org # 5.12+ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:11 +02:00
Naohiro Aota	ba86dd9fe6	btrfs: zoned: suppress reclaim error message on EAGAIN btrfs_relocate_chunk() can fail with -EAGAIN when e.g. send operations are running. The message can fail btrfs/187 and it's unnecessary because we anyway add it back to the reclaim list. btrfs_reclaim_bgs_work() `-> btrfs_relocate_chunk() `-> btrfs_relocate_block_group() `-> reloc_chunk_start() `-> if (fs_info->send_in_progress) `-> return -EAGAIN CC: stable@vger.kernel.org # 5.13+ Fixes: `18bb8bbf13` ("btrfs: zoned: automatically reclaim zones") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:11 +02:00
Johannes Thumshirn	77233c2d2e	btrfs: zoned: allow disabling of zone auto reclaim Automatically reclaiming dirty zones might not always be desired for all workloads, especially as there are currently still some rough edges with the relocation code on zoned filesystems. Allow disabling zone auto reclaim on a per filesystem basis by writing 0 as the threshold value. Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:11 +02:00
Filipe Manana	1f29537302	btrfs: update comment at log_conflicting_inodes() A comment at log_conflicting_inodes() mentions that we check the inode's logged_trans field instead of using btrfs_inode_in_log() because the field last_log_commit is not updated when we log that an inode exists and the inode has the full sync flag (BTRFS_INODE_NEEDS_FULL_SYNC) set. The part about the full sync flag is not true anymore since commit `9acc8103ab` ("btrfs: fix unpersisted i_size on fsync after expanding truncate"), so update the comment to not mention that part anymore. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:11 +02:00
Filipe Manana	d135a53396	btrfs: remove no longer needed full sync flag check at inode_logged() Now that we are checking if the inode's logged_trans is 0 to detect the possibility of the inode having been evicted and reloaded, the test for the full sync flag (BTRFS_INODE_NEEDS_FULL_SYNC) is no longer needed at tree-log.c:inode_logged(). Its purpose was to detect the possibility of a previous eviction as well, since when an inode is loaded the full sync flag is always set on it (and only cleared after the inode is logged). So just remove the check and update the comment. The check for the inode's logged_trans being 0 was added recently by the patch with the subject "btrfs: eliminate some false positives when checking if inode was logged". Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:10 +02:00
Filipe Manana	1c167b87f4	btrfs: remove unnecessary NULL check for the new inode during rename exchange At the very end of btrfs_rename_exchange(), in case an error happened, we are checking if 'new_inode' is NULL, but that is not needed since during a rename exchange, unlike regular renames, 'new_inode' can never be NULL, and if it were, we would have a crashed much earlier when we dereference it multiple times. So remove the check because it is not necessary and because it is causing static checkers to emit a warning. I probably introduced the check by copy-pasting similar code from btrfs_rename(), where 'new_inode' can be NULL, in commit `86e8aa0e77` ("Btrfs: unpin logs if rename exchange operation fails"). Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:10 +02:00
Goldwyn Rodrigues	dce2815039	btrfs: allocate backref_ctx on stack in find_extent_clone Instead of using kmalloc() to allocate backref_ctx, allocate backref_ctx on stack. The size is reasonably small. sizeof(backref_ctx) = 48 Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:10 +02:00
Goldwyn Rodrigues	c853a5783e	btrfs: allocate btrfs_ioctl_defrag_range_args on stack Instead of using kmalloc() to allocate btrfs_ioctl_defrag_range_args, allocate btrfs_ioctl_defrag_range_args on stack, the size is reasonably small and ioctls are called in process context. sizeof(btrfs_ioctl_defrag_range_args) = 48 Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:10 +02:00
Goldwyn Rodrigues	0afb603afc	btrfs: allocate btrfs_ioctl_quota_rescan_args on stack Instead of using kmalloc() to allocate btrfs_ioctl_quota_rescan_args, allocate btrfs_ioctl_quota_rescan_args on stack, the size is reasonably small and ioctls are called in process context. sizeof(btrfs_ioctl_quota_rescan_args) = 64 Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:10 +02:00
Goldwyn Rodrigues	98caf9531e	btrfs: allocate file_ra_state on stack in readahead_cache Instead of allocating file_ra_state using kmalloc, allocate on stack. sizeof(struct readahead) = 32 bytes. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:10 +02:00
Marcos Paulo de Souza	0ff40a910f	btrfs: introduce btrfs_search_backwards function It's a common practice to start a search using offset (u64)-1, which is the u64 maximum value, meaning that we want the search_slot function to be set in the last item with the same objectid and type. Once we are in this position, it's a matter to start a search backwards by calling btrfs_previous_item, which will check if we'll need to go to a previous leaf and other necessary checks, only to be sure that we are in last offset of the same object and type. The new btrfs_search_backwards function does the all these steps when necessary, and can be used to avoid code duplication. Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:09 +02:00
David Sterba	ea3dc7d2d1	btrfs: print if fsverity support is built in when loading module As fsverity support depends on a config option, print that at module load time like we do for similar features. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:09 +02:00
Boris Burkov	705242538f	btrfs: verity metadata orphan items Writing out the verity data is too large of an operation to do in a single transaction. If we are interrupted before we finish creating fsverity metadata for a file, or fail to clean up already created metadata after a failure, we could leak the verity items that we already committed. To address this issue, we use the orphan mechanism. When we start enabling verity on a file, we also add an orphan item for that inode. When we are finished, we delete the orphan. However, if we are interrupted midway, the orphan will be present at mount and we can cleanup the half-formed verity state. There is a possible race with a normal unlink operation: if unlink and verity run on the same file in parallel, it is possible for verity to succeed and delete the still legitimate orphan added by unlink. Then, if we are interrupted and mount in that state, we will never clean up the inode properly. This is also possible for a file created with O_TMPFILE. Check nlink==0 before deleting to avoid this race. A final thing to note is that this is a resurrection of using orphans to signal an operation besides "delete this inode". The old case was to signal the need to do a truncate. That case still technically applies for mounting very old file systems, so we need to take some care to not clobber it. To that end, we just have to be careful that verity orphan cleanup is a no-op for non-verity files. Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:09 +02:00
Boris Burkov	146054090b	btrfs: initial fsverity support Add support for fsverity in btrfs. To support the generic interface in fs/verity, we add two new item types in the fs tree for inodes with verity enabled. One stores the per-file verity descriptor and btrfs verity item and the other stores the Merkle tree data itself. Verity checking is done in end_page_read just before a page is marked uptodate. This naturally handles a variety of edge cases like holes, preallocated extents, and inline extents. Some care needs to be taken to not try to verity pages past the end of the file, which are accessed by the generic buffered file reading code under some circumstances like reading to the end of the last page and trying to read again. Direct IO on a verity file falls back to buffered reads. Verity relies on PageChecked for the Merkle tree data itself to avoid re-walking up shared paths in the tree. For this reason, we need to cache the Merkle tree data. Since the file is immutable after verity is turned on, we can cache it at an index past EOF. Use the new inode ro_flags to store verity on the inode item, so that we can enable verity on a file, then rollback to an older kernel and still mount the file system and read the file. Since we can't safely write the file anymore without ruining the invariants of the Merkle tree, we mark a ro_compat flag on the file system when a file has verity enabled. Acked-by: Eric Biggers <ebiggers@google.com> Co-developed-by: Chris Mason <clm@fb.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:09 +02:00
Boris Burkov	77eea05e78	btrfs: add ro compat flags to inodes Currently, inode flags are fully backwards incompatible in btrfs. If we introduce a new inode flag, then tree-checker will detect it and fail. This can even cause us to fail to mount entirely. To make it possible to introduce new flags which can be read-only compatible, like VERITY, we add new ro flags to btrfs without treating them quite so harshly in tree-checker. A read-only file system can survive an unexpected flag, and can be mounted. As for the implementation, it unfortunately gets a little complicated. The on-disk representation of the inode, btrfs_inode_item, has an __le64 for flags but the in-memory representation, btrfs_inode, uses a u32. David Sterba had the nice idea that we could reclaim those wasted 32 bits on disk and use them for the new ro_compat flags. It turns out that the tree-checker code which checks for unknown flags is broken, and ignores the upper 32 bits we are hoping to use. The issue is that the flags use the literal 1 rather than 1ULL, so the flags are signed ints, and one of them is specifically (1 << 31). As a result, the mask which ORs the flags is a negative integer on machines where int is 32 bit twos complement. When tree-checker evaluates the expression: btrfs_inode_flags(leaf, iitem) & ~BTRFS_INODE_FLAG_MASK) The mask is something like 0x80000abc, which gets promoted to u64 with sign extension to 0xffffffff80000abc. Negating that 64 bit mask leaves all the upper bits zeroed, and we can't detect unexpected flags. This suggests that we can't use those bits after all. Luckily, we have good reason to believe that they are zero anyway. Inode flags are metadata, which is always checksummed, so any bit flips that would introduce 1s would cause a checksum failure anyway (excluding the improbable case of the checksum getting corrupted exactly badly). Further, unless the 1 << 31 flag is used, the cast to u64 of the 32 bit inode flag should preserve its value and not add leading zeroes (at least for twos complement). The only place that flag (BTRFS_INODE_ROOT_ITEM_INIT) is used is in a special inode embedded in the root item, and indeed for that inode we see 0xffffffff80000000 as the flags on disk. However, that inode is never seen by tree checker, nor is it used in a context where verity might be meaningful. Theoretically, a future ro flag might cause trouble on that inode, so we should proactively clean up that mess before it does. With the introduction of the new ro flags, keep two separate unsigned masks and check them against the appropriate u32. Since we no longer run afoul of sign extension, this also stops writing out 0xffffffff80000000 in root_item inodes going forward. Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:09 +02:00
Anand Jain	efc222f8d7	btrfs: simplify return values in btrfs_check_raid_min_devices Function btrfs_check_raid_min_devices() returns error code from the enum btrfs_err_code and it starts from 1. So there is no need to check if ret is > 0. So drop this check and also drop the local variable ret. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:09 +02:00
Qu Wenruo	7361b4ae03	btrfs: remove the dead comment in writepage_delalloc() When btrfs_run_delalloc_range() failed, we will error out. But there is a strange comment mentioning that btrfs_run_delalloc_range() could have returned value >0 to indicate the IO has already started. Commit `40f765805f` ("Btrfs: split up __extent_writepage to lower stack usage") introduced the comment, but unfortunately at that time, we were already using @page_started to indicate that case, and still return 0. Furthermore, even if that comment was right (which is not), we would return -EIO if the IO had already started. By all means the comment is incorrect, just remove the comment along with the dead check. Just to be extra safe, add an ASSERT() in btrfs_run_delalloc_range() to make sure we either return 0 or error, no positive return value. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:08 +02:00

1 2 3 4 5 ...

10226 Commits