mirror of
https://github.com/torvalds/linux.git
synced 2024-12-11 05:33:09 +00:00
8e6598a7b0
10597 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Linus Torvalds
|
3ee65c0f07 |
for-5.17-rc6-tag
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmIk1isACgkQxWXV+ddt WDsAVQ//TvkKObQLL/BJ4TFSxr1ZLs83z4vTcss2W/MrMjGWUut1fhUTGlhkqgC6 RE03VBuUV983k09/Tn3Q0AHSXcMAmxEv/t1QweJNKiVv7YKT3Nj7VF3kHioFz9g/ gZ5q9FVbTXkrl4tgcwiQXbLJ1BLWBfXTAMatKgsIQBYsYg0ec3GGem/tx3OlvdNt 9My6EJhNo5X7vrTMjRUygDgHDhcAgp/gYMa2VmnPhK5qcPzmIYbt4CJGLQDwiiiB KSsXnsHCXKm/BRPgtgnMBH6O8YruaxUn0nEQMjntGx8RHbZrkdXk90PaK7pmWz1W KkbHTM98zclAOWUx6JmGw8mb9aZQo6aGpou2Pa98aBtHhvbhiKYS2W2OOnHbAshK 2bj6W2o89eYHKgX+5fICWHt7efUoWUh1KPC+TeaV8DKl8q0a9DC3KfIL/v7PZacA pIyyy4uyXh3finzI+Q+fW7QVKQhpcQKLuq5EVGCMEotlfsn+SJBselAdwUl9ChUp ALAiYn1T8W1Mrt8P2mxB29btGrdckHtpoWTgr++OAZaX4PABF3GAvIxXwmFg2aMK zfXKwTxjwKM42H3AWaLHttk4OA7FJhY9sgOproON/3Tn9cBSK2jiO0HSk1dBn/dL WQbOKh4Z+VDXi5niF8hmTANTNO0wS0JdiKZX86tYyhcCl0ZBr/w= =Bd5z -----END PGP SIGNATURE----- Merge tag 'for-5.17-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few more fixes for various problems that have user visible effects or seem to be urgent: - fix corruption when combining DIO and non-blocking io_uring over multiple extents (seen on MariaDB) - fix relocation crash due to premature return from commit - fix quota deadlock between rescan and qgroup removal - fix item data bounds checks in tree-checker (found on a fuzzed image) - fix fsync of prealloc extents after EOF - add missing run of delayed items after unlink during log replay - don't start relocation until snapshot drop is finished - fix reversed condition for subpage writers locking - fix warning on page error" * tag 'for-5.17-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fallback to blocking mode when doing async dio over multiple extents btrfs: add missing run of delayed items after unlink during log replay btrfs: qgroup: fix deadlock between rescan worker and remove qgroup btrfs: fix relocation crash due to premature return from btrfs_commit_transaction() btrfs: do not start relocation until in progress drops are done btrfs: tree-checker: use u64 for item data end to avoid overflow btrfs: do not WARN_ON() if we have PageError set btrfs: fix lost prealloc extents beyond eof after full fsync btrfs: subpage: fix a wrong check on subpage->writers |
||
Filipe Manana
|
ca93e44bfb |
btrfs: fallback to blocking mode when doing async dio over multiple extents
Some users recently reported that MariaDB was getting a read corruption
when using io_uring on top of btrfs. This started to happen in 5.16,
after commit
|
||
Filipe Manana
|
4751dc9962 |
btrfs: add missing run of delayed items after unlink during log replay
During log replay, whenever we need to check if a name (dentry) exists in a directory we do searches on the subvolume tree for inode references or or directory entries (BTRFS_DIR_INDEX_KEY keys, and BTRFS_DIR_ITEM_KEY keys as well, before kernel 5.17). However when during log replay we unlink a name, through btrfs_unlink_inode(), we may not delete inode references and dir index keys from a subvolume tree and instead just add the deletions to the delayed inode's delayed items, which will only be run when we commit the transaction used for log replay. This means that after an unlink operation during log replay, if we attempt to search for the same name during log replay, we will not see that the name was already deleted, since the deletion is recorded only on the delayed items. We run delayed items after every unlink operation during log replay, except at unlink_old_inode_refs() and at add_inode_ref(). This was due to an overlook, as delayed items should be run after evert unlink, for the reasons stated above. So fix those two cases. Fixes: |
||
Sidong Yang
|
d4aef1e122 |
btrfs: qgroup: fix deadlock between rescan worker and remove qgroup
The commit |
||
Omar Sandoval
|
5fd76bf31c |
btrfs: fix relocation crash due to premature return from btrfs_commit_transaction()
We are seeing crashes similar to the following trace: [38.969182] WARNING: CPU: 20 PID: 2105 at fs/btrfs/relocation.c:4070 btrfs_relocate_block_group+0x2dc/0x340 [btrfs] [38.973556] CPU: 20 PID: 2105 Comm: btrfs Not tainted 5.17.0-rc4 #54 [38.974580] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 [38.976539] RIP: 0010:btrfs_relocate_block_group+0x2dc/0x340 [btrfs] [38.980336] RSP: 0000:ffffb0dd42e03c20 EFLAGS: 00010206 [38.981218] RAX: ffff96cfc4ede800 RBX: ffff96cfc3ce0000 RCX: 000000000002ca14 [38.982560] RDX: 0000000000000000 RSI: 4cfd109a0bcb5d7f RDI: ffff96cfc3ce0360 [38.983619] RBP: ffff96cfc309c000 R08: 0000000000000000 R09: 0000000000000000 [38.984678] R10: ffff96cec0000001 R11: ffffe84c80000000 R12: ffff96cfc4ede800 [38.985735] R13: 0000000000000000 R14: 0000000000000000 R15: ffff96cfc3ce0360 [38.987146] FS: 00007f11c15218c0(0000) GS:ffff96d6dfb00000(0000) knlGS:0000000000000000 [38.988662] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [38.989398] CR2: 00007ffc922c8e60 CR3: 00000001147a6001 CR4: 0000000000370ee0 [38.990279] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [38.991219] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [38.992528] Call Trace: [38.992854] <TASK> [38.993148] btrfs_relocate_chunk+0x27/0xe0 [btrfs] [38.993941] btrfs_balance+0x78e/0xea0 [btrfs] [38.994801] ? vsnprintf+0x33c/0x520 [38.995368] ? __kmalloc_track_caller+0x351/0x440 [38.996198] btrfs_ioctl_balance+0x2b9/0x3a0 [btrfs] [38.997084] btrfs_ioctl+0x11b0/0x2da0 [btrfs] [38.997867] ? mod_objcg_state+0xee/0x340 [38.998552] ? seq_release+0x24/0x30 [38.999184] ? proc_nr_files+0x30/0x30 [38.999654] ? call_rcu+0xc8/0x2f0 [39.000228] ? __x64_sys_ioctl+0x84/0xc0 [39.000872] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs] [39.001973] __x64_sys_ioctl+0x84/0xc0 [39.002566] do_syscall_64+0x3a/0x80 [39.003011] entry_SYSCALL_64_after_hwframe+0x44/0xae [39.003735] RIP: 0033:0x7f11c166959b [39.007324] RSP: 002b:00007fff2543e998 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [39.008521] RAX: ffffffffffffffda RBX: 00007f11c1521698 RCX: 00007f11c166959b [39.009833] RDX: 00007fff2543ea40 RSI: 00000000c4009420 RDI: 0000000000000003 [39.011270] RBP: 0000000000000003 R08: 0000000000000013 R09: 00007f11c16f94e0 [39.012581] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff25440df3 [39.014046] R13: 0000000000000000 R14: 00007fff2543ea40 R15: 0000000000000001 [39.015040] </TASK> [39.015418] ---[ end trace 0000000000000000 ]--- [43.131559] ------------[ cut here ]------------ [43.132234] kernel BUG at fs/btrfs/extent-tree.c:2717! [43.133031] invalid opcode: 0000 [#1] PREEMPT SMP PTI [43.133702] CPU: 1 PID: 1839 Comm: btrfs Tainted: G W 5.17.0-rc4 #54 [43.134863] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 [43.136426] RIP: 0010:unpin_extent_range+0x37a/0x4f0 [btrfs] [43.139913] RSP: 0000:ffffb0dd4216bc70 EFLAGS: 00010246 [43.140629] RAX: 0000000000000000 RBX: ffff96cfc34490f8 RCX: 0000000000000001 [43.141604] RDX: 0000000080000001 RSI: 0000000051d00000 RDI: 00000000ffffffff [43.142645] RBP: 0000000000000000 R08: 0000000000000000 R09: ffff96cfd07dca50 [43.143669] R10: ffff96cfc46e8a00 R11: fffffffffffec000 R12: 0000000041d00000 [43.144657] R13: ffff96cfc3ce0000 R14: ffffb0dd4216bd08 R15: 0000000000000000 [43.145686] FS: 00007f7657dd68c0(0000) GS:ffff96d6df640000(0000) knlGS:0000000000000000 [43.146808] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [43.147584] CR2: 00007f7fe81bf5b0 CR3: 00000001093ee004 CR4: 0000000000370ee0 [43.148589] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [43.149581] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [43.150559] Call Trace: [43.150904] <TASK> [43.151253] btrfs_finish_extent_commit+0x88/0x290 [btrfs] [43.152127] btrfs_commit_transaction+0x74f/0xaa0 [btrfs] [43.152932] ? btrfs_attach_transaction_barrier+0x1e/0x50 [btrfs] [43.153786] btrfs_ioctl+0x1edc/0x2da0 [btrfs] [43.154475] ? __check_object_size+0x150/0x170 [43.155170] ? preempt_count_add+0x49/0xa0 [43.155753] ? __x64_sys_ioctl+0x84/0xc0 [43.156437] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs] [43.157456] __x64_sys_ioctl+0x84/0xc0 [43.157980] do_syscall_64+0x3a/0x80 [43.158543] entry_SYSCALL_64_after_hwframe+0x44/0xae [43.159231] RIP: 0033:0x7f7657f1e59b [43.161819] RSP: 002b:00007ffda5cd1658 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [43.162702] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f7657f1e59b [43.163526] RDX: 0000000000000000 RSI: 0000000000009408 RDI: 0000000000000003 [43.164358] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000 [43.165208] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [43.166029] R13: 00005621b91c3232 R14: 00005621b91ba580 R15: 00007ffda5cd1800 [43.166863] </TASK> [43.167125] Modules linked in: btrfs blake2b_generic xor pata_acpi ata_piix libata raid6_pq scsi_mod libcrc32c virtio_net virtio_rng net_failover rng_core failover scsi_common [43.169552] ---[ end trace 0000000000000000 ]--- [43.171226] RIP: 0010:unpin_extent_range+0x37a/0x4f0 [btrfs] [43.174767] RSP: 0000:ffffb0dd4216bc70 EFLAGS: 00010246 [43.175600] RAX: 0000000000000000 RBX: ffff96cfc34490f8 RCX: 0000000000000001 [43.176468] RDX: 0000000080000001 RSI: 0000000051d00000 RDI: 00000000ffffffff [43.177357] RBP: 0000000000000000 R08: 0000000000000000 R09: ffff96cfd07dca50 [43.178271] R10: ffff96cfc46e8a00 R11: fffffffffffec000 R12: 0000000041d00000 [43.179178] R13: ffff96cfc3ce0000 R14: ffffb0dd4216bd08 R15: 0000000000000000 [43.180071] FS: 00007f7657dd68c0(0000) GS:ffff96d6df800000(0000) knlGS:0000000000000000 [43.181073] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [43.181808] CR2: 00007fe09905f010 CR3: 00000001093ee004 CR4: 0000000000370ee0 [43.182706] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [43.183591] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 We first hit the WARN_ON(rc->block_group->pinned > 0) in btrfs_relocate_block_group() and then the BUG_ON(!cache) in unpin_extent_range(). This tells us that we are exiting relocation and removing the block group with bytes still pinned for that block group. This is supposed to be impossible: the last thing relocate_block_group() does is commit the transaction to get rid of pinned extents. Commit |
||
Josef Bacik
|
b4be6aefa7 |
btrfs: do not start relocation until in progress drops are done
We hit a bug with a recovering relocation on mount for one of our file systems in production. I reproduced this locally by injecting errors into snapshot delete with balance running at the same time. This presented as an error while looking up an extent item WARNING: CPU: 5 PID: 1501 at fs/btrfs/extent-tree.c:866 lookup_inline_extent_backref+0x647/0x680 CPU: 5 PID: 1501 Comm: btrfs-balance Not tainted 5.16.0-rc8+ #8 RIP: 0010:lookup_inline_extent_backref+0x647/0x680 RSP: 0018:ffffae0a023ab960 EFLAGS: 00010202 RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000000000000c RDI: 0000000000000000 RBP: ffff943fd2a39b60 R08: 0000000000000000 R09: 0000000000000001 R10: 0001434088152de0 R11: 0000000000000000 R12: 0000000001d05000 R13: ffff943fd2a39b60 R14: ffff943fdb96f2a0 R15: ffff9442fc923000 FS: 0000000000000000(0000) GS:ffff944e9eb40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f1157b1fca8 CR3: 000000010f092000 CR4: 0000000000350ee0 Call Trace: <TASK> insert_inline_extent_backref+0x46/0xd0 __btrfs_inc_extent_ref.isra.0+0x5f/0x200 ? btrfs_merge_delayed_refs+0x164/0x190 __btrfs_run_delayed_refs+0x561/0xfa0 ? btrfs_search_slot+0x7b4/0xb30 ? btrfs_update_root+0x1a9/0x2c0 btrfs_run_delayed_refs+0x73/0x1f0 ? btrfs_update_root+0x1a9/0x2c0 btrfs_commit_transaction+0x50/0xa50 ? btrfs_update_reloc_root+0x122/0x220 prepare_to_merge+0x29f/0x320 relocate_block_group+0x2b8/0x550 btrfs_relocate_block_group+0x1a6/0x350 btrfs_relocate_chunk+0x27/0xe0 btrfs_balance+0x777/0xe60 balance_kthread+0x35/0x50 ? btrfs_balance+0xe60/0xe60 kthread+0x16b/0x190 ? set_kthread_struct+0x40/0x40 ret_from_fork+0x22/0x30 </TASK> Normally snapshot deletion and relocation are excluded from running at the same time by the fs_info->cleaner_mutex. However if we had a pending balance waiting to get the ->cleaner_mutex, and a snapshot deletion was running, and then the box crashed, we would come up in a state where we have a half deleted snapshot. Again, in the normal case the snapshot deletion needs to complete before relocation can start, but in this case relocation could very well start before the snapshot deletion completes, as we simply add the root to the dead roots list and wait for the next time the cleaner runs to clean up the snapshot. Fix this by setting a bit on the fs_info if we have any DEAD_ROOT's that had a pending drop_progress key. If they do then we know we were in the middle of the drop operation and set a flag on the fs_info. Then balance can wait until this flag is cleared to start up again. If there are DEAD_ROOT's that don't have a drop_progress set then we're safe to start balance right away as we'll be properly protected by the cleaner_mutex. CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Su Yue
|
a6ab66eb85 |
btrfs: tree-checker: use u64 for item data end to avoid overflow
User reported there is an array-index-out-of-bounds access while mounting the crafted image: [350.411942 ] loop0: detected capacity change from 0 to 262144 [350.427058 ] BTRFS: device fsid a62e00e8-e94e-4200-8217-12444de93c2e devid 1 transid 8 /dev/loop0 scanned by systemd-udevd (1044) [350.428564 ] BTRFS info (device loop0): disk space caching is enabled [350.428568 ] BTRFS info (device loop0): has skinny extents [350.429589 ] [350.429619 ] UBSAN: array-index-out-of-bounds in fs/btrfs/struct-funcs.c:161:1 [350.429636 ] index 1048096 is out of range for type 'page *[16]' [350.429650 ] CPU: 0 PID: 9 Comm: kworker/u8:1 Not tainted 5.16.0-rc4 [350.429652 ] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014 [350.429653 ] Workqueue: btrfs-endio-meta btrfs_work_helper [btrfs] [350.429772 ] Call Trace: [350.429774 ] <TASK> [350.429776 ] dump_stack_lvl+0x47/0x5c [350.429780 ] ubsan_epilogue+0x5/0x50 [350.429786 ] __ubsan_handle_out_of_bounds+0x66/0x70 [350.429791 ] btrfs_get_16+0xfd/0x120 [btrfs] [350.429832 ] check_leaf+0x754/0x1a40 [btrfs] [350.429874 ] ? filemap_read+0x34a/0x390 [350.429878 ] ? load_balance+0x175/0xfc0 [350.429881 ] validate_extent_buffer+0x244/0x310 [btrfs] [350.429911 ] btrfs_validate_metadata_buffer+0xf8/0x100 [btrfs] [350.429935 ] end_bio_extent_readpage+0x3af/0x850 [btrfs] [350.429969 ] ? newidle_balance+0x259/0x480 [350.429972 ] end_workqueue_fn+0x29/0x40 [btrfs] [350.429995 ] btrfs_work_helper+0x71/0x330 [btrfs] [350.430030 ] ? __schedule+0x2fb/0xa40 [350.430033 ] process_one_work+0x1f6/0x400 [350.430035 ] ? process_one_work+0x400/0x400 [350.430036 ] worker_thread+0x2d/0x3d0 [350.430037 ] ? process_one_work+0x400/0x400 [350.430038 ] kthread+0x165/0x190 [350.430041 ] ? set_kthread_struct+0x40/0x40 [350.430043 ] ret_from_fork+0x1f/0x30 [350.430047 ] </TASK> [350.430047 ] [350.430077 ] BTRFS warning (device loop0): bad eb member start: ptr 0xffe20f4e start 20975616 member offset 4293005178 size 2 btrfs check reports: corrupt leaf: root=3 block=20975616 physical=20975616 slot=1, unexpected item end, have 4294971193 expect 3897 The first slot item offset is 4293005033 and the size is 1966160. In check_leaf, we use btrfs_item_end() to check item boundary versus extent_buffer data size. However, return type of btrfs_item_end() is u32. (u32)(4293005033 + 1966160) == 3897, overflow happens and the result 3897 equals to leaf data size reasonably. Fix it by use u64 variable to store item data end in check_leaf() to avoid u32 overflow. This commit does solve the invalid memory access showed by the stack trace. However, its metadata profile is DUP and another copy of the leaf is fine. So the image can be mounted successfully. But when umount is called, the ASSERT btrfs_mark_buffer_dirty() will be triggered because the only node in extent tree has 0 item and invalid owner. It's solved by another commit "btrfs: check extent buffer owner against the owner rootid". Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=215299 Reported-by: Wenqing Liu <wenqingliu0120@gmail.com> CC: stable@vger.kernel.org # 4.19+ Signed-off-by: Su Yue <l@damenly.su> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Josef Bacik
|
a50e1fcbc9 |
btrfs: do not WARN_ON() if we have PageError set
Whenever we do any extent buffer operations we call
assert_eb_page_uptodate() to complain loudly if we're operating on an
non-uptodate page. Our overnight tests caught this warning earlier this
week
WARNING: CPU: 1 PID: 553508 at fs/btrfs/extent_io.c:6849 assert_eb_page_uptodate+0x3f/0x50
CPU: 1 PID: 553508 Comm: kworker/u4:13 Tainted: G W 5.17.0-rc3+ #564
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Workqueue: btrfs-cache btrfs_work_helper
RIP: 0010:assert_eb_page_uptodate+0x3f/0x50
RSP: 0018:ffffa961440a7c68 EFLAGS: 00010246
RAX: 0017ffffc0002112 RBX: ffffe6e74453f9c0 RCX: 0000000000001000
RDX: ffffe6e74467c887 RSI: ffffe6e74453f9c0 RDI: ffff8d4c5efc2fc0
RBP: 0000000000000d56 R08: ffff8d4d4a224000 R09: 0000000000000000
R10: 00015817fa9d1ef0 R11: 000000000000000c R12: 00000000000007b1
R13: ffff8d4c5efc2fc0 R14: 0000000001500000 R15: 0000000001cb1000
FS: 0000000000000000(0000) GS:ffff8d4dbbd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ff31d3448d8 CR3: 0000000118be8004 CR4: 0000000000370ee0
Call Trace:
extent_buffer_test_bit+0x3f/0x70
free_space_test_bit+0xa6/0xc0
load_free_space_tree+0x1f6/0x470
caching_thread+0x454/0x630
? rcu_read_lock_sched_held+0x12/0x60
? rcu_read_lock_sched_held+0x12/0x60
? rcu_read_lock_sched_held+0x12/0x60
? lock_release+0x1f0/0x2d0
btrfs_work_helper+0xf2/0x3e0
? lock_release+0x1f0/0x2d0
? finish_task_switch.isra.0+0xf9/0x3a0
process_one_work+0x26d/0x580
? process_one_work+0x580/0x580
worker_thread+0x55/0x3b0
? process_one_work+0x580/0x580
kthread+0xf0/0x120
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x1f/0x30
This was partially fixed by
|
||
Filipe Manana
|
d994788743 |
btrfs: fix lost prealloc extents beyond eof after full fsync
When doing a full fsync, if we have prealloc extents beyond (or at) eof, and the leaves that contain them were not modified in the current transaction, we end up not logging them. This results in losing those extents when we replay the log after a power failure, since the inode is truncated to the current value of the logged i_size. Just like for the fast fsync path, we need to always log all prealloc extents starting at or beyond i_size. The fast fsync case was fixed in commit |
||
Qu Wenruo
|
c992fa1fd5 |
btrfs: subpage: fix a wrong check on subpage->writers
[BUG]
When looping btrfs/074 with 64K page size and 4K sectorsize, there is a
low chance (1/50~1/100) to crash with the following ASSERT() triggered
in btrfs_subpage_start_writer():
ret = atomic_add_return(nbits, &subpage->writers);
ASSERT(ret == nbits); <<< This one <<<
[CAUSE]
With more debugging output on the parameters of
btrfs_subpage_start_writer(), it shows a very concerning error:
ret=29 nbits=13 start=393216 len=53248
For @nbits it's correct, but @ret which is the returned value from
atomic_add_return(), it's not only larger than nbits, but also larger
than max sectors per page value (for 64K page size and 4K sector size,
it's 16).
This indicates that some call sites are not properly decreasing the value.
And that's exactly the case, in btrfs_page_unlock_writer(), due to the
fact that we can have page locked either by lock_page() or
process_one_page(), we have to check if the subpage has any writer.
If no writers, it's locked by lock_page() and we only need to unlock it.
But unfortunately the check for the writers are completely opposite:
if (atomic_read(&subpage->writers))
/* No writers, locked by plain lock_page() */
return unlock_page(page);
We directly unlock the page if it has writers, which is the completely
opposite what we want.
Thankfully the affected call site is only limited to
extent_write_locked_range(), so it's mostly affecting compressed write.
[FIX]
Just fix the wrong check condition to fix the bug.
Fixes:
|
||
Linus Torvalds
|
c0419188b5 |
for-5.17-rc5-tag
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmIY790ACgkQxWXV+ddt WDvKxA//ctgUNhKEPOfJlmmaKAVRgrE6FfDgfk6c2v/PrpPFH0U9+frishcsImxu XAObMCyPY7PfLDnk6I0Lmxm+8T56+NNGjbxq7/R1Uv0DJm75f51OJbr/H7NSjVfu g6IyPmIft7jmt7Vp9lPyYcPNDTFyG+XARdWYS3AFtAfr2MfXgjx9AALxFjaytbLi AevXP0qEkbLHv5npEG56pouhn44J/8GZKeUGM1crNNUDQoYpgreifZ2SHpLIfxP5 lvzrA1noaZSFS3Cth7NBPhHTFS2tiMb96bHFdF56A2EIq5vAXQF7w6IAUlvBEVoR 5XgWsxGfsv5FbdFmyrRIvOh6gGHwHw8BH5/ZRTRRVuRZAPKPY0oiJ9OJk5kIBCgo LiYksqRTOs0Zp/e5wn/8d/UGp2A6mujxwqw7gLcOZBzfhKw7QIha6BM64BfJxBni 3dakBDCWZ/X+Kje+WaM4Sev7JUIyDVoKWClHrvzoLeEzdIgruNguMnQ+3yOZBFiG 4YRTPUeafAj0OspJ0LLG01X4NJVmnQVAFoKuFOsGbUsxeCaQ9vF3/IGTlhgkwehf KjvE9nzl9DewpvRRd7AAirj5FncuwRw6KNci1gBBixxPaveBClCIuuyfx6lXPusK sIF3eb7xcqKYLh0iYPd2XMZInXbWXIGuJoVG/Gu1IYm1OXAFQ5A= =q/NS -----END PGP SIGNATURE----- Merge tag 'for-5.17-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "This is a hopefully last batch of fixes for defrag that got broken in 5.16, all stable material. The remaining reported problem is excessive IO with autodefrag due to various conditions in the defrag code not met or missing" * tag 'for-5.17-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: reduce extent threshold for autodefrag btrfs: autodefrag: only scan one inode once btrfs: defrag: don't use merged extent map for their generation check btrfs: defrag: bring back the old file extent search behavior btrfs: defrag: remove an ambiguous condition for rejection btrfs: defrag: don't defrag extents which are already at max capacity btrfs: defrag: don't try to merge regular extents with preallocated extents btrfs: defrag: allow defrag_one_cluster() to skip large extent which is not a target btrfs: prevent copying too big compressed lzo segment |
||
Qu Wenruo
|
558732df21 |
btrfs: reduce extent threshold for autodefrag
There is a big gap between inode_should_defrag() and autodefrag extent size threshold. For inode_should_defrag() it has a flexible @small_write value. For compressed extent is 16K, and for non-compressed extent it's 64K. However for autodefrag extent size threshold, it's always fixed to the default value (256K). This means, the following write sequence will trigger autodefrag to defrag ranges which didn't trigger autodefrag: pwrite 0 8k sync pwrite 8k 128K sync The latter 128K write will also be considered as a defrag target (if other conditions are met). While only that 8K write is really triggering autodefrag. Such behavior can cause extra IO for autodefrag. Close the gap, by copying the @small_write value into inode_defrag, so that later autodefrag can use the same @small_write value which triggered autodefrag. With the existing transid value, this allows autodefrag really to scan the ranges which triggered autodefrag. Although this behavior change is mostly reducing the extent_thresh value for autodefrag, I believe in the future we should allow users to specify the autodefrag extent threshold through mount options, but that's an other problem to consider in the future. CC: stable@vger.kernel.org # 5.16+ Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Qu Wenruo
|
26fbac2517 |
btrfs: autodefrag: only scan one inode once
Although we have btrfs_requeue_inode_defrag(), for autodefrag we are still just exhausting all inode_defrag items in the tree. This means, it doesn't make much difference to requeue an inode_defrag, other than scan the inode from the beginning till its end. Change the behaviour to always scan from offset 0 of an inode, and till the end. By this we get the following benefit: - Straight-forward code - No more re-queue related check - Fewer members in inode_defrag We still keep the same btrfs_get_fs_root() and btrfs_iget() check for each loop, and added extra should_auto_defrag() check per-loop. Note: the patch needs to be backported and is intentionally written to minimize the diff size, code will be cleaned up later. CC: stable@vger.kernel.org # 5.16 Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Qu Wenruo
|
199257a78b |
btrfs: defrag: don't use merged extent map for their generation check
For extent maps, if they are not compressed extents and are adjacent by logical addresses and file offsets, they can be merged into one larger extent map. Such merged extent map will have the higher generation of all the original ones. But this brings a problem for autodefrag, as it relies on accurate extent_map::generation to determine if one extent should be defragged. For merged extent maps, their higher generation can mark some older extents to be defragged while the original extent map doesn't meet the minimal generation threshold. Thus this will cause extra IO. So solve the problem, here we introduce a new flag, EXTENT_FLAG_MERGED, to indicate if the extent map is merged from one or more ems. And for autodefrag, if we find a merged extent map, and its generation meets the generation requirement, we just don't use this one, and go back to defrag_get_extent() to read extent maps from subvolume trees. This could cause more read IO, but should result less defrag data write, so in the long run it should be a win for autodefrag. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Qu Wenruo
|
d5633b0dee |
btrfs: defrag: bring back the old file extent search behavior
For defrag, we don't really want to use btrfs_get_extent() to iterate
all extent maps of an inode.
The reasons are:
- btrfs_get_extent() can merge extent maps
And the result em has the higher generation of the two, causing defrag
to mark unnecessary part of such merged large extent map.
This in fact can result extra IO for autodefrag in v5.16+ kernels.
However this patch is not going to completely solve the problem, as
one can still using read() to trigger extent map reading, and got
them merged.
The completely solution for the extent map merging generation problem
will come as an standalone fix.
- btrfs_get_extent() caches the extent map result
Normally it's fine, but for defrag the target range may not get
another read/write for a long long time.
Such cache would only increase the memory usage.
- btrfs_get_extent() doesn't skip older extent map
Unlike the old find_new_extent() which uses btrfs_search_forward() to
skip the older subtree, thus it will pick up unnecessary extent maps.
This patch will fix the regression by introducing defrag_get_extent() to
replace the btrfs_get_extent() call.
This helper will:
- Not cache the file extent we found
It will search the file extent and manually convert it to em.
- Use btrfs_search_forward() to skip entire ranges which is modified in
the past
This should reduce the IO for autodefrag.
Reported-by: Filipe Manana <fdmanana@suse.com>
Fixes:
|
||
Qu Wenruo
|
550f133f69 |
btrfs: defrag: remove an ambiguous condition for rejection
From the very beginning of btrfs defrag, there is a check to reject extents which meet both conditions: - Physically adjacent We may want to defrag physically adjacent extents to reduce the number of extents or the size of subvolume tree. - Larger than 128K This may be there for compressed extents, but unfortunately 128K is exactly the max capacity for compressed extents. And the check is > 128K, thus it never rejects compressed extents. Furthermore, the compressed extent capacity bug is fixed by previous patch, there is no reason for that check anymore. The original check has a very small ranges to reject (the target extent size is > 128K, and default extent threshold is 256K), and for compressed extent it doesn't work at all. So it's better just to remove the rejection, and allow us to defrag physically adjacent extents. CC: stable@vger.kernel.org # 5.16 Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Qu Wenruo
|
979b25c300 |
btrfs: defrag: don't defrag extents which are already at max capacity
[BUG] For compressed extents, defrag ioctl will always try to defrag any compressed extents, wasting not only IO but also CPU time to compress/decompress: mkfs.btrfs -f $DEV mount -o compress $DEV $MNT xfs_io -f -c "pwrite -S 0xab 0 128K" $MNT/foobar sync xfs_io -f -c "pwrite -S 0xcd 128K 128K" $MNT/foobar sync echo "=== before ===" xfs_io -c "fiemap -v" $MNT/foobar btrfs filesystem defrag $MNT/foobar sync echo "=== after ===" xfs_io -c "fiemap -v" $MNT/foobar Then it shows the 2 128K extents just get COW for no extra benefit, with extra IO/CPU spent: === before === /mnt/btrfs/file1: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..255]: 26624..26879 256 0x8 1: [256..511]: 26632..26887 256 0x9 === after === /mnt/btrfs/file1: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..255]: 26640..26895 256 0x8 1: [256..511]: 26648..26903 256 0x9 This affects not only v5.16 (after the defrag rework), but also v5.15 (before the defrag rework). [CAUSE] From the very beginning, btrfs defrag never checks if one extent is already at its max capacity (128K for compressed extents, 128M otherwise). And the default extent size threshold is 256K, which is already beyond the compressed extent max size. This means, by default btrfs defrag ioctl will mark all compressed extent which is not adjacent to a hole/preallocated range for defrag. [FIX] Introduce a helper to grab the maximum extent size, and then in defrag_collect_targets() and defrag_check_next_extent(), reject extents which are already at their max capacity. Reported-by: Filipe Manana <fdmanana@suse.com> CC: stable@vger.kernel.org # 5.16 Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Qu Wenruo
|
7093f15291 |
btrfs: defrag: don't try to merge regular extents with preallocated extents
[BUG] With older kernels (before v5.16), btrfs will defrag preallocated extents. While with newer kernels (v5.16 and newer) btrfs will not defrag preallocated extents, but it will defrag the extent just before the preallocated extent, even it's just a single sector. This can be exposed by the following small script: mkfs.btrfs -f $dev > /dev/null mount $dev $mnt xfs_io -f -c "pwrite 0 4k" -c sync -c "falloc 4k 16K" $mnt/file xfs_io -c "fiemap -v" $mnt/file btrfs fi defrag $mnt/file sync xfs_io -c "fiemap -v" $mnt/file The output looks like this on older kernels: /mnt/btrfs/file: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..7]: 26624..26631 8 0x0 1: [8..39]: 26632..26663 32 0x801 /mnt/btrfs/file: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..39]: 26664..26703 40 0x1 Which defrags the single sector along with the preallocated extent, and replace them with an regular extent into a new location (caused by data COW). This wastes most of the data IO just for the preallocated range. On the other hand, v5.16 is slightly better: /mnt/btrfs/file: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..7]: 26624..26631 8 0x0 1: [8..39]: 26632..26663 32 0x801 /mnt/btrfs/file: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..7]: 26664..26671 8 0x0 1: [8..39]: 26632..26663 32 0x801 The preallocated range is not defragged, but the sector before it still gets defragged, which has no need for it. [CAUSE] One of the function reused by the old and new behavior is defrag_check_next_extent(), it will determine if we should defrag current extent by checking the next one. It only checks if the next extent is a hole or inlined, but it doesn't check if it's preallocated. On the other hand, out of the function, both old and new kernel will reject preallocated extents. Such inconsistent behavior causes above behavior. [FIX] - Also check if next extent is preallocated If so, don't defrag current extent. - Add comments for each branch why we reject the extent This will reduce the IO caused by defrag ioctl and autodefrag. CC: stable@vger.kernel.org # 5.16 Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Qu Wenruo
|
966d879baf |
btrfs: defrag: allow defrag_one_cluster() to skip large extent which is not a target
In the rework of btrfs_defrag_file(), we always call defrag_one_cluster() and increase the offset by cluster size, which is only 256K. But there are cases where we have a large extent (e.g. 128M) which doesn't need to be defragged at all. Before the refactor, we can directly skip the range, but now we have to scan that extent map again and again until the cluster moves after the non-target extent. Fix the problem by allow defrag_one_cluster() to increase btrfs_defrag_ctrl::last_scanned to the end of an extent, if and only if the last extent of the cluster is not a target. The test script looks like this: mkfs.btrfs -f $dev > /dev/null mount $dev $mnt # As btrfs ioctl uses 32M as extent_threshold xfs_io -f -c "pwrite 0 64M" $mnt/file1 sync # Some fragemented range to defrag xfs_io -s -c "pwrite 65548k 4k" \ -c "pwrite 65544k 4k" \ -c "pwrite 65540k 4k" \ -c "pwrite 65536k 4k" \ $mnt/file1 sync echo "=== before ===" xfs_io -c "fiemap -v" $mnt/file1 echo "=== after ===" btrfs fi defrag $mnt/file1 sync xfs_io -c "fiemap -v" $mnt/file1 umount $mnt With extra ftrace put into defrag_one_cluster(), before the patch it would result tons of loops: (As defrag_one_cluster() is inlined, the function name is its caller) btrfs-126062 [005] ..... 4682.816026: btrfs_defrag_file: r/i=5/257 start=0 len=262144 btrfs-126062 [005] ..... 4682.816027: btrfs_defrag_file: r/i=5/257 start=262144 len=262144 btrfs-126062 [005] ..... 4682.816028: btrfs_defrag_file: r/i=5/257 start=524288 len=262144 btrfs-126062 [005] ..... 4682.816028: btrfs_defrag_file: r/i=5/257 start=786432 len=262144 btrfs-126062 [005] ..... 4682.816028: btrfs_defrag_file: r/i=5/257 start=1048576 len=262144 ... btrfs-126062 [005] ..... 4682.816043: btrfs_defrag_file: r/i=5/257 start=67108864 len=262144 But with this patch there will be just one loop, then directly to the end of the extent: btrfs-130471 [014] ..... 5434.029558: defrag_one_cluster: r/i=5/257 start=0 len=262144 btrfs-130471 [014] ..... 5434.029559: defrag_one_cluster: r/i=5/257 start=67108864 len=16384 CC: stable@vger.kernel.org # 5.16 Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Dāvis Mosāns
|
741b23a970 |
btrfs: prevent copying too big compressed lzo segment
Compressed length can be corrupted to be a lot larger than memory we have allocated for buffer. This will cause memcpy in copy_compressed_segment to write outside of allocated memory. This mostly results in stuck read syscall but sometimes when using btrfs send can get #GP kernel: general protection fault, probably for non-canonical address 0x841551d5c1000: 0000 [#1] PREEMPT SMP NOPTI kernel: CPU: 17 PID: 264 Comm: kworker/u256:7 Tainted: P OE 5.17.0-rc2-1 #12 kernel: Workqueue: btrfs-endio btrfs_work_helper [btrfs] kernel: RIP: 0010:lzo_decompress_bio (./include/linux/fortify-string.h:225 fs/btrfs/lzo.c:322 fs/btrfs/lzo.c:394) btrfs Code starting with the faulting instruction =========================================== 0:* 48 8b 06 mov (%rsi),%rax <-- trapping instruction 3: 48 8d 79 08 lea 0x8(%rcx),%rdi 7: 48 83 e7 f8 and $0xfffffffffffffff8,%rdi b: 48 89 01 mov %rax,(%rcx) e: 44 89 f0 mov %r14d,%eax 11: 48 8b 54 06 f8 mov -0x8(%rsi,%rax,1),%rdx kernel: RSP: 0018:ffffb110812efd50 EFLAGS: 00010212 kernel: RAX: 0000000000001000 RBX: 000000009ca264c8 RCX: ffff98996e6d8ff8 kernel: RDX: 0000000000000064 RSI: 000841551d5c1000 RDI: ffffffff9500435d kernel: RBP: ffff989a3be856c0 R08: 0000000000000000 R09: 0000000000000000 kernel: R10: 0000000000000000 R11: 0000000000001000 R12: ffff98996e6d8000 kernel: R13: 0000000000000008 R14: 0000000000001000 R15: 000841551d5c1000 kernel: FS: 0000000000000000(0000) GS:ffff98a09d640000(0000) knlGS:0000000000000000 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: CR2: 00001e9f984d9ea8 CR3: 000000014971a000 CR4: 00000000003506e0 kernel: Call Trace: kernel: <TASK> kernel: end_compressed_bio_read (fs/btrfs/compression.c:104 fs/btrfs/compression.c:1363 fs/btrfs/compression.c:323) btrfs kernel: end_workqueue_fn (fs/btrfs/disk-io.c:1923) btrfs kernel: btrfs_work_helper (fs/btrfs/async-thread.c:326) btrfs kernel: process_one_work (./arch/x86/include/asm/jump_label.h:27 ./include/linux/jump_label.h:212 ./include/trace/events/workqueue.h:108 kernel/workqueue.c:2312) kernel: worker_thread (./include/linux/list.h:292 kernel/workqueue.c:2455) kernel: ? process_one_work (kernel/workqueue.c:2397) kernel: kthread (kernel/kthread.c:377) kernel: ? kthread_complete_and_exit (kernel/kthread.c:332) kernel: ret_from_fork (arch/x86/entry/entry_64.S:301) kernel: </TASK> CC: stable@vger.kernel.org # 4.9+ Signed-off-by: Dāvis Mosāns <davispuh@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Linus Torvalds
|
705d84a366 |
for-5.17-rc4-tag
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmILuxMACgkQxWXV+ddt WDvhrA/9Hsyj2DdvvBVR3HudaER51RAJS6dtJCJdFZGWy2tEwtkxhIdbPn1nwJE7 mvZy2UN79JKwPAdX8inyJ68RCMtcifprkUMC2d7y2mVZcCG/a/iYGdDIVB/z4Pyx NneBBgwdG0V505i7/sm07epLUaNI1MwXc9wNAs00zSXw4eYjLq09fp0lfl74RBhv HvuYgawk2abY6hPbJnTu2MyyHEZI4oGH/fRurP48cvU/FbIm9en7aX3rEZ+T8yRW TkEdFF/60Wce3xkyN87Xqma6L0smypJ888yzpwsJtlFOTr7iI58HYqUfx27Q5VQc xy5fyuuplEb0ky4GBnscpsoutj5C241+4+YE4HGqf9ne5EYU1rzJATlEFUBk84hY YwjdordS/nTScyFVCBG9yiTL30KsQ2SQc1TzIt/rIJiYIJexJyppOJMFmxbuN9By WSrLB5/uN56dRe/A8LMGpuJdwTVrYr2SPXfSseAxCEONt5fppPnDaCGEgVKIdmHq sQXbs/LMGHQ1lq2JsPFD12p8kQJ5Redxy0KIzDwmeBAL3HlXwpFiMia0AhvKOzPj UFtU/KcOmtqWMMv3P0aHydmDmUid4c3612BtvbKOhIXTVzKodzcQhkyTw1ducAa1 GMkKIHCaPCzbsJwiogZGSBmIyDyMwitVvAybZIpRTR9i0xSA61A= =AqU+ -----END PGP SIGNATURE----- Merge tag 'for-5.17-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - yield CPU more often when defragmenting a large file - skip defragmenting extents already under writeback - improve error message when send fails to write file data - get rid of warning when mounted with 'flushoncommit' * tag 'for-5.17-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: send: in case of IO error log it btrfs: get rid of warning on transaction commit when using flushoncommit btrfs: defrag: don't try to defrag extents which are under writeback btrfs: don't hold CPU for too long when defragging a file |
||
Dāvis Mosāns
|
2e7be9db12 |
btrfs: send: in case of IO error log it
Currently if we get IO error while doing send then we abort without logging information about which file caused issue. So log it to help with debugging. CC: stable@vger.kernel.org # 4.9+ Signed-off-by: Dāvis Mosāns <davispuh@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
a0f0cf8341 |
btrfs: get rid of warning on transaction commit when using flushoncommit
When using the flushoncommit mount option, during almost every transaction commit we trigger a warning from __writeback_inodes_sb_nr(): $ cat fs/fs-writeback.c: (...) static void __writeback_inodes_sb_nr(struct super_block *sb, ... { (...) WARN_ON(!rwsem_is_locked(&sb->s_umount)); (...) } (...) The trace produced in dmesg looks like the following: [947.473890] WARNING: CPU: 5 PID: 930 at fs/fs-writeback.c:2610 __writeback_inodes_sb_nr+0x7e/0xb3 [947.481623] Modules linked in: nfsd nls_cp437 cifs asn1_decoder cifs_arc4 fscache cifs_md4 ipmi_ssif [947.489571] CPU: 5 PID: 930 Comm: btrfs-transacti Not tainted 95.16.3-srb-asrock-00001-g36437ad63879 #186 [947.497969] RIP: 0010:__writeback_inodes_sb_nr+0x7e/0xb3 [947.502097] Code: 24 10 4c 89 44 24 18 c6 (...) [947.519760] RSP: 0018:ffffc90000777e10 EFLAGS: 00010246 [947.523818] RAX: 0000000000000000 RBX: 0000000000963300 RCX: 0000000000000000 [947.529765] RDX: 0000000000000000 RSI: 000000000000fa51 RDI: ffffc90000777e50 [947.535740] RBP: ffff888101628a90 R08: ffff888100955800 R09: ffff888100956000 [947.541701] R10: 0000000000000002 R11: 0000000000000001 R12: ffff888100963488 [947.547645] R13: ffff888100963000 R14: ffff888112fb7200 R15: ffff888100963460 [947.553621] FS: 0000000000000000(0000) GS:ffff88841fd40000(0000) knlGS:0000000000000000 [947.560537] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [947.565122] CR2: 0000000008be50c4 CR3: 000000000220c000 CR4: 00000000001006e0 [947.571072] Call Trace: [947.572354] <TASK> [947.573266] btrfs_commit_transaction+0x1f1/0x998 [947.576785] ? start_transaction+0x3ab/0x44e [947.579867] ? schedule_timeout+0x8a/0xdd [947.582716] transaction_kthread+0xe9/0x156 [947.585721] ? btrfs_cleanup_transaction.isra.0+0x407/0x407 [947.590104] kthread+0x131/0x139 [947.592168] ? set_kthread_struct+0x32/0x32 [947.595174] ret_from_fork+0x22/0x30 [947.597561] </TASK> [947.598553] ---[ end trace 644721052755541c ]--- This is because we started using writeback_inodes_sb() to flush delalloc when committing a transaction (when using -o flushoncommit), in order to avoid deadlocks with filesystem freeze operations. This change was made by commit |
||
Qu Wenruo
|
0d1ffa2228 |
btrfs: defrag: don't try to defrag extents which are under writeback
Once we start writeback (have called btrfs_run_delalloc_range()), we allocate an extent, create an extent map point to that extent, with a generation of (u64)-1, created the ordered extent and then clear the DELALLOC bit from the range in the inode's io tree. Such extent map can pass the first call of defrag_collect_targets(), as its generation is (u64)-1, meets any possible minimal generation check. And the range will not have DELALLOC bit, also passing the DELALLOC bit check. It will only be re-checked in the second call of defrag_collect_targets(), which will wait for writeback. But at that stage we have already spent our time waiting for some IO we may or may not want to defrag. Let's reject such extents early so we won't waste our time. CC: stable@vger.kernel.org # 5.16 Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Qu Wenruo
|
ea0eba69a2 |
btrfs: don't hold CPU for too long when defragging a file
There is a user report about "btrfs filesystem defrag" causing 120s timeout problem. For btrfs_defrag_file() it will iterate all file extents if called from defrag ioctl, thus it can take a long time. There is no reason not to release the CPU during such a long operation. Add cond_resched() after defragged one cluster. CC: stable@vger.kernel.org # 5.16 Link: https://lore.kernel.org/linux-btrfs/10e51417-2203-f0a4-2021-86c8511cc367@gmx.com Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Linus Torvalds
|
86286e486c |
for-5.17-rc2-tag
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmH9eUcACgkQxWXV+ddt WDvCvQ//bANu7air/Og5r2Mn0ZYyrcQl+yDYE75UC/tzTZNNtP8guwGllwlcsA0v RQPiuFFtvjKMgKP6Eo1mVeUPkpX83VQkT+sqFRsFEFxazIXnSvEJ+iHVcuiZvgj1 VkTjdt7/mLb573zSA0MLhJqK1BBuFhUTCCHFHlCLoiYeekPAui1pbUC4LAE/+ksU veCn9YS+NGkDpIv/b9mcALVBe+XkZlmw1LON8ArEbpY4ToafRk0qZfhV7CvyRbSP Y1zLUScNLHGoR2WA1WhwlwuMePdhgX/8neGNiXsiw3WnmZhFoUVX7oUa6IWnKkKk dD+x5Z3Z2xBQGK8djyqxzUFJ2VAvz15xGIM452ofGa1BJFZgV9hjPA6Y4RFdWx63 4AZ6OJwhrXhgMtWBhRtM6mGeje56ljwaxku9qhe585z8H5V8ezUNwWVkjY0bLKsd iT3bUHEReoYRWuyszI1ZSm1DbyzNY2943ly97p/j8qKp4SHX39/QYAKmnuuHdIup TnTBJOh38rj4S8BfF873aKAo7EfLJcDbTcZ1ivbuX5FeByRuQB4F0c1RRi4usMLc DL5mhDhT71U1l/LF3IANQ4ieUfZbeFHd+dAVkYsGkYzzaWL8E03L582l/fqaVGsp RaVpiuKnh2cyDXUxob8IYT5mZ/saa96xBSK8VEqnwjNEQCzKEeU= =5MJl -----END PGP SIGNATURE----- Merge tag 'for-5.17-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few fixes and error handling improvements: - fix deadlock between quota disable and qgroup rescan worker - fix use-after-free after failure to create a snapshot - skip warning on unmount after log cleanup failure - don't start transaction for scrub if the fs is mounted read-only - tree checker verifies item sizes" * tag 'for-5.17-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: skip reserved bytes warning on unmount after log cleanup failure btrfs: fix use of uninitialized variable at rm device ioctl btrfs: fix use-after-free after failure to create a snapshot btrfs: tree-checker: check item_size for dev_item btrfs: tree-checker: check item_size for inode_item btrfs: fix deadlock between quota disable and qgroup rescan worker btrfs: don't start transaction for scrub if the fs is mounted read-only |
||
Filipe Manana
|
40cdc50987 |
btrfs: skip reserved bytes warning on unmount after log cleanup failure
After the recent changes made by commit |
||
Tom Rix
|
37b4599547 |
btrfs: fix use of uninitialized variable at rm device ioctl
Clang static analysis reports this problem
ioctl.c:3333:8: warning: 3rd function call argument is an
uninitialized value
ret = exclop_start_or_cancel_reloc(fs_info,
cancel is only set in one branch of an if-check and is always used. So
initialize to false.
Fixes:
|
||
Filipe Manana
|
28b21c558a |
btrfs: fix use-after-free after failure to create a snapshot
At ioctl.c:create_snapshot(), we allocate a pending snapshot structure and then attach it to the transaction's list of pending snapshots. After that we call btrfs_commit_transaction(), and if that returns an error we jump to 'fail' label, where we kfree() the pending snapshot structure. This can result in a later use-after-free of the pending snapshot: 1) We allocated the pending snapshot and added it to the transaction's list of pending snapshots; 2) We call btrfs_commit_transaction(), and it fails either at the first call to btrfs_run_delayed_refs() or btrfs_start_dirty_block_groups(). In both cases, we don't abort the transaction and we release our transaction handle. We jump to the 'fail' label and free the pending snapshot structure. We return with the pending snapshot still in the transaction's list; 3) Another task commits the transaction. This time there's no error at all, and then during the transaction commit it accesses a pointer to the pending snapshot structure that the snapshot creation task has already freed, resulting in a user-after-free. This issue could actually be detected by smatch, which produced the following warning: fs/btrfs/ioctl.c:843 create_snapshot() warn: '&pending_snapshot->list' not removed from list So fix this by not having the snapshot creation ioctl directly add the pending snapshot to the transaction's list. Instead add the pending snapshot to the transaction handle, and then at btrfs_commit_transaction() we add the snapshot to the list only when we can guarantee that any error returned after that point will result in a transaction abort, in which case the ioctl code can safely free the pending snapshot and no one can access it anymore. CC: stable@vger.kernel.org # 5.10+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Su Yue
|
ea1d1ca402 |
btrfs: tree-checker: check item_size for dev_item
Check item size before accessing the device item to avoid out of bound access, similar to inode_item check. Signed-off-by: Su Yue <l@damenly.su> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Su Yue
|
0c982944af |
btrfs: tree-checker: check item_size for inode_item
while mounting the crafted image, out-of-bounds access happens: [350.429619] UBSAN: array-index-out-of-bounds in fs/btrfs/struct-funcs.c:161:1 [350.429636] index 1048096 is out of range for type 'page *[16]' [350.429650] CPU: 0 PID: 9 Comm: kworker/u8:1 Not tainted 5.16.0-rc4 #1 [350.429652] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014 [350.429653] Workqueue: btrfs-endio-meta btrfs_work_helper [btrfs] [350.429772] Call Trace: [350.429774] <TASK> [350.429776] dump_stack_lvl+0x47/0x5c [350.429780] ubsan_epilogue+0x5/0x50 [350.429786] __ubsan_handle_out_of_bounds+0x66/0x70 [350.429791] btrfs_get_16+0xfd/0x120 [btrfs] [350.429832] check_leaf+0x754/0x1a40 [btrfs] [350.429874] ? filemap_read+0x34a/0x390 [350.429878] ? load_balance+0x175/0xfc0 [350.429881] validate_extent_buffer+0x244/0x310 [btrfs] [350.429911] btrfs_validate_metadata_buffer+0xf8/0x100 [btrfs] [350.429935] end_bio_extent_readpage+0x3af/0x850 [btrfs] [350.429969] ? newidle_balance+0x259/0x480 [350.429972] end_workqueue_fn+0x29/0x40 [btrfs] [350.429995] btrfs_work_helper+0x71/0x330 [btrfs] [350.430030] ? __schedule+0x2fb/0xa40 [350.430033] process_one_work+0x1f6/0x400 [350.430035] ? process_one_work+0x400/0x400 [350.430036] worker_thread+0x2d/0x3d0 [350.430037] ? process_one_work+0x400/0x400 [350.430038] kthread+0x165/0x190 [350.430041] ? set_kthread_struct+0x40/0x40 [350.430043] ret_from_fork+0x1f/0x30 [350.430047] </TASK> [350.430077] BTRFS warning (device loop0): bad eb member start: ptr 0xffe20f4e start 20975616 member offset 4293005178 size 2 check_leaf() is checking the leaf: corrupt leaf: root=4 block=29396992 slot=1, bad key order, prev (16140901064495857664 1 0) current (1 204 12582912) leaf 29396992 items 6 free space 3565 generation 6 owner DEV_TREE leaf 29396992 flags 0x1(WRITTEN) backref revision 1 fs uuid a62e00e8-e94e-4200-8217-12444de93c2e chunk uuid cecbd0f7-9ca0-441e-ae9f-f782f9732bd8 item 0 key (16140901064495857664 INODE_ITEM 0) itemoff 3955 itemsize 40 generation 0 transid 0 size 0 nbytes 17592186044416 block group 0 mode 52667 links 33 uid 0 gid 2104132511 rdev 94223634821136 sequence 100305 flags 0x2409000(none) atime 0.0 (1970-01-01 08:00:00) ctime 2973280098083405823.4294967295 (-269783007-01-01 21:37:03) mtime 18446744071572723616.4026825121 (1902-04-16 12:40:00) otime 9249929404488876031.4294967295 (622322949-04-16 04:25:58) item 1 key (1 DEV_EXTENT 12582912) itemoff 3907 itemsize 48 dev extent chunk_tree 3 chunk_objectid 256 chunk_offset 12582912 length 8388608 chunk_tree_uuid cecbd0f7-9ca0-441e-ae9f-f782f9732bd8 The corrupted leaf of device tree has an inode item. The leaf passed checksum and others checks in validate_extent_buffer until check_leaf_item(). Because of the key type BTRFS_INODE_ITEM, check_inode_item() is called even we are in the device tree. Since the item offset + sizeof(struct btrfs_inode_item) > eb->len, out-of-bounds access is triggered. The item end vs leaf boundary check has been done before check_leaf_item(), so fix it by checking item size in check_inode_item() before access of the inode item in extent buffer. Other check functions except check_dev_item() in check_leaf_item() have their item size checks. The commit for check_dev_item() is followed. No regression observed during running fstests. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=215299 CC: stable@vger.kernel.org # 5.10+ CC: Wenqing Liu <wenqingliu0120@gmail.com> Signed-off-by: Su Yue <l@damenly.su> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Shin'ichiro Kawasaki
|
e804861bd4 |
btrfs: fix deadlock between quota disable and qgroup rescan worker
Quota disable ioctl starts a transaction before waiting for the qgroup rescan worker completes. However, this wait can be infinite and results in deadlock because of circular dependency among the quota disable ioctl, the qgroup rescan worker and the other task with transaction such as block group relocation task. The deadlock happens with the steps following: 1) Task A calls ioctl to disable quota. It starts a transaction and waits for qgroup rescan worker completes. 2) Task B such as block group relocation task starts a transaction and joins to the transaction that task A started. Then task B commits to the transaction. In this commit, task B waits for a commit by task A. 3) Task C as the qgroup rescan worker starts its job and starts a transaction. In this transaction start, task C waits for completion of the transaction that task A started and task B committed. This deadlock was found with fstests test case btrfs/115 and a zoned null_blk device. The test case enables and disables quota, and the block group reclaim was triggered during the quota disable by chance. The deadlock was also observed by running quota enable and disable in parallel with 'btrfs balance' command on regular null_blk devices. An example report of the deadlock: [372.469894] INFO: task kworker/u16:6:103 blocked for more than 122 seconds. [372.479944] Not tainted 5.16.0-rc8 #7 [372.485067] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [372.493898] task:kworker/u16:6 state:D stack: 0 pid: 103 ppid: 2 flags:0x00004000 [372.503285] Workqueue: btrfs-qgroup-rescan btrfs_work_helper [btrfs] [372.510782] Call Trace: [372.514092] <TASK> [372.521684] __schedule+0xb56/0x4850 [372.530104] ? io_schedule_timeout+0x190/0x190 [372.538842] ? lockdep_hardirqs_on+0x7e/0x100 [372.547092] ? _raw_spin_unlock_irqrestore+0x3e/0x60 [372.555591] schedule+0xe0/0x270 [372.561894] btrfs_commit_transaction+0x18bb/0x2610 [btrfs] [372.570506] ? btrfs_apply_pending_changes+0x50/0x50 [btrfs] [372.578875] ? free_unref_page+0x3f2/0x650 [372.585484] ? finish_wait+0x270/0x270 [372.591594] ? release_extent_buffer+0x224/0x420 [btrfs] [372.599264] btrfs_qgroup_rescan_worker+0xc13/0x10c0 [btrfs] [372.607157] ? lock_release+0x3a9/0x6d0 [372.613054] ? btrfs_qgroup_account_extent+0xda0/0xda0 [btrfs] [372.620960] ? do_raw_spin_lock+0x11e/0x250 [372.627137] ? rwlock_bug.part.0+0x90/0x90 [372.633215] ? lock_is_held_type+0xe4/0x140 [372.639404] btrfs_work_helper+0x1ae/0xa90 [btrfs] [372.646268] process_one_work+0x7e9/0x1320 [372.652321] ? lock_release+0x6d0/0x6d0 [372.658081] ? pwq_dec_nr_in_flight+0x230/0x230 [372.664513] ? rwlock_bug.part.0+0x90/0x90 [372.670529] worker_thread+0x59e/0xf90 [372.676172] ? process_one_work+0x1320/0x1320 [372.682440] kthread+0x3b9/0x490 [372.687550] ? _raw_spin_unlock_irq+0x24/0x50 [372.693811] ? set_kthread_struct+0x100/0x100 [372.700052] ret_from_fork+0x22/0x30 [372.705517] </TASK> [372.709747] INFO: task btrfs-transacti:2347 blocked for more than 123 seconds. [372.729827] Not tainted 5.16.0-rc8 #7 [372.745907] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [372.767106] task:btrfs-transacti state:D stack: 0 pid: 2347 ppid: 2 flags:0x00004000 [372.787776] Call Trace: [372.801652] <TASK> [372.812961] __schedule+0xb56/0x4850 [372.830011] ? io_schedule_timeout+0x190/0x190 [372.852547] ? lockdep_hardirqs_on+0x7e/0x100 [372.871761] ? _raw_spin_unlock_irqrestore+0x3e/0x60 [372.886792] schedule+0xe0/0x270 [372.901685] wait_current_trans+0x22c/0x310 [btrfs] [372.919743] ? btrfs_put_transaction+0x3d0/0x3d0 [btrfs] [372.938923] ? finish_wait+0x270/0x270 [372.959085] ? join_transaction+0xc75/0xe30 [btrfs] [372.977706] start_transaction+0x938/0x10a0 [btrfs] [372.997168] transaction_kthread+0x19d/0x3c0 [btrfs] [373.013021] ? btrfs_cleanup_transaction.isra.0+0xfc0/0xfc0 [btrfs] [373.031678] kthread+0x3b9/0x490 [373.047420] ? _raw_spin_unlock_irq+0x24/0x50 [373.064645] ? set_kthread_struct+0x100/0x100 [373.078571] ret_from_fork+0x22/0x30 [373.091197] </TASK> [373.105611] INFO: task btrfs:3145 blocked for more than 123 seconds. [373.114147] Not tainted 5.16.0-rc8 #7 [373.120401] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [373.130393] task:btrfs state:D stack: 0 pid: 3145 ppid: 3141 flags:0x00004000 [373.140998] Call Trace: [373.145501] <TASK> [373.149654] __schedule+0xb56/0x4850 [373.155306] ? io_schedule_timeout+0x190/0x190 [373.161965] ? lockdep_hardirqs_on+0x7e/0x100 [373.168469] ? _raw_spin_unlock_irqrestore+0x3e/0x60 [373.175468] schedule+0xe0/0x270 [373.180814] wait_for_commit+0x104/0x150 [btrfs] [373.187643] ? test_and_set_bit+0x20/0x20 [btrfs] [373.194772] ? kmem_cache_free+0x124/0x550 [373.201191] ? btrfs_put_transaction+0x69/0x3d0 [btrfs] [373.208738] ? finish_wait+0x270/0x270 [373.214704] ? __btrfs_end_transaction+0x347/0x7b0 [btrfs] [373.222342] btrfs_commit_transaction+0x44d/0x2610 [btrfs] [373.230233] ? join_transaction+0x255/0xe30 [btrfs] [373.237334] ? btrfs_record_root_in_trans+0x4d/0x170 [btrfs] [373.245251] ? btrfs_apply_pending_changes+0x50/0x50 [btrfs] [373.253296] relocate_block_group+0x105/0xc20 [btrfs] [373.260533] ? mutex_lock_io_nested+0x1270/0x1270 [373.267516] ? btrfs_wait_nocow_writers+0x85/0x180 [btrfs] [373.275155] ? merge_reloc_roots+0x710/0x710 [btrfs] [373.283602] ? btrfs_wait_ordered_extents+0xd30/0xd30 [btrfs] [373.291934] ? kmem_cache_free+0x124/0x550 [373.298180] btrfs_relocate_block_group+0x35c/0x930 [btrfs] [373.306047] btrfs_relocate_chunk+0x85/0x210 [btrfs] [373.313229] btrfs_balance+0x12f4/0x2d20 [btrfs] [373.320227] ? lock_release+0x3a9/0x6d0 [373.326206] ? btrfs_relocate_chunk+0x210/0x210 [btrfs] [373.333591] ? lock_is_held_type+0xe4/0x140 [373.340031] ? rcu_read_lock_sched_held+0x3f/0x70 [373.346910] btrfs_ioctl_balance+0x548/0x700 [btrfs] [373.354207] btrfs_ioctl+0x7f2/0x71b0 [btrfs] [373.360774] ? lockdep_hardirqs_on_prepare+0x410/0x410 [373.367957] ? lockdep_hardirqs_on_prepare+0x410/0x410 [373.375327] ? btrfs_ioctl_get_supported_features+0x20/0x20 [btrfs] [373.383841] ? find_held_lock+0x2c/0x110 [373.389993] ? lock_release+0x3a9/0x6d0 [373.395828] ? mntput_no_expire+0xf7/0xad0 [373.402083] ? lock_is_held_type+0xe4/0x140 [373.408249] ? vfs_fileattr_set+0x9f0/0x9f0 [373.414486] ? selinux_file_ioctl+0x349/0x4e0 [373.420938] ? trace_raw_output_lock+0xb4/0xe0 [373.427442] ? selinux_inode_getsecctx+0x80/0x80 [373.434224] ? lockdep_hardirqs_on+0x7e/0x100 [373.440660] ? force_qs_rnp+0x2a0/0x6b0 [373.446534] ? lock_is_held_type+0x9b/0x140 [373.452763] ? __blkcg_punt_bio_submit+0x1b0/0x1b0 [373.459732] ? security_file_ioctl+0x50/0x90 [373.466089] __x64_sys_ioctl+0x127/0x190 [373.472022] do_syscall_64+0x3b/0x90 [373.477513] entry_SYSCALL_64_after_hwframe+0x44/0xae [373.484823] RIP: 0033:0x7f8f4af7e2bb [373.490493] RSP: 002b:00007ffcbf936178 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [373.500197] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f8f4af7e2bb [373.509451] RDX: 00007ffcbf936220 RSI: 00000000c4009420 RDI: 0000000000000003 [373.518659] RBP: 00007ffcbf93774a R08: 0000000000000013 R09: 00007f8f4b02d4e0 [373.527872] R10: 00007f8f4ae87740 R11: 0000000000000246 R12: 0000000000000001 [373.537222] R13: 00007ffcbf936220 R14: 0000000000000000 R15: 0000000000000002 [373.546506] </TASK> [373.550878] INFO: task btrfs:3146 blocked for more than 123 seconds. [373.559383] Not tainted 5.16.0-rc8 #7 [373.565748] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [373.575748] task:btrfs state:D stack: 0 pid: 3146 ppid: 2168 flags:0x00000000 [373.586314] Call Trace: [373.590846] <TASK> [373.595121] __schedule+0xb56/0x4850 [373.600901] ? __lock_acquire+0x23db/0x5030 [373.607176] ? io_schedule_timeout+0x190/0x190 [373.613954] schedule+0xe0/0x270 [373.619157] schedule_timeout+0x168/0x220 [373.625170] ? usleep_range_state+0x150/0x150 [373.631653] ? mark_held_locks+0x9e/0xe0 [373.637767] ? do_raw_spin_lock+0x11e/0x250 [373.643993] ? lockdep_hardirqs_on_prepare+0x17b/0x410 [373.651267] ? _raw_spin_unlock_irq+0x24/0x50 [373.657677] ? lockdep_hardirqs_on+0x7e/0x100 [373.664103] wait_for_completion+0x163/0x250 [373.670437] ? bit_wait_timeout+0x160/0x160 [373.676585] btrfs_quota_disable+0x176/0x9a0 [btrfs] [373.683979] ? btrfs_quota_enable+0x12f0/0x12f0 [btrfs] [373.691340] ? down_write+0xd0/0x130 [373.696880] ? down_write_killable+0x150/0x150 [373.703352] btrfs_ioctl+0x3945/0x71b0 [btrfs] [373.710061] ? find_held_lock+0x2c/0x110 [373.716192] ? lock_release+0x3a9/0x6d0 [373.722047] ? __handle_mm_fault+0x23cd/0x3050 [373.728486] ? btrfs_ioctl_get_supported_features+0x20/0x20 [btrfs] [373.737032] ? set_pte+0x6a/0x90 [373.742271] ? do_raw_spin_unlock+0x55/0x1f0 [373.748506] ? lock_is_held_type+0xe4/0x140 [373.754792] ? vfs_fileattr_set+0x9f0/0x9f0 [373.761083] ? selinux_file_ioctl+0x349/0x4e0 [373.767521] ? selinux_inode_getsecctx+0x80/0x80 [373.774247] ? __up_read+0x182/0x6e0 [373.780026] ? count_memcg_events.constprop.0+0x46/0x60 [373.787281] ? up_write+0x460/0x460 [373.792932] ? security_file_ioctl+0x50/0x90 [373.799232] __x64_sys_ioctl+0x127/0x190 [373.805237] do_syscall_64+0x3b/0x90 [373.810947] entry_SYSCALL_64_after_hwframe+0x44/0xae [373.818102] RIP: 0033:0x7f1383ea02bb [373.823847] RSP: 002b:00007fffeb4d71f8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 [373.833641] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1383ea02bb [373.842961] RDX: 00007fffeb4d7210 RSI: 00000000c0109428 RDI: 0000000000000003 [373.852179] RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000078 [373.861408] R10: 00007f1383daec78 R11: 0000000000000202 R12: 00007fffeb4d874a [373.870647] R13: 0000000000493099 R14: 0000000000000001 R15: 0000000000000000 [373.879838] </TASK> [373.884018] Showing all locks held in the system: [373.894250] 3 locks held by kworker/4:1/58: [373.900356] 1 lock held by khungtaskd/63: [373.906333] #0: ffffffff8945ff60 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x53/0x260 [373.917307] 3 locks held by kworker/u16:6/103: [373.923938] #0: ffff888127b4f138 ((wq_completion)btrfs-qgroup-rescan){+.+.}-{0:0}, at: process_one_work+0x712/0x1320 [373.936555] #1: ffff88810b817dd8 ((work_completion)(&work->normal_work)){+.+.}-{0:0}, at: process_one_work+0x73f/0x1320 [373.951109] #2: ffff888102dd4650 (sb_internal#2){.+.+}-{0:0}, at: btrfs_qgroup_rescan_worker+0x1f6/0x10c0 [btrfs] [373.964027] 2 locks held by less/1803: [373.969982] #0: ffff88813ed56098 (&tty->ldisc_sem){++++}-{0:0}, at: tty_ldisc_ref_wait+0x24/0x80 [373.981295] #1: ffffc90000b3b2e8 (&ldata->atomic_read_lock){+.+.}-{3:3}, at: n_tty_read+0x9e2/0x1060 [373.992969] 1 lock held by btrfs-transacti/2347: [373.999893] #0: ffff88813d4887a8 (&fs_info->transaction_kthread_mutex){+.+.}-{3:3}, at: transaction_kthread+0xe3/0x3c0 [btrfs] [374.015872] 3 locks held by btrfs/3145: [374.022298] #0: ffff888102dd4460 (sb_writers#18){.+.+}-{0:0}, at: btrfs_ioctl_balance+0xc3/0x700 [btrfs] [374.034456] #1: ffff88813d48a0a0 (&fs_info->reclaim_bgs_lock){+.+.}-{3:3}, at: btrfs_balance+0xfe5/0x2d20 [btrfs] [374.047646] #2: ffff88813d488838 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_relocate_block_group+0x354/0x930 [btrfs] [374.063295] 4 locks held by btrfs/3146: [374.069647] #0: ffff888102dd4460 (sb_writers#18){.+.+}-{0:0}, at: btrfs_ioctl+0x38b1/0x71b0 [btrfs] [374.081601] #1: ffff88813d488bb8 (&fs_info->subvol_sem){+.+.}-{3:3}, at: btrfs_ioctl+0x38fd/0x71b0 [btrfs] [374.094283] #2: ffff888102dd4650 (sb_internal#2){.+.+}-{0:0}, at: btrfs_quota_disable+0xc8/0x9a0 [btrfs] [374.106885] #3: ffff88813d489800 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}, at: btrfs_quota_disable+0xd5/0x9a0 [btrfs] [374.126780] ============================================= To avoid the deadlock, wait for the qgroup rescan worker to complete before starting the transaction for the quota disable ioctl. Clear BTRFS_FS_QUOTA_ENABLE flag before the wait and the transaction to request the worker to complete. On transaction start failure, set the BTRFS_FS_QUOTA_ENABLE flag again. These BTRFS_FS_QUOTA_ENABLE flag changes can be done safely since the function btrfs_quota_disable is not called concurrently because of fs_info->subvol_sem. Also check the BTRFS_FS_QUOTA_ENABLE flag in qgroup_rescan_init to avoid another qgroup rescan worker to start after the previous qgroup worker completed. CC: stable@vger.kernel.org # 5.4+ Suggested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Qu Wenruo
|
2d192fc4c1 |
btrfs: don't start transaction for scrub if the fs is mounted read-only
[BUG]
The following super simple script would crash btrfs at unmount time, if
CONFIG_BTRFS_ASSERT() is set.
mkfs.btrfs -f $dev
mount $dev $mnt
xfs_io -f -c "pwrite 0 4k" $mnt/file
umount $mnt
mount -r ro $dev $mnt
btrfs scrub start -Br $mnt
umount $mnt
This will trigger the following ASSERT() introduced by commit
|
||
Linus Torvalds
|
4897e722b5 |
\n
-----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmHz0QsACgkQnJ2qBz9k QNkN+AgA6XqWHKYyElfgJFt1UqaoNMz/Faz9H/+PKiBNSTf6/+67D+V7DFz6jJrv dDwHNzfDg9kR+pbAAPwhl2jfnQoUlsr019Hrqa5HpWlj5geVpbdunYUzS2WOkwqD /m+brcLgPdKb2uIysj6wOh9B7wa8V9ksl3EjQvvwaHaU0p1YLUqidVXucYvs8DUo bgXNaj9GmeysxnmU+aILotWuuXH2vOP4Q2Uk4qz3rN6xW9eEXtpQ4y7gWBp/GA8y Ia8FtFdQdvlSDOJYMdPOTBu5RB7gY9ElrapvVaWNYtCWI/jRv666nZsLaERYNhLN uUsG4MWjYbOqW5XqFDbSOwbDqvMh5Q== =mEFA -----END PGP SIGNATURE----- Merge tag 'fsnotify_for_v5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify fixes from Jan Kara: "Fixes for userspace breakage caused by fsnotify changes ~3 years ago and one fanotify cleanup" * tag 'fsnotify_for_v5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: fsnotify: fix fsnotify hooks in pseudo filesystems fsnotify: invalidate dcache before IN_DELETE event fanotify: remove variable set but not used |
||
Linus Torvalds
|
49d766f3a0 |
for-5.17-rc1-tag
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmHwDeMACgkQxWXV+ddt WDtdMQ//QFqkIB34zW5N3uX1xBFht/G/bCPNdGiK5YerjJZj1f6Rmsytbb6qlWHg NlB/XEPeQaQVrSfF37svnvATgySPaqePsufrT2XYu3x2w8muPTl460wmzdMt5h47 rGB+ct4JdLBH4KJgqe2Bilrqg+FJmL3XT5k0aU3driy4Gb+bcDGeEyVmTWcnNRIg DzfUlNwTKhAhZDl8D3B9X2vV8TZDBtrRLquI94eYvooF3LYDL+kExLUW8WDmmAfy mjnANs8c+EtcVAzN7tW+O1UqdYYJ8Yo4ngk1nVVRdRvA2BDp9ixgWi/m/3jZ3JmJ jySV1zsZJB3ZGp/hIuEvtGY7jheDtbTnfgtI+vwjVdr208acs+XhfDckuOZBZIUY 7Zk+Qif/narxFAoAvkgkH5QDNSSReKqaHgzohfnzSQqrfO0bh6fw1FnBOm4iXT7C cXvReD4m36g46SdTsxnvttpXizXIFe4JPOkpRkBzxIQFaMTA4Is43W0lYC24Ppxj A0UVevh3HPhOYzABynuy0EnknZeylb6P+WpGG6Ge+sVrVquQiwR01n4HeoaJO3qe re46uUGwO8Q30blYY50HBSJp0bpcciPZRVMJaspcAT9KD0fJ1s/csd2lQyP4ewn6 A0zg6eabc0PD3LwdlHqp//jTNft/BL4RVZ2c3uM+mgXnGeekcoQ= =EysX -----END PGP SIGNATURE----- Merge tag 'for-5.17-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Several fixes for defragmentation that got broken in 5.16 after refactoring and added subpage support. The observed bugs are excessive IO or uninterruptible ioctl. All stable material" * tag 'for-5.17-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: update writeback index when starting defrag btrfs: add back missing dirty page rate limiting to defrag btrfs: fix deadlock when reserving space during defrag btrfs: defrag: properly update range->start for autodefrag btrfs: defrag: fix wrong number of defragged sectors btrfs: allow defrag to be interruptible btrfs: fix too long loop when defragging a 1 byte file |
||
Filipe Manana
|
27cdfde181 |
btrfs: update writeback index when starting defrag
When starting a defrag, we should update the writeback index of the inode's mapping in case it currently has a value beyond the start of the range we are defragging. This can help performance and often result in getting less extents after writeback - for e.g., if the current value of the writeback index sits somewhere in the middle of a range that gets dirty by the defrag, then after writeback we can get two smaller extents instead of a single, larger extent. We used to have this before the refactoring in 5.16, but it was removed without any reason to do so. Originally it was added in kernel 3.1, by commit |
||
Filipe Manana
|
3c9d31c715 |
btrfs: add back missing dirty page rate limiting to defrag
A defrag operation can dirty a lot of pages, specially if operating on
the entire file or a large file range. Any task dirtying pages should
periodically call balance_dirty_pages_ratelimited(), as stated in that
function's comments, otherwise they can leave too many dirty pages in
the system. This is what we did before the refactoring in 5.16, and
it should have remained, just like in the buffered write path and
relocation. So restore that behaviour.
Fixes:
|
||
Filipe Manana
|
0cb5950f3f |
btrfs: fix deadlock when reserving space during defrag
When defragging we can end up collecting a range for defrag that has
already pages under delalloc (dirty), as long as the respective extent
map for their range is not mapped to a hole, a prealloc extent or
the extent map is from an old generation.
Most of the time that is harmless from a functional perspective at
least, however it can result in a deadlock:
1) At defrag_collect_targets() we find an extent map that meets all
requirements but there's delalloc for the range it covers, and we add
its range to list of ranges to defrag;
2) The defrag_collect_targets() function is called at defrag_one_range(),
after it locked a range that overlaps the range of the extent map;
3) At defrag_one_range(), while the range is still locked, we call
defrag_one_locked_target() for the range associated to the extent
map we collected at step 1);
4) Then finally at defrag_one_locked_target() we do a call to
btrfs_delalloc_reserve_space(), which will reserve data and metadata
space. If the space reservations can not be satisfied right away, the
flusher might be kicked in and start flushing delalloc and wait for
the respective ordered extents to complete. If this happens we will
deadlock, because both flushing delalloc and finishing an ordered
extent, requires locking the range in the inode's io tree, which was
already locked at defrag_collect_targets().
So fix this by skipping extent maps for which there's already delalloc.
Fixes:
|
||
Amir Goldstein
|
a37d9a17f0 |
fsnotify: invalidate dcache before IN_DELETE event
Apparently, there are some applications that use IN_DELETE event as an invalidation mechanism and expect that if they try to open a file with the name reported with the delete event, that it should not contain the content of the deleted file. Commit |
||
Christoph Hellwig
|
0a4ee51818 |
mm: remove cleancache
Patch series "remove Xen tmem leftovers".
Since the removal of the Xen tmem driver in 2019, the cleancache hooks
are entirely unused, as are large parts of frontswap. This series
against linux-next (with the folio changes included) removes
cleancaches, and cuts down frontswap to the bits actually used by zswap.
This patch (of 13):
The cleancache subsystem is unused since the removal of Xen tmem driver
in commit
|
||
Linus Torvalds
|
f4484d138b |
Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton: "55 patches. Subsystems affected by this patch series: percpu, procfs, sysctl, misc, core-kernel, get_maintainer, lib, checkpatch, binfmt, nilfs2, hfs, fat, adfs, panic, delayacct, kconfig, kcov, and ubsan" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (55 commits) lib: remove redundant assignment to variable ret ubsan: remove CONFIG_UBSAN_OBJECT_SIZE kcov: fix generic Kconfig dependencies if ARCH_WANTS_NO_INSTR lib/Kconfig.debug: make TEST_KMOD depend on PAGE_SIZE_LESS_THAN_256KB btrfs: use generic Kconfig option for 256kB page size limit arch/Kconfig: split PAGE_SIZE_LESS_THAN_256KB from PAGE_SIZE_LESS_THAN_64KB configs: introduce debug.config for CI-like setup delayacct: track delays from memory compact Documentation/accounting/delay-accounting.rst: add thrashing page cache and direct compact delayacct: cleanup flags in struct task_delay_info and functions use it delayacct: fix incomplete disable operation when switch enable to disable delayacct: support swapin delay accounting for swapping without blkio panic: remove oops_id panic: use error_report_end tracepoint on warnings fs/adfs: remove unneeded variable make code cleaner FAT: use io_schedule_timeout() instead of congestion_wait() hfsplus: use struct_group_attr() for memcpy() region nilfs2: remove redundant pointer sbufs fs/binfmt_elf: use PT_LOAD p_align values for static PIE const_structs.checkpatch: add frequently used ops structs ... |
||
Nathan Chancellor
|
e900909599 |
btrfs: use generic Kconfig option for 256kB page size limit
Use the newly introduced CONFIG_PAGE_SIZE_LESS_THAN_256KB to describe
the dependency introduced by commit
|
||
Qu Wenruo
|
c080b4144b |
btrfs: defrag: properly update range->start for autodefrag
[BUG] After commit |
||
Qu Wenruo
|
484167da77 |
btrfs: defrag: fix wrong number of defragged sectors
[BUG] There are users using autodefrag mount option reporting obvious increase in IO: > If I compare the write average (in total, I don't have it per process) > when taking idle periods on the same machine: > Linux 5.16: > without autodefrag: ~ 10KiB/s > with autodefrag: between 1 and 2MiB/s. > > Linux 5.15: > with autodefrag:~ 10KiB/s (around the same as without > autodefrag on 5.16) [CAUSE] When autodefrag mount option is enabled, btrfs_defrag_file() will be called with @max_sectors = BTRFS_DEFRAG_BATCH (1024) to limit how many sectors we can defrag in one try. And then use the number of sectors defragged to determine if we need to re-defrag. But commit |
||
Filipe Manana
|
b767c2fc78 |
btrfs: allow defrag to be interruptible
During defrag, at btrfs_defrag_file(), we have this loop that iterates over a file range in steps no larger than 256K subranges. If the range is too long, there's no way to interrupt it. So make the loop check in each iteration if there's signal pending, and if there is, break and return -AGAIN to userspace. Before kernel 5.16, we used to allow defrag to be cancelled through a signal, but that was lost with commit |
||
Filipe Manana
|
6b34cd8e17 |
btrfs: fix too long loop when defragging a 1 byte file
When attempting to defrag a file with a single byte, we can end up in a
too long loop, which is nearly infinite because at btrfs_defrag_file()
we end up with the variable last_byte assigned with a value of
18446744073709551615 (which is (u64)-1). The problem comes from the fact
we end up doing:
last_byte = round_up(last_byte, fs_info->sectorsize) - 1;
So if last_byte was assigned 0, which is i_size - 1, we underflow and
end up with the value 18446744073709551615.
This is trivial to reproduce and the following script triggers it:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
mkfs.btrfs -f $DEV
mount $DEV $MNT
echo -n "X" > $MNT/foobar
btrfs filesystem defragment $MNT/foobar
umount $MNT
So fix this by not decrementing last_byte by 1 before doing the sector
size round up. Also, to make it easier to follow, make the round up right
after computing last_byte.
Reported-by: Anthony Ruhier <aruhier@mailbox.org>
Fixes:
|
||
Qu Wenruo
|
36c86a9e1b |
btrfs: output more debug messages for uncommitted transaction
Print extra information about how many dirty bytes an uncommitted has at the end of mount. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
c2f822635d |
btrfs: respect the max size in the header when activating swap file
If we extended the size of a swapfile after its header was created (by the mkswap utility) and then try to activate it, we will map the entire file when activating the swap file, instead of limiting to the max size defined in the swap file's header. Currently test case generic/643 from fstests fails because we do not respect that size limit defined in the swap file's header. So fix this by not mapping file ranges beyond the max size defined in the swap header. This is the same type of bug that iomap used to have, and was fixed in commit |
||
Yang Li
|
be8d1a2ab9 |
btrfs: fix argument list that the kdoc format and script verified
The warnings were found by running scripts/kernel-doc, which is caused by using 'make W=1'. fs/btrfs/extent_io.c:3210: warning: Function parameter or member 'bio_ctrl' not described in 'btrfs_bio_add_page' fs/btrfs/extent_io.c:3210: warning: Excess function parameter 'bio' description in 'btrfs_bio_add_page' fs/btrfs/extent_io.c:3210: warning: Excess function parameter 'prev_bio_flags' description in 'btrfs_bio_add_page' fs/btrfs/space-info.c:1602: warning: Excess function parameter 'root' description in 'btrfs_reserve_metadata_bytes' fs/btrfs/space-info.c:1602: warning: Function parameter or member 'fs_info' not described in 'btrfs_reserve_metadata_bytes' Note: this is fixing only the warnings regarding parameter list, the first line is not strictly conforming to the kdoc format as the btrfs codebase does not stick to that and keeps the first line more free form (because it's only for internal use). Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add note ] Signed-off-by: David Sterba <dsterba@suse.com> |
||
Su Yue
|
4a9e803e5b |
btrfs: remove unnecessary parameter type from compression_decompress_bio
btrfs_decompress_bio, the only caller of compression_decompress_bio gets type from @cb and passes it to compression_decompress_bio. However, compression_decompress_bio can get compression type directly from @cb. So remove the parameter and access it through @cb. No functional change. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Su Yue <l@damenly.su> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |