Commit Graph

7482 Commits

Author SHA1 Message Date
Xiongfeng Wang
6670d4c2d9 btrfs: use correct string length in DEV_INFO ioctl
gcc-8 reports:

fs/btrfs/ioctl.c: In function 'btrfs_ioctl':
./include/linux/string.h:245:9: warning: '__builtin_strncpy' specified
bound 1024 equals destination size [-Wstringop-truncation]

We need one less byte or call strlcpy() to make it a nul-terminated
string. This is done on the next line anyway, but we want to avoid the
warning.

Signed-off-by: Xiongfeng Wang <xiongfeng.wang@linaro.org>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:21 +01:00
Anand Jain
6f794e3c5c btrfs: fail mount when sb flag is not in BTRFS_SUPER_FLAG_SUPP
It appears from the original commit [1] that there isn't any design
specific reason not to fail the mount instead of just warning. This
patch will change it to fail.

[1]
 commit 319e4d0661
    btrfs: Enhance super validation check

Fixes: 319e4d0661 ("btrfs: Enhance super validation check")
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:21 +01:00
Anand Jain
e2731e5588 btrfs: define SUPER_FLAG_METADUMP_V2
btrfs-progs uses super flag bit BTRFS_SUPER_FLAG_METADUMP_V2 (1ULL << 34).
So just define that in kernel so that we know its been used.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:21 +01:00
Liu Bo
a6f93c71d4 Btrfs: avoid losing data raid profile when deleting a device
We've avoided data losing raid profile when doing balance, but it
turns out that deleting a device could also result in the same
problem.

Say we have 3 disks, and they're created with '-d raid1' profile.

- We have chunk P (the only data chunk on the empty btrfs).

- Suppose that chunk P's two raid1 copies reside in disk A and disk B.

- Now, 'btrfs device remove disk B'
         btrfs_rm_device()
	   -> btrfs_shrink_device()
	      -> btrfs_relocate_chunk() #relocate any chunk on disk B
	      	 			 to other places.

- Chunk P will be removed and a new chunk will be created to hold
  those data, but as chunk P is the only one holding raid1 profile,
  after it goes away, the new chunk will be created as single profile
  which is our default profile.

This fixes the problem by creating an empty data chunk before
relocating the data chunk.

Metadata/System chunk are supposed to have non-zero bytes all the time
so their raid profile is preserved.

Reported-by: James Alandt <James.Alandt@wdc.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:20 +01:00
Filipe Manana
81fdf6382b Btrfs: fix space leak after fallocate and zero range operations
If we do a buffered write after a zero range operation that has an
unaligned (with the filesystem's sector size) end which also falls within
an unwritten (prealloc) extent that is currently beyond the inode's
i_size, and the zero range operation has the flag FALLOC_FL_KEEP_SIZE,
we end up leaking data and metadata space. This happens because when
zeroing a range we call btrfs_truncate_block(), which does delalloc
(loads the page and partially zeroes its content), and in the buffered
write path we only clear existing delalloc space reservation for the
range we are writing into if that range starts at an offset smaller then
the inode's i_size, which makes sense since we can not have delalloc
extents beyond the i_size, only unwritten extents are allowed.

Example reproducer:

 $ mkfs.btrfs -f /dev/sdb
 $ mount /dev/sdb /mnt
 $ xfs_io -f -c "falloc -k 428K 4K" /mnt/foobar
 $ xfs_io -c "fzero -k 0 430K" /mnt/foobar
 $ xfs_io -c "pwrite -S 0xaa 428K 4K" /mnt/foobar
 $ umount /mnt

After the unmount we get the metadata and data space leaks reported in
dmesg/syslog:

 [95794.602253] ------------[ cut here ]------------
 [95794.603322] WARNING: CPU: 0 PID: 31496 at fs/btrfs/inode.c:9561 btrfs_destroy_inode+0x4e/0x206 [btrfs]
 [95794.605167] Modules linked in: btrfs xfs ppdev ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper parport_pc psmouse sg i2c_piix4 parport i2c_core evdev pcspkr button serio_raw sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sd_mod virtio_scsi ata_generic crc32c_intel ata_piix floppy virtio_pci virtio_ring virtio libata scsi_mod e1000 [last unloaded: btrfs]
 [95794.613000] CPU: 0 PID: 31496 Comm: umount Tainted: G        W       4.14.0-rc6-btrfs-next-54+ #1
 [95794.614448] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
 [95794.615972] task: ffff880075aa0240 task.stack: ffffc90001734000
 [95794.617114] RIP: 0010:btrfs_destroy_inode+0x4e/0x206 [btrfs]
 [95794.618001] RSP: 0018:ffffc90001737d00 EFLAGS: 00010202
 [95794.618721] RAX: 0000000000000000 RBX: ffff880070fa1418 RCX: ffffc90001737c7c
 [95794.619645] RDX: 0000000175aa0240 RSI: 0000000000000001 RDI: ffff880070fa1418
 [95794.620711] RBP: ffffc90001737d38 R08: 0000000000000000 R09: 0000000000000000
 [95794.621932] R10: ffffc90001737c48 R11: ffff88007123e158 R12: ffff880075b6a000
 [95794.623124] R13: ffff88006145c000 R14: ffff880070fa1418 R15: ffff880070c3b4a0
 [95794.624188] FS:  00007fa6793c92c0(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000
 [95794.625578] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [95794.626522] CR2: 000056338670d048 CR3: 00000000610dc005 CR4: 00000000001606f0
 [95794.627647] Call Trace:
 [95794.628128]  destroy_inode+0x3d/0x55
 [95794.628573]  evict+0x177/0x17e
 [95794.629010]  dispose_list+0x50/0x71
 [95794.629478]  evict_inodes+0x132/0x141
 [95794.630289]  generic_shutdown_super+0x3f/0x10b
 [95794.630864]  kill_anon_super+0x12/0x1c
 [95794.631383]  btrfs_kill_super+0x16/0x21 [btrfs]
 [95794.631930]  deactivate_locked_super+0x30/0x68
 [95794.632539]  deactivate_super+0x36/0x39
 [95794.633200]  cleanup_mnt+0x49/0x67
 [95794.633818]  __cleanup_mnt+0x12/0x14
 [95794.634416]  task_work_run+0x82/0xa6
 [95794.634902]  prepare_exit_to_usermode+0xe1/0x10c
 [95794.635525]  syscall_return_slowpath+0x18c/0x1af
 [95794.636122]  entry_SYSCALL_64_fastpath+0xab/0xad
 [95794.636834] RIP: 0033:0x7fa678cb99a7
 [95794.637370] RSP: 002b:00007ffccf0aaed8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
 [95794.638672] RAX: 0000000000000000 RBX: 0000563386706030 RCX: 00007fa678cb99a7
 [95794.639596] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000056338670ca90
 [95794.640703] RBP: 000056338670ca90 R08: 000056338670c740 R09: 0000000000000015
 [95794.641773] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007fa6791bae64
 [95794.643150] R13: 0000000000000000 R14: 0000563386706210 R15: 00007ffccf0ab160
 [95794.644249] Code: ff 4c 8b a8 80 06 00 00 48 8b 87 c0 01 00 00 48 85 c0 74 02 0f ff 48 83 bb e0 02 00 00 00 74 02 0f ff 83 bb 3c ff ff ff 00 74 02 <0f> ff 83 bb 40 ff ff ff 00 74 02 0f ff 48 83 bb f8 fe ff ff 00
 [95794.646929] ---[ end trace e95877675c6ec007 ]---
 [95794.647751] ------------[ cut here ]------------
 [95794.648509] WARNING: CPU: 0 PID: 31496 at fs/btrfs/inode.c:9562 btrfs_destroy_inode+0x59/0x206 [btrfs]
 [95794.649842] Modules linked in: btrfs xfs ppdev ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper parport_pc psmouse sg i2c_piix4 parport i2c_core evdev pcspkr button serio_raw sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sd_mod virtio_scsi ata_generic crc32c_intel ata_piix floppy virtio_pci virtio_ring virtio libata scsi_mod e1000 [last unloaded: btrfs]
 [95794.654659] CPU: 0 PID: 31496 Comm: umount Tainted: G        W       4.14.0-rc6-btrfs-next-54+ #1
 [95794.655894] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
 [95794.657546] task: ffff880075aa0240 task.stack: ffffc90001734000
 [95794.658433] RIP: 0010:btrfs_destroy_inode+0x59/0x206 [btrfs]
 [95794.659279] RSP: 0018:ffffc90001737d00 EFLAGS: 00010202
 [95794.660054] RAX: 0000000000000000 RBX: ffff880070fa1418 RCX: ffffc90001737c7c
 [95794.660753] RDX: 0000000175aa0240 RSI: 0000000000000001 RDI: ffff880070fa1418
 [95794.661513] RBP: ffffc90001737d38 R08: 0000000000000000 R09: 0000000000000000
 [95794.662289] R10: ffffc90001737c48 R11: ffff88007123e158 R12: ffff880075b6a000
 [95794.663393] R13: ffff88006145c000 R14: ffff880070fa1418 R15: ffff880070c3b4a0
 [95794.664342] FS:  00007fa6793c92c0(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000
 [95794.665673] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [95794.666593] CR2: 000056338670d048 CR3: 00000000610dc005 CR4: 00000000001606f0
 [95794.667629] Call Trace:
 [95794.668065]  destroy_inode+0x3d/0x55
 [95794.668637]  evict+0x177/0x17e
 [95794.669179]  dispose_list+0x50/0x71
 [95794.669830]  evict_inodes+0x132/0x141
 [95794.670416]  generic_shutdown_super+0x3f/0x10b
 [95794.671103]  kill_anon_super+0x12/0x1c
 [95794.671786]  btrfs_kill_super+0x16/0x21 [btrfs]
 [95794.672552]  deactivate_locked_super+0x30/0x68
 [95794.673393]  deactivate_super+0x36/0x39
 [95794.674107]  cleanup_mnt+0x49/0x67
 [95794.674706]  __cleanup_mnt+0x12/0x14
 [95794.675279]  task_work_run+0x82/0xa6
 [95794.675795]  prepare_exit_to_usermode+0xe1/0x10c
 [95794.676507]  syscall_return_slowpath+0x18c/0x1af
 [95794.677275]  entry_SYSCALL_64_fastpath+0xab/0xad
 [95794.678006] RIP: 0033:0x7fa678cb99a7
 [95794.678600] RSP: 002b:00007ffccf0aaed8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
 [95794.679739] RAX: 0000000000000000 RBX: 0000563386706030 RCX: 00007fa678cb99a7
 [95794.680779] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000056338670ca90
 [95794.681837] RBP: 000056338670ca90 R08: 000056338670c740 R09: 0000000000000015
 [95794.682867] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007fa6791bae64
 [95794.683891] R13: 0000000000000000 R14: 0000563386706210 R15: 00007ffccf0ab160
 [95794.684843] Code: c0 01 00 00 48 85 c0 74 02 0f ff 48 83 bb e0 02 00 00 00 74 02 0f ff 83 bb 3c ff ff ff 00 74 02 0f ff 83 bb 40 ff ff ff 00 74 02 <0f> ff 48 83 bb f8 fe ff ff 00 74 02 0f ff 48 83 bb 00 ff ff ff
 [95794.687156] ---[ end trace e95877675c6ec008 ]---
 [95794.687876] ------------[ cut here ]------------
 [95794.688579] WARNING: CPU: 0 PID: 31496 at fs/btrfs/inode.c:9565 btrfs_destroy_inode+0x7d/0x206 [btrfs]
 [95794.689735] Modules linked in: btrfs xfs ppdev ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper parport_pc psmouse sg i2c_piix4 parport i2c_core evdev pcspkr button serio_raw sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sd_mod virtio_scsi ata_generic crc32c_intel ata_piix floppy virtio_pci virtio_ring virtio libata scsi_mod e1000 [last unloaded: btrfs]
 [95794.695015] CPU: 0 PID: 31496 Comm: umount Tainted: G        W       4.14.0-rc6-btrfs-next-54+ #1
 [95794.696396] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
 [95794.697956] task: ffff880075aa0240 task.stack: ffffc90001734000
 [95794.698925] RIP: 0010:btrfs_destroy_inode+0x7d/0x206 [btrfs]
 [95794.699763] RSP: 0018:ffffc90001737d00 EFLAGS: 00010206
 [95794.700434] RAX: 0000000000000000 RBX: ffff880070fa1418 RCX: ffffc90001737c7c
 [95794.701445] RDX: 0000000175aa0240 RSI: 0000000000000001 RDI: ffff880070fa1418
 [95794.702448] RBP: ffffc90001737d38 R08: 0000000000000000 R09: 0000000000000000
 [95794.703557] R10: ffffc90001737c48 R11: ffff88007123e158 R12: ffff880075b6a000
 [95794.704441] R13: ffff88006145c000 R14: ffff880070fa1418 R15: ffff880070c3b4a0
 [95794.705270] FS:  00007fa6793c92c0(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000
 [95794.706341] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [95794.707001] CR2: 000056338670d048 CR3: 00000000610dc005 CR4: 00000000001606f0
 [95794.708030] Call Trace:
 [95794.708466]  destroy_inode+0x3d/0x55
 [95794.709071]  evict+0x177/0x17e
 [95794.709497]  dispose_list+0x50/0x71
 [95794.709973]  evict_inodes+0x132/0x141
 [95794.710564]  generic_shutdown_super+0x3f/0x10b
 [95794.711200]  kill_anon_super+0x12/0x1c
 [95794.711633]  btrfs_kill_super+0x16/0x21 [btrfs]
 [95794.712139]  deactivate_locked_super+0x30/0x68
 [95794.712608]  deactivate_super+0x36/0x39
 [95794.713093]  cleanup_mnt+0x49/0x67
 [95794.713514]  __cleanup_mnt+0x12/0x14
 [95794.713933]  task_work_run+0x82/0xa6
 [95794.714543]  prepare_exit_to_usermode+0xe1/0x10c
 [95794.715247]  syscall_return_slowpath+0x18c/0x1af
 [95794.715952]  entry_SYSCALL_64_fastpath+0xab/0xad
 [95794.716653] RIP: 0033:0x7fa678cb99a7
 [95794.721100] RSP: 002b:00007ffccf0aaed8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
 [95794.722052] RAX: 0000000000000000 RBX: 0000563386706030 RCX: 00007fa678cb99a7
 [95794.722856] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000056338670ca90
 [95794.723698] RBP: 000056338670ca90 R08: 000056338670c740 R09: 0000000000000015
 [95794.724736] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007fa6791bae64
 [95794.725928] R13: 0000000000000000 R14: 0000563386706210 R15: 00007ffccf0ab160
 [95794.726728] Code: 40 ff ff ff 00 74 02 0f ff 48 83 bb f8 fe ff ff 00 74 02 0f ff 48 83 bb 00 ff ff ff 00 74 02 0f ff 48 83 bb 30 ff ff ff 00 74 02 <0f> ff 48 83 bb 08 ff ff ff 00 74 02 0f ff 4d 85 e4 0f 84 52 01
 [95794.729203] ---[ end trace e95877675c6ec009 ]---
 [95794.841054] ------------[ cut here ]------------
 [95794.841829] WARNING: CPU: 0 PID: 31496 at fs/btrfs/extent-tree.c:5831 btrfs_free_block_groups+0x235/0x36a [btrfs]
 [95794.843425] Modules linked in: btrfs xfs ppdev ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper parport_pc psmouse sg i2c_piix4 parport i2c_core evdev pcspkr button serio_raw sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sd_mod virtio_scsi ata_generic crc32c_intel ata_piix floppy virtio_pci virtio_ring virtio libata scsi_mod e1000 [last unloaded: btrfs]
 [95794.850658] CPU: 0 PID: 31496 Comm: umount Tainted: G        W       4.14.0-rc6-btrfs-next-54+ #1
 [95794.852590] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
 [95794.854752] task: ffff880075aa0240 task.stack: ffffc90001734000
 [95794.855812] RIP: 0010:btrfs_free_block_groups+0x235/0x36a [btrfs]
 [95794.856811] RSP: 0018:ffffc90001737d70 EFLAGS: 00010206
 [95794.857805] RAX: 0000000080000000 RBX: ffff88006145c000 RCX: 0000000000000001
 [95794.859014] RDX: 00000001810af668 RSI: 0000000000000002 RDI: 00000000ffffffff
 [95794.860270] RBP: ffffc90001737d98 R08: 0000000000000000 R09: ffffffff817e22b9
 [95794.861525] R10: ffffc90001737c80 R11: 00000000000337fd R12: 0000000000000000
 [95794.862700] R13: ffff88006145c0c0 R14: ffff88021b61a800 R15: ffff88006145c100
 [95794.863810] FS:  00007fa6793c92c0(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000
 [95794.865149] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [95794.866099] CR2: 000056338670d048 CR3: 00000000610dc005 CR4: 00000000001606f0
 [95794.867198] Call Trace:
 [95794.867626]  close_ctree+0x1db/0x2b8 [btrfs]
 [95794.868188]  ? evict_inodes+0x132/0x141
 [95794.869037]  btrfs_put_super+0x15/0x17 [btrfs]
 [95794.870400]  generic_shutdown_super+0x6a/0x10b
 [95794.871262]  kill_anon_super+0x12/0x1c
 [95794.872046]  btrfs_kill_super+0x16/0x21 [btrfs]
 [95794.872746]  deactivate_locked_super+0x30/0x68
 [95794.873687]  deactivate_super+0x36/0x39
 [95794.874639]  cleanup_mnt+0x49/0x67
 [95794.875504]  __cleanup_mnt+0x12/0x14
 [95794.876126]  task_work_run+0x82/0xa6
 [95794.876788]  prepare_exit_to_usermode+0xe1/0x10c
 [95794.877777]  syscall_return_slowpath+0x18c/0x1af
 [95794.878381]  entry_SYSCALL_64_fastpath+0xab/0xad
 [95794.878888] RIP: 0033:0x7fa678cb99a7
 [95794.879307] RSP: 002b:00007ffccf0aaed8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
 [95794.880204] RAX: 0000000000000000 RBX: 0000563386706030 RCX: 00007fa678cb99a7
 [95794.881640] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000056338670ca90
 [95794.882690] RBP: 000056338670ca90 R08: 000056338670c740 R09: 0000000000000015
 [95794.883538] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007fa6791bae64
 [95794.884562] R13: 0000000000000000 R14: 0000563386706210 R15: 00007ffccf0ab160
 [95794.885664] Code: 89 ef e8 07 ec 32 e1 e8 9d c0 ea e0 48 8d b3 28 02 00 00 48 83 c9 ff 31 d2 48 89 df e8 29 c5 ff ff 48 83 bb 80 02 00 00 00 74 02 <0f> ff 48 83 bb 88 02 00 00 00 74 02 0f ff 48 83 bb d8 02 00 00
 [95794.887980] ---[ end trace e95877675c6ec00a ]---
 [95794.888739] ------------[ cut here ]------------
 [95794.889405] WARNING: CPU: 0 PID: 31496 at fs/btrfs/extent-tree.c:5832 btrfs_free_block_groups+0x241/0x36a [btrfs]
 [95794.891020] Modules linked in: btrfs xfs ppdev ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper parport_pc psmouse sg i2c_piix4 parport i2c_core evdev pcspkr button serio_raw sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sd_mod virtio_scsi ata_generic crc32c_intel ata_piix floppy virtio_pci virtio_ring virtio libata scsi_mod e1000 [last unloaded: btrfs]
 [95794.897551] CPU: 0 PID: 31496 Comm: umount Tainted: G        W       4.14.0-rc6-btrfs-next-54+ #1
 [95794.898509] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
 [95794.899685] task: ffff880075aa0240 task.stack: ffffc90001734000
 [95794.900592] RIP: 0010:btrfs_free_block_groups+0x241/0x36a [btrfs]
 [95794.901387] RSP: 0018:ffffc90001737d70 EFLAGS: 00010206
 [95794.902300] RAX: 0000000080000000 RBX: ffff88006145c000 RCX: 0000000000000001
 [95794.903260] RDX: 00000001810af668 RSI: 0000000000000002 RDI: 00000000ffffffff
 [95794.904332] RBP: ffffc90001737d98 R08: 0000000000000000 R09: ffffffff817e22b9
 [95794.905300] R10: ffffc90001737c80 R11: 00000000000337fd R12: 0000000000000000
 [95794.906439] R13: ffff88006145c0c0 R14: ffff88021b61a800 R15: ffff88006145c100
 [95794.907459] FS:  00007fa6793c92c0(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000
 [95794.908625] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [95794.909511] CR2: 000056338670d048 CR3: 00000000610dc005 CR4: 00000000001606f0
 [95794.910630] Call Trace:
 [95794.911153]  close_ctree+0x1db/0x2b8 [btrfs]
 [95794.911837]  ? evict_inodes+0x132/0x141
 [95794.912344]  btrfs_put_super+0x15/0x17 [btrfs]
 [95794.912975]  generic_shutdown_super+0x6a/0x10b
 [95794.913788]  kill_anon_super+0x12/0x1c
 [95794.914424]  btrfs_kill_super+0x16/0x21 [btrfs]
 [95794.915142]  deactivate_locked_super+0x30/0x68
 [95794.915831]  deactivate_super+0x36/0x39
 [95794.916433]  cleanup_mnt+0x49/0x67
 [95794.917045]  __cleanup_mnt+0x12/0x14
 [95794.917665]  task_work_run+0x82/0xa6
 [95794.918309]  prepare_exit_to_usermode+0xe1/0x10c
 [95794.919021]  syscall_return_slowpath+0x18c/0x1af
 [95794.919722]  entry_SYSCALL_64_fastpath+0xab/0xad
 [95794.920426] RIP: 0033:0x7fa678cb99a7
 [95794.921039] RSP: 002b:00007ffccf0aaed8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
 [95794.922303] RAX: 0000000000000000 RBX: 0000563386706030 RCX: 00007fa678cb99a7
 [95794.923335] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000056338670ca90
 [95794.924364] RBP: 000056338670ca90 R08: 000056338670c740 R09: 0000000000000015
 [95794.925435] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007fa6791bae64
 [95794.926533] R13: 0000000000000000 R14: 0000563386706210 R15: 00007ffccf0ab160
 [95794.927557] Code: 48 8d b3 28 02 00 00 48 83 c9 ff 31 d2 48 89 df e8 29 c5 ff ff 48 83 bb 80 02 00 00 00 74 02 0f ff 48 83 bb 88 02 00 00 00 74 02 <0f> ff 48 83 bb d8 02 00 00 00 74 02 0f ff 48 83 bb e0 02 00 00
 [95794.930166] ---[ end trace e95877675c6ec00b ]---
 [95794.930961] ------------[ cut here ]------------
 [95794.931727] WARNING: CPU: 0 PID: 31496 at fs/btrfs/extent-tree.c:9953 btrfs_free_block_groups+0x2bc/0x36a [btrfs]
 [95794.932729] Modules linked in: btrfs xfs ppdev ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper parport_pc psmouse sg i2c_piix4 parport i2c_core evdev pcspkr button serio_raw sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sd_mod virtio_scsi ata_generic crc32c_intel ata_piix floppy virtio_pci virtio_ring virtio libata scsi_mod e1000 [last unloaded: btrfs]
 [95794.938394] CPU: 0 PID: 31496 Comm: umount Tainted: G        W       4.14.0-rc6-btrfs-next-54+ #1
 [95794.939842] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
 [95794.941455] task: ffff880075aa0240 task.stack: ffffc90001734000
 [95794.942336] RIP: 0010:btrfs_free_block_groups+0x2bc/0x36a [btrfs]
 [95794.943268] RSP: 0018:ffffc90001737d70 EFLAGS: 00010206
 [95794.944127] RAX: ffff8802004fd0e8 RBX: ffff88006145c000 RCX: 0000000000000001
 [95794.945211] RDX: 00000001810af668 RSI: 0000000000000002 RDI: 00000000ffffffff
 [95794.946316] RBP: ffffc90001737d98 R08: 0000000000000000 R09: ffffffff817e22b9
 [95794.947271] R10: ffffc90001737c80 R11: 00000000000337fd R12: ffff8802004fd0e8
 [95794.948219] R13: ffff88006145c0c0 R14: ffff88006145e598 R15: ffff88006145c100
 [95794.949193] FS:  00007fa6793c92c0(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000
 [95794.950495] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [95794.951338] CR2: 000056338670d048 CR3: 00000000610dc005 CR4: 00000000001606f0
 [95794.952361] Call Trace:
 [95794.952811]  close_ctree+0x1db/0x2b8 [btrfs]
 [95794.953522]  ? evict_inodes+0x132/0x141
 [95794.954543]  btrfs_put_super+0x15/0x17 [btrfs]
 [95794.955231]  generic_shutdown_super+0x6a/0x10b
 [95794.955916]  kill_anon_super+0x12/0x1c
 [95794.956414]  btrfs_kill_super+0x16/0x21 [btrfs]
 [95794.956953]  deactivate_locked_super+0x30/0x68
 [95794.957635]  deactivate_super+0x36/0x39
 [95794.958256]  cleanup_mnt+0x49/0x67
 [95794.958701]  __cleanup_mnt+0x12/0x14
 [95794.959181]  task_work_run+0x82/0xa6
 [95794.959635]  prepare_exit_to_usermode+0xe1/0x10c
 [95794.960182]  syscall_return_slowpath+0x18c/0x1af
 [95794.960731]  entry_SYSCALL_64_fastpath+0xab/0xad
 [95794.961438] RIP: 0033:0x7fa678cb99a7
 [95794.961990] RSP: 002b:00007ffccf0aaed8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
 [95794.963111] RAX: 0000000000000000 RBX: 0000563386706030 RCX: 00007fa678cb99a7
 [95794.963975] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000056338670ca90
 [95794.964680] RBP: 000056338670ca90 R08: 000056338670c740 R09: 0000000000000015
 [95794.965763] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007fa6791bae64
 [95794.966868] R13: 0000000000000000 R14: 0000563386706210 R15: 00007ffccf0ab160
 [95794.967800] Code: 00 00 00 4c 8b a3 98 25 00 00 49 83 bc 24 60 ff ff ff 00 75 16 49 83 bc 24 68 ff ff ff 00 75 0b 49 83 bc 24 70 ff ff ff 00 74 16 <0f> ff 49 8d b4 24 18 ff ff ff 31 c9 31 d2 48 89 df e8 93 7a ff
 [95794.970629] ---[ end trace e95877675c6ec00c ]---
 [95794.971451] BTRFS info (device sdi): space_info 1 has 7680000 free, is not full
 [95794.972351] BTRFS info (device sdi): space_info total=8388608, used=704512, pinned=0, reserved=0, may_use=4096, readonly=0
 [95794.973595] ------------[ cut here ]------------
 [95794.974353] WARNING: CPU: 0 PID: 31496 at fs/btrfs/extent-tree.c:9953 btrfs_free_block_groups+0x2bc/0x36a [btrfs]
 [95794.980163] Modules linked in: btrfs xfs ppdev ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper parport_pc psmouse sg i2c_piix4 parport i2c_core evdev pcspkr button serio_raw sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sd_mod virtio_scsi ata_generic crc32c_intel ata_piix floppy virtio_pci virtio_ring virtio libata scsi_mod e1000 [last unloaded: btrfs]
 [95794.986461] CPU: 0 PID: 31496 Comm: umount Tainted: G        W       4.14.0-rc6-btrfs-next-54+ #1
 [95794.987591] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
 [95794.988929] task: ffff880075aa0240 task.stack: ffffc90001734000
 [95794.989922] RIP: 0010:btrfs_free_block_groups+0x2bc/0x36a [btrfs]
 [95794.990715] RSP: 0018:ffffc90001737d70 EFLAGS: 00010206
 [95794.991431] RAX: ffff88020f6e70e8 RBX: ffff88006145c000 RCX: ffffffff8115a906
 [95794.992455] RDX: ffffffff8115a902 RSI: ffff880075aa0b40 RDI: ffff880075aa0b40
 [95794.993535] RBP: ffffc90001737d98 R08: 0000000000000020 R09: fffffffffffffff7
 [95794.994573] R10: 00000000ffffffc4 R11: ffff8800633b1bc0 R12: ffff88020f6e70e8
 [95794.996250] R13: 0000000000000038 R14: ffff88006145e598 R15: 0000000000000000
 [95794.997233] FS:  00007fa6793c92c0(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000
 [95794.998592] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [95794.999484] CR2: 000056338670d048 CR3: 00000000610dc005 CR4: 00000000001606f0
 [95795.000542] Call Trace:
 [95795.001138]  close_ctree+0x1db/0x2b8 [btrfs]
 [95795.001885]  ? evict_inodes+0x132/0x141
 [95795.002407]  btrfs_put_super+0x15/0x17 [btrfs]
 [95795.003093]  generic_shutdown_super+0x6a/0x10b
 [95795.003720]  kill_anon_super+0x12/0x1c
 [95795.004353]  btrfs_kill_super+0x16/0x21 [btrfs]
 [95795.005095]  deactivate_locked_super+0x30/0x68
 [95795.005716]  deactivate_super+0x36/0x39
 [95795.006388]  cleanup_mnt+0x49/0x67
 [95795.006939]  __cleanup_mnt+0x12/0x14
 [95795.007512]  task_work_run+0x82/0xa6
 [95795.008124]  prepare_exit_to_usermode+0xe1/0x10c
 [95795.008994]  syscall_return_slowpath+0x18c/0x1af
 [95795.009831]  entry_SYSCALL_64_fastpath+0xab/0xad
 [95795.010610] RIP: 0033:0x7fa678cb99a7
 [95795.011193] RSP: 002b:00007ffccf0aaed8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
 [95795.012327] RAX: 0000000000000000 RBX: 0000563386706030 RCX: 00007fa678cb99a7
 [95795.013432] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000056338670ca90
 [95795.014558] RBP: 000056338670ca90 R08: 000056338670c740 R09: 0000000000000015
 [95795.015577] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007fa6791bae64
 [95795.016569] R13: 0000000000000000 R14: 0000563386706210 R15: 00007ffccf0ab160
 [95795.017662] Code: 00 00 00 4c 8b a3 98 25 00 00 49 83 bc 24 60 ff ff ff 00 75 16 49 83 bc 24 68 ff ff ff 00 75 0b 49 83 bc 24 70 ff ff ff 00 74 16 <0f> ff 49 8d b4 24 18 ff ff ff 31 c9 31 d2 48 89 df e8 93 7a ff
 [95795.020538] ---[ end trace e95877675c6ec00d ]---
 [95795.021259] BTRFS info (device sdi): space_info 4 has 1072775168 free, is not full
 [95795.022390] BTRFS info (device sdi): space_info total=1073741824, used=114688, pinned=0, reserved=0, may_use=786432, readonly=65536

Fix this by ensuring the zero range operation does not call
btrfs_truncate_block() if the corresponding extent is an unwritten one
(it's pointless anyway, since reading from an unwritten extent yields
zeroes).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:20 +01:00
Filipe Manana
9f13ce743b Btrfs: fix missing inode i_size update after zero range operation
For a fallocate's zero range operation that targets a range with an end
that is not aligned to the sector size, we can end up not updating the
inode's i_size. This happens when the last page of the range maps to an
unwritten (prealloc) extent and before that last page we have either a
hole or a written extent. This is because in this scenario we relied
on a call to btrfs_prealloc_file_range() to update the inode's i_size,
however it can only update the i_size to the "down aligned" end of the
range.

Example:

 $ mkfs.btrfs -f /dev/sdc
 $ mount /dev/sdc /mnt
 $ xfs_io -f -c "pwrite -S 0xff 0 428K" /mnt/foobar
 $ xfs_io -c "falloc -k 428K 4K" /mnt/foobar
 $ xfs_io -c "fzero 0 430K" /mnt/foobar
 $ du --bytes /mnt/foobar
 438272	/mnt/foobar

The inode's i_size was left as 428Kb (438272 bytes) when it should have
been updated to 430Kb (440320 bytes).
Fix this by always updating the inode's i_size explicitly after zeroing
the range.

Fixes: ba6d5887946ff86d93dc ("Btrfs: add support for fallocate's zero range operation")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:20 +01:00
Filipe Manana
94f450712a Btrfs: use cached state when dirtying pages during buffered write
During a buffered IO write, we can have an extent state that we got when
we locked the range (if the range starts at an offset lower than eof), so
always pass it to btrfs_dirty_pages() so that setting the delalloc bit
in the range does not need to do a full search in the inode's io tree,
saving time and reducing the amount of time we hold the io tree's lock.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:20 +01:00
Filipe Manana
f27451f229 Btrfs: add support for fallocate's zero range operation
This implements support the zero range operation of fallocate. For now
at least it's as simple as possible while reusing most of the existing
fallocate and hole punching infrastructure.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:20 +01:00
Liu Bo
cc54ff626a Btrfs: do not merge rbios if their fail stripe index are not identical
Since fail stripe index in rbio would be used to decide which
algorithm reconstruction would be run, we cannot merge rbios if
their's fail striped indexes are different, otherwise, one of the two
reconstructions would fail.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:20 +01:00
Liu Bo
db34be19c4 Btrfs: remove redundant check in rbio_can_merge
Given the above
'
if (last->operation != cur->operation)
	return 0;
',
it's guaranteed that two operations are same.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:20 +01:00
Anand Jain
05a5c55dfc btrfs: minor style cleanups in btrfs_scan_one_device
Assign ret = -EINVAL where it is actually required.
Remove { } around single line if else code.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:20 +01:00
Anand Jain
c1f32b7c1f btrfs: simplify mutex unlocking code in btrfs_commit_transaction
No functional change rearrange the mutex_unlock.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
[ edit subject ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:20 +01:00
Anand Jain
cadbc0a067 btrfs: rename btrfs_device::scrub_device to scrub_ctx
btrfs_device::scrub_device is not a device which is being scrubbed,
but it holds the scrub context, so rename to reflect the same. No
functional changes here.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:20 +01:00
Anand Jain
922ea8994a btrfS: collapse btrfs_handle_error() into __btrfs_handle_fs_error()
There is no other consumer for btrfs_handle_error() other than
__btrfs_handle_fs_error(), further this function quite small.
Merge it into its parent.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
[ reformat comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:20 +01:00
Anand Jain
61ecda6865 btrfs: remove check for BTRFS_FS_STATE_ERROR which we just set
__btrfs_handle_fs_error() sets BTRFS_FS_STATE_ERROR, and calls
btrfs_handle_error() so no need to check if the BTRFS_FS_STATE_ERROR
is set in btrfs_handle_error(). And there is no other user of
btrfs_handle_error() as well.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:20 +01:00
Liu Bo
8810f7517a Btrfs: make raid6 rebuild retry more
There is a scenario that can end up with rebuild process failing to
return good content, i.e.
suppose that all disks can be read without problems and if the content
that was read out doesn't match its checksum, currently for raid6
btrfs at most retries twice,

- the 1st retry is to rebuild with all other stripes, it'll eventually
  be a raid5 xor rebuild,
- if the 1st fails, the 2nd retry will deliberately fail parity p so
  that it will do raid6 style rebuild,

however, the chances are that another non-parity stripe content also
has something corrupted, so that the above retries are not able to
return correct content, and users will think of this as data loss.
More seriouly, if the loss happens on some important internal btree
roots, it could refuse to mount.

This extends btrfs to do more retries and each retry fails only one
stripe.  Since raid6 can tolerate 2 disk failures, if there is one
more failure besides the failure on which we're recovering, this can
always work.

The worst case is to retry as many times as the number of raid6 disks,
but given the fact that such a scenario is really rare in practice,
it's still acceptable.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:20 +01:00
Liu Bo
762221f095 Btrfs: fix scrub to repair raid6 corruption
The raid6 corruption is that,
suppose that all disks can be read without problems and if the content
that was read out doesn't match its checksum, currently for raid6
btrfs at most retries twice,

- the 1st retry is to rebuild with all other stripes, it'll eventually
  be a raid5 xor rebuild,
- if the 1st fails, the 2nd retry will deliberately fail parity p so
  that it will do raid6 style rebuild,

however, the chances are that another non-parity stripe content also
has something corrupted, so that the above retries are not able to
return correct content.

We've fixed normal reads to rebuild raid6 correctly with more retries
in Patch "Btrfs: make raid6 rebuild retry more"[1], this is to fix
scrub to do the exactly same rebuild process.

[1]: https://patchwork.kernel.org/patch/10091755/

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:20 +01:00
Anand Jain
6528b99d3d btrfs: factor btrfs_check_rw_degradable() to check given device
Update btrfs_check_rw_degradable() to check against the given device if
its lost.

We can use this function to know if the volume is going to be in
degraded mode OR failed state, when the given device fails.  Which is
needed when we are handling the device failed state.

A preparatory patch does not affect the flow as such.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
[ enhance comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:20 +01:00
David Sterba
e43bbe5e16 btrfs: sink unlock_extent parameter gfp_flags
All callers pass either GFP_NOFS or GFP_KERNEL now, so we can sink the
parameter to the function, though we lose some of the slightly better
semantics of GFP_KERNEL in some places, it's worth cleaning up the
callchains.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:19 +01:00
David Sterba
d810a4be1a btrfs: add separate helper for unlock_extent_cached with GFP_ATOMIC
There's only one instance where we pass different gfp mask to
unlock_extent_cached. Add a separate helper for that and then we can
drop the gfp parameter from unlock_extent_cached.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:19 +01:00
David Sterba
5bedc48a8f btrfs: drop unused parameters from mount_subvol
Recent patches reworking the mount path left some unused parameters. We
pass a vfsmount to mount_subvol, the flags and data (ie. mount options)
have been already applied and we will not need them.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:19 +01:00
Misono, Tomohiro
e215772cd2 btrfs: cleanup unnecessary string dup in btrfs_parse_options()
Long ago, commit edf24abe51 ("btrfs: sanity mount option parsing and
early mount code") split the btrfs_parse_options() into two parts
(btrfs_parse_early_options() and btrfs_parse_options()). As a result,
btrfs_parse_optins no longer gets called twice and is the last one to
parse mount option string. Therefore there is no need to dup it.

Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:19 +01:00
Liu Bo
203e02d934 Btrfs: remove unused wait in btrfs_stripe_hash
In fact nobody is waiting on @wait's waitqueue, it can be safely
removed.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:19 +01:00
Nikolay Borisov
36f7894f66 btrfs: Remove redundant pair of bio_get/set in __btrfs_submit_dio_bio
The bio is not referenced after it has been submitted and the endio is
going to consume the sole reference on successful submission. On error,
the callers of __btrfs_submit_dio_bio do invoke bio_put so we don't
leak it either.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:19 +01:00
Nikolay Borisov
ffc9c8dd7d btrfs: Remove redundant bio_get/bio_set pair from submit_one_bio
The bio is never referenced after it has been submitted so there is no
point in getting an extra reference.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:19 +01:00
Nikolay Borisov
ea057f6daf btrfs: Remove redundant bio_get/set from submit_dio_repair_bio
The bio that is passsed is the newly created repair bio which already
has a reference count of 1, which is going to be consumed by the
endio routine on successful submission. On error the handler also
calls bio_put.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:19 +01:00
Nikolay Borisov
32506af595 btrfs: Remove redundant bio_get/set calls in compressed read/write paths
bio_get/set is necessary only if the bio is going to be referenced
following submissions. In the code paths where such calls are made
we don't really need them since the bio is referenced only if
btrfs_map_bio returns an error. And this function can return an error
prior to submission only. So referencing the bio is safe. Furthermore
we do call bio_endio which will consume the last reference. So let's
remove the redundant calls.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:19 +01:00
Nikolay Borisov
4271ecea64 btrfs: Improve btrfs_search_slot description
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:19 +01:00
David Sterba
36243c9199 btrfs: heuristic: call get4bits directly
As it's a single instance and local to the file, we don't need to pass
it as an argument.

Reviewed-by: Timofey Titovets <nefelim4ag@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:19 +01:00
David Sterba
7add17befc btrfs: heuristic: open code copy_call callback of radix sort
The callback is trivial and we don't need the abstraction for our
purposes. Let's open code it.

Reviewed-by: Timofey Titovets <nefelim4ag@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:19 +01:00
David Sterba
23ae8c63aa btrfs: heuristic: open code get_num callback of radix sort
The callback is trivial and we don't need the abstraction for our
purposes. Let's open code it and also make the array types explicit.

Reviewed-by: Timofey Titovets <nefelim4ag@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:19 +01:00
Misono, Tomohiro
78f6beacd0 btrfs: remove unused arg from parse_subvol_options()
Remove unused arg 'holder' from parse_subvol_options(), which has been
forgotten to be cleaned in the commit b99beb110e2d ("btrfs: split
parse_early_options() in two").

Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:19 +01:00
Misono, Tomohiro
83085935cc btrfs: remove unused setup_root_args()
Since setup_root_args() is not used anymore, just remove it.

Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:19 +01:00
Misono, Tomohiro
d740760656 btrfs: split parse_early_options() in two
Now parse_early_options() is used by both btrfs_mount() and
btrfs_mount_root(). However, the former only needs subvol related part
and the latter needs the others.

Therefore extract the subvol related parts from parse_early_options() and
move it to new parse function (parse_subvol_options()).

Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:18 +01:00
Misono, Tomohiro
312c89fbca btrfs: cleanup btrfs_mount() using btrfs_mount_root()
Cleanup btrfs_mount() by using btrfs_mount_root(). This avoids getting
btrfs_mount() called twice in mount path.

Old btrfs_mount() will do:
0. VFS layer calls vfs_kern_mount() with registered file_system_type
   (for btrfs, btrfs_fs_type). btrfs_mount() is called on the way.
1. btrfs_parse_early_options() parses "subvolid=" mount option and set the
   value to subvol_objectid. Otherwise, subvol_objectid has the initial
   value of 0
2. check subvol_objectid is 5 or not. Assume this time id is not 5, then
   btrfs_mount() returns by calling mount_subvol()
3. In mount_subvol(), original mount options are modified to contain
   "subvolid=0" in setup_root_args(). Then, vfs_kern_mount() is called with
   btrfs_fs_type and new options
4. btrfs_mount() is called again
5. btrfs_parse_early_options() parses "subvolid=0" and set 5 (instead of 0)
   to subvol_objectid
6. check subvol_objectid is 5 or not. This time id is 5 and mount_subvol()
   is not called. btrfs_mount() finishes mounting a root
7. (in mount_subvol()) with using a return vale of vfs_kern_mount(), it
   calls mount_subtree()
8. return subvolume's dentry

Reusing the same file_system_type (and btrfs_mount()) for vfs_kern_mount()
is the cause of complication.

Instead, new btrfs_mount() will do:
1. parse subvol id related options for later use in mount_subvol()
2. mount device's root by calling vfs_kern_mount() with
   btrfs_root_fs_type, which is not registered to VFS by
   register_filesystem(). As a result, btrfs_mount_root() is called
3. return by calling mount_subvol()

The code of 2. is moved from the first part of mount_subvol().

The semantics of device holder changes from btrfs_fs_type to
btrfs_root_fs_type and has to be used in all contexts. Otherwise we'd
get wrong results when mount and dev scan would not check the same
thing. (this has been found indendently and the fix is folded into this
patch)

Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ fold the btrfs_control_ioctl fixup, extend the comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:18 +01:00
Misono, Tomohiro
72fa39f5c7 btrfs: add btrfs_mount_root() and new file_system_type
Add btrfs_mount_root() and new file_system_type for preparation of cleanup
of btrfs_mount(). Code path is not changed yet.

btrfs_mount_root() is almost the same as current btrfs_mount(), but doesn't
have subvolume related part.

Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:18 +01:00
David Sterba
aab6e9edf0 btrfs: unify extent_page_data type passed as void
Functions called from extent_write_cache_pages used void* as generic
callback data, but all of them convert it to extent_page_data, or use it
directly.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:18 +01:00
David Sterba
935db8531f btrfs: sink writepage parameter to extent_write_cache_pages
The function extent_write_cache_pages is modelled after
write_cache_pages which is a generic interface and the writepage
parameter makes sense there. In btrfs we know exactly which callback
we're going to use, so we can pass it directly.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:18 +01:00
David Sterba
25b860e038 btrfs: sink flush_fn to extent_write_cache_pages
All callers pass the same value flush_write_bio.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:18 +01:00
David Sterba
e2932ee08e btrfs: merge two flush_write_bio helpers
flush_epd_write_bio is same as flush_write_bio, no point having two such
functions. Merge them to flush_write_bio. The 'noinline' attribute is
removed as it does not have any meaning.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:18 +01:00
Nikolay Borisov
a74b35ec87 btrfs: Rename bin_search -> btrfs_bin_search
Currently there are 2 function doing binary search on btrfs nodes:
bin_search and btrfs_bin_search. The latter being a simple wrapper for
the former. So eliminate the wrapper and just rename bin_search to
btrfs_bin_search. No functional changes

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:18 +01:00
Nikolay Borisov
0a9b0e5351 btrfs: sink extent_write_full_page tree argument
The tree argument passed to extent_write_full_page is referenced from
the page being passed to the same function. Since we already have
enough information to get the reference, remove the function parameter.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:16 +01:00
Nikolay Borisov
5e3ee23648 btrfs: sink extent_write_locked_range tree parameter
This function is called only from submit_compressed_extents and the
io tree being passed is always that of the inode. But we are also
passing the inode, so just move getting the io tree pointer in
extent_write_locked_range to simplify the signature.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:16 +01:00
Nikolay Borisov
3e798068a8 btrfs: Remove pair of bio_get/put in btrfs_schedule_bio
This code was added in 492bb6deee ("Btrfs: Hold a reference on bios
during submit_bio, add some extra bio checks"). However, holding a
reference on a bio is necessary only if it's going to be referenced
after the submit_bio returns and the bio is completed. In this
particular instance this is not the case so there is no need to hold
an extra reference since we directly return.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:16 +01:00
Nikolay Borisov
9ea2c7c9da btrfs: Fix out of bounds access in btrfs_search_slot
When modifying a tree where the root is at BTRFS_MAX_LEVEL - 1 then
the level variable is going to be 7 (this is the max height of the
tree). On the other hand btrfs_cow_block is always called with
"level + 1" as an index into the nodes and slots arrays. This leads to
an out of bounds access. Admittdely this will be benign since an OOB
access of the nodes array will likely read the 0th element from the
slots array, which in this case is going to be 0 (since we start CoW at
the top of the tree). The OOB access into the slots array in turn will
read the 0th and 1st values of the locks array, which would both be 0
at the time. However, this benign behavior relies on the fact that the
path being passed hasn't been initialised, if it has already been used to
query a btree then it could potentially have populated the nodes/slots arrays.

Fix it by explicitly checking if we are at level 7 (the maximum allowed
index in nodes/slots arrays) and explicitly call the CoW routine with
NULL for parent's node/slot.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Fixes-coverity-id: 711515
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:16 +01:00
Pravin Shedge
87c46ec700 btrfs: remove duplicate includes
These duplicate includes have been found with scripts/checkincludes.pl but
they have been removed manually to avoid removing false positives.

Signed-off-by: Pravin Shedge <pravin.shedge4linux@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:16 +01:00
Nikolay Borisov
f3038ee3a3 btrfs: Handle btrfs_set_extent_delalloc failure in fixup worker
This function was introduced by 247e743cbe ("Btrfs: Use async helpers
to deal with pages that have been improperly dirtied") and it didn't do
any error handling then. This function might very well fail in ENOMEM
situation, yet it's not handled, this could lead to inconsistent state.
So let's handle the failure by setting the mapping error bit.

Cc: stable@vger.kernel.org
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:16 +01:00
David Sterba
0f628c632d btrfs: show options: use helper to convert compression type string
Use the helper, if the COMPRESS option is set, the result is always
defined and not empty.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:16 +01:00
David Sterba
802a5c6958 btrfs: prop: use common helper for type to string conversion
Use the helper for conversion, keep the semantics.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:16 +01:00
David Sterba
93370509c2 btrfs: SETFLAGS ioctl: use helper for compression type conversion
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:16 +01:00
David Sterba
e128f9c3f7 btrfs: compression: add helper for type to string conversion
There are several places opencoding this conversion, add a helper now
that we have 3 compression algorithms.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:16 +01:00
Nikolay Borisov
bf8d32b9b3 btrfs: remove redundant check in btrfs_get_extent_fiemap
Before returning hole_em in btrfs_get_fiemap_extent we check if it's different
than null. However, by the time this null check is triggered we already know
hole_em is not null because it means it points to the em we found and it
has already been dereferenced.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:15 +01:00
Nikolay Borisov
5c9a702ed1 btrfs: Remove unused variable in btrfs_get_extent
trans was statically assigned to NULL and this never changed over the
course of btrfs_get_extent. So remove any code which checks whether
trans != NULL and just hardcode the fact trans is always NULL.

Resolves-coverity-id: 112806
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:15 +01:00
Arnd Bergmann
7cfad65297 btrfs: tree-checker: use %zu format string for size_t
The return value of sizeof() is of type size_t, so we must print it
using the %z format modifier rather than %l to avoid this warning
on some architectures:

fs/btrfs/tree-checker.c: In function 'check_dir_item':
fs/btrfs/tree-checker.c:273:50: error: format '%lu' expects argument of type 'long unsigned int', but argument 5 has type 'u32' {aka 'unsigned int'} [-Werror=format=]

Fixes: 005887f2e3e0 ("btrfs: tree-checker: Add checker for dir item")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:15 +01:00
Liu Bo
b4ff5ad72e Btrfs: use struct completion in scrub_submit_raid56_bio_wait
This changes to use struct completion directly and removes 'struct
scrub_bio_ret' along with the code using it.

This struct is used to get the return value from bio, but the caller can
access bio to get the return value directly and is holding a reference
on it so it won't go away underneath us and can be removed safely.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:15 +01:00
Liu Bo
c9f540fa6f Btrfs: remove unused variable wait in lock_stripe_add
The defined wait is not used anywhere.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:15 +01:00
Timofey Titovets
e9679de3fd Btrfs: compress_file_range() change page dirty status once
We need to call extent_range_clear_dirty_for_io()
on compression range to prevent application from changing
page content, while pages compressing.

extent_range_clear_dirty_for_io() runs on each loop iteration,
"(end - start)" can be much (up to 1024 times) bigger
then compression range (BTRFS_MAX_UNCOMPRESSED).

The start pointer is advanced each time we manage to compress part of
the range. The end pointer does not change so we could redirty the
remaining parts repeatedly.

Fix that behaviour by call extent_range_clear_dirty_for_io()
only once, the first time it happens.

This is the safest but probably not the best behaviour. Previous
iterations of the patch tried to redirty only the range that we were not
able to compress. This has been refused by David for safety reasons, the
writeout callchain is complex and there could be some path that relies
on redirtying the entire unwritten range.

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ enhance changelog, the history and safety concerns, add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:15 +01:00
Timofey Titovets
440c840cb4 Btrfs: compression heuristic: replace heap sort with radix sort
Slowest part of heuristic for now is kernel heap sort()
It's can take up to 55% of runtime on sorting bucket items.

As sorting will always call on most data sets to get correctly
byte_core_set_size, the only way to speed up heuristic, is to
speed up sort on bucket.

Add a general radix_sort function.
Radix sort require 2 buffers, one full size of input array
and one for store counters (jump addresses).

That increase usage per heuristic workspace +1KiB
8KiB + 1KiB -> 8KiB + 2KiB

That is LSD Radix, i use 4 bit as a base for calculating,
to make counters array acceptable small (16 elements * 8 byte).

That Radix sort implementation have several points to adjust,
I added him to make radix sort general usable in kernel,
like heap sort, if needed.

Performance tested in userspace copy of heuristic code,
throughput:
    - average <-> random data: ~3500 MiB/s - heap  sort
    - average <-> random data: ~6000 MiB/s - radix sort

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
[ coding style fixes ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:15 +01:00
Anand Jain
1c3063b6db btrfs: cleanup device states define BTRFS_DEV_STATE_FLUSH_SENT
Currently device state is being managed by each individual int
variable such as struct btrfs_device::is_tgtdev_for_dev_replace.
Instead of that declare btrfs_device::dev_state
BTRFS_DEV_STATE_FLUSH_SENT and use the bit operations.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:15 +01:00
Anand Jain
401e29c124 btrfs: cleanup device states define BTRFS_DEV_STATE_REPLACE_TGT
Currently device state is being managed by each individual int
variable such as struct btrfs_device::is_tgtdev_for_dev_replace.
Instead of that declare btrfs_device::dev_state
BTRFS_DEV_STATE_MISSING and use the bit operations.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
[ whitespace adjustments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:15 +01:00
Anand Jain
e6e674bd4d btrfs: cleanup device states define BTRFS_DEV_STATE_MISSING
Currently device state is being managed by each individual int
variable such as struct btrfs_device::missing. Instead of that
declare btrfs_device::dev_state BTRFS_DEV_STATE_MISSING and use
the bit operations.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by : Nikolay Borisov <nborisov@suse.com>
[ whitespace adjustments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:15 +01:00
Anand Jain
e12c96214d btrfs: cleanup device states define BTRFS_DEV_STATE_IN_FS_METADATA
Currently device state is being managed by each individual int
variable such as struct btrfs_device::in_fs_metadata. Instead of
that declare device state BTRFS_DEV_STATE_IN_FS_METADATA and use
the bit operations.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
[ whitespace adjustments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:15 +01:00
Anand Jain
ebbede42d4 btrfs: cleanup device states define BTRFS_DEV_STATE_WRITEABLE
Currently device state is being managed by each individual int
variable such as struct btrfs_device::writeable. Instead of that
declare device state BTRFS_DEV_STATE_WRITEABLE and use the
bit operations.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
[ whitespace adjustments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:15 +01:00
Anand Jain
3c958bd23b btrfs: add helper for device path or missing
This patch creates a helper function to get either the rcu device path
or missing.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
[ rename to btrfs_dev_name, switch to if/else ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:15 +01:00
Anand Jain
38b5f68e98 btrfs: drop btrfs_device::can_discard to query directly
We can query the bdev directly when needed at btrfs_discard_extent()
so drop btrfs_device::can_discard.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Suggested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:14 +01:00
Colin Ian King
ccc8dc758d btrfs: make function update_share_count static
The function update_share_count is local to the source and does
not need to be in global scope, so make it static.

Cleans up sparse warning:
fs/btrfs/backref.c:219:6: warning: symbol 'update_share_count' was not
declared. Should it be static?

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:14 +01:00
Nikolay Borisov
4a2d25cd93 btrfs: Remove redundant FLAG_VACANCY
Commit 9036c10208 ("Btrfs: update hole handling v2") added the
FLAG_VACANCY to denote holes, however there was already a consistent way
of flagging extents which represent hole - ->block_start =
EXTENT_MAP_HOLE. And also the only place where this flag is checked is
in the fiemap code, but the block_start value is also checked and every
other place in the filesystem detects holes by using block_start
value's. So remove the extra flag. This survived a full xfstest run.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:14 +01:00
Qu Wenruo
3f2dd7a0ce btrfs: extent-tree: Make btrfs_inode_rsv_refill function static
This function is no longer used outside of extent-tree.c.
Make it static.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:14 +01:00
David Sterba
431e98226c btrfs: move some zstd work data from stack to workspace
* ZSTD_inBuffer in_buf
* ZSTD_outBuffer out_buf

are used in all functions to pass the compression parameters and the
local variables consume some space. We can move them to the workspace
and reduce the stack consumption:

zstd.c:zstd_decompress                        -24 (136 -> 112)
zstd.c:zstd_decompress_bio                    -24 (144 -> 120)
zstd.c:zstd_compress_pages                    -24 (264 -> 240)

Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Nick Terrell <terrelln@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:14 +01:00
David Sterba
5302e08964 btrfs: reorder btrfs_transaction members for better packing
There are now 20 bytes of holes, we can reduce that to 4 by minor
changes. Moving 'aborted' to the status and flags is also more logical,
similar for num_dirty_bgs. The size goes from 432 to 416.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:14 +01:00
David Sterba
165c8b022c btrfs: use narrower type for btrfs_transaction::num_dirty_bgs
The u64 is an overkill here, we could not possibly create that many
blockgroups in one transaction.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:14 +01:00
David Sterba
1ca4bb63f6 btrfs: reorder btrfs_trans_handle members for better packing
Recent updates to the structure left some holes, reorder the types so
the packing is tight. The size goes from 112 to 104 on 64bit.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:14 +01:00
David Sterba
b50fff816c btrfs: switch to refcount_t type for btrfs_trans_handle::use_count
The use_count is a reference counter, we can use the refcount_t type,
though we don't use the atomicity. This is not a performance critical
code and we could catch the underflows. The type is changed from long,
but the number of references will fit an int.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:14 +01:00
David Sterba
2dbda74ed9 btrfs: remove unused member of btrfs_trans_handle
Last user was removed in a monster commit a22285a6a3
("Btrfs: Integrate metadata reservation with start_transaction") in
2010.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:14 +01:00
David Sterba
7c2871a2f4 btrfs: switch btrfs_trans_handle::adding_csums to bool
The semantics of adding_csums matches bool, 'short' was most likely used
to save space in a698d0755a ("Btrfs: add a type field for the
transaction handle").

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:14 +01:00
Edmund Nadolski
bf46f52db9 btrfs: remove dead code from btrfs_get_extent
Due to new_inline logic, the create == 0 is always true at this
point in the code, so the create != 0 branch can be removed.

Signed-off-by: Edmund Nadolski <enadolski@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:14 +01:00
Edmund Nadolski
41a1eadad7 btrfs: btrfs_inode_log_parent should use defined inode_only values.
Replace hardcoded numeric argument values for inode_only with the
constants defined for that use.

Signed-off-by: Edmund Nadolski <enadolski@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:14 +01:00
David Sterba
71a635516c btrfs: switch to on-stack csum buffer in csum_tree_block
The maximum size of a checksum buffer is known, BTRFS_CSUM_SIZE, and we
don't have to allocate it dynamically. This code path is not used at all
as we have only the crc32c and use an on-stack buffer already.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:14 +01:00
Liu Bo
343e4fc1c6 Btrfs: set plug for fsync
Setting plug can merge adjacent IOs before dispatching IOs to the disk
driver.

Without plug, it'd not be a problem for single disk usecases, but for
multiple disks using raid profile, a large IO can be split to several
IOs of stripe length, and plug can be helpful to bring them together
for each disk so that we can save several disk access.

Moreover, fsync issues synchronous writes, so plug can really take
effect.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:13 +01:00
Anand Jain
0fb08bccbc btrfs: factor __btrfs_open_devices() to create btrfs_open_one_device()
No functional changes, create btrfs_open_one_device() from
__btrfs_open_devices(). This is a preparatory work to add dynamic
device scan.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
[ minor whitespace fixes ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:13 +01:00
Anand Jain
9f050db43e btrfs: move check for device generation to the last
No functional changes. This helps to move the entire section into
a new function.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:13 +01:00
Anand Jain
71f8a8d2c1 btrfs: set fs_devices->seed directly
This is in preparation to move a section of code in __btrfs_open_devices()
into a new function so that it can be reused. As we set seeding if any of
the device is having SB flag BTRFS_SUPER_FLAG_SEEDING, so do it in the
device list loop itself. No functional changes.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:13 +01:00
Geert Uytterhoeven
02cfe779cc btrfs: ref-verify: Remove unused parameter from walk_up_tree() to kill warning
With gcc-4.1.2:

    fs/btrfs/ref-verify.c: In function ‘btrfs_build_ref_tree’:
    fs/btrfs/ref-verify.c:1017: warning: ‘root’ is used uninitialized in this function

The variable is indeed passed uninitialized, but it is never used by the
callee.  However, not all versions of gcc are smart enough to notice.

Hence remove the unused parameter from walk_up_tree() to silence the
compiler warning.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:13 +01:00
David Sterba
6af49dbde9 btrfs: sink get_extent parameter to read_extent_buffer_pages
All callers pass btree_get_extent, which needs to be exported.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:13 +01:00
David Sterba
4ef77695a0 btrfs: sink get_extent parameter to __do_contiguous_readpages
All callers pass btrfs_get_extent.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:13 +01:00
David Sterba
e4d17ef507 btrfs: sink get_extent parameter to __extent_readpages
All callers pass btrfs_get_extent.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:13 +01:00
David Sterba
0932584b66 btrfs: sink get_extent parameter to extent_readpages
There's only one caller that passes btrfs_get_extent.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:13 +01:00
David Sterba
e3350e16ea btrfs: sink get_extent parameter to get_extent_skip_holes
All callers pass btrfs_get_extent_fiemap and get_extent_skip_holes
itself is used only as a fiemap helper.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:13 +01:00
David Sterba
2135fb9bb4 btrfs: sink get_extent parameter to extent_fiemap
All callers pass btrfs_get_extent_fiemap and we don't expect anything
else in the context of extent_fiemap.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:13 +01:00
David Sterba
3c98c62f7a btrfs: drop get_extent from extent_page_data
Previous patches cleaned up all places where
extent_page_data::get_extent was set and it was btrfs_get_extent all the
time, so we can simply call that instead.

This also reduces size of extent_page_data by 8 bytes which has positive
effect on stack consumption on various functions on the write out path.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:13 +01:00
David Sterba
deac642d7e btrfs: sink get_extent parameter to extent_write_full_page
There's only one caller.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:13 +01:00
David Sterba
916b929831 btrfs: sink get_extent parameter to extent_write_locked_range
There's only one caller.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:12 +01:00
David Sterba
433175992c btrfs: sink get_extent parameter to extent_writepages
There's only one caller.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:12 +01:00
Qu Wenruo
bae15d95e2 btrfs: Cleanup existing name_len checks
Since tree-checker has verified leaf when reading from disk, we don't
need the existing verify_dir_item() or btrfs_is_name_len_valid() checks.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:12 +01:00
Qu Wenruo
ad7b0368f3 btrfs: tree-checker: Add checker for dir item
Add checker for dir item, for key types DIR_ITEM, DIR_INDEX and
XATTR_ITEM.

This checker does comprehensive checks for:

1) dir_item header and its data size
   Against item boundary and maximum name/xattr length.
   This part is mostly the same as old verify_dir_item().

2) dir_type
   Against maximum file types, and against key type.
   Since XATTR key should only have FT_XATTR dir item, and normal dir
   item type should not have XATTR key.

   The check between key->type and dir_type is newly introduced by this
   patch.

3) name hash
   For XATTR and DIR_ITEM key, key->offset is name hash (crc32c).
   Check the hash of the name against the key to ensure it's correct.

   The name hash check is only found in btrfs-progs before this patch.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:12 +01:00
David Sterba
712e36c5f2 btrfs: use GFP_KERNEL in btrfs_alloc_inode
This callback is called directly from VFS, no locks are held at the
allocation time.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:12 +01:00
David Sterba
f08dc36f78 btrfs: sink gfp parameter to clear_extent_uptodate
There's only one callsite with GFP_NOFS.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:12 +01:00
David Sterba
ae0f162534 btrfs: sink gfp parameter to clear_extent_bit
All callers use GFP_NOFS, we don't have to pass it as an argument. The
built-in tests pass GFP_KERNEL, but they run only at module load time
and NOFS works there as well.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:12 +01:00
David Sterba
66b0c887bb btrfs: prepare to drop gfp mask parameter from clear_extent_bit
Use __clear_extent_bit directly in case we want to pass unknown
gfp flags. Otherwise all clear_extent_bit callers use GFP_NOFS, so we
can sink them to the function and reduce argument count, at the cost
that __clear_extent_bit has to be exported.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:12 +01:00
David Sterba
1538e6c52e btrfs: use non-RCU list traversal in write_all_supers callees
We take the fs_devices::device_list_mutex mutex in write_all_supers
which will prevent any add/del changes to the device list. Therefore we
don't need to use the RCU variant list_for_each_entry_rcu in any of the
called functions.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:12 +01:00
David Sterba
d03262c75d btrfs: switch to RCU for device traversal in btrfs_ioctl_fs_info
We don't need to use the mutex as we do not modify the devices nor the
list itself and just read information about device counts.
Move copying fsid out of the protected section, not applicable to RCU
same as the rest of the retrieved information.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:12 +01:00
David Sterba
c5593ca3c8 btrfs: switch to RCU for device traversal in btrfs_ioctl_dev_info
We don't need to use the mutex as we do not modify the devices nor the
list itself and just read some information:

does not change during device lifetime:
- devid
- uuid
- name (ie. the path)

may change in parallel to the ioctl call, but can lead only to reporting
inacurracy:
- bytes_used
- total_bytes

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:12 +01:00
David Sterba
08ffcae8c9 btrfs: simplify btrfs_close_bdev
Split the conditions a bit.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:12 +01:00
David Sterba
9c6b1c4de1 btrfs: document device locking
Overview of the main locks protecting various device-related structures.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:12 +01:00
David Sterba
5c4cf6c91d btrfs: simplify exit paths in btrfs_init_new_device
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:12 +01:00
David Sterba
55de480346 btrfs: use free_device where opencoded
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:12 +01:00
David Sterba
48dae9cf3f btrfs: introduce free_device helper
A helper to free a device and all it's dynamically allocated members,
like the rcu_string name or flush_bio. This is going to replace all
open coded places.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:11 +01:00
David Sterba
f06c5965ab btrfs: rename device free rcu helper to free_device_rcu
Make it clear that it is an RCU helper, we want to use the name
free_device for a wrapper freeing all device members.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:11 +01:00
Liu Bo
4c274bc67b Btrfs: document rules about bio async submit
These rules have been hidden in several if-else and are not
straightforward to follow, for example, dio submit hook's nocsum case
has a bug , i.e. doing async submit instead of sync submit, which has
been fixed recently.

This is documenting the rules for reference.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:11 +01:00
Nikolay Borisov
057aac3e62 btrfs: Reduce scope of delayed_rsv->lock in may_commit_trans
After commit 996478ca9c ("btrfs: change how we decide to commit
transactions during flushing") there is no need to hold the delayed_rsv
during the percpu_counter_compare call since we get the byte's snapshot
earlier. So hold the lock only while reading delayed_rsv.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:11 +01:00
Liu Bo
f5c29bd9db Btrfs: add __init macro to btrfs init functions
Adding __init macro gives kernel a hint that this function is only used
during the initialization phase and its memory resources can be freed up
after.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:11 +01:00
Anand Jain
c74a0b0237 btrfs: rename btrfs_add_device to btrfs_add_dev_item
Function btrfs_add_device() is adding the device item so rename to
reflect that in the function. Similarly we have btrfs_rm_dev_item().

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:11 +01:00
Qu Wenruo
33d85fda13 btrfs: Don't generate UUID for non-fs tree
btrfs_create_tree() will unconditionally generate UUID for any root.
So for quota tree and data reloc tree created by kernel, they will have
unique UUIDs.

However UUID in root item is only referred by UUID tree, which only
records UUID for fs trees.  This makes unique UUIDs for quota/data reloc
tree meaningless.

Leave the UUID as zero for non-fs tree, making btrfs-debug-tree output
less confusing.

Reported-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:11 +01:00
Anand Jain
2c9973847f btrfs: move volume_mutex into the btrfs_rm_device()
A cleanup patch no functional change, we hold volume_mutex before
calling btrfs_rm_device, so move it into the function itself.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:11 +01:00
Nikolay Borisov
96b09dde92 btrfs: Use locked_end rather than open coding it
Right before we go into this loop locked_end is set to alloc_end - 1 and
is being used in nearby functions, no need to have exceptions. This just
makes the code consistent, no functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:11 +01:00
Nikolay Borisov
6b7d6e9334 btrfs: Move loop termination condition in while()
Fallocating a file in btrfs goes through several stages. The one before
actually inserting the fallocated extents is to create a qgroup
reservation, covering the desired range. To this end there is a loop in
btrfs_fallocate which checks to see if there are holes in the fallocated
range or !PREALLOC extents past EOF and if so create qgroup reservations
for them. Unfortunately, the main condition of the loop is burried right
at the end of its body rather than in the actual while statement which
makes it non-obvious. Fix this by moving the condition in the while
statement where it belongs. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:11 +01:00
Liu Bo
47dba17171 Btrfs: remove rcu_barrier in btrfs_close_devices
It was introduced because btrfs used to do blkdev_put in a deferred
work, now that btrfs has blkdev_put in place, this rcu_barrier can be
removed.

modprobe -r btrfs will do btrfs_cleanup_fs_uuids(), where it cleanup
every %fs_devices on the list, but when we do btrfs_close_devices(), we
have replaced the devices on the list with dummy ones which only have
the same name and uuid, so modprobe -r btrfs will free those instead of
what we were using, this change won't cause a problem for it.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copied 2nd paragraph from mailinglist discussion ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:11 +01:00
Nikolay Borisov
8577787fac btrfs: Move checks from btrfs_wq_run_delayed_node to btrfs_balance_delayed_items
btrfs_balance_delayed_items is the sole caller of
btrfs_wq_run_delayed_node and already includes one of the checks whether
the delayed inodes should be run. On the other hand
btrfs_wq_run_delayed_node duplicates that check and performs an
additional one for wq congestion.

Let's remove the duplicate check and move the congestion one in
btrfs_balance_delayed_items, leaving btrfs_wq_run_delayed_node to only
care about setting up the wq run. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:11 +01:00
Nikolay Borisov
617c54a88e btrfs: Make btrfs_async_run_delayed_root use a loop rather than multiple labels
Currently btrfs_async_run_delayed_root's implementation uses 3 goto
labels to mimic the functionality of a simple do {} while loop. Refactor
the function to use a do {} while construct, making intention clear and
code easier to follow. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:11 +01:00
Nikolay Borisov
d3fac6ba7d btrfs: Remove redundant mirror_num arg
The following callpath is always invoked with mirror_num set to 0, so
let's remove it as an argument and directly pass 0 to __do_redpage. No
functional change.

extent_readpages
  __extent_readpages
    __do_contiguous_readpages
      __do_readpage

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:11 +01:00
Nikolay Borisov
ac244ef1da btrfs: Remove unused function
It's sole callsite was removed in a previous patch so just nuke it for good.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:11 +01:00
Nikolay Borisov
4660c49f9b btrfs: Remove redundant memory barrier in dev stats
As per atomic_t.txt documentation :
 - RMW operations that have a return value are fully ordered;

atomic_xchg is one such operation so it already includes everything it
needs w.r.t memory ordering and add a comment to be more explicit about
that.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:10 +01:00
Nikolay Borisov
9deae96892 btrfs: Fix memory barriers usage with device stats counters
Commit addc3fa74e ("Btrfs: Fix the problem that the dirty flag of dev
stats is cleared") reworked the way device stats changes are tracked. A
new atomic dev_stats_ccnt counter was introduced which is incremented
every time any of the device stats counters are changed. This serves as
a flag whether there are any pending stats changes. However, this patch
only partially implemented the correct memory barriers necessary:

- It only ordered the stores to the counters but not the reads e.g.
  btrfs_run_dev_stats
- It completely omitted any comments documenting the intended design and
  how the memory barriers pair with each-other

This patch provides the necessary comments as well as adds a missing
smp_rmb in btrfs_run_dev_stats. Furthermore since dev_stats_cnt is only
a snapshot at best there was no point in reading the counter twice -
once in btrfs_dev_stats_dirty and then again when assigning stats_cnt.
Just collapse both reads into 1.

Fixes: addc3fa74e ("Btrfs: Fix the problem that the dirty flag of dev stats is cleared")
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:10 +01:00
Anand Jain
1cb34c8ecd btrfs: clean up btrfs_dev_stat_inc usage
btrfs_end_bio() is using btrfs_dev_stat_inc() and then
btrfs_dev_stat_print_on_error() separately instead use
btrfs_dev_stat_inc_and_print() directly.

As of now there isn't any bio in btrfs which is - a non-empty write and
also the REQ_PREFLUSH flag is set. So in actual the condition

   if (bio->bi_opf & REQ_PREFLUSH)

is never true in btrfs_end_bio(), and so there won't be any redundant
error log by using btrfs_dev_stat_inc_and_print() separately one for
write and another for flush.

This consolidation will help to add the device critical error handles in
the function btrfs_dev_stat_inc_and_print() and which can be renamed as
needed.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:10 +01:00
Liu Bo
9f5316c17b Btrfs: free btrfs_device in place
It's pointless to defer it to a kthread helper as we're not under a
special context.

For reference, commit 1f78160ce1 ("Btrfs: using rcu lock in the reader
side of devices list") introduced RCU freeing for device structures.

Originally the blkdev_put was called from free_device and rcu_barrier had
to be called. This is no longer required, bdev and our device structures
are now freed separately.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ enhance changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:10 +01:00
Liu Bo
1805f2ca3f Btrfs: remove redundant btrfs_balance_delayed_items
In functions like btrfs_create(), we run both
btrfs_balance_delayed_items() and btrfs_btree_balance_dirty() after
the operation, but btrfs_btree_balance_dirty() is surely going to run
btrfs_balance_delayed_items().

This keeps only btrfs_btree_balance_dirty().

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:10 +01:00
Masami Hiramatsu
663faf9f7b error-injection: Add injectable error types
Add injectable error types for each error-injectable function.

One motivation of error injection test is to find software flaws,
mistakes or mis-handlings of expectable errors. If we find such
flaws by the test, that is a program bug, so we need to fix it.

But if the tester miss input the error (e.g. just return success
code without processing anything), it causes unexpected behavior
even if the caller is correctly programmed to handle any errors.
That is not what we want to test by error injection.

To clarify what type of errors the caller must expect for each
injectable function, this introduces injectable error types:

 - EI_ETYPE_NULL : means the function will return NULL if it
		    fails. No ERR_PTR, just a NULL.
 - EI_ETYPE_ERRNO : means the function will return -ERRNO
		    if it fails.
 - EI_ETYPE_ERRNO_NULL : means the function will return -ERRNO
		       (ERR_PTR) or NULL.

ALLOW_ERROR_INJECTION() macro is expanded to get one of
NULL, ERRNO, ERRNO_NULL to record the error type for
each function. e.g.

 ALLOW_ERROR_INJECTION(open_ctree, ERRNO)

This error types are shown in debugfs as below.

  ====
  / # cat /sys/kernel/debug/error_injection/list
  open_ctree [btrfs]	ERRNO
  io_ctl_init [btrfs]	ERRNO
  ====

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-12 17:33:38 -08:00
Masami Hiramatsu
540adea380 error-injection: Separate error-injection from kprobe
Since error-injection framework is not limited to be used
by kprobes, nor bpf. Other kernel subsystems can use it
freely for checking safeness of error-injection, e.g.
livepatch, ftrace etc.
So this separate error-injection framework from kprobes.

Some differences has been made:

- "kprobe" word is removed from any APIs/structures.
- BPF_ALLOW_ERROR_INJECTION() is renamed to
  ALLOW_ERROR_INJECTION() since it is not limited for BPF too.
- CONFIG_FUNCTION_ERROR_INJECTION is the config item of this
  feature. It is automatically enabled if the arch supports
  error injection feature for kprobe or ftrace etc.

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-12 17:33:38 -08:00
David S. Miller
a0ce093180 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-01-09 10:37:00 -05:00
Ming Lei
c16a8ac3c0 btrfs: avoid accessing bvec table directly for a cloned bio
Commit 17347cec15f919901c90(Btrfs: change how we iterate bios in endio)
mentioned that for dio the submitted bio may be fast cloned, we
can't access the bvec table directly for a cloned bio, so use
bio_get_first_bvec() to retrieve the 1st bvec.

Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: David Sterba <dsterba@suse.com>
Cc: linux-btrfs@vger.kernel.org
Cc: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Acked: David Sterba <dsterba@suse.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06 09:18:00 -07:00
Ming Lei
a0b60d725e btrfs: avoid access to .bi_vcnt directly
BTRFS uses bio->bi_vcnt to figure out page numbers, this approach is no
longer valid once we start enabling multipage bvecs.
correct once we start to enable multipage bvec.

Use bio_nr_pages() to do that instead.

Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: David Sterba <dsterba@suse.com>
Cc: linux-btrfs@vger.kernel.org
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06 09:18:00 -07:00
Ming Lei
c45a8f2def fs: convert to bio_last_bvec_all()
This patch converts 3 users to bio_last_bvec_all(), so that we can go
ahead and convert to multipage bvec.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06 09:18:00 -07:00
Ming Lei
263663cd3c block: convert to bio_first_bvec_all & bio_first_page_all
This patch converts to bio_first_bvec_all() & bio_first_page_all() for
retrieving the 1st bvec/page, and prepares for supporting multipage bvec.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06 09:18:00 -07:00
Linus Torvalds
89876f275e for-4.15-rc7-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlpPux0ACgkQxWXV+ddt
 WDs/ORAAgRtjm+OWBb80eV1xJIHGRPRaL6E4OZc6SA7DEA+oCpkkVzOHQz3PV2a2
 cAsIUvp9azZd41gzBMw8mIe4AQKLZpud+vEM7QYRlbZFtp3EWmZ1Jht4bJRxC+w7
 NjBIEx4MX2KiUeRizmo3iWBVW+RoaRVW1xvFo/k5QchhO8U74SNYzxTGVxd8S/C0
 ZanuTowdm71uCJJHkoNWArAsou40QCJOYK19WilRkrf6SGsUqc1zKArRKe2KF4GH
 Wyf4Qyp2fm8RRKLOlc9NcsVbVqVg4kBmUXbJPCvltCs+JiyfhX9hahweoHHH8kmH
 u/jR3CItVqX+Ft1WAtSpgRzxO0uGu6aVkIql0VHV6wIbGnFoJd9XQ6RPnT/awlOw
 1jx8RLOZtVehF6pjyoSngLppqCw/sYpV8QhF32dEFGentO3Wd7CVKTcMOH498dbN
 paNzcNEfnTFLbUmViOTXl8AS8VX+3PU2Mgn8W8UxcFYksoIpV9P/LBDS3iIGYMtL
 pFFC9fYeipBDOPg2NV4QfCE9ZSqm35c2kAV/hb1nmPtPz4W+Ya5v2y9RSjAU80f4
 Y8ZyePg6pjwWOp1dW+TZF0NE8ExzSvgnXAQOdZkiy4Ztc6OwTVhlwRfW1xFy2Py+
 riR87A7/mDbiR9IXHgzFZi6WjjVMHDifBKeEpu91cF9JrwJqMBc=
 =WIOv
 -----END PGP SIGNATURE-----

Merge tag 'for-4.15-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "We have two more fixes for 4.15, both aimed for stable.

  The leak fix is obvious, the second patch fixes a bug revealed by the
  refcount API, when it behaves differently than previous atomic_t and
  reports refs going from 0 to 1 in one case"

* tag 'for-4.15-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix refcount_t usage when deleting btrfs_delayed_nodes
  btrfs: Fix flush bio leak
2018-01-05 13:02:46 -08:00
Chris Mason
ec35e48b28 btrfs: fix refcount_t usage when deleting btrfs_delayed_nodes
refcounts have a generic implementation and an asm optimized one.  The
generic version has extra debugging to make sure that once a refcount
goes to zero, refcount_inc won't increase it.

The btrfs delayed inode code wasn't expecting this, and we're tripping
over the warnings when the generic refcounts are used.  We ended up with
this race:

Process A                                         Process B
                                                  btrfs_get_delayed_node()
						  spin_lock(root->inode_lock)
						  radix_tree_lookup()
__btrfs_release_delayed_node()
refcount_dec_and_test(&delayed_node->refs)
our refcount is now zero
						  refcount_add(2) <---
						  warning here, refcount
                                                  unchanged

spin_lock(root->inode_lock)
radix_tree_delete()

With the generic refcounts, we actually warn again when process B above
tries to release his refcount because refcount_add() turned into a
no-op.

We saw this in production on older kernels without the asm optimized
refcounts.

The fix used here is to use refcount_inc_not_zero() to detect when the
object is in the middle of being freed and return NULL.  This is almost
always the right answer anyway, since we usually end up pitching the
delayed_node if it didn't have fresh data in it.

This also changes __btrfs_release_delayed_node() to remove the extra
check for zero refcounts before radix tree deletion.
btrfs_get_delayed_node() was the only path that was allowing refcounts
to go from zero to one.

Fixes: 6de5f18e7b ("btrfs: fix refcount_t usage when deleting btrfs_delayed_node")
CC: <stable@vger.kernel.org> # 4.12+
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-02 18:00:14 +01:00
Nikolay Borisov
beed9263f4 btrfs: Fix flush bio leak
Commit e0ae999414 ("btrfs: preallocate device flush bio") reworked
the way the flush bio is allocated and used. Concretely it allocates
the bio in __alloc_device and then re-uses it multiple times with a
very simple endio routine that just calls complete() without consuming
a reference. Allocated bios by default come with a ref count of 1,
which is then consumed by the endio routine (or not, in which case they
should be bio_put by the caller). The way the impleementation works now
is that the flush bio has a refcount of 2 and we only ever bio_put it
once, leaving it to hang indefinitely. Fix this by removing the extra
bio_get in __alloc_device.

Fixes: e0ae999414 ("btrfs: preallocate device flush bio")
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-02 18:00:13 +01:00
Adam Borowski
91581e4c60 fs/*/Kconfig: drop links to 404-compliant http://acl.bestbits.at
This link is replicated in most filesystems' config stanzas.  Referring
to an archived version of that site is pointless as it mostly deals with
patches; user documentation is available elsewhere.

Signed-off-by: Adam Borowski <kilobyte@angband.pl>
CC: Alexander Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Acked-by: David Sterba <dsterba@suse.com>
Acked-by: "Yan, Zheng" <zyan@redhat.com>
Acked-by: Chao Yu <yuchao0@huawei.com>
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Acked-by: Steve French <smfrench@gmail.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2018-01-01 12:45:37 -07:00
David S. Miller
59436c9ee1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2017-12-18

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Allow arbitrary function calls from one BPF function to another BPF function.
   As of today when writing BPF programs, __always_inline had to be used in
   the BPF C programs for all functions, unnecessarily causing LLVM to inflate
   code size. Handle this more naturally with support for BPF to BPF calls
   such that this __always_inline restriction can be overcome. As a result,
   it allows for better optimized code and finally enables to introduce core
   BPF libraries in the future that can be reused out of different projects.
   x86 and arm64 JIT support was added as well, from Alexei.

2) Add infrastructure for tagging functions as error injectable and allow for
   BPF to return arbitrary error values when BPF is attached via kprobes on
   those. This way of injecting errors generically eases testing and debugging
   without having to recompile or restart the kernel. Tags for opting-in for
   this facility are added with BPF_ALLOW_ERROR_INJECTION(), from Josef.

3) For BPF offload via nfp JIT, add support for bpf_xdp_adjust_head() helper
   call for XDP programs. First part of this work adds handling of BPF
   capabilities included in the firmware, and the later patches add support
   to the nfp verifier part and JIT as well as some small optimizations,
   from Jakub.

4) The bpftool now also gets support for basic cgroup BPF operations such
   as attaching, detaching and listing current BPF programs. As a requirement
   for the attach part, bpftool can now also load object files through
   'bpftool prog load'. This reuses libbpf which we have in the kernel tree
   as well. bpftool-cgroup man page is added along with it, from Roman.

5) Back then commit e87c6bc385 ("bpf: permit multiple bpf attachments for
   a single perf event") added support for attaching multiple BPF programs
   to a single perf event. Given they are configured through perf's ioctl()
   interface, the interface has been extended with a PERF_EVENT_IOC_QUERY_BPF
   command in this work in order to return an array of one or multiple BPF
   prog ids that are currently attached, from Yonghong.

6) Various minor fixes and cleanups to the bpftool's Makefile as well
   as a new 'uninstall' and 'doc-uninstall' target for removing bpftool
   itself or prior installed documentation related to it, from Quentin.

7) Add CONFIG_CGROUP_BPF=y to the BPF kernel selftest config file which is
   required for the test_dev_cgroup test case to run, from Naresh.

8) Fix reporting of XDP prog_flags for nfp driver, from Jakub.

9) Fix libbpf's exit code from the Makefile when libelf was not found in
   the system, also from Jakub.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-18 10:51:06 -05:00
Josef Bacik
023f46c5b8 btrfs: allow us to inject errors at io_ctl_init
This was instrumental in reproducing a space cache bug.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2017-12-12 09:02:40 -08:00
Josef Bacik
8556e50994 btrfs: make open_ctree error injectable
This allows us to do error injection with BPF for open_ctree.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2017-12-12 08:56:26 -08:00
Linus Torvalds
51090c5d6d for-4.15-rc3-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlosIf4ACgkQxWXV+ddt
 WDspsw//YPhztOkAM7L37Lcv6PuMIBm7AsZax+iUctx9GlE9Yb9dYX+yIGjk3N44
 M6oHANP/Af70lGn3jaNlH+BeQre+RFD2KnT+Yyvp/0DV5+v+Bb6wqzrVqeYf9NIr
 lf6yc925gX10+DM6UXpYopTmdB8zXXO8xnqmFuT1jC/PrW/g+Hpxi7UtFFcoXwnE
 uucdih1LnNC/2pwp4ygQAxMkLnU2foWRsEP9lqsv83ecKDBfVxHUidzEZLTO7L+c
 ePc74AcyuPZ7DobuSDyDF4e0Ru5YtY5Zf+KR7RZHag5BNF2YLJE/XtN+hd3YhOQA
 7VniaPzUEG74ukvkL3L2oqxrMEavE0IFJtmzT4CM8DlRsGsDnn5n45sGHfo5clr8
 33XOq8aiGtbG1vwVbBJOuNQI2SWJxwe1OyAZoV/o1UVrltSCRf+dYL8Yf3IO2K0M
 DRnRNqEcZQGfqrVO5Iblw7VzVqY9LKiRESScS0Btvrys+DTVZAgC9CJDwN446E5v
 i56PrmT8OcC9MzP9wFIZtg27jiC0ndNwkqUhFrt1LBvC+BtvZvshAnFLhLfSRyZo
 0gqp2GoP6CFaUd5Ok+osALWF2VG8cpMJ7urdX0O5zXEYKioLwiXUS9Z7sldfHsJr
 Uiy1uh70UIOM96ZcsXyjLr0LO5vmgkV2kyDNbR5DtrJhfFai4Gs=
 =YaZE
 -----END PGP SIGNATURE-----

Merge tag 'for-4.15-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "This contains a few fixes (error handling, quota leak, FUA vs
  nobarrier mount option).

  There's one one worth mentioning separately - an off-by-one fix that
  leads to overwriting first byte of an adjacent page with 0, out of
  bounds of the memory allocated by an ioctl. This is under a privileged
  part of the ioctl, can be triggerd in some subvolume layouts"

* tag 'for-4.15-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: Fix possible off-by-one in btrfs_search_path_in_tree
  Btrfs: disable FUA if mounted with nobarrier
  btrfs: fix missing error return in btrfs_drop_snapshot
  btrfs: handle errors while updating refcounts in update_ref_for_cow
  btrfs: Fix quota reservation leak on preallocated files
2017-12-10 08:30:04 -08:00
Nikolay Borisov
c8bcbfbd23 btrfs: Fix possible off-by-one in btrfs_search_path_in_tree
The name char array passed to btrfs_search_path_in_tree is of size
BTRFS_INO_LOOKUP_PATH_MAX (4080). So the actual accessible char indexes
are in the range of [0, 4079]. Currently the code uses the define but this
represents an off-by-one.

Implications:

Size of btrfs_ioctl_ino_lookup_args is 4096, so the new byte will be
written to extra space, not some padding that could be provided by the
allocator.

btrfs-progs store the arguments on stack, but kernel does own copy of
the ioctl buffer and the off-by-one overwrite does not affect userspace,
but the ending 0 might be lost.

Kernel ioctl buffer is allocated dynamically so we're overwriting
somebody else's memory, and the ioctl is privileged if args.objectid is
not 256. Which is in most cases, but resolving a subvolume stored in
another directory will trigger that path.

Before this patch the buffer was one byte larger, but then the -1 was
not added.

Fixes: ac8e9819d7 ("Btrfs: add search and inode lookup ioctls")
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ added implications ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-12-07 00:35:15 +01:00
Omar Sandoval
1b9e619c5b Btrfs: disable FUA if mounted with nobarrier
I was seeing disk flushes still happening when I mounted a Btrfs
filesystem with nobarrier for testing. This is because we use FUA to
write out the first super block, and on devices without FUA support, the
block layer translates FUA to a flush. Even on devices supporting true
FUA, using FUA when we asked for no barriers is surprising.

Fixes: 387125fc72 ("Btrfs: fix barrier flushes")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-12-07 00:34:45 +01:00
Jeff Mahoney
e19182c0ff btrfs: fix missing error return in btrfs_drop_snapshot
If btrfs_del_root fails in btrfs_drop_snapshot, we'll pick up the
error but then return 0 anyway due to mixing err and ret.

Fixes: 79787eaab4 ("btrfs: replace many BUG_ONs with proper error handling")
Cc: <stable@vger.kernel.org> # v3.4+
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-12-07 00:30:29 +01:00
Jeff Mahoney
692826b273 btrfs: handle errors while updating refcounts in update_ref_for_cow
Since commit fb235dc06f (btrfs: qgroup: Move half of the qgroup
accounting time out of commit trans) the assumption that
btrfs_add_delayed_{data,tree}_ref can only return 0 or -ENOMEM has
been false.  The qgroup operations call into btrfs_search_slot
and friends and can now return the full spectrum of error codes.

Fortunately, the fix here is easy since update_ref_for_cow failing
is already handled so we just need to bail early with the error
code.

Fixes: fb235dc06f (btrfs: qgroup: Move half of the qgroup accounting ...)
Cc: <stable@vger.kernel.org> # v4.11+
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Edmund Nadolski <enadolski@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-12-07 00:30:03 +01:00
Justin Maggard
b430b77512 btrfs: Fix quota reservation leak on preallocated files
Commit c6887cd111 ("Btrfs: don't do nocow check unless we have to")
changed the behavior of __btrfs_buffered_write() so that it first tries
to get a data space reservation, and then skips the relatively expensive
nocow check if the reservation succeeded.

If we have quotas enabled, the data space reservation also includes a
quota reservation.  But in the rewrite case, the space has already been
accounted for in qgroups.  So btrfs_check_data_free_space() increases
the quota reservation, but it never gets decreased when the data
actually gets written and overwrites the pre-existing data.  So we're
left with both the qgroup and qgroup reservation accounting for the same
space.

This commit adds the missing btrfs_qgroup_free_data() call in the case
of BTRFS_ORDERED_PREALLOC extents.

Fixes: c6887cd111 ("Btrfs: don't do nocow check unless we have to")
Signed-off-by: Justin Maggard <jmaggard@netgear.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-12-07 00:28:12 +01:00
Linus Torvalds
26cd94744e for-4.15-rc2-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlofBpkACgkQxWXV+ddt
 WDvtTQ//emI1QsD4N0e4BxMcZ1bcigiEk3jc4gj+biRapnMHHAHOqJbVtpK1v8gS
 PCTw+4uD5UOGvhBtS4kXJn8e2qxWCESWJDXwVlW0RHmWLfwd9z7ly0sBMi3oiIqH
 qief8CIkk3oTexiuuJ3mZGxqnDQjRGtWx2LM+bRJBWMk+jN32v2ObSlv9V505a5M
 1daDBsjWojFWa8d4r3YZNJq1df2om/dwVQZ0Wk59bacIo9Xbvok0X459cOlWuv0p
 mjx8m8uA/z+HVdkTYlzyKpq08O8Z4shj3GrBbSnZ511gKzV+c9jJPxij5pKm3Z2z
 KW4Mp17+/7GSNcSsJiqnOYi+wtOrak2lD0COlZTijnY2jrv18h8ianoIM6CpzUdy
 +b09yuFXbPLoUfyl6vFaO/JHuvAkQdaR2tJbds6lvW+liC1ReoL4W1WcUjY6nv9f
 6wTaIv0vwgrHaxeIzxKNpnsTlpHAgorFFk0/w8nLb40WX8AoJ/95lo2zws8oaFDN
 0Fylu3NYhoDrJZK+D8dbsWx2eTsFVCqep4w0+iEVZl3lfuy3FZl1pu8CL7ru9vJl
 DNieh+lUvK1Fk+SYIoilGoriW96RbU8+jPo2W4A1ENzeMJfrNCSWtUSZZp4XT4tO
 8m1PGud07XBLSxd62bAEDV3KZO2DnY1WxgXbKuIHSi9D5CI1LMo=
 =7UW+
 -----END PGP SIGNATURE-----

Merge tag 'for-4.15-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "We've collected some fixes in since the pre-merge window freeze.

  There's technically only one regression fix for 4.15, but the rest
  seems important and candidates for stable.

   - fix missing flush bio puts in error cases (is serious, but rarely
     happens)

   - fix reporting stat::st_blocks for buffered append writes

   - fix space cache invalidation

   - fix out of bound memory access when setting zlib level

   - fix potential memory corruption when fsync fails in the middle

   - fix crash in integrity checker

   - incremetnal send fix, path mixup for certain unlink/rename
     combination

   - pass flags to writeback so compressed writes can be throttled
     properly

   - error handling fixes"

* tag 'for-4.15-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Btrfs: incremental send, fix wrong unlink path after renaming file
  btrfs: tree-checker: Fix false panic for sanity test
  Btrfs: fix list_add corruption and soft lockups in fsync
  btrfs: Fix wild memory access in compression level parser
  btrfs: fix deadlock when writing out space cache
  btrfs: clear space cache inode generation always
  Btrfs: fix reported number of inode blocks after buffered append writes
  Btrfs: move definition of the function btrfs_find_new_delalloc_bytes
  Btrfs: bail out gracefully rather than BUG_ON
  btrfs: dev_alloc_list is not protected by RCU, use normal list_del
  btrfs: add missing device::flush_bio puts
  btrfs: Fix transaction abort during failure in btrfs_rm_dev_item
  Btrfs: add write_flags for compression bio
2017-11-29 14:26:50 -08:00
Filipe Manana
ea37d5998b Btrfs: incremental send, fix wrong unlink path after renaming file
Under some circumstances, an incremental send operation can issue wrong
paths for unlink commands related to files that have multiple hard links
and some (or all) of those links were renamed between the parent and send
snapshots. Consider the following example:

Parent snapshot

 .                                                      (ino 256)
 |---- a/                                               (ino 257)
 |     |---- b/                                         (ino 259)
 |     |     |---- c/                                   (ino 260)
 |     |     |---- f2                                   (ino 261)
 |     |
 |     |---- f2l1                                       (ino 261)
 |
 |---- d/                                               (ino 262)
       |---- f1l1_2                                     (ino 258)
       |---- f2l2                                       (ino 261)
       |---- f1_2                                       (ino 258)

Send snapshot

 .                                                      (ino 256)
 |---- a/                                               (ino 257)
 |     |---- f2l1/                                      (ino 263)
 |             |---- b2/                                (ino 259)
 |                   |---- c/                           (ino 260)
 |                   |     |---- d3                     (ino 262)
 |                   |           |---- f1l1_2           (ino 258)
 |                   |           |---- f2l2_2           (ino 261)
 |                   |           |---- f1_2             (ino 258)
 |                   |
 |                   |---- f2                           (ino 261)
 |                   |---- f1l2                         (ino 258)
 |
 |---- d                                                (ino 261)

When computing the incremental send stream the following steps happen:

1) When processing inode 261, a rename operation is issued that renames
   inode 262, which currently as a path of "d", to an orphan name of
   "o262-7-0". This is done because in the send snapshot, inode 261 has
   of its hard links with a path of "d" as well.

2) Two link operations are issued that create the new hard links for
   inode 261, whose names are "d" and "f2l2_2", at paths "/" and
   "o262-7-0/" respectively.

3) Still while processing inode 261, unlink operations are issued to
   remove the old hard links of inode 261, with names "f2l1" and "f2l2",
   at paths "a/" and "d/". However path "d/" does not correspond anymore
   to the directory inode 262 but corresponds instead to a hard link of
   inode 261 (link command issued in the previous step). This makes the
   receiver fail with a ENOTDIR error when attempting the unlink
   operation.

The problem happens because before sending the unlink operation, we failed
to detect that inode 262 was one of ancestors for inode 261 in the parent
snapshot, and therefore we didn't recompute the path for inode 262 before
issuing the unlink operation for the link named "f2l2" of inode 262. The
detection failed because the function "is_ancestor()" only follows the
first hard link it finds for an inode instead of all of its hard links
(as it was originally created for being used with directories only, for
which only one hard link exists). So fix this by making "is_ancestor()"
follow all hard links of the input inode.

A test case for fstests follows soon.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-28 17:15:30 +01:00
Qu Wenruo
69fc6cbbac btrfs: tree-checker: Fix false panic for sanity test
[BUG]
If we run btrfs with CONFIG_BTRFS_FS_RUN_SANITY_TESTS=y, it will
instantly cause kernel panic like:

------
...
assertion failed: 0, file: fs/btrfs/disk-io.c, line: 3853
...
Call Trace:
 btrfs_mark_buffer_dirty+0x187/0x1f0 [btrfs]
 setup_items_for_insert+0x385/0x650 [btrfs]
 __btrfs_drop_extents+0x129a/0x1870 [btrfs]
...
-----

[Cause]
Btrfs will call btrfs_check_leaf() in btrfs_mark_buffer_dirty() to check
if the leaf is valid with CONFIG_BTRFS_FS_RUN_SANITY_TESTS=y.

However quite some btrfs_mark_buffer_dirty() callers(*) don't really
initialize its item data but only initialize its item pointers, leaving
item data uninitialized.

This makes tree-checker catch uninitialized data as error, causing
such panic.

*: These callers include but not limited to
setup_items_for_insert()
btrfs_split_item()
btrfs_expand_item()

[Fix]
Add a new parameter @check_item_data to btrfs_check_leaf().
With @check_item_data set to false, item data check will be skipped and
fallback to old btrfs_check_leaf() behavior.

So we can still get early warning if we screw up item pointers, and
avoid false panic.

Cc: Filipe Manana <fdmanana@gmail.com>
Reported-by: Lakshmipathi.G <lakshmipathi.g@gmail.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-28 14:59:09 +01:00
Linus Torvalds
1751e8a6cb Rename superblock flags (MS_xyz -> SB_xyz)
This is a pure automated search-and-replace of the internal kernel
superblock flags.

The s_flags are now called SB_*, with the names and the values for the
moment mirroring the MS_* flags that they're equivalent to.

Note how the MS_xyz flags are the ones passed to the mount system call,
while the SB_xyz flags are what we then use in sb->s_flags.

The script to do this was:

    # places to look in; re security/*: it generally should *not* be
    # touched (that stuff parses mount(2) arguments directly), but
    # there are two places where we really deal with superblock flags.
    FILES="drivers/mtd drivers/staging/lustre fs ipc mm \
            include/linux/fs.h include/uapi/linux/bfs_fs.h \
            security/apparmor/apparmorfs.c security/apparmor/include/lib.h"
    # the list of MS_... constants
    SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \
          DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \
          POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \
          I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \
          ACTIVE NOUSER"

    SED_PROG=
    for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done

    # we want files that contain at least one of MS_...,
    # with fs/namespace.c and fs/pnode.c excluded.
    L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c')

    for f in $L; do sed -i $f $SED_PROG; done

Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-27 13:05:09 -08:00
Liu Bo
ebb70442cd Btrfs: fix list_add corruption and soft lockups in fsync
Xfstests btrfs/146 revealed this corruption,

[   58.138831] Buffer I/O error on dev dm-0, logical block 2621424, async page read
[   58.151233] BTRFS error (device sdf): bdev /dev/mapper/error-test errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
[   58.152403] list_add corruption. prev->next should be next (ffff88005e6775d8), but was ffffc9000189be88. (prev=ffffc9000189be88).
[   58.153518] ------------[ cut here ]------------
[   58.153892] WARNING: CPU: 1 PID: 1287 at lib/list_debug.c:31 __list_add_valid+0x169/0x1f0
...
[   58.157379] RIP: 0010:__list_add_valid+0x169/0x1f0
...
[   58.161956] Call Trace:
[   58.162264]  btrfs_log_inode_parent+0x5bd/0xfb0 [btrfs]
[   58.163583]  btrfs_log_dentry_safe+0x60/0x80 [btrfs]
[   58.164003]  btrfs_sync_file+0x4c2/0x6f0 [btrfs]
[   58.164393]  vfs_fsync_range+0x5f/0xd0
[   58.164898]  do_fsync+0x5a/0x90
[   58.165170]  SyS_fsync+0x10/0x20
[   58.165395]  entry_SYSCALL_64_fastpath+0x1f/0xbe
...

It turns out that we could record btrfs_log_ctx:io_err in
log_one_extents when IO fails, but make log_one_extents() return '0'
instead of -EIO, so the IO error is not acknowledged by the callers,
i.e.  btrfs_log_inode_parent(), which would remove btrfs_log_ctx:list
from list head 'root->log_ctxs'.  Since btrfs_log_ctx is allocated
from stack memory, it'd get freed with a object alive on the
list. then a future list_add will throw the above warning.

This returns the correct error in the above case.

Jeff also reported this while testing against his fsync error
patch set[1].

[1]: https://www.spinics.net/lists/linux-btrfs/msg65308.html
"btrfs list corruption and soft lockups while testing writeback error handling"

Fixes: 8407f55326 ("Btrfs: fix data corruption after fast fsync and writeback error")
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-27 17:41:19 +01:00
Qu Wenruo
eae8d82529 btrfs: Fix wild memory access in compression level parser
[BUG]
Kernel panic when mounting with "-o compress" mount option.
KASAN will report like:
------
==================================================================
BUG: KASAN: wild-memory-access in strncmp+0x31/0xc0
Read of size 1 at addr d86735fce994f800 by task mount/662
...
Call Trace:
 dump_stack+0xe3/0x175
 kasan_report+0x163/0x370
 __asan_load1+0x47/0x50
 strncmp+0x31/0xc0
 btrfs_compress_str2level+0x20/0x70 [btrfs]
 btrfs_parse_options+0xff4/0x1870 [btrfs]
 open_ctree+0x2679/0x49f0 [btrfs]
 btrfs_mount+0x1b7f/0x1d30 [btrfs]
 mount_fs+0x49/0x190
 vfs_kern_mount.part.29+0xba/0x280
 vfs_kern_mount+0x13/0x20
 btrfs_mount+0x31e/0x1d30 [btrfs]
 mount_fs+0x49/0x190
 vfs_kern_mount.part.29+0xba/0x280
 do_mount+0xaad/0x1a00
 SyS_mount+0x98/0xe0
 entry_SYSCALL_64_fastpath+0x1f/0xbe
------

[Cause]
For 'compress' and 'compress_force' options, its token doesn't expect
any parameter so its args[0] contains uninitialized data.
Accessing args[0] will cause above wild memory access.

[Fix]
For Opt_compress and Opt_compress_force, set compression level to
the default.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ set the default in advance ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-27 17:01:11 +01:00
Josef Bacik
b77000ed55 btrfs: fix deadlock when writing out space cache
If we fail to prepare our pages for whatever reason (out of memory in
our case) we need to make sure to drop the block_group->data_rwsem,
otherwise hilarity ensues.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add label and use existing unlocking code ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-27 15:50:07 +01:00
Josef Bacik
8e138e0d92 btrfs: clear space cache inode generation always
We discovered a box that had double allocations, and suspected the space
cache may be to blame.  While auditing the write out path I noticed that
if we've already setup the space cache we will just carry on.  This
means that any error we hit after cache_save_setup before we go to
actually write the cache out we won't reset the inode generation, so
whatever was already written will be considered correct, except it'll be
stale.  Fix this by _always_ resetting the generation on the block group
inode, this way we only ever have valid or invalid cache.

With this patch I was no longer able to reproduce cache corruption with
dm-log-writes and my bpf error injection tool.

Cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-20 20:43:39 +01:00
Linus Torvalds
487e2c9f44 AFS development
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAWgm9V/Sw1s6N8H32AQK5mQ//QGUDZLXsUPCtq0XJq0V+r4MUjNp9tCZR
 htiuNrEkHSyPpYgCcQ2Aqdl9kndwVXcE7lWT99mp/a0zwNAsp9GOGVhCXUd5R86G
 XlrBuUYVvBJk18tDsUNWdjRQ0gMHgQSlEnEbsaGiU1bVrpXatI9hL8qoeO78Iy7+
 eaJUQLCuCVJq7qMQGhC0hg338vmHVeYhnViXIxq+HFjsMmR9IVanuK+sQr6NSJxS
 F6RkPxBUPWkRVMHmxTLWj/XSHZwtwu+Mnc/UFYsAPLKEbY0cIohsI8EgfE8U7geU
 yRVnu3MIOXUXUrZizj9SwVYWdJfneRlINqMbHIO8QXMKR38tnQ0C2/7bgBsXiNPv
 YdiAyeqL4nM+JthV/rgA3hWgupwBlSb4ubclTphDNxMs5MBIUIK3XUt9GOXDDUZz
 2FT/FdrphM2UORaI2AEOi4Q0/nHdin+3rld8fjV0Ree/TPNXwcrOmvy8yGnxFCEp
 5b7YLwKrffZGnnS965dhZlnFR6hjndmzFgHdyRrJwc80hXi1Q/+W4F19MoYkkoVK
 G/gLvD3FbmygmFnjCik9TjUrro6vQxo56H/TuWgHTvYriNGH+D/D7EGUwg4GiXZZ
 +7vrNw660uXmZiu9i0YacCRyD8lvm7QpmWLb+uHwzfsBE1+C8UetyQ+egSWVdWJO
 KwPspygWXD4=
 =3vy0
 -----END PGP SIGNATURE-----

Merge tag 'afs-next-20171113' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull AFS updates from David Howells:
 "kAFS filesystem driver overhaul.

  The major points of the overhaul are:

   (1) Preliminary groundwork is laid for supporting network-namespacing
       of kAFS. The remainder of the namespacing work requires some way
       to pass namespace information to submounts triggered by an
       automount. This requires something like the mount overhaul that's
       in progress.

   (2) sockaddr_rxrpc is used in preference to in_addr for holding
       addresses internally and add support for talking to the YFS VL
       server. With this, kAFS can do everything over IPv6 as well as
       IPv4 if it's talking to servers that support it.

   (3) Callback handling is overhauled to be generally passive rather
       than active. 'Callbacks' are promises by the server to tell us
       about data and metadata changes. Callbacks are now checked when
       we next touch an inode rather than actively going and looking for
       it where possible.

   (4) File access permit caching is overhauled to store the caching
       information per-inode rather than per-directory, shared over
       subordinate files. Whilst older AFS servers only allow ACLs on
       directories (shared to the files in that directory), newer AFS
       servers break that restriction.

       To improve memory usage and to make it easier to do mass-key
       removal, permit combinations are cached and shared.

   (5) Cell database management is overhauled to allow lighter locks to
       be used and to make cell records autonomous state machines that
       look after getting their own DNS records and cleaning themselves
       up, in particular preventing races in acquiring and relinquishing
       the fscache token for the cell.

   (6) Volume caching is overhauled. The afs_vlocation record is got rid
       of to simplify things and the superblock is now keyed on the cell
       and the numeric volume ID only. The volume record is tied to a
       superblock and normal superblock management is used to mediate
       the lifetime of the volume fscache token.

   (7) File server record caching is overhauled to make server records
       independent of cells and volumes. A server can be in multiple
       cells (in such a case, the administrator must make sure that the
       VL services for all cells correctly reflect the volumes shared
       between those cells).

       Server records are now indexed using the UUID of the server
       rather than the address since a server can have multiple
       addresses.

   (8) File server rotation is overhauled to handle VMOVED, VBUSY (and
       similar), VOFFLINE and VNOVOL indications and to handle rotation
       both of servers and addresses of those servers. The rotation will
       also wait and retry if the server says it is busy.

   (9) Data writeback is overhauled. Each inode no longer stores a list
       of modified sections tagged with the key that authorised it in
       favour of noting the modified region of a page in page->private
       and storing a list of keys that made modifications in the inode.

       This simplifies things and allows other keys to be used to
       actually write to the server if a key that made a modification
       becomes useless.

  (10) Writable mmap() is implemented. This allows a kernel to be build
       entirely on AFS.

  Note that Pre AFS-3.4 servers are no longer supported, though this can
  be added back if necessary (AFS-3.4 was released in 1998)"

* tag 'afs-next-20171113' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: (35 commits)
  afs: Protect call->state changes against signals
  afs: Trace page dirty/clean
  afs: Implement shared-writeable mmap
  afs: Get rid of the afs_writeback record
  afs: Introduce a file-private data record
  afs: Use a dynamic port if 7001 is in use
  afs: Fix directory read/modify race
  afs: Trace the sending of pages
  afs: Trace the initiation and completion of client calls
  afs: Fix documentation on # vs % prefix in mount source specification
  afs: Fix total-length calculation for multiple-page send
  afs: Only progress call state at end of Tx phase from rxrpc callback
  afs: Make use of the YFS service upgrade to fully support IPv6
  afs: Overhaul volume and server record caching and fileserver rotation
  afs: Move server rotation code into its own file
  afs: Add an address list concept
  afs: Overhaul cell database management
  afs: Overhaul permit caching
  afs: Overhaul the callback handling
  afs: Rename struct afs_call server member to cm_server
  ...
2017-11-16 11:41:22 -08:00
Mel Gorman
8667982014 mm, pagevec: remove cold parameter for pagevecs
Every pagevec_init user claims the pages being released are hot even in
cases where it is unlikely the pages are hot.  As no one cares about the
hotness of pages being released to the allocator, just ditch the
parameter.

No performance impact is expected as the overhead is marginal.  The
parameter is removed simply because it is a bit stupid to have a useless
parameter copied everywhere.

Link: http://lkml.kernel.org/r/20171018075952.10627-6-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:06 -08:00
Jan Kara
67fd707f46 mm: remove nr_pages argument from pagevec_lookup_{,range}_tag()
All users of pagevec_lookup() and pagevec_lookup_range() now pass
PAGEVEC_SIZE as a desired number of pages.  Just drop the argument.

Link: http://lkml.kernel.org/r/20171009151359.31984-15-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:04 -08:00
Jan Kara
4006f437f9 btrfs: use pagevec_lookup_range_tag()
We want only pages from given range in btree_write_cache_pages() and
extent_write_cache_pages().  Use pagevec_lookup_range_tag() instead of
pagevec_lookup_tag() and remove unnecessary code.

Link: http://lkml.kernel.org/r/20171009151359.31984-3-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: David Sterba <dsterba@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:03 -08:00
Filipe Manana
e3b8a48585 Btrfs: fix reported number of inode blocks after buffered append writes
The patch from commit a7e3b975a0 ("Btrfs: fix reported number of inode
blocks") introduced a regression where if we do a buffered write starting
at position equal to or greater than the file's size and then stat(2) the
file before writeback is triggered, the number of used blocks does not
change (unless there's a prealloc/unwritten extent). Example:

  $ xfs_io -f -c "pwrite -S 0xab 0 64K" foobar
  $ du -h foobar
  0	foobar
  $ sync
  $ du -h foobar
  64K	foobar

The first version of that patch didn't had this regression and the second
version, which was the one committed, was made only to address some
performance regression detected by the intel test robots using fs_mark.

This fixes the regression by setting the new delaloc bit in the range, and
doing it at btrfs_dirty_pages() while setting the regular dealloc bit as
well, so that this way we set both bits at once avoiding navigation of the
inode's io tree twice. Doing it at btrfs_dirty_pages() is also the most
meaninful place, as we should set the new dellaloc bit when if we set the
delalloc bit, which happens only if we copied bytes into the pages at
__btrfs_buffered_write().

This was making some of LTP's du tests fail, which can be quickly run
using a command line like the following:

  $ ./runltp -q -p -l /ltp.log -f commands -s du -d /mnt

Fixes: a7e3b975a0 ("Btrfs: fix reported number of inode blocks")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-15 17:27:46 +01:00
Filipe Manana
f48bf66b66 Btrfs: move definition of the function btrfs_find_new_delalloc_bytes
Move the definition of the function btrfs_find_new_delalloc_bytes() closer
to the function btrfs_dirty_pages(), because in a future commit it will be
used exclusively by btrfs_dirty_pages(). This just moves the function's
definition, with no functional changes at all.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-15 17:27:44 +01:00
Liu Bo
56a0e706fc Btrfs: bail out gracefully rather than BUG_ON
If a file's DIR_ITEM key is invalid (due to memory errors) and gets
written to disk, a future lookup_path can end up with kernel panic due
to BUG_ON().

This gets rid of the BUG_ON(), meanwhile output the corrupted key and
return ENOENT if it's invalid.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reported-by: Guillaume Bouchard <bouchard@mercs-eng.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-15 14:47:01 +01:00
David Sterba
619c47f3d4 btrfs: dev_alloc_list is not protected by RCU, use normal list_del
The dev_alloc_list list could be protected by various mutexes,
depending on the context. The list tracks devices that can take part of
allocating new chunks, so the closest mutex is chunk_mutex. Adding a new
device from inside the ADD_DEV ioctl will need device_list_mutex and
registering a new device from the ioctl needs uuid_mutex.

All mutexes naturally guarantee exclusivity against the same context.
The device ownership can move between the contexts and the exclusivity
is guaranteed by other means, eg. during the mount with the uuid_mutex.

There's no RCU involved for dev_alloc_list.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-15 14:46:12 +01:00
David Sterba
3065ae5b85 btrfs: add missing device::flush_bio puts
This fixes potential bio leaks, in several error paths. Unfortunatelly
the device structure freeing is opencoded in many places and I missed
them when introducing the flush_bio.

Most of the time, devices get freed through call_rcu(..., free_device),
so it at least it's not that easy to hit the leak, but it's still
possible through the path that frees stale devices.

Fixes: e0ae999414 ("btrfs: preallocate device flush bio")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-15 14:45:26 +01:00
Nikolay Borisov
5e9f2ad5b2 btrfs: Fix transaction abort during failure in btrfs_rm_dev_item
btrfs_rm_dev_item calls several function under an active transaction,
however it fails to abort it if an error happens. Fix this by adding
explicit btrfs_abort_transaction/btrfs_end_transaction calls.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-15 14:44:44 +01:00
Liu Bo
f82b735936 Btrfs: add write_flags for compression bio
Compression code path has only flaged bios with REQ_OP_WRITE no matter
where the bios come from, but it could be a sync write if fsync starts
this writeback or a normal writeback write if wb kthread starts a
periodic writeback.

It breaks the rule that sync writes and writeback writes need to be
differentiated from each other, because from the POV of block layer,
all bios need to be recognized by these flags in order to do some
management, e.g. throttlling.

This passes writeback_control to compression write path so that it can
send bios with proper flags to block layer.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-15 14:44:31 +01:00
Linus Torvalds
5cea7647e6 Merge branch 'for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
 "There are some new user features and the usual load of invisible
  enhancements or cleanups.

  New features:

   - extend mount options to specify zlib compression level, -o
     compress=zlib:9

   - v2 of ioctl "extent to inode mapping", addressing a usecase where
     we want to retrieve more but inaccurate results and do the
     postprocessing in userspace, aiding defragmentation or
     deduplication tools

   - populate compression heuristics logic, do data sampling and try to
     guess compressibility by: looking for repeated patterns, counting
     unique byte values and distribution, calculating Shannon entropy;
     this will need more benchmarking and possibly fine tuning, but the
     base should be good enough

   - enable indexing for btrfs as lower filesystem in overlayfs

   - speedup page cache readahead during send on large files

  Internal enhancements:

   - more sanity checks of b-tree items when reading them from disk

   - more EINVAL/EUCLEAN fixups, missing BLK_STS_* conversion, other
     errno or error handling fixes

   - remove some homegrown IO-related logic, that's been obsoleted by
     core block layer changes (batching, plug/unplug, own counters)

   - add ref-verify, optional debugging feature to verify extent
     reference accounting

   - simplify code handling outstanding extents, make it more clear
     where and how the accounting is done

   - make delalloc reservations per-inode, simplify the code and make
     the logic more straightforward

   - extensive cleanup of delayed refs code

  Notable fixes:

   - fix send ioctl on 32bit with 64bit kernel"

* 'for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (102 commits)
  btrfs: Fix bug for misused dev_t when lookup in dev state hash table.
  Btrfs: heuristic: add Shannon entropy calculation
  Btrfs: heuristic: add byte core set calculation
  Btrfs: heuristic: add byte set calculation
  Btrfs: heuristic: add detection of repeated data patterns
  Btrfs: heuristic: implement sampling logic
  Btrfs: heuristic: add bucket and sample counters and other defines
  Btrfs: compression: separate heuristic/compression workspaces
  btrfs: move btrfs_truncate_block out of trans handle
  btrfs: don't call btrfs_start_delalloc_roots in flushoncommit
  btrfs: track refs in a rb_tree instead of a list
  btrfs: add a comp_refs() helper
  btrfs: switch args for comp_*_refs
  btrfs: make the delalloc block rsv per inode
  btrfs: add tracepoints for outstanding extents mods
  Btrfs: rework outstanding_extents
  btrfs: increase output size for LOGICAL_INO_V2 ioctl
  btrfs: add a flags argument to LOGICAL_INO and call it LOGICAL_INO_V2
  btrfs: add a flag to iterate_inodes_from_logical to find all extent refs for uncompressed extents
  btrfs: send: remove unused code
  ...
2017-11-14 13:35:29 -08:00
David Howells
5e4def2038 Pass mode to wait_on_atomic_t() action funcs and provide default actions
Make wait_on_atomic_t() pass the TASK_* mode onto its action function as an
extra argument and make it 'unsigned int throughout.

Also, consolidate a bunch of identical action functions into a default
function that can do the appropriate thing for the mode.

Also, change the argument name in the bit_wait*() function declarations to
reflect the fact that it's the mode and not the bit number.

[Peter Z gives this a grudging ACK, but thinks that the whole atomic_t wait
should be done differently, though he's not immediately sure as to how]

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
cc: Ingo Molnar <mingo@kernel.org>
2017-11-13 15:38:16 +00:00
Greg Kroah-Hartman
b24413180f License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier.  The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
 - file had no licensing information it it.
 - file was a */uapi/* one with no licensing information in it,
 - file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne.  Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed.  Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
 - Files considered eligible had to be source code files.
 - Make and config files were included as candidates if they contained >5
   lines of source
 - File already had some variant of a license header in it (even if <5
   lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

 - when both scanners couldn't find any license traces, file was
   considered to have no license information in it, and the top level
   COPYING file license applied.

   For non */uapi/* files that summary was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0                                              11139

   and resulted in the first patch in this series.

   If that file was a */uapi/* path one, it was "GPL-2.0 WITH
   Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0 WITH Linux-syscall-note                        930

   and resulted in the second patch in this series.

 - if a file had some form of licensing information in it, and was one
   of the */uapi/* ones, it was denoted with the Linux-syscall-note if
   any GPL family license was found in the file or had no licensing in
   it (per prior point).  Results summary:

   SPDX license identifier                            # files
   ---------------------------------------------------|------
   GPL-2.0 WITH Linux-syscall-note                       270
   GPL-2.0+ WITH Linux-syscall-note                      169
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
   LGPL-2.1+ WITH Linux-syscall-note                      15
   GPL-1.0+ WITH Linux-syscall-note                       14
   ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
   LGPL-2.0+ WITH Linux-syscall-note                       4
   LGPL-2.1 WITH Linux-syscall-note                        3
   ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
   ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1

   and that resulted in the third patch in this series.

 - when the two scanners agreed on the detected license(s), that became
   the concluded license(s).

 - when there was disagreement between the two scanners (one detected a
   license but the other didn't, or they both detected different
   licenses) a manual inspection of the file occurred.

 - In most cases a manual inspection of the information in the file
   resulted in a clear resolution of the license that should apply (and
   which scanner probably needed to revisit its heuristics).

 - When it was not immediately clear, the license identifier was
   confirmed with lawyers working with the Linux Foundation.

 - If there was any question as to the appropriate license identifier,
   the file was flagged for further research and to be revisited later
   in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights.  The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
 - a full scancode scan run, collecting the matched texts, detected
   license ids and scores
 - reviewing anything where there was a license detected (about 500+
   files) to ensure that the applied SPDX license was correct
 - reviewing anything where there was no detection but the patch license
   was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
   SPDX license was correct

This produced a worksheet with 20 files needing minor correction.  This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg.  Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected.  This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.)  Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-02 11:10:55 +01:00
Gu JinXiang
d28e649a5c btrfs: Fix bug for misused dev_t when lookup in dev state hash table.
Fix bug of commit 74d46992e0 ("block: replace bi_bdev with a gendisk
pointer and partitions index").

bio_dev(bio) is used to find the dev state in function
__btrfsic_submit_bio. But when dev_state is added to the hashtable, it
is using dev_t of block_device.

bio_dev(bio) returns a dev_t of part0 which is different from dev_t in
block_device(bd_dev). bd_dev in block_device represents the exact
partition.

block_device.bd_dev =
	bio->bi_partno (same as block_device.bd_partno) + bio_dev(bio).

When adding a dev_state into hashtable, we use the exact partition dev_t.
So when looking it up, it should also use the exact partition dev_t.

Reproducer of this bug:

Use MOUNT_OPTIONS="-o check_int" and run btrfs/001 in fstests.
Then there will be WARNING like below.

WARNING:
btrfs: attempt to write superblock which references block M @29523968 (sda7     /1111654400/2) which is never written!

Signed-off-by: Gu JinXiang <gujx@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-01 20:45:36 +01:00
Timofey Titovets
19562430c6 Btrfs: heuristic: add Shannon entropy calculation
Byte distribution check in heuristic will filter edge data cases and
some time fail to classify input data.

Let's fix that by adding Shannon entropy calculation, that will cover
classification of most other data types.

As Shannon entropy needs log2 with some precision to work, let's use
ilog2(N) and for increased precision, by do ilog2(pow(N, 4)).

Shannon entropy has been slightly changed to avoid signed numbers and
division.

The calculation is direct by the formula, successor of precalculated
table or chains of if-else.

The accuracy errors of ilog2 are compensated by

@ENTROPY_LVL_ACEPTABLE 70 -> 65
@ENTROPY_LVL_HIGH      85 -> 80

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-01 20:45:36 +01:00
Timofey Titovets
858177d38d Btrfs: heuristic: add byte core set calculation
Calculate byte core set for data sample:
- sort buckets' numbers in decreasing order
- count how many values cover 90% of the sample

If the core set size is low (<=25%), data are easily compressible.
If the core set size is high (>=80%), data are not compressible.

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-01 20:45:36 +01:00
Timofey Titovets
a288e92cac Btrfs: heuristic: add byte set calculation
Calculate byte set size for data sample:
- calculate how many unique bytes have been in the sample
- for all bytes count > 0, check if we're still in the low count range
  (~25%), such data are easily compressible, otherwise furhter analysis
  is needed

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-01 20:45:36 +01:00
Timofey Titovets
1fe4f6fa5a Btrfs: heuristic: add detection of repeated data patterns
Walk over data sample and use memcmp to detect repeated patterns, like
zeros, but a bit more general.

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor coding style fixes ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-01 20:45:36 +01:00
Timofey Titovets
a440d48c7f Btrfs: heuristic: implement sampling logic
Copy sample data from the input data range to sample buffer then
calculate byte value count for that sample into bucket.

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
[ minor comment updates ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-01 20:45:36 +01:00
Timofey Titovets
17b5a6c17e Btrfs: heuristic: add bucket and sample counters and other defines
Add basic defines and structures for data sampling.

Added macros:
 - For future sampling algo
 - For bucket size

Heuristic workspace:
 - Add bucket for storing byte type counters
 - Add sample array for storing partial copy of input data range
 - Add counter for store current sample size to workspace

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor coding style fixes, comments updated ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-01 20:45:36 +01:00
Timofey Titovets
4e439a0b18 Btrfs: compression: separate heuristic/compression workspaces
Compression heuristic itself is not a compression type, as current
infrastructure provides workspaces for several compression types, it's
difficult to just add heuristic workspace.

Just refactor the code to support compression/heuristic workspaces with
maximum code sharing and minimum changes in it.

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ coding style fixes ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-01 20:45:35 +01:00
Josef Bacik
ddfae63cc8 btrfs: move btrfs_truncate_block out of trans handle
Since we do a delalloc reserve in btrfs_truncate_block we can deadlock
with freeze.  If somebody else is trying to allocate metadata for this
inode and it gets stuck in start_delalloc_inodes because of freeze we
will deadlock.  Be safe and move this outside of a trans handle.  This
also has a side-effect of making sure that we're not leaving stale data
behind in the other_encoding or encryption case.  Not an issue now since
nobody uses it, but it would be a problem in the future.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-01 20:45:35 +01:00
Josef Bacik
ce8ea7cc6e btrfs: don't call btrfs_start_delalloc_roots in flushoncommit
We're holding the sb_start_intwrite lock at this point, and doing async
filemap_flush of the inodes will result in a deadlock if we freeze the
fs during this operation.  This is because we could do a
btrfs_join_transaction() in the thread we are waiting on which would
block at sb_start_intwrite, and thus deadlock.  Using
writeback_inodes_sb() side steps the problem by not introducing all of
these extra locking dependencies.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-01 20:45:35 +01:00
Josef Bacik
0e0adbcfdc btrfs: track refs in a rb_tree instead of a list
If we get a significant amount of delayed refs for a single block (think
modifying multiple snapshots) we can end up spending an ungodly amount
of time looping through all of the entries trying to see if they can be
merged.  This is because we only add them to a list, so we have O(2n)
for every ref head.  This doesn't make any sense as we likely have refs
for different roots, and so they cannot be merged.  Tracking in a tree
will allow us to break as soon as we hit an entry that doesn't match,
making our worst case O(n).

With this we can also merge entries more easily.  Before we had to hope
that matching refs were on the ends of our list, but with the tree we
can search down to exact matches and merge them at insert time.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-01 20:45:35 +01:00
Josef Bacik
1d148e5939 btrfs: add a comp_refs() helper
Instead of open-coding the delayed ref comparisons, add a helper to do
the comparisons generically and use that everywhere.  We compare
sequence numbers last for following patches.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-01 20:45:35 +01:00
Josef Bacik
c7ad7c8439 btrfs: switch args for comp_*_refs
Make it more consistent, we want the inserted ref to be compared against
what's already in there.  This will make the order go from lowest seq ->
highest seq, which will make us more likely to make forward progress if
there's a seqlock currently held.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-01 20:45:35 +01:00
Josef Bacik
69fe2d75dd btrfs: make the delalloc block rsv per inode
The way we handle delalloc metadata reservations has gotten
progressively more complicated over the years.  There is so much cruft
and weirdness around keeping the reserved count and outstanding counters
consistent and handling the error cases that it's impossible to
understand.

Fix this by making the delalloc block rsv per-inode.  This way we can
calculate the actual size of the outstanding metadata reservations every
time we make a change, and then reserve the delta based on that amount.
This greatly simplifies the code everywhere, and makes the error
handling in btrfs_delalloc_reserve_metadata far less terrifying.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-01 20:45:35 +01:00
Josef Bacik
dd48d4072e btrfs: add tracepoints for outstanding extents mods
This is handy for tracing problems with modifying the outstanding
extents counters.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-01 20:45:35 +01:00
Josef Bacik
8b62f87bad Btrfs: rework outstanding_extents
Right now we do a lot of weird hoops around outstanding_extents in order
to keep the extent count consistent.  This is because we logically
transfer the outstanding_extent count from the initial reservation
through the set_delalloc_bits.  This makes it pretty difficult to get a
handle on how and when we need to mess with outstanding_extents.

Fix this by revamping the rules of how we deal with outstanding_extents.
Now instead everybody that is holding on to a delalloc extent is
required to increase the outstanding extents count for itself.  This
means we'll have something like this

btrfs_delalloc_reserve_metadata	- outstanding_extents = 1
 btrfs_set_extent_delalloc	- outstanding_extents = 2
btrfs_release_delalloc_extents	- outstanding_extents = 1

for an initial file write.  Now take the append write where we extend an
existing delalloc range but still under the maximum extent size

btrfs_delalloc_reserve_metadata - outstanding_extents = 2
  btrfs_set_extent_delalloc
    btrfs_set_bit_hook		- outstanding_extents = 3
    btrfs_merge_extent_hook	- outstanding_extents = 2
btrfs_delalloc_release_extents	- outstanding_extnets = 1

In order to make the ordered extent transition we of course must now
make ordered extents carry their own outstanding_extent reservation, so
for cow_file_range we end up with

btrfs_add_ordered_extent	- outstanding_extents = 2
clear_extent_bit		- outstanding_extents = 1
btrfs_remove_ordered_extent	- outstanding_extents = 0

This makes all manipulations of outstanding_extents much more explicit.
Every successful call to btrfs_delalloc_reserve_metadata _must_ now be
combined with btrfs_release_delalloc_extents, even in the error case, as
that is the only function that actually modifies the
outstanding_extents counter.

The drawback to this is now we are much more likely to have transient
cases where outstanding_extents is much larger than it actually should
be.  This could happen before as we manipulated the delalloc bits, but
now it happens basically at every write.  This may put more pressure on
the ENOSPC flushing code, but I think making this code simpler is worth
the cost.  I have another change coming to mitigate this side-effect
somewhat.

I also added trace points for the counter manipulation.  These were used
by a bpf script I wrote to help track down leak issues.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-01 20:45:35 +01:00
Zygo Blaxell
b115e3bc81 btrfs: increase output size for LOGICAL_INO_V2 ioctl
Build-server workloads have hundreds of references per file after dedup.
Multiply by a few snapshots and we quickly exhaust the limit of 2730
references per extent that can fit into a 64K buffer.

Raise the limit to 16M to be consistent with other btrfs ioctls
(e.g. TREE_SEARCH_V2, FILE_EXTENT_SAME).

To minimize surprising userspace behavior, apply this change only to
the LOGICAL_INO_V2 ioctl.

Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Reviewed-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Tested-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-01 20:45:35 +01:00
Zygo Blaxell
d24a67b2d9 btrfs: add a flags argument to LOGICAL_INO and call it LOGICAL_INO_V2
Now that check_extent_in_eb()'s extent offset filter can be turned off,
we need a way to do it from userspace.

Add a 'flags' field to the btrfs_logical_ino_args structure to disable
extent offset filtering, taking the place of one of the existing
reserved[] fields.

Previous versions of LOGICAL_INO neglected to check whether any of the
reserved fields have non-zero values.  Assigning meaning to those fields
now may change the behavior of existing programs that left these fields
uninitialized.  The lack of a zero check also means that new programs
have no way to know whether the kernel is honoring the flags field.

To avoid these problems, define a new ioctl LOGICAL_INO_V2.  We can
use the same argument layout as LOGICAL_INO, but shorten the reserved[]
array by one element and turn it into the 'flags' field.  The V2 ioctl
explicitly checks that reserved fields and unsupported flag bits are zero
so that userspace can negotiate future feature bits as they are defined.

Since the memory layouts of the two ioctls' arguments are compatible,
there is no need for a separate function for logical_to_ino_v2 (contrast
with tree_search_v2 vs tree_search where the layout and code are quite
different).  A version parameter and an 'if' statement will suffice.

Now that we have a flags field in logical_ino_args, add a flag
BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET to get the behavior we want,
and pass it down the stack to iterate_inodes_from_logical.

Motivation and background, copied from the patchset cover letter:

Suppose we have a file with one extent:

    root@tester:~# zcat /usr/share/doc/cpio/changelog.gz > /test/a
    root@tester:~# sync

Split the extent by overwriting it in the middle:

    root@tester:~# cat /dev/urandom | dd bs=4k seek=2 skip=2 count=1 conv=notrunc of=/test/a

We should now have 3 extent refs to 2 extents, with one block unreachable.
The extent tree looks like:

    root@tester:~# btrfs-debug-tree /dev/vdc -t 2
    [...]
            item 9 key (1103101952 EXTENT_ITEM 73728) itemoff 15942 itemsize 53
                    extent refs 2 gen 29 flags DATA
                    extent data backref root 5 objectid 261 offset 0 count 2
    [...]
            item 11 key (1103175680 EXTENT_ITEM 4096) itemoff 15865 itemsize 53
                    extent refs 1 gen 30 flags DATA
                    extent data backref root 5 objectid 261 offset 8192 count 1
    [...]

and the ref tree looks like:

    root@tester:~# btrfs-debug-tree /dev/vdc -t 5
    [...]
            item 6 key (261 EXTENT_DATA 0) itemoff 15825 itemsize 53
                    extent data disk byte 1103101952 nr 73728
                    extent data offset 0 nr 8192 ram 73728
                    extent compression(none)
            item 7 key (261 EXTENT_DATA 8192) itemoff 15772 itemsize 53
                    extent data disk byte 1103175680 nr 4096
                    extent data offset 0 nr 4096 ram 4096
                    extent compression(none)
            item 8 key (261 EXTENT_DATA 12288) itemoff 15719 itemsize 53
                    extent data disk byte 1103101952 nr 73728
                    extent data offset 12288 nr 61440 ram 73728
                    extent compression(none)
    [...]

There are two references to the same extent with different, non-overlapping
byte offsets:

    [------------------72K extent at 1103101952----------------------]
    [--8K----------------|--4K unreachable----|--60K-----------------]
    ^                                         ^
    |                                         |
    [--8K ref offset 0--][--4K ref offset 0--][--60K ref offset 12K--]
                         |
                         v
                         [-----4K extent-----] at 1103175680

We want to find all of the references to extent bytenr 1103101952.

Without the patch (and without running btrfs-debug-tree), we have to
do it with 18 LOGICAL_INO calls:

    root@tester:~# btrfs ins log 1103101952 -P /test/
    Using LOGICAL_INO
    inode 261 offset 0 root 5

    root@tester:~# for x in $(seq 0 17); do btrfs ins log $((1103101952 + x * 4096)) -P /test/; done 2>&1 | grep inode
    inode 261 offset 0 root 5
    inode 261 offset 4096 root 5   <- same extent ref as offset 0
                                   (offset 8192 returns empty set, not reachable)
    inode 261 offset 12288 root 5
    inode 261 offset 16384 root 5  \
    inode 261 offset 20480 root 5  |
    inode 261 offset 24576 root 5  |
    inode 261 offset 28672 root 5  |
    inode 261 offset 32768 root 5  |
    inode 261 offset 36864 root 5  \
    inode 261 offset 40960 root 5   > all the same extent ref as offset 12288.
    inode 261 offset 45056 root 5  /  More processing required in userspace
    inode 261 offset 49152 root 5  |  to figure out these are all duplicates.
    inode 261 offset 53248 root 5  |
    inode 261 offset 57344 root 5  |
    inode 261 offset 61440 root 5  |
    inode 261 offset 65536 root 5  |
    inode 261 offset 69632 root 5  /

In the worst case the extents are 128MB long, and we have to do 32768
iterations of the loop to find one 4K extent ref.

With the patch, we just use one call to map all refs to the extent at once:
    root@tester:~# btrfs ins log 1103101952 -P /test/
    Using LOGICAL_INO_V2
    inode 261 offset 0 root 5
    inode 261 offset 12288 root 5

The TREE_SEARCH ioctl allows userspace to retrieve the offset and
extent bytenr fields easily once the root, inode and offset are known.
This is sufficient information to build a complete map of the extent
and all of its references.  Userspace can use this information to make
better choices to dedup or defrag.

Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Reviewed-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Tested-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
[ copy background and motivation from cover letter ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-01 20:45:35 +01:00
Zygo Blaxell
c995ab3cda btrfs: add a flag to iterate_inodes_from_logical to find all extent refs for uncompressed extents
The LOGICAL_INO ioctl provides a backward mapping from extent bytenr and
offset (encoded as a single logical address) to a list of extent refs.
LOGICAL_INO complements TREE_SEARCH, which provides the forward mapping
(extent ref -> extent bytenr and offset, or logical address).  These are
useful capabilities for programs that manipulate extents and extent
references from userspace (e.g. dedup and defrag utilities).

When the extents are uncompressed (and not encrypted and not other),
check_extent_in_eb performs filtering of the extent refs to remove any
extent refs which do not contain the same extent offset as the 'logical'
parameter's extent offset.  This prevents LOGICAL_INO from returning
references to more than a single block.

To find the set of extent references to an uncompressed extent from [a, b),
userspace has to run a loop like this pseudocode:

	for (i = a; i < b; ++i)
		extent_ref_set += LOGICAL_INO(i);

At each iteration of the loop (up to 32768 iterations for a 128M extent),
data we are interested in is collected in the kernel, then deleted by
the filter in check_extent_in_eb.

When the extents are compressed (or encrypted or other), the 'logical'
parameter must be an extent bytenr (the 'a' parameter in the loop).
No filtering by extent offset is done (or possible?) so the result is
the complete set of extent refs for the entire extent.  This removes
the need for the loop, since we get all the extent refs in one call.

Add an 'ignore_offset' argument to iterate_inodes_from_logical,
[...several levels of function call graph...], and check_extent_in_eb, so
that we can disable the extent offset filtering for uncompressed extents.
This flag can be set by an improved version of the LOGICAL_INO ioctl to
get either behavior as desired.

There is no functional change in this patch.  The new flag is always
false.

Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor coding style fixes ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-01 20:45:34 +01:00
Nikolay Borisov
eb7b9d6a46 btrfs: send: remove unused code
This code was first introduced in 31db9f7c23 ("Btrfs: introduce
BTRFS_IOC_SEND for btrfs send/receive") and it was not functional, then
it got slightly refactored in e938c8ad54 ("Btrfs: code cleanups for
send/receive"), alas it was still dead. So let's remove it for good!

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-01 20:45:34 +01:00
Anand Jain
6dd38f81f9 btrfs: remove BUG_ON in btrfs_rm_dev_replace_free_srcdev()
That was only an extra check to tackle a few bugs around this area, now
its safe to remove it.  Replace it by an ASSERT.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-01 20:45:34 +01:00
Adam Borowski
fa4d885a48 btrfs: allow setting zlib compression level via :9
This is bikeshedding, but it seems people are drastically more likely to
understand "zlib:9" as compression level rather than an algorithm
version compared to "zlib9".

Based on feedback on the mailinglist, the ":9" will be the only accepted
syntax. The level must be a single digit. Unrecognized format will
result to the default, for forward compatibility in a similar way the
compression algorithm specifier was relaxed in commit
a7164fa4e0 ("btrfs: prepare for extensions in compression
options").

Signed-off-by: Adam Borowski <kilobyte@angband.pl>
Reviewed-by: David Sterba <dsterba@suse.com>
[ tighten the accepted format ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-01 20:45:34 +01:00
David Sterba
f51d2b5912 btrfs: allow to set compression level for zlib
Preliminary support for setting compression level for zlib, the
following works:

$ mount -o compess=zlib                 # default
$ mount -o compess=zlib0                # same
$ mount -o compess=zlib9                # level 9, slower sync, less data
$ mount -o compess=zlib1                # level 1, faster sync, more data
$ mount -o remount,compress=zlib3	# level set by remount

The compress-force works the same as compress'.  The level is visible in
the same format in /proc/mounts. Level set via file property does not
work yet.

Required patch: "btrfs: prepare for extensions in compression options"

Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-01 20:45:29 +01:00
Nikolay Borisov
d4417e2255 btrfs: Replace opencoded sizes with their symbolic constants
Currently btrfs' code uses a mix of opencoded sizes and defines from sizes.h.
Let's unifiy the code base to always use the symbolic constants. No functional
changes

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:28:01 +01:00
Gu JinXiang
859a58a207 btrfs: Use bd_dev to generate index when dev_state_hashtable add items.
Fix missing change from commit f8f84b2dfd
("btrfs: index check-integrity state hash by a dev_t").

Function btrfsic_dev_state_hashtable_lookup uses dev_t to generate hashval
when look in up a btrfsic_dev_state in hash table. So when we add a
btrfsic_dev_state into the hash table, it should also use dev_t.

Reproducer of this bug:
Use MOUNT_OPTIONS="-o check_int" when running xfstest, device can not be
mounted successfully. So xfstest can not run.

Signed-off-by: Gu JinXiang <gujx@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:28:01 +01:00
Anand Jain
102ed2c5ff btrfs: fix false EIO for missing device
When one of the device is missing, bbio_error() takes care of setting
the error status. And if its only IO that is pending in that stripe, it
fails to check the status of the other IO at %bbio_error before setting
the error %bi_status for the %orig_bio. Fix this by checking if
%bbio->error has exceeded the %bbio->max_errors.

Reproducer as below fdatasync error is seen intermittently.

 mount -o degraded /dev/sdc /btrfs
 dd status=none if=/dev/zero of=$(mktemp /btrfs/XXX) bs=4096 count=1 conv=fdatasync

 dd: fdatasync failed for ‘/btrfs/LSe’: Input/output error

 The reason for the intermittences of the problem is because
 the following conditions have to be met, which depends on timing:
 In btrfs_map_bio()
  - the RAID1 the missing device has to be at %dev_nr = 1
 In bbio_error()
  . before bbio_error() is called the bio of the not-missing
    device at %dev_nr = 0 must be completed so that the below
    condition is true
     if (atomic_dec_and_test(&bbio->stripes_pending)) {

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:28:01 +01:00
Anand Jain
de48373454 btrfs: use need_full_stripe() in __btrfs_map_block()
A cleanup patch, use need_full_stripe() to replace the open code.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:28:01 +01:00
Goldwyn Rodrigues
79f015f216 btrfs: cleanup extent locking sequence
Code cleanup for better understanding:
Variable needs_unlock to be called extent_locked to show state as
opposed to action. Changed the type to int, to reduce code in the
critical path.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:28:01 +01:00
Anand Jain
2dbe0c7718 btrfs: use BLK_STS defines where needed
At few places we could use BLK_STS_OK and BLK_STS_NOSUPP.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Satoru Taekeuchi <satoru.takeuchi@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ dropped first hunk btrfs_endio_direct_read ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:28:01 +01:00
Josef Bacik
bf2681cb94 btrfs: add assertions for releasing trans handle reservations
These are useful for debugging problems where we mess with
trans->block_rsv to make sure we're not screwing something up.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:28:01 +01:00
Josef Bacik
3b60d436a1 btrfs: remove type argument from comp_tree_refs
We can get this from the ref we've passed in.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:28:00 +01:00
Josef Bacik
d278850eff btrfs: remove delayed_ref_node from ref_head
This is just excessive information in the ref_head, and makes the code
complicated.  It is a relic from when we had the heads and the refs in
the same tree, which is no longer the case.  With this removal I've
cleaned up a bunch of the cruft around this old assumption as well.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:28:00 +01:00
Josef Bacik
c1103f7a5d btrfs: move all ref head cleanup to the helper function
We do a couple different cleanup operations on the ref head.  We adjust
counters, we'll free any reserved space if we didn't end up using the
ref, and we clear the pending csum bytes.  Move all these disparate
things into cleanup_ref_head and clean up the logic in
__btrfs_run_delayed_refs so that it handles the !ref case a lot cleaner,
as well as making run_one_delayed_ref() only deal with real refs and not
the ref head.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:28:00 +01:00
Josef Bacik
1ce7a5ec44 btrfs: move ref_mod modification into the if (ref) logic
We only use this logic if our ref isn't a ref_head, so move it up into
the if (ref) case since we know that this is a normal ref and not a
delayed ref head.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:28:00 +01:00
Josef Bacik
194ab0bc21 btrfs: breakout empty head cleanup to a helper
Move this code out to a helper function to further simplivy
__btrfs_run_delayed_refs.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:28:00 +01:00
Josef Bacik
b00e62507e btrfs: move extent_op cleanup to a helper
Move the extent_op cleanup for an empty head ref to a helper function to
help simplify __btrfs_run_delayed_refs.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:28:00 +01:00
Josef Bacik
2eadaa22c1 btrfs: add a helper to return a head ref
Simplify the error handling in __btrfs_run_delayed_refs by breaking out
the code used to return a head back to the delayed_refs tree for
processing into a helper function.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:28:00 +01:00
Josef Bacik
7c777430e8 Btrfs: only check delayed ref usage in should_end_transaction
We were only doing btrfs_check_space_for_delayed_refs() if the metadata
space was full, ie we couldn't allocate chunks.  This assumes we'll be
able to allocate chunks during transaction commit, but since nothing
does a LIMIT flush during the transaction commit this won't actually
happen unless we happen to run shy of actual space.  We already take
into account a full fs in btrfs_check_space_for_delayed_refs() so just
kill this extra check to make sure we're ending the transaction when we
need to.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:28:00 +01:00
Josef Bacik
fd708b81d9 Btrfs: add a extent ref verify tool
We were having corruption issues that were tied back to problems with
the extent tree.  In order to track them down I built this tool to try
and find the culprit, which was pretty successful.  If you compile with
this tool on it will live verify every ref update that the fs makes and
make sure it is consistent and valid.  I've run this through with
xfstests and haven't gotten any false positives.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update error messages, add fixup from Dan Carpenter to handle errors
  of read_tree_block ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:28:00 +01:00
Josef Bacik
84f7d8e624 btrfs: pass root to various extent ref mod functions
We need the actual root for the ref verifier tool to work, so change
these functions to pass the root around instead.  This will be used in
a subsequent patch.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:28:00 +01:00
Josef Bacik
fb592373cd btrfs: add ref-verify mount option
This adds the infrastructure for turning ref verify on and off for a
mount, to be used by a later patch.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ enhnance btrfs_print_mod_info to print if ref-verify is compiled in ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:28:00 +01:00
David Sterba
6273b7f8ed btrfs: get rid of sector_t and use u64 offset in submit_extent_page
The use of sector_t in the callchain of submit_extent_page is not
necessary.  Switch to u64 and rename the variable and use byte units
instead of 512b, ie.  dropping the >> 9 shifts and avoiding the
con(tro)versions of sector_t.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:28:00 +01:00
David Sterba
6c5a4e2c12 btrfs: rename page offset parameter in submit_extent_page
We're going to remove sector_t and will use 'offset', so this patch
frees the name.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:28:00 +01:00
David Sterba
6aa21263e3 btrfs: scrub: get rid of sector_t
The use of sector_t is not necessry, it's just for a warning.  Switch to
u64 and rename the variable and use byte units instead of 512b, ie.
dropping the >> 9 shifts.  The messages are adjusted as well.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:28:00 +01:00
Josef Bacik
2351f431f7 btrfs: fix send ioctl on 32bit with 64bit kernel
We pass in a pointer in our send arg struct, this means the struct size
doesn't match with 32bit user space and 64bit kernel space.  Fix this by
adding a compat mode and doing the appropriate conversion.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ move structure to the beginning, next to receive 32bit compat ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:59 +01:00
Anand Jain
2b902dfc89 btrfs: fix use of error or warning for missing device
When device is missing without the -o degraded option then its an error
so report it as an error instead of a warning.  And when -o degraded
option is provided, log the missing device as warning.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ switch error to bool ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:59 +01:00
Anand Jain
5a2b8e601c btrfs: declare btrfs_report_missing_device() static
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:59 +01:00
Anand Jain
45dbdbc9f6 btrfs: fix EIO misuse to report missing degraded option
EIO is only for the IO failure to the device, avoid it. Use ENOENT as
that's the closest error code describing what happened.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:59 +01:00
Anand Jain
adfb69af7d btrfs: add_missing_dev() should return the actual error
add_missing_dev() can return device pointer so that IS_ERR/PTR_ERR can
be used to check for the actual error that occurred in the function.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
[ minor error message adjustment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:59 +01:00
Christos Gkekas
9e882d6d05 btrfs: Clean up unused variables in free-space-tree.c
Remove variables 'start' and 'end', which are set but never used.

Signed-off-by: Christos Gkekas <chris.gekas@gmail.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:59 +01:00
Arnd Bergmann
709a95c3eb btrfs: tree-checker: use %zu format string for size_t
We now get a harmless compile-time on 32-bit architectures:

fs/btrfs/tree-checker.c: In function 'check_extent_data_item':
fs/btrfs/tree-checker.c:189:70: error: format '%lu' expects argument of type 'long unsigned int', but argument 6 has type 'unsigned int' [-Werror=format=]

This changes the format string to use %zu instead of %lu for size_t.

Fixes: c1f6520bf360 ("btrfs: tree-checker: Enhance output for check_extent_data_item")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:59 +01:00
Liu Bo
736cd52e0c Btrfs: remove nr_async_submits and async_submit_draining
Now that we have the combo of flushing twice, which can make sure IO
have started since the second flush will wait for page lock which
won't be unlocked unless setting page writeback and queuing ordered
extents, we don't need %async_submit_draining, %async_delalloc_pages
and %nr_async_submits to tell whether the IO has actually started.

Moreover, all the flushers in use are followed by functions that wait
for ordered extents to complete, so %nr_async_submits, which tracks
whether bio's async submit has made progress, doesn't really make
sense.

However, %async_delalloc_pages is still required by shrink_delalloc()
as that function doesn't flush twice in the normal case (just issues a
writeback with WB_REASON_FS_FREE_SPACE).

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:59 +01:00
Liu Bo
80e03a2c51 Btrfs: do not make defrag wait on async_delalloc_pages
By setting compression for a defrag task, the task will start IO at
the end of defrag.

After the combo of filemap_flush(), we've already made sure that
dirty pages have made progress via async compress thread because the
second filemap_flush() will wait for page lock, which won't be
unlocked until those pages have been marked as writeback and ordered
extents have been queued.

And this is for per-inode defrag, it's not helpful to wait on a global
%async_delalloc_pages and %nr_async_submits from fs_info.

Although waiting on %nr_async_submits means that all bios are
submitted down to per-device schedule IO lists, it doesn't wait for
their completions, thus users still need to do fsync/sync to make sure
the data is on disk.  While with this change, it makes sure that pages
are marked with writeback bits and will be submitted asynchronously
shortly, therefore, the behavior of defrag option '-c' remains unchanged.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:59 +01:00
Liu Bo
f851689b5a Btrfs: remove nr_async_bios
This was intended to congest higher layers to not send bios, but as

1) the congested bit has been taken by writeback

Async bios come from buffered writes and DIO writes.

For DIO writes, we want to submit them ASAP, while for buffered writes,
writeback uses balance_dirty_pages() to throttle how much dirty pages we
can have.

2) and no one is waiting for %nr_async_bios down to zero,

Historically, it was introduced along with changes which let
checksumming workload spread accross different cpus.  And at that time,
pdflush was used instead of per-bdi flushing, perhaps pdflush did not
have the necessary information for writeback to do throttling.

We can safely remove them now.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
[ additional explanation from mails, removed unused variable 'limit' ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:59 +01:00
Qu Wenruo
8806d7185b btrfs: tree-checker: Enhance output for check_extent_data_item
Output the invalid member name and its bad value, along with its
expected value range or alignment.

Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:59 +01:00
Qu Wenruo
d508c5f07c btrfs: tree-checker: Enhance output for check_csum_item
Output the bad value and expected good value (or its alignment).

Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
[ unindent long strings ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:59 +01:00
Qu Wenruo
478d01b3fc btrfs: tree-checker: Enhance output for btrfs_check_leaf
Enhance the output to print:
1) the eason
2) the ad value, if reason is not sufficient
3) good value (range)

Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
[ wording, unidented long strings ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:59 +01:00
Qu Wenruo
bba4f29896 btrfs: tree-checker: Enhance btrfs_check_node output
Use inline function to replace macro since we don't need
stringification.
(Macro still exists until all callers get updated)

And add more info about the error, and replace EIO with EUCLEAN.

For nr_items error, report if it's too large or too small, and output
the valid value range.

For node block pointer, added a new alignment checker.

For key order, also output the next key to make the problem more
obvious.

Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
[ wording adjustments, unindented long strings ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:59 +01:00
Qu Wenruo
557ea5dd00 btrfs: Move leaf and node validation checker to tree-checker.c
It's no doubt the comprehensive tree block checker will become larger,
so moving them into their own files is quite reasonable.

Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
[ wording adjustments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:58 +01:00
Timofey Titovets
1170862d78 Btrfs: compress_file_range remove dead variable num_bytes
Remove dead assigment of num_bytes.

Also as num_bytes only used in the will_compress block as copy of
total_in just replace that with total_in and drop num_bytes entirely.

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:58 +01:00
Rakesh Pandit
a7e3c5f2f7 btrfs: use appropriate replacements for __sb_{start,end}_write calls
Commit a53f4f8e9c ("btrfs: Don't call btrfs_start_transaction() on
frozen fs to avoid deadlock.") started using internal calls and we
replace them with more suitable ones.

Signed-off-by: Rakesh Pandit <rakesh@tuxera.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:58 +01:00
Hans van Kranenburg
a969f4cc13 btrfs: prefix sysfs attribute struct names
Currently struct names for sysfs are generated only based on the
attribute names. This means that attribute names cannot be reused in
multiple places throughout the complete btrfs sysfs hierarchy.

E.g. allocation/data/total_bytes and allocation/data/single/total_bytes
result in the same struct name btrfs_attr_total_bytes. A workaround for
this case was made in the past by ad hoc creating an extra macro
wrapper, BTRFS_RAID_ATTR, that inserts some extra text in the struct
name.

Instead of polluting sysfs.h with such kind of extra macro definitions,
and only doing so when there are collisions, use a prefix which gets
inserted in the struct name, so we keep everything nicely grouped
together by default.

Current collections of attributes are:
* (the toplevel, empty prefix)
* allocation
* space_info
* raid
* features

Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:58 +01:00
Thomas Meyer
897ca8194c btrfs: Fix bool initialization/comparison
Bool initializations should use true and false. Bool tests don't need
comparisons.

Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:58 +01:00
Nikolay Borisov
efd38150af btrfs: Refactor transaction handling in received subvolume ioctl
If btrfs_transaction_commit fails it will proceed to call
cleanup_transaction, which in turn already does btrfs_abort_transaction.
So let's remove the unnecessary code duplication. Also let's be explicit
about handling failure of btrfs_uuid_tree_add by calling
btrfs_end_transaction.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:58 +01:00
Nikolay Borisov
9417ebc8a6 btrfs: Explicitly handle btrfs_update_root failure
btrfs_udpate_root can fail and it aborts the transaction, the correct
way to handle an aborted transaction is to explicitly end with
btrfs_end_transaction.  Even now the code is correct since
btrfs_commit_transaction would handle an aborted transaction but this is
more of an implementation detail. So let's be explicit in handling
failure in btrfs_update_root.

Furthermore btrfs_commit_transaction can also fail and by ignoring it's
return value we could have left the in-memory copy of the root item in
an inconsistent state. So capture the error value which allows us to
correctly revert the RO/RW flags in case of commit failure.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:58 +01:00
Anand Jain
7132a26259 btrfs: error out if btrfs_attach_transaction() fails
btrfs_init_new_device() calls btrfs_attach_transaction() to
commit sys chunks, and it should error out if it fails.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:58 +01:00
Anand Jain
d31c32f674 btrfs: fix BUG_ON in btrfs_init_new_device()
Instead of BUG_ON return error to the caller. And handle the fail
condition by calling the abort transaction and going through the
error path.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:58 +01:00
Anand Jain
0af2c4bf5a btrfs: undo writable superblocke when sprouting fails
When new device is being added to seed FS, seed FS is marked writable,
but when we fail to bring in the new device, we missed to undo the
writable part. This patch fixes it.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:58 +01:00
Qu Wenruo
4b865cab96 btrfs: Add checker for EXTENT_CSUM
EXTENT_CSUM checker is a relatively easy one, only needs to check:

1) Objectid
   Fixed to BTRFS_EXTENT_CSUM_OBJECTID

2) Key offset alignment
   Must be aligned to sectorsize

3) Item size alignedment
   Must be aligned to csum size

Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:58 +01:00
Qu Wenruo
40c3c40947 btrfs: Add sanity check for EXTENT_DATA when reading out leaf
Add extra checks for item with EXTENT_DATA type.  This checks the
following thing:

0) Key offset
   All key offsets must be aligned to sectorsize.
   Inline extent must have 0 for key offset.

1) Item size
   Uncompressed inline file extent size must match item size.
   (Compressed inline file extent has no information about its on-disk size.)
   Regular/preallocated file extent size must be a fixed value.

2) Every member of regular file extent item
   Including alignment for bytenr and offset, possible value for
   compression/encryption/type.

3) Type/compression/encode must be one of the valid values.

This should be the most comprehensive and strict check in the context
of btrfs_item for EXTENT_DATA.

Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ switch to BTRFS_FILE_EXTENT_TYPES, similar to what
  BTRFS_COMPRESS_TYPES does ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:58 +01:00
Qu Wenruo
7f43d4affb btrfs: Check if item pointer overlaps with the item itself
Function check_leaf() checks if any item pointer points outside of the
leaf, but it doesn't check if the pointer overlaps with the item itself.

Normally only the last item may be the victim, but adding such check is
never a bad idea anyway.

Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:58 +01:00
Qu Wenruo
c3267bbaa9 btrfs: Refactor check_leaf function for later expansion
Current check_leaf() function does a good job checking key order and
item offset/size.

However it only checks from slot 0 to the last but one slot, this is
good but makes later expansion hard.

So this refactoring iterates from slot 0 to the last slot.
For key comparison, it uses a key with all 0 as initial key, so all
valid keys should be larger than that.

And for item size/offset checks, it compares current item end with
previous item offset.
For slot 0, use leaf end as a special case.

This makes later item/key offset checks and item size checks easier to
be implemented.

Also, makes check_leaf() to return -EUCLEAN other than -EIO to indicate
error.

Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:57 +01:00
Timofey Titovets
6018ba0a0e Btrfs: cleanup 'start' subtraction from try uncompressed inline extent
Was added in:
  c8b978188c
  "Btrfs: Add zlib compression support"
Survive to near time (from 08.10.2008).

Because 'start' checked for zero before branch, so it's safe to remove
that subtraction.

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: Satoru Takeuchi <satoru.takeuchi@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:57 +01:00
Josef Bacik
996478ca9c btrfs: change how we decide to commit transactions during flushing
Nikolay reported that generic/273 was failing currently with ENOSPC.
Turns out this is because we get to the point where the outstanding
reservations are greater than the pinned space on the fs.  This is a
mistake, previously we used the current reservation amount in
may_commit_transaction, not the entire outstanding reservation amount.
Fix this to find the minimum byte size needed to make progress in
flushing, and pass that into may_commit_transaction.  From there we can
make a smarter decision on whether to commit the transaction or not.
This fixes the failure in generic/273.

From Nikolai, IOW: when we go to the final stage of deciding whether to
do trans commit, instead of passing all the reservations from all
tickets we just pass the reservation for the current ticket. Otherwise,
in case all reservations exceed pinned space, then we don't commit
transaction and fail prematurely. Before we passed num_bytes from
flush_space, where num_bytes was the sum of all pending reserverations,
but now all we do is take the first ticket and commit the trans if we
can satisfy that.

Fixes: 957780eb27 ("Btrfs: introduce ticketed enospc infrastructure")
Cc: stable@vger.kernel.org # 4.8
Reported-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
[ added Nikolai's comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:57 +01:00
Kuanling Huang
eef16ba269 Btrfs: send, apply asynchronous page cache readahead to enhance page read
By analyzing the perf on btrfs send, we found it take large amount of
cpu time on page_cache_sync_readahead. This effort can be reduced after
switching to asynchronous one. Overall performance gain on HDD and SSD
were 9 and 15 percent if simply send a large file.

Signed-off-by: Kuanling Huang <peterh@synology.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:57 +01:00
Liu Bo
785884fc31 Btrfs: fix memory leak in raid56
The local bio_list may have pending bios when doing cleanup, it can
end up with memory leak if they don't get freed.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:57 +01:00
Colin Ian King
315d8e98aa btrfs: make array types static const, reduces object code size
Don't populate the read-only array types on the stack, instead make
it static const.  Makes the object code smaller by nearly 60 bytes:

Before:
   text	   data	    bss	    dec	    hex	filename
  90536	   6552	     64	  97152	  17b80	fs/btrfs/ioctl.o

After:
   text	   data	    bss	    dec	    hex	filename
  90414	   6616	     64	  97094	  17b46	fs/btrfs/ioctl.o

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:57 +01:00
Allen Pais
3afb0c5014 btrfs: return -ENOMEM on allocation failure in btrfsic
Forward the correct return value -ENOMEM from btrfsic_dev_state_alloc()
too.

Signed-off-by: Allen Pais <allen.lkml@gmail.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ adjust changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:57 +01:00
Liu Bo
6939f66724 Btrfs: fix confusing worker helper info in stacktrace
We've seen the following backtrace stack in ftrace or dmesg log,

  kworker/u16:10-4244  [000] 241942.480955: function:             btrfs_put_ordered_extent
  kworker/u16:10-4244  [000] 241942.480956: kernel_stack:         <stack trace>
=> finish_ordered_fn (ffffffffa0384475)
=> btrfs_scrubparity_helper (ffffffffa03ca577)        <-----"incorrect"
=> btrfs_freespace_write_helper (ffffffffa03ca98e)    <-----"correct"
=> process_one_work (ffffffff81117b2f)
=> worker_thread (ffffffff81118c2a)
=> kthread (ffffffff81121de0)
=> ret_from_fork (ffffffff81d7087a)

btrfs_freespace_write_helper is actually calling normal_worker_helper
instead of btrfs_scrubparity_helper, so somehow kernel has parsed the
incorrect function address while unwinding the stack,
btrfs_scrubparity_helper really shouldn't be shown up.

It's caused by compiler doing inline for our helper function, adding a
noinline tag can fix that.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ use noinline_for_stack ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:57 +01:00
Liu Bo
18fdc67900 Btrfs: remove bio_flags which indicates a meta block of log-tree
Since both committing transaction and writing log-tree are doing
plugging on metadata IO, we can unify to use %sync_writers to benefit
both cases, instead of checking bio_flags while writing meta blocks of
log-tree.

We can remove this bio_flags because in order to write dirty blocks,
log tree also uses btrfs_write_marked_extents(), inside which we
have enabled %sync_writers, therefore, every write goes in a
synchronous way, so does checksuming.

Please also note that, bio_flags is applied per-context while
%sync_writers is applied per-inode, so this might incur some overhead, ie.

1) while log tree is flushing its dirty blocks via
   btrfs_write_marked_extents(), in which %sync_writers is increased
   by one.

2) in the meantime, some writeback operations may happen upon btrfs's
   metadata inode, so these writes go synchronously, too.

However, AFAICS, the overhead is not a big one while the win is that
we unify the two places that needs synchronous way and remove a
special hack/flag.

This removes the bio_flags related stuff for writing log-tree.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:56 +01:00
Liu Bo
6300463b14 Btrfs: make plug in writing meta blocks really work
We have started plug in btrfs_write_and_wait_marked_extents() but the
generated IOs actually go to device's schedule IO list where the work
is doing in another task, thus the started plug doesn't make any
sense.

And since we wait for IOs immediately after writing meta blocks, it's
the same case as writing log tree, doing sync submit can merge more
IOs.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:56 +01:00
Satoru Takeuchi
d8953d69bc btrfs: convert all mount option checking code to use btrfs_test_opt
Signed-off-by: Satoru Takeuchi <satoru.takeuchi@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:56 +01:00
Colin Ian King
3993b112da btrfs: avoid null pointer dereference on fs_info when calling btrfs_crit
There are checks on fs_info in __btrfs_panic to avoid dereferencing a
null fs_info, however, there is a call to btrfs_crit that may also
dereference a null fs_info. Fix this by adding a check to see if fs_info
is null and only print the s_id if fs_info is non-null.

Detected by CoverityScan CID#401973 ("Dereference after null check")

Fixes: efe120a067 ("Btrfs: convert printk to btrfs_ and fix BTRFS prefix")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:56 +01:00
Christos Gkekas
fa0d0888bd btrfs: Clean up dead code in root-tree
The value of variable 'can_recover' is never used after being set, thus
it should be removed, as it was never used since the first commit
68a7342c51 ("Btrfs: cleanup orphaned root orphan item").

Signed-off-by: Christos Gkekas <chris.gekas@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:56 +01:00
Christophe JAILLET
9ca2e97fa3 btrfs: tests: Fix a memory leak in error handling path in 'run_test()'
If 'btrfs_alloc_path()' fails, we must free the resources already
allocated, as done in the other error handling paths in this function.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:56 +01:00
Nikolay Borisov
c434d21c64 btrfs: Remove redundant argument of __link_block_group
__link_block_group is called from only 2 places and at each call site the
space_info being passed is the same as the space info assigned to the passed
cache struct. Let's remove the redundant argument and make the function
reference the space_info from the passed block_group_cache. No functional
changes

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ renamed to link_block_group ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:56 +01:00
Nikolay Borisov
1efb72a3c3 btrfs: Rework error handling of add_extent_mapping in __btrfs_alloc_chunk
Currently the code executes add_extent_mapping and if it is successful
it links the new mapping, it then proceeds to unlock the extent mapping
tree and check for failure and handle them. Instead, rework the code to
only perform a single check if add_extent_mapping has failed and handle
it, otherwise the code continues in a linear fashion. No functional
changes

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:56 +01:00
Nikolay Borisov
8c70c9f81e btrfs: Remove unused parameter from check_direct_IO
Introduced by 5a5f79b570 ("Btrfs: allow unaligned DIO") and never
used. The buffered fallback from unaligned DIO works as expected.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:56 +01:00
Nikolay Borisov
ee8c494f88 btrfs: Remove unused arguments from btrfs_changed_cb_t
btrfs_changed_cb_t represents the signature of the callback being passed
to btrfs_compare_trees. Currently there is only one such callback,
namely changed_cb in send.c. This function doesn't really uses the first
2 parameters, i.e. the roots. Since there are not other functions
implementing the btrfs_changed_cb_t let's remove the unused parameters
from the prototype and implementation.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:56 +01:00
Nikolay Borisov
a0357511f2 btrfs: Remove unused parameters from various functions
iterate_dir_item:found_key - introduced in 31db9f7c23 ("Btrfs:
  introduce BTRFS_IOC_SEND for btrfs send/receive"), yet never used.

record_ref:num - ditto

This is a first pass with the low-hanging fruit. There are still quite a
few unsued parameters in some function which have to abide by a callback
interface.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:55 +01:00
Nikolay Borisov
8ca199501e btrfs: Remove unused variable
Src was initially part of 31ff1cd25d ("Btrfs: Copy into the log tree in
big batches"), however 16e7549f04 ("Btrfs: incompatible format change
to remove hole extents") changed parameters passed to copy_items which
made the src variable redundant.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:55 +01:00
Liu Bo
9b4a9b283d Btrfs: do not async submit for nodatasum inodes
While we submit direct writes, if the inode is flagged with nodatasum,
there's no benefit to submit asynchronously, because

a) we don't have to calculate checksum across processors,

b) and direct IO has started a plug, but async submit makes us queue
IO on each device's scheduled IO list instead of DIO's plug list, so
that IOs get much less merges in general.

Lets use sync submit for nodatasum inodes.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:55 +01:00
Liu Bo
9cd3a7eb85 Btrfs: search parity device wisely
After mapping block with BTRFS_MAP_WRITE, parities have been sorted to
the end position, so this search can start from the first parity
stripe.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copied changelog as a comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:55 +01:00
Anand Jain
ee87cf5ed9 btrfs: copy fsid to super_block s_uuid
We didn't copy fsid to struct super_block.s_uuid so Overlay disables
index feature with btrfs as the lower FS.

kernel: overlayfs: fs on '/lower' does not support file handles, falling back to index=off.

Fix this by publishing the fsid through struct super_block.s_uuid.

[ dsterba: I think that setting s_uuid is the last missing bit. Overlay
  needs the file handle encoding support from the lower filesystem, which
  is supported. Filling the whole filesystem id is correct, the subvolume
  id is encoded in the file handle buffer from inside btrfs_encode_fh. ]

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:55 +01:00
Omar Sandoval
718dc5fade Btrfs: fix __user casting in ioctl.c
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:55 +01:00
Omar Sandoval
c9162bdfd6 Btrfs: make some volumes.c functions static
These aren't used outside of volumes.c.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:55 +01:00
Nikolay Borisov
f78541ddb1 btrfs: Remove redundant forward declarations
Some static functions are needlessly forward declared. Let's remove those
declarations since they add no value.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:55 +01:00
Liu Bo
49e83f5735 Btrfs: protect conditions within root->log_mutex while waiting
Both wait_for_commit() and wait_for_writer() are checking the
condition out of the mutex lock.

This refactors code a bit to be lock safe.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:55 +01:00
Liu Bo
45bac0f3d2 Btrfs: use wait_event instead of a single function
Since TASK_UNINTERRUPTIBLE has been used here, wait_event() can do the
same job.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:55 +01:00
Liu Bo
69cc7151ee Btrfs: move finish_wait out of the loop
If we're still going to wait after schedule(), we don't have to do
finish_wait() to remove our %wait_queue_entry since prepare_to_wait()
won't add the same %wait_queue_entry twice.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:55 +01:00
Liu Bo
219d33b26a Btrfs: remove batch plug in run_scheduled_IO
Block layer has a limit on plug, ie. BLK_MAX_REQUEST_COUNT == 16, so
we don't gain benefits by batching 64 bios here.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:55 +01:00
Matthew Garrett
357fdad075 Convert fs/*/* to SB_I_VERSION
[AV: in addition to the fix in previous commit]

Signed-off-by: Matthew Garrett <mjg59@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-10-18 18:51:27 -04:00
Linus Torvalds
bf2db0b9f5 Merge branch 'for-4.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
 "Two more fixes for bugs introduced in 4.13.

  The sector_t problem with 32bit architecture and !LBDAF config seems
  serious but the number of affected deployments is hopefully low.

  The clashing status bits could lead to a confusing in-memory state of
  the whole-filesystem operations if used with the quota override sysfs
  knob"

* 'for-4.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Btrfs: fix overlap of fs_info::flags values
  btrfs: avoid overflow when sector_t is 32 bit
2017-10-06 09:03:08 -07:00
Tsutomu Itoh
69ad59767d Btrfs: fix overlap of fs_info::flags values
Because the values of BTRFS_FS_EXCL_OP and BTRFS_FS_QUOTA_OVERRIDE overlap,
we should change the value.

First, BTRFS_FS_EXCL_OP was set to 14.

  commit 171938e528 ("btrfs: track exclusive filesystem operation in flags")

Next, the value of BTRFS_FS_QUOTA_OVERRIDE was set to 14.

  commit f29efe2921 ("btrfs: add quota override flag to enable quota override for CAP_SYS_RESOURCE")

As a result, the value 14 overlapped, by accident.
This problem is solved by defining the value of BTRFS_FS_EXCL_OP as 16,
the flags are internal.

Fixes: f29efe2921 ("btrfs: add quota override flag to enable quota override for CAP_SYS_RESOURCE")
CC: stable@vger.kernel.org # 4.13+
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minimize the change, update only BTRFS_FS_EXCL_OP ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-04 16:44:18 +02:00
Goffredo Baroncelli
2d8ce70a08 btrfs: avoid overflow when sector_t is 32 bit
Jean-Denis Girard noticed commit c821e7f3 "pass bytes to
btrfs_bio_alloc" (https://patchwork.kernel.org/patch/9763081/)
introduces a regression on 32 bit machines.
When CONFIG_LBDAF is _not_ defined (CONFIG_LBDAF == Support for large
(2TB+) block devices and files) sector_t is 32 bit on 32bit machines.

In the function submit_extent_page, 'sector' (which is sector_t type) is
multiplied by 512 to convert it from sectors to bytes, leading to an
overflow when the disk is bigger than 4GB (!).

I added a cast to u64 to avoid overflow.

Fixes: c821e7f3 ("btrfs: pass bytes to btrfs_bio_alloc")
CC: stable@vger.kernel.org # 4.13+
Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
Tested-by: Jean-Denis Girard <jd.girard@sysnux.pf>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-04 16:22:56 +02:00
Linus Torvalds
5ba88cd6e9 Merge branch 'for-4.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
 "We've collected a bunch of isolated fixes, for crashes, user-visible
  behaviour or missing bits from other subsystem cleanups from the past.

  The overall number is not small but I was not able to make it
  significantly smaller. Most of the patches are supposed to go to
  stable"

* 'for-4.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: log csums for all modified extents
  Btrfs: fix unexpected result when dio reading corrupted blocks
  btrfs: Report error on removing qgroup if del_qgroup_item fails
  Btrfs: skip checksum when reading compressed data if some IO have failed
  Btrfs: fix kernel oops while reading compressed data
  Btrfs: use btrfs_op instead of bio_op in __btrfs_map_block
  Btrfs: do not backup tree roots when fsync
  btrfs: remove BTRFS_FS_QUOTA_DISABLING flag
  btrfs: propagate error to btrfs_cmp_data_prepare caller
  btrfs: prevent to set invalid default subvolid
  Btrfs: send: fix error number for unknown inode types
  btrfs: fix NULL pointer dereference from free_reloc_roots()
  btrfs: finish ordered extent cleaning if no progress is found
  btrfs: clear ordered flag on cleaning up ordered extents
  Btrfs: fix incorrect {node,sector}size endianness from BTRFS_IOC_FS_INFO
  Btrfs: do not reset bio->bi_ops while writing bio
  Btrfs: use the new helper wbc_to_write_flags
2017-09-29 12:57:35 -07:00
Josef Bacik
8c6c592831 btrfs: log csums for all modified extents
Amir reported a bug discovered by his cleaned up version of my
dm-log-writes xfstests where we were missing csums at certain replay
points.  This is because fsx was doing an msync(), which essentially
fsync()'s a specific range of a file.  We will log all modified extents,
but only search for the checksums in the range we are being asked to
sync.  We cannot simply log the extents in the range we're being asked
because we are logging the inode item as it is currently, which if it
has had a i_size update before the msync means we will miss extents when
replaying.  We could possibly get around this by marking the inode with
the transaction that extended the i_size to see if we have this case,
but this would be racy and we'd have to lock the whole range of the
inode to make sure we didn't have an ordered extent outside of our range
that was in the middle of completing.

Fix this simply by keeping track of the modified extents range and
logging the csums for the entire range of extents that we are logging.
This makes the xfstest pass.

Reported-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-26 14:54:16 +02:00
Liu Bo
99c4e3b96c Btrfs: fix unexpected result when dio reading corrupted blocks
commit 4246a0b63b ("block: add a bi_error field to struct bio")
changed the logic of how dio read endio reports errors.

For single stripe dio read, %bio->bi_status reflects the error before
verifying checksum, and now we're updating it when data block matches
with its checksum, while in the mismatching case, %bio->bi_status is
not updated to relfect that.

When some blocks in a file have been corrupted on disk, reading such a
file ends up with

1) checksum errors are reported in kernel log
2) read(2) returns successfully with some content being 0x01.

In order to fix it, we need to report its checksum mismatch error to
the upper layer (dio layer in this case) as well.

Fixes: 4246a0b63b ("block: add a bi_error field to struct bio")
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reported-by: Goffredo Baroncelli <kreijack@inwind.it>
Tested-by: Goffredo Baroncelli <kreijack@inwind.it>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-26 14:54:07 +02:00
Sargun Dhillon
36b96fdc6b btrfs: Report error on removing qgroup if del_qgroup_item fails
Previously, we were calling del_qgroup_item, and ignoring the return code
resulting in a potential to have divergent in-memory state without an
error. Perhaps, it makes sense to handle this error code, and put the
filesystem into a read only, or similar state.

This patch only adds reporting of the error if the error is fatal,
(any error other than qgroup not found).

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-26 14:54:01 +02:00
Liu Bo
e6311f240c Btrfs: skip checksum when reading compressed data if some IO have failed
Currently even if the underlying disk reports failure on IO,
compressed read endio still gets to verify checksum and reports it as
a checksum error.

In fact, if some IO have failed during reading a compressed data
extent , there's no way the checksum could match, therefore, we can
skip that in order to return error quickly to the upper layer.

Please note that we need to do this after recording the failed mirror
index so that read-repair in the upper layer's endio can work
properly.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Tested-by: Paul Jones <paul@pauljones.id.au>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-26 14:53:26 +02:00
Liu Bo
cf1167d5c1 Btrfs: fix kernel oops while reading compressed data
The kernel oops happens at

kernel BUG at fs/btrfs/extent_io.c:2104!
...
RIP: clean_io_failure+0x263/0x2a0 [btrfs]

It's showing that read-repair code is using an improper mirror index.
This is due to the fact that compression read's endio hasn't recorded
the failed mirror index in %cb->orig_bio.

With this, btrfs's read-repair can work properly on reading compressed
data.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reported-by: Paul Jones <paul@pauljones.id.au>
Tested-by: Paul Jones <paul@pauljones.id.au>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-26 14:53:23 +02:00
Liu Bo
bd7d63c2ce Btrfs: use btrfs_op instead of bio_op in __btrfs_map_block
This seems to be a leftover of commit cf8cddd38b ("btrfs: don't
abuse REQ_OP_* flags for btrfs_map_block").

It should use btrfs_op() helper to provide one of 'enum btrfs_map_op'
types.

Fixes: cf8cddd38b ("btrfs: don't abuse REQ_OP_* flags for btrfs_map_block")
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Satoru Takeuchi <satoru.takeuchi@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-26 14:53:17 +02:00
Liu Bo
fed3b38114 Btrfs: do not backup tree roots when fsync
It doesn't make sense to backup tree roots when doing fsync, since
during fsync those tree roots have not been consistent on disk.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-26 14:53:04 +02:00
Misono, Tomohiro
c2faff790c btrfs: remove BTRFS_FS_QUOTA_DISABLING flag
Currently, "btrfs quota enable" would fail after "btrfs quota disable" on
the first time with syslog output "qgroup_rescan_init failed with -22", but
it would succeed on the second time.

When "quota disable" is called, BTRFS_FS_QUOTA_DISABLING flag bit will be
set in fs_info->flags in btrfs_quota_disable(), but it will not be droppd
in btrfs_run_qgroups() (which is called in btrfs_commit_transaction())
because quota_root has already been freed. If "quota enable" is called
after that, both BTRFS_FS_QUOTA_DISABLING and BTRFS_FS_QUOTA_ENABLED flag
would be dropped in the btrfs_run_qgroups() since quota_root is not NULL.
This leads to the failure of "quota enable" on the first time.

BTRFS_FS_QUOTA_DISABLING flag is not used outside of "quota disable"
context and is equivalent to whether quota_root is NULL or not.
btrfs_run_qgroups() checks whether quota_root is NULL or not in the first
place.

So, let's remove BTRFS_FS_QUOTA_DISABLING flag.

Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-26 14:52:57 +02:00
Naohiro Aota
78ad4ce014 btrfs: propagate error to btrfs_cmp_data_prepare caller
btrfs_cmp_data_prepare() (almost) always returns 0 i.e. ignoring errors
from gather_extent_pages(). While the pages are freed by
btrfs_cmp_data_free(), cmp->num_pages still has > 0. Then,
btrfs_extent_same() try to access the already freed pages causing faults
(or violates PageLocked assertion).

This patch just return the error as is so that the caller stop the process.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Fixes: f441460202 ("btrfs: fix deadlock with extent-same and readpage")
Cc: <stable@vger.kernel.org> # 4.2
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-26 14:52:31 +02:00
satoru takeuchi
6d6d282932 btrfs: prevent to set invalid default subvolid
`btrfs sub set-default` succeeds to set an ID which isn't corresponding to any
fs/file tree. If such the bad ID is set to a filesystem, we can't mount this
filesystem without specifying `subvol` or `subvolid` mount options.

Fixes: 6ef5ed0d38 ("Btrfs: add ioctl and incompat flag to set the default mount subvol")
Cc: <stable@vger.kernel.org>
Signed-off-by: Satoru Takeuchi <satoru.takeuchi@gmail.com>
Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-26 14:52:25 +02:00
Tsutomu Itoh
ca6842bf01 Btrfs: send: fix error number for unknown inode types
ENOTSUPP should not be returned to the user program.
 (cf. include/linux/errno.h)
Therefore, EOPNOTSUPP is used instead of ENOTSUPP.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-26 14:52:06 +02:00
Naohiro Aota
bb166d7207 btrfs: fix NULL pointer dereference from free_reloc_roots()
__del_reloc_root should be called before freeing up reloc_root->node.
If not, calling __del_reloc_root() dereference reloc_root->node, causing
the system BUG.

Fixes: 6bdf131fac ("Btrfs: don't leak reloc root nodes on error")
Cc: <stable@vger.kernel.org> # 4.9
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-26 14:51:49 +02:00
Naohiro Aota
67c003f90f btrfs: finish ordered extent cleaning if no progress is found
__endio_write_update_ordered() repeats the search until it reaches the end
of the specified range. This works well with direct IO path, because before
the function is called, it's ensured that there are ordered extents filling
whole the range. It's not the case, however, when it's called from
run_delalloc_range(): it is possible to have error in the midle of the loop
in e.g. run_delalloc_nocow(), so that there exisits the range not covered
by any ordered extents. By cleaning such "uncomplete" range,
__endio_write_update_ordered() stucks at offset where there're no ordered
extents.

Since the ordered extents are created from head to tail, we can stop the
search if there are no offset progress.

Fixes: 524272607e ("btrfs: Handle delalloc error correctly to avoid ordered extent hang")
Cc: <stable@vger.kernel.org> # 4.12
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-26 14:49:06 +02:00
Naohiro Aota
63d71450c8 btrfs: clear ordered flag on cleaning up ordered extents
Commit 524272607e ("btrfs: Handle delalloc error correctly to avoid
ordered extent hang") introduced btrfs_cleanup_ordered_extents() to cleanup
submitted ordered extents. However, it does not clear the ordered bit
(Private2) of corresponding pages. Thus, the following BUG occurs from
free_pages_check_bad() (on btrfs/125 with nospace_cache).

BUG: Bad page state in process btrfs  pfn:3fa787
page:ffffdf2acfe9e1c0 count:0 mapcount:0 mapping:          (null) index:0xd
flags: 0x8000000000002008(uptodate|private_2)
raw: 8000000000002008 0000000000000000 000000000000000d 00000000ffffffff
raw: ffffdf2acf5c1b20 ffffb443802238b0 0000000000000000 0000000000000000
page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
bad because of flags: 0x2000(private_2)

This patch clears the flag same as other places calling
btrfs_dec_test_ordered_pending() for every page in the specified range.

Fixes: 524272607e ("btrfs: Handle delalloc error correctly to avoid ordered extent hang")
Cc: <stable@vger.kernel.org> # 4.12
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-26 14:49:00 +02:00
Omar Sandoval
bea7eafdbd Btrfs: fix incorrect {node,sector}size endianness from BTRFS_IOC_FS_INFO
fs_info->super_copy->{node,sector}size are little-endian, but the ioctl
should return the values in native endianness. Use the cached values in
btrfs_fs_info instead. Found with sparse.

Fixes: 80a773fbfc ("btrfs: retrieve more info from FS_INFO ioctl")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-26 14:48:50 +02:00
Liu Bo
5f14efd3d4 Btrfs: do not reset bio->bi_ops while writing bio
flush_epd_write_bio() sets bio->bi_opf by itself to honor REQ_SYNC,
but it's not needed at all since bio->bi_opf has set up properly in
both __extent_writepage() and write_one_eb(), and in the case of
write_one_eb(), it also sets REQ_META, which we will lose in
flush_epd_write_bio().

This remove this unnecessary bio->bi_opf setting.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-26 14:48:30 +02:00
Liu Bo
ff40adf7fb Btrfs: use the new helper wbc_to_write_flags
This updates btrfs to use the helper wbc_to_write_flags which has been
applied in ext4/xfs/f2fs/block.

Please note that, with this, btrfs's dirty pages written by a
writeback job will carry the flag REQ_BACKGROUND, which is currently
used by writeback-throttle to determine whether it should go to get a
request or wait.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-26 14:48:14 +02:00
Linus Torvalds
e253d98f5b Merge branch 'work.read_write' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull nowait read support from Al Viro:
 "Support IOCB_NOWAIT for buffered reads and block devices"

* 'work.read_write' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  block_dev: support RFW_NOWAIT on block device nodes
  fs: support RWF_NOWAIT for buffered reads
  fs: support IOCB_NOWAIT in generic_file_buffered_read
  fs: pass iocb to do_generic_file_read
2017-09-14 19:29:55 -07:00
Linus Torvalds
0f0d12728e Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull mount flag updates from Al Viro:
 "Another chunk of fmount preparations from dhowells; only trivial
  conflicts for that part. It separates MS_... bits (very grotty
  mount(2) ABI) from the struct super_block ->s_flags (kernel-internal,
  only a small subset of MS_... stuff).

  This does *not* convert the filesystems to new constants; only the
  infrastructure is done here. The next step in that series is where the
  conflicts would be; that's the conversion of filesystems. It's purely
  mechanical and it's better done after the merge, so if you could run
  something like

	list=$(for i in MS_RDONLY MS_NOSUID MS_NODEV MS_NOEXEC MS_SYNCHRONOUS MS_MANDLOCK MS_DIRSYNC MS_NOATIME MS_NODIRATIME MS_SILENT MS_POSIXACL MS_KERNMOUNT MS_I_VERSION MS_LAZYTIME; do git grep -l $i fs drivers/staging/lustre drivers/mtd ipc mm include/linux; done|sort|uniq|grep -v '^fs/namespace.c$')

	sed -i -e 's/\<MS_RDONLY\>/SB_RDONLY/g' \
	        -e 's/\<MS_NOSUID\>/SB_NOSUID/g' \
	        -e 's/\<MS_NODEV\>/SB_NODEV/g' \
	        -e 's/\<MS_NOEXEC\>/SB_NOEXEC/g' \
	        -e 's/\<MS_SYNCHRONOUS\>/SB_SYNCHRONOUS/g' \
	        -e 's/\<MS_MANDLOCK\>/SB_MANDLOCK/g' \
	        -e 's/\<MS_DIRSYNC\>/SB_DIRSYNC/g' \
	        -e 's/\<MS_NOATIME\>/SB_NOATIME/g' \
	        -e 's/\<MS_NODIRATIME\>/SB_NODIRATIME/g' \
	        -e 's/\<MS_SILENT\>/SB_SILENT/g' \
	        -e 's/\<MS_POSIXACL\>/SB_POSIXACL/g' \
	        -e 's/\<MS_KERNMOUNT\>/SB_KERNMOUNT/g' \
	        -e 's/\<MS_I_VERSION\>/SB_I_VERSION/g' \
	        -e 's/\<MS_LAZYTIME\>/SB_LAZYTIME/g' \
	        $list

  and commit it with something along the lines of 'convert filesystems
  away from use of MS_... constants' as commit message, it would save a
  quite a bit of headache next cycle"

* 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  VFS: Differentiate mount flags (MS_*) from internal superblock flags
  VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)
  vfs: Add sb_rdonly(sb) to query the MS_RDONLY flag on s_flags
2017-09-14 18:54:01 -07:00
Linus Torvalds
581bfce969 Merge branch 'work.set_fs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more set_fs removal from Al Viro:
 "Christoph's 'use kernel_read and friends rather than open-coding
  set_fs()' series"

* 'work.set_fs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: unexport vfs_readv and vfs_writev
  fs: unexport vfs_read and vfs_write
  fs: unexport __vfs_read/__vfs_write
  lustre: switch to kernel_write
  gadget/f_mass_storage: stop messing with the address limit
  mconsole: switch to kernel_read
  btrfs: switch write_buf to kernel_write
  net/9p: switch p9_fd_read to kernel_write
  mm/nommu: switch do_mmap_private to kernel_read
  serial2002: switch serial2002_tty_write to kernel_{read/write}
  fs: make the buf argument to __kernel_write a void pointer
  fs: fix kernel_write prototype
  fs: fix kernel_read prototype
  fs: move kernel_read to fs/read_write.c
  fs: move kernel_write to fs/read_write.c
  autofs4: switch autofs4_write to __kernel_write
  ashmem: switch to ->read_iter
2017-09-14 18:13:32 -07:00
Linus Torvalds
e7cdb60fd2 Merge branch 'zstd-minimal' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull zstd support from Chris Mason:
 "Nick Terrell's patch series to add zstd support to the kernel has been
  floating around for a while. After talking with Dave Sterba, Herbert
  and Phillip, we decided to send the whole thing in as one pull
  request.

  zstd is a big win in speed over zlib and in compression ratio over
  lzo, and the compression team here at FB has gotten great results
  using it in production. Nick will continue to update the kernel side
  with new improvements from the open source zstd userland code.

  Nick has a number of benchmarks for the main zstd code in his lib/zstd
  commit:

      I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB
      of RAM. The VM is running on a MacBook Pro with a 3.1 GHz Intel
      Core i7 processor, 16 GB of RAM, and a SSD. I benchmarked using
      `silesia.tar` [3], which is 211,988,480 B large. Run the following
      commands for the benchmark:

        sudo modprobe zstd_compress_test
        sudo mknod zstd_compress_test c 245 0
        sudo cp silesia.tar zstd_compress_test

      The time is reported by the time of the userland `cp`.
      The MB/s is computed with

        1,536,217,008 B / time(buffer size, hash)

      which includes the time to copy from userland.
      The Adjusted MB/s is computed with

        1,536,217,088 B / (time(buffer size, hash) - time(buffer size, none)).

      The memory reported is the amount of memory the compressor
      requests.

        | Method   | Size (B) | Time (s) | Ratio | MB/s    | Adj MB/s | Mem (MB) |
        |----------|----------|----------|-------|---------|----------|----------|
        | none     | 11988480 |    0.100 |     1 | 2119.88 |        - |        - |
        | zstd -1  | 73645762 |    1.044 | 2.878 |  203.05 |   224.56 |     1.23 |
        | zstd -3  | 66988878 |    1.761 | 3.165 |  120.38 |   127.63 |     2.47 |
        | zstd -5  | 65001259 |    2.563 | 3.261 |   82.71 |    86.07 |     2.86 |
        | zstd -10 | 60165346 |   13.242 | 3.523 |   16.01 |    16.13 |    13.22 |
        | zstd -15 | 58009756 |   47.601 | 3.654 |    4.45 |     4.46 |    21.61 |
        | zstd -19 | 54014593 |  102.835 | 3.925 |    2.06 |     2.06 |    60.15 |
        | zlib -1  | 77260026 |    2.895 | 2.744 |   73.23 |    75.85 |     0.27 |
        | zlib -3  | 72972206 |    4.116 | 2.905 |   51.50 |    52.79 |     0.27 |
        | zlib -6  | 68190360 |    9.633 | 3.109 |   22.01 |    22.24 |     0.27 |
        | zlib -9  | 67613382 |   22.554 | 3.135 |    9.40 |     9.44 |     0.27 |

      I benchmarked zstd decompression using the same method on the same
      machine. The benchmark file is located in the upstream zstd repo
      under `contrib/linux-kernel/zstd_decompress_test.c` [4]. The
      memory reported is the amount of memory required to decompress
      data compressed with the given compression level. If you know the
      maximum size of your input, you can reduce the memory usage of
      decompression irrespective of the compression level.

        | Method   | Time (s) | MB/s    | Adjusted MB/s | Memory (MB) |
        |----------|----------|---------|---------------|-------------|
        | none     |    0.025 | 8479.54 |             - |           - |
        | zstd -1  |    0.358 |  592.15 |        636.60 |        0.84 |
        | zstd -3  |    0.396 |  535.32 |        571.40 |        1.46 |
        | zstd -5  |    0.396 |  535.32 |        571.40 |        1.46 |
        | zstd -10 |    0.374 |  566.81 |        607.42 |        2.51 |
        | zstd -15 |    0.379 |  559.34 |        598.84 |        4.61 |
        | zstd -19 |    0.412 |  514.54 |        547.77 |        8.80 |
        | zlib -1  |    0.940 |  225.52 |        231.68 |        0.04 |
        | zlib -3  |    0.883 |  240.08 |        247.07 |        0.04 |
        | zlib -6  |    0.844 |  251.17 |        258.84 |        0.04 |
        | zlib -9  |    0.837 |  253.27 |        287.64 |        0.04 |

  I ran a long series of tests and benchmarks on the btrfs side and the
  gains are very similar to the core benchmarks Nick ran"

* 'zstd-minimal' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  squashfs: Add zstd support
  btrfs: Add zstd support
  lib: Add zstd modules
  lib: Add xxhash module
2017-09-14 17:30:49 -07:00
Linus Torvalds
66ba772ee3 Merge branch 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
 "The changes range through all types: cleanups, core chagnes, sanity
  checks, fixes, other user visible changes, detailed list below:

   - deprecated: user transaction ioctl

   - mount option ssd does not change allocation alignments

   - degraded read-write mount is allowed if all the raid profile
     constraints are met, now based on more accurate check

   - defrag: do not reset compression afterwards; the NOCOMPRESS flag
     can be now overriden by defrag

   - prep work for better extent reference tracking (related to the
     qgroup slowness with balance)

   - prep work for compression heuristics

   - memory allocation reductions (may help latencies on a loaded
     system)

   - better accounting for io waiting states

   - error handling improvements (removed BUGs)

   - added more sanity checks for shared refs

   - fix readdir vs pagefault deadlock under some circumstances

   - fix for 'no-hole' mode, certain combination of compressed and
     inline extents

   - send: fix emission of invalid clone operations

   - fixup file mode if setting acls fail

   - more fixes from fuzzing

   - oher cleanups"

* 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (104 commits)
  btrfs: submit superblock io with REQ_META and REQ_PRIO
  btrfs: remove unnecessary memory barrier in btrfs_direct_IO
  btrfs: remove superfluous chunk_tree argument from btrfs_alloc_dev_extent
  btrfs: Remove chunk_objectid parameter of btrfs_alloc_dev_extent
  btrfs: pass fs_info to btrfs_del_root instead of tree_root
  Btrfs: add one more sanity check for shared ref type
  Btrfs: remove BUG_ON in __add_tree_block
  Btrfs: remove BUG() in add_data_reference
  Btrfs: remove BUG() in print_extent_item
  Btrfs: remove BUG() in btrfs_extent_inline_ref_size
  Btrfs: convert to use btrfs_get_extent_inline_ref_type
  Btrfs: add a helper to retrive extent inline ref type
  btrfs: scrub: simplify scrub worker initialization
  btrfs: scrub: clean up division in scrub_find_csum
  btrfs: scrub: clean up division in __scrub_mark_bitmap
  btrfs: scrub: use bool for flush_all_writes
  btrfs: preserve i_mode if __btrfs_set_acl() fails
  btrfs: Remove extraneous chunk_objectid variable
  btrfs: Remove chunk_objectid argument from btrfs_make_block_group
  btrfs: Remove extra parentheses from condition in copy_items()
  ...
2017-09-09 13:27:51 -07:00
Linus Torvalds
a0725ab0c7 Merge branch 'for-4.14/block' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
 "This is the first pull request for 4.14, containing most of the code
  changes. It's a quiet series this round, which I think we needed after
  the churn of the last few series. This contains:

   - Fix for a registration race in loop, from Anton Volkov.

   - Overflow complaint fix from Arnd for DAC960.

   - Series of drbd changes from the usual suspects.

   - Conversion of the stec/skd driver to blk-mq. From Bart.

   - A few BFQ improvements/fixes from Paolo.

   - CFQ improvement from Ritesh, allowing idling for group idle.

   - A few fixes found by Dan's smatch, courtesy of Dan.

   - A warning fixup for a race between changing the IO scheduler and
     device remova. From David Jeffery.

   - A few nbd fixes from Josef.

   - Support for cgroup info in blktrace, from Shaohua.

   - Also from Shaohua, new features in the null_blk driver to allow it
     to actually hold data, among other things.

   - Various corner cases and error handling fixes from Weiping Zhang.

   - Improvements to the IO stats tracking for blk-mq from me. Can
     drastically improve performance for fast devices and/or big
     machines.

   - Series from Christoph removing bi_bdev as being needed for IO
     submission, in preparation for nvme multipathing code.

   - Series from Bart, including various cleanups and fixes for switch
     fall through case complaints"

* 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits)
  kernfs: checking for IS_ERR() instead of NULL
  drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set
  drbd: Fix allyesconfig build, fix recent commit
  drbd: switch from kmalloc() to kmalloc_array()
  drbd: abort drbd_start_resync if there is no connection
  drbd: move global variables to drbd namespace and make some static
  drbd: rename "usermode_helper" to "drbd_usermode_helper"
  drbd: fix race between handshake and admin disconnect/down
  drbd: fix potential deadlock when trying to detach during handshake
  drbd: A single dot should be put into a sequence.
  drbd: fix rmmod cleanup, remove _all_ debugfs entries
  drbd: Use setup_timer() instead of init_timer() to simplify the code.
  drbd: fix potential get_ldev/put_ldev refcount imbalance during attach
  drbd: new disk-option disable-write-same
  drbd: Fix resource role for newly created resources in events2
  drbd: mark symbols static where possible
  drbd: Send P_NEG_ACK upon write error in protocol != C
  drbd: add explicit plugging when submitting batches
  drbd: change list_for_each_safe to while(list_first_entry_or_null)
  drbd: introduce drbd_recv_header_maybe_unplug
  ...
2017-09-07 11:59:42 -07:00
Christoph Hellwig
8e93157bdd btrfs: switch write_buf to kernel_write
Instead of playing with the addressing limits.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-04 19:05:16 -04:00
Christoph Hellwig
91f9943e1c fs: support RWF_NOWAIT for buffered reads
This is based on the old idea and code from Milosz Tanski.  With the aio
nowait code it becomes mostly trivial now.  Buffered writes continue to
return -EOPNOTSUPP if RWF_NOWAIT is passed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-04 19:04:23 -04:00
Omar Sandoval
58efbc9f54 Btrfs: fix blk_status_t/errno confusion
This fixes several instances of blk_status_t and bare errno ints being
mixed up, some of which are real bugs.

In the normal case, 0 matches BLK_STS_OK, so we don't observe any
effects of the missing conversion, but in case of errors or passes
through the repair/retry paths, the errors get mixed up.

The changes were identified using 'sparse', we don't have reports of the
buggy behaviour.

Fixes: 4e4cbee93d ("block: switch bios to blk_status_t")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-24 17:19:02 +02:00
Christoph Hellwig
74d46992e0 block: replace bi_bdev with a gendisk pointer and partitions index
This way we don't need a block_device structure to submit I/O.  The
block_device has different life time rules from the gendisk and
request_queue and is usually only available when the block device node
is open.  Other callers need to explicitly create one (e.g. the lightnvm
passthrough code, or the new nvme multipathing code).

For the actual I/O path all that we need is the gendisk, which exists
once per block device.  But given that the block layer also does
partition remapping we additionally need a partition index, which is
used for said remapping in generic_make_request.

Note that all the block drivers generally want request_queue or
sometimes the gendisk, so this removes a layer of indirection all
over the stack.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-23 12:49:55 -06:00
Christoph Hellwig
f8f84b2dfd btrfs: index check-integrity state hash by a dev_t
We won't have the struct block_device available in the bio soon, so switch
to the numerical dev_t instead of the block_device pointer for looking up
the check-integrity state.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-23 12:49:47 -06:00
David Sterba
db95c876c5 btrfs: submit superblock io with REQ_META and REQ_PRIO
The superblock is also metadata of the filesystem so the relevant IO
should be tagged as such. We also tag it as high priority, as it's the
last block committed for metadata from a given transaction. Any delays
would effectively block the whole transaction, also blocking any other
operation holding the device_list_mutex.

Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-22 13:22:05 +02:00
Nikolay Borisov
dc59215d4f btrfs: remove unnecessary memory barrier in btrfs_direct_IO
Commit 38851cc19a ("Btrfs: implement unlocked dio write") implemented
unlocked dio write, allowing multiple dio writers to write to
non-overlapping, and non-eof-extending regions. In doing so it also
introduced a broken memory barrier. It is broken due to 2 things:

1. Memory barriers _MUST_ always be paired, this is clearly not the case
   here

2. Checkpatch actually produces a warning if a memory barrier is
   introduced that doesn't have a comment explaining how it's being
   paired.

Specifically for inode::i_dio_count that's wrapped inside
inode_dio_begin, there is no explicit barrier semantics attached, so
removing is fine as the atomic is used in common the waiter/wakeup
pattern.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ enhance changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 18:49:21 +02:00
Nikolay Borisov
b5d9071c4f btrfs: remove superfluous chunk_tree argument from btrfs_alloc_dev_extent
Currently this function is always called with the object id of the root
key of the chunk_tree, which is always BTRFS_CHUNK_TREE_OBJECTID. So
let's subsume it straight into the function itself. No functional
change.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 18:30:30 +02:00
Nikolay Borisov
0ca00afb2b btrfs: Remove chunk_objectid parameter of btrfs_alloc_dev_extent
THe function is always called with chunk_objectid set to
BTRFS_FIRST_CHUNK_TREE_OBJECTID. Let's collapse the parameter in the
function itself. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 18:30:16 +02:00
Jeff Mahoney
1cd5447eb6 btrfs: pass fs_info to btrfs_del_root instead of tree_root
btrfs_del_roots always uses the tree_root.  Let's pass fs_info instead.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 17:49:54 +02:00
Liu Bo
64ecdb647d Btrfs: add one more sanity check for shared ref type
Every shared ref has a parent tree block, which can be get from
btrfs_extent_inline_ref_offset().  And the tree block must be aligned
to the nodesize, so we'd know this inline ref is not valid if this
block's bytenr is not aligned to the nodesize, in which case, most
likely the ref type has been misused.

This adds the above mentioned check and also updates
print_extent_item() called by btrfs_print_leaf() to point out the
invalid ref while printing the tree structure.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 17:47:43 +02:00
Liu Bo
cdccee993f Btrfs: remove BUG_ON in __add_tree_block
The BUG_ON() can be triggered when the caller is processing an invalid
extent inline ref, e.g.

a shared data ref is offered instead of an extent data ref, such that
it tries to find a non-existent tree block and then btrfs_search_slot
returns 1 for no such item.

This replaces the BUG_ON() with a WARN() followed by calling
btrfs_print_leaf() to show more details about what's going on and
returning -EINVAL to upper callers.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 17:47:43 +02:00
Liu Bo
b14c55a191 Btrfs: remove BUG() in add_data_reference
Now that we have a helper to report invalid value of extent inline ref
type, we need to quit gracefully instead of throwing out a kernel panic.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 17:47:43 +02:00
Liu Bo
07638ea598 Btrfs: remove BUG() in print_extent_item
btrfs_print_leaf() is used in btrfs_get_extent_inline_ref_type, so
here we really want to print the invalid value of ref type instead of
causing a kernel panic.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 17:47:43 +02:00
Liu Bo
4335958de2 Btrfs: remove BUG() in btrfs_extent_inline_ref_size
Now that btrfs_get_extent_inline_ref_type() can report if type is a
valid one and all callers can gracefully deal with that, we don't need
to crash here.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 17:47:43 +02:00
Liu Bo
3de28d579e Btrfs: convert to use btrfs_get_extent_inline_ref_type
Since we have a helper which can do sanity check, this converts all
btrfs_extent_inline_ref_type to it.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 17:47:43 +02:00
Liu Bo
167ce953ca Btrfs: add a helper to retrive extent inline ref type
An invalid value of extent inline ref type may be read from a
malicious image which may force btrfs to crash.

This adds a helper which does sanity check for the ref type, so we can
know if it's sane, return he type, otherwise return an error.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minimal tweak const types, causing warnings due to other cleanup patches ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 17:47:42 +02:00
David Sterba
af1cbe0a66 btrfs: scrub: simplify scrub worker initialization
Minor simplification, merge calls to one.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 17:47:42 +02:00
David Sterba
1d1bf92d9d btrfs: scrub: clean up division in scrub_find_csum
Use proper helpers for 64bit division.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 17:47:42 +02:00
David Sterba
7736b0a431 btrfs: scrub: clean up division in __scrub_mark_bitmap
Use proper helpers for 64bit division and then cast to narrower type.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 17:47:42 +02:00
David Sterba
2073c4c2e5 btrfs: scrub: use bool for flush_all_writes
flush_all_writes is an atomic but does not use the semantics at all,
it's just on/off indicator, we can use bool.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 17:47:42 +02:00
Ernesto A. Fernández
d7d8249665 btrfs: preserve i_mode if __btrfs_set_acl() fails
When changing a file's acl mask, btrfs_set_acl() will first set the
group bits of i_mode to the value of the mask, and only then set the
actual extended attribute representing the new acl.

If the second part fails (due to lack of space, for example) and the
file had no acl attribute to begin with, the system will from now on
assume that the mask permission bits are actual group permission bits,
potentially granting access to the wrong users.

Prevent this by restoring the original mode bits if __btrfs_set_acl
fails.

Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 17:47:42 +02:00
Nikolay Borisov
408fbf19ad btrfs: Remove extraneous chunk_objectid variable
BTRFS_FIRST_CHUNK_TREE_OBJECTIS id the only objectid being used in the
chunk_tree. So remove a variable which is always set to that value and collapse
its usage in callees which are passed this variable. No functional changes

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 17:47:42 +02:00
Nikolay Borisov
0174484d61 btrfs: Remove chunk_objectid argument from btrfs_make_block_group
btrfs_make_block_group is always called with chunk_objectid set to
BTRFS_FIRST_CHUNK_TREE_OBJECTID. There's no reason why this behavior will
change anytime soon, so let's remove the argument and decrease the cognitive
load when reading the code path. No functional change

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 17:47:42 +02:00
Matthias Kaehlcke
0dde10bed2 btrfs: Remove extra parentheses from condition in copy_items()
There is no need for the extra pair of parentheses, remove it. This
fixes the following warning when building with clang:

fs/btrfs/tree-log.c:3694:10: warning: equality comparison with extraneous
  parentheses [-Wparentheses-equality]
                if ((i == (nr - 1)))
                     ~~^~~~~~~~~~~

Also remove the unnecessary parentheses around the substraction.

Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 17:47:42 +02:00
Nikolay Borisov
0ce1dd2a4a btrfs: Remove redundant setting of uuid in btrfs_block_header
btrfs_alloc_dev_extent currently unconditionally sets the uuid in the
leaf block header the function is working with. This is unnecessary
since this operation is peformed by the core btree handling code
(splitting a node, allocating a new btree block etc). So let's remove
it.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 17:47:42 +02:00
Hans van Kranenburg
583b723151 btrfs: Do not use data_alloc_cluster in ssd mode
This patch provides a band aid to improve the 'out of the box'
behaviour of btrfs for disks that are detected as being an ssd.  In a
general purpose mixed workload scenario, the current ssd mode causes
overallocation of available raw disk space for data, while leaving
behind increasing amounts of unused fragmented free space. This
situation leads to early ENOSPC problems which are harming user
experience and adoption of btrfs as a general purpose filesystem.

This patch modifies the data extent allocation behaviour of the ssd mode
to make it behave identical to nossd mode.  The metadata behaviour and
additional ssd_spread option stay untouched so far.

Recommendations for future development are to reconsider the current
oversimplified nossd / ssd distinction and the broken detection
mechanism based on the rotational attribute in sysfs and provide
experienced users with a more flexible way to choose allocator behaviour
for data and metadata, optimized for certain use cases, while keeping
sane 'out of the box' default settings.  The internals of the current
btrfs code have more potential than what currently gets exposed to the
user to choose from.

    The SSD story...

    In the first year of btrfs development, around early 2008, btrfs
gained a mount option which enables specific functionality for
filesystems on solid state devices. The first occurance of this
functionality is in commit e18e4809, labeled "Add mount -o ssd, which
includes optimizations for seek free storage".

The effect on allocating free space for doing (data) writes is to
'cluster' writes together, writing them out in contiguous space, as
opposed to a 'tetris' way of putting all separate writes into any free
space fragment that fits (which is what the -o nossd behaviour does).

A somewhat simplified explanation of what happens is that, when for
example, the 'cluster' size is set to 2MiB, when we do some writes, the
data allocator will search for a free space block that is 2MiB big, and
put the writes in there. The ssd mode itself might allow a 2MiB cluster
to be composed of multiple free space extents with some existing data in
between, while the additional ssd_spread mount option kills off this
option and requires fully free space.

The idea behind this is (commit 536ac8ae): "The [...] clusters make it
more likely a given IO will completely overwrite the ssd block, so it
doesn't have to do an internal rwm cycle."; ssd block meaning nand erase
block. So, effectively this means applying a "locality based algorithm"
and trying to outsmart the actual ssd.

Since then, various changes have been made to the involved code, but the
basic idea is still present, and gets activated whenever the ssd mount
option is active. This also happens by default, when the rotational flag
as seen at /sys/block/<device>/queue/rotational is set to 0.

    However, there's a number of problems with this approach.

    First, what the optimization is trying to do is outsmart the ssd by
assuming there is a relation between the physical address space of the
block device as seen by btrfs and the actual physical storage of the
ssd, and then adjusting data placement. However, since the introduction
of the Flash Translation Layer (FTL) which is a part of the internal
controller of an ssd, these attempts are futile. The use of good quality
FTL in consumer ssd products might have been limited in 2008, but this
situation has changed drastically soon after that time. Today, even the
flash memory in your automatic cat feeding machine or your grandma's
wheelchair has a full featured one.

Second, the behaviour as described above results in the filesystem being
filled up with badly fragmented free space extents because of relatively
small pieces of space that are freed up by deletes, but not selected
again as part of a 'cluster'. Since the algorithm prefers allocating a
new chunk over going back to tetris mode, the end result is a filesystem
in which all raw space is allocated, but which is composed of
underutilized chunks with a 'shotgun blast' pattern of fragmented free
space. Usually, the next problematic thing that happens is the
filesystem wanting to allocate new space for metadata, which causes the
filesystem to fail in spectacular ways.

Third, the default mount options you get for an ssd ('ssd' mode enabled,
'discard' not enabled), in combination with spreading out writes over
the full address space and ignoring freed up space leads to worst case
behaviour in providing information to the ssd itself, since it will
never learn that all the free space left behind is actually free.  There
are two ways to let an ssd know previously written data does not have to
be preserved, which are sending explicit signals using discard or
fstrim, or by simply overwriting the space with new data.  The worst
case behaviour is the btrfs ssd_spread mount option in combination with
not having discard enabled. It has a side effect of minimizing the reuse
of free space previously written in.

Fourth, the rotational flag in /sys/ does not reliably indicate if the
device is a locally attached ssd. For example, iSCSI or NBD displays as
non-rotational, while a loop device on an ssd shows up as rotational.

The combination of the second and third problem effectively means that
despite all the good intentions, the btrfs ssd mode reliably causes the
ssd hardware and the filesystem structures and performance to be choked
to death. The clickbait version of the title of this story would have
been "Btrfs ssd optimizations considered harmful for ssds".

The current nossd 'tetris' mode (even still without discard) allows a
pattern of overwriting much more previously used space, causing many
more implicit discards to happen because of the overwrite information
the ssd gets. The actual location in the physical address space, as seen
from the point of view of btrfs is irrelevant, because the actual writes
to the low level flash are reordered anyway thanks to the FTL.

    Changes made in the code

1. Make ssd mode data allocation identical to tetris mode, like nossd.
2. Adjust and clean up filesystem mount messages so that we can easily
identify if a kernel has this patch applied or not, when providing
support to end users. Also, make better use of the *_and_info helpers to
only trigger messages on actual state changes.

    Backporting notes

Notes for whoever wants to backport this patch to their 4.9 LTS kernel:
* First apply commit 951e7966 "btrfs: drop the nossd flag when
  remounting with -o ssd", or fixup the differences manually.
* The rest of the conflicts are because of the fs_info refactoring. So,
  for example, instead of using fs_info, it's root->fs_info in
  extent-tree.c

Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 17:47:42 +02:00
Lu Fengqi
43a0111103 btrfs: use btrfsic_submit_bio instead of submit_bio in write_dev_flush
Although this bio has no data attached, it will reach this condition
(bio->bi_opf & REQ_PREFLUSH) and then update the flush_gen of dev_state
in __btrfsic_submit_bio. So we should still submit it through integrity
checker. Otherwise, the integrity checker will throw the following warning
when I mount a newly created btrfs filesystem.

[10264.755497] btrfs: attempt to write superblock which references block M @29523968 (sdb1/1111654400/0) which is not flushed out of disk's write cache (block flush_gen=1, dev->flush_gen=0)!
[10264.755498] btrfs: attempt to write superblock which references block M @29523968 (sdb1/37912576/0) which is not flushed out of disk's write cache (block flush_gen=1, dev->flush_gen=0)!

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 17:47:42 +02:00
Filipe Manana
72610b1b40 Btrfs: incremental send, fix emission of invalid clone operations
When doing an incremental send it's possible that the computed send stream
contains clone operations that will fail on the receiver if the receiver
has compression enabled and the clone operations target a sector sized
extent that starts at a zero file offset, is not compressed on the source
filesystem but ends up being compressed and inlined at the destination
filesystem.

Example scenario:

  $ mkfs.btrfs -f /dev/sdb
  $ mount -o compress /dev/sdb /mnt

  # By doing a direct IO write, the data is not compressed.
  $ xfs_io -f -d -c "pwrite -S 0xab 0 4K" /mnt/foobar
  $ btrfs subvolume snapshot -r /mnt /mnt/mysnap1

  $ xfs_io -c "reflink /mnt/foobar 0 8K 4K" /mnt/foobar
  $ btrfs subvolume snapshot -r /mnt /mnt/mysnap2

  $ btrfs send -f /tmp/1.snap /mnt/mysnap1
  $ btrfs send -f /tmp/2.snap -p /mnt/mysnap1 /mnt/mysnap2
  $ umount /mnt

  $ mkfs.btrfs -f /dev/sdc
  $ mount -o compress /dev/sdc /mnt
  $ btrfs receive -f /tmp/1.snap /mnt
  $ btrfs receive -f /tmp/2.snap /mnt
  ERROR: failed to clone extents to foobar
  Operation not supported

The same could be achieved by mounting the source filesystem without
compression and doing a buffered IO write instead of a direct IO one,
and mounting the destination filesystem with compression enabled.

So fix this by issuing regular write operations in the send stream
instead of clone operations when the source offset is zero and the
range has a length matching the sector size.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 17:47:42 +02:00
Liu Bo
f716abd55d Btrfs: fix out of bounds array access while reading extent buffer
There is a corner case that slips through the checkers in functions
reading extent buffer, ie.

if (start < eb->len) and (start + len > eb->len),
then

a) map_private_extent_buffer() returns immediately because
it's thinking the range spans across two pages,

b) and the checkers in read_extent_buffer(), WARN_ON(start > eb->len)
and WARN_ON(start + len > eb->start + eb->len), both are OK in this
corner case, but it'd actually try to access the eb->pages out of
bounds because of (start + len > eb->len).

The case is found by switching extent inline ref type from shared data
ref to non-shared data ref, which is a kind of metadata corruption.

It'd use the wrong helper to access the eb,
eg. btrfs_extent_data_ref_root(eb, ref) is used but the %ref passing
here is "struct btrfs_shared_data_ref".  And if the extent item
happens to be the first item in the eb, then offset/length will get
over eb->len which ends up an invalid memory access.

This is adding proper checks in order to avoid invalid memory access,
ie. 'general protection fault', before it's too late.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 17:47:14 +02:00
Nikolay Borisov
c59efa7eb2 btrfs: Fix -EOVERFLOW handling in btrfs_ioctl_tree_search_v2
The buffer passed to btrfs_ioctl_tree_search* functions have to be at least
sizeof(struct btrfs_ioctl_search_header). If this is not the case then the
ioctl should return -EOVERFLOW and set the uarg->buf_size to the minimum
required size. Currently btrfs_ioctl_tree_search_v2 would return an -EOVERFLOW
error with ->buf_size being set to the value passed by user space. Fix this by
removing the size check and relying on search_ioctl, which already includes it
and correctly sets buf_size.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-18 16:36:29 +02:00
Nikolay Borisov
e6961cac73 btrfs: Move skip checksum check from btrfs_submit_direct to __btrfs_submit_dio_bio
Currently the code checks whether we should do data checksumming in
btrfs_submit_direct and the boolean result of this check is passed to
btrfs_submit_direct_hook, in turn passing it to __btrfs_submit_dio_bio which
actually consumes it. The last function actually has all the necessary context
to figure out whether to skip the check or not, so let's move the check closer
to where it's being consumed. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Chris Mason <clm@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-18 16:36:29 +02:00
Filipe Manana
6399fb5a0b Btrfs: fix assertion failure during fsync in no-holes mode
When logging an inode in full mode that has an inline compressed extent
that represents a range with a size matching the sector size (currently
the same as the page size), has a trailing hole and the no-holes feature
is enabled, we end up failing an assertion leading to a trace like the
following:

[141812.031528] assertion failed: len == i_size, file: fs/btrfs/tree-log.c, line: 4453
[141812.033069] ------------[ cut here ]------------
[141812.034330] kernel BUG at fs/btrfs/ctree.h:3452!
[141812.035137] invalid opcode: 0000 [#1] PREEMPT SMP
[141812.035932] Modules linked in: btrfs dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio dm_flakey dm_mod dax ppdev evdev ghash_clmulni_intel pcbc aesni_intel aes_x86_64 tpm_tis psmouse crypto_simd parport_pc sg pcspkr tpm_tis_core cryptd parport serio_raw glue_helper tpm i2c_piix4 i2c_core button sunrpc loop autofs4 ext4 crc16 jbd2 mbcache raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sd_mod ata_generic virtio_scsi ata_piix floppy crc32c_intel libata scsi_mod virtio_pci virtio_ring e1000 virtio [last unloaded: btrfs]
[141812.036790] CPU: 3 PID: 845 Comm: fdm-stress Tainted: G    B   W       4.12.3-btrfs-next-52+ #1
[141812.036790] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
[141812.036790] task: ffff8801e6694180 task.stack: ffffc90009004000
[141812.036790] RIP: 0010:assfail.constprop.18+0x1c/0x1e [btrfs]
[141812.036790] RSP: 0018:ffffc90009007bc0 EFLAGS: 00010282
[141812.036790] RAX: 0000000000000046 RBX: ffff88017512c008 RCX: 0000000000000001
[141812.036790] RDX: ffff88023fd95201 RSI: ffffffff8182264c RDI: 00000000ffffffff
[141812.036790] RBP: ffffc90009007bc0 R08: 0000000000000001 R09: 0000000000000001
[141812.036790] R10: 0000000000001000 R11: ffffffff82f5a0c9 R12: ffff88014e5947e8
[141812.036790] R13: 00000000000b4000 R14: ffff8801b234d008 R15: 0000000000000000
[141812.036790] FS:  00007fdba6ffd700(0000) GS:ffff88023fd80000(0000) knlGS:0000000000000000
[141812.036790] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[141812.036790] CR2: 00007fdb9c000010 CR3: 000000016efa2000 CR4: 00000000001406e0
[141812.036790] Call Trace:
[141812.036790]  btrfs_log_inode+0x9f0/0xd3d [btrfs]
[141812.036790]  ? __mutex_lock+0x120/0x3ce
[141812.036790]  btrfs_log_inode_parent+0x224/0x685 [btrfs]
[141812.036790]  ? lock_acquire+0x16b/0x1af
[141812.036790]  btrfs_log_dentry_safe+0x60/0x7b [btrfs]
[141812.036790]  btrfs_sync_file+0x32e/0x3f8 [btrfs]
[141812.036790]  vfs_fsync_range+0x8a/0x9d
[141812.036790]  vfs_fsync+0x1c/0x1e
[141812.036790]  do_fsync+0x31/0x4a
[141812.036790]  SyS_fdatasync+0x13/0x17
[141812.036790]  entry_SYSCALL_64_fastpath+0x18/0xad
[141812.036790] RIP: 0033:0x7fdbac41a47d
[141812.036790] RSP: 002b:00007fdba6ffce30 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[141812.036790] RAX: ffffffffffffffda RBX: ffffffff81092c9f RCX: 00007fdbac41a47d
[141812.036790] RDX: 0000004cf0160a40 RSI: 0000000000000000 RDI: 0000000000000006
[141812.036790] RBP: ffffc90009007f98 R08: 0000000000000000 R09: 0000000000000010
[141812.036790] R10: 00000000000002e8 R11: 0000000000000293 R12: ffffffff8110cd90
[141812.036790] R13: ffffc90009007f78 R14: 0000000000000000 R15: 0000000000000000
[141812.036790]  ? time_hardirqs_off+0x9/0x14
[141812.036790]  ? trace_hardirqs_off_caller+0x1f/0xa3
[141812.036790] Code: c7 d6 61 6b a0 48 89 e5 e8 ba ef a8 e0 0f 0b 55 89 f1 48 c7 c2 6d 65 6b a0 48 89 fe 48 c7 c7 81 65 6b a0 48 89 e5 e8 9c ef a8 e0 <0f> 0b 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 49 89
[141812.036790] RIP: assfail.constprop.18+0x1c/0x1e [btrfs] RSP: ffffc90009007bc0
[141812.084448] ---[ end trace 44e472684c7a32cc ]---

Which happens because the code that logs a trailing hole when the no-holes
feature is enabled, did not consider that a compressed inline extent can
represent a range with a size matching the sector size, in which case
expanding the inode's i_size, through a truncate operation, won't lead
to padding with zeroes the page that represents the inline extent, and
therefore the inline extent remains after the truncation.

Fix this by adapting the assertion to accept inline extents representing
data with a sector size length if, and only if, the inline extents are
compressed.

A sample and trivial reproducer (for systems with a 4K page size) for this
issue:

  mkfs.btrfs -O no-holes -f /dev/sdc
  mount -o compress /dev/sdc /mnt
  xfs_io -f -c "pwrite -S 0xab 0 4K" /mnt/foobar
  sync
  xfs_io -c "truncate 32K" /mnt/foobar
  xfs_io -c "fsync" /mnt/foobar

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-18 16:36:29 +02:00
Filipe Manana
4a4b964f42 Btrfs: avoid unnecessarily locking inode when clearing a range
If the range being cleared was not marked for defrag and we are not
about to clear the range from the defrag status, we don't need to
lock and unlock the inode.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Chris Mason <clm@fb.com>
Reviewed-by: Wang Shilong <wangshilong1991@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-18 16:36:29 +02:00
Colin Ian King
938e1c77f8 btrfs: remove redundant check on ret being non-zero
The error return variable ret is initialized to zero and then is
checked to see if it is non-zero in the if-block that follows it.
It is therefore impossible for ret to be non-zero after the if-block
hence the check is redundant and can be removed.

Detected by CoverityScan, CID#1021040 ("Logically dead code")

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-18 16:36:29 +02:00
Nikolay Borisov
2d77ab3cfb btrfs: expose internal free space tree routine only if sanity tests are enabled
The internal free space tree management routines are always exposed for
testing purposes. Make them dependent on SANITY_TESTS being on so that
they are exposed only when they really have to.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-18 16:36:29 +02:00
Nikolay Borisov
db7c942ce8 btrfs: Remove unused sectorsize variable from struct map_lookup
This variable was added in 1abe9b8a13 ("Btrfs: add initial tracepointi
support for btrfs"), yet it never really got used, only assigned to. So
let's remove it.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-18 16:36:29 +02:00
Nikolay Borisov
92ac58ec99 btrfs: Remove never-reached WARN_ON
We have a WARN_ON(!var) inside an if branch which is executed (among
others) only when var is true.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-18 16:36:29 +02:00
Anand Jain
dc2f29212a btrfs: remove unused BTRFS_COMPRESS_LAST
We aren't using this define, so removing it.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-18 16:36:29 +02:00
Anand Jain
44880fdc91 btrfs: use appropriate define for the fsid
Though BTRFS_FSID_SIZE and BTRFS_UUID_SIZE are of the same size, we
should use the matching constant for the fsid buffer.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-18 16:36:29 +02:00
Josef Bacik
42e9cc46fb btrfs: increase ctx->pos for delayed dir index
Our dir_context->pos is supposed to hold the next position we're
supposed to look.  If we successfully insert a delayed dir index we
could end up with a duplicate entry because we don't increase ctx->pos
after doing the dir_emit.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-18 16:36:20 +02:00
Josef Bacik
23b5ec7494 btrfs: fix readdir deadlock with pagefault
Readdir does dir_emit while under the btree lock.  dir_emit can trigger
the page fault which means we can deadlock.  Fix this by allocating a
buffer on opening a directory and copying the readdir into this buffer
and doing dir_emit from outside of the tree lock.

Thread A
readdir  <holding tree lock>
  dir_emit
    <page fault>
      down_read(mmap_sem)

Thread B
mmap write
  down_write(mmap_sem)
    page_mkwrite
      wait_ordered_extents

Process C
finish_ordered_extent
  insert_reserved_file_extent
   try to lock leaf <hang>

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copy the deadlock scenario to changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:05 +02:00
Nikolay Borisov
8d8aafeea2 btrfs: Simplify math in should_alloc chunk
Currently should_alloc_chunk uses ->total_bytes - ->bytes_readonly to
signify the total amount of bytes in this space info. However, given
Jeff's patch which adds bytes_pinned and bytes_may_use to the calculation
of num_allocated it becomes a lot more clear to just eliminate num_bytes
altogether and add the bytes_readonly to the amount of used space. That
way we don't change the results of the following statements. In the
process also start using btrfs_space_info_used.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:05 +02:00
Jeff Mahoney
f44d2287d2 btrfs: account for pinned bytes in should_alloc_chunk
In a heavy write scenario, we can end up with a large number of pinned bytes.
This can translate into (very) premature ENOSPC because pinned bytes
must be accounted for when allowing a reservation but aren't accounted for
when deciding whether to create a new chunk.

This patch adds the accounting to should_alloc_chunk so that we can
create the chunk.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:05 +02:00
David Sterba
a7164fa4e0 btrfs: prepare for extensions in compression options
This is a minimal patch intended to be backported to older kernels.
We're going to extend the string specifying the compression method and
this would fail on kernels before that change (the string is compared
exactly).

Relax the string matching only to the prefix, ie. ignoring anything that
goes after "zlib" or "lzo", regardless of th format extension we decide
to use. This applies to the mount options and properties.

That way, patched old kernels could be booted on systems already
utilizing the new compression spec.

Applicable since commit 63541927c8, v3.14.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:05 +02:00
David Sterba
1e20d1c45f btrfs: allow defrag compress to override NOCOMPRESS attribute
Currently, the BTRFS_INODE_NOCOMPRESS will prevent any compression on a
given file, except when the mount is force-compress. As users have
reported on IRC, this will also prevent compression when requested by
defrag (btrfs fi defrag -c file).

The nocompress flag is set automatically by filesystem when the ratios
are bad and the user would have to manually drop the bit in order to
make defrag -c work. This is not good from the usability perspective.

This patch will raise priority for the defrag -c over nocompress, ie.
any file with NOCOMPRESS bit set will get defragmented. The bit will
remain untouched.

Alternate option was to also drop the nocompress bit and keep the
decision logic as is, but I think this is not the right solution.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:05 +02:00
David Sterba
1e2ef46d89 btrfs: defrag: cleanup checking for compression status
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:05 +02:00
David Sterba
eec63c65dc btrfs: separate defrag and property compression
Add new value for compression to distinguish between defrag and
property. Previously, a single variable was used and this caused clashes
when the per-file 'compression' was set and a defrag -c was called.

The property-compression is loaded when the file is open, defrag will
overwrite the same variable and reset to 0 (ie. NONE) at when the file
defragmentaion is finished. That's considered a usability bug.

Now we won't touch the property value, use the defrag-compression. The
precedence of defrag is higher than for property (and whole-filesystem).

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:05 +02:00
David Sterba
b52aa8c93e btrfs: rename variable holding per-inode compression type
This is preparatory for separating inode compression requested by defrag
and set via properties. This will fix a usability bug when defrag will
reset compression type to NONE. If the file has compression set via
property, it will not apply anymore (until next mount or reset through
command line).

We're going to fix that by adding another variable just for the defrag
call and won't touch the property. The defrag will have higher priority
when deciding whether to compress the data.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:05 +02:00
Timofey Titovets
c2fcdcdf36 Btrfs: add skeleton code for compression heuristic
Add skeleton code for compresison heuristics. Now it iterates over all
the pages, but in the end always says "yes, compress please", ie it does
not change the current behaviour.

In the future we're going to add various heuristics to analyze the data.
This patch can be used as a baseline for measuring if the effectivness
and performance.

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ enhanced changelog, modified comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:04 +02:00
David Sterba
131ce4367a btrfs: account that we're waiting for IO in scrub_submit_raid56_bio_wait
Correctly account for IO when waiting for a submitted bio in scrub. This
only for the accounting purposes and should not change other behaviour.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:04 +02:00
David Sterba
9c17f6cda1 btrfs: account that we're waiting for DIO read
Correctly account for IO when waiting for a submitted DIO read, the case
when we're retrying.  This only for the accounting purposes and should
not change other behaviour.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:04 +02:00
David Sterba
4958aa6821 btrfs: drop chunk locks at the end of close_ctree
The pinned chunks might be left over so we clean them but at this point
of close_ctree, there's noone to race with, the locking can be removed.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:04 +02:00
David Sterba
d3c0bab563 btrfs: remove trivial wrapper btrfs_force_ra
It's a simple call page_cache_sync_readahead, same arguments in the same
order.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:04 +02:00
David Sterba
35dc313046 btrfs: drop ancient page flag mappings
There's no PageFsMisc. Added by patch 4881ee5a2e in 2008, the flag is
not present in current kernels.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:04 +02:00
David Sterba
ea14b57fd1 btrfs: fix spelling of snapshotting
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:04 +02:00
Nikolay Borisov
e38ae7a086 btrfs: Make flush_space return void
The return value of flush_space was used to have significance in the
early days when the code was first introduced and before the ticketed
enospc rework. Since the latter got introduced the return value lost any
significance whatsoever to its callers. So let's remove it. While at it
also remove the unused ticket variable in
btrfs_async_reclaim_metadata_space. It was used in the initial version
of the ticketed ENOSPC work, however Wang Xiaoguang detected a problem
with this and fixed it in ce129655c9 ("btrfs: introduce tickets_id to
determine whether asynchronous metadata reclaim work makes progress").

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:04 +02:00
Nikolay Borisov
3558d4f88e btrfs: Deprecate userspace transaction ioctls
Userspace transactions were introduced in commit 6bf13c0cc8 ("Btrfs:
transaction ioctls") to provide semantics that Ceph's object store
required. However, things have changed significantly since then, to the
point where btrfs is no longer suitable as a backend for ceph and in
fact it's actively advised against such usages. Considering this, there
doesn't seem to be a widespread, legit use case of userspace
transaction. They also clutter the file->private pointer.

So to end the agony let's nuke the userspace transaction ioctls. As a
first step let's give time for people to voice their objection by just
WARN()ining when the userspace transaction is used.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ move the warning past perm checks, keep the has-been-printed state;
  we're ok with just one warning over all filesystems ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:04 +02:00
David Sterba
9f6d251033 btrfs: use named constant for bdev blocksize
Superblock is read and written using buffer heads, we need to set the
bdev blocksize. The magic constant has been hardcoded in several places,
so replace it with a named constant.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:04 +02:00
David Sterba
abbb3b8ebf btrfs: split write_dev_supers to two functions
There are two independent parts, one that writes the superblocks and
another that waits for completion. No functional changes, but cleanups,
reformatting and comment updates.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:04 +02:00
David Sterba
35c70103a5 btrfs: refactor find_device helper
Polish the helper:
* drop underscores, no special meaning here
* pass fs_devices, as this is what the API implements
* drop noinline, no apparent reason for such simple helper
* constify uuid
* add comment

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:04 +02:00
David Sterba
2dfeca9bfb btrfs: merge alloc_device helpers
There are two helpers called in chain from one location, we can merge the
functionaliy.

Originally, alloc_fs_devices could fill the device uuid randomly if we
we didn't give the uuid buffer. This happens for seed devices but the
fsid is generated in btrfs_prepare_sprout, so we can remove it.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:04 +02:00
David Sterba
4b81ba48c6 btrfs: merge REQ_OP and REQ_ flags to one parameter in submit_extent_page
The function submit_extent_page has 15(!) parameters right now, op and
op_flags are effectively one value stored to bio::bi_opf, no need to
pass them separately. So it's 14 parameters now.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:04 +02:00
David Sterba
f1c77c55cd btrfs: cleanup types storing REQ_*
Unify types of local variables and parameters that store various
REQ_* values to unsigned int.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:04 +02:00
David Sterba
abe60ba45c btrfs: get fs_info from eb in btrfs_print_tree, remove argument
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:03 +02:00
David Sterba
a4f78750ef btrfs: get fs_info from eb in btrfs_print_leaf, remove argument
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:03 +02:00
David Sterba
f1b8a1e8c0 btrfs: simplify btrfs_dev_replace_kthread
This function prints an informative message and then continues
dev-replace. The message contains a progress percentage which is read
from the status. The status is allocated dynamically, about 2600 bytes,
just to read the single value. That's an overkill. We'll use the new
helper and drop the allocation.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:03 +02:00
David Sterba
74b595fe67 btrfs: factor reading progress out of btrfs_dev_replace_status
We'll want to read the percentage value from dev_replace elsewhere, move
the logic to a separate helper.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:03 +02:00
David Sterba
0a52d10808 btrfs: defrag: make readahead state allocation failure non-fatal
All sorts of readahead errors are not considered fatal. We can continue
defragmentation without it, with some potential slow down, which will
last only for the current inode.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:03 +02:00
David Sterba
63e727ecd2 btrfs: use GFP_KERNEL in btrfs_defrag_file
We can safely use GFP_KERNEL, the function is called from two contexts:

- ioctl handler, called directly, no locks taken
- cleaner thread, running all queued defrag work, outside of any locks

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:03 +02:00
David Sterba
3ec8362111 btrfs: use GFP_KERNEL in mount and remount
We don't need to restrict the allocation flags in btrfs_mount or
_remount. No big filesystem locks are held (possibly s_umount but that
does no count here).

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:03 +02:00
Nikolay Borisov
e3f3ad1268 btrfs: Remove never reached error handling code in __add_reloc_root
One of the error handling paths in __add_reloc_root contains btrfs_panic()
followed by some other code. As the name implies what it does is print
some error message and call BUG, naturally what follow afterwards is not
invoked. So remove this extra code.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:03 +02:00
Nikolay Borisov
e4ff5fb5dc btrfs: Remove unused parameters from volume.c functions
This also adjusts the respective callers in other files. Those were
found with -Wunused-parameter.

btrfs_full_stripe_len's mapping_tree - introduced by 53b381b3ab
("Btrfs: RAID5 and RAID6") but it was never really used even in that
commit

btrfs_is_parity_mirror's mirror_num - same as above

chunk_drange_filter's chunk_offset - introduced by 94e60d5a5c ("Btrfs:
devid subset filter") and never used.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:03 +02:00
Nikolay Borisov
110840bb62 btrfs: Remove unused variables
clear_super - usage was removed in commit cea67ab92d ("btrfs: clean
the old superblocks before freeing the device") but that change forgot
to remove the actual variable.

max_key - commit 6174d3cb43 ("Btrfs: remove unused max_key arg from
btrfs_search_forward") removed the max_key parameter but it forgot to
remove references from callers.

stripe_len - this one was added by e06cd3dd7c ("Btrfs: add validadtion
checks for chunk loading") but even then it wasn't used.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:03 +02:00
Nikolay Borisov
500ceed807 btrfs: Remove find_raid56_stripe_len
find_raid56_stripe_len statically returns SZ_64K which equals BTRFS_STRIPE_LEN.
It's sole caller is __btrfs_alloc_chunk and it assigns the return value to ai
variable which is already set to BTRFS_STRIPE_LEN. So remove the function
invocation altogether and remove the function itself. Also remove the variable
since it's only aliasing BTRFS_STRIPE_LEN and use the define directly. Use
the occassion to simplify the rounding down of stripe_size now that the value
we want it to align is a power of 2.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:03 +02:00
Nikolay Borisov
47f08b9699 btrfs: Use explicit round_down macro in btrfs resize ioctl handler
No functional changes, just make the code more self-explanatory.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:03 +02:00
Anand Jain
19aee8dea3 btrfs: btrfs_inherit_iflags() can be static
btrfs_new_inode() is the only consumer move it to inode.c,
from ioctl.c.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:02 +02:00
Nick Terrell
26b28dce50 btrfs: Keep one more workspace around
find_workspace() allocates up to num_online_cpus() + 1 workspaces.
free_workspace() will only keep num_online_cpus() workspaces. When
(de)compressing we will allocate num_online_cpus() + 1 workspaces, then
free one, and repeat. Instead, we can just keep num_online_cpus() + 1
workspaces around, and never have to allocate/free another workspace in the
common case.

I tested on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM. I mounted a
BtrFS partition with -o compress-force={lzo,zlib,zstd} and logged whenever
a workspace was allocated of freed. Then I copied vmlinux (527 MB) to the
partition. Before the patch, during the copy it would allocate and free 5-6
workspaces. After, it only allocated the initial 3. This held true for lzo,
zlib, and zstd. The time it took to execute cp vmlinux /mnt/btrfs && sync
dropped from 1.70s to 1.44s with lzo compression, and from 2.04s to 1.80s
for zstd compression.

Signed-off-by: Nick Terrell <terrelln@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:02 +02:00
David Sterba
913e153572 btrfs: drop newlines from strings when using btrfs_* helpers
The helpers append "\n" so we can keep the actual strings shorter. The
extra newline will print an empty line.  Some messages have been
slightly modified to be more consistent with the rest (lowercase first
letter).

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:02 +02:00
Nikolay Borisov
b6e6bca51e btrfs: qgroups: Fix BUG_ON condition in tree level check
The current code was erroneously checking for
root_level > BTRFS_MAX_LEVEL. If we had a root_level of 8 then the check
won't trigger and we could potentially hit a buffer overflow. The
correct check should be root_level >= BTRFS_MAX_LEVEL .

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:02 +02:00
Qu Wenruo
c550245148 btrfs: Enhance message when a device is missing during mount
For a missing device, btrfs will just refuse to mount with almost
meaningless kernel message like:

 BTRFS info (device vdb6): disk space caching is enabled
 BTRFS info (device vdb6): has skinny extents
 BTRFS error (device vdb6): failed to read the system array: -5
 BTRFS error (device vdb6): open_ctree failed

This patch will print a new message about the missing device:

 BTRFS info (device vdb6): disk space caching is enabled
 BTRFS info (device vdb6): has skinny extents
 BTRFS warning (device vdb6): devid 2 uuid 80470722-cad2-4b90-b7c3-fee294552f1b is missing
 BTRFS error (device vdb6): failed to read the system array: -5
 BTRFS error (device vdb6): open_ctree failed

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:02 +02:00
Qu Wenruo
bc3cce2378 btrfs: Cleanup num_tolerated_disk_barrier_failures
As we use per-chunk degradable check, the global
num_tolerated_disk_barrier_failures is of no use.

We can now remove it.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:02 +02:00
Qu Wenruo
d10b82fe29 btrfs: Allow barrier_all_devices to do chunk level device check
The last user of num_tolerated_disk_barrier_failures is
barrier_all_devices().
But it can be easily changed to the new per-chunk degradable check
framework.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:02 +02:00
Qu Wenruo
b382cfe889 btrfs: Do chunk level check for degraded remount
Just the same for mount time check, use btrfs_check_rw_degradable() to
check if we are OK to be remounted rw.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:02 +02:00
Qu Wenruo
4330e183c9 btrfs: Do chunk level check for degraded rw mount
Now use the btrfs_check_rw_degradable() to check if we can mount in the
degraded mode.

With this patch, we can mount in the following case:
 # mkfs.btrfs -f -m raid1 -d single /dev/sdb /dev/sdc
 # wipefs -a /dev/sdc
 # mount /dev/sdb /mnt/btrfs -o degraded
 As the single data chunk is only on sdb, so it's OK to mount as
 degraded, as missing one device is OK for RAID1.

But still fail in the following case as expected:
 # mkfs.btrfs -f -m raid1 -d single /dev/sdb /dev/sdc
 # wipefs -a /dev/sdb
 # mount /dev/sdc /mnt/btrfs -o degraded
 As the data chunk is only in sdb, so it's not OK to mount it as
 degraded.

Reported-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Reported-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:02 +02:00
Qu Wenruo
21634a19f6 btrfs: Introduce a function to check if all chunks a OK for degraded rw mount
Introduce a new function, btrfs_check_rw_degradable(), to check if all
chunks in btrfs is OK for degraded rw mount.

It provides the new basis for accurate btrfs mount/remount and even
runtime degraded mount check other than old one-size-fit-all method.

Btrfs currently uses num_tolerated_disk_barrier_failures to do global
check for tolerated missing device.

Although the one-size-fit-all solution is quite safe, it's too strict
if data and metadata has different duplication level.

For example, if one use Single data and RAID1 metadata for 2 disks, it
means any missing device will make the fs unable to be degraded
mounted.

But in fact, some times all single chunks may be in the existing
device and in that case, we should allow it to be rw degraded mounted.

Such case can be easily reproduced using the following script:
 # mkfs.btrfs -f -m raid1 -d sing /dev/sdb /dev/sdc
 # wipefs -f /dev/sdc
 # mount /dev/sdb -o degraded,rw

If using btrfs-debug-tree to check /dev/sdb, one should find that the
data chunk is only in sdb, so in fact it should allow degraded mount.

This patchset will introduce a new per-chunk degradable check for
btrfs, allow above case to succeed, and it's quite small anyway.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copied text from cover letter with more details about the problem being
  solved ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:02 +02:00
Liu Bo
0d1e0bead6 Btrfs: report errors when checksum is not found
When btrfs fails the checksum check, it'll fill the whole page with
"1".

However, if %csum_expected is 0 (which means there is no checksum), then
for some unknown reason, we just pretend that the read is correct, so
userspace would be confused about the dilemma that read is successful but
getting a page with all content being "1".

This can happen due to a bug in btrfs-convert.

This fixes it by always returning errors if checksum doesn't match.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:02 +02:00
Nikolay Borisov
69f03f137a btrfs: Prevent possible ERR_PTR() dereference
In btrfs_full_stripe_len/btrfs_is_parity_mirror we have similar code which
gets the chunk map for a particular range via get_chunk_map. However,
get_chunk_map can return an ERR_PTR value and while the 2 callers do catch
this with a WARN_ON they then proceed to indiscriminately dereference the
extent map. This of course leads to a crash. Fix the offenders by making the
dereference conditional on IS_ERR.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:02 +02:00
Nikolay Borisov
1174cade81 btrfs: Remove redundant checks from btrfs_alloc_data_chunk_ondemand
Many commits ago the data space_info in alloc_data_chunk_ondemand used to be
acquired from the inode. At that point commit
33b4d47f5e ("Btrfs: deal with NULL space info") got introduced to deal with
spurios cases where the space info could be null, following a rebalance.
Nowadays, however, the space info is referenced directly from the btrfs_fs_info
struct which is initialised at filesystem mount time. This makes the null
checks redundant, so remove them.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:02 +02:00
Nikolay Borisov
7bdd6277e0 btrfs: Remove redundant argument of flush_space
All callers of flush_space pass the same number for orig/num_bytes
arguments. Let's remove one of the numbers and also modify the trace
point to show only a single number - bytes requested.

Seems that last point where the two parameters were treated differently
is before the ticketed enospc rework.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:01 +02:00
Aleksa Sarai
6c6b5a39c4 btrfs: resume qgroup rescan on rw remount
Several distributions mount the "proper root" as ro during initrd and
then remount it as rw before pivot_root(2). Thus, if a rescan had been
aborted by a previous shutdown, the rescan would never be resumed.

This issue would manifest itself as several btrfs ioctl(2)s causing the
entire machine to hang when btrfs_qgroup_wait_for_completion was hit
(due to the fs_info->qgroup_rescan_running flag being set but the rescan
itself not being resumed). Notably, Docker's btrfs storage driver makes
regular use of BTRFS_QUOTA_CTL_DISABLE and BTRFS_IOC_QUOTA_RESCAN_WAIT
(causing this problem to be manifested on boot for some machines).

Cc: <stable@vger.kernel.org> # v3.11+
Cc: Jeff Mahoney <jeffm@suse.com>
Fixes: b382a324b6 ("Btrfs: fix qgroup rescan resume on mount")
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:01 +02:00
Edmund Nadolski
01747e92a9 btrfs: clean up extraneous computations in add_delayed_refs
Repeating the same computation in multiple places is not
necessary.

Signed-off-by: Edmund Nadolski <enadolski@suse.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:01 +02:00
Edmund Nadolski
3ec4d3238a btrfs: allow backref search checks for shared extents
When called with a struct share_check, find_parent_nodes()
will detect a shared extent and immediately return with
BACKREF_SHARED_FOUND.

Signed-off-by: Edmund Nadolski <enadolski@suse.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:01 +02:00
Edmund Nadolski
9dd14fd696 btrfs: add cond_resched() calls when resolving backrefs
Since backref resolution is CPU-intensive, the cond_resched calls
should help alleviate soft lockup occurences.

Signed-off-by: Edmund Nadolski <enadolski@suse.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:01 +02:00
Jeff Mahoney
00142756e1 btrfs: backref, add tracepoints for prelim_ref insertion and merging
This patch adds a tracepoint event for prelim_ref insertion and
merging.  For each, the ref being inserted or merged and the count
of tree nodes is issued.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:01 +02:00
Jeff Mahoney
6c336b212b btrfs: add a node counter to each of the rbtrees
This patch adds counters to each of the rbtrees so that we can tell
how large they are growing for a given workload.  These counters
will be exported by tracepoints in the next patch.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:01 +02:00
Edmund Nadolski
86d5f99442 btrfs: convert prelimary reference tracking to use rbtrees
It's been known for a while that the use of multiple lists
that are periodically merged was an algorithmic problem within
btrfs.  There are several workloads that don't complete in any
reasonable amount of time (e.g. btrfs/130) and others that cause
soft lockups.

The solution is to use a set of rbtrees that do insertion merging
for both indirect and direct refs, with the former converting
refs into the latter.  The result is a btrfs/130 workload that
used to take several hours now takes about half of that. This
runtime still isn't acceptable and a future patch will address that
by moving the rbtrees higher in the stack so the lookups can be
shared across multiple calls to find_parent_nodes.

Signed-off-by: Edmund Nadolski <enadolski@suse.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:11:55 +02:00
Edmund Nadolski
f6954245d9 btrfs: remove ref_tree implementation from backref.c
Commit afce772e87 ("btrfs: fix check_shared for fiemap ioctl") added
the ref_tree code in backref.c to reduce backref searching for
shared extents under the FIEMAP ioctl. This code will not be
compatible with the upcoming rbtree changes for improved backref
searching, so this patch removes the ref_tree code.  The rbtree
changes will provide the equivalent functionality for FIEMAP.

The above commit also introduced transaction semantics around calls to
btrfs_check_shared() in order to accurately account for delayed refs.
This functionality needs to be retained, so a complete revert of the
above commit is not desirable. This patch therefore removes the
ref_tree portion of the commit as above, however it does not remove
the transaction portion.

Signed-off-by: Edmund Nadolski <enadolski@suse.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 14:19:53 +02:00
Edmund Nadolski
bb739cf08e btrfs: btrfs_check_shared should manage its own transaction
Commit afce772e87 ("btrfs: fix check_shared for fiemap ioctl") added
transaction semantics around calls to btrfs_check_shared() in order to
provide accurate accounting of delayed refs. The transaction management
should be done inside btrfs_check_shared(), so that callers do not need
to manage transactions individually.

Signed-off-by: Edmund Nadolski <enadolski@suse.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 14:19:53 +02:00
Jeff Mahoney
e0c476b128 btrfs: backref, cleanup __ namespace abuse
We typically use __ to indicate a helper routine that shouldn't be
called directly without understanding the proper context required
to do so.  We use static functions to indicate that a function is
private to a particular C file.  The backref code uses static
function and __ prefixes on nearly everything, which makes the code
difficult to read and establishes a pattern for future code that
shouldn't be followed.  This patch drops all the unnecessary prefixes.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 14:19:53 +02:00
Jeff Mahoney
4dae077a83 btrfs: backref, add unode_aux_to_inode_list helper
Replacing the double cast and ternary conditional with a helper makes
the code easier on the eyes.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 14:19:53 +02:00
Jeff Mahoney
73980becae btrfs: backref, constify some arguments
This constifies a few buffers used in the backref code.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 14:19:53 +02:00
Jeff Mahoney
9a35b63728 btrfs: constify tracepoint arguments
Tracepoint arguments are all read-only.  If we mark the arguments
as const, we're able to keep or convert those arguments to const
where appropriate.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 14:19:53 +02:00
Jeff Mahoney
1cbb1f454e btrfs: struct-funcs, constify readers
We have reader helpers for most of the on-disk structures that use
an extent_buffer and pointer as offset into the buffer that are
read-only.  We should mark them as const and, in turn, allow consumers
of these interfaces to mark the buffers const as well.

No impact on code, but serves as documentation that a buffer is intended
not to be modified.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 14:19:53 +02:00
Nikolay Borisov
23d1f73788 btrfs: remove unused sectorsize member
The sectorsize member of btrfs_block_group_cache is unused. So remove it, this
reduces the number of holes in the struct.

With patch:
/* size: 856, cachelines: 14, members: 40 */
/* sum members: 837, holes: 4, sum holes: 19 */
/* bit holes: 1, sum bit holes: 29 bits */
/* last cacheline: 24 bytes */

Without patch:
/* size: 864, cachelines: 14, members: 41 */
/* sum members: 841, holes: 5, sum holes: 23 */
/* bit holes: 1, sum bit holes: 29 bits */
/* last cacheline: 32 bytes */

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 14:19:53 +02:00
Nikolay Borisov
f148ef4d3a btrfs: Be explicit about usage of min()
__btrfs_alloc_chunk contains code which boils down to:

    ndevs = min(ndevs, devs_max)

It's conditional upon devs_max not being 0. However, it cannot really be 0
since it's always set to either BTRFS_MAX_DEVS_SYS_CHUNK or
BTRFS_MAX_DEVS(fs_info->chunk_root). So eliminate the condition check and use
min explicitly. This has no functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 14:19:52 +02:00
Nikolay Borisov
e5600fd6fc btrfs: Use explicit round_down call rather than open-coding it
No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 14:19:52 +02:00
Nikolay Borisov
ebcc9301ea btrfs: convert while loop to list_for_each_entry
No functional changes, just make the loop a bit more readable

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 14:19:52 +02:00
Nick Terrell
5c1aab1dd5 btrfs: Add zstd support
Add zstd compression and decompression support to BtrFS. zstd at its
fastest level compresses almost as well as zlib, while offering much
faster compression and decompression, approaching lzo speeds.

I benchmarked btrfs with zstd compression against no compression, lzo
compression, and zlib compression. I benchmarked two scenarios. Copying
a set of files to btrfs, and then reading the files. Copying a tarball
to btrfs, extracting it to btrfs, and then reading the extracted files.
After every operation, I call `sync` and include the sync time.
Between every pair of operations I unmount and remount the filesystem
to avoid caching. The benchmark files can be found in the upstream
zstd source repository under
`contrib/linux-kernel/{btrfs-benchmark.sh,btrfs-extract-benchmark.sh}`
[1] [2].

I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM.
The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor,
16 GB of RAM, and a SSD.

The first compression benchmark is copying 10 copies of the unzipped
Silesia corpus [3] into a BtrFS filesystem mounted with
`-o compress-force=Method`. The decompression benchmark times how long
it takes to `tar` all 10 copies into `/dev/null`. The compression ratio is
measured by comparing the output of `df` and `du`. See the benchmark file
[1] for details. I benchmarked multiple zstd compression levels, although
the patch uses zstd level 1.

| Method  | Ratio | Compression MB/s | Decompression speed |
|---------|-------|------------------|---------------------|
| None    |  0.99 |              504 |                 686 |
| lzo     |  1.66 |              398 |                 442 |
| zlib    |  2.58 |               65 |                 241 |
| zstd 1  |  2.57 |              260 |                 383 |
| zstd 3  |  2.71 |              174 |                 408 |
| zstd 6  |  2.87 |               70 |                 398 |
| zstd 9  |  2.92 |               43 |                 406 |
| zstd 12 |  2.93 |               21 |                 408 |
| zstd 15 |  3.01 |               11 |                 354 |

The next benchmark first copies `linux-4.11.6.tar` [4] to btrfs. Then it
measures the compression ratio, extracts the tar, and deletes the tar.
Then it measures the compression ratio again, and `tar`s the extracted
files into `/dev/null`. See the benchmark file [2] for details.

| Method | Tar Ratio | Extract Ratio | Copy (s) | Extract (s)| Read (s) |
|--------|-----------|---------------|----------|------------|----------|
| None   |      0.97 |          0.78 |    0.981 |      5.501 |    8.807 |
| lzo    |      2.06 |          1.38 |    1.631 |      8.458 |    8.585 |
| zlib   |      3.40 |          1.86 |    7.750 |     21.544 |   11.744 |
| zstd 1 |      3.57 |          1.85 |    2.579 |     11.479 |    9.389 |

[1] https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/btrfs-benchmark.sh
[2] https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/btrfs-extract-benchmark.sh
[3] http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia
[4] https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.11.6.tar.xz

zstd source repository: https://github.com/facebook/zstd

Signed-off-by: Nick Terrell <terrelln@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2017-08-15 09:02:09 -07:00
Linus Torvalds
0a2a1330d2 Merge branch 'for-4.13-part3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
 "Fixes addressing problems reported by users, and there's one more
  regression fix"

* 'for-4.13-part3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: round down size diff when shrinking/growing device
  Btrfs: fix early ENOSPC due to delalloc
  btrfs: fix lockup in find_free_extent with read-only block groups
  Btrfs: fix dir item validation when replaying xattr deletes
2017-07-28 12:26:59 -07:00
Nikolay Borisov
0e4324a4c3 btrfs: round down size diff when shrinking/growing device
Further testing showed that the fix introduced in 7dfb8be11b ("btrfs:
Round down values which are written for total_bytes_size") was
insufficient and it could still lead to discrepancies between the
total_bytes in the super block and the device total bytes. So this patch
also ensures that the difference between old/new sizes when
shrinking/growing is also rounded down. This ensure that we won't be
subtracting/adding a non-sectorsize multiples to the superblock/device
total sizees.

Fixes: 7dfb8be11b ("btrfs: Round down values which are written for total_bytes_size")
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-07-24 16:05:00 +02:00
Omar Sandoval
17024ad0a0 Btrfs: fix early ENOSPC due to delalloc
If a lot of metadata is reserved for outstanding delayed allocations, we
rely on shrink_delalloc() to reclaim metadata space in order to fulfill
reservation tickets. However, shrink_delalloc() has a shortcut where if
it determines that space can be overcommitted, it will stop early. This
made sense before the ticketed enospc system, but now it means that
shrink_delalloc() will often not reclaim enough space to fulfill any
tickets, leading to an early ENOSPC. (Reservation tickets don't care
about being able to overcommit, they need every byte accounted for.)

Fix it by getting rid of the shortcut so that shrink_delalloc() reclaims
all of the metadata it is supposed to. This fixes early ENOSPCs we were
seeing when doing a btrfs receive to populate a new filesystem, as well
as early ENOSPCs Christoph saw when doing a big cp -r onto Btrfs.

Fixes: 957780eb27 ("Btrfs: introduce ticketed enospc infrastructure")
Tested-by: Christoph Anton Mitterer <mail@christoph.anton.mitterer.name>
Cc: stable@vger.kernel.org
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-07-24 16:04:26 +02:00
Jeff Mahoney
144439376b btrfs: fix lockup in find_free_extent with read-only block groups
If we have a block group that is all of the following:
1) uncached in memory
2) is read-only
3) has a disk cache state that indicates we need to recreate the cache

AND the file system has enough free space fragmentation such that the
request for an extent of a given size can't be honored;

AND have a single CPU core;

AND it's the block group with the highest starting offset such that
there are no opportunities (like reading from disk) for the loop to
yield the CPU;

We can end up with a lockup.

The root cause is simple.  Once we're in the position that we've read in
all of the other block groups directly and none of those block groups
can honor the request, there are no more opportunities to sleep.  We end
up trying to start a caching thread which never gets run if we only have
one core.  This *should* present as a hung task waiting on the caching
thread to make some progress, but it doesn't.  Instead, it degrades into
a busy loop because of the placement of the read-only check.

During the first pass through the loop, block_group->cached will be set
to BTRFS_CACHE_STARTED and have_caching_bg will be set.  Then we hit the
read-only check and short circuit the loop.  We're not yet in
LOOP_CACHING_WAIT, so we skip that loop back before going through the
loop again for other raid groups.

Then we move to LOOP_CACHING_WAIT state.

During the this pass through the loop, ->cached will still be
BTRFS_CACHE_STARTED, which means it's not cached, so we'll enter
cache_block_group, do a lot of nothing, and return, and also set
have_caching_bg again.  Then we hit the read-only check and short circuit
the loop.  The same thing happens as before except now we DO trigger
the LOOP_CACHING_WAIT && have_caching_bg check and loop back up to the
top.  We do this forever.

There are two fixes in this patch since they address the same underlying
bug.

The first is to add a cond_resched to the end of the loop to ensure
that the caching thread always has an opportunity to run.  This will
fix the soft lockup issue, but find_free_extent will still loop doing
nothing until the thread has completed.

The second is to move the read-only check to the top of the loop.  We're
never going to return an allocation within a read-only block group so
we may as well skip it early.  The check for ->cached == BTRFS_CACHE_ERROR
would cause the same problem except that BTRFS_CACHE_ERROR is considered
a "done" state and we won't re-set have_caching_bg again.

Many thanks to Stephan Kulow <coolo@suse.de> for his excellent help in
the testing process.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-07-24 16:04:02 +02:00
Filipe Manana
e33bf72361 Btrfs: fix dir item validation when replaying xattr deletes
We were passing an incorrect slot number to the function that validates
directory items when we are replaying xattr deletes from a log tree. The
correct slot is stored at variable 'i' and not at 'path->slots[0]', so
the call to the validation function was only correct for the first
iteration of the loop, when 'i == path->slots[0]'.
After this fix, the fstest generic/066 passes again.

Fixes: 8ee8c2d62d ("btrfs: Verify dir_item in replay_xattr_deletes")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-07-19 20:38:16 +02:00
David Howells
bc98a42c1f VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)
Firstly by applying the following with coccinelle's spatch:

	@@ expression SB; @@
	-SB->s_flags & MS_RDONLY
	+sb_rdonly(SB)

to effect the conversion to sb_rdonly(sb), then by applying:

	@@ expression A, SB; @@
	(
	-(!sb_rdonly(SB)) && A
	+!sb_rdonly(SB) && A
	|
	-A != (sb_rdonly(SB))
	+A != sb_rdonly(SB)
	|
	-A == (sb_rdonly(SB))
	+A == sb_rdonly(SB)
	|
	-!(sb_rdonly(SB))
	+!sb_rdonly(SB)
	|
	-A && (sb_rdonly(SB))
	+A && sb_rdonly(SB)
	|
	-A || (sb_rdonly(SB))
	+A || sb_rdonly(SB)
	|
	-(sb_rdonly(SB)) != A
	+sb_rdonly(SB) != A
	|
	-(sb_rdonly(SB)) == A
	+sb_rdonly(SB) == A
	|
	-(sb_rdonly(SB)) && A
	+sb_rdonly(SB) && A
	|
	-(sb_rdonly(SB)) || A
	+sb_rdonly(SB) || A
	)

	@@ expression A, B, SB; @@
	(
	-(sb_rdonly(SB)) ? 1 : 0
	+sb_rdonly(SB)
	|
	-(sb_rdonly(SB)) ? A : B
	+sb_rdonly(SB) ? A : B
	)

to remove left over excess bracketage and finally by applying:

	@@ expression A, SB; @@
	(
	-(A & MS_RDONLY) != sb_rdonly(SB)
	+(bool)(A & MS_RDONLY) != sb_rdonly(SB)
	|
	-(A & MS_RDONLY) == sb_rdonly(SB)
	+(bool)(A & MS_RDONLY) == sb_rdonly(SB)
	)

to make comparisons against the result of sb_rdonly() (which is a bool)
work correctly.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-07-17 08:45:34 +01:00
Linus Torvalds
78dcf73421 Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull ->s_options removal from Al Viro:
 "Preparations for fsmount/fsopen stuff (coming next cycle). Everything
  gets moved to explicit ->show_options(), killing ->s_options off +
  some cosmetic bits around fs/namespace.c and friends. Basically, the
  stuff needed to work with fsmount series with minimum of conflicts
  with other work.

  It's not strictly required for this merge window, but it would reduce
  the PITA during the coming cycle, so it would be nice to have those
  bits and pieces out of the way"

* 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  isofs: Fix isofs_show_options()
  VFS: Kill off s_options and helpers
  orangefs: Implement show_options
  9p: Implement show_options
  isofs: Implement show_options
  afs: Implement show_options
  affs: Implement show_options
  befs: Implement show_options
  spufs: Implement show_options
  bpf: Implement show_options
  ramfs: Implement show_options
  pstore: Implement show_options
  omfs: Implement show_options
  hugetlbfs: Implement show_options
  VFS: Don't use save/replace_mount_options if not using generic_show_options
  VFS: Provide empty name qstr
  VFS: Make get_filesystem() return the affected filesystem
  VFS: Clean up whitespace in fs/namespace.c and fs/super.c
  Provide a function to create a NUL-terminated string from unterminated data
2017-07-15 12:00:42 -07:00
Linus Torvalds
bc243704fb Merge branch 'for-4.13-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
 "We've identified and fixed a silent corruption (introduced by code in
  the first pull), a fixup after the blk_status_t merge and two fixes to
  incremental send that Filipe has been hunting for some time"

* 'for-4.13-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Btrfs: fix unexpected return value of bio_readpage_error
  btrfs: btrfs_create_repair_bio never fails, skip error handling
  btrfs: cloned bios must not be iterated by bio_for_each_segment_all
  Btrfs: fix write corruption due to bio cloning on raid5/6
  Btrfs: incremental send, fix invalid memory access
  Btrfs: incremental send, fix invalid path for link commands
2017-07-14 22:55:52 -07:00
Liu Bo
c3cfb65630 Btrfs: fix unexpected return value of bio_readpage_error
With blk_status_t conversion (that are now present in master),
bio_readpage_error() may return 1 as now ->submit_bio_hook() may not set
%ret if it runs without problems.

This fixes that unexpected return value by changing
btrfs_check_repairable() to return a bool instead of updating %ret, and
patch is applicable to both codebases with and without blk_status_t.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-07-14 20:42:37 +02:00
David Sterba
e8f5b395d5 btrfs: btrfs_create_repair_bio never fails, skip error handling
As the function uses the non-failing bio allocation, we can remove error
handling from the callers as well.

Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-07-14 20:42:08 +02:00
David Sterba
c09abff87f btrfs: cloned bios must not be iterated by bio_for_each_segment_all
We've started using cloned bios more in 4.13, there are some specifics
regarding the iteration.  Filipe found [1] that the raid56 iterated a
cloned bio using bio_for_each_segment_all, which is incorrect. The
cloned bios have wrong bi_vcnt and this could lead to silent
corruptions.  This patch adds assertions to all remaining
bio_for_each_segment_all cases.

[1] https://patchwork.kernel.org/patch/9838535/

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-07-14 20:39:31 +02:00
Filipe Manana
6592e58c6b Btrfs: fix write corruption due to bio cloning on raid5/6
The recent changes to make bio cloning faster (added in the 4.13 merge
window) by using the bio_clone_fast() API introduced a regression on
raid5/6 modes, because cloned bios have an invalid bi_vcnt field
(therefore it can not be used) and the raid5/6 code uses the
bio_for_each_segment_all() API to iterate the segments of a bio, and this
API uses a bio's bi_vcnt field.

The issue is very simple to trigger by doing for example a direct IO write
against a raid5 or raid6 filesystem and then attempting to read what we
wrote before:

  $ mkfs.btrfs -m raid5 -d raid5 -f /dev/sdc /dev/sdd /dev/sde /dev/sdf
  $ mount /dev/sdc /mnt
  $ xfs_io -f -d -c "pwrite -S 0xab 0 1M" /mnt/foobar
  $ od -t x1 /mnt/foobar
  od: /mnt/foobar: read error: Input/output error

For that example, the following is also reported in dmesg/syslog:

  [18274.985557] btrfs_print_data_csum_error: 18 callbacks suppressed
  [18274.995277] BTRFS warning (device sdf): csum failed root 5 ino 257 off 0 csum 0x98f94189 expected csum 0x94374193 mirror 1
  [18274.997205] BTRFS warning (device sdf): csum failed root 5 ino 257 off 4096 csum 0x98f94189 expected csum 0x94374193 mirror 1
  [18275.025221] BTRFS warning (device sdf): csum failed root 5 ino 257 off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 1
  [18275.047422] BTRFS warning (device sdf): csum failed root 5 ino 257 off 12288 csum 0x98f94189 expected csum 0x94374193 mirror 1
  [18275.054818] BTRFS warning (device sdf): csum failed root 5 ino 257 off 4096 csum 0x98f94189 expected csum 0x94374193 mirror 1
  [18275.054834] BTRFS warning (device sdf): csum failed root 5 ino 257 off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 1
  [18275.054943] BTRFS warning (device sdf): csum failed root 5 ino 257 off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 2
  [18275.055207] BTRFS warning (device sdf): csum failed root 5 ino 257 off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 3
  [18275.055571] BTRFS warning (device sdf): csum failed root 5 ino 257 off 0 csum 0x98f94189 expected csum 0x94374193 mirror 1
  [18275.062171] BTRFS warning (device sdf): csum failed root 5 ino 257 off 12288 csum 0x98f94189 expected csum 0x94374193 mirror 1

A scrub will also fail correcting bad copies, mentioning the following in
dmesg/syslog:

  [18276.128696] scrub_handle_errored_block: 498 callbacks suppressed
  [18276.129617] BTRFS warning (device sdf): checksum error at logical 2186346496 on dev /dev/sde, sector 2116608, root 5, inode 257, offset 65536, length 4096, links $
  [18276.149235] btrfs_dev_stat_print_on_error: 498 callbacks suppressed
  [18276.157897] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
  [18276.206059] BTRFS warning (device sdf): checksum error at logical 2186477568 on dev /dev/sdd, sector 2116736, root 5, inode 257, offset 196608, length 4096, links$
  [18276.206059] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
  [18276.306552] BTRFS warning (device sdf): checksum error at logical 2186543104 on dev /dev/sdd, sector 2116864, root 5, inode 257, offset 262144, length 4096, links$
  [18276.319152] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
  [18276.394316] BTRFS warning (device sdf): checksum error at logical 2186739712 on dev /dev/sdf, sector 2116992, root 5, inode 257, offset 458752, length 4096, links$
  [18276.396348] BTRFS error (device sdf): bdev /dev/sdf errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
  [18276.434127] BTRFS warning (device sdf): checksum error at logical 2186870784 on dev /dev/sde, sector 2117120, root 5, inode 257, offset 589824, length 4096, links$
  [18276.434127] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
  [18276.500504] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186477568 on dev /dev/sdd
  [18276.538400] BTRFS warning (device sdf): checksum error at logical 2186481664 on dev /dev/sdd, sector 2116744, root 5, inode 257, offset 200704, length 4096, links$
  [18276.540452] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, flush 0, corrupt 3, gen 0
  [18276.542012] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186481664 on dev /dev/sdd
  [18276.585030] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186346496 on dev /dev/sde
  [18276.598306] BTRFS warning (device sdf): checksum error at logical 2186412032 on dev /dev/sde, sector 2116736, root 5, inode 257, offset 131072, length 4096, links$
  [18276.598310] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 3, gen 0
  [18276.598582] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186350592 on dev /dev/sde
  [18276.603455] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 4, gen 0
  [18276.638362] BTRFS warning (device sdf): checksum error at logical 2186354688 on dev /dev/sde, sector 2116624, root 5, inode 257, offset 73728, length 4096, links $
  [18276.640445] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 5, gen 0
  [18276.645942] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186354688 on dev /dev/sde
  [18276.657204] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186412032 on dev /dev/sde
  [18276.660563] BTRFS warning (device sdf): checksum error at logical 2186416128 on dev /dev/sde, sector 2116744, root 5, inode 257, offset 135168, length 4096, links$
  [18276.664609] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 6, gen 0
  [18276.664609] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186358784 on dev /dev/sde

So fix this by using the bio_for_each_segment() API and setting before
the bio's bi_iter field to the value of the corresponding btrfs bio
container's saved iterator if we are processing a cloned bio in the
raid5/6 code (the same code processes both cloned and non-cloned bios).

This incorrect iteration of cloned bios was also causing some occasional
BUG_ONs when running fstest btrfs/064, which have a trace like the
following:

  [ 6674.416156] ------------[ cut here ]------------
  [ 6674.416157] kernel BUG at fs/btrfs/raid56.c:1897!
  [ 6674.416159] invalid opcode: 0000 [#1] PREEMPT SMP
  [ 6674.416160] Modules linked in: dm_flakey dm_mod dax ppdev tpm_tis parport_pc tpm_tis_core evdev tpm psmouse sg i2c_piix4 pcspkr parport i2c_core serio_raw button s
  [ 6674.416184] CPU: 3 PID: 19236 Comm: kworker/u32:10 Not tainted 4.12.0-rc6-btrfs-next-44+ #1
  [ 6674.416185] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
  [ 6674.416210] Workqueue: btrfs-endio btrfs_endio_helper [btrfs]
  [ 6674.416211] task: ffff880147f6c740 task.stack: ffffc90001fb8000
  [ 6674.416229] RIP: 0010:__raid_recover_end_io+0x1ac/0x370 [btrfs]
  [ 6674.416230] RSP: 0018:ffffc90001fbbb90 EFLAGS: 00010217
  [ 6674.416231] RAX: ffff8801ff4b4f00 RBX: 0000000000000002 RCX: 0000000000000001
  [ 6674.416232] RDX: ffff880099b045d8 RSI: ffffffff81a5f6e0 RDI: 0000000000000004
  [ 6674.416232] RBP: ffffc90001fbbbc8 R08: 0000000000000001 R09: 0000000000000001
  [ 6674.416233] R10: ffffc90001fbbac8 R11: 0000000000001000 R12: 0000000000000002
  [ 6674.416234] R13: ffff880099b045c0 R14: 0000000000000004 R15: ffff88012bff2000
  [ 6674.416235] FS:  0000000000000000(0000) GS:ffff88023f2c0000(0000) knlGS:0000000000000000
  [ 6674.416235] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [ 6674.416236] CR2: 00007f28cf282000 CR3: 00000001000c6000 CR4: 00000000000006e0
  [ 6674.416239] Call Trace:
  [ 6674.416259]  __raid56_parity_recover+0xfc/0x16e [btrfs]
  [ 6674.416276]  raid56_parity_recover+0x157/0x16b [btrfs]
  [ 6674.416293]  btrfs_map_bio+0xe0/0x259 [btrfs]
  [ 6674.416310]  btrfs_submit_bio_hook+0xbf/0x147 [btrfs]
  [ 6674.416327]  end_bio_extent_readpage+0x27b/0x4a0 [btrfs]
  [ 6674.416331]  bio_endio+0x17d/0x1b3
  [ 6674.416346]  end_workqueue_fn+0x3c/0x3f [btrfs]
  [ 6674.416362]  btrfs_scrubparity_helper+0x1aa/0x3b8 [btrfs]
  [ 6674.416379]  btrfs_endio_helper+0xe/0x10 [btrfs]
  [ 6674.416381]  process_one_work+0x276/0x4b6
  [ 6674.416384]  worker_thread+0x1ac/0x266
  [ 6674.416386]  ? rescuer_thread+0x278/0x278
  [ 6674.416387]  kthread+0x106/0x10e
  [ 6674.416389]  ? __list_del_entry+0x22/0x22
  [ 6674.416391]  ret_from_fork+0x27/0x40
  [ 6674.416395] Code: 44 89 e2 be 00 10 00 00 ff 15 b0 ab ef ff eb 72 4d 89 e8 89 d9 44 89 e2 be 00 10 00 00 ff 15 a3 ab ef ff eb 5d 41 83 fc ff 74 02 <0f> 0b 49 63 97
  [ 6674.416432] RIP: __raid_recover_end_io+0x1ac/0x370 [btrfs] RSP: ffffc90001fbbb90
  [ 6674.416434] ---[ end trace 74d56ebe7489dd6a ]---

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2017-07-13 19:26:01 +01:00
Linus Torvalds
6618a24ab2 Merge branch 'nowait-aio-btrfs-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
 "This fixes a user-visible bug introduced by the nowait-aio patches
  merged in this cycle"

* 'nowait-aio-btrfs-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: nowait aio: Correct assignment of pos
2017-07-10 10:27:48 -07:00
Goldwyn Rodrigues
ff0fa73247 btrfs: nowait aio: Correct assignment of pos
Assigning pos for usage early messes up in append mode, where the pos is
re-assigned in generic_write_checks(). Assign pos later to get the
correct position to write from iocb->ki_pos.

Since check_can_nocow also uses the value of pos, we shift
generic_write_checks() before check_can_nocow(). Checks with IOCB_DIRECT
are present in generic_write_checks(), so checking for IOCB_NOWAIT is
enough.

Also, put locking sequence in the fast path.

This fixes a user visible bug, as reported:

"apparently breaks several shell related features on my system.
In zsh history stopped working, because no new entries are added
anymore.
I fist noticed the issue when I tried to build mplayer. It uses a shell
script to generate a help_mp.h file:
[...]

Here is a simple testcase:

 % echo "foo" >> test
 % echo "foo" >> test
 % cat test
 foo
 %
"

Fixes: edf064e7c6 ("btrfs: nowait aio support")
CC: Jens Axboe <axboe@kernel.dk>
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Link: https://lkml.kernel.org/r/20170704042306.GA274@x4
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-07-10 15:29:44 +02:00
Linus Torvalds
088737f44b Writeback error handling fixes (pile #2)
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZXhmCAAoJEAAOaEEZVoIVpRkP/1qlYn3pq6d5Kuz84pejOmlL
 5jbkS/cOmeTxeUU4+B1xG8Lx7bAk8PfSXQOADbSJGiZd0ug95tJxplFYIGJzR/tG
 aNMHeu/BVKKhUKORGuKR9rJKtwC839L/qao+yPBo5U3mU4L73rFWX8fxFuhSJ8HR
 hvkgBu3Hx6GY59CzxJ8iJzj+B+uPSFrNweAk0+0UeWkBgTzEdiGqaXBX4cHIkq/5
 hMoCG+xnmwHKbCBsQ5js+YJT+HedZ4lvfjOqGxgElUyjJ7Bkt/IFYOp8TUiu193T
 tA4UinDjN8A7FImmIBIftrECmrAC9HIGhGZroYkMKbb8ReDR2ikE5FhKEpuAGU3a
 BXBgX2mPQuArvZWM7qeJCkxV9QJ0u/8Ykbyzo30iPrICyrzbEvIubeB/mDA034+Z
 Z0/z8C3v7826F3zP/NyaQEojUgRq30McMOIS8GMnx15HJwRsRKlzjfy9Wm4tWhl0
 t3nH1jMqAZ7068s6rfh/oCwdgGOwr5o4hW/bnlITzxbjWQUOnZIe7KBxIezZJ2rv
 OcIwd5qE8PNtpagGj5oUbnjGOTkERAgsMfvPk5tjUNt28/qUlVs2V0aeo47dlcsh
 oYr8WMOIzw98Rl7Bo70mplLrqLD6nGl0LfXOyUlT4STgLWW4ksmLVuJjWIUxcO/0
 yKWjj9wfYRQ0vSUqhsI5
 =3Z93
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

Pull Writeback error handling updates from Jeff Layton:
 "This pile represents the bulk of the writeback error handling fixes
  that I have for this cycle. Some of the earlier patches in this pile
  may look trivial but they are prerequisites for later patches in the
  series.

  The aim of this set is to improve how we track and report writeback
  errors to userland. Most applications that care about data integrity
  will periodically call fsync/fdatasync/msync to ensure that their
  writes have made it to the backing store.

  For a very long time, we have tracked writeback errors using two flags
  in the address_space: AS_EIO and AS_ENOSPC. Those flags are set when a
  writeback error occurs (via mapping_set_error) and are cleared as a
  side-effect of filemap_check_errors (as you noted yesterday). This
  model really sucks for userland.

  Only the first task to call fsync (or msync or fdatasync) will see the
  error. Any subsequent task calling fsync on a file will get back 0
  (unless another writeback error occurs in the interim). If I have
  several tasks writing to a file and calling fsync to ensure that their
  writes got stored, then I need to have them coordinate with one
  another. That's difficult enough, but in a world of containerized
  setups that coordination may even not be possible.

  But wait...it gets worse!

  The calls to filemap_check_errors can be buried pretty far down in the
  call stack, and there are internal callers of filemap_write_and_wait
  and the like that also end up clearing those errors. Many of those
  callers ignore the error return from that function or return it to
  userland at nonsensical times (e.g. truncate() or stat()). If I get
  back -EIO on a truncate, there is no reason to think that it was
  because some previous writeback failed, and a subsequent fsync() will
  (incorrectly) return 0.

  This pile aims to do three things:

   1) ensure that when a writeback error occurs that that error will be
      reported to userland on a subsequent fsync/fdatasync/msync call,
      regardless of what internal callers are doing

   2) report writeback errors on all file descriptions that were open at
      the time that the error occurred. This is a user-visible change,
      but I think most applications are written to assume this behavior
      anyway. Those that aren't are unlikely to be hurt by it.

   3) document what filesystems should do when there is a writeback
      error. Today, there is very little consistency between them, and a
      lot of cargo-cult copying. We need to make it very clear what
      filesystems should do in this situation.

  To achieve this, the set adds a new data type (errseq_t) and then
  builds new writeback error tracking infrastructure around that. Once
  all of that is in place, we change the filesystems to use the new
  infrastructure for reporting wb errors to userland.

  Note that this is just the initial foray into cleaning up this mess.
  There is a lot of work remaining here:

   1) convert the rest of the filesystems in a similar fashion. Once the
      initial set is in, then I think most other fs' will be fairly
      simple to convert. Hopefully most of those can in via individual
      filesystem trees.

   2) convert internal waiters on writeback to use errseq_t for
      detecting errors instead of relying on the AS_* flags. I have some
      draft patches for this for ext4, but they are not quite ready for
      prime time yet.

  This was a discussion topic this year at LSF/MM too. If you're
  interested in the gory details, LWN has some good articles about this:

      https://lwn.net/Articles/718734/
      https://lwn.net/Articles/724307/"

* tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
  btrfs: minimal conversion to errseq_t writeback error reporting on fsync
  xfs: minimal conversion to errseq_t writeback error reporting
  ext4: use errseq_t based error handling for reporting data writeback errors
  fs: convert __generic_file_fsync to use errseq_t based reporting
  block: convert to errseq_t based writeback error tracking
  dax: set errors in mapping when writeback fails
  Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors
  mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error
  fs: new infrastructure for writeback error handling and reporting
  lib: add errseq_t type and infrastructure for handling it
  mm: don't TestClearPageError in __filemap_fdatawait_range
  mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails
  jbd2: don't clear and reset errors after waiting on writeback
  buffer: set errors in mapping at the time that the error occurs
  fs: check for writeback errors after syncing out buffers in generic_file_fsync
  buffer: use mapping_set_error instead of setting the flag
  mm: fix mapping_set_error call in me_pagecache_dirty
2017-07-07 19:38:17 -07:00
Filipe Manana
24e52b11e0 Btrfs: incremental send, fix invalid memory access
When doing an incremental send, while processing an extent that changed
between the parent and send snapshots and that extent was an inline extent
in the parent snapshot, it's possible to access a memory region beyond
the end of leaf if the inline extent is very small and it is the first
item in a leaf.

An example scenario is described below.

The send snapshot has the following leaf:

 leaf 33865728 items 33 free space 773 generation 46 owner 5
 fs uuid ab7090d8-dafd-4fb9-9246-723b6d2e2fb7
 chunk uuid 2d16478c-c704-4ab9-b574-68bff2281b1f
        (...)
        item 14 key (335 EXTENT_DATA 0) itemoff 3052 itemsize 53
                generation 36 type 1 (regular)
                extent data disk byte 12791808 nr 4096
                extent data offset 0 nr 4096 ram 4096
                extent compression 0 (none)
        item 15 key (335 EXTENT_DATA 8192) itemoff 2999 itemsize 53
                generation 36 type 1 (regular)
                extent data disk byte 138170368 nr 225280
                extent data offset 0 nr 225280 ram 225280
                extent compression 0 (none)
        (...)

And the parent snapshot has the following leaf:

 leaf 31272960 items 17 free space 17 generation 31 owner 5
 fs uuid ab7090d8-dafd-4fb9-9246-723b6d2e2fb7
 chunk uuid 2d16478c-c704-4ab9-b574-68bff2281b1f
        item 0 key (335 EXTENT_DATA 0) itemoff 3951 itemsize 44
                generation 31 type 0 (inline)
                inline extent data size 23 ram_bytes 613 compression 1 (zlib)
        (...)

When computing the send stream, it is detected that the extent of inode
335, at file offset 0, and at fs/btrfs/send.c:is_extent_unchanged() we
grab the leaf from the parent snapshot and access the inline extent item.
However, before jumping to the 'out' label, we access the 'offset' and
'disk_bytenr' fields of the extent item, which should not be done for
inline extents since the inlined data starts at the offset of the
'disk_bytenr' field and can be very small. For example accessing the
'offset' field of the file extent item results in the following trace:

[  599.705368] general protection fault: 0000 [#1] PREEMPT SMP
[  599.706296] Modules linked in: btrfs psmouse i2c_piix4 ppdev acpi_cpufreq serio_raw parport_pc i2c_core evdev tpm_tis tpm_tis_core sg pcspkr parport tpm button su$
[  599.709340] CPU: 7 PID: 5283 Comm: btrfs Not tainted 4.10.0-rc8-btrfs-next-46+ #1
[  599.709340] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[  599.709340] task: ffff88023eedd040 task.stack: ffffc90006658000
[  599.709340] RIP: 0010:read_extent_buffer+0xdb/0xf4 [btrfs]
[  599.709340] RSP: 0018:ffffc9000665ba00 EFLAGS: 00010286
[  599.709340] RAX: db73880000000000 RBX: 0000000000000000 RCX: 0000000000000001
[  599.709340] RDX: ffffc9000665ba60 RSI: db73880000000000 RDI: ffffc9000665ba5f
[  599.709340] RBP: ffffc9000665ba30 R08: 0000000000000001 R09: ffff88020dc5e098
[  599.709340] R10: 0000000000001000 R11: 0000160000000000 R12: 6db6db6db6db6db7
[  599.709340] R13: ffff880000000000 R14: 0000000000000000 R15: ffff88020dc5e088
[  599.709340] FS:  00007f519555a8c0(0000) GS:ffff88023f3c0000(0000) knlGS:0000000000000000
[  599.709340] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  599.709340] CR2: 00007f1411afd000 CR3: 0000000235f8e000 CR4: 00000000000006e0
[  599.709340] Call Trace:
[  599.709340]  btrfs_get_token_64+0x93/0xce [btrfs]
[  599.709340]  ? printk+0x48/0x50
[  599.709340]  btrfs_get_64+0xb/0xd [btrfs]
[  599.709340]  process_extent+0x3a1/0x1106 [btrfs]
[  599.709340]  ? btree_read_extent_buffer_pages+0x5/0xef [btrfs]
[  599.709340]  changed_cb+0xb03/0xb3d [btrfs]
[  599.709340]  ? btrfs_get_token_32+0x7a/0xcc [btrfs]
[  599.709340]  btrfs_compare_trees+0x432/0x53d [btrfs]
[  599.709340]  ? process_extent+0x1106/0x1106 [btrfs]
[  599.709340]  btrfs_ioctl_send+0x960/0xe26 [btrfs]
[  599.709340]  btrfs_ioctl+0x181b/0x1fed [btrfs]
[  599.709340]  ? trace_hardirqs_on_caller+0x150/0x1ac
[  599.709340]  vfs_ioctl+0x21/0x38
[  599.709340]  ? vfs_ioctl+0x21/0x38
[  599.709340]  do_vfs_ioctl+0x611/0x645
[  599.709340]  ? rcu_read_unlock+0x5b/0x5d
[  599.709340]  ? __fget+0x6d/0x79
[  599.709340]  SyS_ioctl+0x57/0x7b
[  599.709340]  entry_SYSCALL_64_fastpath+0x18/0xad
[  599.709340] RIP: 0033:0x7f51945eec47
[  599.709340] RSP: 002b:00007ffc21c13e98 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[  599.709340] RAX: ffffffffffffffda RBX: ffffffff81096459 RCX: 00007f51945eec47
[  599.709340] RDX: 00007ffc21c13f20 RSI: 0000000040489426 RDI: 0000000000000004
[  599.709340] RBP: ffffc9000665bf98 R08: 00007f519450d700 R09: 00007f519450d700
[  599.709340] R10: 00007f519450d9d0 R11: 0000000000000202 R12: 0000000000000046
[  599.709340] R13: ffffc9000665bf78 R14: 0000000000000000 R15: 00007f5195574040
[  599.709340]  ? trace_hardirqs_off_caller+0x43/0xb1
[  599.709340] Code: 29 f0 49 39 d8 4c 0f 47 c3 49 03 81 58 01 00 00 44 89 c1 4c 01 c2 4c 29 c3 48 c1 f8 03 49 0f af c4 48 c1 e0 0c 4c 01 e8 48 01 c6 <f3> a4 31 f6 4$
[  599.709340] RIP: read_extent_buffer+0xdb/0xf4 [btrfs] RSP: ffffc9000665ba00
[  599.762057] ---[ end trace fe00d7af61b9f49e ]---

This is because the 'offset' field starts at an offset of 37 bytes
(offsetof(struct btrfs_file_extent_item, offset)), has a length of 8
bytes and therefore attemping to read it causes a 1 byte access beyond
the end of the leaf, as the first item's content in a leaf is located
at the tail of the leaf, the item size is 44 bytes and the offset of
that field plus its length (37 + 8 = 45) goes beyond the item's size
by 1 byte.

So fix this by accessing the 'offset' and 'disk_bytenr' fields after
jumping to the 'out' label if we are processing an inline extent. We
move the reading operation of the 'disk_bytenr' field too because we
have the same problem as for the 'offset' field explained above when
the inline data is less then 8 bytes. The access to the 'generation'
field is also moved but just for the sake of grouping access to all
the fields.

Fixes: e1cbfd7bf6 ("Btrfs: send, fix file hole not being preserved due to inline extent")
Cc: <stable@vger.kernel.org>  # v4.12+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2017-07-06 23:02:30 +01:00
Filipe Manana
f59627810e Btrfs: incremental send, fix invalid path for link commands
In some scenarios an incremental send stream can contain link commands
with an invalid target path. Such scenarios happen after moving some
directory inode A, renaming a regular file inode B into the old name of
inode A and finally creating a new hard link for inode B at directory
inode A.

Consider the following example scenario where this issue happens.

Parent snapshot:

  .                                                      (ino 256)
  |
  |--- dir1/                                             (ino 257)
  |      |--- dir2/                                      (ino 258)
  |             |--- dir3/                               (ino 259)
  |                   |--- file1                         (ino 261)
  |                   |--- dir4/                         (ino 262)
  |
  |--- dir5/                                             (ino 260)

Send snapshot:

  .                                                      (ino 256)
  |
  |--- dir1/                                             (ino 257)
         |--- dir2/                                      (ino 258)
         |      |--- dir3/                               (ino 259)
         |            |--- dir4                          (ino 261)
         |
         |--- dir6/                                      (ino 263)
                |--- dir44/                              (ino 262)
                       |--- file11                       (ino 261)
                       |--- dir55/                       (ino 260)

When attempting to apply the corresponding incremental send stream, a
link command contains an invalid target path which makes the receiver
fail. The following is the verbose output of the btrfs receive command:

  receiving snapshot mysnap2 uuid=90076fe6-5ba6-e64a-9321-9279670ed16b (...)
  utimes
  utimes dir1
  utimes dir1/dir2/dir3
  utimes
  rename dir1/dir2/dir3/dir4 -> o262-7-0
  link dir1/dir2/dir3/dir4 -> dir1/dir2/dir3/file1
  link dir1/dir2/dir3/dir4/file11 -> dir1/dir2/dir3/file1
  ERROR: link dir1/dir2/dir3/dir4/file11 -> dir1/dir2/dir3/file1 failed: Not a directory

The following steps happen during the computation of the incremental send
stream the lead to this issue:

1) When processing inode 261, we orphanize inode 262 due to a name/location
   collision with one of the new hard links for inode 261 (created in the
   second step below).

2) We create one of the 2 new hard links for inode 261, the one whose
   location is at "dir1/dir2/dir3/dir4".

3) We then attempt to create the other new hard link for inode 261, which
   has inode 262 as its parent directory. Because the path for this new
   hard link was computed before we started processing the new references
   (hard links), it reflects the old name/location of inode 262, that is,
   it does not account for the orphanization step that happened when
   we started processing the new references for inode 261, whence it is
   no longer valid, causing the receiver to fail.

So fix this issue by recomputing the full path of new references if we
ended up orphanizing other inodes which are directories.

A test case for fstests follows soon.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2017-07-06 23:02:18 +01:00
Linus Torvalds
a4c20b9a57 Merge branch 'for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
Pull percpu updates from Tejun Heo:
 "These are the percpu changes for the v4.13-rc1 merge window. There are
  a couple visibility related changes - tracepoints and allocator stats
  through debugfs, along with __ro_after_init markings and a cosmetic
  rename in percpu_counter.

  Please note that the simple O(#elements_in_the_chunk) area allocator
  used by percpu allocator is again showing scalability issues,
  primarily with bpf allocating and freeing large number of counters.
  Dennis is working on the replacement allocator and the percpu
  allocator will be seeing increased churns in the coming cycles"

* 'for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
  percpu: fix static checker warnings in pcpu_destroy_chunk
  percpu: fix early calls for spinlock in pcpu_stats
  percpu: resolve err may not be initialized in pcpu_alloc
  percpu_counter: Rename __percpu_counter_add to percpu_counter_add_batch
  percpu: add tracepoint support for percpu memory
  percpu: expose statistics about percpu memory via debugfs
  percpu: migrate percpu data structures to internal header
  percpu: add missing lockdep_assert_held to func pcpu_free_area
  mark most percpu globals as __ro_after_init
2017-07-06 08:59:41 -07:00
Jeff Layton
333427a505 btrfs: minimal conversion to errseq_t writeback error reporting on fsync
Just check and advance the errseq_t in the file before returning, and
use an errseq_t based check for writeback errors.

Other internal callers of filemap_* functions are left as-is.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-07-06 07:02:31 -04:00
David Howells
c3d98ea082 VFS: Don't use save/replace_mount_options if not using generic_show_options
btrfs, debugfs, reiserfs and tracefs call save_mount_options() and reiserfs
calls replace_mount_options(), but they then implement their own
->show_options() methods and don't touch s_options, rendering the saved
options unnecessary.  I'm trying to eliminate s_options to make it easier
to implement a context-based mount where the mount options can be passed
individually over a file descriptor.

Remove the calls to save/replace_mount_options() call in these cases.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Chris Mason <clm@fb.com>
cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
cc: Steven Rostedt <rostedt@goodmis.org>
cc: linux-btrfs@vger.kernel.org
cc: reiserfs-devel@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-06 03:31:46 -04:00
Linus Torvalds
8c27cb3566 Merge branch 'for-4.13-part1' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
 "The core updates improve error handling (mostly related to bios), with
  the usual incremental work on the GFP_NOFS (mis)use removal,
  refactoring or cleanups. Except the two top patches, all have been in
  for-next for an extensive amount of time.

  User visible changes:

   - statx support

   - quota override tunable

   - improved compression thresholds

   - obsoleted mount option alloc_start

  Core updates:

   - bio-related updates:
       - faster bio cloning
       - no allocation failures
       - preallocated flush bios

   - more kvzalloc use, memalloc_nofs protections, GFP_NOFS updates

   - prep work for btree_inode removal

   - dir-item validation

   - qgoup fixes and updates

   - cleanups:
       - removed unused struct members, unused code, refactoring
       - argument refactoring (fs_info/root, caller -> callee sink)
       - SEARCH_TREE ioctl docs"

* 'for-4.13-part1' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (115 commits)
  btrfs: Remove false alert when fiemap range is smaller than on-disk extent
  btrfs: Don't clear SGID when inheriting ACLs
  btrfs: fix integer overflow in calc_reclaim_items_nr
  btrfs: scrub: fix target device intialization while setting up scrub context
  btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges
  btrfs: qgroup: Introduce extent changeset for qgroup reserve functions
  btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write and quotas being enabled
  btrfs: qgroup: Return actually freed bytes for qgroup release or free data
  btrfs: qgroup: Cleanup btrfs_qgroup_prepare_account_extents function
  btrfs: qgroup: Add quick exit for non-fs extents
  Btrfs: rework delayed ref total_bytes_pinned accounting
  Btrfs: return old and new total ref mods when adding delayed refs
  Btrfs: always account pinned bytes when dropping a tree block ref
  Btrfs: update total_bytes_pinned when pinning down extents
  Btrfs: make BUG_ON() in add_pinned_bytes() an ASSERT()
  Btrfs: make add_pinned_bytes() take an s64 num_bytes instead of u64
  btrfs: fix validation of XATTR_ITEM dir items
  btrfs: Verify dir_item in iterate_object_props
  btrfs: Check name_len before in btrfs_del_root_ref
  btrfs: Check name_len before reading btrfs_get_name
  ...
2017-07-05 16:41:23 -07:00
Linus Torvalds
c6b1e36c8f Merge branch 'for-4.13/block' of git://git.kernel.dk/linux-block
Pull core block/IO updates from Jens Axboe:
 "This is the main pull request for the block layer for 4.13. Not a huge
  round in terms of features, but there's a lot of churn related to some
  core cleanups.

  Note this depends on the UUID tree pull request, that Christoph
  already sent out.

  This pull request contains:

   - A series from Christoph, unifying the error/stats codes in the
     block layer. We now use blk_status_t everywhere, instead of using
     different schemes for different places.

   - Also from Christoph, some cleanups around request allocation and IO
     scheduler interactions in blk-mq.

   - And yet another series from Christoph, cleaning up how we handle
     and do bounce buffering in the block layer.

   - A blk-mq debugfs series from Bart, further improving on the support
     we have for exporting internal information to aid debugging IO
     hangs or stalls.

   - Also from Bart, a series that cleans up the request initialization
     differences across types of devices.

   - A series from Goldwyn Rodrigues, allowing the block layer to return
     failure if we will block and the user asked for non-blocking.

   - Patch from Hannes for supporting setting loop devices block size to
     that of the underlying device.

   - Two series of patches from Javier, fixing various issues with
     lightnvm, particular around pblk.

   - A series from me, adding support for write hints. This comes with
     NVMe support as well, so applications can help guide data placement
     on flash to improve performance, latencies, and write
     amplification.

   - A series from Ming, improving and hardening blk-mq support for
     stopping/starting and quiescing hardware queues.

   - Two pull requests for NVMe updates. Nothing major on the feature
     side, but lots of cleanups and bug fixes. From the usual crew.

   - A series from Neil Brown, greatly improving the bio rescue set
     support. Most notably, this kills the bio rescue work queues, if we
     don't really need them.

   - Lots of other little bug fixes that are all over the place"

* 'for-4.13/block' of git://git.kernel.dk/linux-block: (217 commits)
  lightnvm: pblk: set line bitmap check under debug
  lightnvm: pblk: verify that cache read is still valid
  lightnvm: pblk: add initialization check
  lightnvm: pblk: remove target using async. I/Os
  lightnvm: pblk: use vmalloc for GC data buffer
  lightnvm: pblk: use right metadata buffer for recovery
  lightnvm: pblk: schedule if data is not ready
  lightnvm: pblk: remove unused return variable
  lightnvm: pblk: fix double-free on pblk init
  lightnvm: pblk: fix bad le64 assignations
  nvme: Makefile: remove dead build rule
  blk-mq: map all HWQ also in hyperthreaded system
  nvmet-rdma: register ib_client to not deadlock in device removal
  nvme_fc: fix error recovery on link down.
  nvmet_fc: fix crashes on bad opcodes
  nvme_fc: Fix crash when nvme controller connection fails.
  nvme_fc: replace ioabort msleep loop with completion
  nvme_fc: fix double calls to nvme_cleanup_cmd()
  nvme-fabrics: verify that a controller returns the correct NQN
  nvme: simplify nvme_dev_attrs_are_visible
  ...
2017-07-03 10:34:51 -07:00
Qu Wenruo
848c23b78f btrfs: Remove false alert when fiemap range is smaller than on-disk extent
Commit 4751832da9 ("btrfs: fiemap: Cache and merge fiemap extent before
submit it to user") introduced a warning to catch unemitted cached
fiemap extent.

However such warning doesn't take the following case into consideration:

0			4K			8K
|<---- fiemap range --->|
|<----------- On-disk extent ------------------>|

In this case, the whole 0~8K is cached, and since it's larger than
fiemap range, it break the fiemap extent emit loop.
This leaves the fiemap extent cached but not emitted, and caught by the
final fiemap extent sanity check, causing kernel warning.

This patch removes the kernel warning and renames the sanity check to
emit_last_fiemap_cache() since it's possible and valid to have cached
fiemap extent.

Reported-by: David Sterba <dsterba@suse.cz>
Reported-by: Adam Borowski <kilobyte@angband.pl>
Fixes: 4751832da9 ("btrfs: fiemap: Cache and merge fiemap extent ...")
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-29 20:25:20 +02:00
Jan Kara
b7f8a09f80 btrfs: Don't clear SGID when inheriting ACLs
When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
set, DIR1 is expected to have SGID bit set (and owning group equal to
the owning group of 'DIR0'). However when 'DIR0' also has some default
ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
'DIR1' to get cleared if user is not member of the owning group.

Fix the problem by moving posix_acl_update_mode() out of
__btrfs_set_acl() into btrfs_set_acl(). That way the function will not be
called when inheriting ACLs which is what we want as it prevents SGID
bit clearing and the mode has been properly set by posix_acl_create()
anyway.

Fixes: 073931017b
CC: stable@vger.kernel.org
CC: linux-btrfs@vger.kernel.org
CC: David Sterba <dsterba@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-29 20:24:59 +02:00
Chris Mason
6374e57ad8 btrfs: fix integer overflow in calc_reclaim_items_nr
Dave Jones hit a WARN_ON(nr < 0) in btrfs_wait_ordered_roots() with
v4.12-rc6.  This was because commit 70e7af244 made it possible for
calc_reclaim_items_nr() to return a negative number.  It's not really a
bug in that commit, it just didn't go far enough down the stack to find
all the possible 64->32 bit overflows.

This switches calc_reclaim_items_nr() to return a u64 and changes everyone
that uses the results of that math to u64 as well.

Reported-by: Dave Jones <davej@codemonkey.org.uk>
Fixes: 70e7af2 ("Btrfs: fix delalloc accounting leak caused by u32 overflow")
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-29 20:17:02 +02:00
David Sterba
ded56184a5 btrfs: scrub: fix target device intialization while setting up scrub context
The commit "btrfs: scrub: inline helper scrub_setup_wr_ctx" inlined a
helper but wrongly sets up the target device. Incidentally there's a
local variable with the same name as a parameter in the previous
function, so this got caught during runtime as crash in test btrfs/027.

Reported-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-29 20:17:02 +02:00
Qu Wenruo
bc42bda223 btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges
[BUG]
For the following case, btrfs can underflow qgroup reserved space
at an error path:
(Page size 4K, function name without "btrfs_" prefix)

         Task A                  |             Task B
----------------------------------------------------------------------
Buffered_write [0, 2K)           |
|- check_data_free_space()       |
|  |- qgroup_reserve_data()      |
|     Range aligned to page      |
|     range [0, 4K)          <<< |
|     4K bytes reserved      <<< |
|- copy pages to page cache      |
                                 | Buffered_write [2K, 4K)
                                 | |- check_data_free_space()
                                 | |  |- qgroup_reserved_data()
                                 | |     Range alinged to page
                                 | |     range [0, 4K)
                                 | |     Already reserved by A <<<
                                 | |     0 bytes reserved      <<<
                                 | |- delalloc_reserve_metadata()
                                 | |  And it *FAILED* (Maybe EQUOTA)
                                 | |- free_reserved_data_space()
                                      |- qgroup_free_data()
                                         Range aligned to page range
                                         [0, 4K)
                                         Freeing 4K
(Special thanks to Chandan for the detailed report and analyse)

[CAUSE]
Above Task B is freeing reserved data range [0, 4K) which is actually
reserved by Task A.

And at writeback time, page dirty by Task A will go through writeback
routine, which will free 4K reserved data space at file extent insert
time, causing the qgroup underflow.

[FIX]
For btrfs_qgroup_free_data(), add @reserved parameter to only free
data ranges reserved by previous btrfs_qgroup_reserve_data().
So in above case, Task B will try to free 0 byte, so no underflow.

Reported-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-29 20:17:02 +02:00
Qu Wenruo
364ecf3651 btrfs: qgroup: Introduce extent changeset for qgroup reserve functions
Introduce a new parameter, struct extent_changeset for
btrfs_qgroup_reserved_data() and its callers.

Such extent_changeset was used in btrfs_qgroup_reserve_data() to record
which range it reserved in current reserve, so it can free it in error
paths.

The reason we need to export it to callers is, at buffered write error
path, without knowing what exactly which range we reserved in current
allocation, we can free space which is not reserved by us.

This will lead to qgroup reserved space underflow.

Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-29 20:17:02 +02:00
Qu Wenruo
a12b877b55 btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write and quotas being enabled
[BUG]
Under the following case, we can underflow qgroup reserved space.

            Task A                |            Task B
---------------------------------------------------------------
 Quota disabled                   |
 Buffered write                   |
 |- btrfs_check_data_free_space() |
 |  *NO* qgroup space is reserved |
 |  since quota is *DISABLED*     |
 |- All pages are copied to page  |
    cache                         |
                                  | Enable quota
                                  | Quota scan finished
                                  |
                                  | Sync_fs
                                  | |- run_delalloc_range
                                  | |- Write pages
                                  | |- btrfs_finish_ordered_io
                                  |    |- insert_reserved_file_extent
                                  |       |- btrfs_qgroup_release_data()
                                  |          Since no qgroup space is
                                             reserved in Task A, we
                                             underflow qgroup reserved
                                             space
This can be detected by fstest btrfs/104.

[CAUSE]
In insert_reserved_file_extent() we tell qgroup to release the @ram_bytes
size of qgroup reserved_space in all cases.
And btrfs_qgroup_release_data() will check if quotas are enabled.

However in the above case, the buffered write happens before quota is
enabled, so we don't have the reserved space for that range.

[FIX]
In insert_reserved_file_extent(), we tell qgroup to release the acctual
byte number it released.
In the above case, since we don't have the reserved space, we tell
qgroups to release 0 byte, so the problem can be fixed.

And thanks to the @reserved parameter introduced by the qgroup rework,
and previous patch to return released bytes, the fix can be as small as
10 lines.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
[ changelog updates ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-29 20:17:02 +02:00
Qu Wenruo
7bc329c183 btrfs: qgroup: Return actually freed bytes for qgroup release or free data
btrfs_qgroup_release/free_data() only returns 0 or a negative error
number (ENOMEM is the only possible error).

This is normally good enough, but sometimes we need the exact byte
count it freed/released.

Change it to return actually released/freed bytenr number instead of 0
for success.
And slightly modify related extent_changeset structure, since in btrfs
one no-hole data extent won't be larger than 128M, so "unsigned int"
is large enough for the use case.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-29 20:17:02 +02:00
Qu Wenruo
d1b8b94a2b btrfs: qgroup: Cleanup btrfs_qgroup_prepare_account_extents function
Quite a lot of qgroup corruption happens due to wrong time of calling
btrfs_qgroup_prepare_account_extents().

Since the safest time is to call it just before
btrfs_qgroup_account_extents(), there is no need to separate these 2
functions.

Merging them will make code cleaner and less bug prone.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
[ changelog and comment adjustments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-29 20:17:02 +02:00
Qu Wenruo
5edfd9fdc6 btrfs: qgroup: Add quick exit for non-fs extents
Modify btrfs_qgroup_account_extent() to exit quicker for non-fs extents.

The quick exit condition is:
1) The extent belongs to a non-fs tree
   Only fs-tree extents can affect qgroup numbers and is the only case
   where extent can be shared between different trees.

   Although strictly speaking extent in data-reloc or tree-reloc tree
   can be shared, data/tree-reloc root won't appear in the result of
   btrfs_find_all_roots(), so we can ignore such case.

   So we can check the first root in old_roots/new_roots ulist.
   - if we find the 1st root is a not a fs/subvol root, then we can skip
     the extent
   - if we find the 1st root is a fs/subvol root, then we must continue
     calculation

OR

2) both 'nr_old_roots' and 'nr_new_roots' are 0
   This means either such extent got allocated then freed in current
   transaction or it's a new reloc tree extent, whose nr_new_roots is 0.
   Either way it won't affect qgroup accounting and can be skipped
   safely.

Such quick exit can make trace output more quite and less confusing:
(example with fs uuid and time stamp removed)

Before:
------
add_delayed_tree_ref: bytenr=29556736 num_bytes=16384 action=ADD_DELAYED_REF parent=0(-) ref_root=2(EXTENT_TREE) level=0 type=TREE_BLOCK_REF seq=0
btrfs_qgroup_account_extent: bytenr=29556736 num_bytes=16384 nr_old_roots=0 nr_new_roots=1
------
Extent tree block will trigger btrfs_qgroup_account_extent() trace point
while no qgroup number is changed, as extent tree won't affect qgroup
accounting.

After:
------
add_delayed_tree_ref: bytenr=29556736 num_bytes=16384 action=ADD_DELAYED_REF parent=0(-) ref_root=2(EXTENT_TREE) level=0 type=TREE_BLOCK_REF seq=0
------
Now such unrelated extent won't trigger btrfs_qgroup_account_extent()
trace point, making the trace less noisy.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
[ changelog and comment adjustments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-29 20:17:02 +02:00
Omar Sandoval
d7eae3403f Btrfs: rework delayed ref total_bytes_pinned accounting
The total_bytes_pinned counter is completely broken when accounting
delayed refs:

- If two drops for the same extent are merged, we will decrement
  total_bytes_pinned twice but only increment it once.
- If an add is merged into a drop or vice versa, we will decrement the
  total_bytes_pinned counter but never increment it.
- If multiple references to an extent are dropped, we will account it
  multiple times, potentially vastly over-estimating the number of bytes
  that will be freed by a commit and doing unnecessary work when we're
  close to ENOSPC.

The last issue is relatively minor, but the first two make the
total_bytes_pinned counter leak or underflow very often. These
accounting issues were introduced in b150a4f10d ("Btrfs: use a percpu
to keep track of possibly pinned bytes"), but they were papered over by
zeroing out the counter on every commit until d288db5dc0 ("Btrfs: fix
race of using total_bytes_pinned").

We need to make sure that an extent is accounted as pinned exactly once
if and only if we will drop references to it when when the transaction
is committed. Ideally we would only add to total_bytes_pinned when the
*last* reference is dropped, but this information isn't readily
available for data extents. Again, this over-estimation can lead to
extra commits when we're close to ENOSPC, but it's not as bad as before.

The fix implemented here is to increment total_bytes_pinned when the
total refmod count for an extent goes negative and decrement it if the
refmod count goes back to non-negative or after we've run all of the
delayed refs for that extent.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-29 20:17:01 +02:00
Omar Sandoval
7be07912b3 Btrfs: return old and new total ref mods when adding delayed refs
We need this to decide when to account pinned bytes.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-29 20:17:01 +02:00
Omar Sandoval
0a16c7d7ae Btrfs: always account pinned bytes when dropping a tree block ref
Currently, we only increment total_bytes_pinned in
btrfs_free_tree_block() when dropping the last reference on the block.
However, when the delayed ref is run later, we will decrement
total_bytes_pinned regardless of whether it was the last reference or
not. This causes the counter to underflow when the reference we dropped
was not the last reference. Fix it by incrementing the counter
unconditionally, which is what btrfs_free_extent() does. This makes
total_bytes_pinned an overestimate when references to shared extents are
dropped, but in the worst case this will just make us try to commit the
transaction to try to free up space and find we didn't free enough.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-29 20:17:01 +02:00
Omar Sandoval
4da8b76d34 Btrfs: update total_bytes_pinned when pinning down extents
The extents marked in pin_down_extent() will be unpinned later in
unpin_extent_range(), which decrements total_bytes_pinned.
pin_down_extent() must increment the counter to avoid underflowing it.
Also adjust btrfs_free_tree_block() to avoid accounting for the same
extent twice.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-29 20:17:01 +02:00
Omar Sandoval
55e8196a57 Btrfs: make BUG_ON() in add_pinned_bytes() an ASSERT()
The value of flags is one of DATA/METADATA/SYSTEM, they must exist at
when add_pinned_bytes is called.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ added changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-29 20:17:01 +02:00
Omar Sandoval
0d9f824df3 Btrfs: make add_pinned_bytes() take an s64 num_bytes instead of u64
There are a few places where we pass in a negative num_bytes, so make it
signed for clarity. Also move it up in the file since later patches will
need it there.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-29 20:17:01 +02:00
David Sterba
1164a9fb9c btrfs: fix validation of XATTR_ITEM dir items
The XATTR_ITEM is a type of a directory item so we use the common
validator helper. Unlike other dir items, it can have data. The way the
name len validation is currently implemented does not reflect that. We'd
have to adjust by the data_len when comparing the read and item limits.

However, this will not work for multi-item xattr dir items.

Example from tree dump of generic/337:

        item 7 key (257 XATTR_ITEM 751495445) itemoff 15667 itemsize 147
                location key (0 UNKNOWN.0 0) type XATTR
                transid 8 data_len 3 name_len 11
                name: user.foobar
                data 123
                location key (0 UNKNOWN.0 0) type XATTR
                transid 8 data_len 6 name_len 13
                name: user.WvG1c1Td
                data qwerty
                location key (0 UNKNOWN.0 0) type XATTR
                transid 8 data_len 5 name_len 19
                name: user.J3__T_Km3dVsW_
                data hello

At the point of btrfs_is_name_len_valid call we don't have access to the
data_len value of the 2nd and 3rd sub-item. So simple btrfs_dir_data_len(leaf,
di) would always return 3, although we'd need to get 6 and 5 respectively to
get the claculations right. (read_end + name_len + data_len vs item_end)

We'd have to also pass data_len externally, which is not point of the
name validation. The last check is supposed to test if there's at least
one dir item space after the one we're processing. I don't think this is
particularly useful, validation of the next item would catch that too.
So the check is removed and we don't weaken the validation. Now tests
btrfs/048, btrfs/053, generic/273 and generic/337 pass.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-29 20:06:11 +02:00
Jens Axboe
e6959b9350 btrfs: add support for passing in write hints for buffered writes
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27 12:05:52 -06:00
Su Yue
fbc326159a btrfs: Verify dir_item in iterate_object_props
Call verify_dir_item before memcmp_extent_buffer reading name from
dir_item.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-21 19:16:04 +02:00
Su Yue
64c7b01446 btrfs: Check name_len before in btrfs_del_root_ref
btrfs_del_root_ref calls btrfs_search_slot and reads name from root_ref.
Call btrfs_is_name_len_valid before memcmp.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-21 19:16:04 +02:00
Su Yue
488d7c4566 btrfs: Check name_len before reading btrfs_get_name
In btrfs_get_name, there's btrfs_search_slot and reads name from
inode_ref/root_ref.

Call btrfs_is_name_len_valid in btrfs_get_name.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-21 19:16:04 +02:00
Su Yue
59b0a7f2c7 btrfs: Check name_len before read in iterate_dir_item
Since iterate_dir_item checks name_len in its own way,
so use btrfs_is_name_len_valid not 'verify_dir_item' to make more strict
name_len check.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ switched ENAMETOOLONG to EIO ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-21 19:16:04 +02:00
Su Yue
3c1d418448 btrfs: Check name_len in btrfs_check_ref_name_override
In btrfs_log_inode, btrfs_search_forward gets the buffer and then
btrfs_check_ref_name_override will read name from ref/extref for the
first time.

Call btrfs_is_name_len_valid before reading name.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-21 19:16:04 +02:00
Su Yue
8ee8c2d62d btrfs: Verify dir_item in replay_xattr_deletes
replay_xattr_deletes calls btrfs_search_slot to get buffer and reads
name.

Call verify_dir_item to check name_len in replay_xattr_deletes to avoid
reading out of boundary.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-21 19:16:04 +02:00
Su Yue
26a836cec2 btrfs: Check name_len on add_inode_ref call path
replay_one_buffer first reads buffers and dispatches items accroding to
the item type.
In this patch, add_inode_ref handles inode_ref and inode_extref.
Then add_inode_ref calls ref_get_fields and extref_get_fields to read
ref/extref name for the first time.
So checking name_len before reading those two is fine.

add_inode_ref also calls inode_in_dir to match ref/extref in parent_dir.
The call graph includes btrfs_match_dir_item_name to read dir_item name
in the parent dir.
Checking first dir_item is not enough. Change it to verify every
dir_item while doing matches.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-21 19:16:04 +02:00
Su Yue
e79a33270d btrfs: Check name_len with boundary in verify dir_item
Originally, verify_dir_item verifies name_len of dir_item with fixed
values but not item boundary.
If corrupted name_len was not bigger than the fixed value, for example
255, the function will think the dir_item is fine. And then reading
beyond boundary will cause crash.

Example:
	1. Corrupt one dir_item name_len to be 255.
        2. Run 'ls -lar /mnt/test/ > /dev/null'
dmesg:
[   48.451449] BTRFS info (device vdb1): disk space caching is enabled
[   48.451453] BTRFS info (device vdb1): has skinny extents
[   48.489420] general protection fault: 0000 [#1] SMP
[   48.489571] Modules linked in: ext4 jbd2 mbcache btrfs xor raid6_pq
[   48.489716] CPU: 1 PID: 2710 Comm: ls Not tainted 4.10.0-rc1 #5
[   48.489853] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
[   48.490008] task: ffff880035df1bc0 task.stack: ffffc90004800000
[   48.490008] RIP: 0010:read_extent_buffer+0xd2/0x190 [btrfs]
[   48.490008] RSP: 0018:ffffc90004803d98 EFLAGS: 00010202
[   48.490008] RAX: 000000000000001b RBX: 000000000000001b RCX: 0000000000000000
[   48.490008] RDX: ffff880079dbf36c RSI: 0005080000000000 RDI: ffff880079dbf368
[   48.490008] RBP: ffffc90004803dc8 R08: ffff880078e8cc48 R09: ffff880000000000
[   48.490008] R10: 0000160000000000 R11: 0000000000001000 R12: ffff880079dbf288
[   48.490008] R13: ffff880078e8ca88 R14: 0000000000000003 R15: ffffc90004803e20
[   48.490008] FS:  00007fef50c60800(0000) GS:ffff88007d400000(0000) knlGS:0000000000000000
[   48.490008] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   48.490008] CR2: 000055f335ac2ff8 CR3: 000000007356d000 CR4: 00000000001406e0
[   48.490008] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   48.490008] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   48.490008] Call Trace:
[   48.490008]  btrfs_real_readdir+0x3b7/0x4a0 [btrfs]
[   48.490008]  iterate_dir+0x181/0x1b0
[   48.490008]  SyS_getdents+0xa7/0x150
[   48.490008]  ? fillonedir+0x150/0x150
[   48.490008]  entry_SYSCALL_64_fastpath+0x18/0xad
[   48.490008] RIP: 0033:0x7fef5032546b
[   48.490008] RSP: 002b:00007ffeafcdb830 EFLAGS: 00000206 ORIG_RAX: 000000000000004e
[   48.490008] RAX: ffffffffffffffda RBX: 00007fef5061db38 RCX: 00007fef5032546b
[   48.490008] RDX: 0000000000008000 RSI: 000055f335abaff0 RDI: 0000000000000003
[   48.490008] RBP: 00007fef5061dae0 R08: 00007fef5061db48 R09: 0000000000000000
[   48.490008] R10: 000055f335abafc0 R11: 0000000000000206 R12: 00007fef5061db38
[   48.490008] R13: 0000000000008040 R14: 00007fef5061db38 R15: 000000000000270e
[   48.490008] RIP: read_extent_buffer+0xd2/0x190 [btrfs] RSP: ffffc90004803d98
[   48.499455] ---[ end trace 321920d8e8339505 ]---

Fix it by adding a parameter @slot and check name_len with item boundary
by calling btrfs_is_name_len_valid.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
rev
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-21 19:16:04 +02:00
Su Yue
19c6dcbfa7 btrfs: Introduce btrfs_is_name_len_valid to avoid reading beyond boundary
Introduce function btrfs_is_name_len_valid.

The function compares parameter @name_len with item boundary then
returns true if name_len is valid.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ s/btrfs_leaf_data/BTRFS_LEAF_DATA_OFFSET/ ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-21 19:16:04 +02:00
David Sterba
66b4993e95 btrfs: move dev stats accounting out of wait_dev_flush
We should really just wait in wait_dev_flush and let the caller decide
what to do with the error value.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-21 19:03:39 +02:00
David Sterba
2980d5745f btrfs: account as waiting for IO, while waiting fot the flush bio completion
Similar to what submit_bio_wait does, we should account for IO while
waiting for a bio completion. This has marginal visible effects, flush
bio is short-lived.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-21 19:03:39 +02:00
David Sterba
e0ae999414 btrfs: preallocate device flush bio
For devices that support flushing, we allocate a bio, submit, wait for
it and then free it. The bio allocation does not fail so ENOMEM is not a
problem but we still may unnecessarily stress the allocation subsystem.

Instead, we can allocate the bio at the same time we allocate the device
and reuse it each time we need to flush the barriers. The bio is reset
before each use. Reference counting is simplified to just device
allocation (get) and freeing (put).

The bio used to be submitted through the integrity checker which will
find out that bio has no data attached and call submit_bio.

Status of the bio in flight needs to be tracked separately in case the
device caches get switched off between write and wait.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-21 19:03:38 +02:00
Filipe Manana
fdb1388994 Btrfs: incremental send, fix invalid path for unlink commands
An incremental send can contain unlink operations with an invalid target
path when we rename some directory inode A, then rename some file inode B
to the old name of inode A and directory inode A is an ancestor of inode B
in the parent snapshot (but not anymore in the send snapshot).

Consider the following example scenario where this issue happens.

Parent snapshot:

 .                                                      (ino 256)
 |
 |--- dir1/                                             (ino 257)
       |--- dir2/                                       (ino 258)
       |     |--- file1                                 (ino 259)
       |     |--- file3                                 (ino 261)
       |
       |--- dir3/                                       (ino 262)
             |--- file22                                (ino 260)
             |--- dir4/                                 (ino 263)

Send snapshot:

 .                                                      (ino 256)
 |
 |--- dir1/                                             (ino 257)
       |--- dir2/                                       (ino 258)
       |--- dir3                                        (ino 260)
       |--- file3/                                      (ino 262)
             |--- dir4/                                 (ino 263)
                   |--- file11                          (ino 269)
                   |--- file33                          (ino 261)

When attempting to apply the corresponding incremental send stream, an
unlink operation contains an invalid path which makes the receiver fail.
The following is verbose output of the btrfs receive command:

 receiving snapshot snap2 uuid=7d5450da-a573-e043-a451-ec85f4879f0f (...)
 utimes
 utimes dir1
 utimes dir1/dir2
 link dir1/dir3/dir4/file11 -> dir1/dir2/file1
 unlink dir1/dir2/file1
 utimes dir1/dir2
 truncate dir1/dir3/dir4/file11 size=0
 utimes dir1/dir3/dir4/file11
 rename dir1/dir3 -> o262-7-0
 link dir1/dir3 -> o262-7-0/file22
 unlink dir1/dir3/file22
 ERROR: unlink dir1/dir3/file22 failed. Not a directory

The following steps happen during the computation of the incremental send
stream the lead to this issue:

1) Before we start processing the new and deleted references for inode
   260, we compute the full path of the deleted reference
   ("dir1/dir3/file22") and cache it in the list of deleted references
   for our inode.

2) We then start processing the new references for inode 260, for which
   there is only one new, located at "dir1/dir3". When processing this
   new reference, we check that inode 262, which was not yet processed,
   collides with the new reference and because of that we orphanize
   inode 262 so its new full path becomes "o262-7-0".

3) After the orphanization of inode 262, we create the new reference for
   inode 260 by issuing a link command with a target path of "dir1/dir3"
   and a source path of "o262-7-0/file22".

4) We then start processing the deleted references for inode 260, for
   which there is only one with the base name of "file22", and issue
   an unlink operation containing the target path computed at step 1,
   which is wrong because that path no longer exists and should be
   replaced with "o262-7-0/file22".

So fix this issue by recomputing the full path of deleted references if
when we processed the new references for an inode we ended up orphanizing
any other inode that is an ancestor of our inode in the parent snapshot.

A test case for fstests follows soon.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
[ adjusted after prev patch removed fs_path::dir_path and dir_path_len ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-21 16:53:10 +02:00
Filipe Manana
72c3668fed Btrfs: send, fix invalid path after renaming and linking file
Currently an incremental snapshot can generate link operations which
contain an invalid target path. Such case happens when in the send
snapshot a file was renamed, a new hard link added for it and some
other inode (with a lower number) got renamed to the former name of
that file. Example:

Parent snapshot

 .                  (ino 256)
 |
 |--- f1            (ino 257)
 |--- f2            (ino 258)
 |--- f3            (ino 259)

Send snapshot

 .                  (ino 256)
 |
 |--- f2            (ino 257)
 |--- f3            (ino 258)
 |--- f4            (ino 259)
 |--- f5            (ino 258)

The following steps happen when computing the incremental send stream:

1) When processing inode 257, inode 258 is orphanized (renamed to
   "o258-7-0"), because its current reference has the same name as the
   new reference for inode 257;

2) When processing inode 258, we iterate over all its new references,
   which have the names "f3" and "f5". The first iteration sees name
   "f5" and renames the inode from its orphan name ("o258-7-0") to
   "f5", while the second iteration sees the name "f3" and, incorrectly,
   issues a link operation with a target name matching the orphan name,
   which no longer exists. The first iteration had reset the current
   valid path of the inode to "f5", but in the second iteration we lost
   it because we found another inode, with a higher number of 259, which
   has a reference named "f3" as well, so we orphanized inode 259 and
   recomputed the current valid path of inode 258 to its old orphan
   name because inode 259 could be an ancestor of inode 258 and therefore
   the current valid path could contain the pre-orphanization name of
   inode 259. However in this case inode 259 is not an ancestor of inode
   258 so the current valid path should not be recomputed.
   This makes the receiver fail with the following error:

   ERROR: link f3 -> o258-7-0 failed: No such file or directory

So fix this by not recomputing the current valid path for an inode
whenever we find a colliding reference from some not yet processed inode
(inode number higher then the one currently being processed), unless
that other inode is an ancestor of the one we are currently processing.

A test case for fstests will follow soon.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-21 16:53:03 +02:00
Filipe Manana
609805d809 Btrfs: fix invalid extent maps due to hole punching
While punching a hole in a range that is not aligned with the sector size
(currently the same as the page size) we can end up leaving an extent map
in memory with a length that is smaller then the sector size or with a
start offset that is not aligned to the sector size. Both cases are not
expected and can lead to problems. This issue is easily detected
after the patch from commit a7e3b975a0 ("Btrfs: fix reported number of
inode blocks"), introduced in kernel 4.12-rc1, in a scenario like the
following for example:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt
  $ xfs_io -c "pwrite -S 0xaa -b 100K 0 100K" /mnt/foo
  $ xfs_io -c "fpunch 60K 90K" /mnt/foo
  $ xfs_io -c "pwrite -S 0xbb -b 100K 50K 100K" /mnt/foo
  $ xfs_io -c "pwrite -S 0xcc -b 50K 100K 50K" /mnt/foo
  $ umount /mnt

After the unmount operation we can see several warnings emmitted due to
underflows related to space reservation counters:

[ 2837.443299] ------------[ cut here ]------------
[ 2837.447395] WARNING: CPU: 8 PID: 2474 at fs/btrfs/inode.c:9444 btrfs_destroy_inode+0xe8/0x27e [btrfs]
[ 2837.452108] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev tpm button se
rio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_gene
ric raid1 raid0 multipath linear md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy
[ 2837.458389] CPU: 8 PID: 2474 Comm: umount Tainted: G        W       4.10.0-rc8-btrfs-next-43+ #1
[ 2837.459754] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[ 2837.462379] Call Trace:
[ 2837.462379]  dump_stack+0x68/0x92
[ 2837.462379]  __warn+0xc2/0xdd
[ 2837.462379]  warn_slowpath_null+0x1d/0x1f
[ 2837.462379]  btrfs_destroy_inode+0xe8/0x27e [btrfs]
[ 2837.462379]  destroy_inode+0x3d/0x55
[ 2837.462379]  evict+0x177/0x17e
[ 2837.462379]  dispose_list+0x50/0x71
[ 2837.462379]  evict_inodes+0x132/0x141
[ 2837.462379]  generic_shutdown_super+0x3f/0xeb
[ 2837.462379]  kill_anon_super+0x12/0x1c
[ 2837.462379]  btrfs_kill_super+0x16/0x21 [btrfs]
[ 2837.462379]  deactivate_locked_super+0x30/0x68
[ 2837.462379]  deactivate_super+0x36/0x39
[ 2837.462379]  cleanup_mnt+0x58/0x76
[ 2837.462379]  __cleanup_mnt+0x12/0x14
[ 2837.462379]  task_work_run+0x77/0x9b
[ 2837.462379]  prepare_exit_to_usermode+0x9d/0xc5
[ 2837.462379]  syscall_return_slowpath+0x196/0x1b9
[ 2837.462379]  entry_SYSCALL_64_fastpath+0xab/0xad
[ 2837.462379] RIP: 0033:0x7f3ef3e6b9a7
[ 2837.462379] RSP: 002b:00007ffdd0d8de58 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[ 2837.462379] RAX: 0000000000000000 RBX: 0000556f76a39060 RCX: 00007f3ef3e6b9a7
[ 2837.462379] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556f76a3f910
[ 2837.462379] RBP: 0000556f76a3f910 R08: 0000556f76a3e670 R09: 0000000000000015
[ 2837.462379] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f3ef436ce64
[ 2837.462379] R13: 0000000000000000 R14: 0000556f76a39240 R15: 00007ffdd0d8e0e0
[ 2837.519355] ---[ end trace e79345fe24b30b8d ]---
[ 2837.596256] ------------[ cut here ]------------
[ 2837.597625] WARNING: CPU: 8 PID: 2474 at fs/btrfs/extent-tree.c:5699 btrfs_free_block_groups+0x246/0x3eb [btrfs]
[ 2837.603547] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev tpm button serio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy
[ 2837.659372] CPU: 8 PID: 2474 Comm: umount Tainted: G        W       4.10.0-rc8-btrfs-next-43+ #1
[ 2837.663359] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[ 2837.663359] Call Trace:
[ 2837.663359]  dump_stack+0x68/0x92
[ 2837.663359]  __warn+0xc2/0xdd
[ 2837.663359]  warn_slowpath_null+0x1d/0x1f
[ 2837.663359]  btrfs_free_block_groups+0x246/0x3eb [btrfs]
[ 2837.663359]  close_ctree+0x1dd/0x2e1 [btrfs]
[ 2837.663359]  ? evict_inodes+0x132/0x141
[ 2837.663359]  btrfs_put_super+0x15/0x17 [btrfs]
[ 2837.663359]  generic_shutdown_super+0x6a/0xeb
[ 2837.663359]  kill_anon_super+0x12/0x1c
[ 2837.663359]  btrfs_kill_super+0x16/0x21 [btrfs]
[ 2837.663359]  deactivate_locked_super+0x30/0x68
[ 2837.663359]  deactivate_super+0x36/0x39
[ 2837.663359]  cleanup_mnt+0x58/0x76
[ 2837.663359]  __cleanup_mnt+0x12/0x14
[ 2837.663359]  task_work_run+0x77/0x9b
[ 2837.663359]  prepare_exit_to_usermode+0x9d/0xc5
[ 2837.663359]  syscall_return_slowpath+0x196/0x1b9
[ 2837.663359]  entry_SYSCALL_64_fastpath+0xab/0xad
[ 2837.663359] RIP: 0033:0x7f3ef3e6b9a7
[ 2837.663359] RSP: 002b:00007ffdd0d8de58 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[ 2837.663359] RAX: 0000000000000000 RBX: 0000556f76a39060 RCX: 00007f3ef3e6b9a7
[ 2837.663359] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556f76a3f910
[ 2837.663359] RBP: 0000556f76a3f910 R08: 0000556f76a3e670 R09: 0000000000000015
[ 2837.663359] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f3ef436ce64
[ 2837.663359] R13: 0000000000000000 R14: 0000556f76a39240 R15: 00007ffdd0d8e0e0
[ 2837.739445] ---[ end trace e79345fe24b30b8e ]---
[ 2837.745595] ------------[ cut here ]------------
[ 2837.746412] WARNING: CPU: 8 PID: 2474 at fs/btrfs/extent-tree.c:5700 btrfs_free_block_groups+0x261/0x3eb [btrfs]
[ 2837.747955] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev tpm button serio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy
[ 2837.755395] CPU: 8 PID: 2474 Comm: umount Tainted: G        W       4.10.0-rc8-btrfs-next-43+ #1
[ 2837.756769] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[ 2837.758526] Call Trace:
[ 2837.758925]  dump_stack+0x68/0x92
[ 2837.759383]  __warn+0xc2/0xdd
[ 2837.759383]  warn_slowpath_null+0x1d/0x1f
[ 2837.759383]  btrfs_free_block_groups+0x261/0x3eb [btrfs]
[ 2837.759383]  close_ctree+0x1dd/0x2e1 [btrfs]
[ 2837.759383]  ? evict_inodes+0x132/0x141
[ 2837.759383]  btrfs_put_super+0x15/0x17 [btrfs]
[ 2837.759383]  generic_shutdown_super+0x6a/0xeb
[ 2837.759383]  kill_anon_super+0x12/0x1c
[ 2837.759383]  btrfs_kill_super+0x16/0x21 [btrfs]
[ 2837.759383]  deactivate_locked_super+0x30/0x68
[ 2837.759383]  deactivate_super+0x36/0x39
[ 2837.759383]  cleanup_mnt+0x58/0x76
[ 2837.759383]  __cleanup_mnt+0x12/0x14
[ 2837.759383]  task_work_run+0x77/0x9b
[ 2837.759383]  prepare_exit_to_usermode+0x9d/0xc5
[ 2837.759383]  syscall_return_slowpath+0x196/0x1b9
[ 2837.759383]  entry_SYSCALL_64_fastpath+0xab/0xad
[ 2837.759383] RIP: 0033:0x7f3ef3e6b9a7
[ 2837.759383] RSP: 002b:00007ffdd0d8de58 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[ 2837.759383] RAX: 0000000000000000 RBX: 0000556f76a39060 RCX: 00007f3ef3e6b9a7
[ 2837.759383] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556f76a3f910
[ 2837.759383] RBP: 0000556f76a3f910 R08: 0000556f76a3e670 R09: 0000000000000015
[ 2837.759383] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f3ef436ce64
[ 2837.759383] R13: 0000000000000000 R14: 0000556f76a39240 R15: 00007ffdd0d8e0e0
[ 2837.777063] ---[ end trace e79345fe24b30b8f ]---
[ 2837.778235] ------------[ cut here ]------------
[ 2837.778856] WARNING: CPU: 8 PID: 2474 at fs/btrfs/extent-tree.c:9825 btrfs_free_block_groups+0x348/0x3eb [btrfs]
[ 2837.791385] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev tpm button serio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy
[ 2837.797711] CPU: 8 PID: 2474 Comm: umount Tainted: G        W       4.10.0-rc8-btrfs-next-43+ #1
[ 2837.798594] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[ 2837.800118] Call Trace:
[ 2837.800515]  dump_stack+0x68/0x92
[ 2837.801015]  __warn+0xc2/0xdd
[ 2837.801471]  warn_slowpath_null+0x1d/0x1f
[ 2837.801698]  btrfs_free_block_groups+0x348/0x3eb [btrfs]
[ 2837.801698]  close_ctree+0x1dd/0x2e1 [btrfs]
[ 2837.801698]  ? evict_inodes+0x132/0x141
[ 2837.801698]  btrfs_put_super+0x15/0x17 [btrfs]
[ 2837.801698]  generic_shutdown_super+0x6a/0xeb
[ 2837.801698]  kill_anon_super+0x12/0x1c
[ 2837.801698]  btrfs_kill_super+0x16/0x21 [btrfs]
[ 2837.801698]  deactivate_locked_super+0x30/0x68
[ 2837.801698]  deactivate_super+0x36/0x39
[ 2837.801698]  cleanup_mnt+0x58/0x76
[ 2837.801698]  __cleanup_mnt+0x12/0x14
[ 2837.801698]  task_work_run+0x77/0x9b
[ 2837.801698]  prepare_exit_to_usermode+0x9d/0xc5
[ 2837.801698]  syscall_return_slowpath+0x196/0x1b9
[ 2837.801698]  entry_SYSCALL_64_fastpath+0xab/0xad
[ 2837.801698] RIP: 0033:0x7f3ef3e6b9a7
[ 2837.801698] RSP: 002b:00007ffdd0d8de58 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[ 2837.801698] RAX: 0000000000000000 RBX: 0000556f76a39060 RCX: 00007f3ef3e6b9a7
[ 2837.801698] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556f76a3f910
[ 2837.801698] RBP: 0000556f76a3f910 R08: 0000556f76a3e670 R09: 0000000000000015
[ 2837.801698] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f3ef436ce64
[ 2837.801698] R13: 0000000000000000 R14: 0000556f76a39240 R15: 00007ffdd0d8e0e0
[ 2837.818441] ---[ end trace e79345fe24b30b90 ]---
[ 2837.818991] BTRFS info (device sdc): space_info 1 has 7974912 free, is not full
[ 2837.819830] BTRFS info (device sdc): space_info total=8388608, used=417792, pinned=0, reserved=0, may_use=18446744073709547520, readonly=0

What happens in the above example is the following:

1) When punching the hole, at btrfs_punch_hole(), the variable tail_len
   is set to 2048 (as tail_start is 148Kb + 1 and offset + len is 150Kb).
   This results in the creation of an extent map with a length of 2Kb
   starting at file offset 148Kb, through find_first_non_hole() ->
   btrfs_get_extent().

2) The second write (first write after the hole punch operation), sets
   the range [50Kb, 152Kb[ to delalloc.

3) The third write, at btrfs_find_new_delalloc_bytes(), sees the extent
   map covering the range [148Kb, 150Kb[ and ends up calling
   set_extent_bit() for the same range, which results in splitting an
   existing extent state record, covering the range [148Kb, 152Kb[ into
   two 2Kb extent state records, covering the ranges [148Kb, 150Kb[ and
   [150Kb, 152Kb[.

4) Finally at lock_and_cleanup_extent_if_need(), immediately after calling
   btrfs_find_new_delalloc_bytes() we clear the delalloc bit from the
   range [100Kb, 152Kb[ which results in the btrfs_clear_bit_hook()
   callback being invoked against the two 2Kb extent state records that
   cover the ranges [148Kb, 150Kb[ and [150Kb, 152Kb[. When called against
   the first 2Kb extent state, it calls btrfs_delalloc_release_metadata()
   with a length argument of 2048 bytes. That function rounds up the length
   to a sector size aligned length, so it ends up considering a length of
   4096 bytes, and then calls calc_csum_metadata_size() which results in
   decrementing the inode's csum_bytes counter by 4096 bytes, so after
   it stays a value of 0 bytes. Then the same happens when
   btrfs_clear_bit_hook() is called against the second extent state that
   has a length of 2Kb, covering the range [150Kb, 152Kb[, the length is
   rounded up to 4096 and calc_csum_metadata_size() ends up being called
   to decrement 4096 bytes from the inode's csum_bytes counter, which
   at that time has a value of 0, leading to an underflow, which is
   exactly what triggers the first warning, at btrfs_destroy_inode().
   All the other warnings relate to several space accounting counters
   that underflow as well due to similar reasons.

A similar case but where the hole punching operation creates an extent map
with a start offset not aligned to the sector size is the following:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt
  $ xfs_io -f -c "fpunch 695K 820K" $SCRATCH_MNT/bar
  $ xfs_io -c "pwrite -S 0xaa 1008K 307K" $SCRATCH_MNT/bar
  $ xfs_io -c "pwrite -S 0xbb -b 630K 1073K 630K" $SCRATCH_MNT/bar
  $ xfs_io -c "pwrite -S 0xcc -b 459K 1068K 459K" $SCRATCH_MNT/bar
  $ umount /mnt

During the unmount operation we get similar traces for the same reasons as
in the first example.

So fix the hole punching operation to make sure it never creates extent
maps with a length that is not aligned to the sector size nor with a start
offset that is not aligned to the sector size, as this breaks all
assumptions and it's a land mine.

Fixes: d77815461f ("btrfs: Avoid trucating page or punching hole in a already existed hole.")
Cc: <stable@vger.kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-21 16:52:45 +02:00
Jeff Mahoney
cddf3b2cb3 btrfs: add cond_resched to btrfs_qgroup_trace_leaf_items
On an uncontended system, we can end up hitting soft lockups while
doing replace_path.  At the core, and frequently called is
btrfs_qgroup_trace_leaf_items, so it makes sense to add a cond_resched
there.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-21 15:48:01 +02:00
Dan Carpenter
0e9350de2e btrfs: use new block error code
This function is supposed to return blk_status_t error codes now but
there was a stray -ENOMEM left behind.

Fixes: 4e4cbee93d ("block: switch bios to blk_status_t")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-21 07:47:34 -06:00
Nikolay Borisov
104b4e5139 percpu_counter: Rename __percpu_counter_add to percpu_counter_add_batch
Currently, percpu_counter_add is a wrapper around __percpu_counter_add
which is preempt safe due to explicit calls to preempt_disable.  Given
how __ prefix is used in percpu related interfaces, the naming
unfortunately creates the false sense that __percpu_counter_add is
less safe than percpu_counter_add.  In terms of context-safety,
they're equivalent.  The only difference is that the __ version takes
a batch parameter.

Make this a bit more explicit by just renaming __percpu_counter_add to
percpu_counter_add_batch.

This patch doesn't cause any functional changes.

tj: Minor updates to patch description for clarity.  Cosmetic
    indentation updates.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: linux-mm@kvack.org
Cc: "David S. Miller" <davem@davemloft.net>
2017-06-20 15:42:32 -04:00
Goldwyn Rodrigues
edf064e7c6 btrfs: nowait aio support
Return EAGAIN if any of the following checks fail
 + i_rwsem is not lockable
 + NODATACOW or PREALLOC is not set
 + Cannot nocow at the desired location
 + Writing beyond end of file which is not allocated

Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20 07:12:03 -06:00
Nikolay Borisov
7dfb8be11b btrfs: Round down values which are written for total_bytes_size
We got an internal report about a file system not wanting to mount
following 99e3ecfcb9 ("Btrfs: add more validation checks for
superblock").

BTRFS error (device sdb1): super_total_bytes 1000203816960 mismatch with
fs_devices total_rw_bytes 1000203820544

Subtracting the numbers we get a difference of less than a 4kb. Upon
closer inspection it became apparent that mkfs actually rounds down the
size of the device to a multiple of sector size. However, the same
cannot be said for various functions which modify the total size and are
called from btrfs_balance as well as when adding a new device. So this
patch ensures that values being saved into on-disk data structures are
always rounded down to a multiple of sectorsize.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-20 14:22:48 +02:00
Nikolay Borisov
eca152edf5 btrfs: Manually implement device_total_bytes getter/setter
The device->total_bytes member needs to always be rounded down to sectorsize
so that it corresponds to the value of super->total_bytes. However, there are
multiple places where the setter is fed a value which is not rounded which
can cause a fs to be unmountable due to the check introduced in
99e3ecfcb9 ("Btrfs: add more validation checks for superblock"). This patch
implements the getter/setter manually so that in a later patch I can add
necessary code to catch offenders.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-20 14:22:48 +02:00
David Sterba
0d0c71b317 btrfs: obsolete and remove mount option alloc_start
The mount option alloc_start was used in the past for debugging and
stressing the chunk allocator. Not meant to be used by users, so we're
not breaking anybody's setup.

There was some added complexity handling changes of the value and when
it was not same as default. Such code has likely been untested and I
think it's better to remove it.

This patch kills all use of alloc_start, and by doing that also fixes
a bug when alloc_size is set, potentially called from statfs:

in btrfs_calc_avail_data_space, traversing the list in RCU, the RCU
protection is temporarily dropped so btrfs_account_dev_extents_size can
be called and then RCU is locked again! Doing that inside
list_for_each_entry_rcu is just asking for trouble, but unlikely to be
observed in practice.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-20 14:22:48 +02:00
David Sterba
fac03c8dae btrfs: move fs_info::fs_frozen to the flags
We can keep the state among the other fs_info flags, there's no reason
why fs_frozen would need to be separate.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-20 14:22:42 +02:00
David Sterba
79b4f4c605 btrfs: cleanup duplicate return value in insert_inline_extent
The pattern when err is used for function exit and ret is used for
return values of callees is not used here.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-20 14:22:12 +02:00
David Sterba
6165572c11 btrfs: use GFP_KERNEL in btrfs_init_dev_replace_tgtdev
The function is called from ioctl context and we don't hold any locks
that take part in writeback. Right now it's only fs_info::volume_mutex.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:04 +02:00
David Sterba
6a44517d79 btrfs: use GFP_KERNEL in btrfs_calc_avail_data_space
We don't hold any locks here. Inidirectly called from statfs.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:04 +02:00
Nikolay Borisov
0eee8a494e btrfs: Use btrfs_space_info_used instead of opencoding it
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:04 +02:00
Anand Jain
4fc6441aac btrfs: wait part of the write_dev_flush() can be separated out
Submit and wait parts of write_dev_flush() can be split into two
separate functions for better readability.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:04 +02:00
Anand Jain
cea7c8bf77 btrfs: remove redundant null bdev counting during flush submission
There is no extra benefit to count null bdev during the submit loop,
as these null devices will be anyway checked during command
completion device loop just after the submit loop. We are holding the
device_list_mutex, the device->bdev status won't change in between.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:04 +02:00
Anand Jain
12b9bf0b94 btrfs: write_dev_flush does not return ENOMEM anymore
Since commit "btrfs: btrfs_io_bio_alloc never fails, skip error handling"
write_dev_flush will not return ENOMEM in the sending part. We do not
need to check for it in the callers.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ updated changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:04 +02:00
Timofey Titovets
170607ebd9 Btrfs: compression must free at least one sector size
We already skip storing data where compression does not make the result
at least one byte less.  Let's make the logic better and check
that compression frees at least one sector size of bytes, otherwise it's
not that useful.

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ changelog updated ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:04 +02:00
David Sterba
c5e4c3d750 btrfs: sink gfp parameter to btrfs_io_bio_alloc
We can hardcode GFP_NOFS to btrfs_io_bio_alloc, although it means we
change it back from GFP_KERNEL in scrub. I'd rather save a few stack
bytes from not passing the gfp flags in the remaining, more imporatant,
contexts and the bio allocating API now looks more consistent.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:04 +02:00
David Sterba
184f999e12 btrfs: add helper to initialize the non-bio part of btrfs_io_bio
We use btrfs_bioset for bios and ask to allocate the entire size of
btrfs_io_bio from btrfs bio_alloc_bioset. The member 'bio' is
initialized but the bytes from 0 to offset of 'bio' are left
uninitialized. Although we initialize some of the members in our
helpers, we should initialize the whole structures.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:03 +02:00
David Sterba
fa1bcbe0a5 btrfs: document mandatory order of bio in btrfs_io_bio
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:03 +02:00
Liu Bo
ef7cdac101 Btrfs: skip checksum verification if IO error occurs
Currently dio read also goes to verify checksum if -EIO has been returned,
although it usually fails on checksum, it's not necessary at all, we could
directly check if there is another copy to read.

And with this, the behavior of dio read is now consistent with that of
buffered read.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ use bool for uptodate ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:03 +02:00
Liu Bo
e3d37faba2 Btrfs: tolerate errors if we have retried successfully
With raid1 profile, dio read isn't tolerating IO errors if read length is
less than the stripe length (64K).

Our bio didn't get split in btrfs_submit_direct_hook() if (dip->flags &
BTRFS_DIO_ORIG_BIO_SUBMITTED) is true and that happens when the read
length is less than 64k.  In this case, if the underlying device returns
error somehow, bio->bi_error has recorded that error.

If we could recover the correct data from another copy in profile raid1/10/5/6,
with btrfs_subio_endio_read() returning 0, bio would have the correct data in
its vector, but bio->bi_error is not updated accordingly so that the following
dio_end_io(dio_bio, bio->bi_error) makes directIO think this read has failed.

This fixes the problem by setting bio's error to 0 if a good copy has been
found.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:03 +02:00
David Sterba
c821e7f3da btrfs: pass bytes to btrfs_bio_alloc
Most callers of btrfs_bio_alloc convert from bytes to sectors. Hide that
in the helper and simplify the logic in the callsers.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:03 +02:00
David Sterba
9886b17433 btrfs: opencode trivial compressed_bio_alloc, simplify error handling
compressed_bio_alloc is now a trivial wrapper around btrfs_bio_alloc, no
point keeping it. The error handling can be simplified, as we know
btrfs_bio_alloc will never fail.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:03 +02:00
David Sterba
9f2179a5e7 btrfs: remove redundant parameters from btrfs_bio_alloc
All callers pass gfp_flags=GFP_NOFS and nr_vecs=BIO_MAX_PAGES.

submit_extent_page adds __GFP_HIGH that does not make a difference in
our case as it allows access to memory reserves but otherwise does not
change the constraints.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:03 +02:00
David Sterba
8b6c1d56f2 btrfs: sink gfp parameter to btrfs_bio_clone
All callers pass GFP_NOFS.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:03 +02:00
David Sterba
e4f5690386 btrfs: btrfs_io_bio_alloc never fails, skip error handling
Update direct callers of btrfs_io_bio_alloc that do error handling, that
we can now remove.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:02 +02:00
David Sterba
3aa8e074ab btrfs: btrfs_bio_clone never fails, skip error handling
Update direct callers of btrfs_bio_clone that do error handling, that we
can now remove.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:02 +02:00
David Sterba
0c4dd97c5e btrfs: btrfs_bio_alloc never fails, skip error handling
Update direct callers of btrfs_bio_alloc that do error handling, that we
can now remove.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:02 +02:00
David Sterba
6e707bcd1f btrfs: bioset allocations will never fail, adapt our helpers
Christoph pointed out that bio allocations backed by a bioset will never
fail.  As we always use a bioset for all bio allocations, we can skip
the error handling.  This patch adjusts our low-level helpers, the
cascaded changes to all callers will come next.

CC: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:02 +02:00
David Sterba
6acafd1eff btrfs: switch to kvmalloc and GFP_KERNEL in lzo/zlib alloc_workspace
The compression workspace buffers are larger than a page so we use
vmalloc, unconditionally. This is not always necessary as there might be
contiguous memory available.

Let's use the kvmalloc helpers that will try kmalloc first and fallback
to vmalloc. For that they require GFP_KERNEL flags. As we now have the
alloc_workspace calls protected by memalloc_nofs in the critical
contexts, we can safely use GFP_KERNEL.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:02 +02:00
David Sterba
389a6cfc2a btrfs: switch kmallocs to GFP_KERNEL in lzo/zlib alloc_workspace
As alloc_workspace is now protected by memalloc_nofs where needed,
we can switch the kmalloc to use GFP_KERNEL.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:02 +02:00
David Sterba
fe30853307 btrfs: add memalloc_nofs protections around alloc_workspace callback
The workspaces are preallocated at the beginning where we can safely use
GFP_KERNEL, but in some cases the find_workspace might reach the
allocation again, now in a more restricted context when the bios or
pages are being compressed.

To avoid potential lockup when alloc_workspace -> vmalloc would silently
use the GFP_KERNEL, add the memalloc_nofs helpers around the critical
call site.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:02 +02:00
David Sterba
adf0212396 btrfs: adjust includes after vmalloc removal
As we don't use vmalloc/vzalloc/vfree directly in ctree.c, we can now
use the proper header that defines kvmalloc.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:02 +02:00
David Sterba
f54de068dd btrfs: use GFP_KERNEL in init_ipath
Now that init_ipath is called either from a safe context or with
memalloc_nofs protection, we can switch to GFP_KERNEL allocations in
init_path and init_data_container.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:02 +02:00
David Sterba
de2491fdef btrfs: scrub: add memalloc_nofs protection around init_ipath
init_ipath is called from a safe ioctl context and from scrub when
printing an error.  The protection is added for three reasons:

* init_data_container calls vmalloc and this does not work as expected
  in the GFP_NOFS context, so this silently does GFP_KERNEL and might
  deadlock in some cases
* keep the context constraint of GFP_NOFS, used by scrub
* we want to use GFP_KERNEL unconditionally inside init_ipath or its
  callees

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:02 +02:00
David Sterba
f11f74416a btrfs: send: use kvmalloc in iterate_dir_item
We use a growing buffer for xattrs larger than a page size, at some
point vmalloc is unconditionally used for larger buffers. We can still
try to avoid it using the kvmalloc helper.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:02 +02:00
David Sterba
818e010bf9 btrfs: replace opencoded kvzalloc with the helper
The logic of kmalloc and vmalloc fallback is opencoded in
several places, we can now use the existing helper.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:01 +02:00
Timofey Titovets
1e9d7291e5 Btrfs: lzo: compressed data size must be less then input size
Logic already skips if compression makes data bigger, let's sync lzo
with zlib and also return error if compressed size is equal to
input size.

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:01 +02:00
Guoqing Jiang
054ec2f626 btrfs: simplify code with bio_io_error
bio_io_error was introduced in the commit 4246a0b63b
("block: add a bi_error field to struct bio"), so use it to simplify
code.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:01 +02:00
Omar Sandoval
25ff17e82f Btrfs: use memalloc_nofs and kvzalloc() for free space tree bitmaps
First, instead of open-coding the vmalloc() fallback, use the new
kvzalloc() helper. Second, use memalloc_nofs_{save,restore}() instead of
GFP_NOFS, as vmalloc() uses some GFP_KERNEL allocations internally which
could lead to deadlocks.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:01 +02:00
David Sterba
4b5faeac46 btrfs: use generic slab for for btrfs_transaction
Observing the number of slab objects of btrfs_transaction, there's just
one active on an almost quiescent filesystem, and the number of objects
goes to about ten when sync is in progress. Then the nubmer goes down to
1.  This matches the expectations of the transaction lifetime.

For such use the separate slab cache is not justified, as we do not
reuse objects frequently. For the shortlived transaction, the generic
slab (size 512) should be ok. We can optimistically expect that the 512
slabs are not all used (fragmentation) and there are free slots to take
when we do the allocation, compared to potentially allocating a whole new
page for the separate slab.

We'll lose the stats about the object use, which could be added later if
we really need them.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:01 +02:00
David Sterba
3fb99303c6 btrfs: scrub: embed scrub_wr_ctx into scrub context
The structure scrub_wr_ctx is not used anywhere just the scrub context,
we can move the members there. The tgtdev is renamed so it's more clear
that it belongs to the "wr" part.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:01 +02:00
David Sterba
25cc1226c1 btrfs: scrub: use fs_info::sectorsize and drop it from scrub context
As we now have the node/block sizes in fs_info, we can use them and can
drop the local copies.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:01 +02:00
Yonghong Song
04a87e3472 Btrfs: add statx support
Return enhanced file attributes from the btrfs, including:
  (1). inode creation time as stx_btime, and
  (2). Certain BTRFS_INODE_xxx flags are mapped to stx_attributes flags.

Example output:
	[root@localhost ~]# cat t.sh
	touch t
	chattr +aic t
	~/linux/samples/statx/test-statx t
	chattr -aic t
	touch t
	echo "========================================"
	~/linux/samples/statx/test-statx t
	/bin/rm t
	[root@localhost ~]# ./t.sh
	statx(t) = 0
	results=fff
  	  Size: 0               Blocks: 0          IO Block: 4096    regular file
	Device: 00:1c           Inode: 63962       Links: 1
	Access: (0644/-rw-r--r--)  Uid:     0   Gid:     0
	Access: 2017-05-11 16:03:13.999856591-0700
	Modify: 2017-05-11 16:03:13.999856591-0700
	Change: 2017-05-11 16:03:14.000856663-0700
 	 Birth: 2017-05-11 16:03:13.999856591-0700
	Attributes: 0000000000000034 (........ ........ ........ ........ ........ ........ ........ .-ai.c..)
	========================================
	statx(t) = 0
	results=fff
	  Size: 0               Blocks: 0          IO Block: 4096    regular file
	Device: 00:1c           Inode: 63962       Links: 1
	Access: (0644/-rw-r--r--)  Uid:     0   Gid:     0
	Access: 2017-05-11 16:03:14.006857097-0700
	Modify: 2017-05-11 16:03:14.006857097-0700
	Change: 2017-05-11 16:03:14.006857097-0700
 	Birth: 2017-05-11 16:03:13.999856591-0700
	Attributes: 0000000000000000 (........ ........ ........ ........ ........ ........ ........ .---.-..)
	[root@localhost ~]#

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:01 +02:00
Timofey Titovets
036b0217ad Btrfs: lzo: fix typo in error message after failed deflate
Fix copy paste typo in debug message for lzo.c, lzo is not deflate.

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:01 +02:00
Jeff Layton
3189ff7786 btrfs: btrfs_wait_tree_block_writeback can be void return
Nothing checks its return value.

Is it safe to skip checking return value of btrfs_wait_tree_block_writeback?

Liu Bo: I think yes, it's used in walk_log_tree which is called in two
places, free_log_tree and log replay.  For free_log_tree, it waits for
any running writeback of the extent buffer under freeing to finish in
case we need to access the eb pointer from page->private, and it's OK to
not check the return value, while for log replay, it's doesn't wait
because wc->wait is not set. So neither cares about the writeback error.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
[ added more explanation to changelog, from Liu Bo ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:01 +02:00
Nikolay Borisov
118c701e20 btrfs: remove __BTRFS_LEAF_DATA_SIZE
__BTRFS_LAF_DATA_SIZE is used only by BTRFS_LEAF_DATA_SIZE. Make the
latter subsume the former.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:01 +02:00
Nikolay Borisov
3d9ec8c49a btrfs: rename btrfs_leaf_data to BTRFS_LEAF_DATA_OFFSET
Commit 5f39d397df ("Btrfs: Create extent_buffer interface
for large blocksizes") refactored btrfs_leaf_data function to take
extent_buffer rather than struct btrfs_leaf. However, as it turns out the
parameter being passed is never used. Furthermore this function no longer
returns the leaf data but rather the offset to it. So rename the function
to BTRFS_LEAF_DATA_OFFSET to make it consistent with other BTRFS_LEAF_*
helpers and turn it into a macro.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
[ removed () from the macro ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:00 +02:00
Anand Jain
e1ddce71d6 btrfs: reduce arguments for decompress_bio ops
struct compressed_bio pointer can be used instead.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:00 +02:00
Anand Jain
8140dc30a4 btrfs: btrfs_decompress_bio() could accept compressed_bio instead
Instead of sending each argument of struct compressed_bio, send
the compressed_bio itself.

Also by having struct compressed_bio in btrfs_decompress_bio()
it would help tracing.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:00 +02:00
Nikolay Borisov
d2006e6d28 btrfs: Refactor update_space_info
Following the factoring out of the creation code udpate_space_info can
only be called for already-existing space_info structs. As such it
cannot fail.  Remove superfluous error handling and make the function
return void.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:00 +02:00
Nikolay Borisov
2be12ef79f btrfs: Separate space_info create/update
Currently the struct space_info creation code is intermixed in the
udpate_space_info function. There are well-defined points at which the
we actually want to create brand-new space_info structs (e.g. during
mount of the filesystem as well as sometimes when adding/initialising
new chunks). In such cases update_space_info is called with 0 as the
bytes parameter. All of this makes for spaghetti code.

Fix it by factoring out the creation code in a separate
create_space_info structure. This also allows to simplify the internals.
Also remove BUG_ON from do_alloc_chunk since the callers handle errors.
Furthermore it will make the update_space_info function not fail,
allowing us to remove error handling in callers. This will come in a
follow up patch.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:00 +02:00
Liu Bo
555ba411aa Btrfs: let btrfs_print_leaf print more about block group
This adds chunk_objectid and flags, with flags we can recognize whether
the block group is about data or metadata.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:00 +02:00
Liu Bo
28785f70ef Btrfs: skip commit transaction if we don't have enough pinned bytes
We commit transaction in order to reclaim space from pinned bytes because
it could process delayed refs, and in may_commit_transaction(), we check
first if pinned bytes are enough for the required space, we then check if
that plus bytes reserved for delayed insert are enough for the required
space.

This changes the code to the above logic.

Fixes: b150a4f10d ("Btrfs: use a percpu to keep track of possibly pinned bytes")
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reported-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:00 +02:00
David Sterba
4e2814ef04 btrfs: scrub: simplify cleanup of wr_ctx in scrub_free_ctx
We don't need to take the mutex and zero out wr_cur_bio, as this is
called after the scrub finished.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:00 +02:00
David Sterba
e241ddeb9c btrfs: scrub: inline helper scrub_free_wr_ctx
The helper scrub_free_wr_ctx is used only once and fits into
scrub_free_ctx as it continues sctx shutdown, no need to keep it
separate.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:00 +02:00
David Sterba
8fcdac3f20 btrfs: scrub: inline helper scrub_setup_wr_ctx
The helper scrub_setup_wr_ctx is used only once and fits into
scrub_setup_ctx as it continues intialization, no need to keep it
separate.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:00 +02:00
Jeff Mahoney
c1c4919b11 btrfs: remove root usage from can_overcommit
can_overcommit using the root to determine the allocation profile
is the only use of a root in the call graph below reserve_metadata_bytes.

It turns out that we only need to know whether the allocation is for
the chunk root or not -- and we can pass that around as a bool instead.

This allows us to pull root usage out of the reservation path all the
way up to reserve_metadata_bytes itself, which uses it only to compare
against fs_info->chunk_root to set the bool.  In turn, this eliminates
a bunch of races where we use a particular root too early in the mount
process.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:00 +02:00
Jeff Mahoney
1b86826d12 btrfs: cleanup root usage by btrfs_get_alloc_profile
There are two places where we don't already know what kind of alloc
profile we need before calling btrfs_get_alloc_profile, but we need
access to a root everywhere we call it.

This patch adds helpers for btrfs_{data,metadata,system}_alloc_profile()
and relegates btrfs_system_alloc_profile to a static for use in those
two cases.  The next patch will eliminate one of those.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:59 +02:00
David Sterba
e03733da5a btrfs: fix bool type in btrfs_page_exists_in_range
We use only a simple bool indicator, int is not a problem here.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:59 +02:00
David Sterba
c9fed2bb61 btrfs: remove unused member list from btrfs_end_io_wq
The end io work queue items have been tracked by the work queues since
"Btrfs: Add async worker threads for pre and post IO checksumming"
(8b71284292) (2008).

Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:59 +02:00
David Sterba
ee4ea69852 btrfs: remove unused members dir_path from recorded_ref
The two members do not seem to be used since the initial commit.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:59 +02:00
David Sterba
b297c9f68f btrfs: remove unused member list from async_submit_bio
The list used to track checksums in the early version (2.6.29), but I
was able not pinpoint the commit that stopped using it. Everything
apparently works without it for a long time.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:59 +02:00
David Sterba
106204f191 btrfs: remove unused member err from reada_extent
Seems to be unused since the initial commit, we ignore readahead errors
anyway, the full read will handle that if necessary.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:59 +02:00
Sahil Kang
0bef71093d btrfs: Remove unnecessary branching in free-space-tree.c
Both btrfs_create_free_space_tree and btrfs_clear_free_space_tree
contain:

  if (ret)
          return ret;

  return 0;

The if statement is only false when ret equals zero, and since we return
zero in such cases, we can safely remove the branching.

Signed-off-by: Sahil Kang <sahil.kang@asilaycomputing.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:59 +02:00
Liu Bo
e477094f0d Btrfs: hardcode GFP_NOFS for btrfs_bio_clone_partial
We only pass GFP_NOFS to btrfs_bio_clone_partial, so lets hardcode it.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:59 +02:00
Arnd Bergmann
3c91ee6964 Btrfs: work around maybe-uninitialized warning
A rewrite of btrfs_submit_direct_hook appears to have introduced a warning:

fs/btrfs/inode.c: In function 'btrfs_submit_direct_hook':
fs/btrfs/inode.c:8467:14: error: 'bio' may be used uninitialized in this function [-Werror=maybe-uninitialized]

Where the 'bio' variable was previously initialized unconditionally, it
is now set in the "while (submit_len > 0)" loop that would never execute
if submit_len is zero.

Assuming this cannot happen in practice, we can avoid the warning
by simply replacing the while{} loop with a do{}while() loop so
the compiler knows that it will always be entered at least once.

Fixes changes introduced in "Btrfs: use bio_clone_bioset_partial to
simplify DIO submit".

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:59 +02:00
Liu Bo
3892ac9086 Btrfs: unify naming of btrfs_io_bio
All dio endio functions are using io_bio for struct btrfs_io_bio, this
makes btrfs_submit_direct to follow this convention.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:59 +02:00
Liu Bo
11b5616516 Btrfs: check-integrity use bvec_iter
Some check-integrity code depends on bio->bi_vcnt, this changes it to use
bio segments because some bios passing here may not have a reliable
bi_vcnt.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:59 +02:00
Liu Bo
629ebf4fad Btrfs: record error if one block has failed to retry
In the nocsum case of dio read endio, it returns immediately if an error
gets returned when repairing, which leaves the rest blocks unrepaired.  The
behavior is different from how buffered read endio works in the same case.
This changes it to record error only and go on repairing the rest blocks.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:59 +02:00
Liu Bo
17347cec15 Btrfs: change how we iterate bios in endio
Since dio submit has used bio_clone_fast, the submitted bio may not have a
reliable bi_vcnt, for the bio vector iterations in checksum related
functions, bio->bi_iter is not modified yet and it's safe to use
bio_for_each_segment, while for those bio vector iterations in dio read's
endio, we now save a copy of bvec_iter in struct btrfs_io_bio when cloning
bios and use the helper __bio_for_each_segment with the saved bvec_iter to
access each bvec.

Also for dio reads which don't get split, we also need to save a copy of
bio iterator in btrfs_bio_clone to let __bio_for_each_segments to access
each bvec in dio read's endio.  Note that it doesn't affect other calls of
btrfs_bio_clone() because they don't need to use this iterator.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:59 +02:00
Liu Bo
725130bac5 Btrfs: use bio_clone_bioset_partial to simplify DIO submit
Currently when mapping bio to limit bio to a single stripe length, we
split bio by adding page to bio one by one, but later we don't modify
the vector of bio at all, thus we can use bio_clone_fast to use the
original bio vector directly.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:58 +02:00
Liu Bo
2f8e914042 Btrfs: new helper btrfs_bio_clone_partial
This adds a new helper btrfs_bio_clone_partial, it'll allocate a cloned
bio that only owns a part of the original bio's data.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:58 +02:00
Liu Bo
015c1bd9f1 Btrfs: use bio_clone_fast to clone our bio
For raid1 and raid10, we clone the original bio to the bios which are then
sent to different disks.

Right now we use bio_clone_bioset to create a clone bio with iterating
bi_io_vec to initialize it.  This changes it to use bio_clone_fast()
which creates a clone bio but only copies the bi_io_vec pointer
instead of iterating bi_io_vec.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:58 +02:00
Josef Bacik
7870d0822b Btrfs: don't pass the inode through clean_io_failure
Instead pass around the failure tree and the io tree.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:58 +02:00
Josef Bacik
6ec656bc0f btrfs: remove inode argument from repair_io_failure
Once we remove the btree_inode we won't have an inode to pass anymore,
just pass the fs_info directly and the inum since we use that to print
out the repair message.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:58 +02:00
Josef Bacik
c6100a4b4e Btrfs: replace tree->mapping with tree->private_data
For extent_io tree's we have carried the address_mapping of the inode
around in the io tree in order to pull the inode back out for calling
into various tree ops hooks.  This works fine when everything that has
an extent_io_tree has an inode.  But we are going to remove the
btree_inode, so we need to change this.  Instead just have a generic
void * for private data that we can initialize with, and have all the
tree ops use that instead.  This had a lot of cascading changes but
should be relatively straightforward.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor reordering of the callback prototypes ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:58 +02:00
Sargun Dhillon
2723480a0f btrfs: Add quota_override knob into sysfs
This patch adds the read-write attribute quota_override into sysfs.
Any process which has CAP_SYS_RESOURCE can set this flag to on, and
once it is set to true, processes with CAP_SYS_RESOURCE can exceed
the quota.

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor changelog edits ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:58 +02:00
Sargun Dhillon
f29efe2921 btrfs: add quota override flag to enable quota override for CAP_SYS_RESOURCE
This patch introduces the quota override flag to btrfs_fs_info, and a
change to quota limit checking code to temporarily allow for quota to be
overridden for processes with CAP_SYS_RESOURCE.

It's useful for administrative programs, such as log rotation, that may
need to temporarily use more disk space in order to free up a greater
amount of overall disk space without yielding more disk space to the
rest of userland.

Eventually, we may want to add the idea of an operator-specific quota,
operator reserved space, or something else to allow for administrative
override, but this is perhaps the simplest solution.

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor changelog edits ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:58 +02:00
Nikolay Borisov
a5ed45f822 btrfs: Convert fs_info->free_chunk_space to atomic64_t
The ->free_chunk_space variable is used to track the unallocated space
and access to it is protected by a spinlock, which is not used for
anything else.  Make the code a bit self-explanatory by switching the
variable to an atomic64_t type and kill the spinlock.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
[ not a performance critical code, use of atomic type is ok ]
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:58 +02:00
Anand Jain
401b41e5a8 btrfs: add framework to handle device flush error as a volume
This adds comments to the flush error handling part of the code, and
hopes to maintain the same logic with a framework which can be used to
handle the errors at the volume level.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:58 +02:00
Daichou
6b349dfe80 Btrfs: remove obsolete FIXMEs in qgroup ioctls
These FIXMEs were already addressed in 2013. All functions check for
qgroup existence:

* btrfs_add_qgroup_relation
* btrfs_ioctl_qgroup_create
* btrfs_limit_qgroup
* btrfs_del_qgroup_relation

Signed-off-by: Daichou <tommy0705c@gmail.com>
[ enhance and reformat changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:58 +02:00
Dan Carpenter
97d038562a Btrfs: remove an unused variable
"item" is never used.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:57 +02:00
Fabian Frederick
977ec79271 btrfs: kmap() can't fail
Remove NULL test on kmap() as it will always return a valid pointer.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:57 +02:00
NeilBrown
011067b056 blk: replace bioset_create_nobvec() with a flags arg to bioset_create()
"flags" arguments are often seen as good API design as they allow
easy extensibility.
bioset_create_nobvec() is implemented internally as a variation in
flags passed to __bioset_create().

To support future extension, make the internal structure part of the
API.
i.e. add a 'flags' argument to bioset_create() and discard
bioset_create_nobvec().

Note that the bio_split allocations in drivers/md/raid* do not need
the bvec mempool - they should have used bioset_create_nobvec().

Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-18 12:40:59 -06:00
Linus Torvalds
54ed0f71f0 Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fix from Herbert Xu:
 "This fixes a bug on sparc where we may dereference freed stack memory"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  crypto: Work around deallocated stack frame reference gcc bug on sparc.
2017-06-15 17:54:51 +09:00
Jens Axboe
8f66439eec Linux 4.12-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJZPdbLAAoJEHm+PkMAQRiGx4wH/1nCjfnl6fE8oJ24/1gEAOUh
 biFdqJkYZmlLYHVtYfLm4Ueg4adJdg0wx6qM/4RaAzmQVvLfDV34bc1qBf1+P95G
 kVF+osWyXrZo5cTwkwapHW/KNu4VJwAx2D1wrlxKDVG5AOrULH1pYOYGOpApEkZU
 4N+q5+M0ce0GJpqtUZX+UnI33ygjdDbBxXoFKsr24B7eA0ouGbAJ7dC88WcaETL+
 2/7tT01SvDMo0jBSV0WIqlgXwZ5gp3yPGnklC3F4159Yze6VFrzHMKS/UpPF8o8E
 W9EbuzwxsKyXUifX2GY348L1f+47glen/1sedbuKnFhP6E9aqUQQJXvEO7ueQl4=
 =m2Gx
 -----END PGP SIGNATURE-----

Merge tag 'v4.12-rc5' into for-4.13/block

We've already got a few conflicts and upcoming work depends on some of the
changes that have gone into mainline as regression fixes for this series.

Pull in 4.12-rc5 to resolve these conflicts and make it easier on down stream
trees to continue working on 4.13 changes.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-12 08:30:13 -06:00
Linus Torvalds
66cea28a94 Merge branch 'for-linus-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Some fixes that Dave Sterba collected.

  We've been hitting an early enospc problem on production machines that
  Omar tracked down to an old int->u64 mistake. I waited a bit on this
  pull to make sure it was really the problem from production, but it's
  on ~2100 hosts now and I think we're good.

  Omar also noticed a commit in the queue would make new early ENOSPC
  problems. I pulled that out for now, which is why the top three
  commits are younger than the rest.

  Otherwise these are all fixes, some explaining very old bugs that
  we've been poking at for a while"

* 'for-linus-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix delalloc accounting leak caused by u32 overflow
  Btrfs: clear EXTENT_DEFRAG bits in finish_ordered_io
  btrfs: tree-log.c: Wrong printk information about namelen
  btrfs: fix race with relocation recovery and fs_root setup
  btrfs: fix memory leak in update_space_info failure path
  btrfs: use correct types for page indices in btrfs_page_exists_in_range
  btrfs: fix incorrect error return ret being passed to mapping_set_error
  btrfs: Make flush bios explicitely sync
  btrfs: fiemap: Cache and merge fiemap extent before submit it to user
2017-06-10 11:06:05 -07:00
Omar Sandoval
70e7af244f Btrfs: fix delalloc accounting leak caused by u32 overflow
btrfs_calc_trans_metadata_size() does an unsigned 32-bit multiplication,
which can overflow if num_items >= 4 GB / (nodesize * BTRFS_MAX_LEVEL * 2).
For a nodesize of 16kB, this overflow happens at 16k items. Usually,
num_items is a small constant passed to btrfs_start_transaction(), but
we also use btrfs_calc_trans_metadata_size() for metadata reservations
for extent items in btrfs_delalloc_{reserve,release}_metadata().

In drop_outstanding_extents(), num_items is calculated as
inode->reserved_extents - inode->outstanding_extents. The difference
between these two counters is usually small, but if many delalloc
extents are reserved and then the outstanding extents are merged in
btrfs_merge_extent_hook(), the difference can become large enough to
overflow in btrfs_calc_trans_metadata_size().

The overflow manifests itself as a leak of a multiple of 4 GB in
delalloc_block_rsv and the metadata bytes_may_use counter. This in turn
can cause early ENOSPC errors. Additionally, these WARN_ONs in
extent-tree.c will be hit when unmounting:

    WARN_ON(fs_info->delalloc_block_rsv.size > 0);
    WARN_ON(fs_info->delalloc_block_rsv.reserved > 0);
    WARN_ON(space_info->bytes_pinned > 0 ||
            space_info->bytes_reserved > 0 ||
            space_info->bytes_may_use > 0);

Fix it by casting nodesize to a u64 so that
btrfs_calc_trans_metadata_size() does a full 64-bit multiplication.
While we're here, do the same in btrfs_calc_trunc_metadata_size(); this
can't overflow with any existing uses, but it's better to be safe here
than have another hard-to-debug problem later on.

Cc: stable@vger.kernel.org
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2017-06-09 12:48:36 -07:00
Liu Bo
452e62b71f Btrfs: clear EXTENT_DEFRAG bits in finish_ordered_io
Before this, we use 'filled' mode here, ie. if all range has been
filled with EXTENT_DEFRAG bits, get to clear it, but if the defrag
range joins the adjacent delalloc range, then we'll have EXTENT_DEFRAG
bits in extent_state until releasing this inode's pages, and that
prevents extent_data from being freed.

This clears the bit if any was found within the ordered extent.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2017-06-09 12:48:29 -07:00
Su Yue
286b92f43c btrfs: tree-log.c: Wrong printk information about namelen
In verify_dir_item, it wants to printk name_len of dir_item but
printk data_len acutally.

Fix it by calling btrfs_dir_name_len instead of btrfs_dir_data_len.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2017-06-09 12:48:07 -07:00
Christoph Hellwig
4e4cbee93d block: switch bios to blk_status_t
Replace bi_error with a new bi_status to allow for a clear conversion.
Note that device mapper overloaded bi_error with a private value, which
we'll have to keep arround at least for now and thus propagate to a
proper blk_status_t value.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-06-09 09:27:32 -06:00
Christoph Hellwig
4055351cdb fs: remove the unused error argument to dio_end_io()
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-06-09 09:27:32 -06:00
David Miller
d41519a69b crypto: Work around deallocated stack frame reference gcc bug on sparc.
On sparc, if we have an alloca() like situation, as is the case with
SHASH_DESC_ON_STACK(), we can end up referencing deallocated stack
memory.  The result can be that the value is clobbered if a trap
or interrupt arrives at just the right instruction.

It only occurs if the function ends returning a value from that
alloca() area and that value can be placed into the return value
register using a single instruction.

For example, in lib/libcrc32c.c:crc32c() we end up with a return
sequence like:

        return  %i7+8
         lduw   [%o5+16], %o0   ! MEM[(u32 *)__shash_desc.1_10 + 16B],

%o5 holds the base of the on-stack area allocated for the shash
descriptor.  But the return released the stack frame and the
register window.

So if an intererupt arrives between 'return' and 'lduw', then
the value read at %o5+16 can be corrupted.

Add a data compiler barrier to work around this problem.  This is
exactly what the gcc fix will end up doing as well, and it absolutely
should not change the code generated for other cpus (unless gcc
on them has the same bug :-)

With crucial insight from Eric Sandeen.

Cc: <stable@vger.kernel.org>
Reported-by: Anatoly Pugachev <matorola@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2017-06-08 17:36:03 +08:00
Jeff Mahoney
a9b3311ef3 btrfs: fix race with relocation recovery and fs_root setup
If we have to recover relocation during mount, we'll ultimately have to
evict the orphan inode.  That goes through the reservation dance, where
priority_reclaim_metadata_space and flush_space expect fs_info->fs_root
to be valid.  That's the next thing to be set up during mount, so we
crash, almost always in flush_space trying to join the transaction
but priority_reclaim_metadata_space is possible as well.  This call
path has been problematic in the past WRT whether ->fs_root is valid
yet.  Commit 957780eb27 (Btrfs: introduce ticketed enospc
infrastructure) added new users that are called in the direct path
instead of the async path that had already been worked around.

The thing is that we don't actually need the fs_root, specifically, for
anything.  We either use it to determine whether the root is the
chunk_root for use in choosing an allocation profile or as a root to pass
btrfs_join_transaction before immediately committing it.  Anything that
isn't the chunk root works in the former case and any root works in
the latter.

A simple fix is to use a root we know will always be there: the
extent_root.

Cc: <stable@vger.kernel.org> # v4.8+
Fixes: 957780eb27 (Btrfs: introduce ticketed enospc infrastructure)
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-01 16:56:55 +02:00
Jeff Mahoney
896533a7da btrfs: fix memory leak in update_space_info failure path
If we fail to add the space_info kobject, we'll leak the memory
for the percpu counter.

Fixes: 6ab0a2029c (btrfs: publish allocation data in sysfs)
Cc: <stable@vger.kernel.org> # v3.14+
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-01 16:56:31 +02:00
David Sterba
cc2b702c52 btrfs: use correct types for page indices in btrfs_page_exists_in_range
Variables start_idx and end_idx are supposed to hold a page index
derived from the file offsets. The int type is not the right one though,
offsets larger than 1 << 44 will get silently trimmed off the high bits.
(1 << 44 is 16TiB)

What can go wrong, if start is below the boundary and end gets trimmed:
- if there's a page after start, we'll find it (radix_tree_gang_lookup_slot)
- the final check "if (page->index <= end_idx)" will unexpectedly fail

The function will return false, ie. "there's no page in the range",
although there is at least one.

btrfs_page_exists_in_range is used to prevent races in:

* in hole punching, where we make sure there are not pages in the
  truncated range, otherwise we'll wait for them to finish and redo
  truncation, but we're going to replace the pages with holes anyway so
  the only problem is the intermediate state

* lock_extent_direct: we want to make sure there are no pages before we
  lock and start DIO, to prevent stale data reads

For practical occurence of the bug, there are several constaints.  The
file must be quite large, the affected range must cross the 16TiB
boundary and the internal state of the file pages and pending operations
must match.  Also, we must not have started any ordered data in the
range, otherwise we don't even reach the buggy function check.

DIO locking tries hard in several places to avoid deadlocks with
buffered IO and avoids waiting for ranges. The worst consequence seems
to be stale data read.

CC: Liu Bo <bo.li.liu@oracle.com>
CC: stable@vger.kernel.org	# 3.16+
Fixes: fc4adbff82 ("btrfs: Drop EXTENT_UPTODATE check in hole punching and direct locking")
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-01 16:56:17 +02:00
Colin Ian King
bff5baf8aa btrfs: fix incorrect error return ret being passed to mapping_set_error
The setting of return code ret should be based on the error code
passed into function end_extent_writepage and not on ret. Thanks
to Liu Bo for spotting this mistake in the original fix I submitted.

Detected by CoverityScan, CID#1414312 ("Logically dead code")

Fixes: 5dca6eea91 ("Btrfs: mark mapping with error flag to report errors to userspace")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-05-16 15:42:10 +02:00
Jan Kara
8d91012528 btrfs: Make flush bios explicitely sync
Commit b685d3d65a "block: treat REQ_FUA and REQ_PREFLUSH as
synchronous" removed REQ_SYNC flag from WRITE_{FUA|PREFLUSH|...}
definitions.  generic_make_request_checks() however strips REQ_FUA and
REQ_PREFLUSH flags from a bio when the storage doesn't report volatile
write cache and thus write effectively becomes asynchronous which can
lead to performance regressions

Fix the problem by making sure all bios which are synchronous are
properly marked with REQ_SYNC.

CC: David Sterba <dsterba@suse.com>
CC: linux-btrfs@vger.kernel.org
Fixes: b685d3d65a
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-05-16 15:42:01 +02:00
Qu Wenruo
4751832da9 btrfs: fiemap: Cache and merge fiemap extent before submit it to user
[BUG]
Cycle mount btrfs can cause fiemap to return different result.
Like:
 # mount /dev/vdb5 /mnt/btrfs
 # dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file
 # xfs_io -c "fiemap -v" /mnt/btrfs/file
 /mnt/test/file:
 EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
   0: [0..127]:        25088..25215       128   0x1
 # umount /mnt/btrfs
 # mount /dev/vdb5 /mnt/btrfs
 # xfs_io -c "fiemap -v" /mnt/btrfs/file
 /mnt/test/file:
 EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
   0: [0..31]:         25088..25119        32   0x0
   1: [32..63]:        25120..25151        32   0x0
   2: [64..95]:        25152..25183        32   0x0
   3: [96..127]:       25184..25215        32   0x1
But after above fiemap, we get correct merged result if we call fiemap
again.
 # xfs_io -c "fiemap -v" /mnt/btrfs/file
 /mnt/test/file:
 EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
   0: [0..127]:        25088..25215       128   0x1

[REASON]
Btrfs will try to merge extent map when inserting new extent map.

btrfs_fiemap(start=0 len=(u64)-1)
|- extent_fiemap(start=0 len=(u64)-1)
   |- get_extent_skip_holes(start=0 len=64k)
   |  |- btrfs_get_extent_fiemap(start=0 len=64k)
   |     |- btrfs_get_extent(start=0 len=64k)
   |        |  Found on-disk (ino, EXTENT_DATA, 0)
   |        |- add_extent_mapping()
   |        |- Return (em->start=0, len=16k)
   |
   |- fiemap_fill_next_extent(logic=0 phys=X len=16k)
   |
   |- get_extent_skip_holes(start=0 len=64k)
   |  |- btrfs_get_extent_fiemap(start=0 len=64k)
   |     |- btrfs_get_extent(start=16k len=48k)
   |        |  Found on-disk (ino, EXTENT_DATA, 16k)
   |        |- add_extent_mapping()
   |        |  |- try_merge_map()
   |        |     Merge with previous em start=0 len=16k
   |        |     resulting em start=0 len=32k
   |        |- Return (em->start=0, len=32K)    << Merged result
   |- Stripe off the unrelated range (0~16K) of return em
   |- fiemap_fill_next_extent(logic=16K phys=X+16K len=16K)
      ^^^ Causing split fiemap extent.

And since in add_extent_mapping(), em is already merged, in next
fiemap() call, we will get merged result.

[FIX]
Here we introduce a new structure, fiemap_cache, which records previous
fiemap extent.

And will always try to merge current fiemap_cache result before calling
fiemap_fill_next_extent().
Only when we failed to merge current fiemap extent with cached one, we
will call fiemap_fill_next_extent() to submit cached one.

So by this method, we can merge all fiemap extents.

It can also be done in fs/ioctl.c, however the problem is if
fieinfo->fi_extents_max == 0, we have no space to cache previous fiemap
extent.
So I choose to merge it in btrfs.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-05-16 15:41:53 +02:00
Linus Torvalds
1176032cb1 Merge branch 'for-linus-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "This has fixes and cleanups Dave Sterba collected for the merge
  window.

  The biggest functional fixes are between btrfs raid5/6 and scrub, and
  raid5/6 and device replacement. Some of our pending qgroup fixes are
  included as well while I bash on the rest in testing.

  We also have the usual set of cleanups, including one that makes
  __btrfs_map_block() much more maintainable, and conversions from
  atomic_t to refcount_t"

* 'for-linus-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (71 commits)
  btrfs: fix the gfp_mask for the reada_zones radix tree
  Btrfs: fix reported number of inode blocks
  Btrfs: send, fix file hole not being preserved due to inline extent
  Btrfs: fix extent map leak during fallocate error path
  Btrfs: fix incorrect space accounting after failure to insert inline extent
  Btrfs: fix invalid attempt to free reserved space on failure to cow range
  btrfs: Handle delalloc error correctly to avoid ordered extent hang
  btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error
  btrfs: check if the device is flush capable
  btrfs: delete unused member nobarriers
  btrfs: scrub: Fix RAID56 recovery race condition
  btrfs: scrub: Introduce full stripe lock for RAID56
  btrfs: Use ktime_get_real_ts for root ctime
  Btrfs: handle only applicable errors returned by btrfs_get_extent
  btrfs: qgroup: Fix qgroup corruption caused by inode_cache mount option
  btrfs: use q which is already obtained from bdev_get_queue
  Btrfs: switch to div64_u64 if with a u64 divisor
  Btrfs: update scrub_parity to use u64 stripe_len
  Btrfs: enable repair during read for raid56 profile
  btrfs: use clear_page where appropriate
  ...
2017-05-10 08:33:17 -07:00
Michal Hocko
19809c2da2 mm, vmalloc: use __GFP_HIGHMEM implicitly
__vmalloc* allows users to provide gfp flags for the underlying
allocation.  This API is quite popular

  $ git grep "=[[:space:]]__vmalloc\|return[[:space:]]*__vmalloc" | wc -l
  77

The only problem is that many people are not aware that they really want
to give __GFP_HIGHMEM along with other flags because there is really no
reason to consume precious lowmemory on CONFIG_HIGHMEM systems for pages
which are mapped to the kernel vmalloc space.  About half of users don't
use this flag, though.  This signals that we make the API unnecessarily
too complex.

This patch simply uses __GFP_HIGHMEM implicitly when allocating pages to
be mapped to the vmalloc space.  Current users which add __GFP_HIGHMEM
are simplified and drop the flag.

Link: http://lkml.kernel.org/r/20170307141020.29107-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Cristopher Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:13 -07:00
Michal Hocko
752ade68cb treewide: use kv[mz]alloc* rather than opencoded variants
There are many code paths opencoding kvmalloc.  Let's use the helper
instead.  The main difference to kvmalloc is that those users are
usually not considering all the aspects of the memory allocator.  E.g.
allocation requests <= 32kB (with 4kB pages) are basically never failing
and invoke OOM killer to satisfy the allocation.  This sounds too
disruptive for something that has a reasonable fallback - the vmalloc.
On the other hand those requests might fallback to vmalloc even when the
memory allocator would succeed after several more reclaim/compaction
attempts previously.  There is no guarantee something like that happens
though.

This patch converts many of those places to kv[mz]alloc* helpers because
they are more conservative.

Link: http://lkml.kernel.org/r/20170306103327.2766-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390
Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim
Acked-by: David Sterba <dsterba@suse.com> # btrfs
Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph
Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4
Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx5
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Santosh Raspatur <santosh@chelsio.com>
Cc: Hariprasad S <hariprasad@chelsio.com>
Cc: Yishai Hadas <yishaih@mellanox.com>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: "Yan, Zheng" <zyan@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:13 -07:00
Chris Mason
9bcaaea741 btrfs: fix the gfp_mask for the reada_zones radix tree
Commits cc8385b59e and 7ef70b4d99 added preallocation for the
reada radix trees and also switched them over to GFP_KERNEL for the
default gfp mask.

Since we're doing radix tree insertions under spinlocks, we need
to make sure the mask doesn't allow sleeping.  This fix keeps
the radix preallocation but switches back to the original gfp_mask.

Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2017-05-04 16:56:11 -07:00
Linus Torvalds
694752922b Merge branch 'for-4.12/block' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:

 - Add BFQ IO scheduler under the new blk-mq scheduling framework. BFQ
   was initially a fork of CFQ, but subsequently changed to implement
   fairness based on B-WF2Q+, a modified variant of WF2Q. BFQ is meant
   to be used on desktop type single drives, providing good fairness.
   From Paolo.

 - Add Kyber IO scheduler. This is a full multiqueue aware scheduler,
   using a scalable token based algorithm that throttles IO based on
   live completion IO stats, similary to blk-wbt. From Omar.

 - A series from Jan, moving users to separately allocated backing
   devices. This continues the work of separating backing device life
   times, solving various problems with hot removal.

 - A series of updates for lightnvm, mostly from Javier. Includes a
   'pblk' target that exposes an open channel SSD as a physical block
   device.

 - A series of fixes and improvements for nbd from Josef.

 - A series from Omar, removing queue sharing between devices on mostly
   legacy drivers. This helps us clean up other bits, if we know that a
   queue only has a single device backing. This has been overdue for
   more than a decade.

 - Fixes for the blk-stats, and improvements to unify the stats and user
   windows. This both improves blk-wbt, and enables other users to
   register a need to receive IO stats for a device. From Omar.

 - blk-throttle improvements from Shaohua. This provides a scalable
   framework for implementing scalable priotization - particularly for
   blk-mq, but applicable to any type of block device. The interface is
   marked experimental for now.

 - Bucketized IO stats for IO polling from Stephen Bates. This improves
   efficiency of polled workloads in the presence of mixed block size
   IO.

 - A few fixes for opal, from Scott.

 - A few pulls for NVMe, including a lot of fixes for NVMe-over-fabrics.
   From a variety of folks, mostly Sagi and James Smart.

 - A series from Bart, improving our exposed info and capabilities from
   the blk-mq debugfs support.

 - A series from Christoph, cleaning up how handle WRITE_ZEROES.

 - A series from Christoph, cleaning up the block layer handling of how
   we track errors in a request. On top of being a nice cleanup, it also
   shrinks the size of struct request a bit.

 - Removal of mg_disk and hd (sorry Linus) by Christoph. The former was
   never used by platforms, and the latter has outlived it's usefulness.

 - Various little bug fixes and cleanups from a wide variety of folks.

* 'for-4.12/block' of git://git.kernel.dk/linux-block: (329 commits)
  block: hide badblocks attribute by default
  blk-mq: unify hctx delay_work and run_work
  block: add kblock_mod_delayed_work_on()
  blk-mq: unify hctx delayed_run_work and run_work
  nbd: fix use after free on module unload
  MAINTAINERS: bfq: Add Paolo as maintainer for the BFQ I/O scheduler
  blk-mq-sched: alloate reserved tags out of normal pool
  mtip32xx: use runtime tag to initialize command header
  scsi: Implement blk_mq_ops.show_rq()
  blk-mq: Add blk_mq_ops.show_rq()
  blk-mq: Show operation, cmd_flags and rq_flags names
  blk-mq: Make blk_flags_show() callers append a newline character
  blk-mq: Move the "state" debugfs attribute one level down
  blk-mq: Unregister debugfs attributes earlier
  blk-mq: Only unregister hctxs for which registration succeeded
  blk-mq-debugfs: Rename functions for registering and unregistering the mq directory
  blk-mq: Let blk_mq_debugfs_register() look up the queue name
  blk-mq: Register <dev>/queue/mq after having registered <dev>/queue
  ide-pm: always pass 0 error to ide_complete_rq in ide_do_devset
  ide-pm: always pass 0 error to __blk_end_request_all
  ..
2017-05-01 10:39:57 -07:00
Linus Torvalds
28b2013587 Merge branch 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fix from Chris Mason:
 "We have one more fix for btrfs.

  This gets rid of a new WARN_ON from rc1 that ended up making more
  noise than we really want. The larger fix for the underflow got
  delayed a bit and it's better for now to put it under
  CONFIG_BTRFS_DEBUG"

* 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: qgroup: move noisy underflow warning to debugging build
2017-04-28 10:13:17 -07:00
Chris Mason
bce19f9d23 Merge branch 'for-chris-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.12 2017-04-27 14:13:09 -07:00
Filipe Manana
a7e3b975a0 Btrfs: fix reported number of inode blocks
Currently when there are buffered writes that were not yet flushed and
they fall within allocated ranges of the file (that is, not in holes or
beyond eof assuming there are no prealloc extents beyond eof), btrfs
simply reports an incorrect number of used blocks through the stat(2)
system call (or any of its variants), regardless of mount options or
inode flags (compress, compress-force, nodatacow). This is because the
number of blocks used that is reported is based on the current number
of bytes in the vfs inode plus the number of dealloc bytes in the btrfs
inode. The later covers bytes that both fall within allocated regions
of the file and holes.

Example scenarios where the number of reported blocks is wrong while the
buffered writes are not flushed:

  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt/sdc

  $ xfs_io -f -c "pwrite -S 0xaa 0 64K" /mnt/sdc/foo1
  wrote 65536/65536 bytes at offset 0
  64 KiB, 16 ops; 0.0000 sec (259.336 MiB/sec and 66390.0415 ops/sec)

  $ sync

  $ xfs_io -c "pwrite -S 0xbb 0 64K" /mnt/sdc/foo1
  wrote 65536/65536 bytes at offset 0
  64 KiB, 16 ops; 0.0000 sec (192.308 MiB/sec and 49230.7692 ops/sec)

  # The following should have reported 64K...
  $ du -h /mnt/sdc/foo1
  128K	/mnt/sdc/foo1

  $ sync

  # After flushing the buffered write, it now reports the correct value.
  $ du -h /mnt/sdc/foo1
  64K	/mnt/sdc/foo1

  $ xfs_io -f -c "falloc -k 0 128K" -c "pwrite -S 0xaa 0 64K" /mnt/sdc/foo2
  wrote 65536/65536 bytes at offset 0
  64 KiB, 16 ops; 0.0000 sec (520.833 MiB/sec and 133333.3333 ops/sec)

  $ sync

  $ xfs_io -c "pwrite -S 0xbb 64K 64K" /mnt/sdc/foo2
  wrote 65536/65536 bytes at offset 65536
  64 KiB, 16 ops; 0.0000 sec (260.417 MiB/sec and 66666.6667 ops/sec)

  # The following should have reported 128K...
  $ du -h /mnt/sdc/foo2
  192K	/mnt/sdc/foo2

  $ sync

  # After flushing the buffered write, it now reports the correct value.
  $ du -h /mnt/sdc/foo2
  128K	/mnt/sdc/foo2

So the number of used file blocks is simply incorrect, unlike in other
filesystems such as ext4 and xfs for example, but only while the buffered
writes are not flushed.

Fix this by tracking the number of delalloc bytes that fall within holes
and beyond eof of a file, and use instead this new counter when reporting
the number of used blocks for an inode.

Another different problem that exists is that the delalloc bytes counter
is reset when writeback starts (by clearing the EXTENT_DEALLOC flag from
the respective range in the inode's iotree) and the vfs inode's bytes
counter is only incremented when writeback finishes (through
insert_reserved_file_extent()). Therefore while writeback is ongoing we
simply report a wrong number of blocks used by an inode if the write
operation covers a range previously unallocated. While this change does
not fix this problem, it does minimizes it a lot by shortening that time
window, as the new dealloc bytes counter (new_delalloc_bytes) is only
decremented when writeback finishes right before updating the vfs inode's
bytes counter. Fully fixing this second problem is not trivial and will
be addressed later by a different patch.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2017-04-26 16:27:26 +01:00
Filipe Manana
e1cbfd7bf6 Btrfs: send, fix file hole not being preserved due to inline extent
Normally we don't have inline extents followed by regular extents, but
there's currently at least one harmless case where this happens. For
example, when the page size is 4Kb and compression is enabled:

  $ mkfs.btrfs -f /dev/sdb
  $ mount -o compress /dev/sdb /mnt
  $ xfs_io -f -c "pwrite -S 0xaa 0 4K" -c "fsync" /mnt/foobar
  $ xfs_io -c "pwrite -S 0xbb 8K 4K" -c "fsync" /mnt/foobar

In this case we get a compressed inline extent, representing 4Kb of
data, followed by a hole extent and then a regular data extent. The
inline extent was not expanded/converted to a regular extent exactly
because it represents 4Kb of data. This does not cause any apparent
problem (such as the issue solved by commit e1699d2d7b
("btrfs: add missing memset while reading compressed inline extents"))
except trigger an unexpected case in the incremental send code path
that makes us issue an operation to write a hole when it's not needed,
resulting in more writes at the receiver and wasting space at the
receiver.

So teach the incremental send code to deal with this particular case.

The issue can be currently triggered by running fstests btrfs/137 with
compression enabled (MOUNT_OPTIONS="-o compress" ./check btrfs/137).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2017-04-26 16:27:25 +01:00
Filipe Manana
be2d253cc9 Btrfs: fix extent map leak during fallocate error path
If the call to btrfs_qgroup_reserve_data() failed, we were leaking an
extent map structure. The failure can happen either due to an -ENOMEM
condition or, when quotas are enabled, due to -EDQUOT for example.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
2017-04-26 16:27:24 +01:00
Filipe Manana
1c81ba237b Btrfs: fix incorrect space accounting after failure to insert inline extent
When using compression, if we fail to insert an inline extent we
incorrectly end up attempting to free the reserved data space twice,
once through extent_clear_unlock_delalloc(), because we pass it the
flag EXTENT_DO_ACCOUNTING, and once through a direct call to
btrfs_free_reserved_data_space_noquota(). This results in a trace
like the following:

[  834.576240] ------------[ cut here ]------------
[  834.576825] WARNING: CPU: 2 PID: 486 at fs/btrfs/extent-tree.c:4316 btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs]
[  834.579501] Modules linked in: btrfs crc32c_generic xor raid6_pq ppdev i2c_piix4 acpi_cpufreq psmouse tpm_tis parport_pc pcspkr serio_raw tpm_tis_core sg parport evdev i2c_core tpm button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio scsi_mod e1000 floppy [last unloaded: btrfs]
[  834.592116] CPU: 2 PID: 486 Comm: kworker/u32:4 Not tainted 4.10.0-rc8-btrfs-next-37+ #2
[  834.593316] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[  834.595273] Workqueue: btrfs-delalloc btrfs_delalloc_helper [btrfs]
[  834.596103] Call Trace:
[  834.596103]  dump_stack+0x67/0x90
[  834.596103]  __warn+0xc2/0xdd
[  834.596103]  warn_slowpath_null+0x1d/0x1f
[  834.596103]  btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs]
[  834.596103]  compress_file_range.constprop.42+0x2fa/0x3fc [btrfs]
[  834.596103]  ? submit_compressed_extents+0x3a7/0x3a7 [btrfs]
[  834.596103]  async_cow_start+0x32/0x4d [btrfs]
[  834.596103]  btrfs_scrubparity_helper+0x187/0x3e7 [btrfs]
[  834.596103]  btrfs_delalloc_helper+0xe/0x10 [btrfs]
[  834.596103]  process_one_work+0x273/0x4e4
[  834.596103]  worker_thread+0x1eb/0x2ca
[  834.596103]  ? rescuer_thread+0x2b6/0x2b6
[  834.596103]  kthread+0x100/0x108
[  834.596103]  ? __list_del_entry+0x22/0x22
[  834.596103]  ret_from_fork+0x2e/0x40
[  834.611656] ---[ end trace 719902fe6bdef08f ]---

So fix this by not calling directly btrfs_free_reserved_data_space_noquota()
if an error happened.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2017-04-26 16:27:23 +01:00
Filipe Manana
a315e68f6e Btrfs: fix invalid attempt to free reserved space on failure to cow range
When attempting to COW a file range (we are starting writeback and doing
COW), if we manage to reserve an extent for the range we will write into
but fail after reserving it and before creating the respective ordered
extent, we end up in an error path where we attempt to decrement the
data space's bytes_may_use counter after we already did it while
reserving the extent, leading to a warning/trace like the following:

[  847.621524] ------------[ cut here ]------------
[  847.625441] WARNING: CPU: 5 PID: 4905 at fs/btrfs/extent-tree.c:4316 btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs]
[  847.633704] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq i2c_piix4 ppdev psmouse tpm_tis serio_raw pcspkr parport_pc tpm_tis_core i2c_core sg
[  847.644616] CPU: 5 PID: 4905 Comm: xfs_io Not tainted 4.10.0-rc8-btrfs-next-37+ #2
[  847.648601] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[  847.648601] Call Trace:
[  847.648601]  dump_stack+0x67/0x90
[  847.648601]  __warn+0xc2/0xdd
[  847.648601]  warn_slowpath_null+0x1d/0x1f
[  847.648601]  btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs]
[  847.648601]  btrfs_clear_bit_hook+0x140/0x258 [btrfs]
[  847.648601]  clear_state_bit+0x87/0x128 [btrfs]
[  847.648601]  __clear_extent_bit+0x222/0x2b7 [btrfs]
[  847.648601]  clear_extent_bit+0x17/0x19 [btrfs]
[  847.648601]  extent_clear_unlock_delalloc+0x3b/0x6b [btrfs]
[  847.648601]  cow_file_range.isra.39+0x387/0x39a [btrfs]
[  847.648601]  run_delalloc_nocow+0x4d7/0x70e [btrfs]
[  847.648601]  ? arch_local_irq_save+0x9/0xc
[  847.648601]  run_delalloc_range+0xa7/0x2b5 [btrfs]
[  847.648601]  writepage_delalloc.isra.31+0xb9/0x15c [btrfs]
[  847.648601]  __extent_writepage+0x249/0x2e8 [btrfs]
[  847.648601]  extent_write_cache_pages.constprop.33+0x28b/0x36c [btrfs]
[  847.648601]  ? arch_local_irq_save+0x9/0xc
[  847.648601]  ? mark_lock+0x24/0x201
[  847.648601]  extent_writepages+0x4b/0x5c [btrfs]
[  847.648601]  ? btrfs_writepage_start_hook+0xed/0xed [btrfs]
[  847.648601]  btrfs_writepages+0x28/0x2a [btrfs]
[  847.648601]  do_writepages+0x23/0x2c
[  847.648601]  __filemap_fdatawrite_range+0x5a/0x61
[  847.648601]  filemap_fdatawrite_range+0x13/0x15
[  847.648601]  btrfs_fdatawrite_range+0x20/0x46 [btrfs]
[  847.648601]  start_ordered_ops+0x19/0x23 [btrfs]
[  847.648601]  btrfs_sync_file+0x136/0x42c [btrfs]
[  847.648601]  vfs_fsync_range+0x8c/0x9e
[  847.648601]  vfs_fsync+0x1c/0x1e
[  847.648601]  do_fsync+0x31/0x4a
[  847.648601]  SyS_fsync+0x10/0x14
[  847.648601]  entry_SYSCALL_64_fastpath+0x18/0xad
[  847.648601] RIP: 0033:0x7f5b05200800
[  847.648601] RSP: 002b:00007ffe204f71c8 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
[  847.648601] RAX: ffffffffffffffda RBX: ffffffff8109637b RCX: 00007f5b05200800
[  847.648601] RDX: 00000000008bd0a0 RSI: 00000000008bd2e0 RDI: 0000000000000003
[  847.648601] RBP: ffffc90001d67f98 R08: 000000000000ffff R09: 000000000000001f
[  847.648601] R10: 00000000000001f6 R11: 0000000000000246 R12: 0000000000000046
[  847.648601] R13: ffffc90001d67f78 R14: 00007f5b054be740 R15: 00007f5b054be740
[  847.648601]  ? trace_hardirqs_off_caller+0x3f/0xaa
[  847.685787] ---[ end trace 2a4a3e15382508e8 ]---

So fix this by not attempting to decrement the data space info's
bytes_may_use counter if we already reserved the extent and an error
happened before creating the ordered extent. We are already correctly
freeing the reserved extent if an error happens, so there's no additional
measure needed.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2017-04-26 16:27:22 +01:00
Qu Wenruo
524272607e btrfs: Handle delalloc error correctly to avoid ordered extent hang
[BUG]
If run_delalloc_range() returns error and there is already some ordered
extents created, btrfs will be hanged with the following backtrace:

Call Trace:
 __schedule+0x2d4/0xae0
 schedule+0x3d/0x90
 btrfs_start_ordered_extent+0x160/0x200 [btrfs]
 ? wake_atomic_t_function+0x60/0x60
 btrfs_run_ordered_extent_work+0x25/0x40 [btrfs]
 btrfs_scrubparity_helper+0x1c1/0x620 [btrfs]
 btrfs_flush_delalloc_helper+0xe/0x10 [btrfs]
 process_one_work+0x2af/0x720
 ? process_one_work+0x22b/0x720
 worker_thread+0x4b/0x4f0
 kthread+0x10f/0x150
 ? process_one_work+0x720/0x720
 ? kthread_create_on_node+0x40/0x40
 ret_from_fork+0x2e/0x40

[CAUSE]

|<------------------ delalloc range --------------------------->|
| OE 1 | OE 2 | ... | OE n |
|<>|                       |<---------- cleanup range --------->|
 ||
 \_=> First page handled by end_extent_writepage() in __extent_writepage()

The problem is caused by error handler of run_delalloc_range(), which
doesn't handle any created ordered extents, leaving them waiting on
btrfs_finish_ordered_io() to finish.

However after run_delalloc_range() returns error, __extent_writepage()
won't submit bio, so btrfs_writepage_end_io_hook() won't be triggered
except the first page, and btrfs_finish_ordered_io() won't be triggered
for created ordered extents either.

So OE 2~n will hang forever, and if OE 1 is larger than one page, it
will also hang.

[FIX]
Introduce btrfs_cleanup_ordered_extents() function to cleanup created
ordered extents and finish them manually.

The function is based on existing
btrfs_endio_direct_write_update_ordered() function, and modify it to
act just like btrfs_writepage_endio_hook() but handles specified range
other than one page.

After fix, delalloc error will be handled like:

|<------------------ delalloc range --------------------------->|
| OE 1 | OE 2 | ... | OE n |
|<>|<--------  ----------->|<------ old error handler --------->|
 ||          ||
 ||          \_=> Cleaned up by cleanup_ordered_extents()
 \_=> First page handled by end_extent_writepage() in __extent_writepage()

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2017-04-26 16:27:21 +01:00
Qu Wenruo
4dbd80fb91 btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error
[BUG]
When btrfs_reloc_clone_csum() reports error, it can underflow metadata
and leads to kernel assertion on outstanding extents in
run_delalloc_nocow() and cow_file_range().

 BTRFS info (device vdb5): relocating block group 12582912 flags data
 BTRFS info (device vdb5): found 1 extents
 assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858

Currently, due to another bug blocking ordered extents, the bug is only
reproducible under certain block group layout and using error injection.

a) Create one data block group with one 4K extent in it.
   To avoid the bug that hangs btrfs due to ordered extent which never
   finishes
b) Make btrfs_reloc_clone_csum() always fail
c) Relocate that block group

[CAUSE]
run_delalloc_nocow() and cow_file_range() handles error from
btrfs_reloc_clone_csum() wrongly:

(The ascii chart shows a more generic case of this bug other than the
bug mentioned above)

|<------------------ delalloc range --------------------------->|
| OE 1 | OE 2 | ... | OE n |
                    |<----------- cleanup range --------------->|
|<-----------  ----------->|
             \/
 btrfs_finish_ordered_io() range

So error handler, which calls extent_clear_unlock_delalloc() with
EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io()
will both cover OE n, and free its metadata, causing metadata under flow.

[Fix]
The fix is to ensure after calling btrfs_add_ordered_extent(), we only
call error handler after increasing the iteration offset, so that
cleanup range won't cover any created ordered extent.

|<------------------ delalloc range --------------------------->|
| OE 1 | OE 2 | ... | OE n |
|<-----------  ----------->|<---------- cleanup range --------->|
             \/
 btrfs_finish_ordered_io() range

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2017-04-26 16:27:21 +01:00
Jan Kara
9e11ceee23 btrfs: Convert to separately allocated bdi
Allocate struct backing_dev_info separately instead of embedding it
inside superblock. This unifies handling of bdi among users.

CC: Chris Mason <clm@fb.com>
CC: Josef Bacik <jbacik@fb.com>
CC: David Sterba <dsterba@suse.com>
CC: linux-btrfs@vger.kernel.org
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 12:09:55 -06:00
David Sterba
338bd52f3c btrfs: qgroup: move noisy underflow warning to debugging build
The WARN_ON and warning from report_reserved_underflow can become very
noisy and is visible unconditionally although this is namely for
debugging. The patch "btrfs: Add WARN_ON for qgroup reserved underflow"
(18dc22c19b) went to 4.11-rc1 and the plan
was to get the fix as well, but this hasn't happened.

CC: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-19 12:40:49 +02:00
Anand Jain
c2a9c7ab47 btrfs: check if the device is flush capable
The block layer call chain from submit_bio will check if the write cache
is enabled for the given queue before submitting the flush. This will
add a code to fail fast if its not.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ updated changelog to reflect current code stat, blkdev_issue_flush is
  not used yet ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 16:13:27 +02:00
Anand Jain
13e88e1560 btrfs: delete unused member nobarriers
The last consumer of nobarriers is removed by the commit [1] and sync
won't fail with EOPNOTSUPP anymore. Thus, now when write cache is write
through it just return success without actually transpiring such a
request to the block device/lun.

[1]
commit b25de9d6da
block: remove BIO_EOPNOTSUPP

And, as the device/lun write cache state may change dynamically saving
such as state won't help either. So deleting the member nobarriers.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 16:12:07 +02:00
Qu Wenruo
28d70e237d btrfs: scrub: Fix RAID56 recovery race condition
When scrubbing a RAID5 which has recoverable data corruption (only one
data stripe is corrupted), sometimes scrub will report more csum errors
than expected. Sometimes even unrecoverable error will be reported.

The problem can be easily reproduced by the following steps:
1) Create a btrfs with RAID5 data profile with 3 devs
2) Mount it with nospace_cache or space_cache=v2
   To avoid extra data space usage.
3) Create a 128K file and sync the fs, unmount it
   Now the 128K file lies at the beginning of the data chunk
4) Locate the physical bytenr of data chunk on dev3
   Dev3 is the 1st data stripe.
5) Corrupt the first 64K of the data chunk stripe on dev3
6) Mount the fs and scrub it

The correct csum error number should be 16 (assuming using x86_64).
Larger csum error number can be reported in a 1/3 chance.
And unrecoverable error can also be reported in a 1/10 chance.

The root cause of the problem is RAID5/6 recover code has race
condition, due to the fact that full scrub is initiated per device.

While for other mirror based profiles, each mirror is independent with
each other, so race won't cause any big problem.

For example:
        Corrupted       |       Correct          |      Correct        |
|   Scrub dev3 (D1)     |    Scrub dev2 (D2)     |    Scrub dev1(P)    |
------------------------------------------------------------------------
Read out D1             |Read out D2             |Read full stripe     |
Check csum              |Check csum              |Check parity         |
Csum mismatch           |Csum match, continue    |Parity mismatch      |
handle_errored_block    |                        |handle_errored_block |
 Read out full stripe   |                        | Read out full stripe|
 D1 csum error(err++)   |                        | D1 csum error(err++)|
 Recover D1             |                        | Recover D1          |

So D1's csum error is accounted twice, just because
handle_errored_block() doesn't have enough protection, and race can happen.

On even worse case, for example D1's recovery code is re-writing
D1/D2/P, and P's recovery code is just reading out full stripe, then we
can cause unrecoverable error.

This patch will use previously introduced lock_full_stripe() and
unlock_full_stripe() to protect the whole scrub_handle_errored_block()
function for RAID56 recovery.
So no extra csum error nor unrecoverable error.

Reported-by: Goffredo Baroncelli <kreijack@libero.it>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:27 +02:00
Qu Wenruo
0966a7b130 btrfs: scrub: Introduce full stripe lock for RAID56
Unlike mirror based profiles, RAID5/6 recovery needs to read out the
whole full stripe.

And if we don't do proper protection, it can easily cause race condition.

Introduce 2 new functions: lock_full_stripe() and unlock_full_stripe()
for RAID5/6.
Which store a rb_tree of mutexes for full stripes, so scrub callers can
use them to lock a full stripe to avoid race.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor comment adjustments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:27 +02:00
Deepa Dinamani
fa7aede2ab btrfs: Use ktime_get_real_ts for root ctime
btrfs_root_item maintains the ctime for root updates.  This is not part
of vfs_inode.

Since current_time() uses struct inode* as an argument as Linus
suggested, this cannot be used to update root times unless, we modify
the signature to use inode.

Since btrfs uses nanosecond time granularity, it can also use
ktime_get_real_ts directly to obtain timestamp for the root. It is
necessary to use the timespec time api here because the same
btrfs_set_stack_timespec_*() apis are used for vfs inode times as well.
These can be transitioned to using timespec64 when btrfs internally
changes to use timespec64 as well.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:27 +02:00
Dan Carpenter
9986277e0e Btrfs: handle only applicable errors returned by btrfs_get_extent
btrfs_get_extent() never returns NULL pointers, so this code introduces
a static checker warning.

The btrfs_get_extent() is a bit complex, but trust me that it doesn't
return NULLs and also if it did we would trigger the BUG_ON(!em) before
the last return statement.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
[ updated subject ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:27 +02:00
Qu Wenruo
82bafb38c2 btrfs: qgroup: Fix qgroup corruption caused by inode_cache mount option
[BUG]
The easist way to reproduce the bug is:
------
 # mkfs.btrfs -f $dev -n 16K
 # mount $dev $mnt -o inode_cache
 # btrfs quota enable $mnt
 # btrfs quota rescan -w $mnt
 # btrfs qgroup show $mnt
qgroupid         rfer         excl
--------         ----         ----
0/5          32.00KiB     32.00KiB
             ^^ Twice the correct value
------

And fstests/btrfs qgroup test group can easily detect them with
inode_cache mount option.
Although some of them are false alerts since old test cases are using
fixed golden output.
While new test cases will use "btrfs check" to detect qgroup mismatch.

[CAUSE]
Inode_cache mount option will make commit_fs_roots() to call
btrfs_save_ino_cache() to update fs/subvol trees, and generate new
delayed refs.

However we call btrfs_qgroup_prepare_account_extents() too early, before
commit_fs_roots().
This makes the "old_roots" for newly generated extents are always NULL.
For freeing extent case, this makes both new_roots and old_roots to be
empty, while correct old_roots should not be empty.
This causing qgroup numbers not decreased correctly.

[FIX]
Modify the timing of calling btrfs_qgroup_prepare_account_extents() to
just before btrfs_qgroup_account_extents(), and add needed delayed_refs
handler.
So qgroup can handle inode_map mount options correctly.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:26 +02:00
Anand Jain
e884f4f06e btrfs: use q which is already obtained from bdev_get_queue
We have already assigned q from bdev_get_queue() so use it.
And rearrange the code for better view.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:26 +02:00
Liu Bo
42c61ab676 Btrfs: switch to div64_u64 if with a u64 divisor
This is fixing code pieces where we use div_u64 when passing a u64 divisor.

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:26 +02:00
Liu Bo
972d721939 Btrfs: update scrub_parity to use u64 stripe_len
Commit 3d8da67817 ("Btrfs: fix divide error upon chunk's stripe_len")
changed stripe_len in struct map_lookup to u64, but didn't update
stripe_len in struct scrub_parity.

This updates the type and switches to div64_u64_rem to match u64 divisor.

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:26 +02:00
Liu Bo
c725328c55 Btrfs: enable repair during read for raid56 profile
Now that scrub can fix data errors with the help of parity for raid56
profile, repair during read is able to as well.

Although the mirror num in raid56 scenario has different meanings, i.e.
0 or 1: read data directly
> 1:    do recover with parity,
it could be fit into how we repair bad block during read.

The trick is to use BTRFS_MAP_READ instead of BTRFS_MAP_WRITE to get the
device and position on it.

Cc: David Sterba <dsterba@suse.cz>
Tested-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:26 +02:00
David Sterba
619a974292 btrfs: use clear_page where appropriate
There's a helper to clear whole page, with a arch-specific optimized
code. The replaced cases do not seem to be in performace critical code,
but we still might get some percent gain.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:26 +02:00
Qu Wenruo
e501bfe323 btrfs: Prevent scrub recheck from racing with dev replace
scrub_setup_recheck_block() calls btrfs_map_sblock() and then accesses
bbio without protection of bio_counter.

This can lead to use-after-free if racing with dev replace cancel.

Fix it by increasing bio_counter before calling btrfs_map_sblock() and
decreasing the bio_counter when corresponding recover is finished.

Cc: Liu Bo <bo.li.liu@oracle.com>
Reported-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:26 +02:00
Qu Wenruo
ae6529c35b btrfs: Wait for in-flight bios before freeing target device for raid56
When raid56 dev-replace is cancelled by running scrub, we will free
target device without waiting for in-flight bios, causing the following
NULL pointer deference or general protection failure.

 BUG: unable to handle kernel NULL pointer dereference at 00000000000005e0
 IP: generic_make_request_checks+0x4d/0x610
 CPU: 1 PID: 11676 Comm: kworker/u4:14 Tainted: G  O    4.11.0-rc2 #72
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
 Workqueue: btrfs-endio-raid56 btrfs_endio_raid56_helper [btrfs]
 task: ffff88002875b4c0 task.stack: ffffc90001334000
 RIP: 0010:generic_make_request_checks+0x4d/0x610
 Call Trace:
  ? generic_make_request+0xc7/0x360
  generic_make_request+0x24/0x360
  ? generic_make_request+0xc7/0x360
  submit_bio+0x64/0x120
  ? page_in_rbio+0x4d/0x80 [btrfs]
  ? rbio_orig_end_io+0x80/0x80 [btrfs]
  finish_rmw+0x3f4/0x540 [btrfs]
  validate_rbio_for_rmw+0x36/0x40 [btrfs]
  raid_rmw_end_io+0x7a/0x90 [btrfs]
  bio_endio+0x56/0x60
  end_workqueue_fn+0x3c/0x40 [btrfs]
  btrfs_scrubparity_helper+0xef/0x620 [btrfs]
  btrfs_endio_raid56_helper+0xe/0x10 [btrfs]
  process_one_work+0x2af/0x720
  ? process_one_work+0x22b/0x720
  worker_thread+0x4b/0x4f0
  kthread+0x10f/0x150
  ? process_one_work+0x720/0x720
  ? kthread_create_on_node+0x40/0x40
  ret_from_fork+0x2e/0x40
 RIP: generic_make_request_checks+0x4d/0x610 RSP: ffffc90001337bb8

In btrfs_dev_replace_finishing(), we will call
btrfs_rm_dev_replace_blocked() to wait bios before destroying the target
device when scrub is finished normally.

However when dev-replace is aborted, either due to error or cancelled by
scrub, we didn't wait for bios, this can lead to use-after-free if there
are bios holding the target device.

Furthermore, for raid56 scrub, at least 2 places are calling
btrfs_map_sblock() without protection of bio_counter, leading to the
problem.

This patch fixes the problem:
1) Wait for bio_counter before freeing target device when canceling
   replace
2) When calling btrfs_map_sblock() for raid56, use bio_counter to
   protect the call.

Cc: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:26 +02:00
Qu Wenruo
9a33944bdf btrfs: scrub: Don't append on-disk pages for raid56 scrub
In the following situation, scrub will calculate wrong parity to
overwrite the correct one:

RAID5 full stripe:

Before
|     Dev 1      |     Dev  2     |     Dev 3     |
| Data stripe 1  | Data stripe 2  | Parity Stripe |
--------------------------------------------------- 0
| 0x0000 (Bad)   |     0xcdcd     |     0x0000    |
--------------------------------------------------- 4K
|     0xcdcd     |     0xcdcd     |     0x0000    |
...
|     0xcdcd     |     0xcdcd     |     0x0000    |
--------------------------------------------------- 64K

After scrubbing dev3 only:

|     Dev 1      |     Dev  2     |     Dev 3     |
| Data stripe 1  | Data stripe 2  | Parity Stripe |
--------------------------------------------------- 0
| 0xcdcd (Good)  |     0xcdcd     | 0xcdcd (Bad)  |
--------------------------------------------------- 4K
|     0xcdcd     |     0xcdcd     |     0x0000    |
...
|     0xcdcd     |     0xcdcd     |     0x0000    |
--------------------------------------------------- 64K

The reason is that after raid56 read rebuild rbio->stripe_pages are all
correctly recovered (0xcd for data stripes).

However when we check and repair parity in
scrub_parity_check_and_repair(), we will append pages in sparity->spages
list to rbio->bio_pages[], which contains old on-disk data.

And when we submit parity data to disk, we calculate parity using
rbio->bio_pages[] first, if rbio->bio_pages[] not found, then fallback
to rbio->stripe_pages[].

The patch fix it by not appending pages from sparity->spages.
So finish_parity_scrub() will use rbio->stripe_pages[] which is correct.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:26 +02:00
Qu Wenruo
d51ea5dd22 btrfs: qgroup: Re-arrange tracepoint timing to co-operate with reserved space tracepoint
Newly introduced qgroup reserved space trace points are normally nested
into several common qgroup operations.

While some other trace points are not well placed to co-operate with
them, causing confusing output.

This patch re-arrange trace_btrfs_qgroup_release_data() and
trace_btrfs_qgroup_free_delayed_ref() trace points so they are triggered
before reserved space ones.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:26 +02:00
Qu Wenruo
3159fe7bae btrfs: qgroup: Add trace point for qgroup reserved space
Introduce the following trace points:
qgroup_update_reserve
qgroup_meta_reserve

These trace points are handy to trace qgroup reserve space related
problems.

Also export btrfs_qgroup structure, as now we directly pass btrfs_qgroup
structure to trace points, so that structure needs to be exported.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:26 +02:00
David Sterba
825ad4c964 btrfs: drop redundant parameters from btrfs_map_sblock
All callers pass 0 for mirror_num and 1 for need_raid_map.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:26 +02:00
David Sterba
bcc8e07f9e btrfs: sink GFP flags parameter to tree_mod_log_insert_root
All (1) callers pass the same value.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:26 +02:00
David Sterba
176ef8f5e6 btrfs: sink GFP flags parameter to tree_mod_log_insert_move
All (1) callers pass the same value.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:26 +02:00
Liu Bo
abad60c601 Btrfs: fix wrong failed mirror_num of read-repair on raid56
In raid56 scenario, after trying parity recovery, we didn't set
mirror_num for btrfs_bio with failed mirror_num, hence
end_bio_extent_readpage() will report a random mirror_num in dmesg
log.

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:26 +02:00
Liu Bo
1bcd7aa17f Btrfs: set scrub page's io_error if failing to submit io
Scrub repairs data by the unit called scrub_block, which may contain
several pages.  Scrub always tries to look up a good copy of a whole
block, but if there's no such copy, it tries to do repair page by page.

If we don't set page's io_error when checking this bad copy, in the last
step, we may skip this page when repairing bad copy from good copy.

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:26 +02:00
David Sterba
171938e528 btrfs: track exclusive filesystem operation in flags
There are several operations, usually started from ioctls, that cannot
run concurrently. The status is tracked in
mutually_exclusive_operation_running as an atomic_t. We can easily track
the status as one of the per-filesystem flag bits with same
synchronization guarantees.

The conversion replaces:

* atomic_xchg(..., 1)    ->   test_and_set_bit(FLAG, ...)
* atomic_set(..., 0)     ->   clear_bit(FLAG, ...)

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:25 +02:00
Goldwyn Rodrigues
48a89bc4f2 btrfs: qgroups: Retry after commit on getting EDQUOT
We are facing the same problem with EDQUOT which was experienced with
ENOSPC. Not sure if we require a full ticketing system such as ENOSPC, but
here is a quick fix, which may be too big a hammer.

Quotas are reserved during the start of an operation, incrementing
qg->reserved. However, it is written to disk in a commit_transaction
which could take as long as commit_interval. In the meantime there
could be deletions which are not accounted for because deletions are
accounted for only while committed (free_refroot). So, when we get
a EDQUOT flush the data to disk and try again.

This fixes fstests btrfs/139.

Here is a sample script which shows this issue.

DEVICE=/dev/vdb
MOUNTPOINT=/mnt
TESTVOL=$MOUNTPOINT/tmp
QUOTA=5
PROG=btrfs
DD_BS="4k"
DD_COUNT="256"
RUN_TIMES=5000

mkfs.btrfs -f $DEVICE
mount -o commit=240 $DEVICE $MOUNTPOINT
$PROG subvolume create $TESTVOL
$PROG quota enable $TESTVOL
$PROG qgroup limit ${QUOTA}G $TESTVOL

typeset -i DD_RUN_GOOD
typeset -i QUOTA

function _check_cmd() {
        if [[ ${?} > 0 ]]; then
                echo -n "$(date) E: Running previous command"
                echo ${*}
                echo "Without sync"
                $PROG qgroup show -pcreFf ${TESTVOL}
                echo "With sync"
                $PROG qgroup show -pcreFf --sync ${TESTVOL}
                exit 1
        fi
}

while true; do
  DD_RUN_GOOD=$RUN_TIMES

  while (( ${DD_RUN_GOOD} != 0 )); do
        dd if=/dev/zero of=${TESTVOL}/quotatest${DD_RUN_GOOD} bs=${DD_BS} count=${DD_COUNT}
        _check_cmd "dd if=/dev/zero of=${TESTVOL}/quotatest${DD_RUN_GOOD} bs=${DD_BS} count=${DD_COUNT}"
        DD_RUN_GOOD=(${DD_RUN_GOOD}-1)
  done

  $PROG qgroup show -pcref $TESTVOL
  echo "----------- Cleanup ---------- "
  rm $TESTVOL/quotatest*

done

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:25 +02:00
Edmund Nadolski
de47c9d3ff btrfs: replace hardcoded value with SEQ_LAST macro
Define the SEQ_LAST macro to replace (u64)-1 in places where said
value triggers a special-case ref search behavior.

Signed-off-by: Edmund Nadolski <enadolski@suse.com>
Reviewed-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:25 +02:00
Edmund Nadolski
f58d88b336 btrfs: provide enumeration for __merge_refs mode argument
Replace hardcoded numeric values for __merge_refs 'mode' argument
with descriptive constants.

Signed-off-by: Edmund Nadolski <enadolski@suse.com>
Reviewed-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:25 +02:00
David Sterba
f486135eba btrfs: remove unused qgroup members from btrfs_trans_handle
The members have been effectively unused since "Btrfs: rework qgroup
accounting" (fcebe4562d), there's no substitute for
assert_qgroups_uptodate so it's removed as well.

Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:25 +02:00
David Sterba
994a5d2bc7 btrfs: remove local blocksize variable in reada_find_extent
The name is misleading and the local variable serves no purpose.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:25 +02:00
David Sterba
5721b8ad26 btrfs: remove redundant parameter from reada_start_machine_dev
We can read fs_info from dev.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:25 +02:00
David Sterba
0ceaf28213 btrfs: remove redundant parameter from reada_find_zone
We can read fs_info from dev.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:25 +02:00
David Sterba
d48d71aa99 btrfs: remove redundant parameter from btree_readahead_hook
We can read fs_info from eb.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:25 +02:00
David Sterba
7ef70b4d99 btrfs: preallocate radix tree node for global readahead tree
We can preallocate the node so insertion does not have to do that under
the lock. The GFP flags for the global radix tree are initialized to
 GFP_NOFS & ~__GFP_DIRECT_RECLAIM
but we can use GFP_KERNEL, because readahead is optional and not on any
critical writeout path.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:25 +02:00
David Sterba
cc8385b59e btrfs: preallocate radix tree node for readahead
We can preallocate the node so insertion does not have to do that under
the lock. The GFP flags for the per-device radix tree are initialized to
 GFP_NOFS & ~__GFP_DIRECT_RECLAIM
but we can use GFP_KERNEL, same as an allocation above anyway, but also
because readahead is optional and not on any critical writeout path.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:25 +02:00
Goldwyn Rodrigues
4d339d0106 btrfs: No need to check !(flags & MS_RDONLY) twice
Code cleanup.
The code block is for !(*flags & MS_RDONLY). We don't need
to check it again.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:25 +02:00
Liu Bo
1a79c1f246 Btrfs: update comments in cache_save_setup
We also don't bother to flush free space cache while with free space
tree.

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:25 +02:00
Liu Bo
539b50d2f6 Btrfs: convert BUG_ON to WARN_ON
These two BUG_ON()s would never be true, ensured by callers' logic.

Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:25 +02:00
Liu Bo
2b19a1fef7 Btrfs: helper for ops that requires full stripe
This adds a helper to show directly whether ops require full stripe.

Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:25 +02:00
Liu Bo
6fad823f49 Btrfs: do not add extra mirror when dev_replace target dev is not available
With this, we can avoid allocating memory for dev replace copies if the
target dev is not available.

Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:25 +02:00
Liu Bo
73c0f22825 Btrfs: handle operations for device replace separately
Since this part is mostly independent, this moves it to a separate
function.

Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:24 +02:00
Liu Bo
5ab56090b8 Btrfs: introduce a function to get extra mirror from replace
As the part of getting extra mirror in __btrfs_map_block is
independent, this puts it into a separate function.

Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:24 +02:00
Liu Bo
0b3d4cd371 Btrfs: separate DISCARD from __btrfs_map_block
Since DISCARD is not as important as an operation like write, we don't
copy it to target device during replace, and it makes __btrfs_map_block
less complex.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:24 +02:00
Liu Bo
592d92eeab Btrfs: create a helper for getting chunk map
We have similar code here and there, this merges them into a helper.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:24 +02:00
Liu Bo
09ed2f165c Btrfs: add file item tracepoints
While debugging truncate problems, I found that these tracepoints could
help us quickly know what went wrong.

Two sets of tracepoints are created to track regular/prealloc file item
and inline file item respectively, I put inline as a separate one since
what inline file items cares about are way less than the regular one.

This adds four tracepoints:
- btrfs_get_extent_show_fi_regular
- btrfs_get_extent_show_fi_inline
- btrfs_truncate_show_fi_regular
- btrfs_truncate_show_fi_inline

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ formatting adjustments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:24 +02:00
Elena Reshetova
dec95574f4 btrfs: convert btrfs_raid_bio.refs from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:24 +02:00
Elena Reshetova
99f4cdb16f btrfs: convert scrub_ctx.refs from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:24 +02:00
Elena Reshetova
78a764504d btrfs: convert scrub_parity.refs from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:24 +02:00
Elena Reshetova
186debd6ed btrfs: convert scrub_block.refs from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:24 +02:00
Elena Reshetova
6f615018b3 btrfs: convert scrub_recover.refs from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:24 +02:00
Elena Reshetova
a50299ae7c btrfs: convert compressed_bio.pending_bios from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:24 +02:00
Elena Reshetova
b7ac31b7b2 btrfs: convert extent_state.refs from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:23 +02:00
Elena Reshetova
0700cea7c8 btrfs: convert btrfs_root.refs from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:23 +02:00
Elena Reshetova
089e77e10d btrfs: convert btrfs_delayed_item.refs from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:23 +02:00
Elena Reshetova
6de5f18e7b btrfs: convert btrfs_delayed_node.refs from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:23 +02:00
Elena Reshetova
6df8cdf5bd btrfs: convert btrfs_delayed_ref_node.refs from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:23 +02:00
Elena Reshetova
1e4f4714d5 btrfs: convert btrfs_caching_control.count from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:23 +02:00
Elena Reshetova
e76edab7f0 btrfs: convert btrfs_ordered_extent.refs from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:23 +02:00
Elena Reshetova
490b54d6fb btrfs: convert extent_map.refs from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:23 +02:00
Elena Reshetova
9b64f57ddf btrfs: convert btrfs_transaction.use_count from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:23 +02:00
Elena Reshetova
140475ae4a btrfs: convert btrfs_bio.refs from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:23 +02:00
Liu Bo
f95fda8751 Btrfs: remove ASSERT in btrfs_truncate_inode_items
After 76b42abbf7 ("Btrfs: fix data loss after truncate when using the
no-holes feature"),

For either NO_HOLES or inline extents, we've set last_size to newsize to
avoid data loss after remount or inode got evicted and read again, thus,
we don't need this check anymore.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:23 +02:00
Adam Borowski
1450612797 btrfs: fix a bogus warning when converting only data or metadata
If your filesystem has, eg, data:raid0 metadata:raid1, and you run "btrfs
balance -dconvert=raid1", the meta.target field will be uninitialized.
That's otherwise ok, as it's unused except for this warning.

Thus, let's use the existing set of raid levels for the comparison.

As a side effect, non-convert balances will now nag about data>metadata.

Signed-off-by: Adam Borowski <kilobyte@angband.pl>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:23 +02:00
Linus Torvalds
4b31ac485d Merge branch 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Dave Sterba collected a few more fixes for the last rc.

  These aren't marked for stable, but I'm putting them in with a batch
  were testing/sending by hand for this release"

* 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix potential use-after-free for cloned bio
  Btrfs: fix segmentation fault when doing dio read
  Btrfs: fix invalid dereference in btrfs_retry_endio
  btrfs: drop the nossd flag when remounting with -o ssd
2017-04-14 16:53:45 -07:00
Liu Bo
a967efb30b Btrfs: fix potential use-after-free for cloned bio
KASAN reports that there is a use-after-free case of bio in btrfs_map_bio.

If we need to submit IOs to several disks at a time, the original bio
would get cloned and mapped to the destination disk, but we really should
use the original bio instead of a cloned bio to do the sanity check
because cloned bios are likely to be freed by its endio.

Reported-by: Diego <diegocg@gmail.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-11 18:49:56 +02:00
Liu Bo
97bf5a5589 Btrfs: fix segmentation fault when doing dio read
Commit 2dabb32484 ("Btrfs: Direct I/O read: Work on sectorsized blocks")
introduced this bug during iterating bio pages in dio read's endio hook,
and it could end up with segment fault of the dio reading task.

So the reason is 'if (nr_sectors--)', and it makes the code assume that
there is one more block in the same page, so page offset is increased and
the bio which is created to repair the bad block then has an incorrect
bvec.bv_offset, and a later access of the page content would throw a
segmentation fault.

This also adds ASSERT to check page offset against page size.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-11 18:49:29 +02:00
Liu Bo
2e949b0a55 Btrfs: fix invalid dereference in btrfs_retry_endio
When doing directIO repair, we have this oops:

[ 1458.532816] general protection fault: 0000 [#1] SMP
...
[ 1458.536291] Workqueue: btrfs-endio-repair btrfs_endio_repair_helper [btrfs]
[ 1458.536893] task: ffff88082a42d100 task.stack: ffffc90002b3c000
[ 1458.537499] RIP: 0010:btrfs_retry_endio+0x7e/0x1a0 [btrfs]
...
[ 1458.543261] Call Trace:
[ 1458.543958]  ? rcu_read_lock_sched_held+0xc4/0xd0
[ 1458.544374]  bio_endio+0xed/0x100
[ 1458.544750]  end_workqueue_fn+0x3c/0x40 [btrfs]
[ 1458.545257]  normal_work_helper+0x9f/0x900 [btrfs]
[ 1458.545762]  btrfs_endio_repair_helper+0x12/0x20 [btrfs]
[ 1458.546224]  process_one_work+0x34d/0xb70
[ 1458.546570]  ? process_one_work+0x29e/0xb70
[ 1458.546938]  worker_thread+0x1cf/0x960
[ 1458.547263]  ? process_one_work+0xb70/0xb70
[ 1458.547624]  kthread+0x17d/0x180
[ 1458.547909]  ? kthread_create_on_node+0x70/0x70
[ 1458.548300]  ret_from_fork+0x31/0x40

It turns out that btrfs_retry_endio is trying to get inode from a directIO
page.

This fixes the problem by using the saved inode pointer, done->inode.
btrfs_retry_endio_nocsum has the same problem, and it's fixed as well.

Also cleanup unused @start (which is too trivial for a separate patch).

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-11 18:49:08 +02:00
Adam Borowski
951e796639 btrfs: drop the nossd flag when remounting with -o ssd
The opposite case was already handled right in the very next switch entry.
And also when turning on nossd, drop ssd_spread.

Reported-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: Adam Borowski <kilobyte@angband.pl>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-11 18:48:59 +02:00
Linus Torvalds
fe8e12b503 Merge branch 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "We have three small fixes queued up in my for-linus-4.11 branch"

* 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix an integer overflow check
  btrfs: Change qgroup_meta_rsv to 64bit
  Btrfs: bring back repair during read
2017-03-31 17:58:48 -07:00
Dan Carpenter
457ae7268b Btrfs: fix an integer overflow check
This isn't super serious because you need CAP_ADMIN to run this code.

I added this integer overflow check last year but apparently I am
rubbish at writing integer overflow checks...  There are two issues.
First, access_ok() works on unsigned long type and not u64 so on 32 bit
systems the access_ok() could be checking a truncated size.  The other
issue is that we should be using a stricter limit so we don't overflow
the kzalloc() setting ctx->clone_roots later in the function after the
access_ok():

	alloc_size = sizeof(struct clone_root) * (arg->clone_sources_count + 1);
	sctx->clone_roots = kzalloc(alloc_size, GFP_KERNEL | __GFP_NOWARN);

Fixes: f5ecec3ce2 ("btrfs: send: silence an integer overflow warning")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ added comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-03-29 14:29:08 +02:00
Goldwyn Rodrigues
ce0dcee626 btrfs: Change qgroup_meta_rsv to 64bit
Using an int value is causing qg->reserved to become negative and
exclusive -EDQUOT to be reached prematurely.

This affects exclusive qgroups only.

TEST CASE:

DEVICE=/dev/vdb
MOUNTPOINT=/mnt
SUBVOL=$MOUNTPOINT/tmp

umount $SUBVOL
umount $MOUNTPOINT

mkfs.btrfs -f $DEVICE
mount /dev/vdb $MOUNTPOINT
btrfs quota enable $MOUNTPOINT
btrfs subvol create $SUBVOL
umount $MOUNTPOINT
mount /dev/vdb $MOUNTPOINT
mount -o subvol=tmp $DEVICE $SUBVOL
btrfs qgroup limit -e 3G $SUBVOL

btrfs quota rescan /mnt -w

for i in `seq 1 44000`; do
  dd if=/dev/zero of=/mnt/tmp/test_$i bs=10k count=1
  if [[ $? > 0 ]]; then
     btrfs qgroup show -pcref $SUBVOL
     exit 1
  fi
done

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
[ add reproducer to changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-03-29 14:29:08 +02:00
Liu Bo
9d0d1c8b1c Btrfs: bring back repair during read
Commit 20a7db8ab3 ("btrfs: add dummy callback for readpage_io_failed
and drop checks") made a cleanup around readpage_io_failed_hook, and
it was supposed to keep the original sematics, but it also
unexpectedly disabled repair during read for dup, raid1 and raid10.

This fixes the problem by letting data's inode call the generic
readpage_io_failed callback by returning -EAGAIN from its
readpage_io_failed_hook in order to notify end_bio_extent_readpage to
do the rest.  We don't call it directly because the generic one takes
an offset from end_bio_extent_readpage() to calculate the index in the
checksum array and inode's readpage_io_failed_hook doesn't offer that
offset.

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ keep the const function attribute ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-03-29 14:29:07 +02:00
Linus Torvalds
131fbf4f9c Merge branch 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Zygo tracked down a very old bug with inline compressed extents.

  I didn't tag this one for stable because I want to do individual
  tested backports. It's a little tricky and I'd rather do some extra
  testing on it along the way"

* 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: add missing memset while reading compressed inline extents
  Btrfs: fix regression in lock_delalloc_pages
  btrfs: remove btrfs_err_str function from uapi/linux/btrfs.h
2017-03-23 11:39:33 -07:00
Zygo Blaxell
e1699d2d7b btrfs: add missing memset while reading compressed inline extents
This is a story about 4 distinct (and very old) btrfs bugs.

Commit c8b978188c ("Btrfs: Add zlib compression support") added
three data corruption bugs for inline extents (bugs #1-3).

Commit 93c82d5750 ("Btrfs: zero page past end of inline file items")
fixed bug #1:  uncompressed inline extents followed by a hole and more
extents could get non-zero data in the hole as they were read.  The fix
was to add a memset in btrfs_get_extent to zero out the hole.

Commit 166ae5a418 ("btrfs: fix inline compressed read err corruption")
fixed bug #2:  compressed inline extents which contained non-zero bytes
might be replaced with zero bytes in some cases.  This patch removed an
unhelpful memset from uncompress_inline, but the case where memset is
required was missed.

There is also a memset in the decompression code, but this only covers
decompressed data that is shorter than the ram_bytes from the extent
ref record.  This memset doesn't cover the region between the end of the
decompressed data and the end of the page.  It has also moved around a
few times over the years, so there's no single patch to refer to.

This patch fixes bug #3:  compressed inline extents followed by a hole
and more extents could get non-zero data in the hole as they were read
(i.e. bug #3 is the same as bug #1, but s/uncompressed/compressed/).
The fix is the same:  zero out the hole in the compressed case too,
by putting a memset back in uncompress_inline, but this time with
correct parameters.

The last and oldest bug, bug #0, is the cause of the offending inline
extent/hole/extent pattern.  Bug #0 is a subtle and mostly-harmless quirk
of behavior somewhere in the btrfs write code.  In a few special cases,
an inline extent and hole are allowed to persist where they normally
would be combined with later extents in the file.

A fast reproducer for bug #0 is presented below.  A few offending extents
are also created in the wild during large rsync transfers with the -S
flag.  A Linux kernel build (git checkout; make allyesconfig; make -j8)
will produce a handful of offending files as well.  Once an offending
file is created, it can present different content to userspace each
time it is read.

Bug #0 is at least 4 and possibly 8 years old.  I verified every vX.Y
kernel back to v3.5 has this behavior.  There are fossil records of this
bug's effects in commits all the way back to v2.6.32.  I have no reason
to believe bug #0 wasn't present at the beginning of btrfs compression
support in v2.6.29, but I can't easily test kernels that old to be sure.

It is not clear whether bug #0 is worth fixing.  A fix would likely
require injecting extra reads into currently write-only paths, and most
of the exceptional cases caused by bug #0 are already handled now.

Whether we like them or not, bug #0's inline extents followed by holes
are part of the btrfs de-facto disk format now, and we need to be able
to read them without data corruption or an infoleak.  So enough about
bug #0, let's get back to bug #3 (this patch).

An example of on-disk structure leading to data corruption found in
the wild:

        item 61 key (606890 INODE_ITEM 0) itemoff 9662 itemsize 160
                inode generation 50 transid 50 size 47424 nbytes 49141
                block group 0 mode 100644 links 1 uid 0 gid 0
                rdev 0 flags 0x0(none)
        item 62 key (606890 INODE_REF 603050) itemoff 9642 itemsize 20
                inode ref index 3 namelen 10 name: DB_File.so
        item 63 key (606890 EXTENT_DATA 0) itemoff 8280 itemsize 1362
                inline extent data size 1341 ram 4085 compress(zlib)
        item 64 key (606890 EXTENT_DATA 4096) itemoff 8227 itemsize 53
                extent data disk byte 5367308288 nr 20480
                extent data offset 0 nr 45056 ram 45056
                extent compression(zlib)

Different data appears in userspace during each read of the 11 bytes
between 4085 and 4096.  The extent in item 63 is not long enough to
fill the first page of the file, so a memset is required to fill the
space between item 63 (ending at 4085) and item 64 (beginning at 4096)
with zero.

Here is a reproducer from Liu Bo, which demonstrates another method
of creating the same inline extent and hole pattern:

Using 'page_poison=on' kernel command line (or enable
CONFIG_PAGE_POISONING) run the following:

	# touch foo
	# chattr +c foo
	# xfs_io -f -c "pwrite -W 0 1000" foo
	# xfs_io -f -c "falloc 4 8188" foo
	# od -x foo
	# echo 3 >/proc/sys/vm/drop_caches
	# od -x foo

This produce the following on my box:

Correct output:  file contains 1000 data bytes followed
by zeros:

	0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
	*
	0001740 cdcd cdcd cdcd cdcd 0000 0000 0000 0000
	0001760 0000 0000 0000 0000 0000 0000 0000 0000
	*
	0020000

Actual output:  the data after the first 1000 bytes
will be different each run:

	0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
	*
	0001740 cdcd cdcd cdcd cdcd 6c63 7400 635f 006d
	0001760 5f74 6f43 7400 435f 0053 5f74 7363 7400
	0002000 435f 0056 5f74 6164 7400 645f 0062 5f74
	(...)

Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Chris Mason <clm@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2017-03-17 13:47:10 -07:00
Liu Bo
49d4a33472 Btrfs: fix regression in lock_delalloc_pages
The bug is a regression after commit
(da2c7009f6 "btrfs: teach __process_pages_contig about PAGE_LOCK operation")
and commit
(76c0021db8 "Btrfs: use helper to simplify lock/unlock pages").

So if the dirty pages which are under writeback got truncated partially
before we lock the dirty pages, we couldn't find all pages mapping to the
delalloc range, and the bug didn't return an error so it kept going on and
found that the delalloc range got truncated and got to unlock the dirty
pages, and then the ASSERT could caught the error, and showed

-----------------------------------------------------------------------------
assertion failed: page_ops & PAGE_LOCK, file: fs/btrfs/extent_io.c, line: 1716
-----------------------------------------------------------------------------

This fixes the bug by returning the proper -EAGAIN.

Cc: David Sterba <dsterba@suse.com>
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-03-17 13:47:09 -07:00
Linus Torvalds
590dce2d49 Merge branch 'rebased-statx' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs 'statx()' update from Al Viro.

This adds the new extended stat() interface that internally subsumes our
previous stat interfaces, and allows user mode to specify in more detail
what kind of information it wants.

It also allows for some explicit synchronization information to be
passed to the filesystem, which can be relevant for network filesystems:
is the cached value ok, or do you need open/close consistency, or what?

From David Howells.

Andreas Dilger points out that the first version of the extended statx
interface was posted June 29, 2010:

    https://www.spinics.net/lists/linux-fsdevel/msg33831.html

* 'rebased-statx' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  statx: Add a system call to make enhanced file info available
2017-03-03 11:38:56 -08:00
Linus Torvalds
1827adb11a Merge branch 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull sched.h split-up from Ingo Molnar:
 "The point of these changes is to significantly reduce the
  <linux/sched.h> header footprint, to speed up the kernel build and to
  have a cleaner header structure.

  After these changes the new <linux/sched.h>'s typical preprocessed
  size goes down from a previous ~0.68 MB (~22K lines) to ~0.45 MB (~15K
  lines), which is around 40% faster to build on typical configs.

  Not much changed from the last version (-v2) posted three weeks ago: I
  eliminated quirks, backmerged fixes plus I rebased it to an upstream
  SHA1 from yesterday that includes most changes queued up in -next plus
  all sched.h changes that were pending from Andrew.

  I've re-tested the series both on x86 and on cross-arch defconfigs,
  and did a bisectability test at a number of random points.

  I tried to test as many build configurations as possible, but some
  build breakage is probably still left - but it should be mostly
  limited to architectures that have no cross-compiler binaries
  available on kernel.org, and non-default configurations"

* 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (146 commits)
  sched/headers: Clean up <linux/sched.h>
  sched/headers: Remove #ifdefs from <linux/sched.h>
  sched/headers: Remove the <linux/topology.h> include from <linux/sched.h>
  sched/headers, hrtimer: Remove the <linux/wait.h> include from <linux/hrtimer.h>
  sched/headers, x86/apic: Remove the <linux/pm.h> header inclusion from <asm/apic.h>
  sched/headers, timers: Remove the <linux/sysctl.h> include from <linux/timer.h>
  sched/headers: Remove <linux/magic.h> from <linux/sched/task_stack.h>
  sched/headers: Remove <linux/sched.h> from <linux/sched/init.h>
  sched/core: Remove unused prefetch_stack()
  sched/headers: Remove <linux/rculist.h> from <linux/sched.h>
  sched/headers: Remove the 'init_pid_ns' prototype from <linux/sched.h>
  sched/headers: Remove <linux/signal.h> from <linux/sched.h>
  sched/headers: Remove <linux/rwsem.h> from <linux/sched.h>
  sched/headers: Remove the runqueue_is_locked() prototype
  sched/headers: Remove <linux/sched.h> from <linux/sched/hotplug.h>
  sched/headers: Remove <linux/sched.h> from <linux/sched/debug.h>
  sched/headers: Remove <linux/sched.h> from <linux/sched/nohz.h>
  sched/headers: Remove <linux/sched.h> from <linux/sched/stat.h>
  sched/headers: Remove the <linux/gfp.h> include from <linux/sched.h>
  sched/headers: Remove <linux/rtmutex.h> from <linux/sched.h>
  ...
2017-03-03 10:16:38 -08:00
David Howells
a528d35e8b statx: Add a system call to make enhanced file info available
Add a system call to make extended file information available, including
file creation and some attribute flags where available through the
underlying filesystem.

The getattr inode operation is altered to take two additional arguments: a
u32 request_mask and an unsigned int flags that indicate the
synchronisation mode.  This change is propagated to the vfs_getattr*()
function.

Functions like vfs_stat() are now inline wrappers around new functions
vfs_statx() and vfs_statx_fd() to reduce stack usage.

========
OVERVIEW
========

The idea was initially proposed as a set of xattrs that could be retrieved
with getxattr(), but the general preference proved to be for a new syscall
with an extended stat structure.

A number of requests were gathered for features to be included.  The
following have been included:

 (1) Make the fields a consistent size on all arches and make them large.

 (2) Spare space, request flags and information flags are provided for
     future expansion.

 (3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
     __s64).

 (4) Creation time: The SMB protocol carries the creation time, which could
     be exported by Samba, which will in turn help CIFS make use of
     FS-Cache as that can be used for coherency data (stx_btime).

     This is also specified in NFSv4 as a recommended attribute and could
     be exported by NFSD [Steve French].

 (5) Lightweight stat: Ask for just those details of interest, and allow a
     netfs (such as NFS) to approximate anything not of interest, possibly
     without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
     Dilger] (AT_STATX_DONT_SYNC).

 (6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
     its cached attributes are up to date [Trond Myklebust]
     (AT_STATX_FORCE_SYNC).

And the following have been left out for future extension:

 (7) Data version number: Could be used by userspace NFS servers [Aneesh
     Kumar].

     Can also be used to modify fill_post_wcc() in NFSD which retrieves
     i_version directly, but has just called vfs_getattr().  It could get
     it from the kstat struct if it used vfs_xgetattr() instead.

     (There's disagreement on the exact semantics of a single field, since
     not all filesystems do this the same way).

 (8) BSD stat compatibility: Including more fields from the BSD stat such
     as creation time (st_btime) and inode generation number (st_gen)
     [Jeremy Allison, Bernd Schubert].

 (9) Inode generation number: Useful for FUSE and userspace NFS servers
     [Bernd Schubert].

     (This was asked for but later deemed unnecessary with the
     open-by-handle capability available and caused disagreement as to
     whether it's a security hole or not).

(10) Extra coherency data may be useful in making backups [Andreas Dilger].

     (No particular data were offered, but things like last backup
     timestamp, the data version number and the DOS archive bit would come
     into this category).

(11) Allow the filesystem to indicate what it can/cannot provide: A
     filesystem can now say it doesn't support a standard stat feature if
     that isn't available, so if, for instance, inode numbers or UIDs don't
     exist or are fabricated locally...

     (This requires a separate system call - I have an fsinfo() call idea
     for this).

(12) Store a 16-byte volume ID in the superblock that can be returned in
     struct xstat [Steve French].

     (Deferred to fsinfo).

(13) Include granularity fields in the time data to indicate the
     granularity of each of the times (NFSv4 time_delta) [Steve French].

     (Deferred to fsinfo).

(14) FS_IOC_GETFLAGS value.  These could be translated to BSD's st_flags.
     Note that the Linux IOC flags are a mess and filesystems such as Ext4
     define flags that aren't in linux/fs.h, so translation in the kernel
     may be a necessity (or, possibly, we provide the filesystem type too).

     (Some attributes are made available in stx_attributes, but the general
     feeling was that the IOC flags were to ext[234]-specific and shouldn't
     be exposed through statx this way).

(15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
     Michael Kerrisk].

     (Deferred, probably to fsinfo.  Finding out if there's an ACL or
     seclabal might require extra filesystem operations).

(16) Femtosecond-resolution timestamps [Dave Chinner].

     (A __reserved field has been left in the statx_timestamp struct for
     this - if there proves to be a need).

(17) A set multiple attributes syscall to go with this.

===============
NEW SYSTEM CALL
===============

The new system call is:

	int ret = statx(int dfd,
			const char *filename,
			unsigned int flags,
			unsigned int mask,
			struct statx *buffer);

The dfd, filename and flags parameters indicate the file to query, in a
similar way to fstatat().  There is no equivalent of lstat() as that can be
emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags.  There is
also no equivalent of fstat() as that can be emulated by passing a NULL
filename to statx() with the fd of interest in dfd.

Whether or not statx() synchronises the attributes with the backing store
can be controlled by OR'ing a value into the flags argument (this typically
only affects network filesystems):

 (1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
     respect.

 (2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
     its attributes with the server - which might require data writeback to
     occur to get the timestamps correct.

 (3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
     network filesystem.  The resulting values should be considered
     approximate.

mask is a bitmask indicating the fields in struct statx that are of
interest to the caller.  The user should set this to STATX_BASIC_STATS to
get the basic set returned by stat().  It should be noted that asking for
more information may entail extra I/O operations.

buffer points to the destination for the data.  This must be 256 bytes in
size.

======================
MAIN ATTRIBUTES RECORD
======================

The following structures are defined in which to return the main attribute
set:

	struct statx_timestamp {
		__s64	tv_sec;
		__s32	tv_nsec;
		__s32	__reserved;
	};

	struct statx {
		__u32	stx_mask;
		__u32	stx_blksize;
		__u64	stx_attributes;
		__u32	stx_nlink;
		__u32	stx_uid;
		__u32	stx_gid;
		__u16	stx_mode;
		__u16	__spare0[1];
		__u64	stx_ino;
		__u64	stx_size;
		__u64	stx_blocks;
		__u64	__spare1[1];
		struct statx_timestamp	stx_atime;
		struct statx_timestamp	stx_btime;
		struct statx_timestamp	stx_ctime;
		struct statx_timestamp	stx_mtime;
		__u32	stx_rdev_major;
		__u32	stx_rdev_minor;
		__u32	stx_dev_major;
		__u32	stx_dev_minor;
		__u64	__spare2[14];
	};

The defined bits in request_mask and stx_mask are:

	STATX_TYPE		Want/got stx_mode & S_IFMT
	STATX_MODE		Want/got stx_mode & ~S_IFMT
	STATX_NLINK		Want/got stx_nlink
	STATX_UID		Want/got stx_uid
	STATX_GID		Want/got stx_gid
	STATX_ATIME		Want/got stx_atime{,_ns}
	STATX_MTIME		Want/got stx_mtime{,_ns}
	STATX_CTIME		Want/got stx_ctime{,_ns}
	STATX_INO		Want/got stx_ino
	STATX_SIZE		Want/got stx_size
	STATX_BLOCKS		Want/got stx_blocks
	STATX_BASIC_STATS	[The stuff in the normal stat struct]
	STATX_BTIME		Want/got stx_btime{,_ns}
	STATX_ALL		[All currently available stuff]

stx_btime is the file creation time, stx_mask is a bitmask indicating the
data provided and __spares*[] are where as-yet undefined fields can be
placed.

Time fields are structures with separate seconds and nanoseconds fields
plus a reserved field in case we want to add even finer resolution.  Note
that times will be negative if before 1970; in such a case, the nanosecond
fields will also be negative if not zero.

The bits defined in the stx_attributes field convey information about a
file, how it is accessed, where it is and what it does.  The following
attributes map to FS_*_FL flags and are the same numerical value:

	STATX_ATTR_COMPRESSED		File is compressed by the fs
	STATX_ATTR_IMMUTABLE		File is marked immutable
	STATX_ATTR_APPEND		File is append-only
	STATX_ATTR_NODUMP		File is not to be dumped
	STATX_ATTR_ENCRYPTED		File requires key to decrypt in fs

Within the kernel, the supported flags are listed by:

	KSTAT_ATTR_FS_IOC_FLAGS

[Are any other IOC flags of sufficient general interest to be exposed
through this interface?]

New flags include:

	STATX_ATTR_AUTOMOUNT		Object is an automount trigger

These are for the use of GUI tools that might want to mark files specially,
depending on what they are.

Fields in struct statx come in a number of classes:

 (0) stx_dev_*, stx_blksize.

     These are local system information and are always available.

 (1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
     stx_size, stx_blocks.

     These will be returned whether the caller asks for them or not.  The
     corresponding bits in stx_mask will be set to indicate whether they
     actually have valid values.

     If the caller didn't ask for them, then they may be approximated.  For
     example, NFS won't waste any time updating them from the server,
     unless as a byproduct of updating something requested.

     If the values don't actually exist for the underlying object (such as
     UID or GID on a DOS file), then the bit won't be set in the stx_mask,
     even if the caller asked for the value.  In such a case, the returned
     value will be a fabrication.

     Note that there are instances where the type might not be valid, for
     instance Windows reparse points.

 (2) stx_rdev_*.

     This will be set only if stx_mode indicates we're looking at a
     blockdev or a chardev, otherwise will be 0.

 (3) stx_btime.

     Similar to (1), except this will be set to 0 if it doesn't exist.

=======
TESTING
=======

The following test program can be used to test the statx system call:

	samples/statx/test-statx.c

Just compile and run, passing it paths to the files you want to examine.
The file is built automatically if CONFIG_SAMPLES is enabled.

Here's some example output.  Firstly, an NFS directory that crosses to
another FSID.  Note that the AUTOMOUNT attribute is set because transiting
this directory will cause d_automount to be invoked by the VFS.

	[root@andromeda ~]# /tmp/test-statx -A /warthog/data
	statx(/warthog/data) = 0
	results=7ff
	  Size: 4096            Blocks: 8          IO Block: 1048576  directory
	Device: 00:26           Inode: 1703937     Links: 125
	Access: (3777/drwxrwxrwx)  Uid:     0   Gid:  4041
	Access: 2016-11-24 09:02:12.219699527+0000
	Modify: 2016-11-17 10:44:36.225653653+0000
	Change: 2016-11-17 10:44:36.225653653+0000
	Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)

Secondly, the result of automounting on that directory.

	[root@andromeda ~]# /tmp/test-statx /warthog/data
	statx(/warthog/data) = 0
	results=7ff
	  Size: 4096            Blocks: 8          IO Block: 1048576  directory
	Device: 00:27           Inode: 2           Links: 125
	Access: (3777/drwxrwxrwx)  Uid:     0   Gid:  4041
	Access: 2016-11-24 09:02:12.219699527+0000
	Modify: 2016-11-17 10:44:36.225653653+0000
	Change: 2016-11-17 10:44:36.225653653+0000

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-03-02 20:51:15 -05:00
Linus Torvalds
bbe08c0a43 Merge branch 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull more btrfs updates from Chris Mason:
 "Btrfs round two.

  These are mostly a continuation of Dave Sterba's collection of
  cleanups, but Filipe also has some bug fixes and performance
  improvements"

* 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (69 commits)
  btrfs: add dummy callback for readpage_io_failed and drop checks
  btrfs: drop checks for mandatory extent_io_ops callbacks
  btrfs: document existence of extent_io ops callbacks
  btrfs: let writepage_end_io_hook return void
  btrfs: do proper error handling in btrfs_insert_xattr_item
  btrfs: handle allocation error in update_dev_stat_item
  btrfs: remove BUG_ON from __tree_mod_log_insert
  btrfs: derive maximum output size in the compression implementation
  btrfs: use predefined limits for calculating maximum number of pages for compression
  btrfs: export compression buffer limits in a header
  btrfs: merge nr_pages input and output parameter in compress_pages
  btrfs: merge length input and output parameter in compress_pages
  btrfs: constify name of subvolume in creation helpers
  btrfs: constify buffers used by compression helpers
  btrfs: constify input buffer of btrfs_csum_data
  btrfs: constify device path passed to relevant helpers
  btrfs: make btrfs_inode_resume_unlocked_dio take btrfs_inode
  btrfs: make btrfs_inode_block_unlocked_dio take btrfs_inode
  btrfs: Make btrfs_add_nondir take btrfs_inode
  btrfs: Make btrfs_add_link take btrfs_inode
  ...
2017-03-02 16:03:00 -08:00
Ingo Molnar
f361bf4a66 sched/headers: Prepare for the reduction of <linux/sched.h>'s signal API dependency
Instead of including the full <linux/signal.h>, we are going to include the
types-only <linux/signal_types.h> header in <linux/sched.h>, to further
decouple the scheduler header from the signal headers.

This means that various files which relied on the full <linux/signal.h> need
to be updated to gain an explicit dependency on it.

Update the code that relies on sched.h's inclusion of the <linux/signal.h> header.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:37 +01:00
Ingo Molnar
174cd4b1e5 sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h>
Fix up affected files that include this signal functionality via sched.h.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:32 +01:00
Chris Mason
e9f467d028 Merge branch 'for-chris-4.11-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.11 2017-02-28 14:35:09 -08:00
David Sterba
20a7db8ab3 btrfs: add dummy callback for readpage_io_failed and drop checks
Make extent_io_ops::readpage_io_failed_hook callback mandatory and
define a dummy function for btrfs_extent_io_ops. As the failed IO
callback is not performance critical, the branch vs extra trade off does
not hurt.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 14:29:24 +01:00
David Sterba
20c9801d39 btrfs: drop checks for mandatory extent_io_ops callbacks
We know that eadpage_end_io_hook, submit_bio_hook and merge_bio_hook are
always defined so we can drop the checks before we call them.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 14:29:24 +01:00
David Sterba
4d53dddbec btrfs: document existence of extent_io ops callbacks
Some of the callbacks defined in btree_extent_io_ops and
btrfs_extent_io_ops do always exist so we don't need to check the
existence before each call. This patch just reorders the definition and
documents which are mandatory/optional.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 14:29:24 +01:00
David Sterba
c3988d630a btrfs: let writepage_end_io_hook return void
There's no error path in any of the instances, always return 0.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 14:29:24 +01:00
David Sterba
b9d04c607c btrfs: do proper error handling in btrfs_insert_xattr_item
The space check in btrfs_insert_xattr_item is duplicated in it's caller
(do_setxattr) so we won't hit the BUG_ON. Continuing without any check
could be disasterous so turn it to a proper error handling.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 14:27:11 +01:00
David Sterba
fa2529923d btrfs: handle allocation error in update_dev_stat_item
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 14:27:11 +01:00
David Sterba
047e5e17c1 btrfs: remove BUG_ON from __tree_mod_log_insert
All callers dereference the 'tm' parameter before it gets to this
function, the NULL check does not make much sense here.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 14:27:11 +01:00
David Sterba
e5d7490236 btrfs: derive maximum output size in the compression implementation
The value of max_out can be calculated from the parameters passed to the
compressors, which is number of pages and the page size, and we don't
have to needlessly pass it around.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 14:26:36 +01:00
David Sterba
069eac7850 btrfs: use predefined limits for calculating maximum number of pages for compression
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 14:26:35 +01:00
David Sterba
ff7638665c btrfs: export compression buffer limits in a header
Move the buffer limit definitions out of compress_file_range.

Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 14:26:35 +01:00
David Sterba
4d3a800ebb btrfs: merge nr_pages input and output parameter in compress_pages
The parameter saying how many pages can be allocated at maximum can be
merged with the output page counter, to save some stack space.  The
compression implementation will sink the parameter to a local variable
so everything works as before.

The nr_pages variables can also be simply merged in compress_file_range
into one.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 14:26:35 +01:00
David Sterba
38c3146408 btrfs: merge length input and output parameter in compress_pages
The length parameter is basically duplicated for input and output in the
top level caller of the compress_pages chain. We can simply use one
variable for that and reduce stack consumption. The compression
implementation will sink the parameter to a local variable so everything
works as before.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 14:26:35 +01:00
David Sterba
52f75f4fe7 btrfs: constify name of subvolume in creation helpers
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 14:26:08 +01:00
David Sterba
14a3357b40 btrfs: constify buffers used by compression helpers
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 14:26:07 +01:00
David Sterba
9ed573674a btrfs: constify input buffer of btrfs_csum_data
The function does not modify the input buffer, also update a typecast in
one caller.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 14:26:07 +01:00
David Sterba
da353f6b30 btrfs: constify device path passed to relevant helpers
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 14:26:07 +01:00
Nikolay Borisov
0b581701d9 btrfs: make btrfs_inode_resume_unlocked_dio take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:12 +01:00
Nikolay Borisov
abcefb1eee btrfs: make btrfs_inode_block_unlocked_dio take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:12 +01:00
Nikolay Borisov
cef415af20 btrfs: Make btrfs_add_nondir take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:12 +01:00
Nikolay Borisov
db0a669fb0 btrfs: Make btrfs_add_link take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:11 +01:00
Nikolay Borisov
9e3e97f45c btrfs: Make btrfs_del_delalloc_inode take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:11 +01:00
Nikolay Borisov
fc4f21b1d8 btrfs: Make get_extent_t take btrfs_inode
In addition to changing the signature, this patch also switches
all the functions which are used as an argument to also take btrfs_inode.
Namely those are: btrfs_get_extent and btrfs_get_extent_filemap.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:11 +01:00
Nikolay Borisov
1c8c9c5216 btrfs: Make check_extent_to_block take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:11 +01:00
Nikolay Borisov
a2f392e401 btrfs: Make clone_update_extent_map take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:11 +01:00
Nikolay Borisov
6fc0ef6870 btrfs: Make btrfs_clear_bit_hook take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:11 +01:00
Nikolay Borisov
9cdc512410 btrfs: Make btrfs_extent_item_to_extent_map take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:11 +01:00
Nikolay Borisov
19df27a9e4 btrfs: make btrfs_log_inode_parent take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:10 +01:00
Nikolay Borisov
aefa6115c0 btrfs: Make check_parent_dirs_for_sync take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:10 +01:00
Nikolay Borisov
73f2e545b6 btrfs: Make btrfs_orphan_add take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:10 +01:00
Nikolay Borisov
3d6ae7bb6a btrfs: make btrfs_orphan_del take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:10 +01:00
Nikolay Borisov
7ab7956ec3 btrfs: make btrfs_free_io_failure_record take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:10 +01:00
Nikolay Borisov
b30cb441fc btrfs: make clean_io_failure take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:10 +01:00
Nikolay Borisov
9d4f7f8ad6 btrfs: make repair_io_failure take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:09 +01:00
Nikolay Borisov
f898ac6ae3 btrfs: make check_compressed_csum take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:09 +01:00
Nikolay Borisov
0970a22e58 btrfs: make btrfs_print_data_csum_error take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:09 +01:00
Nikolay Borisov
4ac1f4acd7 btrfs: make free_io_failure take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:09 +01:00
Nikolay Borisov
2cff578cfc btrfs: Make lock_and_cleanup_extent_if_need take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:09 +01:00
Nikolay Borisov
85b7ab6705 btrfs: Make check_can_nocow take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:09 +01:00
Nikolay Borisov
a776c6fa1f btrfs: Make btrfs_lookup_ordered_range take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:08 +01:00
Nikolay Borisov
7a6d706795 btrfs: Make btrfs_mark_extent_written take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:08 +01:00
Nikolay Borisov
a012a74e78 btrfs: Make fill_holes take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:08 +01:00
Nikolay Borisov
35339c245b btrfs: Make hole_mergeable take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:08 +01:00
Nikolay Borisov
dcdbc059f0 btrfs: Make btrfs_drop_extent_cache take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:08 +01:00
Nikolay Borisov
46e5979183 btrfs: Make btrfs_requeue_inode_defrag take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:07 +01:00
Nikolay Borisov
6158e1ce1c btrfs: Make (__)btrfs_add_inode_defrag take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:07 +01:00
Nikolay Borisov
691fa05967 btrfs: all btrfs_delalloc_release_metadata take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:07 +01:00
Nikolay Borisov
9f3db423f9 btrfs: Make btrfs_delalloc_reserve_metadata take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:07 +01:00
Nikolay Borisov
703b391a03 btrfs: Make btrfs_orphan_release_metadata take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:07 +01:00
Nikolay Borisov
8ed7a2a0e0 btrfs: Make btrfs_orphan_reserve_metadata take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:07 +01:00
Nikolay Borisov
0e6bf9b13c btrfs: Make calc_csum_metadata_size take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:07 +01:00
Nikolay Borisov
baa3ba39b9 btrfs: Make drop_outstanding_extent take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:07 +01:00
Nikolay Borisov
04f4f91653 btrfs: make btrfs_alloc_data_chunk_ondemand take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:06 +01:00
Nikolay Borisov
70ddc553b5 btrfs: make btrfs_is_free_space_inode take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:06 +01:00
Nikolay Borisov
6ef06d2790 btrfs: Make btrfs_i_size_write take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:06 +01:00
Nikolay Borisov
877574e254 btrfs: Make btrfs_set_inode_index take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:06 +01:00
Nikolay Borisov
4c570655f4 btrfs: make btrfs_set_inode_index_count take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:06 +01:00
Nikolay Borisov
8e7611cf38 btrfs: Make btrfs_insert_dir_item take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:06 +01:00
Nikolay Borisov
d0a0b78de4 btrfs: Make btrfs_log_all_parents take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:05 +01:00
Fabian Frederick
93407472a2 fs: add i_blocksize()
Replace all 1 << inode->i_blkbits and (1 << inode->i_blkbits) in fs
branch.

This patch also fixes multiple checkpatch warnings: WARNING: Prefer
'unsigned int' to bare use of 'unsigned'

Thanks to Andrew Morton for suggesting more appropriate function instead
of macro.

[geliangtang@gmail.com: truncate: use i_blocksize()]
  Link: http://lkml.kernel.org/r/9c8b2cd83c8f5653805d43debde9fa8817e02fc4.1484895804.git.geliangtang@gmail.com
Link: http://lkml.kernel.org/r/1481319905-10126-1-git-send-email-fabf@skynet.be
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-27 18:43:46 -08:00
Linus Torvalds
9003ed1fed Merge branch 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "This has a series of fixes and cleanups that Dave Sterba has been
  collecting.

  There is a pretty big variety here, cleaning up internal APIs and
  fixing corner cases"

* 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (124 commits)
  Btrfs: use the correct type when creating cow dio extent
  Btrfs: fix deadlock between dedup on same file and starting writeback
  btrfs: use btrfs_debug instead of pr_debug in transaction abort
  btrfs: btrfs_truncate_free_space_cache always allocates path
  btrfs: free-space-cache, clean up unnecessary root arguments
  btrfs: convert btrfs_inc_block_group_ro to accept fs_info
  btrfs: flush_space always takes fs_info->fs_root
  btrfs: pass fs_info to (more) routines that are only called with extent_root
  btrfs: qgroup: Move half of the qgroup accounting time out of commit trans
  btrfs: remove unused parameter from adjust_slots_upwards
  btrfs: remove unused parameters from __btrfs_write_out_cache
  btrfs: remove unused parameter from cleanup_write_cache_enospc
  btrfs: remove unused parameter from __add_inode_ref
  btrfs: remove unused parameter from clone_copy_inline_extent
  btrfs: remove unused parameters from btrfs_cmp_data
  btrfs: remove unused parameter from __add_inline_refs
  btrfs: remove unused parameters from scrub_setup_wr_ctx
  btrfs: remove unused parameter from create_snapshot
  btrfs: remove unused parameter from init_first_rw_device
  btrfs: remove unused parameter from __btrfs_alloc_chunk
  ...
2017-02-25 14:53:58 -08:00
Dave Jiang
11bac80004 mm, fs: reduce fault, page_mkwrite, and pfn_mkwrite to take only vmf
->fault(), ->page_mkwrite(), and ->pfn_mkwrite() calls do not need to
take a vma and vmf parameter when the vma already resides in vmf.

Remove the vma parameter to simplify things.

[arnd@arndb.de: fix ARM build]
  Link: http://lkml.kernel.org/r/20170125223558.1451224-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/148521301778.19116.10840599906674778980.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24 17:46:54 -08:00
Filipe Manana
263d3995c9 Btrfs: try harder to migrate items to left sibling before splitting a leaf
Before attempting to split a leaf we try to migrate items from the leaf to
its right and left siblings. We start by trying to move items into the
rigth sibling and, if the new item is meant to be inserted at the end of
our leaf, we try to free from our leaf an amount of bytes equal to the
number of bytes used by the new item, by setting the variable space_needed
to the byte size of that new item. However if we fail to move enough items
to the right sibling due to lack of space in that sibling, we then try
to move items into the left sibling, and in that case we try to free
an amount equal to the size of the new item from our leaf, when we need
only to free an amount corresponding to the size of the new item minus
the current free space of our leaf. So make sure that before we try to
move items to the left sibling we do set the variable space_needed with
a value corresponding to the new item's size minus the leaf's current
free space.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2017-02-24 00:39:44 +00:00
Filipe Manana
76b42abbf7 Btrfs: fix data loss after truncate when using the no-holes feature
If we have a file with an implicit hole (NO_HOLES feature enabled) that
has an extent following the hole, delayed writes against regions of the
file behind the hole happened before but were not yet flushed and then
we truncate the file to a smaller size that lies inside the hole, we
end up persisting a wrong disk_i_size value for our inode that leads to
data loss after umounting and mounting again the filesystem or after
the inode is evicted and loaded again.

This happens because at inode.c:btrfs_truncate_inode_items() we end up
setting last_size to the offset of the extent that we deleted and that
followed the hole. We then pass that value to btrfs_ordered_update_i_size()
which updates the inode's disk_i_size to a value smaller then the offset
of the buffered (delayed) writes.

Example reproducer:

 $ mkfs.btrfs -f /dev/sdb
 $ mount /dev/sdb /mnt

 $ xfs_io -f -c "pwrite -S 0x01 0K 32K" /mnt/foo
 $ xfs_io -d -c "pwrite -S 0x02 -b 32K 64K 32K" /mnt/foo
 $ xfs_io -c "truncate 60K" /mnt/foo
   --> inode's disk_i_size updated to 0

 $ md5sum /mnt/foo
 3c5ca3c3ab42f4b04d7e7eb0b0d4d806  /mnt/foo

 $ umount /dev/sdb
 $ mount /dev/sdb /mnt

 $ md5sum /mnt/foo
 d41d8cd98f00b204e9800998ecf8427e  /mnt/foo
   --> Empty file, all data lost!

Cc: <stable@vger.kernel.org>  # 3.14+
Fixes: 16e7549f04 ("Btrfs: incompatible format change to remove hole extents")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2017-02-24 00:39:31 +00:00
Filipe Manana
82bfb2e7b6 Btrfs: incremental send, fix unnecessary hole writes for sparse files
When using the NO_HOLES feature, during an incremental send we often issue
write operations for holes when we should not, because that range is already
a hole in the destination snapshot. While that does not change the contents
of the file at the receiver, it avoids preservation of file holes, leading
to wasted disk space and extra IO during send/receive.

A couple examples where the holes are not preserved follows.

 $ mkfs.btrfs -O no-holes -f /dev/sdb
 $ mount /dev/sdb /mnt
 $ xfs_io -f -c "pwrite -S 0xaa 0 4K" /mnt/foo
 $ xfs_io -f -c "pwrite -S 0xaa 0 4K" -c "pwrite -S 0xbb 1028K 4K" /mnt/bar
 $ btrfs subvolume snapshot -r /mnt /mnt/snap1

 # Now add one new extent to our first test file, increasing its size and
 # leaving a 1Mb hole between the first extent and this new extent.
 $ xfs_io -c "pwrite -S 0xbb 1028K 4K" /mnt/foo

 # Now overwrite the last extent of our second test file.
 $ xfs_io -c "pwrite -S 0xcc 1028K 4K" /mnt/bar

 $ btrfs subvolume snapshot -r /mnt /mnt/snap2

 $ xfs_io -r -c "fiemap -v" /mnt/snap2/foo
 /mnt/snap2/foo:
 EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
   0: [0..7]:          25088..25095         8 0x2000
   1: [8..2055]:       hole              2048
   2: [2056..2063]:    24576..24583         8 0x2001

 $ xfs_io -r -c "fiemap -v" /mnt/snap2/bar
 /mnt/snap2/bar:
 EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
   0: [0..7]:          25096..25103         8 0x2000
   1: [8..2055]:       hole              2048
   2: [2056..2063]:    24584..24591         8 0x2001

  $ btrfs send /mnt/snap1 -f /tmp/1.snap
  $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap

  $ umount /mnt
  # It's not relevant to enable no-holes in the new filesystem.
  $ mkfs.btrfs -O no-holes -f /dev/sdc
  $ mount /dev/sdc /mnt
  $ btrfs receive /mnt -f /tmp/1.snap
  $ btrfs receive /mnt -f /tmp/2.snap

  $ xfs_io -r -c "fiemap -v" /mnt/snap2/foo
  /mnt/snap2/foo:
  EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
    0: [0..7]:          24576..24583         8 0x2000
    1: [8..2063]:       25624..27679      2056   0x1

  $ xfs_io -r -c "fiemap -v" /mnt/snap2/bar
  /mnt/snap2/bar:
  EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
    0: [0..7]:          24584..24591         8 0x2000
    1: [8..2063]:       27680..29735      2056   0x1

The holes do not exist in the second filesystem and they were replaced
with extents filled with the byte 0x00, making each file take 1032Kb of
space instead of 8Kb.

So fix this by not issuing the write operations consisting of buffers
filled with the byte 0x00 when the destination snapshot already has a
hole for the respective range.

A test case for fstests will follow soon.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2017-02-24 00:39:21 +00:00
Filipe Manana
a9b9477db2 Btrfs: fix use-after-free due to wrong order of destroying work queues
Before we destroy all work queues (and wait for their tasks to complete)
we were destroying the work queues used for metadata I/O operations, which
can result in a use-after-free problem because most tasks from all work
queues do metadata I/O operations. For example, the tasks from the caching
workers work queue (fs_info->caching_workers), which is destroyed only
after the work queue used for metadata reads (fs_info->endio_meta_workers)
is destroyed, do metadata reads, which result in attempts to queue tasks
into the later work queue, triggering a use-after-free with a trace like
the following:

[23114.613543] general protection fault: 0000 [#1] PREEMPT SMP
[23114.614442] Modules linked in: dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c btrfs xor raid6_pq dm_flakey dm_mod crc32c_generic
acpi_cpufreq tpm_tis tpm_tis_core tpm ppdev parport_pc parport i2c_piix4 processor sg evdev i2c_core psmouse pcspkr serio_raw button loop autofs4 ext4 crc16
jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio e1000 scsi_mod floppy [last unloaded: scsi_debug]
[23114.616932] CPU: 9 PID: 4537 Comm: kworker/u32:8 Not tainted 4.9.0-rc7-btrfs-next-36+ #1
[23114.616932] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[23114.616932] Workqueue: btrfs-cache btrfs_cache_helper [btrfs]
[23114.616932] task: ffff880221d45780 task.stack: ffffc9000bc50000
[23114.616932] RIP: 0010:[<ffffffffa037c1bf>]  [<ffffffffa037c1bf>] btrfs_queue_work+0x2c/0x190 [btrfs]
[23114.616932] RSP: 0018:ffff88023f443d60  EFLAGS: 00010246
[23114.616932] RAX: 0000000000000000 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000102
[23114.616932] RDX: ffffffffa0419000 RSI: ffff88011df534f0 RDI: ffff880101f01c00
[23114.616932] RBP: ffff88023f443d80 R08: 00000000000f7000 R09: 000000000000ffff
[23114.616932] R10: ffff88023f443d48 R11: 0000000000001000 R12: ffff88011df534f0
[23114.616932] R13: ffff880135963868 R14: 0000000000001000 R15: 0000000000001000
[23114.616932] FS:  0000000000000000(0000) GS:ffff88023f440000(0000) knlGS:0000000000000000
[23114.616932] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[23114.616932] CR2: 00007f0fb9f8e520 CR3: 0000000001a0b000 CR4: 00000000000006e0
[23114.616932] Stack:
[23114.616932]  ffff880101f01c00 ffff88011df534f0 ffff880135963868 0000000000001000
[23114.616932]  ffff88023f443da0 ffffffffa03470af ffff880149b37200 ffff880135963868
[23114.616932]  ffff88023f443db8 ffffffff8125293c ffff880149b37200 ffff88023f443de0
[23114.616932] Call Trace:
[23114.616932]  <IRQ> [23114.616932]  [<ffffffffa03470af>] end_workqueue_bio+0xd5/0xda [btrfs]
[23114.616932]  [<ffffffff8125293c>] bio_endio+0x54/0x57
[23114.616932]  [<ffffffffa0377929>] btrfs_end_bio+0xf7/0x106 [btrfs]
[23114.616932]  [<ffffffff8125293c>] bio_endio+0x54/0x57
[23114.616932]  [<ffffffff8125955f>] blk_update_request+0x21a/0x30f
[23114.616932]  [<ffffffffa0022316>] scsi_end_request+0x31/0x182 [scsi_mod]
[23114.616932]  [<ffffffffa00235fc>] scsi_io_completion+0x1ce/0x4c8 [scsi_mod]
[23114.616932]  [<ffffffffa001ba9d>] scsi_finish_command+0x104/0x10d [scsi_mod]
[23114.616932]  [<ffffffffa002311f>] scsi_softirq_done+0x101/0x10a [scsi_mod]
[23114.616932]  [<ffffffff8125fbd9>] blk_done_softirq+0x82/0x8d
[23114.616932]  [<ffffffff814c8a4b>] __do_softirq+0x1ab/0x412
[23114.616932]  [<ffffffff8105b01d>] irq_exit+0x49/0x99
[23114.616932]  [<ffffffff81035135>] smp_call_function_single_interrupt+0x24/0x26
[23114.616932]  [<ffffffff814c7ec9>] call_function_single_interrupt+0x89/0x90
[23114.616932]  <EOI> [23114.616932]  [<ffffffffa0023262>] ? scsi_request_fn+0x13a/0x2a1 [scsi_mod]
[23114.616932]  [<ffffffff814c5966>] ? _raw_spin_unlock_irq+0x2c/0x4a
[23114.616932]  [<ffffffff814c596c>] ? _raw_spin_unlock_irq+0x32/0x4a
[23114.616932]  [<ffffffff814c5966>] ? _raw_spin_unlock_irq+0x2c/0x4a
[23114.616932]  [<ffffffffa0023262>] scsi_request_fn+0x13a/0x2a1 [scsi_mod]
[23114.616932]  [<ffffffff8125590e>] __blk_run_queue_uncond+0x22/0x2b
[23114.616932]  [<ffffffff81255930>] __blk_run_queue+0x19/0x1b
[23114.616932]  [<ffffffff8125ab01>] blk_queue_bio+0x268/0x282
[23114.616932]  [<ffffffff81258f44>] generic_make_request+0xbd/0x160
[23114.616932]  [<ffffffff812590e7>] submit_bio+0x100/0x11d
[23114.616932]  [<ffffffff81298603>] ? __this_cpu_preempt_check+0x13/0x15
[23114.616932]  [<ffffffff812a1805>] ? __percpu_counter_add+0x8e/0xa7
[23114.616932]  [<ffffffffa03bfd47>] btrfsic_submit_bio+0x1a/0x1d [btrfs]
[23114.616932]  [<ffffffffa0377db2>] btrfs_map_bio+0x1f4/0x26d [btrfs]
[23114.616932]  [<ffffffffa0348a33>] btree_submit_bio_hook+0x74/0xbf [btrfs]
[23114.616932]  [<ffffffffa03489bf>] ? btrfs_wq_submit_bio+0x160/0x160 [btrfs]
[23114.616932]  [<ffffffffa03697a9>] submit_one_bio+0x6b/0x89 [btrfs]
[23114.616932]  [<ffffffffa036f5be>] read_extent_buffer_pages+0x170/0x1ec [btrfs]
[23114.616932]  [<ffffffffa03471fa>] ? free_root_pointers+0x64/0x64 [btrfs]
[23114.616932]  [<ffffffffa0348adf>] readahead_tree_block+0x3f/0x4c [btrfs]
[23114.616932]  [<ffffffffa032e115>] read_block_for_search.isra.20+0x1ce/0x23d [btrfs]
[23114.616932]  [<ffffffffa032fab8>] btrfs_search_slot+0x65f/0x774 [btrfs]
[23114.616932]  [<ffffffffa036eff1>] ? free_extent_buffer+0x73/0x7e [btrfs]
[23114.616932]  [<ffffffffa0331ba4>] btrfs_next_old_leaf+0xa1/0x33c [btrfs]
[23114.616932]  [<ffffffffa0331e4f>] btrfs_next_leaf+0x10/0x12 [btrfs]
[23114.616932]  [<ffffffffa0336aa6>] caching_thread+0x22d/0x416 [btrfs]
[23114.616932]  [<ffffffffa037bce9>] btrfs_scrubparity_helper+0x187/0x3b6 [btrfs]
[23114.616932]  [<ffffffffa037c036>] btrfs_cache_helper+0xe/0x10 [btrfs]
[23114.616932]  [<ffffffff8106cf96>] process_one_work+0x273/0x4e4
[23114.616932]  [<ffffffff8106d6db>] worker_thread+0x1eb/0x2ca
[23114.616932]  [<ffffffff8106d4f0>] ? rescuer_thread+0x2b6/0x2b6
[23114.616932]  [<ffffffff81072a81>] kthread+0xd5/0xdd
[23114.616932]  [<ffffffff810729ac>] ? __kthread_unpark+0x5a/0x5a
[23114.616932]  [<ffffffff814c6257>] ret_from_fork+0x27/0x40
[23114.616932] Code: 1f 44 00 00 55 48 89 e5 41 56 41 55 41 54 53 49 89 f4 48 8b 46 70 a8 04 74 09 48 8b 5f 08 48 85 db 75 03 48 8b 1f 49 89 5c 24 68 <83> 7b
64 ff 74 04 f0 ff 43 58 49 83 7c 24 08 00 74 2c 4c 8d 6b
[23114.616932] RIP  [<ffffffffa037c1bf>] btrfs_queue_work+0x2c/0x190 [btrfs]
[23114.616932]  RSP <ffff88023f443d60>
[23114.689493] ---[ end trace 6e48b6bc707ca34b ]---
[23114.690166] Kernel panic - not syncing: Fatal exception in interrupt
[23114.691283] Kernel Offset: disabled
[23114.691918] ---[ end Kernel panic - not syncing: Fatal exception in interrupt

The following diagram shows the sequence of operations that lead to the
use-after-free problem from the above trace:

        CPU 1                               CPU 2                                     CPU 3

                                       caching_thread()
 close_ctree()
   btrfs_stop_all_workers()
     btrfs_destroy_workqueue(
      fs_info->endio_meta_workers)

                                         btrfs_search_slot()
                                          read_block_for_search()
                                           readahead_tree_block()
                                            read_extent_buffer_pages()
                                             submit_one_bio()
                                              btree_submit_bio_hook()
                                               btrfs_bio_wq_end_io()
                                                --> sets the bio's
                                                    bi_end_io callback
                                                    to end_workqueue_bio()
                                               --> bio is submitted
                                                                                  bio completes
                                                                                  and its bi_end_io callback
                                                                                  is invoked
                                                                                   --> end_workqueue_bio()
                                                                                       --> attempts to queue
                                                                                           a task on fs_info->endio_meta_workers

     btrfs_destroy_workqueue(
      fs_info->caching_workers)

So fix this by destroying the queues used for metadata I/O tasks only
after destroying all the other queues.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2017-02-24 00:38:56 +00:00
Filipe Manana
5cdd7db6c5 Btrfs: fix assertion failure when freeing block groups at close_ctree()
At close_ctree() we free the block groups and then only after we wait for
any running worker kthreads to finish and shutdown the workqueues. This
behaviour is racy and it triggers an assertion failure when freeing block
groups because while we are doing it we can have for example a block group
caching kthread running, and in that case the block group's reference
count can still be greater than 1 by the time we assert its reference count
is 1, leading to an assertion failure:

[19041.198004] assertion failed: atomic_read(&block_group->count) == 1, file: fs/btrfs/extent-tree.c, line: 9799
[19041.200584] ------------[ cut here ]------------
[19041.201692] kernel BUG at fs/btrfs/ctree.h:3418!
[19041.202830] invalid opcode: 0000 [#1] PREEMPT SMP
[19041.203929] Modules linked in: btrfs xor raid6_pq dm_flakey dm_mod crc32c_generic ppdev sg psmouse acpi_cpufreq pcspkr parport_pc evdev tpm_tis parport tpm_tis_core i2c_piix4 i2c_core tpm serio_raw processor button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
[19041.208082] CPU: 6 PID: 29051 Comm: umount Not tainted 4.9.0-rc7-btrfs-next-36+ #1
[19041.208082] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[19041.208082] task: ffff88015f028980 task.stack: ffffc9000ad34000
[19041.208082] RIP: 0010:[<ffffffffa03e319e>]  [<ffffffffa03e319e>] assfail.constprop.41+0x1c/0x1e [btrfs]
[19041.208082] RSP: 0018:ffffc9000ad37d60  EFLAGS: 00010286
[19041.208082] RAX: 0000000000000061 RBX: ffff88015ecb4000 RCX: 0000000000000001
[19041.208082] RDX: ffff88023f392fb8 RSI: ffffffff817ef7ba RDI: 00000000ffffffff
[19041.208082] RBP: ffffc9000ad37d60 R08: 0000000000000001 R09: 0000000000000000
[19041.208082] R10: ffffc9000ad37cb0 R11: ffffffff82f2b66d R12: ffff88023431d170
[19041.208082] R13: ffff88015ecb40c0 R14: ffff88023431d000 R15: ffff88015ecb4100
[19041.208082] FS:  00007f44f3d42840(0000) GS:ffff88023f380000(0000) knlGS:0000000000000000
[19041.208082] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[19041.208082] CR2: 00007f65d623b000 CR3: 00000002166f2000 CR4: 00000000000006e0
[19041.208082] Stack:
[19041.208082]  ffffc9000ad37d98 ffffffffa035989f ffff88015ecb4000 ffff88015ecb5630
[19041.208082]  ffff88014f6be000 0000000000000000 00007ffcf0ba6a10 ffffc9000ad37df8
[19041.208082]  ffffffffa0368cd4 ffff88014e9658e0 ffffc9000ad37e08 ffffffff811a634d
[19041.208082] Call Trace:
[19041.208082]  [<ffffffffa035989f>] btrfs_free_block_groups+0x17f/0x392 [btrfs]
[19041.208082]  [<ffffffffa0368cd4>] close_ctree+0x1c5/0x2e1 [btrfs]
[19041.208082]  [<ffffffff811a634d>] ? evict_inodes+0x132/0x141
[19041.208082]  [<ffffffffa034356d>] btrfs_put_super+0x15/0x17 [btrfs]
[19041.208082]  [<ffffffff8118fc32>] generic_shutdown_super+0x6a/0xeb
[19041.208082]  [<ffffffff8119004f>] kill_anon_super+0x12/0x1c
[19041.208082]  [<ffffffffa0343370>] btrfs_kill_super+0x16/0x21 [btrfs]
[19041.208082]  [<ffffffff8118fad1>] deactivate_locked_super+0x3b/0x68
[19041.208082]  [<ffffffff8118fb34>] deactivate_super+0x36/0x39
[19041.208082]  [<ffffffff811a9946>] cleanup_mnt+0x58/0x76
[19041.208082]  [<ffffffff811a99a2>] __cleanup_mnt+0x12/0x14
[19041.208082]  [<ffffffff81071573>] task_work_run+0x6f/0x95
[19041.208082]  [<ffffffff81001897>] prepare_exit_to_usermode+0xa3/0xc1
[19041.208082]  [<ffffffff81001a23>] syscall_return_slowpath+0x16e/0x1d2
[19041.208082]  [<ffffffff814c607d>] entry_SYSCALL_64_fastpath+0xab/0xad
[19041.208082] Code: c7 ae a0 3e a0 48 89 e5 e8 4e 74 d4 e0 0f 0b 55 89 f1 48 c7 c2 0b a4 3e a0 48 89 fe 48 c7 c7 a4 a6 3e a0 48 89 e5 e8 30 74 d4 e0 <0f> 0b 55 31 d2 48 89 e5 e8 d5 b9 f7 ff 5d c3 48 63 f6 55 31 c9
[19041.208082] RIP  [<ffffffffa03e319e>] assfail.constprop.41+0x1c/0x1e [btrfs]
[19041.208082]  RSP <ffffc9000ad37d60>
[19041.279264] ---[ end trace 23330586f16f064d ]---

This started happening as of kernel 4.8, since commit f3bca8028b
("Btrfs: add ASSERT for block group's memory leak") introduced these
assertions.

So fix this by freeing the block groups only after waiting for all
worker kthreads to complete and shutdown the workqueues.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2017-02-24 00:38:27 +00:00
Filipe Manana
3168021cf9 Btrfs: do not create explicit holes when replaying log tree if NO_HOLES enabled
We log holes explicitly by using file extent items, however when replaying
a log tree, if a logged file extent item corresponds to a hole and the
NO_HOLES feature is enabled we do not need to copy the file extent item
into the fs/subvolume tree, as the absence of such file extent items is
the purpose of the NO_HOLES feature. So skip the copying of file extent
items representing holes when the NO_HOLES feature is enabled.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2017-02-24 00:38:10 +00:00
Robbie Ko
91e1f56a8b Btrfs: fix leak of subvolume writers counter
When falling back from a nocow write to a regular cow write, we were
leaking the subvolume writers counter in 2 situations, preventing
snapshot creation from ever completing in the future, as it waits
for that counter to go down to zero before the snapshot creation
starts.

Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
[Improved changelog and subject]
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2017-02-24 00:38:01 +00:00
Filipe Manana
6f546216e9 Btrfs: bulk delete checksum items in the same leaf
Very often we have the checksums for an extent spread in multiple items
in the checksums tree, and currently the algorithm to delete them starts
by looking for them one by one and then deleting them one by one, which
is not optimal since each deletion involves shifting all the other items
in the leaf and when the leaf reaches some low threshold, to move items
off the leaf into its left and right neighbor leafs. Also, after each
item deletion we release our search path and start a new search for other
checksums items.

So optimize this by deleting in bulk all the items in the same leaf that
contain checksums for the extent being freed.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2017-02-24 00:36:55 +00:00
Robbie Ko
0191410158 Btrfs: incremental send, do not issue invalid rmdir operations
When both the parent and send snapshots have a directory inode with the
same number but different generations (therefore they are different
inodes) and both have an entry with the same name, an incremental send
stream will contain an invalid rmdir operation that refers to the
orphanized name of the inode from the parent snapshot.

The following example scenario shows how this happens.

Parent snapshot:

 .
 |---- d259_old/               (ino 259, gen 9)
 |         |---- d1/           (ino 258, gen 9)
 |
 |---- f                       (ino 257, gen 9)

Send snapshot:

 .
 |---- d258/                   (ino 258, gen 7)
 |---- d259/                   (ino 259, gen 7)
         |---- d1/             (ino 257, gen 7)

When the kernel is processing inode 258 it notices that in both snapshots
there is an inode numbered 259 that is a parent of an inode 258. However
it ignores the fact that the inodes numbered 259 have different generations
in both snapshots, which means they are effectively different inodes.
Then it checks that both inodes 259 have a dentry named "d1" and because
of that it issues a rmdir operation with orphanized name of the inode 258
from the parent snapshot. This happens at send.c:process_record_refs(),
which calls send.c:did_overwrite_first_ref() that returns true and because
of that later on at process_recorded_refs() such rmdir operation is issued
because the inode being currently processed (258) is a directory and it
was deleted in the send snapshot (and replaced with another inode that has
the same number and is a directory too).
Fix this issue by comparing the generations of parent directory inodes
that have the same number and make send.c:did_overwrite_first_ref() when
the generations are different.

The following steps reproduce the problem.

 $ mkfs.btrfs -f /dev/sdb
 $ mount /dev/sdb /mnt
 $ touch /mnt/f
 $ mkdir /mnt/d1
 $ mkdir /mnt/d259_old
 $ mv /mnt/d1 /mnt/d259_old/d1
 $ btrfs subvolume snapshot -r /mnt /mnt/snap1
 $ btrfs send /mnt/snap1 -f /tmp/1.snap
 $ umount /mnt

 $ mkfs.btrfs -f /dev/sdc
 $ mount /dev/sdc /mnt
 $ mkdir /mnt/d1
 $ mkdir /mnt/dir258
 $ mkdir /mnt/dir259
 $ mv /mnt/d1 /mnt/dir259/d1
 $ btrfs subvolume snapshot -r /mnt /mnt/snap2
 $ btrfs receive /mnt/ -f /tmp/1.snap
 # Take note that once the filesystem is created, its current
 # generation has value 7 so the inodes from the second snapshot all have
 # a generation value of 7. And after receiving the first snapshot
 # the filesystem is at a generation value of 10, because the call to
 # create the second snapshot bumps the generation to 8 (the snapshot
 # creation ioctl does a transaction commit), the receive command calls
 # the snapshot creation ioctl to create the first snapshot, which bumps
 # the filesystem's generation to 9, and finally when the receive
 # operation finishes it calls an ioctl to transition the first snapshot
 # (snap1) from RW mode to RO mode, which does another transaction commit
 # and bumps the filesystem's generation to 10. This means all the inodes
 # in the first snapshot (snap1) have a generation value of 9.
 $ rm -f /tmp/1.snap
 $ btrfs send /mnt/snap1 -f /tmp/1.snap
 $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap
 $ umount /mnt

 $ mkfs.btrfs -f /dev/sdd
 $ mount /dev/sdd /mnt
 $ btrfs receive /mnt -f /tmp/1.snap
 $ btrfs receive -vv /mnt -f /tmp/2.snap
 receiving snapshot mysnap2 uuid=9c03962f-f620-0047-9f98-32e5a87116d9, ctransid=7 parent_uuid=d17a6e3f-14e5-df4f-be39-a7951a5399aa, parent_ctransid=9
 utimes
 unlink f
 mkdir o257-7-0
 mkdir o259-7-0
 rename o257-7-0 -> o259-7-0/d1
 chown o259-7-0/d1 - uid=0, gid=0
 chmod o259-7-0/d1 - mode=0755
 utimes o259-7-0/d1
 rmdir o258-9-0
 ERROR: rmdir o258-9-0 failed: No such file or directory

Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
[Rewrote changelog to be more precise and clear]
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2017-02-24 00:36:45 +00:00
Filipe Manana
fe9c798dbf Btrfs: incremental send, do not delay rename when parent inode is new
When we are checking if we need to delay the rename operation for an
inode we not checking if a parent inode that exists in the send and
parent snapshots is really the same inode or not, that is, we are not
comparing the generation number of the parent inode in the send and
parent snapshots. Not only this results in unnecessarily delaying a
rename operation but also can later on make us generate an incorrect
name for a new inode in the send snapshot that has the same number
as another inode in the parent snapshot but a different generation.

Here follows an example where this happens.

Parent snapshot:

 .                                                  (ino 256, gen 3)
 |--- dir258/                                       (ino 258, gen 7)
 |       |--- dir257/                               (ino 257, gen 7)
 |
 |--- dir259/                                       (ino 259, gen 7)

Send snapshot:

 .                                                  (ino 256, gen 3)
 |--- file258                                       (ino 258, gen 10)
 |
 |--- new_dir259/                                   (ino 259, gen 10)
          |--- dir257/                              (ino 257, gen 7)

The following steps happen when computing the incremental send stream:

1) When processing inode 257, its new parent is created using its orphan
   name (o257-21-0), and the rename operation for inode 257 is delayed
   because its new parent (inode 259) was not yet processed - this
   decision to delay the rename operation does not make much sense
   because the inode 259 in the send snapshot is a new inode, it's not
   the same as inode 259 in the parent snapshot.

2) When processing inode 258 we end up delaying its rmdir operation,
   because inode 257 was not yet renamed (moved away from the directory
   inode 258 represents). We also create the new inode 258 using its
   orphan name "o258-10-0", then rename it to its final name of "file258"
   and then issue a truncate operation for it. However this truncate
   operation contains an incorrect name, which corresponds to the orphan
   name and not to the final name, which makes the receiver fail. This
   happens because when we attempt to compute the inode's current name
   we verify that there's another inode with the same number (258) that
   has its rmdir operation pending and because of that we generate an
   orphan name for the new inode 258 (we do this in the function
   get_cur_path()).

Fix this by not delayed the rename operation of an inode if it has parents
with the same number but different generations in both snapshots.

The following steps reproduce this example scenario.

 $ mkfs.btrfs -f /dev/sdb
 $ mount /dev/sdb /mnt
 $ mkdir /mnt/dir257
 $ mkdir /mnt/dir258
 $ mkdir /mnt/dir259
 $ mv /mnt/dir257 /mnt/dir258/dir257
 $ btrfs subvolume snapshot -r /mnt /mnt/snap1

 $ mv /mnt/dir258/dir257 /mnt/dir257
 $ rmdir /mnt/dir258
 $ rmdir /mnt/dir259

 # Remount the filesystem so that the next created inodes will have the
 # numbers 258 and 259. This is because when a filesystem is mounted,
 # btrfs sets the subvolume's inode counter to a value corresponding to
 # the highest inode number in the subvolume plus 1. This inode counter
 # is used to assign a unique number to each new inode and it's
 # incremented by 1 after very inode creation.
 # Note: we unmount and then mount instead of doing a mount with
 # "-o remount" because otherwise the inode counter remains at value 260.
 $ umount /mnt
 $ mount /dev/sdb /mnt
 $ touch /mnt/file258
 $ mkdir /mnt/new_dir259
 $ mv /mnt/dir257 /mnt/new_dir259/dir257
 $ btrfs subvolume snapshot -r /mnt /mnt/snap2

 $ btrfs send /mnt/snap1 -f /tmp/1.snap
 $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap

 $ umount /mnt
 $ mkfs.btrfs -f /dev/sdc
 $ mount /dev/sdc /mnt
 $ btrfs receive /mnt -f /tmo/1.snap
 $ btrfs receive /mnt -f /tmo/2.snap -vv
 receiving snapshot mysnap2 uuid=e059b6d1-7f55-f140-8d7c-9a3039d23c97, ctransid=10 parent_uuid=77e98cb6-8762-814f-9e05-e8ba877fc0b0, parent_ctransid=7
 utimes
 mkdir o259-10-0
 rename dir258 -> o258-7-0
 utimes
 mkfile o258-10-0
 rename o258-10-0 -> file258
 utimes
 truncate o258-10-0 size=0
 ERROR: truncate o258-10-0 failed: No such file or directory

Reported-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2017-02-24 00:36:33 +00:00
Robbie Ko
4dd9920d99 Btrfs: send, fix failure to rename top level inode due to name collision
Under certain situations, an incremental send operation can fail due to a
premature attempt to create a new top level inode (a direct child of the
subvolume/snapshot root) whose name collides with another inode that was
removed from the send snapshot.

Consider the following example scenario.

Parent snapshot:

  .                 (ino 256, gen 8)
  |---- a1/         (ino 257, gen 9)
  |---- a2/         (ino 258, gen 9)

Send snapshot:

  .                 (ino 256, gen 3)
  |---- a2/         (ino 257, gen 7)

In this scenario, when receiving the incremental send stream, the btrfs
receive command fails like this (ran in verbose mode, -vv argument):

  rmdir a1
  mkfile o257-7-0
  rename o257-7-0 -> a2
  ERROR: rename o257-7-0 -> a2 failed: Is a directory

What happens when computing the incremental send stream is:

1) An operation to remove the directory with inode number 257 and
   generation 9 is issued.

2) An operation to create the inode with number 257 and generation 7 is
   issued. This creates the inode with an orphanized name of "o257-7-0".

3) An operation rename the new inode 257 to its final name, "a2", is
   issued. This is incorrect because inode 258, which has the same name
   and it's a child of the same parent (root inode 256), was not yet
   processed and therefore no rmdir operation for it was yet issued.
   The rename operation is issued because we fail to detect that the
   name of the new inode 257 collides with inode 258, because their
   parent, a subvolume/snapshot root (inode 256) has a different
   generation in both snapshots.

So fix this by ignoring the generation value of a parent directory that
matches a root inode (number 256) when we are checking if the name of the
inode currently being processed collides with the name of some other
inode that was not yet processed.

We can achieve this scenario of different inodes with the same number but
different generation values either by mounting a filesystem with the inode
cache option (-o inode_cache) or by creating and sending snapshots across
different filesystems, like in the following example:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt
  $ mkdir /mnt/a1
  $ mkdir /mnt/a2
  $ btrfs subvolume snapshot -r /mnt /mnt/snap1
  $ btrfs send /mnt/snap1 -f /tmp/1.snap
  $ umount /mnt

  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt
  $ touch /mnt/a2
  $ btrfs subvolume snapshot -r /mnt /mnt/snap2
  $ btrfs receive /mnt -f /tmp/1.snap
  # Take note that once the filesystem is created, its current
  # generation has value 7 so the inode from the second snapshot has
  # a generation value of 7. And after receiving the first snapshot
  # the filesystem is at a generation value of 10, because the call to
  # create the second snapshot bumps the generation to 8 (the snapshot
  # creation ioctl does a transaction commit), the receive command calls
  # the snapshot creation ioctl to create the first snapshot, which bumps
  # the filesystem's generation to 9, and finally when the receive
  # operation finishes it calls an ioctl to transition the first snapshot
  # (snap1) from RW mode to RO mode, which does another transaction commit
  # and bumps the filesystem's generation to 10.
  $ rm -f /tmp/1.snap
  $ btrfs send /mnt/snap1 -f /tmp/1.snap
  $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap
  $ umount /mnt

  $ mkfs.btrfs -f /dev/sdd
  $ mount /dev/sdd /mnt
  $ btrfs receive /mnt /tmp/1.snap
  # Receive of snapshot snap2 used to fail.
  $ btrfs receive /mnt /tmp/2.snap

Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
[Rewrote changelog to be more precise and clear]
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2017-02-24 00:36:01 +00:00
Liu Bo
6288d6eabc Btrfs: use the correct type when creating cow dio extent
'BTRFS_ORDERED_REGULAR' was introduced for the cow case in patch
'Btrfs: specify a new ordered extent type for create_io_em',
but it missed the directIO cow case.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2017-02-22 15:55:03 -08:00
Filipe Manana
b1517622f2 Btrfs: fix deadlock between dedup on same file and starting writeback
If we are deduping two ranges of the same file we need to make sure that
we lock all pages in ascending order, that is, lock first the pages from
the range with lower offset and then the pages from the other range, as
otherwise we can deadlock with a concurrent task that is starting delalloc
(writeback). Example trace:

[74073.052218] INFO: task kworker/u32:10:17997 blocked for more than 120 seconds.
[74073.053889]       Tainted: G        W       4.9.0-rc7-btrfs-next-36+ #1
[74073.055071] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[74073.056696] kworker/u32:10  D    0 17997      2 0x00000000
[74073.058606] Workqueue: writeback wb_workfn (flush-btrfs-53176)
[74073.061370]  ffff880031e79858 ffff8802159d2580 ffff880237004580 ffff880031e79240
[74073.064784]  ffff88023f4978c0 ffffc9000817b638 ffffffff814c15e1 0000000000000000
[74073.068386]  ffff88023f4978d8 ffff88023f4978c0 000000000017b620 ffff880031e79240
[74073.071712] Call Trace:
[74073.072884]  [<ffffffff814c15e1>] ? __schedule+0x48f/0x6f4
[74073.075395]  [<ffffffff814c1c8b>] ? bit_wait+0x2f/0x2f
[74073.077511]  [<ffffffff814c18d2>] schedule+0x8c/0xa0
[74073.079440]  [<ffffffff814c4b36>] schedule_timeout+0x43/0xff
[74073.081637]  [<ffffffff8110953e>] ? time_hardirqs_on+0x9/0x14
[74073.083809]  [<ffffffff81095c67>] ? trace_hardirqs_on_caller+0x16/0x197
[74073.086314]  [<ffffffff810bde98>] ? timekeeping_get_ns+0x1e/0x32
[74073.100654]  [<ffffffff810be048>] ? ktime_get+0x41/0x52
[74073.102619]  [<ffffffff814c10f0>] io_schedule_timeout+0xa0/0x102
[74073.104771]  [<ffffffff814c10f0>] ? io_schedule_timeout+0xa0/0x102
[74073.106969]  [<ffffffff814c1ca6>] bit_wait_io+0x1b/0x39
[74073.108954]  [<ffffffff814c1fb8>] __wait_on_bit_lock+0x4f/0x99
[74073.110981]  [<ffffffff8112b692>] __lock_page+0x6b/0x6d
[74073.112833]  [<ffffffff8108ceb4>] ? autoremove_wake_function+0x3a/0x3a
[74073.115010]  [<ffffffffa031178b>] lock_page+0x2f/0x32 [btrfs]
[74073.116999]  [<ffffffffa0311d9f>] lock_delalloc_pages+0xc7/0x1a0 [btrfs]
[74073.119243]  [<ffffffffa0313d15>] find_lock_delalloc_range+0xc3/0x1a4 [btrfs]
[74073.121636]  [<ffffffffa0313e81>] writepage_delalloc.isra.31+0x8b/0x134 [btrfs]
[74073.124229]  [<ffffffffa0315d69>] __extent_writepage+0x1c1/0x2bf [btrfs]
[74073.126372]  [<ffffffffa03160f2>] extent_write_cache_pages.isra.30.constprop.49+0x28b/0x36c [btrfs]
[74073.129371]  [<ffffffffa03165b9>] extent_writepages+0x4b/0x5c [btrfs]
[74073.131440]  [<ffffffffa02fcb59>] ? insert_reserved_file_extent.constprop.42+0x261/0x261 [btrfs]
[74073.134303]  [<ffffffff811b4ce4>] ? writeback_sb_inodes+0xe0/0x4a1
[74073.136298]  [<ffffffffa02fab7f>] btrfs_writepages+0x28/0x2a [btrfs]
[74073.138248]  [<ffffffff81138200>] do_writepages+0x23/0x2c
[74073.139910]  [<ffffffff811b3cab>] __writeback_single_inode+0x105/0x6d2
[74073.142003]  [<ffffffff811b4e96>] writeback_sb_inodes+0x292/0x4a1
[74073.136298]  [<ffffffffa02fab7f>] btrfs_writepages+0x28/0x2a [btrfs]
[74073.138248]  [<ffffffff81138200>] do_writepages+0x23/0x2c
[74073.139910]  [<ffffffff811b3cab>] __writeback_single_inode+0x105/0x6d2
[74073.142003]  [<ffffffff811b4e96>] writeback_sb_inodes+0x292/0x4a1
[74073.143911]  [<ffffffff811b511b>] __writeback_inodes_wb+0x76/0xae
[74073.145787]  [<ffffffff811b53ca>] wb_writeback+0x1cc/0x4d7
[74073.147452]  [<ffffffff811b60cd>] wb_workfn+0x194/0x37d
[74073.149084]  [<ffffffff811b60cd>] ? wb_workfn+0x194/0x37d
[74073.150726]  [<ffffffff8106ce77>] ? process_one_work+0x154/0x4e4
[74073.152694]  [<ffffffff8106cf96>] process_one_work+0x273/0x4e4
[74073.154452]  [<ffffffff8106d6db>] worker_thread+0x1eb/0x2ca
[74073.156138]  [<ffffffff8106d4f0>] ? rescuer_thread+0x2b6/0x2b6
[74073.157837]  [<ffffffff81072a81>] kthread+0xd5/0xdd
[74073.159339]  [<ffffffff810729ac>] ? __kthread_unpark+0x5a/0x5a
[74073.161088]  [<ffffffff814c6257>] ret_from_fork+0x27/0x40
[74073.162680] INFO: lockdep is turned off.
[74073.163855] INFO: task do-dedup:30264 blocked for more than 120 seconds.
[74073.181180]       Tainted: G        W       4.9.0-rc7-btrfs-next-36+ #1
[74073.181180] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[74073.185296] fdm-stress      D    0 30264  29974 0x00000000
[74073.186810]  ffff880089595118 ffff880211b8eac0 ffff880237030380 ffff880089594b00
[74073.188998]  ffff88023f2978c0 ffffc900063abb68 ffffffff814c15e1 0000000000000000
[74073.191070]  ffff88023f2978d8 ffff88023f2978c0 00000000003abb50 ffff880089594b00
[74073.193286] Call Trace:
[74073.193990]  [<ffffffff814c15e1>] ? __schedule+0x48f/0x6f4
[74073.195418]  [<ffffffff814c1c8b>] ? bit_wait+0x2f/0x2f
[74073.196796]  [<ffffffff814c18d2>] schedule+0x8c/0xa0
[74073.198163]  [<ffffffff814c4b36>] schedule_timeout+0x43/0xff
[74073.199621]  [<ffffffff81095df5>] ? trace_hardirqs_on+0xd/0xf
[74073.201100]  [<ffffffff810bde98>] ? timekeeping_get_ns+0x1e/0x32
[74073.202686]  [<ffffffff810be048>] ? ktime_get+0x41/0x52
[74073.204051]  [<ffffffff814c10f0>] io_schedule_timeout+0xa0/0x102
[74073.205585]  [<ffffffff814c10f0>] ? io_schedule_timeout+0xa0/0x102
[74073.207123]  [<ffffffff814c1ca6>] bit_wait_io+0x1b/0x39
[74073.208238]  [<ffffffff814c1fb8>] __wait_on_bit_lock+0x4f/0x99
[74073.208871]  [<ffffffff8112b692>] __lock_page+0x6b/0x6d
[74073.209430]  [<ffffffff8108ceb4>] ? autoremove_wake_function+0x3a/0x3a
[74073.210101]  [<ffffffff8112b800>] lock_page+0x2f/0x32
[74073.210636]  [<ffffffff8112c502>] pagecache_get_page+0x5e/0x153
[74073.211270]  [<ffffffffa03257eb>] gather_extent_pages+0x4e/0x109 [btrfs]
[74073.212166]  [<ffffffffa032a04c>] btrfs_dedupe_file_range+0x1e1/0x4dd [btrfs]
[74073.213257]  [<ffffffff8118d9b5>] vfs_dedupe_file_range+0x1c1/0x221
[74073.214086]  [<ffffffff8119e0c4>] do_vfs_ioctl+0x442/0x600
[74073.214767]  [<ffffffff811a7874>] ? rcu_read_unlock+0x5b/0x5d
[74073.215619]  [<ffffffff811a7953>] ? __fget+0x6b/0x77
[74073.216338]  [<ffffffff8119e2d9>] SyS_ioctl+0x57/0x79
[74073.217149]  [<ffffffff814c5fea>] entry_SYSCALL_64_fastpath+0x18/0xad
[74073.218102]  [<ffffffff81109552>] ? time_hardirqs_off+0x9/0x14
[74073.218968]  [<ffffffff810938ce>] ? trace_hardirqs_off_caller+0x1f/0xaa
[74073.219938] INFO: lockdep is turned off.

What happened was the following:

      CPU 1                                       CPU 2

                                             btrfs_dedupe_file_range()
                                               --> using same inode as source
                                                   and target
                                               --> src range is [768K, 1Mb[
                                               --> dst range is [0, 256K[
                                              btrfs_cmp_data_prepare()
                                               --> calls gather_extent_pages()
                                                   for range [768K, 1Mb[ and
                                                   locks all pages in that range

 do_writepages()
  btrfs_writepages()
   extent_writepages()
    extent_write_cache_pages()
     __extent_writepage()
      writepage_delalloc()
       find_lock_delalloc_range()
         --> finds range [0, 1Mb[
         lock_delalloc_pages()
          --> locks all pages in the
              range [0, 768K[
          --> tries to lock page at
              offset 768K
                --> deadlock

                                               --> calls gather_extent_pages()
                                                   to lock pages in the range
                                                   [0, 256K[
                                                    --> deadlock, task at CPU 1
                                                        already locked that
                                                        range and it's trying
                                                        to lock the range we
                                                        locked previously

So fix this by making sure that during a dedup we always lock first the
pages from the range with lower offset.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2017-02-22 15:55:02 -08:00
Jens Axboe
818551e2b2 Merge branch 'for-4.11/next' into for-4.11/linus-merge
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-17 14:08:19 -07:00
Jeff Mahoney
71367b3fa7 btrfs: use btrfs_debug instead of pr_debug in transaction abort
Commit e5d6b12fe1 (Btrfs: don't WARN() in btrfs_transaction_abort() for
IO errors) added a pr_debug call to be printed when a transaction is
aborted with -EIO instead of WARN.  btrfs_debug prints which file system
the message is associated with so let's use that instead.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:56 +01:00
Jeff Mahoney
21e75ffe3c btrfs: btrfs_truncate_free_space_cache always allocates path
btrfs_truncate_free_space_cache always allocates a btrfs_path structure
but only uses it when the caller passes a block group.  Let's move the
allocation and free into the conditional.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:56 +01:00
Jeff Mahoney
77ab86bf1c btrfs: free-space-cache, clean up unnecessary root arguments
The free space cache APIs accept a root but always use the tree root.

Also, btrfs_truncate_free_space_cache accepts a root AND an inode but
the inode always points to the root anyway, so let's just pass the inode.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:56 +01:00
Jeff Mahoney
5e00f1939f btrfs: convert btrfs_inc_block_group_ro to accept fs_info
btrfs_inc_block_group_ro is either passed the extent root or the dev
root, but it doesn't do anything with the dev tree.  Let's convert
to passing an fs_info and using the extent root.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:56 +01:00
Jeff Mahoney
0c9ab349c2 btrfs: flush_space always takes fs_info->fs_root
We don't need to pass a root to flush_space since it always uses
the fs_root.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:55 +01:00
Jeff Mahoney
87bde3cdfc btrfs: pass fs_info to (more) routines that are only called with extent_root
Outside of interactions with qgroups, the roots passed in extent-tree.c
are usually passed to ensure that we don't do refcounts on log trees or
to get the allocation profile for an allocation request.  Otherwise, it
operates on the extent root.  This patch converts some more routines in
extent-tree.c that are always called with the extent root to accept
an fs_info instead.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:55 +01:00
Qu Wenruo
fb235dc06f btrfs: qgroup: Move half of the qgroup accounting time out of commit trans
Just as Filipe pointed out, the most time consuming parts of qgroup are
btrfs_qgroup_account_extents() and
btrfs_qgroup_prepare_account_extents().
Which both call btrfs_find_all_roots() to get old_roots and new_roots
ulist.

What makes things worse is, we're calling that expensive
btrfs_find_all_roots() at transaction committing time with
TRANS_STATE_COMMIT_DOING, which will blocks all incoming transaction.

Such behavior is necessary for @new_roots search as current
btrfs_find_all_roots() can't do it correctly so we do call it just
before switch commit roots.

However for @old_roots search, it's not necessary as such search is
based on commit_root, so it will always be correct and we can move it
out of transaction committing.

This patch moves the @old_roots search part out of
commit_transaction(), so in theory we can half the time qgroup time
consumption at commit_transaction().

But please note that, this won't speedup qgroup overall, the total time
consumption is still the same, just reduce the performance stall.

Cc: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:55 +01:00
David Sterba
15b34517a6 btrfs: remove unused parameter from adjust_slots_upwards
Never used.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:55 +01:00
David Sterba
0e8d931a82 btrfs: remove unused parameters from __btrfs_write_out_cache
Both unused after the call to update_cache_item has been moved to
__btrfs_wait_cache_io.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:55 +01:00
David Sterba
7bf1a15912 btrfs: remove unused parameter from cleanup_write_cache_enospc
bitmap_list is unused since the io_ctl framework.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:55 +01:00
David Sterba
d75eefdf96 btrfs: remove unused parameter from __add_inode_ref
Unused since the helper has been split, eb used in the caller.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:54 +01:00
David Sterba
4a0ab9d711 btrfs: remove unused parameter from clone_copy_inline_extent
Never used.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:54 +01:00
David Sterba
1a287cfea1 btrfs: remove unused parameters from btrfs_cmp_data
After the page locking has been reworked, we get all pages prepared via
cmp_pages.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:54 +01:00
David Sterba
eeac44cb49 btrfs: remove unused parameter from __add_inline_refs
Never used.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:54 +01:00
David Sterba
e5987e1319 btrfs: remove unused parameters from scrub_setup_wr_ctx
Never used.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:54 +01:00
David Sterba
61d7e4cb11 btrfs: remove unused parameter from create_snapshot
The name parameters have never been used, as the name is passed via the
dentry.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:54 +01:00
David Sterba
e4a4dce72e btrfs: remove unused parameter from init_first_rw_device
The 'device' used to be added in that function, but now it's done by the
caller.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:54 +01:00
David Sterba
72b468c8da btrfs: remove unused parameter from __btrfs_alloc_chunk
We grab fs_info from other parameters.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:53 +01:00
David Sterba
56e033a787 btrfs: remove unused parameter from btrfs_fill_super
Never used for anything meaningful since we have our own superblock
filler.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:53 +01:00
David Sterba
4242b64a4c btrfs: remove unused parameter from extent_write_cache_pages
The 'tree' was used to call locking hook that does not exist anymore.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:53 +01:00
David Sterba
df9f628e3d btrfs: remove unused parameter from add_pending_csums
Never used.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:53 +01:00
David Sterba
3d4b9496e8 btrfs: remove unused parameter from update_nr_written
The logic has been updated in "Btrfs: make mapping->writeback_index
point to the last written page" (a91326679f) and page is not
needed anymore.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:53 +01:00
David Sterba
c2df8bb43f btrfs: remove unused parameter from submit_extent_page
This used to hold number of maximum pages to allocate, but this is now
limited by BIO_MAX_PAGES. The local are now unused and removed as well.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:53 +01:00
David Sterba
f1e3026192 btrfs: remove unused parameter from tree_move_next_or_upnext
Not needed.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:52 +01:00
David Sterba
ab6a43e122 btrfs: remove unused parameter from tree_move_down
Never needed.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:52 +01:00
David Sterba
3d3a126a81 btrfs: remove unused parameter from btrfs_check_super_valid
None of the checks need to know the ro/rw status as they're all not
changing the superblock. Moreover, we can access the sb flags directly
if we'd need to decide by the ro/rw status.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:52 +01:00
David Sterba
8b74c03e3c btrfs: remove unused parameter from btrfs_prepare_extent_commit
Added but never used.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:52 +01:00
David Sterba
7775c8184e btrfs: remove unused parameter from btrfs_subvolume_release_metadata
Unused since qgroup refactoring that split data and metadata accounting,
the btrfs_qgroup_free helper.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:52 +01:00
David Sterba
66cb7ddbf2 btrfs: remove unused parameter from __push_leaf_left
Unused since long ago.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:52 +01:00
David Sterba
1e47eef223 btrfs: remove unused parameter from __push_leaf_right
Unused since long ago.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:52 +01:00
David Sterba
eece6a9cf6 btrfs: merge two superblock writing helpers
write_all_supers and write_ctree_super are almost equal, the parameter
'trans' is unused so we can drop it and have just one helper.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:51 +01:00
David Sterba
b75f506243 btrfs: remove unused parameter from write_dev_supers
The barriers are handled by the caller.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:51 +01:00
David Sterba
4961e2930f btrfs: remove unused parameter from split_item
Never used.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:51 +01:00
David Sterba
7c302b49dd btrfs: remove unused parameter from clean_tree_block
Added but never needed.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:51 +01:00
David Sterba
e27f62652b btrfs: remove unused parameter from check_async_write
Added but never used.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:51 +01:00
David Sterba
cda79c545e btrfs: remove unused parameter from read_block_for_search
Never used in that function.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:50 +01:00
David Sterba
6655bc3de1 btrfs: ulist: rename ulist_fini to ulist_release
Change the name so it matches the naming we already use eg. for
btrfs_path.

Suggested-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:50 +01:00
David Sterba
4ae8553c2d btrfs: remove pointless rcu protection from btrfs_qgroup_inherit
There was never need for RCU protection around reading nodesize or other
fairly constant filesystem data.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:50 +01:00
David Sterba
0b08e1f4f7 btrfs: qgroups: opencode qgroup_free helper
The helper name is not too helpful and is just wrapping a simple call.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:50 +01:00
David Sterba
9ea6e2b548 btrfs: remove unnecessary mutex lock in qgroup_account_snapshot
The quota status used to be tracked as a variable, so the mutex was
needed (until "Btrfs: add a flags field to btrfs_fs_info" afcdd129e0).
Since the status is a bit modified atomically and we don't hold the
mutex beyond the check, we can drop it.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:50 +01:00
David Sterba
81353d50f5 btrfs: check quota status earlier and don't do unnecessary frees
Status of quotas should be the first check in
btrfs_qgroup_account_extent and we can return immediatelly, no need to
do no-op ulist frees.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:50 +01:00
David Sterba
53d3235995 btrfs: embed extent_changeset::range_changed to the structure
We can embed range_changed to the extent changeset to address following
problems:

- no need to allocate ulist dynamically, we also get rid of the GFP_NOFS
  for free
- fix lack of allocation failure checking in btrfs_qgroup_reserve_data

The stack consuption where extent_changeset is used slightly increases:

before: 16
after: 16 - 8 (for pointer) + 32 (sizeof ulist) = 40

Which is bearable.

Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:49 +01:00
David Sterba
9d03793386 btrfs: ulist: make the finalization function public
Make ulist_fini externally visible so the ulist API is complete.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:49 +01:00
David Sterba
025db916aa btrfs: qgroups: make __del_qgroup_relation static
Internal helper.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:49 +01:00
David Sterba
1d4805386e btrfs: make space cache inode readahead failure nonfatal
We do a readahead of the free space cache inode to speed things up but
the failure is not fatal, like in other readahead cases. Proper reads
would need to happen anyway and any errors would be caught there.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:49 +01:00
David Sterba
6602caf149 btrfs: use GFP_KERNEL in btrfs_add/del_qgroup_relation
Qgroup relations are added/deleted from ioctl, we hold the high level
qgroup lock, no deadlocks or recursion from the allocation possible
here.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:49 +01:00
David Sterba
52bf8e7aea btrfs: use GFP_KERNEL in btrfs_quota_enable
We don't need to use GFP_NOFS here as this is called from ioctls an the
only lock held is the subvol_sem, which is of a high level and protects
creation/renames/deletion and is never held in the writeout paths.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:49 +01:00
David Sterba
323b88f4ab btrfs: use GFP_KERNEL in btrfs_read_qgroup_config
The qgroup config is read during mount, we do not have to use NOFS.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:49 +01:00
David Sterba
23269bf5ea btrfs: use GFP_KERNEL in create_snapshot
We don't need to use GFP_NOFS here as this is called from ioctls an the
only lock held is the subvol_sem, which is of a high level and protects
creation/renames/deletion and is never held in the writeout paths.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:48 +01:00
Liu Bo
1af4a0aaa5 Btrfs: specify a new ordered extent type for create_io_em
As 0 refers to an existing type BTRFS_ORDERED_IO_DONE, this specifies a
new type 'REGULAR' for regular IO.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:48 +01:00
Liu Bo
6f9994dbab Btrfs: create a helper to create em for IO
We have similar codes to create and insert extent mapping around IO path,
this merges them into a single helper.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:48 +01:00
Liu Bo
4136135b08 Btrfs: use helper to get used bytes of space_info
This uses a helper instead of open code around used byte of space_info
everywhere.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:48 +01:00
Liu Bo
0c9b36e0d7 Btrfs: try to avoid acquiring free space ctl's lock
We don't need to take the lock if the block group has not been cached.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:48 +01:00
Qu Wenruo
6f6b643e44 btrfs: Better csum error message for data csum mismatch
The original csum error message only outputs inode number, offset, check
sum and expected check sum.

However no root objectid is outputted, which sometimes makes debugging
quite painful under multi-subvolume case (including relocation).

Also the checksum output is decimal, which seldom makes sense for
users/developers and is hard to read in most time.

This patch will add root objectid, which will be %lld for rootid larger
than LAST_FREE_OBJECTID, and hex csum output for better readability.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:48 +01:00
Takafumi Kubota
fe01aa6538 Btrfs: add another missing end_page_writeback on submit_extent_page failure
If btrfs_bio_alloc fails in submit_extent_page, submit_extent_page returns
without clearing the writeback bit of the failed page.

__extent_writepage_io, that is a caller of submit_extent_page,
does not clear the remaining writeback bit anywhere.
As a result, this will cause the hang at filemap_fdatawait_range,
because it waits the writeback bit to be cleared from the failed page.
So, we have to call end_page_writeback to clear the writeback bit.

For reproducing the hang, we inject a fault like

   if (should_failtest()) { // I define should_failtest()
        bio = NULL;
   }
   else {
        bio = btrfs_bio_alloc(...);
   }

in submit_extent_page.

We should also check whether page has the bit before end_page_writeback,
to avoid the conflict against the other end_page_writeback in bio_endio.
Thus, we add PageWriteback checks not only in __extent_writepage_io,
but also in write_one_eb too, because it misses the check.

Signed-off-by: Takafumi Kubota <takafumi.kubota1012@sslab.ics.keio.ac.jp>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:47 +01:00
David Sterba
66bbc1c0c0 btrfs: remove unused ulist members
Commit "btrfs: ulist: Add ulist_del() function" (d4b8040459)
removed some debugging code but left the structure defintions.

Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:47 +01:00
Liu Bo
76c0021db8 Btrfs: use helper to simplify lock/unlock pages
Since we have a helper to set page bits, let lock_delalloc_pages and
__unlock_for_delalloc use it.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:47 +01:00
Liu Bo
da2c7009f6 btrfs: teach __process_pages_contig about PAGE_LOCK operation
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ changes to the helper separated from the following patch ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:35 +01:00
Liu Bo
873695b301 Btrfs: create helper for processing bits on contiguous pages
This introduces a new helper which can be used to process pages bits.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:51:00 +01:00
Liu Bo
e4c3b2dcd1 Btrfs: kill trans in run_delalloc_nocow and btrfs_cross_ref_exist
run_delalloc_nocow has used trans in two places where they don't
actually need @trans.

For btrfs_lookup_file_extent, we search for file extents without COWing
anything, and for btrfs_cross_ref_exist, the only place where we need
@trans is deferencing it in order to get running_transaction which we
could easily get from the global fs_info.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:51:00 +01:00
Liu Bo
f72ad18e99 Btrfs: pass delayed_refs directly to btrfs_find_delayed_ref_head
All we need is @delayed_refs, all callers have get it ahead of calling
btrfs_find_delayed_ref_head since lock needs to be acquired firstly,
there is no reason to deference it again inside the function.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:59 +01:00
Liu Bo
d07b85284f Btrfs: remove unused trans in read_block_for_search
@trans is not used at all, this removes it.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:59 +01:00
Liu Bo
bcf934894f Btrfs: cleanup unused cached_state in __extent_writepage_io
@cached_state is no more required in __extent_writepage_io, also remove
the goto label.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:59 +01:00
Jeff Mahoney
003d7c59e8 btrfs: allow unlink to exceed subvolume quota
Once a qgroup limit is exceeded, it's impossible to restore normal
operation to the subvolume without modifying the limit or removing
the subvolume.  This is a surprising situation for many users used
to the typical workflow with quotas on other file systems where it's
possible to remove files until the used space is back under the limit.

When we go to unlink a file and start the transaction, we'll hit
the qgroup limit while trying to reserve space for the items we'll
modify while removing the file.  We discussed last month how best
to handle this situation and agreed that there is no perfect solution.
The best principle-of-least-surprise solution is to handle it similarly
to how we already handle ENOSPC when unlinking, which is to allow
the operation to succeed with the expectation that it will ultimately
release space under most circumstances.

This patch modifies the transaction start path to select whether to
honor the qgroups limits.  btrfs_start_transaction_fallback_global_rsv
is the only caller that skips enforcement.  The reservation and tracking
still happens normally -- it just skips the enforcement step.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:59 +01:00
Liu Bo
9a9239acb4 Btrfs: fix wrong argument for btrfs_lookup_ordered_range
Commit Btrfs: btrfs_page_mkwrite: Reserve space in sectorsized units"
(d0b7da88) did this, but btrfs_lookup_ordered_range expects a 'length'
rather than a 'page_end'.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:59 +01:00
Qu Wenruo
a7ceffbbbd btrfs: raid56: Remove unused variable in lock_stripe_add
Variable 'walk' in lock_stripe_add() is not used.  Remove it.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:59 +01:00
Omar Sandoval
fc4badd9fe Btrfs: refactor btrfs_extent_same() slightly
This was originally a prep patch for changing the behavior on len=0, but
we went another direction with that. This still makes the function
slightly easier to follow.

Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:58 +01:00
Omar Sandoval
310712b2f7 Btrfs: constify struct btrfs_{,disk_}key wherever possible
In a lot of places, it's unclear when it's safe to reuse a struct
btrfs_key after it has been passed to a helper function. Constify these
arguments wherever possible to make it obvious.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:58 +01:00
Liu Bo
4aaedfb0b6 Btrfs: fix another race between truncate and lockless dio write
Dio writes can update i_size in btrfs_get_blocks_direct when it
writes to offset beyond EOF so that endio can update disk_i_size
correctly (because we don't udpate disk_i_size beyond i_size).

However, when truncating down a file, we firstly update i_size
and then wait for in-flight lockless dio reads/writes, according
to the above, i_size may have been changed in dio writes, and
file extents don't get truncated.

For lockless dio writes are always overwrites, i_size is not
supposed to be changed, so this adds a check to filter out this
case.

The race could be reproduced by fstests/generic/299 with patch
"Btrfs: fix btrfs_ordered_update_i_size to update disk_i_size properly"
 applied.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:58 +01:00
Liu Bo
62c821a8e2 Btrfs: clean up btrfs_ordered_update_i_size
Since we have a good helper entry_end, use it for ordered extent.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ whitespace reformatting ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:58 +01:00
Liu Bo
5416034f7a Btrfs: fix comment in btrfs_page_mkwrite
The comment about "page_mkwrite gets called every time the page is
dirtied" in btrfs_page_mkwrite is not correct, it only gets called the
first time the page gets dirtied after the page faults in.

However, we don't need to touch the code because it works well, although
the proper logic is to check if delalloc bits has been set and if so, go
free reserved space, if not, set the delalloc bits for dirty page range.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:58 +01:00
Liu Bo
19fd2df5b1 Btrfs: fix btrfs_ordered_update_i_size to update disk_i_size properly
btrfs_ordered_update_i_size can be called by truncate and endio, but
only endio takes ordered_extent which contains the completed IO.

while truncating down a file, if there are some in-flight IOs,
btrfs_ordered_update_i_size in endio will set disk_i_size to
@orig_offset that is zero.  If truncating-down fails somehow, we try to
recover in memory isize with this zero'd disk_i_size.

Fix it by only updating disk_i_size with @orig_offset when
btrfs_ordered_update_i_size is not called from endio while truncating
down and waiting for in-flight IOs completing their work before recover
in-memory size.

Besides fixing the above issue, add an assertion for last_size to double
check we truncate down to the desired size.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:57 +01:00
David Sterba
f85b7379cd btrfs: fix over-80 lines introduced by previous cleanups
This goes as a separate patch because fixing that inside the patches
caused too many many conflicts.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:57 +01:00
Nikolay Borisov
f329e31971 btrfs: Make count_inode_refs take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:57 +01:00
Nikolay Borisov
3628365823 btrfs: Make count_inode_extrefs take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:57 +01:00
Nikolay Borisov
a59108a73f btrfs: Make btrfs_log_inode take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:57 +01:00
Nikolay Borisov
6d889a3b9e btrfs: Make log_inode_item take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:56 +01:00
Nikolay Borisov
94c91a1f39 btrfs: Make __add_inode_ref take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:56 +01:00
Nikolay Borisov
207e7d92aa btrfs: Make drop_one_dir_item take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:56 +01:00
Nikolay Borisov
4ec5934e43 btrfs: Make btrfs_unlink_inode take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:56 +01:00
Nikolay Borisov
51cc0d3227 btrfs: Make log_new_dir_dentries take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:56 +01:00
Nikolay Borisov
dbf39ea48b btrfs: Make log_directory_changes take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:56 +01:00
Nikolay Borisov
684a5773f9 btrfs: Make log_dir_items take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:55 +01:00
Nikolay Borisov
9d122629f1 btrfs: Make btrfs_log_changed_extents take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:55 +01:00
Nikolay Borisov
223466370c btrfs: Make btrfs_get_logged_extents take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:55 +01:00
Nikolay Borisov
a0308dd7e0 btrfs: Make btrfs_log_trailing_hole take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:55 +01:00
Nikolay Borisov
1a93c36acd btrfs: Make btrfs_log_all_xattrs take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:55 +01:00
Nikolay Borisov
44d70e194f btrfs: Make copy_items take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:55 +01:00
Nikolay Borisov
4791c8f19c btrfs: Make btrfs_check_ref_name_override take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:55 +01:00
Nikolay Borisov
481b01c0d3 btrfs: Make logged_inode_size take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:54 +01:00
Nikolay Borisov
a491abb2e7 btrfs: Make btrfs_del_inode_ref take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:54 +01:00
Nikolay Borisov
49f34d1f96 btrfs: Make btrfs_del_dir_entries_in_log take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:54 +01:00
Nikolay Borisov
9ca5fbfbb9 btrfs: Make btrfs_log_new_name take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:54 +01:00
Nikolay Borisov
0f8939b8ac btrfs: Make btrfs_inode_in_log take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:54 +01:00
Nikolay Borisov
436635571b btrfs: Make btrfs_record_snapshot_destroy take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:54 +01:00
Nikolay Borisov
4176bdbf2d btrfs: Make btrfs_record_unlink_dir take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:53 +01:00
Nikolay Borisov
ab1717b2ab btrfs: Make btrfs_must_commit_transaction take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:53 +01:00
Nikolay Borisov
f5cc7b80a6 btrfs: Make btrfs_inode_delayed_dir_index_count take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:53 +01:00
Nikolay Borisov
5f4b32e94a btrfs: Make btrfs_commit_inode_delayed_items take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:53 +01:00
Nikolay Borisov
aa79021fde btrfs: Make btrfs_commit_inode_delayed_inode take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:53 +01:00
Nikolay Borisov
f48d1cf59c btrfs: Make btrfs_remove_delayed_node take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:53 +01:00
Nikolay Borisov
4ccb5c7231 btrfs: Make btrfs_kill_delayed_inode_items take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:52 +01:00
Nikolay Borisov
e07222c7d2 btrfs: Make btrfs_delayed_delete_inode_ref take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:52 +01:00
Nikolay Borisov
e67bbbb9d0 btrfs: Make btrfs_delete_delayed_dir_index take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:52 +01:00
Nikolay Borisov
6f45d18568 btrfs: Make btrfs_insert_delayed_dir_index take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:52 +01:00
Nikolay Borisov
fcabdd1ca5 btrfs: Make btrfs_delayed_inode_reserve_metadata take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:52 +01:00
Nikolay Borisov
e5517a7bff btrfs: Make btrfs_get_or_create_delayed_node take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:51 +01:00
Nikolay Borisov
340c6ca9fd btrfs: Make btrfs_get_delayed_node take btrfs_inode
This function is internal to btrfs and doesn't really deal with any
VFS members, as such it needn't take a struct inode refrence but
btrfs_inode.

Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:51 +01:00
Nikolay Borisov
4a0cc7ca6c btrfs: Make btrfs_ino take a struct btrfs_inode
Currently btrfs_ino takes a struct inode and this causes a lot of
internal btrfs functions which consume this ino to take a VFS inode,
rather than btrfs' own struct btrfs_inode. In order to fix this "leak"
of VFS structs into the internals of btrfs first it's necessary to
eliminate all uses of struct inode for the purpose of inode. This patch
does that by using BTRFS_I to convert an inode to btrfs_inode. With
this problem eliminated subsequent patches will start eliminating the
passing of struct inode altogether, eventually resulting in a lot cleaner
code.

Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
[ fix btrfs_get_extent tracepoint prototype ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:51 +01:00
David Sterba
823bb20ab4 btrfs: add wrapper for counting BTRFS_MAX_EXTENT_SIZE
The expression is open-coded in several places, this asks for a wrapper.
As we know the MAX_EXTENT fits to u32, we can use the appropirate
division helper. This cascades to the result type updates.

Compiler is clever enough to use shift instead of integer division, so
there's no change in the generated assembly.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:51 +01:00
David Sterba
95995dbbe6 btrfs: remove unused logic of limiting async delalloc pages
A proposed patch in https://marc.info/?l=linux-btrfs&m=147859791003837
pointed out bad limit threshold in cow_file_range_async, but it turned
out that the whole logic is not necessary and is done by writeback. We
agreed to remove it.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:51 +01:00
Anand Jain
26d30f8529 btrfs: consolidate auto defrag kick off policies
As of now writes smaller than 64k for non compressed extents and 16k
for compressed extents inside eof are considered as candidate
for auto defrag, put them together at a place.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:50 +01:00
Anand Jain
8c3e6b1f0c btrfs: btrfs_defrag_root() doesn't defrag extent root tree
Since btrfs_defrag_leaves() does not support extent_root, remove its
corresponding call. The user can use the file based defrag to defrag
extents as of now.

No change in behaviour as extent_root is explicitly skipped in
btrfs_defrag_leaves and this has never worked as expected.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ ehnance changelong ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:50 +01:00
Jeff Mahoney
fef394f75b btrfs: drop unused extent_op arg from btrfs_add_delayed_data_ref
btrfs_add_delayed_data_ref is always called with a NULL extent_op,
so let's drop the argument.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:50 +01:00
Colin Ian King
694a0dee9c btrfs: remove redundant inode null check
The check for a null inode is redundant since the function
is a callback for exportfs, which will itself crash if
dentry->d_inode or parent->d_inode is NULL.  Removing the
null check makes this consistent with other file systems.

Also remove the redundant null dir check too.

Found with static analysis by CoverityScan, CID 1389472

Kudos to Jeff Mahoney for reviewing and explaining the error in
my original patch (most of this explanation went into the above
commit message) and David Sterba for pointing out that the dir
check is also redundant.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:50 +01:00
Seraphime Kirkovski
20c7bcec6f Btrfs: ACCESS_ONCE cleanup
This replaces ACCESS_ONCE macro with the corresponding
READ|WRITE macros

Signed-off-by: Seraphime Kirkovski <kirkseraph@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:50 +01:00
Seraphime Kirkovski
50d0446e68 Btrfs: code cleanup min/max -> min_t/max_t
This cleans up the cases where the min/max macros were used with a cast
rather than using directly min_t/max_t.

Signed-off-by: Seraphime Kirkovski <kirkseraph@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:50 +01:00
Geliang Tang
6b4df8b6c5 btrfs: use rb_entry() instead of container_of
To make the code clearer, use rb_entry() instead of container_of() to
deal with rbtree.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:49 +01:00