Commit Graph

7482 Commits

Author SHA1 Message Date
David Sterba
ef2fff64fd btrfs: rename helper macros for qgroup and aux data casts
The helpers are not meant to be generic, the name is misleading. Convert
them to static inlines for type checking.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:15 +01:00
David Sterba
5d9dbe617a btrfs: remove stale comment from btrfs_statfs
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:15 +01:00
David Sterba
926b92335a btrfs: remove unused headers, statfs.h
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:14 +01:00
Xiaoguang Wang
745699ef62 btrfs: remove useless comments
Fixes: ("btrfs: update btrfs_space_info's bytes_may_use timely")

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:14 +01:00
Adam Borowski
ebce0e01b9 btrfs: make block group flags in balance printks human-readable
They're not even documented anywhere, letting users with no recourse but
to RTFS.  It's no big burden to output the bitfield as words.

Also, display unknown flags as hex.

Signed-off-by: Adam Borowski <kilobyte@angband.pl>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:14 +01:00
Omar Sandoval
8e2bd3b7fa Btrfs: deal with existing encompassing extent map in btrfs_get_extent()
My QEMU VM was seeing inexplicable I/O errors that I tracked down to
errors coming from the qcow2 virtual drive in the host system. The qcow2
file is a nocow file on my Btrfs drive, which QEMU opens with O_DIRECT.
Every once in awhile, pread() or pwrite() would return EEXIST, which
makes no sense. This turned out to be a bug in btrfs_get_extent().

Commit 8dff9c8534 ("Btrfs: deal with duplciates during extent_map
insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where
two threads race on adding the same extent map to an inode's extent map
tree. However, if the added em is merged with an adjacent em in the
extent tree, then we'll end up with an existing extent that is not
identical to but instead encompasses the extent we tried to add. When we
call merge_extent_mapping() to find the nonoverlapping part of the new
em, the arithmetic overflows because there is no such thing. We then end
up trying to add a bogus em to the em_tree, which results in a EEXIST
that can bubble all the way up to userspace.

Fix it by extending the identical extent map special case.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:14 +01:00
Wang Xiaoguang
939659dfd3 btrfs: add necessary comments about tickets_id
Tickets_id's name may result in some misunderstandings,  it just indicates
the next ticket will be handled and is not stored per ticket.

Fixes: ce12965 ("btrfs: introduce tickets_id to determine whether
asynchronous metadata reclaim work makes progress")
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:14 +01:00
Wang Xiaoguang
dc1a90c6aa btrfs: cleanup: use already calculated value in btrfs_should_throttle_delayed_refs()
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-29 14:10:38 +01:00
Christoph Hellwig
cf8cddd38b btrfs: don't abuse REQ_OP_* flags for btrfs_map_block
btrfs_map_block supports different types of mappings, which to a large
extent resemble block layer operations.  But they don't always do, and
currently btrfs dangerously overlays it's own flag over the block layer
flags.  This is just asking for a conflict, so introduce a different
map flags enum inside of btrfs instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-29 14:10:38 +01:00
Filipe Manana
8d9eddad19 Btrfs: fix qgroup rescan worker initialization
We were setting the qgroup_rescan_running flag to true only after the
rescan worker started (which is a task run by a queue). So if a user
space task starts a rescan and immediately after asks to wait for the
rescan worker to finish, this second call might happen before the rescan
worker task starts running, in which case the rescan wait ioctl returns
immediatley, not waiting for the rescan worker to finish.

This was making the fstest btrfs/022 fail very often.

Fixes: d2c609b834 (btrfs: properly track when rescan worker is running)
Cc: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
2016-11-25 18:06:50 +00:00
Filipe Manana
f177d73949 Btrfs: fix emptiness check for dirtied extent buffers at check_leaf()
We can not simply use the owner field from an extent buffer's header to
get the id of the respective tree when the extent buffer is from a
relocation tree. When we create the root for a relocation tree we leave
(on purpose) the owner field with the same value as the subvolume's tree
root (we do this at ctree.c:btrfs_copy_root()). So we must ignore extent
buffers from relocation trees, which have the BTRFS_HEADER_FLAG_RELOC
flag set, because otherwise we will always consider the extent buffer
as not being the root of the tree (the root of original subvolume tree
is always different from the root of the respective relocation tree).

This lead to assertion failures when running with the integrity checker
enabled (CONFIG_BTRFS_FS_CHECK_INTEGRITY=y) such as the following:

[  643.393409] BTRFS critical (device sdg): corrupt leaf, non-root leaf's nritems is 0: block=38506496, root=260, slot=0
[  643.397609] BTRFS info (device sdg): leaf 38506496 total ptrs 0 free space 3995
[  643.407075] assertion failed: 0, file: fs/btrfs/disk-io.c, line: 4078
[  643.408425] ------------[ cut here ]------------
[  643.409112] kernel BUG at fs/btrfs/ctree.h:3419!
[  643.409773] invalid opcode: 0000 [#1] PREEMPT SMP
[  643.410447] Modules linked in: dm_flakey dm_mod crc32c_generic btrfs xor raid6_pq ppdev psmouse acpi_cpufreq parport_pc evdev parport tpm_tis tpm_tis_core pcspkr serio_raw i2c_piix4 sg tpm i2c_core button processor loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring scsi_mod virtio e1000 floppy
[  643.414356] CPU: 11 PID: 32726 Comm: btrfs Not tainted 4.8.0-rc8-btrfs-next-35+ #1
[  643.414356] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[  643.414356] task: ffff880145e95b00 task.stack: ffff88014826c000
[  643.414356] RIP: 0010:[<ffffffffa0352759>]  [<ffffffffa0352759>] assfail.constprop.41+0x1c/0x1e [btrfs]
[  643.414356] RSP: 0018:ffff88014826fa28  EFLAGS: 00010292
[  643.414356] RAX: 0000000000000039 RBX: ffff88014e2d7c38 RCX: 0000000000000001
[  643.414356] RDX: ffff88023f4d2f58 RSI: ffffffff81806c63 RDI: 00000000ffffffff
[  643.414356] RBP: ffff88014826fa28 R08: 0000000000000001 R09: 0000000000000000
[  643.414356] R10: ffff88014826f918 R11: ffffffff82f3c5ed R12: ffff880172910000
[  643.414356] R13: ffff880233992230 R14: ffff8801a68a3310 R15: fffffffffffffff8
[  643.414356] FS:  00007f9ca305e8c0(0000) GS:ffff88023f4c0000(0000) knlGS:0000000000000000
[  643.414356] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  643.414356] CR2: 00007f9ca3071000 CR3: 000000015d01b000 CR4: 00000000000006e0
[  643.414356] Stack:
[  643.414356]  ffff88014826fa50 ffffffffa02d655a 000000000000000a ffff88014e2d7c38
[  643.414356]  0000000000000000 ffff88014826faa8 ffffffffa02b72f3 ffff88014826fab8
[  643.414356]  00ffffffa03228e4 0000000000000000 0000000000000000 ffff8801bbd4e000
[  643.414356] Call Trace:
[  643.414356]  [<ffffffffa02d655a>] btrfs_mark_buffer_dirty+0xdf/0xe5 [btrfs]
[  643.414356]  [<ffffffffa02b72f3>] btrfs_copy_root+0x18a/0x1d1 [btrfs]
[  643.414356]  [<ffffffffa0322921>] create_reloc_root+0x72/0x1ba [btrfs]
[  643.414356]  [<ffffffffa03267c2>] btrfs_init_reloc_root+0x7b/0xa7 [btrfs]
[  643.414356]  [<ffffffffa02d9e44>] record_root_in_trans+0xdf/0xed [btrfs]
[  643.414356]  [<ffffffffa02db04e>] btrfs_record_root_in_trans+0x50/0x6a [btrfs]
[  643.414356]  [<ffffffffa030ad2b>] create_subvol+0x472/0x773 [btrfs]
[  643.414356]  [<ffffffffa030b406>] btrfs_mksubvol+0x3da/0x463 [btrfs]
[  643.414356]  [<ffffffffa030b406>] ? btrfs_mksubvol+0x3da/0x463 [btrfs]
[  643.414356]  [<ffffffff810781ac>] ? preempt_count_add+0x65/0x68
[  643.414356]  [<ffffffff811a6e97>] ? __mnt_want_write+0x62/0x77
[  643.414356]  [<ffffffffa030b55d>] btrfs_ioctl_snap_create_transid+0xce/0x187 [btrfs]
[  643.414356]  [<ffffffffa030b67d>] btrfs_ioctl_snap_create+0x67/0x81 [btrfs]
[  643.414356]  [<ffffffffa030ecfd>] btrfs_ioctl+0x508/0x20dd [btrfs]
[  643.414356]  [<ffffffff81293e39>] ? __this_cpu_preempt_check+0x13/0x15
[  643.414356]  [<ffffffff81155eca>] ? handle_mm_fault+0x976/0x9ab
[  643.414356]  [<ffffffff81091300>] ? arch_local_irq_save+0x9/0xc
[  643.414356]  [<ffffffff8119a2b0>] vfs_ioctl+0x18/0x34
[  643.414356]  [<ffffffff8119a8e8>] do_vfs_ioctl+0x581/0x600
[  643.414356]  [<ffffffff814b9552>] ? entry_SYSCALL_64_fastpath+0x5/0xa8
[  643.414356]  [<ffffffff81093fe9>] ? trace_hardirqs_on_caller+0x17b/0x197
[  643.414356]  [<ffffffff8119a9be>] SyS_ioctl+0x57/0x79
[  643.414356]  [<ffffffff814b9565>] entry_SYSCALL_64_fastpath+0x18/0xa8
[  643.414356]  [<ffffffff81091b08>] ? trace_hardirqs_off_caller+0x3f/0xaa
[  643.414356] Code: 89 83 88 00 00 00 31 c0 5b 41 5c 41 5d 5d c3 55 89 f1 48 c7 c2 98 bc 35 a0 48 89 fe 48 c7 c7 05 be 35 a0 48 89 e5 e8 13 46 dd e0 <0f> 0b 55 89 f1 48 c7 c2 9f d3 35 a0 48 89 fe 48 c7 c7 7a d5 35
[  643.414356] RIP  [<ffffffffa0352759>] assfail.constprop.41+0x1c/0x1e [btrfs]
[  643.414356]  RSP <ffff88014826fa28>
[  643.468267] ---[ end trace 6a1b3fb1a9d7d6e3 ]---

This can be easily reproduced by running xfstests with the integrity
checker enabled.

Fixes: 1ba98d086f (Btrfs: detect corruption when non-root leaf has zero item)
Cc: stable@vger.kernel.org  # 4.8+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2016-11-23 20:24:35 +00:00
Liu Bo
ef85b25e98 Btrfs: fix BUG_ON in btrfs_mark_buffer_dirty
This can only happen with CONFIG_BTRFS_FS_CHECK_INTEGRITY=y.

Commit 1ba98d0 ("Btrfs: detect corruption when non-root leaf has zero item")
assumes that a leaf is its root when leaf->bytenr == btrfs_root_bytenr(root),
however, we should not use btrfs_root_bytenr(root) since it's mainly got
updated during committing transaction.  So the check can fail when doing
COW on this leaf while it is a root.

This changes to use "if (leaf == btrfs_root_node(root))" instead, just like
how we check whether leaf is a root in __btrfs_cow_block().

Fixes: 1ba98d086f (Btrfs: detect corruption when non-root leaf has zero item)
Cc: stable@vger.kernel.org  # 4.8+
Reported-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
2016-11-23 20:23:20 +00:00
Filipe Manana
2a2a83de54 Btrfs: remove rb_node field from the delayed ref node structure
After the last big change in the delayed references code that was needed
for the last qgroups rework, the red black tree node field of struct
btrfs_delayed_ref_node is no longer used, so just remove it, this helps
us save some memory (since struct rb_node is 24 bytes on x86_64) for
these structures.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-11-19 13:39:18 +00:00
Filipe Manana
001895b313 Btrfs: remove unused code when creating and merging reloc trees
In commit 5bc7247ac4 (Btrfs: fix broken nocow after balance) we started
abusing the rtransid and otransid fields of root items from relocation
trees to fix some issues with nodatacow mode. However later in commit
ba8b028933 (Btrfs: do not reset last_snapshot after relocation) we
dropped the code that made use of those fields but did not remove
the code that sets those fields.

So just remove them to avoid confusion.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-11-19 13:39:18 +00:00
Filipe Manana
054570a1dc Btrfs: fix relocation incorrectly dropping data references
During relocation of a data block group we create a relocation tree
for each fs/subvol tree by making a snapshot of each tree using
btrfs_copy_root() and the tree's commit root, and then setting the last
snapshot field for the fs/subvol tree's root to the value of the current
transaction id minus 1. However this can lead to relocation later
dropping references that it did not create if we have qgroups enabled,
leaving the filesystem in an inconsistent state that keeps aborting
transactions.

Lets consider the following example to explain the problem, which requires
qgroups to be enabled.

We are relocating data block group Y, we have a subvolume with id 258 that
has a root at level 1, that subvolume is used to store directory entries
for snapshots and we are currently at transaction 3404.

When committing transaction 3404, we have a pending snapshot and therefore
we call btrfs_run_delayed_items() at transaction.c:create_pending_snapshot()
in order to create its dentry at subvolume 258. This results in COWing
leaf A from root 258 in order to add the dentry. Note that leaf A
also contains file extent items referring to extents from some other
block group X (we are currently relocating block group Y). Later on, still
at create_pending_snapshot() we call qgroup_account_snapshot(), which
switches the commit root for root 258 when it calls switch_commit_roots(),
so now the COWed version of leaf A, lets call it leaf A', is accessible
from the commit root of tree 258. At the end of qgroup_account_snapshot(),
we call record_root_in_trans() with 258 as its argument, which results
in btrfs_init_reloc_root() being called, which in turn calls
relocation.c:create_reloc_root() in order to create a relocation tree
associated to root 258, which results in assigning the value of 3403
(which is the current transaction id minus 1 = 3404 - 1) to the
last_snapshot field of root 258. When creating the relocation tree root
at ctree.c:btrfs_copy_root() we add a shared reference for leaf A',
corresponding to the relocation tree's root, when we call btrfs_inc_ref()
against the COWed root (a copy of the commit root from tree 258), which
is at level 1. So at this point leaf A' has 2 references, one normal
reference corresponding to root 258 and one shared reference corresponding
to the root of the relocation tree.

Transaction 3404 finishes its commit and transaction 3405 is started by
relocation when calling merge_reloc_root() for the relocation tree
associated to root 258. In the meanwhile leaf A' is COWed again, in
response to some filesystem operation, when we are still at transaction
3405. However when we COW leaf A', at ctree.c:update_ref_for_cow(), we
call btrfs_block_can_be_shared() in order to figure out if other trees
refer to the leaf and if any such trees exists, add a full back reference
to leaf A' - but btrfs_block_can_be_shared() incorrectly returns false
because the following condition is false:

  btrfs_header_generation(buf) <= btrfs_root_last_snapshot(&root->root_item)

which evaluates to 3404 <= 3403. So after leaf A' is COWed, it stays with
only one reference, corresponding to the shared reference we created when
we called btrfs_copy_root() to create the relocation tree's root and
btrfs_inc_ref() ends up not being called for leaf A' nor we end up setting
the flag BTRFS_BLOCK_FLAG_FULL_BACKREF in leaf A'. This results in not
adding shared references for the extents from block group X that leaf A'
refers to with its file extent items.

Later, after merging the relocation root we do a call to to
btrfs_drop_snapshot() in order to delete the relocation tree. This ends
up calling do_walk_down() when path->slots[1] points to leaf A', which
results in calling btrfs_lookup_extent_info() to get the number of
references for leaf A', which is 1 at this time (only the shared reference
exists) and this value is stored at wc->refs[0]. After this walk_up_proc()
is called when wc->level is 0 and path->nodes[0] corresponds to leaf A'.
Because the current level is 0 and wc->refs[0] is 1, it does call
btrfs_dec_ref() against leaf A', which results in removing the single
references that the extents from block group X have which are associated
to root 258 - the expectation was to have each of these extents with 2
references - one reference for root 258 and one shared reference related
to the root of the relocation tree, and so we would drop only the shared
reference (because leaf A' was supposed to have the flag
BTRFS_BLOCK_FLAG_FULL_BACKREF set).

This leaves the filesystem in an inconsistent state as we now have file
extent items in a subvolume tree that point to extents from block group X
without references in the extent tree. So later on when we try to decrement
the references for these extents, for example due to a file unlink operation,
truncate operation or overwriting ranges of a file, we fail because the
expected references do not exist in the extent tree.

This leads to warnings and transaction aborts like the following:

[  588.965795] ------------[ cut here ]------------
[  588.965815] WARNING: CPU: 2 PID: 2479 at fs/btrfs/extent-tree.c:1625 lookup_inline_extent_backref+0x432/0x5b0 [btrfs]
[  588.965816] Modules linked in: af_packet iscsi_ibft iscsi_boot_sysfs xfs libcrc32c ppdev acpi_cpufreq button tpm_tis e1000 i2c_piix4 pcspkr parport_pc
parport tpm qemu_fw_cfg joydev btrfs xor raid6_pq sr_mod cdrom ata_generic virtio_scsi ata_piix virtio_pci bochs_drm virtio_ring drm_kms_helper syscopyarea
sysfillrect sysimgblt fb_sys_fops virtio ttm serio_raw drm floppy sg
[  588.965831] CPU: 2 PID: 2479 Comm: kworker/u8:7 Not tainted 4.7.3-3-default-fdm+ #1
[  588.965832] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[  588.965844] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
[  588.965845]  0000000000000000 ffff8802263bfa28 ffffffff813af542 0000000000000000
[  588.965847]  0000000000000000 ffff8802263bfa68 ffffffff81081e8b 0000065900000000
[  588.965848]  ffff8801db2af000 000000012bbe2000 0000000000000000 ffff880215703b48
[  588.965849] Call Trace:
[  588.965852]  [<ffffffff813af542>] dump_stack+0x63/0x81
[  588.965854]  [<ffffffff81081e8b>] __warn+0xcb/0xf0
[  588.965855]  [<ffffffff81081f7d>] warn_slowpath_null+0x1d/0x20
[  588.965863]  [<ffffffffa0175042>] lookup_inline_extent_backref+0x432/0x5b0 [btrfs]
[  588.965865]  [<ffffffff81143220>] ? trace_clock_local+0x10/0x30
[  588.965867]  [<ffffffff8114c5df>] ? rb_reserve_next_event+0x6f/0x460
[  588.965875]  [<ffffffffa0175215>] insert_inline_extent_backref+0x55/0xd0 [btrfs]
[  588.965882]  [<ffffffffa017531f>] __btrfs_inc_extent_ref.isra.55+0x8f/0x240 [btrfs]
[  588.965890]  [<ffffffffa017acea>] __btrfs_run_delayed_refs+0x74a/0x1260 [btrfs]
[  588.965892]  [<ffffffff810cb046>] ? cpuacct_charge+0x86/0xa0
[  588.965900]  [<ffffffffa017e74f>] btrfs_run_delayed_refs+0x9f/0x2c0 [btrfs]
[  588.965908]  [<ffffffffa017ea04>] delayed_ref_async_start+0x94/0xb0 [btrfs]
[  588.965918]  [<ffffffffa01c799a>] btrfs_scrubparity_helper+0xca/0x350 [btrfs]
[  588.965928]  [<ffffffffa01c7c5e>] btrfs_extent_refs_helper+0xe/0x10 [btrfs]
[  588.965930]  [<ffffffff8109b323>] process_one_work+0x1f3/0x4e0
[  588.965931]  [<ffffffff8109b658>] worker_thread+0x48/0x4e0
[  588.965932]  [<ffffffff8109b610>] ? process_one_work+0x4e0/0x4e0
[  588.965934]  [<ffffffff810a1659>] kthread+0xc9/0xe0
[  588.965936]  [<ffffffff816f2f1f>] ret_from_fork+0x1f/0x40
[  588.965937]  [<ffffffff810a1590>] ? kthread_worker_fn+0x170/0x170
[  588.965938] ---[ end trace 34e5232c933a1749 ]---
[  588.966187] ------------[ cut here ]------------
[  588.966196] WARNING: CPU: 2 PID: 2479 at fs/btrfs/extent-tree.c:2966 btrfs_run_delayed_refs+0x28c/0x2c0 [btrfs]
[  588.966196] BTRFS: Transaction aborted (error -5)
[  588.966197] Modules linked in: af_packet iscsi_ibft iscsi_boot_sysfs xfs libcrc32c ppdev acpi_cpufreq button tpm_tis e1000 i2c_piix4 pcspkr parport_pc
parport tpm qemu_fw_cfg joydev btrfs xor raid6_pq sr_mod cdrom ata_generic virtio_scsi ata_piix virtio_pci bochs_drm virtio_ring drm_kms_helper syscopyarea
sysfillrect sysimgblt fb_sys_fops virtio ttm serio_raw drm floppy sg
[  588.966206] CPU: 2 PID: 2479 Comm: kworker/u8:7 Tainted: G        W       4.7.3-3-default-fdm+ #1
[  588.966207] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[  588.966217] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
[  588.966217]  0000000000000000 ffff8802263bfc98 ffffffff813af542 ffff8802263bfce8
[  588.966219]  0000000000000000 ffff8802263bfcd8 ffffffff81081e8b 00000b96345ee000
[  588.966220]  ffffffffa021ae1c ffff880215703b48 00000000000005fe ffff8802345ee000
[  588.966221] Call Trace:
[  588.966223]  [<ffffffff813af542>] dump_stack+0x63/0x81
[  588.966224]  [<ffffffff81081e8b>] __warn+0xcb/0xf0
[  588.966225]  [<ffffffff81081eff>] warn_slowpath_fmt+0x4f/0x60
[  588.966233]  [<ffffffffa017e93c>] btrfs_run_delayed_refs+0x28c/0x2c0 [btrfs]
[  588.966241]  [<ffffffffa017ea04>] delayed_ref_async_start+0x94/0xb0 [btrfs]
[  588.966250]  [<ffffffffa01c799a>] btrfs_scrubparity_helper+0xca/0x350 [btrfs]
[  588.966259]  [<ffffffffa01c7c5e>] btrfs_extent_refs_helper+0xe/0x10 [btrfs]
[  588.966260]  [<ffffffff8109b323>] process_one_work+0x1f3/0x4e0
[  588.966261]  [<ffffffff8109b658>] worker_thread+0x48/0x4e0
[  588.966263]  [<ffffffff8109b610>] ? process_one_work+0x4e0/0x4e0
[  588.966264]  [<ffffffff810a1659>] kthread+0xc9/0xe0
[  588.966265]  [<ffffffff816f2f1f>] ret_from_fork+0x1f/0x40
[  588.966267]  [<ffffffff810a1590>] ? kthread_worker_fn+0x170/0x170
[  588.966268] ---[ end trace 34e5232c933a174a ]---
[  588.966269] BTRFS: error (device sda2) in btrfs_run_delayed_refs:2966: errno=-5 IO failure
[  588.966270] BTRFS info (device sda2): forced readonly

This was happening often on openSUSE and SLE systems using btrfs as the
root filesystem (with its default layout where multiple subvolumes are
used) where balance happens in the background triggered by a cron job and
snapshots are automatically created before/after package installations,
upgrades and removals. The issue could be triggered simply by running the
following loop on the first system boot post installation:

  while true; do
     zypper -n in nfs-kernel-server
     zypper -n rm nfs-kernel-server
  done

(If we were fast enough and made that loop before the cron job triggered
a balance operation and the balance finished)

So fix by setting the last_snapshot field of the root to the value of the
generation of its commit root. Like this btrfs_block_can_be_shared()
behaves correctly for the case where the relocation root is created during
a transaction commit and for the case where it's created before a
transaction commit.

Fixes: 6426c7ad69 (btrfs: qgroup: Fix qgroup accounting when creating snapshot)
Cc: stable@vger.kernel.org  # 4.7+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-11-19 13:39:17 +00:00
Linus Torvalds
46d7cbb2c4 Merge branch 'for-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from Chris Mason:
 "Some fixes that Dave Sterba collected.  We held off on these last week
  because I was focused on the memory corruption testing"

* 'for-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix WARNING in btrfs_select_ref_head()
  Btrfs: remove some no-op casts
  btrfs: pass correct args to btrfs_async_run_delayed_refs()
  btrfs: make file clone aware of fatal signals
  btrfs: qgroup: Prevent qgroup->reserved from going subzero
  Btrfs: kill BUG_ON in do_relocation
2016-11-04 20:08:16 -07:00
Chris Mason
e3597e6090 Merge branch 'for-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.9 2016-11-01 12:54:45 -07:00
Christoph Hellwig
70fd76140a block,fs: use REQ_* flags directly
Remove the WRITE_* and READ_SYNC wrappers, and just use the flags
directly.  Where applicable this also drops usage of the
bio_set_op_attrs wrapper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-01 09:43:26 -06:00
Christoph Hellwig
67f055c798 btrfs: use op_is_sync to check for synchronous requests
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-01 09:43:26 -06:00
Linus Torvalds
f6167514c8 Merge branch 'for-linus-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "My patch fixes the btrfs list_head abuse that we tracked down during
  Dave Jones' memory corruption investigation. With both Jens and my
  patches in place, I'm no longer able to trigger problems.

  Filipe is fixing a difficult old bug between snapshots, balance and
  send. Dave is cooking a few more for the next rc, but these are tested
  and ready"

* 'for-linus-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: fix races on root_log_ctx lists
  btrfs: fix incremental send failure caused by balance
2016-10-28 10:07:35 -07:00
Christoph Hellwig
ef295ecf09 block: better op and flags encoding
Now that we don't need the common flags to overflow outside the range
of a 32-bit type we can encode them the same way for both the bio and
request fields.  This in addition allows us to place the operation
first (and make some room for more ops while we're at it) and to
stop having to shift around the operation values.

In addition this allows passing around only one value in the block layer
instead of two (and eventuall also in the file systems, but we can do
that later) and thus clean up a lot of code.

Last but not least this allows decreasing the size of the cmd_flags
field in struct request to 32-bits.  Various functions passing this
value could also be updated, but I'd like to avoid the churn for now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-28 08:48:16 -06:00
Chris Mason
570dd45042 btrfs: fix races on root_log_ctx lists
btrfs_remove_all_log_ctxs takes a shortcut where it avoids walking the
list because it knows all of the waiters are patiently waiting for the
commit to finish.

But, there's a small race where btrfs_sync_log can remove itself from
the list if it finds a log commit is already done.  Also, it uses
list_del_init() to remove itself from the list, but there's no way to
know if btrfs_remove_all_log_ctxs has already run, so we don't know for
sure if it is safe to call list_del_init().

This gets rid of all the shortcuts for btrfs_remove_all_log_ctxs(), and
just calls it with the proper locking.

This is part two of the corruption fixed by cbd60aa7cd.  I should have
done this in the first place, but convinced myself the optimizations were
safe.  A 12 hour run of dbench 2048 will eventually trigger a list debug
WARN_ON for the list_del_init() in btrfs_sync_log().

Fixes: d1433debe7
Reported-by: Dave Jones <davej@codemonkey.org.uk>
cc: stable@vger.kernel.org # 3.15+
Signed-off-by: Chris Mason <clm@fb.com>
2016-10-27 10:42:20 -07:00
Wang Xiaoguang
9d1032cc49 btrfs: fix WARNING in btrfs_select_ref_head()
This issue was found when testing in-band dedupe enospc behaviour,
sometimes run_one_delayed_ref() may fail for enospc reason, then
__btrfs_run_delayed_refs()will return, but forget to add num_heads_read
back, which will trigger "WARN_ON(delayed_refs->num_heads_ready == 0)" in
btrfs_select_ref_head().

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-10-24 18:20:29 +02:00
Dan Carpenter
9c894696f5 Btrfs: remove some no-op casts
We cast 0 to a u8 but then because of type promotion, it's immediately
cast to int back to int before we do a bitwise negate.  The cast doesn't
matter in this case, the code works as intended.  It causes a static
checker warning though so let's remove it.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-10-24 18:20:29 +02:00
Wang Xiaoguang
dd4b857aab btrfs: pass correct args to btrfs_async_run_delayed_refs()
In btrfs_truncate_inode_items()->btrfs_async_run_delayed_refs(), we
swap the arg2 and arg3 wrongly, fix this.

This bug just impacts asynchronous delayed refs handle when we truncate inodes.
In delayed_ref_async_start(), there is such codes:

    trans = btrfs_join_transaction(async->root);
    if (trans->transid > async->transid)
        goto end;
    ret = btrfs_run_delayed_refs(trans, async->root, async->count);

From this codes, we can see that this just influence whether can we handle
delayed refs or the number of delayed refs to handle, this may impact
performance, but will not result in missing delayed refs, all delayed refs will
be handled in btrfs_commit_transaction().

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-10-24 18:20:29 +02:00
Wang Xiaoguang
69ae5e4459 btrfs: make file clone aware of fatal signals
Indeed this just make the behavior similar to xfs when process has
fatal signals pending, and it'll make fstests/generic/298 happy.

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-10-24 18:20:29 +02:00
Goldwyn Rodrigues
0b34c261e2 btrfs: qgroup: Prevent qgroup->reserved from going subzero
While free'ing qgroup->reserved resources, we much check if
the page has not been invalidated by a truncate operation
by checking if the page is still dirty before reducing the
qgroup resources. Resources in such a case are free'd when
the entire extent is released by delayed_ref.

This fixes a double accounting while releasing resources
in case of truncating a file, reproduced by the following testcase.

SCRATCH_DEV=/dev/vdb
SCRATCH_MNT=/mnt
mkfs.btrfs -f $SCRATCH_DEV
mount -t btrfs $SCRATCH_DEV $SCRATCH_MNT
cd $SCRATCH_MNT
btrfs quota enable $SCRATCH_MNT
btrfs subvolume create a
btrfs qgroup limit 500m a $SCRATCH_MNT
sync
for c in {1..15}; do
dd if=/dev/zero  bs=1M count=40 of=$SCRATCH_MNT/a/file;
done

sleep 10
sync
sleep 5

touch $SCRATCH_MNT/a/newfile

echo "Removing file"
rm $SCRATCH_MNT/a/file

Fixes: b9d0b38928 ("btrfs: Add handler for invalidate page")
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-10-24 18:20:21 +02:00
Chris Mason
112a3edf4b Merge branch 'for-chris-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.9 2016-10-18 06:51:33 -07:00
Junjie Mao
14155cafea btrfs: assign error values to the correct bio structs
Fixes: 4246a0b63b ("block: add a bi_error field to struct bio")
Signed-off-by: Junjie Mao <junjie.mao@enight.me>
Acked-by: David Sterba <dsterba@suse.cz>
Cc: stable@vger.kernel.org # 4.3+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-17 14:16:14 -07:00
Liu Bo
4547f4d8ff Btrfs: kill BUG_ON in do_relocation
While updating btree, we try to push items between sibling
nodes/leaves in order to keep height as low as possible.
But we don't memset the original places with zero when
pushing items so that we could end up leaving stale content
in nodes/leaves.  One may read the above stale content by
increasing btree blocks' @nritems.

One case I've come across is that in fs tree, a leaf has two
parent nodes, hence running balance ends up with processing
this leaf with two parent nodes, but it can only reach the
valid parent node through btrfs_search_slot, so it'd be like,

do_relocation
    for P in all parent nodes of block A:
        if !P->eb:
            btrfs_search_slot(key);   --> get path from P to A.
        if lowest:
            BUG_ON(A->bytenr != bytenr of A recorded in P);
        btrfs_cow_block(P, A);   --> change A's bytenr in P.

After btrfs_cow_block, P has the new bytenr of A, but with the
same @key, we get the same path again, and get panic by BUG_ON.

Note that this is only happening in a corrupted fs, for a
regular fs in which we have correct @nritems so that we won't
read stale content in any case.

Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-10-17 15:48:40 +02:00
Linus Torvalds
d3304cadb2 Merge branch 'for-linus-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Some fixes from Omar and Dave Sterba for our new free space tree.

  This isn't heavily used yet, but as we move toward making it the new
  default we wanted to nail down an endian bug"

* 'for-linus-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: tests: uninline member definitions in free_space_extent
  btrfs: tests: constify free space extent specs
  Btrfs: expand free space tree sanity tests to catch endianness bug
  Btrfs: fix extent buffer bitmap tests on big-endian systems
  Btrfs: catch invalid free space trees
  Btrfs: fix mount -o clear_cache,space_cache=v2
  Btrfs: fix free space tree bitmaps on big-endian systems
2016-10-14 17:44:56 -07:00
Chris Mason
d9ed71e545 Merge branch 'fst-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.9
Signed-off-by: Chris Mason <clm@fb.com>
2016-10-12 13:16:00 -07:00
Filipe Manana
d5e84fd8d0 Btrfs: fix incremental send failure caused by balance
Commit 951555856b ("Btrfs: send, don't bug on inconsistent snapshots")
removed some BUG_ON() statements (replacing them with returning errors
to user space and logging error messages) when a snapshot is in an
inconsistent state due to failures to update a delayed inode item (ENOMEM
or ENOSPC) after adding/updating/deleting references, xattrs or file
extent items.

However there is a case, when no errors happen, where a file extent item
can be modified without having the corresponding inode item updated. This
case happens during balance under very specific timings, when relocation
is in the stage where it updates data pointers and a leaf that contains
file extent items is COWed. When that happens file extent items get their
disk_bytenr field updated to a new value that reflects the post relocation
logical address of the extent, without updating their respective inode
items (as there is nothing that needs to be updated on them). This is
performed at relocation.c:replace_file_extents() through
relocation.c:btrfs_reloc_cow_block().

So make an incremental send deal with this case and don't do any processing
for a file extent item that got its disk_bytenr field updated by relocation,
since the extent's data is the same as the one pointed by the file extent
item in the parent snapshot.

After the recent commit mentioned above this case resulted in EIO errors
returned to user space (and an error message logged to dmesg/syslog) when
doing an incremental send, while before it, it resulted in hitting a
BUG_ON leading to the following trace:

[  952.206705] ------------[ cut here ]------------
[  952.206714] kernel BUG at ../fs/btrfs/send.c:5653!
[  952.206719] Internal error: Oops - BUG: 0 [#1] SMP
[  952.209854] Modules linked in: st dm_mod nls_utf8 isofs fuse nf_log_ipv6 xt_pkttype xt_physdev br_netfilter nf_log_ipv4 nf_log_common xt_LOG xt_limit ebtable_filter ebtables af_packet bridge stp llc ip6t_REJECT xt_tcpudp nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_raw ipt_REJECT iptable_raw xt_CT iptable_filter ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables xt_conntrack nf_conntrack ip6table_filter ip6_tables x_tables xfs libcrc32c nls_iso8859_1 nls_cp437 vfat fat joydev aes_ce_blk ablk_helper cryptd snd_intel8x0 aes_ce_cipher snd_ac97_codec ac97_bus snd_pcm ghash_ce sha2_ce sha1_ce snd_timer snd virtio_net soundcore btrfs xor sr_mod cdrom hid_generic usbhid raid6_pq virtio_blk virtio_scsi bochs_drm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm virtio_mmio xhci_pci xhci_hcd usbcore usb_common virtio_pci virtio_ring virtio drm sg efivarfs
[  952.228333] Supported: Yes
[  952.228908] CPU: 0 PID: 12779 Comm: snapperd Not tainted 4.4.14-50-default #1
[  952.230329] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[  952.231683] task: ffff800058e94100 ti: ffff8000d866c000 task.ti: ffff8000d866c000
[  952.233279] PC is at changed_cb+0x9f4/0xa48 [btrfs]
[  952.234375] LR is at changed_cb+0x58/0xa48 [btrfs]
[  952.236552] pc : [<ffff7ffffc39de7c>] lr : [<ffff7ffffc39d4e0>] pstate: 80000145
[  952.238049] sp : ffff8000d866fa20
[  952.238732] x29: ffff8000d866fa20 x28: 0000000000000019
[  952.239840] x27: 00000000000028d5 x26: 00000000000024a2
[  952.241008] x25: 0000000000000002 x24: ffff8000e66e92f0
[  952.242131] x23: ffff8000b8c76800 x22: ffff800092879140
[  952.243238] x21: 0000000000000002 x20: ffff8000d866fb78
[  952.244348] x19: ffff8000b8f8c200 x18: 0000000000002710
[  952.245607] x17: 0000ffff90d42480 x16: ffff800000237dc0
[  952.246719] x15: 0000ffff90de7510 x14: ab000c000a2faf08
[  952.247835] x13: 0000000000577c2b x12: ab000c000b696665
[  952.248981] x11: 2e65726f632f6966 x10: 652d34366d72612f
[  952.250101] x9 : 32627572672f746f x8 : ab000c00092f1671
[  952.251352] x7 : 8000000000577c2b x6 : ffff800053eadf45
[  952.252468] x5 : 0000000000000000 x4 : ffff80005e169494
[  952.253582] x3 : 0000000000000004 x2 : ffff8000d866fb78
[  952.254695] x1 : 000000000003e2a3 x0 : 000000000003e2a4
[  952.255803]
[  952.256150] Process snapperd (pid: 12779, stack limit = 0xffff8000d866c020)
[  952.257516] Stack: (0xffff8000d866fa20 to 0xffff8000d8670000)
[  952.258654] fa20: ffff8000d866fae0 ffff7ffffc308fc0 ffff800092879140 ffff8000e66e92f0
[  952.260219] fa40: 0000000000000035 ffff800055de6000 ffff8000b8c76800 ffff8000d866fb78
[  952.261745] fa60: 0000000000000002 00000000000024a2 00000000000028d5 0000000000000019
[  952.263269] fa80: ffff8000d866fae0 ffff7ffffc3090f0 ffff8000d866fae0 ffff7ffffc309128
[  952.264797] faa0: ffff800092879140 ffff8000e66e92f0 0000000000000035 ffff800055de6000
[  952.268261] fac0: ffff8000b8c76800 ffff8000d866fb78 0000000000000002 0000000000001000
[  952.269822] fae0: ffff8000d866fbc0 ffff7ffffc39ecfc ffff8000b8f8c200 ffff8000b8f8c368
[  952.271368] fb00: ffff8000b8f8c378 ffff800055de6000 0000000000000001 ffff8000ecb17500
[  952.272893] fb20: ffff8000b8c76800 ffff800092879140 ffff800062b6d000 ffff80007a9e2470
[  952.274420] fb40: ffff8000b8f8c208 0000000005784000 ffff8000580a8000 ffff8000b8f8c200
[  952.276088] fb60: ffff7ffffc39d488 00000002b8f8c368 0000000000000000 000000000003e2a4
[  952.280275] fb80: 000000000000006c ffff7ffffc39ec00 000000000003e2a4 000000000000006c
[  952.283219] fba0: ffff8000b8f8c300 0000000000000100 0000000000000001 ffff8000ecb17500
[  952.286166] fbc0: ffff8000d866fcd0 ffff7ffffc3643c0 ffff8000f8842700 0000ffff8ffe9278
[  952.289136] fbe0: 0000000040489426 ffff800055de6000 0000ffff8ffe9278 0000000040489426
[  952.292083] fc00: 000000000000011d 000000000000001d ffff80007a9e4598 ffff80007a9e43e8
[  952.294959] fc20: ffff8000b8c7693f 0000000000003b24 0000000000000019 ffff8000b8f8c218
[  952.301161] fc40: 00000001d866fc70 ffff8000b8c76800 0000000000000128 ffffffffffffff84
[  952.305749] fc60: ffff800058e941ff 0000000000003a58 ffff8000d866fcb0 ffff8000000f7390
[  952.308875] fc80: 000000000000012a 0000000000010290 ffff8000d866fc00 000000000000007b
[  952.311915] fca0: 0000000000010290 ffff800046c1b100 74732d7366727462 000001006d616572
[  952.314937] fcc0: ffff8000fffc4100 cb88537fdc8ba60e ffff8000d866fe10 ffff8000002499e8
[  952.318008] fce0: 0000000040489426 ffff8000f8842700 0000ffff8ffe9278 ffff80007a9e4598
[  952.321321] fd00: 0000ffff8ffe9278 0000000040489426 000000000000011d 000000000000001d
[  952.324280] fd20: ffff80000072c000 ffff8000d866c000 ffff8000d866fda0 ffff8000000e997c
[  952.327156] fd40: ffff8000fffc4180 00000000000031ed ffff8000fffc4180 ffff800046c1b7d4
[  952.329895] fd60: 0000000000000140 0000ffff907ea170 000000000000011d 00000000000000dc
[  952.334641] fd80: ffff80000072c000 ffff8000d866c000 0000000000000000 0000000000000002
[  952.338002] fda0: ffff8000d866fdd0 ffff8000000ebacc ffff800046c1b080 ffff800046c1b7d4
[  952.340724] fdc0: ffff8000d866fdf0 ffff8000000db67c 0000000000000040 ffff800000e69198
[  952.343415] fde0: 0000ffff8ffea790 00000000000031ed ffff8000d866fe20 ffff800000254000
[  952.346101] fe00: 000000000000001d 0000000000000004 ffff8000d866fe90 ffff800000249d3c
[  952.348980] fe20: ffff8000f8842700 0000000000000000 ffff8000f8842701 0000000000000008
[  952.351696] fe40: ffff8000d866fe70 0000000000000008 ffff8000d866fe90 ffff800000249cf8
[  952.354387] fe60: ffff8000f8842700 0000ffff8ffe9170 ffff8000f8842701 0000000000000008
[  952.357083] fe80: 0000ffff8ffe9278 ffff80008ff85500 0000ffff8ffe90c0 ffff800000085c84
[  952.359800] fea0: 0000000000000000 0000ffff8ffe9170 ffffffffffffffff 0000ffff90d473bc
[  952.365351] fec0: 0000000000000000 0000000000000015 0000000000000008 0000000040489426
[  952.369550] fee0: 0000ffff8ffe9278 0000ffff907ea790 0000ffff907ea170 0000ffff907ea790
[  952.372416] ff00: 0000ffff907ea170 0000000000000000 000000000000001d 0000000000000004
[  952.375223] ff20: 0000ffff90a32220 00000000003d0f00 0000ffff907ea0a0 0000ffff8ffe8f30
[  952.378099] ff40: 0000ffff9100f554 0000ffff91147000 0000ffff91117bc0 0000ffff90d473b0
[  952.381115] ff60: 0000ffff9100f620 0000ffff880069b0 0000ffff8ffe9170 0000ffff8ffe91a0
[  952.384003] ff80: 0000ffff8ffe9160 0000ffff8ffe9140 0000ffff88006990 0000ffff8ffe9278
[  952.386860] ffa0: 0000ffff88008a60 0000ffff8ffe9480 0000ffff88014ca0 0000ffff8ffe90c0
[  952.389654] ffc0: 0000ffff910be8e8 0000ffff8ffe90c0 0000ffff90d473bc 0000000000000000
[  952.410986] ffe0: 0000000000000008 000000000000001d 6e2079747265706f 72616d223d656d61
[  952.415497] Call trace:
[  952.417403] [<ffff7ffffc39de7c>] changed_cb+0x9f4/0xa48 [btrfs]
[  952.420023] [<ffff7ffffc308fc0>] btrfs_compare_trees+0x500/0x6b0 [btrfs]
[  952.422759] [<ffff7ffffc39ecfc>] btrfs_ioctl_send+0xb4c/0xe10 [btrfs]
[  952.425601] [<ffff7ffffc3643c0>] btrfs_ioctl+0x374/0x29a4 [btrfs]
[  952.428031] [<ffff8000002499e8>] do_vfs_ioctl+0x33c/0x600
[  952.430360] [<ffff800000249d3c>] SyS_ioctl+0x90/0xa4
[  952.432552] [<ffff800000085c84>] el0_svc_naked+0x38/0x3c
[  952.434803] Code: 2a1503e0 17fffdac b9404282 17ffff28 (d4210000)
[  952.437457] ---[ end trace 9afd7090c466cf15 ]---

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-10-12 10:41:01 +01:00
Linus Torvalds
f29135b54b Merge branch 'for-linus-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "This is a big variety of fixes and cleanups.

  Liu Bo continues to fixup fuzzer related problems, and some of Josef's
  cleanups are prep for his bigger extent buffer changes (slated for
  v4.10)"

* 'for-linus-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (39 commits)
  Revert "btrfs: let btrfs_delete_unused_bgs() to clean relocated bgs"
  Btrfs: remove unnecessary btrfs_mark_buffer_dirty in split_leaf
  Btrfs: don't BUG() during drop snapshot
  btrfs: fix btrfs_no_printk stub helper
  Btrfs: memset to avoid stale content in btree leaf
  btrfs: parent_start initialization cleanup
  btrfs: Remove already completed TODO comment
  btrfs: Do not reassign count in btrfs_run_delayed_refs
  btrfs: fix a possible umount deadlock
  Btrfs: fix memory leak in do_walk_down
  btrfs: btrfs_debug should consume fs_info when DEBUG is not defined
  btrfs: convert send's verbose_printk to btrfs_debug
  btrfs: convert pr_* to btrfs_* where possible
  btrfs: convert printk(KERN_* to use pr_* calls
  btrfs: unsplit printed strings
  btrfs: clean the old superblocks before freeing the device
  Btrfs: kill BUG_ON in run_delayed_tree_ref
  Btrfs: don't leak reloc root nodes on error
  btrfs: squash lines for simple wrapper functions
  Btrfs: improve check_node to avoid reading corrupted nodes
  ...
2016-10-11 11:23:06 -07:00
Linus Torvalds
101105b171 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more vfs updates from Al Viro:
 ">rename2() work from Miklos + current_time() from Deepa"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: Replace current_fs_time() with current_time()
  fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
  fs: Replace CURRENT_TIME with current_time() for inode timestamps
  fs: proc: Delete inode time initializations in proc_alloc_inode()
  vfs: Add current_time() api
  vfs: add note about i_op->rename changes to porting
  fs: rename "rename2" i_op to "rename"
  vfs: remove unused i_op->rename
  fs: make remaining filesystems use .rename2
  libfs: support RENAME_NOREPLACE in simple_rename()
  fs: support RENAME_NOREPLACE for local filesystems
  ncpfs: fix unused variable warning
2016-10-10 20:16:43 -07:00
Al Viro
3873691e5a Merge remote-tracking branch 'ovl/rename2' into for-linus 2016-10-10 23:02:51 -04:00
Linus Torvalds
97d2116708 Merge branch 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs xattr updates from Al Viro:
 "xattr stuff from Andreas

  This completes the switch to xattr_handler ->get()/->set() from
  ->getxattr/->setxattr/->removexattr"

* 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  vfs: Remove {get,set,remove}xattr inode operations
  xattr: Stop calling {get,set,remove}xattr inode operations
  vfs: Check for the IOP_XATTR flag in listxattr
  xattr: Add __vfs_{get,set,remove}xattr helpers
  libfs: Use IOP_XATTR flag for empty directory handling
  vfs: Use IOP_XATTR flag for bad-inode handling
  vfs: Add IOP_XATTR inode operations flag
  vfs: Move xattr_resolve_name to the front of fs/xattr.c
  ecryptfs: Switch to generic xattr handlers
  sockfs: Get rid of getxattr iop
  sockfs: getxattr: Fail with -EOPNOTSUPP for invalid attribute names
  kernfs: Switch to generic xattr handlers
  hfs: Switch to generic xattr handlers
  jffs2: Remove jffs2_{get,set,remove}xattr macros
  xattr: Remove unnecessary NULL attribute name check
2016-10-10 17:11:50 -07:00
Chris Mason
19c4d2f994 Revert "btrfs: let btrfs_delete_unused_bgs() to clean relocated bgs"
This reverts commit 5d8eb6fe51.

When we remove devices, we free the device structures.  Delaying
btfs_remove_chunk() ends up hitting a use-after-free on them.

Signed-off-by: Chris Mason <clm@fb.com>
2016-10-10 13:43:31 -07:00
Linus Torvalds
fed41f7d03 Merge branch 'work.splice_read' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull splice fixups from Al Viro:
 "A couple of fixups for interaction of pipe-backed iov_iter with
  O_DIRECT reads + constification of a couple of primitives in uio.h
  missed by previous rounds.

  Kudos to davej - his fuzzing has caught those bugs"

* 'work.splice_read' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  [btrfs] fix check_direct_IO() for non-iovec iterators
  constify iov_iter_count() and iter_is_iovec()
  fix ITER_PIPE interaction with direct_IO
2016-10-10 13:38:49 -07:00
Linus Torvalds
abb5a14fa2 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "Assorted misc bits and pieces.

  There are several single-topic branches left after this (rename2
  series from Miklos, current_time series from Deepa Dinamani, xattr
  series from Andreas, uaccess stuff from from me) and I'd prefer to
  send those separately"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (39 commits)
  proc: switch auxv to use of __mem_open()
  hpfs: support FIEMAP
  cifs: get rid of unused arguments of CIFSSMBWrite()
  posix_acl: uapi header split
  posix_acl: xattr representation cleanups
  fs/aio.c: eliminate redundant loads in put_aio_ring_file
  fs/internal.h: add const to ns_dentry_operations declaration
  compat: remove compat_printk()
  fs/buffer.c: make __getblk_slow() static
  proc: unsigned file descriptors
  fs/file: more unsigned file descriptors
  fs: compat: remove redundant check of nr_segs
  cachefiles: Fix attempt to read i_blocks after deleting file [ver #2]
  cifs: don't use memcpy() to copy struct iov_iter
  get rid of separate multipage fault-in primitives
  fs: Avoid premature clearing of capabilities
  fs: Give dentry to inode_change_ok() instead of inode
  fuse: Propagate dentry down to inode_change_ok()
  ceph: Propagate dentry down to inode_change_ok()
  xfs: Propagate dentry down to inode_change_ok()
  ...
2016-10-10 13:04:49 -07:00
Al Viro
cd27e45504 [btrfs] fix check_direct_IO() for non-iovec iterators
looking for duplicate ->iov_base makes sense only for
iovec-backed iterators; for kvec-backed ones it's pointless,
for bvec-backed ones it's pointless and broken on 32bit (we
walk through an array of struct bio_vec accessing them as if
they were struct iovec; works by accident on 64bit, but on
32bit it'll blow up) and for pipe-backed ones it's pointless
and ends up oopsing.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-10 13:58:16 -04:00
Al Viro
e55f1d1d13 Merge remote-tracking branch 'jk/vfs' into work.misc 2016-10-08 11:06:08 -04:00
Al Viro
f334bcd94b Merge remote-tracking branch 'ovl/misc' into work.misc 2016-10-08 11:00:01 -04:00
Andreas Gruenbacher
fd50ecaddf vfs: Remove {get,set,remove}xattr inode operations
These inode operations are no longer used; remove them.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-07 21:48:36 -04:00
Linus Torvalds
513a4befae Merge branch 'for-4.9/block' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
 "This is the main pull request for block layer changes in 4.9.

  As mentioned at the last merge window, I've changed things up and now
  do just one branch for core block layer changes, and driver changes.
  This avoids dependencies between the two branches. Outside of this
  main pull request, there are two topical branches coming as well.

  This pull request contains:

   - A set of fixes, and a conversion to blk-mq, of nbd. From Josef.

   - Set of fixes and updates for lightnvm from Matias, Simon, and Arnd.
     Followup dependency fix from Geert.

   - General fixes from Bart, Baoyou, Guoqing, and Linus W.

   - CFQ async write starvation fix from Glauber.

   - Add supprot for delayed kick of the requeue list, from Mike.

   - Pull out the scalable bitmap code from blk-mq-tag.c and make it
     generally available under the name of sbitmap. Only blk-mq-tag uses
     it for now, but the blk-mq scheduling bits will use it as well.
     From Omar.

   - bdev thaw error progagation from Pierre.

   - Improve the blk polling statistics, and allow the user to clear
     them. From Stephen.

   - Set of minor cleanups from Christoph in block/blk-mq.

   - Set of cleanups and optimizations from me for block/blk-mq.

   - Various nvme/nvmet/nvmeof fixes from the various folks"

* 'for-4.9/block' of git://git.kernel.dk/linux-block: (54 commits)
  fs/block_dev.c: return the right error in thaw_bdev()
  nvme: Pass pointers, not dma addresses, to nvme_get/set_features()
  nvme/scsi: Remove power management support
  nvmet: Make dsm number of ranges zero based
  nvmet: Use direct IO for writes
  admin-cmd: Added smart-log command support.
  nvme-fabrics: Add host_traddr options field to host infrastructure
  nvme-fabrics: revise host transport option descriptions
  nvme-fabrics: rework nvmf_get_address() for variable options
  nbd: use BLK_MQ_F_BLOCKING
  blkcg: Annotate blkg_hint correctly
  cfq: fix starvation of asynchronous writes
  blk-mq: add flag for drivers wanting blocking ->queue_rq()
  blk-mq: remove non-blocking pass in blk_mq_map_request
  blk-mq: get rid of manual run of queue with __blk_mq_run_hw_queue()
  block: export bio_free_pages to other modules
  lightnvm: propagate device_add() error code
  lightnvm: expose device geometry through sysfs
  lightnvm: control life of nvm_dev in driver
  blk-mq: register device instead of disk
  ...
2016-10-07 14:42:05 -07:00
David Sterba
0e6757859e btrfs: tests: uninline member definitions in free_space_extent
The recommended way is to put all members on separate lines.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-10-03 18:52:15 +02:00
David Sterba
d2d9ac6aae btrfs: tests: constify free space extent specs
We don't change the given extent ranges, mark them const to catch
accidental changes.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-10-03 18:52:15 +02:00
Omar Sandoval
781e3bcf0e Btrfs: expand free space tree sanity tests to catch endianness bug
The free space tree format conversion functions were broken on
big-endian systems, but the sanity tests didn't catch it because all of
the operations were aligned to multiple words. This was meant to catch
any bugs in the extent buffer code's handling of high memory, but it
ended up hiding the endianness bug. Expand the tests to do both
sector-aligned and page-aligned operations.

Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-10-03 18:52:14 +02:00
Omar Sandoval
9426ce754f Btrfs: fix extent buffer bitmap tests on big-endian systems
The in-memory bitmap code manipulates words and is therefore sensitive
to endianness, while the extent buffer bitmap code addresses bytes and
is byte-order agnostic. Because the byte addressing of the extent buffer
bitmaps is equivalent to a little-endian in-memory bitmap, the extent
buffer bitmap tests fail on big-endian systems.

34b3e6c92a ("Btrfs: self-tests: Fix extent buffer bitmap test fail on
BE system") worked around another endianness bug in the tests but missed
this one because ed9e4afdb0 ("Btrfs: self-tests: Execute page
straddling test only when nodesize < PAGE_SIZE") disables this part of
the test on ppc64. That change lost the original meaning of the test,
however. We really want to test that an equivalent series of operations
using the in-memory bitmap API and the extent buffer bitmap API produces
equivalent results.

To fix this, don't use memcmp_extent_buffer() or write_extent_buffer();
do everything bit-by-bit.

Reported-by: Anatoly Pugachev <matorola@gmail.com>
Tested-by: Anatoly Pugachev <matorola@gmail.com>
Tested-by: Feifei Xu <xufeifei@linux.vnet.ibm.com>
Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-10-03 18:52:14 +02:00
Omar Sandoval
6675df311d Btrfs: catch invalid free space trees
There are two separate issues that can lead to corrupted free space
trees.

1. The free space tree bitmaps had an endianness issue on big-endian
   systems which is fixed by an earlier patch in this series.
2. btrfs-progs before v4.7.3 modified filesystems without updating the
   free space tree.

To catch both of these issues at once, we need to force the free space
tree to be rebuilt. To do so, add a FREE_SPACE_TREE_VALID compat_ro bit.
If the bit isn't set, we know that it was either produced by a broken
big-endian kernel or may have been corrupted by btrfs-progs.

This also provides us with a way to add rudimentary read-write support
for the free space tree to btrfs-progs: it can just clear this bit and
have the kernel rebuild the free space tree.

Cc: stable@vger.kernel.org # 4.5+
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-10-03 18:52:14 +02:00
Omar Sandoval
f8d468a15c Btrfs: fix mount -o clear_cache,space_cache=v2
We moved the code for creating the free space tree the first time that
it's enabled, but didn't move the clearing code along with it. This
breaks my (undocumented) intention that `mount -o
clear_cache,space_cache=v2` would clear the free space tree and then
recreate it.

Fixes: 511711af91 ("btrfs: don't run delayed references while we are creating the free space tree")
Cc: stable@vger.kernel.org # 4.5+
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-10-03 18:52:14 +02:00
Omar Sandoval
2fe1d55134 Btrfs: fix free space tree bitmaps on big-endian systems
In convert_free_space_to_{bitmaps,extents}(), we buffer the free space
bitmaps in memory and copy them directly to/from the extent buffers with
{read,write}_extent_buffer(). The extent buffer bitmap helpers use byte
granularity, which is equivalent to a little-endian bitmap. This means
that on big-endian systems, the in-memory bitmaps will be written to
disk byte-swapped. To fix this, use byte-granularity for the bitmaps in
memory.

Fixes: a5ed918285 ("Btrfs: implement the free space B-tree")
Cc: stable@vger.kernel.org # 4.5+
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-10-03 18:52:14 +02:00
Andreas Gruenbacher
2211d5ba5c posix_acl: xattr representation cleanups
Remove the unnecessary typedefs and the zero-length a_entries array in
struct posix_acl_xattr_header.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-09-27 21:52:00 -04:00
Deepa Dinamani
c2050a454c fs: Replace current_fs_time() with current_time()
current_fs_time() uses struct super_block* as an argument.
As per Linus's suggestion, this is changed to take struct
inode* as a parameter instead. This is because the function
is primarily meant for vfs inode timestamps.
Also the function was renamed as per Arnd's suggestion.

Change all calls to current_fs_time() to use the new
current_time() function instead. current_fs_time() will be
deleted.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-09-27 21:06:22 -04:00
Miklos Szeredi
2773bf00ae fs: rename "rename2" i_op to "rename"
Generated patch:

sed -i "s/\.rename2\t/\.rename\t\t/" `git grep -wl rename2`
sed -i "s/\brename2\b/rename/g" `git grep -wl rename2`

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-09-27 11:03:58 +02:00
Liu Bo
196e02490c Btrfs: remove unnecessary btrfs_mark_buffer_dirty in split_leaf
When we're not able to get enough space through splitting leaf,
we'd create a new sibling leaf instead, and it's possible that we return
 a zero-nritem sibling leaf and mark it dirty before it's in a consistent
state.  With CONFIG_BTRFS_FS_CHECK_INTEGRITY=y, the integrity check of
check_leaf will report panic due to this zero-nritem non-root leaf.

This removes the unnecessary btrfs_mark_buffer_dirty.

Reported-by: Filipe Manana <fdmanana@gmail.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 19:50:44 +02:00
Josef Bacik
4867268c57 Btrfs: don't BUG() during drop snapshot
Really there's lots of things that can go wrong here, kill all the
BUG_ON()'s and replace the logic ones with ASSERT()'s and return EIO
instead.

Signed-off-by: Josef Bacik <jbacik@fb.com>
[ switched to btrfs_err, errors go to common label ]
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 19:37:06 +02:00
Arnd Bergmann
2fd57fcb16 btrfs: fix btrfs_no_printk stub helper
The addition of btrfs_no_printk() caused a build failure when
CONFIG_PRINTK is disabled:

fs/btrfs/send.c: In function 'send_rename':
fs/btrfs/ctree.h:3367:2: error: implicit declaration of function 'btrfs_no_printk' [-Werror=implicit-function-declaration]

This moves the helper outside of that #ifdef so it is always
defined, and changes the existing #ifdef to refer to that
helper as well for consistency.

Fixes: 47c57058ff2c ("btrfs: btrfs_debug should consume fs_info when DEBUG is not defined")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 19:37:06 +02:00
Liu Bo
851cd173f0 Btrfs: memset to avoid stale content in btree leaf
This is an additional patch to
"Btrfs: memset to avoid stale content in btree node block".

This uses memset to initialize the unused space in a leaf to avoid
potential stale content, which may be incurred by pushing items
between sibling leaves.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 19:37:06 +02:00
Goldwyn Rodrigues
0f5053eb90 btrfs: parent_start initialization cleanup
Code cleanup. parent_start is initialized multiple times when it is
not necessary to do so.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 19:37:06 +02:00
Goldwyn Rodrigues
6cea66e544 btrfs: Remove already completed TODO comment
Fixes: 7cf5b97650 ("btrfs: qgroup: Cleanup old inaccurate facilities")
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 19:37:06 +02:00
Goldwyn Rodrigues
dd12d5b804 btrfs: Do not reassign count in btrfs_run_delayed_refs
Code cleanup. count is already (unsgined long)-1. That is the reason
run_all was set. Do not reassign it (unsigned long)-1.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 19:37:06 +02:00
Anand Jain
0ccd05285e btrfs: fix a possible umount deadlock
btrfs_show_devname() is using the device_list_mutex, sometimes
a call to blkdev_put() leads vfs calling into this func. So
call blkdev_put() outside of device_list_mutex, as of now.

[  983.284212] ======================================================
[  983.290401] [ INFO: possible circular locking dependency detected ]
[  983.296677] 4.8.0-rc5-ceph-00023-g1b39cec2 #1 Not tainted
[  983.302081] -------------------------------------------------------
[  983.308357] umount/21720 is trying to acquire lock:
[  983.313243]  (&bdev->bd_mutex){+.+.+.}, at: [<ffffffff9128ec51>] blkdev_put+0x31/0x150
[  983.321264]
[  983.321264] but task is already holding lock:
[  983.327101]  (&fs_devs->device_list_mutex){+.+...}, at: [<ffffffffc033d6f6>] __btrfs_close_devices+0x46/0x200 [btrfs]
[  983.337839]
[  983.337839] which lock already depends on the new lock.
[  983.337839]
[  983.346024]
[  983.346024] the existing dependency chain (in reverse order) is:
[  983.353512]
-> #4 (&fs_devs->device_list_mutex){+.+...}:
[  983.359096]        [<ffffffff910dfd0c>] lock_acquire+0x1bc/0x1f0
[  983.365143]        [<ffffffff91823125>] mutex_lock_nested+0x65/0x350
[  983.371521]        [<ffffffffc02d8116>] btrfs_show_devname+0x36/0x1f0 [btrfs]
[  983.378710]        [<ffffffff9129523e>] show_vfsmnt+0x4e/0x150
[  983.384593]        [<ffffffff9126ffc7>] m_show+0x17/0x20
[  983.389957]        [<ffffffff91276405>] seq_read+0x2b5/0x3b0
[  983.395669]        [<ffffffff9124c808>] __vfs_read+0x28/0x100
[  983.401464]        [<ffffffff9124eb3b>] vfs_read+0xab/0x150
[  983.407080]        [<ffffffff9124ec32>] SyS_read+0x52/0xb0
[  983.412609]        [<ffffffff91825fc0>] entry_SYSCALL_64_fastpath+0x23/0xc1
[  983.419617]
-> #3 (namespace_sem){++++++}:
[  983.424024]        [<ffffffff910dfd0c>] lock_acquire+0x1bc/0x1f0
[  983.430074]        [<ffffffff918239e9>] down_write+0x49/0x80
[  983.435785]        [<ffffffff91272457>] lock_mount+0x67/0x1c0
[  983.441582]        [<ffffffff91272ab2>] do_add_mount+0x32/0xf0
[  983.447458]        [<ffffffff9127363a>] finish_automount+0x5a/0xc0
[  983.453682]        [<ffffffff91259513>] follow_managed+0x1b3/0x2a0
[  983.459912]        [<ffffffff9125b750>] lookup_fast+0x300/0x350
[  983.465875]        [<ffffffff9125d6e7>] path_openat+0x3a7/0xaa0
[  983.471846]        [<ffffffff9125ef75>] do_filp_open+0x85/0xe0
[  983.477731]        [<ffffffff9124c41c>] do_sys_open+0x14c/0x1f0
[  983.483702]        [<ffffffff9124c4de>] SyS_open+0x1e/0x20
[  983.489240]        [<ffffffff91825fc0>] entry_SYSCALL_64_fastpath+0x23/0xc1
[  983.496254]
-> #2 (&sb->s_type->i_mutex_key#3){+.+.+.}:
[  983.501798]        [<ffffffff910dfd0c>] lock_acquire+0x1bc/0x1f0
[  983.507855]        [<ffffffff918239e9>] down_write+0x49/0x80
[  983.513558]        [<ffffffff91366237>] start_creating+0x87/0x100
[  983.519703]        [<ffffffff91366647>] debugfs_create_dir+0x17/0x100
[  983.526195]        [<ffffffff911df153>] bdi_register+0x93/0x210
[  983.532165]        [<ffffffff911df313>] bdi_register_owner+0x43/0x70
[  983.538570]        [<ffffffff914080fb>] device_add_disk+0x1fb/0x450
[  983.544888]        [<ffffffff91580226>] loop_add+0x1e6/0x290
[  983.550596]        [<ffffffff91fec358>] loop_init+0x10b/0x14f
[  983.556394]        [<ffffffff91002207>] do_one_initcall+0xa7/0x180
[  983.562618]        [<ffffffff91f932e0>] kernel_init_freeable+0x1cc/0x266
[  983.569370]        [<ffffffff918174be>] kernel_init+0xe/0x100
[  983.575166]        [<ffffffff9182620f>] ret_from_fork+0x1f/0x40
[  983.581131]
-> #1 (loop_index_mutex){+.+.+.}:
[  983.585801]        [<ffffffff910dfd0c>] lock_acquire+0x1bc/0x1f0
[  983.591858]        [<ffffffff91823125>] mutex_lock_nested+0x65/0x350
[  983.598256]        [<ffffffff9157ed3f>] lo_open+0x1f/0x60
[  983.603704]        [<ffffffff9128eec3>] __blkdev_get+0x123/0x400
[  983.609757]        [<ffffffff9128f4ea>] blkdev_get+0x34a/0x350
[  983.615639]        [<ffffffff9128f554>] blkdev_open+0x64/0x80
[  983.621428]        [<ffffffff9124aff6>] do_dentry_open+0x1c6/0x2d0
[  983.627651]        [<ffffffff9124c029>] vfs_open+0x69/0x80
[  983.633181]        [<ffffffff9125db74>] path_openat+0x834/0xaa0
[  983.639152]        [<ffffffff9125ef75>] do_filp_open+0x85/0xe0
[  983.645035]        [<ffffffff9124c41c>] do_sys_open+0x14c/0x1f0
[  983.650999]        [<ffffffff9124c4de>] SyS_open+0x1e/0x20
[  983.656535]        [<ffffffff91825fc0>] entry_SYSCALL_64_fastpath+0x23/0xc1
[  983.663541]
-> #0 (&bdev->bd_mutex){+.+.+.}:
[  983.668107]        [<ffffffff910def43>] __lock_acquire+0x1003/0x17b0
[  983.674510]        [<ffffffff910dfd0c>] lock_acquire+0x1bc/0x1f0
[  983.680561]        [<ffffffff91823125>] mutex_lock_nested+0x65/0x350
[  983.686967]        [<ffffffff9128ec51>] blkdev_put+0x31/0x150
[  983.692761]        [<ffffffffc033481f>] btrfs_close_bdev+0x4f/0x60 [btrfs]
[  983.699699]        [<ffffffffc033d77b>] __btrfs_close_devices+0xcb/0x200 [btrfs]
[  983.707178]        [<ffffffffc033d8db>] btrfs_close_devices+0x2b/0xa0 [btrfs]
[  983.714380]        [<ffffffffc03081c5>] close_ctree+0x265/0x340 [btrfs]
[  983.721061]        [<ffffffffc02d7959>] btrfs_put_super+0x19/0x20 [btrfs]
[  983.727908]        [<ffffffff91250e2f>] generic_shutdown_super+0x6f/0x100
[  983.734744]        [<ffffffff91250f56>] kill_anon_super+0x16/0x30
[  983.740888]        [<ffffffffc02da97e>] btrfs_kill_super+0x1e/0x130 [btrfs]
[  983.747909]        [<ffffffff91250fe9>] deactivate_locked_super+0x49/0x80
[  983.754745]        [<ffffffff912515fd>] deactivate_super+0x5d/0x70
[  983.760977]        [<ffffffff91270a1c>] cleanup_mnt+0x5c/0x80
[  983.766773]        [<ffffffff91270a92>] __cleanup_mnt+0x12/0x20
[  983.772738]        [<ffffffff910aa2fe>] task_work_run+0x7e/0xc0
[  983.778708]        [<ffffffff91081b5a>] exit_to_usermode_loop+0x7e/0xb4
[  983.785373]        [<ffffffff910039eb>] syscall_return_slowpath+0xbb/0xd0
[  983.792212]        [<ffffffff9182605c>] entry_SYSCALL_64_fastpath+0xbf/0xc1
[  983.799225]
[  983.799225] other info that might help us debug this:
[  983.799225]
[  983.807291] Chain exists of:
  &bdev->bd_mutex --> namespace_sem --> &fs_devs->device_list_mutex

[  983.816521]  Possible unsafe locking scenario:
[  983.816521]
[  983.822489]        CPU0                    CPU1
[  983.827043]        ----                    ----
[  983.831599]   lock(&fs_devs->device_list_mutex);
[  983.836289]                                lock(namespace_sem);
[  983.842268]                                lock(&fs_devs->device_list_mutex);
[  983.849478]   lock(&bdev->bd_mutex);
[  983.853127]
[  983.853127]  *** DEADLOCK ***
[  983.853127]
[  983.859113] 3 locks held by umount/21720:
[  983.863145]  #0:  (&type->s_umount_key#35){++++..}, at: [<ffffffff912515f5>] deactivate_super+0x55/0x70
[  983.872713]  #1:  (uuid_mutex){+.+.+.}, at: [<ffffffffc033d8d3>] btrfs_close_devices+0x23/0xa0 [btrfs]
[  983.882206]  #2:  (&fs_devs->device_list_mutex){+.+...}, at: [<ffffffffc033d6f6>] __btrfs_close_devices+0x46/0x200 [btrfs]
[  983.893422]
[  983.893422] stack backtrace:
[  983.897824] CPU: 6 PID: 21720 Comm: umount Not tainted 4.8.0-rc5-ceph-00023-g1b39cec2 #1
[  983.905958] Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 1.0c 09/07/2015
[  983.913492]  0000000000000000 ffff8c8a53c17a38 ffffffff91429521 ffffffff9260f4f0
[  983.921018]  ffffffff92642760 ffff8c8a53c17a88 ffffffff911b2b04 0000000000000050
[  983.928542]  ffffffff9237d620 ffff8c8a5294aee0 ffff8c8a5294aeb8 ffff8c8a5294aee0
[  983.936072] Call Trace:
[  983.938545]  [<ffffffff91429521>] dump_stack+0x85/0xc4
[  983.943715]  [<ffffffff911b2b04>] print_circular_bug+0x1fb/0x20c
[  983.949748]  [<ffffffff910def43>] __lock_acquire+0x1003/0x17b0
[  983.955613]  [<ffffffff910dfd0c>] lock_acquire+0x1bc/0x1f0
[  983.961123]  [<ffffffff9128ec51>] ? blkdev_put+0x31/0x150
[  983.966550]  [<ffffffff91823125>] mutex_lock_nested+0x65/0x350
[  983.972407]  [<ffffffff9128ec51>] ? blkdev_put+0x31/0x150
[  983.977832]  [<ffffffff9128ec51>] blkdev_put+0x31/0x150
[  983.983101]  [<ffffffffc033481f>] btrfs_close_bdev+0x4f/0x60 [btrfs]
[  983.989500]  [<ffffffffc033d77b>] __btrfs_close_devices+0xcb/0x200 [btrfs]
[  983.996415]  [<ffffffffc033d8db>] btrfs_close_devices+0x2b/0xa0 [btrfs]
[  984.003068]  [<ffffffffc03081c5>] close_ctree+0x265/0x340 [btrfs]
[  984.009189]  [<ffffffff9126cc5e>] ? evict_inodes+0x15e/0x170
[  984.014881]  [<ffffffffc02d7959>] btrfs_put_super+0x19/0x20 [btrfs]
[  984.021176]  [<ffffffff91250e2f>] generic_shutdown_super+0x6f/0x100
[  984.027476]  [<ffffffff91250f56>] kill_anon_super+0x16/0x30
[  984.033082]  [<ffffffffc02da97e>] btrfs_kill_super+0x1e/0x130 [btrfs]
[  984.039548]  [<ffffffff91250fe9>] deactivate_locked_super+0x49/0x80
[  984.045839]  [<ffffffff912515fd>] deactivate_super+0x5d/0x70
[  984.051525]  [<ffffffff91270a1c>] cleanup_mnt+0x5c/0x80
[  984.056774]  [<ffffffff91270a92>] __cleanup_mnt+0x12/0x20
[  984.062201]  [<ffffffff910aa2fe>] task_work_run+0x7e/0xc0
[  984.067625]  [<ffffffff91081b5a>] exit_to_usermode_loop+0x7e/0xb4
[  984.073747]  [<ffffffff910039eb>] syscall_return_slowpath+0xbb/0xd0
[  984.080038]  [<ffffffff9182605c>] entry_SYSCALL_64_fastpath+0xbf/0xc1

Reported-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 19:37:06 +02:00
Liu Bo
a958eab0ed Btrfs: fix memory leak in do_walk_down
The extent buffer 'next' needs to be free'd conditionally.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 19:37:06 +02:00
Jeff Mahoney
c01f5f96f5 btrfs: btrfs_debug should consume fs_info when DEBUG is not defined
We can hit unused variable warnings when btrfs_debug and friends are
just aliases for no_printk.  This is due to the fs_info not getting
consumed by the function call, which can happen if convenenience
variables are used.  This patch adds a new btrfs_no_printk static inline
that consumes the convenience variable and does nothing else.  It
silences the unused variable warning and has no impact on the generated
code:

$ size fs/btrfs/extent_io.o*
   text	   data	    bss	    dec	    hex	filename
  44072	    152	     32	  44256	   ace0	fs/btrfs/extent_io.o.btrfs_no_printk
  44072	    152	     32	  44256	   ace0	fs/btrfs/extent_io.o.no_printk

Fixes: 27a0dd61a5 (Btrfs: make btrfs_debug match pr_debug handling related to DEBUG)
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 19:37:06 +02:00
Jeff Mahoney
04ab956ee6 btrfs: convert send's verbose_printk to btrfs_debug
This was basically an open-coded, less flexible dynamic printk.  We can
just use btrfs_debug instead.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 19:37:06 +02:00
Jeff Mahoney
ab8d0fc48d btrfs: convert pr_* to btrfs_* where possible
For many printks, we want to know which file system issued the message.

This patch converts most pr_* calls to use the btrfs_* versions instead.
In some cases, this means adding plumbing to allow call sites access to
an fs_info pointer.

fs/btrfs/check-integrity.c is left alone for another day.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 19:37:04 +02:00
Jeff Mahoney
62e855771d btrfs: convert printk(KERN_* to use pr_* calls
This patch converts printk(KERN_* style messages to use the pr_* versions.

One side effect is that anything that was KERN_DEBUG is now automatically
a dynamic debug message.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 18:08:44 +02:00
Jeff Mahoney
5d163e0e68 btrfs: unsplit printed strings
CodingStyle chapter 2:
"[...] never break user-visible strings such as printk messages,
because that breaks the ability to grep for them."

This patch unsplits user-visible strings.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 18:08:44 +02:00
Jeff Mahoney
cea67ab92d btrfs: clean the old superblocks before freeing the device
btrfs_rm_device frees the block device but then re-opens it using
the saved device name.  A race exists between the close and the
re-open that allows the block size to be changed.  The result
is getting stuck forever in the reclaim loop in __getblk_slow.

This patch moves the superblock cleanup before closing the block
device, which is also consistent with other callers.  We also don't
need a private copy of dev_name as the whole routine operates under
the uuid_mutex.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 18:08:44 +02:00
Liu Bo
02794222c4 Btrfs: kill BUG_ON in run_delayed_tree_ref
In a corrupted btrfs image, we can come across this BUG_ON and
get an unreponsive system, but if we return errors instead,
its caller can handle everything gracefully by aborting the current
transaction.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 18:08:44 +02:00
Josef Bacik
6bdf131fac Btrfs: don't leak reloc root nodes on error
We don't track the reloc roots in any sort of normal way, so the only way the
root/commit_root nodes get free'd is if the relocation finishes successfully and
the reloc root is deleted.  Fix this by free'ing them in free_reloc_roots.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 18:08:44 +02:00
Masahiro Yamada
e2c8990734 btrfs: squash lines for simple wrapper functions
Remove unneeded variables and assignments.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 18:08:38 +02:00
Liu Bo
6b722c1747 Btrfs: improve check_node to avoid reading corrupted nodes
We need to check items in a node to make sure that we're reading
a valid one, otherwise we could get various crashes while processing
delayed_refs.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 18:05:28 +02:00
Liu Bo
a42cbec9c6 Btrfs: add error handling for extent buffer in print tree
Somehow we missed btrfs_print_tree when last time we
updated error handling for read_extent_block().

This keeps us from getting a NULL pointer panic when
btrfs_print_tree's read_extent_block() fails.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 18:04:01 +02:00
Liu Bo
a43f7f8206 Btrfs: remove BUG_ON in start_transaction
Since we could get errors from the concurrent aborted transaction,
the check of this BUG_ON in start_transaction is not true any more.

Say, while flushing free space cache inode's dirty pages,
btrfs_finish_ordered_io
 -> btrfs_join_transaction_nolock
      (the transaction has been aborted.)
      -> BUG_ON(type == TRANS_JOIN_NOLOCK);

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 18:04:01 +02:00
Liu Bo
3eb548ee3a Btrfs: memset to avoid stale content in btree node block
During updating btree, we could push items between sibling
nodes/leaves, for leaves data sections starts reversely from
the end of the block while for nodes we only have key pairs
which are stored one by one from the start of the block.

So we could do try to push key pairs from one node to the next
node right in the tree, and after that, we update the node's
nritems to reflect the correct end while leaving the stale
content in the node.  One may intentionally corrupt the fs
image and access the stale content by bumping the nritems and
causes various crashes.

This takes the in-memory @nritems as the correct one and
gets to memset the unused part of a btree node.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 18:03:47 +02:00
Liu Bo
3561b9db70 Btrfs: return gracefully from balance if fs tree is corrupted
When relocating tree blocks, we firstly get block information from
back references in the extent tree, we then search fs tree to try to
find all parents of a block.

However, if fs tree is corrupted, eg. if there're some missing
items, we could come across these WARN_ONs and BUG_ONs.

This makes us print some error messages and return gracefully
from balance.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 17:59:49 +02:00
Josef Bacik
9c8e63db1d Btrfs: kill BUG_ON()'s in btrfs_mark_extent_written
No reason to bug on in here, fs corruption could easily cause these things to
happen.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 17:59:49 +02:00
Josef Bacik
8436ea91a1 Btrfs: kill the start argument to read_extent_buffer_pages
Nobody uses this, it makes no sense to do partial reads of extent buffers.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 17:59:49 +02:00
Josef Bacik
afcdd129e0 Btrfs: add a flags field to btrfs_fs_info
We have a lot of random ints in btrfs_fs_info that can be put into flags.  This
is mostly equivalent with the exception of how we deal with quota going on or
off, now instead we set a flag when we are turning it on or off and deal with
that appropriately, rather than just having a pending state that the current
quota_enabled gets set to.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 17:59:49 +02:00
Qu Wenruo
ba8b04c1d4 btrfs: extend btrfs_set_extent_delalloc and its friends to support in-band dedupe and subpage size patchset
Extend btrfs_set_extent_delalloc() and extent_clear_unlock_delalloc()
parameters for both in-band dedupe and subpage sector size patchset.

This should reduce conflict of both patchset and the effort to rebase
them.

Cc: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 17:59:49 +02:00
Jeff Mahoney
897a41b116 btrfs: add dynamic debug support
We can re-use the dynamic debugging descriptor to make use of the dynamic
debugging mechanism but still use our own printk interface.

Defining the DEBUG macro works as it did before.  When it's defined,
all of the messages default to print.  We can also enable all debug
messages at boot or module-load time using the 'dyndbg' and
'btrfs.dyndbg' options.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 17:59:49 +02:00
Luis Henriques
2309e79650 btrfs: Fix warning "variable ‘gen’ set but not used"
Variable 'gen' in reada_for_search() is not used since commit 58dc4ce432
("btrfs: remove unused parameter from readahead_tree_block").  This patch
simply removes this variable.

Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 17:59:49 +02:00
Luis Henriques
1f079fa2f8 btrfs: Fix warning "variable ‘blocksize’ set but not used"
Variable 'blocksize' in reada_walk_down() is not used since commit
d3e46fea1b ("btrfs: sink blocksize parameter to readahead_tree_block").
This patch simply removes this variable.

Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 17:59:49 +02:00
Naohiro Aota
5d8eb6fe51 btrfs: let btrfs_delete_unused_bgs() to clean relocated bgs
Currently, btrfs_relocate_chunk() is removing relocated BG by itself. But
the work can be done by btrfs_delete_unused_bgs() (and it's better since it
trim the BG). Let's dedupe the code.

While btrfs_delete_unused_bgs() is already hitting the relocated BG, it
skip the BG since the BG has "ro" flag set (to keep balancing BG intact).
On the other hand, btrfs cannot drop "ro" flag here to prevent additional
writes. So this patch make use of "removed" flag.
btrfs_delete_unused_bgs() now detect the flag to distinguish whether a
read-only BG is relocating or not.

Signed-off-by: Naohiro Aota <naohiro.aota@hgst.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 17:59:49 +02:00
Liu Bo
49303381f1 Btrfs: bail out if block group has different mixed flag
Currently we allow inconsistence about mixed flag
 (BTRFS_BLOCK_GROUP_METADATA | BTRFS_BLOCK_GROUP_DATA).

We'd get ENOSPC if block group has mixed flag and btrfs doesn't.
If that happens, we have one space_info with mixed flag and another
space_info only with BTRFS_BLOCK_GROUP_METADATA, and
global_block_rsv.space_info points to the latter one, but all bytes
from block_group contributes to the mixed space_info, thus all the
allocation will fail with ENOSPC.

This adds a check for the above case.

Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
[ updated message ]
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 17:59:49 +02:00
Liu Bo
2571e73967 Btrfs: fix memory leak in reading btree blocks
So we can read a btree block via readahead or intentional read,
and we can end up with a memory leak when something happens as
follows,
1) readahead starts to read block A but does not wait for read
   completion,
2) btree_readpage_end_io_hook finds that block A is corrupted,
   and it needs to clear all block A's pages' uptodate bit.
3) meanwhile an intentional read kicks in and checks block A's
   pages' uptodate to decide which page needs to be read.
4) when some pages have the uptodate bit during 3)'s check so
   3) doesn't count them for eb->io_pages, but they are later
   cleared by 2) so we has to readpage on the page, we get
   the wrong eb->io_pages which results in a memory leak of
   this block.

This fixes the problem by firstly getting all pages's locking and
then checking pages' uptodate bit.

   t1(readahead)                              t2(readahead endio)                                       t3(the following read)
read_extent_buffer_pages                    end_bio_extent_readpage
  for pg in eb:                                for page 0,1,2 in eb:
      if pg is uptodate:                           btree_readpage_end_io_hook(pg)
          num_reads++                              if uptodate:
  eb->io_pages = num_reads                             SetPageUptodate(pg)              _______________
  for pg in eb:                                for page 3 in eb:                                     read_extent_buffer_pages
       if pg is NOT uptodate:                      btree_readpage_end_io_hook(pg)                       for pg in eb:
           __extent_read_full_page(pg)                 sanity check reports something wrong                 if pg is uptodate:
                                                       clear_extent_buffer_uptodate(eb)                         num_reads++
                                                           for pg in eb:                                eb->io_pages = num_reads
                                                               ClearPageUptodate(page)  _______________
                                                                                                        for pg in eb:
                                                                                                            if pg is NOT uptodate:
                                                                                                                __extent_read_full_page(pg)

So t3's eb->io_pages is not consistent with the number of pages it's reading,
and during endio(), atomic_dec_and_test(&eb->io_pages) will get a negative
number so that we're not able to free the eb.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 17:59:49 +02:00
Liu Bo
e46a28ca3d Btrfs: remove BUG() in raid56
This BUG() has been triggered by a fuzz testing image, which contains
an invalid chunk type, ie. a single stripe chunk has the raid6 type.

Btrfs can handle this gracefully by returning -EIO, so besides using
btrfs_warn to give us more debugging information rather than a single
BUG(), we can return error properly.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 17:59:49 +02:00
Lu Fengqi
afce772e87 btrfs: fix check_shared for fiemap ioctl
Only in the case of different root_id or different object_id, check_shared
identified extent as the shared. However, If a extent was referred by
different offset of same file, it should also be identified as shared.
In addition, check_shared's loop scale is at least n^3, so if a extent
has too many references, even causes soft hang up.

First, add all delayed_ref to the ref_tree and calculate the unqiue_refs,
if the unique_refs is greater than one, return BACKREF_FOUND_SHARED.
Then individually add the on-disk reference(inline/keyed) to the ref_tree
and calculate the unique_refs of the ref_tree to check if the unique_refs
is greater than one.Because once there are two references to return
SHARED, so the time complexity is close to the constant.

Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 17:59:49 +02:00
David Sterba
b0de6c4c81 btrfs: create example debugfs file only in debugging build
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 17:59:49 +02:00
Eric Sandeen
07f6a48043 btrfs: fix perms on demonstration debugfs interface
btrfs provides a helpful demonstration of how to export
a global variable via debugfs; however, it is unique among
other debugfs files in that it is world-writable, which causes
some concern to people who are not familiar with its purpose.

Fix it so that it is only user-writable.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 17:59:49 +02:00
Liu Bo
c79a175175 Btrfs: fix memory leak of block group cache
While processing delayed refs, we may update block group's statistics
and attach it to cur_trans->dirty_bgs, and later writing dirty block
groups will process the list, which happens during
btrfs_commit_transaction().

For whatever reason, the transaction is aborted and dirty_bgs
is not processed in cleanup_transaction(), we end up with memory leak
of these dirty block group cache.

Since btrfs_start_dirty_block_groups() doesn't make it go to the commit
critical section, this also adds the cleanup work inside it.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 17:59:49 +02:00
Linus Torvalds
b22734a550 Merge branch 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Josef fixed a problem when quotas are enabled with his latest ENOSPC
  rework, and Jeff added more checks into the subvol ioctls to avoid
  tripping up lookup_one_len"

* 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: ensure that file descriptor used with subvol ioctls is a dir
  Btrfs: handle quota reserve failure properly
2016-09-23 13:39:37 -07:00
Jan Kara
31051c85b5 fs: Give dentry to inode_change_ok() instead of inode
inode_change_ok() will be resposible for clearing capabilities and IMA
extended attributes and as such will need dentry. Give it as an argument
to inode_change_ok() instead of an inode. Also rename inode_change_ok()
to setattr_prepare() to better relect that it does also some
modifications in addition to checks.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2016-09-22 10:56:19 +02:00
Jan Kara
073931017b posix_acl: Clear SGID bit when setting file permissions
When file permissions are modified via chmod(2) and the user is not in
the owning group or capable of CAP_FSETID, the setgid bit is cleared in
inode_change_ok().  Setting a POSIX ACL via setxattr(2) sets the file
permissions as well as the new ACL, but doesn't clear the setgid bit in
a similar way; this allows to bypass the check in chmod(2).  Fix that.

References: CVE-2016-7097
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2016-09-22 10:55:32 +02:00
Jeff Mahoney
325c50e3ce btrfs: ensure that file descriptor used with subvol ioctls is a dir
If the subvol/snapshot create/destroy ioctls are passed a regular file
with execute permissions set, we'll eventually Oops while trying to do
inode->i_op->lookup via lookup_one_len.

This patch ensures that the file descriptor refers to a directory.

Fixes: cb8e70901d (Btrfs: Fix subvolume creation locking rules)
Fixes: 76dda93c6a (Btrfs: add snapshot/subvolume destroy ioctl)
Cc: <stable@vger.kernel.org> #v2.6.29+
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-09-21 17:22:16 -07:00
Josef Bacik
1e5ec2e709 Btrfs: handle quota reserve failure properly
btrfs/022 was spitting a warning for the case that we exceed the quota.  If we
fail to make our quota reservation we need to clean up our data space
reservation.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Tested-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-09-21 17:22:16 -07:00
Miklos Szeredi
f031221001 btrfs: use filemap_check_errors()
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Cc: Chris Mason <clm@fb.com>
2016-09-16 12:44:21 +02:00
Bart Van Assche
4382e33ad3 block, dm-crypt, btrfs: Introduce bio_flags()
Introduce the bio_flags() macro. Ensure that the second argument of
bio_set_op_attrs() only contains flags and no operation. This patch
does not change any functionality.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Mike Christie <mchristi@redhat.com>
Cc: Chris Mason <clm@fb.com> (maintainer:BTRFS FILE SYSTEM)
Cc: Josef Bacik <jbacik@fb.com> (maintainer:BTRFS FILE SYSTEM)
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Damien Le Moal <damien.lemoal@hgst.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-14 08:48:27 -06:00
Linus Torvalds
f4a9c169c2 Merge branch 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "I'm not proud of how long it took me to track down that one liner in
  btrfs_sync_log(), but the good news is the patches I was trying to
  blame for these problems were actually fine (sorry Filipe)"

* 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: introduce tickets_id to determine whether asynchronous metadata reclaim work makes progress
  btrfs: remove root_log_ctx from ctx list before btrfs_sync_log returns
  btrfs: do not decrease bytes_may_use when replaying extents
2016-09-09 12:52:31 -07:00
Chris Mason
b7f3c7d345 Merge branch 'for-chris' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.8 2016-09-07 12:55:36 -07:00
Wang Xiaoguang
ce129655c9 btrfs: introduce tickets_id to determine whether asynchronous metadata reclaim work makes progress
In btrfs_async_reclaim_metadata_space(), we use ticket's address to
determine whether asynchronous metadata reclaim work is making progress.

	ticket = list_first_entry(&space_info->tickets,
				  struct reserve_ticket, list);
	if (last_ticket == ticket) {
		flush_state++;
	} else {
		last_ticket = ticket;
		flush_state = FLUSH_DELAYED_ITEMS_NR;
		if (commit_cycles)
			commit_cycles--;
	}

But indeed it's wrong, we should not rely on local variable's address to
do this check, because addresses may be same. In my test environment, I
dd one 168MB file in a 256MB fs, found that for this file, every time
wait_reserve_ticket() called, local variable ticket's address is same,

For above codes, assume a previous ticket's address is addrA, last_ticket
is addrA. Btrfs_async_reclaim_metadata_space() finished this ticket and
wake up it, then another ticket is added, but with the same address addrA,
now last_ticket will be same to current ticket, then current ticket's flush
work will start from current flush_state, not initial FLUSH_DELAYED_ITEMS_NR,
which may result in some enospc issues(I have seen this in my test machine).

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-06 16:31:43 +02:00
Chris Mason
cbd60aa7cd Btrfs: remove root_log_ctx from ctx list before btrfs_sync_log returns
We use a btrfs_log_ctx structure to pass information into the
tree log commit, and get error values out.  It gets added to a per
log-transaction list which we walk when things go bad.

Commit d1433debe added an optimization to skip waiting for the log
commit, but didn't take root_log_ctx out of the list.  This
patch makes sure we remove things before exiting.

Signed-off-by: Chris Mason <clm@fb.com>
Fixes: d1433debe7
cc: stable@vger.kernel.org # 3.15+
2016-09-06 05:57:25 -07:00
Wang Xiaoguang
ed7a694839 btrfs: do not decrease bytes_may_use when replaying extents
When replaying extents, there is no need to update bytes_may_use
in btrfs_alloc_logged_file_extent(), otherwise it'll trigger a
WARN_ON about bytes_may_use.

Fixes: ("btrfs: update btrfs_space_info's bytes_may_use timely")
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-05 17:40:41 +02:00
Linus Torvalds
4b30b6d126 Merge branch 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "I'm still prepping a set of fixes for btrfs fsync, just nailing down a
  hard to trigger memory corruption.  For now, these are tested and ready."

* 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: fix one bug that process may endlessly wait for ticket in wait_reserve_ticket()
  Btrfs: fix endless loop in balancing block groups
  Btrfs: kill invalid ASSERT() in process_all_refs()
2016-09-03 12:40:45 -07:00
Wang Xiaoguang
e0af24849e btrfs: fix one bug that process may endlessly wait for ticket in wait_reserve_ticket()
If can_overcommit() in btrfs_calc_reclaim_metadata_size() returns true,
btrfs_async_reclaim_metadata_space() will not reclaim metadata space, just
return directly and also forget to wake up process which are waiting for
their tickets, so these processes will wait endlessly.

Fstests case generic/172 with mount option "-o compress=lzo" have revealed
this bug in my test machine. Here if we have tickets to handle, we must
handle them first.

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-01 17:23:24 +02:00
Liu Bo
a9b1fc851d Btrfs: fix endless loop in balancing block groups
Qgroup function may overwrite the saved error 'err' with 0
in case quota is not enabled, and this ends up with a
endless loop in balance because we keep going back to balance
the same block group.

It really should use 'ret' instead.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-01 17:16:47 +02:00
Josef Bacik
3dc09ec895 Btrfs: kill invalid ASSERT() in process_all_refs()
Suppose you have the following tree in snap1 on a file system mounted with -o
inode_cache so that inode numbers are recycled

└── [    258]  a
    └── [    257]  b

and then you remove b, rename a to c, and then re-create b in c so you have the
following tree

└── [    258]  c
    └── [    257]  b

and then you try to do an incremental send you will hit

ASSERT(pending_move == 0);

in process_all_refs().  This is because we assume that any recycling of inodes
will not have a pending change in our path, which isn't the case.  This is the
case for the DELETE side, since we want to remove the old file using the old
path, but on the create side we could have a pending move and need to do the
normal pending rename dance.  So remove this ASSERT() and put a comment about
why we ignore pending_move.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-01 17:16:47 +02:00
Linus Torvalds
28687b935e Merge branch 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "We've queued up a few different fixes in here.  These range from
  enospc corners to fsync and quota fixes, and a few targeted at error
  handling for corrupt metadata/fuzzing"

* 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix lockdep warning on deadlock against an inode's log mutex
  Btrfs: detect corruption when non-root leaf has zero item
  Btrfs: check btree node's nritems
  btrfs: don't create or leak aliased root while cleaning up orphans
  Btrfs: fix em leak in find_first_block_group
  btrfs: do not background blkdev_put()
  Btrfs: clarify do_chunk_alloc()'s return value
  btrfs: fix fsfreeze hang caused by delayed iputs deal
  btrfs: update btrfs_space_info's bytes_may_use timely
  btrfs: divide btrfs_update_reserved_bytes() into two functions
  btrfs: use correct offset for reloc_inode in prealloc_file_extent_cluster()
  btrfs: qgroup: Fix qgroup incorrectness caused by log replay
  btrfs: relocation: Fix leaking qgroups numbers on data extents
  btrfs: qgroup: Refactor btrfs_qgroup_insert_dirty_extent()
  btrfs: waiting on qgroup rescan should not always be interruptible
  btrfs: properly track when rescan worker is running
  btrfs: flush_space: treat return value of do_chunk_alloc properly
  Btrfs: add ASSERT for block group's memory leak
  btrfs: backref: Fix soft lockup in __merge_refs function
  Btrfs: fix memory leak of reloc_root
2016-08-26 20:22:01 -07:00
Filipe Manana
28a235931b Btrfs: fix lockdep warning on deadlock against an inode's log mutex
Commit 44f714dae5 ("Btrfs: improve performance on fsync against new
inode after rename/unlink"), which landed in 4.8-rc2, introduced a
possibility for a deadlock due to double locking of an inode's log mutex
by the same task, which lockdep reports with:

[23045.433975] =============================================
[23045.434748] [ INFO: possible recursive locking detected ]
[23045.435426] 4.7.0-rc6-btrfs-next-34+ #1 Not tainted
[23045.436044] ---------------------------------------------
[23045.436044] xfs_io/3688 is trying to acquire lock:
[23045.436044]  (&ei->log_mutex){+.+...}, at: [<ffffffffa038552d>] btrfs_log_inode+0x13a/0xc95 [btrfs]
[23045.436044]
               but task is already holding lock:
[23045.436044]  (&ei->log_mutex){+.+...}, at: [<ffffffffa038552d>] btrfs_log_inode+0x13a/0xc95 [btrfs]
[23045.436044]
               other info that might help us debug this:
[23045.436044]  Possible unsafe locking scenario:

[23045.436044]        CPU0
[23045.436044]        ----
[23045.436044]   lock(&ei->log_mutex);
[23045.436044]   lock(&ei->log_mutex);
[23045.436044]
                *** DEADLOCK ***

[23045.436044]  May be due to missing lock nesting notation

[23045.436044] 3 locks held by xfs_io/3688:
[23045.436044]  #0:  (&sb->s_type->i_mutex_key#15){+.+...}, at: [<ffffffffa035f2ae>] btrfs_sync_file+0x14e/0x425 [btrfs]
[23045.436044]  #1:  (sb_internal#2){.+.+.+}, at: [<ffffffff8118446b>] __sb_start_write+0x5f/0xb0
[23045.436044]  #2:  (&ei->log_mutex){+.+...}, at: [<ffffffffa038552d>] btrfs_log_inode+0x13a/0xc95 [btrfs]
[23045.436044]
               stack backtrace:
[23045.436044] CPU: 4 PID: 3688 Comm: xfs_io Not tainted 4.7.0-rc6-btrfs-next-34+ #1
[23045.436044] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[23045.436044]  0000000000000000 ffff88022f5f7860 ffffffff8127074d ffffffff82a54b70
[23045.436044]  ffffffff82a54b70 ffff88022f5f7920 ffffffff81092897 ffff880228015d68
[23045.436044]  0000000000000000 ffffffff82a54b70 ffffffff829c3f00 ffff880228015d68
[23045.436044] Call Trace:
[23045.436044]  [<ffffffff8127074d>] dump_stack+0x67/0x90
[23045.436044]  [<ffffffff81092897>] __lock_acquire+0xcbb/0xe4e
[23045.436044]  [<ffffffff8109155f>] ? mark_lock+0x24/0x201
[23045.436044]  [<ffffffff8109179a>] ? mark_held_locks+0x5e/0x74
[23045.436044]  [<ffffffff81092de0>] lock_acquire+0x12f/0x1c3
[23045.436044]  [<ffffffff81092de0>] ? lock_acquire+0x12f/0x1c3
[23045.436044]  [<ffffffffa038552d>] ? btrfs_log_inode+0x13a/0xc95 [btrfs]
[23045.436044]  [<ffffffffa038552d>] ? btrfs_log_inode+0x13a/0xc95 [btrfs]
[23045.436044]  [<ffffffff814a51a4>] mutex_lock_nested+0x77/0x3a7
[23045.436044]  [<ffffffffa038552d>] ? btrfs_log_inode+0x13a/0xc95 [btrfs]
[23045.436044]  [<ffffffffa039705e>] ? btrfs_release_delayed_node+0xb/0xd [btrfs]
[23045.436044]  [<ffffffffa038552d>] btrfs_log_inode+0x13a/0xc95 [btrfs]
[23045.436044]  [<ffffffffa038552d>] ? btrfs_log_inode+0x13a/0xc95 [btrfs]
[23045.436044]  [<ffffffff810a0ed1>] ? vprintk_emit+0x453/0x465
[23045.436044]  [<ffffffffa0385a61>] btrfs_log_inode+0x66e/0xc95 [btrfs]
[23045.436044]  [<ffffffffa03c084d>] log_new_dir_dentries+0x26c/0x359 [btrfs]
[23045.436044]  [<ffffffffa03865aa>] btrfs_log_inode_parent+0x4a6/0x628 [btrfs]
[23045.436044]  [<ffffffffa0387552>] btrfs_log_dentry_safe+0x5a/0x75 [btrfs]
[23045.436044]  [<ffffffffa035f464>] btrfs_sync_file+0x304/0x425 [btrfs]
[23045.436044]  [<ffffffff811acaf4>] vfs_fsync_range+0x8c/0x9e
[23045.436044]  [<ffffffff811acb22>] vfs_fsync+0x1c/0x1e
[23045.436044]  [<ffffffff811acc79>] do_fsync+0x31/0x4a
[23045.436044]  [<ffffffff811ace99>] SyS_fsync+0x10/0x14
[23045.436044]  [<ffffffff814a88e5>] entry_SYSCALL_64_fastpath+0x18/0xa8
[23045.436044]  [<ffffffff8108f039>] ? trace_hardirqs_off_caller+0x3f/0xaa

An example reproducer for this is:

   $ mkfs.btrfs -f /dev/sdb
   $ mount /dev/sdb /mnt
   $ mkdir /mnt/dir
   $ touch /mnt/dir/foo
   $ sync
   $ mv /mnt/dir/foo /mnt/dir/bar
   $ touch /mnt/dir/foo
   $ xfs_io -c "fsync" /mnt/dir/bar

This is because while logging the inode of file bar we end up logging its
parent directory (since its inode has an unlink_trans field matching the
current transaction id due to the rename operation), which in turn logs
the inodes for all its new dentries, so that the new inode for the new
file named foo gets logged which in turn triggered another logging attempt
for the inode we are fsync'ing, since that inode had an old name that
corresponds to the name of the new inode.

So fix this by ensuring that when logging the inode for a new dentry that
has a name matching an old name of some other inode, we don't log again
the original inode that we are fsync'ing.

Fixes: 44f714dae5 ("Btrfs: improve performance on fsync against new inode after rename/unlink")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25 03:58:32 -07:00
Liu Bo
1ba98d086f Btrfs: detect corruption when non-root leaf has zero item
Right now we treat leaf which has zero item as a valid one
because we could have an empty tree, that is, a root that is
also a leaf without any item, however, in the same case but
when the leaf is not a root, we can end up with hitting the
BUG_ON(1) in btrfs_extend_item() called by
setup_inline_extent_backref().

This makes us check the situation as a corruption if leaf is
not its own root.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25 03:58:31 -07:00
Liu Bo
053ab70f06 Btrfs: check btree node's nritems
When btree node (level = 1) has nritems which equals to zero,
we can end up with panic due to insert_ptr()'s

BUG_ON(slot > nritems);

where slot is 1 and nritems is 0, as copy_for_split() calls
insert_ptr(.., path->slots[1] + 1, ...);

A invalid value results in the whole mess, this adds the check
for btree's node nritems so that we stop reading block when
when something is wrong.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25 03:58:30 -07:00
Jeff Mahoney
35bbb97fc8 btrfs: don't create or leak aliased root while cleaning up orphans
commit 909c3a22da (Btrfs: fix loading of orphan roots leading to BUG_ON)
avoids the BUG_ON but can add an aliased root to the dead_roots list or
leak the root.

Since we've already been loading roots into the radix tree, we should
use it before looking the root up on disk.

Cc: <stable@vger.kernel.org> # 4.5
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25 03:58:29 -07:00
Josef Bacik
187ee58c62 Btrfs: fix em leak in find_first_block_group
We need to call free_extent_map() on the em we look up.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25 03:58:29 -07:00
Anand Jain
1423881941 btrfs: do not background blkdev_put()
At the end of unmount/dev-delete, if the device exclusive open is not
actually closed, then there might be a race with another program in
the userland who is trying to open the device in exclusive mode and
it may fail for eg:
      unmount /btrfs; fsck /dev/x
      btrfs dev del /dev/x /btrfs; fsck /dev/x
so here background blkdev_put() is not a choice

Signed-off-by: Anand Jain <Anand.Jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25 03:58:28 -07:00
Liu Bo
28b737f6ed Btrfs: clarify do_chunk_alloc()'s return value
Function start_transaction() can return ERR_PTR(1) when flush is
BTRFS_RESERVE_FLUSH_LIMIT, so the call graph is

start_transaction (return ERR_PTR(1))
  -> btrfs_block_rsv_add (return 1)
     -> reserve_metadata_bytes (return 1)
        -> flush_space (return 1)
           -> do_chunk_alloc  (return 1)

With BTRFS_RESERVE_FLUSH_LIMIT, if flush_space is already on the
flush_state of ALLOC_CHUNK and it successfully allocates a new
chunk, then instead of trying to reserve space again,
reserve_metadata_bytes returns 1 immediately.

Eventually the callers who call start_transaction() usually just
do the IS_ERR() check which ERR_PTR(1) can pass, then it'll get
a panic when dereferencing a pointer which is ERR_PTR(1).

The following patch fixes the above problem.
"btrfs: flush_space: treat return value of do_chunk_alloc properly"
https://patchwork.kernel.org/patch/7778651/

This add comments to clarify do_chunk_alloc()'s return value.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25 03:58:27 -07:00
Wang Xiaoguang
9e7cc91a6d btrfs: fix fsfreeze hang caused by delayed iputs deal
When running fstests generic/068, sometimes we got below deadlock:
  xfs_io          D ffff8800331dbb20     0  6697   6693 0x00000080
  ffff8800331dbb20 ffff88007acfc140 ffff880034d895c0 ffff8800331dc000
  ffff880032d243e8 fffffffeffffffff ffff880032d24400 0000000000000001
  ffff8800331dbb38 ffffffff816a9045 ffff880034d895c0 ffff8800331dbba8
  Call Trace:
  [<ffffffff816a9045>] schedule+0x35/0x80
  [<ffffffff816abab2>] rwsem_down_read_failed+0xf2/0x140
  [<ffffffff8118f5e1>] ? __filemap_fdatawrite_range+0xd1/0x100
  [<ffffffff8134f978>] call_rwsem_down_read_failed+0x18/0x30
  [<ffffffffa06631fc>] ? btrfs_alloc_block_rsv+0x2c/0xb0 [btrfs]
  [<ffffffff810d32b5>] percpu_down_read+0x35/0x50
  [<ffffffff81217dfc>] __sb_start_write+0x2c/0x40
  [<ffffffffa067f5d5>] start_transaction+0x2a5/0x4d0 [btrfs]
  [<ffffffffa067f857>] btrfs_join_transaction+0x17/0x20 [btrfs]
  [<ffffffffa068ba34>] btrfs_evict_inode+0x3c4/0x5d0 [btrfs]
  [<ffffffff81230a1a>] evict+0xba/0x1a0
  [<ffffffff812316b6>] iput+0x196/0x200
  [<ffffffffa06851d0>] btrfs_run_delayed_iputs+0x70/0xc0 [btrfs]
  [<ffffffffa067f1d8>] btrfs_commit_transaction+0x928/0xa80 [btrfs]
  [<ffffffffa0646df0>] btrfs_freeze+0x30/0x40 [btrfs]
  [<ffffffff81218040>] freeze_super+0xf0/0x190
  [<ffffffff81229275>] do_vfs_ioctl+0x4a5/0x5c0
  [<ffffffff81003176>] ? do_audit_syscall_entry+0x66/0x70
  [<ffffffff810038cf>] ? syscall_trace_enter_phase1+0x11f/0x140
  [<ffffffff81229409>] SyS_ioctl+0x79/0x90
  [<ffffffff81003c12>] do_syscall_64+0x62/0x110
  [<ffffffff816acbe1>] entry_SYSCALL64_slow_path+0x25/0x25

>From this warning, freeze_super() already holds SB_FREEZE_FS, but
btrfs_freeze() will call btrfs_commit_transaction() again, if
btrfs_commit_transaction() finds that it has delayed iputs to handle,
it'll start_transaction(), which will try to get SB_FREEZE_FS lock
again, then deadlock occurs.

The root cause is that in btrfs, sync_filesystem(sb) does not make
sure all metadata is updated. There still maybe some codes adding
delayed iputs, see below sample race window:

         CPU1                                  |         CPU2
|-> freeze_super()                             |
    |-> sync_filesystem(sb);                   |
    |                                          |-> cleaner_kthread()
    |                                          |   |-> btrfs_delete_unused_bgs()
    |                                          |       |-> btrfs_remove_chunk()
    |                                          |           |-> btrfs_remove_block_group()
    |                                          |               |-> btrfs_add_delayed_iput()
    |                                          |
    |-> sb->s_writers.frozen = SB_FREEZE_FS;   |
    |-> sb_wait_write(sb, SB_FREEZE_FS);       |
    |   acquire SB_FREEZE_FS lock.             |
    |                                          |
    |-> btrfs_freeze()                         |
        |-> btrfs_commit_transaction()         |
            |-> btrfs_run_delayed_iputs()      |
            |   will handle delayed iputs,     |
            |   that means start_transaction() |
            |   will be called, which will try |
            |   to get SB_FREEZE_FS lock.      |

To fix this issue, introduce a "int fs_frozen" to record internally whether
fs has been frozen. If fs has been frozen, we can not handle delayed iputs.

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment to btrfs_freeze ]
Signed-off-by: David Sterba <dsterba@suse.com>

Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25 03:58:26 -07:00
Wang Xiaoguang
18513091af btrfs: update btrfs_space_info's bytes_may_use timely
This patch can fix some false ENOSPC errors, below test script can
reproduce one false ENOSPC error:
	#!/bin/bash
	dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=128
	dev=$(losetup --show -f fs.img)
	mkfs.btrfs -f -M $dev
	mkdir /tmp/mntpoint
	mount $dev /tmp/mntpoint
	cd /tmp/mntpoint
	xfs_io -f -c "falloc 0 $((64*1024*1024))" testfile

Above script will fail for ENOSPC reason, but indeed fs still has free
space to satisfy this request. Please see call graph:
btrfs_fallocate()
|-> btrfs_alloc_data_chunk_ondemand()
|   bytes_may_use += 64M
|-> btrfs_prealloc_file_range()
    |-> btrfs_reserve_extent()
        |-> btrfs_add_reserved_bytes()
        |   alloc_type is RESERVE_ALLOC_NO_ACCOUNT, so it does not
        |   change bytes_may_use, and bytes_reserved += 64M. Now
        |   bytes_may_use + bytes_reserved == 128M, which is greater
        |   than btrfs_space_info's total_bytes, false enospc occurs.
        |   Note, the bytes_may_use decrease operation will be done in
        |   end of btrfs_fallocate(), which is too late.

Here is another simple case for buffered write:
                    CPU 1              |              CPU 2
                                       |
|-> cow_file_range()                   |-> __btrfs_buffered_write()
    |-> btrfs_reserve_extent()         |   |
    |                                  |   |
    |                                  |   |
    |    .....                         |   |-> btrfs_check_data_free_space()
    |                                  |
    |                                  |
    |-> extent_clear_unlock_delalloc() |

In CPU 1, btrfs_reserve_extent()->find_free_extent()->
btrfs_add_reserved_bytes() do not decrease bytes_may_use, the decrease
operation will be delayed to be done in extent_clear_unlock_delalloc().
Assume in this case, btrfs_reserve_extent() reserved 128MB data, CPU2's
btrfs_check_data_free_space() tries to reserve 100MB data space.
If
	100MB > data_sinfo->total_bytes - data_sinfo->bytes_used -
		data_sinfo->bytes_reserved - data_sinfo->bytes_pinned -
		data_sinfo->bytes_readonly - data_sinfo->bytes_may_use
btrfs_check_data_free_space() will try to allcate new data chunk or call
btrfs_start_delalloc_roots(), or commit current transaction in order to
reserve some free space, obviously a lot of work. But indeed it's not
necessary as long as decreasing bytes_may_use timely, we still have
free space, decreasing 128M from bytes_may_use.

To fix this issue, this patch chooses to update bytes_may_use for both
data and metadata in btrfs_add_reserved_bytes(). For compress path, real
extent length may not be equal to file content length, so introduce a
ram_bytes argument for btrfs_reserve_extent(), find_free_extent() and
btrfs_add_reserved_bytes(), it's becasue bytes_may_use is increased by
file content length. Then compress path can update bytes_may_use
correctly. Also now we can discard RESERVE_ALLOC_NO_ACCOUNT, RESERVE_ALLOC
and RESERVE_FREE.

As we know, usually EXTENT_DO_ACCOUNTING is used for error path. In
run_delalloc_nocow(), for inode marked as NODATACOW or extent marked as
PREALLOC, we also need to update bytes_may_use, but can not pass
EXTENT_DO_ACCOUNTING, because it also clears metadata reservation, so
here we introduce EXTENT_CLEAR_DATA_RESV flag to indicate btrfs_clear_bit_hook()
to update btrfs_space_info's bytes_may_use.

Meanwhile __btrfs_prealloc_file_range() will call
btrfs_free_reserved_data_space() internally for both sucessful and failed
path, btrfs_prealloc_file_range()'s callers does not need to call
btrfs_free_reserved_data_space() any more.

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25 03:58:26 -07:00
Wang Xiaoguang
4824f1f412 btrfs: divide btrfs_update_reserved_bytes() into two functions
This patch divides btrfs_update_reserved_bytes() into
btrfs_add_reserved_bytes() and btrfs_free_reserved_bytes(), and
next patch will extend btrfs_add_reserved_bytes()to fix some
false ENOSPC error, please see later patch for detailed info.

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25 03:58:25 -07:00
Wang Xiaoguang
dcb40c196f btrfs: use correct offset for reloc_inode in prealloc_file_extent_cluster()
In prealloc_file_extent_cluster(), btrfs_check_data_free_space() uses
wrong file offset for reloc_inode, it uses cluster->start and cluster->end,
which indeed are extent's bytenr. The correct value should be
cluster->[start|end] minus block group's start bytenr.

start bytenr   cluster->start
|              |     extent      |   extent   | ...| extent |
|----------------------------------------------------------------|
|                block group reloc_inode                         |

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25 03:58:24 -07:00
Qu Wenruo
df2c95f33e btrfs: qgroup: Fix qgroup incorrectness caused by log replay
When doing log replay at mount time(after power loss), qgroup will leak
numbers of replayed data extents.

The cause is almost the same of balance.
So fix it by manually informing qgroup for owner changed extents.

The bug can be detected by btrfs/119 test case.

Cc: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-and-Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25 03:58:23 -07:00
Qu Wenruo
62b99540a1 btrfs: relocation: Fix leaking qgroups numbers on data extents
This patch fixes a REGRESSION introduced in 4.2, caused by the big quota
rework.

When balancing data extents, qgroup will leak all its numbers for
relocated data extents.

The relocation is done in the following steps for data extents:
1) Create data reloc tree and inode
2) Copy all data extents to data reloc tree
   And commit transaction
3) Create tree reloc tree(special snapshot) for any related subvolumes
4) Replace file extent in tree reloc tree with new extents in data reloc
   tree
   And commit transaction
5) Merge tree reloc tree with original fs, by swapping tree blocks

For 1)~4), since tree reloc tree and data reloc tree doesn't count to
qgroup, everything is OK.

But for 5), the swapping of tree blocks will only info qgroup to track
metadata extents.

If metadata extents contain file extents, qgroup number for file extents
will get lost, leading to corrupted qgroup accounting.

The fix is, before commit transaction of step 5), manually info qgroup to
track all file extents in data reloc tree.
Since at commit transaction time, the tree swapping is done, and qgroup
will account these data extents correctly.

Cc: Mark Fasheh <mfasheh@suse.de>
Reported-by: Mark Fasheh <mfasheh@suse.de>
Reported-by: Filipe Manana <fdmanana@gmail.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25 03:58:22 -07:00
Qu Wenruo
cb93b52cc0 btrfs: qgroup: Refactor btrfs_qgroup_insert_dirty_extent()
Refactor btrfs_qgroup_insert_dirty_extent() function, to two functions:
1. btrfs_qgroup_insert_dirty_extent_nolock()
   Almost the same with original code.
   For delayed_ref usage, which has delayed refs locked.

   Change the return value type to int, since caller never needs the
   pointer, but only needs to know if they need to free the allocated
   memory.

2. btrfs_qgroup_insert_dirty_extent()
   The more encapsulated version.

   Will do the delayed_refs lock, memory allocation, quota enabled check
   and other things.

The original design is to keep exported functions to minimal, but since
more btrfs hacks exposed, like replacing path in balance, we need to
record dirty extents manually, so we have to add such functions.

Also, add comment for both functions, to info developers how to keep
qgroup correct when doing hacks.

Cc: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-and-Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25 03:58:21 -07:00
Jeff Mahoney
d06f23d6a9 btrfs: waiting on qgroup rescan should not always be interruptible
We wait on qgroup rescan completion in three places: file system
shutdown, the quota disable ioctl, and the rescan wait ioctl.  If the
user sends a signal while we're waiting, we continue happily along.  This
is expected behavior for the rescan wait ioctl.  It's racy in the shutdown
path but mostly works due to other unrelated synchronization points.
In the quota disable path, it Oopses the kernel pretty much immediately.

Cc: <stable@vger.kernel.org> # v4.4+
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25 03:58:20 -07:00
Jeff Mahoney
d2c609b834 btrfs: properly track when rescan worker is running
The qgroup_flags field is overloaded such that it reflects the on-disk
status of qgroups and the runtime state.  The BTRFS_QGROUP_STATUS_FLAG_RESCAN
flag is used to indicate that a rescan operation is in progress, but if
the file system is unmounted while a rescan is running, the rescan
operation is paused.  If the file system is then mounted read-only,
the flag will still be present but the rescan operation will not have
been resumed.  When we go to umount, btrfs_qgroup_wait_for_completion
will see the flag and interpret it to mean that the rescan worker is
still running and will wait for a completion that will never come.

This patch uses a separate flag to indicate when the worker is
running.  The locking and state surrounding the qgroup rescan worker
needs a lot of attention beyond this patch but this is enough to
avoid a hung umount.

Cc: <stable@vger.kernel.org> # v4.4+
Signed-off-by; Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>

Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25 03:58:19 -07:00
Alex Lyakas
eecba891d3 btrfs: flush_space: treat return value of do_chunk_alloc properly
do_chunk_alloc returns 1 when it succeeds to allocate a new chunk.
But flush_space will not convert this to 0, and will also return 1.
As a result, reserve_metadata_bytes will think that flush_space failed,
and may potentially return this value "1" to the caller (depends how
reserve_metadata_bytes was called). The caller will also treat this as an error.
For example, btrfs_block_rsv_refill does:

int ret = -ENOSPC;
...
ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
if (!ret) {
        block_rsv_add_bytes(block_rsv, num_bytes, 0);
        return 0;
}

return ret;

So it will return -ENOSPC.

Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25 03:58:18 -07:00
Liu Bo
f3bca8028b Btrfs: add ASSERT for block group's memory leak
This adds several ASSERT()' s to report memory leak of block group cache.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25 03:58:17 -07:00
Qu Wenruo
d8422ba334 btrfs: backref: Fix soft lockup in __merge_refs function
When over 1000 file extents refers to one extent, find_parent_nodes()
will be obviously slow, due to the O(n^2)~O(n^3) loops inside
__merge_refs().

The following ftrace shows the cubic growth of execution time:

256 refs
 5) + 91.768 us   |  __add_keyed_refs.isra.12 [btrfs]();
 5)   1.447 us    |  __add_missing_keys.isra.13 [btrfs]();
 5) ! 114.544 us  |  __merge_refs [btrfs]();
 5) ! 136.399 us  |  __merge_refs [btrfs]();

512 refs
 6) ! 279.859 us  |  __add_keyed_refs.isra.12 [btrfs]();
 6)   3.164 us    |  __add_missing_keys.isra.13 [btrfs]();
 6) ! 442.498 us  |  __merge_refs [btrfs]();
 6) # 2091.073 us |  __merge_refs [btrfs]();

and 1024 refs
 7) ! 368.683 us  |  __add_keyed_refs.isra.12 [btrfs]();
 7)   4.810 us    |  __add_missing_keys.isra.13 [btrfs]();
 7) # 2043.428 us |  __merge_refs [btrfs]();
 7) * 18964.23 us |  __merge_refs [btrfs]();

And sort them into the following char:
(Unit: us)
------------------------------------------------------------------------
 Trace function        | 256 ref        | 512 refs      | 1024 refs    |
------------------------------------------------------------------------
 __add_keyed_refs      | 91             | 249           | 368          |
 __add_missing_keys    | 1              | 3             | 4            |
 __merge_refs 1st call | 114            | 442           | 2043         |
 __merge_refs 2nd call | 136            | 2091          | 18964        |
------------------------------------------------------------------------

We can see the that __add_keyed_refs() grows almost in linear behavior.
And __add_missing_keys() in this case doesn't change much or takes much
time.

While for the 1st __merge_refs() it's square growth
for the 2nd __merge_refs() call it's cubic growth.

It's no doubt that merge_refs() will take a long long time to execute if
the number of refs continues its grows.

So add a cond_resced() into the loop of __merge_refs().

Although this will solve the problem of soft lockup, we need to use the
new rb_tree based structure introduced by Lu Fengqi to really solve the
long execution time.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25 03:58:16 -07:00
Liu Bo
1c1ea4f781 Btrfs: fix memory leak of reloc_root
When some critical errors occur and FS would be flipped into RO,
if we have an on-going balance, we can end up with a memory leak
of root->reloc_root since btrfs_drop_snapshots() bails out
without freeing reloc_root at the very early start.

However, we're not able to free reloc_root in btrfs_drop_snapshots()
because its caller, merge_reloc_roots(), still needs to access it to
cleanup reloc_root's rbtree.

This makes us free reloc_root when we're going to free fs/file roots.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25 03:58:15 -07:00
Linus Torvalds
9512c47ec2 Merge branch 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Some fixes for btrfs send/recv and fsync from Filipe and Robbie Ko.

  Bonus points to Filipe for already having xfstests in place for many
  of these"

* 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: remove unused function btrfs_add_delayed_qgroup_reserve()
  Btrfs: improve performance on fsync against new inode after rename/unlink
  Btrfs: be more precise on errors when getting an inode from disk
  Btrfs: send, don't bug on inconsistent snapshots
  Btrfs: send, avoid incorrect leaf accesses when sending utimes operations
  Btrfs: send, fix invalid leaf accesses due to incorrect utimes operations
  Btrfs: send, fix warning due to late freeing of orphan_dir_info structures
  Btrfs: incremental send, fix premature rmdir operations
  Btrfs: incremental send, fix invalid paths for rename operations
  Btrfs: send, add missing error check for calls to path_loop()
  Btrfs: send, fix failure to move directories with the same name around
  Btrfs: add missing check for writeback errors on fsync
2016-08-10 11:16:03 -07:00
Jens Axboe
1eff9d322a block: rename bio bi_rw to bi_opf
Since commit 63a4cc2486, bio->bi_rw contains flags in the lower
portion and the op code in the higher portions. This means that
old code that relies on manually setting bi_rw is most likely
going to be broken. Instead of letting that brokeness linger,
rename the member, to force old and out-of-tree code to break
at compile time instead of at runtime.

No intended functional changes in this commit.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-07 14:41:02 -06:00
Linus Torvalds
fff648da96 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "Here's the second round of block updates for this merge window.

  It's a mix of fixes for changes that went in previously in this round,
  and fixes in general.  This pull request contains:

   - Fixes for loop from Christoph

   - A bdi vs gendisk lifetime fix from Dan, worth two cookies.

   - A blk-mq timeout fix, when on frozen queues.  From Gabriel.

   - Writeback fix from Jan, ensuring that __writeback_single_inode()
     does the right thing.

   - Fix for bio->bi_rw usage in f2fs from me.

   - Error path deadlock fix in blk-mq sysfs registration from me.

   - Floppy O_ACCMODE fix from Jiri.

   - Fix to the new bio op methods from Mike.

     One more followup will be coming here, ensuring that we don't
     propagate the block types outside of block.  That, and a rename of
     bio->bi_rw is coming right after -rc1 is cut.

   - Various little fixes"

* 'for-linus' of git://git.kernel.dk/linux-block:
  mm/block: convert rw_page users to bio op use
  loop: make do_req_filebacked more robust
  loop: don't try to use AIO for discards
  blk-mq: fix deadlock in blk_mq_register_disk() error path
  Include: blkdev: Removed duplicate 'struct request;' declaration.
  Fixup direct bi_rw modifiers
  block: fix bdi vs gendisk lifetime mismatch
  blk-mq: Allow timeouts to run while queue is freezing
  nbd: fix race in ioctl
  block: fix use-after-free in seq file
  f2fs: drop bio->bi_rw manual assignment
  block: add missing group association in bio-cloning functions
  blkcg: kill unused field nr_undestroyed_grps
  writeback: Write dirty times for WB_SYNC_ALL writeback
  floppy: fix open(O_ACCMODE) for ioctl-only open
2016-08-05 23:31:51 -04:00
Chris Mason
1083881654 Merge branch 'integration-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.8 2016-08-05 12:25:05 -07:00
Linus Torvalds
d58b0d980f Merge branch 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull more btrfs updates from Chris Mason:
 "This is part two of my btrfs pull, which is some cleanups and a batch
  of fixes.

  Most of the code here is from Jeff Mahoney, making the pointers we
  pass around internally more consistent and less confusing overall.  I
  noticed a small problem right before I sent this out yesterday, so I
  fixed it up and re-tested overnight"

* 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (40 commits)
  Btrfs: fix __MAX_CSUM_ITEMS
  btrfs: btrfs_abort_transaction, drop root parameter
  btrfs: add btrfs_trans_handle->fs_info pointer
  btrfs: btrfs_relocate_chunk pass extent_root to btrfs_end_transaction
  btrfs: convert nodesize macros to static inlines
  btrfs: introduce BTRFS_MAX_ITEM_SIZE
  btrfs: cleanup, remove prototype for btrfs_find_root_ref
  btrfs: copy_to_sk drop unused root parameter
  btrfs: simpilify btrfs_subvol_inherit_props
  btrfs: tests, use BTRFS_FS_STATE_DUMMY_FS_INFO instead of dummy root
  btrfs: tests, require fs_info for root
  btrfs: tests, move initialization into tests/
  btrfs: btrfs_test_opt and friends should take a btrfs_fs_info
  btrfs: prefix fsid to all trace events
  btrfs: plumb fs_info into btrfs_work
  btrfs: remove obsolete part of comment in statfs
  btrfs: hide test-only member under ifdef
  btrfs: Ratelimit "no csum found" info message
  btrfs: Add ratelimit to btrfs printing
  Btrfs: fix unexpected balance crash due to BUG_ON
  ...
2016-08-04 19:56:16 -04:00
Shaun Tancheff
b571bc606e Fixup direct bi_rw modifiers
bi_rw should be using bio_set_op_attrs to set bi_rw.

Signed-off-by: Shaun Tancheff <shaun@tancheff.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-04 14:19:16 -06:00
Paolo Valente
20bd723ec6 block: add missing group association in bio-cloning functions
When a bio is cloned, the newly created bio must be associated with
the same blkcg as the original bio (if BLK_CGROUP is enabled). If
this operation is not performed, then the new bio is not associated
with any group, and the group of the current task is returned when
the group of the bio is requested.

Depending on the cloning frequency, this may cause a large
percentage of the bios belonging to a given group to be treated
as if belonging to other groups (in most cases as if belonging to
the root group). The expected group isolation may thereby be broken.

This commit adds the missing association in bio-cloning functions.

Fixes: da2f0f74cf ("Btrfs: add support for blkio controllers")
Cc: stable@vger.kernel.org # v4.3+

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Reviewed-by: Nikolay Borisov <kernel@kyup.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-04 14:19:16 -06:00
Chris Mason
42049bf60d Btrfs: fix __MAX_CSUM_ITEMS
Jeff Mahoney's cleanup commit (14a1e067b4) wasn't correct for csums on
machines where the pagesize >= metadata blocksize.

This just reverts the relevant hunks to bring the old math back.

Signed-off-by: Chris Mason <clm@fb.com>
2016-08-03 14:08:37 -07:00
Filipe Manana
e657149933 Btrfs: remove unused function btrfs_add_delayed_qgroup_reserve()
No longer used as of commit 5846a3c268 ("btrfs: qgroup: Fix a race in
delayed_ref which leads to abort trans").

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-08-03 11:02:51 +01:00
Filipe Manana
44f714dae5 Btrfs: improve performance on fsync against new inode after rename/unlink
With commit 56f23fdbb6 ("Btrfs: fix file/data loss caused by fsync after
rename and new inode") we got simple fix for a functional issue when the
following sequence of actions is done:

  at transaction N
  create file A at directory D
  at transaction N + M (where M >= 1)
  move/rename existing file A from directory D to directory E
  create a new file named A at directory D
  fsync the new file
  power fail

The solution was to simply detect such scenario and fallback to a full
transaction commit when we detect it. However this turned out to had a
significant impact on throughput (and a bit on latency too) for benchmarks
using the dbench tool, which simulates real workloads from smbd (Samba)
servers. For example on a test vm (with a debug kernel):

Unpatched:
Throughput 19.1572 MB/sec  32 clients  32 procs  max_latency=1005.229 ms

Patched:
Throughput 23.7015 MB/sec  32 clients  32 procs  max_latency=809.206 ms

The patched results (this patch is applied) are similar to the results of
a kernel with the commit 56f23fdbb6 ("Btrfs: fix file/data loss caused
by fsync after rename and new inode") reverted.

This change avoids the fallback to a transaction commit and instead makes
sure all the names of the conflicting inode (the one that had a name in a
past transaction that matches the name of the new file in the same parent
directory) are logged so that at log replay time we don't lose neither the
new file nor the old file, and the old file gets the name it was renamed
to.

This also ends up avoiding a full transaction commit for a similar case
that involves an unlink instead of a rename of the old file:

  at transaction N
  create file A at directory D
  at transaction N + M (where M >= 1)
  remove file A
  create a new file named A at directory D
  fsync the new file
  power fail

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-08-01 07:32:14 +01:00
Filipe Manana
67710892ec Btrfs: be more precise on errors when getting an inode from disk
When we attempt to read an inode from disk, we end up always returning an
-ESTALE error to the caller regardless of the actual failure reason, which
can be an out of memory problem (when allocating a path), some error found
when reading from the fs/subvolume btree (like a genuine IO error) or the
inode does not exists. So lets start returning the real error code to the
callers so that they don't treat all -ESTALE errors as meaning that the
inode does not exists (such as during orphan cleanup). This will also be
needed for a subsequent patch in the same series dealing with a special
fsync case.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-08-01 07:32:03 +01:00
Filipe Manana
951555856b Btrfs: send, don't bug on inconsistent snapshots
When doing an incremental send, if we find a new/modified/deleted extent,
reference or xattr without having previously processed the corresponding
inode item we end up exexuting a BUG_ON(). This is because whenever an
extent, xattr or reference is added, modified or deleted, we always expect
to have the corresponding inode item updated. However there are situations
where this will not happen due to transient -ENOMEM or -ENOSPC errors when
doing delayed inode updates.

For example, when punching holes we can succeed in deleting and modifying
(shrinking) extents but later fail to do the delayed inode update. So after
such failure we close our transaction handle and right after a snapshot of
the fs/subvol tree can be made and used later for a send operation. The
same thing can happen during truncate, link, unlink, and xattr related
operations.

So instead of executing a BUG_ON, make send return an -EIO error and print
an informative error message do dmesg/syslog.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-08-01 07:31:41 +01:00
Filipe Manana
15b253eace Btrfs: send, avoid incorrect leaf accesses when sending utimes operations
The caller of send_utimes() is supposed to be sure that the inode number
it passes to this function does actually exists in the send snapshot.
However due to logic/algorithm bugs (such as the one fixed by the patch
titled "Btrfs: send, fix invalid leaf accesses due to incorrect utimes
operations"), this might not be the case and when that happens it makes
send_utimes() access use an unrelated leaf item as the target inode item
or access beyond a leaf's boundaries (when the leaf is full and
path->slots[0] matches the number of items in the leaf).

So if the call to btrfs_search_slot() done by send_utimes() does not find
the inode item, just make sure send_utimes() returns -ENOENT and does not
silently accesses unrelated leaf items or does invalid leaf accesses, also
allowing us to easialy and deterministically catch such algorithmic/logic
bugs.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-08-01 07:26:15 +01:00
Robbie Ko
764433a12e Btrfs: send, fix invalid leaf accesses due to incorrect utimes operations
During an incremental send, if we have delayed rename operations for inodes
that were children of directories which were removed in the send snapshot,
we can end up accessing incorrect items in a leaf or accessing beyond the
last item of the leaf due to issuing utimes operations for the removed
inodes. Consider the following example:

  Parent snapshot:
  .                                                             (ino 256)
  |--- a/                                                       (ino 257)
  |    |--- c/                                                  (ino 262)
  |
  |--- b/                                                       (ino 258)
  |    |--- d/                                                  (ino 263)
  |
  |--- del/                                                     (ino 261)
        |--- x/                                                 (ino 259)
        |--- y/                                                 (ino 260)

  Send snapshot:

  .                                                             (ino 256)
  |--- a/                                                       (ino 257)
  |
  |--- b/                                                       (ino 258)
  |
  |--- c/                                                       (ino 262)
  |    |--- y/                                                  (ino 260)
  |
  |--- d/                                                       (ino 263)
       |--- x/                                                  (ino 259)

1) When processing inodes 259 and 260, we end up delaying their rename
   operations because their parents, inodes 263 and 262 respectively, were
   not yet processed and therefore not yet renamed;

2) When processing inode 262, its rename operation is issued and right
   after the rename operation for inode 260 is issued. However right after
   issuing the rename operation for inode 260, at send.c:apply_dir_move(),
   we issue utimes operations for all current and past parents of inode
   260. This means we try to send a utimes operation for its old parent,
   inode 261 (deleted in the send snapshot), which does not cause any
   immediate and deterministic failure, because when the target inode is
   not found in the send snapshot, the send.c:send_utimes() function
   ignores it and uses the leaf region pointed to by path->slots[0],
   which can be any unrelated item (belonging to other inode) or it can
   be a region outside the leaf boundaries, if the leaf is full and
   path->slots[0] matches the number of items in the leaf. So we end
   up either successfully sending a utimes operation, which is fine
   and irrelevant because the old parent (inode 261) will end up being
   deleted later, or we end up doing an invalid memory access tha
   crashes the kernel.

So fix this by making apply_dir_move() issue utimes operations only for
parents that still exist in the send snapshot. In a separate patch we
will make send_utimes() return an error (-ENOENT) if the given inode
does not exists in the send snapshot.

Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
[Rewrote change log to be more detailed and better organized]

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-08-01 07:25:48 +01:00
Robbie Ko
443f9d266c Btrfs: send, fix warning due to late freeing of orphan_dir_info structures
Under certain situations, when doing an incremental send, we can end up
not freeing orphan_dir_info structures as soon as they are no longer
needed. Instead we end up freeing them only after finishing the send
stream, which causes a warning to be emitted:

[282735.229200] ------------[ cut here ]------------
[282735.229968] WARNING: CPU: 9 PID: 10588 at fs/btrfs/send.c:6298 btrfs_ioctl_send+0xe2f/0xe51 [btrfs]
[282735.231282] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
[282735.237130] CPU: 9 PID: 10588 Comm: btrfs Tainted: G        W       4.6.0-rc7-btrfs-next-31+ #1
[282735.239309] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[282735.240160]  0000000000000000 ffff880224273ca8 ffffffff8126b42c 0000000000000000
[282735.240160]  0000000000000000 ffff880224273ce8 ffffffff81052b14 0000189a24273ac8
[282735.240160]  ffff8802210c9800 0000000000000000 0000000000000001 0000000000000000
[282735.240160] Call Trace:
[282735.240160]  [<ffffffff8126b42c>] dump_stack+0x67/0x90
[282735.240160]  [<ffffffff81052b14>] __warn+0xc2/0xdd
[282735.240160]  [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f
[282735.240160]  [<ffffffffa03c99d5>] btrfs_ioctl_send+0xe2f/0xe51 [btrfs]
[282735.240160]  [<ffffffffa0398358>] btrfs_ioctl+0x14f/0x1f81 [btrfs]
[282735.240160]  [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
[282735.240160]  [<ffffffff8118da05>] vfs_ioctl+0x18/0x34
[282735.240160]  [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be
[282735.240160]  [<ffffffff81196f0c>] ? __fget+0x6b/0x77
[282735.240160]  [<ffffffff81196fa1>] ? __fget_light+0x62/0x71
[282735.240160]  [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79
[282735.240160]  [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
[282735.240160]  [<ffffffff81100c6b>] ? time_hardirqs_off+0x9/0x14
[282735.240160]  [<ffffffff8108e87d>] ? trace_hardirqs_off_caller+0x1f/0xaa
[282735.256343] ---[ end trace a4539270c8056f93 ]---

Consider the following example:

  Parent snapshot:

  .                                                             (ino 256)
  |--- a/                                                       (ino 257)
  |    |--- c/                                                  (ino 260)
  |
  |--- del/                                                     (ino 259)
        |--- tmp/                                               (ino 258)
        |--- x/                                                 (ino 261)
        |--- y/                                                 (ino 262)

  Send snapshot:

  .                                                             (ino 256)
  |--- a/                                                       (ino 257)
  |    |--- x/                                                  (ino 261)
  |    |--- y/                                                  (ino 262)
  |
  |--- c/                                                       (ino 260)
       |--- tmp/                                                (ino 258)

1) When processing inode 258, we end up delaying its rename operation
   because it has an ancestor (in the send snapshot) that has a higher
   inode number (inode 260) which was also renamed in the send snapshot,
   therefore we delay the rename of inode 258 so that it happens after
   inode 260 is renamed;

2) When processing inode 259, we end up delaying its deletion (rmdir
   operation) because it has a child inode (258) that has its rename
   operation delayed. At this point we allocate an orphan_dir_info
   structure and tag inode 258 so that we later attempt to see if we
   can delete (rmdir) inode 259 once inode 258 is renamed;

3) When we process inode 260, after renaming it we finally do the rename
   operation for inode 258. Once we issue the rename operation for inode
   258 we notice that this inode was tagged so that we attempt to see
   if at this point we can delete (rmdir) inode 259. But at this point
   we can not still delete inode 259 because it has 2 children, inodes
   261 and 262, that were not yet processed and therefore not yet
   moved (renamed) away from inode 259. We end up not freeing the
   orphan_dir_info structure allocated in step 2;

4) We process inodes 261 and 262, and once we move/rename inode 262
   we issue the rmdir operation for inode 260;

5) We finish the send stream and notice that red black tree that
   contains orphan_dir_info structures is not empty, so we emit
   a warning and then free any orphan_dir_structures left.

So fix this by freeing an orphan_dir_info structure once we try to
apply a pending rename operation if we can not delete yet the tagged
directory.

A test case for fstests follows soon.

Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
[Modified changelog to be more detailed and easier to understand]
2016-08-01 07:25:31 +01:00
Robbie Ko
99ea42ddb1 Btrfs: incremental send, fix premature rmdir operations
Under certain situations, an incremental send operation can contain
a rmdir operation that will make the receiving end fail when attempting
to execute it, because the target directory is not yet empty.

Consider the following example:

  Parent snapshot:

  .                                                             (ino 256)
  |--- a/                                                       (ino 257)
  |    |--- c/                                                  (ino 260)
  |
  |--- del/                                                     (ino 259)
        |--- tmp/                                               (ino 258)
        |--- x/                                                 (ino 261)

  Send snapshot:

  .                                                             (ino 256)
  |--- a/                                                       (ino 257)
  |    |--- x/                                                  (ino 261)
  |
  |--- c/                                                       (ino 260)
       |--- tmp/                                                (ino 258)

1) When processing inode 258, we delay its rename operation because inode
   260 is its new parent in the send snapshot and it was not yet renamed
   (since 260 > 258, that is, beyond the current progress);

2) When processing inode 259, we realize we can not yet send an rmdir
   operation (against inode 259) because inode 258 was still not yet
   renamed/moved away from inode 259. Therefore we update data structures
   so that after inode 258 is renamed, we try again to see if we can
   finally send an rmdir operation for inode 259;

3) When we process inode 260, we send a rename operation for it followed
   by a rename operation for inode 258. Once we send the rename operation
   for inode 258 we then check if we can finally issue an rmdir for its
   previous parent, inode 259, by calling the can_rmdir() function with
   a value of sctx->cur_ino + 1 (260 + 1 = 261) for its "progress"
   argument. This makes can_rmdir() return true (value 1) because even
   though there's still a child inode of inode 259 that was not yet
   renamed/moved, which is inode 261, the given value of progress (261)
   is not lower then 261 (that is, not lower than the inode number of
   some child of inode 259). So we end up sending a rmdir operation for
   inode 259 before its child inode 261 is processed and renamed.

So fix this by passing the correct progress value to the call to
can_rmdir() from within apply_dir_move() (where we issue delayed rename
operations), which should match stcx->cur_ino (the number of the inode
currently being processed) and not sctx->cur_ino + 1.

A test case for fstests follows soon.

Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
[Rewrote change log to be more detailed, clear and well formatted]

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-08-01 07:25:12 +01:00
Filipe Manana
4122ea64f8 Btrfs: incremental send, fix invalid paths for rename operations
Example scenario:

  Parent snapshot:

  .                                                       (ino 277)
  |---- tmp/                                              (ino 278)
  |---- pre/                                              (ino 280)
  |      |---- wait_dir/                                  (ino 281)
  |
  |---- desc/                                             (ino 282)
  |---- ance/                                             (ino 283)
  |       |---- below_ance/                               (ino 279)
  |
  |---- other_dir/                                        (ino 284)

  Send snapshot:

  .                                                       (ino 277)
  |---- tmp/                                              (ino 278)
         |---- other_dir/                                 (ino 284)
                   |---- below_ance/                      (ino 279)
                   |            |---- pre/                (ino 280)
                   |
                   |---- wait_dir/                        (ino 281)
                              |---- desc/                 (ino 282)
                                      |---- ance/         (ino 283)

While computing the send stream the following steps happen:

1) While processing inode 279 we end up delaying its rename operation
   because its new parent in the send snapshot, inode 284, was not
   yet processed and therefore not yet renamed;

2) Later when processing inode 280 we end up renaming it immediately to
   "ance/below_once/pre" and not delay its rename operation because its
   new parent (inode 279 in the send snapshot) has its rename operation
   delayed and inode 280 is not an encestor of inode 279 (its parent in
   the send snapshot) in the parent snapshot;

3) When processing inode 281 we end up delaying its rename operation
   because its new parent in the send snapshot, inode 284, was not yet
   processed and therefore not yet renamed;

4) When processing inode 282 we do not delay its rename operation because
   its parent in the send snapshot, inode 281, already has its own rename
   operation delayed and our current inode (282) is not an ancestor of
   inode 281 in the parent snapshot. Therefore inode 282 is renamed to
   "ance/below_ance/pre/wait_dir";

5) When processing inode 283 we realize that we can rename it because one
   of its ancestors in the send snapshot, inode 281, has its rename
   operation delayed and inode 283 is not an ancestor of inode 281 in the
   parent snapshot. So a rename operation to rename inode 283 to
   "ance/below_ance/pre/wait_dir/desc/ance" is issued. This path is
   invalid due to a missing path building loop that was undetected by
   the incremental send implementation, as inode 283 ends up getting
   included twice in the path (once with its path in the parent snapshot).
   Therefore its rename operation must wait before the ancestor inode 284
   is renamed.

Fix this by not terminating the rename dependency checks when we find an
ancestor, in the send snapshot, that has its rename operation delayed. So
that we continue doing the same checks if the current inode is not an
ancestor, in the parent snapshot, of an ancestor in the send snapshot we
are processing in the loop.

The problem and reproducer were reported by Robbie Ko, as part of a patch
titled "Btrfs: incremental send, avoid ancestor rename to descendant".
However the fix was unnecessarily complicated and can be addressed with
much less code and effort.

Reported-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-08-01 07:24:45 +01:00
Filipe Manana
7969e77a73 Btrfs: send, add missing error check for calls to path_loop()
The function path_loop() can return a negative integer, signaling an
error, 0 if there's no path loop and 1 if there's a path loop. We were
treating any non zero values as meaning that a path loop exists. Fix
this by explicitly checking for errors and gracefully return them to
user space.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-08-01 07:23:20 +01:00
Robbie Ko
801bec365e Btrfs: send, fix failure to move directories with the same name around
When doing an incremental send we can end up not moving directories that
have the same name. This happens when the same parent directory has
different child directories with the same name in the parent and send
snapshots.

For example, consider the following scenario:

  Parent snapshot:

  .                   (ino 256)
  |---- d/            (ino 257)
  |     |--- p1/      (ino 258)
  |
  |---- p1/           (ino 259)

  Send snapshot:

  .                    (ino 256)
  |--- d/              (ino 257)
       |--- p1/        (ino 259)
             |--- p1/  (ino 258)

The directory named "d" (inode 257) has in both snapshots an entry with
the name "p1" but it refers to different inodes in both snapshots (inode
258 in the parent snapshot and inode 259 in the send snapshot). When
attempting to move inode 258, the operation is delayed because its new
parent, inode 259, was not yet moved/renamed (as the stream is currently
processing inode 258). Then when processing inode 259, we also end up
delaying its move/rename operation so that it happens after inode 258 is
moved/renamed. This decision to delay the move/rename rename operation
of inode 259 is due to the fact that the new parent inode (257) still
has inode 258 as its child, which has the same name has inode 259. So
we end up with inode 258 move/rename operation waiting for inode's 259
move/rename operation, which in turn it waiting for inode's 258
move/rename. This results in ending the send stream without issuing
move/rename operations for inodes 258 and 259 and generating the
following warnings in syslog/dmesg:

[148402.979747] ------------[ cut here ]------------
[148402.980588] WARNING: CPU: 14 PID: 4117 at fs/btrfs/send.c:6177 btrfs_ioctl_send+0xe03/0xe51 [btrfs]
[148402.981928] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
[148402.986999] CPU: 14 PID: 4117 Comm: btrfs Tainted: G        W       4.6.0-rc7-btrfs-next-31+ #1
[148402.988136] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[148402.988136]  0000000000000000 ffff88022139fca8 ffffffff8126b42c 0000000000000000
[148402.988136]  0000000000000000 ffff88022139fce8 ffffffff81052b14 000018212139fac8
[148402.988136]  ffff88022b0db400 0000000000000000 0000000000000001 0000000000000000
[148402.988136] Call Trace:
[148402.988136]  [<ffffffff8126b42c>] dump_stack+0x67/0x90
[148402.988136]  [<ffffffff81052b14>] __warn+0xc2/0xdd
[148402.988136]  [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f
[148402.988136]  [<ffffffffa04bc831>] btrfs_ioctl_send+0xe03/0xe51 [btrfs]
[148402.988136]  [<ffffffffa048b358>] btrfs_ioctl+0x14f/0x1f81 [btrfs]
[148402.988136]  [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
[148402.988136]  [<ffffffff8108eb51>] ? __lock_is_held+0x3c/0x57
[148402.988136]  [<ffffffff8118da05>] vfs_ioctl+0x18/0x34
[148402.988136]  [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be
[148402.988136]  [<ffffffff81196f0c>] ? __fget+0x6b/0x77
[148402.988136]  [<ffffffff81196fa1>] ? __fget_light+0x62/0x71
[148402.988136]  [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79
[148402.988136]  [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
[148402.988136]  [<ffffffff8108e89d>] ? trace_hardirqs_off_caller+0x3f/0xaa
[148403.011373] ---[ end trace a4539270c8056f8b ]---
[148403.012296] ------------[ cut here ]------------
[148403.013071] WARNING: CPU: 14 PID: 4117 at fs/btrfs/send.c:6194 btrfs_ioctl_send+0xe19/0xe51 [btrfs]
[148403.014447] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
[148403.019708] CPU: 14 PID: 4117 Comm: btrfs Tainted: G        W       4.6.0-rc7-btrfs-next-31+ #1
[148403.020104] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[148403.020104]  0000000000000000 ffff88022139fca8 ffffffff8126b42c 0000000000000000
[148403.020104]  0000000000000000 ffff88022139fce8 ffffffff81052b14 000018322139fac8
[148403.020104]  ffff88022b0db400 0000000000000000 0000000000000001 0000000000000000
[148403.020104] Call Trace:
[148403.020104]  [<ffffffff8126b42c>] dump_stack+0x67/0x90
[148403.020104]  [<ffffffff81052b14>] __warn+0xc2/0xdd
[148403.020104]  [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f
[148403.020104]  [<ffffffffa04bc847>] btrfs_ioctl_send+0xe19/0xe51 [btrfs]
[148403.020104]  [<ffffffffa048b358>] btrfs_ioctl+0x14f/0x1f81 [btrfs]
[148403.020104]  [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
[148403.020104]  [<ffffffff8108eb51>] ? __lock_is_held+0x3c/0x57
[148403.020104]  [<ffffffff8118da05>] vfs_ioctl+0x18/0x34
[148403.020104]  [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be
[148403.020104]  [<ffffffff81196f0c>] ? __fget+0x6b/0x77
[148403.020104]  [<ffffffff81196fa1>] ? __fget_light+0x62/0x71
[148403.020104]  [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79
[148403.020104]  [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
[148403.020104]  [<ffffffff8108e89d>] ? trace_hardirqs_off_caller+0x3f/0xaa
[148403.038981] ---[ end trace a4539270c8056f8c ]---

There's another issue caused by similar (but more complex) changes in the
directory hierarchy that makes move/rename operations fail, described with
the following example:

  Parent snapshot:

  .
  |---- a/                                                   (ino 262)
  |     |---- c/                                             (ino 268)
  |
  |---- d/                                                   (ino 263)
        |---- ance/                                          (ino 267)
                |---- e/                                     (ino 264)
                |---- f/                                     (ino 265)
                |---- ance/                                  (ino 266)

  Send snapshot:

  .
  |---- a/                                                   (ino 262)
  |---- c/                                                   (ino 268)
  |     |---- ance/                                          (ino 267)
  |
  |---- d/                                                   (ino 263)
  |     |---- ance/                                          (ino 266)
  |
  |---- f/                                                   (ino 265)
        |---- e/                                             (ino 264)

When the inode 265 is processed, the path for inode 267 is computed, which
at that time corresponds to "d/ance", and it's stored in the names cache.
Later on when processing inode 266, we end up orphanizing (renaming to a
name matching the pattern o<ino>-<gen>-<seq>) inode 267 because it has
the same name as inode 266 and it's currently a child of the new parent
directory (inode 263) for inode 266. After the orphanization and while we
are still processing inode 266, a rename operation for inode 266 is
generated. However the source path for that rename operation is incorrect
because it ends up using the old, pre-orphanization, name of inode 267.
The no longer valid name for inode 267 was previously cached when
processing inode 265 and it remains usable and considered valid until
the inode currently being processed has a number greater than 267.
This resulted in the receiving side failing with the following error:

  ERROR: rename d/ance/ance -> d/ance failed: No such file or directory

So fix these issues by detecting such circular dependencies for rename
operations and by clearing the cached name of an inode once the inode
is orphanized.

A test case for fstests will follow soon.

Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
[Rewrote change log to be more detailed and organized, and improved
 comments]

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-08-01 07:23:10 +01:00
Filipe Manana
0596a9048b Btrfs: add missing check for writeback errors on fsync
When we start an fsync we start ordered extents for all delalloc ranges.
However before attempting to log the inode, we only wait for those ordered
extents if we are not doing a full sync (bit BTRFS_INODE_NEEDS_FULL_SYNC
is set in the inode's flags). This means that if an ordered extent
completes with an IO error before we check if we can skip logging the
inode, we will not catch and report the IO error to user space. This is
because on an IO error, when the ordered extent completes we do not
update the inode, so if the inode was not previously updated by the
current transaction we end up not logging it through calls to fsync and
therefore not check its mapping flags for the presence of IO errors.

Fix this by checking for errors in the flags of the inode's mapping when
we notice we can skip logging the inode.

This caused sporadic failures in the test generic/331 (which explicitly
tests for IO errors during an fsync call).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2016-08-01 07:21:13 +01:00
Linus Torvalds
ba929b6646 Merge branch 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "This pull is dedicated to Josef's enospc rework, which we've been
  testing for a few releases now.  It fixes some early enospc problems
  and is dramatically faster.

  This also includes an updated fix for the delalloc accounting that
  happens after a fault in copy_from_user.  My patch in v4.7 was almost
  but not quite enough"

* 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix delalloc accounting after copy_from_user faults
  Btrfs: avoid deadlocks during reservations in btrfs_truncate_block
  Btrfs: use FLUSH_LIMIT for relocation in reserve_metadata_bytes
  Btrfs: fill relocation block rsv after allocation
  Btrfs: always use trans->block_rsv for orphans
  Btrfs: change how we calculate the global block rsv
  Btrfs: use root when checking need_async_flush
  Btrfs: don't bother kicking async if there's nothing to reclaim
  Btrfs: fix release reserved extents trace points
  Btrfs: add fsid to some tracepoints
  Btrfs: add tracepoints for flush events
  Btrfs: fix delalloc reservation amount tracepoint
  Btrfs: trace pinned extents
  Btrfs: introduce ticketed enospc infrastructure
  Btrfs: add tracepoint for adding block groups
  Btrfs: warn_on for unaccounted spaces
  Btrfs: change delayed reservation fallback behavior
  Btrfs: always reserve metadata for delalloc extents
  Btrfs: fix callers of btrfs_block_rsv_migrate
  Btrfs: add bytes_readonly to the spaceinfo at once
2016-07-31 21:27:32 -04:00
Linus Torvalds
0e06f5c0de Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - a few misc bits

 - ocfs2

 - most(?) of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (125 commits)
  thp: fix comments of __pmd_trans_huge_lock()
  cgroup: remove unnecessary 0 check from css_from_id()
  cgroup: fix idr leak for the first cgroup root
  mm: memcontrol: fix documentation for compound parameter
  mm: memcontrol: remove BUG_ON in uncharge_list
  mm: fix build warnings in <linux/compaction.h>
  mm, thp: convert from optimistic swapin collapsing to conservative
  mm, thp: fix comment inconsistency for swapin readahead functions
  thp: update Documentation/{vm/transhuge,filesystems/proc}.txt
  shmem: split huge pages beyond i_size under memory pressure
  thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE
  khugepaged: add support of collapse for tmpfs/shmem pages
  shmem: make shmem_inode_info::lock irq-safe
  khugepaged: move up_read(mmap_sem) out of khugepaged_alloc_page()
  thp: extract khugepaged from mm/huge_memory.c
  shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings
  shmem: add huge pages support
  shmem: get_unmapped_area align huge page
  shmem: prepare huge= mount option and sysfs knob
  mm, rmap: account shmem thp pages
  ...
2016-07-26 19:55:54 -07:00
Michal Hocko
8a5c743e30 mm, memcg: use consistent gfp flags during readahead
Vladimir has noticed that we might declare memcg oom even during
readahead because read_pages only uses GFP_KERNEL (with mapping_gfp
restriction) while __do_page_cache_readahead uses
page_cache_alloc_readahead which adds __GFP_NORETRY to prevent from
OOMs.  This gfp mask discrepancy is really unfortunate and easily
fixable.  Drop page_cache_alloc_readahead() which only has one user and
outsource the gfp_mask logic into readahead_gfp_mask and propagate this
mask from __do_page_cache_readahead down to read_pages.

This alone would have only very limited impact as most filesystems are
implementing ->readpages and the common implementation mpage_readpages
does GFP_KERNEL (with mapping_gfp restriction) again.  We can tell it to
use readahead_gfp_mask instead as this function is called only during
readahead as well.  The same applies to read_cache_pages.

ext4 has its own ext4_mpage_readpages but the path which has pages !=
NULL can use the same gfp mask.  Btrfs, cifs, f2fs and orangefs are
doing a very similar pattern to mpage_readpages so the same can be
applied to them as well.

[akpm@linux-foundation.org: coding-style fixes]
[mhocko@suse.com: restrict gfp mask in mpage_alloc]
  Link: http://lkml.kernel.org/r/20160610074223.GC32285@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/1465301556-26431-1-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Chris Mason <clm@fb.com>
Cc: Steve French <sfrench@samba.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Cc: Mike Marshall <hubcap@omnibond.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Changman Lee <cm224.lee@samsung.com>
Cc: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Linus Torvalds
d05d7f4079 Merge branch 'for-4.8/core' of git://git.kernel.dk/linux-block
Pull core block updates from Jens Axboe:

   - the big change is the cleanup from Mike Christie, cleaning up our
     uses of command types and modified flags.  This is what will throw
     some merge conflicts

   - regression fix for the above for btrfs, from Vincent

   - following up to the above, better packing of struct request from
     Christoph

   - a 2038 fix for blktrace from Arnd

   - a few trivial/spelling fixes from Bart Van Assche

   - a front merge check fix from Damien, which could cause issues on
     SMR drives

   - Atari partition fix from Gabriel

   - convert cfq to highres timers, since jiffies isn't granular enough
     for some devices these days.  From Jan and Jeff

   - CFQ priority boost fix idle classes, from me

   - cleanup series from Ming, improving our bio/bvec iteration

   - a direct issue fix for blk-mq from Omar

   - fix for plug merging not involving the IO scheduler, like we do for
     other types of merges.  From Tahsin

   - expose DAX type internally and through sysfs.  From Toshi and Yigal

* 'for-4.8/core' of git://git.kernel.dk/linux-block: (76 commits)
  block: Fix front merge check
  block: do not merge requests without consulting with io scheduler
  block: Fix spelling in a source code comment
  block: expose QUEUE_FLAG_DAX in sysfs
  block: add QUEUE_FLAG_DAX for devices to advertise their DAX support
  Btrfs: fix comparison in __btrfs_map_block()
  block: atari: Return early for unsupported sector size
  Doc: block: Fix a typo in queue-sysfs.txt
  cfq-iosched: Charge at least 1 jiffie instead of 1 ns
  cfq-iosched: Fix regression in bonnie++ rewrite performance
  cfq-iosched: Convert slice_resid from u64 to s64
  block: Convert fifo_time from ulong to u64
  blktrace: avoid using timespec
  block/blk-cgroup.c: Declare local symbols static
  block/bio-integrity.c: Add #include "blk.h"
  block/partition-generic.c: Remove a set-but-not-used variable
  block: bio: kill BIO_MAX_SIZE
  cfq-iosched: temporarily boost queue priority for idle classes
  block: drbd: avoid to use BIO_MAX_SIZE
  block: bio: remove BIO_MAX_SECTORS
  ...
2016-07-26 15:03:07 -07:00
Jeff Mahoney
66642832f0 btrfs: btrfs_abort_transaction, drop root parameter
__btrfs_abort_transaction doesn't use its root parameter except to
obtain an fs_info pointer.  We can obtain that from trans->root->fs_info
for now and from trans->fs_info in a later patch.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:54:26 +02:00
Jeff Mahoney
64b6358072 btrfs: add btrfs_trans_handle->fs_info pointer
btrfs_trans_handle->root is documented as for use for confirming
that the root passed in to start the transaction is the same as the
one ending it.  It's used in several places when an fs_info pointer
is needed, so let's just add an fs_info pointer directly.  Eventually,
the root pointer can be removed.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:54:26 +02:00
Jeff Mahoney
05f9a78012 btrfs: btrfs_relocate_chunk pass extent_root to btrfs_end_transaction
In btrfs_relocate_chunk, we get a transaction handle via
btrfs_start_trans_remove_block_group, which starts the transaction
using the extent root.  When we call btrfs_end_transaction, we're calling
it using the chunk root.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:54:25 +02:00
Jeff Mahoney
1db1ff92b6 btrfs: convert nodesize macros to static inlines
This patch converts the macros used to calculate various node
size limits to static inlines.  That way we get type checking for free.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:54:25 +02:00
Jeff Mahoney
14a1e067b4 btrfs: introduce BTRFS_MAX_ITEM_SIZE
We use BTRFS_LEAF_DATA_SIZE - sizeof(struct btrfs_item) in
several places.  This introduces a BTRFS_MAX_ITEM_SIZE macro to do the
same.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:54:24 +02:00
Jeff Mahoney
0c83b62e22 btrfs: cleanup, remove prototype for btrfs_find_root_ref
The function isn't implemented anywhere.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:54:23 +02:00
Jeff Mahoney
df3975652f btrfs: copy_to_sk drop unused root parameter
The root parameter for copy_to_sk is not used at all.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:54:23 +02:00
Jeff Mahoney
bd6c57dda6 btrfs: simpilify btrfs_subvol_inherit_props
We just need a superblock, but we look it up using two different
roots depending on the call site.  Let's just use a superblock
pointer initialized at the outset.

This is mostly for Coccinelle not to choke on my root push up set.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:54:22 +02:00
Jeff Mahoney
f5ee5c9ac5 btrfs: tests, use BTRFS_FS_STATE_DUMMY_FS_INFO instead of dummy root
Now that we have a dummy fs_info associated with each test that
uses a root, we don't need the DUMMY_ROOT bit anymore.  This lets
us make choices without needing an actual root like in e.g.
btrfs_find_create_tree_block.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:54:19 +02:00
Jeff Mahoney
7c0260ee09 btrfs: tests, require fs_info for root
This allows the upcoming patchset to push nodesize and sectorsize into
fs_info.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:53:18 +02:00
Jeff Mahoney
8632daae40 btrfs: tests, move initialization into tests/
We have all these stubs that only exist because they're called from
btrfs_run_sanity_tests, which is a static inside super.c.  Let's just
move it all into tests/btrfs-tests.c and only have one stub.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:53:17 +02:00
Jeff Mahoney
3cdde2240d btrfs: btrfs_test_opt and friends should take a btrfs_fs_info
btrfs_test_opt and friends only use the root pointer to access
the fs_info.  Let's pass the fs_info directly in preparation to
eliminate similar patterns all over btrfs.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:53:16 +02:00
Jeff Mahoney
bc074524e1 btrfs: prefix fsid to all trace events
When using trace events to debug a problem, it's impossible to determine
which file system generated a particular event.  This patch adds a
macro to prefix standard information to the head of a trace event.

The extent_state alloc/free events are all that's left without an
fs_info available.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:53:16 +02:00
Jeff Mahoney
cb001095ca btrfs: plumb fs_info into btrfs_work
In order to provide an fsid for trace events, we'll need a btrfs_fs_info
pointer.  The most lightweight way to do that for btrfs_work structures
is to associate it with the __btrfs_workqueue structure.  Each queued
btrfs_work structure has a workqueue associated with it, so that's
a natural fit.  It's a privately defined structures, so we add accessors
to retrieve the fs_info pointer.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:53:15 +02:00
David Sterba
9f8d49095b btrfs: remove obsolete part of comment in statfs
The mixed blockgroup reporting has been fixed by commit
ae02d1bd07
"btrfs: fix mixed block count of available space"

Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:52:25 +02:00
David Sterba
05653ef386 btrfs: hide test-only member under ifdef
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:52:25 +02:00
Nikolay Borisov
aee133afcd btrfs: Ratelimit "no csum found" info message
Recently during a crash it became apparent that this particular message
can be printed so many times that it causes the softlockup detector to
trigger. Fix it by ratelimiting it.

Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:52:25 +02:00
Nikolay Borisov
35f4e5e6f1 btrfs: Add ratelimit to btrfs printing
This patch adds ratelimiting to all messages which are not using the _rl
version of the various printing APIs in btrfs. This is designed to be
used as a safety net, since a flood messages might cause the softlockup
detector to trigger. To reduce interference between different classes of
messages use a separate ratelimit state for every class of message.

Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:52:25 +02:00
Liu Bo
5a488b9d2c Btrfs: fix unexpected balance crash due to BUG_ON
Mounting a btrfs can resume previous balance operations asynchronously.
An user got a crash when one drive has some corrupt sectors.

Since balance can cancel itself in case of any error, we can gracefully
return errors to upper layers and let balance do the cancel job.

Reported-by: sash <master.b.at.raven@chefmail.de>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:52:25 +02:00
Liu Bo
0fd8c3dae1 Btrfs: fix panic in balance due to EIO
During build_backref_tree(), if we fail to read a btree node,
we can eventually run into BUG_ON(cache->nr_nodes) that we put
in backref_cache_cleanup(), meaning we have at least one
memory leak.

This frees the backref_node that we's allocated at the very
beginning of build_backref_tree().

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:52:25 +02:00
Liu Bo
baf863b9c2 Btrfs: fix eb memory leak due to readpage failure
eb->io_pages is set in read_extent_buffer_pages().

In case of readpage failure, for pages that have been added to bio,
it calls bio_endio and later readpage_io_failed_hook() does the work.

When this eb's page (couldn't be the 1st page) fails to add itself to bio
due to failure in merge_bio(), it cannot decrease eb->io_pages via bio_endio,
 and ends up with a memory leak eventually.

This lets __do_readpage propagate errors to callers and adds the
 'atomic_dec(&eb->io_pages)'.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:52:25 +02:00
Liu Bo
f49070957f Btrfs: change BUG_ON()'s to ASSERT()'s in backref_cache_cleanup()
Since it is just an in-memory building of the backrefs of several
btree blocks, nothing is fatal other than memory leaks, so this
changes BUG_ON()'s to ASSERT()'s.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:52:25 +02:00
Wang Xiaoguang
39581a3a1a btrfs: fix free space calculation in dump_space_info()
In btrfs, btrfs_space_info's bytes_may_use is treated as fs used
space, as what we do in reserve_metadata_bytes() or
btrfs_alloc_data_chunk_ondemand(), so in dump_space_info(), when
calculating free space, we should also subtract btrfs_space_info's
bytes_may_use.

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:52:25 +02:00
Chandan Rajendra
751bebbe0a Btrfs: subpage-blocksize: Rate limit scrub error message
btrfs/073 invokes scrub ioctl in a tight loop. In subpage-blocksize
scenario this results in a lot of "scrub: size assumption sectorsize !=
PAGE_SIZE " messages being printed on the console. To reduce the number
of such messages this commit uses btrfs_err_rl() instead of
btrfs_err().

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:52:25 +02:00
Wang Xiaoguang
dda3245eca btrfs: expand cow_file_range() to support in-band dedup and subpage-blocksize
Extract cow_file_range() new parameters for both in-band dedupe and
subpage sector size patchset.

This should make conflict of both patchset to minimal, and reduce the
effort needed to rebase them.

Cc: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:52:25 +02:00
Liu Bo
f5daf2c780 Btrfs: fix BUG_ON in btrfs_submit_compressed_write
This is similar to btrfs_submit_compressed_read(), if we fail after
bio is allocated, then we can use bio_endio() and errors are saved
 in bio->bi_error.  But please note that we don't return errors to
its caller because the caller assumes it won't call endio to cleanup
on error.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:52:25 +02:00
Anand Jain
e2bf6e89b4 btrfs: make sure device is synced before return
An inconsistent behavior due to stale reads from the
disk was reported

  mail-archive.com/linux-btrfs@vger.kernel.org/msg54188.html

This patch will make sure devices are synced before
return in the unmount thread.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:52:25 +02:00
Anand Jain
f448341af9 btrfs: reorg btrfs_close_one_device()
Moves closer to the caller and removes declaration

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:52:25 +02:00
Ashish Samant
c8bb0c8bd2 btrfs: Cleanup compress_file_range()
Remove unnecessary checks in compress_file_range().

Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
[ minor coding style fixups ]
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:52:25 +02:00
Liu Bo
6f034ece34 Btrfs: cleanup BUG_ON in merge_bio
One can use btrfs-corrupt-block to hit BUG_ON() in merge_bio(),
thus this aims to stop anyone to panic the whole system by using
 their btrfs.

Since the error in merge_bio can only come from __btrfs_map_block()
when chunk tree mapping has something insane and __btrfs_map_block()
has already had printed the reason, we can just return errors in
merge_bio.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:52:25 +02:00
Nikolay Borisov
fba4b69771 btrfs: Fix slab accounting flags
BTRFS is using a variety of slab caches to satisfy internal needs.
Those slab caches are always allocated with the SLAB_RECLAIM_ACCOUNT,
meaning allocations from the caches are going to be accounted as
SReclaimable. At the same time btrfs is not registering any shrinkers
whatsoever, thus preventing memory from the slabs to be shrunk. This
means those caches are not in fact reclaimable.

To fix this remove the SLAB_RECLAIM_ACCOUNT on all caches apart from the
inode cache, since this one is being freed by the generic VFS super_block
shrinker. Also set the transaction related caches as SLAB_TEMPORARY,
to better document the lifetime of the objects (it just translates
to SLAB_RECLAIM_ACCOUNT).

Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:52:25 +02:00
Salah Triki
a60617d0ae btrfs: Replace -ENOENT by -ERANGE in btrfs_get_acl()
size contains the value returned by posix_acl_from_xattr(), which
returns -ERANGE, -ENODATA, zero, or an integer greater than zero. So
replace -ENOENT by -ERANGE.

Signed-off-by: Salah Triki <salah.triki@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:52:25 +02:00
Nikolay Borisov
3d48d9810d btrfs: Handle uninitialised inode eviction
The code flow in btrfs_new_inode allows for btrfs_evict_inode to be
called with not fully initialised inode (e.g. ->root member not
being set). This can happen when btrfs_set_inode_index in
btrfs_new_inode fails, which in turn would call iput for the newly
allocated inode. This in turn leads to vfs calling into btrfs_evict_inode.
This leads to null pointer dereference. To handle this situation check whether
the passed inode has root set and just free it in case it doesn't.

Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:52:25 +02:00
Liu Bo
fb770ae414 Btrfs: fix read_node_slot to return errors
We use read_node_slot() to read btree node and it has two cases,
a) slot is out of range, which means 'no such entry'
b) we fail to read the block, due to checksum fails or corrupted
   content or not with uptodate flag.
But we're returning NULL in both cases, this makes it return -ENOENT
in case a) and return -EIO in case b), and this fixes its callers
as well as btrfs_search_forward() 's caller to catch the new errors.

The problem is reported by Peter Becker, and I can manage to
hit the same BUG_ON by mounting my fuzz image.

Reported-by: Peter Becker <floyd.net@gmail.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:52:25 +02:00
Liu Bo
876d2cf141 Btrfs: fix double free of fs root
I got this warning while mounting a btrfs image,

[ 3020.509606] ------------[ cut here ]------------
[ 3020.510107] WARNING: CPU: 3 PID: 5581 at lib/idr.c:1051 ida_remove+0xca/0x190
[ 3020.510853] ida_remove called for id=42 which is not allocated.
[ 3020.511466] Modules linked in:
[ 3020.511802] CPU: 3 PID: 5581 Comm: mount Not tainted 4.7.0-rc5+ #274
[ 3020.512438] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.2-20150714_191134- 04/01/2014
[ 3020.513385]  0000000000000286 0000000021295d86 ffff88006c66b8f0 ffffffff8182ba5a
[ 3020.514153]  0000000000000000 0000000000000009 ffff88006c66b930 ffffffff810e0ed7
[ 3020.514928]  0000041b00000000 ffffffff8289a8c0 ffff88007f437880 0000000000000000
[ 3020.515717] Call Trace:
[ 3020.515965]  [<ffffffff8182ba5a>] dump_stack+0xc9/0x13f
[ 3020.516487]  [<ffffffff810e0ed7>] __warn+0x147/0x160
[ 3020.517005]  [<ffffffff810e0f4f>] warn_slowpath_fmt+0x5f/0x80
[ 3020.517572]  [<ffffffff8182e6ca>] ida_remove+0xca/0x190
[ 3020.518075]  [<ffffffff813a2bcc>] free_anon_bdev+0x2c/0x60
[ 3020.518609]  [<ffffffff81657a9f>] free_fs_root+0x13f/0x160
[ 3020.519138]  [<ffffffff8165c679>] btrfs_get_fs_root+0x379/0x3d0
[ 3020.519710]  [<ffffffff81e6e975>] ? __mutex_unlock_slowpath+0x155/0x2c0
[ 3020.520366]  [<ffffffff816615b1>] open_ctree+0x2e91/0x3200
[ 3020.520965]  [<ffffffff8161ede2>] btrfs_mount+0x1322/0x15b0
[ 3020.521536]  [<ffffffff81e60e74>] ? kmemleak_alloc_percpu+0x44/0x170
[ 3020.522167]  [<ffffffff8115f5e1>] ? lockdep_init_map+0x61/0x210
[ 3020.522780]  [<ffffffff813a4f59>] mount_fs+0x49/0x2c0
[ 3020.523305]  [<ffffffff813d840c>] vfs_kern_mount+0xac/0x1b0
[ 3020.523872]  [<ffffffff8161dee1>] btrfs_mount+0x421/0x15b0
[ 3020.524402]  [<ffffffff81e60e74>] ? kmemleak_alloc_percpu+0x44/0x170
[ 3020.525045]  [<ffffffff8115f5e1>] ? lockdep_init_map+0x61/0x210
[ 3020.525657]  [<ffffffff8115f5e1>] ? lockdep_init_map+0x61/0x210
[ 3020.526289]  [<ffffffff813a4f59>] mount_fs+0x49/0x2c0
[ 3020.526803]  [<ffffffff813d840c>] vfs_kern_mount+0xac/0x1b0
[ 3020.527365]  [<ffffffff813dc27a>] do_mount+0x41a/0x1770
[ 3020.527899]  [<ffffffff812e800d>] ? strndup_user+0x6d/0xc0
[ 3020.528447]  [<ffffffff812e7f68>] ? memdup_user+0x78/0xb0
[ 3020.528987]  [<ffffffff813ddad0>] SyS_mount+0x150/0x160
[ 3020.529493]  [<ffffffff81e72b7c>] entry_SYSCALL_64_fastpath+0x1f/0xbd

It turns out that we free fs root twice, btrfs_init_fs_root() calls
free_anon_bdev(root->anon_dev) and later then btrfs_get_fs_root() cals
free_fs_root which does another free_anon_bdev() and it ends up with the
above warning.

Instead of reset root->anon_dev to 0 after free_anon_bdev(), we can let
btrfs_init_fs_root() return directly since its callers have already done
the free job by calling free_fs_root().

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:52:25 +02:00
Liu Bo
5e24e9af01 Btrfs: error out if generic_bin_search get invalid arguments
With btrfs-corrupt-block, one can set btree node/leaf's field, if
we assign a negative value to node/leaf, we can get various hangs,
eg. if extent_root's nritems is -2ULL, then we get stuck in
 btrfs_read_block_groups() because it has a while loop and
btrfs_search_slot() on extent_root will always return the first
 child.

This lets us know what's happening and returns a EINVAL to callers
instead of returning the first item.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:52:25 +02:00
Liu Bo
6fb37b756a Btrfs: check inconsistence between chunk and block group
With btrfs-corrupt-block, one can drop one chunk item and mounting
will end up with a panic in btrfs_full_stripe_len().

This doesn't not remove the BUG_ON, but instead checks it a bit
earlier when we find the block group item.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:52:25 +02:00
Wang Xiaoguang
c1fd5c30d1 btrfs: add missing bytes_readonly attribute file in sysfs
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:52:25 +02:00
Chris Mason
8b8b08cbfb Btrfs: fix delalloc accounting after copy_from_user faults
Commit 56244ef151 was almost but not quite enough to fix the
reservation math after btrfs_copy_from_user returned partial copies.

Some users are still seeing warnings in btrfs_destroy_inode, and with a
long enough test run I'm able to trigger them as well.

This patch fixes the accounting math again, bringing it much closer to
the way it was before the sectorsize conversion Chandan did.  The
problem is accounting for the offset into the page/sector when we do a
partial copy.  This one just uses the dirty_sectors variable which
should already be updated properly.

Signed-off-by: Chris Mason <clm@fb.com>
cc: stable@vger.kernel.org # v4.6+
2016-07-21 04:03:40 -07:00
Josef Bacik
bac357dcec Btrfs: avoid deadlocks during reservations in btrfs_truncate_block
The new enospc code makes it possible to deadlock if we don't use
FLUSH_LIMIT during reservations inside a transaction.  This enforces
the correct flush type to avoid both deadlocks and assertions

Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2016-07-20 16:58:04 -07:00
Vincent Stehlé
df5c82a8dc Btrfs: fix comparison in __btrfs_map_block()
Add missing comparison to op in expression, which was forgotten when doing
the REQ_OP transition.

Fixes: b3d3fa5199 ("btrfs: update __btrfs_map_block for REQ_OP transition")
Signed-off-by: Vincent Stehlé <vincent.stehle@intel.com>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-18 15:28:23 -06:00
Josef Bacik
8ca17f0f59 Btrfs: use FLUSH_LIMIT for relocation in reserve_metadata_bytes
We used to allow you to set FLUSH_ALL and then just wouldn't do things like
commit transactions or wait on ordered extents if we noticed you were in a
transaction.  However now that all the flushing for FLUSH_ALL is asynchronous
we've lost the ability to tell, and we could end up deadlocking.  So instead use
FLUSH_LIMIT in reserve_metadata_bytes in relocation and then return -EAGAIN if
we error out to preserve the previous behavior.  I've also added an ASSERT() to
catch anybody else who tries to do this.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
ac2fabac42 Btrfs: fill relocation block rsv after allocation
Since we set the reloc control before we've reserved our space for relocation we
could race with a root being dirtied and not actually have space to do our init
reloc root.  So once we've allocated it and set it up go ahead and make our
reservation before setting the relocate control, that way anybody who tries to
do the reloc root init has space to use.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
40acc3eede Btrfs: always use trans->block_rsv for orphans
This is the case all the time anyway except for relocation which could be doing
a reloc root for a non ref counted root, in which case we'd end up with some
random block rsv rather than the one we have our reservation in.  If there isn't
enough space in the block rsv we are trying to steal from we'll BUG() because we
expect there to be space for the orphan to make its reservation.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
ae2e472881 Btrfs: change how we calculate the global block rsv
Traditionally we've calculated the global block rsv by guessing how much of the
metadata used amount was the extent tree, and then taking the data size and
figuring out how large the csum tree would have to be to hold that much data.

This is imprecise and falls down on MIXED file systems as we can't trust the
data used amount.  This resulted in failures for xfstests generic/333 because it
creates lots of clones, which explodes out the extent tree.  Our global reserve
calculations were woefully inaccurate in this case which meant we got into a
situation where we did not have enough reserved to do our work.

We know we only use the global block rsv for the extent, csum, and root trees,
so just get the bytes used for these trees and use that as the basis of our
global reserve.  Since these are not reference counted trees the bytes_used
value will be accurate.  This fixed the transaction aborts seen with
generic/333.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
87241c2e68 Btrfs: use root when checking need_async_flush
Instead of doing fs_info->fs_root in need_async_flush, which may not be set
during recovery when mounting, just pass the root itself in, which makes more
sense as thats what btrfs_calc_reclaim_metadata_size takes.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reported-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
d38b349c39 Btrfs: don't bother kicking async if there's nothing to reclaim
We do this check when we start the async reclaimer thread, might as well check
before we kick it off to save us some cycles.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
31bada7c4e Btrfs: fix release reserved extents trace points
We were doing trace_btrfs_release_reserved_extent() in pin_down_extent which
isn't quite right because we will go through and free that extent later when we
unpin, so it messes up apps that are accounting for the reservation space.  We
were also unconditionally doing it in __btrfs_free_reserved_extent(), when we
only actually free the reservation instead of pinning the extent.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
f376df2b7d Btrfs: add tracepoints for flush events
We want to track when we're triggering flushing from our reservation code and
what flushing is being done when we start flushing.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
f485c9ee32 Btrfs: fix delalloc reservation amount tracepoint
We can sometimes drop the reservation we had for our inode, so we need to remove
that amount from to_reserve so that our tracepoint reports a valid amount of
space.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
c51e7bb184 Btrfs: trace pinned extents
Pinned extents are an important metric to keep track of for enospc.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
957780eb27 Btrfs: introduce ticketed enospc infrastructure
Our enospc flushing sucks.  It is born from a time where we were early
enospc'ing constantly because multiple threads would race in for the same
reservation and randomly starve other ones out.  So I came up with this solution
to block any other reservations from happening while one guy tried to flush
stuff to satisfy his reservation.  This gives us pretty good correctness, but
completely crap latency.

The solution I've come up with is ticketed reservations.  Basically we try to
make our reservation, and if we can't we put a ticket on a list in order and
kick off an async flusher thread.  This async flusher thread does the same old
flushing we always did, just asynchronously.  As space is freed and added back
to the space_info it checks and sees if we have any tickets that need
satisfying, and adds space to the tickets and wakes up anything we've satisfied.

Once the flusher thread stops making progress it wakes up all the current
tickets and tells them to take a hike.

There is a priority list for things that can't flush, since the async flusher
could do anything we need to avoid deadlocks.  These guys get priority for
having their reservation made, and will still do manual flushing themselves in
case the async flusher isn't running.

This patch gives us significantly better latencies.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
c83f8effef Btrfs: add tracepoint for adding block groups
I'm writing a tool to visualize the enospc system inside btrfs, I need this
tracepoint in order to keep track of the block groups in the system.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
d555b6c380 Btrfs: warn_on for unaccounted spaces
These were hidden behind enospc_debug, which isn't helpful as they indicate
actual bugs, unlike the rest of the enospc_debug stuff which is really debug
information.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
c48f49d63d Btrfs: change delayed reservation fallback behavior
We reserve space for the inode update when we first reserve space for writing to
a file.  However there are lots of ways that we can use this reservation and not
have it for subsequent ordered extents.  Previously we'd fall through and try to
reserve metadata bytes for this, then we'd just steal the full reservation from
the delalloc_block_rsv, and if that didn't have enough space we'd steal the full
reservation from the global reserve.  The problem with this is we can easily
just return ENOSPC and fallback to updating the inode item directly.  In the
worst case (assuming 4k nodesize) we'd steal 64kib from the global reserve if we
fall all the way through, however if we just fallback and update the inode
directly we'd only steal 4k * BTRFS_PATH_MAX in the worst case which is 32kib.

We would have also just added the extent item for the inode so we likely will
have already cow'ed down most of the way to the leaf containing the inode item,
so we are more often than not only need one or two nodesize's worth of
reservations.  Given the reservation for the extent itself is also a worst case
we will likely already have space to cover the inode update.

This change will make us behave better in the theoretical worst case, and much
better in the case that we don't have our reservation and cannot reserve more
metadata.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
48c3d480e4 Btrfs: always reserve metadata for delalloc extents
There are a few races in the metadata reservation stuff.  First we add the bytes
to the block_rsv well after we've set the bit on the inode saying that we have
space for it and after we've reserved the bytes.  So use the normal
btrfs_block_rsv_add helper for this case.  Secondly we can flush delalloc
extents when we try to reserve space for our write, which means that we could
have used up the space for the inode and we wouldn't know because we only check
before the reservation.  So instead make sure we are always reserving space for
the inode update, and then if we don't need it release those bytes afterward.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
25d609f86d Btrfs: fix callers of btrfs_block_rsv_migrate
So btrfs_block_rsv_migrate just unconditionally calls block_rsv_migrate_bytes.
Not only this but it unconditionally changes the size of the block_rsv.  This
isn't a bug strictly speaking, but it makes truncate block rsv's look funny
because every time we migrate bytes over its size grows, even though we only
want it to be a specific size.  So collapse this into one function that takes an
update_size argument and make truncate and evict not update the size for
consistency sake.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
e40edf2da4 Btrfs: add bytes_readonly to the spaceinfo at once
For some reason we're adding bytes_readonly to the space info after we update
the space info with the block group info.  This creates a tiny race where we
could over-reserve space because we haven't yet taken out the bytes_readonly
bit.  Since we already know this information at the time we call
update_space_info, just pass it along so it can be updated all at once.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Linus Torvalds
da2f6aba4a Merge branch 'for-linus-4.7-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes part 2 from Chris Mason:
 "This has one patch from Omar to bring iterate_shared back to btrfs.

  We have a tree of work we queue up for directory items and it doesn't
  lend itself well to shared access.  While we're cleaning it up, Omar
  has changed things to use an exclusive lock when there are delayed
  items"

* 'for-linus-4.7-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix ->iterate_shared() by upgrading i_rwsem for delayed nodes
2016-06-25 08:53:38 -07:00
Linus Torvalds
b971712afc Merge branch 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "I have a two part pull this time because one of the patches Dave
  Sterba collected needed to be against v4.7-rc2 or higher (we used
  rc4).  I try to make my for-linus-xx branch testable on top of the
  last major so we can hand fixes to people on the list more easily, so
  I've split this pull in two.

  This first part has some fixes and two performance improvements that
  we've been testing for some time.

  Josef's two performance fixes are most notable.  The transid tracking
  patch makes a big improvement on pretty much every workload"

* 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: Force stripesize to the value of sectorsize
  btrfs: fix disk_i_size update bug when fallocate() fails
  Btrfs: fix error handling in map_private_extent_buffer
  Btrfs: fix error return code in btrfs_init_test_fs()
  Btrfs: don't do nocow check unless we have to
  btrfs: fix deadlock in delayed_ref_async_start
  Btrfs: track transid for delayed ref flushing
2016-06-25 08:42:31 -07:00
Omar Sandoval
02dbfc99b4 Btrfs: fix ->iterate_shared() by upgrading i_rwsem for delayed nodes
Commit fe742fd4f9 ("Revert "btrfs: switch to ->iterate_shared()"")
backed out the conversion to ->iterate_shared() for Btrfs because the
delayed inode handling in btrfs_real_readdir() is racy. However, we can
still do readdir in parallel if there are no delayed nodes.

This is a temporary fix which upgrades the shared inode lock to an
exclusive lock only when we have delayed items until we come up with a
more complete solution. While we're here, rename the
btrfs_{get,put}_delayed_items functions to make it very clear that
they're just for readdir.

Tested with xfstests and by doing a parallel kernel build:

	while make tinyconfig && make -j4 && git clean dqfx; do
		:
	done

along with a bunch of parallel finds in another shell:

	while true; do
		for ((i=0; i<4; i++)); do
			find . >/dev/null &
		done
		wait
	done

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-06-25 06:20:10 -07:00
Chandan Rajendra
b7f67055d2 Btrfs: Force stripesize to the value of sectorsize
Btrfs code currently assumes stripesize to be same as
sectorsize. However Btrfs-progs (until commit
df05c7ed455f519e6e15e46196392e4757257305) has been setting
btrfs_super_block->stripesize to a value of 4096.

This commit makes sure that the value of btrfs_super_block->stripesize
is a power of 2. Later, it unconditionally sets btrfs_root->stripesize
to sectorsize.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-06-23 10:44:42 -07:00
Wang Xiaoguang
c0d2f6104e btrfs: fix disk_i_size update bug when fallocate() fails
When doing truncate operation, btrfs_setsize() will first call
truncate_setsize() to set new inode->i_size, but if later
btrfs_truncate() fails, btrfs_setsize() will call
"i_size_write(inode, BTRFS_I(inode)->disk_i_size)" to reset the
inmemory inode size, now bug occurs. It's because for truncate
case btrfs_ordered_update_i_size() directly uses inode->i_size
to update BTRFS_I(inode)->disk_i_size, indeed we should use the
"offset" argument to update disk_i_size. Here is the call graph:
==>btrfs_truncate()
====>btrfs_truncate_inode_items()
======>btrfs_ordered_update_i_size(inode, last_size, NULL);
Here btrfs_ordered_update_i_size()'s offset argument is last_size.

And below test case can reveal this bug:

dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=100
dev=$(losetup --show -f fs.img)
mkdir -p /mnt/mntpoint
mkfs.btrfs  -f $dev
mount $dev /mnt/mntpoint
cd /mnt/mntpoint

echo "workdir is: /mnt/mntpoint"
blocksize=$((128 * 1024))
dd if=/dev/zero of=testfile bs=$blocksize count=1
sync
count=$((17*1024*1024*1024/blocksize))
echo "file size is:" $((count*blocksize))
for ((i = 1; i <= $count; i++)); do
	i=$((i + 1))
	dst_offset=$((blocksize * i))
	xfs_io -f -c "reflink testfile 0 $dst_offset $blocksize"\
		testfile > /dev/null
done
sync

truncate --size 0 testfile
ls -l testfile
du -sh testfile
exit

In this case, truncate operation will fail for enospc reason and
"du -sh testfile" returns value greater than 0, but testfile's
size is 0, we need to reflect correct inode->i_size.

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-06-23 10:44:41 -07:00
Liu Bo
415b35a55b Btrfs: fix error handling in map_private_extent_buffer
map_private_extent_buffer() can return -EINVAL in two different cases,
1. when the requested contents span two pages if nodesize is larger
   than pagesize,
2. when it detects something insane.

The 2nd one used to be only a WARN_ON(1), and we decided to return a error
to callers, but we didn't fix up all its callers, which will be
addressed by this patch.

Without this, btrfs may end up with 'general protection', ie.
reading invalid memory.

Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-06-23 10:44:40 -07:00
Wei Yongjun
04e1b65af2 Btrfs: fix error return code in btrfs_init_test_fs()
Fix to return a negative error code from the kern_mount() error handling
case instead of 0(ret is set to 0 by register_filesystem), as done
elsewhere in this function.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-06-23 10:44:39 -07:00
Josef Bacik
c6887cd111 Btrfs: don't do nocow check unless we have to
Before we write into prealloc/nocow space we have to make sure that there are no
references to the extents we are writing into, which means checking the extent
tree and csum tree in the case of nocow.  So we don't want to do the nocow dance
unless we can't reserve data space, since it's a serious drag on performance.
With the following sequence

fallocate -l10737418240 /mnt/btrfs-test/file
cp --reflink /mnt/btrfs-test/file /mnt/btrfs-test/link
fio --name=randwrite --rw=randwrite --bs=4k --filename=/mnt/btrfs-test/file \
	--end_fsync=1

we get the worst case scenario where we have to fall back on to doing the check
anyway.

Without this patch
lat (usec): min=5, max=111598, avg=27.65, stdev=124.51
write: io=10240MB, bw=126876KB/s, iops=31718, runt= 82646msec

With this patch
lat (usec): min=3, max=91210, avg=14.09, stdev=110.62
write: io=10240MB, bw=212753KB/s, iops=53188, runt= 49286msec

We get twice the throughput, half of the runtime, and half of the average
latency.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
[ PAGE_CACHE_ removal related fixups ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-06-22 17:57:14 -07:00
Chris Mason
0f873eca82 btrfs: fix deadlock in delayed_ref_async_start
"Btrfs: track transid for delayed ref flushing" was deadlocking on
btrfs_attach_transaction because its not safe to call from the async
delayed ref start code.  This commit brings back btrfs_join_transaction
instead and checks for a blocked commit.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-06-22 17:54:18 -07:00
Josef Bacik
31b9655f43 Btrfs: track transid for delayed ref flushing
Using the offwakecputime bpf script I noticed most of our time was spent waiting
on the delayed ref throttling.  This is what is supposed to happen, but
sometimes the transaction can commit and then we're waiting for throttling that
doesn't matter anymore.  So change this stuff to be a little smarter by tracking
the transid we were in when we initiated the throttling.  If the transaction we
get is different then we can just bail out.  This resulted in a 50% speedup in
my fs_mark test, and reduced the amount of time spent throttling by 60 seconds
over the entire run (which is about 30 minutes).  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-06-22 17:54:18 -07:00
Linus Torvalds
4c6459f945 Merge branch 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "The most user visible change here is a fix for our recent superblock
  validation checks that were causing problems on non-4k pagesized
  systems"

* 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: btrfs_check_super_valid: Allow 4096 as stripesize
  btrfs: remove build fixup for qgroup_account_snapshot
  btrfs: use new error message helper in qgroup_account_snapshot
  btrfs: avoid blocking open_ctree from cleaner_kthread
  Btrfs: don't BUG_ON() in btrfs_orphan_add
  btrfs: account for non-CoW'd blocks in btrfs_abort_transaction
  Btrfs: check if extent buffer is aligned to sectorsize
  btrfs: Use correct format specifier
2016-06-18 05:57:59 -10:00
Chandan Rajendra
dd5c93111d Btrfs: btrfs_check_super_valid: Allow 4096 as stripesize
Older btrfs-progs/mkfs.btrfs sets 4096 as the stripesize. Hence
restricting stripesize to be equal to sectorsize would cause super block
validation to return an error on architectures where PAGE_SIZE is not
equal to 4096.

Hence as a workaround, this commit allows stripesize to be set to 4096
bytes.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-17 18:32:49 +02:00
David Sterba
89c5a5441d btrfs: remove build fixup for qgroup_account_snapshot
Introduced in 2c1984f244 ("btrfs: build fixup for
qgroup_account_snapshot") as temporary bisectability build fixup.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-17 18:32:40 +02:00
David Sterba
f7af3934c2 btrfs: use new error message helper in qgroup_account_snapshot
We've renamed btrfs_std_error, this one is left from last merge.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-17 18:32:40 +02:00
Zygo Blaxell
90c711ab38 btrfs: avoid blocking open_ctree from cleaner_kthread
This fixes a problem introduced in commit 2f3165ecf1
"btrfs: don't force mounts to wait for cleaner_kthread to delete one or more subvolumes".

open_ctree eventually calls btrfs_replay_log which in turn calls
btrfs_commit_super which tries to lock the cleaner_mutex, causing a
recursive mutex deadlock during mount.

Instead of playing whack-a-mole trying to keep up with all the
functions that may want to lock cleaner_mutex, put all the cleaner_mutex
lockers back where they were, and attack the problem more directly:
keep cleaner_kthread asleep until the filesystem is mounted.

When filesystems are mounted read-only and later remounted read-write,
open_ctree did not set fs_info->open and neither does anything else.
Set this flag in btrfs_remount so that neither btrfs_delete_unused_bgs
nor cleaner_kthread get confused by the common case of "/" filesystem
read-only mount followed by read-write remount.

Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-17 18:32:40 +02:00
Josef Bacik
3b6571c180 Btrfs: don't BUG_ON() in btrfs_orphan_add
This is just a screwup for developers, so change it to an ASSERT() so developers
notice when things go wrong and deal with the error appropriately if ASSERT()
isn't enabled.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-17 18:32:40 +02:00
Jeff Mahoney
64c12921e1 btrfs: account for non-CoW'd blocks in btrfs_abort_transaction
The test for !trans->blocks_used in btrfs_abort_transaction is
insufficient to determine whether it's safe to drop the transaction
handle on the floor.  btrfs_cow_block, informed by should_cow_block,
can return blocks that have already been CoW'd in the current
transaction.  trans->blocks_used is only incremented for new block
allocations. If an operation overlaps the blocks in the current
transaction entirely and must abort the transaction, we'll happily
let it clean up the trans handle even though it may have modified
the blocks and will commit an incomplete operation.

In the long-term, I'd like to do closer tracking of when the fs
is actually modified so we can still recover as gracefully as possible,
but that approach will need some discussion.  In the short term,
since this is the only code using trans->blocks_used, let's just
switch it to a bool indicating whether any blocks were used and set
it when should_cow_block returns false.

Cc: stable@vger.kernel.org # 3.4+
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-17 18:32:40 +02:00
Liu Bo
c871b0f2fd Btrfs: check if extent buffer is aligned to sectorsize
Thanks to fuzz testing, we can pass an invalid bytenr to extent buffer
via alloc_extent_buffer().  An unaligned eb can have more pages than it
should have, which ends up extent buffer's leak or some corrupted content
in extent buffer.

This adds a warning to let us quickly know what was happening.

Now that alloc_extent_buffer() no more returns NULL, this changes its
caller and callers of its caller to match with the new error
handling.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-17 18:32:40 +02:00
Heinrich Schuchardt
16ff4b454f btrfs: Use correct format specifier
Component mirror_num of struct btrfsic_block is defined
as unsigned int. Use %u as format specifier.

Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-17 18:32:40 +02:00
Linus Torvalds
3d0f0b6a55 Merge branch 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Has some fixes and some new self tests for btrfs.  The self tests are
  usually disabled in the .config file (unless you're doing btrfs dev
  work), and this bunch is meant to find problems with the 64K page size
  patches.

  Jeff has a patch to help people see if they are using the hardware
  assist crc32c module, which really helps us nail down problems when
  people ask why crcs are using so much CPU.

  Otherwise, it's small fixes"

* 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: self-tests: Fix extent buffer bitmap test fail on BE system
  Btrfs: self-tests: Fix test_bitmaps fail on 64k sectorsize
  Btrfs: self-tests: Use macros instead of constants and add missing newline
  Btrfs: self-tests: Support testing all possible sectorsizes and nodesizes
  Btrfs: self-tests: Execute page straddling test only when nodesize < PAGE_SIZE
  btrfs: advertise which crc32c implementation is being used at module load
  Btrfs: add validadtion checks for chunk loading
  Btrfs: add more validation checks for superblock
  Btrfs: clear uptodate flags of pages in sys_array eb
  Btrfs: self-tests: Support non-4k page size
  Btrfs: Fix integer overflow when calculating bytes_per_bitmap
  Btrfs: test_check_exists: Fix infinite loop when searching for free space entries
  Btrfs: end transaction if we abort when creating uuid root
  btrfs: Use __u64 in exported linux/btrfs.h.
2016-06-10 14:13:27 -07:00
Chris Mason
719da39a61 Merge branch 'misc-fixes-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.7 2016-06-08 14:36:12 -07:00
Chris Mason
4c52990080 Merge branch 'for-chris' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.7 2016-06-08 14:35:11 -07:00
Mike Christie
28a8f0d317 block, drivers, fs: rename REQ_FLUSH to REQ_PREFLUSH
To avoid confusion between REQ_OP_FLUSH, which is handled by
request_fn drivers, and upper layers requesting the block layer
perform a flush sequence along with possibly a WRITE, this patch
renames REQ_FLUSH to REQ_PREFLUSH.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie
6296b9604f block, drivers, fs: shrink bi_rw from long to int
We don't need bi_rw to be so large on 64 bit archs, so
reduce it to unsigned int.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie
81a75f6781 btrfs: use bio fields for op and flags
The bio REQ_OP and bi_rw rq_flag_bits are now always setup, so there is
no need to pass around the rq_flag_bits bits too. btrfs users should
should access the bio insead.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie
b3d3fa5199 btrfs: update __btrfs_map_block for REQ_OP transition
We no longer pass in a bitmap of rq_flag_bits bits to __btrfs_map_block.
It will always be a REQ_OP, or the btrfs specific REQ_GET_READ_MIRRORS,
so this drops the bit tests.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie
37226b2111 btrfs: use bio op accessors
This should be the easier cases to convert btrfs to
bio_set_op_attrs/bio_op.
They are mostly just cut and replace type of changes.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie
1f7ad75b13 btrfs: have submit_one_bio users use bio op accessors
This patch has btrfs's submit_one_bio users set the bio op using
bio_set_op_attrs and get the op using bio_op.

The next patches will continue to convert btrfs,
so submit_bio_hook and merge_bio_hook
related code will be modified to take only the bio. I did
not do it in this patch to try and keep it smaller.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie
8a4c1e42e0 direct-io: use bio set/get op accessors
This patch has the dio code use a REQ_OP for the op and rq_flag_bits
for bi_rw flags. To set/get the op it uses the bio_set_op_attrs/bio_op
accssors.

It also begins to convert btrfs's dio_submit_t because of the dio
submit_io callout use. The next patches will completely convert
this code and the reset of the btrfs code paths.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie
2a222ca992 fs: have submit_bh users pass in op and flags separately
This has submit_bh users pass in the operation and flags separately,
so submit_bh_wbc can setup the bio op and bi_rw flags on the bio that
is submitted.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie
4e49ea4a3d block/fs/drivers: remove rw argument from submit_bio
This has callers of submit_bio/submit_bio_wait set the bio->bi_rw
instead of passing it in. This makes that use the same as
generic_make_request and how we set the other bio fields.

Signed-off-by: Mike Christie <mchristi@redhat.com>

Fixed up fs/ext4/crypto.c

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Feifei Xu
34b3e6c92a Btrfs: self-tests: Fix extent buffer bitmap test fail on BE system
In __test_eb_bitmaps(), we write random data to a bitmap. Then copy
the bitmap to another bitmap that resides inside an extent buffer.
Later we verify the values of corresponding bits in the bitmap and the
bitmap inside the extent buffer. However, extent_buffer_test_bit()
reads in byte granularity while test_bit() reads in unsigned long
granularity. Hence we end up comparing wrong bits on big-endian
systems such as ppc64. This commit fixes the issue by reading the
bitmap in byte granularity.

Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Feifei Xu <xufeifei@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-06 17:17:12 +02:00
Feifei Xu
36b3dc05b4 Btrfs: self-tests: Fix test_bitmaps fail on 64k sectorsize
With 64K sectorsize, 1G sized block group cannot span across bitmaps.
To execute test_bitmaps() function, this commit allocates
"BITS_PER_BITMAP * sectorsize + PAGE_SIZE" sized block group.

Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Feifei Xu <xufeifei@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-06 17:17:12 +02:00
Feifei Xu
ef9f2db365 Btrfs: self-tests: Use macros instead of constants and add missing newline
This commit replaces numerical constants with appropriate
preprocessor macros.

Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Feifei Xu <xufeifei@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-06 17:17:12 +02:00
Feifei Xu
d94f43b4c6 Btrfs: self-tests: Support testing all possible sectorsizes and nodesizes
To test all possible sectorsizes, this commit adds a sectorsize
array. This commit executes the tests for all possible sectorsizes and
nodesizes.

Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Feifei Xu <xufeifei@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-06 17:17:12 +02:00
Feifei Xu
ed9e4afdb0 Btrfs: self-tests: Execute page straddling test only when nodesize < PAGE_SIZE
On ppc64, PAGE_SIZE is 64k which is same as BTRFS_MAX_METADATA_BLOCKSIZE.
In such a scenario, we will never be able to have an extent buffer
containing more than one page. Hence in such cases this commit does not
execute the page straddling tests.

Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Feifei Xu <xufeifei@linux.vnet.ibm.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-06 17:17:11 +02:00
Jeff Mahoney
5f9e1059d9 btrfs: advertise which crc32c implementation is being used at module load
Since several architectures support hardware-accelerated crc32c
calculation, it would be nice to confirm that btrfs is actually using it.

We can see an elevated use count for the module, but it doesn't actually
show who the users are.  This patch simply prints the name of the driver
after successfully initializing the shash.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
[ added a helper and used in module load-time message ]
Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-06 14:08:28 +02:00
Liu Bo
e06cd3dd7c Btrfs: add validadtion checks for chunk loading
To prevent fuzzed filesystem images from panic the whole system,
we need various validation checks to refuse to mount such an image
if btrfs finds any invalid value during loading chunks, including
both sys_array and regular chunks.

Note that these checks may not be sufficient to cover all corner cases,
feel free to add more checks.

Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Reported-by: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-06 10:57:09 +02:00
Liu Bo
99e3ecfcb9 Btrfs: add more validation checks for superblock
This adds validation checks for super_total_bytes, super_bytes_used and
super_stripesize, super_num_devices.

Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Reported-by: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-06 10:41:53 +02:00
Liu Bo
d865177a5e Btrfs: clear uptodate flags of pages in sys_array eb
We set uptodate flag to pages in the temporary sys_array eb,
but do not clear the flag after free eb.  As the special
btree inode may still hold a reference on those pages, the
uptodate flag can remain alive in them.

If btrfs_super_chunk_root has been intentionally changed to the
offset of this sys_array eb, reading chunk_root will read content
of sys_array and it will skip our beautiful checks in
btree_readpage_end_io_hook() because of
"pages of eb are uptodate => eb is uptodate"

This adds the 'clear uptodate' part to force it to read from disk.

Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-06 10:14:40 +02:00
Linus Torvalds
b2d5ad8223 Merge branch 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "The important part of this pull is Filipe's set of fixes for btrfs
  device replacement.  Filipe fixed a few issues seen on the list and a
  number he found on his own"

* 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: deal with duplciates during extent_map insertion in btrfs_get_extent
  Btrfs: fix race between device replace and read repair
  Btrfs: fix race between device replace and discard
  Btrfs: fix race between device replace and chunk allocation
  Btrfs: fix race setting block group back to RW mode during device replace
  Btrfs: fix unprotected assignment of the left cursor for device replace
  Btrfs: fix race setting block group readonly during device replace
  Btrfs: fix race between device replace and block group removal
  Btrfs: fix race between readahead and device replace/removal
2016-06-04 11:56:28 -07:00
Chris Mason
8dff9c8534 Btrfs: deal with duplciates during extent_map insertion in btrfs_get_extent
When dealing with inline extents, btrfs_get_extent will incorrectly try
to insert a duplicate extent_map.  The dup hits -EEXIST from
add_extent_map, but then we try to merge with the existing one and end
up trying to insert a zero length extent_map.

This actually works most of the time, except when there are extent maps
past the end of the inline extent.  rocksdb will trigger this sometimes
because it preallocates an extent and then truncates down.

Josef made a script to trigger with xfs_io:

	#!/bin/bash

	xfs_io -f -c "pwrite 0 1000" inline
	xfs_io -c "falloc -k 4k 1M" inline
	xfs_io -c "pread 0 1000" -c "fadvise -d 0 1000" -c "pread 0 1000" inline
	xfs_io -c "fadvise -d 0 1000" inline
	cat inline

You'll get EIOs trying to read inline after this because add_extent_map
is returning EEXIST

Signed-off-by: Chris Mason <clm@fb.com>
2016-06-03 12:32:34 -07:00
Feifei Xu
b9ef22dedd Btrfs: self-tests: Support non-4k page size
self-tests code assumes 4k as the sectorsize and nodesize. This commit
fix hardcoded 4K. Enables the self-tests code to be executed on non-4k
page sized systems (e.g. ppc64).

Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Feifei Xu <xufeifei@linux.vnet.ibm.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-02 19:23:14 +02:00
Feifei Xu
0ef6447a3d Btrfs: Fix integer overflow when calculating bytes_per_bitmap
On ppc64, bytes_per_bitmap will be (65536*8*65536). Hence append UL to
fix integer overflow.

Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Feifei Xu <xufeifei@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-02 19:22:49 +02:00
Feifei Xu
5473e0c426 Btrfs: test_check_exists: Fix infinite loop when searching for free space entries
On a ppc64 machine using 64K as the block size, assume that the RB
tree at btrfs_free_space_ctl->free_space_offset contains following
two entries:

1. A bitmap entry having an offset value of 0 and having the bits
   corresponding to the address range [128M+512K, 128M+768K] set.
2. An extent entry corresponding to the address range
   [128M-256K, 128M-128K]

In such a scenario, test_check_exists() invoked for checking the
existence of address range [128M+768K, 256M] can lead to an
infinite loop as explained below:

- Checking for the extent entry fails.
- Checking for a bitmap entry results in the free space info in
  range [128M+512K, 128M+768K] beng returned.
- rb_prev(info) returns NULL because the bitmap entry starting from
  offset 0 comes first in the RB tree.
- current_node = bitmap node.
- while (current_node)
	tmp = rb_next(bitmap_node);/*tmp is extent based free space entry*/
	Since extent based free space entry's last address is smaller
	than the address being searched for (i.e. 128M+768K) we
	incorrectly again obtain the extent node as the "next right node"
	of the RB tree and thus end up looping infinitely.

This patch fixes the issue by checking the "tmp" variable which point
to the most recently searched free space node.

Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Feifei Xu <xufeifei@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-02 19:22:34 +02:00
Josef Bacik
65d4f4c151 Btrfs: end transaction if we abort when creating uuid root
We still need to call btrfs_end_transaction if we call btrfs_abort_transaction,
otherwise we hang and make me super grumpy.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-01 00:32:42 +02:00
Filipe Manana
b5de8d0df8 Btrfs: fix race between device replace and read repair
While we are finishing a device replace operation we can have a concurrent
task trying to do a read repair operation, in which case it will call
btrfs_map_block() to get a struct btrfs_bio which can have a stripe that
points to the source device of the device replace operation. This allows
for the read repair task to dereference the stripe's device pointer after
the device replace operation has freed the source device, resulting in
an invalid memory access. This is similar to the problem solved by my
previous patch in the same series and named "Btrfs: fix race between
device replace and discard".

So fix this by surrounding the call to btrfs_map_block() and the code
that uses the returned struct btrfs_bio with calls to
btrfs_bio_counter_inc_blocked() and btrfs_bio_counter_dec(), giving the
proper serialization with the finishing phase of the device replace
operation.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-05-31 01:00:03 +01:00
Filipe Manana
2999241daa Btrfs: fix race between device replace and discard
While we are finishing a device replace operation, we can make a discard
operation (fs mounted with -o discard) do an invalid memory access like
the one reported by the following trace:

[ 3206.384654] general protection fault: 0000 [#1] PREEMPT SMP
[ 3206.387520] Modules linked in: dm_mod btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis psmouse tpm ppdev sg parport_pc evdev i2c_piix4 parport
processor serio_raw i2c_core pcspkr button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sd_mod virtio_scsi ata_piix libata virtio_pci
virtio_ring scsi_mod e1000 virtio floppy [last unloaded: btrfs]
[ 3206.388595] CPU: 14 PID: 29194 Comm: fsstress Not tainted 4.6.0-rc7-btrfs-next-29+ #1
[ 3206.388595] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[ 3206.388595] task: ffff88017ace0100 ti: ffff880171b98000 task.ti: ffff880171b98000
[ 3206.388595] RIP: 0010:[<ffffffff8124d233>]  [<ffffffff8124d233>] blkdev_issue_discard+0x5c/0x2a7
[ 3206.388595] RSP: 0018:ffff880171b9bb80  EFLAGS: 00010246
[ 3206.388595] RAX: ffff880171b9bc28 RBX: 000000000090d000 RCX: 0000000000000000
[ 3206.388595] RDX: ffffffff82fa1b48 RSI: ffffffff8179f46c RDI: ffffffff82fa1b48
[ 3206.388595] RBP: ffff880171b9bcc0 R08: 0000000000000000 R09: 0000000000000001
[ 3206.388595] R10: ffff880171b9bce0 R11: 000000000090f000 R12: ffff880171b9bbe8
[ 3206.388595] R13: 0000000000000010 R14: 0000000000004868 R15: 6b6b6b6b6b6b6b6b
[ 3206.388595] FS:  00007f6182e4e700(0000) GS:ffff88023fdc0000(0000) knlGS:0000000000000000
[ 3206.388595] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3206.388595] CR2: 00007f617c2bbb18 CR3: 000000017ad9c000 CR4: 00000000000006e0
[ 3206.388595] Stack:
[ 3206.388595]  0000000000004878 0000000000000000 0000000002400040 0000000000000000
[ 3206.388595]  0000000000000000 ffff880171b9bbe8 ffff880171b9bbb0 ffff880171b9bbb0
[ 3206.388595]  ffff880171b9bbc0 ffff880171b9bbc0 ffff880171b9bbd0 ffff880171b9bbd0
[ 3206.388595] Call Trace:
[ 3206.388595]  [<ffffffffa042899e>] btrfs_issue_discard+0x12f/0x143 [btrfs]
[ 3206.388595]  [<ffffffffa042899e>] ? btrfs_issue_discard+0x12f/0x143 [btrfs]
[ 3206.388595]  [<ffffffffa042e862>] btrfs_discard_extent+0x87/0xde [btrfs]
[ 3206.388595]  [<ffffffffa04303b5>] btrfs_finish_extent_commit+0xb2/0x1df [btrfs]
[ 3206.388595]  [<ffffffff8149c246>] ? __mutex_unlock_slowpath+0x150/0x15b
[ 3206.388595]  [<ffffffffa04464c4>] btrfs_commit_transaction+0x7fc/0x980 [btrfs]
[ 3206.388595]  [<ffffffff8149c246>] ? __mutex_unlock_slowpath+0x150/0x15b
[ 3206.388595]  [<ffffffffa0459af6>] btrfs_sync_file+0x38f/0x428 [btrfs]
[ 3206.388595]  [<ffffffff811a8292>] vfs_fsync_range+0x8c/0x9e
[ 3206.388595]  [<ffffffff811a82c0>] vfs_fsync+0x1c/0x1e
[ 3206.388595]  [<ffffffff811a8417>] do_fsync+0x31/0x4a
[ 3206.388595]  [<ffffffff811a8637>] SyS_fsync+0x10/0x14
[ 3206.388595]  [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
[ 3206.388595]  [<ffffffff81100c6b>] ? time_hardirqs_off+0x9/0x14
[ 3206.388595]  [<ffffffff8108e87d>] ? trace_hardirqs_off_caller+0x1f/0xaa

This happens because when we call btrfs_map_block() from
btrfs_discard_extent() to get a btrfs_bio structure, the device replace
operation has not finished yet, but before we use the device of one of the
stripes from the returned btrfs_bio structure, the device object is freed.

This is illustrated by the following diagram.

            CPU 1                                                  CPU 2

 btrfs_dev_replace_start()

 (...)

 btrfs_dev_replace_finishing()

   btrfs_start_transaction()
   btrfs_commit_transaction()

   (...)

                                                            btrfs_sync_file()
                                                              btrfs_start_transaction()

                                                              (...)

                                                              btrfs_commit_transaction()
                                                                btrfs_finish_extent_commit()
                                                                  btrfs_discard_extent()
                                                                    btrfs_map_block()
                                                                      --> returns a struct btrfs_bio
                                                                          with a stripe that has a
                                                                          device field pointing to
                                                                          source device of the replace
                                                                          operation (the device that
                                                                          is being replaced)

   mutex_lock(&uuid_mutex)
   mutex_lock(&fs_info->fs_devices->device_list_mutex)
   mutex_lock(&fs_info->chunk_mutex)

   btrfs_dev_replace_update_device_in_mapping_tree()
     --> iterates the mapping tree and for each
         extent map that has a stripe pointing to
         the source device, it updates the stripe
         to point to the target device instead

   btrfs_rm_dev_replace_blocked()
     --> waits for fs_info->bio_counter to go down to 0

   btrfs_rm_dev_replace_remove_srcdev()
     --> removes source device from the list of devices

   mutex_unlock(&fs_info->chunk_mutex)
   mutex_unlock(&fs_info->fs_devices->device_list_mutex)
   mutex_unlock(&uuid_mutex)

   btrfs_rm_dev_replace_free_srcdev()
     --> frees the source device

                                                                    --> iterates over all stripes
                                                                        of the returned struct
                                                                        btrfs_bio
                                                                    --> for each stripe it
                                                                        dereferences its device
                                                                        pointer
                                                                        --> it ends up finding a
                                                                            pointer to the device
                                                                            used as the source
                                                                            device for the replace
                                                                            operation and that was
                                                                            already freed

So fix this by surrounding the call to btrfs_map_block(), and the code
that uses the returned struct btrfs_bio, with calls to
btrfs_bio_counter_inc_blocked() and btrfs_bio_counter_dec(), so that
the finishing phase of the device replace operation blocks until the
the bio counter decreases to zero before it frees the source device.
This is the same approach we do at btrfs_map_bio() for example.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-05-31 00:59:44 +01:00
Filipe Manana
22ab04e814 Btrfs: fix race between device replace and chunk allocation
While iterating and copying extents from the source device, the device
replace code keeps adjusting a left cursor that is used to make sure that
once we finish processing a device extent, any future writes to extents
from the corresponding block group will get into both the source and
target devices. This left cursor is also used for resuming the device
replace operation at mount time.

However using this left cursor to decide whether writes go into both
devices or only the source device is not enough to guarantee we don't
miss copying extents into the target device. There are two cases where
the current approach fails. The first one is related to when there are
holes in the device and they get allocated for new block groups while
the device replace operation is iterating the device extents (more on
this explained below). The second one is that when that loop over the
device extents finishes, we start dellaloc, wait for all ordered extents
and then commit the current transaction, we might have got new block
groups allocated that are now using a device extent that has an offset
greater then or equals to the value of the left cursor, in which case
writes to extents belonging to these new block groups will get issued
only to the source device.

For the first case where the current approach of using a left cursor
fails, consider the source device currently has the following layout:

  [ extent bg A ] [ hole, unallocated space ] [extent bg B ]
  3Gb             4Gb                         5Gb

While we are iterating the device extents from the source device using
the commit root of the device tree, the following happens:

        CPU 1                                            CPU 2

                      <we are at transaction N>

  scrub_enumerate_chunks()
    --> searches the device tree for
        extents belonging to the source
        device using the device tree's
        commit root
    --> 1st iteration finds extent belonging to
        block group A

        --> sets block group A to RO mode
            (btrfs_inc_block_group_ro)

        --> sets cursor left to found_key.offset
            which is 3Gb

        --> scrub_chunk() starts
            copies all allocated extents from
            block group's A stripe at source
            device into target device

                                                           btrfs_alloc_chunk()
                                                             --> allocates device extent
                                                                 in the range [4Gb, 5Gb[
                                                                 from the source device for
                                                                 a new block group C

                                                           extent allocated from block
                                                           group C for a direct IO,
                                                           buffered write or btree node/leaf

                                                           extent is written to, perhaps
                                                           in response to a writepages()
                                                           call from the VM or directly
                                                           through direct IO

                                                           the write is made only against
                                                           the source device and not against
                                                           the target device because the
                                                           extent's offset is in the interval
                                                           [4Gb, 5Gb[ which is larger then
                                                           the value of cursor_left (3Gb)

        --> scrub_chunks() finishes

        --> updates left cursor from 3Gb to
            4Gb

        --> btrfs_dec_block_group_ro() sets
            block group A back to RW mode

                             <we are still at transaction N>

    --> 2nd iteration finds extent belonging to
        block group B - it did not find the new
        extent in the range [4Gb, 5Gb[ for block
        group C because we are using the device
        tree's commit root or even because the
        block group's items are not all yet
        inserted in the respective btrees, that is,
        the block group is still attached to some
        transaction handle's new_bgs list and
        btrfs_create_pending_block_groups() was
        not called yet against that transaction
        handle, so the device extent items were
        not yet inserted into the devices tree

                             <we are still at transaction N>

        --> so we end not copying anything from the newly
            allocated device extent from the source device
            to the target device

So fix this by making __btrfs_map_block() always redirect writes to the
target device as well, independently of the left cursor's value. With
this change the left cursor is now used only for the purpose of tracking
progress and allow a mount operation to resume a device replace.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-05-30 12:58:26 +01:00
Filipe Manana
1a1a8b732c Btrfs: fix race setting block group back to RW mode during device replace
After it finishes processing a device extent, the device replace code sets
back the block group to RW mode and then after that it sets the left cursor
to match the logical end address of the block group, so that future writes
into extents belonging to the block group go both the source (old) and
target (new) devices. However from the moment we turn the block group
back to RW mode we have a short time window, that lasts until we update
the left cursor's value, where extents can be allocated from the block
group and written to, in which case they will not be copied/written to
the target (new) device. Fix this by updating the left cursor's value
before turning the block group back to RW mode.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-05-30 12:58:24 +01:00
Filipe Manana
81e87a736c Btrfs: fix unprotected assignment of the left cursor for device replace
We were assigning new values to fields of the device replace object
without holding the respective lock after processing each device extent.
This is important for the left cursor field which can be accessed by a
concurrent task running __btrfs_map_block (which, correctly, takes the
device replace lock).
So change these fields while holding the device replace lock.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-05-30 12:58:23 +01:00
Filipe Manana
f0e9b7d640 Btrfs: fix race setting block group readonly during device replace
When we do a device replace, for each device extent we find from the
source device, we set the corresponding block group to readonly mode to
prevent writes into it from happening while we are copying the device
extent from the source to the target device. However just before we set
the block group to readonly mode some concurrent task might have already
allocated an extent from it or decided it could perform a nocow write
into one of its extents, which can make the device replace process to
miss copying an extent since it uses the extent tree's commit root to
search for extents and only once it finishes searching for all extents
belonging to the block group it does set the left cursor to the logical
end address of the block group - this is a problem if the respective
ordered extents finish while we are searching for extents using the
extent tree's commit root and no transaction commit happens while we
are iterating the tree, since it's the delayed references created by the
ordered extents (when they complete) that insert the extent items into
the extent tree (using the non-commit root of course).
Example:

          CPU 1                                            CPU 2

 btrfs_dev_replace_start()
   btrfs_scrub_dev()
     scrub_enumerate_chunks()
       --> finds device extent belonging
           to block group X

                               <transaction N starts>

                                                      starts buffered write
                                                      against some inode

                                                      writepages is run against
                                                      that inode forcing dellaloc
                                                      to run

                                                      btrfs_writepages()
                                                        extent_writepages()
                                                          extent_write_cache_pages()
                                                            __extent_writepage()
                                                              writepage_delalloc()
                                                                run_delalloc_range()
                                                                  cow_file_range()
                                                                    btrfs_reserve_extent()
                                                                      --> allocates an extent
                                                                          from block group X
                                                                          (which is not yet
                                                                           in RO mode)
                                                                    btrfs_add_ordered_extent()
                                                                      --> creates ordered extent Y
                                                        flush_epd_write_bio()
                                                          --> bio against the extent from
                                                              block group X is submitted

       btrfs_inc_block_group_ro(bg X)
         --> sets block group X to readonly

       scrub_chunk(bg X)
         scrub_stripe(device extent from srcdev)
           --> keeps searching for extent items
               belonging to the block group using
               the extent tree's commit root
           --> it never blocks due to
               fs_info->scrub_pause_req as no
               one tries to commit transaction N
           --> copies all extents found from the
               source device into the target device
           --> finishes search loop

                                                        bio completes

                                                        ordered extent Y completes
                                                        and creates delayed data
                                                        reference which will add an
                                                        extent item to the extent
                                                        tree when run (typically
                                                        at transaction commit time)

                                                          --> so the task doing the
                                                              scrub/device replace
                                                              at CPU 1 misses this
                                                              and does not copy this
                                                              extent into the new/target
                                                              device

       btrfs_dec_block_group_ro(bg X)
         --> turns block group X back to RW mode

       dev_replace->cursor_left is set to the
       logical end offset of block group X

So fix this by waiting for all cow and nocow writes after setting a block
group to readonly mode.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-05-30 12:58:21 +01:00
Filipe Manana
57ba4cb85b Btrfs: fix race between device replace and block group removal
When it's finishing, the device replace code iterates all extent maps
representing block group and for each one that has a stripe that refers
to the source device, it replaces its device with the target device.
However when it replaces the source device with the target device it,
the target device still has an ID of 0ULL (BTRFS_DEV_REPLACE_DEVID),
only after its ID is changed to match the one from the source device.
This leads to races with the chunk removal code that can temporarly see
a device with an ID of 0ULL and then attempt to use that ID to remove
items from the device tree and fail, causing a transaction abort:

[ 9238.594364] BTRFS info (device sdf): dev_replace from /dev/sdf (devid 3) to /dev/sde finished
[ 9238.594377] ------------[ cut here ]------------
[ 9238.594402] WARNING: CPU: 14 PID: 21566 at fs/btrfs/volumes.c:2771 btrfs_remove_chunk+0x2e5/0x793 [btrfs]
[ 9238.594403] BTRFS: Transaction aborted (error 1)
[ 9238.594416] Modules linked in: btrfs crc32c_generic acpi_cpufreq xor tpm_tis tpm raid6_pq ppdev parport_pc processor psmouse parport i2c_piix4 evdev sg i2c_core se
rio_raw pcspkr button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio e1000 scsi_mod fl
oppy [last unloaded: btrfs]
[ 9238.594418] CPU: 14 PID: 21566 Comm: btrfs-cleaner Not tainted 4.6.0-rc7-btrfs-next-29+ #1
[ 9238.594419] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[ 9238.594421]  0000000000000000 ffff88017f1dbc60 ffffffff8126b42c ffff88017f1dbcb0
[ 9238.594422]  0000000000000000 ffff88017f1dbca0 ffffffff81052b14 00000ad37f1dbd18
[ 9238.594423]  0000000000000001 ffff88018068a558 ffff88005c4b9c00 ffff880233f60db0
[ 9238.594424] Call Trace:
[ 9238.594428]  [<ffffffff8126b42c>] dump_stack+0x67/0x90
[ 9238.594430]  [<ffffffff81052b14>] __warn+0xc2/0xdd
[ 9238.594432]  [<ffffffff81052b7a>] warn_slowpath_fmt+0x4b/0x53
[ 9238.594434]  [<ffffffff8116c311>] ? kmem_cache_free+0x128/0x188
[ 9238.594450]  [<ffffffffa04d43f5>] btrfs_remove_chunk+0x2e5/0x793 [btrfs]
[ 9238.594452]  [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
[ 9238.594464]  [<ffffffffa04a26fa>] btrfs_delete_unused_bgs+0x317/0x382 [btrfs]
[ 9238.594476]  [<ffffffffa04a961d>] cleaner_kthread+0x1ad/0x1c7 [btrfs]
[ 9238.594489]  [<ffffffffa04a9470>] ? btree_invalidatepage+0x8e/0x8e [btrfs]
[ 9238.594490]  [<ffffffff8106f403>] kthread+0xd4/0xdc
[ 9238.594494]  [<ffffffff8149e242>] ret_from_fork+0x22/0x40
[ 9238.594495]  [<ffffffff8106f32f>] ? kthread_stop+0x286/0x286
[ 9238.594496] ---[ end trace 183efbe50275f059 ]---

The sequence of steps leading to this is like the following:

              CPU 1                                           CPU 2

 btrfs_dev_replace_finishing()

   at this point
   dev_replace->tgtdev->devid ==
   BTRFS_DEV_REPLACE_DEVID (0ULL)

   ...

   btrfs_start_transaction()
   btrfs_commit_transaction()

                                                     btrfs_delete_unused_bgs()
                                                       btrfs_remove_chunk()

                                                         looks up for the extent map
                                                         corresponding to the chunk

                                                         lock_chunks() (chunk_mutex)
                                                         check_system_chunk()
                                                         unlock_chunks() (chunk_mutex)

   locks fs_info->chunk_mutex

   btrfs_dev_replace_update_device_in_mapping_tree()
     --> iterates fs_info->mapping_tree and
         replaces the device in every extent
         map's map->stripes[] with
         dev_replace->tgtdev, which still has
         an id of 0ULL (BTRFS_DEV_REPLACE_DEVID)

                                                         iterates over all stripes from
                                                         the extent map

                                                           --> calls btrfs_free_dev_extent()
                                                               passing it the target device
                                                               that still has an ID of 0ULL

                                                           --> btrfs_free_dev_extent() fails
                                                             --> aborts current transaction

   finishes setting up the target device,
   namely it sets tgtdev->devid to the value
   of srcdev->devid (which is necessarily > 0)

   frees the srcdev

   unlocks fs_info->chunk_mutex

So fix this by taking the device list mutex while processing the stripes
for the chunk's extent map. This is similar to the race between device
replace and block group creation that was fixed by commit 50460e3718
("Btrfs: fix race when finishing dev replace leading to transaction abort").

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-05-30 12:58:19 +01:00
Filipe Manana
ce7791ffee Btrfs: fix race between readahead and device replace/removal
The list of devices is protected by the device_list_mutex and the device
replace code, in its finishing phase correctly takes that mutex before
removing the source device from that list. However the readahead code was
iterating that list without acquiring the respective mutex leading to
crashes later on due to invalid memory accesses:

[125671.831036] general protection fault: 0000 [#1] PREEMPT SMP
[125671.832129] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis tpm ppdev evdev parport_pc psmouse sg parport
processor ser
[125671.834973] CPU: 10 PID: 19603 Comm: kworker/u32:19 Tainted: G        W       4.6.0-rc7-btrfs-next-29+ #1
[125671.834973] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[125671.834973] Workqueue: btrfs-readahead btrfs_readahead_helper [btrfs]
[125671.834973] task: ffff8801ac520540 ti: ffff8801ac918000 task.ti: ffff8801ac918000
[125671.834973] RIP: 0010:[<ffffffff81270479>]  [<ffffffff81270479>] __radix_tree_lookup+0x6a/0x105
[125671.834973] RSP: 0018:ffff8801ac91bc28  EFLAGS: 00010206
[125671.834973] RAX: 0000000000000000 RBX: 6b6b6b6b6b6b6b6a RCX: 0000000000000000
[125671.834973] RDX: 0000000000000000 RSI: 00000000000c1bff RDI: ffff88002ebd62a8
[125671.834973] RBP: ffff8801ac91bc70 R08: 0000000000000001 R09: 0000000000000000
[125671.834973] R10: ffff8801ac91bc70 R11: 0000000000000000 R12: ffff88002ebd62a8
[125671.834973] R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000c1bff
[125671.834973] FS:  0000000000000000(0000) GS:ffff88023fd40000(0000) knlGS:0000000000000000
[125671.834973] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[125671.834973] CR2: 000000000073cae4 CR3: 00000000b7723000 CR4: 00000000000006e0
[125671.834973] Stack:
[125671.834973]  0000000000000000 ffff8801422d5600 ffff8802286bbc00 0000000000000000
[125671.834973]  0000000000000001 ffff8802286bbc00 00000000000c1bff 0000000000000000
[125671.834973]  ffff88002e639eb8 ffff8801ac91bc80 ffffffff81270541 ffff8801ac91bcb0
[125671.834973] Call Trace:
[125671.834973]  [<ffffffff81270541>] radix_tree_lookup+0xd/0xf
[125671.834973]  [<ffffffffa04ae6a6>] reada_peer_zones_set_lock+0x3e/0x60 [btrfs]
[125671.834973]  [<ffffffffa04ae8b9>] reada_pick_zone+0x29/0x103 [btrfs]
[125671.834973]  [<ffffffffa04af42f>] reada_start_machine_worker+0x129/0x2d3 [btrfs]
[125671.834973]  [<ffffffffa04880be>] btrfs_scrubparity_helper+0x185/0x3aa [btrfs]
[125671.834973]  [<ffffffffa0488341>] btrfs_readahead_helper+0xe/0x10 [btrfs]
[125671.834973]  [<ffffffff81069691>] process_one_work+0x271/0x4e9
[125671.834973]  [<ffffffff81069dda>] worker_thread+0x1eb/0x2c9
[125671.834973]  [<ffffffff81069bef>] ? rescuer_thread+0x2b3/0x2b3
[125671.834973]  [<ffffffff8106f403>] kthread+0xd4/0xdc
[125671.834973]  [<ffffffff8149e242>] ret_from_fork+0x22/0x40
[125671.834973]  [<ffffffff8106f32f>] ? kthread_stop+0x286/0x286

So fix this by taking the device_list_mutex in the readahead code. We
can't use here the lighter approach of using a rcu_read_lock() and
rcu_read_unlock() pair together with a list_for_each_entry_rcu() call
because we end up doing calls to sleeping functions (kzalloc()) in the
respective code path.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-05-30 12:58:18 +01:00
Linus Torvalds
d102a56edb Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
 "Followups to the parallel lookup work:

   - update docs

   - restore killability of the places that used to take ->i_mutex
     killably now that we have down_write_killable() merged

   - Additionally, it turns out that I missed a prerequisite for
     security_d_instantiate() stuff - ->getxattr() wasn't the only thing
     that could be called before dentry is attached to inode; with smack
     we needed the same treatment applied to ->setxattr() as well"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  switch ->setxattr() to passing dentry and inode separately
  switch xattr_handler->set() to passing dentry and inode separately
  restore killability of old mutex_lock_killable(&inode->i_mutex) users
  add down_write_killable_nested()
  update D/f/directory-locking
2016-05-27 17:14:05 -07:00
Linus Torvalds
559b6d90a0 Merge branch 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs cleanups and fixes from Chris Mason:
 "We have another round of fixes and a few cleanups.

  I have a fix for short returns from btrfs_copy_from_user, which
  finally nails down a very hard to find regression we added in v4.6.

  Dave is pushing around gfp parameters, mostly to cleanup internal apis
  and make it a little more consistent.

  The rest are smaller fixes, and one speelling fixup patch"

* 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (22 commits)
  Btrfs: fix handling of faults from btrfs_copy_from_user
  btrfs: fix string and comment grammatical issues and typos
  btrfs: scrub: Set bbio to NULL before calling btrfs_map_block
  Btrfs: fix unexpected return value of fiemap
  Btrfs: free sys_array eb as soon as possible
  btrfs: sink gfp parameter to convert_extent_bit
  btrfs: make state preallocation more speculative in __set_extent_bit
  btrfs: untangle gotos a bit in convert_extent_bit
  btrfs: untangle gotos a bit in __clear_extent_bit
  btrfs: untangle gotos a bit in __set_extent_bit
  btrfs: sink gfp parameter to set_record_extent_bits
  btrfs: sink gfp parameter to set_extent_new
  btrfs: sink gfp parameter to set_extent_defrag
  btrfs: sink gfp parameter to set_extent_delalloc
  btrfs: sink gfp parameter to clear_extent_dirty
  btrfs: sink gfp parameter to clear_record_extent_bits
  btrfs: sink gfp parameter to clear_extent_bits
  btrfs: sink gfp parameter to set_extent_bits
  btrfs: make find_workspace warn if there are no workspaces
  btrfs: make find_workspace always succeed
  ...
2016-05-27 16:37:36 -07:00
Al Viro
5930122683 switch xattr_handler->set() to passing dentry and inode separately
preparation for similar switch in ->setxattr() (see the next commit for
rationale).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-27 15:39:43 -04:00
Chris Mason
56244ef151 Btrfs: fix handling of faults from btrfs_copy_from_user
When btrfs_copy_from_user isn't able to copy all of the pages, we need
to adjust our accounting to reflect the work that was actually done.

Commit 2e78c927d7 changed around the decisions a little and we ended up
skipping the accounting adjustments some of the time.  This commit makes
sure that when we don't copy anything at all, we still hop into
the adjustments, and switches to release_bytes instead of write_bytes,
since write_bytes isn't aligned.

The accounting errors led to warnings during btrfs_destroy_inode:

[   70.847532] WARNING: CPU: 10 PID: 514 at fs/btrfs/inode.c:9350 btrfs_destroy_inode+0x2b3/0x2c0
[   70.847536] Modules linked in: i2c_piix4 virtio_net i2c_core input_leds button led_class serio_raw acpi_cpufreq sch_fq_codel autofs4 virtio_blk
[   70.847538] CPU: 10 PID: 514 Comm: umount Tainted: G        W 4.6.0-rc6_00062_g2997da1-dirty #23
[   70.847539] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014
[   70.847542]  0000000000000000 ffff880ff5cafab8 ffffffff8149d5e9 0000000000000202
[   70.847543]  0000000000000000 0000000000000000 0000000000000000 ffff880ff5cafb08
[   70.847547]  ffffffff8107bdfd ffff880ff5cafaf8 000024868120013d ffff880ff5cafb28
[   70.847547] Call Trace:
[   70.847550]  [<ffffffff8149d5e9>] dump_stack+0x51/0x78
[   70.847551]  [<ffffffff8107bdfd>] __warn+0xfd/0x120
[   70.847553]  [<ffffffff8107be3d>] warn_slowpath_null+0x1d/0x20
[   70.847555]  [<ffffffff8139c9e3>] btrfs_destroy_inode+0x2b3/0x2c0
[   70.847556]  [<ffffffff812003a1>] ? __destroy_inode+0x71/0x140
[   70.847558]  [<ffffffff812004b3>] destroy_inode+0x43/0x70
[   70.847559]  [<ffffffff810b7b5f>] ? wake_up_bit+0x2f/0x40
[   70.847560]  [<ffffffff81200c68>] evict+0x148/0x1d0
[   70.847562]  [<ffffffff81398ade>] ? start_transaction+0x3de/0x460
[   70.847564]  [<ffffffff81200d49>] dispose_list+0x59/0x80
[   70.847565]  [<ffffffff81201ba0>] evict_inodes+0x180/0x190
[   70.847566]  [<ffffffff812191ff>] ? __sync_filesystem+0x3f/0x50
[   70.847568]  [<ffffffff811e95f8>] generic_shutdown_super+0x48/0x100
[   70.847569]  [<ffffffff810b75c0>] ? woken_wake_function+0x20/0x20
[   70.847571]  [<ffffffff811e9796>] kill_anon_super+0x16/0x30
[   70.847573]  [<ffffffff81365cde>] btrfs_kill_super+0x1e/0x130
[   70.847574]  [<ffffffff811e99be>] deactivate_locked_super+0x4e/0x90
[   70.847576]  [<ffffffff811e9e61>] deactivate_super+0x51/0x70
[   70.847577]  [<ffffffff8120536f>] cleanup_mnt+0x3f/0x80
[   70.847579]  [<ffffffff81205402>] __cleanup_mnt+0x12/0x20
[   70.847581]  [<ffffffff81098358>] task_work_run+0x68/0xa0
[   70.847582]  [<ffffffff810022b6>] exit_to_usermode_loop+0xd6/0xe0
[   70.847583]  [<ffffffff81002e1d>] do_syscall_64+0xbd/0x170
[   70.847586]  [<ffffffff817d4dbc>] entry_SYSCALL64_slow_path+0x25/0x25

This is the test program I used to force short returns from
btrfs_copy_from_user

void *dontneed(void *arg)
{
	char *p = arg;
	int ret;

	while(1) {
		ret = madvise(p, BUFSIZE/4, MADV_DONTNEED);
		if (ret) {
			perror("madvise");
			exit(1);
		}
	}
}

int main(int ac, char **av) {
	int ret;
	int fd;
	char *filename;
	unsigned long offset;
	char *buf;
	int i;
	pthread_t tid;

	if (ac != 2) {
		fprintf(stderr, "usage: dammitdave filename\n");
		exit(1);
	}

	buf = mmap(NULL, BUFSIZE, PROT_READ|PROT_WRITE,
		   MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
	if (buf == MAP_FAILED) {
		perror("mmap");
		exit(1);
	}
	memset(buf, 'a', BUFSIZE);
	filename = av[1];

	ret = pthread_create(&tid, NULL, dontneed, buf);
	if (ret) {
		fprintf(stderr, "error %d from pthread_create\n", ret);
		exit(1);
	}

	ret = pthread_detach(tid);
	if (ret) {
		fprintf(stderr, "pthread detach failed %d\n", ret);
		exit(1);
	}

	while (1) {
		fd = open(filename, O_RDWR | O_CREAT, 0600);
		if (fd < 0) {
			perror("open");
			exit(1);
		}

		for (i = 0; i < ROUNDS; i++) {
			int this_write = BUFSIZE;

			offset = rand() % MAXSIZE;
			ret = pwrite(fd, buf, this_write, offset);
			if (ret < 0) {
				perror("pwrite");
				exit(1);
			} else if (ret != this_write) {
				fprintf(stderr, "short write to %s offset %lu ret %d\n",
					filename, offset, ret);
				exit(1);
			}
			if (i == ROUNDS - 1) {
				ret = sync_file_range(fd, offset, 4096,
				    SYNC_FILE_RANGE_WRITE);
				if (ret < 0) {
					perror("sync_file_range");
					exit(1);
				}
			}
		}
		ret = ftruncate(fd, 0);
		if (ret < 0) {
			perror("ftruncate");
			exit(1);
		}
		ret = close(fd);
		if (ret) {
			perror("close");
			exit(1);
		}
		ret = unlink(filename);
		if (ret) {
			perror("unlink");
			exit(1);
		}

	}
	return 0;
}

Signed-off-by: Chris Mason <clm@fb.com>
Reported-by: Dave Jones <dsj@fb.com>
Fixes: 2e78c927d7
cc: stable@vger.kernel.org # v4.6
Signed-off-by: Chris Mason <clm@fb.com>
2016-05-26 13:23:59 -07:00
Al Viro
002354112f restore killability of old mutex_lock_killable(&inode->i_mutex) users
The ones that are taking it exclusive, that is...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-26 00:13:25 -04:00
David Sterba
42f31734eb Merge branch 'cleanups-4.7' into for-chris-4.7-20160525 2016-05-25 22:51:03 +02:00
Nicholas D Steeves
0132761017 btrfs: fix string and comment grammatical issues and typos
Signed-off-by: Nicholas D Steeves <nsteeves@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-25 22:35:14 +02:00
Zhao Lei
f1fee6534d btrfs: scrub: Set bbio to NULL before calling btrfs_map_block
We usually call btrfs_put_bbio() when btrfs_map_block() failed,
btrfs_put_bbio() works right whether bbio is a valid value, or NULL.

But there is a exception, in some case, btrfs_map_block() will return
fail without touching *bbio(keeping its original value), and if bbio
was not initialized yet, invalid memory accessing will happened.

Above case is in scrub_missing_raid56_pages(), and similar case in
scrub_raid56_parity().

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-25 22:15:21 +02:00
Liu Bo
2d324f59f3 Btrfs: fix unexpected return value of fiemap
btrfs's fiemap is supposed to return 0 on success and return < 0 on
error. however, ret becomes 1 after looking up the last file extent:

  btrfs_lookup_file_extent ->
    btrfs_search_slot(..., ins_len=0, cow=0)

and if the offset is beyond EOF, we'll get 'path' pointed to the place
of potentail insertion, and ret == 1.

This may confuse applications using ioctl(FIEL_IOC_FIEMAP).

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-25 19:53:54 +02:00
Liu Bo
1c8b5b6e8b Btrfs: free sys_array eb as soon as possible
While reading sys_chunk_array in superblock, btrfs creates a temporary
extent buffer.  Since we don't use it after finishing reading
 sys_chunk_array, we don't need to keep it in memory.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-25 19:53:51 +02:00
Linus Torvalds
07be1337b9 Merge branch 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "This has our merge window series of cleanups and fixes.  These target
  a wide range of issues, but do include some important fixes for
  qgroups, O_DIRECT, and fsync handling.  Jeff Mahoney moved around a
  few definitions to make them easier for userland to consume.

  Also whiteout support is included now that issues with overlayfs have
  been cleared up.

  I have one more fix pending for page faults during btrfs_copy_from_user,
  but I wanted to get this bulk out the door first"

* 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (90 commits)
  btrfs: fix memory leak during RAID 5/6 device replacement
  Btrfs: add semaphore to synchronize direct IO writes with fsync
  Btrfs: fix race between block group relocation and nocow writes
  Btrfs: fix race between fsync and direct IO writes for prealloc extents
  Btrfs: fix number of transaction units for renames with whiteout
  Btrfs: pin logs earlier when doing a rename exchange operation
  Btrfs: unpin logs if rename exchange operation fails
  Btrfs: fix inode leak on failure to setup whiteout inode in rename
  btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT
  Btrfs: pin log earlier when renaming
  Btrfs: unpin log if rename operation fails
  Btrfs: don't do unnecessary delalloc flushes when relocating
  Btrfs: don't wait for unrelated IO to finish before relocation
  Btrfs: fix empty symlink after creating symlink and fsync parent dir
  Btrfs: fix for incorrect directory entries after fsync log replay
  btrfs: build fixup for qgroup_account_snapshot
  btrfs: qgroup: Fix qgroup accounting when creating snapshot
  Btrfs: fix fspath error deallocation
  btrfs: make find_workspace warn if there are no workspaces
  btrfs: make find_workspace always succeed
  ...
2016-05-21 10:49:22 -07:00
Andy Shevchenko
8da4b8c48e lib/uuid.c: move generate_random_uuid() to uuid.c
Let's gather the UUID related functions under one hood.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Dmitry Kasatkin <dmitry.kasatkin@gmail.com>
Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Linus Torvalds
e34df3344d Merge branch 'work.lookups' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull parallel lookup fixups from Al Viro:
 "Fix for xfs parallel readdir (turns out the cxfs exposure was not
  enough to catch all problems), and a reversion of btrfs back to
  ->iterate() until the fs/btrfs/delayed-inode.c gets fixed"

* 'work.lookups' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  xfs: concurrent readdir hangs on data buffer locks
  Revert "btrfs: switch to ->iterate_shared()"
2016-05-18 10:28:45 -07:00
Al Viro
fe742fd4f9 Revert "btrfs: switch to ->iterate_shared()"
This reverts commit 972b241f84.
Quoth Chris:
	didn't take the delayed inode stuff into account
	it got an rbtree of items and it pulls things out
	so in shared mode, its hugely racey
	sorry, lets revert and fix it for real inside of btrfs

Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-18 13:19:17 -04:00
Linus Torvalds
ba5a2655c2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull remaining vfs xattr work from Al Viro:
 "The rest of work.xattr (non-cifs conversions)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  btrfs: Switch to generic xattr handlers
  ubifs: Switch to generic xattr handlers
  jfs: Switch to generic xattr handlers
  jfs: Clean up xattr name mapping
  gfs2: Switch to generic xattr handlers
  ceph: kill __ceph_removexattr()
  ceph: Switch to generic xattr handlers
  ceph: Get rid of d_find_alias in ceph_set_acl
2016-05-18 10:08:45 -07:00
Andreas Gruenbacher
e0d46f5c6e btrfs: Switch to generic xattr handlers
The btrfs_{set,remove}xattr inode operations check for a read-only root
(btrfs_root_readonly) before calling into generic_{set,remove}xattr.  If
this check is moved into __btrfs_setxattr, we can get rid of
btrfs_{set,remove}xattr.

This patch applies to mainline, I would like to keep it together with
the other xattr cleanups if possible, though.  Could you please review?

Thanks,
Andreas

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-17 19:17:09 -04:00
Linus Torvalds
c2e7b20705 Merge branch 'work.preadv2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs cleanups from Al Viro:
 "More cleanups from Christoph"

* 'work.preadv2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  nfsd: use RWF_SYNC
  fs: add RWF_DSYNC aand RWF_SYNC
  ceph: use generic_write_sync
  fs: simplify the generic_write_sync prototype
  fs: add IOCB_SYNC and IOCB_DSYNC
  direct-io: remove the offset argument to dio_complete
  direct-io: eliminate the offset argument to ->direct_IO
  xfs: eliminate the pos variable in xfs_file_dio_aio_write
  filemap: remove the pos argument to generic_file_direct_write
  filemap: remove pos variables in generic_file_read_iter
2016-05-17 15:05:23 -07:00
Chris Mason
c315ef8d9d Merge branch 'for-chris-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.7
Signed-off-by: Chris Mason <clm@fb.com>
2016-05-17 14:43:19 -07:00
David Sterba
680834ca0a Merge branch 'foreign/jeffm/uapi' into for-chris-4.7-20160516
# Conflicts:
#	include/uapi/linux/btrfs.h
2016-05-16 15:46:29 +02:00
David Sterba
36fac9e9ff Merge branch 'foreign/anand/dev-del-by-id-ext' into for-chris-4.7-20160516 2016-05-16 15:46:26 +02:00
David Sterba
5ef64a3e75 Merge branch 'cleanups-4.7' into for-chris-4.7-20160516 2016-05-16 15:46:24 +02:00
Scott Talbert
4673272f43 btrfs: fix memory leak during RAID 5/6 device replacement
A 'struct bio' is allocated in scrub_missing_raid56_pages(), but it was never
freed anywhere.

Signed-off-by: Scott Talbert <scott.talbert@hgst.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-16 10:17:58 +02:00
Filipe Manana
5f9a8a51d8 Btrfs: add semaphore to synchronize direct IO writes with fsync
Due to the optimization of lockless direct IO writes (the inode's i_mutex
is not held) introduced in commit 38851cc19a ("Btrfs: implement unlocked
dio write"), we started having races between such writes with concurrent
fsync operations that use the fast fsync path. These races were addressed
in the patches titled "Btrfs: fix race between fsync and lockless direct
IO writes" and "Btrfs: fix race between fsync and direct IO writes for
prealloc extents". The races happened because the direct IO path, like
every other write path, does create extent maps followed by the
corresponding ordered extents while the fast fsync path collected first
ordered extents and then it collected extent maps. This made it possible
to log file extent items (based on the collected extent maps) without
waiting for the corresponding ordered extents to complete (get their IO
done). The two fixes mentioned before added a solution that consists of
making the direct IO path create first the ordered extents and then the
extent maps, while the fsync path attempts to collect any new ordered
extents once it collects the extent maps. This was simple and did not
require adding any synchonization primitive to any data structure (struct
btrfs_inode for example) but it makes things more fragile for future
development endeavours and adds an exceptional approach compared to the
other write paths.

This change adds a read-write semaphore to the btrfs inode structure and
makes the direct IO path create the extent maps and the ordered extents
while holding read access on that semaphore, while the fast fsync path
collects extent maps and ordered extents while holding write access on
that semaphore. The logic for direct IO write path is encapsulated in a
new helper function that is used both for cow and nocow direct IO writes.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-05-13 01:59:36 +01:00
Filipe Manana
f78c436c39 Btrfs: fix race between block group relocation and nocow writes
Relocation of a block group waits for all existing tasks flushing
dellaloc, starting direct IO writes and any ordered extents before
starting the relocation process. However for direct IO writes that end
up doing nocow (inode either has the flag nodatacow set or the write is
against a prealloc extent) we have a short time window that allows for a
race that makes relocation proceed without waiting for the direct IO
write to complete first, resulting in data loss after the relocation
finishes. This is illustrated by the following diagram:

           CPU 1                                     CPU 2

 btrfs_relocate_block_group(bg X)

                                               direct IO write starts against
                                               an extent in block group X
                                               using nocow mode (inode has the
                                               nodatacow flag or the write is
                                               for a prealloc extent)

                                               btrfs_direct_IO()
                                                 btrfs_get_blocks_direct()
                                                   --> can_nocow_extent() returns 1

   btrfs_inc_block_group_ro(bg X)
     --> turns block group into RO mode

   btrfs_wait_ordered_roots()
     --> returns and does not know about
         the DIO write happening at CPU 2
         (the task there has not created
          yet an ordered extent)

   relocate_block_group(bg X)
     --> rc->stage == MOVE_DATA_EXTENTS

     find_next_extent()
       --> returns extent that the DIO
           write is going to write to

     relocate_data_extent()

       relocate_file_extent_cluster()

         --> reads the extent from disk into
             pages belonging to the relocation
             inode and dirties them

                                                   --> creates DIO ordered extent

                                                 btrfs_submit_direct()
                                                   --> submits bio against a location
                                                       on disk obtained from an extent
                                                       map before the relocation started

   btrfs_wait_ordered_range()
     --> writes all the pages read before
         to disk (belonging to the
         relocation inode)

   relocation finishes

                                                 bio completes and wrote new data
                                                 to the old location of the block
                                                 group

So fix this by tracking the number of nocow writers for a block group and
make sure relocation waits for that number to go down to 0 before starting
to move the extents.

The same race can also happen with buffered writes in nocow mode since the
patch I recently made titled "Btrfs: don't do unnecessary delalloc flushes
when relocating", because we are no longer flushing all delalloc which
served as a synchonization mechanism (due to page locking) and ensured
the ordered extents for nocow buffered writes were created before we
called btrfs_wait_ordered_roots(). The race with direct IO writes in nocow
mode existed before that patch (no pages are locked or used during direct
IO) and that fixed only races with direct IO writes that do cow.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-05-13 01:59:34 +01:00
Filipe Manana
0b901916a0 Btrfs: fix race between fsync and direct IO writes for prealloc extents
When we do a direct IO write against a preallocated extent (fallocate)
that does not go beyond the i_size of the inode, we do the write operation
without holding the inode's i_mutex (an optimization that landed in
commit 38851cc19a ("Btrfs: implement unlocked dio write")). This allows
for a very tiny time window where a race can happen with a concurrent
fsync using the fast code path, as the direct IO write path creates first
a new extent map (no longer flagged as a prealloc extent) and then it
creates the ordered extent, while the fast fsync path first collects
ordered extents and then it collects extent maps. This allows for the
possibility of the fast fsync path to collect the new extent map without
collecting the new ordered extent, and therefore logging an extent item
based on the extent map without waiting for the ordered extent to be
created and complete. This can result in a situation where after a log
replay we end up with an extent not marked anymore as prealloc but it was
only partially written (or not written at all), exposing random, stale or
garbage data corresponding to the unwritten pages and without any
checksums in the csum tree covering the extent's range.

This is an extension of what was done in commit de0ee0edb2 ("Btrfs: fix
race between fsync and lockless direct IO writes").

So fix this by creating first the ordered extent and then the extent
map, so that this way if the fast fsync patch collects the new extent
map it also collects the corresponding ordered extent.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-05-13 01:59:32 +01:00
Filipe Manana
5062af35c3 Btrfs: fix number of transaction units for renames with whiteout
When we do a rename with the whiteout flag, we need to create the whiteout
inode, which in the worst case requires 5 transaction units (1 inode item,
1 inode ref, 2 dir items and 1 xattr if selinux is enabled). So bump the
number of transaction units from 11 to 16 if the whiteout flag is set.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-05-13 01:59:30 +01:00
Filipe Manana
376e5a57bf Btrfs: pin logs earlier when doing a rename exchange operation
The btrfs_rename_exchange() started as a copy-paste from btrfs_rename(),
which had a race fixed by my previous patch titled "Btrfs: pin log earlier
when renaming", and so it suffers from the same problem.

We pin the logs of the affected roots after we insert the new inode
references, leaving a time window where concurrent tasks logging the
inodes can end up logging both the new and old references, resulting
in log trees that when replayed can turn the metadata into inconsistent
states. This behaviour was added to btrfs_rename() in 2009 without any
explanation about why not pinning the logs earlier, just leaving a
comment about the posibility for the race. As of today it's perfectly
safe and sane to pin the logs before we start doing any of the steps
involved in the rename operation.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-05-13 01:59:28 +01:00
Filipe Manana
86e8aa0e77 Btrfs: unpin logs if rename exchange operation fails
If rename exchange operations fail at some point after we pinned any of
the logs, we end up aborting the current transaction but never unpin the
logs, which leaves concurrent tasks that are trying to sync the logs (as
part of an fsync request from user space) blocked forever and preventing
the filesystem from being unmountable.

Fix this by safely unpinning the log.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-05-13 01:59:26 +01:00
Filipe Manana
c990161888 Btrfs: fix inode leak on failure to setup whiteout inode in rename
If we failed to fully setup the whiteout inode during a rename operation
with the whiteout flag, we ended up leaking the inode, not decrementing
its link count nor removing all its items from the fs/subvol tree.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-05-13 01:59:23 +01:00
Dan Fuhry
cdd1fedf82 btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT
Two new flags, RENAME_EXCHANGE and RENAME_WHITEOUT, provide for new
behavior in the renameat2() syscall. This behavior is primarily used by
overlayfs. This patch adds support for these flags to btrfs, enabling it to
be used as a fully functional upper layer for overlayfs.

RENAME_EXCHANGE support was written by Davide Italiano originally
submitted on 2 April 2015.

Signed-off-by: Davide Italiano <dccitaliano@gmail.com>
Signed-off-by: Dan Fuhry <dfuhry@datto.com>
[ remove unlikely ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-05-13 01:59:21 +01:00
Filipe Manana
c4aba95454 Btrfs: pin log earlier when renaming
We were pinning the log right after the first step in the rename operation
(inserting inode ref for the new name in the destination directory)
instead of doing it before. This behaviour was introduced in 2009 for some
reason that was not mentioned neither on the changelog nor any comment,
with the drawback of a small time window where concurrent log writers can
end up logging the new inode reference for the inode we are renaming while
the rename operation is in progress (so that we can end up with a log
containing both the new and old references). As of today there's no reason
to not pin the log before that first step anymore, so just fix this.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-05-13 01:59:19 +01:00
Filipe Manana
3dc9e8f767 Btrfs: unpin log if rename operation fails
If rename operations fail at some point after we pinned the log, we end
up aborting the current transaction but never unpin the log, which leaves
concurrent tasks that are trying to sync the log (as part of an fsync
request from user space) blocked forever and preventing the filesystem
from being unmountable.

Fix this by safely unpinning the log.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-05-13 01:59:18 +01:00
Filipe Manana
9cfa3e34e2 Btrfs: don't do unnecessary delalloc flushes when relocating
Before we start the actual relocation process of a block group, we do
calls to flush delalloc of all inodes and then wait for ordered extents
to complete. However we do these flush calls just to make sure we don't
race with concurrent tasks that have actually already started to run
delalloc and have allocated an extent from the block group we want to
relocate, right before we set it to readonly mode, but have not yet
created the respective ordered extents. The flush calls make us wait
for such concurrent tasks because they end up calling
filemap_fdatawrite_range() (through btrfs_start_delalloc_roots() ->
__start_delalloc_inodes() -> btrfs_alloc_delalloc_work() ->
btrfs_run_delalloc_work()) which ends up serializing us with those tasks
due to attempts to lock the same pages (and the delalloc flush procedure
calls the allocator and creates the ordered extents before unlocking the
pages).

These flushing calls not only make us waste time (cpu, IO) but also reduce
the chances of writing larger extents (applications might be writing to
contiguous ranges and we flush before they finish dirtying the whole
ranges).

So make sure we don't flush delalloc and just wait for concurrent tasks
that have already started flushing delalloc and have allocated an extent
from the block group we are about to relocate.

This change also ends up fixing a race with direct IO writes that makes
relocation not wait for direct IO ordered extents. This race is
illustrated by the following diagram:

        CPU 1                                       CPU 2

 btrfs_relocate_block_group(bg X)

                                           starts direct IO write,
                                           target inode currently has no
                                           ordered extents ongoing nor
                                           dirty pages (delalloc regions),
                                           therefore the root for our inode
                                           is not in the list
                                           fs_info->ordered_roots

                                           btrfs_direct_IO()
                                             __blockdev_direct_IO()
                                               btrfs_get_blocks_direct()
                                                 btrfs_lock_extent_direct()
                                                   locks range in the io tree
                                                 btrfs_new_extent_direct()
                                                   btrfs_reserve_extent()
                                                     --> extent allocated
                                                         from bg X

   btrfs_inc_block_group_ro(bg X)

   btrfs_start_delalloc_roots()
     __start_delalloc_inodes()
       --> does nothing, no dealloc ranges
           in the inode's io tree so the
           inode's root is not in the list
           fs_info->delalloc_roots

   btrfs_wait_ordered_roots()
     --> does not find the inode's root in the
         list fs_info->ordered_roots

     --> ends up not waiting for the direct IO
         write started by the task at CPU 2

   relocate_block_group(rc->stage ==
     MOVE_DATA_EXTENTS)

     prepare_to_relocate()
       btrfs_commit_transaction()

     iterates the extent tree, using its
     commit root and moves extents into new
     locations

                                                   btrfs_add_ordered_extent_dio()
                                                     --> now a ordered extent is
                                                         created and added to the
                                                         list root->ordered_extents
                                                         and the root added to the
                                                         list fs_info->ordered_roots
                                                     --> this is too late and the
                                                         task at CPU 1 already
                                                         started the relocation

     btrfs_commit_transaction()

                                                   btrfs_finish_ordered_io()
                                                     btrfs_alloc_reserved_file_extent()
                                                       --> adds delayed data reference
                                                           for the extent allocated
                                                           from bg X

   relocate_block_group(rc->stage ==
     UPDATE_DATA_PTRS)

     prepare_to_relocate()
       btrfs_commit_transaction()
         --> delayed refs are run, so an extent
             item for the allocated extent from
             bg X is added to extent tree
         --> commit roots are switched, so the
             next scan in the extent tree will
             see the extent item

     sees the extent in the extent tree

When this happens the relocation produces the following warning when it
finishes:

[ 7260.832836] ------------[ cut here ]------------
[ 7260.834653] WARNING: CPU: 5 PID: 6765 at fs/btrfs/relocation.c:4318 btrfs_relocate_block_group+0x245/0x2a1 [btrfs]()
[ 7260.838268] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc
[ 7260.850935] CPU: 5 PID: 6765 Comm: btrfs Not tainted 4.5.0-rc6-btrfs-next-28+ #1
[ 7260.852998] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[ 7260.852998]  0000000000000000 ffff88020bf57bc0 ffffffff812648b3 0000000000000000
[ 7260.852998]  0000000000000009 ffff88020bf57bf8 ffffffff81051608 ffffffffa03c1b2d
[ 7260.852998]  ffff8800b2bbb800 0000000000000000 ffff8800b17bcc58 ffff8800399dd000
[ 7260.852998] Call Trace:
[ 7260.852998]  [<ffffffff812648b3>] dump_stack+0x67/0x90
[ 7260.852998]  [<ffffffff81051608>] warn_slowpath_common+0x99/0xb2
[ 7260.852998]  [<ffffffffa03c1b2d>] ? btrfs_relocate_block_group+0x245/0x2a1 [btrfs]
[ 7260.852998]  [<ffffffff810516d4>] warn_slowpath_null+0x1a/0x1c
[ 7260.852998]  [<ffffffffa03c1b2d>] btrfs_relocate_block_group+0x245/0x2a1 [btrfs]
[ 7260.852998]  [<ffffffffa039d9de>] btrfs_relocate_chunk.isra.29+0x66/0xdb [btrfs]
[ 7260.852998]  [<ffffffffa039f314>] btrfs_balance+0xde1/0xe4e [btrfs]
[ 7260.852998]  [<ffffffff8127d671>] ? debug_smp_processor_id+0x17/0x19
[ 7260.852998]  [<ffffffffa03a9583>] btrfs_ioctl_balance+0x255/0x2d3 [btrfs]
[ 7260.852998]  [<ffffffffa03ac96a>] btrfs_ioctl+0x11e0/0x1dff [btrfs]
[ 7260.852998]  [<ffffffff811451df>] ? handle_mm_fault+0x443/0xd63
[ 7260.852998]  [<ffffffff81491817>] ? _raw_spin_unlock+0x31/0x44
[ 7260.852998]  [<ffffffff8108b36a>] ? arch_local_irq_save+0x9/0xc
[ 7260.852998]  [<ffffffff811876ab>] vfs_ioctl+0x18/0x34
[ 7260.852998]  [<ffffffff81187cb2>] do_vfs_ioctl+0x550/0x5be
[ 7260.852998]  [<ffffffff81190c30>] ? __fget_light+0x4d/0x71
[ 7260.852998]  [<ffffffff81187d77>] SyS_ioctl+0x57/0x79
[ 7260.852998]  [<ffffffff81492017>] entry_SYSCALL_64_fastpath+0x12/0x6b
[ 7260.893268] ---[ end trace eb7803b24ebab8ad ]---

This is because at the end of the first stage, in relocate_block_group(),
we commit the current transaction, which makes delayed refs run, the
commit roots are switched and so the second stage will find the extent
item that the ordered extent added to the delayed refs. But this extent
was not moved (ordered extent completed after first stage finished), so
at the end of the relocation our block group item still has a positive
used bytes counter, triggering a warning at the end of
btrfs_relocate_block_group(). Later on when trying to read the extent
contents from disk we hit a BUG_ON() due to the inability to map a block
with a logical address that belongs to the block group we relocated and
is no longer valid, resulting in the following trace:

[ 7344.885290] BTRFS critical (device sdi): unable to find logical 12845056 len 4096
[ 7344.887518] ------------[ cut here ]------------
[ 7344.888431] kernel BUG at fs/btrfs/inode.c:1833!
[ 7344.888431] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 7344.888431] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc
[ 7344.888431] CPU: 0 PID: 6831 Comm: od Tainted: G        W       4.5.0-rc6-btrfs-next-28+ #1
[ 7344.888431] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[ 7344.888431] task: ffff880215818600 ti: ffff880204684000 task.ti: ffff880204684000
[ 7344.888431] RIP: 0010:[<ffffffffa037c88c>]  [<ffffffffa037c88c>] btrfs_merge_bio_hook+0x54/0x6b [btrfs]
[ 7344.888431] RSP: 0018:ffff8802046878f0  EFLAGS: 00010282
[ 7344.888431] RAX: 00000000ffffffea RBX: 0000000000001000 RCX: 0000000000000001
[ 7344.888431] RDX: ffff88023ec0f950 RSI: ffffffff8183b638 RDI: 00000000ffffffff
[ 7344.888431] RBP: ffff880204687908 R08: 0000000000000001 R09: 0000000000000000
[ 7344.888431] R10: ffff880204687770 R11: ffffffff82f2d52d R12: 0000000000001000
[ 7344.888431] R13: ffff88021afbfee8 R14: 0000000000006208 R15: ffff88006cd199b0
[ 7344.888431] FS:  00007f1f9e1d6700(0000) GS:ffff88023ec00000(0000) knlGS:0000000000000000
[ 7344.888431] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7344.888431] CR2: 00007f1f9dc8cb60 CR3: 000000023e3b6000 CR4: 00000000000006f0
[ 7344.888431] Stack:
[ 7344.888431]  0000000000001000 0000000000001000 ffff880204687b98 ffff880204687950
[ 7344.888431]  ffffffffa0395c8f ffffea0004d64d48 0000000000000000 0000000000001000
[ 7344.888431]  ffffea0004d64d48 0000000000001000 0000000000000000 0000000000000000
[ 7344.888431] Call Trace:
[ 7344.888431]  [<ffffffffa0395c8f>] submit_extent_page+0xf5/0x16f [btrfs]
[ 7344.888431]  [<ffffffffa03970ac>] __do_readpage+0x4a0/0x4f1 [btrfs]
[ 7344.888431]  [<ffffffffa039680d>] ? btrfs_create_repair_bio+0xcb/0xcb [btrfs]
[ 7344.888431]  [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
[ 7344.888431]  [<ffffffff8108df55>] ? trace_hardirqs_on+0xd/0xf
[ 7344.888431]  [<ffffffffa039728c>] __do_contiguous_readpages.constprop.26+0xc2/0xe4 [btrfs]
[ 7344.888431]  [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
[ 7344.888431]  [<ffffffffa039739b>] __extent_readpages.constprop.25+0xed/0x100 [btrfs]
[ 7344.888431]  [<ffffffff81129d24>] ? lru_cache_add+0xe/0x10
[ 7344.888431]  [<ffffffffa0397ea8>] extent_readpages+0x160/0x1aa [btrfs]
[ 7344.888431]  [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
[ 7344.888431]  [<ffffffff8115daad>] ? alloc_pages_current+0xa9/0xcd
[ 7344.888431]  [<ffffffffa037cdc9>] btrfs_readpages+0x1f/0x21 [btrfs]
[ 7344.888431]  [<ffffffff81128316>] __do_page_cache_readahead+0x168/0x1fc
[ 7344.888431]  [<ffffffff811285a0>] ondemand_readahead+0x1f6/0x207
[ 7344.888431]  [<ffffffff811285a0>] ? ondemand_readahead+0x1f6/0x207
[ 7344.888431]  [<ffffffff8111cf34>] ? pagecache_get_page+0x2b/0x154
[ 7344.888431]  [<ffffffff8112870e>] page_cache_sync_readahead+0x3d/0x3f
[ 7344.888431]  [<ffffffff8111dbf7>] generic_file_read_iter+0x197/0x4e1
[ 7344.888431]  [<ffffffff8117773a>] __vfs_read+0x79/0x9d
[ 7344.888431]  [<ffffffff81178050>] vfs_read+0x8f/0xd2
[ 7344.888431]  [<ffffffff81178a38>] SyS_read+0x50/0x7e
[ 7344.888431]  [<ffffffff81492017>] entry_SYSCALL_64_fastpath+0x12/0x6b
[ 7344.888431] Code: 8d 4d e8 45 31 c9 45 31 c0 48 8b 00 48 c1 e2 09 48 8b 80 80 fc ff ff 4c 89 65 e8 48 8b b8 f0 01 00 00 e8 1d 42 02 00 85 c0 79 02 <0f> 0b 4c 0
[ 7344.888431] RIP  [<ffffffffa037c88c>] btrfs_merge_bio_hook+0x54/0x6b [btrfs]
[ 7344.888431]  RSP <ffff8802046878f0>
[ 7344.970544] ---[ end trace eb7803b24ebab8ae ]---

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2016-05-13 01:59:16 +01:00
Filipe Manana
578def7c50 Btrfs: don't wait for unrelated IO to finish before relocation
Before the relocation process of a block group starts, it sets the block
group to readonly mode, then flushes all delalloc writes and then finally
it waits for all ordered extents to complete. This last step includes
waiting for ordered extents destinated at extents allocated in other block
groups, making us waste unecessary time.

So improve this by waiting only for ordered extents that fall into the
block group's range.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2016-05-13 01:59:14 +01:00
Filipe Manana
3f9749f6e9 Btrfs: fix empty symlink after creating symlink and fsync parent dir
If we create a symlink, fsync its parent directory, crash/power fail and
mount the filesystem, we end up with an empty symlink, which not only is
useless it's also not allowed in linux (the man page symlink(2) is well
explicit about that).  So we just need to make sure to fully log an inode
if it's a symlink, to ensure its inline extent gets logged, ensuring the
same behaviour as ext3, ext4, xfs, reiserfs, f2fs, nilfs2, etc.

Example reproducer:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt
  $ mkdir /mnt/testdir
  $ sync
  $ ln -s /mnt/foo /mnt/testdir/bar
  $ xfs_io -c fsync /mnt/testdir
  <power fail>
  $ mount /dev/sdb /mnt
  $ readlink /mnt/testdir/bar
  <empty string>

A test case for fstests follows soon.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-05-13 01:59:12 +01:00
Filipe Manana
657ed1aa48 Btrfs: fix for incorrect directory entries after fsync log replay
If we move a directory to a new parent and later log that parent and don't
explicitly log the old parent, when we replay the log we can end up with
entries for the moved directory in both the old and new parent directories.
Besides being ilegal to have directories with multiple hard links in linux,
it also resulted in the leaving the inode item with a link count of 1.
A similar issue also happens if we move a regular file - after the log tree
is replayed the file has a link in both the old and new parent directories,
when it should be only at the new directory.

Sample reproducer:

  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt
  $ mkdir /mnt/x
  $ mkdir /mnt/y
  $ touch /mnt/x/foo
  $ mkdir /mnt/y/z
  $ sync
  $ ln /mnt/x/foo /mnt/x/bar
  $ mv /mnt/y/z /mnt/x/z
  < power fail >
  $ mount /dev/sdc /mnt
  $ ls -1Ri /mnt
  /mnt:
  257 x
  258 y

  /mnt/x:
  259 bar
  259 foo
  260 z

  /mnt/x/z:

  /mnt/y:
  260 z

  /mnt/y/z:

  $ umount /dev/sdc
  $ btrfs check /dev/sdc
  Checking filesystem on /dev/sdc
  UUID: a67e2c4a-a4b4-4fdc-b015-9d9af1e344be
  checking extents
  checking free space cache
  checking fs roots
  root 5 inode 260 errors 2000, link count wrong
        unresolved ref dir 257 index 4 namelen 1 name z filetype 2 errors 0
        unresolved ref dir 258 index 2 namelen 1 name z filetype 2 errors 0
  (...)

Attempting to remove the directory becomes impossible:

  $ mount /dev/sdc /mnt
  $ rmdir /mnt/y/z
  $ ls -lh /mnt/y
  ls: cannot access /mnt/y/z: No such file or directory
  total 0
  d????????? ? ? ? ?            ? z
  $ rmdir /mnt/x/z
  rmdir: failed to remove ‘/mnt/x/z’: Stale file handle
  $ ls -lh /mnt/x
  ls: cannot access /mnt/x/z: Stale file handle
  total 0
  -rw-r--r-- 2 root root 0 Apr  6 18:06 bar
  -rw-r--r-- 2 root root 0 Apr  6 18:06 foo
  d????????? ? ?    ?    ?            ? z

So make sure that on rename we set the last_unlink_trans value for our
inode, even if it's a directory, to the value of the current transaction's
ID and that if the new parent directory is logged that we fallback to a
transaction commit.

A test case for fstests is being submitted as well.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-05-13 01:59:11 +01:00
David Sterba
2c1984f244 btrfs: build fixup for qgroup_account_snapshot
The macro btrfs_std_error got renamed to btrfs_handle_fs_error in an
independent branch for the same merge target (4.7). To make the code
compilable for bisectability reasons, add a temporary stub.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-12 11:05:03 +02:00
Qu Wenruo
6426c7ad69 btrfs: qgroup: Fix qgroup accounting when creating snapshot
Current btrfs qgroup design implies a requirement that after calling
btrfs_qgroup_account_extents() there must be a commit root switch.

Normally this is OK, as btrfs_qgroup_accounting_extents() is only called
inside btrfs_commit_transaction() just be commit_cowonly_roots().

However there is a exception at create_pending_snapshot(), which will
call btrfs_qgroup_account_extents() but no any commit root switch.

In case of creating a snapshot whose parent root is itself (create a
snapshot of fs tree), it will corrupt qgroup by the following trace:
(skipped unrelated data)
======
btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 1
qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 0, excl = 0
qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 16384, excl = 16384
btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0
======

The problem here is in first qgroup_account_extent(), the
nr_new_roots of the extent is 1, which means its reference got
increased, and qgroup increased its rfer and excl.

But at second qgroup_account_extent(), its reference got decreased, but
between these two qgroup_account_extent(), there is no switch roots.
This leads to the same nr_old_roots, and this extent just got ignored by
qgroup, which means this extent is wrongly accounted.

Fix it by call commit_cowonly_roots() after qgroup_account_extent() in
create_pending_snapshot(), with needed preparation.

Mark: I added a check at the top of qgroup_account_snapshot() to skip this
code if qgroups are turned off. xfstest btrfs/122 exposes this problem.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-12 10:47:31 +02:00
Vincent Stehlé
72928f2476 Btrfs: fix fspath error deallocation
Make sure to deallocate fspath with vfree() in case of error in
init_ipath().

fspath is allocated with vmalloc() in init_data_container() since
commit 425d17a290 ("Btrfs: use larger limit for translation of logical to
inode").

Signed-off-by: Vincent Stehlé <vincent.stehle@intel.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-10 16:22:26 +02:00
David Sterba
523567168d btrfs: make find_workspace warn if there are no workspaces
Be verbose if there are no workspaces at all, ie. the module init time
preallocation failed.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-10 09:46:16 +02:00
David Sterba
e721e49dd1 btrfs: make find_workspace always succeed
With just one preallocated workspace we can guarantee forward progress
even if there's no memory available for new workspaces. The cost is more
waiting but we also get rid of several error paths.

On average, there will be several idle workspaces, so the waiting
penalty won't be so bad.

In the worst case, all cpus will compete for one workspace until there's
some memory. Attempts to allocate a new one are done each time the
waiters are woken up.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-10 09:46:13 +02:00
David Sterba
f77dd0d6b2 btrfs: preallocate compression workspaces
Preallocate one workspace for each compression type so we can guarantee
forward progress in the worst case. A failure cannot be a hard error as
we might not use compression at all on the filesystem. If we can't
allocate the workspaces later when need them, it might actually
deadlock, but in such situation the system has effectively not enough
memory to operate properly.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-10 09:46:11 +02:00
David Sterba
6ac10a6ac2 btrfs: rename and document compression workspace members
The names are confusing, pick more fitting names and add comments.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-10 09:46:08 +02:00
David Sterba
e1860a7724 btrfs: GFP_NOFS does not GFP_HIGHMEM
Masking HIGHMEM out of NOFS does not make sense.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-10 09:44:21 +02:00
David Sterba
05135f597a btrfs: switch to common message helpers in open_ctree, adjust messages
Currently we lack the identification of the filesystem in most if not
all mount messages, done via printk/pr_* functions. We can use the
btrfs_* helpers in open_ctree, as the fs_info <-> sb link is established
at the beginning of the function.

The messages have been updated at the same time to be more consistent:

* dropped sb->s_id, as it's not available via btrfs_*
* added %d for return code where appropriate
* wording changed
* %Lx replaced by %llx

Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-10 09:43:44 +02:00
Al Viro
972b241f84 btrfs: switch to ->iterate_shared()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-09 11:42:19 -04:00
Adam Borowski
8eb0dfdbda btrfs: fix int32 overflow in shrink_delalloc().
UBSAN: Undefined behaviour in fs/btrfs/extent-tree.c:4623:21
signed integer overflow:
10808 * 262144 cannot be represented in type 'int [8]'

If 8192<=items<16384, we request a writeback of an insane number of pages
which is benign (everything will be written).  But if items>=16384, the
space reservation won't be enough.

Signed-off-by: Adam Borowski <kilobyte@angband.pl>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-09 11:51:19 +02:00
Zygo Blaxell
2f3165ecf1 btrfs: don't force mounts to wait for cleaner_kthread to delete one or more subvolumes
During a mount, we start the cleaner kthread first because the transaction
kthread wants to wake up the cleaner kthread.  We start the transaction
kthread next because everything in btrfs wants transactions.  We do reloc
recovery in the thread that was doing the original mount call once the
transaction kthread is running.  This means that the cleaner kthread
could already be running when reloc recovery happens (e.g. if a snapshot
delete was started before a crash).

Relocation does not play well with the cleaner kthread, so a mutex was
added in commit 5f3164813b "Btrfs: fix
race between balance recovery and root deletion" to prevent both from
being active at the same time.

If the cleaner kthread is already holding the mutex by the time we get
to btrfs_recover_relocation, the mount will be blocked until at least
one deleted subvolume is cleaned (possibly more if the mount process
doesn't get the lock right away).  During this time (which could be an
arbitrarily long time on a large/slow filesystem), the mount process is
stuck and the filesystem is unnecessarily inaccessible.

Fix this by locking cleaner_mutex before we start cleaner_kthread, and
unlocking the mutex after mount no longer requires it.  This ensures
that the mounting process will not be blocked by the cleaner kthread.
The cleaner kthread is already prepared for mutex contention and will
just go to sleep until the mutex is available.

Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
David Sterba
58d7bbf81f btrfs: ioctl: reorder exclusive op check in RM_DEV
Move the op exclusivity check before the other code (same as in
ADD_DEV).

Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
David Sterba
7ab19625a9 btrfs: add write protection to SET_FEATURES ioctl
Perform the want_write check if we get far enough to do any writes.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
Anand Jain
48b3b9d401 btrfs: fix lock dep warning move scratch super outside of chunk_mutex
Move scratch super outside of the chunk lock to avoid below
lockdep warning. The better place to scratch super is in
the function btrfs_rm_dev_replace_free_srcdev() just before
free_device, which is outside of the chunk lock as well.

To reproduce:
  (fresh boot)
  mkfs.btrfs -f -draid5 -mraid5 /dev/sdc /dev/sdd /dev/sde
  mount /dev/sdc /btrfs
  dd if=/dev/zero of=/btrfs/tf1 bs=4096 count=100
  (get devmgt from https://github.com/asj/devmgt.git)
  devmgt detach /dev/sde
  dd if=/dev/zero of=/btrfs/tf1 bs=4096 count=100
  sync
  btrfs replace start -Brf 3 /dev/sdf /btrfs <--
  devmgt attach host7

======================================================
[ INFO: possible circular locking dependency detected ]
4.6.0-rc2asj+ #1 Not tainted
---------------------------------------------------

btrfs/2174 is trying to acquire lock:
(sb_writers){.+.+.+}, at:
[<ffffffff812449b4>] __sb_start_write+0xb4/0xf0

but task is already holding lock:
(&fs_info->chunk_mutex){+.+.+.}, at:
[<ffffffffa05c5f55>] btrfs_dev_replace_finishing+0x145/0x980 [btrfs]

which lock already depends on the new lock.

Chain exists of:
sb_writers --> &fs_devs->device_list_mutex --> &fs_info->chunk_mutex
Possible unsafe locking scenario:
CPU0				CPU1
----				----
lock(&fs_info->chunk_mutex);
				lock(&fs_devs->device_list_mutex);
				lock(&fs_info->chunk_mutex);
lock(sb_writers);

*** DEADLOCK ***

-> #0 (sb_writers){.+.+.+}:
[<ffffffff810e6415>] __lock_acquire+0x1bc5/0x1ee0
[<ffffffff810e707e>] lock_acquire+0xbe/0x210
[<ffffffff810df49a>] percpu_down_read+0x4a/0xa0
[<ffffffff812449b4>] __sb_start_write+0xb4/0xf0
[<ffffffff81265534>] mnt_want_write+0x24/0x50
[<ffffffff812508a2>] path_openat+0x952/0x1190
[<ffffffff81252451>] do_filp_open+0x91/0x100
[<ffffffff8123f5cc>] file_open_name+0xfc/0x140
[<ffffffff8123f643>] filp_open+0x33/0x60
[<ffffffffa0572bb6>] update_dev_time+0x16/0x40 [btrfs]
[<ffffffffa057f60d>] btrfs_scratch_superblocks+0x5d/0xb0 [btrfs]
[<ffffffffa057f70e>] btrfs_rm_dev_replace_remove_srcdev+0xae/0xd0 [btrfs]
[<ffffffffa05c62c5>] btrfs_dev_replace_finishing+0x4b5/0x980 [btrfs]
[<ffffffffa05c6ae8>] btrfs_dev_replace_start+0x358/0x530 [btrfs]

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
Ashish Samant
2473114981 btrfs: Fix BUG_ON condition in scrub_setup_recheck_block()
pagev array in scrub_block{} is of size SCRUB_MAX_PAGES_PER_BLOCK.
page_index should be checked with the same to trigger BUG_ON().

Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
Josef Bacik
e042d1ec44 Btrfs: remove BUG_ON()'s in btrfs_map_block
btrfs_map_block can go horribly wrong in the face of fs corruption, lets agree
to not be assholes and panic at any possible chance things are all fucked up.

Signed-off-by: Josef Bacik <jbacik@fb.com>
[ removed type casts ]
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
Liu Bo
3d8da67817 Btrfs: fix divide error upon chunk's stripe_len
The struct 'map_lookup' uses type int for @stripe_len, while
btrfs_chunk_stripe_len() can return a u64 value, and it may end up with
@stripe_len being undefined value and it can lead to 'divide error' in
 __btrfs_map_block().

This changes 'map_lookup' to use type u64 for stripe_len, also right now
we only use BTRFS_STRIPE_LEN for stripe_len, so this adds a valid checker for
BTRFS_STRIPE_LEN.

Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Reported-by: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ folded division fix to scrub_raid56_parity ]
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
David Sterba
ee17fc8005 btrfs: sysfs: protect reading label by lock
If the label setting ioctl races with sysfs label handler, we could get
mixed result in the output, part old part new. We should either get the
old or new label. The chances to hit this race are low.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
David Sterba
66ac9fe7ba btrfs: add check to sysfs handler of label
Add a sanity check for the fs_info as we will dereference it, similar to
what the 'store features' handler does.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
David Sterba
ee6111386a btrfs: add read-only check to sysfs handler of features
We don't want to trigger the change on a read-only filesystem, similar
to what the label handler does.

Signed-off-by: David Sterba <dsterba@suse.cz>
2016-05-06 15:22:49 +02:00
David Sterba
e6c11f9a46 btrfs: reuse existing variable in scrub_stripe, reduce stack usage
The key variable occupies 17 bytes, the key_start is used once, we can
simply reuse existing 'key' for that purpose. As the key is not a simple
type, compiler doest not do it on itself.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
David Sterba
49a3c4d9b6 btrfs: use dynamic allocation for root item in create_subvol
The size of root item is more than 400 bytes, which is quite a lot of
stack space. As we do IO from inside the subvolume ioctls, we should
keep the stack usage low in case the filesystem is on top of other
layers (NFS, device mapper, iscsi, etc).

Reviewed-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
David Sterba
153519559a btrfs: clone: use vmalloc only as fallback for nodesize bufer
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
David Sterba
2f91306a37 btrfs: send: use vmalloc only as fallback for clone_sources_tmp
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
David Sterba
c03d01f340 btrfs: send: use vmalloc only as fallback for clone_roots
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
David Sterba
e55d1153db btrfs: send: use temporary variable to store allocation size
We're going to use the argument multiple times later.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
David Sterba
eb5b75fe2e btrfs: send: use vmalloc only as fallback for read_buf
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
David Sterba
6ff48ce06b btrfs: send: use vmalloc only as fallback for send_buf
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
Anand Jain
779bf3fefa btrfs: fix lock dep warning, move scratch dev out of device_list_mutex and uuid_mutex
When the replace target fails, the target device will be taken
out of fs device list, scratch + update_dev_time and freed. However
we could do the scratch  + update_dev_time and free part after the
device has been taken out of device list, so that we don't have to
hold the device_list_mutex and uuid_mutex locks.

Reported issue:

[ 5375.718845] ======================================================
[ 5375.718846] [ INFO: possible circular locking dependency detected ]
[ 5375.718849] 4.4.5-scst31x-debug-11+ #40 Not tainted
[ 5375.718849] -------------------------------------------------------
[ 5375.718851] btrfs-health/4662 is trying to acquire lock:
[ 5375.718861]  (sb_writers){.+.+.+}, at: [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
[ 5375.718862]
[ 5375.718862] but task is already holding lock:
[ 5375.718907]  (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa028263c>] btrfs_destroy_dev_replace_tgtdev+0x3c/0x150 [btrfs]
[ 5375.718907]
[ 5375.718907] which lock already depends on the new lock.
[ 5375.718907]
[ 5375.718908]
[ 5375.718908] the existing dependency chain (in reverse order) is:
[ 5375.718911]
[ 5375.718911] -> #3 (&fs_devs->device_list_mutex){+.+.+.}:
[ 5375.718917]        [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.718921]        [<ffffffff81633949>] mutex_lock_nested+0x69/0x3c0
[ 5375.718940]        [<ffffffffa0219bf6>] btrfs_show_devname+0x36/0x210 [btrfs]
[ 5375.718945]        [<ffffffff81267079>] show_vfsmnt+0x49/0x150
[ 5375.718948]        [<ffffffff81240b07>] m_show+0x17/0x20
[ 5375.718951]        [<ffffffff81246868>] seq_read+0x2d8/0x3b0
[ 5375.718955]        [<ffffffff8121df28>] __vfs_read+0x28/0xd0
[ 5375.718959]        [<ffffffff8121e806>] vfs_read+0x86/0x130
[ 5375.718962]        [<ffffffff8121f4c9>] SyS_read+0x49/0xa0
[ 5375.718966]        [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 5375.718968]
[ 5375.718968] -> #2 (namespace_sem){+++++.}:
[ 5375.718971]        [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.718974]        [<ffffffff81635199>] down_write+0x49/0x80
[ 5375.718977]        [<ffffffff81243593>] lock_mount+0x43/0x1c0
[ 5375.718979]        [<ffffffff81243c13>] do_add_mount+0x23/0xd0
[ 5375.718982]        [<ffffffff81244afb>] do_mount+0x27b/0xe30
[ 5375.718985]        [<ffffffff812459dc>] SyS_mount+0x8c/0xd0
[ 5375.718988]        [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 5375.718991]
[ 5375.718991] -> #1 (&sb->s_type->i_mutex_key#5){+.+.+.}:
[ 5375.718994]        [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.718996]        [<ffffffff81633949>] mutex_lock_nested+0x69/0x3c0
[ 5375.719001]        [<ffffffff8122d608>] path_openat+0x468/0x1360
[ 5375.719004]        [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
[ 5375.719007]        [<ffffffff8121da7b>] do_sys_open+0x12b/0x210
[ 5375.719010]        [<ffffffff8121db7e>] SyS_open+0x1e/0x20
[ 5375.719013]        [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 5375.719015]
[ 5375.719015] -> #0 (sb_writers){.+.+.+}:
[ 5375.719018]        [<ffffffff810d97ca>] __lock_acquire+0x17ba/0x1ae0
[ 5375.719021]        [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.719026]        [<ffffffff810d3bef>] percpu_down_read+0x4f/0xa0
[ 5375.719028]        [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
[ 5375.719031]        [<ffffffff81242eb4>] mnt_want_write+0x24/0x50
[ 5375.719035]        [<ffffffff8122ded2>] path_openat+0xd32/0x1360
[ 5375.719037]        [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
[ 5375.719040]        [<ffffffff8121d8a4>] file_open_name+0xe4/0x130
[ 5375.719043]        [<ffffffff8121d923>] filp_open+0x33/0x60
[ 5375.719073]        [<ffffffffa02776a6>] update_dev_time+0x16/0x40 [btrfs]
[ 5375.719099]        [<ffffffffa02825be>] btrfs_scratch_superblocks+0x4e/0x90 [btrfs]
[ 5375.719123]        [<ffffffffa0282665>] btrfs_destroy_dev_replace_tgtdev+0x65/0x150 [btrfs]
[ 5375.719150]        [<ffffffffa02c6c80>] btrfs_dev_replace_finishing+0x6b0/0x990 [btrfs]
[ 5375.719175]        [<ffffffffa02c729e>] btrfs_dev_replace_start+0x33e/0x540 [btrfs]
[ 5375.719199]        [<ffffffffa02c7f58>] btrfs_auto_replace_start+0xf8/0x140 [btrfs]
[ 5375.719222]        [<ffffffffa02464e6>] health_kthread+0x246/0x490 [btrfs]
[ 5375.719225]        [<ffffffff810a70df>] kthread+0xef/0x110
[ 5375.719229]        [<ffffffff81637d2f>] ret_from_fork+0x3f/0x70
[ 5375.719230]
[ 5375.719230] other info that might help us debug this:
[ 5375.719230]
[ 5375.719233] Chain exists of:
[ 5375.719233]   sb_writers --> namespace_sem --> &fs_devs->device_list_mutex
[ 5375.719233]
[ 5375.719234]  Possible unsafe locking scenario:
[ 5375.719234]
[ 5375.719234]        CPU0                    CPU1
[ 5375.719235]        ----                    ----
[ 5375.719236]   lock(&fs_devs->device_list_mutex);
[ 5375.719238]                                lock(namespace_sem);
[ 5375.719239]                                lock(&fs_devs->device_list_mutex);
[ 5375.719241]   lock(sb_writers);
[ 5375.719241]
[ 5375.719241]  *** DEADLOCK ***
[ 5375.719241]
[ 5375.719243] 4 locks held by btrfs-health/4662:
[ 5375.719266]  #0:  (&fs_info->health_mutex){+.+.+.}, at: [<ffffffffa0246303>] health_kthread+0x63/0x490 [btrfs]
[ 5375.719293]  #1:  (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+.+.}, at: [<ffffffffa02c6611>] btrfs_dev_replace_finishing+0x41/0x990 [btrfs]
[ 5375.719319]  #2:  (uuid_mutex){+.+.+.}, at: [<ffffffffa0282620>] btrfs_destroy_dev_replace_tgtdev+0x20/0x150 [btrfs]
[ 5375.719343]  #3:  (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa028263c>] btrfs_destroy_dev_replace_tgtdev+0x3c/0x150 [btrfs]
[ 5375.719343]
[ 5375.719343] stack backtrace:
[ 5375.719347] CPU: 2 PID: 4662 Comm: btrfs-health Not tainted 4.4.5-scst31x-debug-11+ #40
[ 5375.719348] Hardware name: Supermicro SYS-6018R-WTRT/X10DRW-iT, BIOS 1.0c 01/07/2015
[ 5375.719352]  0000000000000000 ffff880856f73880 ffffffff813529e3 ffffffff826182a0
[ 5375.719354]  ffffffff8260c090 ffff880856f738c0 ffffffff810d667c ffff880856f73930
[ 5375.719357]  ffff880861f32b40 ffff880861f32b68 0000000000000003 0000000000000004
[ 5375.719357] Call Trace:
[ 5375.719363]  [<ffffffff813529e3>] dump_stack+0x85/0xc2
[ 5375.719366]  [<ffffffff810d667c>] print_circular_bug+0x1ec/0x260
[ 5375.719369]  [<ffffffff810d97ca>] __lock_acquire+0x17ba/0x1ae0
[ 5375.719373]  [<ffffffff810f606d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
[ 5375.719376]  [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.719378]  [<ffffffff812214f7>] ? __sb_start_write+0xb7/0xf0
[ 5375.719383]  [<ffffffff810d3bef>] percpu_down_read+0x4f/0xa0
[ 5375.719385]  [<ffffffff812214f7>] ? __sb_start_write+0xb7/0xf0
[ 5375.719387]  [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
[ 5375.719389]  [<ffffffff81242eb4>] mnt_want_write+0x24/0x50
[ 5375.719393]  [<ffffffff8122ded2>] path_openat+0xd32/0x1360
[ 5375.719415]  [<ffffffffa02462a0>] ? btrfs_congested_fn+0x180/0x180 [btrfs]
[ 5375.719418]  [<ffffffff810f606d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
[ 5375.719420]  [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
[ 5375.719423]  [<ffffffff810f615d>] ? rcu_read_lock_sched_held+0x6d/0x80
[ 5375.719426]  [<ffffffff81201a9b>] ? kmem_cache_alloc+0x26b/0x5d0
[ 5375.719430]  [<ffffffff8122e7d4>] ? getname_kernel+0x34/0x120
[ 5375.719433]  [<ffffffff8121d8a4>] file_open_name+0xe4/0x130
[ 5375.719436]  [<ffffffff8121d923>] filp_open+0x33/0x60
[ 5375.719462]  [<ffffffffa02776a6>] update_dev_time+0x16/0x40 [btrfs]
[ 5375.719485]  [<ffffffffa02825be>] btrfs_scratch_superblocks+0x4e/0x90 [btrfs]
[ 5375.719506]  [<ffffffffa0282665>] btrfs_destroy_dev_replace_tgtdev+0x65/0x150 [btrfs]
[ 5375.719530]  [<ffffffffa02c6c80>] btrfs_dev_replace_finishing+0x6b0/0x990 [btrfs]
[ 5375.719554]  [<ffffffffa02c6b23>] ? btrfs_dev_replace_finishing+0x553/0x990 [btrfs]
[ 5375.719576]  [<ffffffffa02c729e>] btrfs_dev_replace_start+0x33e/0x540 [btrfs]
[ 5375.719598]  [<ffffffffa02c7f58>] btrfs_auto_replace_start+0xf8/0x140 [btrfs]
[ 5375.719621]  [<ffffffffa02464e6>] health_kthread+0x246/0x490 [btrfs]
[ 5375.719641]  [<ffffffffa02463d8>] ? health_kthread+0x138/0x490 [btrfs]
[ 5375.719661]  [<ffffffffa02462a0>] ? btrfs_congested_fn+0x180/0x180 [btrfs]
[ 5375.719663]  [<ffffffff810a70df>] kthread+0xef/0x110
[ 5375.719666]  [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200
[ 5375.719669]  [<ffffffff81637d2f>] ret_from_fork+0x3f/0x70
[ 5375.719672]  [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200
[ 5375.719697] ------------[ cut here ]------------

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reported-by: Yauhen Kharuzhy <yauhen.kharuzhy@zavadatar.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
Dan Carpenter
f5ecec3ce2 btrfs: send: silence an integer overflow warning
The "sizeof(*arg->clone_sources) * arg->clone_sources_count" expression
can overflow.  It causes several static checker warnings.  It's all
under CAP_SYS_ADMIN so it's not that serious but lets silence the
warnings.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
Luis de Bethencourt
41b34accb2 btrfs: avoid overflowing f_bfree
Since mixed block groups accounting isn't byte-accurate and f_bree is an
unsigned integer, it could overflow. Avoid this.

Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Suggested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
Luis de Bethencourt
ae02d1bd07 btrfs: fix mixed block count of available space
Metadata for mixed block is already accounted in total data and should not
be counted as part of the free metadata space.

Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=114281
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
Austin S. Hemmelgarn
88be159c90 btrfs: allow balancing to dup with multi-device
Currently, we don't allow the user to try and rebalance to a dup profile
on a multi-device filesystem.  In most cases, this is a perfectly sensible
restriction as raid1 uses the same amount of space and provides better
protection.

However, when reshaping a multi-device filesystem down to a single device
filesystem, this requires the user to convert metadata and system chunks
to single profile before deleting devices, and then convert again to dup,
which leaves a period of time where metadata integrity is reduced.

This patch removes the single-device-only restriction from converting to
dup profile to remove this potential data integrity reduction.

Signed-off-by: Austin S. Hemmelgarn <ahferroin7@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
David Sterba
2355ac8495 btrfs: ioctl: reorder exclusive op check in RM_DEV
Move the op exclusivity check before the other code (same as in
ADD_DEV).

Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 14:58:00 +02:00
David Sterba
58409edd2d btrfs: kill unused writepage_io_hook callback
It seems to be long time unused, since 2008 and
6885f308b5 ("Btrfs: Misc 2.6.25 updates").

Propagating the removal touches some code but has no functional effect.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 14:57:57 +02:00
Anand Jain
88acff64c6 btrfs: cleanup assigning next active device with a check
Creates helper fucntion as needed by the device delete
and replace operations. Also now it checks if the next
device being assigned is an active device.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-04 10:41:08 +02:00
Anand Jain
8ed01abe7d btrfs: s_bdev is not null after missing replace
Yauhen reported in the ML that s_bdev is null at mount, and
s_bdev gets updated to some device when missing device is
replaced, as because bdev is null for missing device, things
gets matched up. Fix this by checking if s_bdev is set. I
didn't want to completely remove updating s_bdev because
the future multi device support at vfs layer may need it.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reported-by: Yauhen Kharuzhy <yauhen.kharuzhy@zavadatar.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-04 09:52:44 +02:00
Al Viro
9902af79c0 parallel lookups: actual switch to rwsem
ta-da!

The main issue is the lack of down_write_killable(), so the places
like readdir.c switched to plain inode_lock(); once killable
variants of rwsem primitives appear, that'll be dealt with.

lockdep side also might need more work

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:49:28 -04:00
Al Viro
84695ffee7 Merge getxattr prototype change into work.lookups
The rest of work.xattr stuff isn't needed for this branch
2016-05-02 19:45:47 -04:00
Anand Jain
ad8403df05 btrfs: pass the right error code to the btrfs_std_error
Also drop the newline from the message.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-02 17:50:13 +02:00
Christoph Hellwig
e259221763 fs: simplify the generic_write_sync prototype
The kiocb already has the new position, so use that.  The only interesting
case is AIO, where we currently don't bother updating ki_pos.  We're about
to free the kiocb after we're done, so we might as well update it to make
everyone's life simpler.

While we're at it also return the bytes written argument passed in if
we were successful so that the boilerplate error switch code in the
callers can go away.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-01 19:58:39 -04:00
Christoph Hellwig
dde0c2e798 fs: add IOCB_SYNC and IOCB_DSYNC
This will allow us to do per-I/O sync file writes, as required by a lot
of fileservers or storage targets.

XXX: Will need a few additional audits for O_DSYNC

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-01 19:58:39 -04:00
Christoph Hellwig
c8b8e32d70 direct-io: eliminate the offset argument to ->direct_IO
Including blkdev_direct_IO and dax_do_io.  It has to be ki_pos to actually
work, so eliminate the superflous argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-01 19:58:39 -04:00
Christoph Hellwig
1af5bb491f filemap: remove the pos argument to generic_file_direct_write
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-01 19:58:39 -04:00
David Sterba
210aa27768 btrfs: sink gfp parameter to convert_extent_bit
Single caller passes GFP_NOFS. We can get rid of the
gfpflags_allow_blocking checks as NOFS can block but does not recurse to
filesystem through reclaim.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-29 13:48:14 +02:00
David Sterba
059f791c6b btrfs: make state preallocation more speculative in __set_extent_bit
Similar to __clear_extent_bit, do not fail if the state preallocation
fails as we might not need it. One less BUG_ON.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-29 11:01:47 +02:00
David Sterba
03bf538770 btrfs: untangle gotos a bit in convert_extent_bit
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-29 11:01:47 +02:00
David Sterba
7ab5cb2a9e btrfs: untangle gotos a bit in __clear_extent_bit
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-29 11:01:47 +02:00
David Sterba
b5a4ba14e0 btrfs: untangle gotos a bit in __set_extent_bit
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-29 11:01:47 +02:00
David Sterba
2c53b912ae btrfs: sink gfp parameter to set_record_extent_bits
Single caller passes GFP_NOFS.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-29 11:01:47 +02:00
David Sterba
3744dbeb70 btrfs: sink gfp parameter to set_extent_new
Single caller passes GFP_NOFS.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-29 11:01:47 +02:00
David Sterba
018ed4f788 btrfs: sink gfp parameter to set_extent_defrag
Single caller passes GFP_NOFS.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-29 11:01:47 +02:00
David Sterba
7cd8c7527c btrfs: sink gfp parameter to set_extent_delalloc
Callers pass GFP_NOFS and tests pass GFP_KERNEL, but using NOFS there
does not hurt. No need to pass the flags around.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-29 11:01:47 +02:00
David Sterba
af6f8f604d btrfs: sink gfp parameter to clear_extent_dirty
Callers pass GFP_NOFS. No need to pass the flags around.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-29 11:01:47 +02:00
David Sterba
f734c44a1b btrfs: sink gfp parameter to clear_record_extent_bits
Callers pass GFP_NOFS. No need to pass the flags around.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-29 11:01:47 +02:00
David Sterba
91166212e0 btrfs: sink gfp parameter to clear_extent_bits
Callers pass GFP_NOFS and GFP_KERNEL. No need to pass the flags around.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-29 11:01:47 +02:00
David Sterba
ceeb0ae7bf btrfs: sink gfp parameter to set_extent_bits
All callers pass GFP_NOFS.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-29 11:01:47 +02:00
Jeff Mahoney
db6711600e btrfs: uapi/linux/btrfs_tree.h migration, item types and defines
The BTRFS_IOC_SEARCH_TREE ioctl returns file system items directly
to userspace.  In order to decode them, full type information is required.

Create a new header, btrfs_tree to contain these since most users won't
need them.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 11:06:41 +02:00
Jeff Mahoney
33ca913349 btrfs: uapi/linux/btrfs.h migration, move struct btrfs_ioctl_defrag_range_args
struct btrfs_ioctl_defrag_range_args is used by the BTRFS_IOC_DEFRAG_RANGE
ioctl.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 11:06:41 +02:00
Jeff Mahoney
04cd01dffb btrfs: uapi/linux/btrfs.h migration, move balance flags
The BTRFS_BALANCE_* flags are used by struct btrfs_ioctl_balance_args.flags
and btrfs_ioctl_balance_args.{data,meta,sys}.flags in the BTRFS_IOC_BALANCE
ioctl.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 11:06:41 +02:00
Jeff Mahoney
18db9ac644 btrfs: uapi/linux/btrfs.h migration, move feature flags
The compat/compat_ro/incompat feature flags are used by the feature set/get
ioctls.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 11:06:41 +02:00
Jeff Mahoney
83288b60bf btrfs: uapi/linux/btrfs.h migration, qgroup limit flags
The BTRFS_QGROUP_LIMIT_* flags are required to tell the kernel which
fields are valid when using the BTRFS_IOC_QGROUP_LIMIT ioctl.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 11:06:41 +02:00
Jeff Mahoney
d4ae133b2d btrfs: uapi/linux/btrfs.h migration, move BTRFS_LABEL_SIZE
BTRFS_LABEL_SIZE is required to define the BTRFS_IOC_GET_FSLABEL and
BTRFS_IOC_SET_FSLABEL ioctls.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 11:06:41 +02:00
Anand Jain
b5255456c5 btrfs: refactor btrfs_dev_replace_start for reuse
A refactor patch, and avoids user input verification in the
btrfs_dev_replace_start(), and so this function can be reused.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
Anand Jain
fc23c246d7 btrfs: use fs_info directly
Local variable fs_info, contains root->fs_info, use it.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
David Sterba
735654ea91 btrfs: rename flags for vol args v2
Rename BTRFS_DEVICE_BY_ID so it's more descriptive that we specify the
device by id, it'll be part of the public API. The mask of supported
flags is also renamed, only for internal use.

The error code for unknown flags is EOPNOTSUPP, fixed.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
David Sterba
5c5c0df05d btrfs: rename btrfs_find_device_by_user_input
For clarity how we are going to find the device, let's call it a device
specifier, devspec for short. Also rename the arguments that are a
leftover from previous function purpose.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
David Sterba
418775a22b btrfs: use existing device constraints table btrfs_raid_array
We should avoid duplicating the device constraints, let's use the
btrfs_raid_array in btrfs_check_raid_min_devices.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
David Sterba
621292bae6 btrfs: introduce raid-type to error-code table, for minimum device constraint
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
David Sterba
3cc31a0d5b btrfs: pass number of devices to btrfs_check_raid_min_devices
Before this patch, btrfs_check_raid_min_devices would do an off-by-one
check of the constraints and not the miminmum check, as its name
suggests. This is not a problem if the only caller is device remove, but
would be confusing for others.

Add an argument with the exact number and let the caller(s) decide if
this needs any adjustments, like when device replace is running.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Tested-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
David Sterba
f47ab2588e btrfs: rename __check_raid_min_devices
Underscores are for special functions, use the full prefix for better
stacktrace recognition.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
Anand Jain
02feae3c55 btrfs: optimize check for stale device
Optimize check for stale device to only be checked when there is device
added or changed. If there is no update to the device, there is no need
to call btrfs_free_stale_device().

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
Anand Jain
6b526ed70c btrfs: introduce device delete by devid
This introduces new ioctl BTRFS_IOC_RM_DEV_V2, which uses enhanced struct
btrfs_ioctl_vol_args_v2 to carry devid as an user argument.

The patch won't delete the old ioctl interface and so kernel remains
backward compatible with user land progs.

Test case/script:
echo "0 $(blockdev --getsz /dev/sdf) linear /dev/sdf 0" | dmsetup create bad_disk
mkfs.btrfs -f -d raid1 -m raid1 /dev/sdd /dev/sde /dev/mapper/bad_disk
mount /dev/sdd /btrfs
dmsetup suspend bad_disk
echo "0 $(blockdev --getsz /dev/sdf) error /dev/sdf 0" | dmsetup load bad_disk
dmsetup resume bad_disk
echo "bad disk failed. now deleting/replacing"
btrfs dev del  3  /btrfs
echo $?
btrfs fi show /btrfs
umount /btrfs
btrfs-show-super /dev/sdd | egrep num_device
dmsetup remove bad_disk
wipefs -a /dev/sdf

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reported-by: Martin <m_btrfs@ml1.co.uk>
[ adjust messages, s/disk/device/ ]
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
Anand Jain
42b6742715 btrfs: make use of btrfs_scratch_superblocks() in btrfs_rm_device()
With the previous patches now the btrfs_scratch_superblocks() is ready to
be used in btrfs_rm_device() so use it.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
[ use GFP_KERNEL ]
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
Anand Jain
b3d1b1532f btrfs: enhance btrfs_find_device_by_user_input() to check device path
The operation of device replace and device delete follows same steps upto
some depth with in btrfs kernel, however they don't share codes. This
enhancement will help replace and delete to share codes.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
Anand Jain
24fc572fe4 btrfs: make use of btrfs_find_device_by_user_input()
btrfs_rm_device() has a section of the code which can be replaced
btrfs_find_device_by_user_input()

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
Anand Jain
24e0474b59 btrfs: create helper btrfs_find_device_by_user_input()
The patch renames btrfs_dev_replace_find_srcdev() to
btrfs_find_device_by_user_input() and moves it to volumes.c, so that
delete device can use it.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
Anand Jain
bd45ffbcb1 btrfs: clean up and optimize __check_raid_min_device()
__check_raid_min_device() which was pealed from btrfs_rm_device()
maintianed its original code to show the block move. This patch cleans up
__check_raid_min_device().

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
Anand Jain
f1fa7f2642 btrfs: create helper function __check_raid_min_devices()
move a section of btrfs_rm_device() code to check for min number of the
devices into the function __check_raid_min_devices()

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
Anand Jain
6cf86a006b btrfs: create a helper function to read the disk super
A part of code from btrfs_scan_one_device() is moved to a new function
btrfs_read_disk_super(), so that former function looks cleaner. (In this
process it also moves the code which ensures null terminating label). So
this creates easy opportunity to merge various duplicate codes on read
disk super. Earlier attempt to merge duplicate codes highlighted that
there were some issues for which there are duplicate codes (to read disk
super), however it was not clear what was the issue. So until we figure
that out, its better to keep them in a separate functions.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
[ use GFP_KERNEL, PAGE_CACHE_ removal related fixups ]
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:04 +02:00
Liu Bo
cf25ce518e Btrfs: do not create empty block group if we have allocated data
Now we force to create empty block group to keep data profile alive,
however, in the below example, we eventually get an empty block group
while we're trying to get more space for other types (metadata/system),

- Before,
block group "A": size=2G, used=1.2G
block group "B": size=2G, used=512M

- After "btrfs balance start -dusage=50 mount_point",
block group "A": size=2G, used=(1.2+0.5)G
block group "C": size=2G, used=0

Since there is no data in block group C, it won't be deleted
automatically and we have to get the unused 2G until the next mount.

Balance itself just moves data and doesn't remove data, so it's safe
to not create such a empty block group if we already have data
 allocated in other block groups.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:41:47 +02:00
Chandan Rajendra
a2af23b7d7 Btrfs: __btrfs_buffered_write: Pass valid file offset when releasing delalloc space
The delalloc reserved space is calculated in terms of number of bytes
used by an integral number of blocks. This is done by rounding down the
value of 'pos' to the nearest multiple of sectorsize.

The file offset value held by 'pos' variable may not be aligned to
sectorsize and hence when passing it as an argument to
btrfs_delalloc_release_space(), we may end up releasing larger delalloc
space than we originally had reserved.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:41:47 +02:00
Liu Bo
894b36e35a Btrfs: cleanup error handling in extent_write_cached_pages
Now that we bail out immediately if ->writepage() returns an error,
we don't need an extra error to retain the error code.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:41:47 +02:00
Liu Bo
a91326679f Btrfs: make mapping->writeback_index point to the last written page
If sequential writer is writing in the middle of the page and it just redirties
the last written page by continuing from it.

In the above case this can end up with seeking back to that firstly redirtied
page after writing all the pages at the end of file because btrfs updates
mapping->writeback_index to 1 past the current one.

For non-cow filesystems, the cost is only about extra seek, while for cow
filesystems such as btrfs, it means unnecessary fragments.

To avoid it, we just need to continue writeback from the last written page.

This also updates btrfs to behave like what write_cache_pages() does, ie, bail
 out immediately if there is an error in writepage().

<Ref: https://www.spinics.net/lists/linux-btrfs/msg52628.html>

Reported-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:41:47 +02:00
Luke Dashjr
4c63c2454e btrfs: bugfix: handle FS_IOC32_{GETFLAGS,SETFLAGS,GETVERSION} in btrfs_ioctl
32-bit ioctl uses these rather than the regular FS_IOC_* versions. They can
be handled in btrfs using the same code. Without this, 32-bit {ch,ls}attr
fail.

Signed-off-by: Luke Dashjr <luke-jr+git@utopios.org>
Cc: stable@vger.kernel.org
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:40:27 +02:00
Luis de Bethencourt
180e4d4700 btrfs: fix typos in comments
Correct a typo in the chunk_mutex name to make it grepable.

Since it is better to fix several typos at once, fixing the 2 more in the
same file.

Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:36:54 +02:00
Geert Uytterhoeven
6719afdcf8 Btrfs: Refactor btrfs_lock_cluster() to kill compiler warning
fs/btrfs/extent-tree.c: In function ‘btrfs_lock_cluster’:
fs/btrfs/extent-tree.c:6399: warning: ‘used_bg’ may be used uninitialized in this function

  - Replace "again: ... goto again;" by standard C "while (1) { ... }",
  - Move block not processed during the first iteration of the loop to the
    end of the loop, which allows to kill the "locked" variable,

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-and-Tested-by: Miao Xie <miaox@cn.fujitsu.com>
[ the compilation warning has been fixed by other patch, now we want to
  clean up the function ]
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:36:54 +02:00
Anand Jain
0713d90c75 btrfs: remove save_error_info()
Actually save_error_info() sets the FS state to error and nothing else.
Further the word save doesn't induce caffeine when compared to the word
set in what actually it does.

So to make it better understandable move save_error_info() code to its
only consumer itself.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:36:54 +02:00
Satoru Takeuchi
13f48dc909 btrfs: Simplify conditions about compress while mapping btrfs flags to inode flags
Signed-off-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:36:54 +02:00
Anand Jain
c5f4ccb2f7 btrfs: move error handling code together in ctree.h
Looks like we added the incompatible defines in between the error
handling defines in the file ctree.h. Now group them back.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:36:54 +02:00
Anand Jain
2351d743f6 btrfs: remove unused function btrfs_assert()
Apparently looks like ASSERT does the same intended job,
as intended btrfs_assert().

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:36:54 +02:00
Anand Jain
34d9700702 btrfs: rename btrfs_std_error to btrfs_handle_fs_error
btrfs_std_error() handles errors, puts FS into readonly mode
(as of now). So its good idea to rename it to btrfs_handle_fs_error().

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ edit changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:36:54 +02:00
Al Viro
b296821a7c xattr_handler: pass dentry and inode as separate arguments of ->get()
... and do not assume they are already attached to each other

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-10 20:48:24 -04:00
Al Viro
fc64005c93 don't bother with ->d_inode->i_sb - it's always equal to ->d_sb
... and neither can ever be NULL

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-10 17:11:51 -04:00
Linus Torvalds
839a3f7657 Merge branch 'for-linus-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "These are bug fixes, including a really old fsync bug, and a few trace
  points to help us track down problems in the quota code"

* 'for-linus-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix file/data loss caused by fsync after rename and new inode
  btrfs: Reset IO error counters before start of device replacing
  btrfs: Add qgroup tracing
  Btrfs: don't use src fd for printk
  btrfs: fallback to vmalloc in btrfs_compare_tree
  btrfs: handle non-fatal errors in btrfs_qgroup_inherit()
  btrfs: Output more info for enospc_debug mount option
  Btrfs: fix invalid reference in replace_path
  Btrfs: Improve FL_KEEP_SIZE handling in fallocate
2016-04-09 10:41:34 -07:00
Linus Torvalds
93061f390f These changes contains a fix for overlayfs interacting with some
(badly behaved) dentry code in various file systems.  These have been
 reviewed by Al and the respective file system mtinainers and are going
 through the ext4 tree for convenience.
 
 This also has a few ext4 encryption bug fixes that were discovered in
 Android testing (yes, we will need to get these sync'ed up with the
 fs/crypto code; I'll take care of that).  It also has some bug fixes
 and a change to ignore the legacy quota options to allow for xfstests
 regression testing of ext4's internal quota feature and to be more
 consistent with how xfs handles this case.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJXBn4aAAoJEPL5WVaVDYGjHWgH/2wXnlQnC2ndJhblBWtPzprz
 OQW4dawdnhxqbTEGUqWe942tZivSb/liu/lF+urCGbWsbgz9jNOCmEAg7JPwlccY
 mjzwDvtVq5U4d2rP+JDWXLy/Gi8XgUclhbQDWFVIIIea6fS7IuFWqoVBR+HPMhra
 9tEygpiy5lNtJA/hqq3/z9x0AywAjwrYR491CuWreo2Uu1aeKg0YZsiDsuAcGioN
 Waa2TgbC/ZZyJuJcPBP8If+VOFAa0ea3F+C/o7Tb9bOqwuz0qSTcaMRgt6eQ2KUt
 P4b9Ecp1XLjJTC7IYOknUOScY3lCyREx/Xya9oGZfFNTSHzbOlLBoplCr3aUpYQ=
 =/HHR
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 bugfixes from Ted Ts'o:
 "These changes contains a fix for overlayfs interacting with some
  (badly behaved) dentry code in various file systems.  These have been
  reviewed by Al and the respective file system mtinainers and are going
  through the ext4 tree for convenience.

  This also has a few ext4 encryption bug fixes that were discovered in
  Android testing (yes, we will need to get these sync'ed up with the
  fs/crypto code; I'll take care of that).  It also has some bug fixes
  and a change to ignore the legacy quota options to allow for xfstests
  regression testing of ext4's internal quota feature and to be more
  consistent with how xfs handles this case"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: ignore quota mount options if the quota feature is enabled
  ext4 crypto: fix some error handling
  ext4: avoid calling dquot_get_next_id() if quota is not enabled
  ext4: retry block allocation for failed DIO and DAX writes
  ext4: add lockdep annotations for i_data_sem
  ext4: allow readdir()'s of large empty directories to be interrupted
  btrfs: fix crash/invalid memory access on fsync when using overlayfs
  ext4 crypto: use dget_parent() in ext4_d_revalidate()
  ext4: use file_dentry()
  ext4: use dget_parent() in ext4_file_open()
  nfs: use file_dentry()
  fs: add file_dentry()
  ext4 crypto: don't let data integrity writebacks fail with ENOMEM
  ext4: check if in-inode xattr is corrupted in ext4_expand_extra_isize_ea()
2016-04-07 17:22:20 -07:00
Filipe Manana
56f23fdbb6 Btrfs: fix file/data loss caused by fsync after rename and new inode
If we rename an inode A (be it a file or a directory), create a new
inode B with the old name of inode A and under the same parent directory,
fsync inode B and then power fail, at log tree replay time we end up
removing inode A completely. If inode A is a directory then all its files
are gone too.

Example scenarios where this happens:
This is reproducible with the following steps, taken from a couple of
test cases written for fstests which are going to be submitted upstream
soon:

   # Scenario 1

   mkfs.btrfs -f /dev/sdc
   mount /dev/sdc /mnt
   mkdir -p /mnt/a/x
   echo "hello" > /mnt/a/x/foo
   echo "world" > /mnt/a/x/bar
   sync
   mv /mnt/a/x /mnt/a/y
   mkdir /mnt/a/x
   xfs_io -c fsync /mnt/a/x
   <power failure happens>

   The next time the fs is mounted, log tree replay happens and
   the directory "y" does not exist nor do the files "foo" and
   "bar" exist anywhere (neither in "y" nor in "x", nor the root
   nor anywhere).

   # Scenario 2

   mkfs.btrfs -f /dev/sdc
   mount /dev/sdc /mnt
   mkdir /mnt/a
   echo "hello" > /mnt/a/foo
   sync
   mv /mnt/a/foo /mnt/a/bar
   echo "world" > /mnt/a/foo
   xfs_io -c fsync /mnt/a/foo
   <power failure happens>

   The next time the fs is mounted, log tree replay happens and the
   file "bar" does not exists anymore. A file with the name "foo"
   exists and it matches the second file we created.

Another related problem that does not involve file/data loss is when a
new inode is created with the name of a deleted snapshot and we fsync it:

   mkfs.btrfs -f /dev/sdc
   mount /dev/sdc /mnt
   mkdir /mnt/testdir
   btrfs subvolume snapshot /mnt /mnt/testdir/snap
   btrfs subvolume delete /mnt/testdir/snap
   rmdir /mnt/testdir
   mkdir /mnt/testdir
   xfs_io -c fsync /mnt/testdir # or fsync some file inside /mnt/testdir
   <power failure>

   The next time the fs is mounted the log replay procedure fails because
   it attempts to delete the snapshot entry (which has dir item key type
   of BTRFS_ROOT_ITEM_KEY) as if it were a regular (non-root) entry,
   resulting in the following error that causes mount to fail:

   [52174.510532] BTRFS info (device dm-0): failed to delete reference to snap, inode 257 parent 257
   [52174.512570] ------------[ cut here ]------------
   [52174.513278] WARNING: CPU: 12 PID: 28024 at fs/btrfs/inode.c:3986 __btrfs_unlink_inode+0x178/0x351 [btrfs]()
   [52174.514681] BTRFS: Transaction aborted (error -2)
   [52174.515630] Modules linked in: btrfs dm_flakey dm_mod overlay crc32c_generic ppdev xor raid6_pq acpi_cpufreq parport_pc tpm_tis sg parport tpm evdev i2c_piix4 proc
   [52174.521568] CPU: 12 PID: 28024 Comm: mount Tainted: G        W       4.5.0-rc6-btrfs-next-27+ #1
   [52174.522805] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
   [52174.524053]  0000000000000000 ffff8801df2a7710 ffffffff81264e93 ffff8801df2a7758
   [52174.524053]  0000000000000009 ffff8801df2a7748 ffffffff81051618 ffffffffa03591cd
   [52174.524053]  00000000fffffffe ffff88015e6e5000 ffff88016dbc3c88 ffff88016dbc3c88
   [52174.524053] Call Trace:
   [52174.524053]  [<ffffffff81264e93>] dump_stack+0x67/0x90
   [52174.524053]  [<ffffffff81051618>] warn_slowpath_common+0x99/0xb2
   [52174.524053]  [<ffffffffa03591cd>] ? __btrfs_unlink_inode+0x178/0x351 [btrfs]
   [52174.524053]  [<ffffffff81051679>] warn_slowpath_fmt+0x48/0x50
   [52174.524053]  [<ffffffffa03591cd>] __btrfs_unlink_inode+0x178/0x351 [btrfs]
   [52174.524053]  [<ffffffff8118f5e9>] ? iput+0xb0/0x284
   [52174.524053]  [<ffffffffa0359fe8>] btrfs_unlink_inode+0x1c/0x3d [btrfs]
   [52174.524053]  [<ffffffffa038631e>] check_item_in_log+0x1fe/0x29b [btrfs]
   [52174.524053]  [<ffffffffa0386522>] replay_dir_deletes+0x167/0x1cf [btrfs]
   [52174.524053]  [<ffffffffa038739e>] fixup_inode_link_count+0x289/0x2aa [btrfs]
   [52174.524053]  [<ffffffffa038748a>] fixup_inode_link_counts+0xcb/0x105 [btrfs]
   [52174.524053]  [<ffffffffa038a5ec>] btrfs_recover_log_trees+0x258/0x32c [btrfs]
   [52174.524053]  [<ffffffffa03885b2>] ? replay_one_extent+0x511/0x511 [btrfs]
   [52174.524053]  [<ffffffffa034f288>] open_ctree+0x1dd4/0x21b9 [btrfs]
   [52174.524053]  [<ffffffffa032b753>] btrfs_mount+0x97e/0xaed [btrfs]
   [52174.524053]  [<ffffffff8108e1b7>] ? trace_hardirqs_on+0xd/0xf
   [52174.524053]  [<ffffffff8117bafa>] mount_fs+0x67/0x131
   [52174.524053]  [<ffffffff81193003>] vfs_kern_mount+0x6c/0xde
   [52174.524053]  [<ffffffffa032af81>] btrfs_mount+0x1ac/0xaed [btrfs]
   [52174.524053]  [<ffffffff8108e1b7>] ? trace_hardirqs_on+0xd/0xf
   [52174.524053]  [<ffffffff8108c262>] ? lockdep_init_map+0xb9/0x1b3
   [52174.524053]  [<ffffffff8117bafa>] mount_fs+0x67/0x131
   [52174.524053]  [<ffffffff81193003>] vfs_kern_mount+0x6c/0xde
   [52174.524053]  [<ffffffff8119590f>] do_mount+0x8a6/0x9e8
   [52174.524053]  [<ffffffff811358dd>] ? strndup_user+0x3f/0x59
   [52174.524053]  [<ffffffff81195c65>] SyS_mount+0x77/0x9f
   [52174.524053]  [<ffffffff814935d7>] entry_SYSCALL_64_fastpath+0x12/0x6b
   [52174.561288] ---[ end trace 6b53049efb1a3ea6 ]---

Fix this by forcing a transaction commit when such cases happen.
This means we check in the commit root of the subvolume tree if there
was any other inode with the same reference when the inode we are
fsync'ing is a new inode (created in the current transaction).

Test cases for fstests, covering all the scenarios given above, were
submitted upstream for fstests:

  * fstests: generic test for fsync after renaming directory
    https://patchwork.kernel.org/patch/8694281/

  * fstests: generic test for fsync after renaming file
    https://patchwork.kernel.org/patch/8694301/

  * fstests: add btrfs test for fsync after snapshot deletion
    https://patchwork.kernel.org/patch/8670671/

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-04-06 17:01:44 -07:00
Linus Torvalds
4a2d057e4f Merge branch 'PAGE_CACHE_SIZE-removal'
Merge PAGE_CACHE_SIZE removal patches from Kirill Shutemov:
 "PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
  ago with promise that one day it will be possible to implement page
  cache with bigger chunks than PAGE_SIZE.

  This promise never materialized.  And unlikely will.

  Let's stop pretending that pages in page cache are special.  They are
  not.

  The first patch with most changes has been done with coccinelle.  The
  second is manual fixups on top.

  The third patch removes macros definition"

[ I was planning to apply this just before rc2, but then I spaced out,
  so here it is right _after_ rc2 instead.

  As Kirill suggested as a possibility, I could have decided to only
  merge the first two patches, and leave the old interfaces for
  compatibility, but I'd rather get it all done and any out-of-tree
  modules and patches can trivially do the converstion while still also
  working with older kernels, so there is little reason to try to
  maintain the redundant legacy model.    - Linus ]

* PAGE_CACHE_SIZE-removal:
  mm: drop PAGE_CACHE_* and page_cache_{get,release} definition
  mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage
  mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
2016-04-04 10:50:24 -07:00
Kirill A. Shutemov
ea1754a084 mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage
Mostly direct substitution with occasional adjustment or removing
outdated comments.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04 10:41:08 -07:00
Kirill A. Shutemov
09cbfeaf1a mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized.  And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE.  And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special.  They are
not.

The changes are pretty straight-forward:

 - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

 - page_cache_get() -> get_page();

 - page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below.  For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach.  I'll
fix them manually in a separate patch.  Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04 10:41:08 -07:00
Yauhen Kharuzhy
7ccefb98ce btrfs: Reset IO error counters before start of device replacing
If device replace entry was found on disk at mounting and its num_write_errors
stats counter has non-NULL value, then replace operation will never be
finished and -EIO error will be reported by btrfs_scrub_dev() because
this counter is never reset.

 # mount -o degraded /media/a4fb5c0a-21c5-4fe7-8d0e-fdd87d5f71ee/
 # btrfs replace status /media/a4fb5c0a-21c5-4fe7-8d0e-fdd87d5f71ee/
 Started on 25.Mar 07:28:00, canceled on 25.Mar 07:28:01 at 0.0%, 40 write errs, 0 uncorr. read errs
 # btrfs replace start -B 4 /dev/sdg /media/a4fb5c0a-21c5-4fe7-8d0e-fdd87d5f71ee/
 ERROR: ioctl(DEV_REPLACE_START) failed on "/media/a4fb5c0a-21c5-4fe7-8d0e-fdd87d5f71ee/": Input/output error, no error

Reset num_write_errors and num_uncorrectable_read_errors counters in the
dev_replace structure before start of replacing.

Signed-off-by: Yauhen Kharuzhy <yauhen.kharuzhy@zavadatar.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-04 16:29:22 +02:00
Mark Fasheh
0f5dcf8de9 btrfs: Add qgroup tracing
This patch adds tracepoints to the qgroup code on both the reporting side
(insert_dirty_extents) and the accounting side. Taken together it allows us
to see what qgroup operations have happened, and what their result was.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-04 16:29:22 +02:00
Josef Bacik
c79b471330 Btrfs: don't use src fd for printk
The fd we pass in may not be on a btrfs file system, so don't try to do
BTRFS_I() on it.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-04 16:29:22 +02:00
David Sterba
8f282f71ea btrfs: fallback to vmalloc in btrfs_compare_tree
The allocation of node could fail if the memory is too fragmented for a
given node size, practically observed with 64k.

http://article.gmane.org/gmane.comp.file-systems.btrfs/54689

Reported-and-tested-by: Jean-Denis Girard <jd.girard@sysnux.pf>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-04 16:29:22 +02:00
Mark Fasheh
918c2ee103 btrfs: handle non-fatal errors in btrfs_qgroup_inherit()
create_pending_snapshot() will go readonly on _any_ error return from
btrfs_qgroup_inherit(). If qgroups are enabled, a user can crash their fs by
just making a snapshot and asking it to inherit from an invalid qgroup. For
example:

$ btrfs sub snap -i 1/10 /btrfs/ /btrfs/foo

Will cause a transaction abort.

Fix this by only throwing errors in btrfs_qgroup_inherit() when we know
going readonly is acceptable.

The following xfstests test case reproduces this bug:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"

  here=`pwd`
  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
  	cd /
  	rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter

  # remove previous $seqres.full before test
  rm -f $seqres.full

  # real QA test starts here
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch

  rm -f $seqres.full

  _scratch_mkfs
  _scratch_mount
  _run_btrfs_util_prog quota enable $SCRATCH_MNT
  # The qgroup '1/10' does not exist and should be silently ignored
  _run_btrfs_util_prog subvolume snapshot -i 1/10 $SCRATCH_MNT $SCRATCH_MNT/snap1

  _scratch_unmount

  echo "Silence is golden"

  status=0
  exit

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-04 16:29:22 +02:00
Qu Wenruo
0305bc2793 btrfs: Output more info for enospc_debug mount option
As one user in mail list report reproducible balance ENOSPC error, it's
better to add more debug info for enospc_debug mount option.

Reported-by: Marc Haber <mh+linux-btrfs@zugschlus.de>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-04 16:29:22 +02:00
Liu Bo
264813acb1 Btrfs: fix invalid reference in replace_path
Dan Carpenter's static checker has found this error, it's introduced by
commit 64c043de46
("Btrfs: fix up read_tree_block to return proper error")

It's really supposed to 'break' the loop on error like others.

Cc: Dan Carpenter <dan.carpenter@oracle.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-04 16:29:18 +02:00
Davide Italiano
2a162ce932 Btrfs: Improve FL_KEEP_SIZE handling in fallocate
- We call inode_size_ok() only if FL_KEEP_SIZE isn't specified.
- As an optimisation we can skip the call if (off + len)
  isn't greater than the current size of the file. This operation
  is called under the lock so the less work we do, the better.
- If we call inode_size_ok() pass to it the correct value rather
  than a more conservative estimation.

Signed-off-by: Davide Italiano <dccitaliano@gmail.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-04 16:25:28 +02:00
Linus Torvalds
82d2a348bb Merge branch 'for-linus-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "This has a few fixes Dave Sterba had queued up.  These are all pretty
  small, but since they were tested I decided against waiting for more"

* 'for-linus-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: transaction_kthread() is not freezable
  btrfs: cleaner_kthread() doesn't need explicit freeze
  btrfs: do not write corrupted metadata blocks to disk
  btrfs: csum_tree_block: return proper errno value
2016-04-01 18:08:34 -05:00
Andreas Gruenbacher
b8a7a3a667 posix_acl: Inode acl caching fixes
When get_acl() is called for an inode whose ACL is not cached yet, the
get_acl inode operation is called to fetch the ACL from the filesystem.
The inode operation is responsible for updating the cached acl with
set_cached_acl().  This is done without locking at the VFS level, so
another task can call set_cached_acl() or forget_cached_acl() before the
get_acl inode operation gets to calling set_cached_acl(), and then
get_acl's call to set_cached_acl() results in caching an outdate ACL.

Prevent this from happening by setting the cached ACL pointer to a
task-specific sentinel value before calling the get_acl inode operation.
Move the responsibility for updating the cached ACL from the get_acl
inode operations to get_acl().  There, only set the cached ACL if the
sentinel value hasn't changed.

The sentinel values are chosen to have odd values.  Likewise, the value
of ACL_NOT_CACHED is odd.  In contrast, ACL object pointers always have
an even value (ACLs are aligned in memory).  This allows to distinguish
uncached ACLs values from ACL objects.

In addition, switch from guarding inode->i_acl and inode->i_default_acl
upates by the inode->i_lock spinlock to using xchg() and cmpxchg().

Filesystems that do not want ACLs returned from their get_acl inode
operations to be cached must call forget_cached_acl() to prevent the VFS
from doing so.

(Patch written by Al Viro and Andreas Gruenbacher.)

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-03-31 00:30:15 -04:00
Filipe Manana
de17e793b1 btrfs: fix crash/invalid memory access on fsync when using overlayfs
If the lower or upper directory of an overlayfs mount belong to a btrfs
file system and we fsync the file through the overlayfs' merged directory
we ended up accessing an inode that didn't belong to btrfs as if it were
a btrfs inode at btrfs_sync_file() resulting in a crash like the following:

[ 7782.588845] BUG: unable to handle kernel NULL pointer dereference at 0000000000000544
[ 7782.590624] IP: [<ffffffffa030b7ab>] btrfs_sync_file+0x11b/0x3e9 [btrfs]
[ 7782.591931] PGD 4d954067 PUD 1e878067 PMD 0
[ 7782.592016] Oops: 0002 [#6] PREEMPT SMP DEBUG_PAGEALLOC
[ 7782.592016] Modules linked in: btrfs overlay ppdev crc32c_generic evdev xor raid6_pq psmouse pcspkr sg serio_raw acpi_cpufreq parport_pc parport tpm_tis i2c_piix4 tpm i2c_core processor button loop autofs4 ext4 crc16 mbcache jbd2 sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio scsi_mod e1000 floppy [last unloaded: btrfs]
[ 7782.592016] CPU: 10 PID: 16437 Comm: xfs_io Tainted: G      D         4.5.0-rc6-btrfs-next-26+ #1
[ 7782.592016] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[ 7782.592016] task: ffff88001b8d40c0 ti: ffff880137488000 task.ti: ffff880137488000
[ 7782.592016] RIP: 0010:[<ffffffffa030b7ab>]  [<ffffffffa030b7ab>] btrfs_sync_file+0x11b/0x3e9 [btrfs]
[ 7782.592016] RSP: 0018:ffff88013748be40  EFLAGS: 00010286
[ 7782.592016] RAX: 0000000080000000 RBX: ffff880133b30c88 RCX: 0000000000000001
[ 7782.592016] RDX: 0000000000000001 RSI: ffffffff8148fec0 RDI: 00000000ffffffff
[ 7782.592016] RBP: ffff88013748bec0 R08: 0000000000000001 R09: 0000000000000000
[ 7782.624248] R10: ffff88013748be40 R11: 0000000000000246 R12: 0000000000000000
[ 7782.624248] R13: 0000000000000000 R14: 00000000009305a0 R15: ffff880015e3be40
[ 7782.624248] FS:  00007fa83b9cb700(0000) GS:ffff88023ed40000(0000) knlGS:0000000000000000
[ 7782.624248] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7782.624248] CR2: 0000000000000544 CR3: 00000001fa652000 CR4: 00000000000006e0
[ 7782.624248] Stack:
[ 7782.624248]  ffffffff8108b5cc ffff88013748bec0 0000000000000246 ffff8800b005ded0
[ 7782.624248]  ffff880133b30d60 8000000000000000 7fffffffffffffff 0000000000000246
[ 7782.624248]  0000000000000246 ffffffff81074f9b ffffffff8104357c ffff880015e3be40
[ 7782.624248] Call Trace:
[ 7782.624248]  [<ffffffff8108b5cc>] ? arch_local_irq_save+0x9/0xc
[ 7782.624248]  [<ffffffff81074f9b>] ? ___might_sleep+0xce/0x217
[ 7782.624248]  [<ffffffff8104357c>] ? __do_page_fault+0x3c0/0x43a
[ 7782.624248]  [<ffffffff811a2351>] vfs_fsync_range+0x8c/0x9e
[ 7782.624248]  [<ffffffff811a237f>] vfs_fsync+0x1c/0x1e
[ 7782.624248]  [<ffffffff811a24d6>] do_fsync+0x31/0x4a
[ 7782.624248]  [<ffffffff811a2700>] SyS_fsync+0x10/0x14
[ 7782.624248]  [<ffffffff81493617>] entry_SYSCALL_64_fastpath+0x12/0x6b
[ 7782.624248] Code: 85 c0 0f 85 e2 02 00 00 48 8b 45 b0 31 f6 4c 29 e8 48 ff c0 48 89 45 a8 48 8d 83 d8 00 00 00 48 89 c7 48 89 45 a0 e8 fc 43 18 e1 <f0> 41 ff 84 24 44 05 00 00 48 8b 83 58 ff ff ff 48 c1 e8 07 83
[ 7782.624248] RIP  [<ffffffffa030b7ab>] btrfs_sync_file+0x11b/0x3e9 [btrfs]
[ 7782.624248]  RSP <ffff88013748be40>
[ 7782.624248] CR2: 0000000000000544
[ 7782.661994] ---[ end trace 721e14960eb939bc ]---

This started happening since commit 4bacc9c923 (overlayfs: Make f_path
always point to the overlay and f_inode to the underlay) and even though
after this change we could still access the btrfs inode through
struct file->f_mapping->host or struct file->f_inode, we would end up
resulting in more similar issues later on at check_parent_dirs_for_sync()
because the dentry we got (from struct file->f_path.dentry) was from
overlayfs and not from btrfs, that is, we had no way of getting the dentry
that belonged to btrfs (we always got the dentry that belonged to
overlayfs).

The new patch from Miklos Szeredi, titled "vfs: add file_dentry()" and
recently submitted to linux-fsdevel, adds a file_dentry() API that allows
us to get the btrfs dentry from the input file and therefore being able
to fsync when the upper and lower directories belong to btrfs filesystems.

This issue has been reported several times by users in the mailing list
and bugzilla. A test case for xfstests is being submitted as well.

Fixes: 4bacc9c923 ("overlayfs: Make f_path always point to the overlay and f_inode to the underlay")
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=101951
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=109791
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Cc: stable@vger.kernel.org
2016-03-30 19:03:13 -04:00
Chris Mason
232cad8413 Merge branch 'misc-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.6 2016-03-24 17:36:13 -07:00
Jiri Kosina
ce63f891e1 btrfs: transaction_kthread() is not freezable
transaction_kthread() is calling try_to_freeze(), but that's just an
expeinsive no-op given the fact that the thread is not marked freezable.

After removing this, disk-io.c is now independent on freezer API.

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-03-22 10:08:47 +01:00
Jiri Kosina
838fe18877 btrfs: cleaner_kthread() doesn't need explicit freeze
cleaner_kthread() is not marked freezable, and therefore calling
try_to_freeze() in its context is a pointless no-op.

In addition to that, as has been clearly demonstrated by 80ad623edd
("Revert "btrfs: clear PF_NOFREEZE in cleaner_kthread()"), it's perfectly
valid / legal for cleaner_kthread() to stay scheduled out in an arbitrary
place during suspend (in that particular example that was waiting for
reading of extent pages), so there is no need to leave any traces of
freezer in this kthread.

Fixes: 80ad623edd ("Revert "btrfs: clear PF_NOFREEZE in cleaner_kthread()")
Fixes: 6962491321 ("btrfs: clear PF_NOFREEZE in cleaner_kthread()")
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-03-22 10:08:47 +01:00
Alex Lyakas
0f805531da btrfs: do not write corrupted metadata blocks to disk
csum_dirty_buffer was issuing a warning in case the extent buffer
did not look alright, but was still returning success.
Let's return error in this case, and also add an additional sanity
check on the extent buffer header.
The caller up the chain may BUG_ON on this, for example flush_epd_write_bio will,
but it is better than to have a silent metadata corruption on disk.

Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-03-22 10:08:12 +01:00
Alex Lyakas
8bd98f0e6b btrfs: csum_tree_block: return proper errno value
Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-03-22 10:07:43 +01:00
Linus Torvalds
968f3e374f Merge branch 'for-linus-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "We have a good sized cleanup of our internal read ahead code, and the
  first series of commits from Chandan to enable PAGE_SIZE > sectorsize

  Otherwise, it's a normal series of cleanups and fixes, with many
  thanks to Dave Sterba for doing most of the patch wrangling this time"

* 'for-linus-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (82 commits)
  btrfs: make sure we stay inside the bvec during __btrfs_lookup_bio_sums
  btrfs: Fix misspellings in comments.
  btrfs: Print Warning only if ENOSPC_DEBUG is enabled
  btrfs: scrub: silence an uninitialized variable warning
  btrfs: move btrfs_compression_type to compression.h
  btrfs: rename btrfs_print_info to btrfs_print_mod_info
  Btrfs: Show a warning message if one of objectid reaches its highest value
  Documentation: btrfs: remove usage specific information
  btrfs: use kbasename in btrfsic_mount
  Btrfs: do not collect ordered extents when logging that inode exists
  Btrfs: fix race when checking if we can skip fsync'ing an inode
  Btrfs: fix listxattrs not listing all xattrs packed in the same item
  Btrfs: fix deadlock between direct IO reads and buffered writes
  Btrfs: fix extent_same allowing destination offset beyond i_size
  Btrfs: fix file loss on log replay after renaming a file and fsync
  Btrfs: fix unreplayable log after snapshot delete + parent dir fsync
  Btrfs: fix lockdep deadlock warning due to dev_replace
  btrfs: drop unused argument in btrfs_ioctl_get_supported_features
  btrfs: add GET_SUPPORTED_FEATURES to the control device ioctls
  btrfs: change max_inline default to 2048
  ...
2016-03-21 18:12:42 -07:00
Chris Mason
389f239c53 btrfs: make sure we stay inside the bvec during __btrfs_lookup_bio_sums
Commit c40a3d38af (Btrfs: Compute and look up csums based on
sectorsized blocks) changes around how we walk the bios while looking up
crcs.  There's an inner loop that is jumping to the next bvec based on
sectors and before it derefs the next bvec, it needs to make sure we're
still in the bio.

In this case, the outer loop would have decided to stop moving forward
too, and the bvec deref is never actually used for anything.  But
CONFIG_DEBUG_PAGEALLOC catches it because we're outside our bio.

Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
2016-03-21 07:25:44 -07:00
Matthew Wilcox
c28f242063 btrfs: use radix_tree_iter_retry()
Even though this is a 'can't happen' situation, use the new
radix_tree_iter_retry() pattern to eliminate a goto.

[akpm@linux-foundation.org: fix btrfs build]
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: David Sterba <dsterba@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Adam Buchbinder
bb7ab3b92e btrfs: Fix misspellings in comments.
Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-03-14 15:05:02 +01:00
Ashish Samant
2e3fcb1ccd btrfs: Print Warning only if ENOSPC_DEBUG is enabled
Dont print warning for ENOSPC error unless ENOSPC_DEBUG is enabled. Use
btrfs_debug if it is enabled.

Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
[ preserve the WARN_ON ]
Signed-off-by: David Sterba <dsterba@suse.com>
2016-03-14 14:59:54 +01:00
Dan Carpenter
07c9a8e077 btrfs: scrub: silence an uninitialized variable warning
It's basically harmless if "ref_level" isn't initialized since it's only
used for an error message, but it causes a static checker warning.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-03-11 17:21:59 +01:00
Anand Jain
ebb8765b2d btrfs: move btrfs_compression_type to compression.h
So that its better organized.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-03-11 17:12:46 +01:00
Anand Jain
8ae1af3cd1 btrfs: rename btrfs_print_info to btrfs_print_mod_info
So that it indicates what it does.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-03-11 17:12:46 +01:00
Satoru Takeuchi
3c1d84b71e Btrfs: Show a warning message if one of objectid reaches its highest value
It's better to show a warning message for the exceptional case
that one of objectid (in most case, inode number) reaches its
highest value. For example, if inode cache is off and this event
happens, we can't create any file even if there are not so many files.
This message ease detecting such problem.

Signed-off-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-03-11 17:12:35 +01:00
Rasmus Villemoes
02def69fae btrfs: use kbasename in btrfsic_mount
This is more readable.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reviewed-by Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-03-11 16:55:52 +01:00
Ingo Molnar
ec87e1cf7d Linux 4.5-rc7
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJW3LO0AAoJEHm+PkMAQRiGhewIAIVHA1+qSSXEHTFeuLRuYpiz
 +ptQUIjPJdakWm/XqOnwSG8SWUuD4XL6ysfNmLSZIdqXYBAPpAuwT1UA2FZhz0dN
 soZxMNleAvzHWRDFLqwjVdOVlTxS6CTTdEQNzi+3R0ZCADllsRcuj/GBIY+M8cr6
 LvxK8BnhDU+Au3gZQjaujTMO7fKG6gOq4wKz/U7RIG37A6rwW577kEfLg4ZgFwt9
 RVjsky5mrX9+4l3QFtox9ZC383P/0VZ6+vXwN2QH1/joDK4EvA8pCwsGTyjRJiqi
 fArHbS+mHyAtbPWJmDbVlQ5dkZJAqRgtWBydjQYoC16S4Bwdce2/FbhBiTgEQAo=
 =sqln
 -----END PGP SIGNATURE-----

Merge tag 'v4.5-rc7' into x86/asm, to pick up SMAP fix

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-07 09:27:30 +01:00
Linus Torvalds
2cdcb2b5b5 Merge branch 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fix from Chris Mason:
 "Filipe nailed down a problem where tree log replay would do some work
  that orphan code wasn't expecting to be done yet, leading to BUG_ON"

* 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix loading of orphan roots leading to BUG_ON
2016-03-04 17:31:32 -08:00
Filipe Manana
909c3a22da Btrfs: fix loading of orphan roots leading to BUG_ON
When looking for orphan roots during mount we can end up hitting a
BUG_ON() (at root-item.c:btrfs_find_orphan_roots()) if a log tree is
replayed and qgroups are enabled. This is because after a log tree is
replayed, a transaction commit is made, which triggers qgroup extent
accounting which in turn does backref walking which ends up reading and
inserting all roots in the radix tree fs_info->fs_root_radix, including
orphan roots (deleted snapshots). So after the log tree is replayed, when
finding orphan roots we hit the BUG_ON with the following trace:

[118209.182438] ------------[ cut here ]------------
[118209.183279] kernel BUG at fs/btrfs/root-tree.c:314!
[118209.184074] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[118209.185123] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic ppdev xor raid6_pq evdev sg parport_pc parport acpi_cpufreq tpm_tis tpm psmouse
processor i2c_piix4 serio_raw pcspkr i2c_core button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata
virtio_pci virtio_ring virtio scsi_mod e1000 floppy [last unloaded: btrfs]
[118209.186318] CPU: 14 PID: 28428 Comm: mount Tainted: G        W       4.5.0-rc5-btrfs-next-24+ #1
[118209.186318] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[118209.186318] task: ffff8801ec131040 ti: ffff8800af34c000 task.ti: ffff8800af34c000
[118209.186318] RIP: 0010:[<ffffffffa04237d7>]  [<ffffffffa04237d7>] btrfs_find_orphan_roots+0x1fc/0x244 [btrfs]
[118209.186318] RSP: 0018:ffff8800af34faa8  EFLAGS: 00010246
[118209.186318] RAX: 00000000ffffffef RBX: 00000000ffffffef RCX: 0000000000000001
[118209.186318] RDX: 0000000080000000 RSI: 0000000000000001 RDI: 00000000ffffffff
[118209.186318] RBP: ffff8800af34fb08 R08: 0000000000000001 R09: 0000000000000000
[118209.186318] R10: ffff8800af34f9f0 R11: 6db6db6db6db6db7 R12: ffff880171b97000
[118209.186318] R13: ffff8801ca9d65e0 R14: ffff8800afa2e000 R15: 0000160000000000
[118209.186318] FS:  00007f5bcb914840(0000) GS:ffff88023edc0000(0000) knlGS:0000000000000000
[118209.186318] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[118209.186318] CR2: 00007f5bcaceb5d9 CR3: 00000000b49b5000 CR4: 00000000000006e0
[118209.186318] Stack:
[118209.186318]  fffffbffffffffff 010230ffffffffff 0101000000000000 ff84000000000000
[118209.186318]  fbffffffffffffff 30ffffffffffffff 0000000000000101 ffff880082348000
[118209.186318]  0000000000000000 ffff8800afa2e000 ffff8800afa2e000 0000000000000000
[118209.186318] Call Trace:
[118209.186318]  [<ffffffffa042e2db>] open_ctree+0x1e37/0x21b9 [btrfs]
[118209.186318]  [<ffffffffa040a753>] btrfs_mount+0x97e/0xaed [btrfs]
[118209.186318]  [<ffffffff8108e1c0>] ? trace_hardirqs_on+0xd/0xf
[118209.186318]  [<ffffffff8117b87e>] mount_fs+0x67/0x131
[118209.186318]  [<ffffffff81192d2b>] vfs_kern_mount+0x6c/0xde
[118209.186318]  [<ffffffffa0409f81>] btrfs_mount+0x1ac/0xaed [btrfs]
[118209.186318]  [<ffffffff8108e1c0>] ? trace_hardirqs_on+0xd/0xf
[118209.186318]  [<ffffffff8108c26b>] ? lockdep_init_map+0xb9/0x1b3
[118209.186318]  [<ffffffff8117b87e>] mount_fs+0x67/0x131
[118209.186318]  [<ffffffff81192d2b>] vfs_kern_mount+0x6c/0xde
[118209.186318]  [<ffffffff81195637>] do_mount+0x8a6/0x9e8
[118209.186318]  [<ffffffff8119598d>] SyS_mount+0x77/0x9f
[118209.186318]  [<ffffffff81493017>] entry_SYSCALL_64_fastpath+0x12/0x6b
[118209.186318] Code: 64 00 00 85 c0 89 c3 75 24 f0 41 80 4c 24 20 20 49 8b bc 24 f0 01 00 00 4c 89 e6 e8 e8 65 00 00 85 c0 89 c3 74 11 83 f8 ef 75 02 <0f> 0b
4c 89 e7 e8 da 72 00 00 eb 1c 41 83 bc 24 00 01 00 00 00
[118209.186318] RIP  [<ffffffffa04237d7>] btrfs_find_orphan_roots+0x1fc/0x244 [btrfs]
[118209.186318]  RSP <ffff8800af34faa8>
[118209.230735] ---[ end trace 83938f987d85d477 ]---

So fix this by not treating the error -EEXIST, returned when attempting
to insert a root already inserted by the backref walking code, as an error.

The following test case for xfstests reproduces the bug:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"
  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      _cleanup_flakey
      cd /
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter
  . ./common/dmflakey

  # real QA test starts here
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch
  _require_dm_target flakey
  _require_metadata_journaling $SCRATCH_DEV

  rm -f $seqres.full

  _scratch_mkfs >>$seqres.full 2>&1
  _init_flakey
  _mount_flakey

  _run_btrfs_util_prog quota enable $SCRATCH_MNT

  # Create 2 directories with one file in one of them.
  # We use these just to trigger a transaction commit later, moving the file from
  # directory a to directory b and doing an fsync against directory a.
  mkdir $SCRATCH_MNT/a
  mkdir $SCRATCH_MNT/b
  touch $SCRATCH_MNT/a/f
  sync

  # Create our test file with 2 4K extents.
  $XFS_IO_PROG -f -s -c "pwrite -S 0xaa 0 8K" $SCRATCH_MNT/foobar | _filter_xfs_io

  # Create a snapshot and delete it. This doesn't really delete the snapshot
  # immediately, just makes it inaccessible and invisible to user space, the
  # snapshot is deleted later by a dedicated kernel thread (cleaner kthread)
  # which is woke up at the next transaction commit.
  # A root orphan item is inserted into the tree of tree roots, so that if a
  # power failure happens before the dedicated kernel thread does the snapshot
  # deletion, the next time the filesystem is mounted it resumes the snapshot
  # deletion.
  _run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/snap
  _run_btrfs_util_prog subvolume delete $SCRATCH_MNT/snap

  # Now overwrite half of the extents we wrote before. Because we made a snapshpot
  # before, which isn't really deleted yet (since no transaction commit happened
  # after we did the snapshot delete request), the non overwritten extents get
  # referenced twice, once by the default subvolume and once by the snapshot.
  $XFS_IO_PROG -c "pwrite -S 0xbb 4K 8K" $SCRATCH_MNT/foobar | _filter_xfs_io

  # Now move file f from directory a to directory b and fsync directory a.
  # The fsync on the directory a triggers a transaction commit (because a file
  # was moved from it to another directory) and the file fsync leaves a log tree
  # with file extent items to replay.
  mv $SCRATCH_MNT/a/f $SCRATCH_MNT/a/b
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/a
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar

  echo "File digest before power failure:"
  md5sum $SCRATCH_MNT/foobar | _filter_scratch

  # Now simulate a power failure and mount the filesystem to replay the log tree.
  # After the log tree was replayed, we used to hit a BUG_ON() when processing
  # the root orphan item for the deleted snapshot. This is because when processing
  # an orphan root the code expected to be the first code inserting the root into
  # the fs_info->fs_root_radix radix tree, while in reallity it was the second
  # caller attempting to do it - the first caller was the transaction commit that
  # took place after replaying the log tree, when updating the qgroup counters.
  _flakey_drop_and_remount

  echo "File digest before after failure:"
  # Must match what he got before the power failure.
  md5sum $SCRATCH_MNT/foobar | _filter_scratch

  _unmount_flakey
  status=0
  exit

Fixes: 2d9e977610 ("Btrfs: use btrfs_get_fs_root in resolve_indirect_ref")
Cc: stable@vger.kernel.org  # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-03-03 15:28:59 -08:00
Filipe Manana
5e33a2bd7c Btrfs: do not collect ordered extents when logging that inode exists
When logging that an inode exists, for example as part of a directory
fsync operation, we were collecting any ordered extents for the inode but
we ended up doing nothing with them except tagging them as processed, by
setting the flag BTRFS_ORDERED_LOGGED on them, which prevented a
subsequent fsync of that inode (using the LOG_INODE_ALL mode) from
collecting and processing them. This created a time window where a second
fsync against the inode, using the fast path, ended up not logging the
checksums for the new extents but it logged the extents since they were
part of the list of modified extents. This happened because the ordered
extents were not collected and checksums were not yet added to the csum
tree - the ordered extents have not gone through btrfs_finish_ordered_io()
yet (which is where we add them to the csum tree by calling
inode.c:add_pending_csums()).

So fix this by not collecting an inode's ordered extents if we are logging
it with the LOG_INODE_EXISTS mode.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-03-01 08:23:47 -08:00
Filipe Manana
affc0ff902 Btrfs: fix race when checking if we can skip fsync'ing an inode
If we're about to do a fast fsync for an inode and btrfs_inode_in_log()
returns false, it's possible that we had an ordered extent in progress
(btrfs_finish_ordered_io() not run yet) when we noticed that the inode's
last_trans field was not greater than the id of the last committed
transaction, but shortly after, before we checked if there were any
ongoing ordered extents, the ordered extent had just completed and
removed itself from the inode's ordered tree, in which case we end up not
logging the inode, losing some data if a power failure or crash happens
after the fsync handler returns and before the transaction is committed.

Fix this by checking first if there are any ongoing ordered extents
before comparing the inode's last_trans with the id of the last committed
transaction - when it completes, an ordered extent always updates the
inode's last_trans before it removes itself from the inode's ordered
tree (at btrfs_finish_ordered_io()).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-03-01 08:23:44 -08:00
Filipe Manana
daac7ba61a Btrfs: fix listxattrs not listing all xattrs packed in the same item
In the listxattrs handler, we were not listing all the xattrs that are
packed in the same btree item, which happens when multiple xattrs have
a name that when crc32c hashed produce the same checksum value.

Fix this by processing them all.

The following test case for xfstests reproduces the issue:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"
  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      cd /
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter
  . ./common/attr

  # real QA test starts here
  _supported_fs generic
  _supported_os Linux
  _require_scratch
  _require_attrs

  rm -f $seqres.full

  _scratch_mkfs >>$seqres.full 2>&1
  _scratch_mount

  # Create our test file with a few xattrs. The first 3 xattrs have a name
  # that when given as input to a crc32c function result in the same checksum.
  # This made btrfs list only one of the xattrs through listxattrs system call
  # (because it packs xattrs with the same name checksum into the same btree
  # item).
  touch $SCRATCH_MNT/testfile
  $SETFATTR_PROG -n user.foobar -v 123 $SCRATCH_MNT/testfile
  $SETFATTR_PROG -n user.WvG1c1Td -v qwerty $SCRATCH_MNT/testfile
  $SETFATTR_PROG -n user.J3__T_Km3dVsW_ -v hello $SCRATCH_MNT/testfile
  $SETFATTR_PROG -n user.something -v pizza $SCRATCH_MNT/testfile
  $SETFATTR_PROG -n user.ping -v pong $SCRATCH_MNT/testfile

  # Now call getfattr with --dump, which calls the listxattrs system call.
  # It should list all the xattrs we have set before.
  $GETFATTR_PROG --absolute-names --dump $SCRATCH_MNT/testfile | _filter_scratch

  status=0
  exit

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-03-01 08:23:41 -08:00
Filipe Manana
ade770294d Btrfs: fix deadlock between direct IO reads and buffered writes
While running a test with a mix of buffered IO and direct IO against
the same files I hit a deadlock reported by the following trace:

[11642.140352] INFO: task kworker/u32:3:15282 blocked for more than 120 seconds.
[11642.142452]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
[11642.143982] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11642.146332] kworker/u32:3   D ffff880230ef7988 [11642.147737] systemd-journald[571]: Sent WATCHDOG=1 notification.
[11642.149771]     0 15282      2 0x00000000
[11642.151205] Workqueue: btrfs-flush_delalloc btrfs_flush_delalloc_helper [btrfs]
[11642.154074]  ffff880230ef7988 0000000000000246 0000000000014ec0 ffff88023ec94ec0
[11642.156722]  ffff880233fe8f80 ffff880230ef8000 ffff88023ec94ec0 7fffffffffffffff
[11642.159205]  0000000000000002 ffffffff8147b7f9 ffff880230ef79a0 ffffffff8147b541
[11642.161403] Call Trace:
[11642.162129]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
[11642.163396]  [<ffffffff8147b541>] schedule+0x82/0x9a
[11642.164871]  [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
[11642.167020]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
[11642.167931]  [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
[11642.182320]  [<ffffffff8108affa>] ? trace_hardirqs_on+0xd/0xf
[11642.183762]  [<ffffffff810b079b>] ? timekeeping_get_ns+0xe/0x33
[11642.185308]  [<ffffffff810b0f61>] ? ktime_get+0x41/0x52
[11642.186782]  [<ffffffff8147ac08>] io_schedule_timeout+0xa0/0x102
[11642.188217]  [<ffffffff8147ac08>] ? io_schedule_timeout+0xa0/0x102
[11642.189626]  [<ffffffff8147b814>] bit_wait_io+0x1b/0x39
[11642.190803]  [<ffffffff8147bb21>] __wait_on_bit_lock+0x4c/0x90
[11642.192158]  [<ffffffff8111829f>] __lock_page+0x66/0x68
[11642.193379]  [<ffffffff81082f29>] ? autoremove_wake_function+0x3a/0x3a
[11642.194831]  [<ffffffffa0450ddd>] lock_page+0x31/0x34 [btrfs]
[11642.197068]  [<ffffffffa0454e3b>] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs]
[11642.199188]  [<ffffffffa0455373>] extent_writepages+0x4b/0x5c [btrfs]
[11642.200723]  [<ffffffffa043c913>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
[11642.202465]  [<ffffffffa043aa82>] btrfs_writepages+0x28/0x2a [btrfs]
[11642.203836]  [<ffffffff811236bc>] do_writepages+0x23/0x2c
[11642.205624]  [<ffffffff811198c9>] __filemap_fdatawrite_range+0x5a/0x61
[11642.207057]  [<ffffffff81119946>] filemap_fdatawrite_range+0x13/0x15
[11642.208529]  [<ffffffffa044f87e>] btrfs_start_ordered_extent+0xd0/0x1a1 [btrfs]
[11642.210375]  [<ffffffffa0462613>] ? btrfs_scrubparity_helper+0x140/0x33a [btrfs]
[11642.212132]  [<ffffffffa044f974>] btrfs_run_ordered_extent_work+0x25/0x34 [btrfs]
[11642.213837]  [<ffffffffa046262f>] btrfs_scrubparity_helper+0x15c/0x33a [btrfs]
[11642.215457]  [<ffffffffa046293b>] btrfs_flush_delalloc_helper+0xe/0x10 [btrfs]
[11642.217095]  [<ffffffff8106483e>] process_one_work+0x256/0x48b
[11642.218324]  [<ffffffff81064f20>] worker_thread+0x1f5/0x2a7
[11642.219466]  [<ffffffff81064d2b>] ? rescuer_thread+0x289/0x289
[11642.220801]  [<ffffffff8106a500>] kthread+0xd4/0xdc
[11642.222032]  [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
[11642.223190]  [<ffffffff8147fdef>] ret_from_fork+0x3f/0x70
[11642.224394]  [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
[11642.226295] 2 locks held by kworker/u32:3/15282:
[11642.227273]  #0:  ("%s-%s""btrfs", name){++++.+}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
[11642.229412]  #1:  ((&work->normal_work)){+.+.+.}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
[11642.231414] INFO: task kworker/u32:8:15289 blocked for more than 120 seconds.
[11642.232872]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
[11642.234109] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11642.235776] kworker/u32:8   D ffff88020de5f848     0 15289      2 0x00000000
[11642.237412] Workqueue: writeback wb_workfn (flush-btrfs-481)
[11642.238670]  ffff88020de5f848 0000000000000246 0000000000014ec0 ffff88023ed54ec0
[11642.240475]  ffff88021b1ece40 ffff88020de60000 ffff88023ed54ec0 7fffffffffffffff
[11642.242154]  0000000000000002 ffffffff8147b7f9 ffff88020de5f860 ffffffff8147b541
[11642.243715] Call Trace:
[11642.244390]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
[11642.245432]  [<ffffffff8147b541>] schedule+0x82/0x9a
[11642.246392]  [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
[11642.247479]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
[11642.248551]  [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
[11642.249968]  [<ffffffff8108affa>] ? trace_hardirqs_on+0xd/0xf
[11642.251043]  [<ffffffff810b079b>] ? timekeeping_get_ns+0xe/0x33
[11642.252202]  [<ffffffff810b0f61>] ? ktime_get+0x41/0x52
[11642.253210]  [<ffffffff8147ac08>] io_schedule_timeout+0xa0/0x102
[11642.254307]  [<ffffffff8147ac08>] ? io_schedule_timeout+0xa0/0x102
[11642.256118]  [<ffffffff8147b814>] bit_wait_io+0x1b/0x39
[11642.257131]  [<ffffffff8147bb21>] __wait_on_bit_lock+0x4c/0x90
[11642.258200]  [<ffffffff8111829f>] __lock_page+0x66/0x68
[11642.259168]  [<ffffffff81082f29>] ? autoremove_wake_function+0x3a/0x3a
[11642.260516]  [<ffffffffa0450ddd>] lock_page+0x31/0x34 [btrfs]
[11642.261841]  [<ffffffffa0454e3b>] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs]
[11642.263531]  [<ffffffffa0455373>] extent_writepages+0x4b/0x5c [btrfs]
[11642.264747]  [<ffffffffa043c913>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
[11642.266148]  [<ffffffffa043aa82>] btrfs_writepages+0x28/0x2a [btrfs]
[11642.267264]  [<ffffffff811236bc>] do_writepages+0x23/0x2c
[11642.268280]  [<ffffffff81192a2b>] __writeback_single_inode+0xda/0x5ba
[11642.269407]  [<ffffffff811939f0>] writeback_sb_inodes+0x27b/0x43d
[11642.270476]  [<ffffffff81193c28>] __writeback_inodes_wb+0x76/0xae
[11642.271547]  [<ffffffff81193ea6>] wb_writeback+0x19e/0x41c
[11642.272588]  [<ffffffff81194821>] wb_workfn+0x201/0x341
[11642.273523]  [<ffffffff81194821>] ? wb_workfn+0x201/0x341
[11642.274479]  [<ffffffff8106483e>] process_one_work+0x256/0x48b
[11642.275497]  [<ffffffff81064f20>] worker_thread+0x1f5/0x2a7
[11642.276518]  [<ffffffff81064d2b>] ? rescuer_thread+0x289/0x289
[11642.277520]  [<ffffffff81064d2b>] ? rescuer_thread+0x289/0x289
[11642.278517]  [<ffffffff8106a500>] kthread+0xd4/0xdc
[11642.279371]  [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
[11642.280468]  [<ffffffff8147fdef>] ret_from_fork+0x3f/0x70
[11642.281607]  [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
[11642.282604] 3 locks held by kworker/u32:8/15289:
[11642.283423]  #0:  ("writeback"){++++.+}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
[11642.285629]  #1:  ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
[11642.287538]  #2:  (&type->s_umount_key#37){+++++.}, at: [<ffffffff81171217>] trylock_super+0x1b/0x4b
[11642.289423] INFO: task fdm-stress:26848 blocked for more than 120 seconds.
[11642.290547]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
[11642.291453] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11642.292864] fdm-stress      D ffff88022c107c20     0 26848  26591 0x00000000
[11642.294118]  ffff88022c107c20 000000038108affa 0000000000014ec0 ffff88023ed54ec0
[11642.295602]  ffff88013ab1ca40 ffff88022c108000 ffff8800b2fc19d0 00000000000e0fff
[11642.297098]  ffff8800b2fc19b0 ffff88022c107c88 ffff88022c107c38 ffffffff8147b541
[11642.298433] Call Trace:
[11642.298896]  [<ffffffff8147b541>] schedule+0x82/0x9a
[11642.299738]  [<ffffffffa045225d>] lock_extent_bits+0xfe/0x1a3 [btrfs]
[11642.300833]  [<ffffffff81082eef>] ? add_wait_queue_exclusive+0x44/0x44
[11642.301943]  [<ffffffffa0447516>] lock_and_cleanup_extent_if_need+0x68/0x18e [btrfs]
[11642.303270]  [<ffffffffa04485ba>] __btrfs_buffered_write+0x238/0x4c1 [btrfs]
[11642.304552]  [<ffffffffa044b50a>] ? btrfs_file_write_iter+0x17c/0x408 [btrfs]
[11642.305782]  [<ffffffffa044b682>] btrfs_file_write_iter+0x2f4/0x408 [btrfs]
[11642.306878]  [<ffffffff8116e298>] __vfs_write+0x7c/0xa5
[11642.307729]  [<ffffffff8116e7d1>] vfs_write+0x9d/0xe8
[11642.308602]  [<ffffffff8116efbb>] SyS_write+0x50/0x7e
[11642.309410]  [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
[11642.310403] 3 locks held by fdm-stress/26848:
[11642.311108]  #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff811877e8>] __fdget_pos+0x3a/0x40
[11642.312578]  #1:  (sb_writers#11){.+.+.+}, at: [<ffffffff811706ee>] __sb_start_write+0x5f/0xb0
[11642.314170]  #2:  (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffffa044b401>] btrfs_file_write_iter+0x73/0x408 [btrfs]
[11642.316796] INFO: task fdm-stress:26849 blocked for more than 120 seconds.
[11642.317842]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
[11642.318691] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11642.319959] fdm-stress      D ffff8801964ffa68     0 26849  26591 0x00000000
[11642.321312]  ffff8801964ffa68 00ff8801e9975f80 0000000000014ec0 ffff88023ed94ec0
[11642.322555]  ffff8800b00b4840 ffff880196500000 ffff8801e9975f20 0000000000000002
[11642.323715]  ffff8801e9975f18 ffff8800b00b4840 ffff8801964ffa80 ffffffff8147b541
[11642.325096] Call Trace:
[11642.325532]  [<ffffffff8147b541>] schedule+0x82/0x9a
[11642.326303]  [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
[11642.327180]  [<ffffffff8108ae40>] ? mark_held_locks+0x5e/0x74
[11642.328114]  [<ffffffff8147f30e>] ? _raw_spin_unlock_irq+0x2c/0x4a
[11642.329051]  [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
[11642.330053]  [<ffffffff8147bceb>] __wait_for_common+0x109/0x147
[11642.330952]  [<ffffffff8147bceb>] ? __wait_for_common+0x109/0x147
[11642.331869]  [<ffffffff8147e7bb>] ? usleep_range+0x4a/0x4a
[11642.332925]  [<ffffffff81074075>] ? wake_up_q+0x47/0x47
[11642.333736]  [<ffffffff8147bd4d>] wait_for_completion+0x24/0x26
[11642.334672]  [<ffffffffa044f5ce>] btrfs_wait_ordered_extents+0x1c8/0x217 [btrfs]
[11642.335858]  [<ffffffffa0465b5a>] btrfs_mksubvol+0x224/0x45d [btrfs]
[11642.336854]  [<ffffffff81082eef>] ? add_wait_queue_exclusive+0x44/0x44
[11642.337820]  [<ffffffffa0465edb>] btrfs_ioctl_snap_create_transid+0x148/0x17a [btrfs]
[11642.339026]  [<ffffffffa046603b>] btrfs_ioctl_snap_create_v2+0xc7/0x110 [btrfs]
[11642.340214]  [<ffffffffa0468582>] btrfs_ioctl+0x590/0x27bd [btrfs]
[11642.341123]  [<ffffffff8147dc00>] ? mutex_unlock+0xe/0x10
[11642.341934]  [<ffffffffa00fa6e9>] ? ext4_file_write_iter+0x2a3/0x36f [ext4]
[11642.342936]  [<ffffffff8108895d>] ? __lock_is_held+0x3c/0x57
[11642.343772]  [<ffffffff81186a1d>] ? rcu_read_unlock+0x3e/0x5d
[11642.344673]  [<ffffffff8117dc95>] do_vfs_ioctl+0x458/0x4dc
[11642.346024]  [<ffffffff81186bbe>] ? __fget_light+0x62/0x71
[11642.346873]  [<ffffffff8117dd70>] SyS_ioctl+0x57/0x79
[11642.347720]  [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
[11642.350222] 4 locks held by fdm-stress/26849:
[11642.350898]  #0:  (sb_writers#11){.+.+.+}, at: [<ffffffff811706ee>] __sb_start_write+0x5f/0xb0
[11642.352375]  #1:  (&type->i_mutex_dir_key#4/1){+.+.+.}, at: [<ffffffffa0465981>] btrfs_mksubvol+0x4b/0x45d [btrfs]
[11642.354072]  #2:  (&fs_info->subvol_sem){++++..}, at: [<ffffffffa0465a2a>] btrfs_mksubvol+0xf4/0x45d [btrfs]
[11642.355647]  #3:  (&root->ordered_extent_mutex){+.+...}, at: [<ffffffffa044f456>] btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
[11642.357516] INFO: task fdm-stress:26850 blocked for more than 120 seconds.
[11642.358508]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
[11642.359376] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11642.368625] fdm-stress      D ffff88021f167688     0 26850  26591 0x00000000
[11642.369716]  ffff88021f167688 0000000000000001 0000000000014ec0 ffff88023edd4ec0
[11642.370950]  ffff880128a98680 ffff88021f168000 ffff88023edd4ec0 7fffffffffffffff
[11642.372210]  0000000000000002 ffffffff8147b7f9 ffff88021f1676a0 ffffffff8147b541
[11642.373430] Call Trace:
[11642.373853]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
[11642.374623]  [<ffffffff8147b541>] schedule+0x82/0x9a
[11642.375948]  [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
[11642.376862]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
[11642.377637]  [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
[11642.378610]  [<ffffffff8108affa>] ? trace_hardirqs_on+0xd/0xf
[11642.379457]  [<ffffffff810b079b>] ? timekeeping_get_ns+0xe/0x33
[11642.380366]  [<ffffffff810b0f61>] ? ktime_get+0x41/0x52
[11642.381353]  [<ffffffff8147ac08>] io_schedule_timeout+0xa0/0x102
[11642.382255]  [<ffffffff8147ac08>] ? io_schedule_timeout+0xa0/0x102
[11642.383162]  [<ffffffff8147b814>] bit_wait_io+0x1b/0x39
[11642.383945]  [<ffffffff8147bb21>] __wait_on_bit_lock+0x4c/0x90
[11642.384875]  [<ffffffff8111829f>] __lock_page+0x66/0x68
[11642.385749]  [<ffffffff81082f29>] ? autoremove_wake_function+0x3a/0x3a
[11642.386721]  [<ffffffffa0450ddd>] lock_page+0x31/0x34 [btrfs]
[11642.387596]  [<ffffffffa0454e3b>] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs]
[11642.389030]  [<ffffffffa0455373>] extent_writepages+0x4b/0x5c [btrfs]
[11642.389973]  [<ffffffff810a25ad>] ? rcu_read_lock_sched_held+0x61/0x69
[11642.390939]  [<ffffffffa043c913>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
[11642.392271]  [<ffffffffa0451c32>] ? __clear_extent_bit+0x26e/0x2c0 [btrfs]
[11642.393305]  [<ffffffffa043aa82>] btrfs_writepages+0x28/0x2a [btrfs]
[11642.394239]  [<ffffffff811236bc>] do_writepages+0x23/0x2c
[11642.395045]  [<ffffffff811198c9>] __filemap_fdatawrite_range+0x5a/0x61
[11642.395991]  [<ffffffff81119946>] filemap_fdatawrite_range+0x13/0x15
[11642.397144]  [<ffffffffa044f87e>] btrfs_start_ordered_extent+0xd0/0x1a1 [btrfs]
[11642.398392]  [<ffffffffa0452094>] ? clear_extent_bit+0x17/0x19 [btrfs]
[11642.399363]  [<ffffffffa0445945>] btrfs_get_blocks_direct+0x12b/0x61c [btrfs]
[11642.400445]  [<ffffffff8119f7a1>] ? dio_bio_add_page+0x3d/0x54
[11642.401309]  [<ffffffff8119fa93>] ? submit_page_section+0x7b/0x111
[11642.402213]  [<ffffffff811a0258>] do_blockdev_direct_IO+0x685/0xc24
[11642.403139]  [<ffffffffa044581a>] ? btrfs_page_exists_in_range+0x1a1/0x1a1 [btrfs]
[11642.404360]  [<ffffffffa043d267>] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs]
[11642.406187]  [<ffffffff811a0828>] __blockdev_direct_IO+0x31/0x33
[11642.407070]  [<ffffffff811a0828>] ? __blockdev_direct_IO+0x31/0x33
[11642.407990]  [<ffffffffa043d267>] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs]
[11642.409192]  [<ffffffffa043b4ca>] btrfs_direct_IO+0x1c7/0x27e [btrfs]
[11642.410146]  [<ffffffffa043d267>] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs]
[11642.411291]  [<ffffffff81119a2c>] generic_file_read_iter+0x89/0x4e1
[11642.412263]  [<ffffffff8108ac05>] ? mark_lock+0x24/0x201
[11642.413057]  [<ffffffff8116e1f8>] __vfs_read+0x79/0x9d
[11642.413897]  [<ffffffff8116e6f1>] vfs_read+0x8f/0xd2
[11642.414708]  [<ffffffff8116ef3d>] SyS_read+0x50/0x7e
[11642.415573]  [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
[11642.416572] 1 lock held by fdm-stress/26850:
[11642.417345]  #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff811877e8>] __fdget_pos+0x3a/0x40
[11642.418703] INFO: task fdm-stress:26851 blocked for more than 120 seconds.
[11642.419698]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
[11642.420612] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11642.421807] fdm-stress      D ffff880196483d28     0 26851  26591 0x00000000
[11642.422878]  ffff880196483d28 00ff8801c8f60740 0000000000014ec0 ffff88023ed94ec0
[11642.424149]  ffff8801c8f60740 ffff880196484000 0000000000000246 ffff8801c8f60740
[11642.425374]  ffff8801bb711840 ffff8801bb711878 ffff880196483d40 ffffffff8147b541
[11642.426591] Call Trace:
[11642.427013]  [<ffffffff8147b541>] schedule+0x82/0x9a
[11642.427856]  [<ffffffff8147b6d5>] schedule_preempt_disabled+0x18/0x24
[11642.428852]  [<ffffffff8147c23a>] mutex_lock_nested+0x1d7/0x3b4
[11642.429743]  [<ffffffffa044f456>] ? btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
[11642.430911]  [<ffffffffa044f456>] btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
[11642.432102]  [<ffffffffa044f674>] ? btrfs_wait_ordered_roots+0x57/0x191 [btrfs]
[11642.433259]  [<ffffffffa044f456>] ? btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
[11642.434431]  [<ffffffffa044f6ea>] btrfs_wait_ordered_roots+0xcd/0x191 [btrfs]
[11642.436079]  [<ffffffffa0410cab>] btrfs_sync_fs+0xe0/0x1ad [btrfs]
[11642.437009]  [<ffffffff81197900>] ? SyS_tee+0x23c/0x23c
[11642.437860]  [<ffffffff81197920>] sync_fs_one_sb+0x20/0x22
[11642.438723]  [<ffffffff81171435>] iterate_supers+0x75/0xc2
[11642.439597]  [<ffffffff81197d00>] sys_sync+0x52/0x80
[11642.440454]  [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
[11642.441533] 3 locks held by fdm-stress/26851:
[11642.442370]  #0:  (&type->s_umount_key#37){+++++.}, at: [<ffffffff8117141f>] iterate_supers+0x5f/0xc2
[11642.444043]  #1:  (&fs_info->ordered_operations_mutex){+.+...}, at: [<ffffffffa044f661>] btrfs_wait_ordered_roots+0x44/0x191 [btrfs]
[11642.446010]  #2:  (&root->ordered_extent_mutex){+.+...}, at: [<ffffffffa044f456>] btrfs_wait_ordered_extents+0x50/0x217 [btrfs]

This happened because under specific timings the path for direct IO reads
can deadlock with concurrent buffered writes. The diagram below shows how
this happens for an example file that has the following layout:

     [  extent A  ]  [  extent B  ]  [ ....
     0K              4K              8K

     CPU 1                                               CPU 2                             CPU 3

DIO read against range
 [0K, 8K[ starts

btrfs_direct_IO()
  --> calls btrfs_get_blocks_direct()
      which finds the extent map for the
      extent A and leaves the range
      [0K, 4K[ locked in the inode's
      io tree

                                                   buffered write against
                                                   range [4K, 8K[ starts

                                                   __btrfs_buffered_write()
                                                     --> dirties page at 4K

                                                                                     a user space
                                                                                     task calls sync
                                                                                     for e.g or
                                                                                     writepages() is
                                                                                     invoked by mm

                                                                                     writepages()
                                                                                       run_delalloc_range()
                                                                                         cow_file_range()
                                                                                           --> ordered extent X
                                                                                               for the buffered
                                                                                               write is created
                                                                                               and
                                                                                               writeback starts

  --> calls btrfs_get_blocks_direct()
      again, without submitting first
      a bio for reading extent A, and
      finds the extent map for extent B

  --> calls lock_extent_direct()

      --> locks range [4K, 8K[
      --> finds ordered extent X
          covering range [4K, 8K[
      --> unlocks range [4K, 8K[

                                                  buffered write against
                                                  range [0K, 8K[ starts

                                                  __btrfs_buffered_write()
                                                    prepare_pages()
                                                      --> locks pages with
                                                          offsets 0 and 4K
                                                    lock_and_cleanup_extent_if_need()
                                                      --> blocks attempting to
                                                          lock range [0K, 8K[ in
                                                          the inode's io tree,
                                                          because the range [0, 4K[
                                                          is already locked by the
                                                          direct IO task at CPU 1

      --> calls
          btrfs_start_ordered_extent(oe X)

          btrfs_start_ordered_extent(oe X)

            --> At this point writeback for ordered
                extent X has not finished yet

            filemap_fdatawrite_range()
              btrfs_writepages()
                extent_writepages()
                  extent_write_cache_pages()
                    --> finds page with offset 0
                        with the writeback tag
                        (and not dirty)
                    --> tries to lock it
                         --> deadlock, task at CPU 2
                             has the page locked and
                             is blocked on the io range
                             [0, 4K[ that was locked
                             earlier by this task

So fix this by falling back to a buffered read in the direct IO read path
when an ordered extent for a buffered write is found.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-03-01 08:23:37 -08:00
Filipe Manana
f4dfe68710 Btrfs: fix extent_same allowing destination offset beyond i_size
When using the same file as the source and destination for a dedup
(extent_same ioctl) operation we were allowing it to dedup to a
destination offset beyond the file's size, which doesn't make sense and
it's not allowed for the case where the source and destination files are
not the same file. This made de deduplication operation successful only
when the source range corresponded to a hole, a prealloc extent or an
extent with all bytes having a value of 0x00. This was also leaving a
file hole (between i_size and destination offset) without the
corresponding file extent items, which can be reproduced with the
following steps for example:

  $ mkfs.btrfs -f /dev/sdi
  $ mount /dev/sdi /mnt/sdi

  $ xfs_io -f -c "pwrite -S 0xab 304457 404990" /mnt/sdi/foobar
  wrote 404990/404990 bytes at offset 304457
  395 KiB, 99 ops; 0.0000 sec (31.150 MiB/sec and 7984.5149 ops/sec)

  $ /git/hub/duperemove/btrfs-extent-same 24576 /mnt/sdi/foobar 28672 /mnt/sdi/foobar 929792
  Deduping 2 total files
  (28672, 24576): /mnt/sdi/foobar
  (929792, 24576): /mnt/sdi/foobar
  1 files asked to be deduped
  i: 0, status: 0, bytes_deduped: 24576
  24576 total bytes deduped in this operation

  $ umount /mnt/sdi
  $ btrfsck /dev/sdi
  Checking filesystem on /dev/sdi
  UUID: 98c528aa-0833-427d-9403-b98032ffbf9d
  checking extents
  checking free space cache
  checking fs roots
  root 5 inode 257 errors 100, file extent discount
  Found file extent holes:
          start: 712704, len: 217088
  found 540673 bytes used err is 1
  total csum bytes: 400
  total tree bytes: 131072
  total fs tree bytes: 32768
  total extent tree bytes: 16384
  btree space waste bytes: 123675
  file data blocks allocated: 671744
    referenced 671744
  btrfs-progs v4.2.3

So fix this by not allowing the destination to go beyond the file's size,
just as we do for the same where the source and destination files are not
the same.

A test for xfstests follows.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-03-01 08:23:33 -08:00
Filipe Manana
2be63d5ce9 Btrfs: fix file loss on log replay after renaming a file and fsync
We have two cases where we end up deleting a file at log replay time
when we should not. For this to happen the file must have been renamed
and a directory inode must have been fsynced/logged.

Two examples that exercise these two cases are listed below.

  Case 1)

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt
  $ mkdir -p /mnt/a/b
  $ mkdir /mnt/c
  $ touch /mnt/a/b/foo
  $ sync
  $ mv /mnt/a/b/foo /mnt/c/
  # Create file bar just to make sure the fsync on directory a/ does
  # something and it's not a no-op.
  $ touch /mnt/a/bar
  $ xfs_io -c "fsync" /mnt/a
  < power fail / crash >

  The next time the filesystem is mounted, the log replay procedure
  deletes file foo.

  Case 2)

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt
  $ mkdir /mnt/a
  $ mkdir /mnt/b
  $ mkdir /mnt/c
  $ touch /mnt/a/foo
  $ ln /mnt/a/foo /mnt/b/foo_link
  $ touch /mnt/b/bar
  $ sync
  $ unlink /mnt/b/foo_link
  $ mv /mnt/b/bar /mnt/c/
  $ xfs_io -c "fsync" /mnt/a/foo
  < power fail / crash >

  The next time the filesystem is mounted, the log replay procedure
  deletes file bar.

The reason why the files are deleted is because when we log inodes
other then the fsync target inode, we ignore their last_unlink_trans
value and leave the log without enough information to later replay the
rename operations. So we need to look at the last_unlink_trans values
and fallback to a transaction commit if they are greater than the
id of the last committed transaction.

So fix this by looking at the last_unlink_trans values and fallback to
transaction commits when needed. Also, when logging other inodes (for
case 1 we logged descendants of the fsync target inode while for case 2
we logged ascendants) we need to care about concurrent tasks updating
the last_unlink_trans of inodes we are logging (which was already an
existing problem in check_parent_dirs_for_sync()). Since we can not
acquire their inode mutex (vfs' struct inode ->i_mutex), as that causes
deadlocks with other concurrent operations that acquire the i_mutex of
2 inodes (other fsyncs or renames for example), we need to serialize on
the log_mutex of the inode we are logging. A task setting a new value for
an inode's last_unlink_trans must acquire the inode's log_mutex and it
must do this update before doing the actual unlink operation (which is
already the case except when deleting a snapshot). Conversely the task
logging the inode must first log the inode and then check the inode's
last_unlink_trans value while holding its log_mutex, as if its value is
not greater then the id of the last committed transaction it means it
logged a safe state of the inode's items, while if its value is not
smaller then the id of the last committed transaction it means the inode
state it has logged might not be safe (the concurrent task might have
just updated last_unlink_trans but hasn't done yet the unlink operation)
and therefore a transaction commit must be done.

Test cases for xfstests follow in separate patches.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-03-01 08:23:29 -08:00
Filipe Manana
1ec9a1ae1e Btrfs: fix unreplayable log after snapshot delete + parent dir fsync
If we delete a snapshot, fsync its parent directory and crash/power fail
before the next transaction commit, on the next mount when we attempt to
replay the log tree of the root containing the parent directory we will
fail and prevent the filesystem from mounting, which is solvable by wiping
out the log trees with the btrfs-zero-log tool but very inconvenient as
we will lose any data and metadata fsynced before the parent directory
was fsynced.

For example:

  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt
  $ mkdir /mnt/testdir
  $ btrfs subvolume snapshot /mnt /mnt/testdir/snap
  $ btrfs subvolume delete /mnt/testdir/snap
  $ xfs_io -c "fsync" /mnt/testdir
  < crash / power failure and reboot >
  $ mount /dev/sdc /mnt
  mount: mount(2) failed: No such file or directory

And in dmesg/syslog we get the following message and trace:

[192066.361162] BTRFS info (device dm-0): failed to delete reference to snap, inode 257 parent 257
[192066.363010] ------------[ cut here ]------------
[192066.365268] WARNING: CPU: 4 PID: 5130 at fs/btrfs/inode.c:3986 __btrfs_unlink_inode+0x17a/0x354 [btrfs]()
[192066.367250] BTRFS: Transaction aborted (error -2)
[192066.368401] Modules linked in: btrfs dm_flakey dm_mod ppdev sha256_generic xor raid6_pq hmac drbg ansi_cprng aesni_intel acpi_cpufreq tpm_tis aes_x86_64 tpm ablk_helper evdev cryptd sg parport_pc i2c_piix4 psmouse lrw parport i2c_core pcspkr gf128mul processor serio_raw glue_helper button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last unloaded: btrfs]
[192066.377154] CPU: 4 PID: 5130 Comm: mount Tainted: G        W       4.4.0-rc6-btrfs-next-20+ #1
[192066.378875] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[192066.380889]  0000000000000000 ffff880143923670 ffffffff81257570 ffff8801439236b8
[192066.382561]  ffff8801439236a8 ffffffff8104ec07 ffffffffa039dc2c 00000000fffffffe
[192066.384191]  ffff8801ed31d000 ffff8801b9fc9c88 ffff8801086875e0 ffff880143923710
[192066.385827] Call Trace:
[192066.386373]  [<ffffffff81257570>] dump_stack+0x4e/0x79
[192066.387387]  [<ffffffff8104ec07>] warn_slowpath_common+0x99/0xb2
[192066.388429]  [<ffffffffa039dc2c>] ? __btrfs_unlink_inode+0x17a/0x354 [btrfs]
[192066.389236]  [<ffffffff8104ec68>] warn_slowpath_fmt+0x48/0x50
[192066.389884]  [<ffffffffa039dc2c>] __btrfs_unlink_inode+0x17a/0x354 [btrfs]
[192066.390621]  [<ffffffff81184b55>] ? iput+0xb0/0x266
[192066.391200]  [<ffffffffa039ea25>] btrfs_unlink_inode+0x1c/0x3d [btrfs]
[192066.391930]  [<ffffffffa03ca623>] check_item_in_log+0x1fe/0x29b [btrfs]
[192066.392715]  [<ffffffffa03ca827>] replay_dir_deletes+0x167/0x1cf [btrfs]
[192066.393510]  [<ffffffffa03cccc7>] replay_one_buffer+0x417/0x570 [btrfs]
[192066.394241]  [<ffffffffa03ca164>] walk_up_log_tree+0x10e/0x1dc [btrfs]
[192066.394958]  [<ffffffffa03cac72>] walk_log_tree+0xa5/0x190 [btrfs]
[192066.395628]  [<ffffffffa03ce8b8>] btrfs_recover_log_trees+0x239/0x32c [btrfs]
[192066.396790]  [<ffffffffa03cc8b0>] ? replay_one_extent+0x50a/0x50a [btrfs]
[192066.397891]  [<ffffffffa0394041>] open_ctree+0x1d8b/0x2167 [btrfs]
[192066.398897]  [<ffffffffa03706e1>] btrfs_mount+0x5ef/0x729 [btrfs]
[192066.399823]  [<ffffffff8108ad98>] ? trace_hardirqs_on+0xd/0xf
[192066.400739]  [<ffffffff8108959b>] ? lockdep_init_map+0xb9/0x1b3
[192066.401700]  [<ffffffff811714b9>] mount_fs+0x67/0x131
[192066.402482]  [<ffffffff81188560>] vfs_kern_mount+0x6c/0xde
[192066.403930]  [<ffffffffa03702bd>] btrfs_mount+0x1cb/0x729 [btrfs]
[192066.404831]  [<ffffffff8108ad98>] ? trace_hardirqs_on+0xd/0xf
[192066.405726]  [<ffffffff8108959b>] ? lockdep_init_map+0xb9/0x1b3
[192066.406621]  [<ffffffff811714b9>] mount_fs+0x67/0x131
[192066.407401]  [<ffffffff81188560>] vfs_kern_mount+0x6c/0xde
[192066.408247]  [<ffffffff8118ae36>] do_mount+0x893/0x9d2
[192066.409047]  [<ffffffff8113009b>] ? strndup_user+0x3f/0x8c
[192066.409842]  [<ffffffff8118b187>] SyS_mount+0x75/0xa1
[192066.410621]  [<ffffffff8147e517>] entry_SYSCALL_64_fastpath+0x12/0x6b
[192066.411572] ---[ end trace 2de42126c1e0a0f0 ]---
[192066.412344] BTRFS: error (device dm-0) in __btrfs_unlink_inode:3986: errno=-2 No such entry
[192066.413748] BTRFS: error (device dm-0) in btrfs_replay_log:2464: errno=-2 No such entry (Failed to recover log tree)
[192066.415458] BTRFS error (device dm-0): cleaner transaction attach returned -30
[192066.444613] BTRFS: open_ctree failed

This happens because when we are replaying the log and processing the
directory entry pointing to the snapshot in the subvolume tree, we treat
its btrfs_dir_item item as having a location with a key type matching
BTRFS_INODE_ITEM_KEY, which is wrong because the type matches
BTRFS_ROOT_ITEM_KEY and therefore must be processed differently, as the
object id refers to a root number and not to an inode in the root
containing the parent directory.

So fix this by triggering a transaction commit if an fsync against the
parent directory is requested after deleting a snapshot. This is the
simplest approach for a rare use case. Some alternative that avoids the
transaction commit would require more code to explicitly delete the
snapshot at log replay time (factoring out common code from ioctl.c:
btrfs_ioctl_snap_destroy()), special care at fsync time to remove the
log tree of the snapshot's root from the log root of the root of tree
roots, amongst other steps.

A test case for xfstests that triggers the issue follows.

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"
  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      _cleanup_flakey
      cd /
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter
  . ./common/dmflakey

  # real QA test starts here
  _need_to_be_root
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch
  _require_dm_target flakey
  _require_metadata_journaling $SCRATCH_DEV

  rm -f $seqres.full

  _scratch_mkfs >>$seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create a snapshot at the root of our filesystem (mount point path), delete it,
  # fsync the mount point path, crash and mount to replay the log. This should
  # succeed and after the filesystem is mounted the snapshot should not be visible
  # anymore.
  _run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/snap1
  _run_btrfs_util_prog subvolume delete $SCRATCH_MNT/snap1
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT
  _flakey_drop_and_remount
  [ -e $SCRATCH_MNT/snap1 ] && \
      echo "Snapshot snap1 still exists after log replay"

  # Similar scenario as above, but this time the snapshot is created inside a
  # directory and not directly under the root (mount point path).
  mkdir $SCRATCH_MNT/testdir
  _run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/testdir/snap2
  _run_btrfs_util_prog subvolume delete $SCRATCH_MNT/testdir/snap2
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir
  _flakey_drop_and_remount
  [ -e $SCRATCH_MNT/testdir/snap2 ] && \
      echo "Snapshot snap2 still exists after log replay"

  _unmount_flakey

  echo "Silence is golden"
  status=0
  exit

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-03-01 08:23:25 -08:00
Chris Mason
c05c5ee5ea Btrfs patchsets for 4.6
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJW0GSnAAoJEMVl1fnXbVg75qAP/0xbZPJtvTgRMSRnARtFJ28w
 vCsxqY+AatNJDuEpg2My/vscZvAVXGcTWjnM8NkXMMKN+oags47QN4qD0cuNv2kI
 JWcz7Ppt3GY6lcQbTj/Ce6N8RPRCNGsU7vxev+sKZ+jjXn+vuc+wKXnyJgaL1qcN
 XhcP2MccrXTVVJXLbGMFoaJXWWfd2i9uJ2MplmjFP7HQi5zP+5t/dsVaAQbc1dqx
 2TqgTJkUEPQqK8geAKom5wdLTmpLSgMWvg1m4lkYpDO89Fi+hFAKeeuJZvNutxVa
 hA0QLrLyZmr4tbZhM1of35Kl7N1uwCzOd8u6xsxurB12bibz67RbQpK+fazlCjKa
 wZJvJV+N3gqgCusLHlXYX0YalQxpWRQiKkjzpMy3Pq4K4soLrw20tQOnnBFhLR1y
 ZwqmZUN33lhFNCIWqLS4BLqDG+Z7Sf2aGhFtspMDjSUJe9gLbIpvH9sW6CexJI2r
 FnxTaVZ08uY0ky1dvZcRDR6zDDbVUpoQKWmwdZpxoEO1eLKjD01VsMOw5zlAaxdc
 a5SxKMVt0Gq56oTPgp0MuLHJr20pxx03yr+yl69VM8R1dAG/y61Dq5DwiFNQ8+J6
 jrX+eVYGBgTNYw/UGb14UPwVjQFFEs/vouphy6MmOVvNz+YZI6thN1uScB0vw7BV
 p/oFts5Fo0ipJgaBzGu4
 =CRdD
 -----END PGP SIGNATURE-----

Merge tag 'for-chris' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.6

Btrfs patchsets for 4.6
2016-03-01 08:13:56 -08:00
David Sterba
f5bc27c71a Merge branch 'dev/control-ioctl' into for-chris-4.6 2016-02-26 15:38:34 +01:00
David Sterba
fa695b01bc Merge branch 'misc-4.6' into for-chris-4.6
# Conflicts:
#	fs/btrfs/file.c
2016-02-26 15:38:34 +01:00
David Sterba
f004fae0cf Merge branch 'cleanups-4.6' into for-chris-4.6 2016-02-26 15:38:33 +01:00
David Sterba
675d276b32 Merge branch 'foreign/liubo/replace-lockup' into for-chris-4.6 2016-02-26 15:38:32 +01:00
David Sterba
e9ddd77a31 Merge branch 'foreign/josef/space-updates' into for-chris-4.6 2016-02-26 15:38:31 +01:00
David Sterba
ff7db6e05a Merge branch 'foreign/zhaolei/reada' into for-chris-4.6 2016-02-26 15:38:30 +01:00
David Sterba
23c1a966f2 Merge branch 'foreign/qu/norecovery-v7' into for-chris-4.6 2016-02-26 15:38:30 +01:00
David Sterba
67d605fec1 Merge branch 'dev/rename-keys' into for-chris-4.6 2016-02-26 15:38:29 +01:00
David Sterba
e22b3d1fbe Merge branch 'dev/gfp-flags' into for-chris-4.6 2016-02-26 15:38:28 +01:00
David Sterba
5f1b5664d9 Merge branch 'chandan/prep-subpage-blocksize' into for-chris-4.6
# Conflicts:
#	fs/btrfs/file.c
2016-02-26 15:38:28 +01:00
Liu Bo
73beece9ca Btrfs: fix lockdep deadlock warning due to dev_replace
Xfstests btrfs/011 complains about a deadlock warning,

[ 1226.649039] =========================================================
[ 1226.649039] [ INFO: possible irq lock inversion dependency detected ]
[ 1226.649039] 4.1.0+ #270 Not tainted
[ 1226.649039] ---------------------------------------------------------
[ 1226.652955] kswapd0/46 just changed the state of lock:
[ 1226.652955]  (&delayed_node->mutex){+.+.-.}, at: [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
[ 1226.652955] but this lock took another, RECLAIM_FS-unsafe lock in the past:
[ 1226.652955]  (&fs_info->dev_replace.lock){+.+.+.}

and interrupts could create inverse lock ordering between them.

[ 1226.652955]
other info that might help us debug this:
[ 1226.652955] Chain exists of:
  &delayed_node->mutex --> &found->groups_sem --> &fs_info->dev_replace.lock

[ 1226.652955]  Possible interrupt unsafe locking scenario:

[ 1226.652955]        CPU0                    CPU1
[ 1226.652955]        ----                    ----
[ 1226.652955]   lock(&fs_info->dev_replace.lock);
[ 1226.652955]                                local_irq_disable();
[ 1226.652955]                                lock(&delayed_node->mutex);
[ 1226.652955]                                lock(&found->groups_sem);
[ 1226.652955]   <Interrupt>
[ 1226.652955]     lock(&delayed_node->mutex);
[ 1226.652955]
 *** DEADLOCK ***

Commit 084b6e7c76 ("btrfs: Fix a lockdep warning when running xfstest.") tried
to fix a similar one that has the exactly same warning, but with that, we still
run to this.

The above lock chain comes from
btrfs_commit_transaction
  ->btrfs_run_delayed_items
    ...
    ->__btrfs_update_delayed_inode
      ...
      ->__btrfs_cow_block
         ...
         ->find_free_extent
            ->cache_block_group
              ->load_free_space_cache
                ->btrfs_readpages
                  ->submit_one_bio
                    ...
                    ->__btrfs_map_block
                      ->btrfs_dev_replace_lock

However, with high memory pressure, tasks which hold dev_replace.lock can
be interrupted by kswapd and then kswapd is intended to release memory occupied
by superblock, inodes and dentries, where we may call evict_inode, and it comes
to

[ 1226.652955]  [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
[ 1226.652955]  [<ffffffff81459e74>] btrfs_remove_delayed_node+0x24/0x30
[ 1226.652955]  [<ffffffff8140c5fe>] btrfs_evict_inode+0x34e/0x700

delayed_node->mutex may be acquired in __btrfs_release_delayed_node(), and it leads
to a ABBA deadlock.

To fix this, we can use "blocking rwlock" used in the case of extent_buffer, but
things are simpler here since we only needs read's spinlock to blocking lock.

With this, btrfs/011 no more produces warnings in dmesg.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-23 13:10:10 +01:00
David Sterba
d5131b658c btrfs: drop unused argument in btrfs_ioctl_get_supported_features
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-23 12:56:35 +01:00
David Sterba
c5868f8362 btrfs: add GET_SUPPORTED_FEATURES to the control device ioctls
The control device is accessible when no filesystem is mounted and we
may want to query features supported by the module. This is already
possible using the sysfs files, this ioctl is for parity and
convenience.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-23 12:56:21 +01:00
David Sterba
f7e98a7fff btrfs: change max_inline default to 2048
The current practical default is ~4k on x86_64 (the logic is more complex,
simplified for brevity), the inlined files land in the metadata group and
thus consume space that could be needed for the real metadata.

The inlining brings some usability surprises:

1) total space consumption measured on various filesystems and btrfs
   with DUP metadata was quite visible because of the duplicated data
   within metadata

2) inlined data may exhaust the metadata, which are more precious in case
   the entire device space is allocated to chunks (ie. balance cannot
   make the space more compact)

3) performance suffers a bit as the inlined blocks are duplicate and
   stored far away on the device.

Proposed fix: set the default to 2048

This fixes namely 1), the total filesysystem space consumption will be on
par with other filesystems.

Partially fixes 2), more data are pushed to the data block groups.

The characteristics of 3) are based on actual small file size
distribution.

The change is independent of the metadata blockgroup type (though it's
most visible with DUP) or system page size as these parameters are not
trival to find out, compared to file size.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-23 12:55:27 +01:00
David Sterba
11ea474f74 btrfs: remove error message from search ioctl for nonexistent tree
Let's remove the error message that appears when the tree_id is not
present. This can happen with the quota tree and has been observed in
practice. The applications are supposed to handle -ENOENT and we don't
need to report that in the system log as it's not a fatal error.

Reported-by: Vlastimil Babka <vbabka@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-23 12:54:48 +01:00
Arnd Bergmann
f827ba9a64 btrfs: avoid uninitialized variable warning
With CONFIG_SMP and CONFIG_PREEMPT both disabled, gcc decides
to partially inline the get_state_failrec() function but cannot
figure out that means the failrec pointer is always valid
if the function returns success, which causes a harmless
warning:

fs/btrfs/extent_io.c: In function 'clean_io_failure':
fs/btrfs/extent_io.c:2131:4: error: 'failrec' may be used uninitialized in this function [-Werror=maybe-uninitialized]

This marks get_state_failrec() and set_state_failrec() both
as 'noinline', which avoids the warning in all cases for me,
and seems less ugly than adding a fake initialization.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 47dc196ae7 ("btrfs: use proper type for failrec in extent_state")
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-23 12:42:46 +01:00
Linus Torvalds
ce6b71432d Merge branch 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fix from Chris Mason:
 "My for-linus-4.5 branch has a btrfs DIO error passing fix.

  I know how much you love DIO, so I'm going to suggest against reading
  it.  We'll follow up with a patch to drop the error arg from
  dio_end_io in the next merge window."

* 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix direct IO requests not reporting IO error to user space
2016-02-19 13:40:42 -08:00
Kinglong Mee
aa66b0bb08 btrfs: fix memory leak of fs_info in block group cache
When starting up linux with btrfs filesystem, I got many memory leak
messages by kmemleak as,

unreferenced object 0xffff880066882000 (size 4096):
  comm "modprobe", pid 730, jiffies 4294690024 (age 196.599s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff8174d52e>] kmemleak_alloc+0x4e/0xb0
    [<ffffffff811d09aa>] kmem_cache_alloc_trace+0xea/0x1e0
    [<ffffffffa03620fb>] btrfs_alloc_dummy_fs_info+0x6b/0x2a0 [btrfs]
    [<ffffffffa03624fc>] btrfs_alloc_dummy_block_group+0x5c/0x120 [btrfs]
    [<ffffffffa0360aa9>] btrfs_test_free_space_cache+0x39/0xed0 [btrfs]
    [<ffffffffa03b5a74>] trace_raw_output_xfs_attr_class+0x54/0xe0 [xfs]
    [<ffffffff81002122>] do_one_initcall+0xb2/0x1f0
    [<ffffffff811765aa>] do_init_module+0x5e/0x1e9
    [<ffffffff810fec09>] load_module+0x20a9/0x2690
    [<ffffffff810ff439>] SyS_finit_module+0xb9/0xf0
    [<ffffffff81757daf>] entry_SYSCALL_64_fastpath+0x12/0x76
    [<ffffffffffffffff>] 0xffffffffffffffff
unreferenced object 0xffff8800573f8000 (size 10256):
  comm "modprobe", pid 730, jiffies 4294690185 (age 196.460s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff8174d52e>] kmemleak_alloc+0x4e/0xb0
    [<ffffffff8119ca6e>] kmalloc_order+0x5e/0x70
    [<ffffffff8119caa4>] kmalloc_order_trace+0x24/0x90
    [<ffffffffa03620b3>] btrfs_alloc_dummy_fs_info+0x23/0x2a0 [btrfs]
    [<ffffffffa03624fc>] btrfs_alloc_dummy_block_group+0x5c/0x120 [btrfs]
    [<ffffffffa036603d>] run_test+0xfd/0x320 [btrfs]
    [<ffffffffa0366f34>] btrfs_test_free_space_tree+0x94/0xee [btrfs]
    [<ffffffffa03b5aab>] trace_raw_output_xfs_attr_class+0x8b/0xe0 [xfs]
    [<ffffffff81002122>] do_one_initcall+0xb2/0x1f0
    [<ffffffff811765aa>] do_init_module+0x5e/0x1e9
    [<ffffffff810fec09>] load_module+0x20a9/0x2690
    [<ffffffff810ff439>] SyS_finit_module+0xb9/0xf0
    [<ffffffff81757daf>] entry_SYSCALL_64_fastpath+0x12/0x76
    [<ffffffffffffffff>] 0xffffffffffffffff

This patch lets btrfs using fs_info stored in btrfs_root for
block group cache directly without allocating a new one.

Fixes: d0bd456074 ("Btrfs: add fragment=* debug mount option")
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 13:28:24 +01:00
Zhao Lei
4da2e26a2a btrfs: Continue write in case of can_not_nocow
btrfs failed in xfstests btrfs/080 with -o nodatacow.

Can be reproduced by following script:
  DEV=/dev/vdg
  MNT=/mnt/tmp

  umount $DEV &>/dev/null
  mkfs.btrfs -f $DEV
  mount -o nodatacow $DEV $MNT

  dd if=/dev/zero of=$MNT/test bs=1 count=2048 &
  btrfs subvolume snapshot -r $MNT $MNT/test_snap &
  wait
  --
  We can see dd failed on NO_SPACE.

Reason:
  __btrfs_buffered_write should run cow write when no_cow impossible,
  and current code is designed with above logic.
  But check_can_nocow() have 2 type of return value(0 and <0) on
  can_not_no_cow, and current code only continue write on first case,
  the second case happened in doing subvolume.

Fix:
  Continue write when check_can_nocow() return 0 and <0.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
2016-02-18 13:18:06 +01:00
Kinglong Mee
5598e9005a btrfs: drop null testing before destroy functions
Cleanup.

kmem_cache_destroy has support NULL argument checking,
so drop the double null testing before calling it.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 11:46:03 +01:00
Sudip Mukherjee
89771cc98c btrfs: fix build warning
We were getting build warning about:
fs/btrfs/extent-tree.c:7021:34: warning: ‘used_bg’ may be used
	uninitialized in this function

It is not a valid warning as used_bg is never used uninitilized since
locked is initially false so we can never be in the section where
'used_bg' is used. But gcc is not able to understand that and we can
initialize it while declaring to silence the warning.

Signed-off-by: Sudip Mukherjee <sudip@vectorindia.org>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 11:46:03 +01:00
David Sterba
47dc196ae7 btrfs: use proper type for failrec in extent_state
We use the private member of extent_state to store the failrec and play
pointless pointer games.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 11:46:03 +01:00
Deepa Dinamani
04b285f35e btrfs: Replace CURRENT_TIME by current_fs_time()
CURRENT_TIME macro is not appropriate for filesystems as it
doesn't use the right granularity for filesystem timestamps.
Use current_fs_time() instead.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: linux-btrfs@vger.kernel.org
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 11:46:03 +01:00
Dave Jones
8f682f6955 btrfs: remove open-coded swap() in backref.c:__merge_refs
The kernel provides a swap() that does the same thing as this code.

Signed-off-by: Dave Jones <dsj@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 11:45:55 +01:00
Byongho Lee
ac1407ba24 btrfs: remove redundant error check
While running btrfs_mksubvol(), d_really_is_positive() is called twice.
First in btrfs_mksubvol() and second inside btrfs_may_create().  So I
remove the first one.

Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 11:35:27 +01:00
Byongho Lee
0138b6fe8f btrfs: simplify expression in btrfs_calc_trans_metadata_size()
Simplify expression in btrfs_calc_trans_metadata_size().

Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Reviewed-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 11:33:17 +01:00
Josef Bacik
baee879064 Btrfs: check reserved when deciding to background flush
We will sometimes start background flushing the various enospc related things
(delayed nodes, delalloc, etc) if we are getting close to reserving all of our
available space.  We don't want to do this however when we are actually using
this space as it causes unneeded thrashing.  We currently try to do this by
checking bytes_used >= thresh, but bytes_used is only part of the equation, we
need to use bytes_reserved as well as this represents space that is very likely
to become bytes_used in the future.

My tracing tool will keep count of the number of times we kick off the async
flusher, the following are counts for the entire run of generic/027

		No Patch	Patch
avg: 		5385		5009
median:		5500		4916

We skewed lower than the average with my patch and higher than the average with
the patch, overall it cuts the flushing from anywhere from 5-10%, which in the
case of actual ENOSPC is quite helpful.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 11:29:43 +01:00
Josef Bacik
88d3a5aaf6 Btrfs: add transaction space reservation tracepoints
There are a few places where we add to trans->bytes_reserved but don't have the
corresponding trace point.  With these added my tool no longer sees transaction
leaks.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 11:22:41 +01:00
Josef Bacik
dc95f7bfc5 Btrfs: fix truncate_space_check
truncate_space_check is using btrfs_csum_bytes_to_leaves() but forgetting to
multiply by nodesize so we get an actual byte count.  We need a tracepoint here
so that we have the matching reserve for the release that will come later.  Also
add a comment to make clear what the intent of truncate_space_check is.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 11:22:24 +01:00
Josef Bacik
fb4b10e5d5 Btrfs: change how we update the global block rsv
I'm writing a tool to visualize the enospc system in order to help debug enospc
bugs and I found weird data and ran it down to when we update the global block
rsv.  We add all of the remaining free space to the block rsv, do a trace event,
then remove the extra and do another trace event.  This makes my visualization
look silly and is unintuitive code as well.  Fix this stuff to only add the
amount we are missing, or free the amount we are missing.  This is less clean to
read but more explicit in what it is doing, as well as only emitting events for
values that make sense.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 11:21:48 +01:00
Zhao Lei
7aff8cf4a6 btrfs: reada: ignore creating reada_extent for a non-existent device
For a non-existent device, old code bypasses adding it in dev's reada
queue.

And to solve problem of unfinished waitting in raid5/6,
commit 5fbc7c59fd ("Btrfs: fix unfinished readahead thread for
raid5/6 degraded mounting")
adding an exception for the first stripe, in short, the first
stripe will always be processed whether the device exists or not.

Actually we have a better way for the above request: just bypass
creation of the reada_extent for non-existent device, it will make
code simple and effective.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 10:27:23 +01:00
Zhao Lei
4fe7a0e138 btrfs: reada: avoid undone reada extents in btrfs_reada_wait
Reada background works is not designed to finish all jobs
completely, it will break in following case:
1: When a device reaches workload limit (MAX_IN_FLIGHT)
2: Total reads reach max limit (10000)
3: All devices don't have queued more jobs, often happened in DUP case

And if all background works exit with remaining jobs,
btrfs_reada_wait() will wait indefinetelly.

Above problem is rarely happened in old code, because:
1: Every work queues 2x new works
   So many works reduced chances of undone jobs.
2: One work will continue 10000 times loop in case of no-jobs
   It reduced no-thread window time.

But after we fixed above case, the "undone reada extents" frequently
happened.

Fix:
 Check to ensure we have at least one thread if there are undone jobs
 in btrfs_reada_wait().

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 10:27:23 +01:00
Zhao Lei
2fefd5583f btrfs: reada: limit max works count
Reada creates 2 works for each level of tree recursively.

In case of a tree having many levels, the number of created works
is 2^level_of_tree.
Actually we don't need so many works in parallel, this patch limits
max works to BTRFS_MAX_MIRRORS * 2.

The per-fs works_counter will be also used for btrfs_reada_wait() to
check is there are background workers.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 10:27:23 +01:00
Zhao Lei
895a11b868 btrfs: reada: simplify dev->reada_in_flight processing
No need to decrease dev->reada_in_flight in __readahead_hook()'s
internal and reada_extent_put().
reada_extent_put() have no chance to decrease dev->reada_in_flight
in free operation, because reada_extent have additional refcnt when
scheduled to a dev.

We can put inc and dec operation for dev->reada_in_flight to one
place instead to make logic simple and safe, and move useless
reada_extent->scheduled_for to a bool flag instead.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 10:27:23 +01:00
Zhao Lei
8afd6841e1 btrfs: reada: Fix a debug code typo
Remove one copy of loop to fix the typo of iterate zones.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 10:26:12 +01:00
Zhao Lei
57f16e0826 btrfs: reada: Jump into cleanup in direct way for __readahead_hook()
Current code set nritems to 0 to make for_loop useless to bypass it,
and set generation's value which is not necessary.
Jump into cleanup directly is better choise.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 10:26:12 +01:00
Zhao Lei
02873e4325 btrfs: reada: Use fs_info instead of root in __readahead_hook's argument
What __readahead_hook() need exactly is fs_info, no need to convert
fs_info to root in caller and convert back in __readahead_hook()

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 10:26:12 +01:00
Zhao Lei
6e39dbe8b9 btrfs: reada: Pass reada_extent into __readahead_hook directly
reada_start_machine_dev() already have reada_extent pointer, pass
it into __readahead_hook() directly instead of search radix_tree
will make code run faster.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 10:26:12 +01:00
Zhao Lei
b257cf5006 btrfs: reada: move reada_extent_put to place after __readahead_hook()
We can't release reada_extent earlier than __readahead_hook(), because
__readahead_hook() still need to use it, it is necessary to hode a refcnt
to avoid it be freed.

Actually it is not a problem after my patch named:
  Avoid many times of empty loop
It make reada_extent in above line include at least one reada_extctl,
which keeps additional one refcnt for reada_extent.

But we still need this patch to make the code in pretty logic.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 10:26:12 +01:00
Zhao Lei
1e7970c0f3 btrfs: reada: Remove level argument in severial functions
level is not used in severial functions, remove them from arguments,
and remove relative code for get its value.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 10:26:12 +01:00
Zhao Lei
3194502118 btrfs: reada: bypass adding extent when all zone failed
When failed adding all dev_zones for a reada_extent, the extent
will have no chance to be selected to run, and keep in memory
for ever.

We should bypass this extent to avoid above case.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 10:26:12 +01:00
Zhao Lei
6a159d2ae4 btrfs: reada: add all reachable mirrors into reada device list
If some device is not reachable, we should bypass and continus addingb
next, instead of break on bad device.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 10:26:12 +01:00
Zhao Lei
a3f7fde243 btrfs: reada: Move is_need_to_readahead contition earlier
Move is_need_to_readahead contition earlier to avoid useless loop
to get relative data for readahead.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 10:26:10 +01:00
Ingo Molnar
3a2f2ac9b9 Merge branch 'x86/urgent' into x86/asm, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-18 09:28:03 +01:00
Zhao Lei
97d5f0e63d btrfs: reada: Avoid many times of empty loop
We can see following loop(10000 times) in trace_log:
 [   75.416137] ZL_DEBUG: reada_start_machine_dev:730: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2
 [   75.417413] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1
 [   75.418611] ZL_DEBUG: __readahead_hook:129: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2
 [   75.419793] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1

 [   75.421016] ZL_DEBUG: reada_start_machine_dev:730: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2
 [   75.422324] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1
 [   75.423661] ZL_DEBUG: __readahead_hook:129: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2
 [   75.424882] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1

 ...(10000 times)

 [  124.101672] ZL_DEBUG: reada_start_machine_dev:730: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2
 [  124.102850] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1
 [  124.104008] ZL_DEBUG: __readahead_hook:129: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2
 [  124.105121] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1

Reason:
 If more than one user trigger reada in same extent, the first task
 finished setting of reada data struct and call reada_start_machine()
 to start, and the second task only add a ref_count but have not
 add reada_extctl struct completely, the reada_extent can not finished
 all jobs, and will be selected in __reada_start_machine() for 10000
 times(total times in __reada_start_machine()).

Fix:
 For a reada_extent without job, we don't need to run it, just return
 0 to let caller break.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-16 13:21:45 +01:00
Zhao Lei
8e9aa51f54 btrfs: reada: Add missed segment checking in reada_find_zone
In rechecking zone-in-tree, we still need to check zone include
our logical address.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-16 13:21:45 +01:00
Zhao Lei
c37f49c7ef btrfs: reada: reduce additional fs_info->reada_lock in reada_find_zone
We can avoid additional locking-acquirment and one pair of
kref_get/put by combine two condition.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-16 13:21:45 +01:00
Zhao Lei
503785306d btrfs: reada: Fix in-segment calculation for reada
reada_zone->end is end pos of segment:
 end = start + cache->key.offset - 1;

So we need to use "<=" in condition to judge is a pos in the
segment.

The problem happened rearly, because logical pos rarely pointed
to last 4k of a blockgroup, but we need to fix it to make code
right in logic.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-16 13:21:45 +01:00
Filipe Manana
1636d1d77e Btrfs: fix direct IO requests not reporting IO error to user space
If a bio for a direct IO request fails, we were not setting the error in
the parent bio (the main DIO bio), making us not return the error to
user space in btrfs_direct_IO(), that is, it made __blockdev_direct_IO()
return the number of bytes issued for IO and not the error a bio created
and submitted by btrfs_submit_direct() got from the block layer.
This essentially happens because when we call:

   dio_end_io(dio_bio, bio->bi_error);

It does not set dio_bio->bi_error to the value of the second argument.
So just add this missing assignment in endio callbacks, just as we do in
the error path at btrfs_submit_direct() when we fail to clone the dio bio
or allocate its private object. This follows the convention of what is
done with other similar APIs such as bio_endio() where the caller is
responsible for setting the bi_error field in the bio it passes as an
argument to bio_endio().

This was detected by the new generic test cases in xfstests: 271, 272,
276 and 278. Which essentially setup a dm error target, then load the
error table, do a direct IO write and unload the error table. They
expect the write to fail with -EIO, which was not getting reported
when testing against btrfs.

Cc: stable@vger.kernel.org  # 4.3+
Fixes: 4246a0b63b ("block: add a bi_error field to struct bio")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-02-16 03:41:26 +00:00
Linus Torvalds
27c9d772e5 Merge branch 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "This has a few fixes from Filipe, along with a readdir fix from Dave
  that we've been testing for some time"

* 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: properly set the termination value of ctx->pos in readdir
  Btrfs: fix hang on extent buffer lock caused by the inode_paths ioctl
  Btrfs: remove no longer used function extent_read_full_page_nolock()
  Btrfs: fix page reading in extent_same ioctl leading to csum errors
  Btrfs: fix invalid page accesses in extent_same (dedup) ioctl
2016-02-12 09:21:28 -08:00
Qu Wenruo
fed8f166eb btrfs: Introduce new mount option alias for nologreplay
Introduce new mount option alias "norecovery" for nologreplay, to keep
"norecovery" behavior the same with other filesystems.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-12 15:14:49 +01:00
Qu Wenruo
96da09192c btrfs: Introduce new mount option to disable tree log replay
Introduce a new mount option "nologreplay" to co-operate with "ro" mount
option to get real readonly mount, like "norecovery" in ext* and xfs.

Since the new parse_options() need to check new flags at remount time,
so add a new parameter for parse_options().

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Tested-by: Austin S. Hemmelgarn <ahferroin7@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-12 15:14:49 +01:00
Qu Wenruo
8dcddfa048 btrfs: Introduce new mount option usebackuproot to replace recovery
Current "recovery" mount option will only try to use backup root.
However the word "recovery" is too generic and may be confusing for some
users.

Here introduce a new and more specific mount option, "usebackuproot" to
replace "recovery" mount option.
"Recovery" will be kept for compatibility reason, but will be
deprecated.

Also, since "usebackuproot" will only affect mount behavior and after
open_ctree() it has nothing to do with the filesystem, so clear the flag
after mount succeeded.

This provides the basis for later unified "norecovery" mount option.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
[ dropped usebackuproot from show_mount, added note about 'recovery' to
  docs ]
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-12 15:14:14 +01:00
David Sterba
9f07e1d76e btrfs: teach print_leaf about temporary item subtypes
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-11 16:15:43 +01:00
David Sterba
585a3d0d23 btrfs: teach print_leaf about permanent item subtypes
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-11 16:15:43 +01:00
David Sterba
242e2956e4 btrfs: switch dev stats item to the permanent item key
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-11 16:15:43 +01:00
David Sterba
50c2d5abe6 btrfs: introduce key type for persistent permanent items
The number of distinct key types is not that big that we could waste one
for something new we want to store in the tree.

Similar to the temporary items, we'll introduce a new name for an
existing key value and use the objectid for further extension.  The
victim is the BTRFS_DEV_STATS_KEY (248).

The device stats are an example of a permanent item.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-11 16:15:43 +01:00
David Sterba
c479cb4f14 btrfs: switch balance item to the temporary item key
No visible change.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-11 16:15:43 +01:00
David Sterba
0bbbccb17f btrfs: introduce key type for persistent temporary items
The number of distinct key types is not that big that we could waste one
for something new we want to store in the tree. We'll introduce a new
name for an existing key value and use the objectid for further
extension.  The victim is the BTRFS_BALANCE_ITEM_KEY (248).

The nature of the balance status item is a good example of the temporary
item. It exists from beginning of the balance, keeps the status until it
finishes.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-11 16:15:43 +01:00
David Sterba
bc4ef7592f btrfs: properly set the termination value of ctx->pos in readdir
The value of ctx->pos in the last readdir call is supposed to be set to
INT_MAX due to 32bit compatibility, unless 'pos' is intentially set to a
larger value, then it's LLONG_MAX.

There's a report from PaX SIZE_OVERFLOW plugin that "ctx->pos++"
overflows (https://forums.grsecurity.net/viewtopic.php?f=1&t=4284), on a
64bit arch, where the value is 0x7fffffffffffffff ie. LLONG_MAX before
the increment.

We can get to that situation like that:

* emit all regular readdir entries
* still in the same call to readdir, bump the last pos to INT_MAX
* next call to readdir will not emit any entries, but will reach the
  bump code again, finds pos to be INT_MAX and sets it to LLONG_MAX

Normally this is not a problem, but if we call readdir again, we'll find
'pos' set to LLONG_MAX and the unconditional increment will overflow.

The report from Victor at
(http://thread.gmane.org/gmane.comp.file-systems.btrfs/49500) with debugging
print shows that pattern:

 Overflow: e
 Overflow: 7fffffff
 Overflow: 7fffffffffffffff
 PAX: size overflow detected in function btrfs_real_readdir
   fs/btrfs/inode.c:5760 cicus.935_282 max, count: 9, decl: pos; num: 0;
   context: dir_context;
 CPU: 0 PID: 2630 Comm: polkitd Not tainted 4.2.3-grsec #1
 Hardware name: Gigabyte Technology Co., Ltd. H81ND2H/H81ND2H, BIOS F3 08/11/2015
  ffffffff81901608 0000000000000000 ffffffff819015e6 ffffc90004973d48
  ffffffff81742f0f 0000000000000007 ffffffff81901608 ffffc90004973d78
  ffffffff811cb706 0000000000000000 ffff8800d47359e0 ffffc90004973ed8
 Call Trace:
  [<ffffffff81742f0f>] dump_stack+0x4c/0x7f
  [<ffffffff811cb706>] report_size_overflow+0x36/0x40
  [<ffffffff812ef0bc>] btrfs_real_readdir+0x69c/0x6d0
  [<ffffffff811dafc8>] iterate_dir+0xa8/0x150
  [<ffffffff811e6d8d>] ? __fget_light+0x2d/0x70
  [<ffffffff811dba3a>] SyS_getdents+0xba/0x1c0
 Overflow: 1a
  [<ffffffff811db070>] ? iterate_dir+0x150/0x150
  [<ffffffff81749b69>] entry_SYSCALL_64_fastpath+0x12/0x83

The jump from 7fffffff to 7fffffffffffffff happens when new dir entries
are not yet synced and are processed from the delayed list. Then the code
could go to the bump section again even though it might not emit any new
dir entries from the delayed list.

The fix avoids entering the "bump" section again once we've finished
emitting the entries, both for synced and delayed entries.

References: https://forums.grsecurity.net/viewtopic.php?f=1&t=4284
Reported-by: Victor <services@swwu.com>
CC: stable@vger.kernel.org
Signed-off-by: David Sterba <dsterba@suse.com>
Tested-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-02-11 07:01:59 -08:00
David Sterba
66722f7c05 btrfs: switch to kcalloc in btrfs_cmp_data_prepare
Kcalloc is functionally equivalent and does overflow checks.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-11 15:19:39 +01:00
David Sterba
fd95ef56b1 btrfs: extent same: use GFP_KERNEL for page array allocations
We can safely use GFP_KERNEL in the functions called from the ioctl
handlers. Here we can allocate up to 32k so less pressure to the
allocator could help.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-11 15:19:39 +01:00
David Sterba
78f2c9e6db btrfs: device add and remove: use GFP_KERNEL
We can safely use GFP_KERNEL in the functions called from the ioctl
handlers.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-11 15:19:39 +01:00
David Sterba
49e350a491 btrfs: readdir: use GFP_KERNEL
Readdir is initiated from userspace and is not on the critical
writeback path, we don't need to use GFP_NOFS for allocations.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-11 15:19:39 +01:00
David Sterba
32fc932e30 btrfs: fallocate: use GFP_KERNEL
Fallocate is initiated from userspace and is not on the critical
writeback path, we don't need to use GFP_NOFS for allocations.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-11 15:19:39 +01:00
David Sterba
74e4d82757 btrfs: let callers of btrfs_alloc_root pass gfp flags
We don't need to use GFP_NOFS in all contexts, eg. during mount or for
dummy root tree, but we might for the the log tree creation.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-11 15:19:39 +01:00
David Sterba
58c4e17384 btrfs: scrub: use GFP_KERNEL on the submission path
Scrub is not on the critical writeback path we don't need to use
GFP_NOFS for all allocations. The failures are handled and stats passed
back to userspace.

Let's use GFP_KERNEL on the paths where everything is ok, ie. setup the
global structures and the IO submission paths.

Functions that do the repair and fixups still use GFP_NOFS as we might
want to skip any other filesystem activity if we encounter an error.
This could turn out to be unnecessary, but requires more review compared
to the easy cases in this patch.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-11 15:19:39 +01:00
David Sterba
ed0244faf5 btrfs: reada: use GFP_KERNEL everywhere
The readahead framework is not on the critical writeback path we don't
need to use GFP_NOFS for allocations. All error paths are handled and
the readahead failures are not fatal. The actual users (scrub,
dev-replace) will trigger reads if the blocks are not found in cache.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-11 15:19:39 +01:00
David Sterba
e780b0d1c1 btrfs: send: use GFP_KERNEL everywhere
The send operation is not on the critical writeback path we don't need
to use GFP_NOFS for allocations. All error paths are handled and the
whole operation is restartable.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-11 15:19:39 +01:00
Filipe Manana
0c0fe3b0fa Btrfs: fix hang on extent buffer lock caused by the inode_paths ioctl
While doing some tests I ran into an hang on an extent buffer's rwlock
that produced the following trace:

[39389.800012] NMI watchdog: BUG: soft lockup - CPU#15 stuck for 22s! [fdm-stress:32166]
[39389.800016] NMI watchdog: BUG: soft lockup - CPU#14 stuck for 22s! [fdm-stress:32165]
[39389.800016] Modules linked in: btrfs dm_mod ppdev xor sha256_generic hmac raid6_pq drbg ansi_cprng aesni_intel i2c_piix4 acpi_cpufreq aes_x86_64 ablk_helper tpm_tis parport_pc i2c_core sg cryptd evdev psmouse lrw tpm parport gf128mul serio_raw pcspkr glue_helper processor button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last unloaded: btrfs]
[39389.800016] irq event stamp: 0
[39389.800016] hardirqs last  enabled at (0): [<          (null)>]           (null)
[39389.800016] hardirqs last disabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35
[39389.800016] softirqs last  enabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35
[39389.800016] softirqs last disabled at (0): [<          (null)>]           (null)
[39389.800016] CPU: 14 PID: 32165 Comm: fdm-stress Not tainted 4.4.0-rc6-btrfs-next-18+ #1
[39389.800016] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[39389.800016] task: ffff880175b1ca40 ti: ffff8800a185c000 task.ti: ffff8800a185c000
[39389.800016] RIP: 0010:[<ffffffff810902af>]  [<ffffffff810902af>] queued_spin_lock_slowpath+0x57/0x158
[39389.800016] RSP: 0018:ffff8800a185fb80  EFLAGS: 00000202
[39389.800016] RAX: 0000000000000101 RBX: ffff8801710c4e9c RCX: 0000000000000101
[39389.800016] RDX: 0000000000000100 RSI: 0000000000000001 RDI: 0000000000000001
[39389.800016] RBP: ffff8800a185fb98 R08: 0000000000000001 R09: 0000000000000000
[39389.800016] R10: ffff8800a185fb68 R11: 6db6db6db6db6db7 R12: ffff8801710c4e98
[39389.800016] R13: ffff880175b1ca40 R14: ffff8800a185fc10 R15: ffff880175b1ca40
[39389.800016] FS:  00007f6d37fff700(0000) GS:ffff8802be9c0000(0000) knlGS:0000000000000000
[39389.800016] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[39389.800016] CR2: 00007f6d300019b8 CR3: 0000000037c93000 CR4: 00000000001406e0
[39389.800016] Stack:
[39389.800016]  ffff8801710c4e98 ffff8801710c4e98 ffff880175b1ca40 ffff8800a185fbb0
[39389.800016]  ffffffff81091e11 ffff8801710c4e98 ffff8800a185fbc8 ffffffff81091895
[39389.800016]  ffff8801710c4e98 ffff8800a185fbe8 ffffffff81486c5c ffffffffa067288c
[39389.800016] Call Trace:
[39389.800016]  [<ffffffff81091e11>] queued_read_lock_slowpath+0x46/0x60
[39389.800016]  [<ffffffff81091895>] do_raw_read_lock+0x3e/0x41
[39389.800016]  [<ffffffff81486c5c>] _raw_read_lock+0x3d/0x44
[39389.800016]  [<ffffffffa067288c>] ? btrfs_tree_read_lock+0x54/0x125 [btrfs]
[39389.800016]  [<ffffffffa067288c>] btrfs_tree_read_lock+0x54/0x125 [btrfs]
[39389.800016]  [<ffffffffa0622ced>] ? btrfs_find_item+0xa7/0xd2 [btrfs]
[39389.800016]  [<ffffffffa069363f>] btrfs_ref_to_path+0xd6/0x174 [btrfs]
[39389.800016]  [<ffffffffa0693730>] inode_to_path+0x53/0xa2 [btrfs]
[39389.800016]  [<ffffffffa0693e2e>] paths_from_inode+0x117/0x2ec [btrfs]
[39389.800016]  [<ffffffffa0670cff>] btrfs_ioctl+0xd5b/0x2793 [btrfs]
[39389.800016]  [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc
[39389.800016]  [<ffffffff81276727>] ? __this_cpu_preempt_check+0x13/0x15
[39389.800016]  [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc
[39389.800016]  [<ffffffff8118b3d4>] ? rcu_read_unlock+0x3e/0x5d
[39389.800016]  [<ffffffff811822f8>] do_vfs_ioctl+0x42b/0x4ea
[39389.800016]  [<ffffffff8118b4f3>] ? __fget_light+0x62/0x71
[39389.800016]  [<ffffffff8118240e>] SyS_ioctl+0x57/0x79
[39389.800016]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
[39389.800016] Code: b9 01 01 00 00 f7 c6 00 ff ff ff 75 32 83 fe 01 89 ca 89 f0 0f 45 d7 f0 0f b1 13 39 f0 74 04 89 c6 eb e2 ff ca 0f 84 fa 00 00 00 <8b> 03 84 c0 74 04 f3 90 eb f6 66 c7 03 01 00 e9 e6 00 00 00 e8
[39389.800012] Modules linked in: btrfs dm_mod ppdev xor sha256_generic hmac raid6_pq drbg ansi_cprng aesni_intel i2c_piix4 acpi_cpufreq aes_x86_64 ablk_helper tpm_tis parport_pc i2c_core sg cryptd evdev psmouse lrw tpm parport gf128mul serio_raw pcspkr glue_helper processor button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last unloaded: btrfs]
[39389.800012] irq event stamp: 0
[39389.800012] hardirqs last  enabled at (0): [<          (null)>]           (null)
[39389.800012] hardirqs last disabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35
[39389.800012] softirqs last  enabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35
[39389.800012] softirqs last disabled at (0): [<          (null)>]           (null)
[39389.800012] CPU: 15 PID: 32166 Comm: fdm-stress Tainted: G             L  4.4.0-rc6-btrfs-next-18+ #1
[39389.800012] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[39389.800012] task: ffff880179294380 ti: ffff880034a60000 task.ti: ffff880034a60000
[39389.800012] RIP: 0010:[<ffffffff81091e8d>]  [<ffffffff81091e8d>] queued_write_lock_slowpath+0x62/0x72
[39389.800012] RSP: 0018:ffff880034a639f0  EFLAGS: 00000206
[39389.800012] RAX: 0000000000000101 RBX: ffff8801710c4e98 RCX: 0000000000000000
[39389.800012] RDX: 00000000000000ff RSI: 0000000000000000 RDI: ffff8801710c4e9c
[39389.800012] RBP: ffff880034a639f8 R08: 0000000000000001 R09: 0000000000000000
[39389.800012] R10: ffff880034a639b0 R11: 0000000000001000 R12: ffff8801710c4e98
[39389.800012] R13: 0000000000000001 R14: ffff880172cbc000 R15: ffff8801710c4e00
[39389.800012] FS:  00007f6d377fe700(0000) GS:ffff8802be9e0000(0000) knlGS:0000000000000000
[39389.800012] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[39389.800012] CR2: 00007f6d3d3c1000 CR3: 0000000037c93000 CR4: 00000000001406e0
[39389.800012] Stack:
[39389.800012]  ffff8801710c4e98 ffff880034a63a10 ffffffff81091963 ffff8801710c4e98
[39389.800012]  ffff880034a63a30 ffffffff81486f1b ffffffffa0672cb3 ffff8801710c4e00
[39389.800012]  ffff880034a63a78 ffffffffa0672cb3 ffff8801710c4e00 ffff880034a63a58
[39389.800012] Call Trace:
[39389.800012]  [<ffffffff81091963>] do_raw_write_lock+0x72/0x8c
[39389.800012]  [<ffffffff81486f1b>] _raw_write_lock+0x3a/0x41
[39389.800012]  [<ffffffffa0672cb3>] ? btrfs_tree_lock+0x119/0x251 [btrfs]
[39389.800012]  [<ffffffffa0672cb3>] btrfs_tree_lock+0x119/0x251 [btrfs]
[39389.800012]  [<ffffffffa061aeba>] ? rcu_read_unlock+0x5b/0x5d [btrfs]
[39389.800012]  [<ffffffffa061ce13>] ? btrfs_root_node+0xda/0xe6 [btrfs]
[39389.800012]  [<ffffffffa061ce83>] btrfs_lock_root_node+0x22/0x42 [btrfs]
[39389.800012]  [<ffffffffa062046b>] btrfs_search_slot+0x1b8/0x758 [btrfs]
[39389.800012]  [<ffffffff810fc6b0>] ? time_hardirqs_on+0x15/0x28
[39389.800012]  [<ffffffffa06365db>] btrfs_lookup_inode+0x31/0x95 [btrfs]
[39389.800012]  [<ffffffff8108d62f>] ? trace_hardirqs_on+0xd/0xf
[39389.800012]  [<ffffffff8148482b>] ? mutex_lock_nested+0x397/0x3bc
[39389.800012]  [<ffffffffa068821b>] __btrfs_update_delayed_inode+0x59/0x1c0 [btrfs]
[39389.800012]  [<ffffffffa068858e>] __btrfs_commit_inode_delayed_items+0x194/0x5aa [btrfs]
[39389.800012]  [<ffffffff81486ab7>] ? _raw_spin_unlock+0x31/0x44
[39389.800012]  [<ffffffffa0688a48>] __btrfs_run_delayed_items+0xa4/0x15c [btrfs]
[39389.800012]  [<ffffffffa0688d62>] btrfs_run_delayed_items+0x11/0x13 [btrfs]
[39389.800012]  [<ffffffffa064048e>] btrfs_commit_transaction+0x234/0x96e [btrfs]
[39389.800012]  [<ffffffffa0618d10>] btrfs_sync_fs+0x145/0x1ad [btrfs]
[39389.800012]  [<ffffffffa0671176>] btrfs_ioctl+0x11d2/0x2793 [btrfs]
[39389.800012]  [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc
[39389.800012]  [<ffffffff81140261>] ? __might_fault+0x4c/0xa7
[39389.800012]  [<ffffffff81140261>] ? __might_fault+0x4c/0xa7
[39389.800012]  [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc
[39389.800012]  [<ffffffff8118b3d4>] ? rcu_read_unlock+0x3e/0x5d
[39389.800012]  [<ffffffff811822f8>] do_vfs_ioctl+0x42b/0x4ea
[39389.800012]  [<ffffffff8118b4f3>] ? __fget_light+0x62/0x71
[39389.800012]  [<ffffffff8118240e>] SyS_ioctl+0x57/0x79
[39389.800012]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
[39389.800012] Code: f0 0f b1 13 85 c0 75 ef eb 2a f3 90 8a 03 84 c0 75 f8 f0 0f b0 13 84 c0 75 f0 ba ff 00 00 00 eb 0a f0 0f b1 13 ff c8 74 0b f3 90 <8b> 03 83 f8 01 75 f7 eb ed c6 43 04 00 5b 5d c3 0f 1f 44 00 00

This happens because in the code path executed by the inode_paths ioctl we
end up nesting two calls to read lock a leaf's rwlock when after the first
call to read_lock() and before the second call to read_lock(), another
task (running the delayed items as part of a transaction commit) has
already called write_lock() against the leaf's rwlock. This situation is
illustrated by the following diagram:

         Task A                       Task B

  btrfs_ref_to_path()               btrfs_commit_transaction()
    read_lock(&eb->lock);

                                      btrfs_run_delayed_items()
                                        __btrfs_commit_inode_delayed_items()
                                          __btrfs_update_delayed_inode()
                                            btrfs_lookup_inode()

                                              write_lock(&eb->lock);
                                                --> task waits for lock

    read_lock(&eb->lock);
    --> makes this task hang
        forever (and task B too
	of course)

So fix this by avoiding doing the nested read lock, which is easily
avoidable. This issue does not happen if task B calls write_lock() after
task A does the second call to read_lock(), however there does not seem
to exist anything in the documentation that mentions what is the expected
behaviour for recursive locking of rwlocks (leaving the idea that doing
so is not a good usage of rwlocks).

Also, as a side effect necessary for this fix, make sure we do not
needlessly read lock extent buffers when the input path has skip_locking
set (used when called from send).

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-02-05 02:26:25 +00:00
Filipe Manana
7f042a8370 Btrfs: remove no longer used function extent_read_full_page_nolock()
Not needed after the previous patch named
"Btrfs: fix page reading in extent_same ioctl leading to csum errors".

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-02-03 19:27:10 +00:00
Filipe Manana
3131400230 Btrfs: fix page reading in extent_same ioctl leading to csum errors
In the extent_same ioctl, we were grabbing the pages (locked) and
attempting to read them without bothering about any concurrent IO
against them. That is, we were not checking for any ongoing ordered
extents nor waiting for them to complete, which leads to a race where
the extent_same() code gets a checksum verification error when it
reads the pages, producing a message like the following in dmesg
and making the operation fail to user space with -ENOMEM:

[18990.161265] BTRFS warning (device sdc): csum failed ino 259 off 495616 csum 685204116 expected csum 1515870868

Fix this by using btrfs_readpage() for reading the pages instead of
extent_read_full_page_nolock(), which waits for any concurrent ordered
extents to complete and locks the io range. Also do better error handling
and don't treat all failures as -ENOMEM, as that's clearly misleasing,
becoming identical to the checks and operation of prepare_uptodate_page().

The use of extent_read_full_page_nolock() was required before
commit f441460202 ("btrfs: fix deadlock with extent-same and readpage"),
as we had the range locked in an inode's io tree before attempting to
read the pages.

Fixes: f441460202 ("btrfs: fix deadlock with extent-same and readpage")
Cc: stable@vger.kernel.org   # 4.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-02-03 19:27:10 +00:00
Filipe Manana
e0bd70c67b Btrfs: fix invalid page accesses in extent_same (dedup) ioctl
In the extent_same ioctl we are getting the pages for the source and
target ranges and unlocking them immediately after, which is incorrect
because later we attempt to map them (with kmap_atomic) and access their
contents at btrfs_cmp_data(). When we do such access the pages might have
been relocated or removed from memory, which leads to an invalid memory
access. This issue is detected on a kernel with CONFIG_DEBUG_PAGEALLOC=y
which produces a trace like the following:

186736.677437] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[186736.680382] Modules linked in: btrfs dm_flakey dm_mod ppdev xor raid6_pq sha256_generic hmac drbg ansi_cprng acpi_cpufreq evdev sg aesni_intel aes_x86_64
parport_pc ablk_helper tpm_tis psmouse parport i2c_piix4 tpm cryptd i2c_core lrw processor button serio_raw pcspkr gf128mul glue_helper loop autofs4 ext4
crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last
unloaded: btrfs]
[186736.681319] CPU: 13 PID: 10222 Comm: duperemove Tainted: G        W       4.4.0-rc6-btrfs-next-18+ #1
[186736.681319] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[186736.681319] task: ffff880132600400 ti: ffff880362284000 task.ti: ffff880362284000
[186736.681319] RIP: 0010:[<ffffffff81264d00>]  [<ffffffff81264d00>] memcmp+0xb/0x22
[186736.681319] RSP: 0018:ffff880362287d70  EFLAGS: 00010287
[186736.681319] RAX: 000002c002468acf RBX: 0000000012345678 RCX: 0000000000000000
[186736.681319] RDX: 0000000000001000 RSI: 0005d129c5cf9000 RDI: 0005d129c5cf9000
[186736.681319] RBP: ffff880362287d70 R08: 0000000000000000 R09: 0000000000001000
[186736.681319] R10: ffff880000000000 R11: 0000000000000476 R12: 0000000000001000
[186736.681319] R13: ffff8802f91d4c88 R14: ffff8801f2a77830 R15: ffff880352e83e40
[186736.681319] FS:  00007f27b37fe700(0000) GS:ffff88043dda0000(0000) knlGS:0000000000000000
[186736.681319] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[186736.681319] CR2: 00007f27a406a000 CR3: 0000000217421000 CR4: 00000000001406e0
[186736.681319] Stack:
[186736.681319]  ffff880362287ea0 ffffffffa048d0bd 000000000009f000 0000000000001000
[186736.681319]  0100000000000000 ffff8801f2a77850 ffff8802f91d49b0 ffff880132600400
[186736.681319]  00000000000004f8 ffff8801c1efbe41 0000000000000000 0000000000000038
[186736.681319] Call Trace:
[186736.681319]  [<ffffffffa048d0bd>] btrfs_ioctl+0x24cb/0x2731 [btrfs]
[186736.681319]  [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc
[186736.681319]  [<ffffffff8118b3d4>] ? rcu_read_unlock+0x3e/0x5d
[186736.681319]  [<ffffffff811822f8>] do_vfs_ioctl+0x42b/0x4ea
[186736.681319]  [<ffffffff8118b4f3>] ? __fget_light+0x62/0x71
[186736.681319]  [<ffffffff8118240e>] SyS_ioctl+0x57/0x79
[186736.681319]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
[186736.681319] Code: 0a 3c 6e 74 0d 3c 79 74 04 3c 59 75 0c c6 06 01 eb 03 c6 06 00 31 c0 eb 05 b8 ea ff ff ff 5d c3 55 31 c9 48 89 e5 48 39 d1 74 13 <0f> b6
04 0f 44 0f b6 04 0e 48 ff c1 44 29 c0 74 ea eb 02 31 c0

(gdb) list *(btrfs_ioctl+0x24cb)
0x5e0e1 is in btrfs_ioctl (fs/btrfs/ioctl.c:2972).
2967                    dst_addr = kmap_atomic(dst_page);
2968
2969                    flush_dcache_page(src_page);
2970                    flush_dcache_page(dst_page);
2971
2972                    if (memcmp(addr, dst_addr, cmp_len))
2973                            ret = BTRFS_SAME_DATA_DIFFERS;
2974
2975                    kunmap_atomic(addr);
2976                    kunmap_atomic(dst_addr);

So fix this by making sure we keep the pages locked and respect the same
locking order as everywhere else: get and lock the pages first and then
lock the range in the inode's io tree (like for example at
__btrfs_buffered_write() and extent_readpages()). If an ordered extent
is found after locking the range in the io tree, unlock the range,
unlock the pages, wait for the ordered extent to complete and repeat the
entire locking process until no overlapping ordered extents are found.

Cc: stable@vger.kernel.org   # 4.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-02-03 19:27:09 +00:00
Chandan Rajendra
65bfa65807 Btrfs: btrfs_ioctl_clone: Truncate complete page after performing clone operation
In subpagesize-blocksize scenario, the "destination offset" argument passed to
the btrfs_ioctl_clone() can be aligned to sectorsize but may not be
necessarily aligned to the machine's page size. In such cases,
truncate_inode_pages_range() ends up zeroing out the partial page and future
read operations will return incorrect data. Hence this commit explicitly
rounds down the "destination offset" to the machine's page size.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-01 19:24:29 +01:00
Chandan Rajendra
27772b68f6 Btrfs: Clean pte corresponding to page straddling i_size
When extending a file by either "truncate up" or by writing beyond i_size, the
page which had i_size needs to be marked "read only" so that future writes to
the page via mmap interface causes btrfs_page_mkwrite() to be invoked. If not,
a write performed after extending the file via the mmap interface will find
the page to be writaeable and continue writing to the page without invoking
btrfs_page_mkwrite() i.e. we end up writing to a file without reserving disk
space.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-01 19:24:29 +01:00
Chandan Rajendra
5a2834f808 Btrfs: Fix block size returned to user space
btrfs_getattr() returns PAGE_CACHE_SIZE as the block size. Since
generic_fillattr() already does the right thing (by obtaining block size
from inode->i_blkbits), just remove the statement from btrfs_getattr.

Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-01 19:24:29 +01:00
Chandan Rajendra
0c29ba993e Btrfs: Limit inline extents to root->sectorsize
cow_file_range_inline() limits the size of an inline extent to
PAGE_CACHE_SIZE. This breaks in subpagesize-blocksize scenarios. Fix this by
comparing against root->sectorsize.

Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-01 19:24:29 +01:00
Chandan Rajendra
5f4dc8fc83 Btrfs: btrfs_submit_direct_hook: Handle map_length < bio vector length
In subpagesize-blocksize scenario, map_length can be less than the length of a
bio vector. Such a condition may cause btrfs_submit_direct_hook() to submit a
zero length bio. Fix this by comparing map_length against block size rather
than with bv_len.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-01 19:24:29 +01:00
Chandan Rajendra
298cfd3683 Btrfs: Use (eb->start, seq) as search key for tree modification log
In subpagesize-blocksize a page can map multiple extent buffers and hence
using (page index, seq) as the search key is incorrect. For example, searching
through tree modification log tree can return an entry associated with the
first extent buffer mapped by the page (if such an entry exists), when we are
actually searching for entries associated with extent buffers that are mapped
at position 2 or more in the page.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-01 19:24:29 +01:00
Chandan Rajendra
dbfdb6d1b3 Btrfs: Search for all ordered extents that could span across a page
In subpagesize-blocksize scenario it is not sufficient to search using the
first byte of the page to make sure that there are no ordered extents
present across the page. Fix this.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-01 19:24:29 +01:00
Chandan Rajendra
d0b7da88f6 Btrfs: btrfs_page_mkwrite: Reserve space in sectorsized units
In subpagesize-blocksize scenario, if i_size occurs in a block which is not
the last block in the page, then the space to be reserved should be calculated
appropriately.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-01 19:24:29 +01:00
Chandan Rajendra
9703fefe0b Btrfs: fallocate: Work with sectorsized blocks
While at it, this commit changes btrfs_truncate_page() to truncate sectorsized
blocks instead of pages. Hence the function has been renamed to
btrfs_truncate_block().

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-01 19:24:29 +01:00
Chandan Rajendra
2dabb32484 Btrfs: Direct I/O read: Work on sectorsized blocks
The direct I/O read's endio and corresponding repair functions work on
page sized blocks. This commit adds the ability for direct I/O read to work on
subpagesized blocks.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-01 19:23:47 +01:00
Chandan Rajendra
c40a3d38af Btrfs: Compute and look up csums based on sectorsized blocks
Checksums are applicable to sectorsize units. The current code uses
bio->bv_len units to compute and look up checksums. This works on machines
where sectorsize == PAGE_SIZE. This patch makes the checksum computation and
look up code to work with sectorsize units.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-01 19:23:47 +01:00
Chandan Rajendra
2e78c927d7 Btrfs: __btrfs_buffered_write: Reserve/release extents aligned to block size
Currently, the code reserves/releases extents in multiples of PAGE_CACHE_SIZE
units. Fix this by doing reservation/releases in block size units.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-01 19:23:47 +01:00
Borislav Petkov
bc696ca05f x86/cpufeature: Replace the old static_cpu_has() with safe variant
So the old one didn't work properly before alternatives had run.
And it was supposed to provide an optimized JMP because the
assumption was that the offset it is jumping to is within a
signed byte and thus a two-byte JMP.

So I did an x86_64 allyesconfig build and dumped all possible
sites where static_cpu_has() was used. The optimization amounted
to all in all 12(!) places where static_cpu_has() had generated
a 2-byte JMP. Which has saved us a whopping 36 bytes!

This clearly is not worth the trouble so we can remove it. The
only place where the optimization might count - in __switch_to()
- we will handle differently. But that's not subject of this
patch.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1453842730-28463-6-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-30 11:22:18 +01:00
Linus Torvalds
d3f71ae711 Merge branch 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Dave had a small collection of fixes to the new free space tree code,
  one of which was keeping our sysfs files more up to date with feature
  bits as different things get enabled (lzo, raid5/6, etc).

  I should have kept the sysfs stuff for rc3, since we always manage to
  trip over something.  This time it was GFP_KERNEL from somewhere that
  is NOFS only.  Instead of rebasing it out I've put a revert in, and
  we'll fix it properly for rc3.

  Otherwise, Filipe fixed a btrfs DIO race and Qu Wenruo fixed up a
  use-after-free in our tracepoints that Dave Jones reported"

* 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Revert "btrfs: synchronize incompat feature bits with sysfs files"
  btrfs: don't use GFP_HIGHMEM for free-space-tree bitmap kzalloc
  btrfs: sysfs: check initialization state before updating features
  Revert "btrfs: clear PF_NOFREEZE in cleaner_kthread()"
  btrfs: async-thread: Fix a use-after-free error for trace
  Btrfs: fix race between fsync and lockless direct IO writes
  btrfs: add free space tree to the cow-only list
  btrfs: add free space tree to lockdep classes
  btrfs: tweak free space tree bitmap allocation
  btrfs: tests: switch to GFP_KERNEL
  btrfs: synchronize incompat feature bits with sysfs files
  btrfs: sysfs: introduce helper for syncing bits with sysfs files
  btrfs: sysfs: add free-space-tree bit attribute
  btrfs: sysfs: fix typo in compat_ro attribute definition
2016-01-29 15:46:49 -08:00
Chris Mason
e410e34fad Revert "btrfs: synchronize incompat feature bits with sysfs files"
This reverts commit 14e46e0495.

This ends up doing sysfs operations from deep in balance (where we
should be GFP_NOFS) and under heavy balance load, we're making races
against sysfs internals.

Revert it for now while we figure things out.

Signed-off-by: Chris Mason <clm@fb.com>
2016-01-29 08:19:37 -08:00
Chris Mason
e1c0ebad3f btrfs: don't use GFP_HIGHMEM for free-space-tree bitmap kzalloc
This was copied incorrectly from the __vmalloc call.

Signed-off-by: Chris Mason <clm@fb.com>
2016-01-27 07:05:49 -08:00
Chris Mason
d32a4e3434 Merge branch 'dev/fst-followup' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5 2016-01-27 05:48:23 -08:00
David Sterba
bf6092066f btrfs: sysfs: check initialization state before updating features
If the mount phase is not finished, we can't update the sysfs files.

Reported-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-01-27 05:40:10 -08:00
David Sterba
80ad623edd Revert "btrfs: clear PF_NOFREEZE in cleaner_kthread()"
This reverts commit 6962491321. The
cleaner thread can block freezing when there's a snapshot cleaning in
progress and the other threads get suspended first. From the logs
provided by Martin we're waiting for reading extent pages:

kernel: PM: Syncing filesystems ... done.
kernel: Freezing user space processes ... (elapsed 0.015 seconds) done.
kernel: Freezing remaining freezable tasks ...
kernel: Freezing of tasks failed after 20.003 seconds (1 tasks refusing to freeze, wq_busy=0):
kernel: btrfs-cleaner   D ffff88033dd13bc0     0   152      2 0x00000000
kernel: ffff88032ebc2e00 ffff88032e750000 ffff88032e74fa50 7fffffffffffffff
kernel: ffffffff814a58df 0000000000000002 ffffea000934d580 ffffffff814a5451
kernel: 7fffffffffffffff ffffffff814a6e8f 0000000000000000 0000000000000020
kernel: Call Trace:
kernel: [<ffffffff814a58df>] ? bit_wait+0x2c/0x2c
kernel: [<ffffffff814a5451>] ? schedule+0x6f/0x7c
kernel: [<ffffffff814a6e8f>] ? schedule_timeout+0x2f/0xd8
kernel: [<ffffffff81076f94>] ? timekeeping_get_ns+0xa/0x2e
kernel: [<ffffffff81077603>] ? ktime_get+0x36/0x44
kernel: [<ffffffff814a4f6c>] ? io_schedule_timeout+0x94/0xf2
kernel: [<ffffffff814a4f6c>] ? io_schedule_timeout+0x94/0xf2
kernel: [<ffffffff814a590b>] ? bit_wait_io+0x2c/0x30
kernel: [<ffffffff814a5694>] ? __wait_on_bit+0x41/0x73
kernel: [<ffffffff8109eba8>] ? wait_on_page_bit+0x6d/0x72
kernel: [<ffffffff8105d718>] ? autoremove_wake_function+0x2a/0x2a
kernel: [<ffffffff811a02d7>] ? read_extent_buffer_pages+0x1bd/0x203
kernel: [<ffffffff8117d9e9>] ? free_root_pointers+0x4c/0x4c
kernel: [<ffffffff8117e831>] ? btree_read_extent_buffer_pages.constprop.57+0x5a/0xe9
kernel: [<ffffffff8117f4f3>] ? read_tree_block+0x2d/0x45
kernel: [<ffffffff8116782a>] ? read_block_for_search.isra.34+0x22a/0x26b
kernel: [<ffffffff811656c3>] ? btrfs_set_path_blocking+0x1e/0x4a
kernel: [<ffffffff8116919b>] ? btrfs_search_slot+0x648/0x736
kernel: [<ffffffff81170559>] ? btrfs_lookup_extent_info+0xb7/0x2c7
kernel: [<ffffffff81170ee5>] ? walk_down_proc+0x9c/0x1ae
kernel: [<ffffffff81171c9d>] ? walk_down_tree+0x40/0xa4
kernel: [<ffffffff8117375f>] ? btrfs_drop_snapshot+0x2da/0x664
kernel: [<ffffffff8104ff21>] ? finish_task_switch+0x126/0x167
kernel: [<ffffffff811850f8>] ? btrfs_clean_one_deleted_snapshot+0xa6/0xb0
kernel: [<ffffffff8117eaba>] ? cleaner_kthread+0x13e/0x17b
kernel: [<ffffffff8117e97c>] ? btrfs_item_end+0x33/0x33
kernel: [<ffffffff8104d256>] ? kthread+0x95/0x9d
kernel: [<ffffffff8104d1c1>] ? kthread_parkme+0x16/0x16
kernel: [<ffffffff814a7b5f>] ? ret_from_fork+0x3f/0x70
kernel: [<ffffffff8104d1c1>] ? kthread_parkme+0x16/0x16

As this affects a released kernel (4.4) we need a minimal fix for
stable kernels.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=108361
Reported-by: Martin Ziegler <ziegler@uni-freiburg.de>
CC: stable@vger.kernel.org # 4.4
CC: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-01-25 16:50:27 -08:00
Qu Wenruo
0a95b85137 btrfs: async-thread: Fix a use-after-free error for trace
Parameter of trace_btrfs_work_queued() can be freed in its workqueue.
So no one use use that pointer after queue_work().

Fix the user-after-free bug by move the trace line before queue_work().

Reported-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-01-25 16:50:26 -08:00
Filipe Manana
de0ee0edb2 Btrfs: fix race between fsync and lockless direct IO writes
An fsync, using the fast path, can race with a concurrent lockless direct
IO write and end up logging a file extent item that points to an extent
that wasn't written to yet. This is because the fast fsync path collects
ordered extents into a local list and then collects all the new extent
maps to log file extent items based on them, while the direct IO write
path creates the new extent map before it creates the corresponding
ordered extent (and submitting the respective bio(s)).

So fix this by making the direct IO write path create ordered extents
before the extent maps and make the fast fsync path collect any new
ordered extents after it collects the extent maps.
Note that making the fsync handler call inode_dio_wait() (after acquiring
the inode's i_mutex) would not work and lead to a deadlock when doing
AIO, as through AIO we end up in a path where the fsync handler is called
(through dio_aio_complete_work() -> dio_complete() -> vfs_fsync_range())
before the inode's dio counter is decremented (inode_dio_wait() waits
for this counter to have a value of zero).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-01-25 16:50:26 -08:00
Chris Mason
6b5aa88c86 Merge branch 'fix/fst-sysfs' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5
Signed-off-by: Chris Mason <clm@fb.com>
2016-01-25 16:43:13 -08:00
David Sterba
3e4c5efbb3 btrfs: add free space tree to the cow-only list
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-25 16:48:07 +01:00
David Sterba
6b20e0ad2e btrfs: add free space tree to lockdep classes
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-25 16:48:06 +01:00
Al Viro
5955102c99 wrappers for ->i_mutex access
parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).

Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-01-22 18:04:28 -05:00
Linus Torvalds
2101ae4289 Merge branch 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull more btrfs updates from Chris Mason:
 "These are mostly fixes that we've been testing, but also we grabbed
  and tested a few small cleanups that had been on the list for a while.

  Zhao Lei's patchset also fixes some early ENOSPC buglets"

* 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (21 commits)
  btrfs: raid56: Use raid_write_end_io for scrub
  btrfs: Remove unnecessary ClearPageUptodate for raid56
  btrfs: use rbio->nr_pages to reduce calculation
  btrfs: Use unified stripe_page's index calculation
  btrfs: Fix calculation of rbio->dbitmap's size calculation
  btrfs: Fix no_space in write and rm loop
  btrfs: merge functions for wait snapshot creation
  btrfs: delete unused argument in btrfs_copy_from_user
  btrfs: Use direct way to determine raid56 write/recover mode
  btrfs: Small cleanup for get index_srcdev loop
  btrfs: Enhance chunk validation check
  btrfs: Enhance super validation check
  Btrfs: fix deadlock running delayed iputs at transaction commit time
  Btrfs: fix typo in log message when starting a balance
  btrfs: remove duplicate const specifier
  btrfs: initialize the seq counter in struct btrfs_device
  Btrfs: clean up an error code in btrfs_init_space_info()
  btrfs: fix iterator with update error in backref.c
  Btrfs: fix output of compression message in btrfs_parse_options()
  Btrfs: Initialize btrfs_root->highest_objectid when loading tree root and subvolume roots
  ...
2016-01-22 11:49:21 -08:00
David Sterba
79b134a22b btrfs: tweak free space tree bitmap allocation
The requested bitmap size varies, observed numbers were < 4K up to 16K.
Using vmalloc unconditionally would be too heavy, we'll try contiguous
allocations first and fall back to vmalloc if there's no contig memory.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-22 17:16:18 +01:00
David Sterba
8cce83ba50 btrfs: tests: switch to GFP_KERNEL
There's no reason to do GFP_NOFS in tests, it's not data-heavy and
memory allocation failures would affect only developers or testers.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-22 10:28:24 +01:00
David Sterba
14e46e0495 btrfs: synchronize incompat feature bits with sysfs files
The files under /sys/fs/UUID/features get out of sync with the actual
incompat bits set for the filesystem if they change after mount (eg. the
LZO compression).

Synchronize the feature bits with the sysfs files representing them
right after we set/clear them.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-21 18:54:41 +01:00
David Sterba
444e751698 btrfs: sysfs: introduce helper for syncing bits with sysfs files
The files under /sys/fs/UUID/features get out of sync with the actual
incompat bits set for the filesystem if they change after mount. We're
going to sync them and need a helper to do that.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-21 18:50:40 +01:00
David Sterba
3b5bb73bd8 btrfs: sysfs: add free-space-tree bit attribute
The incompat bit representing the newly added free space tree feature is
missing. Right now it will be listed only among features supported by
the module, not per-fs.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-21 18:36:46 +01:00
David Sterba
ba2d084055 btrfs: sysfs: fix typo in compat_ro attribute definition
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-20 19:07:04 +01:00
Zhao Lei
a6111d11b8 btrfs: raid56: Use raid_write_end_io for scrub
No need to create additional end_io function for scrub, it increased
code size and introduced some un-unified lines, as:
raid_write_parity_end_io():
        int err = bio->bi_error;
        if (bio->bi_error)
raid_write_end_io():
        int err = bio->bi_error;
        if (err)

This patch combines them.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-01-20 07:22:18 -08:00
Zhao Lei
748f4ef4c6 btrfs: Remove unnecessary ClearPageUptodate for raid56
PageUptodate flag already initialized to 0 for new page,
no need to set it again.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-01-20 07:22:17 -08:00
Zhao Lei
915e22903c btrfs: use rbio->nr_pages to reduce calculation
We can use rbio->stripe_npages to reduce unnecessary calculation in
many code place.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-01-20 07:22:16 -08:00
Zhao Lei
b7178a5f03 btrfs: Use unified stripe_page's index calculation
We are using different index calculation method for stripe_page in
current code:
1: (rbio->stripe_len / PAGE_CACHE_SIZE) * stripe_index + page_index
2: DIV_ROUND_UP(rbio->stripe_len, PAGE_CACHE_SIZE) * stripe_index + page_index
3: DIV_ROUND_UP(rbio->stripe_len * stripe_index, PAGE_CACHE_SIZE) + page_index
...

They can get same result when stripe_len align to PAGE_CACHE_SIZE,
this is why current code can work, intruduce and use a common function
for calculation is a better choose.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-01-20 07:22:16 -08:00
Zhao Lei
bfca9a6d4b btrfs: Fix calculation of rbio->dbitmap's size calculation
Current code is trying to calculate rbio->dbitmap's size to make it
align to sizeof(long), but implement haven't achived this object,
it is align to sizeof(char) instead.
This patch fixed above calculation, and use sizeof(long) instead of
fixed "8" to increate compatibility.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-01-20 07:22:15 -08:00
Zhao Lei
e1746e8381 btrfs: Fix no_space in write and rm loop
I see no_space in v4.4-rc1 again in xfstests generic/102.
It happened randomly in some node only.
(one of 4 phy-node, and a kvm with non-virtio block driver)

By bisect, we can found the first-bad is:
 commit bdced438ac ("block: setup bi_phys_segments after splitting")'
But above patch only triggered the bug by making bio operation
faster(or slower).

Main reason is in our space_allocating code, we need to commit
page writeback before wait it complish, this patch fixed above
bug.

BTW, there is another reason for generic/102 fail, caused by
disable default mixed-blockgroup, I'll fix it in xfstests.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-01-20 07:22:14 -08:00
Zhao Lei
0bc19f9031 btrfs: merge functions for wait snapshot creation
wait_for_snapshot_creation() is in same group with oher two:
 btrfs_start_write_no_snapshoting()
 btrfs_end_write_no_snapshoting()

Rename wait_for_snapshot_creation() and move it into same place
with other two.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-01-20 07:22:13 -08:00
Zhao Lei
ee22f0c4ec btrfs: delete unused argument in btrfs_copy_from_user
size_t write_bytes is not necessary for btrfs_copy_from_user(),
delete it.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-01-20 07:22:13 -08:00
Zhao Lei
ad1ba2a0c4 btrfs: Use direct way to determine raid56 write/recover mode
Old code used bbio->raid_map to determine whether in raid56
write/recover operation, because we didn't't have bbio->map_type.

Now we have direct way for this condition, rid of using
the function-relative data, and make the code more readable.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-01-19 18:43:45 -08:00
Zhao Lei
94a97dfeb6 btrfs: Small cleanup for get index_srcdev loop
1: Adjust condition in loop to make less TAB
2: Move btrfs_put_bbio()'s line for combine, and makes logic clean.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-01-19 18:43:40 -08:00
Qu Wenruo
f04b772bfc btrfs: Enhance chunk validation check
Enhance chunk validation:
1) Num_stripes
   We already have such check but it's only in super block sys chunk
   array.
   Now check all on-disk chunks.

2) Chunk logical
   It should be aligned to sector size.
   This behavior should be *DOUBLE CHECKED* for 64K sector size like
   PPC64 or AArch64.
   Maybe we can found some hidden bugs.

3) Chunk length
   Same as chunk logical, should be aligned to sector size.

4) Stripe length
   It should be power of 2.

5) Chunk type
   Any bit out of TYPE_MAS | PROFILE_MASK is invalid.

With all these much restrict rules, several fuzzed image reported in
mail list should no longer cause kernel panic.

Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-01-19 18:21:41 -08:00
Qu Wenruo
319e4d0661 btrfs: Enhance super validation check
Enhance btrfs_check_super_valid() function by the following points:
1) Restrict sector/node size check
   Not the old max/min valid check, but also check if it's a power of 2.
   So some bogus number like 12K node size won't pass now.

2) Super flag check
   For now, there is still some inconsistency between kernel and
   btrfs-progs super flags.
   And considering btrfs-progs may add new flags for super block, this
   check will only output warning.

3) Better root alignment check
   Now root bytenr is checked against sector size.

4) Move some check into btrfs_check_super_valid().
   Like node size vs leaf size check, and PAGESIZE vs sectorsize check.
   And magic number check.

Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-01-19 18:21:41 -08:00
Filipe Manana
c2d6cb1636 Btrfs: fix deadlock running delayed iputs at transaction commit time
While running a stress test I ran into a deadlock when running the delayed
iputs at transaction time, which produced the following report and trace:

[  886.399989] =============================================
[  886.400871] [ INFO: possible recursive locking detected ]
[  886.401663] 4.4.0-rc6-btrfs-next-18+ #1 Not tainted
[  886.402384] ---------------------------------------------
[  886.403182] fio/8277 is trying to acquire lock:
[  886.403568]  (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[  886.403568]
[  886.403568] but task is already holding lock:
[  886.403568]  (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[  886.403568]
[  886.403568] other info that might help us debug this:
[  886.403568]  Possible unsafe locking scenario:
[  886.403568]
[  886.403568]        CPU0
[  886.403568]        ----
[  886.403568]   lock(&fs_info->delayed_iput_sem);
[  886.403568]   lock(&fs_info->delayed_iput_sem);
[  886.403568]
[  886.403568]  *** DEADLOCK ***
[  886.403568]
[  886.403568]  May be due to missing lock nesting notation
[  886.403568]
[  886.403568] 3 locks held by fio/8277:
[  886.403568]  #0:  (sb_writers#11){.+.+.+}, at: [<ffffffff81174c4c>] __sb_start_write+0x5f/0xb0
[  886.403568]  #1:  (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffffa054620d>] btrfs_file_write_iter+0x73/0x408 [btrfs]
[  886.403568]  #2:  (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[  886.403568]
[  886.403568] stack backtrace:
[  886.403568] CPU: 6 PID: 8277 Comm: fio Not tainted 4.4.0-rc6-btrfs-next-18+ #1
[  886.403568] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[  886.403568]  0000000000000000 ffff88009f80f770 ffffffff8125d4fd ffffffff82af1fc0
[  886.403568]  ffff88009f80f830 ffffffff8108e5f9 0000000200000000 ffff88009fd92290
[  886.403568]  0000000000000000 ffffffff82af1fc0 ffffffff829cfb01 00042b216d008804
[  886.403568] Call Trace:
[  886.403568]  [<ffffffff8125d4fd>] dump_stack+0x4e/0x79
[  886.403568]  [<ffffffff8108e5f9>] __lock_acquire+0xd42/0xf0b
[  886.403568]  [<ffffffff810c22db>] ? __module_address+0xdf/0x108
[  886.403568]  [<ffffffff8108eb77>] lock_acquire+0x10d/0x194
[  886.403568]  [<ffffffff8108eb77>] ? lock_acquire+0x10d/0x194
[  886.403568]  [<ffffffffa0538823>] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[  886.489542]  [<ffffffff8148556b>] down_read+0x3e/0x4d
[  886.489542]  [<ffffffffa0538823>] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[  886.489542]  [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[  886.489542]  [<ffffffffa0533953>] btrfs_commit_transaction+0x8f5/0x96e [btrfs]
[  886.489542]  [<ffffffffa0521d7a>] flush_space+0x435/0x44a [btrfs]
[  886.489542]  [<ffffffffa052218b>] ? reserve_metadata_bytes+0x26a/0x384 [btrfs]
[  886.489542]  [<ffffffffa05221ae>] reserve_metadata_bytes+0x28d/0x384 [btrfs]
[  886.489542]  [<ffffffffa052256c>] ? btrfs_block_rsv_refill+0x58/0x96 [btrfs]
[  886.489542]  [<ffffffffa0522584>] btrfs_block_rsv_refill+0x70/0x96 [btrfs]
[  886.489542]  [<ffffffffa053d747>] btrfs_evict_inode+0x394/0x55a [btrfs]
[  886.489542]  [<ffffffff81188e31>] evict+0xa7/0x15c
[  886.489542]  [<ffffffff81189878>] iput+0x1d3/0x266
[  886.489542]  [<ffffffffa053887c>] btrfs_run_delayed_iputs+0x8f/0xbf [btrfs]
[  886.489542]  [<ffffffffa0533953>] btrfs_commit_transaction+0x8f5/0x96e [btrfs]
[  886.489542]  [<ffffffff81085096>] ? signal_pending_state+0x31/0x31
[  886.489542]  [<ffffffffa0521191>] btrfs_alloc_data_chunk_ondemand+0x1d7/0x288 [btrfs]
[  886.489542]  [<ffffffffa0521282>] btrfs_check_data_free_space+0x40/0x59 [btrfs]
[  886.489542]  [<ffffffffa05228f5>] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs]
[  886.489542]  [<ffffffffa053620a>] btrfs_direct_IO+0x10c/0x27e [btrfs]
[  886.489542]  [<ffffffff8111d9a1>] generic_file_direct_write+0xb3/0x128
[  886.489542]  [<ffffffffa05463c3>] btrfs_file_write_iter+0x229/0x408 [btrfs]
[  886.489542]  [<ffffffff8108ae38>] ? __lock_is_held+0x38/0x50
[  886.489542]  [<ffffffff8117279e>] __vfs_write+0x7c/0xa5
[  886.489542]  [<ffffffff81172cda>] vfs_write+0xa0/0xe4
[  886.489542]  [<ffffffff811734cc>] SyS_write+0x50/0x7e
[  886.489542]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
[ 1081.852335] INFO: task fio:8244 blocked for more than 120 seconds.
[ 1081.854348]       Not tainted 4.4.0-rc6-btrfs-next-18+ #1
[ 1081.857560] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1081.863227] fio        D ffff880213f9bb28     0  8244   8240 0x00000000
[ 1081.868719]  ffff880213f9bb28 00ffffff810fc6b0 ffffffff0000000a ffff88023ed55240
[ 1081.872499]  ffff880206b5d400 ffff880213f9c000 ffff88020a4d5318 ffff880206b5d400
[ 1081.876834]  ffffffff00000001 ffff880206b5d400 ffff880213f9bb40 ffffffff81482ba4
[ 1081.880782] Call Trace:
[ 1081.881793]  [<ffffffff81482ba4>] schedule+0x7f/0x97
[ 1081.883340]  [<ffffffff81485eb5>] rwsem_down_write_failed+0x2d5/0x325
[ 1081.895525]  [<ffffffff8108d48d>] ? trace_hardirqs_on_caller+0x16/0x1ab
[ 1081.897419]  [<ffffffff81269723>] call_rwsem_down_write_failed+0x13/0x20
[ 1081.899251]  [<ffffffff81269723>] ? call_rwsem_down_write_failed+0x13/0x20
[ 1081.901063]  [<ffffffff81089fae>] ? __down_write_nested.isra.0+0x1f/0x21
[ 1081.902365]  [<ffffffff814855bd>] down_write+0x43/0x57
[ 1081.903846]  [<ffffffffa05211b0>] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
[ 1081.906078]  [<ffffffffa05211b0>] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
[ 1081.908846]  [<ffffffff8108d461>] ? mark_held_locks+0x56/0x6c
[ 1081.910409]  [<ffffffffa0521282>] btrfs_check_data_free_space+0x40/0x59 [btrfs]
[ 1081.912482]  [<ffffffffa05228f5>] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs]
[ 1081.914597]  [<ffffffffa053620a>] btrfs_direct_IO+0x10c/0x27e [btrfs]
[ 1081.919037]  [<ffffffff8111d9a1>] generic_file_direct_write+0xb3/0x128
[ 1081.920754]  [<ffffffffa05463c3>] btrfs_file_write_iter+0x229/0x408 [btrfs]
[ 1081.922496]  [<ffffffff8108ae38>] ? __lock_is_held+0x38/0x50
[ 1081.923922]  [<ffffffff8117279e>] __vfs_write+0x7c/0xa5
[ 1081.925275]  [<ffffffff81172cda>] vfs_write+0xa0/0xe4
[ 1081.926584]  [<ffffffff811734cc>] SyS_write+0x50/0x7e
[ 1081.927968]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
[ 1081.985293] INFO: lockdep is turned off.
[ 1081.986132] INFO: task fio:8249 blocked for more than 120 seconds.
[ 1081.987434]       Not tainted 4.4.0-rc6-btrfs-next-18+ #1
[ 1081.988534] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1081.990147] fio        D ffff880218febbb8     0  8249   8240 0x00000000
[ 1081.991626]  ffff880218febbb8 00ffffff81486b8e ffff88020000000b ffff88023ed75240
[ 1081.993258]  ffff8802120a9a00 ffff880218fec000 ffff88020a4d5318 ffff8802120a9a00
[ 1081.994850]  ffffffff00000001 ffff8802120a9a00 ffff880218febbd0 ffffffff81482ba4
[ 1081.996485] Call Trace:
[ 1081.997037]  [<ffffffff81482ba4>] schedule+0x7f/0x97
[ 1081.998017]  [<ffffffff81485eb5>] rwsem_down_write_failed+0x2d5/0x325
[ 1081.999241]  [<ffffffff810852a5>] ? finish_wait+0x6d/0x76
[ 1082.000306]  [<ffffffff81269723>] call_rwsem_down_write_failed+0x13/0x20
[ 1082.001533]  [<ffffffff81269723>] ? call_rwsem_down_write_failed+0x13/0x20
[ 1082.002776]  [<ffffffff81089fae>] ? __down_write_nested.isra.0+0x1f/0x21
[ 1082.003995]  [<ffffffff814855bd>] down_write+0x43/0x57
[ 1082.005000]  [<ffffffffa05211b0>] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
[ 1082.007403]  [<ffffffffa05211b0>] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
[ 1082.008988]  [<ffffffffa0545064>] btrfs_fallocate+0x7c1/0xc2f [btrfs]
[ 1082.010193]  [<ffffffff8108a1ba>] ? percpu_down_read+0x4e/0x77
[ 1082.011280]  [<ffffffff81174c4c>] ? __sb_start_write+0x5f/0xb0
[ 1082.012265]  [<ffffffff81174c4c>] ? __sb_start_write+0x5f/0xb0
[ 1082.013021]  [<ffffffff811712e4>] vfs_fallocate+0x170/0x1ff
[ 1082.013738]  [<ffffffff81181ebb>] ioctl_preallocate+0x89/0x9b
[ 1082.014778]  [<ffffffff811822d7>] do_vfs_ioctl+0x40a/0x4ea
[ 1082.015778]  [<ffffffff81176ea7>] ? SYSC_newfstat+0x25/0x2e
[ 1082.016806]  [<ffffffff8118b4de>] ? __fget_light+0x4d/0x71
[ 1082.017789]  [<ffffffff8118240e>] SyS_ioctl+0x57/0x79
[ 1082.018706]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f

This happens because we can recursively acquire the semaphore
fs_info->delayed_iput_sem when attempting to allocate space to satisfy
a file write request as shown in the first trace above - when committing
a transaction we acquire (down_read) the semaphore before running the
delayed iputs, and when running a delayed iput() we can end up calling
an inode's eviction handler, which in turn commits another transaction
and attempts to acquire (down_read) again the semaphore to run more
delayed iput operations.
This results in a deadlock because if a task acquires multiple times a
semaphore it should invoke down_read_nested() with a different lockdep
class for each level of recursion.

Fix this by simplifying the implementation and use a mutex instead that
is acquired by the cleaner kthread before it runs the delayed iputs
instead of always acquiring a semaphore before delayed references are
run from anywhere.

Fixes: d7c151717a (btrfs: Fix NO_SPACE bug caused by delayed-iput)
Cc: stable@vger.kernel.org   # 4.1+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-01-19 18:21:41 -08:00
Filipe Manana
fedc00455c Btrfs: fix typo in log message when starting a balance
The recent change titled "Btrfs: Check metadata redundancy on balance"
(already in linux-next) left a typo in a message for users:
metatdata -> metadata.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-01-19 18:21:40 -08:00
Chris Mason
326f784281 Merge branch 'misc-for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5 2016-01-19 18:21:30 -08:00
Chris Mason
acc308556c Merge branch 'misc-cleanups-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5 2016-01-19 18:21:00 -08:00
Colin Ian King
fb75d857a3 btrfs: remove duplicate const specifier
duplicate const is redundant so remove it

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-19 10:33:56 +01:00
Linus Torvalds
c1a198d923 Merge branch 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "This has our usual assortment of fixes and cleanups, but the biggest
  change included is Omar Sandoval's free space tree.  It's not the
  default yet, mounting -o space_cache=v2 enables it and sets a readonly
  compat bit.  The tree can actually be deleted and regenerated if there
  are any problems, but it has held up really well in testing so far.

  For very large filesystems (30T+) our existing free space caching code
  can end up taking a huge amount of time during commits.  The new tree
  based code is faster and less work overall to update as the commit
  progresses.

  Omar worked on this during the summer and we'll hammer on it in
  production here at FB over the next few months"

* 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (73 commits)
  Btrfs: fix fitrim discarding device area reserved for boot loader's use
  Btrfs: Check metadata redundancy on balance
  btrfs: statfs: report zero available if metadata are exhausted
  btrfs: preallocate path for snapshot creation at ioctl time
  btrfs: allocate root item at snapshot ioctl time
  btrfs: do an allocation earlier during snapshot creation
  btrfs: use smaller type for btrfs_path locks
  btrfs: use smaller type for btrfs_path lowest_level
  btrfs: use smaller type for btrfs_path reada
  btrfs: cleanup, use enum values for btrfs_path reada
  btrfs: constify static arrays
  btrfs: constify remaining structs with function pointers
  btrfs tests: replace whole ops structure for free space tests
  btrfs: use list_for_each_entry* in backref.c
  btrfs: use list_for_each_entry_safe in free-space-cache.c
  btrfs: use list_for_each_entry* in check-integrity.c
  Btrfs: use linux/sizes.h to represent constants
  btrfs: cleanup, remove stray return statements
  btrfs: zero out delayed node upon allocation
  btrfs: pass proper enum type to start_transaction()
  ...
2016-01-18 12:44:40 -08:00
Sebastian Andrzej Siewior
546bed6312 btrfs: initialize the seq counter in struct btrfs_device
I managed to trigger this:
| INFO: trying to register non-static key.
| the code is fine but needs lockdep annotation.
| turning off the locking correctness validator.
| CPU: 1 PID: 781 Comm: systemd-gpt-aut Not tainted 4.4.0-rt2+ #14
| Hardware name: ARM-Versatile Express
| [<80307cec>] (dump_stack)
| [<80070e98>] (__lock_acquire)
| [<8007184c>] (lock_acquire)
| [<80287800>] (btrfs_ioctl)
| [<8012a8d4>] (do_vfs_ioctl)
| [<8012ac14>] (SyS_ioctl)

so I think that btrfs_device_data_ordered_init() is not invoked behind
a macro somewhere.

Fixes: 7cc8e58d53 ("Btrfs: fix unprotected device's variants on 32bits machine")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-15 19:28:43 +01:00
Dan Carpenter
0dc924c5f2 Btrfs: clean up an error code in btrfs_init_space_info()
If we return 1 here, then the caller treats it as an error and returns
-EINVAL.  It causes a static checker warning to treat positive returns
as an error.

Fixes: 1aba86d67f ('Btrfs: fix easily get into ENOSPC in mixed case')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-15 19:27:28 +01:00
Geliang Tang
8e217858ee btrfs: fix iterator with update error in backref.c
Fix the following error:

fs/btrfs/backref.c:565:1-20: iterator with update on line 577

Fixes: a7ca422('btrfs: use list_for_each_entry* in backref.c')
Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-15 19:27:18 +01:00
Tsutomu Itoh
b7c47bbb2d Btrfs: fix output of compression message in btrfs_parse_options()
The compression message might not be correctly output.
Fix it.

[[before fix]]

# mount -o compress /dev/sdb3 /test3
[  996.874264] BTRFS info (device sdb3): disk space caching is enabled
[  996.874268] BTRFS: has skinny extents
# mount | grep /test3
/dev/sdb3 on /test3 type btrfs (rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/)

# mount -o remount,compress-force /dev/sdb3 /test3
[ 1035.075017] BTRFS info (device sdb3): force zlib compression
[ 1035.075021] BTRFS info (device sdb3): disk space caching is enabled
# mount | grep /test3
/dev/sdb3 on /test3 type btrfs (rw,relatime,compress-force=zlib,space_cache,subvolid=5,subvol=/)

# mount -o remount,compress /dev/sdb3 /test3
[ 1053.679092] BTRFS info (device sdb3): disk space caching is enabled
# mount | grep /test3
/dev/sdb3 on /test3 type btrfs (rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/)

[[after fix]]

# mount -o compress /dev/sdb3 /test3
[  401.021753] BTRFS info (device sdb3): use zlib compression
[  401.021758] BTRFS info (device sdb3): disk space caching is enabled
[  401.021760] BTRFS: has skinny extents
# mount | grep /test3
/dev/sdb3 on /test3 type btrfs (rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/)

# mount -o remount,compress-force /dev/sdb3 /test3
[  439.824624] BTRFS info (device sdb3): force zlib compression
[  439.824629] BTRFS info (device sdb3): disk space caching is enabled
# mount | grep /test3
/dev/sdb3 on /test3 type btrfs (rw,relatime,compress-force=zlib,space_cache,subvolid=5,subvol=/)

# mount -o remount,compress /dev/sdb3 /test3
[  459.918430] BTRFS info (device sdb3): use zlib compression
[  459.918434] BTRFS info (device sdb3): disk space caching is enabled
# mount | grep /test3
/dev/sdb3 on /test3 type btrfs (rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/)

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-15 19:25:36 +01:00
Chandan Rajendra
f32e48e925 Btrfs: Initialize btrfs_root->highest_objectid when loading tree root and subvolume roots
The following call trace is seen when btrfs/031 test is executed in a loop,

[  158.661848] ------------[ cut here ]------------
[  158.662634] WARNING: CPU: 2 PID: 890 at /home/chandan/repos/linux/fs/btrfs/ioctl.c:558 create_subvol+0x3d1/0x6ea()
[  158.664102] BTRFS: Transaction aborted (error -2)
[  158.664774] Modules linked in:
[  158.665266] CPU: 2 PID: 890 Comm: btrfs Not tainted 4.4.0-rc6-g511711a #2
[  158.666251] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[  158.667392]  ffffffff81c0a6b0 ffff8806c7c4f8e8 ffffffff81431fc8 ffff8806c7c4f930
[  158.668515]  ffff8806c7c4f920 ffffffff81051aa1 ffff880c85aff000 ffff8800bb44d000
[  158.669647]  ffff8808863b5c98 0000000000000000 00000000fffffffe ffff8806c7c4f980
[  158.670769] Call Trace:
[  158.671153]  [<ffffffff81431fc8>] dump_stack+0x44/0x5c
[  158.671884]  [<ffffffff81051aa1>] warn_slowpath_common+0x81/0xc0
[  158.672769]  [<ffffffff81051b27>] warn_slowpath_fmt+0x47/0x50
[  158.673620]  [<ffffffff813bc98d>] create_subvol+0x3d1/0x6ea
[  158.674440]  [<ffffffff813777c9>] btrfs_mksubvol.isra.30+0x369/0x520
[  158.675376]  [<ffffffff8108a4aa>] ? percpu_down_read+0x1a/0x50
[  158.676235]  [<ffffffff81377a81>] btrfs_ioctl_snap_create_transid+0x101/0x180
[  158.677268]  [<ffffffff81377b52>] btrfs_ioctl_snap_create+0x52/0x70
[  158.678183]  [<ffffffff8137afb4>] btrfs_ioctl+0x474/0x2f90
[  158.678975]  [<ffffffff81144b8e>] ? vma_merge+0xee/0x300
[  158.679751]  [<ffffffff8115be31>] ? alloc_pages_vma+0x91/0x170
[  158.680599]  [<ffffffff81123f62>] ? lru_cache_add_active_or_unevictable+0x22/0x70
[  158.681686]  [<ffffffff813d99cf>] ? selinux_file_ioctl+0xff/0x1d0
[  158.682581]  [<ffffffff8117b791>] do_vfs_ioctl+0x2c1/0x490
[  158.683399]  [<ffffffff813d3cde>] ? security_file_ioctl+0x3e/0x60
[  158.684297]  [<ffffffff8117b9d4>] SyS_ioctl+0x74/0x80
[  158.685051]  [<ffffffff819b2bd7>] entry_SYSCALL_64_fastpath+0x12/0x6a
[  158.685958] ---[ end trace 4b63312de5a2cb76 ]---
[  158.686647] BTRFS: error (device loop0) in create_subvol:558: errno=-2 No such entry
[  158.709508] BTRFS info (device loop0): forced readonly
[  158.737113] BTRFS info (device loop0): disk space caching is enabled
[  158.738096] BTRFS error (device loop0): Remounting read-write after error is not allowed
[  158.851303] BTRFS error (device loop0): cleaner transaction attach returned -30

This occurs because,

Mount filesystem
Create subvol with ID 257
Unmount filesystem
Mount filesystem
Delete subvol with ID 257
  btrfs_drop_snapshot()
    Add root corresponding to subvol 257 into
    btrfs_transaction->dropped_roots list
Create new subvol (i.e. create_subvol())
  257 is returned as the next free objectid
  btrfs_read_fs_root_no_name()
    Finds the btrfs_root instance corresponding to the old subvol with ID 257
    in btrfs_fs_info->fs_roots_radix.
    Returns error since btrfs_root_item->refs has the value of 0.

To fix the issue the commit initializes tree root's and subvolume root's
highest_objectid when loading the roots from disk.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-15 19:25:02 +01:00
Jeff Mahoney
95617d6932 btrfs: cleanup, stop casting for extent_map->lookup everywhere
Overloading extent_map->bdev to struct map_lookup * might have started out
as a means to an end, but it's a pattern that's used all over the place
now. Let's get rid of the casting and just add a union instead.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-15 19:22:28 +01:00
Vladimir Davydov
5d097056c9 kmemcg: account certain kmem allocations to memcg
Mark those kmem allocations that are known to be easily triggered from
userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
memcg.  For the list, see below:

 - threadinfo
 - task_struct
 - task_delay_info
 - pid
 - cred
 - mm_struct
 - vm_area_struct and vm_region (nommu)
 - anon_vma and anon_vma_chain
 - signal_struct
 - sighand_struct
 - fs_struct
 - files_struct
 - fdtable and fdtable->full_fds_bits
 - dentry and external_name
 - inode for all filesystems. This is the most tedious part, because
   most filesystems overwrite the alloc_inode method.

The list is far from complete, so feel free to add more objects.
Nevertheless, it should be close to "account everything" approach and
keep most workloads within bounds.  Malevolent users will be able to
breach the limit, but this was possible even with the former "account
everything" approach (simply because it did not account everything in
fact).

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Linus Torvalds
33caf82acf Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "All kinds of stuff.  That probably should've been 5 or 6 separate
  branches, but by the time I'd realized how large and mixed that bag
  had become it had been too close to -final to play with rebasing.

  Some fs/namei.c cleanups there, memdup_user_nul() introduction and
  switching open-coded instances, burying long-dead code, whack-a-mole
  of various kinds, several new helpers for ->llseek(), assorted
  cleanups and fixes from various people, etc.

  One piece probably deserves special mention - Neil's
  lookup_one_len_unlocked().  Similar to lookup_one_len(), but gets
  called without ->i_mutex and tries to avoid ever taking it.  That, of
  course, means that it's not useful for any directory modifications,
  but things like getting inode attributes in nfds readdirplus are fine
  with that.  I really should've asked for moratorium on lookup-related
  changes this cycle, but since I hadn't done that early enough...  I
  *am* asking for that for the coming cycle, though - I'm going to try
  and get conversion of i_mutex to rwsem with ->lookup() done under lock
  taken shared.

  There will be a patch closer to the end of the window, along the lines
  of the one Linus had posted last May - mechanical conversion of
  ->i_mutex accesses to inode_lock()/inode_unlock()/inode_trylock()/
  inode_is_locked()/inode_lock_nested().  To quote Linus back then:

    -----
    |    This is an automated patch using
    |
    |        sed 's/mutex_lock(&\(.*\)->i_mutex)/inode_lock(\1)/'
    |        sed 's/mutex_unlock(&\(.*\)->i_mutex)/inode_unlock(\1)/'
    |        sed 's/mutex_lock_nested(&\(.*\)->i_mutex,[     ]*I_MUTEX_\([A-Z0-9_]*\))/inode_lock_nested(\1, I_MUTEX_\2)/'
    |        sed 's/mutex_is_locked(&\(.*\)->i_mutex)/inode_is_locked(\1)/'
    |        sed 's/mutex_trylock(&\(.*\)->i_mutex)/inode_trylock(\1)/'
    |
    |    with a very few manual fixups
    -----

  I'm going to send that once the ->i_mutex-affecting stuff in -next
  gets mostly merged (or when Linus says he's about to stop taking
  merges)"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
  nfsd: don't hold i_mutex over userspace upcalls
  fs:affs:Replace time_t with time64_t
  fs/9p: use fscache mutex rather than spinlock
  proc: add a reschedule point in proc_readfd_common()
  logfs: constify logfs_block_ops structures
  fcntl: allow to set O_DIRECT flag on pipe
  fs: __generic_file_splice_read retry lookup on AOP_TRUNCATED_PAGE
  fs: xattr: Use kvfree()
  [s390] page_to_phys() always returns a multiple of PAGE_SIZE
  nbd: use ->compat_ioctl()
  fs: use block_device name vsprintf helper
  lib/vsprintf: add %*pg format specifier
  fs: use gendisk->disk_name where possible
  poll: plug an unused argument to do_poll
  amdkfd: don't open-code memdup_user()
  cdrom: don't open-code memdup_user()
  rsxx: don't open-code memdup_user()
  mtip32xx: don't open-code memdup_user()
  [um] mconsole: don't open-code memdup_user_nul()
  [um] hostaudio: don't open-code memdup_user()
  ...
2016-01-12 17:11:47 -08:00
Linus Torvalds
fce205e9da Merge branch 'work.copy_file_range' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs copy_file_range updates from Al Viro:
 "Several series around copy_file_range/CLONE"

* 'work.copy_file_range' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  btrfs: use new dedupe data function pointer
  vfs: hoist the btrfs deduplication ioctl to the vfs
  vfs: wire up compat ioctl for CLONE/CLONE_RANGE
  cifs: avoid unused variable and label
  nfsd: implement the NFSv4.2 CLONE operation
  nfsd: Pass filehandle to nfs4_preprocess_stateid_op()
  vfs: pull btrfs clone API to vfs layer
  locks: new locks_mandatory_area calling convention
  vfs: Add vfs_copy_file_range() support for pagecache copies
  btrfs: add .copy_file_range file operation
  x86: add sys_copy_file_range to syscall tables
  vfs: add copy_file_range syscall and vfs helper
2016-01-12 16:30:34 -08:00
Linus Torvalds
67c707e451 Merge branch 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Ingo Molnar:
 "The main changes in this cycle were:

   - code patching and cpu_has cleanups (Borislav Petkov)

   - paravirt cleanups (Juergen Gross)

   - TSC cleanup (Thomas Gleixner)

   - ptrace cleanup (Chen Gang)"

* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  arch/x86/kernel/ptrace.c: Remove unused arg_offs_table
  x86/mm: Align macro defines
  x86/cpu: Provide a config option to disable static_cpu_has
  x86/cpufeature: Remove unused and seldomly used cpu_has_xx macros
  x86/cpufeature: Cleanup get_cpu_cap()
  x86/cpufeature: Move some of the scattered feature bits to x86_capability
  x86/paravirt: Remove paravirt ops pmd_update[_defer] and pte_update_defer
  x86/paravirt: Remove unused pv_apic_ops structure
  x86/tsc: Remove unused tsc_pre_init() hook
  x86: Remove unused function cpu_has_ht_siblings()
  x86/paravirt: Kill some unused patching functions
2016-01-11 16:26:03 -08:00
Linus Torvalds
ddf1d6238d Merge branch 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs xattr updates from Al Viro:
 "Andreas' xattr cleanup series.

  It's a followup to his xattr work that went in last cycle; -0.5KLoC"

* 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  xattr handlers: Simplify list operation
  ocfs2: Replace list xattr handler operations
  nfs: Move call to security_inode_listsecurity into nfs_listxattr
  xfs: Change how listxattr generates synthetic attributes
  tmpfs: listxattr should include POSIX ACL xattrs
  tmpfs: Use xattr handler infrastructure
  btrfs: Use xattr handler infrastructure
  vfs: Distinguish between full xattr names and proper prefixes
  posix acls: Remove duplicate xattr name definitions
  gfs2: Remove gfs2_xattr_acl_chmod
  vfs: Remove vfs_xattr_cmp
2016-01-11 13:32:10 -08:00
Linus Torvalds
32fb378437 Merge branch 'work.symlinks' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs RCU symlink updates from Al Viro:
 "Replacement of ->follow_link/->put_link, allowing to stay in RCU mode
  even if the symlink is not an embedded one.

  No changes since the mailbomb on Jan 1"

* 'work.symlinks' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  switch ->get_link() to delayed_call, kill ->put_link()
  kill free_page_put_link()
  teach nfs_get_link() to work in RCU mode
  teach proc_self_get_link()/proc_thread_self_get_link() to work in RCU mode
  teach shmem_get_link() to work in RCU mode
  teach page_get_link() to work in RCU mode
  replace ->follow_link() with new method that could stay in RCU mode
  don't put symlink bodies in pagecache into highmem
  namei: page_getlink() and page_follow_link_light() are the same thing
  ufs: get rid of ->setattr() for symlinks
  udf: don't duplicate page_symlink_inode_operations
  logfs: don't duplicate page_symlink_inode_operations
  switch befs long symlinks to page_symlink_operations
2016-01-11 13:13:23 -08:00
Chris Mason
988f1f576d Merge branch 'for-chris-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.5
Signed-off-by: Chris Mason <clm@fb.com>
2016-01-11 08:39:28 -08:00
Chris Mason
b28cf57246 Merge branch 'misc-cleanups-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5
Signed-off-by: Chris Mason <clm@fb.com>
2016-01-11 06:08:37 -08:00
Chris Mason
a3058101c1 Merge branch 'misc-for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5 2016-01-11 05:59:32 -08:00
Filipe Manana
8cdc7c5b00 Btrfs: fix fitrim discarding device area reserved for boot loader's use
As of the 4.3 kernel release, the fitrim ioctl can now discard any region
of a disk that is not allocated to any chunk/block group, including the
first megabyte which is used for our primary superblock and by the boot
loader (grub for example).

Fix this by not allowing to trim/discard any region in the device starting
with an offset not greater than min(alloc_start_mount_option, 1Mb), just
as it was not possible before 4.3.

A reproducer test case for xfstests follows.

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"
  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      cd /
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter

  # real QA test starts here
  _need_to_be_root
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch

  rm -f $seqres.full

  _scratch_mkfs >>$seqres.full 2>&1

  # Write to the [0, 64Kb[ and [68Kb, 1Mb[ ranges of the device. These ranges are
  # reserved for a boot loader to use (GRUB for example) and btrfs should never
  # use them - neither for allocating metadata/data nor should trim/discard them.
  # The range [64Kb, 68Kb[ is used for the primary superblock of the filesystem.
  $XFS_IO_PROG -c "pwrite -S 0xfd 0 64K" $SCRATCH_DEV | _filter_xfs_io
  $XFS_IO_PROG -c "pwrite -S 0xfd 68K 956K" $SCRATCH_DEV | _filter_xfs_io

  # Now mount the filesystem and perform a fitrim against it.
  _scratch_mount
  _require_batched_discard $SCRATCH_MNT
  $FSTRIM_PROG $SCRATCH_MNT

  # Now unmount the filesystem and verify the content of the ranges was not
  # modified (no trim/discard happened on them).
  _scratch_unmount
  echo "Content of the ranges [0, 64Kb] and [68Kb, 1Mb[ after fitrim:"
  od -t x1 -N $((64 * 1024)) $SCRATCH_DEV
  od -t x1 -j $((68 * 1024)) -N $((956 * 1024)) $SCRATCH_DEV

  status=0
  exit

Reported-by: Vincent Petry  <PVince81@yahoo.fr>
Reported-by: Andrei Borzenkov <arvidjaar@gmail.com>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=109341
Fixes: 499f377f49 (btrfs: iterate over unused chunk space in FITRIM)
Cc: stable@vger.kernel.org # 4.3+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-01-07 21:16:03 +00:00
Sam Tygier
ee592d0771 Btrfs: Check metadata redundancy on balance
When converting a filesystem via balance check that metadata mode
is at least as redundant as the data mode. For example give warning
when:
-dconvert=raid1 -mconvert=single

Signed-off-by: Sam Tygier <samtygier@yahoo.co.uk>
[ minor message reformatting ]
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07 15:20:56 +01:00
David Sterba
ca8a51b3a9 btrfs: statfs: report zero available if metadata are exhausted
There is one ENOSPC case that's very confusing. There's Available
greater than zero but no file operation succeds (besides removing
files). This happens when the metadata are exhausted and there's no
possibility to allocate another chunk.

In this scenario it's normal that there's still some space in the data
chunk and the calculation in df reflects that in the Avail value.

To at least give some clue about the ENOSPC situation, let statfs report
zero value in Avail, even if there's still data space available.

Current:
  /dev/sdb1             4.0G  3.3G  719M  83% /mnt/test

New:
  /dev/sdb1             4.0G  3.3G     0 100% /mnt/test

We calculate the remaining metadata space minus global reserve. If this
is (supposedly) smaller than zero, there's no space. But this does not
hold in practice, the exhausted state happens where's still some
positive delta. So we apply some guesswork and compare the delta to a 4M
threshold. (Practically observed delta was 2M.)

We probably cannot calculate the exact threshold value because this
depends on the internal reservations requested by various operations, so
some operations that consume a few metadata will succeed even if the
Avail is zero. But this is better than the other way around.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07 15:20:55 +01:00
David Sterba
8546b57051 btrfs: preallocate path for snapshot creation at ioctl time
We can also preallocate btrfs_path that's used during pending snapshot
creation and avoid another late ENOMEM failure.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07 15:20:55 +01:00
David Sterba
b0c0ea6338 btrfs: allocate root item at snapshot ioctl time
The actual snapshot creation is delayed until transaction commit. If we
cannot get enough memory for the root item there, we have to fail the
whole transaction commit which is bad. So we'll allocate the memory at
the ioctl call and pass it along with the pending_snapshot struct. The
potential ENOMEM will be returned to the caller of snapshot ioctl.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07 15:20:54 +01:00
David Sterba
a1ee736268 btrfs: do an allocation earlier during snapshot creation
We can allocate pending_snapshot earlier and do not have to do cleanup
in case of failure.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07 15:20:54 +01:00
David Sterba
4fb72bf2e9 btrfs: use smaller type for btrfs_path locks
The values of btrfs_path::locks are 0 to 4, fit into a u8. Let's see:

* overall size of btrfs_path drops down from 136 to 112 (-24 bytes),
* better packing in a slab page +6 objects
* the whole structure now fits to 2 cachelines
* slight decrease in code size:

   text    data     bss     dec     hex filename
 938731   43670   23144 1005545   f57e9 fs/btrfs/btrfs.ko.before
 938203   43670   23144 1005017   f55d9 fs/btrfs/btrfs.ko.after

(and the generated assembly does not change much)

The main purpose is to decrease the size of the structure without
affecting performance. The byte access is usually well behaving accross
arches, the locks are not accessed frequently and sometimes just
compared to zero.

Note for further size reduction attempts: the slots could be made u16
but this might generate worse code on some arches (non-byte and non-int
access). Also the range of operations on slots is wider compared to
locks and the potential performance drop should be evaluated first.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07 15:01:17 +01:00
David Sterba
7853f15b2a btrfs: use smaller type for btrfs_path lowest_level
The level is 0..7, we can use smaller type. The size of btrfs_path is now
136 bytes from 144, which is +2 objects that fit into a 4k slab.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07 15:01:17 +01:00
David Sterba
dccabfad20 btrfs: use smaller type for btrfs_path reada
The possible values for reada are all positive and bounded, we can later
save some bytes by storing it in u8.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07 15:01:16 +01:00
David Sterba
e4058b54d1 btrfs: cleanup, use enum values for btrfs_path reada
Replace the integers by enums for better readability. The value 2 does
not have any meaning since a717531942
"Btrfs: do less aggressive btree readahead" (2009-01-22).

Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07 15:01:15 +01:00
David Sterba
4d4ab6d6bc btrfs: constify static arrays
There are a few statically initialized arrays that can be made const.
The remaining (like file_system_type, sysfs attributes or prop handlers)
do not allow that due to type mismatch when passed to the APIs or
because the structures are modified through other members.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07 15:01:15 +01:00
David Sterba
20e5506baf btrfs: constify remaining structs with function pointers
* struct extent_io_ops
* struct btrfs_free_space_op

Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07 15:01:14 +01:00
David Sterba
28f0779a3f btrfs tests: replace whole ops structure for free space tests
Preparatory work for making btrfs_free_space_op constant. In
test_steal_space_from_bitmap_to_extent, we substitute use_bitmap with
own version thus preventing constification. We can rework it so we
replace the whole structure with the correct function pointers.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07 15:01:14 +01:00
Geliang Tang
a7ca42256d btrfs: use list_for_each_entry* in backref.c
Use list_for_each_entry*() to simplify the code.

Signed-off-by: Geliang Tang <geliangtang@163.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07 14:42:46 +01:00
Geliang Tang
7ae1681e12 btrfs: use list_for_each_entry_safe in free-space-cache.c
Use list_for_each_entry_safe() instead of list_for_each_safe() to
simplify the code.

Signed-off-by: Geliang Tang <geliangtang@163.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07 14:39:09 +01:00
Geliang Tang
b69f2bef48 btrfs: use list_for_each_entry* in check-integrity.c
Use list_for_each_entry*() instead of list_for_each*() to simplify
the code.

Signed-off-by: Geliang Tang <geliangtang@163.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07 14:38:42 +01:00
Byongho Lee
ee22184b53 Btrfs: use linux/sizes.h to represent constants
We use many constants to represent size and offset value.  And to make
code readable we use '256 * 1024 * 1024' instead of '268435456' to
represent '256MB'.  However we can make far more readable with 'SZ_256MB'
which is defined in the 'linux/sizes.h'.

So this patch replaces 'xxx * 1024 * 1024' kind of expression with
single 'SZ_xxxMB' if 'xxx' is a power of 2 then 'xxx * SZ_1M' if 'xxx' is
not a power of 2. And I haven't touched to '4096' & '8192' because it's
more intuitive than 'SZ_4KB' & 'SZ_8KB'.

Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07 14:38:02 +01:00
David Sterba
7928d672ff btrfs: cleanup, remove stray return statements
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07 14:30:52 +01:00
Alexandru Moise
352dd9c8d3 btrfs: zero out delayed node upon allocation
It's slightly cleaner to zero-out the delayed node upon allocation
than to do it by hand in btrfs_init_delayed_node() for a few members

Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07 14:30:17 +01:00
Alexandru Moise
575a75d6fa btrfs: pass proper enum type to start_transaction()
Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07 14:30:00 +01:00
Alexandru Moise
9780c4976f btrfs: switch __btrfs_fs_incompat return type from int to bool
Conform to __btrfs_fs_incompat() cast-to-bool (!!) by explicitly
returning boolean not int.

Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07 14:29:20 +01:00
Byongho Lee
e40da0e58a btrfs: remove unused inode argument from uncompress_inline()
The inode argument is never used from the beginning, so remove it.

Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07 14:29:02 +01:00
David Sterba
100d57025c btrfs: don't use slab cache for struct btrfs_delalloc_work
Although we prefer to use separate caches for various structs, it seems
better not to do that for struct btrfs_delalloc_work. Objects of this
type are allocated rarely, when transaction commit calls
btrfs_start_delalloc_roots, requesting delayed iputs.

The objects are temporary (with some IO involved) but still allocated
and freed within __start_delalloc_inodes. Memory allocation failure is
handled.

The slab cache is empty most of the time (observed on several systems),
so if we need to allocate a new slab object, the first one has to
allocate a full page. In a potential case of low memory conditions this
might fail with higher probability compared to using the generic slab
caches.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07 14:26:58 +01:00
David Sterba
0de270fa83 btrfs: drop duplicate prefix from scrub workqueues
The helper btrfs_alloc_workqueue will add the "btrfs-" prefix.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07 14:26:58 +01:00
David Sterba
93a3d46780 btrfs: verbose error when we find an unexpected item in sys_array
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07 14:26:58 +01:00
David Sterba
f5cdedd73f btrfs: handle invalid num_stripes in sys_array
We can handle the special case of num_stripes == 0 directly inside
btrfs_read_sys_array. The BUG_ON in btrfs_chunk_item_size is there to
catch other unhandled cases where we fail to validate external data.

A crafted or corrupted image crashes at mount time:

BTRFS: device fsid 9006933e-2a9a-44f0-917f-514252aeec2c devid 1 transid 7 /dev/loop0
BTRFS info (device loop0): disk space caching is enabled
BUG: failure at fs/btrfs/ctree.h:337/btrfs_chunk_item_size()!
Kernel panic - not syncing: BUG!
CPU: 0 PID: 313 Comm: mount Not tainted 4.2.5-00657-ge047887-dirty #25
Stack:
 637af890 60062489 602aeb2e 604192ba
 60387961 00000011 637af8a0 6038a835
 637af9c0 6038776b 634ef32b 00000000
Call Trace:
 [<6001c86d>] show_stack+0xfe/0x15b
 [<6038a835>] dump_stack+0x2a/0x2c
 [<6038776b>] panic+0x13e/0x2b3
 [<6020f099>] btrfs_read_sys_array+0x25d/0x2ff
 [<601cfbbe>] open_ctree+0x192d/0x27af
 [<6019c2c1>] btrfs_mount+0x8f5/0xb9a
 [<600bc9a7>] mount_fs+0x11/0xf3
 [<600d5167>] vfs_kern_mount+0x75/0x11a
 [<6019bcb0>] btrfs_mount+0x2e4/0xb9a
 [<600bc9a7>] mount_fs+0x11/0xf3
 [<600d5167>] vfs_kern_mount+0x75/0x11a
 [<600d710b>] do_mount+0xa35/0xbc9
 [<600d7557>] SyS_mount+0x95/0xc8
 [<6001e884>] handle_syscall+0x6b/0x8e

Reported-by: Jiri Slaby <jslaby@suse.com>
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
CC: stable@vger.kernel.org	# 3.19+
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07 14:26:58 +01:00
David Sterba
35b3ad50ba btrfs: better packing of btrfs_delayed_extent_op
btrfs_delayed_extent_op can be packed in a better way, it's 40 bytes now
and has 8 unused bytes. Reducing the level type to u8 makes it possible
to squeeze it to the padding byte after key. The bitfields were switched
to bool as there's space to store the full byte without increasing the
whole structure, besides that the generated assembly is smaller.

struct btrfs_delayed_extent_op {
	struct btrfs_disk_key      key;                  /*     0    17 */
	u8                         level;                /*    17     1 */
	bool                       update_key;           /*    18     1 */
	bool                       update_flags;         /*    19     1 */
	bool                       is_data;              /*    20     1 */

	/* XXX 3 bytes hole, try to pack */

	u64                        flags_to_set;         /*    24     8 */

	/* size: 32, cachelines: 1, members: 6 */
	/* sum members: 29, holes: 1, sum holes: 3 */
	/* last cacheline: 32 bytes */
};

The final size is 32 bytes which gives +26 object per slab page.

   text	   data	    bss	    dec	    hex	filename
 938811	  43670	  23144	1005625	  f5839	fs/btrfs/btrfs.ko.before
 938747	  43670	  23144	1005561	  f57f9	fs/btrfs/btrfs.ko.after

Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07 14:26:58 +01:00
David Sterba
8089fe62c6 btrfs: put delayed item hook into inode
Inodes for delayed iput allocate a trivial helper structure, let's place
the list hook directly into the inode and save a kmalloc (killing a
__GFP_NOFAIL as a bonus) at the cost of increasing size of btrfs_inode.

The inode can be put into the delayed_iputs list more than once and we
have to keep the count. This means we can't use the list_splice to
process a bunch of inodes because we'd lost track of the count if the
inode is put into the delayed iputs again while it's processed.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07 14:26:58 +01:00
Zhao Lei
c5ca87819d btrfs: Support convert to -d dup for btrfs-convert
Since we will add support for -d dup for non-mixed filesystem,
kernel need to support converting to this raid-type.

This patch remove limitation of above case.

Tested by following script:
(combination of dup conversion with fsck):

export TEST_DEV='/dev/vdc'
export TEST_DIR='/var/ltf/tester/mnt'

do_dup_test()
{
    local m_from="$1"
    local d_from="$2"
    local m_to="$3"
    local d_to="$4"

    echo "Convert from -m $m_from -d $d_from to -m $m_to -d $d_to"

    umount "$TEST_DIR" &>/dev/null
    ./mkfs.btrfs -f -m "$m_from" -d "$d_from" "$TEST_DEV" >/dev/null || return 1
    mount "$TEST_DEV" "$TEST_DIR" || return 1

    cp -a /sbin/* "$TEST_DIR"

    [[ "$m_from" != "$m_to" ]] && {
        ./btrfs balance start -f -mconvert="$m_to" "$TEST_DIR" || return 1
    }

    [[ "$d_from" != "$d_to" ]] && {
	local opt=()
	[[ "$d_to" == single ]] && opt+=("-f")
        ./btrfs balance start "${opt[@]}" -dconvert="$d_to" "$TEST_DIR" || return 1
    }

    umount "$TEST_DIR" || return 1
    ./btrfsck "$TEST_DEV" || return 1
    echo

    return 0
}

test_all()
{
    for m_from in single dup; do
    for d_from in single dup; do
    for m_to in single dup; do
    for d_to in single dup; do
    do_dup_test "$m_from" "$d_from" "$m_to" "$d_to" || return 1
    done
    done
    done
    done
}

test_all

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07 14:26:58 +01:00
Josef Bacik
be7bd73084 Btrfs: igrab inode in writepage
We hit this panic on a few of our boxes this week where we have an
ordered_extent with an NULL inode.  We do an igrab() of the inode in writepages,
but weren't doing it in writepage which can be called directly from the VM on
dirty pages.  If the inode has been unlinked then we could have I_FREEING set
which means igrab() would return NULL and we get this panic.  Fix this by trying
to igrab in btrfs_writepage, and if it returns NULL then just redirty the page
and return AOP_WRITEPAGE_ACTIVATE; so the VM knows it wasn't successful.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07 14:26:58 +01:00
Anand Jain
b2acdddfad Btrfs: add missing brelse when superblock checksum fails
Looks like oversight, call brelse() when checksum fails. Further down the
code, in the non error path, we do call brelse() and so we don't see
brelse() in the goto error paths.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07 14:26:53 +01:00
Filipe Manana
271dba4521 Btrfs: fix transaction handle leak on failure to create hard link
If we failed to create a hard link we were not always releasing the
the transaction handle we got before, resulting in a memory leak and
preventing any other tasks from being able to commit the current
transaction.
Fix this by always releasing our transaction handle.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2016-01-06 22:52:38 +00:00
Dmitry Monakhov
a1c6f05733 fs: use block_device name vsprintf helper
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-01-06 13:03:18 -05:00
Darrick J. Wong
2b3909f8a7 btrfs: use new dedupe data function pointer
Now that the VFS encapsulates the dedupe ioctl, wire up btrfs to it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-01-01 02:36:40 -05:00
Filipe Manana
9269d12b2d Btrfs: fix number of transaction units required to create symlink
We weren't accounting for the insertion of an inline extent item for the
symlink inode nor that we need to update the parent inode item (through
the call to btrfs_add_nondir()). So fix this by including two more
transaction units.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-12-31 18:18:40 +00:00
Filipe Manana
d50866d00f Btrfs: don't leave dangling dentry if symlink creation failed
When we are creating a symlink we might fail with an error after we
created its inode and added the corresponding directory indexes to its
parent inode. In this case we end up never removing the directory indexes
because the inode eviction handler, called for our symlink inode on the
final iput(), only removes items associated with the symlink inode and
not with the parent inode.

Example:

  $ mkfs.btrfs -f /dev/sdi
  $ mount /dev/sdi /mnt
  $ touch /mnt/foo
  $ ln -s /mnt/foo /mnt/bar
  ln: failed to create symbolic link ‘bar’: Cannot allocate memory
  $ umount /mnt
  $ btrfsck /dev/sdi
  Checking filesystem on /dev/sdi
  UUID: d5acb5ba-31bd-42da-b456-89dca2e716e1
  checking extents
  checking free space cache
  checking fs roots
  root 5 inode 258 errors 2001, no inode item, link count wrong
	unresolved ref dir 256 index 3 namelen 3 name bar filetype 7 errors 4, no inode ref
  found 131073 bytes used err is 1
  total csum bytes: 0
  total tree bytes: 131072
  total fs tree bytes: 32768
  total extent tree bytes: 16384
  btree space waste bytes: 124305
  file data blocks allocated: 262144
   referenced 262144
  btrfs-progs v4.2.3

So fix this by adding the directory index entries as the very last
step of symlink creation.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-12-31 18:10:56 +00:00
Filipe Manana
a879719b8c Btrfs: send, don't BUG_ON() when an empty symlink is found
When a symlink is successfully created it always has an inline extent
containing the source path. However if an error happens when creating
the symlink, we can leave in the subvolume's tree a symlink inode without
any such inline extent item - this happens if after btrfs_symlink() calls
btrfs_end_transaction() and before it calls the inode eviction handler
(through the final iput() call), the transaction gets committed and a
crash happens before the eviction handler gets called, or if a snapshot
of the subvolume is made before the eviction handler gets called. Sadly
we can't just avoid this by making btrfs_symlink() call
btrfs_end_transaction() after it calls the eviction handler, because the
later can commit the current transaction before it removes any items from
the subvolume tree (if it encounters ENOSPC errors while reserving space
for removing all the items).

So make send fail more gracefully, with an -EIO error, and print a
message to dmesg/syslog informing that there's an empty symlink inode,
so that the user can delete the empty symlink or do something else
about it.

Reported-by: Stephen R. van den Berg <srb@cuci.nl>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-12-31 18:08:20 +00:00
Al Viro
fceef393a5 switch ->get_link() to delayed_call, kill ->put_link()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-12-30 13:01:03 -05:00
Filipe Manana
2bc0bb5fe7 Btrfs: fix race between free space endio workers and space cache writeout
While running a stress test I ran into the following trace/transaction
abort:

[471626.672243] ------------[ cut here ]------------
[471626.673322] WARNING: CPU: 9 PID: 19107 at fs/btrfs/extent-tree.c:3740 btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]()
[471626.675492] BTRFS: Transaction aborted (error -2)
[471626.676748] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix
[471626.688802] CPU: 14 PID: 19107 Comm: fsstress Tainted: G        W       4.3.0-rc5-btrfs-next-17+ #1
[471626.690148] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[471626.691901]  0000000000000000 ffff880016037cf0 ffffffff812566f4 ffff880016037d38
[471626.695009]  ffff880016037d28 ffffffff8104d0a6 ffffffffa040c84e 00000000fffffffe
[471626.697490]  ffff88011fe855f8 ffff88000c484cb0 ffff88000d195000 ffff880016037d90
[471626.699201] Call Trace:
[471626.699804]  [<ffffffff812566f4>] dump_stack+0x4e/0x79
[471626.701049]  [<ffffffff8104d0a6>] warn_slowpath_common+0x9f/0xb8
[471626.702542]  [<ffffffffa040c84e>] ? btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]
[471626.704326]  [<ffffffff8104d107>] warn_slowpath_fmt+0x48/0x50
[471626.705636]  [<ffffffffa0403717>] ? write_one_cache_group.isra.32+0x77/0x82 [btrfs]
[471626.707048]  [<ffffffffa040c84e>] btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]
[471626.708616]  [<ffffffffa048a50a>] commit_cowonly_roots+0x1d7/0x25a [btrfs]
[471626.709950]  [<ffffffffa041e34a>] btrfs_commit_transaction+0x4c4/0x991 [btrfs]
[471626.711286]  [<ffffffff81081c61>] ? signal_pending_state+0x31/0x31
[471626.712611]  [<ffffffffa03f6df4>] btrfs_sync_fs+0x145/0x1ad [btrfs]
[471626.715610]  [<ffffffff811962a2>] ? SyS_tee+0x226/0x226
[471626.716718]  [<ffffffff811962c2>] sync_fs_one_sb+0x20/0x22
[471626.717672]  [<ffffffff8116fc01>] iterate_supers+0x75/0xc2
[471626.718800]  [<ffffffff8119669a>] sys_sync+0x52/0x80
[471626.719990]  [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
[471626.721835] ---[ end trace baf57f43d76693f4 ]---
[471626.722954] BTRFS: error (device sdc) in btrfs_write_dirty_block_groups:3740: errno=-2 No such entry

This is a very rare situation and it happened due to a race between a free
space endio worker and writing the space caches for dirty block groups at
a transaction's commit critical section. The steps leading to this are:

1) A task calls btrfs_commit_transaction() and starts the writeout of the
   space caches for all currently dirty block groups (i.e. it calls
   btrfs_start_dirty_block_groups());

2) The previous step starts writeback for space caches;

3) When the writeback finishes it queues jobs for free space endio work
   queue (fs_info->endio_freespace_worker) that execute
   btrfs_finish_ordered_io();

4) The task committing the transaction sets the transaction's state
   to TRANS_STATE_COMMIT_DOING and shortly after calls
   btrfs_write_dirty_block_groups();

5) A free space endio job joins the transaction, through
   btrfs_join_transaction_nolock(), and updates a free space inode item
   in the root tree through btrfs_update_inode_fallback();

6) Updating the free space inode item resulted in COWing one or more
   nodes/leaves of the root tree, and that resulted in creating a new
   metadata block group, which gets added to the transaction's list
   of dirty block groups (this is a very rare case);

7) The free space endio job has not released yet its transaction handle
   at this point, so the new metadata block group was not yet fully
   created (didn't go through btrfs_create_pending_block_groups() yet);

8) The transaction commit task sees the new metadata block group in
   the transaction's list of dirty block groups and processes it.
   When it attempts to update the block group's block group item in
   the extent tree, through write_one_cache_group(), it isn't able
   to find it and aborts the transaction with error -ENOENT - this
   is because the free space endio job hasn't yet released its
   transaction handle (which calls btrfs_create_pending_block_groups())
   and therefore the block group item was not yet added to the extent
   tree.

Fix this waiting for free space endio jobs if we fail to find a block
group item in the extent tree and then retry once updating the block
group item.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-12-30 16:08:13 +00:00
Chris Mason
511711af91 btrfs: don't run delayed references while we are creating the free space tree
This is a short term solution to make sure btrfs_run_delayed_refs()
doesn't change the extent tree while we are scanning it to create the
free space tree.

Longer term we need to synchronize scanning the block groups one by one,
similar to what happens during a balance.

Signed-off-by: Chris Mason <clm@fb.com>
2015-12-30 07:52:35 -08:00
Chris Mason
b4570aa994 btrfs: fix compiling with CONFIG_BTRFS_DEBUG enabled.
Merging in the free space tree deleted a variable needed when
CONFIG_BTRFS_DEBUG=y

Signed-off-by: Chris Mason <clm@fb.com>
2015-12-30 07:37:26 -08:00
Chris Mason
140e639f1a btrfs: fix warning on uninit variable in btrfs_finish_chunk_alloc
map->num_stripes really can't be zero, but just in case.

Signed-off-by: Chris Mason <clm@fb.com>
2015-12-23 13:30:51 -08:00
Chris Mason
f0f76413d3 Merge branch 'freespace-4.5' into for-linus-4.5 2015-12-23 13:29:09 -08:00
Chris Mason
a53fe25769 Merge branch 'for-chris-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.5 2015-12-23 13:28:35 -08:00
Chris Mason
bb9d687618 Merge branch 'dev/simplify-set-bit' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5
Signed-off-by: Chris Mason <clm@fb.com>
2015-12-23 13:17:42 -08:00
Chris Mason
13d5d15d63 Merge branch 'dev/gfp-flags' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5 2015-12-23 13:11:27 -08:00
Chris Mason
afa427cf9d Merge branch 'cleanup/misc-simplify' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5 2015-12-23 13:10:26 -08:00
Filipe Manana
e44081ef61 Btrfs: fix unprotected list operations at btrfs_write_dirty_block_groups
We call btrfs_write_dirty_block_groups() in the critical section of a
transaction's commit, when no other tasks can join the transaction and
add more block groups to the transaction's list of dirty block groups,
so we not taking the dirty block groups spinlock when checking for the
list's emptyness, grabbing its first element or deleting elements from
it.

However there's a special and rare case where we can have a concurrent
task adding elements to this list. We trigger writeback for space
caches before at btrfs_start_dirty_block_groups() and in past iterations
of the loop at btrfs_write_dirty_block_groups(), this means that when
the writeback finishes (which happens asynchronously) it creates a
task for the endio free space work queue that executes
btrfs_finish_ordered_io() - this function is able to join the transaction,
through btrfs_join_transaction_nolock(), and update the free space cache's
inode item in the root tree, which can result in COWing nodes of this tree
and therefore allocation of a new block group can happen, which gets added
to the transaction's list of dirty block groups while the transaction
commit task is operating on it concurrently.

So fix this by taking the dirty block groups spinlock before doing
operations on the dirty block groups list at
btrfs_write_dirty_block_groups().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-12-21 17:51:22 +00:00
Borislav Petkov
362f924b64 x86/cpufeature: Remove unused and seldomly used cpu_has_xx macros
Those are stupid and code should use static_cpu_has_safe() or
boot_cpu_has() instead. Kill the least used and unused ones.

The remaining ones need more careful inspection before a conversion can
happen. On the TODO.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1449481182-27541-4-git-send-email-bp@alien8.de
Cc: David Sterba <dsterba@suse.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-12-19 11:49:55 +01:00
Linus Torvalds
fc315e3e5c Merge branch 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "A couple of small fixes"

* 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: check prepare_uptodate_page() error code earlier
  Btrfs: check for empty bitmap list in setup_cluster_bitmaps
  btrfs: fix misleading warning when space cache failed to load
  Btrfs: fix transaction handle leak in balance
  Btrfs: fix unprotected list move from unused_bgs to deleted_bgs list
2015-12-18 15:35:08 -08:00
Chris Mason
f7d3d2f99e Merge branch 'freespace-tree' into for-linus-4.5
Signed-off-by: Chris Mason <clm@fb.com>
2015-12-18 11:11:10 -08:00
Filipe Manana
0376374a98 Btrfs: fix locking bugs when defragging leaves
When running fstests btrfs/070, with a higher number of fsstress
operations, I ran frequently into two different locking bugs when
defragging directories.

The first bug produced the following traces:

[133860.229792] ------------[ cut here ]------------
[133860.251062] WARNING: CPU: 2 PID: 26057 at fs/btrfs/locking.c:46 btrfs_set_lock_blocking_rw+0x57/0xbd [btrfs]()
[133860.253576] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix4 psmouse parport
[133860.282566] CPU: 2 PID: 26057 Comm: btrfs Tainted: G        W       4.3.0-rc5-btrfs-next-17+ #1
[133860.284393] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[133860.286827]  0000000000000000 ffff880207697b78 ffffffff812566f4 0000000000000000
[133860.288341]  ffff880207697bb0 ffffffff8104d0a6 ffffffffa052d4c1 ffff880178f60e00
[133860.294219]  ffff880178f60e00 0000000000000000 00000000000000f6 ffff880207697bc0
[133860.295831] Call Trace:
[133860.306518]  [<ffffffff812566f4>] dump_stack+0x4e/0x79
[133860.307473]  [<ffffffff8104d0a6>] warn_slowpath_common+0x9f/0xb8
[133860.308619]  [<ffffffffa052d4c1>] ? btrfs_set_lock_blocking_rw+0x57/0xbd [btrfs]
[133860.310068]  [<ffffffff8104d172>] warn_slowpath_null+0x1a/0x1c
[133860.312552]  [<ffffffffa052d4c1>] btrfs_set_lock_blocking_rw+0x57/0xbd [btrfs]
[133860.314630]  [<ffffffffa04d5787>] btrfs_set_lock_blocking+0xe/0x10 [btrfs]
[133860.323596]  [<ffffffffa04d99cb>] btrfs_realloc_node+0xb3/0x341 [btrfs]
[133860.325233]  [<ffffffffa050e396>] btrfs_defrag_leaves+0x239/0x2fa [btrfs]
[133860.332427]  [<ffffffffa04fc2ce>] btrfs_defrag_root+0x63/0xca [btrfs]
[133860.337259]  [<ffffffffa052a34e>] btrfs_ioctl_defrag+0x78/0x14e [btrfs]
[133860.340147]  [<ffffffffa052b00b>] btrfs_ioctl+0x746/0x24c6 [btrfs]
[133860.344833]  [<ffffffff81087481>] ? arch_local_irq_save+0x9/0xc
[133860.346343]  [<ffffffff8113ad61>] ? __might_fault+0x4c/0xa7
[133860.353248]  [<ffffffff8113ad61>] ? __might_fault+0x4c/0xa7
[133860.354242]  [<ffffffff8113adba>] ? __might_fault+0xa5/0xa7
[133860.355232]  [<ffffffff81171139>] ? cp_new_stat+0x15d/0x174
[133860.356237]  [<ffffffff8117c610>] do_vfs_ioctl+0x427/0x4e6
[133860.358587]  [<ffffffff81171175>] ? SYSC_newfstat+0x25/0x2e
[133860.360195]  [<ffffffff8118574d>] ? __fget_light+0x4d/0x71
[133860.361380]  [<ffffffff8117c726>] SyS_ioctl+0x57/0x79
[133860.363578]  [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
[133860.366217] ---[ end trace 2cadb2f653437e49 ]---
[133860.367399] ------------[ cut here ]------------
[133860.368162] kernel BUG at fs/btrfs/locking.c:307!
[133860.369430] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[133860.370205] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix4 psmouse parport
[133860.370205] CPU: 2 PID: 26057 Comm: btrfs Tainted: G        W       4.3.0-rc5-btrfs-next-17+ #1
[133860.370205] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[133860.370205] task: ffff8800aec6db40 ti: ffff880207694000 task.ti: ffff880207694000
[133860.370205] RIP: 0010:[<ffffffffa052d466>]  [<ffffffffa052d466>] btrfs_assert_tree_locked+0x10/0x14 [btrfs]
[133860.370205] RSP: 0018:ffff880207697bc0  EFLAGS: 00010246
[133860.370205] RAX: 0000000000000000 RBX: ffff880178f60e00 RCX: 0000000000000000
[133860.370205] RDX: ffff88023ec4fb50 RSI: 00000000ffffffff RDI: ffff880178f60e00
[133860.370205] RBP: ffff880207697bc0 R08: 0000000000000001 R09: 0000000000000000
[133860.370205] R10: 0000160000000000 R11: ffffffff81651000 R12: ffff880178f60e00
[133860.370205] R13: 0000000000000000 R14: 00000000000000f6 R15: ffff8801ff409000
[133860.370205] FS:  00007f763efd48c0(0000) GS:ffff88023ec40000(0000) knlGS:0000000000000000
[133860.370205] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[133860.370205] CR2: 0000000002158048 CR3: 000000003fd6c000 CR4: 00000000000006e0
[133860.370205] Stack:
[133860.370205]  ffff880207697bd8 ffffffffa052d4d0 0000000000000000 ffff880207697be8
[133860.370205]  ffffffffa04d5787 ffff880207697c80 ffffffffa04d99cb ffff8801ff409590
[133860.370205]  ffff880207697ca8 000000f507697c80 ffff880183c11bb8 0000000000000000
[133860.370205] Call Trace:
[133860.370205]  [<ffffffffa052d4d0>] btrfs_set_lock_blocking_rw+0x66/0xbd [btrfs]
[133860.370205]  [<ffffffffa04d5787>] btrfs_set_lock_blocking+0xe/0x10 [btrfs]
[133860.370205]  [<ffffffffa04d99cb>] btrfs_realloc_node+0xb3/0x341 [btrfs]
[133860.370205]  [<ffffffffa050e396>] btrfs_defrag_leaves+0x239/0x2fa [btrfs]
[133860.370205]  [<ffffffffa04fc2ce>] btrfs_defrag_root+0x63/0xca [btrfs]
[133860.370205]  [<ffffffffa052a34e>] btrfs_ioctl_defrag+0x78/0x14e [btrfs]
[133860.370205]  [<ffffffffa052b00b>] btrfs_ioctl+0x746/0x24c6 [btrfs]
[133860.370205]  [<ffffffff81087481>] ? arch_local_irq_save+0x9/0xc
[133860.370205]  [<ffffffff8113ad61>] ? __might_fault+0x4c/0xa7
[133860.370205]  [<ffffffff8113ad61>] ? __might_fault+0x4c/0xa7
[133860.370205]  [<ffffffff8113adba>] ? __might_fault+0xa5/0xa7
[133860.370205]  [<ffffffff81171139>] ? cp_new_stat+0x15d/0x174
[133860.370205]  [<ffffffff8117c610>] do_vfs_ioctl+0x427/0x4e6
[133860.370205]  [<ffffffff81171175>] ? SYSC_newfstat+0x25/0x2e
[133860.370205]  [<ffffffff8118574d>] ? __fget_light+0x4d/0x71
[133860.370205]  [<ffffffff8117c726>] SyS_ioctl+0x57/0x79
[133860.370205]  [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f

This bug happened because we assumed that by setting keep_locks to 1 in
our search path, our path after a call to btrfs_search_slot() would have
all nodes locked, which is not always true because unlock_up() (called by
btrfs_search_slot()) will unlock a node in a path if the slot of the node
below it doesn't point to the last item or beyond the last item. For
example, when the tree has a heigth of 2 and path->slots[0] has a value
smaller than btrfs_header_nritems(path->nodes[0]) - 1, the node at level 2
will be unlocked (also because lowest_unlock is set to 1 due to the fact
that the value passed as ins_len to btrfs_search_slot is 0).
This resulted in btrfs_find_next_key(), called before btrfs_realloc_node(),
to release out path and call again btrfs_search_slot(), but this time with
the cow parameter set to 0, meaning the resulting path got only read locks.
Therefore when we called btrfs_realloc_node(), with path->nodes[1] having
a read lock, it resulted in the warning and BUG_ON when calling
btrfs_set_lock_blocking() against the node, as that function expects the
node to have a write lock.

The second bug happened often when the first bug didn't happen, and made
us hang and hitting the following warning at fs/btrfs/locking.c:

   251  void btrfs_tree_lock(struct extent_buffer *eb)
   252  {
   253          WARN_ON(eb->lock_owner == current->pid);

This happened because the tree search we made at btrfs_defrag_leaves()
before calling btrfs_find_next_key() locked a leaf and all the other
nodes in the path, so btrfs_find_next_key() had no need to release the
path and make a new search (with path->lowest_level set to 1). This
made btrfs_realloc_node() attempt to write lock the same leaf again,
resulting in a hang/deadlock.

So fix these issues by calling btrfs_find_next_key() after calling
btrfs_realloc_node() and setting the search path's lowest_level to 1
to avoid the hang/deadlock when attempting to write lock the leaves
at btrfs_realloc_node().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-12-18 02:51:32 +00:00
Omar Sandoval
70f6d82ec7 Btrfs: add free space tree mount option
Now we can finally hook up everything so we can actually use free space
tree. The free space tree is enabled by passing the space_cache=v2 mount
option. On the first mount with the this option set, the free space tree
will be created and the FREE_SPACE_TREE read-only compat bit will be
set. Any time the filesystem is mounted from then on, we must use the
free space tree. The clear_cache option will also clear the free space
tree.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-12-17 12:16:47 -08:00
Omar Sandoval
1e144fb8f4 Btrfs: wire up the free space tree to the extent tree
The free space tree is updated in tandem with the extent tree. There are
only a handful of places where we need to hook in:

1. Block group creation
2. Block group deletion
3. Delayed refs (extent creation and deletion)
4. Block group caching

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-12-17 12:16:47 -08:00
Omar Sandoval
7c55ee0c4a Btrfs: add free space tree sanity tests
This tests the operations on the free space tree trying to excercise all
of the main cases for both formats. Between this and xfstests, the free
space tree should have pretty good coverage.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-12-17 12:16:47 -08:00
Omar Sandoval
a5ed918285 Btrfs: implement the free space B-tree
The free space cache has turned out to be a scalability bottleneck on
large, busy filesystems. When the cache for a lot of block groups needs
to be written out, we can get extremely long commit times; if this
happens in the critical section, things are especially bad because we
block new transactions from happening.

The main problem with the free space cache is that it has to be written
out in its entirety and is managed in an ad hoc fashion. Using a B-tree
to store free space fixes this: updates can be done as needed and we get
all of the benefits of using a B-tree: checksumming, RAID handling,
well-understood behavior.

With the free space tree, we get commit times that are about the same as
the no cache case with load times slower than the free space cache case
but still much faster than the no cache case. Free space is represented
with extents until it becomes more space-efficient to use bitmaps,
giving us similar space overhead to the free space cache.

The operations on the free space tree are: adding and removing free
space, handling the creation and deletion of block groups, and loading
the free space for a block group. We can also create the free space tree
by walking the extent tree and clear the free space tree.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-12-17 12:16:47 -08:00
Omar Sandoval
208acb8c72 Btrfs: introduce the free space B-tree on-disk format
The on-disk format for the free space tree is straightforward. Each
block group is represented in the free space tree by a free space info
item that stores accounting information: whether the free space for this
block group is stored as bitmaps or extents and how many extents of free
space exist for this block group (regardless of which format is being
used in the tree). Extents are (start, FREE_SPACE_EXTENT, length) keys
with no corresponding item, and bitmaps instead have the
FREE_SPACE_BITMAP type and have a bitmap item attached, which is just an
array of bytes.

Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-12-17 12:16:46 -08:00
Omar Sandoval
73fa48b674 Btrfs: refactor caching_thread()
We're also going to load the free space tree from caching_thread(), so
we should refactor some of the common code.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-12-17 12:16:46 -08:00
Omar Sandoval
1abfbcdf56 Btrfs: add helpers for read-only compat bits
We're finally going to add one of these for the free space tree, so
let's add the same nice helpers that we have for the incompat bits.
While we're add it, also add helpers to clear the bits.

Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-12-17 12:16:46 -08:00
Omar Sandoval
0f3312295d Btrfs: add extent buffer bitmap sanity tests
Sanity test the extent buffer bitmap operations (test, set, and clear)
against the equivalent standard kernel operations.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-12-17 12:16:46 -08:00
Omar Sandoval
3e1e8bb770 Btrfs: add extent buffer bitmap operations
These are going to be used for the free space tree bitmap items.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-12-17 12:16:46 -08:00
Filipe Manana
f28a492878 Btrfs: fix leaking of ordered extents after direct IO write error
When doing a direct IO write, __blockdev_direct_IO() can call the
btrfs_get_blocks_direct() callback one or more times before it calls the
btrfs_submit_direct() callback. However it can fail after calling the
first callback and before calling the second callback, which is a problem
because the first one creates ordered extents and the second one is the
one that submits bios that cover the ordered extents created by the first
one. That means the ordered extents will never complete nor have any of
the flags BTRFS_ORDERED_IO_DONE / BTRFS_ORDERED_IOERR set, resulting in
subsequent operations (such as other direct IO writes, buffered writes or
hole punching) that lock the same IO range and lookup for ordered extents
in the range to hang forever waiting for those ordered extents because
they can not complete ever, since no bio was submitted.

Fix this by tracking a range of created ordered extents that don't have
yet corresponding bios submitted and completing the ordered extents in
the range if __blockdev_direct_IO() fails with an error.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-12-17 10:59:51 +00:00
Filipe Manana
b850ae1427 Btrfs: fix deadlock between direct IO write and defrag/readpages
If readpages() (triggered by defrag or buffered reads) is called while a
direct IO write is in progress, we have a small time window where we can
deadlock, resulting in traces like the following being generated:

[84723.212993] INFO: task fio:2849 blocked for more than 120 seconds.
[84723.214310]       Tainted: G        W       4.3.0-rc5-btrfs-next-17+ #1
[84723.215640] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[84723.217313] fio        D ffff88023ec75218     0  2849   2835 0x00000000
[84723.218778]  ffff880122dfb6e8 0000000000000092 0000000000000000 ffff88023ec75200
[84723.220458]  ffff88000e05d2c0 ffff880122dfc000 ffff88023ec75200 7fffffffffffffff
[84723.230597]  0000000000000002 ffffffff8147891a ffff880122dfb700 ffffffff8147856a
[84723.232085] Call Trace:
[84723.232625]  [<ffffffff8147891a>] ? bit_wait+0x3c/0x3c
[84723.233529]  [<ffffffff8147856a>] schedule+0x7d/0x95
[84723.234398]  [<ffffffff8147baa3>] schedule_timeout+0x43/0x10b
[84723.235384]  [<ffffffff810f82eb>] ? time_hardirqs_on+0x15/0x28
[84723.236426]  [<ffffffff8108a23d>] ? trace_hardirqs_on+0xd/0xf
[84723.237502]  [<ffffffff810af8a3>] ? read_seqcount_begin.constprop.20+0x57/0x6d
[84723.238807]  [<ffffffff8108a09b>] ? trace_hardirqs_on_caller+0x16/0x1ab
[84723.242012]  [<ffffffff8108a23d>] ? trace_hardirqs_on+0xd/0xf
[84723.243064]  [<ffffffff810af2ad>] ? timekeeping_get_ns+0xe/0x33
[84723.244116]  [<ffffffff810afa2e>] ? ktime_get+0x41/0x52
[84723.245029]  [<ffffffff81477cff>] io_schedule_timeout+0xb7/0x12b
[84723.245942]  [<ffffffff81477cff>] ? io_schedule_timeout+0xb7/0x12b
[84723.246596]  [<ffffffff81478953>] bit_wait_io+0x39/0x45
[84723.247503]  [<ffffffff81478b93>] __wait_on_bit_lock+0x49/0x8d
[84723.248540]  [<ffffffff8111684f>] __lock_page+0x66/0x68
[84723.249558]  [<ffffffff81081c9b>] ? autoremove_wake_function+0x3a/0x3a
[84723.250844]  [<ffffffff81124a04>] lock_page+0x2c/0x2f
[84723.251871]  [<ffffffff81124afc>] invalidate_inode_pages2_range+0xf5/0x2aa
[84723.253274]  [<ffffffff81117c34>] ? filemap_fdatawait_range+0x12d/0x146
[84723.254757]  [<ffffffff81118191>] ? filemap_fdatawrite_range+0x13/0x15
[84723.256378]  [<ffffffffa05139a2>] btrfs_get_blocks_direct+0x1b0/0x664 [btrfs]
[84723.258556]  [<ffffffff8119e3f9>] ? submit_page_section+0x7b/0x111
[84723.260064]  [<ffffffff8119eb90>] do_blockdev_direct_IO+0x658/0xbdb
[84723.261479]  [<ffffffffa05137f2>] ? btrfs_page_exists_in_range+0x1a9/0x1a9 [btrfs]
[84723.262961]  [<ffffffffa050a8a6>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
[84723.264449]  [<ffffffff8119f144>] __blockdev_direct_IO+0x31/0x33
[84723.265614]  [<ffffffff8119f144>] ? __blockdev_direct_IO+0x31/0x33
[84723.266769]  [<ffffffffa050a8a6>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
[84723.268264]  [<ffffffffa050935d>] btrfs_direct_IO+0x1b9/0x259 [btrfs]
[84723.270954]  [<ffffffffa050a8a6>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
[84723.272465]  [<ffffffff8111878c>] generic_file_direct_write+0xb3/0x128
[84723.273734]  [<ffffffffa051955c>] btrfs_file_write_iter+0x228/0x404 [btrfs]
[84723.275101]  [<ffffffff8116ca6f>] __vfs_write+0x7c/0xa5
[84723.276200]  [<ffffffff8116cfab>] vfs_write+0xa0/0xe4
[84723.277298]  [<ffffffff8116d79d>] SyS_write+0x50/0x7e
[84723.278327]  [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
[84723.279595] INFO: lockdep is turned off.
[84723.379035] INFO: task btrfs:2923 blocked for more than 120 seconds.
[84723.380323]       Tainted: G        W       4.3.0-rc5-btrfs-next-17+ #1
[84723.381608] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[84723.383003] btrfs           D ffff88023ed75218     0  2923   2859 0x00000000
[84723.384277]  ffff88001311f860 0000000000000082 ffff88001311f840 ffff88023ed75200
[84723.385748]  ffff88012c6751c0 ffff880013120000 ffff88012042fe68 ffff88012042fe30
[84723.387152]  ffff880221571c88 0000000000000001 ffff88001311f878 ffffffff8147856a
[84723.388620] Call Trace:
[84723.389105]  [<ffffffff8147856a>] schedule+0x7d/0x95
[84723.391882]  [<ffffffffa051da32>] btrfs_start_ordered_extent+0x161/0x1fa [btrfs]
[84723.393718]  [<ffffffff81081c61>] ? signal_pending_state+0x31/0x31
[84723.395659]  [<ffffffffa0522c5b>] __do_contiguous_readpages.constprop.21+0x81/0xdc [btrfs]
[84723.397383]  [<ffffffffa050ac96>] ? btrfs_submit_direct+0x3f0/0x3f0 [btrfs]
[84723.398852]  [<ffffffffa0522da3>] __extent_readpages.constprop.20+0xed/0x100 [btrfs]
[84723.400561]  [<ffffffff81123f6c>] ? __lru_cache_add+0x5d/0x72
[84723.401787]  [<ffffffffa0523896>] extent_readpages+0x111/0x1a7 [btrfs]
[84723.403121]  [<ffffffffa050ac96>] ? btrfs_submit_direct+0x3f0/0x3f0 [btrfs]
[84723.404583]  [<ffffffffa05088fa>] btrfs_readpages+0x1f/0x21 [btrfs]
[84723.406007]  [<ffffffff811226df>] __do_page_cache_readahead+0x168/0x1f4
[84723.407502]  [<ffffffff81122988>] ondemand_readahead+0x21d/0x22e
[84723.408937]  [<ffffffff81122988>] ? ondemand_readahead+0x21d/0x22e
[84723.410487]  [<ffffffff81122af1>] page_cache_sync_readahead+0x3d/0x3f
[84723.411710]  [<ffffffffa0535388>] btrfs_defrag_file+0x419/0xaaf [btrfs]
[84723.413007]  [<ffffffffa0531db0>] ? kzalloc+0xf/0x11 [btrfs]
[84723.414085]  [<ffffffffa0535b43>] btrfs_ioctl_defrag+0x125/0x14e [btrfs]
[84723.415307]  [<ffffffffa0536753>] btrfs_ioctl+0x746/0x24c6 [btrfs]
[84723.416532]  [<ffffffff81087481>] ? arch_local_irq_save+0x9/0xc
[84723.417731]  [<ffffffff8113ad61>] ? __might_fault+0x4c/0xa7
[84723.418699]  [<ffffffff8113ad61>] ? __might_fault+0x4c/0xa7
[84723.421532]  [<ffffffff8113adba>] ? __might_fault+0xa5/0xa7
[84723.422629]  [<ffffffff81171139>] ? cp_new_stat+0x15d/0x174
[84723.423712]  [<ffffffff8117c610>] do_vfs_ioctl+0x427/0x4e6
[84723.424801]  [<ffffffff81171175>] ? SYSC_newfstat+0x25/0x2e
[84723.425968]  [<ffffffff8118574d>] ? __fget_light+0x4d/0x71
[84723.427063]  [<ffffffff8117c726>] SyS_ioctl+0x57/0x79
[84723.428138]  [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f

Consider the following logical and physical file layout:

logical:    ... [ prealloc extent A ] [ prealloc extent B ] [ extent C ] ...
                4K                    8K                    16K

physical:   ... 12853248              12857344              1103101952   ...
                                      (= 12853248 + 4K)

Extents A and B are physically adjacent. The following diagram shows a
sequence of events that lead to the deadlock when we attempt to do a
direct IO write against the file range [4K, 16K[ and a defrag is triggered
simultaneously.

           CPU 1                                               CPU 2

 btrfs_direct_IO()

   btrfs_get_blocks_direct()
     creates ordered extent A, covering
     the 4k prealloc extent A (range [4K, 8K[)

                                                    btrfs_defrag_file()
                                                      page_cache_sync_readahead([0K, 1M[)
                                                        btrfs_readpages()
                                                          extent_readpages()

                                                            locks all pages in the file
                                                            range [0K, 128K[ through calls
                                                            to add_to_page_cache_lru()

                                                            __do_contiguous_readpages()

                                                               finds ordered extent A

                                                               waits for it to complete

   btrfs_get_blocks_direct() called again

     lock_extent_direct(range [8K, 16K[)

       finds a page in range [8K, 16K[ through
       btrfs_page_exists_in_range()

       invalidate_inode_pages2_range([8K, 16K[)

         --> tries to lock pages that are already
             locked by the task at CPU 2

         --> our task, running __blockdev_direct_IO(),
             hangs waiting to lock the pages and the
             submit bio callback, btrfs_submit_direct(),
             ends up never being called, resulting in the
             ordered extent A never completing (because a
             corresponding bio is never submitted) and
             CPU 2 will wait for it forever while holding
             the pages locked
              ---> deadlock!

Fix this by removing the page invalidation approach when attempting to
lock the range for IO from the callback btrfs_get_blocks_direct() and
falling back buffered IO. This was a rare case anyway and well behaved
applications do not mix concurrent direct IO writes with buffered reads
anyway, being a concurrent defrag the only normal case that could lead
to the deadlock.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-12-17 10:59:50 +00:00
Filipe Manana
14543774bd Btrfs: fix error path when failing to submit bio for direct IO write
Commit 61de718fce ("Btrfs: fix memory corruption on failure to submit
bio for direct IO") fixed problems with the error handling code after we
fail to submit a bio for direct IO. However there were 2 problems that it
did not address when the failure is due to memory allocation failures for
direct IO writes:

1) We considered that there could be only one ordered extent for the whole
   IO range, which is not always true, as we can have multiple;

2) It did not set the bit BTRFS_ORDERED_IO_DONE in the ordered extent,
   which can make other tasks running btrfs_wait_logged_extents() hang
   forever, since they wait for that bit to be set. The general assumption
   is that regardless of an error, the BTRFS_ORDERED_IO_DONE is always set
   and it precedes setting the bit BTRFS_ORDERED_COMPLETE.

Fix these issues by moving part of the btrfs_endio_direct_write() handler
into a new helper function and having that new helper function called when
we fail to allocate memory to submit the bio (and its private object) for
a direct IO write.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2015-12-17 10:59:49 +00:00
Filipe Manana
7785a663c4 Btrfs: fix memory leaks after transaction is aborted
When a transaction is aborted, or its commit fails before writing the new
superblock and calling btrfs_finish_extent_commit(), we leak reference
counts on the block groups attached to the transaction's delete_bgs list,
because btrfs_finish_extent_commit() is never called for those two cases.
Fix this by dropping their references at btrfs_put_transaction(), which
is called when transactions are aborted (by making the transaction kthread
commit the transaction) or if their commits fail.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-12-17 10:59:48 +00:00
Filipe Manana
50460e3718 Btrfs: fix race when finishing dev replace leading to transaction abort
During the final phase of a device replace operation, I ran into a
transaction abort that resulted in the following trace:

[23919.655368] WARNING: CPU: 10 PID: 30175 at fs/btrfs/extent-tree.c:9843 btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]()
[23919.664742] BTRFS: Transaction aborted (error -2)
[23919.665749] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix4 parport psmouse acpi_cpufreq processor i2c_core evdev microcode pcspkr button serio_raw ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom virtio_scsi ata_generic ata_piix virtio_pci floppy virtio_ring libata e1000 virtio scsi_mod [last unloaded: btrfs]
[23919.679442] CPU: 10 PID: 30175 Comm: fsstress Not tainted 4.3.0-rc5-btrfs-next-17+ #1
[23919.682392] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[23919.689151]  0000000000000000 ffff8804020cbb50 ffffffff812566f4 ffff8804020cbb98
[23919.692604]  ffff8804020cbb88 ffffffff8104d0a6 ffffffffa03eea69 ffff88041b678a48
[23919.694230]  ffff88042ac38000 ffff88041b678930 00000000fffffffe ffff8804020cbbf0
[23919.696716] Call Trace:
[23919.698669]  [<ffffffff812566f4>] dump_stack+0x4e/0x79
[23919.700597]  [<ffffffff8104d0a6>] warn_slowpath_common+0x9f/0xb8
[23919.701958]  [<ffffffffa03eea69>] ? btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]
[23919.703612]  [<ffffffff8104d107>] warn_slowpath_fmt+0x48/0x50
[23919.705047]  [<ffffffffa03eea69>] btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]
[23919.706967]  [<ffffffffa0402097>] __btrfs_end_transaction+0x84/0x2dd [btrfs]
[23919.708611]  [<ffffffffa0402300>] btrfs_end_transaction+0x10/0x12 [btrfs]
[23919.710099]  [<ffffffffa03ef0b8>] btrfs_alloc_data_chunk_ondemand+0x121/0x28b [btrfs]
[23919.711970]  [<ffffffffa0413025>] btrfs_fallocate+0x7d3/0xc6d [btrfs]
[23919.713602]  [<ffffffff8108b78f>] ? lock_acquire+0x10d/0x194
[23919.714756]  [<ffffffff81086dbc>] ? percpu_down_read+0x51/0x78
[23919.716155]  [<ffffffff8116ef1d>] ? __sb_start_write+0x5f/0xb0
[23919.718918]  [<ffffffff8116ef1d>] ? __sb_start_write+0x5f/0xb0
[23919.724170]  [<ffffffff8116b579>] vfs_fallocate+0x170/0x1ff
[23919.725482]  [<ffffffff8117c1d7>] ioctl_preallocate+0x89/0x9b
[23919.726790]  [<ffffffff8117c5ef>] do_vfs_ioctl+0x406/0x4e6
[23919.728428]  [<ffffffff81171175>] ? SYSC_newfstat+0x25/0x2e
[23919.729642]  [<ffffffff8118574d>] ? __fget_light+0x4d/0x71
[23919.730782]  [<ffffffff8117c726>] SyS_ioctl+0x57/0x79
[23919.731847]  [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
[23919.733330] ---[ end trace 166ef301a335832a ]---

This is due to a race between device replace and chunk allocation, which
the following diagram illustrates:

         CPU 1                                    CPU 2

 btrfs_dev_replace_finishing()

   at this point
    dev_replace->tgtdev->devid ==
    BTRFS_DEV_REPLACE_DEVID (0ULL)

   ...

   btrfs_start_transaction()
   btrfs_commit_transaction()

                                               btrfs_fallocate()
                                                 btrfs_alloc_data_chunk_ondemand()
                                                   btrfs_join_transaction()
                                                     --> starts a new transaction
                                                   do_chunk_alloc()
                                                     lock fs_info->chunk_mutex
                                                       btrfs_alloc_chunk()
                                                         --> creates extent map for
                                                             the new chunk with
                                                             em->bdev->map->stripes[i]->dev->devid
                                                             == X (X > 0)
                                                         --> extent map is added to
                                                             fs_info->mapping_tree
                                                         --> initial phase of bg A
                                                             allocation completes
                                                     unlock fs_info->chunk_mutex

   lock fs_info->chunk_mutex

   btrfs_dev_replace_update_device_in_mapping_tree()
     --> iterates fs_info->mapping_tree and
         replaces the device in every extent
         map's map->stripes[] with
         dev_replace->tgtdev, which still has
         an id of 0ULL (BTRFS_DEV_REPLACE_DEVID)

                                                   btrfs_end_transaction()
                                                     btrfs_create_pending_block_groups()
                                                       --> starts final phase of
                                                           bg A creation (update device,
                                                           extent, and chunk trees, etc)
                                                       btrfs_finish_chunk_alloc()

                                                         btrfs_update_device()
                                                           --> attempts to update a device
                                                               item with ID == 0ULL
                                                               (BTRFS_DEV_REPLACE_DEVID)
                                                               which is the current ID of
                                                               bg A's
                                                               em->bdev->map->stripes[i]->dev->devid
                                                           --> doesn't find such item
                                                               returns -ENOENT
                                                           --> the device id should have been X
                                                               and not 0ULL

                                                       got -ENOENT from
                                                       btrfs_finish_chunk_alloc()
                                                       and aborts current transaction

   finishes setting up the target device,
   namely it sets tgtdev->devid to the value
   of srcdev->devid, which is X (and X > 0)

   frees the srcdev

   unlock fs_info->chunk_mutex

So fix this by taking the device list mutex when processing the chunk's
extent map stripes to update the device items. This avoids getting the
wrong device id and use-after-free problems if the task finishing a
chunk allocation grabs the replaced device, which is freed while the
dev replace task is holding the device list mutex.

This happened while running fstest btrfs/071.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2015-12-17 10:59:46 +00:00
Chris Mason
1d3a5a82fe Merge branch 'for-chris-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.4 2015-12-15 09:09:59 -08:00
Chris Mason
bb1591b4ea Btrfs: check prepare_uptodate_page() error code earlier
prepare_pages() may end up calling prepare_uptodate_page() twice if our
write only spans a single page.  But if the first call returns an error,
our page will be unlocked and its not safe to call it again.

This bug goes all the way back to 2011, and it's not something commonly
hit.

While we're here, add a more explicit check for the page being truncated
away.  The bare lock_page() alone is protected only by good thoughts and
i_mutex, which we're sure to regret eventually.

Reported-by: Dave Jones <dsj@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-12-15 09:09:38 -08:00
Chris Mason
1b9b922a3a Btrfs: check for empty bitmap list in setup_cluster_bitmaps
Dave Jones found a warning from kasan in setup_cluster_bitmaps()

==================================================================
BUG: KASAN: stack-out-of-bounds in setup_cluster_bitmap+0xc4/0x5a0 at
addr ffff88039bef6828
Read of size 8 by task nfsd/1009
page:ffffea000e6fbd80 count:0 mapcount:0 mapping:          (null)
index:0x0
flags: 0x8000000000000000()
page dumped because: kasan: bad access detected
CPU: 1 PID: 1009 Comm: nfsd Tainted: G        W
4.4.0-rc3-backup-debug+ #1
 ffff880065647b50 000000006bb712c2 ffff88039bef6640 ffffffffa680a43e
 0000004559c00000 ffff88039bef66c8 ffffffffa62638d1 ffffffffa61121c0
 ffff8803a5769de8 0000000000000296 ffff8803a5769df0 0000000000046280
Call Trace:
 [<ffffffffa680a43e>] dump_stack+0x4b/0x6d
 [<ffffffffa62638d1>] kasan_report_error+0x501/0x520
 [<ffffffffa61121c0>] ? debug_show_all_locks+0x1e0/0x1e0
 [<ffffffffa6263948>] kasan_report+0x58/0x60
 [<ffffffffa6814b00>] ? rb_last+0x10/0x40
 [<ffffffffa66f8af4>] ? setup_cluster_bitmap+0xc4/0x5a0
 [<ffffffffa6262ead>] __asan_load8+0x5d/0x70
 [<ffffffffa66f8af4>] setup_cluster_bitmap+0xc4/0x5a0
 [<ffffffffa66f675a>] ? setup_cluster_no_bitmap+0x6a/0x400
 [<ffffffffa66fcd16>] btrfs_find_space_cluster+0x4b6/0x640
 [<ffffffffa66fc860>] ? btrfs_alloc_from_cluster+0x4e0/0x4e0
 [<ffffffffa66fc36e>] ? btrfs_return_cluster_to_free_space+0x9e/0xb0
 [<ffffffffa702dc37>] ? _raw_spin_unlock+0x27/0x40
 [<ffffffffa666a1a1>] find_free_extent+0xba1/0x1520

Andrey noticed this was because we were doing list_first_entry on a list
that might be empty.  Rework the tests a bit so we don't do that.

Signed-off-by: Chris Mason <clm@fb.com>
Reprorted-by: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Reported-by:  Dave Jones <dsj@fb.com>
2015-12-15 09:09:33 -08:00
Holger Hoffstätte
94356889c4 btrfs: fix misleading warning when space cache failed to load
When an inconsistent space cache is detected during loading we log a
warning that users frequently mistake as instruction to invalidate the
cache manually, even though this is not required. Fix the message to
indicate that the cache will be rebuilt automatically.

Signed-off-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com>
Acked-by: Filipe Manana <fdmanana@suse.com>
2015-12-10 11:38:08 +00:00
Filipe Manana
8a7d656f3d Btrfs: fix transaction handle leak in balance
If we fail to allocate a new data chunk, we were jumping to the error path
without release the transaction handle we got before. Fix this by always
releasing it before doing the jump.

Fixes: 2c9fe83552 ("btrfs: Fix lost-data-profile caused by balance bg")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-12-10 11:23:24 +00:00
Filipe Manana
348a0013d5 Btrfs: fix unprotected list move from unused_bgs to deleted_bgs list
As of my previous change titled "Btrfs: fix scrub preventing unused block
groups from being deleted", the following warning at
extent-tree.c:btrfs_delete_unused_bgs() can be hit when we mount the a
filesysten with "-o discard":

 10263  void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 10264  {
 (...)
 10405                  if (trimming) {
 10406                          WARN_ON(!list_empty(&block_group->bg_list));
 10407                          spin_lock(&trans->transaction->deleted_bgs_lock);
 10408                          list_move(&block_group->bg_list,
 10409                                    &trans->transaction->deleted_bgs);
 10410                          spin_unlock(&trans->transaction->deleted_bgs_lock);
 10411                          btrfs_get_block_group(block_group);
 10412                  }
 (...)

This happens because scrub can now add back the block group to the list of
unused block groups (fs_info->unused_bgs). This is dangerous because we
are moving the block group from the unused block groups list to the list
of deleted block groups without holding the lock that protects the source
list (fs_info->unused_bgs_lock).

The following diagram illustrates how this happens:

            CPU 1                                     CPU 2

 cleaner_kthread()
   btrfs_delete_unused_bgs()

     sees bg X in list
      fs_info->unused_bgs

     deletes bg X from list
      fs_info->unused_bgs

                                            scrub_enumerate_chunks()

                                              searches device tree using
                                              its commit root

                                              finds device extent for
                                              block group X

                                              gets block group X from the tree
                                              fs_info->block_group_cache_tree
                                              (via btrfs_lookup_block_group())

                                              sets bg X to RO (again)

                                              scrub_chunk(bg X)

                                              sets bg X back to RW mode

                                              adds bg X to the list
                                              fs_info->unused_bgs again,
                                              since it's still unused and
                                              currently not in that list

     sets bg X to RO mode

     btrfs_remove_chunk(bg X)

     --> discard is enabled and bg X
         is in the fs_info->unused_bgs
         list again so the warning is
         triggered
     --> we move it from that list into
         the transaction's delete_bgs
         list, but we can have another
         task currently manipulating
         the first list (fs_info->unused_bgs)

Fix this by using the same lock (fs_info->unused_bgs_lock) to protect both
the list of unused block groups and the list of deleted block groups. This
makes it safe and there's not much worry for more lock contention, as this
lock is seldom used and only the cleaner kthread adds elements to the list
of deleted block groups. The warning goes away too, as this was previously
an impossible case (and would have been better a BUG_ON/ASSERT) but it's
not impossible anymore.
Reproduced with fstest btrfs/073 (using MOUNT_OPTIONS="-o discard").

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-12-10 11:22:38 +00:00
Al Viro
6b2553918d replace ->follow_link() with new method that could stay in RCU mode
new method: ->get_link(); replacement of ->follow_link().  The differences
are:
	* inode and dentry are passed separately
	* might be called both in RCU and non-RCU mode;
the former is indicated by passing it a NULL dentry.
	* when called that way it isn't allowed to block
and should return ERR_PTR(-ECHILD) if it needs to be called
in non-RCU mode.

It's a flagday change - the old method is gone, all in-tree instances
converted.  Conversion isn't hard; said that, so far very few instances
do not immediately bail out when called in RCU mode.  That'll change
in the next commits.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-12-08 22:41:54 -05:00
Al Viro
21fc61c73c don't put symlink bodies in pagecache into highmem
kmap() in page_follow_link_light() needed to go - allowing to hold
an arbitrary number of kmaps for long is a great way to deadlocking
the system.

new helper (inode_nohighmem(inode)) needs to be used for pagecache
symlinks inodes; done for all in-tree cases.  page_follow_link_light()
instrumented to yell about anything missed.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-12-08 22:41:36 -05:00
Christoph Hellwig
04b38d6012 vfs: pull btrfs clone API to vfs layer
The btrfs clone ioctls are now adopted by other file systems, with NFS
and CIFS already having support for them, and XFS being under active
development.  To avoid growth of various slightly incompatible
implementations, add one to the VFS.  Note that clones are different from
file copies in several ways:

 - they are atomic vs other writers
 - they support whole file clones
 - they support 64-bit legth clones
 - they do not allow partial success (aka short writes)
 - clones are expected to be a fast metadata operation

Because of that it would be rather cumbersome to try to piggyback them on
top of the recent clone_file_range infrastructure.  The converse isn't
true and the clone_file_range system call could try clone file range as
a first attempt to copy, something that further patches will enable.

Based on earlier work from Peng Tao.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-12-07 23:11:33 -05:00
David Sterba
35de6db28f btrfs: make set_range_writeback return void
Does not return any errors, nor anything from the callgraph. There's a
BUG_ON but it's a sanity check and not an error condition we could
recover from.

Signed-off-by: David Sterba <dsterba@suse.com>
2015-12-07 15:06:45 +01:00
David Sterba
f631157276 btrfs: make extent_range_redirty_for_io return void
Does not return any errors, nor anything from the callgraph. There's a
BUG_ON but it's a sanity check and not an error condition we could
recover from.

Signed-off-by: David Sterba <dsterba@suse.com>
2015-12-07 15:06:45 +01:00
David Sterba
bd1fa4f0b0 btrfs: make extent_range_clear_dirty_for_io return void
Does not return any errors, nor anything from the callgraph. There's a
BUG_ON but it's a sanity check and not an error condition we could
recover from.

Signed-off-by: David Sterba <dsterba@suse.com>
2015-12-07 15:06:45 +01:00
David Sterba
b5227c075b btrfs: make end_extent_writepage return void
Does not return any errors, nor anything from the callgraph.  The branch
in end_bio_extent_writepage has been skipped since
5fd0204355 ("Btrfs: finish ordered extents in their own thread").

Signed-off-by: David Sterba <dsterba@suse.com>
2015-12-07 15:06:45 +01:00
David Sterba
a9d93e1778 btrfs: make extent_clear_unlock_delalloc return void
Does not return any errors, nor anything from the callgraph.

Signed-off-by: David Sterba <dsterba@suse.com>
2015-12-07 15:06:45 +01:00
David Sterba
69ba39272c btrfs: make clear_extent_buffer_uptodate return void
Does not return any errors, nor anything from the callgraph.

Signed-off-by: David Sterba <dsterba@suse.com>
2015-12-07 15:06:45 +01:00
David Sterba
09c25a8cda btrfs: make set_extent_buffer_uptodate return void
Does not return any errors, nor anything from the callgraph.

Signed-off-by: David Sterba <dsterba@suse.com>
2015-12-07 15:06:45 +01:00
David Sterba
4db8c528cd btrfs: remove a trivial helper btrfs_set_buffer_uptodate
Signed-off-by: David Sterba <dsterba@suse.com>
2015-12-07 15:06:45 +01:00
Andreas Gruenbacher
9172abbcd3 btrfs: Use xattr handler infrastructure
Use the VFS xattr handler infrastructure and get rid of similar code in
the filesystem.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-12-06 21:34:14 -05:00
Andreas Gruenbacher
97d7929922 posix acls: Remove duplicate xattr name definitions
Remove POSIX_ACL_XATTR_{ACCESS,DEFAULT} and GFS2_POSIX_ACL_{ACCESS,DEFAULT}
and replace them with the definitions in <include/uapi/linux/xattr.h>.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: James Morris <james.l.morris@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-12-06 21:25:17 -05:00
David Sterba
39a27ec100 btrfs: use GFP_KERNEL for xattr and acl allocations
We don't have to use GFP_NOFS in context of ACL or XATTR actions, not
possible to loop through the allocator and it's safe to fail with
ENOMEM.

Signed-off-by: David Sterba <dsterba@suse.com>
2015-12-03 15:03:44 +01:00
David Sterba
61dd5ae65b btrfs: use GFP_KERNEL for allocations of workqueues
We don't have to use GFP_NOFS to allocate workqueue structures, this is
done from mount context or potentially scrub start context, safe to fail
in both cases.

Signed-off-by: David Sterba <dsterba@suse.com>
2015-12-03 15:03:43 +01:00
David Sterba
8d2db7855e btrfs: use GFP_KERNEL for allocations in ioctl handlers
We don't have to use GFP_NOFS in the ioctl handlers because there's no
risk of looping through the allocators back to the filesystem. This
patch covers only allocations that are directly in the ioctl handlers.

Signed-off-by: David Sterba <dsterba@suse.com>
2015-12-03 15:03:43 +01:00
David Sterba
3042460136 btrfs: remove wait from struct btrfs_delalloc_work
The value is 0 and never changes, we can propagate the value, remove
wait and the implied dead code.

Signed-off-by: David Sterba <dsterba@suse.com>
2015-12-03 15:02:21 +01:00
David Sterba
651d494a67 btrfs: sink parameter wait to btrfs_alloc_delalloc_work
There's only one caller and single value, we can propagate it down to
the callee and remove the unused parameter.

Signed-off-by: David Sterba <dsterba@suse.com>
2015-12-03 15:02:21 +01:00
David Sterba
87ad58c5f0 btrfs: make btrfs_close_one_device static
Signed-off-by: David Sterba <dsterba@suse.com>
2015-12-03 15:02:21 +01:00
David Sterba
cd716d8fea btrfs: make lock_extent static inline
One call less reduces stack usage, code slightly reduced as well.

Signed-off-by: David Sterba <dsterba@suse.com>
2015-12-03 14:44:59 +01:00
David Sterba
ff13db41f1 btrfs: drop unused parameter from lock_extent_bits
We've always passed 0. Stack usage will slightly decrease.

Signed-off-by: David Sterba <dsterba@suse.com>
2015-12-03 14:30:40 +01:00
David Sterba
e83b1d91f8 btrfs: make clear_extent_bit helpers static inline
The funcions just wrap the clear_extent_bit API and generate function
calls. This increases stack consumption and may negatively affect
performance due to icache misses. We can simply make the helpers static
inline and keep the type checking and API untouched. The code slightly
decreases:

   text	   data	    bss	    dec	    hex	filename
 938667	  43670	  23144	1005481	  f57a9	fs/btrfs/btrfs.ko.before
 939651	  43670	  23144	1006465	  f5b81	fs/btrfs/btrfs.ko.after

Signed-off-by: David Sterba <dsterba@suse.com>
2015-12-03 14:17:30 +01:00
David Sterba
c63179556a btrfs: make set_extent_bit helpers static inline
The funcions just wrap the set_extent_bit API and generate function
calls. This increases stack consumption and may negatively affect
performance due to icache misses. We can simply make the helpers static
inline and keep the type checking and API untouched. The code slightly
increases:

   text	   data	    bss	    dec	    hex	filename
 938427	  43670	  23144	1005241	  f56b9	fs/btrfs/btrfs.ko.before
 938667	  43670	  23144	1005481	  f57a9	fs/btrfs/btrfs.ko

Signed-off-by: David Sterba <dsterba@suse.com>
2015-12-03 14:08:11 +01:00
Zach Brown
3db11b2eec btrfs: add .copy_file_range file operation
This rearranges the existing COPY_RANGE ioctl implementation so that the
.copy_file_range file operation can call the core loop that copies file
data extent items.

The extent copying loop is lifted up into its own function.  It retains
the core btrfs error checks that should be shared.

Signed-off-by: Zach Brown <zab@redhat.com>
[Anna Schumaker: Make flags an unsigned int,
                 Check for COPY_FR_REFLINK]
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-12-01 14:00:54 -05:00
Linus Torvalds
80e0c505b2 Merge branch 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "This has Mark Fasheh's patches to fix quota accounting during subvol
  deletion, which we've been working on for a while now.  The patch is
  pretty small but it's a key fix.

  Otherwise it's a random assortment"

* 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: fix balance range usage filters in 4.4-rc
  btrfs: qgroup: account shared subtree during snapshot delete
  Btrfs: use btrfs_get_fs_root in resolve_indirect_ref
  btrfs: qgroup: fix quota disable during rescan
  Btrfs: fix race between cleaner kthread and space cache writeout
  Btrfs: fix scrub preventing unused block groups from being deleted
  Btrfs: fix race between scrub and block group deletion
  btrfs: fix rcu warning during device replace
  btrfs: Continue replace when set_block_ro failed
  btrfs: fix clashing number of the enhanced balance usage filter
  Btrfs: fix the number of transaction units needed to remove a block group
  Btrfs: use global reserve when deleting unused block group after ENOSPC
  Btrfs: tests: checking for NULL instead of IS_ERR()
  btrfs: fix signed overflows in btrfs_sync_file
2015-11-27 15:45:45 -08:00
Holger Hoffstätte
dba72cb30b btrfs: fix balance range usage filters in 4.4-rc
There's a regression in 4.4-rc since commit bc3094673f
(btrfs: extend balance filter usage to take minimum and maximum) in that
existing (non-ranged) balance with -dusage=x no longer works; all chunks
are skipped.

After staring at the code for a while and wondering why a non-ranged
balance would even need min and max thresholds (..which then were not
set correctly, leading to the bug) I realized that the only problem
was the fact that the filter functions were named wrong, thanks to
patching copypasta. Simply renaming both functions lets the existing
btrfs-progs call balance with -dusage=x and now the non-ranged filter
function is invoked, properly using only a single chunk limit.

Signed-off-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com>
Fixes: bc3094673f ("btrfs: extend balance filter usage to take minimum and maximum")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25 05:27:33 -08:00
Mark Fasheh
82bd101b52 btrfs: qgroup: account shared subtree during snapshot delete
Commit 0ed4792 ('btrfs: qgroup: Switch to new extent-oriented qgroup
mechanism.') removed our qgroup accounting during
btrfs_drop_snapshot(). Predictably, this results in qgroup numbers
going bad shortly after a snapshot is removed.

Fix this by adding a dirty extent record when we encounter extents during
our shared subtree walk. This effectively restores the functionality we had
with the original shared subtree walking code in 1152651 (btrfs: qgroup:
account shared subtrees during snapshot delete).

The idea with the original patch (and this one) is that shared subtrees can
get skipped during drop_snapshot. The shared subtree walk then allows us a
chance to visit those extents and add them to the qgroup work for later
processing. This ultimately makes the accounting for drop snapshot work.

The new qgroup code nicely handles all the other extents during the tree
walk via the ref dec/inc functions so we don't have to add actions beyond
what we had originally.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25 05:27:33 -08:00
Josef Bacik
2d9e977610 Btrfs: use btrfs_get_fs_root in resolve_indirect_ref
The backref code will look up the fs_root we're trying to resolve our indirect
refs for, unfortunately we use btrfs_read_fs_root_no_name, which returns -ENOENT
if the ref is 0.  This isn't helpful for the qgroup stuff with snapshot delete
as it won't be able to search down the snapshot we are deleting, which will
cause us to miss roots.  So use btrfs_get_fs_root and send false for check_ref
so we can always get the root we're looking for.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25 05:22:08 -08:00
Justin Maggard
967ef5131e btrfs: qgroup: fix quota disable during rescan
There's a race condition that leads to a NULL pointer dereference if you
disable quotas while a quota rescan is running.  To fix this, we just need
to wait for the quota rescan worker to actually exit before tearing down
the quota structures.

Signed-off-by: Justin Maggard <jmaggard@netgear.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25 05:22:08 -08:00
Filipe Manana
036a9348dc Btrfs: fix race between cleaner kthread and space cache writeout
When a block group becomes unused and the cleaner kthread is currently
running, we can end up getting the current transaction aborted with error
-ENOENT when we try to commit the transaction, leading to the following
trace:

  [59779.258768] WARNING: CPU: 3 PID: 5990 at fs/btrfs/extent-tree.c:3740 btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]()
  [59779.272594] BTRFS: Transaction aborted (error -2)
  (...)
  [59779.291137] Call Trace:
  [59779.291621]  [<ffffffff812566f4>] dump_stack+0x4e/0x79
  [59779.292543]  [<ffffffff8104d0a6>] warn_slowpath_common+0x9f/0xb8
  [59779.293435]  [<ffffffffa04cb81f>] ? btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]
  [59779.295000]  [<ffffffff8104d107>] warn_slowpath_fmt+0x48/0x50
  [59779.296138]  [<ffffffffa04c2721>] ? write_one_cache_group.isra.32+0x77/0x82 [btrfs]
  [59779.297663]  [<ffffffffa04cb81f>] btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]
  [59779.299141]  [<ffffffffa0549b0d>] commit_cowonly_roots+0x1de/0x261 [btrfs]
  [59779.300359]  [<ffffffffa04dd5b6>] btrfs_commit_transaction+0x4c4/0x99c [btrfs]
  [59779.301805]  [<ffffffffa04b5df4>] btrfs_sync_fs+0x145/0x1ad [btrfs]
  [59779.302893]  [<ffffffff81196634>] sync_filesystem+0x7f/0x93
  (...)
  [59779.318186] ---[ end trace 577e2daff90da33a ]---

The following diagram illustrates a sequence of steps leading to this
problem:

       CPU 1                                             CPU 2

                           <at transaction N>

                                                        adds bg A to list
                                                        fs_info->unused_bgs

                                                        adds bg B to list
                                                        fs_info->unused_bgs

                           <transaction kthread
                            commits transaction N
                            and wakes up the
                            cleaner kthread>

  cleaner kthread
    delete_unused_bgs()

      sees bg A in list
      fs_info->unused_bgs

      btrfs_start_transaction()

                           <transaction N + 1 starts>

      deletes bg A

                                                        update_block_group(bg C)

                                                          --> adds bg C to list
                                                              fs_info->unused_bgs

      deletes bg B

      sees bg C in the list
      fs_info->unused_bgs

      btrfs_remove_chunk(bg C)
        btrfs_remove_block_group(bg C)

          --> checks if the block group
              is in a dirty list, and
              because it isn't now, it
              does nothing

          --> the block group item
              is deleted from the
              extent tree

                                                          --> adds bg C to list
                                                              transaction->dirty_bgs

                                                         some task calls
                                                         btrfs_commit_transaction(t N + 1)
                                                           commit_cowonly_roots()
                                                             btrfs_write_dirty_block_groups()
                                                               --> sees bg C in cur_trans->dirty_bgs
                                                               --> calls write_one_cache_group()
                                                                   which returns -ENOENT because
                                                                   it did not find the block group
                                                                   item in the extent tree
                                                               --> transaction aborte with -ENOENT
                                                                   because write_one_cache_group()
                                                                   returned that error

So fix this by adding a block group to the list of dirty block groups
before adding it to the list of unused block groups.

This happened on a stress test using fsstress plus concurrent calls to
fallocate 20G and truncate (releasing part of the space allocated with
fallocate).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25 05:22:08 -08:00
Filipe Manana
758f2dfcf8 Btrfs: fix scrub preventing unused block groups from being deleted
Currently scrub can race with the cleaner kthread when the later attempts
to delete an unused block group, and the result is preventing the cleaner
kthread from ever deleting later the block group - unless the block group
becomes used and unused again. The following diagram illustrates that
race:

              CPU 1                                 CPU 2

 cleaner kthread
   btrfs_delete_unused_bgs()

     gets block group X from
     fs_info->unused_bgs and
     removes it from that list

                                             scrub_enumerate_chunks()

                                               searches device tree using
                                               its commit root

                                               finds device extent for
                                               block group X

                                               gets block group X from the tree
                                               fs_info->block_group_cache_tree
                                               (via btrfs_lookup_block_group())

                                               sets bg X to RO

     sees the block group is
     already RO and therefore
     doesn't delete it nor adds
     it back to unused list

So fix this by making scrub add the block group again to the list of
unused block groups if the block group is still unused when it finished
scrubbing it and it hasn't been removed already.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25 05:22:08 -08:00
Filipe Manana
020d5b7366 Btrfs: fix race between scrub and block group deletion
Scrub can race with the cleaner kthread deleting block groups that are
unused (and with relocation too) leading to a failure with error -EINVAL
that gets returned to user space.

The following diagram illustrates how it happens:

              CPU 1                                 CPU 2

 cleaner kthread
   btrfs_delete_unused_bgs()

     gets block group X from
     fs_info->unused_bgs

     sets block group to RO

       btrfs_remove_chunk(bg X)

         deletes device extents

                                         scrub_enumerate_chunks()

                                           searches device tree using
                                           its commit root

                                           finds device extent for
                                           block group X

                                           gets block group X from the tree
                                           fs_info->block_group_cache_tree
                                           (via btrfs_lookup_block_group())

                                           sets bg X to RO (again)

          btrfs_remove_block_group(bg X)

            deletes block group from
            fs_info->block_group_cache_tree

            removes extent map from
            fs_info->mapping_tree

                                               scrub_chunk(offset X)

                                                 searches fs_info->mapping_tree
                                                 for extent map starting at
                                                 offset X

                                                    --> doesn't find any such
                                                        extent map
                                                    --> returns -EINVAL and scrub
                                                        errors out to userspace
                                                        with -EINVAL

Fix this by dealing with an extent map lookup failure as an indicator of
block group deletion.
Issue reproduced with fstest btrfs/071.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25 05:19:51 -08:00
David Sterba
31388ab2ed btrfs: fix rcu warning during device replace
The test btrfs/011 triggers a rcu warning
Reviewed-by: Anand Jain <anand.jain@oracle.com>

===============================
[ INFO: suspicious RCU usage. ]
4.4.0-rc1-default+ #286 Tainted: G        W
-------------------------------
fs/btrfs/volumes.c:1977 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 1, debug_locks = 0
4 locks held by btrfs/28786:

0:  (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+...}, at: [<ffffffffa00bc785>] btrfs_dev_replace_finishing+0x45/0xa00 [btrfs]
1:  (uuid_mutex){+.+.+.}, at: [<ffffffffa00bc84f>] btrfs_dev_replace_finishing+0x10f/0xa00 [btrfs]
2:  (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa00bc868>] btrfs_dev_replace_finishing+0x128/0xa00 [btrfs]
3:  (&fs_info->chunk_mutex){+.+...}, at: [<ffffffffa00bc87d>] btrfs_dev_replace_finishing+0x13d/0xa00 [btrfs]

stack backtrace:
CPU: 0 PID: 28786 Comm: btrfs Tainted: G        W       4.4.0-rc1-default+ #286
Hardware name: Intel Corporation SandyBridge Platform/To be filled by O.E.M., BIOS ASNBCPT1.86C.0031.B00.1006301607 06/30/2010
0000000000000001 ffff8800a07dfb48 ffffffff8141d47b 0000000000000001
0000000000000001 0000000000000000 ffff8801464a4f00 ffff8800a07dfb78
ffffffff810cd883 ffff880146eb9400 ffff8800a3698600 ffff8800a33fe220
Call Trace:
[<ffffffff8141d47b>] dump_stack+0x4f/0x74
[<ffffffff810cd883>] lockdep_rcu_suspicious+0x103/0x140
[<ffffffffa0071261>] btrfs_rm_dev_replace_remove_srcdev+0x111/0x130 [btrfs]
[<ffffffff810d354d>] ? trace_hardirqs_on+0xd/0x10
[<ffffffff81449536>] ? __percpu_counter_sum+0x66/0x80
[<ffffffffa00bcc15>] btrfs_dev_replace_finishing+0x4d5/0xa00 [btrfs]
[<ffffffffa00bc96e>] ? btrfs_dev_replace_finishing+0x22e/0xa00 [btrfs]
[<ffffffffa00a8795>] ? btrfs_scrub_dev+0x415/0x6d0 [btrfs]
[<ffffffffa003ea69>] ? btrfs_start_transaction+0x9/0x20 [btrfs]
[<ffffffffa00bda79>] btrfs_dev_replace_start+0x339/0x590 [btrfs]
[<ffffffff81196aa5>] ? __might_fault+0x95/0xa0
[<ffffffffa0078638>] btrfs_ioctl_dev_replace+0x118/0x160 [btrfs]
[<ffffffff811409c6>] ? stack_trace_call+0x46/0x70
[<ffffffffa007c914>] ? btrfs_ioctl+0x24/0x1770 [btrfs]
[<ffffffffa007ce43>] btrfs_ioctl+0x553/0x1770 [btrfs]
[<ffffffff811409c6>] ? stack_trace_call+0x46/0x70
[<ffffffff811d6eb1>] ? do_vfs_ioctl+0x21/0x5a0
[<ffffffff811d6f1c>] do_vfs_ioctl+0x8c/0x5a0
[<ffffffff811e3336>] ? __fget_light+0x86/0xb0
[<ffffffff811e3369>] ? __fdget+0x9/0x20
[<ffffffff811d7451>] ? SyS_ioctl+0x21/0x80
[<ffffffff811d7483>] SyS_ioctl+0x53/0x80
[<ffffffff81b1efd7>] entry_SYSCALL_64_fastpath+0x12/0x6f

This is because of unprotected use of rcu_dereference in
btrfs_scratch_superblocks. We can't add rcu locks around the whole
function because we read the superblock.

The fix will use the rcu string buffer directly without the rcu locking.
Thi is safe as the device will not go away in the meantime. We're
holding the device list mutexes.

Restructuring the code to narrow down the rcu section turned out to be
impossible, we need to call filp_open (through update_dev_time) on the
buffer and this could call kmalloc/__might_sleep. We could call kstrdup
with GFP_ATOMIC but it's not absolutely necessary.

Fixes: 12b1c2637b (Btrfs: enhance btrfs_scratch_superblock to scratch all superblocks)
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25 05:19:51 -08:00
Zhaolei
76a8efa171 btrfs: Continue replace when set_block_ro failed
xfstests/011 failed in node with small_size filesystem.
Can be reproduced by following script:
  DEV_LIST="/dev/vdd /dev/vde"
  DEV_REPLACE="/dev/vdf"

  do_test()
  {
      local mkfs_opt="$1"
      local size="$2"

      dmesg -c >/dev/null
      umount $SCRATCH_MNT &>/dev/null

      echo  mkfs.btrfs -f $mkfs_opt "${DEV_LIST[*]}"
      mkfs.btrfs -f $mkfs_opt "${DEV_LIST[@]}" || return 1
      mount "${DEV_LIST[0]}" $SCRATCH_MNT

      echo -n "Writing big files"
      dd if=/dev/urandom of=$SCRATCH_MNT/t0 bs=1M count=1 >/dev/null 2>&1
      for ((i = 1; i <= size; i++)); do
          echo -n .
          /bin/cp $SCRATCH_MNT/t0 $SCRATCH_MNT/t$i || return 1
      done
      echo

      echo Start replace
      btrfs replace start -Bf "${DEV_LIST[0]}" "$DEV_REPLACE" $SCRATCH_MNT || {
          dmesg
          return 1
      }
      return 0
  }

  # Set size to value near fs size
  # for example, 1897 can trigger this bug in 2.6G device.
  #
  ./do_test "-d raid1 -m raid1" 1897

System will report replace fail with following warning in dmesg:
 [  134.710853] BTRFS: dev_replace from /dev/vdd (devid 1) to /dev/vdf started
 [  135.542390] BTRFS: btrfs_scrub_dev(/dev/vdd, 1, /dev/vdf) failed -28
 [  135.543505] ------------[ cut here ]------------
 [  135.544127] WARNING: CPU: 0 PID: 4080 at fs/btrfs/dev-replace.c:428 btrfs_dev_replace_start+0x398/0x440()
 [  135.545276] Modules linked in:
 [  135.545681] CPU: 0 PID: 4080 Comm: btrfs Not tainted 4.3.0 #256
 [  135.546439] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
 [  135.547798]  ffffffff81c5bfcf ffff88003cbb3d28 ffffffff817fe7b5 0000000000000000
 [  135.548774]  ffff88003cbb3d60 ffffffff810a88f1 ffff88002b030000 00000000ffffffe4
 [  135.549774]  ffff88003c080000 ffff88003c082588 ffff88003c28ab60 ffff88003cbb3d70
 [  135.550758] Call Trace:
 [  135.551086]  [<ffffffff817fe7b5>] dump_stack+0x44/0x55
 [  135.551737]  [<ffffffff810a88f1>] warn_slowpath_common+0x81/0xc0
 [  135.552487]  [<ffffffff810a89e5>] warn_slowpath_null+0x15/0x20
 [  135.553211]  [<ffffffff81448c88>] btrfs_dev_replace_start+0x398/0x440
 [  135.554051]  [<ffffffff81412c3e>] btrfs_ioctl+0x1d2e/0x25c0
 [  135.554722]  [<ffffffff8114c7ba>] ? __audit_syscall_entry+0xaa/0xf0
 [  135.555506]  [<ffffffff8111ab36>] ? current_kernel_time64+0x56/0xa0
 [  135.556304]  [<ffffffff81201e3d>] do_vfs_ioctl+0x30d/0x580
 [  135.557009]  [<ffffffff8114c7ba>] ? __audit_syscall_entry+0xaa/0xf0
 [  135.557855]  [<ffffffff810011d1>] ? do_audit_syscall_entry+0x61/0x70
 [  135.558669]  [<ffffffff8120d1c1>] ? __fget_light+0x61/0x90
 [  135.559374]  [<ffffffff81202124>] SyS_ioctl+0x74/0x80
 [  135.559987]  [<ffffffff81809857>] entry_SYSCALL_64_fastpath+0x12/0x6f
 [  135.560842] ---[ end trace 2a5c1fc3205abbdd ]---

Reason:
 When big data writen to fs, the whole free space will be allocated
 for data chunk.
 And operation as scrub need to set_block_ro(), and when there is
 only one metadata chunk in system(or other metadata chunks
 are all full), the function will try to allocate a new chunk,
 and failed because no space in device.

Fix:
 When set_block_ro failed for metadata chunk, it is not a problem
 because scrub_lock paused commit_trancaction in same time, and
 metadata are always cowed, so the on-the-fly writepages will not
 write data into same place with scrub/replace.
 Let replace continue in this case is no problem.

Tested by above script, and xfstests/011, plus 100 times xfstests/070.

Changelog v1->v2:
1: Add detail comments in source and commit-message.
2: Add dmesg detail into commit-message.
3: Limit return value of -ENOSPC to be passed.
All suggested by: Filipe Manana <fdmanana@gmail.com>

Suggested-by: Filipe Manana <fdmanana@gmail.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25 05:19:51 -08:00
David Sterba
da02c68989 btrfs: fix clashing number of the enhanced balance usage filter
I've accidentally picked an already used number for the enhanced usage
filter represented by BTRFS_BALANCE_ARGS_USAGE_RANGE, clashing with
BTRFS_BALANCE_ARGS_CONVERT. Introduced during the development phase,
no backward compatibility issues.

Reported-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: bc3094673f ("btrfs: extend balance filter usage to take minimum and maximum")
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25 05:19:50 -08:00
Filipe Manana
7fd01182d1 Btrfs: fix the number of transaction units needed to remove a block group
We were using only 1 transaction unit when attempting to delete an unused
block group but in reality we need 3 + N units, where N corresponds to the
number of stripes. We were accounting only for the addition of the orphan
item (for the block group's free space cache inode) but we were not
accounting that we need to delete one block group item from the extent
tree, one free space item from the tree of tree roots and N device extent
items from the device tree.

While one unit is not enough, it worked most of the time because for each
single unit we are too pessimistic and assume an entire tree path, with
the highest possible heigth (8), needs to be COWed with eventual node
splits at every possible level in the tree, so there was usually enough
reserved space for removing all the items and adding the orphan item.

However after adding the orphan item, writepages() can by called by the VM
subsystem against the btree inode when we are under memory pressure, which
causes writeback to start for the nodes we COWed before, this forces the
operation to remove the free space item to COW again some (or all of) the
same nodes (in the tree of tree roots). Even without writepages() being
called, we could fail with ENOSPC because these items are located in
multiple trees and one of them might have a higher heigth and require
node/leaf splits at many levels, exhausting all the reserved space before
removing all the items and adding the orphan.

In the kernel 4.0 release, commit 3d84be7991 ("Btrfs: fix BUG_ON in
btrfs_orphan_add() when delete unused block group"), we attempted to fix
a BUG_ON due to ENOSPC when trying to add the orphan item by making the
cleaner kthread reserve one transaction unit before attempting to remove
the block group, but this was not enough. We had a couple user reports
still hitting the same BUG_ON after 4.0, like Stefan Priebe's report on
a 4.2-rc6 kernel for example:

    http://www.spinics.net/lists/linux-btrfs/msg46070.html

So fix this by reserving all the necessary units of metadata.

Reported-by: Stefan Priebe <s.priebe@profihost.ag>
Fixes: 3d84be7991 ("Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25 05:19:50 -08:00
Filipe Manana
8eab77ff16 Btrfs: use global reserve when deleting unused block group after ENOSPC
It's possible to reach a state where the cleaner kthread isn't able to
start a transaction to delete an unused block group due to lack of enough
free metadata space and due to lack of unallocated device space to allocate
a new metadata block group as well. If this happens try to use space from
the global block group reserve just like we do for unlink operations, so
that we don't reach a permanent state where starting a transaction for
filesystem operations (file creation, renames, etc) keeps failing with
-ENOSPC. Such an unfortunate state was observed on a machine where over
a dozen unused data block groups existed and the cleaner kthread was
failing to delete them due to ENOSPC error when attempting to start a
transaction, and even running balance with a -dusage=0 filter failed with
ENOSPC as well. Also unmounting and mounting again the filesystem didn't
help. Allowing the cleaner kthread to use the global block reserve to
delete the unused data block groups fixed the problem.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25 05:19:50 -08:00
Dan Carpenter
89b6c8d1e4 Btrfs: tests: checking for NULL instead of IS_ERR()
btrfs_alloc_dummy_root() return an error pointer on failure, it never
returns NULL.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25 05:19:50 -08:00
David Sterba
9dcbeed4d7 btrfs: fix signed overflows in btrfs_sync_file
The calculation of range length in btrfs_sync_file leads to signed
overflow. This was caught by PaX gcc SIZE_OVERFLOW plugin.

https://forums.grsecurity.net/viewtopic.php?f=1&t=4284

The fsync call passes 0 and LLONG_MAX, the range length does not fit to
loff_t and overflows, but the value is converted to u64 so it silently
works as expected.

The minimal fix is a typecast to u64, switching functions to take
(start, end) instead of (start, len) would be more intrusive.

Coccinelle script found that there's one more opencoded calculation of
the length.

<smpl>
@@
loff_t start, end;
@@
* end - start
</smpl>

CC: stable@vger.kernel.org
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25 05:19:50 -08:00
Linus Torvalds
e75cdf9898 Merge branch 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes and cleanups from Chris Mason:
 "Some of this got cherry-picked from a github repo this week, but I
  verified the patches.

  We have three small scrub cleanups and a collection of fixes"

* 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: Use fs_info directly in btrfs_delete_unused_bgs
  btrfs: Fix lost-data-profile caused by balance bg
  btrfs: Fix lost-data-profile caused by auto removing bg
  btrfs: Remove len argument from scrub_find_csum
  btrfs: Reduce unnecessary arguments in scrub_recheck_block
  btrfs: Use scrub_checksum_data and scrub_checksum_tree_block for scrub_recheck_block_checksum
  btrfs: Reset sblock->xxx_error stats before calling scrub_recheck_block_checksum
  btrfs: scrub: setup all fields for sblock_to_check
  btrfs: scrub: set error stats when tree block spanning stripes
  Btrfs: fix race when listing an inode's xattrs
  Btrfs: fix race leading to BUG_ON when running delalloc for nodatacow
  Btrfs: fix race leading to incorrect item deletion when dropping extents
  Btrfs: fix sleeping inside atomic context in qgroup rescan worker
  Btrfs: fix race waiting for qgroup rescan worker
  btrfs: qgroup: exit the rescan worker during umount
  Btrfs: fix extent accounting for partial direct IO writes
2015-11-13 16:30:29 -08:00
Zhao Lei
d5f2e33b92 btrfs: Use fs_info directly in btrfs_delete_unused_bgs
No need to use root->fs_info in btrfs_delete_unused_bgs(),
use fs_info directly instead.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-10 19:27:24 -08:00
Zhao Lei
2c9fe83552 btrfs: Fix lost-data-profile caused by balance bg
Reproduce:
 (In integration-4.3 branch)

 TEST_DEV=(/dev/vdg /dev/vdh)
 TEST_DIR=/mnt/tmp

 umount "$TEST_DEV" >/dev/null
 mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}"

 mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
 btrfs balance start -dusage=0 $TEST_DIR
 btrfs filesystem usage $TEST_DIR

 dd if=/dev/zero of="$TEST_DIR"/file count=100
 btrfs filesystem usage $TEST_DIR

Result:
 We can see "no data chunk" in first "btrfs filesystem usage":
 # btrfs filesystem usage $TEST_DIR
 Overall:
    ...
 Metadata,single: Size:8.00MiB, Used:0.00B
    /dev/vdg        8.00MiB
 Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
    /dev/vdg      122.88MiB
    /dev/vdh      122.88MiB
 System,single: Size:4.00MiB, Used:0.00B
    /dev/vdg        4.00MiB
 System,RAID1: Size:8.00MiB, Used:16.00KiB
    /dev/vdg        8.00MiB
    /dev/vdh        8.00MiB
 Unallocated:
    /dev/vdg        1.06GiB
    /dev/vdh        1.07GiB

 And "data chunks changed from raid1 to single" in second
 "btrfs filesystem usage":
 # btrfs filesystem usage $TEST_DIR
 Overall:
    ...
 Data,single: Size:256.00MiB, Used:0.00B
    /dev/vdh      256.00MiB
 Metadata,single: Size:8.00MiB, Used:0.00B
    /dev/vdg        8.00MiB
 Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
    /dev/vdg      122.88MiB
    /dev/vdh      122.88MiB
 System,single: Size:4.00MiB, Used:0.00B
    /dev/vdg        4.00MiB
 System,RAID1: Size:8.00MiB, Used:16.00KiB
    /dev/vdg        8.00MiB
    /dev/vdh        8.00MiB
 Unallocated:
    /dev/vdg        1.06GiB
    /dev/vdh      841.92MiB

Reason:
 btrfs balance delete last data chunk in case of no data in
 the filesystem, then we can see "no data chunk" by "fi usage"
 command.

 And when we do write operation to fs, the only available data
 profile is 0x0, result is all new chunks are allocated single type.

Fix:
 Allocate a data chunk explicitly to ensure we don't lose the
 raid profile for data.

Test:
 Test by above script, and confirmed the logic by debug output.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-10 19:27:20 -08:00
Zhao Lei
aefbe9a633 btrfs: Fix lost-data-profile caused by auto removing bg
Reproduce:
 (In integration-4.3 branch)

 TEST_DEV=(/dev/vdg /dev/vdh)
 TEST_DIR=/mnt/tmp

 umount "$TEST_DEV" >/dev/null
 mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}"

 mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
 umount "$TEST_DEV"

 mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
 btrfs filesystem usage $TEST_DIR

We can see the data chunk changed from raid1 to single:
 # btrfs filesystem usage $TEST_DIR
 Data,single: Size:8.00MiB, Used:0.00B
    /dev/vdg        8.00MiB
 #

Reason:
 When a empty filesystem mount with -o nospace_cache, the last
 data blockgroup will be auto-removed in umount.

 Then if we mount it again, there is no data chunk in the
 filesystem, so the only available data profile is 0x0, result
 is all new chunks are created as single type.

Fix:
 Don't auto-delete last blockgroup for a raid type.

Test:
 Test by above script, and confirmed the logic by debug output.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-10 19:27:16 -08:00
Zhao Lei
3b5753ec23 btrfs: Remove len argument from scrub_find_csum
It is useless.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-10 19:27:13 -08:00
Zhao Lei
affe4a5ae1 btrfs: Reduce unnecessary arguments in scrub_recheck_block
We don't need pass so many arguments for recheck sblock now,
this patch cleans them.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-10 19:27:10 -08:00
Zhao Lei
ba7cf9882b btrfs: Use scrub_checksum_data and scrub_checksum_tree_block for scrub_recheck_block_checksum
We can use existing scrub_checksum_data() and scrub_checksum_tree_block()
for scrub_recheck_block_checksum(), instead of write duplicated code.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-10 19:27:06 -08:00
Zhao Lei
772d233f5d btrfs: Reset sblock->xxx_error stats before calling scrub_recheck_block_checksum
We should reset sblock->xxx_error stats before calling
scrub_recheck_block_checksum().

Current code run correctly because all sblock are allocated by
k[cz]alloc(), and the error stats are not got changed.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-10 19:27:03 -08:00
Zhao Lei
4734b7ed79 btrfs: scrub: setup all fields for sblock_to_check
scrub_setup_recheck_block() isn't setup all necessary fields for
sblock_to_check because history reason.

So current code need more arguments in severial functions,
and more local variables, just to passing these lacked values to
necessary place.

This patch setup above fields to sblock_to_check in
scrub_setup_recheck_block(), for:
1: more cleanup for function arg, local variable
2: to make sblock_to_check complete, then we can use sblock_to_check
   without concern about some uninitialized member.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-10 19:27:00 -08:00
Zhao Lei
9799d2c32b btrfs: scrub: set error stats when tree block spanning stripes
It is better to show error stats to user when we found tree block
spanning stripes.

On a btrfs created by old version of btrfs-convert:
Before patch:
  # btrfs scrub start -B /dev/vdh
  scrub done for 8b342d35-2904-41ab-b3cb-2f929709cf47
          scrub started at Tue Aug 25 21:19:09 2015 and finished after 00:00:00
          total bytes scrubbed: 53.54MiB with 0 errors
  # dmesg
  ...
  [  128.711434] BTRFS error (device vdh): scrub: tree block 27054080 spanning stripes, ignored. logical=27000832
  [  128.712744] BTRFS error (device vdh): scrub: tree block 27054080 spanning stripes, ignored. logical=27066368
  ...

After patch:
  # btrfs scrub start -B /dev/vdh
  scrub done for ff7f844b-7a4e-4b1a-88a9-8252ab25be1b
          scrub started at Tue Aug 25 21:42:29 2015 and finished after 00:00:00
          total bytes scrubbed: 53.60MiB with 2 errors
          error details:
          corrected errors: 0, uncorrectable errors: 2, unverified errors: 0
  ERROR: There are uncorrectable errors.
  # dmesg
  ...omit...
  #

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-10 19:26:57 -08:00
Yaowei Bai
7cac0a8599 fs/btrfs/inode.c: remove unnecessary new_valid_dev() check
new_valid_dev() always returns 1, so the !new_valid_dev() check is not
needed.  Remove it.

Signed-off-by: Yaowei Bai <bywxiaobai@163.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-09 15:11:24 -08:00
Filipe Manana
f1cd1f0b7d Btrfs: fix race when listing an inode's xattrs
When listing a inode's xattrs we have a time window where we race against
a concurrent operation for adding a new hard link for our inode that makes
us not return any xattr to user space. In order for this to happen, the
first xattr of our inode needs to be at slot 0 of a leaf and the previous
leaf must still have room for an inode ref (or extref) item, and this can
happen because an inode's listxattrs callback does not lock the inode's
i_mutex (nor does the VFS does it for us), but adding a hard link to an
inode makes the VFS lock the inode's i_mutex before calling the inode's
link callback.

If we have the following leafs:

               Leaf X (has N items)                    Leaf Y

 [ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ]  [ (257 XATTR_ITEM 12345), ... ]
           slot N - 2         slot N - 1              slot 0

The race illustrated by the following sequence diagram is possible:

       CPU 1                                               CPU 2

  btrfs_listxattr()

    searches for key (257 XATTR_ITEM 0)

    gets path with path->nodes[0] == leaf X
    and path->slots[0] == N

    because path->slots[0] is >=
    btrfs_header_nritems(leaf X), it calls
    btrfs_next_leaf()

    btrfs_next_leaf()
      releases the path

                                                   adds key (257 INODE_REF 666)
                                                   to the end of leaf X (slot N),
                                                   and leaf X now has N + 1 items

      searches for the key (257 INODE_REF 256),
      with path->keep_locks == 1, because that
      is the last key it saw in leaf X before
      releasing the path

      ends up at leaf X again and it verifies
      that the key (257 INODE_REF 256) is no
      longer the last key in leaf X, so it
      returns with path->nodes[0] == leaf X
      and path->slots[0] == N, pointing to
      the new item with key (257 INODE_REF 666)

    btrfs_listxattr's loop iteration sees that
    the type of the key pointed by the path is
    different from the type BTRFS_XATTR_ITEM_KEY
    and so it breaks the loop and stops looking
    for more xattr items
      --> the application doesn't get any xattr
          listed for our inode

So fix this by breaking the loop only if the key's type is greater than
BTRFS_XATTR_ITEM_KEY and skip the current key if its type is smaller.

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-11-09 18:34:40 +00:00
Filipe Manana
1d512cb77b Btrfs: fix race leading to BUG_ON when running delalloc for nodatacow
If we are using the NO_HOLES feature, we have a tiny time window when
running delalloc for a nodatacow inode where we can race with a concurrent
link or xattr add operation leading to a BUG_ON.

This happens because at run_delalloc_nocow() we end up casting a leaf item
of type BTRFS_INODE_[REF|EXTREF]_KEY or of type BTRFS_XATTR_ITEM_KEY to a
file extent item (struct btrfs_file_extent_item) and then analyse its
extent type field, which won't match any of the expected extent types
(values BTRFS_FILE_EXTENT_[REG|PREALLOC|INLINE]) and therefore trigger an
explicit BUG_ON(1).

The following sequence diagram shows how the race happens when running a
no-cow dellaloc range [4K, 8K[ for inode 257 and we have the following
neighbour leafs:

             Leaf X (has N items)                    Leaf Y

 [ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ]  [ (257 EXTENT_DATA 8192), ... ]
              slot N - 2         slot N - 1              slot 0

 (Note the implicit hole for inode 257 regarding the [0, 8K[ range)

       CPU 1                                         CPU 2

 run_dealloc_nocow()
   btrfs_lookup_file_extent()
     --> searches for a key with value
         (257 EXTENT_DATA 4096) in the
         fs/subvol tree
     --> returns us a path with
         path->nodes[0] == leaf X and
         path->slots[0] == N

   because path->slots[0] is >=
   btrfs_header_nritems(leaf X), it
   calls btrfs_next_leaf()

   btrfs_next_leaf()
     --> releases the path

                                              hard link added to our inode,
                                              with key (257 INODE_REF 500)
                                              added to the end of leaf X,
                                              so leaf X now has N + 1 keys

     --> searches for the key
         (257 INODE_REF 256), because
         it was the last key in leaf X
         before it released the path,
         with path->keep_locks set to 1

     --> ends up at leaf X again and
         it verifies that the key
         (257 INODE_REF 256) is no longer
         the last key in the leaf, so it
         returns with path->nodes[0] ==
         leaf X and path->slots[0] == N,
         pointing to the new item with
         key (257 INODE_REF 500)

   the loop iteration of run_dealloc_nocow()
   does not break out the loop and continues
   because the key referenced in the path
   at path->nodes[0] and path->slots[0] is
   for inode 257, its type is < BTRFS_EXTENT_DATA_KEY
   and its offset (500) is less then our delalloc
   range's end (8192)

   the item pointed by the path, an inode reference item,
   is (incorrectly) interpreted as a file extent item and
   we get an invalid extent type, leading to the BUG_ON(1):

   if (extent_type == BTRFS_FILE_EXTENT_REG ||
      extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
       (...)
   } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
       (...)
   } else {
       BUG_ON(1)
   }

The same can happen if a xattr is added concurrently and ends up having
a key with an offset smaller then the delalloc's range end.

So fix this by skipping keys with a type smaller than
BTRFS_EXTENT_DATA_KEY.

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-11-09 11:29:14 +00:00
Filipe Manana
aeafbf8486 Btrfs: fix race leading to incorrect item deletion when dropping extents
While running a stress test I got the following warning triggered:

  [191627.672810] ------------[ cut here ]------------
  [191627.673949] WARNING: CPU: 8 PID: 8447 at fs/btrfs/file.c:779 __btrfs_drop_extents+0x391/0xa50 [btrfs]()
  (...)
  [191627.701485] Call Trace:
  [191627.702037]  [<ffffffff8145f077>] dump_stack+0x4f/0x7b
  [191627.702992]  [<ffffffff81095de5>] ? console_unlock+0x356/0x3a2
  [191627.704091]  [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb
  [191627.705380]  [<ffffffffa0664499>] ? __btrfs_drop_extents+0x391/0xa50 [btrfs]
  [191627.706637]  [<ffffffff8104b46d>] warn_slowpath_null+0x1a/0x1c
  [191627.707789]  [<ffffffffa0664499>] __btrfs_drop_extents+0x391/0xa50 [btrfs]
  [191627.709155]  [<ffffffff8115663c>] ? cache_alloc_debugcheck_after.isra.32+0x171/0x1d0
  [191627.712444]  [<ffffffff81155007>] ? kmemleak_alloc_recursive.constprop.40+0x16/0x18
  [191627.714162]  [<ffffffffa06570c9>] insert_reserved_file_extent.constprop.40+0x83/0x24e [btrfs]
  [191627.715887]  [<ffffffffa065422b>] ? start_transaction+0x3bb/0x610 [btrfs]
  [191627.717287]  [<ffffffffa065b604>] btrfs_finish_ordered_io+0x273/0x4e2 [btrfs]
  [191627.728865]  [<ffffffffa065b888>] finish_ordered_fn+0x15/0x17 [btrfs]
  [191627.730045]  [<ffffffffa067d688>] normal_work_helper+0x14c/0x32c [btrfs]
  [191627.731256]  [<ffffffffa067d96a>] btrfs_endio_write_helper+0x12/0x14 [btrfs]
  [191627.732661]  [<ffffffff81061119>] process_one_work+0x24c/0x4ae
  [191627.733822]  [<ffffffff810615b0>] worker_thread+0x206/0x2c2
  [191627.734857]  [<ffffffff810613aa>] ? process_scheduled_works+0x2f/0x2f
  [191627.736052]  [<ffffffff810613aa>] ? process_scheduled_works+0x2f/0x2f
  [191627.737349]  [<ffffffff810669a6>] kthread+0xef/0xf7
  [191627.738267]  [<ffffffff810f3b3a>] ? time_hardirqs_on+0x15/0x28
  [191627.739330]  [<ffffffff810668b7>] ? __kthread_parkme+0xad/0xad
  [191627.741976]  [<ffffffff81465592>] ret_from_fork+0x42/0x70
  [191627.743080]  [<ffffffff810668b7>] ? __kthread_parkme+0xad/0xad
  [191627.744206] ---[ end trace bbfddacb7aaada8d ]---

  $ cat -n fs/btrfs/file.c
  691  int __btrfs_drop_extents(struct btrfs_trans_handle *trans,
  (...)
  758                  btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
  759                  if (key.objectid > ino ||
  760                      key.type > BTRFS_EXTENT_DATA_KEY || key.offset >= end)
  761                          break;
  762
  763                  fi = btrfs_item_ptr(leaf, path->slots[0],
  764                                      struct btrfs_file_extent_item);
  765                  extent_type = btrfs_file_extent_type(leaf, fi);
  766
  767                  if (extent_type == BTRFS_FILE_EXTENT_REG ||
  768                      extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
  (...)
  774                  } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
  (...)
  778                  } else {
  779                          WARN_ON(1);
  780                          extent_end = search_start;
  781                  }
  (...)

This happened because the item we were processing did not match a file
extent item (its key type != BTRFS_EXTENT_DATA_KEY), and even on this
case we cast the item to a struct btrfs_file_extent_item pointer and
then find a type field value that does not match any of the expected
values (BTRFS_FILE_EXTENT_[REG|PREALLOC|INLINE]). This scenario happens
due to a tiny time window where a race can happen as exemplified below.
For example, consider the following scenario where we're using the
NO_HOLES feature and we have the following two neighbour leafs:

               Leaf X (has N items)                    Leaf Y

[ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ]  [ (257 EXTENT_DATA 8192), ... ]
          slot N - 2         slot N - 1              slot 0

Our inode 257 has an implicit hole in the range [0, 8K[ (implicit rather
than explicit because NO_HOLES is enabled). Now if our inode has an
ordered extent for the range [4K, 8K[ that is finishing, the following
can happen:

          CPU 1                                       CPU 2

  btrfs_finish_ordered_io()
    insert_reserved_file_extent()
      __btrfs_drop_extents()
         Searches for the key
          (257 EXTENT_DATA 4096) through
          btrfs_lookup_file_extent()

         Key not found and we get a path where
         path->nodes[0] == leaf X and
         path->slots[0] == N

         Because path->slots[0] is >=
         btrfs_header_nritems(leaf X), we call
         btrfs_next_leaf()

         btrfs_next_leaf() releases the path

                                                  inserts key
                                                  (257 INODE_REF 4096)
                                                  at the end of leaf X,
                                                  leaf X now has N + 1 keys,
                                                  and the new key is at
                                                  slot N

         btrfs_next_leaf() searches for
         key (257 INODE_REF 256), with
         path->keep_locks set to 1,
         because it was the last key it
         saw in leaf X

           finds it in leaf X again and
           notices it's no longer the last
           key of the leaf, so it returns 0
           with path->nodes[0] == leaf X and
           path->slots[0] == N (which is now
           < btrfs_header_nritems(leaf X)),
           pointing to the new key
           (257 INODE_REF 4096)

         __btrfs_drop_extents() casts the
         item at path->nodes[0], slot
         path->slots[0], to a struct
         btrfs_file_extent_item - it does
         not skip keys for the target
         inode with a type less than
         BTRFS_EXTENT_DATA_KEY
         (BTRFS_INODE_REF_KEY < BTRFS_EXTENT_DATA_KEY)

         sees a bogus value for the type
         field triggering the WARN_ON in
         the trace shown above, and sets
         extent_end = search_start (4096)

         does the if-then-else logic to
         fixup 0 length extent items created
         by a past bug from hole punching:

           if (extent_end == key.offset &&
               extent_end >= search_start)
               goto delete_extent_item;

         that evaluates to true and it ends
         up deleting the key pointed to by
         path->slots[0], (257 INODE_REF 4096),
         from leaf X

The same could happen for example for a xattr that ends up having a key
with an offset value that matches search_start (very unlikely but not
impossible).

So fix this by ensuring that keys smaller than BTRFS_EXTENT_DATA_KEY are
skipped, never casted to struct btrfs_file_extent_item and never deleted
by accident. Also protect against the unexpected case of getting a key
for a lower inode number by skipping that key and issuing a warning.

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-11-08 21:51:28 +00:00
Linus Torvalds
ad804a0b2a Merge branch 'akpm' (patches from Andrew)
Merge second patch-bomb from Andrew Morton:

 - most of the rest of MM

 - procfs

 - lib/ updates

 - printk updates

 - bitops infrastructure tweaks

 - checkpatch updates

 - nilfs2 update

 - signals

 - various other misc bits: coredump, seqfile, kexec, pidns, zlib, ipc,
   dma-debug, dma-mapping, ...

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (102 commits)
  ipc,msg: drop dst nil validation in copy_msg
  include/linux/zutil.h: fix usage example of zlib_adler32()
  panic: release stale console lock to always get the logbuf printed out
  dma-debug: check nents in dma_sync_sg*
  dma-mapping: tidy up dma_parms default handling
  pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode
  kexec: use file name as the output message prefix
  fs, seqfile: always allow oom killer
  seq_file: reuse string_escape_str()
  fs/seq_file: use seq_* helpers in seq_hex_dump()
  coredump: change zap_threads() and zap_process() to use for_each_thread()
  coredump: ensure all coredumping tasks have SIGNAL_GROUP_COREDUMP
  signal: remove jffs2_garbage_collect_thread()->allow_signal(SIGCONT)
  signal: introduce kernel_signal_stop() to fix jffs2_garbage_collect_thread()
  signal: turn dequeue_signal_lock() into kernel_dequeue_signal()
  signals: kill block_all_signals() and unblock_all_signals()
  nilfs2: fix gcc uninitialized-variable warnings in powerpc build
  nilfs2: fix gcc unused-but-set-variable warnings
  MAINTAINERS: nilfs2: add header file for tracing
  nilfs2: add tracepoints for analyzing reading and writing metadata files
  ...
2015-11-07 14:32:45 -08:00
Michal Hocko
c62d25556b mm, fs: introduce mapping_gfp_constraint()
There are many places which use mapping_gfp_mask to restrict a more
generic gfp mask which would be used for allocations which are not
directly related to the page cache but they are performed in the same
context.

Let's introduce a helper function which makes the restriction explicit and
easier to track.  This patch doesn't introduce any functional changes.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06 17:50:42 -08:00
Mel Gorman
d0164adc89 mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
__GFP_WAIT has been used to identify atomic context in callers that hold
spinlocks or are in interrupts.  They are expected to be high priority and
have access one of two watermarks lower than "min" which can be referred
to as the "atomic reserve".  __GFP_HIGH users get access to the first
lower watermark and can be called the "high priority reserve".

Over time, callers had a requirement to not block when fallback options
were available.  Some have abused __GFP_WAIT leading to a situation where
an optimisitic allocation with a fallback option can access atomic
reserves.

This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
cannot sleep and have no alternative.  High priority users continue to use
__GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
redefined as a caller that is willing to enter direct reclaim and wake
kswapd for background reclaim.

This patch then converts a number of sites

o __GFP_ATOMIC is used by callers that are high priority and have memory
  pools for those requests. GFP_ATOMIC uses this flag.

o Callers that have a limited mempool to guarantee forward progress clear
  __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
  into this category where kswapd will still be woken but atomic reserves
  are not used as there is a one-entry mempool to guarantee progress.

o Callers that are checking if they are non-blocking should use the
  helper gfpflags_allow_blocking() where possible. This is because
  checking for __GFP_WAIT as was done historically now can trigger false
  positives. Some exceptions like dm-crypt.c exist where the code intent
  is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
  flag manipulations.

o Callers that built their own GFP flags instead of starting with GFP_KERNEL
  and friends now also need to specify __GFP_KSWAPD_RECLAIM.

The first key hazard to watch out for is callers that removed __GFP_WAIT
and was depending on access to atomic reserves for inconspicuous reasons.
In some cases it may be appropriate for them to use __GFP_HIGH.

The second key hazard is callers that assembled their own combination of
GFP flags instead of starting with something like GFP_KERNEL.  They may
now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
if it's missed in most cases as other activity will wake kswapd.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06 17:50:42 -08:00
Linus Torvalds
27eb427bdc Merge branch 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "We have a lot of subvolume quota improvements in here, along with big
  piles of cleanups from Dave Sterba and Anand Jain and others.

  Josef pitched in a batch of allocator fixes based on production use
  here at FB.  We found that mount -o ssd_spread greatly improved our
  performance on hardware raid5/6, but it exposed some CPU bottlenecks
  in the allocator.  These patches make a huge difference"

* 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (100 commits)
  Btrfs: fix hole punching when using the no-holes feature
  Btrfs: find_free_extent: Do not erroneously skip LOOP_CACHING_WAIT state
  btrfs: Fix a data space underflow warning
  btrfs: qgroup: Fix a rebase bug which will cause qgroup double free
  btrfs: qgroup: Fix a race in delayed_ref which leads to abort trans
  btrfs: clear PF_NOFREEZE in cleaner_kthread()
  btrfs: qgroup: Don't copy extent buffer to do qgroup rescan
  btrfs: add balance filters limits, stripes and usage to supported mask
  btrfs: extend balance filter usage to take minimum and maximum
  btrfs: add balance filter for stripes
  btrfs: extend balance filter limit to take minimum and maximum
  btrfs: fix use after free iterating extrefs
  btrfs: check unsupported filters in balance arguments
  Btrfs: fix regression running delayed references when using qgroups
  Btrfs: fix regression when running delayed references
  Btrfs: don't do extra bitmap search in one bit case
  Btrfs: keep track of largest extent in bitmaps
  Btrfs: don't keep trying to build clusters if we are fragmented
  Btrfs: cut down on loops through the allocator
  Btrfs: don't continue setting up space cache when enospc
  ...
2015-11-06 17:17:13 -08:00
Filipe Manana
3b2ba7b31d Btrfs: fix sleeping inside atomic context in qgroup rescan worker
We are holding a btree path with spinning locks and then we attempt to
clone an extent buffer, which calls kmem_cache_alloc() and this function
can sleep, causing the following trace to be reported on a debug kernel:

[107118.218536] BUG: sleeping function called from invalid context at mm/slab.c:2871
[107118.224110] in_atomic(): 1, irqs_disabled(): 0, pid: 19148, name: kworker/u32:3
[107118.226120] INFO: lockdep is turned off.
[107118.226843] Preemption disabled at:[<ffffffffa05ffa22>] btrfs_clear_lock_blocking_rw+0x96/0xea [btrfs]

[107118.229175] CPU: 3 PID: 19148 Comm: kworker/u32:3 Tainted: G        W       4.3.0-rc5-btrfs-next-17+ #1
[107118.231326] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[107118.233687] Workqueue: btrfs-qgroup-rescan btrfs_qgroup_rescan_helper [btrfs]
[107118.236835]  0000000000000000 ffff880424bf3b78 ffffffff812566f4 0000000000000000
[107118.238369]  ffff880424bf3ba0 ffffffff81070664 ffffffff817f1cd5 0000000000000b37
[107118.239769]  0000000000000000 ffff880424bf3bc8 ffffffff8107070a 0000000000008850
[107118.241244] Call Trace:
[107118.241729]  [<ffffffff812566f4>] dump_stack+0x4e/0x79
[107118.242602]  [<ffffffff81070664>] ___might_sleep+0x23a/0x241
[107118.243586]  [<ffffffff8107070a>] __might_sleep+0x9f/0xa6
[107118.244532]  [<ffffffff8115af70>] cache_alloc_debugcheck_before+0x25/0x36
[107118.245939]  [<ffffffff8115d52b>] kmem_cache_alloc+0x50/0x215
[107118.246930]  [<ffffffffa05e627e>] __alloc_extent_buffer+0x2a/0x11f [btrfs]
[107118.248121]  [<ffffffffa05ecb1a>] btrfs_clone_extent_buffer+0x3d/0xdd [btrfs]
[107118.249451]  [<ffffffffa06239ea>] btrfs_qgroup_rescan_worker+0x16d/0x434 [btrfs]
[107118.250755]  [<ffffffff81087481>] ? arch_local_irq_save+0x9/0xc
[107118.251754]  [<ffffffffa05f7952>] normal_work_helper+0x14c/0x32a [btrfs]
[107118.252899]  [<ffffffffa05f7952>] ? normal_work_helper+0x14c/0x32a [btrfs]
[107118.254195]  [<ffffffffa05f7c82>] btrfs_qgroup_rescan_helper+0x12/0x14 [btrfs]
[107118.255436]  [<ffffffff81063b23>] process_one_work+0x24a/0x4ac
[107118.263690]  [<ffffffff81064285>] worker_thread+0x206/0x2c2
[107118.264888]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
[107118.267413]  [<ffffffff8106904d>] kthread+0xef/0xf7
[107118.268417]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
[107118.269505]  [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70
[107118.270491]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24

So just use blocking locks for our path to solve this.
This fixes the patch titled:
  "btrfs: qgroup: Don't copy extent buffer to do qgroup rescan"

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-11-05 11:02:22 +00:00
Filipe Manana
190631f1c8 Btrfs: fix race waiting for qgroup rescan worker
We were initializing the completion (fs_info->qgroup_rescan_completion)
object after releasing the qgroup rescan lock, which gives a small time
window for a rescan waiter to not actually wait for the rescan worker
to finish. Example:

         CPU 1                                                     CPU 2

 fs_info->qgroup_rescan_completion->done is 0

 btrfs_qgroup_rescan_worker()
   complete_all(&fs_info->qgroup_rescan_completion)
     sets fs_info->qgroup_rescan_completion->done
     to UINT_MAX / 2

 ... do some other stuff ....

 qgroup_rescan_init()
   mutex_lock(&fs_info->qgroup_rescan_lock)
   set flag BTRFS_QGROUP_STATUS_FLAG_RESCAN
     in fs_info->qgroup_flags
   mutex_unlock(&fs_info->qgroup_rescan_lock)

                                                       btrfs_qgroup_wait_for_completion()
                                                         mutex_lock(&fs_info->qgroup_rescan_lock)
                                                         sees flag BTRFS_QGROUP_STATUS_FLAG_RESCAN
                                                           in fs_info->qgroup_flags
                                                         mutex_unlock(&fs_info->qgroup_rescan_lock)

                                                         wait_for_completion_interruptible(
                                                           &fs_info->qgroup_rescan_completion)

                                                           fs_info->qgroup_rescan_completion->done
                                                           is > 0 so it returns immediately

  init_completion(&fs_info->qgroup_rescan_completion)
    sets fs_info->qgroup_rescan_completion->done to 0

So fix this by initializing the completion object while holding the mutex
fs_info->qgroup_rescan_lock.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-11-05 10:32:21 +00:00
Justin Maggard
7343dd61fd btrfs: qgroup: exit the rescan worker during umount
I was hitting a consistent NULL pointer dereference during shutdown that
showed the trace running through end_workqueue_bio().  I traced it back to
the endio_meta_workers workqueue being poked after it had already been
destroyed.

Eventually I found that the root cause was a qgroup rescan that was still
in progress while we were stopping all the btrfs workers.

Currently we explicitly pause balance and scrub operations in
close_ctree(), but we do nothing to stop the qgroup rescan.  We should
probably be doing the same for qgroup rescan, but that's a much larger
change.  This small change is good enough to allow me to unmount without
crashing.

Signed-off-by: Justin Maggard <jmaggard@netgear.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
2015-11-05 10:32:20 +00:00
Filipe Manana
9c9464cc92 Btrfs: fix extent accounting for partial direct IO writes
When doing a write using direct IO we can end up not doing the whole write
operation using the direct IO path, in that case we fallback to a buffered
write to do the remaining IO. This happens for example if the range we are
writing to contains a compressed extent.
When we do a partial write and fallback to buffered IO, due to the
existence of a compressed extent for example, we end up not adjusting the
outstanding extents counter of our inode which ends up getting decremented
twice, once by the DIO ordered extent for the partial write and once again
by btrfs_direct_IO(), resulting in an arithmetic underflow at
extent-tree.c:drop_outstanding_extent(). For example if we have:

  extents        [ prealloc extent ] [ compressed extent ]
  offsets        A        B          C       D           E

and at the moment our inode's outstanding extents counter is 0, if we do a
direct IO write against the range [B, D[ (which has a length smaller than
128Mb), we end up bumping our inode's outstanding extents counter to 1, we
create a DIO ordered extent for the range [B, C[ and then fallback to a
buffered write for the range [C, D[. The direct IO handler
(inode.c:btrfs_direct_IO()) decrements the outstanding extents counter by
1, leaving it with a value of 0, through a call to
btrfs_delalloc_release_space() and then shortly after the DIO ordered
extent finishes and calls btrfs_delalloc_release_metadata() which ends
up to attempt to decrement the inode's outstanding extents counter by 1,
resulting in an assertion failure at drop_outstanding_extent() because
the operation would result in an arithmetic underflow (0 - 1). This
produces the following trace:

  [125471.336838] BTRFS: assertion failed: BTRFS_I(inode)->outstanding_extents >= num_extents, file: fs/btrfs/extent-tree.c, line: 5526
  [125471.338844] ------------[ cut here ]------------
  [125471.340745] kernel BUG at fs/btrfs/ctree.h:4173!
  [125471.340745] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
  [125471.340745] Modules linked in: btrfs f2fs xfs libcrc32c dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc acpi_cpufreq psmouse i2c_piix4 parport pcspkr serio_raw microcode processor evdev i2c_core button ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom ata_generic virtio_scsi ata_piix virtio_pci virtio_ring floppy libata virtio e1000 scsi_mod [last unloaded: btrfs]
  [125471.340745] CPU: 10 PID: 23649 Comm: kworker/u32:1 Tainted: G        W       4.3.0-rc5-btrfs-next-17+ #1
  [125471.340745] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
  [125471.340745] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
  [125471.340745] task: ffff8804244fcf80 ti: ffff88040a118000 task.ti: ffff88040a118000
  [125471.340745] RIP: 0010:[<ffffffffa0550da1>]  [<ffffffffa0550da1>] assfail.constprop.46+0x1e/0x20 [btrfs]
  [125471.340745] RSP: 0018:ffff88040a11bc78  EFLAGS: 00010296
  [125471.340745] RAX: 0000000000000075 RBX: 0000000000005000 RCX: 0000000000000000
  [125471.340745] RDX: ffffffff81098f93 RSI: ffffffff8147c619 RDI: 00000000ffffffff
  [125471.340745] RBP: ffff88040a11bc78 R08: 0000000000000001 R09: 0000000000000000
  [125471.340745] R10: ffff88040a11bc08 R11: ffffffff81651000 R12: ffff8803efb4a000
  [125471.340745] R13: ffff8803efb4a000 R14: 0000000000000000 R15: ffff8802f8e33c88
  [125471.340745] FS:  0000000000000000(0000) GS:ffff88043dd40000(0000) knlGS:0000000000000000
  [125471.340745] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
  [125471.340745] CR2: 00007fae7ca86095 CR3: 0000000001a0b000 CR4: 00000000000006e0
  [125471.340745] Stack:
  [125471.340745]  ffff88040a11bc88 ffffffffa04ca0cd ffff88040a11bcc8 ffffffffa04ceeb1
  [125471.340745]  ffff8802f8e33940 ffff8802c93eadb0 ffff8802f8e0bf50 ffff8803efb4a000
  [125471.340745]  0000000000000000 ffff8802f8e33c88 ffff88040a11bd38 ffffffffa04eccfa
  [125471.340745] Call Trace:
  [125471.340745]  [<ffffffffa04ca0cd>] drop_outstanding_extent+0x3d/0x6d [btrfs]
  [125471.340745]  [<ffffffffa04ceeb1>] btrfs_delalloc_release_metadata+0x51/0xdd [btrfs]
  [125471.340745]  [<ffffffffa04eccfa>] btrfs_finish_ordered_io+0x420/0x4eb [btrfs]
  [125471.340745]  [<ffffffffa04ecdda>] finish_ordered_fn+0x15/0x17 [btrfs]
  [125471.340745]  [<ffffffffa050e6e8>] normal_work_helper+0x14c/0x32a [btrfs]
  [125471.340745]  [<ffffffffa050e9c8>] btrfs_endio_write_helper+0x12/0x14 [btrfs]
  [125471.340745]  [<ffffffff81063b23>] process_one_work+0x24a/0x4ac
  [125471.340745]  [<ffffffff81064285>] worker_thread+0x206/0x2c2
  [125471.340745]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
  [125471.340745]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
  [125471.340745]  [<ffffffff8106904d>] kthread+0xef/0xf7
  [125471.340745]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
  [125471.340745]  [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70
  [125471.340745]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
  [125471.340745] Code: a5 55 a0 48 89 e5 e8 42 50 bc e0 0f 0b 55 89 f1 48 c7 c2 f0 a8 55 a0 48 89 fe 31 c0 48 c7 c7 14 aa 55 a0 48 89 e5 e8 22 50 bc e0 <0f> 0b 0f 1f 44 00 00 55 31 c9 ba 18 00 00 00 48 89 e5 41 56 41
  [125471.340745] RIP  [<ffffffffa0550da1>] assfail.constprop.46+0x1e/0x20 [btrfs]
  [125471.340745]  RSP <ffff88040a11bc78>
  [125471.539620] ---[ end trace 144259f7838b4aa4 ]---

So fix this by ensuring we adjust the outstanding extents counter when we
do the fallback just like we do for the case where the whole write can be
done through the direct IO path.

We were also adjusting the outstanding extents counter by a constant value
of 1, which is incorrect because we were ignorning that we account extents
in BTRFS_MAX_EXTENT_SIZE units, o fix that as well.

The following test case for fstests reproduces this issue:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"
  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter

  # real QA test starts here
  _need_to_be_root
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch
  _require_xfs_io_command "falloc"

  rm -f $seqres.full

  _scratch_mkfs >>$seqres.full 2>&1
  _scratch_mount "-o compress"

  # Create a compressed extent covering the range [700K, 800K[.
  $XFS_IO_PROG -f -s -c "pwrite -S 0xaa -b 100K 700K 100K" \
      $SCRATCH_MNT/foo | _filter_xfs_io

  # Create prealloc extent covering the range [600K, 700K[.
  $XFS_IO_PROG -c "falloc 600K 100K" $SCRATCH_MNT/foo

  # Write 80K of data to the range [640K, 720K[ using direct IO. This
  # range covers both the prealloc extent and the compressed extent.
  # Because there's a compressed extent in the range we are writing to,
  # the DIO write code path ends up only writing the first 60k of data,
  # which goes to the prealloc extent, and then falls back to buffered IO
  # for writing the remaining 20K of data - because that remaining data
  # maps to a file range containing a compressed extent.
  # When falling back to buffered IO, we used to trigger an assertion when
  # releasing reserved space due to bad accounting of the inode's
  # outstanding extents counter, which was set to 1 but we ended up
  # decrementing it by 1 twice, once through the ordered extent for the
  # 60K of data we wrote using direct IO, and once through the main direct
  # IO handler (inode.cbtrfs_direct_IO()) because the direct IO write
  # wrote less than 80K of data (60K).
  $XFS_IO_PROG -d -c "pwrite -S 0xbb -b 80K 640K 80K" \
      $SCRATCH_MNT/foo | _filter_xfs_io

  # Now similar test as above but for very large write operations. This
  # triggers special cases for an inode's outstanding extents accounting,
  # as internally btrfs logically splits extents into 128Mb units.
  $XFS_IO_PROG -f -s \
      -c "pwrite -S 0xaa -b 128M 258M 128M" \
      -c "falloc 0 258M" \
      $SCRATCH_MNT/bar | _filter_xfs_io
  $XFS_IO_PROG -d -c "pwrite -S 0xbb -b 256M 3M 256M" $SCRATCH_MNT/bar \
      | _filter_xfs_io

  # Now verify the file contents are correct and that they are the same
  # even after unmounting and mounting the fs again (or evicting the page
  # cache).
  #
  # For file foo, all bytes in the range [0, 640K[ must have a value of
  # 0x00, all bytes in the range [640K, 720K[ must have a value of 0xbb
  # and all bytes in the range [720K, 800K[ must have a value of 0xaa.
  #
  # For file bar, all bytes in the range [0, 3M[ must havea value of 0x00,
  # all bytes in the range [3M, 259M[ must have a value of 0xbb and all
  # bytes in the range [259M, 386M[ must have a value of 0xaa.
  #
  echo "File digests before remounting the file system:"
  md5sum $SCRATCH_MNT/foo | _filter_scratch
  md5sum $SCRATCH_MNT/bar | _filter_scratch
  _scratch_remount
  echo "File digests after remounting the file system:"
  md5sum $SCRATCH_MNT/foo | _filter_scratch
  md5sum $SCRATCH_MNT/bar | _filter_scratch

  status=0
  exit

Fixes: e1cbbfa5f5 ("Btrfs: fix outstanding_extents accounting in DIO")
Fixes: 3e05bde8c3 ("Btrfs: only adjust outstanding_extents when we do a short write")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-11-05 10:32:19 +00:00
Filipe Manana
2959a32a85 Btrfs: fix hole punching when using the no-holes feature
When we are using the no-holes feature, if we punch a hole into a file
range that already contains a hole which overlaps the range we are passing
to fallocate(), we end up removing the extent map that represents the
existing hole without adding a new one. This happens because with the
no-holes feature we do not have explicit extent items to represent holes
and therefore the call to __btrfs_drop_extents(), made from
btrfs_punch_hole(), returns an end offset to the variable drop_end that
is smaller than the end of the range passed to fallocate(), while it
drops all existing extent maps in that range.
Normally having a missing extent map is not a problem, for example for
a readpages() operation we just end up building the extent map by
looking at the fs/subvol tree for a matching extent item (or a lack of
one for implicit holes). However for an fsync that uses the fast path,
which needs to look at the list of modified extent maps, this means
the fsync will not record information about the complete hole we had
before the fallocate() call into the log tree, resulting in a file with
content/layout that does not match what we had neither before nor after
the hole punch operation.

The following test case for fstests reproduces the issue. It fails without
this change because we get a file with a different digest after the fsync
log replay and also with a different extent/hole layout.

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"
  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
     _cleanup_flakey
     rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter
  . ./common/punch
  . ./common/dmflakey

  # real QA test starts here
  _need_to_be_root
  _supported_fs generic
  _supported_os Linux
  _require_scratch
  _require_xfs_io_command "fpunch"
  _require_xfs_io_command "fiemap"
  _require_dm_target flakey
  _require_metadata_journaling $SCRATCH_DEV

  # This test was motivated by an issue found in btrfs when the btrfs
  # no-holes feature is enabled (introduced in kernel 3.14). So enable
  # the feature if the fs being tested is btrfs.
  if [ $FSTYP == "btrfs" ]; then
      _require_btrfs_fs_feature "no_holes"
      _require_btrfs_mkfs_feature "no-holes"
      MKFS_OPTIONS="$MKFS_OPTIONS -O no-holes"
  fi

  rm -f $seqres.full

  _scratch_mkfs >>$seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create out test file with some data and then fsync it.
  # We do the fsync only to make sure the last fsync we do in this test
  # triggers the fast code path of btrfs' fsync implementation, a
  # condition necessary to trigger the bug btrfs had.
  $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 128K" \
                  -c "fsync"                  \
                  $SCRATCH_MNT/foobar | _filter_xfs_io

  # Now punch a hole against the range [96K, 128K[.
  $XFS_IO_PROG -c "fpunch 96K 32K" $SCRATCH_MNT/foobar

  # Punch another hole against a range that overlaps the previous range
  # and ends beyond eof.
  $XFS_IO_PROG -c "fpunch 64K 128K" $SCRATCH_MNT/foobar

  # Punch another hole against a range that overlaps the first range
  # ([96K, 128K[) and ends at eof.
  $XFS_IO_PROG -c "fpunch 32K 96K" $SCRATCH_MNT/foobar

  # Fsync our file. We want to verify that, after a power failure and
  # mounting the filesystem again, the file content reflects all the hole
  # punch operations.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar

  echo "File digest before power failure:"
  md5sum $SCRATCH_MNT/foobar | _filter_scratch

  echo "Fiemap before power failure:"
  $XFS_IO_PROG -c "fiemap -v" $SCRATCH_MNT/foobar | _filter_fiemap

  # Silently drop all writes and umount to simulate a crash/power failure.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  # Allow writes again, mount to trigger log replay and validate file
  # contents.
  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  echo "File digest after log replay:"
  # Must match the same digest we got before the power failure.
  md5sum $SCRATCH_MNT/foobar | _filter_scratch

  echo "Fiemap after log replay:"
  # Must match the same extent listing we got before the power failure.
  $XFS_IO_PROG -c "fiemap -v" $SCRATCH_MNT/foobar | _filter_fiemap

  _unmount_flakey

  status=0
  exit

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-03 07:44:20 -08:00
chandan
13a0db5a53 Btrfs: find_free_extent: Do not erroneously skip LOOP_CACHING_WAIT state
When executing generic/001 in a loop on a ppc64 machine (with both sectorsize
and nodesize set to 64k), the following call trace is observed,

WARNING: at /root/repos/linux/fs/btrfs/locking.c:253
Modules linked in:
CPU: 2 PID: 8353 Comm: umount Not tainted 4.3.0-rc5-13676-ga5e681d #54
task: c0000000f2b1f560 ti: c0000000f6008000 task.ti: c0000000f6008000
NIP: c000000000520c88 LR: c0000000004a3b34 CTR: 0000000000000000
REGS: c0000000f600a820 TRAP: 0700   Not tainted  (4.3.0-rc5-13676-ga5e681d)
MSR: 8000000102029032 <SF,VEC,EE,ME,IR,DR,RI>  CR: 24444884  XER: 00000000
CFAR: c0000000004a3b30 SOFTE: 1
GPR00: c0000000004a3b34 c0000000f600aaa0 c00000000108ac00 c0000000f5a808c0
GPR04: 0000000000000000 c0000000f600ae60 0000000000000000 0000000000000005
GPR08: 00000000000020a1 0000000000000001 c0000000f2b1f560 0000000000000030
GPR12: 0000000084842882 c00000000fdc0900 c0000000f600ae60 c0000000f070b800
GPR16: 0000000000000000 c0000000f3c8a000 0000000000000000 0000000000000049
GPR20: 0000000000000001 0000000000000001 c0000000f5aa01f8 0000000000000000
GPR24: 0f83e0f83e0f83e1 c0000000f5a808c0 c0000000f3c8d000 c000000000000000
GPR28: c0000000f600ae74 0000000000000001 c0000000f3c8d000 c0000000f5a808c0
NIP [c000000000520c88] .btrfs_tree_lock+0x48/0x2a0
LR [c0000000004a3b34] .btrfs_lock_root_node+0x44/0x80
Call Trace:
[c0000000f600aaa0] [c0000000f600ab80] 0xc0000000f600ab80 (unreliable)
[c0000000f600ab80] [c0000000004a3b34] .btrfs_lock_root_node+0x44/0x80
[c0000000f600ac00] [c0000000004a99dc] .btrfs_search_slot+0xa8c/0xc00
[c0000000f600ad40] [c0000000004ab878] .btrfs_insert_empty_items+0x98/0x120
[c0000000f600adf0] [c00000000050da44] .btrfs_finish_chunk_alloc+0x1d4/0x620
[c0000000f600af20] [c0000000004be854] .btrfs_create_pending_block_groups+0x1d4/0x2c0
[c0000000f600b020] [c0000000004bf188] .do_chunk_alloc+0x3c8/0x420
[c0000000f600b100] [c0000000004c27cc] .find_free_extent+0xbfc/0x1030
[c0000000f600b260] [c0000000004c2ce8] .btrfs_reserve_extent+0xe8/0x250
[c0000000f600b330] [c0000000004c2f90] .btrfs_alloc_tree_block+0x140/0x590
[c0000000f600b440] [c0000000004a47b4] .__btrfs_cow_block+0x124/0x780
[c0000000f600b530] [c0000000004a4fc0] .btrfs_cow_block+0xf0/0x250
[c0000000f600b5e0] [c0000000004a917c] .btrfs_search_slot+0x22c/0xc00
[c0000000f600b720] [c00000000050aa40] .btrfs_remove_chunk+0x1b0/0x9f0
[c0000000f600b850] [c0000000004c4e04] .btrfs_delete_unused_bgs+0x434/0x570
[c0000000f600b950] [c0000000004d3cb8] .close_ctree+0x2e8/0x3b0
[c0000000f600ba20] [c00000000049d178] .btrfs_put_super+0x18/0x30
[c0000000f600ba90] [c000000000243cd4] .generic_shutdown_super+0xa4/0x1a0
[c0000000f600bb10] [c0000000002441d8] .kill_anon_super+0x18/0x30
[c0000000f600bb90] [c00000000049c898] .btrfs_kill_super+0x18/0xc0
[c0000000f600bc10] [c0000000002444f8] .deactivate_locked_super+0x98/0xe0
[c0000000f600bc90] [c000000000269f94] .cleanup_mnt+0x54/0xa0
[c0000000f600bd10] [c0000000000bd744] .task_work_run+0xc4/0x100
[c0000000f600bdb0] [c000000000016334] .do_notify_resume+0x74/0x80
[c0000000f600be30] [c0000000000098b8] .ret_from_except_lite+0x64/0x68
Instruction dump:
fba1ffe8 fbc1fff0 fbe1fff8 7c791b78 f8010010 f821ff21 e94d0290 81030040
812a04e8 7d094a78 7d290034 5529d97e <0b090000> 3b400000 3be30050 3bc3004c

The above call trace is seen even on x86_64; albeit very rarely and that too
with nodesize set to 64k and with nospace_cache mount option being used.

The reason for the above call trace is,
btrfs_remove_chunk
  check_system_chunk
    Allocate chunk if required
  For each physical stripe on underlying device,
    btrfs_free_dev_extent
      ...
      Take lock on Device tree's root node
      btrfs_cow_block("dev tree's root node");
        btrfs_reserve_extent
          find_free_extent
	    index = BTRFS_RAID_DUP;
	    have_caching_bg = false;

            When in LOOP_CACHING_NOWAIT state, Assume we find a block group
	    which is being cached; Hence have_caching_bg is set to true

            When repeating the search for the next RAID index, we set
	    have_caching_bg to false.

Hence right after completing the LOOP_CACHING_NOWAIT state, we incorrectly
skip LOOP_CACHING_WAIT state and move to LOOP_ALLOC_CHUNK state where we
allocate a chunk and try to add entries corresponding to the chunk's physical
stripe into the device tree. When doing so the task deadlocks itself waiting
for the blocking lock on the root node of the device tree.

This commit fixes the issue by introducing a new local variable to help
indicate as to whether a block group of any RAID type is being cached.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-03 07:44:20 -08:00
Qu Wenruo
485290a734 btrfs: Fix a data space underflow warning
Even with quota disabled, generic/127 will trigger a kernel warning by
underflow data space info.

The bug is caused by buffered write, which in case of short copy, the
start parameter for btrfs_delalloc_release_space() is wrong, and
round_up/down() in btrfs_delalloc_release() extents the range to page
aligned, decreasing one more page than expected.

This patch will fix it by passing correct start.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-03 07:44:20 -08:00
Qu Wenruo
90ce321da8 btrfs: qgroup: Fix a rebase bug which will cause qgroup double free
When rebasing my patchset, I forgot to pick up a cleanup patch to remove
old hotfix in 4.2 release.

Witouth the cleanup, it will screw up new qgroup reserve framework and
always cause minus reserved number.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-26 19:44:39 -07:00
Qu Wenruo
5846a3c268 btrfs: qgroup: Fix a race in delayed_ref which leads to abort trans
Between btrfs_allocerved_file_extent() and
btrfs_add_delayed_qgroup_reserve(), there is a window that delayed_refs
are run and delayed ref head maybe freed before
btrfs_add_delayed_qgroup_reserve().

This will cause btrfs_dad_delayed_qgroup_reserve() to return -ENOENT,
and cause transaction to be aborted.

This patch will record qgroup reserve space info into delayed_ref_head
at btrfs_add_delayed_ref(), to eliminate the race window.

Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-26 19:44:39 -07:00
Jiri Kosina
6962491321 btrfs: clear PF_NOFREEZE in cleaner_kthread()
cleaner_kthread() kthread calls try_to_freeze() at the beginning of every
cleanup attempt. This operation can't ever succeed though, as the kthread
hasn't marked itself as freezable.

Before (hopefully eventually) kthread freezing gets converted to fileystem
freezing, we'd rather mark cleaner_kthread() freezable (as my
understanding is that it can generate filesystem I/O during suspend).

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-26 19:42:30 -07:00
Qu Wenruo
0a0e8b8938 btrfs: qgroup: Don't copy extent buffer to do qgroup rescan
Ancient qgroup code call memcpy() on a extent buffer and use it for leaf
iteration.

As extent buffer contains lock, pointers to pages, it's never sane to do
such copy.

The following bug may be caused by this insane operation:
[92098.841309] general protection fault: 0000 [#1] SMP
[92098.841338] Modules linked in: ...
[92098.841814] CPU: 1 PID: 24655 Comm: kworker/u4:12 Not tainted
4.3.0-rc1 #1
[92098.841868] Workqueue: btrfs-qgroup-rescan btrfs_qgroup_rescan_helper
[btrfs]
[92098.842261] Call Trace:
[92098.842277]  [<ffffffffc035a5d8>] ? read_extent_buffer+0xb8/0x110
[btrfs]
[92098.842304]  [<ffffffffc0396d00>] ? btrfs_find_all_roots+0x60/0x70
[btrfs]
[92098.842329]  [<ffffffffc039af3d>]
btrfs_qgroup_rescan_worker+0x28d/0x5a0 [btrfs]

Where btrfs_qgroup_rescan_worker+0x28d is btrfs_disk_key_to_cpu(),
called in reading key from the copied extent_buffer.

This patch will use btrfs_clone_extent_buffer() to a better copy of
extent buffer to deal such case.

Reported-by: Stephane Lesimple <stephane_btrfs@lesimple.fr>
Suggested-by: Filipe Manana <fdmanana@kernel.org>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-26 19:42:30 -07:00
David Sterba
b66d62ba1e btrfs: add balance filters limits, stripes and usage to supported mask
Enable the extended 'limit' syntax (a range), the new 'stripes' and
extended 'usage' syntax (a range) filters in the filters mask. The patch
comes separate and not within the series that introduced the new filters
because the patch adding the mask was merged in a late rc. The
integration branch was based on an older rc and could not merge the
patch due to the missing changes.

Prerequisities:
* btrfs: check unsupported filters in balance arguments
* btrfs: extend balance filter limit to take minimum and maximum
* btrfs: add balance filter for stripes
* btrfs: extend balance filter usage to take minimum and maximum

Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-26 19:38:30 -07:00
David Sterba
bc3094673f btrfs: extend balance filter usage to take minimum and maximum
Similar to the 'limit' filter, we can enhance the 'usage' filter to
accept a range. The change is backward compatible, the range is applied
only in connection with the BTRFS_BALANCE_ARGS_USAGE_RANGE flag.

We don't have a usecase yet, the current syntax has been sufficient. The
enhancement should provide parity with other range-like filters.

Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-26 19:38:30 -07:00
Gabríel Arthúr Pétursson
dee32d0ac3 btrfs: add balance filter for stripes
Balance block groups which have the given number of stripes, defined by
a range min..max. This is useful to selectively rebalance only chunks
that do not span enough devices, applies to RAID0/10/5/6.

Signed-off-by: Gabríel Arthúr Pétursson <gabriel@system.is>
[ renamed bargs members, added to the UAPI, wrote the changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>

Signed-off-by: Chris Mason <clm@fb.com>
2015-10-26 19:38:29 -07:00
David Sterba
12907fc798 btrfs: extend balance filter limit to take minimum and maximum
The 'limit' filter is underdesigned, it should have been a range for
[min,max], with some relaxed semantics when one of the bounds is
missing. Besides that, using a full u64 for a single value is a waste of
bytes.

Let's fix both by extending the use of the u64 bytes for the [min,max]
range. This can be done in a backward compatible way, the range will be
interpreted only if the appropriate flag is set
(BTRFS_BALANCE_ARGS_LIMIT_RANGE).

Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-26 19:38:28 -07:00
Chris Mason
2849a85422 btrfs: fix use after free iterating extrefs
The code for btrfs inode-resolve has never worked properly for
files with enough hard links to trigger extrefs.  It was trying to
get the leaf out of a path after freeing the path:

	btrfs_release_path(path);
	leaf = path->nodes[0];
	item_size = btrfs_item_size_nr(leaf, slot);

The fix here is to use the extent buffer we cloned just a little higher
up to avoid deadlocks caused by using the leaf in the path.

Signed-off-by: Chris Mason <clm@fb.com>
cc: stable@vger.kernel.org # v3.7+
cc: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-26 19:38:28 -07:00
David Sterba
849ef9286f btrfs: check unsupported filters in balance arguments
We don't verify that all the balance filter arguments supplemented by
the flags are actually known to the kernel. Thus we let it silently pass
and do nothing.

At the moment this means only the 'limit' filter, but we're going to add
a few more soon so it's better to have that fixed. Also in older stable
kernels so that it works with newer userspace tools.

Cc: stable@vger.kernel.org # 3.16+
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-26 19:38:26 -07:00
Filipe Manana
b06c4bf5c8 Btrfs: fix regression running delayed references when using qgroups
In the kernel 4.2 merge window we had a big changes to the implementation
of delayed references and qgroups which made the no_quota field of delayed
references not used anymore. More specifically the no_quota field is not
used anymore as of:

  commit 0ed4792af0 ("btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.")

Leaving the no_quota field actually prevents delayed references from
getting merged, which in turn cause the following BUG_ON(), at
fs/btrfs/extent-tree.c, to be hit when qgroups are enabled:

  static int run_delayed_tree_ref(...)
  {
     (...)
     BUG_ON(node->ref_mod != 1);
     (...)
  }

This happens on a scenario like the following:

  1) Ref1 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.

  2) Ref2 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
     It's not merged with Ref1 because Ref1->no_quota != Ref2->no_quota.

  3) Ref3 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
     It's not merged with the reference at the tail of the list of refs
     for bytenr X because the reference at the tail, Ref2 is incompatible
     due to Ref2->no_quota != Ref3->no_quota.

  4) Ref4 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
     It's not merged with the reference at the tail of the list of refs
     for bytenr X because the reference at the tail, Ref3 is incompatible
     due to Ref3->no_quota != Ref4->no_quota.

  5) We run delayed references, trigger merging of delayed references,
     through __btrfs_run_delayed_refs() -> btrfs_merge_delayed_refs().

  6) Ref1 and Ref3 are merged as Ref1->no_quota = Ref3->no_quota and
     all other conditions are satisfied too. So Ref1 gets a ref_mod
     value of 2.

  7) Ref2 and Ref4 are merged as Ref2->no_quota = Ref4->no_quota and
     all other conditions are satisfied too. So Ref2 gets a ref_mod
     value of 2.

  8) Ref1 and Ref2 aren't merged, because they have different values
     for their no_quota field.

  9) Delayed reference Ref1 is picked for running (select_delayed_ref()
     always prefers references with an action == BTRFS_ADD_DELAYED_REF).
     So run_delayed_tree_ref() is called for Ref1 which triggers the
     BUG_ON because Ref1->red_mod != 1 (equals 2).

So fix this by removing the no_quota field, as it's not used anymore as
of commit 0ed4792af0 ("btrfs: qgroup: Switch to new extent-oriented
qgroup mechanism.").

The use of no_quota was also buggy in at least two places:

1) At delayed-refs.c:btrfs_add_delayed_tree_ref() - we were setting
   no_quota to 0 instead of 1 when the following condition was true:
   is_fstree(ref_root) || !fs_info->quota_enabled

2) At extent-tree.c:__btrfs_inc_extent_ref() - we were attempting to
   reset a node's no_quota when the condition "!is_fstree(root_objectid)
   || !root->fs_info->quota_enabled" was true but we did it only in
   an unused local stack variable, that is, we never reset the no_quota
   value in the node itself.

This fixes the remainder of problems several people have been having when
running delayed references, mostly while a balance is running in parallel,
on a 4.2+ kernel.

Very special thanks to Stéphane Lesimple for helping debugging this issue
and testing this fix on his multi terabyte filesystem (which took more
than one day to balance alone, plus fsck, etc).

Also, this fixes deadlock issue when using the clone ioctl with qgroups
enabled, as reported by Elias Probst in the mailing list. The deadlock
happens because after calling btrfs_insert_empty_item we have our path
holding a write lock on a leaf of the fs/subvol tree and then before
releasing the path we called check_ref() which did backref walking, when
qgroups are enabled, and tried to read lock the same leaf. The trace for
this case is the following:

  INFO: task systemd-nspawn:6095 blocked for more than 120 seconds.
  (...)
  Call Trace:
    [<ffffffff86999201>] schedule+0x74/0x83
    [<ffffffff863ef64c>] btrfs_tree_read_lock+0xc0/0xea
    [<ffffffff86137ed7>] ? wait_woken+0x74/0x74
    [<ffffffff8639f0a7>] btrfs_search_old_slot+0x51a/0x810
    [<ffffffff863a129b>] btrfs_next_old_leaf+0xdf/0x3ce
    [<ffffffff86413a00>] ? ulist_add_merge+0x1b/0x127
    [<ffffffff86411688>] __resolve_indirect_refs+0x62a/0x667
    [<ffffffff863ef546>] ? btrfs_clear_lock_blocking_rw+0x78/0xbe
    [<ffffffff864122d3>] find_parent_nodes+0xaf3/0xfc6
    [<ffffffff86412838>] __btrfs_find_all_roots+0x92/0xf0
    [<ffffffff864128f2>] btrfs_find_all_roots+0x45/0x65
    [<ffffffff8639a75b>] ? btrfs_get_tree_mod_seq+0x2b/0x88
    [<ffffffff863e852e>] check_ref+0x64/0xc4
    [<ffffffff863e9e01>] btrfs_clone+0x66e/0xb5d
    [<ffffffff863ea77f>] btrfs_ioctl_clone+0x48f/0x5bb
    [<ffffffff86048a68>] ? native_sched_clock+0x28/0x77
    [<ffffffff863ed9b0>] btrfs_ioctl+0xabc/0x25cb
  (...)

The problem goes away by eleminating check_ref(), which no longer is
needed as its purpose was to get a value for the no_quota field of
a delayed reference (this patch removes the no_quota field as mentioned
earlier).

Reported-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
Tested-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
Reported-by: Elias Probst <mail@eliasprobst.eu>
Reported-by: Peter Becker <floyd.net@gmail.com>
Reported-by: Malte Schröder <malte@tnxip.de>
Reported-by: Derek Dongray <derek@valedon.co.uk>
Reported-by: Erkki Seppala <flux-btrfs@inside.org>
Cc: stable@vger.kernel.org  # 4.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
2015-10-25 19:53:26 +00:00
Filipe Manana
2c3cf7d5f6 Btrfs: fix regression when running delayed references
In the kernel 4.2 merge window we had a refactoring/rework of the delayed
references implementation in order to fix certain problems with qgroups.
However that rework introduced one more regression that leads to the
following trace when running delayed references for metadata:

[35908.064664] kernel BUG at fs/btrfs/extent-tree.c:1832!
[35908.065201] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[35908.065201] Modules linked in: dm_flakey dm_mod btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc psmouse i2
[35908.065201] CPU: 14 PID: 15014 Comm: kworker/u32:9 Tainted: G        W       4.3.0-rc5-btrfs-next-17+ #1
[35908.065201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[35908.065201] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
[35908.065201] task: ffff880114b7d780 ti: ffff88010c4c8000 task.ti: ffff88010c4c8000
[35908.065201] RIP: 0010:[<ffffffffa04928b5>]  [<ffffffffa04928b5>] insert_inline_extent_backref+0x52/0xb1 [btrfs]
[35908.065201] RSP: 0018:ffff88010c4cbb08  EFLAGS: 00010293
[35908.065201] RAX: 0000000000000000 RBX: ffff88008a661000 RCX: 0000000000000000
[35908.065201] RDX: ffffffffa04dd58f RSI: 0000000000000001 RDI: 0000000000000000
[35908.065201] RBP: ffff88010c4cbb40 R08: 0000000000001000 R09: ffff88010c4cb9f8
[35908.065201] R10: 0000000000000000 R11: 000000000000002c R12: 0000000000000000
[35908.065201] R13: ffff88020a74c578 R14: 0000000000000000 R15: 0000000000000000
[35908.065201] FS:  0000000000000000(0000) GS:ffff88023edc0000(0000) knlGS:0000000000000000
[35908.065201] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[35908.065201] CR2: 00000000015e8708 CR3: 0000000102185000 CR4: 00000000000006e0
[35908.065201] Stack:
[35908.065201]  ffff88010c4cbb18 0000000000000f37 ffff88020a74c578 ffff88015a408000
[35908.065201]  ffff880154a44000 0000000000000000 0000000000000005 ffff88010c4cbbd8
[35908.065201]  ffffffffa0492b9a 0000000000000005 0000000000000000 0000000000000000
[35908.065201] Call Trace:
[35908.065201]  [<ffffffffa0492b9a>] __btrfs_inc_extent_ref+0x8b/0x208 [btrfs]
[35908.065201]  [<ffffffffa0497117>] ? __btrfs_run_delayed_refs+0x4d4/0xd33 [btrfs]
[35908.065201]  [<ffffffffa049773d>] __btrfs_run_delayed_refs+0xafa/0xd33 [btrfs]
[35908.065201]  [<ffffffffa04a976a>] ? join_transaction.isra.10+0x25/0x41f [btrfs]
[35908.065201]  [<ffffffffa04a97ed>] ? join_transaction.isra.10+0xa8/0x41f [btrfs]
[35908.065201]  [<ffffffffa049914d>] btrfs_run_delayed_refs+0x75/0x1dd [btrfs]
[35908.065201]  [<ffffffffa04992f1>] delayed_ref_async_start+0x3c/0x7b [btrfs]
[35908.065201]  [<ffffffffa04d4b4f>] normal_work_helper+0x14c/0x32a [btrfs]
[35908.065201]  [<ffffffffa04d4e93>] btrfs_extent_refs_helper+0x12/0x14 [btrfs]
[35908.065201]  [<ffffffff81063b23>] process_one_work+0x24a/0x4ac
[35908.065201]  [<ffffffff81064285>] worker_thread+0x206/0x2c2
[35908.065201]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
[35908.065201]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
[35908.065201]  [<ffffffff8106904d>] kthread+0xef/0xf7
[35908.065201]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
[35908.065201]  [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70
[35908.065201]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
[35908.065201] Code: 6a 01 41 56 41 54 ff 75 10 41 51 4d 89 c1 49 89 c8 48 8d 4d d0 e8 f6 f1 ff ff 48 83 c4 28 85 c0 75 2c 49 81 fc ff 00 00 00 77 02 <0f> 0b 4c 8b 45 30 8b 4d 28 45 31
[35908.065201] RIP  [<ffffffffa04928b5>] insert_inline_extent_backref+0x52/0xb1 [btrfs]
[35908.065201]  RSP <ffff88010c4cbb08>
[35908.310885] ---[ end trace fe4299baf0666457 ]---

This happens because the new delayed references code no longer merges
delayed references that have different sequence values. The following
steps are an example sequence leading to this issue:

1) Transaction N starts, fs_info->tree_mod_seq has value 0;

2) Extent buffer (btree node) A is allocated, delayed reference Ref1 for
   bytenr A is created, with a value of 1 and a seq value of 0;

3) fs_info->tree_mod_seq is incremented to 1;

4) Extent buffer A is deleted through btrfs_del_items(), which calls
   btrfs_del_leaf(), which in turn calls btrfs_free_tree_block(). The
   later returns the metadata extent associated to extent buffer A to
   the free space cache (the range is not pinned), because the extent
   buffer was created in the current transaction (N) and writeback never
   happened for the extent buffer (flag BTRFS_HEADER_FLAG_WRITTEN not set
   in the extent buffer).
   This creates the delayed reference Ref2 for bytenr A, with a value
   of -1 and a seq value of 1;

5) Delayed reference Ref2 is not merged with Ref1 when we create it,
   because they have different sequence numbers (decided at
   add_delayed_ref_tail_merge());

6) fs_info->tree_mod_seq is incremented to 2;

7) Some task attempts to allocate a new extent buffer (done at
   extent-tree.c:find_free_extent()), but due to heavy fragmentation
   and running low on metadata space the clustered allocation fails
   and we fall back to unclustered allocation, which finds the
   extent at offset A, so a new extent buffer at offset A is allocated.
   This creates delayed reference Ref3 for bytenr A, with a value of 1
   and a seq value of 2;

8) Ref3 is not merged neither with Ref2 nor Ref1, again because they
   all have different seq values;

9) We start running the delayed references (__btrfs_run_delayed_refs());

10) The delayed Ref1 is the first one being applied, which ends up
    creating an inline extent backref in the extent tree;

10) Next the delayed reference Ref3 is selected for execution, and not
    Ref2, because select_delayed_ref() always gives a preference for
    positive references (that have an action of BTRFS_ADD_DELAYED_REF);

11) When running Ref3 we encounter alreay the inline extent backref
    in the extent tree at insert_inline_extent_backref(), which makes
    us hit the following BUG_ON:

        BUG_ON(owner < BTRFS_FIRST_FREE_OBJECTID);

    This is always true because owner corresponds to the level of the
    extent buffer/btree node in the btree.

For the scenario described above we hit the BUG_ON because we never merge
references that have different seq values.

We used to do the merging before the 4.2 kernel, more specifically, before
the commmits:

  c6fc245499 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.")
  c43d160fcd ("btrfs: delayed-ref: Cleanup the unneeded functions.")

This issue became more exposed after the following change that was added
to 4.2 as well:

  cffc3374e5 ("Btrfs: fix order by which delayed references are run")

Which in turn fixed another regression by the two commits previously
mentioned.

So fix this by bringing back the delayed reference merge code, with the
proper adaptations so that it operates against the new data structure
(linked list vs old red black tree implementation).

This issue was hit running fstest btrfs/063 in a loop. Several people have
reported this issue in the mailing list when running on kernels 4.2+.

Very special thanks to Stéphane Lesimple for helping debugging this issue
and testing this fix on his multi terabyte filesystem (which took more
than one day to balance alone, plus fsck, etc).

Fixes: c6fc245499 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.")
Reported-by: Peter Becker <floyd.net@gmail.com>
Reported-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
Tested-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
Reported-by: Malte Schröder <malte@tnxip.de>
Reported-by: Derek Dongray <derek@valedon.co.uk>
Reported-by: Erkki Seppala <flux-btrfs@inside.org>
Cc: stable@vger.kernel.org  # 4.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2015-10-25 19:52:23 +00:00
Linus Torvalds
37902bc190 Merge branch 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "I have two more small fixes this week:

  Qu's fix avoids unneeded COW during fallocate, and Christian found a
  memory leak in the error handling of an earlier fix"

* 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: fix possible leak in btrfs_ioctl_balance()
  btrfs: Avoid truncate tailing page if fallocate range doesn't exceed inode size
2015-10-24 07:17:58 +09:00
Chris Mason
a9e6d15356 Merge branch 'allocator-fixes' into for-linus-4.4
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21 19:00:38 -07:00
Josef Bacik
0584f718ed Btrfs: don't do extra bitmap search in one bit case
When we make ctl->unit allocations from a bitmap there is no point in searching
for the next 0 in the bitmap.  If we've found a bit we're done and can just exit
the loop.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21 18:55:41 -07:00
Josef Bacik
cef4048370 Btrfs: keep track of largest extent in bitmaps
We can waste a lot of time searching through bitmaps when we are heavily
fragmented trying to find large contiguous areas that don't exist in the bitmap.
So keep track of the max extent size when we do a full search of a bitmap so
that next time around we can just skip the expensive searching if our max size
is less than what we are looking for.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21 18:55:40 -07:00
Josef Bacik
c759c4e161 Btrfs: don't keep trying to build clusters if we are fragmented
If we are extremely fragmented then we won't be able to create a free_cluster.
So if this happens set last_ptr->fragmented so that all future allcations will
give up trying to create a cluster.  When we unpin extents we will unset
->fragmented if we free up a sufficient amount of space in a block group.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21 18:55:39 -07:00
Josef Bacik
a5e681d9bd Btrfs: cut down on loops through the allocator
We try really really hard to make allocations, but sometimes it is just not
going to happen, especially when free space is extremely fragmented.  So add a
few short cuts through the looping states.  For example if we couldn't allocate
a chunk, just go straight to the NO_EMPTY_SIZE loop.  If there are no uncached
block groups and we've done a full search, go straight to the ALLOC_CHUNK stage.
And finally if we already have empty_size and empty_cluster set to 0 go ahead
and return -ENOSPC.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21 18:55:37 -07:00
Josef Bacik
2968b1f48b Btrfs: don't continue setting up space cache when enospc
If we hit ENOSPC when setting up a space cache don't bother setting up any of
the other space cache's in this transaction, it'll just induce unnecessary
latency.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21 18:55:36 -07:00
Josef Bacik
4f4db2174d Btrfs: keep track of max_extent_size per space_info
When we are heavily fragmented we can induce a lot of latency trying to make an
allocation happen that is simply not going to happen.  Thankfully we keep track
of our max_extent_size when going through the allocator, so if we get to the
point where we are exiting find_free_extent with ENOSPC then set our
space_info->max_extent_size so we can keep future allocations from having to pay
this cost.  We reset the max_extent_size whenever we release pinned bytes back
into this space info so we can redo all the work.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21 18:55:19 -07:00
Josef Bacik
36af4e0737 Btrfs: don't loop in allocator for space cache
The space cache needs to have contiguous allocations, and the allocator tries to
make allocations by reducing the amount of bytes requested and re-searching.
But this just makes us waste time when we are very fragmented, so if we can't
find our space just exit, don't bother trying to search again.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21 18:51:46 -07:00
Josef Bacik
3204d33cda Btrfs: add a flags field to btrfs_transaction
I want to set some per transaction flags, so instead of adding yet another int
lets just convert the current two int indicators to flags and add a flags field
for future use.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21 18:51:45 -07:00
Josef Bacik
0b670dc44c Btrfs: fix prealloc under heavy fragmentation conditions
If we are heavily fragmented we will continually try to prealloc the largest
extent size we can every time we call btrfs_reserve_extent.  This can be very
expensive when we are heavily fragmented, burning lots of CPU cycles and loops
through the allocator.  So instead notice when we get a smaller chunk from the
allocator than what we specified and use this as the new maximum size we try to
allocate.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21 18:51:44 -07:00
Josef Bacik
d0bd456074 Btrfs: add fragment=* debug mount option
In tracking down these weird bitmap problems it was helpful to artificially
create an extremely fragmented file system.  These mount options let us either
fragment data or metadata or both.  With these options I could reproduce all
sorts of weird latencies and hangs that occur under extreme fragmentation and
get them fixed.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21 18:51:43 -07:00
Josef Bacik
d9ee522ba3 Btrfs: fix qgroup sanity tests
With my changes to allow us to find old roots when resolving indirect refs I
introduced a regression to the sanity tests.  Since we don't really care to go
down into the fs roots we just need to have the old behavior of returning ENOENT
for dummy roots for the sanity tests.  In the future if we want to get fancy we
can populate the test fs trees with the references as well.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21 18:51:41 -07:00
Josef Bacik
161c3549b4 Btrfs: change how we wait for pending ordered extents
We have a mechanism to make sure we don't lose updates for ordered extents that
were logged in the transaction that is currently running.  We add the ordered
extent to a transaction list and then the transaction waits on all the ordered
extents in that list.  However are substantially large file systems this list
can be extremely large, and can give us soft lockups, since the ordered extents
don't remove themselves from the list when they do complete.

To fix this we simply add a counter to the transaction that is incremented any
time we have a logged extent that needs to be completed in the current
transaction.  Then when the ordered extent finally completes it decrements the
per transaction counter and wakes up the transaction if we are the last ones.
This will eliminate the softlockup.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21 18:51:40 -07:00
Qu Wenruo
56fa9d0762 btrfs: qgroup: Check if qgroup reserved space leaked
Add check at btrfs_destroy_inode() time to detect qgroup reserved space
leak.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21 18:41:10 -07:00
Qu Wenruo
51773bec7e btrfs: qgroup: Avoid calling btrfs_free_reserved_data_space in clear_bit_hook
In clear_bit_hook, qgroup reserved data is already handled quite well,
either released by finish_ordered_io or invalidatepage.

So calling btrfs_qgroup_free_data() here is completely meaningless, and
since btrfs_qgroup_free_data() will lock io_tree, so it can't be called
with io_tree lock hold.

This patch will add a new function
btrfs_free_reserved_data_space_noquota() for clear_bit_hook() to cease
the lockdep warning.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21 18:41:09 -07:00
Qu Wenruo
14524a846e btrfs: fallocate: Add support to accurate qgroup reserve
Now fallocate will do accurate qgroup reserve space check, unlike old
method, which will always reserve the whole length of the range.

With this patch, fallocate will:
1) Iterate the desired range and mark in data rsv map
   Only range which is going to be allocated will be recorded in data
   rsv map and reserve the space.
   For already allocated range (normal/prealloc extent) they will be
   skipped.
   Also, record the marked range into a new list for later use.

2) If 1) succeeded, do real file extent allocate.
   And at file extent allocation time, corresponding range will be
   removed from the range in data rsv map.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21 18:41:09 -07:00
Qu Wenruo
81fb6f77a0 btrfs: qgroup: Add new trace point for qgroup data reserve
Now each qgroup reserve for data will has its ftrace event for better
debugging.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21 18:41:08 -07:00
Qu Wenruo
b9d0b38928 btrfs: Add handler for invalidate page
For btrfs_invalidatepage() and its variant evict_inode_truncate_page(),
there will be pages don't reach disk.
In that case, their reserved space won't be release nor freed by
finish_ordered_io() nor delayed_ref handler.

So we must free their qgroup reserved space, or we will leaking reserved
space again.

So this will patch will call btrfs_qgroup_free_data() for
invalidatepage() and its variant evict_inode_truncate_page().

And due to the nature of new btrfs_qgroup_reserve/free_data() reserved
space will only be reserved or freed once, so for pages which are
already flushed to disk, their reserved space will be released and freed
by delayed_ref handler.

Double free won't be a problem.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21 18:41:07 -07:00
Qu Wenruo
94ed938aba btrfs: qgroup: Add handler for NOCOW and inline
For NOCOW and inline case, there will be no delayed_ref created for
them, so we should free their reserved data space at proper
time(finish_ordered_io for NOCOW and cow_file_inline for inline).

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21 18:41:07 -07:00
Qu Wenruo
7cf5b97650 btrfs: qgroup: Cleanup old inaccurate facilities
Cleanup the old facilities which use old btrfs_qgroup_reserve() function
call, replace them with the newer version, and remove the "__" prefix in
them.

Also, make btrfs_qgroup_reserve/free() functions private, as they are
now only used inside qgroup codes.

Now, the whole btrfs qgroup is swithed to use the new reserve facilities.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21 18:41:06 -07:00
Qu Wenruo
df480633b8 btrfs: extent-tree: Switch to new delalloc space reserve and release
Use new __btrfs_delalloc_reserve_space() and
__btrfs_delalloc_release_space() to reserve and release space for
delalloc.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21 18:41:05 -07:00
Qu Wenruo
1ada3a62b5 btrfs: extent-tree: Add new version of btrfs_delalloc_reserve/release_space
Add new version of btrfs_delalloc_reserve_space() and
btrfs_delalloc_release_space() functions, which supports accurate qgroup
reserve.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21 18:41:05 -07:00
Qu Wenruo
d9d8b2a51a btrfs: extent-tree: Switch to new check_data_free_space and free_reserved_data_space
Use new reserve/free for buffered write and inode cache.

For buffered write case, as nodatacow write won't increase quota account,
so unlike old behavior which does reserve before check nocow, now we
check nocow first and then only reserve data if we can't do nocow write.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21 18:41:04 -07:00
Qu Wenruo
4ceff0792d btrfs: extent-tree: Add new version of btrfs_check_data_free_space and btrfs_free_reserved_data_space.
Add new functions __btrfs_check_data_free_space() and
__btrfs_free_reserved_data_space() to work with new accurate qgroup
reserved space framework.

The new function will replace old btrfs_check_data_free_space() and
btrfs_free_reserved_data_space() respectively, but until all the change
is done, let's just use the new name.

Also, export internal use function btrfs_alloc_data_chunk_ondemand(), as
now qgroup reserve requires precious bytes, some operation can't get the
accurate number in advance(like fallocate).
But data space info check and data chunk allocate doesn't need to be
that accurate, and can be called at the beginning.

So export it for later operations.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21 18:41:03 -07:00
Qu Wenruo
7174109c65 btrfs: qgroup: Use new metadata reservation.
As we have the new metadata reservation functions, use them to replace
the old btrfs_qgroup_reserve() call for metadata.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21 18:40:40 -07:00
Qu Wenruo
55eeaf0578 btrfs: qgroup: Introduce new functions to reserve/free metadata
Introduce new functions btrfs_qgroup_reserve/free_meta() to reserve/free
metadata reserved space.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21 18:37:47 -07:00
Qu Wenruo
297d750b9f btrfs: delayed_ref: release and free qgroup reserved at proper timing
Qgroup reserved space needs to be released from inode dirty map and get
freed at different timing:

1) Release when the metadata is written into tree
After corresponding metadata is written into tree, any newer write will
be COWed(don't include NOCOW case yet).
So we must release its range from inode dirty range map, or we will
forget to reserve needed range, causing accounting exceeding the limit.

2) Free reserved bytes when delayed ref is run
When delayed refs are run, qgroup accounting will follow soon and turn
the reserved bytes into rfer/excl numbers.
As run_delayed_refs and qgroup accounting are all done at
commit_transaction() time, we are safe to free reserved space in
run_delayed_ref time().

With these timing to release/free reserved space, we should be able to
resolve the long existing qgroup reserve space leak problem.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21 18:37:47 -07:00
Qu Wenruo
f64d5ca868 btrfs: delayed_ref: Add new function to record reserved space into delayed ref
Add new function btrfs_add_delayed_qgroup_reserve() function to record
how much space is reserved for that extent.

As btrfs only accounts qgroup at run_delayed_refs() time, so newly
allocated extent should keep the reserved space until then.

So add needed function with related members to do it.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21 18:37:46 -07:00
Qu Wenruo
f695fdcef8 btrfs: qgroup: Introduce functions to release/free qgroup reserve data
space

Introduce functions btrfs_qgroup_release/free_data() to release/free
reserved data range.

Release means, just remove the data range from io_tree, but doesn't
free the reserved space.
This is for normal buffered write case, when data is written into disc
and its metadata is added into tree, its reserved space should still be
kept until commit_trans().
So in that case, we only release dirty range, but keep the reserved
space recorded some other place until commit_tran().

Free means not only remove data range, but also free reserved space.
This is used for case for cleanup and invalidate page.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21 18:37:46 -07:00
Qu Wenruo
5247255370 btrfs: qgroup: Introduce btrfs_qgroup_reserve_data function
Introduce a new function, btrfs_qgroup_reserve_data(), which will use
io_tree to accurate qgroup reserve, to avoid reserved space leaking.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21 18:37:45 -07:00
Qu Wenruo
fefdc55702 btrfs: extent_io: Introduce new function clear_record_extent_bits()
Introduce new function clear_record_extent_bits(), which will clear bits
for given range and record the details about which ranges are cleared
and how many bytes in total it changes.

This provides the basis for later qgroup reserve codes.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21 18:37:44 -07:00
Qu Wenruo
d38ed27f04 btrfs: extent_io: Introduce new function set_record_extent_bits
Introduce new function set_record_extent_bits(), which will not only set
given bits, but also record how many bytes are changed, and detailed
range info.

This is quite important for later qgroup reserve framework.
The number of bytes will be used to do qgroup reserve, and detailed
range info will be used to cleanup for EQUOT case.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21 18:37:44 -07:00
Qu Wenruo
ac46777213 btrfs: extent_io: Introduce needed structure for recoding set/clear bits
Add a new structure, extent_change_set, to record how many bytes are
changed in one set/clear_extent_bits() operation, with detailed changed
ranges info.

This provides the needed facilities for later qgroup reserve framework.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21 18:37:43 -07:00
Chris Mason
a408365c62 Merge branch 'integration-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.4 2015-10-21 18:23:59 -07:00
Chris Mason
a0d58e48db Merge branch 'cleanups/for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.4 2015-10-21 18:21:40 -07:00
Christian Engelmayer
0f89abf56a btrfs: fix possible leak in btrfs_ioctl_balance()
Commit 8eb934591f ("btrfs: check unsupported filters in balance
arguments") adds a jump to exit label out_bargs in case the argument
check fails. At this point in addition to the bargs memory, the
memory for struct btrfs_balance_control has already been allocated.
Ownership of bctl is passed to btrfs_balance() in the good case,
thus the memory is not freed due to the introduced jump. Make sure
that the memory gets freed in any case as necessary. Detected by
Coverity CID 1328378.

Signed-off-by: Christian Engelmayer <cengelma@gmx.at>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21 18:10:02 -07:00
Luis de Bethencourt
ddd664f447 btrfs: reada: Fix returned errno code
reada is using -1 instead of the -ENOMEM defined macro to specify that
a buffer allocation failed. Since the error number is propagated, the
caller will get a -EPERM which is the wrong error condition.

Also, updating the caller to return the exact value from
reada_add_block.

Smatch tool warning:
reada_add_block() warn: returning -1 instead of -ENOMEM is sloppy

Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21 18:29:50 +02:00
Luis de Bethencourt
0b8d8ce029 btrfs: check-integrity: Fix returned errno codes
check-integrity is using -1 instead of the -ENOMEM defined macro to
specify that a buffer allocation failed. Since the error number is
propagated, the caller will get a -EPERM which is the wrong error
condition.

Also, the smatch tool complains with the following warnings:
btrfsic_process_superblock() warn: returning -1 instead of -ENOMEM is sloppy
btrfsic_read_block() warn: returning -1 instead of -ENOMEM is sloppy

Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21 18:29:44 +02:00
Byongho Lee
d91876496b btrfs: compress: put variables defined per compress type in struct to make cache friendly
Below variables are defined per compress type.
 - struct list_head comp_idle_workspace[BTRFS_COMPRESS_TYPES]
 - spinlock_t comp_workspace_lock[BTRFS_COMPRESS_TYPES]
 - int comp_num_workspace[BTRFS_COMPRESS_TYPES]
 - atomic_t comp_alloc_workspace[BTRFS_COMPRESS_TYPES]
 - wait_queue_head_t comp_workspace_wait[BTRFS_COMPRESS_TYPES]

BTW, while accessing one compress type of these variables, the next or
before address is other compress types of it.
So this patch puts these variables in a struct to make cache friendly.

Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21 18:28:48 +02:00
Byongho Lee
619ed39242 btrfs: cleanup iterating over prop_handlers array
This patch eliminates the last item of prop_handlers array which is used
to check end of array and instead uses ARRAY_SIZE macro.
Though this is a very tiny optimization, using ARRAY_SIZE macro is a
good practice to iterate array.

Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21 18:28:48 +02:00
Geliang Tang
8cd1e73111 btrfs: fix a comment typo
Just fix a typo in the code comment.

Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21 18:28:48 +02:00
Alexandru Moise
6e4d6fa12c btrfs: declare rsv_count as unsigned int instead of int
rsv_count ultimately gets passed to start_transaction() which
now takes an unsigned int as its num_items parameter.
The value of rsv_count should always be positive so declare it
as being unsigned.

Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21 18:28:48 +02:00
Alexandru Moise
5aed1dd8b4 btrfs: change num_items type from u64 to unsigned int
The value of num_items that start_transaction() ultimately
always takes is a small one, so a 64 bit integer is overkill.

Also change num_items for btrfs_start_transaction() and
btrfs_start_transaction_lflush() as well.

Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21 18:28:48 +02:00
Alexandru Moise
bdcd3c97d1 btrfs: cleanup btrfs_balance profile validity checks
Improve readability by generalizing the profile validity checks.

Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21 18:28:48 +02:00
Shan Hai
bb78915203 btrfs/file.c: remove an unsed varialbe first_index
The commit b37392ea86 ("Btrfs: cleanup unnecessary parameter
and variant of prepare_pages()") makes it redundant.

Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Shan Hai <haishan.bai@hotmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21 18:28:48 +02:00
Zhao Lei
9c170b2644 btrfs: use btrfs_raid_array in btrfs_reduce_alloc_profile
btrfs_raid_array[] holds attributes of all raid types.

Use btrfs_raid_array[].devs_min is best way for request
in btrfs_reduce_alloc_profile(), instead of use complex
condition of each raid types.

Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21 18:28:48 +02:00
Zhao Lei
8789f4fe60 btrfs: use btrfs_raid_array for btrfs_get_num_tolerated_disk_barrier_failures()
btrfs_raid_array[] is used to define all raid attributes, use it
to get tolerated_failures in btrfs_get_num_tolerated_disk_barrier_failures(),
instead of complex condition in function.

It can make code simple and auto-support other possible raid-type in
future.

Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21 18:28:48 +02:00
Zhao Lei
af90204750 btrfs: Move btrfs_raid_array to public
This array is used to record attributes of each raid type,
make it public, and many functions will benifit with this array.

For example, num_tolerated_disk_barrier_failures(), we can
avoid complex conditions in this function, and get raid attribute
simply by accessing above array.

It can also make code logic simple, reduce duplication code, and
increase maintainability.

Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21 18:28:48 +02:00
Alexandru Moise
e9cf439f0d btrfs: use a single if() statement for one outcome in get_block_rsv()
Rather than have three separate if() statements for the same outcome
we should just OR them together in the same if() statement.

Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21 18:28:48 +02:00
Alexandru Moise
a099d0fdb3 btrfs: memset cur_trans->delayed_refs to zero
Use memset() to null out the btrfs_delayed_ref_root of
btrfs_transaction instead of setting all the members to 0 by hand.

Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21 18:28:48 +02:00
Byongho Lee
568b1c9cca btrfs: remove unnecessary list_del
We can safely iterate whole list items, without using list_del macro.
So remove the list_del call.

Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21 18:28:48 +02:00
Byongho Lee
d7641a49a5 btrfs: replace unnecessary list_for_each_entry_safe to list_for_each_entry
There is no removing list element while iterating over list.
So, replace list_for_each_entry_safe to list_for_each_entry.

Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21 18:28:48 +02:00
Alexandru Moise
f2f767e734 btrfs: trimming some start_transaction() code away
Just call kmem_cache_zalloc() instead of calling kmem_cache_alloc().
We're just initializing most fields to 0, false and NULL later on
_anyway_, so to make the code mode readable and potentially gain
a bit of performance (completely untested claim), we should fill our
btrfs_trans_handle with zeros on allocation then just initialize
those five remaining fields (not counting the list_heads) as normal.

Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21 18:28:48 +02:00
Alexandru Moise
0412e58c6d btrfs: Fixed declaration of old_len
old_len is used to store the return value of btrfs_item_size_nr().
The return value of btrfs_item_size_nr() is of type u32.
To improve code correctness and avoid mixing signed and unsigned
integers I've changed old_len to be of type u32 as well.

Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21 18:28:48 +02:00
Alexandru Moise
ce0eac2a1d btrfs: Fixed dsize and last_off declarations
The return values of btrfs_item_offset_nr and btrfs_item_size_nr are of
type u32. To avoid mixing signed and unsigned integers we should also
declare dsize and last_off to be of type u32.

Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21 18:28:48 +02:00
Chandan Rajendra
0d51e28a11 Btrfs: btrfs_submit_bio_hook: Use btrfs_wq_endio_type values instead of integer constants
btrfs_submit_bio_hook() uses integer constants instead of values from "enum
btrfs_wq_endio_type". Fix this.

Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21 18:28:47 +02:00
Qu Wenruo
0f6925fa29 btrfs: Avoid truncate tailing page if fallocate range doesn't exceed inode size
Current code will always truncate tailing page if its alloc_start is
smaller than inode size.

For example, the file extent layout is like:
0	4K	8K	16K	32K
|<-----Extent A---------------->|
|<--Inode size: 18K---------->|

But if calling fallocate even for range [0,4K), it will cause btrfs to
re-truncate the range [16,32K), causing COW and a new extent.

0	4K	8K	16K	32K
|///////|	<- Fallocate call range
|<-----Extent A-------->|<--B-->|

The cause is quite easy, just a careless btrfs_truncate_inode() in a
else branch without extra judgment.
Fix it by add judgment on whether the fallocate range is beyond isize.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-20 19:07:29 -07:00
Filipe Manana
0305cd5f7f Btrfs: fix truncation of compressed and inlined extents
When truncating a file to a smaller size which consists of an inline
extent that is compressed, we did not discard (or made unusable) the
data between the new file size and the old file size, wasting metadata
space and allowing for the truncated data to be leaked and the data
corruption/loss mentioned below.
We were also not correctly decrementing the number of bytes used by the
inode, we were setting it to zero, giving a wrong report for callers of
the stat(2) syscall. The fsck tool also reported an error about a mismatch
between the nbytes of the file versus the real space used by the file.

Now because we weren't discarding the truncated region of the file, it
was possible for a caller of the clone ioctl to actually read the data
that was truncated, allowing for a security breach without requiring root
access to the system, using only standard filesystem operations. The
scenario is the following:

   1) User A creates a file which consists of an inline and compressed
      extent with a size of 2000 bytes - the file is not accessible to
      any other users (no read, write or execution permission for anyone
      else);

   2) The user truncates the file to a size of 1000 bytes;

   3) User A makes the file world readable;

   4) User B creates a file consisting of an inline extent of 2000 bytes;

   5) User B issues a clone operation from user A's file into its own
      file (using a length argument of 0, clone the whole range);

   6) User B now gets to see the 1000 bytes that user A truncated from
      its file before it made its file world readbale. User B also lost
      the bytes in the range [1000, 2000[ bytes from its own file, but
      that might be ok if his/her intention was reading stale data from
      user A that was never supposed to be public.

Note that this contrasts with the case where we truncate a file from 2000
bytes to 1000 bytes and then truncate it back from 1000 to 2000 bytes. In
this case reading any byte from the range [1000, 2000[ will return a value
of 0x00, instead of the original data.

This problem exists since the clone ioctl was added and happens both with
and without my recent data loss and file corruption fixes for the clone
ioctl (patch "Btrfs: fix file corruption and data loss after cloning
inline extents").

So fix this by truncating the compressed inline extents as we do for the
non-compressed case, which involves decompressing, if the data isn't already
in the page cache, compressing the truncated version of the extent, writing
the compressed content into the inline extent and then truncate it.

The following test case for fstests reproduces the problem. In order for
the test to pass both this fix and my previous fix for the clone ioctl
that forbids cloning a smaller inline extent into a larger one,
which is titled "Btrfs: fix file corruption and data loss after cloning
inline extents", are needed. Without that other fix the test fails in a
different way that does not leak the truncated data, instead part of
destination file gets replaced with zeroes (because the destination file
has a larger inline extent than the source).

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"
  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter

  # real QA test starts here
  _need_to_be_root
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch
  _require_cloner

  rm -f $seqres.full

  _scratch_mkfs >>$seqres.full 2>&1
  _scratch_mount "-o compress"

  # Create our test files. File foo is going to be the source of a clone operation
  # and consists of a single inline extent with an uncompressed size of 512 bytes,
  # while file bar consists of a single inline extent with an uncompressed size of
  # 256 bytes. For our test's purpose, it's important that file bar has an inline
  # extent with a size smaller than foo's inline extent.
  $XFS_IO_PROG -f -c "pwrite -S 0xa1 0 128"   \
          -c "pwrite -S 0x2a 128 384" \
          $SCRATCH_MNT/foo | _filter_xfs_io
  $XFS_IO_PROG -f -c "pwrite -S 0xbb 0 256" $SCRATCH_MNT/bar | _filter_xfs_io

  # Now durably persist all metadata and data. We do this to make sure that we get
  # on disk an inline extent with a size of 512 bytes for file foo.
  sync

  # Now truncate our file foo to a smaller size. Because it consists of a
  # compressed and inline extent, btrfs did not shrink the inline extent to the
  # new size (if the extent was not compressed, btrfs would shrink it to 128
  # bytes), it only updates the inode's i_size to 128 bytes.
  $XFS_IO_PROG -c "truncate 128" $SCRATCH_MNT/foo

  # Now clone foo's inline extent into bar.
  # This clone operation should fail with errno EOPNOTSUPP because the source
  # file consists only of an inline extent and the file's size is smaller than
  # the inline extent of the destination (128 bytes < 256 bytes). However the
  # clone ioctl was not prepared to deal with a file that has a size smaller
  # than the size of its inline extent (something that happens only for compressed
  # inline extents), resulting in copying the full inline extent from the source
  # file into the destination file.
  #
  # Note that btrfs' clone operation for inline extents consists of removing the
  # inline extent from the destination inode and copy the inline extent from the
  # source inode into the destination inode, meaning that if the destination
  # inode's inline extent is larger (N bytes) than the source inode's inline
  # extent (M bytes), some bytes (N - M bytes) will be lost from the destination
  # file. Btrfs could copy the source inline extent's data into the destination's
  # inline extent so that we would not lose any data, but that's currently not
  # done due to the complexity that would be needed to deal with such cases
  # (specially when one or both extents are compressed), returning EOPNOTSUPP, as
  # it's normally not a very common case to clone very small files (only case
  # where we get inline extents) and copying inline extents does not save any
  # space (unlike for normal, non-inlined extents).
  $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/foo $SCRATCH_MNT/bar

  # Now because the above clone operation used to succeed, and due to foo's inline
  # extent not being shinked by the truncate operation, our file bar got the whole
  # inline extent copied from foo, making us lose the last 128 bytes from bar
  # which got replaced by the bytes in range [128, 256[ from foo before foo was
  # truncated - in other words, data loss from bar and being able to read old and
  # stale data from foo that should not be possible to read anymore through normal
  # filesystem operations. Contrast with the case where we truncate a file from a
  # size N to a smaller size M, truncate it back to size N and then read the range
  # [M, N[, we should always get the value 0x00 for all the bytes in that range.

  # We expected the clone operation to fail with errno EOPNOTSUPP and therefore
  # not modify our file's bar data/metadata. So its content should be 256 bytes
  # long with all bytes having the value 0xbb.
  #
  # Without the btrfs bug fix, the clone operation succeeded and resulted in
  # leaking truncated data from foo, the bytes that belonged to its range
  # [128, 256[, and losing data from bar in that same range. So reading the
  # file gave us the following content:
  #
  # 0000000 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1
  # *
  # 0000200 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a
  # *
  # 0000400
  echo "File bar's content after the clone operation:"
  od -t x1 $SCRATCH_MNT/bar

  # Also because the foo's inline extent was not shrunk by the truncate
  # operation, btrfs' fsck, which is run by the fstests framework everytime a
  # test completes, failed reporting the following error:
  #
  #  root 5 inode 257 errors 400, nbytes wrong

  status=0
  exit

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-10-16 21:02:53 +01:00
Linus Torvalds
6aa8ca4df0 Merge branch 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "I have two more bug fixes for btrfs.

  My commit fixes a bug we hit last week at FB, a combination of lots of
  hard links and an admin command to resolve inode numbers.

  Dave is adding checks to make sure balance on current kernels ignores
  filters it doesn't understand.  The penalty for being wrong is just
  doing more work (not crashing etc), but it's a good fix"

* 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: fix use after free iterating extrefs
  btrfs: check unsupported filters in balance arguments
2015-10-16 12:55:34 -07:00
Filipe Manana
5e6ecb362b Btrfs: fix double range unlock of hole region when reading page
If when reading a page we find a hole and our caller had already locked
the range (bio flags has the bit EXTENT_BIO_PARENT_LOCKED set), we end
up unlocking the hole's range and then later our caller unlocks it
again, which might have already been locked by some other task once
the first unlock happened.

Currently this can only happen during a call to the extent_same ioctl,
as it's the only caller of __do_readpage() that sets the bit
EXTENT_BIO_PARENT_LOCKED for bio flags.

Fix this by leaving the unlock exclusively to the caller.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-10-14 04:37:00 +01:00
Filipe Manana
8039d87d9e Btrfs: fix file corruption and data loss after cloning inline extents
Currently the clone ioctl allows to clone an inline extent from one file
to another that already has other (non-inlined) extents. This is a problem
because btrfs is not designed to deal with files having inline and regular
extents, if a file has an inline extent then it must be the only extent
in the file and must start at file offset 0. Having a file with an inline
extent followed by regular extents results in EIO errors when doing reads
or writes against the first 4K of the file.

Also, the clone ioctl allows one to lose data if the source file consists
of a single inline extent, with a size of N bytes, and the destination
file consists of a single inline extent with a size of M bytes, where we
have M > N. In this case the clone operation removes the inline extent
from the destination file and then copies the inline extent from the
source file into the destination file - we lose the M - N bytes from the
destination file, a read operation will get the value 0x00 for any bytes
in the the range [N, M] (the destination inode's i_size remained as M,
that's why we can read past N bytes).

So fix this by not allowing such destructive operations to happen and
return errno EOPNOTSUPP to user space.

Currently the fstest btrfs/035 tests the data loss case but it totally
ignores this - i.e. expects the operation to succeed and does not check
the we got data loss.

The following test case for fstests exercises all these cases that result
in file corruption and data loss:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"
  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter

  # real QA test starts here
  _need_to_be_root
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch
  _require_cloner
  _require_btrfs_fs_feature "no_holes"
  _require_btrfs_mkfs_feature "no-holes"

  rm -f $seqres.full

  test_cloning_inline_extents()
  {
      local mkfs_opts=$1
      local mount_opts=$2

      _scratch_mkfs $mkfs_opts >>$seqres.full 2>&1
      _scratch_mount $mount_opts

      # File bar, the source for all the following clone operations, consists
      # of a single inline extent (50 bytes).
      $XFS_IO_PROG -f -c "pwrite -S 0xbb 0 50" $SCRATCH_MNT/bar \
          | _filter_xfs_io

      # Test cloning into a file with an extent (non-inlined) where the
      # destination offset overlaps that extent. It should not be possible to
      # clone the inline extent from file bar into this file.
      $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 16K" $SCRATCH_MNT/foo \
          | _filter_xfs_io
      $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo

      # Doing IO against any range in the first 4K of the file should work.
      # Due to a past clone ioctl bug which allowed cloning the inline extent,
      # these operations resulted in EIO errors.
      echo "File foo data after clone operation:"
      # All bytes should have the value 0xaa (clone operation failed and did
      # not modify our file).
      od -t x1 $SCRATCH_MNT/foo
      $XFS_IO_PROG -c "pwrite -S 0xcc 0 100" $SCRATCH_MNT/foo | _filter_xfs_io

      # Test cloning the inline extent against a file which has a hole in its
      # first 4K followed by a non-inlined extent. It should not be possible
      # as well to clone the inline extent from file bar into this file.
      $XFS_IO_PROG -f -c "pwrite -S 0xdd 4K 12K" $SCRATCH_MNT/foo2 \
          | _filter_xfs_io
      $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo2

      # Doing IO against any range in the first 4K of the file should work.
      # Due to a past clone ioctl bug which allowed cloning the inline extent,
      # these operations resulted in EIO errors.
      echo "File foo2 data after clone operation:"
      # All bytes should have the value 0x00 (clone operation failed and did
      # not modify our file).
      od -t x1 $SCRATCH_MNT/foo2
      $XFS_IO_PROG -c "pwrite -S 0xee 0 90" $SCRATCH_MNT/foo2 | _filter_xfs_io

      # Test cloning the inline extent against a file which has a size of zero
      # but has a prealloc extent. It should not be possible as well to clone
      # the inline extent from file bar into this file.
      $XFS_IO_PROG -f -c "falloc -k 0 1M" $SCRATCH_MNT/foo3 | _filter_xfs_io
      $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo3

      # Doing IO against any range in the first 4K of the file should work.
      # Due to a past clone ioctl bug which allowed cloning the inline extent,
      # these operations resulted in EIO errors.
      echo "First 50 bytes of foo3 after clone operation:"
      # Should not be able to read any bytes, file has 0 bytes i_size (the
      # clone operation failed and did not modify our file).
      od -t x1 $SCRATCH_MNT/foo3
      $XFS_IO_PROG -c "pwrite -S 0xff 0 90" $SCRATCH_MNT/foo3 | _filter_xfs_io

      # Test cloning the inline extent against a file which consists of a
      # single inline extent that has a size not greater than the size of
      # bar's inline extent (40 < 50).
      # It should be possible to do the extent cloning from bar to this file.
      $XFS_IO_PROG -f -c "pwrite -S 0x01 0 40" $SCRATCH_MNT/foo4 \
          | _filter_xfs_io
      $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo4

      # Doing IO against any range in the first 4K of the file should work.
      echo "File foo4 data after clone operation:"
      # Must match file bar's content.
      od -t x1 $SCRATCH_MNT/foo4
      $XFS_IO_PROG -c "pwrite -S 0x02 0 90" $SCRATCH_MNT/foo4 | _filter_xfs_io

      # Test cloning the inline extent against a file which consists of a
      # single inline extent that has a size greater than the size of bar's
      # inline extent (60 > 50).
      # It should not be possible to clone the inline extent from file bar
      # into this file.
      $XFS_IO_PROG -f -c "pwrite -S 0x03 0 60" $SCRATCH_MNT/foo5 \
          | _filter_xfs_io
      $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo5

      # Reading the file should not fail.
      echo "File foo5 data after clone operation:"
      # Must have a size of 60 bytes, with all bytes having a value of 0x03
      # (the clone operation failed and did not modify our file).
      od -t x1 $SCRATCH_MNT/foo5

      # Test cloning the inline extent against a file which has no extents but
      # has a size greater than bar's inline extent (16K > 50).
      # It should not be possible to clone the inline extent from file bar
      # into this file.
      $XFS_IO_PROG -f -c "truncate 16K" $SCRATCH_MNT/foo6 | _filter_xfs_io
      $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo6

      # Reading the file should not fail.
      echo "File foo6 data after clone operation:"
      # Must have a size of 16K, with all bytes having a value of 0x00 (the
      # clone operation failed and did not modify our file).
      od -t x1 $SCRATCH_MNT/foo6

      # Test cloning the inline extent against a file which has no extents but
      # has a size not greater than bar's inline extent (30 < 50).
      # It should be possible to clone the inline extent from file bar into
      # this file.
      $XFS_IO_PROG -f -c "truncate 30" $SCRATCH_MNT/foo7 | _filter_xfs_io
      $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo7

      # Reading the file should not fail.
      echo "File foo7 data after clone operation:"
      # Must have a size of 50 bytes, with all bytes having a value of 0xbb.
      od -t x1 $SCRATCH_MNT/foo7

      # Test cloning the inline extent against a file which has a size not
      # greater than the size of bar's inline extent (20 < 50) but has
      # a prealloc extent that goes beyond the file's size. It should not be
      # possible to clone the inline extent from bar into this file.
      $XFS_IO_PROG -f -c "falloc -k 0 1M" \
                      -c "pwrite -S 0x88 0 20" \
                      $SCRATCH_MNT/foo8 | _filter_xfs_io
      $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo8

      echo "File foo8 data after clone operation:"
      # Must have a size of 20 bytes, with all bytes having a value of 0x88
      # (the clone operation did not modify our file).
      od -t x1 $SCRATCH_MNT/foo8

      _scratch_unmount
  }

  echo -e "\nTesting without compression and without the no-holes feature...\n"
  test_cloning_inline_extents

  echo -e "\nTesting with compression and without the no-holes feature...\n"
  test_cloning_inline_extents "" "-o compress"

  echo -e "\nTesting without compression and with the no-holes feature...\n"
  test_cloning_inline_extents "-O no-holes" ""

  echo -e "\nTesting with compression and with the no-holes feature...\n"
  test_cloning_inline_extents "-O no-holes" "-o compress"

  status=0
  exit

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-10-14 04:36:43 +01:00
Chris Mason
dc6c5fb3b5 btrfs: fix use after free iterating extrefs
The code for btrfs inode-resolve has never worked properly for
files with enough hard links to trigger extrefs.  It was trying to
get the leaf out of a path after freeing the path:

	btrfs_release_path(path);
	leaf = path->nodes[0];
	item_size = btrfs_item_size_nr(leaf, slot);

The fix here is to use the extent buffer we cloned just a little higher
up to avoid deadlocks caused by using the leaf in the path.

Signed-off-by: Chris Mason <clm@fb.com>
cc: stable@vger.kernel.org # v3.7+
cc: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
2015-10-13 18:54:44 -07:00
David Sterba
8eb934591f btrfs: check unsupported filters in balance arguments
We don't verify that all the balance filter arguments supplemented by
the flags are actually known to the kernel. Thus we let it silently pass
and do nothing.

At the moment this means only the 'limit' filter, but we're going to add
a few more soon so it's better to have that fixed. Also in older stable
kernels so that it works with newer userspace tools.

Cc: stable@vger.kernel.org # 3.16+
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-13 18:53:03 -07:00
Robin Ruede
b96b1db039 btrfs: fix resending received snapshot with parent
This fixes a regression introduced by 37b8d27d between v4.1 and v4.2.

When a snapshot is received, its received_uuid is set to the original
uuid of the subvolume. When that snapshot is then resent to a third
filesystem, it's received_uuid is set to the second uuid
instead of the original one. The same was true for the parent_uuid.
This behaviour was partially changed in 37b8d27d, but in that patch
only the parent_uuid was taken from the real original,
not the uuid itself, causing the search for the parent to fail in
the case below.

This happens for example when trying to send a series of linked
snapshots (e.g. created by snapper) from the backup file system back
to the original one.

The following commands reproduce the issue in v4.2.1
(no error in 4.1.6)

    # setup three test file systems
    for i in 1 2 3; do
	    truncate -s 50M fs$i
	    mkfs.btrfs fs$i
	    mkdir $i
	    mount fs$i $i
    done
    echo "content" > 1/testfile
    btrfs su snapshot -r 1/ 1/snap1
    echo "changed content" > 1/testfile
    btrfs su snapshot -r 1/ 1/snap2

    # works fine:
    btrfs send 1/snap1 | btrfs receive 2/
    btrfs send -p 1/snap1 1/snap2 | btrfs receive 2/

    # ERROR: could not find parent subvolume
    btrfs send 2/snap1 | btrfs receive 3/
    btrfs send -p 2/snap1 2/snap2 | btrfs receive 3/

Signed-off-by: Robin Ruede <rruede+git@gmail.com>
Fixes: 37b8d27de5 ("Btrfs: use received_uuid of parent during send")
Cc: stable@vger.kernel.org # v4.2+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Ed Tomlinson <edt@aei.ca>
2015-10-13 20:04:10 +01:00
Filipe Manana
d906d49fc5 Btrfs: send, fix file corruption due to incorrect cloning operations
If we have a file that shares an extent with other files, when processing
the extent item relative to a shared extent, we blindly issue a clone
operation that will target a length matching the length in the extent item
and uses as a source some other file the receiver already has and points
to the same extent. However that range in the other file might not
exclusively point only to the shared extent, and so using that length
will result in the receiver getting a file with different data from the
one in the send snapshot. This issue happened both for incremental and
full send operations.

So fix this by issuing clone operations with lengths that don't cover
regions of the source file that point to different extents (or have holes).

The following test case for fstests reproduces the problem.

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"

  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      rm -fr $send_files_dir
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter

  # real QA test starts here
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch
  _need_to_be_root
  _require_cp_reflink
  _require_xfs_io_command "fpunch"

  send_files_dir=$TEST_DIR/btrfs-test-$seq

  rm -f $seqres.full
  rm -fr $send_files_dir
  mkdir $send_files_dir

  _scratch_mkfs >>$seqres.full 2>&1
  _scratch_mount

  # Create our test file with a single 100K extent.
  $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K" \
     $SCRATCH_MNT/foo | _filter_xfs_io

  # Clone our file into a new file named bar.
  cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar

  # Now overwrite parts of our foo file.
  $XFS_IO_PROG -c "pwrite -S 0xbb 50K 10K" \
     -c "pwrite -S 0xcc 90K 10K" \
     -c "fpunch 70K 10k" \
     $SCRATCH_MNT/foo | _filter_xfs_io

  _run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
     $SCRATCH_MNT/snap

  echo "File digests in the original filesystem:"
  md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
  md5sum $SCRATCH_MNT/snap/bar | _filter_scratch

  _run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap

  # Now recreate the filesystem by receiving the send stream and verify
  # we get the same file contents that the original filesystem had.
  _scratch_unmount
  _scratch_mkfs >>$seqres.full 2>&1
  _scratch_mount

  _run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap

  # We expect the destination filesystem to have exactly the same file
  # data as the original filesystem.
  # The btrfs send implementation had a bug where it sent a clone
  # operation from file foo into file bar covering the whole [0, 100K[
  # range after creating and writing the file foo. This was incorrect
  # because the file bar now included the updates done to file foo after
  # we cloned foo to bar, breaking the COW nature of reflink copies
  # (cloned extents).
  echo "File digests in the new filesystem:"
  md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
  md5sum $SCRATCH_MNT/snap/bar | _filter_scratch

  status=0
  exit

Another test case that reproduces the problem when we have compressed
extents:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"

  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      rm -fr $send_files_dir
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter

  # real QA test starts here
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch
  _need_to_be_root
  _require_cp_reflink

  send_files_dir=$TEST_DIR/btrfs-test-$seq

  rm -f $seqres.full
  rm -fr $send_files_dir
  mkdir $send_files_dir

  _scratch_mkfs >>$seqres.full 2>&1
  _scratch_mount "-o compress"

  # Create our file with an extent of 100K starting at file offset 0K.
  $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K"       \
                  -c "fsync"                        \
                  $SCRATCH_MNT/foo | _filter_xfs_io

  # Rewrite part of the previous extent (its first 40K) and write a new
  # 100K extent starting at file offset 100K.
  $XFS_IO_PROG -c "pwrite -S 0xbb 0K 40K"    \
          -c "pwrite -S 0xcc 100K 100K"      \
          $SCRATCH_MNT/foo | _filter_xfs_io

  # Our file foo now has 3 file extent items in its metadata:
  #
  # 1) One covering the file range 0 to 40K;
  # 2) One covering the file range 40K to 100K, which points to the first
  #    extent we wrote to the file and has a data offset field with value
  #    40K (our file no longer uses the first 40K of data from that
  #    extent);
  # 3) One covering the file range 100K to 200K.

  # Now clone our file foo into file bar.
  cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar

  # Create our snapshot for the send operation.
  _run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
          $SCRATCH_MNT/snap

  echo "File digests in the original filesystem:"
  md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
  md5sum $SCRATCH_MNT/snap/bar | _filter_scratch

  _run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap

  # Now recreate the filesystem by receiving the send stream and verify we
  # get the same file contents that the original filesystem had.
  # Btrfs send used to issue a clone operation from foo's range
  # [80K, 140K[ to bar's range [40K, 100K[ when cloning the extent pointed
  # to by foo's second file extent item, this was incorrect because of bad
  # accounting of the file extent item's data offset field. The correct
  # range to clone from should have been [40K, 100K[.
  _scratch_unmount
  _scratch_mkfs >>$seqres.full 2>&1
  _scratch_mount "-o compress"

  _run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap

  echo "File digests in the new filesystem:"
  # Must match the digests we got in the original filesystem.
  md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
  md5sum $SCRATCH_MNT/snap/bar | _filter_scratch

  status=0
  exit

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-10-13 01:05:27 +01:00
Chris Mason
6db4a7335d Merge branch 'fix/waitqueue-barriers' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.4 2015-10-12 16:24:40 -07:00
Chris Mason
62fb50ab7c Merge branch 'anand/sysfs-updates-v4.3-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.4
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-12 16:24:15 -07:00
Chris Mason
640926ffdd Merge branch 'cleanup/messages' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.4 2015-10-12 16:22:26 -07:00
David Sterba
ee86395458 btrfs: comment the rest of implicit barriers before waitqueue_active
There are atomic operations that imply the barrier for waitqueue_active
mixed in an if-condition.

Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-10 18:42:00 +02:00
David Sterba
779adf0f64 btrfs: remove extra barrier before waitqueue_active
Removing barriers is scary, but a call to atomic_dec_and_test implies
a barrier, so we don't need to issue another one.

Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-10 18:40:33 +02:00
David Sterba
a83342aa0c btrfs: add comments to barriers before waitqueue_active
Reduce number of undocumented barriers out there.

Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-10 18:40:04 +02:00
David Sterba
33a9eca7e4 btrfs: comment waitqueue_active implied by locks
Suggested-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-10 18:35:10 +02:00
David Sterba
b666a9cd99 btrfs: add barrier for waitqueue_active in clear_btree_io_tree
waitqueue_active should be preceded by a barrier, in this function we
don't need to call it all the time.

Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-10 18:24:48 +02:00
David Sterba
730d9ec36b btrfs: remove waitqueue_active check from btrfs_rm_dev_replace_unblocked
Normally the waitqueue_active would need a barrier, but this is not
necessary here because it's not a performance sensitive context and we
can call wake_up directly.

Suggested-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-10 18:16:38 +02:00
Linus Torvalds
175d58cfed Merge branch 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "These are small and assorted.  Neil's is the oldest, I dropped the
  ball thinking he was going to send it in"

* 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: support NFSv2 export
  Btrfs: open_ctree: Fix possible memory leak
  Btrfs: fix deadlock when finalizing block group creation
  Btrfs: update fix for read corruption of compressed and shared extents
  Btrfs: send, fix corner case for reference overwrite detection
2015-10-09 16:39:35 -07:00
David Sterba
f14d104dbd btrfs: switch more printks to our helpers
Convert the simple cases, not all functions provide a way to reach the
fs_info. Also skipped debugging messages (print-tree, integrity
checker and pr_debug) and messages that are printed from possibly
unfinished mount.

Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-08 13:08:03 +02:00
David Sterba
9464732266 btrfs: switch message printers to ratelimited variants
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-08 13:04:06 +02:00
David Sterba
1dd6d7ca9d btrfs: introduce ratelimited variants of message printing functions
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-08 11:07:56 +02:00
David Sterba
b14af3b46f btrfs: switch message printers to ratelimited _in_rcu variants
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-08 11:07:55 +02:00
David Sterba
24aa6b41d4 btrfs: introduce ratelimited _in_rcu variants of message printing functions
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-08 11:07:55 +02:00
David Sterba
ecaeb14b91 btrfs: switch message printers to _in_rcu variants
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-08 11:07:55 +02:00
David Sterba
08a84e25a8 btrfs: introduce _in_rcu variants of message printing functions
Due to the missing variants there are messages that lack the information
printed by btrfs_info etc helpers.

Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-08 11:07:55 +02:00
NeilBrown
7d35199e15 BTRFS: support NFSv2 export
The "fh_len" passed to ->fh_to_* is not guaranteed to be that same as
that returned by encode_fh - it may be larger.

With NFSv2, the filehandle is fixed length, so it may appear longer
than expected and be zero-padded.

So we must test that fh_len is at least some value, not exactly equal
to it.

Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: David Sterba <dsterba@suse.cz>
2015-10-06 06:55:23 -07:00
chandan
e5fffbac4a Btrfs: open_ctree: Fix possible memory leak
After reading one of chunk or tree root tree's root node from disk, if the
root node does not have EXTENT_BUFFER_UPTODATE flag set, we fail to release
the memory used by the root node. Fix this.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
2015-10-06 06:55:22 -07:00
Filipe Manana
d9a0540a79 Btrfs: fix deadlock when finalizing block group creation
Josef ran into a deadlock while a transaction handle was finalizing the
creation of its block groups, which produced the following trace:

  [260445.593112] fio             D ffff88022a9df468     0  8924   4518 0x00000084
  [260445.593119]  ffff88022a9df468 ffffffff81c134c0 ffff880429693c00 ffff88022a9df488
  [260445.593126]  ffff88022a9e0000 ffff8803490d7b00 ffff8803490d7b18 ffff88022a9df4b0
  [260445.593132]  ffff8803490d7af8 ffff88022a9df488 ffffffff8175a437 ffff8803490d7b00
  [260445.593137] Call Trace:
  [260445.593145]  [<ffffffff8175a437>] schedule+0x37/0x80
  [260445.593189]  [<ffffffffa0850f37>] btrfs_tree_lock+0xa7/0x1f0 [btrfs]
  [260445.593197]  [<ffffffff810db7c0>] ? prepare_to_wait_event+0xf0/0xf0
  [260445.593225]  [<ffffffffa07eac44>] btrfs_lock_root_node+0x34/0x50 [btrfs]
  [260445.593253]  [<ffffffffa07eff6b>] btrfs_search_slot+0x88b/0xa00 [btrfs]
  [260445.593295]  [<ffffffffa08389df>] ? free_extent_buffer+0x4f/0x90 [btrfs]
  [260445.593324]  [<ffffffffa07f1a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
  [260445.593351]  [<ffffffffa07ea94a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
  [260445.593394]  [<ffffffffa08403b9>] btrfs_finish_chunk_alloc+0x1c9/0x570 [btrfs]
  [260445.593427]  [<ffffffffa08002ab>] btrfs_create_pending_block_groups+0x11b/0x200 [btrfs]
  [260445.593459]  [<ffffffffa0800964>] do_chunk_alloc+0x2a4/0x2e0 [btrfs]
  [260445.593491]  [<ffffffffa0803815>] find_free_extent+0xa55/0xd90 [btrfs]
  [260445.593524]  [<ffffffffa0803c22>] btrfs_reserve_extent+0xd2/0x220 [btrfs]
  [260445.593532]  [<ffffffff8119fe5d>] ? account_page_dirtied+0xdd/0x170
  [260445.593564]  [<ffffffffa0803e78>] btrfs_alloc_tree_block+0x108/0x4a0 [btrfs]
  [260445.593597]  [<ffffffffa080c9de>] ? btree_set_page_dirty+0xe/0x10 [btrfs]
  [260445.593626]  [<ffffffffa07eb5cd>] __btrfs_cow_block+0x12d/0x5b0 [btrfs]
  [260445.593654]  [<ffffffffa07ebbff>] btrfs_cow_block+0x11f/0x1c0 [btrfs]
  [260445.593682]  [<ffffffffa07ef8c7>] btrfs_search_slot+0x1e7/0xa00 [btrfs]
  [260445.593724]  [<ffffffffa08389df>] ? free_extent_buffer+0x4f/0x90 [btrfs]
  [260445.593752]  [<ffffffffa07f1a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
  [260445.593830]  [<ffffffffa07ea94a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
  [260445.593905]  [<ffffffffa08403b9>] btrfs_finish_chunk_alloc+0x1c9/0x570 [btrfs]
  [260445.593946]  [<ffffffffa08002ab>] btrfs_create_pending_block_groups+0x11b/0x200 [btrfs]
  [260445.593990]  [<ffffffffa0815798>] btrfs_commit_transaction+0xa8/0xb40 [btrfs]
  [260445.594042]  [<ffffffffa085abcd>] ? btrfs_log_dentry_safe+0x6d/0x80 [btrfs]
  [260445.594089]  [<ffffffffa082bc84>] btrfs_sync_file+0x294/0x350 [btrfs]
  [260445.594115]  [<ffffffff8123e29b>] vfs_fsync_range+0x3b/0xa0
  [260445.594133]  [<ffffffff81023891>] ? syscall_trace_enter_phase1+0x131/0x180
  [260445.594149]  [<ffffffff8123e35d>] do_fsync+0x3d/0x70
  [260445.594169]  [<ffffffff81023bb8>] ? syscall_trace_leave+0xb8/0x110
  [260445.594187]  [<ffffffff8123e600>] SyS_fsync+0x10/0x20
  [260445.594204]  [<ffffffff8175de6e>] entry_SYSCALL_64_fastpath+0x12/0x71

This happened because the same transaction handle created a large number
of block groups and while finalizing their creation (inserting new items
and updating existing items in the chunk and device trees) a new metadata
extent had to be allocated and no free space was found in the current
metadata block groups, which made find_free_extent() attempt to allocate
a new block group via do_chunk_alloc(). However at do_chunk_alloc() we
ended up allocating a new system chunk too and exceeded the threshold
of 2Mb of reserved chunk bytes, which makes do_chunk_alloc() enter the
final part of block group creation again (at
btrfs_create_pending_block_groups()) and attempt to lock again the root
of the chunk tree when it's already write locked by the same task.

Similarly we can deadlock on extent tree nodes/leafs if while we are
running delayed references we end up creating a new metadata block group
in order to allocate a new node/leaf for the extent tree (as part of
a CoW operation or growing the tree), as btrfs_create_pending_block_groups
inserts items into the extent tree as well. In this case we get the
following trace:

  [14242.773581] fio             D ffff880428ca3418     0  3615   3100 0x00000084
  [14242.773588]  ffff880428ca3418 ffff88042d66b000 ffff88042a03c800 ffff880428ca3438
  [14242.773594]  ffff880428ca4000 ffff8803e4b20190 ffff8803e4b201a8 ffff880428ca3460
  [14242.773600]  ffff8803e4b20188 ffff880428ca3438 ffffffff8175a437 ffff8803e4b20190
  [14242.773606] Call Trace:
  [14242.773613]  [<ffffffff8175a437>] schedule+0x37/0x80
  [14242.773656]  [<ffffffffa057ff07>] btrfs_tree_lock+0xa7/0x1f0 [btrfs]
  [14242.773664]  [<ffffffff810db7c0>] ? prepare_to_wait_event+0xf0/0xf0
  [14242.773692]  [<ffffffffa0519c44>] btrfs_lock_root_node+0x34/0x50 [btrfs]
  [14242.773720]  [<ffffffffa051ef6b>] btrfs_search_slot+0x88b/0xa00 [btrfs]
  [14242.773750]  [<ffffffffa0520a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
  [14242.773758]  [<ffffffff811ef4a2>] ? kmem_cache_alloc+0x1d2/0x200
  [14242.773786]  [<ffffffffa0520ad1>] btrfs_insert_item+0x71/0xf0 [btrfs]
  [14242.773818]  [<ffffffffa052f292>] btrfs_create_pending_block_groups+0x102/0x200 [btrfs]
  [14242.773850]  [<ffffffffa052f96e>] do_chunk_alloc+0x2ae/0x2f0 [btrfs]
  [14242.773934]  [<ffffffffa0532825>] find_free_extent+0xa55/0xd90 [btrfs]
  [14242.773998]  [<ffffffffa0532c22>] btrfs_reserve_extent+0xc2/0x1d0 [btrfs]
  [14242.774041]  [<ffffffffa0532e38>] btrfs_alloc_tree_block+0x108/0x4a0 [btrfs]
  [14242.774078]  [<ffffffffa051a5cd>] __btrfs_cow_block+0x12d/0x5b0 [btrfs]
  [14242.774118]  [<ffffffffa051abff>] btrfs_cow_block+0x11f/0x1c0 [btrfs]
  [14242.774155]  [<ffffffffa051e8c7>] btrfs_search_slot+0x1e7/0xa00 [btrfs]
  [14242.774194]  [<ffffffffa0528021>] ? __btrfs_free_extent.isra.70+0x2e1/0xcb0 [btrfs]
  [14242.774235]  [<ffffffffa0520a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
  [14242.774274]  [<ffffffffa051994a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
  [14242.774318]  [<ffffffffa052c433>] __btrfs_run_delayed_refs+0xbb3/0x1020 [btrfs]
  [14242.774358]  [<ffffffffa052f404>] btrfs_run_delayed_refs.part.78+0x74/0x280 [btrfs]
  [14242.774391]  [<ffffffffa052f627>] btrfs_run_delayed_refs+0x17/0x20 [btrfs]
  [14242.774432]  [<ffffffffa05be236>] commit_cowonly_roots+0x8d/0x2bd [btrfs]
  [14242.774474]  [<ffffffffa059d07f>] ? __btrfs_run_delayed_items+0x1cf/0x210 [btrfs]
  [14242.774516]  [<ffffffffa05adac3>] ? btrfs_qgroup_account_extents+0x83/0x130 [btrfs]
  [14242.774558]  [<ffffffffa0544c40>] btrfs_commit_transaction+0x590/0xb40 [btrfs]
  [14242.774599]  [<ffffffffa0589b9d>] ? btrfs_log_dentry_safe+0x6d/0x80 [btrfs]
  [14242.774642]  [<ffffffffa055ac54>] btrfs_sync_file+0x294/0x350 [btrfs]
  [14242.774650]  [<ffffffff8123e29b>] vfs_fsync_range+0x3b/0xa0
  [14242.774657]  [<ffffffff81023891>] ? syscall_trace_enter_phase1+0x131/0x180
  [14242.774663]  [<ffffffff8123e35d>] do_fsync+0x3d/0x70
  [14242.774669]  [<ffffffff81023bb8>] ? syscall_trace_leave+0xb8/0x110
  [14242.774675]  [<ffffffff8123e600>] SyS_fsync+0x10/0x20
  [14242.774681]  [<ffffffff8175de6e>] entry_SYSCALL_64_fastpath+0x12/0x71

Fix this by never recursing into the finalization phase of block group
creation and making sure we never trigger the finalization of block group
creation while running delayed references.

Reported-by: Josef Bacik <jbacik@fb.com>
Fixes: 00d80e342c ("Btrfs: fix quick exhaustion of the system array in the superblock")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-10-05 16:56:38 -07:00
Filipe Manana
808f80b467 Btrfs: update fix for read corruption of compressed and shared extents
My previous fix in commit 005efedf2c ("Btrfs: fix read corruption of
compressed and shared extents") was effective only if the compressed
extents cover a file range with a length that is not a multiple of 16
pages. That's because the detection of when we reached a different range
of the file that shares the same compressed extent as the previously
processed range was done at extent_io.c:__do_contiguous_readpages(),
which covers subranges with a length up to 16 pages, because
extent_readpages() groups the pages in clusters no larger than 16 pages.
So fix this by tracking the start of the previously processed file
range's extent map at extent_readpages().

The following test case for fstests reproduces the issue:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"
  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter

  # real QA test starts here
  _need_to_be_root
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch
  _require_cloner

  rm -f $seqres.full

  test_clone_and_read_compressed_extent()
  {
      local mount_opts=$1

      _scratch_mkfs >>$seqres.full 2>&1
      _scratch_mount $mount_opts

      # Create our test file with a single extent of 64Kb that is going to
      # be compressed no matter which compression algo is used (zlib/lzo).
      $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 64K" \
          $SCRATCH_MNT/foo | _filter_xfs_io

      # Now clone the compressed extent into an adjacent file offset.
      $CLONER_PROG -s 0 -d $((64 * 1024)) -l $((64 * 1024)) \
          $SCRATCH_MNT/foo $SCRATCH_MNT/foo

      echo "File digest before unmount:"
      md5sum $SCRATCH_MNT/foo | _filter_scratch

      # Remount the fs or clear the page cache to trigger the bug in
      # btrfs. Because the extent has an uncompressed length that is a
      # multiple of 16 pages, all the pages belonging to the second range
      # of the file (64K to 128K), which points to the same extent as the
      # first range (0K to 64K), had their contents full of zeroes instead
      # of the byte 0xaa. This was a bug exclusively in the read path of
      # compressed extents, the correct data was stored on disk, btrfs
      # just failed to fill in the pages correctly.
      _scratch_remount

      echo "File digest after remount:"
      # Must match the digest we got before.
      md5sum $SCRATCH_MNT/foo | _filter_scratch
  }

  echo -e "\nTesting with zlib compression..."
  test_clone_and_read_compressed_extent "-o compress=zlib"

  _scratch_unmount

  echo -e "\nTesting with lzo compression..."
  test_clone_and_read_compressed_extent "-o compress=lzo"

  status=0
  exit

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Timofey Titovets <nefelim4ag@gmail.com>
2015-10-05 16:56:27 -07:00
Filipe Manana
b786f16ac3 Btrfs: send, fix corner case for reference overwrite detection
When the inode given to did_overwrite_ref() matches the current progress
and has a reference that collides with the reference of other inode that
has the same number as the current progress, we were always telling our
caller that the inode's reference was overwritten, which is incorrect
because the other inode might be a new inode (different generation number)
in which case we must return false from did_overwrite_ref() so that its
callers don't use an orphanized path for the inode (as it will never be
orphanized, instead it will be unlinked and the new inode created later).

The following test case for fstests reproduces the issue:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"

  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      rm -fr $send_files_dir
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter

  # real QA test starts here
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch
  _need_to_be_root

  send_files_dir=$TEST_DIR/btrfs-test-$seq

  rm -f $seqres.full
  rm -fr $send_files_dir
  mkdir $send_files_dir

  _scratch_mkfs >>$seqres.full 2>&1
  _scratch_mount

  # Create our test file with a single extent of 64K.
  mkdir -p $SCRATCH_MNT/foo
  $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 64K" $SCRATCH_MNT/foo/bar \
      | _filter_xfs_io

  _run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
      $SCRATCH_MNT/mysnap1
  _run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT \
      $SCRATCH_MNT/mysnap2

  echo "File digest before being replaced:"
  md5sum $SCRATCH_MNT/mysnap1/foo/bar | _filter_scratch

  # Remove the file and then create a new one in the same location with
  # the same name but with different content. This new file ends up
  # getting the same inode number as the previous one, because that inode
  # number was the highest inode number used by the snapshot's root and
  # therefore when attempting to find the a new inode number for the new
  # file, we end up reusing the same inode number. This happens because
  # currently btrfs uses the highest inode number summed by 1 for the
  # first inode created once a snapshot's root is loaded (done at
  # fs/btrfs/inode-map.c:btrfs_find_free_objectid in the linux kernel
  # tree).
  # Having these two different files in the snapshots with the same inode
  # number (but different generation numbers) caused the btrfs send code
  # to emit an incorrect path for the file when issuing an unlink
  # operation because it failed to realize they were different files.
  rm -f $SCRATCH_MNT/mysnap2/foo/bar
  $XFS_IO_PROG -f -c "pwrite -S 0xbb 0 96K" \
      $SCRATCH_MNT/mysnap2/foo/bar | _filter_xfs_io

  _run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT/mysnap2 \
      $SCRATCH_MNT/mysnap2_ro

  _run_btrfs_util_prog send $SCRATCH_MNT/mysnap1 -f $send_files_dir/1.snap
  _run_btrfs_util_prog send -p $SCRATCH_MNT/mysnap1 \
      $SCRATCH_MNT/mysnap2_ro -f $send_files_dir/2.snap

  echo "File digest in the original filesystem after being replaced:"
  md5sum $SCRATCH_MNT/mysnap2_ro/foo/bar | _filter_scratch

  # Now recreate the filesystem by receiving both send streams and verify
  # we get the same file contents that the original filesystem had.
  _scratch_unmount
  _scratch_mkfs >>$seqres.full 2>&1
  _scratch_mount

  _run_btrfs_util_prog receive -vv $SCRATCH_MNT -f $send_files_dir/1.snap
  _run_btrfs_util_prog receive -vv $SCRATCH_MNT -f $send_files_dir/2.snap

  echo "File digest in the new filesystem:"
  # Must match the digest from the new file.
  md5sum $SCRATCH_MNT/mysnap2_ro/foo/bar | _filter_scratch

  status=0
  exit

Reported-by: Martin Raiber <martin@urbackup.org>
Fixes: 8b191a6849 ("Btrfs: incremental send, check if orphanized dir inode needs delayed rename")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-10-05 16:56:27 -07:00
Liu Bo
73416dab23 Btrfs: move kobj stuff out of dev_replace lock range
To avoid deadlock described in commit 084b6e7c76 ("btrfs: Fix a
lockdep warning when running xfstest."), we should move kobj stuff out
of dev_replace lock range.

  "It is because the btrfs_kobj_{add/rm}_device() will call memory
  allocation with GFP_KERNEL,
  which may flush fs page cache to free space, waiting for it self to do
  the commit, causing the deadlock.

  To solve the problem, move btrfs_kobj_{add/rm}_device() out of the
  dev_replace lock range, also involing split the
  btrfs_rm_dev_replace_srcdev() function into remove and free parts.

  Now only btrfs_rm_dev_replace_remove_srcdev() is called in dev_replace
  lock range, and kobj_{add/rm} and btrfs_rm_dev_replace_free_srcdev() are
  called out of the lock range."

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
[added lockup description]
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-01 18:07:59 +02:00
Anand Jain
f190aa471a Btrfs: add helper for closing one device
Signed-off-by: Anand Jain <anand.jain@oracle.com>
[reworded subject and changelog]
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-01 18:00:05 +02:00
Anand Jain
097efc966a Btrfs: don't log error from btrfs_get_bdev_and_sb
Originally the message was not in a helper but ended up there. We should
print error messages from callers instead.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
[reworded subject and changelog]
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-01 17:56:47 +02:00
Anand Jain
9e271ae27e Btrfs: kernel operation should come after user input has been verified
By general rule of thumb there shouldn't be any way that user land
could trigger a kernel operation just by sending wrong arguments.

Here do commit cleanups after user input has been verified.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-01 17:45:10 +02:00
Anand Jain
12b1c2637b Btrfs: enhance btrfs_scratch_superblock to scratch all superblocks
This patch updates and renames btrfs_scratch_superblocks, (which is used
by the replace device thread), with those fixes from the scratch
superblock code section of btrfs_rm_device(). The fixes are:
  Scratch all copies of superblock
  Notify kobject that superblock has been changed
  Update time on the device

So that btrfs_rm_device() can use the function
btrfs_scratch_superblocks() instead of its own scratch code. And further
replace deivce code which similarly releases device back to the system,
will have the fixes from the btrfs device delete.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
[renamed to btrfs_scratch_superblock]
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-01 17:37:34 +02:00
Anand Jain
29c36d7253 Btrfs: add btrfs_read_dev_one_super() to read one specific SB
This uses a chunk of code from btrfs_read_dev_super() and creates
a function called btrfs_read_dev_one_super() so that next patch
can use it for scratch superblock.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
[renamed bufhead to bh]
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-01 17:29:38 +02:00
Anand Jain
d74a625987 Btrfs: use BTRFS_ERROR_DEV_MISSING_NOT_FOUND when missing device is not found
Use btrfs specific error code BTRFS_ERROR_DEV_MISSING_NOT_FOUND instead
of -ENOENT.  Next this removes the logging when user specifies "missing"
and we don't find it in the kernel device list. Logging are for system
events not for user input errors.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-01 16:47:16 +02:00
Anand Jain
a4553fefb5 Btrfs: consolidate btrfs_error() to btrfs_std_error()
btrfs_error() and btrfs_std_error() does the same thing
and calls _btrfs_std_error(), so consolidate them together.
And the main motivation is that btrfs_error() is closely
named with btrfs_err(), one handles error action the other
is to log the error, so don't closely name them.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Suggested-by: David Sterba <dsterba@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-09-29 16:30:00 +02:00
Anand Jain
57d816a15b Btrfs: __btrfs_std_error() logic should be consistent w/out CONFIG_PRINTK defined
error handling logic behaves differently with or without
CONFIG_PRINTK defined, since there are two copies of the same
function which a bit of different logic

One, when CONFIG_PRINTK is defined, code is

__btrfs_std_error(..)
{
::
       save_error_info(fs_info);
       if (sb->s_flags & MS_BORN)
               btrfs_handle_error(fs_info);
}

and two when CONFIG_PRINTK is not defined, the code is

__btrfs_std_error(..)
{
::
       if (sb->s_flags & MS_BORN) {
               save_error_info(fs_info);
               btrfs_handle_error(fs_info);
        }
}

I doubt if this was intentional ? and appear to have caused since
we maintain two copies of the same function and they got diverged
with commits.

Now to decide which logic is correct reviewed changes as below,

 533574c6bc
Commit added two copies of this function

 cf79ffb5b7
Commit made change to only one copy of the function and to the
copy when CONFIG_PRINTK is defined.

To fix this, instead of maintaining two copies of same function
approach, maintain single function, and just put the extra
portion of the code under CONFIG_PRINTK define.

This patch just does that. And keeps code of with CONFIG_PRINTK
defined.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-09-29 16:30:00 +02:00
Anand Jain
92fc03fbdc Btrfs: SB read failure should return EIO for __bread failure
This will return EIO when __bread() fails to read SB,
instead of EINVAL.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-09-29 16:29:59 +02:00
Anand Jain
c1b7e47459 Btrfs: rename super_kobj to fsid_kobj
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-09-29 16:29:59 +02:00
Anand Jain
3257604048 Btrfs: rename btrfs_kobj_rm_device to btrfs_sysfs_rm_device_link
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-09-29 16:29:59 +02:00
Anand Jain
e3bd6973bc Btrfs: rename btrfs_kobj_add_device to btrfs_sysfs_add_device_link
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-09-29 16:29:59 +02:00
Anand Jain
6618a59bfc Btrfs: rename btrfs_sysfs_remove_one to btrfs_sysfs_remove_mounted
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-09-29 16:29:58 +02:00
Anand Jain
96f3136e51 Btrfs: rename btrfs_sysfs_add_one to btrfs_sysfs_add_mounted
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-09-29 16:29:57 +02:00
Linus Torvalds
03e8f64486 Merge branch 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "This is an assorted set I've been queuing up:

  Jeff Mahoney tracked down a tricky one where we ended up starting IO
  on the wrong mapping for special files in btrfs_evict_inode.  A few
  people reported this one on the list.

  Filipe found (and provided a test for) a difficult bug in reading
  compressed extents, and Josef fixed up some quota record keeping with
  snapshot deletion.  Chandan killed off an accounting bug during DIO
  that lead to WARN_ONs as we freed inodes"

* 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: keep dropped roots in cache until transaction commit
  Btrfs: Direct I/O: Fix space accounting
  btrfs: skip waiting on ordered range for special files
  Btrfs: fix read corruption of compressed and shared extents
  Btrfs: remove unnecessary locking of cleaner_mutex to avoid deadlock
  Btrfs: don't initialize a space info as full to prevent ENOSPC
2015-09-25 12:08:41 -07:00
Josef Bacik
2b9dbef272 Btrfs: keep dropped roots in cache until transaction commit
When dropping a snapshot we need to account for the qgroup changes.  If we drop
the snapshot in all one go then the backref code will fail to find blocks from
the snapshot we dropped since it won't be able to find the root in the fs root
cache.  This can lead to us failing to find refs from other roots that pointed
at blocks in the now deleted root.  To handle this we need to not remove the fs
roots from the cache until after we process the qgroup operations.  Do this by
adding dropped roots to a list on the transaction, and letting the transaction
remove the roots at the same time it drops the commit roots.  This will keep all
of the backref searching code in sync properly, and fixes a problem Mark was
seeing with snapshot delete and qgroups.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Tested-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-09-22 10:22:56 -07:00
chandan
50745b0a7f Btrfs: Direct I/O: Fix space accounting
The following call trace is seen when generic/095 test is executed,

WARNING: CPU: 3 PID: 2769 at /home/chandan/code/repos/linux/fs/btrfs/inode.c:8967 btrfs_destroy_inode+0x284/0x2a0()
Modules linked in:
CPU: 3 PID: 2769 Comm: umount Not tainted 4.2.0-rc5+ #31
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20150306_163512-brownie 04/01/2014
 ffffffff81c08150 ffff8802ec9cbce8 ffffffff81984058 ffff8802ffd8feb0
 0000000000000000 ffff8802ec9cbd28 ffffffff81050385 ffff8802ec9cbd38
 ffff8802d12f8588 ffff8802d12f8588 ffff8802f15ab000 ffff8800bb96c0b0
Call Trace:
 [<ffffffff81984058>] dump_stack+0x45/0x57
 [<ffffffff81050385>] warn_slowpath_common+0x85/0xc0
 [<ffffffff81050465>] warn_slowpath_null+0x15/0x20
 [<ffffffff81340294>] btrfs_destroy_inode+0x284/0x2a0
 [<ffffffff8117ce07>] destroy_inode+0x37/0x60
 [<ffffffff8117cf39>] evict+0x109/0x170
 [<ffffffff8117cfd5>] dispose_list+0x35/0x50
 [<ffffffff8117dd3a>] evict_inodes+0xaa/0x100
 [<ffffffff81165667>] generic_shutdown_super+0x47/0xf0
 [<ffffffff81165951>] kill_anon_super+0x11/0x20
 [<ffffffff81302093>] btrfs_kill_super+0x13/0x110
 [<ffffffff81165c99>] deactivate_locked_super+0x39/0x70
 [<ffffffff811660cf>] deactivate_super+0x5f/0x70
 [<ffffffff81180e1e>] cleanup_mnt+0x3e/0x90
 [<ffffffff81180ebd>] __cleanup_mnt+0xd/0x10
 [<ffffffff81069c06>] task_work_run+0x96/0xb0
 [<ffffffff81003a3d>] do_notify_resume+0x3d/0x50
 [<ffffffff8198cbc2>] int_signal+0x12/0x17

This means that the inode had non-zero "outstanding extents" during
eviction. This occurs because, during direct I/O a task which successfully
used up its reserved data space would set BTRFS_INODE_DIO_READY bit and does
not clear the bit after finishing the DIO write. A future DIO write could
actually fail and the unused reserve space won't be freed because of the
previously set BTRFS_INODE_DIO_READY bit.

Clearing the BTRFS_INODE_DIO_READY bit in btrfs_direct_IO() caused the
following issue,
|-----------------------------------+-------------------------------------|
| Task A                            | Task B                              |
|-----------------------------------+-------------------------------------|
| Start direct i/o write on inode X.|                                     |
| reserve space                     |                                     |
| Allocate ordered extent           |                                     |
| release reserved space            |                                     |
| Set BTRFS_INODE_DIO_READY bit.    |                                     |
|                                   | splice()                            |
|                                   | Transfer data from pipe buffer to   |
|                                   | destination file.                   |
|                                   | - kmap(pipe buffer page)            |
|                                   | - Start direct i/o write on         |
|                                   |   inode X.                          |
|                                   |   - reserve space                   |
|                                   |   - dio_refill_pages()              |
|                                   |     - sdio->blocks_available == 0   |
|                                   |     - Since a kernel address is     |
|                                   |       being passed instead of a     |
|                                   |       user space address,           |
|                                   |       iov_iter_get_pages() returns  |
|                                   |       -EFAULT.                      |
|                                   |   - Since BTRFS_INODE_DIO_READY is  |
|                                   |     set, we don't release reserved  |
|                                   |     space.                          |
|                                   |   - Clear BTRFS_INODE_DIO_READY bit.|
| -EIOCBQUEUED is returned.         |                                     |
|-----------------------------------+-------------------------------------|

Hence this commit introduces "struct btrfs_dio_data" to track the usage of
reserved data space. The remaining unused "reserve space" can now be freed
reliably.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-09-21 13:47:55 -07:00
Jeff Mahoney
a30e577c96 btrfs: skip waiting on ordered range for special files
In btrfs_evict_inode, we properly truncate the page cache for evicted
inodes but then we call btrfs_wait_ordered_range for every inode as well.
It's the right thing to do for regular files but results in incorrect
behavior for device inodes for block devices.

filemap_fdatawrite_range gets called with inode->i_mapping which gets
resolved to the block device inode before getting passed to
wbc_attach_fdatawrite_inode and ultimately to inode_to_bdi.  What happens
next depends on whether there's an open file handle associated with the
inode.  If there is, we write to the block device, which is unexpected
behavior.  If there isn't, we through normally and inode->i_data is used.
We can also end up racing against open/close which can result in crashes
when i_mapping points to a block device inode that has been closed.

Since there can't be any page cache associated with special file inodes,
it's safe to skip the btrfs_wait_ordered_range call entirely and avoid
the problem.

Cc: <stable@vger.kernel.org>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=100911
Tested-by: Christoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
2015-09-15 02:21:08 +01:00
Filipe Manana
005efedf2c Btrfs: fix read corruption of compressed and shared extents
If a file has a range pointing to a compressed extent, followed by
another range that points to the same compressed extent and a read
operation attempts to read both ranges (either completely or part of
them), the pages that correspond to the second range are incorrectly
filled with zeroes.

Consider the following example:

  File layout
  [0 - 8K]                      [8K - 24K]
      |                             |
      |                             |
   points to extent X,         points to extent X,
   offset 4K, length of 8K     offset 0, length 16K

  [extent X, compressed length = 4K uncompressed length = 16K]

If a readpages() call spans the 2 ranges, a single bio to read the extent
is submitted - extent_io.c:submit_extent_page() would only create a new
bio to cover the second range pointing to the extent if the extent it
points to had a different logical address than the extent associated with
the first range. This has a consequence of the compressed read end io
handler (compression.c:end_compressed_bio_read()) finish once the extent
is decompressed into the pages covering the first range, leaving the
remaining pages (belonging to the second range) filled with zeroes (done
by compression.c:btrfs_clear_biovec_end()).

So fix this by submitting the current bio whenever we find a range
pointing to a compressed extent that was preceded by a range with a
different extent map. This is the simplest solution for this corner
case. Making the end io callback populate both ranges (or more, if we
have multiple pointing to the same extent) is a much more complex
solution since each bio is tightly coupled with a single extent map and
the extent maps associated to the ranges pointing to the shared extent
can have different offsets and lengths.

The following test case for fstests triggers the issue:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"
  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter

  # real QA test starts here
  _need_to_be_root
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch
  _require_cloner

  rm -f $seqres.full

  test_clone_and_read_compressed_extent()
  {
      local mount_opts=$1

      _scratch_mkfs >>$seqres.full 2>&1
      _scratch_mount $mount_opts

      # Create a test file with a single extent that is compressed (the
      # data we write into it is highly compressible no matter which
      # compression algorithm is used, zlib or lzo).
      $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 4K"        \
                      -c "pwrite -S 0xbb 4K 8K"        \
                      -c "pwrite -S 0xcc 12K 4K"       \
                      $SCRATCH_MNT/foo | _filter_xfs_io

      # Now clone our extent into an adjacent offset.
      $CLONER_PROG -s $((4 * 1024)) -d $((16 * 1024)) -l $((8 * 1024)) \
          $SCRATCH_MNT/foo $SCRATCH_MNT/foo

      # Same as before but for this file we clone the extent into a lower
      # file offset.
      $XFS_IO_PROG -f -c "pwrite -S 0xaa 8K 4K"         \
                      -c "pwrite -S 0xbb 12K 8K"        \
                      -c "pwrite -S 0xcc 20K 4K"        \
                      $SCRATCH_MNT/bar | _filter_xfs_io

      $CLONER_PROG -s $((12 * 1024)) -d 0 -l $((8 * 1024)) \
          $SCRATCH_MNT/bar $SCRATCH_MNT/bar

      echo "File digests before unmounting filesystem:"
      md5sum $SCRATCH_MNT/foo | _filter_scratch
      md5sum $SCRATCH_MNT/bar | _filter_scratch

      # Evicting the inode or clearing the page cache before reading
      # again the file would also trigger the bug - reads were returning
      # all bytes in the range corresponding to the second reference to
      # the extent with a value of 0, but the correct data was persisted
      # (it was a bug exclusively in the read path). The issue happened
      # only if the same readpages() call targeted pages belonging to the
      # first and second ranges that point to the same compressed extent.
      _scratch_remount

      echo "File digests after mounting filesystem again:"
      # Must match the same digests we got before.
      md5sum $SCRATCH_MNT/foo | _filter_scratch
      md5sum $SCRATCH_MNT/bar | _filter_scratch
  }

  echo -e "\nTesting with zlib compression..."
  test_clone_and_read_compressed_extent "-o compress=zlib"

  _scratch_unmount

  echo -e "\nTesting with lzo compression..."
  test_clone_and_read_compressed_extent "-o compress=lzo"

  status=0
  exit

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo<quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2015-09-15 00:59:31 +01:00
Linus Torvalds
e91eb6204f Merge branch 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs cleanups and fixes from Chris Mason:
 "These are small cleanups, and also some fixes for our async worker
  thread initialization.

  I was having some trouble testing these, but it ended up being a
  combination of changing around my test servers and a shiny new
  schedule while atomic from the new start/finish_plug in
  writeback_sb_inodes().

  That one only hits on btrfs raid5/6 or MD raid10, and if I wasn't
  changing a bunch of things in my test setup at once it would have been
  really clear.  Fix for writeback_sb_inodes() on the way as well"

* 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: cleanup: remove unnecessary check before btrfs_free_path is called
  btrfs: async_thread: Fix workqueue 'max_active' value when initializing
  btrfs: Add raid56 support for updating  num_tolerated_disk_barrier_failures in btrfs_balance
  btrfs: Cleanup for btrfs_calc_num_tolerated_disk_barrier_failures
  btrfs: Remove noused chunk_tree and chunk_objectid from scrub_enumerate_chunks and scrub_chunk
  btrfs: Update out-of-date "skip parity stripe" comment
2015-09-11 12:38:25 -07:00
Filipe Manana
85e0a0f21a Btrfs: remove unnecessary locking of cleaner_mutex to avoid deadlock
After commmit e44163e177 ("btrfs: explictly delete unused block groups
in close_ctree and ro-remount"), added in the 4.3 merge window, we have
calls to btrfs_delete_unused_bgs() while holding the cleaner_mutex.
This can cause a deadlock with a concurrent block group relocation (when
a filesystem balance or shrink operation is in progress for example)
because btrfs_delete_unused_bgs() locks delete_unused_bgs_mutex and the
relocation path locks first delete_unused_bgs_mutex and then it locks
cleaner_mutex, resulting in a classic ABBA deadlock:

         CPU 0                                        CPU 1

lock fs_info->cleaner_mutex

                                           __btrfs_balance() || btrfs_shrink_device()
                                             lock fs_info->delete_unused_bgs_mutex
                                             btrfs_relocate_chunk()
                                               btrfs_relocate_block_group()
                                                 lock fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
  lock fs_info->delete_unused_bgs_mutex

Fix this by not taking the cleaner_mutex before calling
btrfs_delete_unused_bgs() because it's no longer needed after
commit 67c5e7d464 ("Btrfs: fix race between balance and unused block
group deletion"). The mutex fs_info->delete_unused_bgs_mutex, the
spinlock fs_info->unused_bgs_lock and a block group's spinlock are
enough to get correct serialization between tasks running relocation
and unused block group deletion (as well as between multiple tasks
concurrently calling btrfs_delete_unused_bgs()).

This issue was discussed (in the mailing list) during the review of
the patch titled "btrfs: explictly delete unused block groups in
close_ctree and ro-remount" and it was agreed that acquiring the
cleaner mutex had to be dropped after the patch titled
"Btrfs: fix race between balance and unused block group deletion"
got merged (both patches were submitted at about the same time, but
one landed in kernel 4.2 and the other in the 4.3 merge window).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-09-10 11:27:57 +01:00
Filipe Manana
6af3e3adca Btrfs: don't initialize a space info as full to prevent ENOSPC
Commit 2e6e518335 ("Btrfs: fix block group ->space_info null pointer
dereference") accidently marked a space info as full when initializing
it with a value of 0 total bytes. This introduces an ENOSPC problem when
writing file data if we mount a filesystem that has no data block groups
allocated, because the data space info is initialized with 0 total bytes,
marked as full, and it never gets its total bytes incremented by a
(positive) value to unmark it as full (because there are no data block
groups loaded when the fs is mounted).
For metadata and system spaces this issue can never happen since we always
have at least one metadata block group and one system block group (even
for an empty filesystem).

So fix this by just not initializing a space info as full, reverting the
offending part of the commit mentioned above.

The following test case for fstests reproduces the issue:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"
  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter

  # real QA test starts here
  _need_to_be_root
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch

  rm -f $seqres.full

  _scratch_mkfs >>$seqres.full 2>&1

  # Mount our filesystem without space caches enabled so that we do not
  # get any space used from the initial data block group that mkfs creates
  # (space caches used space from data block groups).
  _scratch_mount "-o nospace_cache"

  # Need an fs with at least 2Gb to make sure mkfs.btrfs does not create
  # an fs using mixed block groups (used both for data and metadata). We
  # really need to have dedicated block groups for data to reproduce the
  # issue and mkfs.btrfs defaults to mixed block groups only for small
  # filesystems (up to 1Gb).
  _require_fs_space $SCRATCH_MNT $((2 * 1024 * 1024))

  # Run balance with the purpose of deleting the unused data block group
  # that mkfs created. We could also wait for the background kthread to
  # automatically delete the unused block group, but we do not have a way
  # to make it run and wait for it to complete, so just do a balance
  # instead of some unreliable sleep
  _run_btrfs_util_prog balance start -dusage=0 $SCRATCH_MNT

  # Now unmount the filesystem, mount it again (either with or with space
  # caches enabled, it does not matter to trigger the problem) and attempt
  # to create a file with some data - this used to fail with ENOSPC
  # because there were no data block groups when the filesystem was
  # mounted and the data space info object was marked as full when
  # initialized (because it had 0 total bytes), which prevented the file
  # write path from attempting to allocate a data block group and fail
  # immediately with ENOSPC.
  _scratch_remount
  echo "hello world" > $SCRATCH_MNT/foobar

  echo "Silence is golden"
  status=0
  exit

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-09-08 03:25:10 +01:00
Linus Torvalds
7d9071a095 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "In this one:

   - d_move fixes (Eric Biederman)

   - UFS fixes (me; locking is mostly sane now, a bunch of bugs in error
     handling ought to be fixed)

   - switch of sb_writers to percpu rwsem (Oleg Nesterov)

   - superblock scalability (Josef Bacik and Dave Chinner)

   - swapon(2) race fix (Hugh Dickins)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (65 commits)
  vfs: Test for and handle paths that are unreachable from their mnt_root
  dcache: Reduce the scope of i_lock in d_splice_alias
  dcache: Handle escaped paths in prepend_path
  mm: fix potential data race in SyS_swapon
  inode: don't softlockup when evicting inodes
  inode: rename i_wb_list to i_io_list
  sync: serialise per-superblock sync operations
  inode: convert inode_sb_list_lock to per-sb
  inode: add hlist_fake to avoid the inode hash lock in evict
  writeback: plug writeback at a high level
  change sb_writers to use percpu_rw_semaphore
  shift percpu_counter_destroy() into destroy_super_work()
  percpu-rwsem: kill CONFIG_PERCPU_RWSEM
  percpu-rwsem: introduce percpu_rwsem_release() and percpu_rwsem_acquire()
  percpu-rwsem: introduce percpu_down_read_trylock()
  document rwsem_release() in sb_wait_write()
  fix the broken lockdep logic in __sb_start_write()
  introduce __sb_writers_{acquired,release}() helpers
  ufs_inode_get{frag,block}(): get rid of 'phys' argument
  ufs_getfrag_block(): tidy up a bit
  ...
2015-09-05 20:34:28 -07:00
Linus Torvalds
22365979ab Merge branch 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "This has Jeff Mahoney's long standing trim patch that fixes corners
  where trims were missing.  Omar has some raid5/6 fixes, especially for
  using scrub and device replace when devices are missing.

  Zhao Lie continues cleaning and fixing things, this series fixes some
  really hard to hit corners in xfstests.  I had to pull it last merge
  window due to some deadlocks, but those are now resolved.

  I added support for Tejun's new blkio controllers.  It seems to work
  well for single devices, we'll expand to multi-device as well"

* 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (47 commits)
  btrfs: fix compile when block cgroups are not enabled
  Btrfs: fix file read corruption after extent cloning and fsync
  Btrfs: check if previous transaction aborted to avoid fs corruption
  btrfs: use __GFP_NOFAIL in alloc_btrfs_bio
  btrfs: Prevent from early transaction abort
  btrfs: Remove unused arguments in tree-log.c
  btrfs: Remove useless condition in start_log_trans()
  Btrfs: add support for blkio controllers
  Btrfs: remove unused mutex from struct 'btrfs_fs_info'
  Btrfs: fix parity scrub of RAID 5/6 with missing device
  Btrfs: fix device replace of a missing RAID 5/6 device
  Btrfs: add RAID 5/6 BTRFS_RBIO_REBUILD_MISSING operation
  Btrfs: count devices correctly in readahead during RAID 5/6 replace
  Btrfs: remove misleading handling of missing device scrub
  btrfs: fix clone / extent-same deadlocks
  Btrfs: fix defrag to merge tail file extent
  Btrfs: fix warning in backref walking
  btrfs: Add WARN_ON() for double lock in btrfs_tree_lock()
  btrfs: Remove root argument in extent_data_ref_count()
  btrfs: Fix wrong comment of btrfs_alloc_tree_block()
  ...
2015-09-05 15:14:43 -07:00
Linus Torvalds
1081230b74 Merge branch 'for-4.3/core' of git://git.kernel.dk/linux-block
Pull core block updates from Jens Axboe:
 "This first core part of the block IO changes contains:

   - Cleanup of the bio IO error signaling from Christoph.  We used to
     rely on the uptodate bit and passing around of an error, now we
     store the error in the bio itself.

   - Improvement of the above from myself, by shrinking the bio size
     down again to fit in two cachelines on x86-64.

   - Revert of the max_hw_sectors cap removal from a revision again,
     from Jeff Moyer.  This caused performance regressions in various
     tests.  Reinstate the limit, bump it to a more reasonable size
     instead.

   - Make /sys/block/<dev>/queue/discard_max_bytes writeable, by me.
     Most devices have huge trim limits, which can cause nasty latencies
     when deleting files.  Enable the admin to configure the size down.
     We will look into having a more sane default instead of UINT_MAX
     sectors.

   - Improvement of the SGP gaps logic from Keith Busch.

   - Enable the block core to handle arbitrarily sized bios, which
     enables a nice simplification of bio_add_page() (which is an IO hot
     path).  From Kent.

   - Improvements to the partition io stats accounting, making it
     faster.  From Ming Lei.

   - Also from Ming Lei, a basic fixup for overflow of the sysfs pending
     file in blk-mq, as well as a fix for a blk-mq timeout race
     condition.

   - Ming Lin has been carrying Kents above mentioned patches forward
     for a while, and testing them.  Ming also did a few fixes around
     that.

   - Sasha Levin found and fixed a use-after-free problem introduced by
     the bio->bi_error changes from Christoph.

   - Small blk cgroup cleanup from Viresh Kumar"

* 'for-4.3/core' of git://git.kernel.dk/linux-block: (26 commits)
  blk: Fix bio_io_vec index when checking bvec gaps
  block: Replace SG_GAPS with new queue limits mask
  block: bump BLK_DEF_MAX_SECTORS to 2560
  Revert "block: remove artifical max_hw_sectors cap"
  blk-mq: fix race between timeout and freeing request
  blk-mq: fix buffer overflow when reading sysfs file of 'pending'
  Documentation: update notes in biovecs about arbitrarily sized bios
  block: remove bio_get_nr_vecs()
  fs: use helper bio_add_page() instead of open coding on bi_io_vec
  block: kill merge_bvec_fn() completely
  md/raid5: get rid of bio_fits_rdev()
  md/raid5: split bio for chunk_aligned_read
  block: remove split code in blkdev_issue_{discard,write_same}
  btrfs: remove bio splitting and merge_bvec_fn() calls
  bcache: remove driver private bio splitting code
  block: simplify bio_add_page()
  block: make generic_make_request handle arbitrarily sized bios
  blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL)
  block: don't access bio->bi_error after bio_put()
  block: shrink struct bio down to 2 cache lines again
  ...
2015-09-02 13:10:25 -07:00
Linus Torvalds
089b669506 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina:
 "The usual stuff from trivial tree for 4.3 (kerneldoc updates, printk()
  fixes, Documentation and MAINTAINERS updates)"

* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (28 commits)
  MAINTAINERS: update my e-mail address
  mod_devicetable: add space before */
  scsi: a100u2w: trivial typo in printk
  i2c: Fix typo in i2c-bfin-twi.c
  treewide: fix typos in comment blocks
  Doc: fix trivial typo in SubmittingPatches
  proportions: Spelling s/consitent/consistent/
  dm: Spelling s/consitent/consistent/
  aic7xxx: Fix typo in error message
  pcmcia: Fix typo in locking documentation
  scsi/arcmsr: Fix typos in error log
  drm/nouveau/gr: Fix typo in nv10.c
  [SCSI] Fix printk typos in drivers/scsi
  staging: comedi: Grammar s/Enable support a/Enable support for a/
  Btrfs: Spelling s/consitent/consistent/
  README: GTK+ is a acronym
  ASoC: omap: Fix typo in config option description
  mm: tlb.c: Fix error message
  ntfs: super.c: Fix error log
  fix typo in Documentation/SubmittingPatches
  ...
2015-09-01 18:46:42 -07:00
Tsutomu Itoh
527afb4493 Btrfs: cleanup: remove unnecessary check before btrfs_free_path is called
We need not check path before btrfs_free_path() is called because
path is checked in btrfs_free_path().

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-31 11:46:41 -07:00
Qu Wenruo
c6dd6ea557 btrfs: async_thread: Fix workqueue 'max_active' value when initializing
At initializing time, for threshold-able workqueue, it's max_active
of kernel workqueue should be 1 and grow if it hits threshold.

But due to the bad naming, there is both 'max_active' for kernel
workqueue and btrfs workqueue.
So wrong value is given at workqueue initialization.

This patch fixes it, and to avoid further misunderstanding, change the
member name of btrfs_workqueue to 'current_active' and 'limit_active'.

Also corresponding comment is added for readability.

Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-31 11:46:40 -07:00
Zhao Lei
943c6e9925 btrfs: Add raid56 support for updating
num_tolerated_disk_barrier_failures in btrfs_balance

Code for updating fs_info->num_tolerated_disk_barrier_failures in
btrfs_balance() lacks raid56 support.

Reason:
 Above code was wroten in 2012-08-01, together with
 btrfs_calc_num_tolerated_disk_barrier_failures()'s first version.

 Then, btrfs_calc_num_tolerated_disk_barrier_failures() got updated
 later to support raid56, but code in btrfs_balance() was not
 updated together.

Fix:
 Merge above similar code to a common function:
 btrfs_get_num_tolerated_disk_barrier_failures()
 and make it support both case.

 It can fix this bug with a bonus of cleanup, and make these code
 never in above no-sync state from now on.

Suggested-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-31 11:45:48 -07:00
Zhao Lei
2c4580454f btrfs: Cleanup for btrfs_calc_num_tolerated_disk_barrier_failures
1: Use ARRAY_SIZE(types) to replace a static-value variant:
   int num_types = 4;

2: Use 'continue' on condition to reduce one level tab
   if (!XXX) {
       code;
       ...
   }
   ->
   if (XXX)
       continue;
   code;
   ...

3: Put setting 'num_tolerated_disk_barrier_failures = 2' to
   (num_tolerated_disk_barrier_failures > 2) condition to make
   make logic neat.
   if (num_tolerated_disk_barrier_failures > 0 && XXX)
       num_tolerated_disk_barrier_failures = 0;
   else if (num_tolerated_disk_barrier_failures > 1) {
       if (XXX)
           num_tolerated_disk_barrier_failures = 1;
       else if (XXX)
           num_tolerated_disk_barrier_failures = 2;
   ->
   if (num_tolerated_disk_barrier_failures > 0 && XXX)
       num_tolerated_disk_barrier_failures = 0;
   if (num_tolerated_disk_barrier_failures > 1 && XXX)
       num_tolerated_disk_barrier_failures = ;
   if (num_tolerated_disk_barrier_failures > 2 && XXX)
       num_tolerated_disk_barrier_failures = 2;

4: Remove comment of:
   num_mirrors - 1: if RAID1 or RAID10 is configured and more
   than 2 mirrors are used.
   which is not fit with code.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-31 11:45:47 -07:00
Zhao Lei
8c204c9657 btrfs: Remove noused chunk_tree and chunk_objectid from scrub_enumerate_chunks and scrub_chunk
These variables are not used from introduced version, remove them.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-31 11:45:46 -07:00
Zhao Lei
7955323bdc btrfs: Update out-of-date "skip parity stripe" comment
Because btrfs support scrub raid56 parity stripe now.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-31 11:45:45 -07:00
Chris Mason
3a9508b022 btrfs: fix compile when block cgroups are not enabled
bio->bi_css and bio->bi_ioc don't exist when block cgroups are not on.
This adds an ifdef around them.  It's not perfect, but our
use of bi_ioc is being removed in the 4.3 merge window.

The bi_css usage really should go into bio_clone, but I want to make
sure that doesn't introduce problems for other bio_clone use cases.

Signed-off-by: Chris Mason <clm@fb.com>
2015-08-21 10:08:13 -07:00
Filipe Manana
b84b8390d6 Btrfs: fix file read corruption after extent cloning and fsync
If we partially clone one extent of a file into a lower offset of the
file, fsync the file, power fail and then mount the fs to trigger log
replay, we can get multiple checksum items in the csum tree that overlap
each other and result in checksum lookup failures later. Those failures
can make file data read requests assume a checksum value of 0, but they
will not return an error (-EIO for example) to userspace exactly because
the expected checksum value 0 is a special value that makes the read bio
endio callback return success and set all the bytes of the corresponding
page with the value 0x01 (at fs/btrfs/inode.c:__readpage_endio_check()).
From a userspace perspective this is equivalent to file corruption
because we are not returning what was written to the file.

Details about how this can happen, and why, are included inline in the
following reproducer test case for fstests and the comment added to
tree-log.c.

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"
  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      _cleanup_flakey
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter
  . ./common/dmflakey

  # real QA test starts here
  _need_to_be_root
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch
  _require_dm_flakey
  _require_cloner
  _require_metadata_journaling $SCRATCH_DEV

  rm -f $seqres.full

  _scratch_mkfs >>$seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create our test file with a single 100K extent starting at file
  # offset 800K. We fsync the file here to make the fsync log tree gets
  # a single csum item that covers the whole 100K extent, which causes
  # the second fsync, done after the cloning operation below, to not
  # leave in the log tree two csum items covering two sub-ranges
  # ([0, 20K[ and [20K, 100K[)) of our extent.
  $XFS_IO_PROG -f -c "pwrite -S 0xaa 800K 100K"  \
                  -c "fsync"                     \
                   $SCRATCH_MNT/foo | _filter_xfs_io

  # Now clone part of our extent into file offset 400K. This adds a file
  # extent item to our inode's metadata that points to the 100K extent
  # we created before, using a data offset of 20K and a data length of
  # 20K, so that it refers to the sub-range [20K, 40K[ of our original
  # extent.
  $CLONER_PROG -s $((800 * 1024 + 20 * 1024)) -d $((400 * 1024)) \
      -l $((20 * 1024)) $SCRATCH_MNT/foo $SCRATCH_MNT/foo

  # Now fsync our file to make sure the extent cloning is durably
  # persisted. This fsync will not add a second csum item to the log
  # tree containing the checksums for the blocks in the sub-range
  # [20K, 40K[ of our extent, because there was already a csum item in
  # the log tree covering the whole extent, added by the first fsync
  # we did before.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

  echo "File digest before power failure:"
  md5sum $SCRATCH_MNT/foo | _filter_scratch

  # Silently drop all writes and ummount to simulate a crash/power
  # failure.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  # Allow writes again, mount to trigger log replay and validate file
  # contents.
  # The fsync log replay first processes the file extent item
  # corresponding to the file offset 400K (the one which refers to the
  # [20K, 40K[ sub-range of our 100K extent) and then processes the file
  # extent item for file offset 800K. It used to happen that when
  # processing the later, it erroneously left in the csum tree 2 csum
  # items that overlapped each other, 1 for the sub-range [20K, 40K[ and
  # 1 for the whole range of our extent. This introduced a problem where
  # subsequent lookups for the checksums of blocks within the range
  # [40K, 100K[ of our extent would not find anything because lookups in
  # the csum tree ended up looking only at the smaller csum item, the
  # one covering the subrange [20K, 40K[. This made read requests assume
  # an expected checksum with a value of 0 for those blocks, which caused
  # checksum verification failure when the read operations finished.
  # However those checksum failure did not result in read requests
  # returning an error to user space (like -EIO for e.g.) because the
  # expected checksum value had the special value 0, and in that case
  # btrfs set all bytes of the corresponding pages with the value 0x01
  # and produce the following warning in dmesg/syslog:
  #
  #  "BTRFS warning (device dm-0): csum failed ino 257 off 917504 csum\
  #   1322675045 expected csum 0"
  #
  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  echo "File digest after log replay:"
  # Must match the same digest he had after cloning the extent and
  # before the power failure happened.
  md5sum $SCRATCH_MNT/foo | _filter_scratch

  _unmount_flakey

  status=0
  exit

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-19 14:27:46 -07:00
Filipe Manana
1f9b8c8fbc Btrfs: check if previous transaction aborted to avoid fs corruption
While we are committing a transaction, it's possible the previous one is
still finishing its commit and therefore we wait for it to finish first.
However we were not checking if that previous transaction ended up getting
aborted after we waited for it to commit, so we ended up committing the
current transaction which can lead to fs corruption because the new
superblock can point to trees that have had one or more nodes/leafs that
were never durably persisted.
The following sequence diagram exemplifies how this is possible:

          CPU 0                                                        CPU 1

  transaction N starts

  (...)

  btrfs_commit_transaction(N)

    cur_trans->state = TRANS_STATE_COMMIT_START;
    (...)
    cur_trans->state = TRANS_STATE_COMMIT_DOING;
    (...)

    cur_trans->state = TRANS_STATE_UNBLOCKED;
    root->fs_info->running_transaction = NULL;

                                                              btrfs_start_transaction()
                                                                 --> starts transaction N + 1

    btrfs_write_and_wait_transaction(trans, root);
      --> starts writing all new or COWed ebs created
          at transaction N

                                                              creates some new ebs, COWs some
                                                              existing ebs but doesn't COW or
                                                              deletes eb X

                                                              btrfs_commit_transaction(N + 1)
                                                                (...)
                                                                cur_trans->state = TRANS_STATE_COMMIT_START;
                                                                (...)
                                                                wait_for_commit(root, prev_trans);
                                                                  --> prev_trans == transaction N

    btrfs_write_and_wait_transaction() continues
    writing ebs
       --> fails writing eb X, we abort transaction N
           and set bit BTRFS_FS_STATE_ERROR on
           fs_info->fs_state, so no new transactions
           can start after setting that bit

       cleanup_transaction()
         btrfs_cleanup_one_transaction()
           wakes up task at CPU 1

                                                                continues, doesn't abort because
                                                                cur_trans->aborted (transaction N + 1)
                                                                is zero, and no checks for bit
                                                                BTRFS_FS_STATE_ERROR in fs_info->fs_state
                                                                are made

                                                                btrfs_write_and_wait_transaction(trans, root);
                                                                  --> succeeds, no errors during writeback

                                                                write_ctree_super(trans, root, 0);
                                                                  --> succeeds
                                                                  --> we have now a superblock that points us
                                                                      to some root that uses eb X, which was
                                                                      never written to disk

In this scenario future attempts to read eb X from disk results in an
error message like "parent transid verify failed on X wanted Y found Z".

So fix this by aborting the current transaction if after waiting for the
previous transaction we verify that it was aborted.

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-19 14:27:31 -07:00
Michal Hocko
277fb5fc17 btrfs: use __GFP_NOFAIL in alloc_btrfs_bio
alloc_btrfs_bio relies on GFP_NOFS allocation when committing the
transaction but this allocation context is rather weak wrt. reclaim
capabilities. The page allocator currently tries hard to not fail these
allocations if they are small (<=PAGE_ALLOC_COSTLY_ORDER) but it can
still fail if the _current_ process is the OOM killer victim. Moreover
there is an attempt to move away from the default no-fail behavior and
allow these allocation to fail more eagerly. This would lead to:

[   37.928625] kernel BUG at fs/btrfs/extent_io.c:4045

which is clearly undesirable and the nofail behavior should be explicit
if the allocation failure cannot be tolerated.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-19 14:25:15 -07:00
Michal Hocko
d1b5c5671d btrfs: Prevent from early transaction abort
Btrfs relies on GFP_NOFS allocation when committing the transaction but
this allocation context is rather weak wrt. reclaim capabilities. The
page allocator currently tries hard to not fail these allocations if
they are small (<=PAGE_ALLOC_COSTLY_ORDER) so this is not a problem
currently but there is an attempt to move away from the default no-fail
behavior and allow these allocation to fail more eagerly. And this would
lead to a pre-mature transaction abort as follows:

[   55.328093] Call Trace:
[   55.328890]  [<ffffffff8154e6f0>] dump_stack+0x4f/0x7b
[   55.330518]  [<ffffffff8108fa28>] ? console_unlock+0x334/0x363
[   55.332738]  [<ffffffff8110873e>] __alloc_pages_nodemask+0x81d/0x8d4
[   55.334910]  [<ffffffff81100752>] pagecache_get_page+0x10e/0x20c
[   55.336844]  [<ffffffffa007d916>] alloc_extent_buffer+0xd0/0x350 [btrfs]
[   55.338973]  [<ffffffffa0059d8c>] btrfs_find_create_tree_block+0x15/0x17 [btrfs]
[   55.341329]  [<ffffffffa004f728>] btrfs_alloc_tree_block+0x18c/0x405 [btrfs]
[   55.343566]  [<ffffffffa003fa34>] split_leaf+0x1e4/0x6a6 [btrfs]
[   55.345577]  [<ffffffffa0040567>] btrfs_search_slot+0x671/0x831 [btrfs]
[   55.347679]  [<ffffffff810682d7>] ? get_parent_ip+0xe/0x3e
[   55.349434]  [<ffffffffa0041cb2>] btrfs_insert_empty_items+0x5d/0xa8 [btrfs]
[   55.351681]  [<ffffffffa004ecfb>] __btrfs_run_delayed_refs+0x7a6/0xf35 [btrfs]
[   55.353979]  [<ffffffffa00512ea>] btrfs_run_delayed_refs+0x6e/0x226 [btrfs]
[   55.356212]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[   55.358378]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[   55.360626]  [<ffffffffa0060221>] btrfs_commit_transaction+0x4c/0xaba [btrfs]
[   55.362894]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[   55.365221]  [<ffffffffa0073428>] btrfs_sync_file+0x29c/0x310 [btrfs]
[   55.367273]  [<ffffffff81186808>] vfs_fsync_range+0x8f/0x9e
[   55.369047]  [<ffffffff81186833>] vfs_fsync+0x1c/0x1e
[   55.370654]  [<ffffffff81186869>] do_fsync+0x34/0x4e
[   55.372246]  [<ffffffff81186ab3>] SyS_fsync+0x10/0x14
[   55.373851]  [<ffffffff81554f97>] system_call_fastpath+0x12/0x6f
[   55.381070] BTRFS: error (device hdb1) in btrfs_run_delayed_refs:2821: errno=-12 Out of memory
[   55.382431] BTRFS warning (device hdb1): Skipping commit of aborted transaction.
[   55.382433] BTRFS warning (device hdb1): cleanup_transaction:1692: Aborting unused transaction(IO failure).
[   55.384280] ------------[ cut here ]------------
[   55.384312] WARNING: CPU: 0 PID: 3010 at fs/btrfs/delayed-ref.c:438 btrfs_select_ref_head+0xd9/0xfe [btrfs]()
[...]
[   55.384337] Call Trace:
[   55.384353]  [<ffffffff8154e6f0>] dump_stack+0x4f/0x7b
[   55.384357]  [<ffffffff8107f717>] ? down_trylock+0x2d/0x37
[   55.384359]  [<ffffffff81046977>] warn_slowpath_common+0xa1/0xbb
[   55.384398]  [<ffffffffa00a1d6b>] ? btrfs_select_ref_head+0xd9/0xfe [btrfs]
[   55.384400]  [<ffffffff81046a34>] warn_slowpath_null+0x1a/0x1c
[   55.384423]  [<ffffffffa00a1d6b>] btrfs_select_ref_head+0xd9/0xfe [btrfs]
[   55.384446]  [<ffffffffa004e5f7>] ? __btrfs_run_delayed_refs+0xa2/0xf35 [btrfs]
[   55.384455]  [<ffffffffa004e600>] __btrfs_run_delayed_refs+0xab/0xf35 [btrfs]
[   55.384476]  [<ffffffffa00512ea>] btrfs_run_delayed_refs+0x6e/0x226 [btrfs]
[   55.384499]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[   55.384521]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[   55.384543]  [<ffffffffa0060221>] btrfs_commit_transaction+0x4c/0xaba [btrfs]
[   55.384565]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[   55.384588]  [<ffffffffa0073428>] btrfs_sync_file+0x29c/0x310 [btrfs]
[   55.384591]  [<ffffffff81186808>] vfs_fsync_range+0x8f/0x9e
[   55.384592]  [<ffffffff81186833>] vfs_fsync+0x1c/0x1e
[   55.384593]  [<ffffffff81186869>] do_fsync+0x34/0x4e
[   55.384594]  [<ffffffff81186ab3>] SyS_fsync+0x10/0x14
[   55.384595]  [<ffffffff81554f97>] system_call_fastpath+0x12/0x6f
[...]
[   55.384608] ---[ end trace c29799da1d4dd621 ]---
[   55.437323] BTRFS info (device hdb1): forced readonly
[   55.438815] BTRFS info (device hdb1): delayed_refs has NO entry

Fix this by being explicit about the no-fail behavior of this allocation
path and use __GFP_NOFAIL.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-19 14:25:15 -07:00
Zhaolei
60d53eb310 btrfs: Remove unused arguments in tree-log.c
Following arguments are not used in tree-log.c:
 insert_one_name(): path, type
 wait_log_commit(): trans
 wait_for_writer(): trans

This patch remove them.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-19 14:25:15 -07:00
Zhaolei
34eb2a5249 btrfs: Remove useless condition in start_log_trans()
Dan Carpenter <dan.carpenter@oracle.com> reported a smatch warning
for start_log_trans():
 fs/btrfs/tree-log.c:178 start_log_trans()
 warn: we tested 'root->log_root' before and it was 'false'

 fs/btrfs/tree-log.c
 147          if (root->log_root) {
 We test "root->log_root" here.
 ...

Reason:
 Condition of:
 fs/btrfs/tree-log.c:178: if (!root->log_root) {
 is not necessary after commit: 7237f1833

 It caused a smatch warning, and no functionally error.

Fix:
 Deleting above condition will make smatch shut up,
 but a better way is to do cleanup for start_log_trans()
 to remove duplicated code and make code more readable.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-19 14:24:49 -07:00
Oleg Nesterov
bee9182d95 introduce __sb_writers_{acquired,release}() helpers
Preparation to hide the sb->s_writers internals from xfs and btrfs.
Add 2 trivial define's they can use rather than play with ->s_writers
directly. No changes in btrfs/transaction.o and xfs/xfs_aops.o.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jan Kara <jack@suse.com>
2015-08-15 13:52:08 +02:00
Kent Overstreet
b54ffb73ca block: remove bio_get_nr_vecs()
We can always fill up the bio now, no need to estimate the possible
size based on queue parameters.

Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[hch: rebased and wrote a changelog]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-13 12:32:04 -06:00
Kent Overstreet
0e28997ec4 btrfs: remove bio splitting and merge_bvec_fn() calls
Btrfs has been doing bio splitting from btrfs_map_bio(), by checking
device limits as well as calling ->merge_bvec_fn() etc. That is not
necessary any more, because generic_make_request() is now able to
handle arbitrarily sized bios. So clean up unnecessary code paths.

Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: linux-btrfs@vger.kernel.org
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
[dpark: add more description in commit message]
Signed-off-by: Dongsu Park <dpark@posteo.net>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-13 12:31:43 -06:00
Greg Kroah-Hartman
5d44f4b348 Merge 4.2-rc6 into char-misc-next
We want the fixes in Linus's tree in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-09 16:28:09 -07:00
Chris Mason
46cd28555f Merge branch 'jeffm-discard-4.3' into for-linus-4.3 2015-08-09 07:35:33 -07:00
Chris Mason
da2f0f74cf Btrfs: add support for blkio controllers
This attaches accounting information to bios as we submit them so the
new blkio controllers can throttle on btrfs filesystems.

Not much is required, we're just associating bios with blkcgs during clone,
calling wbc_init_bio()/wbc_account_io() during writepages submission,
and attaching the bios to the current context during direct IO.

Finally if we are splitting bios during btrfs_map_bio, this attaches
accounting information to the split.

The end result is able to throttle nicely on single disk filesystems.  A
little more work is required for multi-device filesystems.

Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:35:06 -07:00
Byongho Lee
a4027a20c5 Btrfs: remove unused mutex from struct 'btrfs_fs_info'
The code using 'ordered_extent_flush_mutex' mutex has removed by below
commit.
 - 8d875f95da
   btrfs: disable strict file flushes for renames and truncates
But the mutex still lives in struct 'btrfs_fs_info'.

So, this patch removes the mutex from struct 'btrfs_fs_info' and its
initialization code.

Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:34:27 -07:00
Omar Sandoval
4a770891d9 Btrfs: fix parity scrub of RAID 5/6 with missing device
When testing the previous patch, Zhao Lei reported a similar bug when
attempting to scrub a degraded RAID 5/6 filesystem with a missing
device, leading to NULL pointer dereferences from the RAID 5/6 parity
scrubbing code.

The first cause was the same as in the previous patch: attempting to
call bio_add_page() on a missing block device. To fix this,
scrub_extent_for_parity() can just mark the sectors on the missing
device as errors instead of attempting to read from it.

Additionally, the code uses scrub_remap_extent() to map the extent of
the corresponding data stripe, but the extent wasn't already mapped. If
scrub_remap_extent() finds a missing block device, it doesn't initialize
extent_dev, so we're left with a NULL struct btrfs_device. The solution
is to use btrfs_map_block() directly.

Reported-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:34:26 -07:00
Omar Sandoval
73ff61dbe5 Btrfs: fix device replace of a missing RAID 5/6 device
The original implementation of device replace on RAID 5/6 seems to have
missed support for replacing a missing device. When this is attempted,
we end up calling bio_add_page() on a bio with a NULL ->bi_bdev, which
crashes when we try to dereference it. This happens because
btrfs_map_block() has no choice but to return us the missing device
because RAID 5/6 don't have any alternate mirrors to read from, and a
missing device has a NULL bdev.

The idea implemented here is to handle the missing device case
separately, which better only happen when we're replacing a missing RAID
5/6 device. We use the new BTRFS_RBIO_REBUILD_MISSING operation to
reconstruct the data from parity, check it with
scrub_recheck_block_checksum(), and write it out with
scrub_write_block_to_dev_replace().

Reported-by: Philip <bugzilla@philip-seeger.de>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=96141
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:34:26 -07:00
Omar Sandoval
b4ee178268 Btrfs: add RAID 5/6 BTRFS_RBIO_REBUILD_MISSING operation
The current RAID 5/6 recovery code isn't quite prepared to handle
missing devices. In particular, it expects a bio that we previously
attempted to use in the read path, meaning that it has valid pages
allocated. However, missing devices have a NULL blkdev, and we can't
call bio_add_page() on a bio with a NULL blkdev. We could do manual
manipulation of bio->bi_io_vec, but that's pretty gross. So instead, add
a separate path that allows us to manually add pages to the rbio.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:34:26 -07:00
Omar Sandoval
7cb2c4202e Btrfs: count devices correctly in readahead during RAID 5/6 replace
Commit 5fbc7c59fd ("Btrfs: fix unfinished readahead thread for raid5/6
degraded mounting") fixed a problem where we would skip a missing device
when we shouldn't have because there are no other mirrors to read from
in RAID 5/6. After commit 2c8cdd6ee4 ("Btrfs, replace: write dirty
pages into the replace target device"), the fix doesn't work when we're
doing a missing device replace on RAID 5/6 because the replace device is
counted as a mirror so we're tricked into thinking we can safely skip
the missing device. The fix is to count only the real stripes and decide
based on that.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:34:26 -07:00
Omar Sandoval
03679ade86 Btrfs: remove misleading handling of missing device scrub
scrub_submit() claims that it can handle a bio with a NULL block device,
but this is misleading, as calling bio_add_page() on a bio with a NULL
->bi_bdev would've already crashed. Delete this, as we're about to
properly handle a missing block device.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:34:26 -07:00
Mark Fasheh
293a8489f3 btrfs: fix clone / extent-same deadlocks
Clone and extent same lock their source and target inodes in opposite order.
In addition to this, the range locking in clone doesn't take ordering into
account. Fix this by having clone use the same locking helpers as
btrfs-extent-same.

In addition, I do a small cleanup of the locking helpers, removing a case
(both inodes being the same) which was poorly accounted for and never
actually used by the callers.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:34:25 -07:00
Liu Bo
4a3560c4f3 Btrfs: fix defrag to merge tail file extent
The file layout is

[extent 1]...[extent n][4k extent][HOLE][extent x]

extent 1~n and 4k extent can be merged during defrag, and the whole
defrag bytes is larger than our defrag thresh(256k), 4k extent as a
tail is left unmerged since we check if its next extent can be merged
(the next one is a hole, so the check will fail), the layout thus can
be

[new extent][4k extent][HOLE][extent x]
 (1~n)

To fix it, beside looking at the next one, this also looks at the
previous one by checking @defrag_end, which is set to 0 when we
decide to stop merging contiguous extents, otherwise, we can merge
the previous one with our extent.

Also, this makes btrfs behave consistent with how xfs and ext4 do.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:33:50 -07:00
Liu Bo
acdf898de8 Btrfs: fix warning in backref walking
When we do backref walking, we search firstly in queued delayed refs
and then the on-disk backrefs, but we parse differently for shared
references, for delayed refs we also add 'ref->root' while for on-disk
backrefs we don't, this can prevent us from merging refs indexed
by the same bytenr and cause find_parent_nodes() to throw a warning at
'WARN_ON(ref->count < 0)', for example, when we have a shared data extent
with 'ref_cnt=1' and a delayed shared data with a BTRFS_DROP_DELAYED_REF,
that happens.

For shared references, no matter if it's delayed or on-disk, ref->root is
not at all used, instead it's ref->parent that really matters, so this has
delayed refs handled as the same way as on-disk refs.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:33:50 -07:00
Zhaolei
166f66d0bc btrfs: Add WARN_ON() for double lock in btrfs_tree_lock()
When a task trying to double lock a extent buffer, there are no
lockdep warning about it because this lock may be in "blocking_lock"
state, and make us hard to debug.

This patch add a WARN_ON() for above condition, it can not report
all deadlock cases(as lock between tasks), but at least helps us
some.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:07:14 -07:00
Zhaolei
9ed0dea09f btrfs: Remove root argument in extent_data_ref_count()
Because it is never used.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:07:14 -07:00
Zhaolei
d02207512d btrfs: Fix wrong comment of btrfs_alloc_tree_block()
These wrong comment was copyed from another function(expired) from
init, this patch fixed them.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:07:14 -07:00
Zhaolei
93314e3b64 btrfs: abort transaction on btrfs_reloc_cow_block()
When btrfs_reloc_cow_block() failed in __btrfs_cow_block(), current
code just return a err-value to caller, but leave new_created extent
buffer exist and locked.

Then subsequent code (in relocate) try to lock above eb again,
and caused deadlock without any dmesg.
(eb lock use wait_event(), so no lockdep message)

It is hard to do recover work in __btrfs_cow_block() at this error
point, but we can abort transaction to avoid deadlock and operate on
unstable state.a

It also helps developer to find wrong place quickly.
(better than a frozen fs without any dmesg before patch)

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:07:14 -07:00
Zhaolei
147d256e09 btrfs: Remove unnecessary variants in relocation.c
These arguments are not used in functions, remove them for cleanup
and make kernel stack happy.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:07:14 -07:00
Zhaolei
dc2ee4e244 btrfs: Cleanup: Remove chunk_objectid argument from btrfs_relocate_chunk()
Remove chunk_objectid argument from btrfs_relocate_chunk() because
it is not necessary, it can also cleanup some code in caller for
prepare its value.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:07:13 -07:00
Zhaolei
4624900dd3 btrfs: Cleanup: Remove objectid's init-value in create_reloc_inode()
objectid's init-value is not used in any case, remove it.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:07:13 -07:00
Zhaolei
4b3576e450 btrfs: Error handle for get_ref_objectid_v0() in relocate_block_group()
We need error checking code for get_ref_objectid_v0() in
relocate_block_group(), to avoid unpredictable result, especially
for accessing uninitialized value(when function failed) after
this line.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:07:13 -07:00
Zhaolei
55e3a601c8 btrfs: Fix data checksum error cause by replace with io-load.
xfstests btrfs/070 sometimes failed.
In my test machine, its fail rate is about 30%.
In another vm(vmware), its fail rate is about 50%.

Reason:
  btrfs/070 do replace and defrag with fsstress simultaneously,
  after above operation, checksum error is found by scrub.

  Actually, it have no relationship with defrag operation, only
  replace with fsstress can trigger this bug.

  New data writen to target device have possibility rewrited by
  old data from source device by replace code in debug, to avoid
  above problem, we can set target block group to readonly in
  replace period, so new data requested by other operation will
  not write to same place with replace code.

  Before patch(4.1-rc3):
    30% failed in 100 xfstests.
  After patch:
    0% failed in 300 xfstests.

It also happened in btrfs/071 as it's another scrub with IO load tests.

Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:07:13 -07:00
Zhaolei
b708ce969a btrfs: use scrub_pause_on/off() to reduce code in scrub_enumerate_chunks()
Use new intruduced scrub_pause_on/off() can make this code block
clean and more readable.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:07:12 -07:00
Zhaolei
0e22be890e btrfs: Separate scrub_blocked_if_needed() to scrub_pause_on/off()
It can reduce current duplicated code which is similar to
scrub_blocked_if_needed() but can not call it because little
different.
It also used by my next patch which is in same case.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:07:12 -07:00
Zhaolei
868f401ae3 btrfs: Use ref_cnt for set_block_group_ro()
More than one code call set_block_group_ro() and restore rw in fail.

Old code use bool bit to save blockgroup's ro state, it can not
support parallel case(it is confirmd exist in my debug log).

This patch use ref count to store ro state, and rename
set_block_group_ro/set_block_group_rw
to
inc_block_group_ro/dec_block_group_ro.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:07:12 -07:00
Zhao Lei
d7cad23895 btrfs: Bypass unrelated items before accessing its contents in scrub
When we access extent_root in scrub_stripe() and
scrub_raid56_parity(), we need bypass unrelated tree item firstly
before using its contents to do other condition.

It is not a bug fix, only making code sequence in logic.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:07:12 -07:00
Zhao Lei
fe8cf654b1 btrfs: Load only necessary csums into list in scrub
We need not load csum of whole strip in scrub because strip is trimed
before use, it is to say, what we really need to calculate csum is
data between [extent_logical, extent_len).

This patch changed to use above segment for btrfs_lookup_csums_range()
in scrub_stripe()

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:07:11 -07:00
Zhao Lei
a0dd59de3c btrfs: Fix calculate typo caused by ambiguous meaning of logic_end
For example, in scrub_raid56_parity(), following lines are used
to judge is all data processed:
 place1: if (key.objectid > logic_end) ...
 place2: if (logic_start >= logic_end) ...
 ...
 (place2 is typo, is should be ">", it is copied from other
  place, where logic_end's meaning is different, long story...)

We can fix above typo directly, but the root reason is ambiguous
meaning of logic_end in scrub raid56 parity.

In other place, XXX_end is pointed to data which is not included,
and we need to process segment of [XXX_start, XXX_end).

But for scrub raid56 parity, logic_end is pointed to lattest data
need to process, and introduced many "+ 1" and "- 1" in code as
below:
 length = sparity->logic_end - sparity->logic_start + 1
 logic_end - logic_start + 1
 stripe_logical + increment - 1

This patch changed logic_end's meaning to make it in normal understanding
in raid56 parity functions and data struct alone with above bugfix.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:07:11 -07:00
Zhao Lei
6fa96d72f7 btrfs: Free checksum list on scrub_extent() fail
When scrub_extent() failed, we need to free previois created
checksum list.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:07:11 -07:00
Zhao Lei
f2f66a2f88 btrfs: Check cancel and pause in interval of scrub operation
Old code checking cancel and pause request inside scrub stripe
operation, like:
  loop() {
    if (parity) {
      scrub_parity_stripe();
      continue;
    }

    check_cancel_and_pause()

    scrub_normal_stripe();
  }

Reason is when introduce raid56 stripe scrub, new code is inserted
simplely to front of loop.

Better to:
  loop() {
    check_cancel_and_pause()

    if (parity)
      scrub_parity_stripe();
    else
      scrub_normal_stripe();
  }

This patch adjusted code place to realize above sequence.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:07:11 -07:00
Zhao Lei
78fa177029 btrfs: Show detail information when mount failed on missing devices
When mount failed because missing device, we can see following
dmesg:
 [ 1060.267743] BTRFS: too many missing devices, writeable mount is not allowed
 [ 1060.273158] BTRFS: open_ctree failed

This patch add missing_device_number and tolerated_missing_device_number
to above output, to let user know what really happened, and helps
bug-report and debug.

dmesg after patch:
 [  127.050367] BTRFS: missing devices(1) exceeds the limit(0), writeable mount is not allowed
 [  127.056099] BTRFS: open_ctree failed

Changelog v1->v2:
1: Changed to more clear description, suggested-by:
   Anand Jain <anand.jain@oracle.com>

Suggested-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:07:10 -07:00
Zhao Lei
a323e8139c btrfs: Fix scrub panic when leaf crosses stripes
Scrub panic in following operation:
  mkfs.ext4 /dev/vdh
  btrfs-convert /dev/vdh
  mount /dev/vdh /mnt/tmp1
  btrfs scrub start -B /dev/vdh
  (panic)

Reason:
  1: In some case, leaf created by btrfs-convert was splited into 2
     strips.
  2: Scrub bypassed part of above wrong leaf data, but remain data
     caused panic in scrub_checksum_tree_block().

For reason 1:
  we can get following information after some simple operation.
  a. mkfs.ext4 /dev/vdh
     btrfs-convert /dev/vdh
  b. btrfs-debug-tree /dev/vdh
     we can see following item in extent tree:
     item 25 key (27054080 METADATA_ITEM 0) itemoff 15083 itemsize 33
     Its logical address is [27054080, 27070464)
     and acrossed 2 strips:
     [27000832, 27066368)
     [27066368, 27131904)
  Will be fixed in btrfs-progs(btrfs-convert, btrfsck, ...)

For reason 2:
  Scrub is trying to do a "bypass" in this case, but the result is
  "panic", because current code lacks of some condition in bypass,
  and let some wrong leaf data escaped.

This patch fixed above scrub code.

Before patch:
  # btrfs scrub start -B /dev/vdh
  (panic)

After patch:
  # btrfs scrub start -B /dev/vdh
  scrub done for 353cec8f-da31-4a94-aa35-be72d997b06e
  ...
  # dmesg
  ...
  [   59.088697] BTRFS error (device vdh): scrub: tree block 27054080 spanning stripes, ignored. logical=27000832
  [   59.089929] BTRFS error (device vdh): scrub: tree block 27054080 spanning stripes, ignored. logical=27066368
  #

Reported-by: Chris Murphy <lists@colorremedies.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:00:31 -07:00
Filipe Manana
18aa092297 Btrfs: fix stale dir entries after removing a link and fsync
We have one more case where after a log tree is replayed we get
inconsistent metadata leading to stale directory entries, due to
some directories having entries pointing to some inode while the
inode does not have a matching BTRFS_INODE_[REF|EXTREF]_KEY item.

To trigger the problem we need to have a file with multiple hard links
belonging to different parent directories. Then if one of those hard
links is removed and we fsync the file using one of its other links
that belongs to a different parent directory, we end up not logging
the fact that the removed hard link doesn't exists anymore in the
parent directory.

Simple reproducer:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"
  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      _cleanup_flakey
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter
  . ./common/dmflakey

  # real QA test starts here
  _need_to_be_root
  _supported_fs generic
  _supported_os Linux
  _require_scratch
  _require_dm_flakey
  _require_metadata_journaling $SCRATCH_DEV

  rm -f $seqres.full

  _scratch_mkfs >>$seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create our test directory and file.
  mkdir $SCRATCH_MNT/testdir
  touch $SCRATCH_MNT/foo
  ln $SCRATCH_MNT/foo $SCRATCH_MNT/testdir/foo2
  ln $SCRATCH_MNT/foo $SCRATCH_MNT/testdir/foo3

  # Make sure everything done so far is durably persisted.
  sync

  # Now we remove one of our file's hardlinks in the directory testdir.
  unlink $SCRATCH_MNT/testdir/foo3

  # We now fsync our file using the "foo" link, which has a parent that
  # is not the directory "testdir".
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

  # Silently drop all writes and unmount to simulate a crash/power
  # failure.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  # Allow writes again, mount to trigger journal/log replay.
  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  # After the journal/log is replayed we expect to not see the "foo3"
  # link anymore and we should be able to remove all names in the
  # directory "testdir" and then remove it (no stale directory entries
  # left after the journal/log replay).
  echo "Entries in testdir:"
  ls -1 $SCRATCH_MNT/testdir

  rm -f $SCRATCH_MNT/testdir/*
  rmdir $SCRATCH_MNT/testdir

  _unmount_flakey

  status=0
  exit

The test fails with:

  $ ./check generic/107
  FSTYP         -- btrfs
  PLATFORM      -- Linux/x86_64 debian3 4.1.0-rc6-btrfs-next-11+
  MKFS_OPTIONS  -- /dev/sdc
  MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1

  generic/107 3s ... - output mismatch (see .../results/generic/107.out.bad)
    --- tests/generic/107.out	2015-08-01 01:39:45.807462161 +0100
    +++ /home/fdmanana/git/hub/xfstests/results//generic/107.out.bad
    @@ -1,3 +1,5 @@
     QA output created by 107
     Entries in testdir:
     foo2
    +foo3
    +rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/testdir': Directory not empty
    ...
    _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent \
      (see /home/fdmanana/git/hub/xfstests/results//generic/107.full)
    _check_dmesg: something found in dmesg (see .../results/generic/107.dmesg)
  Ran: generic/107
  Failures: generic/107
  Failed 1 of 1 tests

  $ cat /home/fdmanana/git/hub/xfstests/results//generic/107.full
  (...)
  checking fs roots
  root 5 inode 257 errors 200, dir isize wrong
	unresolved ref dir 257 index 3 namelen 4 name foo3 filetype 1 errors 5, no dir item, no inode ref
  (...)

And produces the following warning in dmesg:

  [127298.759064] BTRFS info (device dm-0): failed to delete reference to foo3, inode 258 parent 257
  [127298.762081] ------------[ cut here ]------------
  [127298.763311] WARNING: CPU: 10 PID: 7891 at fs/btrfs/inode.c:3956 __btrfs_unlink_inode+0x182/0x35a [btrfs]()
  [127298.767327] BTRFS: Transaction aborted (error -2)
  (...)
  [127298.788611] Call Trace:
  [127298.789137]  [<ffffffff8145f077>] dump_stack+0x4f/0x7b
  [127298.790090]  [<ffffffff81095de5>] ? console_unlock+0x356/0x3a2
  [127298.791157]  [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb
  [127298.792323]  [<ffffffffa065ad09>] ? __btrfs_unlink_inode+0x182/0x35a [btrfs]
  [127298.793633]  [<ffffffff8104b410>] warn_slowpath_fmt+0x46/0x48
  [127298.794699]  [<ffffffffa065ad09>] __btrfs_unlink_inode+0x182/0x35a [btrfs]
  [127298.797640]  [<ffffffffa065be8f>] btrfs_unlink_inode+0x1e/0x40 [btrfs]
  [127298.798876]  [<ffffffffa065bf11>] btrfs_unlink+0x60/0x9b [btrfs]
  [127298.800154]  [<ffffffff8116fb48>] vfs_unlink+0x9c/0xed
  [127298.801303]  [<ffffffff81173481>] do_unlinkat+0x12b/0x1fb
  [127298.802450]  [<ffffffff81253855>] ? lockdep_sys_exit_thunk+0x12/0x14
  [127298.803797]  [<ffffffff81174056>] SyS_unlinkat+0x29/0x2b
  [127298.805017]  [<ffffffff81465197>] system_call_fastpath+0x12/0x6f
  [127298.806310] ---[ end trace bbfddacb7aaada7b ]---
  [127298.807325] BTRFS warning (device dm-0): __btrfs_unlink_inode:3956: Aborting unused transaction(No such entry).

So fix this by logging all parent inodes, current and old ones, to make
sure we do not get stale entries after log replay. This is not a simple
solution such as triggering a full transaction commit because it would
imply full transaction commit when an inode is fsynced in the same
transaction that modified it and reloaded it after eviction (because its
last_unlink_trans is set to the same value as its last_trans as of the
commit with the title "Btrfs: fix stale dir entries after unlink, inode
eviction and fsync"), and it would also make fstest generic/066 fail
since one of the fsyncs triggers a full commit and the next fsync will
not find the inode in the log anymore (therefore not removing the xattr).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 06:17:04 -07:00
Naohiro Aota
dd81d459a3 btrfs: fix search key advancing condition
The search key advancing condition used in copy_to_sk() is loose. It can
advance the key even if it reaches sk->max_*: e.g. when the max key = (512,
1024, -1) and the current key = (512, 1025, 10), it increments the
offset by 1, continues hopeless search from (512, 1025, 11). This issue
make ioctl() to take unexpectedly long time scanning all the leaf a blocks
one by one.

This commit fix the problem using standard way of key comparison:
btrfs_comp_cpu_keys()

Signed-off-by: Naohiro Aota <naota@elisp.net>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 06:17:02 -07:00
Filipe Manana
d6589101b6 Btrfs: teach backref walking about backrefs with underflowed offset values
When cloning/deduplicating file extents (through the clone and extent_same
ioctls) we can get data back references with offset values that are a
result of an unsigned integer arithmetic underflow, that is, values that
are much larger then they could be otherwise.

This is not a problem when decrementing or dropping the back references
(happens when we overwrite the extents or punch a hole for example, through
__btrfs_drop_extents()), since we compute the same too large offset value,
but it is a problem for the backref walking code, used by an incremental
send and the ioctls that are used by the btrfs tool "inspect-internal"
commands, as it makes it miss the corresponding file extent items because
the search key is set for an extent item that starts at an offset matching
the exceptionally large offset value of the data back reference. For an
incremental send this causes the send ioctl to fail with -EIO.

So teach the backref walking code to deal with these cases by setting the
search key's offset to 0 if the backref's offset value is larger than
LLONG_MAX (the largest possible file offset). This makes sure the backref
walking code finds the corresponding file extent items at the expense of
scanning more items and leafs in the btree.

Fixing the clone/dedup ioctls to not produce such underflowed results would
require major changes breaking backward compatibility, updating user space
tools, etc.

Simple reproducer case for fstests:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"

  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      rm -fr $send_files_dir
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter

  # real QA test starts here
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch
  _require_cloner
  _need_to_be_root

  send_files_dir=$TEST_DIR/btrfs-test-$seq

  rm -f $seqres.full
  rm -fr $send_files_dir
  mkdir $send_files_dir

  _scratch_mkfs >>$seqres.full 2>&1
  _scratch_mount

  # Create our test file with a single extent of 64K starting at file
  # offset 128K.
  $XFS_IO_PROG -f -c "pwrite -S 0xaa 128K 64K" $SCRATCH_MNT/foo \
      | _filter_xfs_io

  _run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
      $SCRATCH_MNT/mysnap1

  # Now clone parts of the original extent into lower offsets of the file.
  #
  # The first clone operation adds a file extent item to file offset 0
  # that points to our initial extent with a data offset of 16K. The
  # corresponding data back reference in the extent tree has an offset of
  # 18446744073709535232, which is the result of file_offset - data_offset
  # = 0 - 16K.
  #
  # The second clone operation adds a file extent item to file offset 16K
  # that points to our initial extent with a data offset of 48K. The
  # corresponding data back reference in the extent tree has an offset of
  # 18446744073709518848, which is the result of file_offset - data_offset
  # = 16K - 48K.
  #
  # Those large back reference offsets (result of unsigned arithmetic
  # underflow) confused the back reference walking code (used by an
  # incremental send and the multiple inspect-internal ioctls) and made it
  # miss the back references, which for the case of an incremental send it
  # made it fail with -EIO and print a message like the following to
  # dmesg:
  #
  # "BTRFS error (device sdc): did not find backref in send_root. \
  #  inode=257, offset=0, disk_byte=12845056 found extent=12845056"
  #
  $CLONER_PROG -s $(((128 + 16) * 1024)) -d 0 -l $((16 * 1024)) \
      $SCRATCH_MNT/foo $SCRATCH_MNT/foo
  $CLONER_PROG -s $(((128 + 48) * 1024)) -d $((16 * 1024)) \
      -l $((16 * 1024)) $SCRATCH_MNT/foo $SCRATCH_MNT/foo

  _run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
      $SCRATCH_MNT/mysnap2

  _run_btrfs_util_prog send $SCRATCH_MNT/mysnap1 -f $send_files_dir/1.snap
  _run_btrfs_util_prog send -p $SCRATCH_MNT/mysnap1 $SCRATCH_MNT/mysnap2 \
      -f $send_files_dir/2.snap

  echo "File digest in the original filesystem:"
  md5sum $SCRATCH_MNT/mysnap2/foo | _filter_scratch

  # Now recreate the filesystem by receiving both send streams and verify
  # we get the same file contents that the original filesystem had.
  _scratch_unmount
  _scratch_mkfs >>$seqres.full 2>&1
  _scratch_mount

  _run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap
  _run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/2.snap

  echo "File digest in the new filesystem:"
  md5sum $SCRATCH_MNT/mysnap2/foo | _filter_scratch

  status=0
  exit

The test's expected golden output is:

  wrote 65536/65536 bytes at offset 131072
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  File digest in the original filesystem:
  6c6079335cff141b8a31233ead04cbff  SCRATCH_MNT/mysnap2/foo
  File digest in the new filesystem:
  6c6079335cff141b8a31233ead04cbff  SCRATCH_MNT/mysnap2/foo

But it failed with:

    (...)
    @@ -1,7 +1,5 @@
     QA output created by 097
     wrote 65536/65536 bytes at offset 131072
     XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
    -File digest in the original filesystem:
    -6c6079335cff141b8a31233ead04cbff  SCRATCH_MNT/mysnap2/foo
    -File digest in the new filesystem:
    -6c6079335cff141b8a31233ead04cbff  SCRATCH_MNT/mysnap2/foo
    ...

  $ cat /home/fdmanana/git/hub/xfstests/results//btrfs/097.full
  (...)
  ERROR: send ioctl failed with -5: Input/output error

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 06:17:00 -07:00
Filipe Manana
bde6c24202 Btrfs: fix stale dir entries after unlink, inode eviction and fsync
If we remove a hard link from an inode, the inode gets evicted, then
we fsync the inode and then power fail/crash, when the log tree is
replayed, the parent directory inode still has entries pointing to
the name that no longer exists, while our inode no longer has the
BTRFS_INODE_REF_KEY item matching the deleted hard link (as expected),
leaving the filesystem in an inconsistent state. The stale directory
entries can not be deleted (an attempt to delete them causes -ESTALE
errors), which makes it impossible to delete the parent directory.

This happens because we track the id of the transaction where the last
unlink operation for the inode happened (last_unlink_trans) in an
in-memory only field of the inode, that is, a value that is never
persisted in the inode item stored on the fs/subvol btree. So if an
inode is evicted and loaded again, the value for last_unlink_trans is
set to 0, which prevents the fsync from logging the parent directory
at btrfs_log_inode_parent(). So fix this by setting last_unlink_trans
to the id of the transaction that last modified the inode when we
load the inode. This is a pessimistic approach but it always ensures
correctness with the trade off of ocassional full transaction commits
when an fsync is done against the inode in the same transaction where
it was evicted and reloaded when our inode is a directory and often
logging its parent unnecessarily when our inode is not a directory.

The following test case for fstests triggers the problem:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"
  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      _cleanup_flakey
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter
  . ./common/dmflakey

  # real QA test starts here
  _need_to_be_root
  _supported_fs generic
  _supported_os Linux
  _require_scratch
  _require_dm_flakey
  _require_metadata_journaling $SCRATCH_DEV

  rm -f $seqres.full

  _scratch_mkfs >>$seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create our test file with 2 hard links.
  mkdir $SCRATCH_MNT/testdir
  touch $SCRATCH_MNT/testdir/foo
  ln $SCRATCH_MNT/testdir/foo $SCRATCH_MNT/testdir/bar

  # Make sure everything done so far is durably persisted.
  sync

  # Now remove one of the links, trigger inode eviction and then fsync
  # our inode.
  unlink $SCRATCH_MNT/testdir/bar
  echo 2 > /proc/sys/vm/drop_caches
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir/foo

  # Silently drop all writes on our scratch device to simulate a power failure.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  # Allow writes again and mount the fs to trigger log/journal replay.
  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  # Now verify our directory entries.
  echo "Entries in testdir:"
  ls -1 $SCRATCH_MNT/testdir

  # If we remove our inode, its parent should become empty and therefore we should
  # be able to remove the parent.
  rm -f $SCRATCH_MNT/testdir/*
  rmdir $SCRATCH_MNT/testdir

  _unmount_flakey

  # The fstests framework will call fsck against our filesystem which will verify
  # that all metadata is in a consistent state.

  status=0
  exit

The test failed on btrfs with:

  generic/098 4s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/098.out.bad)
    --- tests/generic/098.out	2015-07-23 18:01:12.616175932 +0100
    +++ /home/fdmanana/git/hub/xfstests/results//generic/098.out.bad	2015-07-23 18:04:58.924138308 +0100
    @@ -1,3 +1,6 @@
     QA output created by 098
     Entries in testdir:
    +bar
     foo
    +rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/testdir/foo': Stale file handle
    +rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/testdir': Directory not empty
    ...
    (Run 'diff -u tests/generic/098.out /home/fdmanana/git/hub/xfstests/results//generic/098.out.bad'  to see the entire diff)
  _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent (see /home/fdmanana/git/hub/xfstests/results//generic/098.full)

  $ cat /home/fdmanana/git/hub/xfstests/results//generic/098.full
  (...)
  checking fs roots
  root 5 inode 258 errors 2001, no inode item, link count wrong
     unresolved ref dir 257 index 0 namelen 3 name foo filetype 1 errors 6, no dir index, no inode ref
     unresolved ref dir 257 index 3 namelen 3 name bar filetype 1 errors 5, no dir item, no inode ref
  Checking filesystem on /dev/sdc
  (...)

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 06:16:58 -07:00
Filipe Manana
bb53eda902 Btrfs: fix stale directory entries after fsync log replay
We have another case where after an fsync log replay we get an inode with
a wrong link count (smaller than it should be) and a number of directory
entries greater than its link count. This happens when we add a new link
hard link to our inode A and then we fsync some other inode B that has
the side effect of logging the parent directory inode too. In this case
at log replay time we add the new hard link to our inode (the item with
key BTRFS_INODE_REF_KEY) when processing the parent directory but we
never adjust the link count of our inode A. As a result we get stale dir
entries for our inode A that can never be deleted and therefore it makes
it impossible to remove the parent directory (as its i_size can never
decrease back to 0).

A simple reproducer for fstests that triggers this issue:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"
  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      _cleanup_flakey
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter
  . ./common/dmflakey

  # real QA test starts here
  _need_to_be_root
  _supported_fs generic
  _supported_os Linux
  _require_scratch
  _require_dm_flakey
  _require_metadata_journaling $SCRATCH_DEV

  rm -f $seqres.full

  _scratch_mkfs >>$seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create our test directory and files.
  mkdir $SCRATCH_MNT/testdir
  touch $SCRATCH_MNT/testdir/foo
  touch $SCRATCH_MNT/testdir/bar

  # Make sure everything done so far is durably persisted.
  sync

  # Create one hard link for file foo and another one for file bar. After
  # that fsync only the file bar.
  ln $SCRATCH_MNT/testdir/bar $SCRATCH_MNT/testdir/bar_link
  ln $SCRATCH_MNT/testdir/foo $SCRATCH_MNT/testdir/foo_link
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir/bar

  # Silently drop all writes on scratch device to simulate power failure.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  # Allow writes again and mount the fs to trigger log/journal replay.
  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  # Now verify both our files have a link count of 2.
  echo "Link count for file foo: $(stat --format=%h $SCRATCH_MNT/testdir/foo)"
  echo "Link count for file bar: $(stat --format=%h $SCRATCH_MNT/testdir/bar)"

  # We should be able to remove all the links of our files in testdir, and
  # after that the parent directory should become empty and therefore
  # possible to remove it.
  rm -f $SCRATCH_MNT/testdir/*
  rmdir $SCRATCH_MNT/testdir

  _unmount_flakey

  # The fstests framework will call fsck against our filesystem which will verify
  # that all metadata is in a consistent state.

  status=0
  exit

The test fails with:

 -Link count for file foo: 2
 +Link count for file foo: 1
  Link count for file bar: 2
 +rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/testdir/foo_link': Stale file handle
 +rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/testdir': Directory not empty
 (...)
 _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent

And fsck's output:

  (...)
  checking fs roots
  root 5 inode 258 errors 2001, no inode item, link count wrong
      unresolved ref dir 257 index 5 namelen 8 name foo_link filetype 1 errors 4, no inode ref
  Checking filesystem on /dev/sdc
  (...)

So fix this by marking inodes for link count fixup at log replay time
whenever a directory entry is replayed if the entry was created in the
transaction where the fsync was made and if it points to a non-directory
inode.

This isn't a new problem/regression, the issue exists for a long time,
possibly since the log tree feature was added (2008).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 06:16:56 -07:00
Linus Torvalds
af0b3152bb Merge branch 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fix from Chris Mason:
 "We have a btrfs quota regression fix.

  I merged this one on Thursday and have run it through tests against
  current master.

  Normally I wouldn't have sent this while you were finalizing rc6, but
  I'm feeding mosquitoes in the adirondacks next week, so I wanted to
  get this one out before leaving.  I'll leave longer tests running and
  check on things during the week, but I don't expect any problems"

* 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: qgroup: Fix a regression in qgroup reserved space.
2015-08-09 05:56:31 +03:00
Geert Uytterhoeven
d41e36a0ab Btrfs: Spelling s/consitent/consistent/
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Jiri Kosina <jkosina@suse.com>
2015-08-07 14:13:21 +02:00
Qu Wenruo
c05f9429e1 btrfs: qgroup: Fix a regression in qgroup reserved space.
During the change to new btrfs extent-oriented qgroup implement, due to
it doesn't use the old __qgroup_excl_accounting() for exclusive extent,
it didn't free the reserved bytes.

The bug will cause limit function go crazy as the reserved space is
never freed, increasing limit will have no effect and still cause
EQOUT.

The fix is easy, just free reserved bytes for newly created exclusive
extent as what it does before.

Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Yang Dongsheng <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-06 14:51:15 -07:00
Greg Kroah-Hartman
f368ed6088 char: make misc_deregister a void function
With well over 200+ users of this api, there are a mere 12 users that
actually checked the return value of this function.  And all of them
really didn't do anything with that information as the system or module
was shutting down no matter what.

So stop pretending like it matters, and just return void from
misc_deregister().  If something goes wrong in the call, you will get a
WARNING splat in the syslog so you know how to fix up your driver.
Other than that, there's nothing that can go wrong.

Cc: Alasdair Kergon <agk@redhat.com>
Cc: Neil Brown <neilb@suse.com>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: Andreas Dilger <andreas.dilger@intel.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Wim Van Sebroeck <wim@iguana.be>
Cc: Christine Caulfield <ccaulfie@redhat.com>
Cc: David Teigland <teigland@redhat.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Acked-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Acked-by: Alessandro Zummo <a.zummo@towertech.it>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-05 10:35:49 -07:00
Linus Torvalds
acea568fa9 Merge branch 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Filipe fixed up a hard to trigger ENOSPC regression from our merge
  window pull, and we have a few other smaller fixes"

* 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix quick exhaustion of the system array in the superblock
  btrfs: its btrfs_err() instead of btrfs_error()
  btrfs: Avoid NULL pointer dereference of free_extent_buffer when read_tree_block() fail
  btrfs: Fix lockdep warning of btrfs_run_delayed_iputs()
2015-07-31 17:05:37 -07:00
Jeff Mahoney
e33e17ee10 btrfs: add missing discards when unpinning extents with -o discard
When we clear the dirty bits in btrfs_delete_unused_bgs for extents
in the empty block group, it results in btrfs_finish_extent_commit being
unable to discard the freed extents.

The block group removal patch added an alternate path to forget extents
other than btrfs_finish_extent_commit.  As a result, any extents that
would be freed when the block group is removed aren't discarded.  In my
test run, with a large copy of mixed sized files followed by removal, it
left nearly 2/3 of extents undiscarded.

To clean up the block groups, we add the removed block group onto a list
that will be discarded after transaction commit.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-29 08:15:29 -07:00
Jeff Mahoney
e44163e177 btrfs: explictly delete unused block groups in close_ctree and ro-remount
The cleaner thread may already be sleeping by the time we enter
close_ctree.  If that's the case, we'll skip removing any unused
block groups queued for removal, even during a normal umount.
They'll be cleaned up automatically at next mount, but users
expect a umount to be a clean synchronization point, especially
when used on thin-provisioned storage with -odiscard.  We also
explicitly remove unused block groups in the ro-remount path
for the same reason.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-29 08:15:27 -07:00
Jeff Mahoney
499f377f49 btrfs: iterate over unused chunk space in FITRIM
Since we now clean up block groups automatically as they become
empty, iterating over block groups is no longer sufficient to discard
unused space.

This patch iterates over the unused chunk space and discards any regions
that are unallocated, regardless of whether they were ever used.  This is
a change for btrfs but is consistent with other file systems.

We do this in a transactionless manner since the discard process can take
a substantial amount of time and a transaction would need to be started
before the acquisition of the device list lock.  That would mean a
transaction would be held open across /all/ of the discards collectively.
In order to prevent other threads from allocating or freeing chunks, we
hold the chunks lock across the search and discard calls.  We release it
between searches to allow the file system to perform more-or-less
normally.  Since the running transaction can commit and disappear while
we're using the transaction pointer, we take a reference to it and
release it after the search.  This is safe since it would happen normally
at the end of the transaction commit after any locks are released anyway.
We also take the commit_root_sem to protect against a transaction starting
and committing while we're running.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-29 08:15:26 -07:00
Jeff Mahoney
86557861df btrfs: skip superblocks during discard
Btrfs doesn't track superblocks with extent records so there is nothing
persistent on-disk to indicate that those blocks are in use.  We track
the superblocks in memory to ensure they don't get used by removing them
from the free space cache when we load a block group from disk.  Prior
to 47ab2a6c6a (Btrfs: remove empty block groups automatically), that
was fine since the block group would never be reclaimed so the superblock
was always safe.  Once we started removing the empty block groups, we
were protected by the fact that discards weren't being properly issued
for unused space either via FITRIM or -odiscard.  The block groups were
still being released, but the blocks remained on disk.

In order to properly discard unused block groups, we need to filter out
the superblocks from the discard range.  Superblocks are located at fixed
locations on each device, so it makes sense to filter them out in
btrfs_issue_discard, which is used by both -odiscard and FITRIM.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-29 08:15:25 -07:00
Jeff Mahoney
4d89d377bb btrfs: btrfs_issue_discard ensure offset/length are aligned to sector boundaries
It's possible, though unexpected, to pass unaligned offsets and lengths
to btrfs_issue_discard.  We then shift the offset/length values to sector
units.  If an unaligned offset has been passed, it will result in the
entire sector being discarded, possibly losing data.  An unaligned
length is safe but we'll end up returning an inaccurate number of
discarded bytes.

This patch aligns the offset to the 512B boundary, adjusts the length,
and warns, since we shouldn't be discarding on an offset that isn't
aligned with our sector size.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-29 08:15:24 -07:00
Jeff Mahoney
d04c6b8832 btrfs: make btrfs_issue_discard return bytes discarded
Initially this will just be the length argument passed to it,
but the following patches will adjust that to reflect re-alignment
and skipped blocks.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-29 08:15:22 -07:00
Christoph Hellwig
4246a0b63b block: add a bi_error field to struct bio
Currently we have two different ways to signal an I/O error on a BIO:

 (1) by clearing the BIO_UPTODATE flag
 (2) by returning a Linux errno value to the bi_end_io callback

The first one has the drawback of only communicating a single possible
error (-EIO), and the second one has the drawback of not beeing persistent
when bios are queued up, and are not passed along from child to parent
bio in the ever more popular chaining scenario.  Having both mechanisms
available has the additional drawback of utterly confusing driver authors
and introducing bugs where various I/O submitters only deal with one of
them, and the others have to add boilerplate code to deal with both kinds
of error returns.

So add a new bi_error field to store an errno value directly in struct
bio and remove the existing mechanisms to clean all this up.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-29 08:55:15 -06:00
Filipe Manana
00d80e342c Btrfs: fix quick exhaustion of the system array in the superblock
Omar reported that after commit 4fbcdf6694 ("Btrfs: fix -ENOSPC when
finishing block group creation"), introduced in 4.2-rc1, the following
test was failing due to exhaustion of the system array in the superblock:

  #!/bin/bash

  truncate -s 100T big.img
  mkfs.btrfs big.img
  mount -o loop big.img /mnt/loop

  num=5
  sz=10T
  for ((i = 0; i < $num; i++)); do
      echo fallocate $i $sz
      fallocate -l $sz /mnt/loop/testfile$i
  done
  btrfs filesystem sync /mnt/loop

  for ((i = 0; i < $num; i++)); do
        echo rm $i
        rm /mnt/loop/testfile$i
        btrfs filesystem sync /mnt/loop
  done
  umount /mnt/loop

This made btrfs_add_system_chunk() fail with -EFBIG due to excessive
allocation of system block groups. This happened because the test creates
a large number of data block groups per transaction and when committing
the transaction we start the writeout of the block group caches for all
the new new (dirty) block groups, which results in pre-allocating space
for each block group's free space cache using the same transaction handle.
That in turn often leads to creation of more block groups, and all get
attached to the new_bgs list of the same transaction handle to the point
of getting a list with over 1500 elements, and creation of new block groups
leads to the need of reserving space in the chunk block reserve and often
creating a new system block group too.

So that made us quickly exhaust the chunk block reserve/system space info,
because as of the commit mentioned before, we do reserve space for each
new block group in the chunk block reserve, unlike before where we would
not and would at most allocate one new system block group and therefore
would only ensure that there was enough space in the system space info to
allocate 1 new block group even if we ended up allocating thousands of
new block groups using the same transaction handle. That worked most of
the time because the computed required space at check_system_chunk() is
very pessimistic (assumes a chunk tree height of BTRFS_MAX_LEVEL/8 and
that all nodes/leafs in a path will be COWed and split) and since the
updates to the chunk tree all happen at btrfs_create_pending_block_groups
it is unlikely that a path needs to be COWed more than once (unless
writepages() for the btree inode is called by mm in between) and that
compensated for the need of creating any new nodes/leads in the chunk
tree.

So fix this by ensuring we don't accumulate a too large list of new block
groups in a transaction's handles new_bgs list, inserting/updating the
chunk tree for all accumulated new block groups and releasing the unused
space from the chunk block reserve whenever the list becomes sufficiently
large. This is a generic solution even though the problem currently can
only happen when starting the writeout of the free space caches for all
dirty block groups (btrfs_start_dirty_block_groups()).

Reported-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-22 18:20:54 -07:00
Anand Jain
3e303ea60d btrfs: its btrfs_err() instead of btrfs_error()
sorry I indented to use btrfs_err() and I have no idea
how btrfs_error() got there.
infact I was thinking about these kind of oversights
since these two func are too closely named.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-22 18:20:53 -07:00
Zhao Lei
95ab1f6490 btrfs: Avoid NULL pointer dereference of free_extent_buffer when read_tree_block() fail
When read_tree_block() failed, we can see following dmesg:
 [  134.371389] BUG: unable to handle kernel NULL pointer dereference at 0000000000000063
 [  134.372236] IP: [<ffffffff813a4a51>] free_extent_buffer+0x21/0x90
 [  134.372236] PGD 0
 [  134.372236] Oops: 0000 [#1] SMP
 [  134.372236] Modules linked in:
 [  134.372236] CPU: 0 PID: 2289 Comm: mount Not tainted 4.2.0-rc1_HEAD_c65b99f046843d2455aa231747b5a07a999a9f3d_+ #115
 [  134.372236] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org 04/01/2014
 [  134.372236] task: ffff88003b6e1a00 ti: ffff880011e60000 task.ti: ffff880011e60000
 [  134.372236] RIP: 0010:[<ffffffff813a4a51>]  [<ffffffff813a4a51>] free_extent_buffer+0x21/0x90
 ...
 [  134.372236] Call Trace:
 [  134.372236]  [<ffffffff81379aa1>] free_root_extent_buffers+0x91/0xb0
 [  134.372236]  [<ffffffff81379c3d>] free_root_pointers+0x17d/0x190
 [  134.372236]  [<ffffffff813801b0>] open_ctree+0x1ca0/0x25b0
 [  134.372236]  [<ffffffff8144d017>] ? disk_name+0x97/0xb0
 [  134.372236]  [<ffffffff813558aa>] btrfs_mount+0x8fa/0xab0
 ...

Reason:
 read_tree_block() changed to return error number on fail,
 and this value(not NULL) is set to tree_root->node, then subsequent
 code will run to:
  free_root_pointers()
  ->free_root_extent_buffers()
  ->free_extent_buffer()
  ->atomic_read((extent_buffer *)(-E_XXX)->refs);
 and trigger above error.

Fix:
 Set tree_root->node to NULL on fail to make error_handle code
 happy.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-22 18:20:52 -07:00
Zhao Lei
8a73301304 btrfs: Fix lockdep warning of btrfs_run_delayed_iputs()
Liu Bo <bo.li.liu@oracle.com> reported a lockdep warning of
delayed_iput_sem in xfstests generic/241:
  [ 2061.345955] =============================================
  [ 2061.346027] [ INFO: possible recursive locking detected ]
  [ 2061.346027] 4.1.0+ #268 Tainted: G        W
  [ 2061.346027] ---------------------------------------------
  [ 2061.346027] btrfs-cleaner/3045 is trying to acquire lock:
  [ 2061.346027]  (&fs_info->delayed_iput_sem){++++..}, at:
  [<ffffffff814063ab>] btrfs_run_delayed_iputs+0x6b/0x100
  [ 2061.346027] but task is already holding lock:
  [ 2061.346027]  (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffff814063ab>] btrfs_run_delayed_iputs+0x6b/0x100
  [ 2061.346027] other info that might help us debug this:
  [ 2061.346027]  Possible unsafe locking scenario:

  [ 2061.346027]        CPU0
  [ 2061.346027]        ----
  [ 2061.346027]   lock(&fs_info->delayed_iput_sem);
  [ 2061.346027]   lock(&fs_info->delayed_iput_sem);
  [ 2061.346027]
   *** DEADLOCK ***
It is rarely happened, about 1/400 in my test env.

The reason is recursion of btrfs_run_delayed_iputs():
  cleaner_kthread
  -> btrfs_run_delayed_iputs() *1
  -> get delayed_iput_sem lock *2
  -> iput()
  -> ...
  -> btrfs_commit_transaction()
  -> btrfs_run_delayed_iputs() *1
  -> get delayed_iput_sem lock (dead lock) *2
  *1: recursion of btrfs_run_delayed_iputs()
  *2: warning of lockdep about delayed_iput_sem

When fs is in high stress, new iputs may added into fs_info->delayed_iputs
list when btrfs_run_delayed_iputs() is running, which cause
second btrfs_run_delayed_iputs() run into down_read(&fs_info->delayed_iput_sem)
again, and cause above lockdep warning.

Actually, it will not cause real problem because both locks are read lock,
but to avoid lockdep warning, we can do a fix.

Fix:
  Don't do btrfs_run_delayed_iputs() in btrfs_commit_transaction() for
  cleaner_kthread thread to break above recursion path.
  cleaner_kthread is calling btrfs_run_delayed_iputs() explicitly in code,
  and don't need to call btrfs_run_delayed_iputs() again in
  btrfs_commit_transaction(), it also give us a bonus to avoid stack overflow.

Test:
  No above lockdep warning after patch in 1200 generic/241 tests.

Reported-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-22 18:20:51 -07:00
Linus Torvalds
8be5701342 Merge branch 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "These are all from Filipe, and cover a few problems we've had reported
  on the list recently (along with ones he found on his own)"

* 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix file corruption after cloning inline extents
  Btrfs: fix order by which delayed references are run
  Btrfs: fix list transaction->pending_ordered corruption
  Btrfs: fix memory leak in the extent_same ioctl
  Btrfs: fix shrinking truncate when the no_holes feature is enabled
2015-07-17 21:46:57 -07:00
Filipe Manana
ed95876264 Btrfs: fix file corruption after cloning inline extents
Using the clone ioctl (or extent_same ioctl, which calls the same extent
cloning function as well) we end up allowing copy an inline extent from
the source file into a non-zero offset of the destination file. This is
something not expected and that the btrfs code is not prepared to deal
with - all inline extents must be at a file offset equals to 0.

For example, the following excerpt of a test case for fstests triggers
a crash/BUG_ON() on a write operation after an inline extent is cloned
into a non-zero offset:

  _scratch_mkfs >>$seqres.full 2>&1
  _scratch_mount

  # Create our test files. File foo has the same 2K of data at offset 4K
  # as file bar has at its offset 0.
  $XFS_IO_PROG -f -s -c "pwrite -S 0xaa 0 4K" \
      -c "pwrite -S 0xbb 4k 2K" \
      -c "pwrite -S 0xcc 8K 4K" \
      $SCRATCH_MNT/foo | _filter_xfs_io

  # File bar consists of a single inline extent (2K size).
  $XFS_IO_PROG -f -s -c "pwrite -S 0xbb 0 2K" \
     $SCRATCH_MNT/bar | _filter_xfs_io

  # Now call the clone ioctl to clone the extent of file bar into file
  # foo at its offset 4K. This made file foo have an inline extent at
  # offset 4K, something which the btrfs code can not deal with in future
  # IO operations because all inline extents are supposed to start at an
  # offset of 0, resulting in all sorts of chaos.
  # So here we validate that clone ioctl returns an EOPNOTSUPP, which is
  # what it returns for other cases dealing with inlined extents.
  $CLONER_PROG -s 0 -d $((4 * 1024)) -l $((2 * 1024)) \
      $SCRATCH_MNT/bar $SCRATCH_MNT/foo

  # Because of the inline extent at offset 4K, the following write made
  # the kernel crash with a BUG_ON().
  $XFS_IO_PROG -c "pwrite -S 0xdd 6K 2K" $SCRATCH_MNT/foo | _filter_xfs_io

  status=0
  exit

The stack trace of the BUG_ON() triggered by the last write is:

  [152154.035903] ------------[ cut here ]------------
  [152154.036424] kernel BUG at mm/page-writeback.c:2286!
  [152154.036424] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
  [152154.036424] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc acpi_cpu$
  [152154.036424] CPU: 2 PID: 17873 Comm: xfs_io Tainted: G        W       4.1.0-rc6-btrfs-next-11+ #2
  [152154.036424] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
  [152154.036424] task: ffff880429f70990 ti: ffff880429efc000 task.ti: ffff880429efc000
  [152154.036424] RIP: 0010:[<ffffffff8111a9d5>]  [<ffffffff8111a9d5>] clear_page_dirty_for_io+0x1e/0x90
  [152154.036424] RSP: 0018:ffff880429effc68  EFLAGS: 00010246
  [152154.036424] RAX: 0200000000000806 RBX: ffffea0006a6d8f0 RCX: 0000000000000001
  [152154.036424] RDX: 0000000000000000 RSI: ffffffff81155d1b RDI: ffffea0006a6d8f0
  [152154.036424] RBP: ffff880429effc78 R08: ffff8801ce389fe0 R09: 0000000000000001
  [152154.036424] R10: 0000000000002000 R11: ffffffffffffffff R12: ffff8800200dce68
  [152154.036424] R13: 0000000000000000 R14: ffff8800200dcc88 R15: ffff8803d5736d80
  [152154.036424] FS:  00007fbf119f6700(0000) GS:ffff88043d280000(0000) knlGS:0000000000000000
  [152154.036424] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [152154.036424] CR2: 0000000001bdc000 CR3: 00000003aa555000 CR4: 00000000000006e0
  [152154.036424] Stack:
  [152154.036424]  ffff8803d5736d80 0000000000000001 ffff880429effcd8 ffffffffa04e97c1
  [152154.036424]  ffff880429effd68 ffff880429effd60 0000000000000001 ffff8800200dc9c8
  [152154.036424]  0000000000000001 ffff8800200dcc88 0000000000000000 0000000000001000
  [152154.036424] Call Trace:
  [152154.036424]  [<ffffffffa04e97c1>] lock_and_cleanup_extent_if_need+0x147/0x18d [btrfs]
  [152154.036424]  [<ffffffffa04ea82c>] __btrfs_buffered_write+0x245/0x4c8 [btrfs]
  [152154.036424]  [<ffffffffa04ed14b>] ? btrfs_file_write_iter+0x150/0x3e0 [btrfs]
  [152154.036424]  [<ffffffffa04ed15a>] ? btrfs_file_write_iter+0x15f/0x3e0 [btrfs]
  [152154.036424]  [<ffffffffa04ed2c7>] btrfs_file_write_iter+0x2cc/0x3e0 [btrfs]
  [152154.036424]  [<ffffffff81165a4a>] __vfs_write+0x7c/0xa5
  [152154.036424]  [<ffffffff81165f89>] vfs_write+0xa0/0xe4
  [152154.036424]  [<ffffffff81166855>] SyS_pwrite64+0x64/0x82
  [152154.036424]  [<ffffffff81465197>] system_call_fastpath+0x12/0x6f
  [152154.036424] Code: 48 89 c7 e8 0f ff ff ff 5b 41 5c 5d c3 0f 1f 44 00 00 55 48 89 e5 41 54 53 48 89 fb e8 ae ef 00 00 49 89 c4 48 8b 03 a8 01 75 02 <0f> 0b 4d 85 e4 74 59 49 8b 3c 2$
  [152154.036424] RIP  [<ffffffff8111a9d5>] clear_page_dirty_for_io+0x1e/0x90
  [152154.036424]  RSP <ffff880429effc68>
  [152154.242621] ---[ end trace e3d3376b23a57041 ]---

Fix this by returning the error EOPNOTSUPP if an attempt to copy an
inline extent into a non-zero offset happens, just like what is done for
other scenarios that would require copying/splitting inline extents,
which were introduced by the following commits:

   00fdf13a2e ("Btrfs: fix a crash of clone with inline extents's split")
   3f9e3df8da ("btrfs: replace error code from btrfs_drop_extents")

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-07-14 16:09:39 +01:00
Filipe Manana
cffc3374e5 Btrfs: fix order by which delayed references are run
When we have an extent that got N references removed and N new references
added in the same transaction, we must run the insertion of the references
first because otherwise the last removed reference will remove the extent
item from the extent tree, resulting in a failure for the insertions.

This is a regression introduced in the 4.2-rc1 release and this fix just
brings back the behaviour of selecting reference additions before any
reference removals.

The following test case for fstests reproduces the issue:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"
  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      _cleanup_flakey
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter
  . ./common/dmflakey

  # real QA test starts here
  _need_to_be_root
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch
  _require_dm_flakey
  _require_cloner
  _require_metadata_journaling $SCRATCH_DEV

  rm -f $seqres.full

  _scratch_mkfs >>$seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create prealloc extent covering range [160K, 620K[
  $XFS_IO_PROG -f -c "falloc 160K 460K" $SCRATCH_MNT/foo

  # Now write to the last 80K of the prealloc extent plus 40K to the unallocated
  # space that immediately follows it. This creates a new extent of 40K that spans
  # the range [620K, 660K[.
  $XFS_IO_PROG -c "pwrite -S 0xaa 540K 120K" $SCRATCH_MNT/foo | _filter_xfs_io

  # At this point, there are now 2 back references to the prealloc extent in our
  # extent tree. Both are for our file offset 160K and one relates to a file
  # extent item with a data offset of 0 and a length of 380K, while the other
  # relates to a file extent item with a data offset of 380K and a length of 80K.

  # Make sure everything done so far is durably persisted (all back references are
  # in the extent tree, etc).
  sync

  # Now clone all extents of our file that cover the offset 160K up to its eof
  # (660K at this point) into itself at offset 2M. This leaves a hole in the file
  # covering the range [660K, 2M[. The prealloc extent will now be referenced by
  # the file twice, once for offset 160K and once for offset 2M. The 40K extent
  # that follows the prealloc extent will also be referenced twice by our file,
  # once for offset 620K and once for offset 2M + 460K.
  $CLONER_PROG -s $((160 * 1024)) -d $((2 * 1024 * 1024)) -l 0 $SCRATCH_MNT/foo \
	$SCRATCH_MNT/foo

  # Now create one new extent in our file with a size of 100Kb. It will span the
  # range [3M, 3M + 100K[. It also will cause creation of a hole spanning the
  # range [2M + 460K, 3M[. Our new file size is 3M + 100K.
  $XFS_IO_PROG -c "pwrite -S 0xbb 3M 100K" $SCRATCH_MNT/foo | _filter_xfs_io

  # At this point, there are now (in memory) 4 back references to the prealloc
  # extent.
  #
  # Two of them are for file offset 160K, related to file extent items
  # matching the file offsets 160K and 540K respectively, with data offsets of
  # 0 and 380K respectively, and with lengths of 380K and 80K respectively.
  #
  # The other two references are for file offset 2M, related to file extent items
  # matching the file offsets 2M and 2M + 380K respectively, with data offsets of
  # 0 and 380K respectively, and with lengths of 389K and 80K respectively.
  #
  # The 40K extent has 2 back references, one for file offset 620K and the other
  # for file offset 2M + 460K.
  #
  # The 100K extent has a single back reference and it relates to file offset 3M.

  # Now clone our 100K extent into offset 600K. That offset covers the last 20K
  # of the prealloc extent, the whole 40K extent and 40K of the hole starting at
  # offset 660K.
  $CLONER_PROG -s $((3 * 1024 * 1024)) -d $((600 * 1024)) -l $((100 * 1024)) \
      $SCRATCH_MNT/foo $SCRATCH_MNT/foo

  # At this point there's only one reference to the 40K extent, at file offset
  # 2M + 460K, we have 4 references for the prealloc extent (2 for file offset
  # 160K and 2 for file offset 2M) and 2 references for the 100K extent (1 for
  # file offset 3M and a new one for file offset 600K).

  # Now fsync our file to make all its new data and metadata updates are durably
  # persisted and present if a power failure/crash happens after a successful
  # fsync and before the next transaction commit.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

  echo "File digest before power failure:"
  md5sum $SCRATCH_MNT/foo | _filter_scratch

  # Silently drop all writes and ummount to simulate a crash/power failure.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  # Allow writes again, mount to trigger log replay and validate file contents.
  # During log replay, the btrfs delayed references implementation used to run the
  # deletion of back references before the addition of new back references, which
  # made the addition fail as it didn't find the key in the extent tree that it
  # was looking for. The failure triggered by this test was related to the 40K
  # extent, which got 1 reference dropped and 1 reference added during the fsync
  # log replay - when running the delayed references at transaction commit time,
  # btrfs was applying the deletion before the insertion, resulting in a failure
  # of the insertion that ended up turning the fs into read-only mode.
  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  echo "File digest after log replay:"
  md5sum $SCRATCH_MNT/foo | _filter_scratch

  _unmount_flakey

  status=0
  exit

This issue turned the filesystem into read-only mode (current transaction
aborted) and produced the following traces:

  [ 8247.578385] ------------[ cut here ]------------
  [ 8247.579947] WARNING: CPU: 0 PID: 11341 at fs/btrfs/extent-tree.c:1547 lookup_inline_extent_backref+0x17d/0x45d [btrfs]()
  (...)
  [ 8247.601697] Call Trace:
  [ 8247.602222]  [<ffffffff8145f077>] dump_stack+0x4f/0x7b
  [ 8247.604320]  [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb
  [ 8247.605488]  [<ffffffffa0506c8d>] ? lookup_inline_extent_backref+0x17d/0x45d [btrfs]
  [ 8247.608226]  [<ffffffffa0506c8d>] lookup_inline_extent_backref+0x17d/0x45d [btrfs]
  [ 8247.617061]  [<ffffffffa0507957>] insert_inline_extent_backref+0x41/0xb2 [btrfs]
  [ 8247.621856]  [<ffffffffa0507c4f>] __btrfs_inc_extent_ref+0x8c/0x20a [btrfs]
  [ 8247.624366]  [<ffffffffa050ee60>] __btrfs_run_delayed_refs+0xb0c/0xd49 [btrfs]
  [ 8247.626176]  [<ffffffffa0510dcd>] btrfs_run_delayed_refs+0x6d/0x1d4 [btrfs]
  [ 8247.627435]  [<ffffffff81155c9b>] ? __cache_free+0x4a7/0x4b6
  [ 8247.628531]  [<ffffffffa0520482>] btrfs_commit_transaction+0x4c/0xa20 [btrfs]
  (...)
  [ 8247.648430] ---[ end trace 2461e55f92c2ac2d ]---

  [ 8247.727263] WARNING: CPU: 3 PID: 11341 at fs/btrfs/extent-tree.c:2771 btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs]()
  [ 8247.728954] BTRFS: Transaction aborted (error -5)
  (...)
  [ 8247.760866] Call Trace:
  [ 8247.761534]  [<ffffffff8145f077>] dump_stack+0x4f/0x7b
  [ 8247.764271]  [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb
  [ 8247.767582]  [<ffffffffa0510e04>] ? btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs]
  [ 8247.769373]  [<ffffffff8104b410>] warn_slowpath_fmt+0x46/0x48
  [ 8247.770836]  [<ffffffffa0510e04>] btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs]
  [ 8247.772532]  [<ffffffff81155c9b>] ? __cache_free+0x4a7/0x4b6
  [ 8247.773664]  [<ffffffffa0520482>] btrfs_commit_transaction+0x4c/0xa20 [btrfs]
  [ 8247.775047]  [<ffffffff81087310>] ? trace_hardirqs_on+0xd/0xf
  [ 8247.776176]  [<ffffffff81155dd5>] ? kmem_cache_free+0x12b/0x189
  [ 8247.777427]  [<ffffffffa055a920>] btrfs_recover_log_trees+0x2da/0x33d [btrfs]
  [ 8247.778575]  [<ffffffffa055898e>] ? replay_one_extent+0x4fc/0x4fc [btrfs]
  [ 8247.779838]  [<ffffffffa051e265>] open_ctree+0x1cc0/0x201a [btrfs]
  [ 8247.781020]  [<ffffffff81120f48>] ? register_shrinker+0x56/0x81
  [ 8247.782285]  [<ffffffffa04fb12c>] btrfs_mount+0x5f0/0x734 [btrfs]
  (...)
  [ 8247.793394] ---[ end trace 2461e55f92c2ac2e ]---
  [ 8247.794276] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2771: errno=-5 IO failure
  [ 8247.797335] BTRFS: error (device dm-0) in btrfs_replay_log:2375: errno=-5 IO failure (Failed to recover log tree)

Fixes: c6fc245499 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Acked-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
2015-07-11 22:36:44 +01:00
Filipe Manana
d3efe08400 Btrfs: fix list transaction->pending_ordered corruption
When we call btrfs_commit_transaction(), we splice the list "ordered"
of our transaction handle into the transaction's "pending_ordered"
list, but we don't re-initialize the "ordered" list of our transaction
handle, this means it still points to the same elements it used to
before the splice. Then we check if the current transaction's state is
>= TRANS_STATE_COMMIT_START and if it is we end up calling
btrfs_end_transaction() which simply splices again the "ordered" list
of our handle into the transaction's "pending_ordered" list, leaving
multiple pointers to the same ordered extents which results in list
corruption when we are iterating, removing and freeing ordered extents
at btrfs_wait_pending_ordered(), resulting in access to dangling
pointers / use-after-free issues.
Similarly, btrfs_end_transaction() can end up in some cases calling
btrfs_commit_transaction(), and both did a list splice of the transaction
handle's "ordered" list into the transaction's "pending_ordered" without
re-initializing the handle's "ordered" list, resulting in exactly the
same problem.

This produces the following warning on a kernel with linked list
debugging enabled:

[109749.265416] ------------[ cut here ]------------
[109749.266410] WARNING: CPU: 7 PID: 324 at lib/list_debug.c:59 __list_del_entry+0x5a/0x98()
[109749.267969] list_del corruption. prev->next should be ffff8800ba087e20, but was fffffff8c1f7c35d
(...)
[109749.287505] Call Trace:
[109749.288135]  [<ffffffff8145f077>] dump_stack+0x4f/0x7b
[109749.298080]  [<ffffffff81095de5>] ? console_unlock+0x356/0x3a2
[109749.331605]  [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb
[109749.334849]  [<ffffffff81260642>] ? __list_del_entry+0x5a/0x98
[109749.337093]  [<ffffffff8104b410>] warn_slowpath_fmt+0x46/0x48
[109749.337847]  [<ffffffff81260642>] __list_del_entry+0x5a/0x98
[109749.338678]  [<ffffffffa053e8bf>] btrfs_wait_pending_ordered+0x46/0xdb [btrfs]
[109749.340145]  [<ffffffffa058a65f>] ? __btrfs_run_delayed_items+0x149/0x163 [btrfs]
[109749.348313]  [<ffffffffa054077d>] btrfs_commit_transaction+0x36b/0xa10 [btrfs]
[109749.349745]  [<ffffffff81087310>] ? trace_hardirqs_on+0xd/0xf
[109749.350819]  [<ffffffffa055370d>] btrfs_sync_file+0x36f/0x3fc [btrfs]
[109749.351976]  [<ffffffff8118ec98>] vfs_fsync_range+0x8f/0x9e
[109749.360341]  [<ffffffff8118ecc3>] vfs_fsync+0x1c/0x1e
[109749.368828]  [<ffffffff8118ee1d>] do_fsync+0x34/0x4e
[109749.369790]  [<ffffffff8118f045>] SyS_fsync+0x10/0x14
[109749.370925]  [<ffffffff81465197>] system_call_fastpath+0x12/0x6f
[109749.382274] ---[ end trace 48e0d07f7c03d95a ]---

On a non-debug kernel this leads to invalid memory accesses, causing a
crash. Fix this by using list_splice_init() instead of list_splice() in
btrfs_commit_transaction() and btrfs_end_transaction().

Cc: stable@vger.kernel.org
Fixes: 50d9aa99bd ("Btrfs: make sure logged extents complete in the current transaction V3"
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
2015-07-11 22:35:05 +01:00
Filipe Manana
497b4050e0 Btrfs: fix memory leak in the extent_same ioctl
We were allocating memory with memdup_user() but we were never releasing
that memory. This affected pretty much every call to the ioctl, whether
it deduplicated extents or not.

This issue was reported on IRC by Julian Taylor and on the mailing list
by Marcel Ritter, credit goes to them for finding the issue.

Reported-by: Julian Taylor <jtaylor.debian@googlemail.com>
Reported-by: Marcel Ritter <ritter.marcel@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
2015-07-11 22:34:26 +01:00
Filipe Manana
c1aa45759e Btrfs: fix shrinking truncate when the no_holes feature is enabled
If the no_holes feature is enabled, we attempt to shrink a file to a size
that ends up in the middle of a hole and we don't have any file extent
items in the fs/subvol tree that go beyond the new file size (or any
ordered extents that will insert such file extent items), we end up not
updating the inode's disk_i_size, we only update the inode's i_size.

This means that after unmounting and mounting the filesystem, or after
the inode is evicted and reloaded, its i_size ends up being incorrect
(an inode's i_size is set to the disk_i_size field when an inode is
loaded). This happens when btrfs_truncate_inode_items() doesn't find
any file extent items to drop - in this case it never makes a call to
btrfs_ordered_update_i_size() in order to update the inode's disk_i_size.

Example reproducer:

  $ mkfs.btrfs -O no-holes -f /dev/sdd
  $ mount /dev/sdd /mnt

  # Create our test file with some data and durably persist it.
  $ xfs_io -f -c "pwrite -S 0xaa 0 128K" /mnt/foo
  $ sync

  # Append some data to the file, increasing its size, and leave a hole
  # between the old size and the start offset if the following write. So
  # our file gets a hole in the range [128Kb, 256Kb[.
  $ xfs_io -c "truncate 160K" /mnt/foo

  # We expect to see our file with a size of 160Kb, with the first 128Kb
  # of data all having the value 0xaa and the remaining 32Kb of data all
  # having the value 0x00.
  $ od -t x1 /mnt/foo
  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0400000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  *
  0500000

  # Now cleanly unmount and mount again the filesystem.
  $ umount /mnt
  $ mount /dev/sdd /mnt

  # We expect to get the same result as before, a file with a size of
  # 160Kb, with the first 128Kb of data all having the value 0xaa and the
  # remaining 32Kb of data all having the value 0x00.
  $ od -t x1 /mnt/foo
  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0400000

In the example above the file size/data do not match what they were before
the remount.

Fix this by always calling btrfs_ordered_update_i_size() with a size
matching the size the file was truncated to if btrfs_truncate_inode_items()
is not called for a log tree and no file extent items were dropped. This
ensures the same behaviour as when the no_holes feature is not enabled.

A test case for fstests follows soon.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-07-11 22:33:14 +01:00
Linus Torvalds
31b7a57c9e Merge branch 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "This is an assortment of fixes.  Most of the commits are from Filipe
  (fsync, the inode allocation cache and a few others).  Mark kicked in
  a series fixing corners in the extent sharing ioctls, and everyone
  else fixed up on assorted other problems"

* 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix wrong check for btrfs_force_chunk_alloc()
  Btrfs: fix warning of bytes_may_use
  Btrfs: fix hang when failing to submit bio of directIO
  Btrfs: fix a comment in inode.c:evict_inode_truncate_pages()
  Btrfs: fix memory corruption on failure to submit bio for direct IO
  btrfs: don't update mtime/ctime on deduped inodes
  btrfs: allow dedupe of same inode
  btrfs: fix deadlock with extent-same and readpage
  btrfs: pass unaligned length to btrfs_cmp_data()
  Btrfs: fix fsync after truncate when no_holes feature is enabled
  Btrfs: fix fsync xattr loss in the fast fsync path
  Btrfs: fix fsync data loss after append write
  Btrfs: fix crash on close_ctree() if cleaner starts new transaction
  Btrfs: fix race between caching kthread and returning inode to inode cache
  Btrfs: use kmem_cache_free when freeing entry in inode cache
  Btrfs: fix race between balance and unused block group deletion
  btrfs: add error handling for scrub_workers_get()
  btrfs: cleanup noused initialization of dev in btrfs_end_bio()
  btrfs: qgroup: allow user to clear the limitation on qgroup
2015-07-11 10:26:34 -07:00
Linus Torvalds
1dc51b8288 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more vfs updates from Al Viro:
 "Assorted VFS fixes and related cleanups (IMO the most interesting in
  that part are f_path-related things and Eric's descriptor-related
  stuff).  UFS regression fixes (it got broken last cycle).  9P fixes.
  fs-cache series, DAX patches, Jan's file_remove_suid() work"

[ I'd say this is much more than "fixes and related cleanups".  The
  file_table locking rule change by Eric Dumazet is a rather big and
  fundamental update even if the patch isn't huge.   - Linus ]

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits)
  9p: cope with bogus responses from server in p9_client_{read,write}
  p9_client_write(): avoid double p9_free_req()
  9p: forgetting to cancel request on interrupted zero-copy RPC
  dax: bdev_direct_access() may sleep
  block: Add support for DAX reads/writes to block devices
  dax: Use copy_from_iter_nocache
  dax: Add block size note to documentation
  fs/file.c: __fget() and dup2() atomicity rules
  fs/file.c: don't acquire files->file_lock in fd_install()
  fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation
  vfs: avoid creation of inode number 0 in get_next_ino
  namei: make set_root_rcu() return void
  make simple_positive() public
  ufs: use dir_pages instead of ufs_dir_pages()
  pagemap.h: move dir_pages() over there
  remove the pointless include of lglock.h
  fs: cleanup slight list_entry abuse
  xfs: Correctly lock inode when removing suid and file capabilities
  fs: Call security_ops->inode_killpriv on truncate
  fs: Provide function telling whether file_remove_privs() will do anything
  ...
2015-07-04 19:36:06 -07:00
Shilong Wang
9689457b5b Btrfs: fix wrong check for btrfs_force_chunk_alloc()
btrfs_force_chunk_alloc() return 1 for allocation chunk successfully.
This problem exists since commit c87f08ca4.

With this patch, we might fix some enospc problems for balances.

Signed-off-by: Wang Shilong <wangshilong1991@gmail.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-01 17:17:22 -07:00
Liu Bo
ddba1bfc23 Btrfs: fix warning of bytes_may_use
While running generic/019, dmesg got several warnings from
btrfs_free_reserved_data_space().

Test generic/019 produces some disk failures so sumbit dio will get errors,
in which case, btrfs_direct_IO() goes to the error handling and free
bytes_may_use, but the problem is that bytes_may_use has been free'd
during get_block().

This adds a runtime flag to show if we've gone through get_block(), if so,
don't do the cleanup work.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-01 17:17:21 -07:00
Liu Bo
ad9ee2053f Btrfs: fix hang when failing to submit bio of directIO
The hang is uncoverd by generic/019.

btrfs_endio_direct_write() skips the "finish_ordered_fn" part when it hits
an error, thus those added ordered extents will never get processed, which
block processes that waiting for them via btrfs_start_ordered_extent().

This fixes the above, and meanwhile finish_ordered_fn will do the space
accounting work.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-01 17:17:20 -07:00
Filipe Manana
9c6429d96d Btrfs: fix a comment in inode.c:evict_inode_truncate_pages()
The comment was not correct about the part where it says the endio
callback of the bio might have not yet been called - update it
to mention that by that time the endio callback execution might
still be in progress only.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-01 17:17:19 -07:00
Filipe Manana
61de718fce Btrfs: fix memory corruption on failure to submit bio for direct IO
If we fail to submit a bio for a direct IO request, we were grabbing the
corresponding ordered extent and decrementing its reference count twice,
once for our lookup reference and once for the ordered tree reference.
This was a problem because it caused the ordered extent to be freed
without removing it from the ordered tree and any lists it might be
attached to, leaving dangling pointers to the ordered extent around.
Example trace with CONFIG_DEBUG_PAGEALLOC=y:

[161779.858707] BUG: unable to handle kernel paging request at 0000000087654330
[161779.859983] IP: [<ffffffff8124ca68>] rb_prev+0x22/0x3b
[161779.860636] PGD 34d818067 PUD 0
[161779.860636] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
(...)
[161779.860636] Call Trace:
[161779.860636]  [<ffffffffa06b36a6>] __tree_search+0xd9/0xf9 [btrfs]
[161779.860636]  [<ffffffffa06b3708>] tree_search+0x42/0x63 [btrfs]
[161779.860636]  [<ffffffffa06b4868>] ? btrfs_lookup_ordered_range+0x2d/0xa5 [btrfs]
[161779.860636]  [<ffffffffa06b4873>] btrfs_lookup_ordered_range+0x38/0xa5 [btrfs]
[161779.860636]  [<ffffffffa06aab8e>] btrfs_get_blocks_direct+0x11b/0x615 [btrfs]
[161779.860636]  [<ffffffff8119727f>] do_blockdev_direct_IO+0x5ff/0xb43
[161779.860636]  [<ffffffffa06aaa73>] ? btrfs_page_exists_in_range+0x1ad/0x1ad [btrfs]
[161779.860636]  [<ffffffffa06a2c9a>] ? btrfs_get_extent_fiemap+0x1bc/0x1bc [btrfs]
[161779.860636]  [<ffffffff811977f5>] __blockdev_direct_IO+0x32/0x34
[161779.860636]  [<ffffffffa06a2c9a>] ? btrfs_get_extent_fiemap+0x1bc/0x1bc [btrfs]
[161779.860636]  [<ffffffffa06a10ae>] btrfs_direct_IO+0x198/0x21f [btrfs]
[161779.860636]  [<ffffffffa06a2c9a>] ? btrfs_get_extent_fiemap+0x1bc/0x1bc [btrfs]
[161779.860636]  [<ffffffff81112ca1>] generic_file_direct_write+0xb3/0x128
[161779.860636]  [<ffffffffa06affaa>] ? btrfs_file_write_iter+0x15f/0x3e0 [btrfs]
[161779.860636]  [<ffffffffa06b004c>] btrfs_file_write_iter+0x201/0x3e0 [btrfs]
(...)

We were also not freeing the btrfs_dio_private we allocated previously,
which kmemleak reported with the following trace in its sysfs file:

unreferenced object 0xffff8803f553bf80 (size 96):
  comm "xfs_io", pid 4501, jiffies 4295039588 (age 173.936s)
  hex dump (first 32 bytes):
    88 6c 9b f5 02 88 ff ff 00 00 00 00 00 00 00 00  .l..............
    00 00 00 00 00 00 00 00 00 00 c4 00 00 00 00 00  ................
  backtrace:
    [<ffffffff81161ffe>] create_object+0x172/0x29a
    [<ffffffff8145870f>] kmemleak_alloc+0x25/0x41
    [<ffffffff81154e64>] kmemleak_alloc_recursive.constprop.40+0x16/0x18
    [<ffffffff811579ed>] kmem_cache_alloc_trace+0xfb/0x148
    [<ffffffffa03d8cff>] btrfs_submit_direct+0x65/0x16a [btrfs]
    [<ffffffff811968dc>] dio_bio_submit+0x62/0x8f
    [<ffffffff811975fe>] do_blockdev_direct_IO+0x97e/0xb43
    [<ffffffff811977f5>] __blockdev_direct_IO+0x32/0x34
    [<ffffffffa03d70ae>] btrfs_direct_IO+0x198/0x21f [btrfs]
    [<ffffffff81112ca1>] generic_file_direct_write+0xb3/0x128
    [<ffffffffa03e604d>] btrfs_file_write_iter+0x201/0x3e0 [btrfs]
    [<ffffffff8116586a>] __vfs_write+0x7c/0xa5
    [<ffffffff81165da9>] vfs_write+0xa0/0xe4
    [<ffffffff81166675>] SyS_pwrite64+0x64/0x82
    [<ffffffff81464fd7>] system_call_fastpath+0x12/0x6f
    [<ffffffffffffffff>] 0xffffffffffffffff

For read requests we weren't doing any cleanup either (none of the work
done by btrfs_endio_direct_read()), so a failure submitting a bio for a
read request would leave a range in the inode's io_tree locked forever,
blocking any future operations (both reads and writes) against that range.

So fix this by making sure we do the same cleanup that we do for the case
where the bio submission succeeds.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-01 17:17:18 -07:00
Mark Fasheh
1c919a5e13 btrfs: don't update mtime/ctime on deduped inodes
One issue users have reported is that dedupe changes mtime on files,
resulting in tools like rsync thinking that their contents have changed when
in fact the data is exactly the same. We also skip the ctime update as no
user-visible metadata changes here and we want dedupe to be transparent to
the user.

Clone still wants time changes, so we special case this in the code.

This was tested with the btrfs-extent-same tool.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-01 17:17:17 -07:00
Mark Fasheh
0efa9f48c7 btrfs: allow dedupe of same inode
clone() supports cloning within an inode so extent-same can do
the same now. This patch fixes up the locking in extent-same to
know about the single-inode case. In addition to that, we add a
check for overlapping ranges, which clone does not allow.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-01 17:17:15 -07:00
Mark Fasheh
f441460202 btrfs: fix deadlock with extent-same and readpage
->readpage() does page_lock() before extent_lock(), we do the opposite in
extent-same. We want to reverse the order in btrfs_extent_same() but it's
not quite straightforward since the page locks are taken inside btrfs_cmp_data().

So I split btrfs_cmp_data() into 3 parts with a small context structure that
is passed between them. The first, btrfs_cmp_data_prepare() gathers up the
pages needed (taking page lock as required) and puts them on our context
structure. At this point, we are safe to lock the extent range. Afterwards,
we use btrfs_cmp_data() to do the data compare as usual and btrfs_cmp_data_free()
to clean up our context.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-01 17:17:14 -07:00
Mark Fasheh
207910ddee btrfs: pass unaligned length to btrfs_cmp_data()
In the case that we dedupe the tail of a file, we might expand the dedupe
len out to the end of our last block. We don't want to compare data past
i_size however, so pass the original length to btrfs_cmp_data().

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-01 17:17:13 -07:00
Filipe Manana
a89ca6f24f Btrfs: fix fsync after truncate when no_holes feature is enabled
When we have the no_holes feature enabled, if a we truncate a file to a
smaller size, truncate it again but to a size greater than or equals to
its original size and fsync it, the log tree will not have any information
about the hole covering the range [truncate_1_offset, new_file_size[.
Which means if the fsync log is replayed, the file will remain with the
state it had before both truncate operations.

Without the no_holes feature this does not happen, since when the inode
is logged (full sync flag is set) it will find in the fs/subvol tree a
leaf with a generation matching the current transaction id that has an
explicit extent item representing the hole.

Fix this by adding an explicit extent item representing a hole between
the last extent and the inode's i_size if we are doing a full sync.

The issue is easy to reproduce with the following test case for fstests:

  . ./common/rc
  . ./common/filter
  . ./common/dmflakey

  _need_to_be_root
  _supported_fs generic
  _supported_os Linux
  _require_scratch
  _require_dm_flakey

  # This test was motivated by an issue found in btrfs when the btrfs
  # no-holes feature is enabled (introduced in kernel 3.14). So enable
  # the feature if the fs being tested is btrfs.
  if [ $FSTYP == "btrfs" ]; then
      _require_btrfs_fs_feature "no_holes"
      _require_btrfs_mkfs_feature "no-holes"
      MKFS_OPTIONS="$MKFS_OPTIONS -O no-holes"
  fi

  rm -f $seqres.full

  _scratch_mkfs >>$seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create our test files and make sure everything is durably persisted.
  $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 64K"         \
                  -c "pwrite -S 0xbb 64K 61K"       \
                  $SCRATCH_MNT/foo | _filter_xfs_io
  $XFS_IO_PROG -f -c "pwrite -S 0xee 0 64K"         \
                  -c "pwrite -S 0xff 64K 61K"       \
                  $SCRATCH_MNT/bar | _filter_xfs_io
  sync

  # Now truncate our file foo to a smaller size (64Kb) and then truncate
  # it to the size it had before the shrinking truncate (125Kb). Then
  # fsync our file. If a power failure happens after the fsync, we expect
  # our file to have a size of 125Kb, with the first 64Kb of data having
  # the value 0xaa and the second 61Kb of data having the value 0x00.
  $XFS_IO_PROG -c "truncate 64K" \
               -c "truncate 125K" \
               -c "fsync" \
               $SCRATCH_MNT/foo

  # Do something similar to our file bar, but the first truncation sets
  # the file size to 0 and the second truncation expands the size to the
  # double of what it was initially.
  $XFS_IO_PROG -c "truncate 0" \
               -c "truncate 253K" \
               -c "fsync" \
               $SCRATCH_MNT/bar

  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  # Allow writes again, mount to trigger log replay and validate file
  # contents.
  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  # We expect foo to have a size of 125Kb, the first 64Kb of data all
  # having the value 0xaa and the remaining 61Kb to be a hole (all bytes
  # with value 0x00).
  echo "File foo content after log replay:"
  od -t x1 $SCRATCH_MNT/foo

  # We expect bar to have a size of 253Kb and no extents (any byte read
  # from bar has the value 0x00).
  echo "File bar content after log replay:"
  od -t x1 $SCRATCH_MNT/bar

  status=0
  exit

The expected file contents in the golden output are:

  File foo content after log replay:
  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0200000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  *
  0372000
  File bar content after log replay:
  0000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  *
  0772000

Without this fix, their contents are:

  File foo content after log replay:
  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0200000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
  *
  0372000
  File bar content after log replay:
  0000000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
  *
  0200000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
  *
  0372000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  *
  0772000

A test case submission for fstests follows soon.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-01 17:17:12 -07:00
Linus Torvalds
043cd04950 Merge branch 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "Outside of our usual batch of fixes, this integrates the subvolume
  quota updates that Qu Wenruo from Fujitsu has been working on for a
  few releases now.  He gets an extra gold star for making btrfs smaller
  this time, and fixing a number of quota corners in the process.

  Dave Sterba tested and integrated Anand Jain's sysfs improvements.
  Outside of exporting a symbol (ack'd by Greg) these are all internal
  to btrfs and it's mostly cleanups and fixes.  Anand also attached some
  of our sysfs objects to our internal device management structs instead
  of an object off the super block.  It will make device management
  easier overall and it's a better fit for how the sysfs files are used.
  None of the existing sysfs files are moved around.

  Thanks for all the fixes everyone"

* 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (87 commits)
  btrfs: delayed-ref: double free in btrfs_add_delayed_tree_ref()
  Btrfs: Check if kobject is initialized before put
  lib: export symbol kobject_move()
  Btrfs: sysfs: add support to show replacing target in the sysfs
  Btrfs: free the stale device
  Btrfs: use received_uuid of parent during send
  Btrfs: fix use-after-free in btrfs_replay_log
  btrfs: wait for delayed iputs on no space
  btrfs: qgroup: Make snapshot accounting work with new extent-oriented qgroup.
  btrfs: qgroup: Add the ability to skip given qgroup for old/new_roots.
  btrfs: ulist: Add ulist_del() function.
  btrfs: qgroup: Cleanup the old ref_node-oriented mechanism.
  btrfs: qgroup: Switch self test to extent-oriented qgroup mechanism.
  btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.
  btrfs: qgroup: Switch rescan to new mechanism.
  btrfs: qgroup: Add new qgroup calculation function btrfs_qgroup_account_extents().
  btrfs: backref: Add special time_seq == (u64)-1 case for btrfs_find_all_roots().
  btrfs: qgroup: Add new function to record old_roots.
  btrfs: qgroup: Record possible quota-related extent for qgroup.
  btrfs: qgroup: Add function qgroup_update_counters().
  ...
2015-06-30 20:07:45 -07:00
Filipe Manana
36283bf777 Btrfs: fix fsync xattr loss in the fast fsync path
After commit 4f764e5153 ("Btrfs: remove deleted xattrs on fsync log
replay"), we can end up in a situation where during log replay we end up
deleting xattrs that were never deleted when their file was last fsynced.

This happens in the fast fsync path (flag BTRFS_INODE_NEEDS_FULL_SYNC is
not set in the inode) if the inode has the flag BTRFS_INODE_COPY_EVERYTHING
set, the xattr was added in a past transaction and the leaf where the
xattr is located was not updated (COWed or created) in the current
transaction. In this scenario the xattr item never ends up in the log
tree and therefore at log replay time, which makes the replay code delete
the xattr from the fs/subvol tree as it thinks that xattr was deleted
prior to the last fsync.

Fix this by always logging all xattrs, which is the simplest and most
reliable way to detect deleted xattrs and replay the deletes at log replay
time.

This issue is reproducible with the following test case for fstests:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"

  here=`pwd`
  tmp=/tmp/$$
  status=1	# failure is the default!

  _cleanup()
  {
      _cleanup_flakey
      rm -f $tmp.*
  }
  trap "_cleanup; exit \$status" 0 1 2 3 15

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter
  . ./common/dmflakey
  . ./common/attr

  # real QA test starts here

  # We create a lot of xattrs for a single file. Only btrfs and xfs are currently
  # able to store such a large mount of xattrs per file, other filesystems such
  # as ext3/4 and f2fs for example, fail with ENOSPC even if we attempt to add
  # less than 1000 xattrs with very small values.
  _supported_fs btrfs xfs
  _supported_os Linux
  _need_to_be_root
  _require_scratch
  _require_dm_flakey
  _require_attrs
  _require_metadata_journaling $SCRATCH_DEV

  rm -f $seqres.full

  _scratch_mkfs >> $seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create the test file with some initial data and make sure everything is
  # durably persisted.
  $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 32k" $SCRATCH_MNT/foo | _filter_xfs_io
  sync

  # Add many small xattrs to our file.
  # We create such a large amount because it's needed to trigger the issue found
  # in btrfs - we need to have an amount that causes the fs to have at least 3
  # btree leafs with xattrs stored in them, and it must work on any leaf size
  # (maximum leaf/node size is 64Kb).
  num_xattrs=2000
  for ((i = 1; i <= $num_xattrs; i++)); do
      name="user.attr_$(printf "%04d" $i)"
      $SETFATTR_PROG -n $name -v "val_$(printf "%04d" $i)" $SCRATCH_MNT/foo
  done

  # Sync the filesystem to force a commit of the current btrfs transaction, this
  # is a necessary condition to trigger the bug on btrfs.
  sync

  # Now update our file's data and fsync the file.
  # After a successful fsync, if the fsync log/journal is replayed we expect to
  # see all the xattrs we added before with the same values (and the updated file
  # data of course). Btrfs used to delete some of these xattrs when it replayed
  # its fsync log/journal.
  $XFS_IO_PROG -c "pwrite -S 0xbb 8K 16K" \
               -c "fsync" \
               $SCRATCH_MNT/foo | _filter_xfs_io

  # Simulate a crash/power loss.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  # Allow writes again and mount. This makes the fs replay its fsync log.
  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  echo "File content after crash and log replay:"
  od -t x1 $SCRATCH_MNT/foo

  echo "File xattrs after crash and log replay:"
  for ((i = 1; i <= $num_xattrs; i++)); do
      name="user.attr_$(printf "%04d" $i)"
      echo -n "$name="
      $GETFATTR_PROG --absolute-names -n $name --only-values $SCRATCH_MNT/foo
      echo
  done

  status=0
  exit

The golden output expects all xattrs to be available, and with the correct
values, after the fsync log is replayed.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-30 14:36:47 -07:00
Filipe Manana
e4545de5b0 Btrfs: fix fsync data loss after append write
If we do an append write to a file (which increases its inode's i_size)
that does not have the flag BTRFS_INODE_NEEDS_FULL_SYNC set in its inode,
and the previous transaction added a new hard link to the file, which sets
the flag BTRFS_INODE_COPY_EVERYTHING in the file's inode, and then fsync
the file, the inode's new i_size isn't logged. This has the consequence
that after the fsync log is replayed, the file size remains what it was
before the append write operation, which means users/applications will
not be able to read the data that was successsfully fsync'ed before.

This happens because neither the inode item nor the delayed inode get
their i_size updated when the append write is made - doing so would
require starting a transaction in the buffered write path, something that
we do not do intentionally for performance reasons.

Fix this by making sure that when the flag BTRFS_INODE_COPY_EVERYTHING is
set the inode is logged with its current i_size (log the in-memory inode
into the log tree).

This issue is not a recent regression and is easy to reproduce with the
following test case for fstests:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"

  here=`pwd`
  tmp=/tmp/$$
  status=1	# failure is the default!

  _cleanup()
  {
          _cleanup_flakey
          rm -f $tmp.*
  }
  trap "_cleanup; exit \$status" 0 1 2 3 15

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter
  . ./common/dmflakey

  # real QA test starts here
  _supported_fs generic
  _supported_os Linux
  _need_to_be_root
  _require_scratch
  _require_dm_flakey
  _require_metadata_journaling $SCRATCH_DEV

  _crash_and_mount()
  {
          # Simulate a crash/power loss.
          _load_flakey_table $FLAKEY_DROP_WRITES
          _unmount_flakey
          # Allow writes again and mount. This makes the fs replay its fsync log.
          _load_flakey_table $FLAKEY_ALLOW_WRITES
          _mount_flakey
  }

  rm -f $seqres.full

  _scratch_mkfs >> $seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create the test file with some initial data and then fsync it.
  # The fsync here is only needed to trigger the issue in btrfs, as it causes the
  # the flag BTRFS_INODE_NEEDS_FULL_SYNC to be removed from the btrfs inode.
  $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 32k" \
                  -c "fsync" \
                  $SCRATCH_MNT/foo | _filter_xfs_io
  sync

  # Add a hard link to our file.
  # On btrfs this sets the flag BTRFS_INODE_COPY_EVERYTHING on the btrfs inode,
  # which is a necessary condition to trigger the issue.
  ln $SCRATCH_MNT/foo $SCRATCH_MNT/bar

  # Sync the filesystem to force a commit of the current btrfs transaction, this
  # is a necessary condition to trigger the bug on btrfs.
  sync

  # Now append more data to our file, increasing its size, and fsync the file.
  # In btrfs because the inode flag BTRFS_INODE_COPY_EVERYTHING was set and the
  # write path did not update the inode item in the btree nor the delayed inode
  # item (in memory struture) in the current transaction (created by the fsync
  # handler), the fsync did not record the inode's new i_size in the fsync
  # log/journal. This made the data unavailable after the fsync log/journal is
  # replayed.
  $XFS_IO_PROG -c "pwrite -S 0xbb 32K 32K" \
               -c "fsync" \
               $SCRATCH_MNT/foo | _filter_xfs_io

  echo "File content after fsync and before crash:"
  od -t x1 $SCRATCH_MNT/foo

  _crash_and_mount

  echo "File content after crash and log replay:"
  od -t x1 $SCRATCH_MNT/foo

  status=0
  exit

The expected file output before and after the crash/power failure expects the
appended data to be available, which is:

  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0100000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
  *
  0200000

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-30 14:36:47 -07:00
Filipe Manana
da288d280d Btrfs: fix crash on close_ctree() if cleaner starts new transaction
Often when running fstests btrfs/079 I was running into the following
trace during umount on one of my qemu/kvm test vms:

[ 8245.682441] WARNING: CPU: 8 PID: 25064 at fs/btrfs/extent-tree.c:138 btrfs_put_block_group+0x51/0x69 [btrfs]()
[ 8245.685039] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix4 acpi_cpufreq processor psmouse i2c_core thermal_sys parport evdev serio_raw button pcspkr microcode ext4 crc16 jbd2 mbcache sg sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata floppy virtio_pci virtio_ring scsi_mod virtio e1000 [last unloaded: btrfs]
[ 8245.693860] CPU: 8 PID: 25064 Comm: umount Tainted: G        W       4.1.0-rc5-btrfs-next-10+ #1
[ 8245.695081] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[ 8245.697583]  0000000000000009 ffff88020d047ce8 ffffffff8145eec7 ffffffff81095dce
[ 8245.699234]  0000000000000000 ffff88020d047d28 ffffffff8104b399 0000000000000028
[ 8245.700995]  ffffffffa04db07b ffff8801c6036c00 ffff8801c6036d68 ffff880202eb40b0
[ 8245.702510] Call Trace:
[ 8245.703006]  [<ffffffff8145eec7>] dump_stack+0x4f/0x7b
[ 8245.705393]  [<ffffffff81095dce>] ? console_unlock+0x356/0x3a2
[ 8245.706569]  [<ffffffff8104b399>] warn_slowpath_common+0xa1/0xbb
[ 8245.707747]  [<ffffffffa04db07b>] ? btrfs_put_block_group+0x51/0x69 [btrfs]
[ 8245.709101]  [<ffffffff8104b456>] warn_slowpath_null+0x1a/0x1c
[ 8245.710274]  [<ffffffffa04db07b>] btrfs_put_block_group+0x51/0x69 [btrfs]
[ 8245.711823]  [<ffffffffa04e3473>] btrfs_free_block_groups+0x145/0x322 [btrfs]
[ 8245.713251]  [<ffffffffa04ef31a>] close_ctree+0x1ef/0x325 [btrfs]
[ 8245.714448]  [<ffffffff8117d26e>] ? evict_inodes+0xdc/0xeb
[ 8245.715539]  [<ffffffffa04cb3ad>] btrfs_put_super+0x19/0x1b [btrfs]
[ 8245.716835]  [<ffffffff81167607>] generic_shutdown_super+0x73/0xef
[ 8245.718015]  [<ffffffff81167a3a>] kill_anon_super+0x13/0x1e
[ 8245.719101]  [<ffffffffa04cb1b6>] btrfs_kill_super+0x17/0x23 [btrfs]
[ 8245.720316]  [<ffffffff81167544>] deactivate_locked_super+0x3b/0x68
[ 8245.721517]  [<ffffffff81167dd6>] deactivate_super+0x3f/0x43
[ 8245.722581]  [<ffffffff8117fbb9>] cleanup_mnt+0x59/0x78
[ 8245.723538]  [<ffffffff8117fc18>] __cleanup_mnt+0x12/0x14
[ 8245.724572]  [<ffffffff81065371>] task_work_run+0x8f/0xbc
[ 8245.725598]  [<ffffffff810028fb>] do_notify_resume+0x45/0x53
[ 8245.726892]  [<ffffffff814651ac>] int_signal+0x12/0x17
[ 8245.737887] ---[ end trace a01d038397e99b92 ]---
[ 8245.769363] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 8245.770737] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix4 acpi_cpufreq processor psmouse i2c_core thermal_sys parport evdev serio_raw button pcspkr microcode ext4 crc16 jbd2 mbcache sg sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata floppy virtio_pci virtio_ring scsi_mod virtio e1000 [last unloaded: btrfs]
[ 8245.772641] CPU: 2 PID: 25064 Comm: umount Tainted: G        W       4.1.0-rc5-btrfs-next-10+ #1
[ 8245.772641] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[ 8245.772641] task: ffff880013005810 ti: ffff88020d044000 task.ti: ffff88020d044000
[ 8245.772641] RIP: 0010:[<ffffffffa051c8e6>]  [<ffffffffa051c8e6>] btrfs_queue_work+0x2c/0x14d [btrfs]
[ 8245.772641] RSP: 0018:ffff88020d0478b8  EFLAGS: 00010202
[ 8245.772641] RAX: 0000000000000004 RBX: 6b6b6b6b6b6b6b6b RCX: ffffffffa0581488
[ 8245.772641] RDX: 0000000000000000 RSI: ffff880194b7bf48 RDI: ffff880144b6a7a0
[ 8245.772641] RBP: ffff88020d0478d8 R08: 0000000000000000 R09: 000000000000ffff
[ 8245.772641] R10: 0000000000000004 R11: 0000000000000005 R12: ffff880194b7bf48
[ 8245.772641] R13: ffff880194b7bf48 R14: 0000000000000410 R15: 0000000000000000
[ 8245.772641] FS:  00007f991e77d840(0000) GS:ffff88023e280000(0000) knlGS:0000000000000000
[ 8245.772641] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 8245.772641] CR2: 00007fbbd325ee68 CR3: 000000021de8e000 CR4: 00000000000006e0
[ 8245.772641] Stack:
[ 8245.772641]  ffff880194b7bf00 ffff880202eb4000 ffff880194b7bf48 0000000000000410
[ 8245.772641]  ffff88020d047958 ffffffffa04ec6d5 ffff8801629b2ee8 0000000082987570
[ 8245.772641]  0000000000a5813f 0000000000000001 ffff880013006100 0000000000000002
[ 8245.772641] Call Trace:
[ 8245.772641]  [<ffffffffa04ec6d5>] btrfs_wq_submit_bio+0xe1/0x17b [btrfs]
[ 8245.772641]  [<ffffffff81086bff>] ? check_irq_usage+0x76/0x87
[ 8245.772641]  [<ffffffffa04ec825>] btree_submit_bio_hook+0xb6/0xd9 [btrfs]
[ 8245.772641]  [<ffffffffa04ebb7c>] ? btree_csum_one_bio+0xad/0xad [btrfs]
[ 8245.772641]  [<ffffffffa04eb1a6>] ? btree_io_failed_hook+0x5e/0x5e [btrfs]
[ 8245.772641]  [<ffffffffa050a6e7>] submit_one_bio+0x8c/0xc7 [btrfs]
[ 8245.772641]  [<ffffffffa050d75b>] submit_extent_page.isra.18+0x9d/0x186 [btrfs]
[ 8245.772641]  [<ffffffffa050d95b>] write_one_eb+0x117/0x1ae [btrfs]
[ 8245.772641]  [<ffffffffa050a79b>] ? end_extent_buffer_writeback+0x21/0x21 [btrfs]
[ 8245.772641]  [<ffffffffa0510510>] btree_write_cache_pages+0x2ab/0x385 [btrfs]
[ 8245.772641]  [<ffffffffa04eb2b8>] btree_writepages+0x23/0x5c [btrfs]
[ 8245.772641]  [<ffffffff8111c661>] do_writepages+0x23/0x2c
[ 8245.772641]  [<ffffffff81189cd4>] __writeback_single_inode+0xda/0x5bd
[ 8245.772641]  [<ffffffff8118aa60>] ? writeback_single_inode+0x2b/0x173
[ 8245.772641]  [<ffffffff8118aafd>] writeback_single_inode+0xc8/0x173
[ 8245.772641]  [<ffffffff8118ac95>] write_inode_now+0x8a/0x95
[ 8245.772641]  [<ffffffff81247bf0>] ? _atomic_dec_and_lock+0x30/0x4e
[ 8245.772641]  [<ffffffff8117cc5e>] iput+0x17d/0x26a
[ 8245.772641]  [<ffffffffa04ef355>] close_ctree+0x22a/0x325 [btrfs]
[ 8245.772641]  [<ffffffff8117d26e>] ? evict_inodes+0xdc/0xeb
[ 8245.772641]  [<ffffffffa04cb3ad>] btrfs_put_super+0x19/0x1b [btrfs]
[ 8245.772641]  [<ffffffff81167607>] generic_shutdown_super+0x73/0xef
[ 8245.772641]  [<ffffffff81167a3a>] kill_anon_super+0x13/0x1e
[ 8245.772641]  [<ffffffffa04cb1b6>] btrfs_kill_super+0x17/0x23 [btrfs]
[ 8245.772641]  [<ffffffff81167544>] deactivate_locked_super+0x3b/0x68
[ 8245.772641]  [<ffffffff81167dd6>] deactivate_super+0x3f/0x43
[ 8245.772641]  [<ffffffff8117fbb9>] cleanup_mnt+0x59/0x78
[ 8245.772641]  [<ffffffff8117fc18>] __cleanup_mnt+0x12/0x14
[ 8245.772641]  [<ffffffff81065371>] task_work_run+0x8f/0xbc
[ 8245.772641]  [<ffffffff810028fb>] do_notify_resume+0x45/0x53
[ 8245.772641]  [<ffffffff814651ac>] int_signal+0x12/0x17
[ 8245.772641] Code: 1f 44 00 00 55 48 89 e5 41 56 41 55 41 54 53 49 89 f4 48 8b 46 70 a8 04 74 09 48 8b 5f 08 48 85 db 75 03 48 8b 1f 49 89 5c 24 68 <83> 7b 5c ff 74 04 f0 ff 43 50 49 83 7c 24 08 00 74 2c 4c 8d 6b
[ 8245.772641] RIP  [<ffffffffa051c8e6>] btrfs_queue_work+0x2c/0x14d [btrfs]
[ 8245.772641]  RSP <ffff88020d0478b8>
[ 8245.845040] ---[ end trace a01d038397e99b93 ]---

For logical reasons such as the phase of the moon, this happened more
often with "-o inode_cache" than without any mount options.

After some debugging it turned out to be simple to understand what was
happening:

1) close_ctree() is called;

2) It then stops the transaction kthread, which commits the current
   transaction;

3) It asks the cleaner kthread to stop, which is currently running
   btrfs_delete_unused_bgs();

4) btrfs_delete_unused_bgs() finds an unused block group, starts a new
   transaction, deletes the block group, which implies COWing some
   tree nodes and leafs and dirtying their respective pages, and then
   finally it ends the transaction it started, without committing it;

5) The cleaner kthread stops;

6) close_ctree() releases (from memory) the block group objects, which
   produces the warning in the trace pasted above;

7) Then it invalidates all pages of the btree inode, by calling
   invalidate_inode_pages2(), which waits for any pages under writeback,
   and releases any non-dirty pages;

8) All work queues are destroyed (waiting first for their current tasks
   to finish execution);

9) A final iput() is called against the btree inode;

10) This iput triggers a writeback of the btree inode because it still
    has dirty pages;

11) This starts the whole chain of callbacks for the btree inode until
    it eventually reaches btrfs_wq_submit_bio() where it leads to a
    NULL pointer dereference because the work queues were already
    destroyed.

Fix this by making the cleaner commit any transaction that it started
after the transaction kthread was stopped.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-30 14:36:46 -07:00
Filipe Manana
ae9d8f1711 Btrfs: fix race between caching kthread and returning inode to inode cache
While the inode cache caching kthread is calling btrfs_unpin_free_ino(),
we could have a concurrent call to btrfs_return_ino() that adds a new
entry to the root's free space cache of pinned inodes. This concurrent
call does not acquire the fs_info->commit_root_sem before adding a new
entry if the caching state is BTRFS_CACHE_FINISHED, which is a problem
because the caching kthread calls btrfs_unpin_free_ino() after setting
the caching state to BTRFS_CACHE_FINISHED and therefore races with
the task calling btrfs_return_ino(), which is adding a new entry, while
the former (caching kthread) is navigating the cache's rbtree, removing
and freeing nodes from the cache's rbtree without acquiring the spinlock
that protects the rbtree.

This race resulted in memory corruption due to double free of struct
btrfs_free_space objects because both tasks can end up doing freeing the
same objects. Note that adding a new entry can result in merging it with
other entries in the cache, in which case those entries are freed.
This is particularly important as btrfs_free_space structures are also
used for the block group free space caches.

This memory corruption can be detected by a debugging kernel, which
reports it with the following trace:

[132408.501148] slab error in verify_redzone_free(): cache `btrfs_free_space': double free detected
[132408.505075] CPU: 15 PID: 12248 Comm: btrfs-ino-cache Tainted: G        W       4.1.0-rc5-btrfs-next-10+ #1
[132408.505075] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[132408.505075]  ffff880023e7d320 ffff880163d73cd8 ffffffff8145eec7 ffffffff81095dce
[132408.505075]  ffff880009735d40 ffff880163d73ce8 ffffffff81154e1e ffff880163d73d68
[132408.505075]  ffffffff81155733 ffffffffa054a95a ffff8801b6099f00 ffffffffa0505b5f
[132408.505075] Call Trace:
[132408.505075]  [<ffffffff8145eec7>] dump_stack+0x4f/0x7b
[132408.505075]  [<ffffffff81095dce>] ? console_unlock+0x356/0x3a2
[132408.505075]  [<ffffffff81154e1e>] __slab_error.isra.28+0x25/0x36
[132408.505075]  [<ffffffff81155733>] __cache_free+0xe2/0x4b6
[132408.505075]  [<ffffffffa054a95a>] ? __btrfs_add_free_space+0x2f0/0x343 [btrfs]
[132408.505075]  [<ffffffffa0505b5f>] ? btrfs_unpin_free_ino+0x8e/0x99 [btrfs]
[132408.505075]  [<ffffffff810f3b30>] ? time_hardirqs_off+0x15/0x28
[132408.505075]  [<ffffffff81084d42>] ? trace_hardirqs_off+0xd/0xf
[132408.505075]  [<ffffffff811563a1>] ? kfree+0xb6/0x14e
[132408.505075]  [<ffffffff811563d0>] kfree+0xe5/0x14e
[132408.505075]  [<ffffffffa0505b5f>] btrfs_unpin_free_ino+0x8e/0x99 [btrfs]
[132408.505075]  [<ffffffffa0505e08>] caching_kthread+0x29e/0x2d9 [btrfs]
[132408.505075]  [<ffffffffa0505b6a>] ? btrfs_unpin_free_ino+0x99/0x99 [btrfs]
[132408.505075]  [<ffffffff8106698f>] kthread+0xef/0xf7
[132408.505075]  [<ffffffff810f3b08>] ? time_hardirqs_on+0x15/0x28
[132408.505075]  [<ffffffff810668a0>] ? __kthread_parkme+0xad/0xad
[132408.505075]  [<ffffffff814653d2>] ret_from_fork+0x42/0x70
[132408.505075]  [<ffffffff810668a0>] ? __kthread_parkme+0xad/0xad
[132408.505075] ffff880023e7d320: redzone 1:0x9f911029d74e35b, redzone 2:0x9f911029d74e35b.
[132409.501654] slab: double free detected in cache 'btrfs_free_space', objp ffff880023e7d320
[132409.503355] ------------[ cut here ]------------
[132409.504241] kernel BUG at mm/slab.c:2571!

Therefore fix this by having btrfs_unpin_free_ino() acquire the lock
that protects the rbtree while doing the searches and removing entries.

Fixes: 1c70d8fb4d ("Btrfs: fix inode caching vs tree log")
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-30 14:36:46 -07:00
Filipe Manana
c3f4a1685b Btrfs: use kmem_cache_free when freeing entry in inode cache
The free space entries are allocated using kmem_cache_zalloc(),
through __btrfs_add_free_space(), therefore we should use
kmem_cache_free() and not kfree() to avoid any confusion and
any potential problem. Looking at the kfree() definition at
mm/slab.c it has the following comment:

  /*
   * (...)
   *
   * Don't free memory not originally allocated by kmalloc()
   * or you will run into trouble.
   */

So better be safe and use kmem_cache_free().

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-30 14:36:46 -07:00
Filipe Manana
67c5e7d464 Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:

[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G        W       4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>]  [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8  EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS:  00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566]  ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566]  ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566]  ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566]  [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566]  [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566]  [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566]  [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566]  [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566]  [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566]  [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566]  [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566]  [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566]  [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566]  [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566]  [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566]  [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566]  [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)

The sequence of steps leading to this are:

           CPU 0                                         CPU 1

  btrfs_balance()
    btrfs_relocate_chunk()

      btrfs_relocate_block_group(bg X)
        btrfs_lookup_block_group(bg X)

                                               cleaner_kthread
                                                  locks fs_info->cleaner_mutex

                                                  btrfs_delete_unused_bgs()
                                                    finds bg X, which became
                                                    unused in the previous
                                                    transaction

                                                    checks bg X ->ro == 0,
                                                    so it proceeds
        sets bg X ->ro to 1
        (btrfs_set_block_group_ro(bg X))

        blocks on fs_info->cleaner_mutex
                                                    btrfs_remove_chunk(bg X)
                                                  unlocks fs_info->cleaner_mutex

        acquires fs_info->cleaner_mutex
        relocate_block_group()
          --> does nothing, no extents found in
              the extent tree from bg X
        unlocks fs_info->cleaner_mutex

      btrfs_relocate_block_group(bg X) returns

    btrfs_remove_chunk(bg X)
       extent map not found
          --> ASSERT(0)

Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.

This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:

    while true; do btrfs balance start -dusage=0 <mountpoint> ; done

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-30 14:36:46 -07:00
Zhao Lei
e82afc52ab btrfs: add error handling for scrub_workers_get()
Although it is a rare case, we'd better free previous allocated
memory on error.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-30 13:20:03 -07:00
Zhao Lei
65f5333875 btrfs: cleanup noused initialization of dev in btrfs_end_bio()
It is introduced by:
 c404e0dc2c
 Btrfs: fix use-after-free in the finishing procedure of the device replace

But seems no relationship with that bug, this patch revirt these
code block for cleanup.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-30 13:20:02 -07:00
Yang Dongsheng
fe7599079b btrfs: qgroup: allow user to clear the limitation on qgroup
Currently, we can only set a limitation on a qgroup, but we
can not clear it.

This patch provide a choice to user to clear a limitation on
qgroup by passing a value of CLEAR_VALUE(-1) to kernel.

Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-30 13:20:00 -07:00
Linus Torvalds
bfffa1cc9d Merge branch 'for-4.2/core' of git://git.kernel.dk/linux-block
Pull core block IO update from Jens Axboe:
 "Nothing really major in here, mostly a collection of smaller
  optimizations and cleanups, mixed with various fixes.  In more detail,
  this contains:

   - Addition of policy specific data to blkcg for block cgroups.  From
     Arianna Avanzini.

   - Various cleanups around command types from Christoph.

   - Cleanup of the suspend block I/O path from Christoph.

   - Plugging updates from Shaohua and Jeff Moyer, for blk-mq.

   - Eliminating atomic inc/dec of both remaining IO count and reference
     count in a bio.  From me.

   - Fixes for SG gap and chunk size support for data-less (discards)
     IO, so we can merge these better.  From me.

   - Small restructuring of blk-mq shared tag support, freeing drivers
     from iterating hardware queues.  From Keith Busch.

   - A few cfq-iosched tweaks, from Tahsin Erdogan and me.  Makes the
     IOPS mode the default for non-rotational storage"

* 'for-4.2/core' of git://git.kernel.dk/linux-block: (35 commits)
  cfq-iosched: fix other locations where blkcg_to_cfqgd() can return NULL
  cfq-iosched: fix sysfs oops when attempting to read unconfigured weights
  cfq-iosched: move group scheduling functions under ifdef
  cfq-iosched: fix the setting of IOPS mode on SSDs
  blktrace: Add blktrace.c to BLOCK LAYER in MAINTAINERS file
  block, cgroup: implement policy-specific per-blkcg data
  block: Make CFQ default to IOPS mode on SSDs
  block: add blk_set_queue_dying() to blkdev.h
  blk-mq: Shared tag enhancements
  block: don't honor chunk sizes for data-less IO
  block: only honor SG gap prevention for merges that contain data
  block: fix returnvar.cocci warnings
  block, dm: don't copy bios for request clones
  block: remove management of bi_remaining when restoring original bi_end_io
  block: replace trylock with mutex_lock in blkdev_reread_part()
  block: export blkdev_reread_part() and __blkdev_reread_part()
  suspend: simplify block I/O handling
  block: collapse bio bit space
  block: remove unused BIO_RW_BLOCK and BIO_EOF flags
  block: remove BIO_EOPNOTSUPP
  ...
2015-06-25 14:29:53 -07:00
Dan Carpenter
5a5003df98 btrfs: delayed-ref: double free in btrfs_add_delayed_tree_ref()
There is a cut and paste error so instead of freeing "head_ref", we free
"ref" twice.

Fixes: 3368d001ba ('btrfs: qgroup: Record possible quota-related extent for qgroup.')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-24 12:28:03 -07:00
Jan Kara
5fa8e0a1c6 fs: Rename file_remove_suid() to file_remove_privs()
file_remove_suid() is a misnomer since it removes also file capabilities
stored in xattrs and sets S_NOSEC flag. Also should_remove_suid() tells
something else than whether file_remove_suid() call is necessary which
leads to bugs.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-06-23 18:01:08 -04:00
Chris Mason
c40b7b064f Merge branch 'sysfs-fsdevices-4.2-part1' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into anand 2015-06-23 05:34:39 -07:00
Anand Jain
f90fc54728 Btrfs: Check if kobject is initialized before put
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-06-22 14:43:31 +02:00
Anand Jain
d2ff1b2008 Btrfs: sysfs: add support to show replacing target in the sysfs
This patch will add support to show the replacing target in sysfs
during the process of replacement.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-06-19 14:03:54 +02:00
Anand Jain
4fde46f0cc Btrfs: free the stale device
When btrfs on a device is overwritten with a new btrfs (mkfs),
the old btrfs instance in the kernel becomes stale. So with this
patch, if kernel finds device is overwritten then delete the stale
fsid/uuid.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
2015-06-19 14:03:13 +02:00
Josef Bacik
37b8d27de5 Btrfs: use received_uuid of parent during send
Neil Horman pointed out a problem where if he did something like this

receive A
snap A B
change B
send -p A B

and then on another box do

recieve A
receive B

the receive B would fail because we use the UUID of A for the clone sources for
B.  This makes sense most of the time because normally you are sending from the
original sources, not a received source.  However when you use a recieved subvol
its UUID is going to be something completely different, so if you then try to
receive the diff on a different volume it won't find the UUID because the new A
will be something else.  The only constant is the received uuid.  So instead
check to see if we have received_uuid set on the root, and if so use that as the
clone source, as btrfs receive looks for matches either in received_uuid or
uuid.  Thanks,

Reported-by: Neil Horman <nhorman@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Hugo Mills <hugo@carfax.org.uk>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-12 13:20:38 -07:00
Liu Bo
0eeff2362b Btrfs: fix use-after-free in btrfs_replay_log
@log_root_tree should not be referenced after kfree.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Reported-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-12 11:03:21 -07:00
Zhao Lei
9a4e7276d3 btrfs: wait for delayed iputs on no space
btrfs will report no_space when we run following write and delete
file loop:
 # FILE_SIZE_M=[ 75% of fs space ]
 # DEV=[ some dev ]
 # MNT=[ some dir ]
 #
 # mkfs.btrfs -f "$DEV"
 # mount -o nodatacow "$DEV" "$MNT"
 # for ((i = 0; i < 100; i++)); do dd if=/dev/zero of="$MNT"/file0 bs=1M count="$FILE_SIZE_M"; rm -f "$MNT"/file0; done
 #

Reason:
 iput() and evict() is run after write pages to block device, if
 write pages work is not finished before next write, the "rm"ed space
 is not freed, and caused above bug.

Fix:
 We can add "-o flushoncommit" mount option to avoid above bug, but
 it have performance problem. Actually, we can to wait for on-the-fly
 writes only when no-space happened, it is which this patch do.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:26:34 -07:00
Qu Wenruo
d672633545 btrfs: qgroup: Make snapshot accounting work with new extent-oriented
qgroup.

Make snapshot accounting work with new extent-oriented mechanism by
skipping given root in new/old_roots in create_pending_snapshot().

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:26:29 -07:00
Qu Wenruo
9086db86e0 btrfs: qgroup: Add the ability to skip given qgroup for old/new_roots.
This is used by later qgroup fix patches for snapshot.

As current snapshot accounting is done by btrfs_qgroup_inherit(), but
new extent oriented quota mechanism will account extent from
btrfs_copy_root() and other snapshot things, causing wrong result.

So add this ability to handle snapshot accounting.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:26:23 -07:00
Qu Wenruo
d4b8040459 btrfs: ulist: Add ulist_del() function.
This function will delete unode with given (val,aux) pair.
And with this patch, seqnum for debug usage doesn't have any meaning
now, so remove them.

This is used by later patches to skip snapshot root.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:26:17 -07:00
Qu Wenruo
e69bcee376 btrfs: qgroup: Cleanup the old ref_node-oriented mechanism.
Goodbye, the old mechanisim.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:26:11 -07:00
Qu Wenruo
442244c963 btrfs: qgroup: Switch self test to extent-oriented qgroup mechanism.
Since the self test transaction don't have delayed_ref_roots, so use
find_all_roots() and export btrfs_qgroup_account_extent() to simulate it

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:26:05 -07:00
Qu Wenruo
0ed4792af0 btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.
Switch from old ref_node based qgroup to extent based qgroup mechanism
for normal operations.

The new mechanism should hugely reduce the overhead of btrfs quota
system, and further more, the codes and logic should be more clean and
easier to maintain.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:25:59 -07:00
Qu Wenruo
9d220c95f5 btrfs: qgroup: Switch rescan to new mechanism.
Switch rescan to use the new new extent oriented mechanism.

As rescan is also based on extent, new mechanism is just a perfect match
for rescan.

With re-designed internal functions, rescan is quite easy, just call
btrfs_find_all_roots() and then btrfs_qgroup_account_one_extent().

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:25:54 -07:00
Qu Wenruo
550d7a2ed5 btrfs: qgroup: Add new qgroup calculation function
btrfs_qgroup_account_extents().

The new btrfs_qgroup_account_extents() function should be called in
btrfs_commit_transaction() and it will update all the qgroup according
to delayed_ref_root->dirty_extent_root.

The new function can handle both normal operation during
commit_transaction() or in rescan in a unified method with clearer
logic.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:25:49 -07:00
Qu Wenruo
21633fc603 btrfs: backref: Add special time_seq == (u64)-1 case for
btrfs_find_all_roots().

Allow btrfs_find_all_roots() to skip all delayed_ref_head lock and tree
lock to do tree search.

This is important for later qgroup implement which will call
find_all_roots() after fs trees are committed.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:25:45 -07:00
Qu Wenruo
3b7d00f99c btrfs: qgroup: Add new function to record old_roots.
Add function btrfs_qgroup_prepare_account_extents() to get old_roots
which are needed for qgroup.

We do it in commit_transaction() and before switch_roots(), and only
search commit_root, so it gives a quite accurate view for previous
transaction.

With old_roots from previous transaction, we can use it to do accurate
account with current transaction.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:25:39 -07:00
Qu Wenruo
3368d001ba btrfs: qgroup: Record possible quota-related extent for qgroup.
Add hook in add_delayed_ref_head() to record quota-related extent record
into delayed_ref_root->dirty_extent_record rb-tree for later qgroup
accounting.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:25:32 -07:00
Qu Wenruo
823ae5b8e3 btrfs: qgroup: Add function qgroup_update_counters().
Add function qgroup_update_counters(), which will update related
qgroups' rfer/excl according to old/new_roots.

This is one of the two core functions for the new qgroup implement.

This is based on btrfs_adjust_coutners() but with clearer logic and
comment.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:25:28 -07:00
Qu Wenruo
d810ef2be5 btrfs: qgroup: Add function qgroup_update_refcnt().
This function is used to update refcnt for qgroups.
And is one of the two core functions used in the new qgroup implement.

This is based on the old update_old/new_refcnt, but provides a unified
logic and behavior.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:25:24 -07:00
Qu Wenruo
c682f9b3c2 btrfs: extent-tree: Use ref_node to replace unneeded parameters in __inc_extent_ref() and __free_extent()
__btrfs_inc_extent_ref() and __btrfs_free_extent() have already had too
many parameters, but three of them can be extracted from
btrfs_delayed_ref_node struct.

So use btrfs_delayed_ref_node struct as a single parameter to replace
the bytenr/num_byte/no_quota parameters.

The real objective of this patch is to allow btrfs_qgroup_record_ref()
get the delayed_ref_node in incoming qgroup patches.

Other functions calling btrfs_qgroup_record_ref() are not affected since
the rest will only add/sub exclusive extents, where node is not used.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:25:18 -07:00
Qu Wenruo
9c542136fd btrfs: qgroup: Cleanup open-coded old/new_refcnt update and read.
Use inline functions to do such things, to improve readability.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Acked-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:25:13 -07:00
Qu Wenruo
c43d160fcd btrfs: delayed-ref: Cleanup the unneeded functions.
Cleanup the rb_tree merge/insert/update functions, since now we use list
instead of rb_tree now.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:25:09 -07:00
Qu Wenruo
c6fc245499 btrfs: delayed-ref: Use list to replace the ref_root in ref_head.
This patch replace the rbtree used in ref_head to list.
This has the following advantage:
1) Easier merge logic.
With the new list implement, we only need to care merging the tail
ref_node with the new ref_node.
And this can be done quite easy at insert time, no need to do a
indicated merge at run_delayed_refs().

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:25:03 -07:00
Qu Wenruo
00db646d3f btrfs: backref: Don't merge refs which are not for same block.
Old __merge_refs() in backref.c will even merge refs whose root_id are
different, which makes qgroup gives wrong result.

Fix it by checking ref_for_same_block() before any mode specific works.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:24:59 -07:00
Zhao Lei
20b2e3029e btrfs: Fix lockdep warning of wr_ctx->wr_lock in scrub_free_wr_ctx()
lockdep report following warning in test:
 [25176.843958] =================================
 [25176.844519] [ INFO: inconsistent lock state ]
 [25176.845047] 4.1.0-rc3 #22 Tainted: G        W
 [25176.845591] ---------------------------------
 [25176.846153] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
 [25176.846713] fsstress/26661 [HC0[0]:SC1[1]:HE1:SE0] takes:
 [25176.847246]  (&wr_ctx->wr_lock){+.?...}, at: [<ffffffffa04cdc6d>] scrub_free_ctx+0x2d/0xf0 [btrfs]
 [25176.847838] {SOFTIRQ-ON-W} state was registered at:
 [25176.848396]   [<ffffffff810bf460>] __lock_acquire+0x6a0/0xe10
 [25176.848955]   [<ffffffff810bfd1e>] lock_acquire+0xce/0x2c0
 [25176.849491]   [<ffffffff816489af>] mutex_lock_nested+0x7f/0x410
 [25176.850029]   [<ffffffffa04d04ff>] scrub_stripe+0x4df/0x1080 [btrfs]
 [25176.850575]   [<ffffffffa04d11b1>] scrub_chunk.isra.19+0x111/0x130 [btrfs]
 [25176.851110]   [<ffffffffa04d144c>] scrub_enumerate_chunks+0x27c/0x510 [btrfs]
 [25176.851660]   [<ffffffffa04d3b87>] btrfs_scrub_dev+0x1c7/0x6c0 [btrfs]
 [25176.852189]   [<ffffffffa04e918e>] btrfs_dev_replace_start+0x36e/0x450 [btrfs]
 [25176.852771]   [<ffffffffa04a98e0>] btrfs_ioctl+0x1e10/0x2d20 [btrfs]
 [25176.853315]   [<ffffffff8121c5b8>] do_vfs_ioctl+0x318/0x570
 [25176.853868]   [<ffffffff8121c851>] SyS_ioctl+0x41/0x80
 [25176.854406]   [<ffffffff8164da17>] system_call_fastpath+0x12/0x6f
 [25176.854935] irq event stamp: 51506
 [25176.855511] hardirqs last  enabled at (51506): [<ffffffff810d4ce5>] vprintk_emit+0x225/0x5e0
 [25176.856059] hardirqs last disabled at (51505): [<ffffffff810d4b77>] vprintk_emit+0xb7/0x5e0
 [25176.856642] softirqs last  enabled at (50886): [<ffffffff81067a23>] __do_softirq+0x363/0x640
 [25176.857184] softirqs last disabled at (50949): [<ffffffff8106804d>] irq_exit+0x10d/0x120
 [25176.857746]
 other info that might help us debug this:
 [25176.858845]  Possible unsafe locking scenario:
 [25176.859981]        CPU0
 [25176.860537]        ----
 [25176.861059]   lock(&wr_ctx->wr_lock);
 [25176.861705]   <Interrupt>
 [25176.862272]     lock(&wr_ctx->wr_lock);
 [25176.862881]
  *** DEADLOCK ***

Reason:
 Above warning is caused by:
 Interrupt
 -> bio_endio()
 -> ...
 -> scrub_put_ctx()
 -> scrub_free_ctx() *1
 -> ...
 -> mutex_lock(&wr_ctx->wr_lock);

 scrub_put_ctx() is allowed to be called in end_bio interrupt, but
 in code design, it will never call scrub_free_ctx(sctx) in interrupe
 context(above *1), because btrfs_scrub_dev() get one additional
 reference of sctx->refs, which makes scrub_free_ctx() only called
 withine btrfs_scrub_dev().

 Now the code runs out of our wish, because free sequence in
 scrub_pending_bio_dec() have a gap.

 Current code:
 -----------------------------------+-----------------------------------
 scrub_pending_bio_dec()            |  btrfs_scrub_dev
 -----------------------------------+-----------------------------------
 atomic_dec(&sctx->bios_in_flight); |
 wake_up(&sctx->list_wait);         |
                                    | scrub_put_ctx()
                                    | -> atomic_dec_and_test(&sctx->refs)
 scrub_put_ctx(sctx);               |
 -> atomic_dec_and_test(&sctx->refs)|
 -> scrub_free_ctx()                |
 -----------------------------------+-----------------------------------

 We expected:
 -----------------------------------+-----------------------------------
 scrub_pending_bio_dec()            |  btrfs_scrub_dev
 -----------------------------------+-----------------------------------
 atomic_dec(&sctx->bios_in_flight); |
 wake_up(&sctx->list_wait);         |
 scrub_put_ctx(sctx);               |
 -> atomic_dec_and_test(&sctx->refs)|
                                    | scrub_put_ctx()
                                    | -> atomic_dec_and_test(&sctx->refs)
                                    | -> scrub_free_ctx()
 -----------------------------------+-----------------------------------

Fix:
 Move scrub_pending_bio_dec() to a workqueue, to avoid this function run
 in interrupt context.
 Tested by check tracelog in debug.

Changelog v1->v2:
 Use workqueue instead of adjust function call sequence in v1,
 because v1 will introduce a bug pointed out by:
 Filipe David Manana <fdmanana@gmail.com>

Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 07:04:52 -07:00
Mark Fasheh
e1d227a42e btrfs: Handle unaligned length in extent_same
The extent-same code rejects requests with an unaligned length. This
poses a problem when we want to dedupe the tail extent of files as we
skip cloning the portion between i_size and the extent boundary.

If we don't clone the entire extent, it won't be deleted. So the
combination of these behaviors winds up giving us worst-case dedupe on
many files.

We can fix this by allowing a length that extents to i_size and
internally aligining those to the end of the block. This is what
btrfs_ioctl_clone() so we can just copy that check over.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 07:02:50 -07:00
chandan
070034bdf9 Btrfs: btrfs_defrag_file: Fix calculation of max_to_defrag.
max_to_defrag represents the number of pages to defrag rather than the last
page of the file range to be defragged.

Consider a file having 10 4k blocks (i.e. blocks in the range [0 - 9]). If the
defrag ioctl was invoked for the block range [3 - 6], then max_to_defrag
should actually have the value 4. Instead in the current code we end up
setting it to 6.

Now, this does not (yet) cause an issue since the first part of the while loop
condition in btrfs_defrag_file() (i.e. "i <= last_index") causes the control
to flow out of the while loop before any buggy behavior is actually caused. So
the patch just makes sure that max_to_defrag ends up having the right value
rather than fixing a bug. I did run the xfstests suite to make sure that the
code does not regress.

Changelog: v1->v2:
Provide a much descriptive commit message.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 07:02:48 -07:00
chandan
e4826a5b24 Btrfs: btrfs_defrag_file: Fix ra_index computation.
Read-ahead is done for the pages in the range [ra_index, ra_index + cluster -
1]. So the next read-ahead should be starting from the page at index 'ra_index
+ cluster' (unless we deemed that the extent at 'ra_index + cluster' as
non-defraggable) rather than from the page at index 'ra_index +
max_cluster'. This patch fixes this. I did run the xfstests suite to make sure
that the code does not regress.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 07:02:47 -07:00
Filipe Manana
4617ea3a52 Btrfs: fix necessary chunk tree space calculation when allocating a chunk
When allocating a new chunk or removing one we need to update num_devs
device items and insert or remove a chunk item in the chunk tree, so
in the worst case the space needed in the chunk space_info is:

  btrfs_calc_trunc_metadata_size(chunk_root, num_devs) +
     btrfs_calc_trans_metadata_size(chunk_root, 1)

That is, in the worst case we need to cow num_devs paths and cow 1 other
path that can result in splitting every node and leaf, and each path
consisting of BTRFS_MAX_LEVEL - 1 nodes and 1 leaf. We were requiring
some additional chunk_root->nodesize * BTRFS_MAX_LEVEL * num_devs bytes,
which were unnecessary since updating the existing device items does
not result in splitting the nodes and leaf since after updating them
they remain with the same size.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 07:02:46 -07:00
Filipe Manana
7558c8bc17 Btrfs: don't attach unnecessary extents to transaction on fsync
We don't need to attach ordered extents that have completed to the current
transaction. Doing so only makes us hold memory for longer than necessary
and delaying the iput of the inode until the transaction is committed (for
each created ordered extent we do an igrab and then schedule an asynchronous
iput when the ordered extent's reference count drops to 0), preventing the
inode from being evictable until the transaction commits.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 07:02:44 -07:00
Filipe Manana
b659ef0277 Btrfs: avoid syncing log in the fast fsync path when not necessary
Commit 3a8b36f378 ("Btrfs: fix data loss in the fast fsync path") added
a performance regression for that causes an unnecessary sync of the log
trees (fs/subvol and root log trees) when 2 consecutive fsyncs are done
against a file, without no writes or any metadata updates to the inode in
between them and if a transaction is committed before the second fsync is
called.

Huang Ying reported this to lkml (https://lkml.org/lkml/2015/3/18/99)
after a test sysbench test that measured a -62% decrease of file io
requests per second for that tests' workload.

The test is:

  echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
  echo performance > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
  echo performance > /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor
  echo performance > /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor
  mkfs -t btrfs /dev/sda2
  mount -t btrfs /dev/sda2 /fs/sda2
  cd /fs/sda2
  for ((i = 0; i < 1024; i++)); do fallocate -l 67108864 testfile.$i; done
  sysbench --test=fileio --max-requests=0 --num-threads=4 --max-time=600 \
    --file-test-mode=rndwr --file-total-size=68719476736 --file-io-mode=sync \
    --file-num=1024 run

A test on kvm guest, running a debug kernel gave me the following results:

Without 3a8b36f378:             16.01 reqs/sec
With 3a8b36f378:                 3.39 reqs/sec
With 3a8b36f378 and this patch: 16.04 reqs/sec

Reported-by: Huang Ying <ying.huang@intel.com>
Tested-by: Huang, Ying <ying.huang@intel.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 07:02:43 -07:00
Chris Mason
1ab818b137 Merge branch 'send_fixes_4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.2 2015-06-10 07:02:41 -07:00
Filipe Manana
6ca0709756 Btrfs: fix hang during inode eviction due to concurrent readahead
Zygo Blaxell and other users have reported occasional hangs while an
inode is being evicted, leading to traces like the following:

[ 5281.972322] INFO: task rm:20488 blocked for more than 120 seconds.
[ 5281.973836]       Not tainted 4.0.0-rc5-btrfs-next-9+ #2
[ 5281.974818] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5281.976364] rm              D ffff8800724cfc38     0 20488   7747 0x00000000
[ 5281.977506]  ffff8800724cfc38 ffff8800724cfc38 ffff880065da5c50 0000000000000001
[ 5281.978461]  ffff8800724cffd8 ffff8801540a5f50 0000000000000008 ffff8801540a5f78
[ 5281.979541]  ffff8801540a5f50 ffff8800724cfc58 ffffffff8143107e 0000000000000123
[ 5281.981396] Call Trace:
[ 5281.982066]  [<ffffffff8143107e>] schedule+0x74/0x83
[ 5281.983341]  [<ffffffffa03b33cf>] wait_on_state+0xac/0xcd [btrfs]
[ 5281.985127]  [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31
[ 5281.986715]  [<ffffffffa03b4b71>] wait_extent_bit.constprop.32+0x7c/0xde [btrfs]
[ 5281.988680]  [<ffffffffa03b540b>] lock_extent_bits+0x5d/0x88 [btrfs]
[ 5281.990200]  [<ffffffffa03a621d>] btrfs_evict_inode+0x24e/0x5be [btrfs]
[ 5281.991781]  [<ffffffff8116964d>] evict+0xa0/0x148
[ 5281.992735]  [<ffffffff8116a43d>] iput+0x18f/0x1e5
[ 5281.993796]  [<ffffffff81160d4a>] do_unlinkat+0x15b/0x1fa
[ 5281.994806]  [<ffffffff81435b54>] ? ret_from_sys_call+0x1d/0x58
[ 5281.996120]  [<ffffffff8107d314>] ? trace_hardirqs_on_caller+0x18f/0x1ab
[ 5281.997562]  [<ffffffff8123960b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 5281.998815]  [<ffffffff81161a16>] SyS_unlinkat+0x29/0x2b
[ 5281.999920]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
[ 5282.001299] 1 lock held by rm/20488:
[ 5282.002066]  #0:  (sb_writers#12){.+.+.+}, at: [<ffffffff8116dd81>] mnt_want_write+0x24/0x4b

This happens when we have readahead, which calls readpages(), happening
right before the inode eviction handler is invoked. So the reason is
essentially:

1) readpages() is called while a reference on the inode is held, so
   eviction can not be triggered before readpages() returns. It also
   locks one or more ranges in the inode's io_tree (which is done at
   extent_io.c:__do_contiguous_readpages());

2) readpages() submits several read bios, all with an end io callback
   that runs extent_io.c:end_bio_extent_readpage() and that is executed
   by other task when a bio finishes, corresponding to a work queue
   (fs_info->end_io_workers) worker kthread. This callback unlocks
   the ranges in the inode's io_tree that were previously locked in
   step 1;

3) readpages() returns, the reference on the inode is dropped;

4) One or more of the read bios previously submitted are still not
   complete (their end io callback was not yet invoked or has not
   yet finished execution);

5) Inode eviction is triggered (through an unlink call for example).
   The inode reference count was not incremented before submitting
   the read bios, therefore this is possible;

6) The eviction handler starts executing and enters the loop that
   iterates over all extent states in the inode's io_tree;

7) The loop picks one extent state record and uses its ->start and
   ->end fields, after releasing the inode's io_tree spinlock, to
   call lock_extent_bits() and clear_extent_bit(). The call to lock
   the range [state->start, state->end] blocks because the whole
   range or a part of it was locked by the previous call to
   readpages() and the corresponding end io callback, which unlocks
   the range was not yet executed;

8) The end io callback for the read bio is executed and unlocks the
   range [state->start, state->end] (or a superset of that range).
   And at clear_extent_bit() the extent_state record state is used
   as a second argument to split_state(), which sets state->start to
   a larger value;

9) The task executing the eviction handler is woken up by the task
   executing the bio's end io callback (through clear_state_bit) and
   the eviction handler locks the range
   [old value for state->start, state->end]. Shortly after, when
   calling clear_extent_bit(), it unlocks the range
   [new value for state->start, state->end], so it ends up unlocking
   only part of the range that it locked, leaving an extent state
   record in the io_tree that represents the unlocked subrange;

10) The eviction handler loop, in its next iteration, gets the
    extent_state record for the subrange that it did not unlock in the
    previous step and then tries to lock it, resulting in an hang.

So fix this by not using the ->start and ->end fields of an existing
extent_state record. This is a simple solution, and an alternative
could be to bump the inode's reference count before submitting each
read bio and having it dropped in the bio's end io callback. But that
would be a more invasive/complex change and would not protect against
other possible places that are not holding a reference on the inode
as well. Something to consider in the future.

Many thanks to Zygo Blaxell for reporting, in the mailing list, the
issue, a set of scripts to trigger it and testing this fix.

Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Tested-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03 04:03:09 -07:00
Liu Bo
64c043de46 Btrfs: fix up read_tree_block to return proper error
The return value of read_tree_block() can confuse callers as it always
returns NULL for either -ENOMEM or -EIO, so it's likely that callers
parse it to a wrong error, for instance, in btrfs_read_tree_root().

This fixes the above issue.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03 04:03:08 -07:00
Liu Bo
8635eda91e Btrfs: add missing free_extent_buffer
read_tree_block may take a reference on the 'eb', a following
free_extent_buffer is necessary.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03 04:03:07 -07:00
Liu Bo
0c304304fe Btrfs: remove csum_bytes_left
After commit 8407f55326
("Btrfs: fix data corruption after fast fsync and writeback error"),
during wait_ordered_extents(), we wait for ordered extent setting
BTRFS_ORDERED_IO_DONE or BTRFS_ORDERED_IOERR, at which point we've
already got checksum information, so we don't need to check
(csum_bytes_left == 0) in the whole logging path.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03 04:03:06 -07:00
Filipe Manana
39c2d7facc Btrfs: fix -ENOSPC on block group removal
Unlike when attempting to allocate a new block group, where we check
that we have enough space in the system space_info to update the device
items and insert a new chunk item in the chunk tree, we were not checking
if the system space_info had enough space for updating the device items
and deleting the chunk item in the chunk tree. This often lead to -ENOSPC
error when attempting to allocate blocks for the chunk tree (during btree
node/leaf COW operations) while updating the device items or deleting the
chunk item, which resulted in the current transaction being aborted and
turning the filesystem into read-only mode.

While running fstests generic/038, which stresses allocation of block
groups and removal of unused block groups, with a large scratch device
(750Gb) this happened often, despite more than enough unallocated space,
and resulted in the following trace:

[68663.586604] WARNING: CPU: 3 PID: 1521 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]()
[68663.600407] BTRFS: Transaction aborted (error -28)
(...)
[68663.730829] Call Trace:
[68663.732585]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
[68663.734334]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
[68663.739980]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
[68663.757153]  [<ffffffffa036ca6d>] ? __btrfs_abort_transaction+0x52/0x114 [btrfs]
[68663.760925]  [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
[68663.762854]  [<ffffffffa03b159d>] ? btrfs_update_device+0x15a/0x16c [btrfs]
[68663.764073]  [<ffffffffa036ca6d>] __btrfs_abort_transaction+0x52/0x114 [btrfs]
[68663.765130]  [<ffffffffa03b3638>] btrfs_remove_chunk+0x597/0x5ee [btrfs]
[68663.765998]  [<ffffffffa0384663>] ? btrfs_delete_unused_bgs+0x245/0x296 [btrfs]
[68663.767068]  [<ffffffffa0384676>] btrfs_delete_unused_bgs+0x258/0x296 [btrfs]
[68663.768227]  [<ffffffff8143527f>] ? _raw_spin_unlock_irq+0x2d/0x4c
[68663.769081]  [<ffffffffa038b109>] cleaner_kthread+0x13d/0x16c [btrfs]
[68663.799485]  [<ffffffffa038afcc>] ? btrfs_alloc_root+0x28/0x28 [btrfs]
[68663.809208]  [<ffffffff8105f367>] kthread+0xef/0xf7
[68663.828795]  [<ffffffff810e603f>] ? time_hardirqs_on+0x15/0x28
[68663.844942]  [<ffffffff8105f278>] ? __kthread_parkme+0xad/0xad
[68663.846486]  [<ffffffff81435a88>] ret_from_fork+0x58/0x90
[68663.847760]  [<ffffffff8105f278>] ? __kthread_parkme+0xad/0xad
[68663.849503] ---[ end trace 798477c6d6dbaad6 ]---
[68663.850525] BTRFS: error (device sdc) in btrfs_remove_chunk:2652: errno=-28 No space left

So fix this by verifying that enough space exists in system space_info,
and reserving the space in the chunk block reserve, before attempting to
delete the block group and allocate a new system chunk if we don't have
enough space to perform the necessary updates and delete in the chunk
tree. Like for the block group creation case, we don't error our if we
fail to allocate a new system chunk, since we might end up not needing
it (no node/leaf splits happen during the COW operations and/or we end
up not needing to COW any btree nodes or leafs because they were already
COWed in the current transaction and their writeback didn't start yet).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03 04:03:05 -07:00
Filipe Manana
4fbcdf6694 Btrfs: fix -ENOSPC when finishing block group creation
While creating a block group, we often end up getting ENOSPC while updating
the chunk tree, which leads to a transaction abortion that produces a trace
like the following:

[30670.116368] WARNING: CPU: 4 PID: 20735 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x106 [btrfs]()
[30670.117777] BTRFS: Transaction aborted (error -28)
(...)
[30670.163567] Call Trace:
[30670.163906]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
[30670.164522]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
[30670.165171]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
[30670.166323]  [<ffffffffa035daa7>] ? __btrfs_abort_transaction+0x52/0x106 [btrfs]
[30670.167213]  [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
[30670.167862]  [<ffffffffa035daa7>] __btrfs_abort_transaction+0x52/0x106 [btrfs]
[30670.169116]  [<ffffffffa03743d7>] btrfs_create_pending_block_groups+0x101/0x130 [btrfs]
[30670.170593]  [<ffffffffa038426a>] __btrfs_end_transaction+0x84/0x366 [btrfs]
[30670.171960]  [<ffffffffa038455c>] btrfs_end_transaction+0x10/0x12 [btrfs]
[30670.174649]  [<ffffffffa036eb6b>] btrfs_check_data_free_space+0x11f/0x27c [btrfs]
[30670.176092]  [<ffffffffa039450d>] btrfs_fallocate+0x7c8/0xb96 [btrfs]
[30670.177218]  [<ffffffff812459f2>] ? __this_cpu_preempt_check+0x13/0x15
[30670.178622]  [<ffffffff81152447>] vfs_fallocate+0x14c/0x1de
[30670.179642]  [<ffffffff8116b915>] ? __fget_light+0x2d/0x4f
[30670.180692]  [<ffffffff81152863>] SyS_fallocate+0x47/0x62
[30670.186737]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
[30670.187792] ---[ end trace 0373e6b491c4a8cc ]---

This is because we don't do proper space reservation for the chunk block
reserve when we have multiple tasks allocating chunks in parallel.

So block group creation has 2 phases, and the first phase essentially
checks if there is enough space in the system space_info, allocating a
new system chunk if there isn't, while the second phase updates the
device, extent and chunk trees. However, because the updates to the
chunk tree happen in the second phase, if we have N tasks, each with
its own transaction handle, allocating new chunks in parallel and if
there is only enough space in the system space_info to allocate M chunks,
where M < N, none of the tasks ends up allocating a new system chunk in
the first phase and N - M tasks will get -ENOSPC when attempting to
update the chunk tree in phase 2 if they need to COW any nodes/leafs
from the chunk tree.

Fix this by doing proper reservation in the chunk block reserve.

The issue could be reproduced by running fstests generic/038 in a loop,
which eventually triggered the problem.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03 04:03:04 -07:00
Josef Bacik
0d2b2372e0 Btrfs: set UNWRITTEN for prealloc'ed extents in fiemap
We should be doing this, it's weird we hadn't been doing this.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03 04:03:03 -07:00
Omar Sandoval
c8d3fe028f Btrfs: show subvol= and subvolid= in /proc/mounts
Now that we're guaranteed to have a meaningful root dentry, we can just
export seq_dentry() and use it in btrfs_show_options(). The subvolume ID
is easy to get and can also be useful, so put that in there, too.

Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03 04:03:02 -07:00
Omar Sandoval
05dbe6837b Btrfs: unify subvol= and subvolid= mounting
Currently, mounting a subvolume with subvolid= takes a different code
path than mounting with subvol=. This isn't really a big deal except for
the fact that mounts done with subvolid= or the default subvolume don't
have a dentry that's connected to the dentry tree like in the subvol=
case. To unify the code paths, when given subvolid= or using the default
subvolume ID, translate it into a subvolume name by walking
ROOT_BACKREFs in the root tree and INODE_REFs in the filesystem trees.

Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03 04:03:01 -07:00
Omar Sandoval
bb289b7be6 Btrfs: fail on mismatched subvol and subvolid mount options
There's nothing to stop a user from passing both subvol= and subvolid=
to mount, but if they don't refer to the same subvolume, someone is
going to be surprised at some point. Error out on this case, but allow
users to pass in both if they do match (which they could, for example,
get out of /proc/mounts).

Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03 04:03:00 -07:00
Omar Sandoval
fa33065950 Btrfs: clean up error handling in mount_subvol()
In preparation for new functionality in mount_subvol(), give it
ownership of subvol_name and tidy up the error paths.

Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03 04:02:59 -07:00
Omar Sandoval
e6e4dbe894 Btrfs: remove all subvol options before mounting top-level
Currently, setup_root_args() substitutes 's/subvol=[^,]*/subvolid=0/'.
But, this means that if the user passes both a subvol and subvolid for
some reason, we won't actually mount the top-level when we recursively
mount. For example, consider:

mkfs.btrfs -f /dev/sdb
mount /dev/sdb /mnt
btrfs subvol create /mnt/subvol1 # subvolid=257
btrfs subvol create /mnt/subvol2 # subvolid=258
umount /mnt
mount -osubvol=/subvol1,subvolid=258 /dev/sdb /mnt

In the final mount, subvol=/subvol1,subvolid=258 becomes
subvolid=0,subvolid=258, and the last option takes precedence, so we
mount subvol2 and try to look up subvol1 inside of it, which fails.

So, instead, do a thorough scan through the argument list and remove any
subvol= and subvolid= options, then append subvolid=0 to the end. This
implicitly makes subvol= take precedence over subvolid=, but we're about
to add a stricter check for that. This also makes setup_root_args() more
generic, which we'll need soon.

Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03 04:02:58 -07:00
Omar Sandoval
773cd04ec1 Btrfs: lock superblock before remounting for rw subvol
Since commit 0723a0473f ("btrfs: allow mounting btrfs subvolumes with
different ro/rw options"), when mounting a subvolume read/write when
another subvolume has previously been mounted read-only, we first do a
remount. However, this should be done with the superblock locked, as per
sync_filesystem():

	/*
	 * We need to be protected against the filesystem going from
	 * r/o to r/w or vice versa.
	 */
	WARN_ON(!rwsem_is_locked(&sb->s_umount));

This WARN_ON can easily be hit with:

mkfs.btrfs -f /dev/vdb
mount /dev/vdb /mnt
btrfs subvol create /mnt/vol1
btrfs subvol create /mnt/vol2
umount /mnt
mount -oro,subvol=/vol1 /dev/vdb /mnt
mount -orw,subvol=/vol2 /dev/vdb /mnt2

Fixes: 0723a0473f ("btrfs: allow mounting btrfs subvolumes with different ro/rw options")
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03 04:02:57 -07:00
Filipe Manana
0f31871f44 Btrfs: wake up extent state waiters on unlock through clear_extent_bits
When we clear an extent state's EXTENT_LOCKED bit with clear_extent_bits()
through free_io_failure(), we weren't waking up any tasks waiting for the
extent's state EXTENT_LOCKED bit, leading to an hang.

So make sure clear_extent_bits() ends up waking up any waiters if the
bit EXTENT_LOCKED is supplied by its callers.

Zygo Blaxell was experiencing such hangs at inode eviction time after
file unlinks. Thanks to him for a set of scripts to reproduce the issue.

Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03 04:02:56 -07:00
Filipe Manana
c152b63efc Btrfs: fix chunk allocation regression leading to transaction abort
With commit 1b98450816 ("Btrfs: fix find_free_dev_extent() malfunction
in case device tree has hole") introduced in the kernel 4.1 merge window,
we end up using part of a device hole for which there are already pending
chunks or pinned chunks. Before that commit we didn't use the hole and
would just move on to the next hole in the device.

However when we adjust the start offset for the chunk allocation and we
have pinned chunks, we set it blindly to the end offset of the pinned
chunk we are currently processing, which is dangerous because we can
have a pending chunk that has a start offset that matches the end offset
of our pinned chunk - leading us to a case where we end up getting two
pending chunks that start at the same physical device offset, which makes
us later abort the current transaction with -EEXIST when finishing the
chunk allocation at btrfs_create_pending_block_groups():

[194737.659017] ------------[ cut here ]------------
[194737.660192] WARNING: CPU: 15 PID: 31111 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x106 [btrfs]()
[194737.662209] BTRFS: Transaction aborted (error -17)
[194737.663175] Modules linked in: btrfs dm_snapshot dm_bufio dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse
[194737.674015] CPU: 15 PID: 31111 Comm: xfs_io Tainted: G        W       4.0.0-rc5-btrfs-next-9+ #2
[194737.675986] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[194737.682999]  0000000000000009 ffff8800564c7a98 ffffffff8142fa46 ffffffff8108b6a2
[194737.684540]  ffff8800564c7ae8 ffff8800564c7ad8 ffffffff81045ea5 ffff8800564c7b78
[194737.686017]  ffffffffa0383aa7 00000000ffffffef ffff88000c7ba000 ffff8801a1f66f40
[194737.687509] Call Trace:
[194737.688068]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
[194737.689027]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
[194737.690095]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
[194737.691198]  [<ffffffffa0383aa7>] ? __btrfs_abort_transaction+0x52/0x106 [btrfs]
[194737.693789]  [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
[194737.695065]  [<ffffffffa0383aa7>] __btrfs_abort_transaction+0x52/0x106 [btrfs]
[194737.696806]  [<ffffffffa039a3bd>] btrfs_create_pending_block_groups+0x101/0x130 [btrfs]
[194737.698683]  [<ffffffffa03aa433>] __btrfs_end_transaction+0x84/0x366 [btrfs]
[194737.700329]  [<ffffffffa03aa725>] btrfs_end_transaction+0x10/0x12 [btrfs]
[194737.701924]  [<ffffffffa0394b51>] btrfs_check_data_free_space+0x11f/0x27c [btrfs]
[194737.703675]  [<ffffffffa03b8ba4>] __btrfs_buffered_write+0x16a/0x4c8 [btrfs]
[194737.705417]  [<ffffffffa03bb502>] ? btrfs_file_write_iter+0x19a/0x431 [btrfs]
[194737.707058]  [<ffffffffa03bb511>] ? btrfs_file_write_iter+0x1a9/0x431 [btrfs]
[194737.708560]  [<ffffffffa03bb68d>] btrfs_file_write_iter+0x325/0x431 [btrfs]
[194737.710673]  [<ffffffff81067d85>] ? get_parent_ip+0xe/0x3e
[194737.712076]  [<ffffffff811534c3>] new_sync_write+0x7c/0xa0
[194737.713293]  [<ffffffff81153b58>] vfs_write+0xb2/0x117
[194737.714443]  [<ffffffff81154424>] SyS_pwrite64+0x64/0x82
[194737.715646]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
[194737.717175] ---[ end trace f2d5dc04e56d7e48 ]---
[194737.718170] BTRFS: error (device sdc) in btrfs_create_pending_block_groups:9524: errno=-17 Object already exists

The -EEXIST failure comes from btrfs_finish_chunk_alloc(), called by
btrfs_create_pending_block_groups(), when it attempts to insert a
duplicated device extent item via btrfs_alloc_dev_extent().

This issue was reproducible with fstests generic/038 running in a loop for
several hours (it's very hard to hit) and using MOUNT_OPTIONS="-o discard".
Applying Jeff's recent patch titled "btrfs: add missing discards when
unpinning extents with -o discard" makes the issue much easier to reproduce
(usually within 4 to 5 hours), since it pins chunks for longer periods of
time when an unused block group is deleted by the cleaner kthread.

Fix this by making sure that we never adjust the start offset to a lower
value than it currently has.

Fixes: 1b98450816 ("Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole"
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03 04:02:55 -07:00
Sasha Levin
2037a0933b btrfs: use after free when closing devices
__btrfs_close_devices() would call_rcu to free the device, which is racy with
list_for_each_entry() accessing the memory to retrieve the next device on the
list.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02 19:34:36 -07:00
David Sterba
01b810b889 btrfs: make root id query unprivileged
The INO_LOOKUP ioctl can lookup path for a given inode number and is
thus restricted. As a sideefect it can find the root id of the
containing subvolume and we're using this int the 'btrfs inspect rootid'
command.

The restriction is unnecessary in case we set the ioctl args
 args::treeid    = 0
 args::objectid  = 256 (BTRFS_FIRST_FREE_OBJECTID)

Then the path will be empty and the treeid is filled with the root id of
the inode on which the ioctl is called. This behaviour is unchanged,
after the root restriction is removed.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02 19:34:36 -07:00
Filipe Manana
2e6e518335 Btrfs: fix block group ->space_info null pointer dereference
When we create a block group we add it to the rbtree of block groups
before setting its ->space_info field (while it's NULL). This is
problematic since other tasks can access the block group from the
rbtree and attempt to use its ->space_info before it is set by
btrfs_make_block_group().

This can happen for example when a concurrent fitrim ioctl operation
is ongoing, which produces a trace like the following when
CONFIG_DEBUG_PAGEALLOC is set.

[11509.604369] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
[11509.606373] IP: [<ffffffff8107d675>] __lock_acquire+0xb4/0xf02
[11509.608179] PGD 2296a8067 PUD 22f4a2067 PMD 0
[11509.608179] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[11509.608179] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq processor i2c_piix4 psmou
[11509.608179] CPU: 10 PID: 8538 Comm: fstrim Tainted: G        W       4.0.0-rc5-btrfs-next-9+ #2
[11509.608179] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[11509.608179] task: ffff88009f5c46d0 ti: ffff8801b3edc000 task.ti: ffff8801b3edc000
[11509.608179] RIP: 0010:[<ffffffff8107d675>]  [<ffffffff8107d675>] __lock_acquire+0xb4/0xf02
[11509.608179] RSP: 0018:ffff8801b3edf9e8  EFLAGS: 00010002
[11509.608179] RAX: 0000000000000046 RBX: 0000000000000000 RCX: 0000000000000000
[11509.608179] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018
[11509.608179] RBP: ffff8801b3edfaa8 R08: 0000000000000001 R09: 0000000000000000
[11509.608179] R10: 0000000000000000 R11: ffff88009f5c4f98 R12: 0000000000000000
[11509.608179] R13: 0000000000000000 R14: 0000000000000018 R15: ffff88009f5c46d0
[11509.608179] FS:  00007f280a10e840(0000) GS:ffff88023ed40000(0000) knlGS:0000000000000000
[11509.608179] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[11509.608179] CR2: 0000000000000018 CR3: 00000002119bc000 CR4: 00000000000006e0
[11509.608179] Stack:
[11509.608179]  0000000000000000 0000000000000000 0000000000000004 0000000000000000
[11509.608179]  ffff880100000000 ffffffff00000000 0000000000000001 ffffffff00000000
[11509.608179]  0000000000000001 0000000000000000 ffff880100000000 00000000000006c4
[11509.608179] Call Trace:
[11509.608179]  [<ffffffff8107dc57>] ? __lock_acquire+0x696/0xf02
[11509.608179]  [<ffffffff8107e806>] lock_acquire+0xa5/0x116
[11509.608179]  [<ffffffffa04cc876>] ? do_trimming+0x51/0x145 [btrfs]
[11509.608179]  [<ffffffff81434f37>] _raw_spin_lock+0x34/0x44
[11509.608179]  [<ffffffffa04cc876>] ? do_trimming+0x51/0x145 [btrfs]
[11509.608179]  [<ffffffffa04cc876>] do_trimming+0x51/0x145 [btrfs]
[11509.608179]  [<ffffffffa04cde7d>] btrfs_trim_block_group+0x201/0x491 [btrfs]
[11509.608179]  [<ffffffffa04849e2>] btrfs_trim_fs+0xe0/0x129 [btrfs]
[11509.608179]  [<ffffffffa04bb80a>] btrfs_ioctl_fitrim+0x138/0x167 [btrfs]
[11509.608179]  [<ffffffffa04c002f>] btrfs_ioctl+0x50d/0x21e8 [btrfs]
[11509.608179]  [<ffffffff81123bda>] ? might_fault+0x58/0xb5
[11509.608179]  [<ffffffff81123bda>] ? might_fault+0x58/0xb5
[11509.608179]  [<ffffffff81123bda>] ? might_fault+0x58/0xb5
[11509.608179]  [<ffffffff81158050>] ? cp_new_stat+0x147/0x15e
[11509.608179]  [<ffffffff81163041>] do_vfs_ioctl+0x3c6/0x479
[11509.608179]  [<ffffffff81158116>] ? SYSC_newfstat+0x25/0x2e
[11509.608179]  [<ffffffff81435b54>] ? ret_from_sys_call+0x1d/0x58
[11509.608179]  [<ffffffff8116b915>] ? __fget_light+0x2d/0x4f
[11509.608179]  [<ffffffff8116314e>] SyS_ioctl+0x5a/0x7f
[11509.608179]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
[11509.608179] Code: f4 01 00 0f 85 c0 00 00 00 48 c7 c1 f3 1f 7d 81 48 c7 c2 aa cb 7c 81 be fc 0b 00 00 eb 70 83 3d 61 eb 9c 00 00 0f 84 a5 00 00 00 <49> 81 3e 40 a3 2b 82 b8 00 00 00
[11509.608179] RIP  [<ffffffff8107d675>] __lock_acquire+0xb4/0xf02
[11509.608179]  RSP <ffff8801b3edf9e8>
[11509.608179] CR2: 0000000000000018
[11509.608179] ---[ end trace 570a5c6769f0e49a ]---

Which corresponds to the following access in fs/btrfs/free-space-cache.c:

  static int do_trimming(struct btrfs_block_group_cache *block_group,
                         u64 *total_trimmed, u64 start, u64 bytes,
                         u64 reserved_start, u64 reserved_bytes,
                         struct btrfs_trim_range *trim_entry)
  {
       struct btrfs_space_info *space_info = block_group->space_info;
  (...)
       spin_lock(&space_info->lock);
       ^^^^^ - block_group->space_info is NULL...

Fix this by ensuring the block group's ->space_info is set before adding
the block group to the rbtree.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02 19:34:36 -07:00
Anand Jain
33b97e4327 Btrfs: check error before reporting missing device and add uuid
Report missing device when add is successful,
otherwise it would exit as ENOMEM. And add uuid
to the report.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02 19:34:35 -07:00
Qu Wenruo
1f6e4b3f9f btrfs: Fix superblock csum type check.
Old csum type check is wrong and can't catch csum_type 1(not supported).

Fix it to avoid hostile 0 division.

Reported-by: Lukas Lueg <lukas.lueg@gmail.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02 19:34:35 -07:00
Filipe Manana
619d8c4ef7 Btrfs: incremental send, fix clone operations for compressed extents
Marc reported a problem where the receiving end of an incremental send
was performing clone operations that failed with -EINVAL. This happened
because, unlike for uncompressed extents, we were not checking if the
source clone offset and length, after summing the data offset, falls
within the source file's boundaries.

So make sure we do such checks when attempting to issue clone operations
for compressed extents.

Problem reproducible with the following steps:

  $ mkfs.btrfs -f /dev/sdb
  $ mount -o compress /dev/sdb /mnt
  $ mkfs.btrfs -f /dev/sdc
  $ mount -o compress /dev/sdc /mnt2

  # Create the file with a single extent of 128K. This creates a metadata file
  # extent item with a data start offset of 0 and a logical length of 128K.
  $ xfs_io -f -c "pwrite -S 0xaa 64K 128K" -c "fsync" /mnt/foo

  # Now rewrite the range 64K to 112K of our file. This will make the inode's
  # metadata continue to point to the 128K extent we created before, but now
  # with an extent item that points to the extent with a data start offset of
  # 112K and a logical length of 16K.
  # That metadata file extent item is associated with the logical file offset
  # at 176K and covers the logical file range 176K to 192K.
  $ xfs_io -c "pwrite -S 0xbb 64K 112K" -c "fsync" /mnt/foo

  # Now rewrite the range 180K to 12K. This will make the inode's metadata
  # continue to point the the 128K extent we created earlier, with a single
  # extent item that points to it with a start offset of 112K and a logical
  # length of 4K.
  # That metadata file extent item is associated with the logical file offset
  # at 176K and covers the logical file range 176K to 180K.
  $ xfs_io -c "pwrite -S 0xcc 180K 12K" -c "fsync" /mnt/foo

  $ btrfs subvolume snapshot -r /mnt /mnt/snap1

  $ touch /mnt/bar
  # Calls the btrfs clone ioctl.
  $ ~/xfstests/src/cloner -s $((176 * 1024)) -d $((176 * 1024)) \
    -l $((4 * 1024)) /mnt/foo /mnt/bar

  $ btrfs subvolume snapshot -r /mnt /mnt/snap2

  $ btrfs send /mnt/snap1 | btrfs receive /mnt2
  At subvol /mnt/snap1
  At subvol snap1

  $ btrfs send -p /mnt/snap1 /mnt/snap2 | btrfs receive /mnt2
  At subvol /mnt/snap2
  At snapshot snap2
  ERROR: failed to clone extents to bar
  Invalid argument

A test case for fstests follows soon.

Reported-by: Marc MERLIN <marc@merlins.org>
Tested-by: Marc MERLIN <marc@merlins.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.cz>
Tested-by: Jan Alexander Steffens (heftig) <jan.steffens@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02 19:34:35 -07:00
Christian Engelmayer
ab3680dd18 btrfs: qgroup: Fix possible leak in btrfs_add_qgroup_relation()
Commit 9c8b35b1ba ("btrfs: quota: Automatically update related qgroups or
mark INCONSISTENT flags when assigning/deleting a qgroup relations.")
introduced the allocation of a temporary ulist in function
btrfs_add_qgroup_relation() and added the corresponding cleanup to the out
path. However, the allocation was introduced before the src/dst level check
that directly returns. Fix the possible leakage of the ulist by moving the
allocation after the input validation. Detected by Coverity CID 1295988.

Signed-off-by: Christian Engelmayer <cengelma@gmx.at>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02 19:34:35 -07:00
Filipe Manana
35c766425a Btrfs: fix mutex unlock without prior lock on space cache truncation
If the call to btrfs_truncate_inode_items() failed and we don't have a block
group, we were unlocking the cache_write_mutex without having locked it (we
do it only if we have a block group).

Fixes: 1bbc621ef2 ("Btrfs: allow block group cache writeout
                      outside critical section in commit")

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02 19:34:34 -07:00
Anand Jain
816fcebe8f Btrfs: log when missing device is created
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02 19:34:34 -07:00
David Sterba
6d13f5497f btrfs: fix warnings after changes in btrfs_abort_transaction
fs/btrfs/volumes.c: In function ‘btrfs_create_uuid_tree’:
fs/btrfs/volumes.c:3909:3: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘long int’ [-Wformat=]
   btrfs_abort_transaction(trans, tree_root,
   ^
  CC [M]  fs/btrfs/ioctl.o
fs/btrfs/ioctl.c: In function ‘create_subvol’:
fs/btrfs/ioctl.c:549:3: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘long int’ [-Wformat=]
   btrfs_abort_transaction(trans, root, PTR_ERR(new_root));

PTR_ERR returns long, but we're really using 'int' for the error codes
everywhere so just set and use the local variable.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02 19:34:34 -07:00
David Sterba
c0d19e2b9a btrfs: add 'cold' compiler annotations to all error handling functions
The annotated functios will be placed into .text.unlikely section. The
annotation also hints compiler to move the code out of the hot paths,
and may implicitly mark if-statement leading to that block as unlikely.

This is a heuristic, the impact on the generated code is not
significant.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02 19:34:34 -07:00
David Sterba
1a9a8a71ed btrfs: report exact callsite where transaction abort occurs
WARN is called from a single location and all bugreports say that's in
super.c __btrfs_abort_transaction. This is slightly confusing as we'd
rather want to know the exact callsite. Whereas this information is
printed in the syslog below the stacktrace, this requires further look
and we usually see only the headline from WARNING.

Moving the WARN into the macro has to inline some code and increases
code by a few kilobytes:

  text    data     bss     dec     hex filename
835481   20305   14120  869906   d4612 btrfs.ko.before
842883   20305   14120  877308   d62fc btrfs.ko.after

The delta is +7k (130+ calls), measured on 3.19 x86_64, distro config.
The increase is not small and could lead to worse icache use. The code
is on error/exit paths that can be recognized by compiler as cold and
moved out of the way so the impact is speculated to be low, if
measurable at all.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02 19:34:34 -07:00
David Sterba
13028901a4 btrfs: let tree defrag work in SSD mode
Long time ago (2008) the defrag was automatic for new b-tree writes but
has been disabled after performance problems. There was a leftover in
tree-defrag.c that effectively stops any defragmentation on b-trees.
This is a bit unexpected and IMHO undesired. The SSD mode is an
optimization and defrag is supposed to work if the users asks for it.

Related commits:

6702ed490c
Btrfs: Add run time btree defrag, and an ioctl to force btree defrag

e18e4809b1
Btrfs: Add mount -o ssd, which includes optimizations for seek free
storage

b3236e68bf
Btrfs: Leave on the tree defragger in mount -o ssd, it still helps there

9afbb0b752
Btrfs: Disable tree defrag in SSD mode

The last three commits switch the defrag+ssd off/on/off and the last one

3f157a2fd2
Btrfs: Online btree defragmentation fixes

misses the bits from tree-defrag.c to revert to the behaviour introduced
in e18e4809b1.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02 19:34:33 -07:00
Filipe Manana
53e489bc8c Btrfs: check pending chunks when shrinking fs to avoid corruption
When we shrink the usable size of a device (its total_bytes), we go over
all the device extent items in the device tree and attempt to relocate
the chunk of any device extent that goes beyond the new usable size for
the device. We do that after setting the new usable size (total_bytes) in
the device object, so that all new allocations (and reallocations) don't
use areas of the device that go beyond the new (shorter) size. However we
were not considering that before setting the new size in the device,
pending chunks might have been created that use device extents that go
beyond the new size, and those device extents are not yet in the device
tree after we search the device tree - they are still attached to the
list of new block group for some ongoing transaction handle, and they are
only added to the device tree when the transaction handle is ended (via
btrfs_create_pending_block_groups()).

So check for pending chunks with device extents that go beyond the new
size and if any exists, commit the current transaction and repeat the
search in the device tree.

Not doing this it would mean we would return success to user space while
still having extents that go beyond the new size, and later user space
could override those locations on the device while the fs still references
them, causing all sorts of corruption and unexpected events.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02 19:34:33 -07:00
Omar Sandoval
64ad6c4889 Btrfs: don't invalidate root dentry when subvolume deletion fails
Since commit bafc9b754f ("vfs: More precise tests in d_invalidate"),
mounted subvolumes can be deleted because d_invalidate() won't fail.
However, we run into problems when we attempt to delete the default
subvolume while it is mounted as the root filesystem:

	# btrfs subvol list /
	ID 257 gen 306 top level 5 path rootvol
	ID 267 gen 334 top level 5 path snap1
	# btrfs subvol get-default /
	ID 267 gen 334 top level 5 path snap1
	# btrfs inspect-internal rootid /
	267
	# mount -o subvol=/ /dev/vda1 /mnt
	# btrfs subvol del /mnt/snap1
	Delete subvolume (no-commit): '/mnt/snap1'
	ERROR: cannot delete '/mnt/snap1' - Operation not permitted
	# findmnt /
	findmnt: can't read /proc/mounts: No such file or directory
	# ls /proc
	#

Markus reported that this same scenario simply led to a kernel oops.

This happens because in btrfs_ioctl_snap_destroy(), we call
d_invalidate() before we check may_destroy_subvol(), which means that we
detach the submounts and drop the dentry before erroring out. Instead,
we should only invalidate the dentry once the deletion has succeeded.
Additionally, the shrink_dcache_sb() isn't necessary; d_invalidate()
will prune the dcache for the deleted subvolume.

Cc: <stable@vger.kernel.org>
Fixes: bafc9b754f ("vfs: More precise tests in d_invalidate")
Reported-by: Markus Schauler <mschauler@gmail.com>
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02 19:34:33 -07:00
Filipe Manana
8b191a6849 Btrfs: incremental send, check if orphanized dir inode needs delayed rename
If a directory inode is orphanized, because some inode previously
processed has a new name that collides with the old name of the current
inode, we need to check if it needs its rename operation delayed too,
as its ancestor-descendent relationship with some other inode might
have been reversed between the parent and send snapshots and therefore
its rename operation needs to happen after that other inode is renamed.

For example, for the following reproducer where this is needed (provided
by Robbie Ko):

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt
  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt2

  $ mkdir -p /mnt/data/n1/n2
  $ mkdir /mnt/data/n4
  $ mkdir -p /mnt/data/t6/t7
  $ mkdir /mnt/data/t5
  $ mkdir /mnt/data/t7
  $ mkdir /mnt/data/n4/t2
  $ mkdir /mnt/data/t4
  $ mkdir /mnt/data/t3
  $ mv /mnt/data/t7 /mnt/data/n4/t2
  $ mv /mnt/data/t4 /mnt/data/n4/t2/t7
  $ mv /mnt/data/t5 /mnt/data/n4/t2/t7/t4
  $ mv /mnt/data/t6 /mnt/data/n4/t2/t7/t4/t5
  $ mv /mnt/data/n1/n2 /mnt/data/n4/t2/t7/t4/t5/t6
  $ mv /mnt/data/n1 /mnt/data/n4/t2/t7/t4/t5/t6
  $ mv /mnt/data/n4/t2/t7/t4/t5/t6/t7 /mnt/data/n4/t2/t7/t4/t5/t6/n2
  $ mv /mnt/data/t3 /mnt/data/n4/t2/t7/t4/t5/t6/n2/t7

  $ btrfs subvolume snapshot -r /mnt /mnt/snap1

  $ mv /mnt/data/n4/t2/t7/t4/t5/t6/n1 /mnt/data/n4
  $ mv /mnt/data/n4/t2 /mnt/data/n4/n1
  $ mv /mnt/data/n4/n1/t2/t7/t4/t5/t6/n2 /mnt/data/n4/n1/t2
  $ mv /mnt/data/n4/n1/t2/n2/t7/t3 /mnt/data/n4/n1/t2
  $ mv /mnt/data/n4/n1/t2/t7/t4/t5/t6 /mnt/data/n4/n1/t2
  $ mv /mnt/data/n4/n1/t2/t7/t4 /mnt/data/n4/n1/t2/t6
  $ mv /mnt/data/n4/n1/t2/t7 /mnt/data/n4/n1/t2/t3
  $ mv /mnt/data/n4/n1/t2/n2/t7 /mnt/data/n4/n1/t2

  $ btrfs subvolume snapshot -r /mnt /mnt/snap2

  $ btrfs send /mnt/snap1 | btrfs receive /mnt2
  $ btrfs send -p /mnt/snap1 /mnt/snap2 | btrfs receive /mnt2
  ERROR: send ioctl failed with -12: Cannot allocate memory

Where the parent snapshot directory hierarchy is the following:

  .                                                        (ino 256)
  |-- data/                                                (ino 257)
        |-- n4/                                            (ino 260)
             |-- t2/                                       (ino 265)
                  |-- t7/                                  (ino 264)
                       |-- t4/                             (ino 266)
                            |-- t5/                        (ino 263)
                                 |-- t6/                   (ino 261)
                                      |-- n1/              (ino 258)
                                      |-- n2/              (ino 259)
                                           |-- t7/         (ino 262)
                                                |-- t3/    (ino 267)

And the send snapshot's directory hierarchy is the following:

  .                                                        (ino 256)
  |-- data/                                                (ino 257)
        |-- n4/                                            (ino 260)
             |-- n1/                                       (ino 258)
                  |-- t2/                                  (ino 265)
                       |-- n2/                             (ino 259)
                       |-- t3/                             (ino 267)
                       |    |-- t7                         (ino 264)
                       |
                       |-- t6/                             (ino 261)
                       |    |-- t4/                        (ino 266)
                       |         |-- t5/                   (ino 263)
                       |
                       |-- t7/                             (ino 262)

While processing inode 262 we orphanize inode 264 and later attempt
to rename inode 264 to its new name/location, which resulted in building
an incorrect destination path string for the rename operation with the
value "data/n4/t2/t7/t4/t5/t6/n2/t7/t3/t7". This rename operation must
have been done only after inode 267 is processed and renamed, as the
ancestor-descendent relationship between inodes 264 and 267 was reversed
between both snapshots, because otherwise it results in an infinite loop
when building the path string for inode 264 when we are processing an
inode with a number larger than 264. That loop is the following:

  start inode 264, send progress of 265 for example
  parent of 264 -> 267
  parent of 267 -> 262
  parent of 262 -> 259
  parent of 259 -> 261
  parent of 261 -> 263
  parent of 263 -> 266
  parent of 266 -> 264
    |--> back to first iteration while current path string length
         is <= PATH_MAX, and fail with -ENOMEM otherwise

So fix this by making the check if we need to delay a directory rename
regardless of the current inode having been orphanized or not.

A test case for fstests follows soon.

Thanks to Robbie Ko for providing a reproducer for this problem.

Reported-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-06-03 03:10:40 +01:00
Filipe Manana
80aa602756 Btrfs: incremental send, don't delay directory renames unnecessarily
Even though we delay the rename of directories when they become
descendents of other directories that were also renamed in the send
root to prevent infinite path build loops, we were doing it in cases
where this was not needed and was actually harmful resulting in
infinite path build loops as we ended up with a circular dependency
of delayed directory renames.

Consider the following reproducer:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt
  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt2

  $ mkdir /mnt/data
  $ mkdir /mnt/data/n1
  $ mkdir /mnt/data/n1/n2
  $ mkdir /mnt/data/n4
  $ mkdir /mnt/data/n1/n2/p1
  $ mkdir /mnt/data/n1/n2/p1/p2
  $ mkdir /mnt/data/t6
  $ mkdir /mnt/data/t7
  $ mkdir -p /mnt/data/t5/t7
  $ mkdir /mnt/data/t2
  $ mkdir /mnt/data/t4
  $ mkdir -p /mnt/data/t1/t3
  $ mkdir /mnt/data/p1
  $ mv /mnt/data/t1 /mnt/data/p1
  $ mkdir -p /mnt/data/p1/p2
  $ mv /mnt/data/t4 /mnt/data/p1/p2/t1
  $ mv /mnt/data/t5 /mnt/data/n4/t5
  $ mv /mnt/data/n1/n2/p1/p2 /mnt/data/n4/t5/p2
  $ mv /mnt/data/t7 /mnt/data/n4/t5/p2/t7
  $ mv /mnt/data/t2 /mnt/data/n4/t1
  $ mv /mnt/data/p1 /mnt/data/n4/t5/p2/p1
  $ mv /mnt/data/n1/n2 /mnt/data/n4/t5/p2/p1/p2/n2
  $ mv /mnt/data/n4/t5/p2/p1/p2/t1 /mnt/data/n4/t5/p2/p1/p2/n2/t1
  $ mv /mnt/data/n4/t5/t7 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t7
  $ mv /mnt/data/n4/t5/p2/p1/t1/t3 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t3
  $ mv /mnt/data/n4/t5/p2/p1/p2/n2/p1 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t7/p1
  $ mv /mnt/data/t6 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t3/t5
  $ mv /mnt/data/n4/t5/p2/p1/t1 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t3/t1
  $ mv /mnt/data/n1 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t7/p1/n1

  $ btrfs subvolume snapshot -r /mnt /mnt/snap1

  $ mv /mnt/data/n4/t1 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t7/p1/t1
  $ mv /mnt/data/n4/t5/p2/p1/p2/n2/t1 /mnt/data/n4/
  $ mv /mnt/data/n4/t5/p2/p1/p2/n2 /mnt/data/n4/t1/n2
  $ mv /mnt/data/n4/t1/t7/p1 /mnt/data/n4/t1/n2/p1
  $ mv /mnt/data/n4/t1/t3/t1 /mnt/data/n4/t1/n2/t1
  $ mv /mnt/data/n4/t1/t3 /mnt/data/n4/t1/n2/t1/t3
  $ mv /mnt/data/n4/t5/p2/p1/p2 /mnt/data/n4/t1/n2/p1/p2
  $ mv /mnt/data/n4/t1/t7 /mnt/data/n4/t1/n2/p1/t7
  $ mv /mnt/data/n4/t5/p2/p1 /mnt/data/n4/t1/n2/p1/p2/p1
  $ mv /mnt/data/n4/t1/n2/t1/t3/t5 /mnt/data/n4/t1/n2/p1/p2/t5
  $ mv /mnt/data/n4/t5 /mnt/data/n4/t1/n2/p1/p2/p1/t5
  $ mv /mnt/data/n4/t1/n2/p1/p2/p1/t5/p2 /mnt/data/n4/t1/n2/p1/p2/p1/p2
  $ mv /mnt/data/n4/t1/n2/p1/p2/p1/p2/t7 /mnt/data/n4/t1/t7

  $ btrfs subvolume snapshot -r /mnt /mnt/snap2

  $ btrfs send /mnt/snap1 | btrfs receive /mnt2
  $ btrfs send -p /mnt/snap1 /mnt/snap2 | btrfs receive -vv /mnt2
  ERROR: send ioctl failed with -12: Cannot allocate memory

This reproducer resulted in an infinite path build loop when building the
path for inode 266 because the following circular dependency of delayed
directory renames was created:

   ino 272 <- ino 261 <- ino 259 <- ino 268 <- ino 267 <- ino 261

Where the notation "X <- Y" means the rename of inode X is delayed by the
rename of inode Y (X will be renamed after Y is renamed). This resulted
in an infinite path build loop of inode 266 because that inode has inode
261 as an ancestor in the send root and inode 261 is in the circular
dependency of delayed renames listed above.

Fix this by not delaying the rename of a directory inode if an ancestor of
the inode in the send root, which has a delayed rename operation, is not
also a descendent of the inode in the parent root.

Thanks to Robbie Ko for sending the reproducer example.
A test case for xfstests follows soon.

Reported-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-06-03 03:10:20 +01:00
Anand Jain
2421a8cd5f Btrfs: sysfs: don't fail seeding for the sake of sysfs kobject issue
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:22 +02:00
Anand Jain
24bd69cb0f Btrfs: sysfs: add support to add parent for fsid
To support seed sysfs layout and represent seed fsid under
the sprout we need the facility to create fsid under the
specified parent.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:22 +02:00
Anand Jain
b7c35e81ad Btrfs: sysfs: separate kobject and attribute creation
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:22 +02:00
Anand Jain
1d1c1be372 Btrfs: sysfs: btrfs_sysfs_remove_fsid() make it non static
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:22 +02:00
Anand Jain
ef1a0daadf Btrfs: sysfs: make btrfs_sysfs_add_device() non static
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:22 +02:00
Anand Jain
0c10e2d482 Btrfs: sysfs: make btrfs_sysfs_add_fsid() non static
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:21 +02:00
Anand Jain
6c14a1641b Btrfs: sysfs btrfs_kobj_rm_device() pass fs_devices instead of fs_info
since btrfs_kobj_rm_device() does nothing with fs_info

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:21 +02:00
Anand Jain
1ba43816af Btrfs: sysfs btrfs_kobj_add_device() pass fs_devices instead of fs_info
btrfs_kobj_add_device() does not need fs_info any more.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:21 +02:00
Anand Jain
2e3e12815a Btrfs: sysfs: provide framework to remove all fsid sysfs kobject
Just a helper function to clean up the sysfs fsid kobjects.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:21 +02:00
Anand Jain
5a13f4308c Btrfs: sysfs: add pointer to access fs_info from fs_devices
adds fs_info pointer with struct btrfs_fs_devices.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:21 +02:00
Anand Jain
c73eccf75b Btrfs: introduce btrfs_get_fs_uuids to get fs_uuids
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:20 +02:00
Anand Jain
2e7910d6ca Btrfs: sysfs: move super_kobj and device_dir_kobj from fs_info to btrfs_fs_devices
This patch will provide a framework and help to create attributes
from the structure btrfs_fs_devices which are available even before
fs_info is created. So by moving the parent kobject super_kobj from
fs_info to btrfs_fs_devices, it will help to create attributes
from the btrfs_fs_devices as well.

Patches on top of this patch now will be able to create the
sys/fs/btrfs/fsid kobject and attributes from btrfs_fs_devices
when devices are scanned and registered to the kernel.

Just to note, this does not change any of the existing btrfs sysfs
external kobject names and its attributes and not even the life
cycle of them. Changes are internal only. And to ensure the same,
this path has been tested with various device operations and,
checking and comparing the sysfs kobjects and attributes with
sysfs kobject and attributes with out this patch, and they remain
same.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:20 +02:00
Anand Jain
00c921c23f Btrfs: sysfs: separate device kobject and its attribute creation
Separate device kobject and its attribute creation so that device
kobject can be created from the device discovery thread.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:20 +02:00
Anand Jain
0dd2906f72 Btrfs: sysfs: let default_attrs be separate from the kset
As of now btrfs_attrs are provided using the default_attrs through
the kset. Separate them and create the default_attrs using the
sysfs_create_files instead. By doing this we will have the
flexibility that device discovery thread could create fsid
kobject.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:20 +02:00
Anand Jain
720592157e Btrfs: sysfs: introduce function btrfs_sysfs_add_fsid() to create sysfs fsid
We need it in a seperate function so that it can be called from the
device discovery thread as well.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:20 +02:00
Anand Jain
3a08f3b72a Btrfs: sysfs: rename __btrfs_sysfs_remove_one to btrfs_sysfs_remove_fsid
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:20 +02:00
Anand Jain
aaf1330516 Btrfs: sysfs: reorder the kobject creations
As of now the order in which the kobjects are created
at btrfs_sysfs_add_one() is..
 fsid
 features
 unknown features (dynamic features)
 devices.

Since we would move fsid and device kobject to fs_devices
from fs_info structure, this patch will reorder in which
the kobjects are created as below.
 fsid
 devices
 features
 unknown features (dynamic features)

And hence the btrfs_sysfs_remove_one() will follow the same
in reverse order. and the device kobject destroy now can
be moved into the function __btrfs_sysfs_remove_one()

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:19 +02:00
Anand Jain
4d435731f9 Btrfc: sysfs: fix, check if device_dir_kobj is init before destroy
Since the failure code in the btrfs_sysfs_add_one() can
call btrfs_sysfs_remove_one() even before device_dir_kobj
has been created we need to check if its null.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:19 +02:00
Anand Jain
8345ea31dc Btrfs: sysfs: fix, kobject pointer clean up needed after kobject release
The sysfs clean up self test like in the below code fails, since
fs_info->device_dir_kobject still points to its stale kobject.
Reseting this pointer will help to fix this.

open_ctree()
{

ret = btrfs_sysfs_add_one(fs_info);
::
+       btrfs_sysfs_remove_one(fs_info);
+       ret = btrfs_sysfs_add_one(fs_info);
+       if (ret) {
+               pr_err("BTRFS: failed to init sysfs interface: %d\n", ret);
+               goto fail_block_groups;
+       }

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:19 +02:00
Anand Jain
e7e1aa9c91 Btrfs: sysfs: fix, undo sysfs device links
Theoritically need to remove the device links attributes, but since its entire device
kobject was removed, so there wasn't any issue of about it. Just do it nicely.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:19 +02:00
Anand Jain
4e51f005a2 Btrfs: sysfs: fix, fs_info kobject_unregister has init_completion() twice
kobject_unregister is to handle the release of the kobject,
its completion init is being called in btrfs_sysfs_add_one(),
so we don't have to do the same in the open_ctree() again.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:19 +02:00
Anand Jain
248d200df3 Btrfs: sysfs: fix, btrfs_release_super_kobj() should to clean up the kobject data
The following test case fails indicating that, thread tried to init an initialized object.

kernel: [232104.016513] kobject (ffff880006c1c980): tried to init an initialized object, something is seriously wrong.

btrfs_sysfs_remove_one() self test code:

open_tree()
{
 ::
        ret = btrfs_sysfs_add_one(fs_info);
	if (ret) {
              pr_err("BTRFS: failed to init sysfs interface: %d\n", ret);
                goto fail_block_groups;
        }
+       btrfs_sysfs_remove_one(fs_info);
+       ret = btrfs_sysfs_add_one(fs_info);
+       if (ret) {
+               pr_err("BTRFS: failed to init sysfs interface: %d\n", ret);
+               goto fail_block_groups;
+       }

cleaning up the unregistered kobject fixes this.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:18 +02:00
Linus Torvalds
7ce14f6ff2 Merge branch 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "I fixed up a regression from 4.0 where conversion between different
  raid levels would sometimes bail out without converting.

  Filipe tracked down a race where it was possible to double allocate
  chunks on the drive.

  Mark has a fix for fiemap.  All three will get bundled off for stable
  as well"

* 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix regression in raid level conversion
  Btrfs: fix racy system chunk allocation when setting block group ro
  btrfs: clear 'ret' in btrfs_check_shared() loop
2015-05-23 11:14:10 -07:00
Mike Snitzer
326e1dbb57 block: remove management of bi_remaining when restoring original bi_end_io
Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for
non-chains") regressed all existing callers that followed this pattern:
 1) saving a bio's original bi_end_io
 2) wiring up an intermediate bi_end_io
 3) restoring the original bi_end_io from intermediate bi_end_io
 4) calling bio_endio() to execute the restored original bi_end_io

The regression was due to BIO_CHAIN only ever getting set if
bio_inc_remaining() is called.  For the above pattern it isn't set until
step 3 above (step 2 would've needed to establish BIO_CHAIN).  As such
the first bio_endio(), in step 2 above, never decremented __bi_remaining
before calling the intermediate bi_end_io -- leaving __bi_remaining with
the value 1 instead of 0.  When bio_inc_remaining() occurred during step
3 it brought it to a value of 2.  When the second bio_endio() was
called, in step 4 above, it should've called the original bi_end_io but
it didn't because there was an extra reference that wasn't dropped (due
to atomic operations being optimized away since BIO_CHAIN wasn't set
upfront).

Fix this issue by removing the __bi_remaining management complexity for
all callers that use the above pattern -- bio_chain() is the only
interface that _needs_ to be concerned with __bi_remaining.  For the
above pattern callers just expect the bi_end_io they set to get called!
Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls
that aren't associated with the bio_chain() interface.

Also, the bio_inc_remaining() interface has been moved local to bio.c.

Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-22 08:58:55 -06:00
Chris Mason
153c35b6cc Btrfs: fix regression in raid level conversion
Commit 2f0810880f changed
btrfs_set_block_group_ro to avoid trying to allocate new chunks with the
new raid profile during conversion.  This fixed failures when there was
no space on the drive to allocate a new chunk, but the metadata
reserves were sufficient to continue the conversion.

But this ended up causing a regression when the drive had plenty of
space to allocate new chunks, mostly because reduce_alloc_profile isn't
using the new raid profile.

Fixing btrfs_reduce_alloc_profile is a bigger patch.  For now, do a
partial revert of 2f0810880, and don't error out if we hit ENOSPC.

Signed-off-by: Chris Mason <clm@fb.com>
Tested-by: Dave Sterba <dsterba@suse.cz>
Reported-by: Holger Hoffstaette <holger.hoffstaette@googlemail.com>
2015-05-20 11:03:38 -07:00
Filipe Manana
a96295965b Btrfs: fix racy system chunk allocation when setting block group ro
If while setting a block group read-only we end up allocating a system
chunk, through check_system_chunk(), we were not doing it while holding
the chunk mutex which is a problem if a concurrent chunk allocation is
happening, through do_chunk_alloc(), as it means both block groups can
end up using the same logical addresses and physical regions in the
device(s). So make sure we hold the chunk mutex.

Cc: stable@vger.kernel.org  # 4.0+
Fixes: 2f0810880f ("btrfs: delete chunk allocation attemp when
                      setting block group ro")

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-05-19 18:04:17 -07:00
Mark Fasheh
2c2ed5aa01 btrfs: clear 'ret' in btrfs_check_shared() loop
btrfs_check_shared() is leaking a return value of '1' from
find_parent_nodes(). As a result, callers (in this case, extent_fiemap())
are told extents are shared when they are not. This in turn broke fiemap on
btrfs for kernels v3.18 and up.

The fix is simple - we just have to clear 'ret' after we are done processing
the results of find_parent_nodes().

It wasn't clear to me at first what was happening with return values in
btrfs_check_shared() and find_parent_nodes() - thanks to Josef for the help
on irc. I added documentation to both functions to make things more clear
for the next hacker who might come across them.

If we could queue this up for -stable too that would be great.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-05-19 18:04:17 -07:00
Christoph Hellwig
b25de9d6da block: remove BIO_EOPNOTSUPP
Since the big barrier rewrite/removal in 2007 we never fail FLUSH or
FUA requests, which means we can remove the magic BIO_EOPNOTSUPP flag
to help propagating those to the buffer_head layer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-19 09:17:03 -06:00
Linus Torvalds
c7309e88a6 Merge branch 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "The first commit is a fix from Filipe for a very old extent buffer
  reuse race that triggered a BUG_ON.  It hasn't come up often, I looked
  through old logs at FB and we hit it a handful of times over the last
  year.

  The rest are other corners he hit during testing"

* 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix race when reusing stale extent buffers that leads to BUG_ON
  Btrfs: fix race between block group creation and their cache writeout
  Btrfs: fix panic when starting bg cache writeout after IO error
  Btrfs: fix crash after inode cache writeback failure
2015-05-16 15:50:58 -07:00
Filipe Manana
062c19e9dd Btrfs: fix race when reusing stale extent buffers that leads to BUG_ON
There's a race between releasing extent buffers that are flagged as stale
and recycling them that makes us it the following BUG_ON at
btrfs_release_extent_buffer_page:

    BUG_ON(extent_buffer_under_io(eb))

The BUG_ON is triggered because the extent buffer has the flag
EXTENT_BUFFER_DIRTY set as a consequence of having been reused and made
dirty by another concurrent task.

Here follows a sequence of steps that leads to the BUG_ON.

      CPU 0                                                    CPU 1                                                CPU 2

path->nodes[0] == eb X
X->refs == 2 (1 for the tree, 1 for the path)
btrfs_header_generation(X) == current trans id
flag EXTENT_BUFFER_DIRTY set on X

btrfs_release_path(path)
    unlocks X

                                                      reads eb X
                                                         X->refs incremented to 3
                                                      locks eb X
                                                      btrfs_del_items(X)
                                                         X becomes empty
                                                         clean_tree_block(X)
                                                             clear EXTENT_BUFFER_DIRTY from X
                                                         btrfs_del_leaf(X)
                                                             unlocks X
                                                             extent_buffer_get(X)
                                                                X->refs incremented to 4
                                                             btrfs_free_tree_block(X)
                                                                X's range is not pinned
                                                                X's range added to free
                                                                  space cache
                                                             free_extent_buffer_stale(X)
                                                                lock X->refs_lock
                                                                set EXTENT_BUFFER_STALE on X
                                                                release_extent_buffer(X)
                                                                    X->refs decremented to 3
                                                                    unlocks X->refs_lock
                                                      btrfs_release_path()
                                                         unlocks X
                                                         free_extent_buffer(X)
                                                             X->refs becomes 2

                                                                                                      __btrfs_cow_block(Y)
                                                                                                          btrfs_alloc_tree_block()
                                                                                                              btrfs_reserve_extent()
                                                                                                                  find_free_extent()
                                                                                                                      gets offset == X->start
                                                                                                              btrfs_init_new_buffer(X->start)
                                                                                                                  btrfs_find_create_tree_block(X->start)
                                                                                                                      alloc_extent_buffer(X->start)
                                                                                                                          find_extent_buffer(X->start)
                                                                                                                              finds eb X in radix tree

    free_extent_buffer(X)
        lock X->refs_lock
            test X->refs == 2
            test bit EXTENT_BUFFER_STALE is set
            test !extent_buffer_under_io(eb)

                                                                                                                              increments X->refs to 3
                                                                                                                              mark_extent_buffer_accessed(X)
                                                                                                                                  check_buffer_tree_ref(X)
                                                                                                                                    --> does nothing,
                                                                                                                                        X->refs >= 2 and
                                                                                                                                        EXTENT_BUFFER_TREE_REF
                                                                                                                                        is set in X
                                                                                                              clear EXTENT_BUFFER_STALE from X
                                                                                                              locks X
                                                                                                          btrfs_mark_buffer_dirty()
                                                                                                              set_extent_buffer_dirty(X)
                                                                                                                  check_buffer_tree_ref(X)
                                                                                                                     --> does nothing, X->refs >= 2 and
                                                                                                                         EXTENT_BUFFER_TREE_REF is set
                                                                                                                  sets EXTENT_BUFFER_DIRTY on X

            test and clear EXTENT_BUFFER_TREE_REF
            decrements X->refs to 2
        release_extent_buffer(X)
            decrements X->refs to 1
            unlock X->refs_lock

                                                                                                      unlock X
                                                                                                      free_extent_buffer(X)
                                                                                                          lock X->refs_lock
                                                                                                          release_extent_buffer(X)
                                                                                                              decrements X->refs to 0
                                                                                                              btrfs_release_extent_buffer_page(X)
                                                                                                                   BUG_ON(extent_buffer_under_io(X))
                                                                                                                       --> EXTENT_BUFFER_DIRTY set on X

Fix this by making find_extent buffer wait for any ongoing task currently
executing free_extent_buffer()/free_extent_buffer_stale() if the extent
buffer has the stale flag set.
A more clean alternative would be to always increment the extent buffer's
reference count while holding its refs_lock spinlock but find_extent_buffer
is a performance critical area and that would cause lock contention whenever
multiple tasks search for the same extent buffer concurrently.

A build server running a SLES 12 kernel (3.12 kernel + over 450 upstream
btrfs patches backported from newer kernels) was hitting this often:

[1212302.461948] kernel BUG at ../fs/btrfs/extent_io.c:4507!
(...)
[1212302.470219] CPU: 1 PID: 19259 Comm: bs_sched Not tainted 3.12.36-38-default #1
[1212302.540792] Hardware name: Supermicro PDSM4/PDSM4, BIOS 6.00 04/17/2006
[1212302.540792] task: ffff8800e07e0100 ti: ffff8800d6412000 task.ti: ffff8800d6412000
[1212302.540792] RIP: 0010:[<ffffffffa0507081>]  [<ffffffffa0507081>] btrfs_release_extent_buffer_page.constprop.51+0x101/0x110 [btrfs]
(...)
[1212302.630008] Call Trace:
[1212302.630008]  [<ffffffffa05070cd>] release_extent_buffer+0x3d/0xa0 [btrfs]
[1212302.630008]  [<ffffffffa04c2d9d>] btrfs_release_path+0x1d/0xa0 [btrfs]
[1212302.630008]  [<ffffffffa04c5c7e>] read_block_for_search.isra.33+0x13e/0x3a0 [btrfs]
[1212302.630008]  [<ffffffffa04c8094>] btrfs_search_slot+0x3f4/0xa80 [btrfs]
[1212302.630008]  [<ffffffffa04cf5d8>] lookup_inline_extent_backref+0xf8/0x630 [btrfs]
[1212302.630008]  [<ffffffffa04d13dd>] __btrfs_free_extent+0x11d/0xc40 [btrfs]
[1212302.630008]  [<ffffffffa04d64a4>] __btrfs_run_delayed_refs+0x394/0x11d0 [btrfs]
[1212302.630008]  [<ffffffffa04db379>] btrfs_run_delayed_refs.part.66+0x69/0x280 [btrfs]
[1212302.630008]  [<ffffffffa04ed2ad>] __btrfs_end_transaction+0x2ad/0x3d0 [btrfs]
[1212302.630008]  [<ffffffffa04f7505>] btrfs_evict_inode+0x4a5/0x500 [btrfs]
[1212302.630008]  [<ffffffff811b9e28>] evict+0xa8/0x190
[1212302.630008]  [<ffffffff811b0330>] do_unlinkat+0x1a0/0x2b0

I was also able to reproduce this on a 3.19 kernel, corresponding to Chris'
integration branch from about a month ago, running the following stress
test on a qemu/kvm guest (with 4 virtual cpus and 16Gb of ram):

  while true; do
     mkfs.btrfs -l 4096 -f -b `expr 20 \* 1024 \* 1024 \* 1024` /dev/sdd
     mount /dev/sdd /mnt
     snapshot_cmd="btrfs subvolume snapshot -r /mnt"
     snapshot_cmd="$snapshot_cmd /mnt/snap_\`date +'%H_%M_%S_%N'\`"
     fsstress -d /mnt -n 25000 -p 8 -x "$snapshot_cmd" -X 100
     umount /mnt
  done

Which usually triggers the BUG_ON within less than 24 hours:

[49558.618097] ------------[ cut here ]------------
[49558.619732] kernel BUG at fs/btrfs/extent_io.c:4551!
(...)
[49558.620031] CPU: 3 PID: 23908 Comm: fsstress Tainted: G        W      3.19.0-btrfs-next-7+ #3
[49558.620031] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[49558.620031] task: ffff8800319fc0d0 ti: ffff880220da8000 task.ti: ffff880220da8000
[49558.620031] RIP: 0010:[<ffffffffa0476b1a>]  [<ffffffffa0476b1a>] btrfs_release_extent_buffer_page+0x20/0xe9 [btrfs]
(...)
[49558.620031] Call Trace:
[49558.620031]  [<ffffffffa0476c73>] release_extent_buffer+0x90/0xd3 [btrfs]
[49558.620031]  [<ffffffff8142b10c>] ? _raw_spin_lock+0x3b/0x43
[49558.620031]  [<ffffffffa0477052>] ? free_extent_buffer+0x37/0x94 [btrfs]
[49558.620031]  [<ffffffffa04770ab>] free_extent_buffer+0x90/0x94 [btrfs]
[49558.620031]  [<ffffffffa04396d5>] btrfs_release_path+0x4a/0x69 [btrfs]
[49558.620031]  [<ffffffffa0444907>] __btrfs_free_extent+0x778/0x80c [btrfs]
[49558.620031]  [<ffffffffa044a485>] __btrfs_run_delayed_refs+0xad2/0xc62 [btrfs]
[49558.728054]  [<ffffffff811420d5>] ? kmemleak_alloc_recursive.constprop.52+0x16/0x18
[49558.728054]  [<ffffffffa044c1e8>] btrfs_run_delayed_refs+0x6d/0x1ba [btrfs]
[49558.728054]  [<ffffffffa045917f>] ? join_transaction.isra.9+0xb9/0x36b [btrfs]
[49558.728054]  [<ffffffffa045a75c>] btrfs_commit_transaction+0x4c/0x981 [btrfs]
[49558.728054]  [<ffffffffa0434f86>] btrfs_sync_fs+0xd5/0x10d [btrfs]
[49558.728054]  [<ffffffff81155923>] ? iterate_supers+0x60/0xc4
[49558.728054]  [<ffffffff8117966a>] ? do_sync_work+0x91/0x91
[49558.728054]  [<ffffffff8117968a>] sync_fs_one_sb+0x20/0x22
[49558.728054]  [<ffffffff81155939>] iterate_supers+0x76/0xc4
[49558.728054]  [<ffffffff811798e8>] sys_sync+0x55/0x83
[49558.728054]  [<ffffffff8142bbd2>] system_call_fastpath+0x12/0x17

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-05-11 07:59:11 -07:00
Filipe Manana
ff1f8250a9 Btrfs: fix race between block group creation and their cache writeout
So creating a block group has 2 distinct phases:

Phase 1 - creates the btrfs_block_group_cache item and adds it to the
rbtree fs_info->block_group_cache_tree and to the corresponding list
space_info->block_groups[];

Phase 2 - adds the block group item to the extent tree and corresponding
items to the chunk tree.

The first phase adds the block_group_cache_item to a list of pending block
groups in the transaction handle, and phase 2 happens when
btrfs_end_transaction() is called against the transaction handle.

It happens that once phase 1 completes, other concurrent tasks that use
their own transaction handle, but points to the same running transaction
(struct btrfs_trans_handle->transaction), can use this block group for
space allocations and therefore mark it dirty. Dirty block groups are
tracked in a list belonging to the currently running transaction (struct
btrfs_transaction) and not in the transaction handle (btrfs_trans_handle).

This is a problem because once a task calls btrfs_commit_transaction(),
it calls btrfs_start_dirty_block_groups() which will see all dirty block
groups and attempt to start their writeout, including those that are
still attached to the transaction handle of some concurrent task that
hasn't called btrfs_end_transaction() yet - which means those block
groups haven't gone through phase 2 yet and therefore when
write_one_cache_group() is called, it won't find the block group items
in the extent tree and abort the current transaction with -ENOENT,
turning the fs into readonly mode and require a remount.

Fix this by ignoring -ENOENT when looking for block group items in the
extent tree when we attempt to start the writeout of the block group
caches outside the critical section of the transaction commit. We will
try again later during the critical section and if there we still don't
find the block group item in the extent tree, we then abort the current
transaction.

This issue happened twice, once while running fstests btrfs/067 and once
for btrfs/078, which produced the following trace:

[ 3278.703014] WARNING: CPU: 7 PID: 18499 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]()
[ 3278.707329] BTRFS: Transaction aborted (error -2)
(...)
[ 3278.731555] Call Trace:
[ 3278.732396]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
[ 3278.733860]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
[ 3278.735312]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
[ 3278.736874]  [<ffffffffa03ada6d>] ? __btrfs_abort_transaction+0x52/0x114 [btrfs]
[ 3278.738302]  [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
[ 3278.739520]  [<ffffffffa03ada6d>] __btrfs_abort_transaction+0x52/0x114 [btrfs]
[ 3278.741222]  [<ffffffffa03b9e56>] write_one_cache_group+0xae/0xbf [btrfs]
[ 3278.742797]  [<ffffffffa03c487b>] btrfs_start_dirty_block_groups+0x170/0x2b2 [btrfs]
[ 3278.744492]  [<ffffffffa03d309c>] btrfs_commit_transaction+0x130/0x9c9 [btrfs]
[ 3278.746084]  [<ffffffff8107d33d>] ? trace_hardirqs_on+0xd/0xf
[ 3278.747249]  [<ffffffffa03e5660>] btrfs_sync_file+0x313/0x387 [btrfs]
[ 3278.748744]  [<ffffffff8117acad>] vfs_fsync_range+0x95/0xa4
[ 3278.749958]  [<ffffffff81435b54>] ? ret_from_sys_call+0x1d/0x58
[ 3278.751218]  [<ffffffff8117acd8>] vfs_fsync+0x1c/0x1e
[ 3278.754197]  [<ffffffff8117ae54>] do_fsync+0x34/0x4e
[ 3278.755192]  [<ffffffff8117b07c>] SyS_fsync+0x10/0x14
[ 3278.756236]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
[ 3278.757366] ---[ end trace 9a4d4df4969709aa ]---

Fixes: 1bbc621ef2 ("Btrfs: allow block group cache writeout
                      outside critical section in commit")

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-05-11 07:59:10 -07:00
Filipe Manana
28aeeac1dd Btrfs: fix panic when starting bg cache writeout after IO error
When waiting for the writeback of block group cache we returned
immediately if there was an error during writeback without waiting
for the ordered extent to complete. This left a short time window
where if some other task attempts to start the writeout for the same
block group cache it can attempt to add a new ordered extent, starting
at the same offset (0) before the previous one is removed from the
ordered tree, causing an ordered tree panic (calls BUG()).

This normally doesn't happen in other write paths, such as buffered
writes or direct IO writes for regular files, since before marking
page ranges dirty we lock the ranges and wait for any ordered extents
within the range to complete first.

Fix this by making btrfs_wait_ordered_range() not return immediately
if it gets an error from the writeback, waiting for all ordered extents
to complete first.

This issue happened often when running the fstest btrfs/088 and it's
easy to trigger it by running in a loop until the panic happens:

  for ((i = 1; i <= 10000; i++)) do ./check btrfs/088 ; done

[17156.862573] BTRFS critical (device sdc): panic in ordered_data_tree_panic:70: Inconsistency in ordered tree at offset 0 (errno=-17 Object already exists)
[17156.864052] ------------[ cut here ]------------
[17156.864052] kernel BUG at fs/btrfs/ordered-data.c:70!
(...)
[17156.864052] Call Trace:
[17156.864052]  [<ffffffffa03876e3>] btrfs_add_ordered_extent+0x12/0x14 [btrfs]
[17156.864052]  [<ffffffffa03787e2>] run_delalloc_nocow+0x5bf/0x747 [btrfs]
[17156.864052]  [<ffffffffa03789ff>] run_delalloc_range+0x95/0x353 [btrfs]
[17156.864052]  [<ffffffffa038b7fe>] writepage_delalloc.isra.16+0xb9/0x13f [btrfs]
[17156.864052]  [<ffffffffa038d75b>] __extent_writepage+0x129/0x1f7 [btrfs]
[17156.864052]  [<ffffffffa038da5a>] extent_write_cache_pages.isra.15.constprop.28+0x231/0x2f4 [btrfs]
[17156.864052]  [<ffffffff810ad2af>] ? __module_text_address+0x12/0x59
[17156.864052]  [<ffffffff8107d33d>] ? trace_hardirqs_on+0xd/0xf
[17156.864052]  [<ffffffffa038df76>] extent_writepages+0x4b/0x5c [btrfs]
[17156.864052]  [<ffffffff81144431>] ? kmem_cache_free+0x9b/0xce
[17156.864052]  [<ffffffffa0376a46>] ? btrfs_submit_direct+0x3fc/0x3fc [btrfs]
[17156.864052]  [<ffffffffa0389cd6>] ? free_extent_state+0x8c/0xc1 [btrfs]
[17156.864052]  [<ffffffffa0374871>] btrfs_writepages+0x28/0x2a [btrfs]
[17156.864052]  [<ffffffff8110c4c8>] do_writepages+0x23/0x2c
[17156.864052]  [<ffffffff81102f36>] __filemap_fdatawrite_range+0x5a/0x61
[17156.864052]  [<ffffffff81102f6e>] filemap_fdatawrite_range+0x13/0x15
[17156.864052]  [<ffffffffa0383ef7>] btrfs_fdatawrite_range+0x21/0x48 [btrfs]
[17156.864052]  [<ffffffffa03ab89e>] __btrfs_write_out_cache.isra.14+0x2d9/0x3a7 [btrfs]
[17156.864052]  [<ffffffffa03ac1ab>] ? btrfs_write_out_cache+0x41/0xdc [btrfs]
[17156.864052]  [<ffffffffa03ac1fd>] btrfs_write_out_cache+0x93/0xdc [btrfs]
[17156.864052]  [<ffffffffa0363847>] ? btrfs_start_dirty_block_groups+0x13a/0x2b2 [btrfs]
[17156.864052]  [<ffffffffa03638e6>] btrfs_start_dirty_block_groups+0x1d9/0x2b2 [btrfs]
[17156.864052]  [<ffffffff8107d33d>] ? trace_hardirqs_on+0xd/0xf
[17156.864052]  [<ffffffffa037209e>] btrfs_commit_transaction+0x130/0x9c9 [btrfs]
[17156.864052]  [<ffffffffa034c748>] btrfs_sync_fs+0xe1/0x12d [btrfs]

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-05-11 07:59:10 -07:00
Filipe Manana
e43699d4b4 Btrfs: fix crash after inode cache writeback failure
If the writeback of an inode cache failed we were unnecessarilly
attempting to release again the delalloc metadata that we previously
reserved. However attempting to do this a second time triggers an
assertion at drop_outstanding_extent() because we have no more
outstanding extents for our inode cache's inode. If we were able
to start writeback of the cache the reserved metadata space is
released at btrfs_finished_ordered_io(), even if an error happens
during writeback.

So make sure we don't repeat the metadata space release if writeback
started for our inode cache.

This issue was trivial to reproduce by running the fstest btrfs/088
with "-o inode_cache", which triggered the assertion leading to a
BUG() call and requiring a reboot in order to run the remaining
fstests. Trace produced by btrfs/088:

[255289.385904] BTRFS: assertion failed: BTRFS_I(inode)->outstanding_extents >= num_extents, file: fs/btrfs/extent-tree.c, line: 5276
[255289.388094] ------------[ cut here ]------------
[255289.389184] kernel BUG at fs/btrfs/ctree.h:4057!
[255289.390125] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
(...)
[255289.392068] Call Trace:
[255289.392068]  [<ffffffffa035e774>] drop_outstanding_extent+0x3d/0x6d [btrfs]
[255289.392068]  [<ffffffffa0364988>] btrfs_delalloc_release_metadata+0x54/0xe3 [btrfs]
[255289.392068]  [<ffffffffa03b4174>] btrfs_write_out_ino_cache+0x95/0xad [btrfs]
[255289.392068]  [<ffffffffa036f5c4>] btrfs_save_ino_cache+0x275/0x2dc [btrfs]
[255289.392068]  [<ffffffffa03e2d83>] commit_fs_roots.isra.12+0xaa/0x137 [btrfs]
[255289.392068]  [<ffffffff8107d33d>] ? trace_hardirqs_on+0xd/0xf
[255289.392068]  [<ffffffffa037841f>] ? btrfs_commit_transaction+0x4b1/0x9c9 [btrfs]
[255289.392068]  [<ffffffff814351a4>] ? _raw_spin_unlock+0x32/0x46
[255289.392068]  [<ffffffffa037842e>] btrfs_commit_transaction+0x4c0/0x9c9 [btrfs]
(...)

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-05-11 07:59:10 -07:00
Linus Torvalds
af6472881a Merge branch 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fix from Chris Mason:
 "When an arm user reported crashes near page_address(page) in my new
  code, it became clear that I can't be trusted with GFP masks.  Filipe
  beat me to the patch, and I'll just be in the corner with my dunce cap
  on"

* 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix wrong mapping flags for free space inode
2015-05-08 20:59:02 -07:00
Filipe Manana
1d3c61c2eb Btrfs: fix wrong mapping flags for free space inode
We were passing a flags value that differed from the intention in commit
2b10826800 ("Btrfs: don't use highmem for free space cache pages").

This caused problems in a ARM machine, leaving btrfs unusable there.

Reported-by: Merlijn Wajer <merlijn@wizzup.org>
Tested-by: Merlijn Wajer <merlijn@wizzup.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-05-06 17:06:13 -07:00
Jens Axboe
dac56212e8 bio: skip atomic inc/dec of ->bi_cnt for most use cases
Struct bio has a reference count that controls when it can be freed.
Most uses cases is allocating the bio, which then returns with a
single reference to it, doing IO, and then dropping that single
reference. We can remove this atomic_dec_and_test() in the completion
path, if nobody else is holding a reference to the bio.

If someone does call bio_get() on the bio, then we flag the bio as
now having valid count and that we must properly honor the reference
count when it's being put.

Tested-by: Robert Elliott <elliott@hp.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-05 13:32:49 -06:00
Linus Torvalds
64887b6882 Merge branch 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "A few more btrfs fixes.

  These range from corners Filipe found in the new free space cache
  writeback to a grab bag of fixes from the list"

* 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: btrfs_release_extent_buffer_page didn't free pages of dummy extent
  Btrfs: fill ->last_trans for delayed inode in btrfs_fill_inode.
  btrfs: unlock i_mutex after attempting to delete subvolume during send
  btrfs: check io_ctl_prepare_pages return in __btrfs_write_out_cache
  btrfs: fix race on ENOMEM in alloc_extent_buffer
  btrfs: handle ENOMEM in btrfs_alloc_tree_block
  Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole
  Btrfs: don't check for delalloc_bytes in cache_save_setup
  Btrfs: fix deadlock when starting writeback of bg caches
  Btrfs: fix race between start dirty bg cache writeout and bg deletion
2015-05-01 07:46:21 -07:00
Forrest Liu
5d2361db48 Btrfs: btrfs_release_extent_buffer_page didn't free pages of dummy extent
btrfs_release_extent_buffer_page() can't handle dummy extent that
allocated by btrfs_clone_extent_buffer() properly. That is because
reference count of pages that allocated by btrfs_clone_extent_buffer()
was 2, 1 by alloc_page(), and another by attach_extent_buffer_page().

Running following command repeatly can check this memory leak problem

    btrfs inspect-internal inode-resolve 256 /mnt/btrfs

Signed-off-by: Chien-Kuan Yeh <ckya@synology.com>
Signed-off-by: Forrest Liu <forrestl@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-29 13:22:09 -07:00
Linus Torvalds
f583381f50 Merge branch 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Filipe hit two problems in my block group cache patches.  We finalized
  the fixes last week and ran through more tests"

* 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: prevent list corruption during free space cache processing
  Btrfs: fix inode cache writeout
2015-04-26 17:40:30 -07:00
Linus Torvalds
9ec3a646fe Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull fourth vfs update from Al Viro:
 "d_inode() annotations from David Howells (sat in for-next since before
  the beginning of merge window) + four assorted fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  RCU pathwalk breakage when running into a symlink overmounting something
  fix I_DIO_WAKEUP definition
  direct-io: only inc/dec inode->i_dio_count for file systems
  fs/9p: fix readdir()
  VFS: assorted d_backing_inode() annotations
  VFS: fs/inode.c helpers: d_inode() annotations
  VFS: fs/cachefiles: d_backing_inode() annotations
  VFS: fs library helpers: d_inode() annotations
  VFS: assorted weird filesystems: d_inode() annotations
  VFS: normal filesystems (and lustre): d_inode() annotations
  VFS: security/: d_inode() annotations
  VFS: security/: d_backing_inode() annotations
  VFS: net/: d_inode() annotations
  VFS: net/unix: d_backing_inode() annotations
  VFS: kernel/: d_inode() annotations
  VFS: audit: d_backing_inode() annotations
  VFS: Fix up some ->d_inode accesses in the chelsio driver
  VFS: Cachefiles should perform fs modifications on the top layer only
  VFS: AF_UNIX sockets should call mknod on the top layer only
2015-04-26 17:22:07 -07:00
Yang Dongsheng
6e17d30bfa Btrfs: fill ->last_trans for delayed inode in btrfs_fill_inode.
We need to fill inode when we found a node for it in delayed_nodes_tree.
But we did not fill the ->last_trans currently, it will cause the test
of xfstest/generic/311 fail. Scenario of the 311 is shown as below:

Problem:
	(1). test_fd = open(fname, O_RDWR|O_DIRECT)
	(2). pwrite(test_fd, buf, 4096, 0)
	(3). close(test_fd)
	(4). drop_all_caches()	<-------- "echo 3 > /proc/sys/vm/drop_caches"
	(5). test_fd = open(fname, O_RDWR|O_DIRECT)
	(6). fsync(test_fd);
				<-------- we did not get the correct log entry for the file
Reason:
	When we re-open this file in (5), we would find a node
in delayed_nodes_tree and fill the inode we are lookup with the
information. But the ->last_trans is not filled, then the fsync()
will check the ->last_trans and found it's 0 then say this inode
is already in our tree which is commited, not recording the extents
for it.

Fix:
	This patch fill the ->last_trans properly and set the
runtime_flags if needed in this situation. Then we can get the
log entries we expected after (6) and generic/311 passed.

Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaoxie@huawei.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-26 06:27:03 -07:00
Omar Sandoval
909e26dce3 btrfs: unlock i_mutex after attempting to delete subvolume during send
Whenever the check for a send in progress introduced in commit
521e0546c9 (btrfs: protect snapshots from deleting during send) is
hit, we return without unlocking inode->i_mutex. This is easy to see
with lockdep enabled:

[  +0.000059] ================================================
[  +0.000028] [ BUG: lock held when returning to user space! ]
[  +0.000029] 4.0.0-rc5-00096-g3c435c1 #93 Not tainted
[  +0.000026] ------------------------------------------------
[  +0.000029] btrfs/211 is leaving the kernel with locks still held!
[  +0.000029] 1 lock held by btrfs/211:
[  +0.000023]  #0:  (&type->i_mutex_dir_key){+.+.+.}, at: [<ffffffff8135b8df>] btrfs_ioctl_snap_destroy+0x2df/0x7a0

Make sure we unlock it in the error path.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Cc: stable@vger.kernel.org
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-26 06:27:02 -07:00
Omar Sandoval
b86054540e btrfs: check io_ctl_prepare_pages return in __btrfs_write_out_cache
If io_ctl_prepare_pages fails, the pages in io_ctl.pages are not valid.
When we try to access them later, things will blow up in various ways.

Also fix the comment about the return value, which is an errno on error,
not -1, and update the cases where it was not.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-26 06:27:01 -07:00
Omar Sandoval
5ca64f45e9 btrfs: fix race on ENOMEM in alloc_extent_buffer
Consider the following interleaving of overlapping calls to
alloc_extent_buffer:

Call 1:

- Successfully allocates a few pages with find_or_create_page
- find_or_create_page fails, goto free_eb
- Unlocks the allocated pages

Call 2:
- Calls find_or_create_page and gets a page in call 1's extent_buffer
- Finds that the page is already associated with an extent_buffer
- Grabs a reference to the half-written extent_buffer and calls
  mark_extent_buffer_accessed on it

mark_extent_buffer_accessed will then try to call mark_page_accessed on
a null page and panic.

The fix is to decrement the reference count on the half-written
extent_buffer before unlocking the pages so call 2 won't use it. We
should also set exists = NULL in the case that we don't use exists to
avoid accidentally returning a freed extent_buffer in an error case.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-26 06:27:00 -07:00
Omar Sandoval
67b7859e9b btrfs: handle ENOMEM in btrfs_alloc_tree_block
This is one of the first places to give out when memory is tight. Handle
it properly rather than with a BUG_ON.

Also fix the comment about the return value, which is an ERR_PTR, not
NULL, on error.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-26 06:27:00 -07:00
Forrest Liu
1b98450816 Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole
If device tree has hole, find_free_dev_extent() cannot find available
address properly.

The problem can be reproduce by following script.

    mntpath=/btrfs
    loopdev=/dev/loop0
    filepath=/home/forrest/image

    umount $mntpath
    losetup -d $loopdev
    truncate --size 100g $filepath
    losetup $loopdev $filepath
    mkfs.btrfs -f $loopdev
    mount $loopdev $mntpath

    # make device tree with one big hole
    for i in `seq 1 1 100`; do
        fallocate -l 1g $mntpath/$i
    done
    sync
    for i in `seq 1 1 95`; do
        rm $mntpath/$i
    done
    sync

    # wait cleaner thread remove unused block group
    sleep 300

    fallocate -l 1g $mntpath/aaa

    # failed to allocate new chunk
    fallocate -l 1g $mntpath/bbb

Above script will make device tree with one big hole, and can only allocate
just one chunk in a transaction, so failed to allocate new chunk for $mntpath/bbb

    item 8 key (1 DEV_EXTENT 2185232384) itemoff 15859 itemsize 48
        dev extent chunk_tree 3
        chunk objectid 256 chunk offset 106292051968 length 1073741824
    item 9 key (1 DEV_EXTENT 104190705664) itemoff 15811 itemsize 48
        dev extent chunk_tree 3
        chunk objectid 256 chunk offset 103108575232 length 1073741824

Signed-off-by: Forrest Liu <forrestl@synology.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-26 06:26:59 -07:00
Chris Mason
e4c88f007b Btrfs: don't check for delalloc_bytes in cache_save_setup
Now that we're doing free space cache writeback outside the critical
section in the commit, there is a bigger window for delalloc_bytes to
be added after a cache has been written.  find_free_extent may do this
without putting the block group back into the dirty list, and also
without a transaction running.

Checking for delalloc_bytes in cache_save_setup means we might leave the
cache marked as written without invalidating it.  Consistency checks
during mount will toss the cache, but it's better to get rid of the
check in cache_save_setup and let it get invalidated by the checks
already done during cache write out.

Signed-off-by: Chris Mason <clm@fb.com>
2015-04-26 06:26:58 -07:00
Filipe Manana
24b89d08ef Btrfs: fix deadlock when starting writeback of bg caches
While starting the writes of the dirty block group caches, if we don't
find a block group item in the extent tree we were leaving without
releasing our path, running delayed references and then looping again to
process any new dirty block groups. However this second iteration of the
loop could cause a deadlock because it tries to lock some other extent
tree node/leaf which another task already locked and it's blocked because
it's waiting for a lock on some node/leaf that is in our path that was not
released before.
We could also deadlock when running the delayed references - as we could
end up trying to lock the same nodes/leafs that we have in our local path
(with a different lock type).

Got into such case when running xfstests:

[20892.242791] ------------[ cut here ]------------
[20892.243776] WARNING: CPU: 0 PID: 13299 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]()
[20892.245874] BTRFS: Transaction aborted (error -2)
(...)
[20892.269378] Call Trace:
[20892.269915]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
[20892.271097]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
[20892.272173]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
[20892.273386]  [<ffffffffa0509a6d>] ? __btrfs_abort_transaction+0x52/0x114 [btrfs]
[20892.274857]  [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
[20892.275851]  [<ffffffffa0509a6d>] __btrfs_abort_transaction+0x52/0x114 [btrfs]
[20892.277341]  [<ffffffffa0515e10>] write_one_cache_group+0x68/0xaf [btrfs]
[20892.278628]  [<ffffffffa052088a>] btrfs_start_dirty_block_groups+0x18d/0x29b [btrfs]
[20892.280191]  [<ffffffffa052f077>] btrfs_commit_transaction+0x130/0x9c9 [btrfs]
(...)
[20892.291316] ---[ end trace 597f77e664245373 ]---
[20892.293955] BTRFS: error (device sdg) in write_one_cache_group:3184: errno=-2 No such entry
[20892.297390] BTRFS info (device sdg): forced readonly
[20892.298222] ------------[ cut here ]------------
[20892.299190] WARNING: CPU: 0 PID: 13299 at fs/btrfs/ctree.c:2683 btrfs_search_slot+0x7e/0x7d2 [btrfs]()
(...)
[20892.326253] Call Trace:
[20892.326904]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
[20892.329503]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
[20892.330815]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
[20892.332556]  [<ffffffffa0510b73>] ? btrfs_search_slot+0x7e/0x7d2 [btrfs]
[20892.333955]  [<ffffffff81045f62>] warn_slowpath_null+0x1a/0x1c
[20892.335562]  [<ffffffffa0510b73>] btrfs_search_slot+0x7e/0x7d2 [btrfs]
[20892.336849]  [<ffffffff8107b024>] ? arch_local_irq_save+0x9/0xc
[20892.338222]  [<ffffffffa051ad52>] ? cache_save_setup+0x43/0x2a5 [btrfs]
[20892.339823]  [<ffffffffa051ad66>] ? cache_save_setup+0x57/0x2a5 [btrfs]
[20892.341275]  [<ffffffff814351a4>] ? _raw_spin_unlock+0x32/0x46
[20892.342810]  [<ffffffffa0515de7>] write_one_cache_group+0x3f/0xaf [btrfs]
[20892.344184]  [<ffffffffa052088a>] btrfs_start_dirty_block_groups+0x18d/0x29b [btrfs]
[20892.347162]  [<ffffffffa052f077>] btrfs_commit_transaction+0x130/0x9c9 [btrfs]
(...)
[20892.361015] ---[ end trace 597f77e664245374 ]---
[21120.688097] INFO: task kworker/u8:17:29854 blocked for more than 120 seconds.
[21120.689881]       Tainted: G        W       4.0.0-rc5-btrfs-next-9+ #2
[21120.691384] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
(...)
[21120.703696] Call Trace:
[21120.704310]  [<ffffffff8143107e>] schedule+0x74/0x83
[21120.705490]  [<ffffffffa055f025>] btrfs_tree_lock+0xd7/0x236 [btrfs]
[21120.706757]  [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31
[21120.708156]  [<ffffffffa054ac1e>] lock_extent_buffer_for_io+0x3e/0x194 [btrfs]
[21120.709892]  [<ffffffffa054bb86>] ? btree_write_cache_pages+0x273/0x385 [btrfs]
[21120.711605]  [<ffffffffa054bc42>] btree_write_cache_pages+0x32f/0x385 [btrfs]
[21120.723440]  [<ffffffffa0527552>] btree_writepages+0x23/0x5c [btrfs]
[21120.724943]  [<ffffffff8110c4c8>] do_writepages+0x23/0x2c
[21120.726008]  [<ffffffff81176dde>] __writeback_single_inode+0x73/0x2fa
[21120.727230]  [<ffffffff8117714a>] ? writeback_sb_inodes+0xe5/0x38b
[21120.728526]  [<ffffffff811771fb>] ? writeback_sb_inodes+0x196/0x38b
[21120.729701]  [<ffffffff8117726a>] writeback_sb_inodes+0x205/0x38b
(...)
[21120.747853] INFO: task btrfs:13282 blocked for more than 120 seconds.
[21120.749459]       Tainted: G        W       4.0.0-rc5-btrfs-next-9+ #2
[21120.751137] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
(...)
[21120.768457] Call Trace:
[21120.769039]  [<ffffffff8143107e>] schedule+0x74/0x83
[21120.770107]  [<ffffffffa052f25c>] btrfs_commit_transaction+0x315/0x9c9 [btrfs]
[21120.771558]  [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31
[21120.773659]  [<ffffffffa056fd8c>] prepare_to_relocate+0xcb/0xd2 [btrfs]
[21120.776257]  [<ffffffffa05741da>] relocate_block_group+0x44/0x4a9 [btrfs]
[21120.777755]  [<ffffffffa05747a0>] ? btrfs_relocate_block_group+0x161/0x288 [btrfs]
[21120.779459]  [<ffffffffa05747a8>] btrfs_relocate_block_group+0x169/0x288 [btrfs]
[21120.781153]  [<ffffffffa0550403>] btrfs_relocate_chunk.isra.29+0x3e/0xa7 [btrfs]
[21120.783918]  [<ffffffffa05518fd>] btrfs_balance+0xaa4/0xc52 [btrfs]
[21120.785436]  [<ffffffff8114306e>] ? cpu_cache_get.isra.39+0xe/0x1f
[21120.786434]  [<ffffffffa0559252>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
(...)
[21120.889251] INFO: task fsstress:13288 blocked for more than 120 seconds.
[21120.890526]       Tainted: G        W       4.0.0-rc5-btrfs-next-9+ #2
[21120.891773] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
(...)
[21120.899960] Call Trace:
[21120.900743]  [<ffffffff8143107e>] schedule+0x74/0x83
[21120.903004]  [<ffffffffa055f025>] btrfs_tree_lock+0xd7/0x236 [btrfs]
[21120.904383]  [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31
[21120.905608]  [<ffffffffa051125b>] btrfs_search_slot+0x766/0x7d2 [btrfs]
[21120.906812]  [<ffffffff8114290e>] ? virt_to_head_page+0x9/0x2c
[21120.907874]  [<ffffffff81144b7f>] ? cache_alloc_debugcheck_after.isra.42+0x16c/0x1cb
[21120.909551]  [<ffffffffa05124e0>] btrfs_insert_empty_items+0x5d/0xa8 [btrfs]
[21120.910914]  [<ffffffffa0512585>] btrfs_insert_item+0x5a/0xa5 [btrfs]
[21120.912181]  [<ffffffffa0520271>] ? btrfs_create_pending_block_groups+0x96/0x130 [btrfs]
[21120.913784]  [<ffffffffa052028a>] btrfs_create_pending_block_groups+0xaf/0x130 [btrfs]
[21120.915374]  [<ffffffffa052ffc2>] __btrfs_end_transaction+0x84/0x366 [btrfs]
[21120.916735]  [<ffffffffa05302b4>] btrfs_end_transaction+0x10/0x12 [btrfs]
[21120.917996]  [<ffffffffa051ab26>] btrfs_check_data_free_space+0x11f/0x27c [btrfs]
[21120.919478]  [<ffffffffa051ba25>] btrfs_delalloc_reserve_space+0x1e/0x51 [btrfs]
[21120.921226]  [<ffffffffa05382f2>] btrfs_truncate_page+0x85/0x2c4 [btrfs]
[21120.923121]  [<ffffffffa0538572>] btrfs_cont_expand+0x41/0x3ef [btrfs]
[21120.924449]  [<ffffffffa0541091>] ? btrfs_file_write_iter+0x19a/0x431 [btrfs]
[21120.926602]  [<ffffffff8107b024>] ? arch_local_irq_save+0x9/0xc
[21120.927769]  [<ffffffffa0541091>] ? btrfs_file_write_iter+0x19a/0x431 [btrfs]
[21120.929324]  [<ffffffffa05410a0>] ? btrfs_file_write_iter+0x1a9/0x431 [btrfs]
[21120.930723]  [<ffffffffa05410d9>] btrfs_file_write_iter+0x1e2/0x431 [btrfs]
[21120.931897]  [<ffffffff81067d85>] ? get_parent_ip+0xe/0x3e
[21120.934446]  [<ffffffff811534c3>] new_sync_write+0x7c/0xa0
[21120.935528]  [<ffffffff81153b58>] vfs_write+0xb2/0x117
(...)

Fixes: 1bbc621ef2 ("Btrfs: allow block group cache writeout
                      outside critical section in commit")

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-26 06:26:37 -07:00
Filipe Manana
b58d1a9ef9 Btrfs: fix race between start dirty bg cache writeout and bg deletion
While running xfstests I ran into the following:

[20892.242791] ------------[ cut here ]------------
[20892.243776] WARNING: CPU: 0 PID: 13299 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]()
[20892.245874] BTRFS: Transaction aborted (error -2)
[20892.247329] Modules linked in: btrfs dm_snapshot dm_bufio dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse$
[20892.258488] CPU: 0 PID: 13299 Comm: fsstress Tainted: G        W       4.0.0-rc5-btrfs-next-9+ #2
[20892.262011] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[20892.264738]  0000000000000009 ffff880427f8bc18 ffffffff8142fa46 ffffffff8108b6a2
[20892.266244]  ffff880427f8bc68 ffff880427f8bc58 ffffffff81045ea5 ffff880427f8bc48
[20892.267761]  ffffffffa0509a6d 00000000fffffffe ffff8803545d6f40 ffffffffa05a15a0
[20892.269378] Call Trace:
[20892.269915]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
[20892.271097]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
[20892.272173]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
[20892.273386]  [<ffffffffa0509a6d>] ? __btrfs_abort_transaction+0x52/0x114 [btrfs]
[20892.274857]  [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
[20892.275851]  [<ffffffffa0509a6d>] __btrfs_abort_transaction+0x52/0x114 [btrfs]
[20892.277341]  [<ffffffffa0515e10>] write_one_cache_group+0x68/0xaf [btrfs]
[20892.278628]  [<ffffffffa052088a>] btrfs_start_dirty_block_groups+0x18d/0x29b [btrfs]
[20892.280191]  [<ffffffffa052f077>] btrfs_commit_transaction+0x130/0x9c9 [btrfs]
[20892.281781]  [<ffffffff8107d33d>] ? trace_hardirqs_on+0xd/0xf
[20892.282873]  [<ffffffffa054163b>] btrfs_sync_file+0x313/0x387 [btrfs]
[20892.284111]  [<ffffffff8117acad>] vfs_fsync_range+0x95/0xa4
[20892.285203]  [<ffffffff810e603f>] ? time_hardirqs_on+0x15/0x28
[20892.286290]  [<ffffffff8123960b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[20892.287469]  [<ffffffff8117acd8>] vfs_fsync+0x1c/0x1e
[20892.288412]  [<ffffffff8117ae54>] do_fsync+0x34/0x4e
[20892.289348]  [<ffffffff8117b07c>] SyS_fsync+0x10/0x14
[20892.290255]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
[20892.291316] ---[ end trace 597f77e664245373 ]---
[20892.293955] BTRFS: error (device sdg) in write_one_cache_group:3184: errno=-2 No such entry
[20892.297390] BTRFS info (device sdg): forced readonly

This happens because in btrfs_start_dirty_block_groups() we splice the
transaction's list of dirty block groups into a local list and then we
keep extracting the first element of the list without holding the
cache_write_mutex mutex. This means that before we acquire that mutex
the first block group on the list might be removed by a conurrent task
running btrfs_remove_block_group(). So make sure we extract the first
element (and test the list emptyness) while holding that mutex.

Fixes: 1bbc621ef2 ("Btrfs: allow block group cache writeout
                      outside critical section in commit")

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-26 06:26:37 -07:00
Jens Axboe
fe0f07d08e direct-io: only inc/dec inode->i_dio_count for file systems
do_blockdev_direct_IO() increments and decrements the inode
->i_dio_count for each IO operation. It does this to protect against
truncate of a file. Block devices don't need this sort of protection.

For a capable multiqueue setup, this atomic int is the only shared
state between applications accessing the device for O_DIRECT, and it
presents a scaling wall for that. In my testing, as much as 30% of
system time is spent incrementing and decrementing this value. A mixed
read/write workload improved from ~2.5M IOPS to ~9.6M IOPS, with
better latencies too. Before:

clat percentiles (usec):
 |  1.00th=[   33],  5.00th=[   34], 10.00th=[   34], 20.00th=[   34],
 | 30.00th=[   34], 40.00th=[   34], 50.00th=[   35], 60.00th=[   35],
 | 70.00th=[   35], 80.00th=[   35], 90.00th=[   37], 95.00th=[   80],
 | 99.00th=[   98], 99.50th=[  151], 99.90th=[  155], 99.95th=[  155],
 | 99.99th=[  165]

After:

clat percentiles (usec):
 |  1.00th=[   95],  5.00th=[  108], 10.00th=[  129], 20.00th=[  149],
 | 30.00th=[  155], 40.00th=[  161], 50.00th=[  167], 60.00th=[  171],
 | 70.00th=[  177], 80.00th=[  185], 90.00th=[  201], 95.00th=[  270],
 | 99.00th=[  390], 99.50th=[  398], 99.90th=[  418], 99.95th=[  422],
 | 99.99th=[  438]

In other setups, Robert Elliott reported seeing good performance
improvements:

https://lkml.org/lkml/2015/4/3/557

The more applications accessing the device, the worse it gets.

Add a new direct-io flags, DIO_SKIP_DIO_COUNT, which tells
do_blockdev_direct_IO() that it need not worry about incrementing
or decrementing the inode i_dio_count for this caller.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Elliott, Robert (Server Storage) <elliott@hp.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-24 15:45:28 -04:00
Chris Mason
a3bdccc4e6 Btrfs: prevent list corruption during free space cache processing
__btrfs_write_out_cache is holding the ctl->tree_lock while it prepares
a list of bitmaps to record in the free space cache.  It was dropping
the lock while it worked on other components, which made a window for
free_bitmap() to free the bitmap struct without removing it from the
list.

This changes things to hold the lock the whole time, and also makes sure
we hold the lock during enospc cleanup.

Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-24 11:52:25 -07:00
Linus Torvalds
ba0e4ae88f Merge branch 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "I've been running these through a longer set of load tests because my
  commits change the free space cache writeout.  It fixes commit stalls
  on large filesystems (~20T space used and up) that we have been
  triggering here.  We were seeing new writers blocked for 10 seconds or
  more during commits, which is far from good.

  Josef and I fixed up ENOSPC aborts when deleting huge files (3T or
  more), that are triggered because our metadata reservations were not
  properly accounting for crcs and were not replenishing during the
  truncate.

  Also in this series, a number of qgroup fixes from Fujitsu and Dave
  Sterba collected most of the pending cleanups from the list"

* 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (93 commits)
  btrfs: quota: Update quota tree after qgroup relationship change.
  btrfs: quota: Automatically update related qgroups or mark INCONSISTENT flags when assigning/deleting a qgroup relations.
  btrfs: qgroup: clear STATUS_FLAG_ON in disabling quota.
  btrfs: Update btrfs qgroup status item when rescan is done.
  btrfs: qgroup: Fix dead judgement on qgroup_rescan_leaf() return value.
  btrfs: Don't allow subvolid >= (1 << BTRFS_QGROUP_LEVEL_SHIFT) to be created
  btrfs: Check qgroup level in kernel qgroup assign.
  btrfs: qgroup: allow to remove qgroup which has parent but no child.
  btrfs: qgroup: return EINVAL if level of parent is not higher than child's.
  btrfs: qgroup: do a reservation in a higher level.
  Btrfs: qgroup, Account data space in more proper timings.
  Btrfs: qgroup: Introduce a may_use to account space_info->bytes_may_use.
  Btrfs: qgroup: free reserved in exceeding quota.
  Btrfs: qgroup: cleanup, remove an unsued parameter in btrfs_create_qgroup().
  btrfs: qgroup: fix limit args override whole limit struct
  btrfs: qgroup: update limit info in function btrfs_run_qgroups().
  btrfs: qgroup: consolidate the parameter of fucntion update_qgroup_limit_item().
  btrfs: qgroup: update qgroup in memory at the same time when we update it in btree.
  btrfs: qgroup: inherit limit info from srcgroup in creating snapshot.
  btrfs: Support busy loop of write and delete
  ...
2015-04-24 07:40:02 -07:00
Chris Mason
85db36cfb3 Btrfs: fix inode cache writeout
The code to fix stalls during free spache cache IO wasn't using
the correct root when waiting on the IO for inode caches.  This
is only a problem when the inode cache is enabled with

mount -o inode_cache

This fixes the inode cache writeout to preserve any error values and
makes sure not to override the root when inode cache writeout is done.

Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-23 17:47:34 -07:00
David Howells
2b0143b5c9 VFS: normal filesystems (and lustre): d_inode() annotations
that's the bulk of filesystem drivers dealing with inodes of their own

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-15 15:06:57 -04:00
Qu Wenruo
e082f56313 btrfs: quota: Update quota tree after qgroup relationship change.
Previous patch modified the in memory struct but it's not written in
quota tree until next commit.
So user will still get old data using "btrfs qgroup show" after
assign/remove.

This patch will call btrfs_run_qgroups in assign ioctl so it will be
updated to in memory quota trees and user will get up-to-date results.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-13 07:53:00 -07:00
Qu Wenruo
9c8b35b1ba btrfs: quota: Automatically update related qgroups or mark INCONSISTENT flags when assigning/deleting a qgroup relations.
Operation like qgroups assigning/deleting qgroup relations will mostly
cause qgroup data inconsistent, since it needs to do the full rescan to
determine whether shared extents are exclusive or still shared in
parent qgroups.

But there are some exceptions, like qgroup with only exclusive extents
(qgroup->excl == qgroup->rfer), in that case, we only needs to
modify all its parents' excl and rfer.

So this patch adds a quick path for such qgroup in qgroup
assign/remove routine, and if quick path failed, the qgroup status will
be marked INCONSISTENT, and return 1 to info user-land.

BTW since the quick path is much the same of qgroup_excl_accounting(),
so move the core of it to __qgroup_excl_accounting() and reuse it.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-13 07:52:59 -07:00
Dongsheng Yang
8ea0ec9e01 btrfs: qgroup: clear STATUS_FLAG_ON in disabling quota.
we forgot to clear STATUS_FLAG_ON in quota_disable(), it
will cause a problem shown as below:

	# mount /dev/sdc /mnt
	# btrfs quota enable /mnt
	# btrfs quota disable /mnt
	# btrfs quota rescan /mnt
	quota rescan started <--- expecting it fail here.
	# echo $?
	0

Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-13 07:52:58 -07:00
Qu Wenruo
53b7cde9d5 btrfs: Update btrfs qgroup status item when rescan is done.
Update qgroup status when rescan is done.

Before this patch, status item is not updated on rescan finish, which
causing the RESCAN and INCONSISTENT flags never cleared.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-13 07:52:57 -07:00
Qu Wenruo
3393168d22 btrfs: qgroup: Fix dead judgement on qgroup_rescan_leaf() return value.
Old qgroup_rescan_leaf() comment indicates ret == 2 as complete and
cleared INCONSISTENT flag.

This is not true since it will never return 2, and inside it no codes
will clear INCONSISTENT flag.
The flag clearance is done in btrfs_qgroup_rescan_work().
This caused the bug that INCONSISTENT flag is never cleared.

So change the comment and fix the dead judgment.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-13 07:52:55 -07:00
Qu Wenruo
e09fe2d211 btrfs: Don't allow subvolid >= (1 << BTRFS_QGROUP_LEVEL_SHIFT) to be created
Btrfs will create qgroup on subvolume creation if quota is enabled, but
qgroup uses the high bits(currently 16 bits) as level, to build the
inheritance.

However it is fully possible a subvolume can be created with a
subvolumeid larger than 1 << BTRFS_QGROUP_LEVEL_SHIFT, so it will be
considered as level 1 and can't be assigned to other qgroup in level 1.

This patch will prevent such things so qgroup inheritance will not be
screwed up.
The downside is very clear, btrfs subvolume number limit will decrease
from (u64 max - 256(fisrt free objectid) - 256(last free objectid)) to
(u48 max -256(first free objectid)).
But we still have near u48(that's 15 digits in dec), so that should not
be a huge problem.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-13 07:52:54 -07:00
Qu Wenruo
8465ecec96 btrfs: Check qgroup level in kernel qgroup assign.
Although we have qgroup level check in btrfs-progs, it's not enough
since other programe may still call ioctl directly not using
btrfs-progs. For example, systemd.

But it's btrfs-progs to be blame since we don't provide a
full-function(like subvolume create things) btrfs library with enough
check, and only rely on kernel ioctl.

So Add level checks in kernel too.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-13 07:52:53 -07:00
Dongsheng Yang
f5a6b1c53b btrfs: qgroup: allow to remove qgroup which has parent but no child.
When a qgroup has parents but no child, it should be removable in
Theory I think. But currently, we can not remove it when it has
either parent or child.

Example:
	# btrfs quota enable /mnt
	# btrfs qgroup create 1/0 /mnt
	# btrfs qgroup create 2/0 /mnt
	# btrfs qgroup assign 1/0 2/0 /mnt
	# btrfs qgroup show -pcre /mnt
qgroupid rfer  excl  max_rfer max_excl parent  child
-------- ----  ----  -------- -------- ------  -----
0/5      16384 16384 0        0        ---     ---
1/0      0     0     0        0        2/0     ---
2/0      0     0     0        0        ---     1/0

At this time, there is no subvol or qgroup depending on it.
Just a qgroup 2/0 is its parent, but 2/0 can work well without
1/0. So I think 1/0 should be removalbe. But:
	# btrfs qgroup destroy 1/0 /mnt
ERROR: unable to destroy quota group: Device or resource busy

This patch remove the check of qgroup->parent in removing it,
then we can remove a qgroup when it has a parent.

Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-13 07:52:52 -07:00
Dongsheng Yang
09870d2772 btrfs: qgroup: return EINVAL if level of parent is not higher than child's.
When we create a subvol inheriting a qgroup, we need to check the level
of them. Otherwise, there is a chance a qgroup can inherit another qgroup
at the same level.

Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-13 07:52:51 -07:00
Dongsheng Yang
e2d1f92399 btrfs: qgroup: do a reservation in a higher level.
There are two problems in qgroup:

a). The PAGE_CACHE is 4K, even when we are writing a data of 1K,
qgroup will reserve a 4K size. It will cause the last 3K in a qgroup
is not available to user.

b). When user is writing a inline data, qgroup will not reserve it,
it means this is a window we can exceed the limit of a qgroup.

The main idea of this patch is reserving the data size of write_bytes
rather than the reserve_bytes. It means qgroup will not care about
the data size btrfs will reserve for user, but only care about the
data size user is going to write. Then reserve it when user want to
write and release it in transaction committed.

In this way, qgroup can be released from the complex procedure in
btrfs and only do the reserve when user want to write and account
when the data is written in commit_transaction().

Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-13 07:52:50 -07:00
Dongsheng Yang
237c0e9f1f Btrfs: qgroup, Account data space in more proper timings.
Currenly, in data writing, ->reserved is accounted in
fill_delalloc(), but ->may_use is released in clear_bit_hook()
which is called by btrfs_finish_ordered_io(). That's too late,
that said, between fill_delalloc() and btrfs_finish_ordered_io(),
the data is doublely accounted by qgroup. It will cause some
unexpected -EDQUOT.

Example:
	# btrfs quota enable /root/btrfs-auto-test/
	# btrfs subvolume create /root/btrfs-auto-test//sub
	Create subvolume '/root/btrfs-auto-test/sub'
	# btrfs qgroup limit 1G /root/btrfs-auto-test//sub
	dd if=/dev/zero of=/root/btrfs-auto-test//sub/file bs=1024 count=1500000
	dd: error writing '/root/btrfs-auto-test//sub/file': Disk quota exceeded
	681353+0 records in
	681352+0 records out
	697704448 bytes (698 MB) copied, 8.15563 s, 85.5 MB/s
It's (698 MB) when we got an -EDQUOT, but we limit it by 1G.

This patch move the btrfs_qgroup_reserve/free() for data from
btrfs_delalloc_reserve/release_metadata() to btrfs_check_data_free_space()
and btrfs_free_reserved_data_space(). Then the accounter in qgroup
will be updated at the same time with the accounter in space_info updated.
In this way, the unexpected -EDQUOT will be killed.

Reported-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-13 07:52:48 -07:00
Dongsheng Yang
31193213f1 Btrfs: qgroup: Introduce a may_use to account space_info->bytes_may_use.
Currently, for pre_alloc or delay_alloc, the bytes will be accounted
in space_info by the three guys.
space_info->bytes_may_use --- space_info->reserved --- space_info->used.
But on the other hand, in qgroup, there are only two counters to account the
bytes, qgroup->reserved and qgroup->excl. And qg->reserved accounts
bytes in space_info->bytes_may_use and qg->excl accounts bytes in
space_info->used. So the bytes in space_info->reserved is not accounted
in qgroup. If so, there is a window we can exceed the quota limit when
bytes is in space_info->reserved.

Example:
	# btrfs quota enable /mnt
	# btrfs qgroup limit -e 10M /mnt
	# for((i=0;i<20;i++));do fallocate -l 1M /mnt/data$i; done
	# sync
	# btrfs qgroup show -pcre /mnt
qgroupid rfer     excl     max_rfer max_excl parent  child
-------- ----     ----     -------- -------- ------  -----
0/5      20987904 20987904 0        10485760 ---     ---

qg->excl is 20987904 larger than max_excl 10485760.

This patch introduce a new counter named may_use to qgroup, then
there are three counters in qgroup to account bytes in space_info
as below.
space_info->bytes_may_use --- space_info->reserved --- space_info->used.
qgroup->may_use           --- qgroup->reserved     --- qgroup->excl

With this patch applied:
	# btrfs quota enable /mnt
	# btrfs qgroup limit -e 10M /mnt
	# for((i=0;i<20;i++));do fallocate -l 1M /mnt/data$i; done
fallocate: /mnt/data9: fallocate failed: Disk quota exceeded
fallocate: /mnt/data10: fallocate failed: Disk quota exceeded
fallocate: /mnt/data11: fallocate failed: Disk quota exceeded
fallocate: /mnt/data12: fallocate failed: Disk quota exceeded
fallocate: /mnt/data13: fallocate failed: Disk quota exceeded
fallocate: /mnt/data14: fallocate failed: Disk quota exceeded
fallocate: /mnt/data15: fallocate failed: Disk quota exceeded
fallocate: /mnt/data16: fallocate failed: Disk quota exceeded
fallocate: /mnt/data17: fallocate failed: Disk quota exceeded
fallocate: /mnt/data18: fallocate failed: Disk quota exceeded
fallocate: /mnt/data19: fallocate failed: Disk quota exceeded
	# sync
	# btrfs qgroup show -pcre /mnt
qgroupid rfer    excl    max_rfer max_excl parent  child
-------- ----    ----    -------- -------- ------  -----
0/5      9453568 9453568 0        10485760 ---     ---

Reported-by: Cyril SCETBON <cyril.scetbon@free.fr>
Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-13 07:52:47 -07:00
Dongsheng Yang
804ca127fb Btrfs: qgroup: free reserved in exceeding quota.
When we exceed quota limit in writing, we will free
some reserved extent when we need to drop but not free
account in qgroup. It means, each time we exceed quota
in writing, there will be some remain space in qg->reserved
we can not use any more. If things go on like this, the
all space will be ate up.

Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-13 07:52:46 -07:00
Dongsheng Yang
4087cf24ae Btrfs: qgroup: cleanup, remove an unsued parameter in btrfs_create_qgroup().
Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-13 07:52:44 -07:00
Dongsheng Yang
03477d945f btrfs: qgroup: fix limit args override whole limit struct
btrfs_limit_group use arg limit to override the old qgroup_limit of
corresponding qgroup. However, we should override part of old qgroup_limit
according to the bit which has been set in arg limit.

Signed-off-by: Fan Chengniang <fancn.fnst@cn.fujitsu.com>
Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-13 07:52:43 -07:00
Dongsheng Yang
d3001ed3a8 btrfs: qgroup: update limit info in function btrfs_run_qgroups().
When we commit_transaction(), qgroups in btree should be updated.
But, limit info is not considered currently. It will cause a problem
when a qgroup of a snapshot inherit the limit info from srcqgroup,
then there is an inconsistency.

Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-13 07:52:42 -07:00
Dongsheng Yang
1510e71c62 btrfs: qgroup: consolidate the parameter of fucntion update_qgroup_limit_item().
Cleanup: Change the parameter of update_qgroup_limit_item() to the family of
update_qgroup_xxx_item().

Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-13 07:52:41 -07:00
Dongsheng Yang
e8c8541ac3 btrfs: qgroup: update qgroup in memory at the same time when we update it in btree.
When we call btrfs_qgroup_inherit() with BTRFS_QGROUP_INHERIT_SET_LIMITS,
btrfs will update the limit info of qgroup in btree but forget to update
the qgroup in rbtree at the same time. It obviousely will cause an inconsistency.

This patch fix it by updating the rbtree at the same time.

Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-13 07:52:40 -07:00
Dongsheng Yang
3eeb4d597e btrfs: qgroup: inherit limit info from srcgroup in creating snapshot.
Currently, when we snapshot a subvol, snapshot will not copy the limits
from srcqgroup.

This patch make the qgroup in snapshot inherit the limit info when create
a snapshot.

Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-13 07:52:38 -07:00
Zhao Lei
c99f1b0c6c btrfs: Support busy loop of write and delete
Reproduce:
 while true; do
   dd if=/dev/zero of=/mnt/btrfs/file count=[75% fs_size]
   rm /mnt/btrfs/file
 done
 Then we can see above loop failed on NO_SPACE.

It it long-term problem since very beginning, because delayed-iput
after rm are not run.

We already have commit_transaction() in alloc_space code, but it is
not triggered in above case.
This patch trigger commit_transaction() to run delayed-iput and
reflash pinned-space to to make write success.

It is based on previous fix of delayed-iput in commit_transaction(),
need to be applied on top of:
btrfs: Fix NO_SPACE bug caused by delayed-iput

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-13 07:31:10 -07:00
Zhao Lei
d7c151717a btrfs: Fix NO_SPACE bug caused by delayed-iput
Steps to reproduce:
  while true; do
    dd if=/dev/zero of=/btrfs_dir/file count=[fs_size * 75%]
    rm /btrfs_dir/file
    sync
  done

  And we'll see dd failed because btrfs return NO_SPACE.

Reason:
  Normally, btrfs_commit_transaction() call btrfs_run_delayed_iputs()
  in end to free fs space for next write, but sometimes it hadn't
  done work on time, because btrfs-cleaner thread get delayed-iputs
  from list before, but do iput() after next write.

  This is log:
  [ 2569.050776] comm=btrfs-cleaner func=btrfs_evict_inode() begin

  [ 2569.084280] comm=sync func=btrfs_commit_transaction() call btrfs_run_delayed_iputs()
  [ 2569.085418] comm=sync func=btrfs_commit_transaction() done btrfs_run_delayed_iputs()
  [ 2569.087554] comm=sync func=btrfs_commit_transaction() end

  [ 2569.191081] comm=dd begin
  [ 2569.790112] comm=dd func=__btrfs_buffered_write() ret=-28

  [ 2569.847479] comm=btrfs-cleaner func=add_pinned_bytes() 0 + 32677888 = 32677888
  [ 2569.849530] comm=btrfs-cleaner func=add_pinned_bytes() 32677888 + 23834624 = 56512512
  ...
  [ 2569.903893] comm=btrfs-cleaner func=add_pinned_bytes() 943976448 + 21762048 = 965738496
  [ 2569.908270] comm=btrfs-cleaner func=btrfs_evict_inode() end

Fix:
  Make btrfs_commit_transaction() wait current running btrfs-cleaner's
  delayed-iputs() done in end.

Test:
  Use script similar to above(more complex),
  before patch:
    7 failed in 100 * 20 loop.
  after patch:
    0 failed in 100 * 20 loop.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-13 07:27:41 -07:00
Zhao Lei
18d018ad2c btrfs: add WARN_ON() to check is space_info op current
space_info's value calculation is some complex and easy to cause
bug, add WARN_ON() to help debug.

Changelog v1->v2:
 Put WARN_ON()s under the ENOSPC_DEBUG mount option.
 Suggested by: David Sterba <dsterba@suse.cz>

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-13 07:27:35 -07:00
Zhao Lei
c30666d466 btrfs: Set relative data on clear btrfs_block_group_cache->pinned
Bug1:
  space_info->bytes_readonly was set to very large(negative) value in
  btrfs_remove_block_group().

Reason:
  Current code set block_group_cache->pinned = 0 in btrfs_delete_unused_bgs(),
  but above space was not counted to space_info->bytes_readonly.

  Then in btrfs_remove_block_group():
    block_group->space_info->bytes_readonly -= block_group->key.offset;
  We can see following value in trace:
    btrfs_remove_block_group: pid=2677 comm=btrfs-cleaner WARNING: bytes_readonly=12582912, key.offset=134217728

Bug2:
  space_info->total_bytes_pinned grow to value larger than fs size.
  In a 1.2G fs, we can get following trace log:
  at first:
    ZL_DEBUG: add_pinned_bytes: pid=2710 comm=sync change total_bytes_pinned flags=1 869793792 + 95944704 = 965738496
  after some op:
    ZL_DEBUG: add_pinned_bytes: pid=2770 comm=sync change total_bytes_pinned flags=1 1780178944 + 95944704 = 1876123648
  after some op:
    ZL_DEBUG: add_pinned_bytes: pid=3193 comm=sync change total_bytes_pinned flags=1 2924568576 + 95551488 = 3020120064
  ...

Reason:
  Similar to bug1, we also need to adjust space_info->total_bytes_pinned
  in above code block.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-13 07:27:29 -07:00
Zhao Lei
264ca0f60b btrfs: Adjust commit-transaction condition to avoid NO_SPACE more
If we have any chance to make a successful write, we should not give up.

This patch adjust commit-transaction condition from:
  pinned >= wanted
to
  left + pinned >= wanted

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-13 07:27:24 -07:00
Zhao Lei
f2ab76188e btrfs: Fix tail space processing in find_free_dev_extent()
It is another reason for NO_SPACE case.

When we found enough free space in loop and saved them to
max_hole_start/size before, and tail space contains pending extent,
origional innocent max_hole_start/size are reset in retry.

As a result, find_free_dev_extent() returns less space than it can,
and cause NO_SPACE in user program.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-13 07:27:18 -07:00
Zhao Lei
94b947b2f3 btrfs: fix condition of commit transaction
Old code bypass commit transaction when we don't have enough
pinned space, but another case is there exist freed bgs in current
transction, it have possibility to make alloc_chunk success.

This patch modify the condition to:
if (have_free_bg || have_pinned_space) commit_transaction()

Confirmed above action by printk before and after patch.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-13 07:26:40 -07:00
Chris Mason
de249e66a7 Btrfs: fix uninit variable in clone ioctl
Commit 0d97a64e0 creates a new variable but doesn't always set it up.
This puts it back to the original method (key.offset + 1) for the cases
not covered by Filipe's new logic.

Signed-off-by: Chris Mason <clm@fb.com>
2015-04-13 07:03:58 -07:00
Filipe Manana
ccccf3d672 Btrfs: fix inode eviction infinite loop after cloning into it
If we attempt to clone a 0 length region into a file we can end up
inserting a range in the inode's extent_io tree with a start offset
that is greater then the end offset, which triggers immediately the
following warning:

[ 3914.619057] WARNING: CPU: 17 PID: 4199 at fs/btrfs/extent_io.c:435 insert_state+0x4b/0x10b [btrfs]()
[ 3914.620886] BTRFS: end < start 4095 4096
(...)
[ 3914.638093] Call Trace:
[ 3914.638636]  [<ffffffff81425fd9>] dump_stack+0x4c/0x65
[ 3914.639620]  [<ffffffff81045390>] warn_slowpath_common+0xa1/0xbb
[ 3914.640789]  [<ffffffffa03ca44f>] ? insert_state+0x4b/0x10b [btrfs]
[ 3914.642041]  [<ffffffff810453f0>] warn_slowpath_fmt+0x46/0x48
[ 3914.643236]  [<ffffffffa03ca44f>] insert_state+0x4b/0x10b [btrfs]
[ 3914.644441]  [<ffffffffa03ca729>] __set_extent_bit+0x107/0x3f4 [btrfs]
[ 3914.645711]  [<ffffffffa03cb256>] lock_extent_bits+0x65/0x1bf [btrfs]
[ 3914.646914]  [<ffffffff8142b2fb>] ? _raw_spin_unlock+0x28/0x33
[ 3914.648058]  [<ffffffffa03cbac4>] ? test_range_bit+0xcc/0xde [btrfs]
[ 3914.650105]  [<ffffffffa03cb3c3>] lock_extent+0x13/0x15 [btrfs]
[ 3914.651361]  [<ffffffffa03db39e>] lock_extent_range+0x3d/0xcd [btrfs]
[ 3914.652761]  [<ffffffffa03de1fe>] btrfs_ioctl_clone+0x278/0x388 [btrfs]
[ 3914.654128]  [<ffffffff811226dd>] ? might_fault+0x58/0xb5
[ 3914.655320]  [<ffffffffa03e0909>] btrfs_ioctl+0xb51/0x2195 [btrfs]
(...)
[ 3914.669271] ---[ end trace 14843d3e2e622fc1 ]---

This later makes the inode eviction handler enter an infinite loop that
keeps dumping the following warning over and over:

[ 3915.117629] WARNING: CPU: 22 PID: 4228 at fs/btrfs/extent_io.c:435 insert_state+0x4b/0x10b [btrfs]()
[ 3915.119913] BTRFS: end < start 4095 4096
(...)
[ 3915.137394] Call Trace:
[ 3915.137913]  [<ffffffff81425fd9>] dump_stack+0x4c/0x65
[ 3915.139154]  [<ffffffff81045390>] warn_slowpath_common+0xa1/0xbb
[ 3915.140316]  [<ffffffffa03ca44f>] ? insert_state+0x4b/0x10b [btrfs]
[ 3915.141505]  [<ffffffff810453f0>] warn_slowpath_fmt+0x46/0x48
[ 3915.142709]  [<ffffffffa03ca44f>] insert_state+0x4b/0x10b [btrfs]
[ 3915.143849]  [<ffffffffa03ca729>] __set_extent_bit+0x107/0x3f4 [btrfs]
[ 3915.145120]  [<ffffffffa038c1e3>] ? btrfs_kill_super+0x17/0x23 [btrfs]
[ 3915.146352]  [<ffffffff811548f6>] ? deactivate_locked_super+0x3b/0x50
[ 3915.147565]  [<ffffffffa03cb256>] lock_extent_bits+0x65/0x1bf [btrfs]
[ 3915.148785]  [<ffffffff8142b7e2>] ? _raw_write_unlock+0x28/0x33
[ 3915.149931]  [<ffffffffa03bc325>] btrfs_evict_inode+0x196/0x482 [btrfs]
[ 3915.151154]  [<ffffffff81168904>] evict+0xa0/0x148
[ 3915.152094]  [<ffffffff811689e5>] dispose_list+0x39/0x43
[ 3915.153081]  [<ffffffff81169564>] evict_inodes+0xdc/0xeb
[ 3915.154062]  [<ffffffff81154418>] generic_shutdown_super+0x49/0xef
[ 3915.155193]  [<ffffffff811546d1>] kill_anon_super+0x13/0x1e
[ 3915.156274]  [<ffffffffa038c1e3>] btrfs_kill_super+0x17/0x23 [btrfs]
(...)
[ 3915.167404] ---[ end trace 14843d3e2e622fc2 ]---

So just bail out of the clone ioctl if the length of the region to clone
is zero, without locking any extent range, in order to prevent this issue
(same behaviour as a pwrite with a 0 length for example).

This is trivial to reproduce. For example, the steps for the test I just
made for fstests:

  mkfs.btrfs -f SCRATCH_DEV
  mount SCRATCH_DEV $SCRATCH_MNT

  touch $SCRATCH_MNT/foo
  touch $SCRATCH_MNT/bar

  $CLONER_PROG -s 0 -d 4096 -l 0 $SCRATCH_MNT/foo $SCRATCH_MNT/bar
  umount $SCRATCH_MNT

A test case for fstests follows soon.

CC: <stable@vger.kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-13 07:03:28 -07:00
Filipe Manana
113e828386 Btrfs: fix inode eviction infinite loop after extent_same ioctl
If we pass a length of 0 to the extent_same ioctl, we end up locking an
extent range with a start offset greater then its end offset (if the
destination file's offset is greater than zero). This results in a warning
from extent_io.c:insert_state through the following call chain:

  btrfs_extent_same()
    btrfs_double_lock()
      lock_extent_range()
        lock_extent(inode->io_tree, offset, offset + len - 1)
          lock_extent_bits()
            __set_extent_bit()
              insert_state()
                --> WARN_ON(end < start)

This leads to an infinite loop when evicting the inode. This is the same
problem that my previous patch titled
"Btrfs: fix inode eviction infinite loop after cloning into it" addressed
but for the extent_same ioctl instead of the clone ioctl.

CC: <stable@vger.kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-13 07:03:27 -07:00
Filipe Manana
df858e7672 Btrfs: fix range cloning when same inode used as source and destination
While searching for extents to clone we might find one where we only use
a part of it coming from its tail. If our destination inode is the same
the source inode, we end up removing the tail part of the extent item and
insert after a new one that point to the same extent with an adjusted
key file offset and data offset. After this we search for the next extent
item in the fs/subvol tree with a key that has an offset incremented by
one. But this second search leaves us at the new extent item we inserted
previously, and since that extent item has a non-zero data offset, it
it can make us call btrfs_drop_extents with an empty range (start == end)
which causes the following warning:

[23978.537119] WARNING: CPU: 6 PID: 16251 at fs/btrfs/file.c:550 btrfs_drop_extent_cache+0x43/0x385 [btrfs]()
(...)
[23978.557266] Call Trace:
[23978.557978]  [<ffffffff81425fd9>] dump_stack+0x4c/0x65
[23978.559191]  [<ffffffff81045390>] warn_slowpath_common+0xa1/0xbb
[23978.560699]  [<ffffffffa047f0ea>] ? btrfs_drop_extent_cache+0x43/0x385 [btrfs]
[23978.562389]  [<ffffffff8104544d>] warn_slowpath_null+0x1a/0x1c
[23978.563613]  [<ffffffffa047f0ea>] btrfs_drop_extent_cache+0x43/0x385 [btrfs]
[23978.565103]  [<ffffffff810e3a18>] ? time_hardirqs_off+0x15/0x28
[23978.566294]  [<ffffffff81079ff8>] ? trace_hardirqs_off+0xd/0xf
[23978.567438]  [<ffffffffa047f73d>] __btrfs_drop_extents+0x6b/0x9e1 [btrfs]
[23978.568702]  [<ffffffff8107c03f>] ? trace_hardirqs_on+0xd/0xf
[23978.569763]  [<ffffffff811441c0>] ? ____cache_alloc+0x69/0x2eb
[23978.570817]  [<ffffffff81142269>] ? virt_to_head_page+0x9/0x36
[23978.571872]  [<ffffffff81143c15>] ? cache_alloc_debugcheck_after.isra.42+0x16c/0x1cb
[23978.573466]  [<ffffffff811420d5>] ? kmemleak_alloc_recursive.constprop.52+0x16/0x18
[23978.574962]  [<ffffffffa0480d07>] btrfs_drop_extents+0x66/0x7f [btrfs]
[23978.576179]  [<ffffffffa049aa35>] btrfs_clone+0x516/0xaf5 [btrfs]
[23978.577311]  [<ffffffffa04983dc>] ? lock_extent_range+0x7b/0xcd [btrfs]
[23978.578520]  [<ffffffffa049b2a2>] btrfs_ioctl_clone+0x28e/0x39f [btrfs]
[23978.580282]  [<ffffffffa049d9ae>] btrfs_ioctl+0xb51/0x219a [btrfs]
(...)
[23978.591887] ---[ end trace 988ec2a653d03ed3 ]---

Then we attempt to insert a new extent item with a key that already
exists, which makes btrfs_insert_empty_item return -EEXIST resulting in
abortion of the current transaction:

[23978.594355] WARNING: CPU: 6 PID: 16251 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]()
(...)
[23978.622589] Call Trace:
[23978.623181]  [<ffffffff81425fd9>] dump_stack+0x4c/0x65
[23978.624359]  [<ffffffff81045390>] warn_slowpath_common+0xa1/0xbb
[23978.625573]  [<ffffffffa044ab6c>] ? __btrfs_abort_transaction+0x52/0x114 [btrfs]
[23978.626971]  [<ffffffff810453f0>] warn_slowpath_fmt+0x46/0x48
[23978.628003]  [<ffffffff8108a6c8>] ? vprintk_default+0x1d/0x1f
[23978.629138]  [<ffffffffa044ab6c>] __btrfs_abort_transaction+0x52/0x114 [btrfs]
[23978.630528]  [<ffffffffa049ad1b>] btrfs_clone+0x7fc/0xaf5 [btrfs]
[23978.631635]  [<ffffffffa04983dc>] ? lock_extent_range+0x7b/0xcd [btrfs]
[23978.632886]  [<ffffffffa049b2a2>] btrfs_ioctl_clone+0x28e/0x39f [btrfs]
[23978.634119]  [<ffffffffa049d9ae>] btrfs_ioctl+0xb51/0x219a [btrfs]
(...)
[23978.647714] ---[ end trace 988ec2a653d03ed4 ]---

This is wrong because we should not process the extent item that we just
inserted previously, and instead process the extent item that follows it
in the tree

For example for the test case I wrote for fstests:

   bs=$((64 * 1024))
   mkfs.btrfs -f -l $bs -O ^no-holes /dev/sdc
   mount /dev/sdc /mnt

   xfs_io -f -c "pwrite -S 0xaa $(($bs * 2)) $(($bs * 2))" /mnt/foo

   $CLONER_PROG -s $((3 * $bs)) -d $((267 * $bs)) -l 0 /mnt/foo /mnt/foo
   $CLONER_PROG -s $((217 * $bs)) -d $((95 * $bs)) -l 0 /mnt/foo /mnt/foo

The second clone call fails with -EEXIST, because when we process the
first extent item (offset 262144), we drop part of it (counting from the
end) and then insert a new extent item with a key greater then the key we
found. The next time we search the tree we search for a key with offset
262144 + 1, which leaves us at the new extent item we have just inserted
but we think it refers to an extent that we need to clone.

Fix this by ensuring the next search key uses an offset corresponding to
the offset of the key we found previously plus the data length of the
corresponding extent item. This ensures we skip new extent items that we
inserted and works for the case of implicit holes too (NO_HOLES feature).

A test case for fstests follows soon.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-13 07:03:26 -07:00
Al Viro
2ba48ce513 mirror O_APPEND and O_DIRECT into iocb->ki_flags
... avoiding write_iter/fcntl races.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:30:22 -04:00
Al Viro
3309dd04cb switch generic_write_checks() to iocb and iter
... returning -E... upon error and amount of data left in iter after
(possible) truncation upon success.  Note, that normal case gives
a non-zero (positive) return value, so any tests for != 0 _must_ be
updated.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Conflicts:
	fs/ext4/file.c
2015-04-11 22:30:21 -04:00
Al Viro
0fa6b005af generic_write_checks(): drop isblk argument
all remaining callers are passing 0; some just obscure that fact.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:29:48 -04:00
Omar Sandoval
22c6186ece direct_IO: remove rw from a_ops->direct_IO()
Now that no one is using rw, remove it completely.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:29:45 -04:00
Omar Sandoval
6f67376318 direct_IO: use iov_iter_rw() instead of rw everywhere
The rw parameter to direct_IO is redundant with iov_iter->type, and
treated slightly differently just about everywhere it's used: some users
do rw & WRITE, and others do rw == WRITE where they should be doing a
bitwise check. Simplify this with the new iov_iter_rw() helper, which
always returns either READ or WRITE.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:29:45 -04:00
Omar Sandoval
17f8c842d2 Remove rw from {,__,do_}blockdev_direct_IO()
Most filesystems call through to these at some point, so we'll start
here.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:29:44 -04:00
Al Viro
5d5d568975 make new_sync_{read,write}() static
All places outside of core VFS that checked ->read and ->write for being NULL or
called the methods directly are gone now, so NULL {read,write} with non-NULL
{read,write}_iter will do the right thing in all cases.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:29:40 -04:00
Al Viro
c0fec3a98b Merge branch 'iocb' into for-next 2015-04-11 22:24:41 -04:00
Chris Mason
cdfb080e18 Btrfs: fix use after free when close_ctree frees the orphan_rsv
Near the end of close_ctree, we're calling btrfs_free_block_rsv
to free up the orphan rsv.  The problem is this call updates the
space_info, which has already been freed.

This adds a new __ function that directly calls kfree instead of trying
to update the space infos.

Signed-off-by: Chris Mason <clm@fb.com>
2015-04-10 14:07:29 -07:00
Chris Mason
1bbc621ef2 Btrfs: allow block group cache writeout outside critical section in commit
We loop through all of the dirty block groups during commit and write
the free space cache.  In order to make sure the cache is currect, we do
this while no other writers are allowed in the commit.

If a large number of block groups are dirty, this can introduce long
stalls during the final stages of the commit, which can block new procs
trying to change the filesystem.

This commit changes the block group cache writeout to take appropriate
locks and allow it to run earlier in the commit.  We'll still have to
redo some of the block groups, but it means we can get most of the work
out of the way without blocking the entire FS.

Signed-off-by: Chris Mason <clm@fb.com>
2015-04-10 14:07:22 -07:00
Chris Mason
2b10826800 Btrfs: don't use highmem for free space cache pages
In order to create the free space cache concurrently with FS modifications,
we need to take a few block group locks.

The cache code also does kmap, which would schedule with the locks held.
Instead of going through kmap_atomic, lets just use lowmem for the cache
pages.

Signed-off-by: Chris Mason <clm@fb.com>
2015-04-10 14:07:18 -07:00
Chris Mason
c9dc4c6578 Btrfs: two stage dirty block group writeout
Block group cache writeout is currently waiting on the pages for each
block group cache before moving on to writing the next one.  This commit
switches things around to send down all the caches and then wait on them
in batches.

The end result is much faster, since we're keeping the disk pipeline
full.

Signed-off-by: Chris Mason <clm@fb.com>
2015-04-10 14:07:11 -07:00
Chris Mason
4c6d1d85ad btrfs: move struct io_ctl into ctree.h and rename it
We'll need to put the io_ctl into the block_group cache struct, so
name it struct btrfs_io_ctl and move it into ctree.h

Signed-off-by: Chris Mason <clm@fb.com>
2015-04-10 14:07:04 -07:00
Josef Bacik
3bce876fd5 Btrfs: don't steal from the global reserve if we don't have the space
btrfs_evict_inode() needs to be more careful about stealing from the
global_rsv.  We dont' want to end up aborting commit with ENOSPC just
because the evict_inode code was too greedy.

Signed-off-by: Chris Mason <clm@fb.com>
2015-04-10 14:06:59 -07:00
Josef Bacik
365c531377 Btrfs: don't commit the transaction in the async space flushing
We're triggering a huge number of commits from
btrfs_async_reclaim_metadata_space.  These aren't really requried,
because everyone calling the async reclaim code is going to end up
triggering a commit on their own.

Signed-off-by: Chris Mason <clm@fb.com>
2015-04-10 14:06:54 -07:00
Josef Bacik
cb723e4919 Btrfs: reserve space for block groups
This changes our delayed refs calculations to include the space needed
to write back dirty block groups.

Signed-off-by: Chris Mason <clm@fb.com>
2015-04-10 14:06:48 -07:00
Chris Mason
28f75a0e6c Btrfs: refill block reserves during truncate
When truncate starts, it allocates some space in the block reserves so
that we'll have enough to update metadata along the way.

For very large files, we can easily go through all of that space as we
loop through the extents.  This changes truncate to refill the space
reservation as it progresses through the file.

Signed-off-by: Chris Mason <clm@fb.com>
2015-04-10 14:06:34 -07:00
Josef Bacik
1262133b8d Btrfs: account for crcs in delayed ref processing
As we delete large extents, we end up doing huge amounts of COW in order
to delete the corresponding crcs.  This adds accounting so that we keep
track of that space and flushing of delayed refs so that we don't build
up too much delayed crc work.

This helps limit the delayed work that must be done at commit time and
tries to avoid ENOSPC aborts because the crcs eat all the global
reserves.

Signed-off-by: Chris Mason <clm@fb.com>
2015-04-10 14:04:47 -07:00
Chris Mason
28ed1345a5 btrfs: actively run the delayed refs while deleting large files
When we are deleting large files with large extents, we are building up
a huge set of delayed refs for processing.  Truncate isn't checking
often enough to see if we need to back off and process those, or let
a commit proceed.

The end result is long stalls after the rm, and very long commit times.
During the commits, other processes back up waiting to start new
transactions and we get into trouble.

Signed-off-by: Chris Mason <clm@fb.com>
2015-04-10 14:00:14 -07:00
Guenter Roeck
4a3d1caf8a fs: btrfs: Add missing include file
Building alpha:allmodconfig fails with

fs/btrfs/inode.c: In function 'check_direct_IO':
fs/btrfs/inode.c:8050:2: error: implicit declaration of function 'iov_iter_alignment'

due to a missing include file.

Fixes: 3737c63e1fb0 ("fs: move struct kiocb to fs.h")
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Acked-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-01 12:32:07 -07:00
Chris Mason
dd82525956 Btrfs: free and unlock our path before btrfs_free_and_pin_reserved_extent()
The error handling path for alloc_reserved_tree_block is calling
btrfs_free_and_pin_reserved_extent with a spinning tree lock held.  This
might sleep as we allocate extent_state objects:

 BUG: sleeping function called from invalid context at mm/slub.c:1268
 in_atomic(): 1, irqs_disabled(): 0, pid: 11093, name: kworker/u4:7
 5 locks held by kworker/u4:7/11093:
  #0:  ("%s-%s""btrfs", name){++++.+}, at: [<ffffffff81091d51>] process_one_work+0x151/0x520
  #1:  ((&work->normal_work)){+.+.+.}, at: [<ffffffff81091d51>] process_one_work+0x151/0x520
  #2:  (sb_internal){++++.+}, at: [<ffffffffa003a70e>] start_transaction+0x43e/0x590 [btrfs]
  #3:  (&head_ref->mutex){+.+...}, at: [<ffffffffa0089f8c>] btrfs_delayed_ref_lock+0x4c/0x240 [btrfs]
  #4:  (btrfs-extent-00){++++..}, at: [<ffffffffa007697b>] btrfs_clear_lock_blocking_rw+0x9b/0x150 [btrfs]
 CPU: 0 PID: 11093 Comm: kworker/u4:7 Tainted: G        W 4.0.0-rc6-default+ #246
 Hardware name: Intel Corporation Santa Rosa platform/Matanzas, BIOS TSRSCRB1.86C.0047.B00.0610170821 10/17/06
 Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
  00000000000004f4 ffff88006dd17848 ffffffff81ab0e3b ffff88006dd17848
  ffff88007a944760 ffff88006dd17868 ffffffff8109d516 ffff88006dd17898
  0000000000000000 ffff88006dd17898 ffffffff8109d5b2 ffffffff81aba2bb
 Call Trace:
  [<ffffffff81ab0e3b>] dump_stack+0x4f/0x6c
  [<ffffffff8109d516>] ___might_sleep+0xf6/0x140
  [<ffffffff8109d5b2>] __might_sleep+0x52/0x90
  [<ffffffff81aba2bb>] ? ftrace_call+0x5/0x34
  [<ffffffff81196363>] kmem_cache_alloc+0x163/0x1b0
  [<ffffffffa0056f31>] ? alloc_extent_state+0x31/0x150 [btrfs]
  [<ffffffffa0056f20>] ? alloc_extent_state+0x20/0x150 [btrfs]
  [<ffffffffa0056f31>] alloc_extent_state+0x31/0x150 [btrfs]
  [<ffffffffa005805b>] __set_extent_bit+0x37b/0x5d0 [btrfs]
  [<ffffffff81aba2bb>] ? ftrace_call+0x5/0x34
  [<ffffffffa005888d>] ? set_extent_bit+0xd/0x30 [btrfs]
  [<ffffffffa00588a3>] set_extent_bit+0x23/0x30 [btrfs]
  [<ffffffffa0058e80>] set_extent_dirty+0x20/0x30 [btrfs]
  [<ffffffffa00195ba>] pin_down_extent+0xaa/0x170 [btrfs]
  [<ffffffffa001d8ef>] __btrfs_free_reserved_extent+0xcf/0x160 [btrfs]
  [<ffffffffa0023856>] btrfs_free_and_pin_reserved_extent+0x16/0x20 [btrfs]
  [<ffffffffa002482a>] __btrfs_run_delayed_refs+0xfca/0x1290 [btrfs]
  [<ffffffffa0026eae>] btrfs_run_delayed_refs+0x6e/0x2e0 [btrfs]
  [<ffffffffa0027378>] delayed_ref_async_start+0x48/0xb0 [btrfs]
  [<ffffffffa006c883>] normal_work_helper+0x83/0x350 [btrfs]
  [<ffffffffa006cd79>] ? btrfs_extent_refs_helper+0x9/0x20 [btrfs]
  [<ffffffffa006cd82>] btrfs_extent_refs_helper+0x12/0x20 [btrfs]
  [<ffffffff81091dcb>] process_one_work+0x1cb/0x520
  [<ffffffff81091d51>] ? process_one_work+0x151/0x520
  [<ffffffff811c7abf>] ? seq_read+0x3f/0x400
  [<ffffffff8109260b>] worker_thread+0x5b/0x4e0
  [<ffffffff81097be2>] ? __kthread_parkme+0x12/0xa0
  [<ffffffff810925b0>] ? rescuer_thread+0x450/0x450
  [<ffffffff81098686>] kthread+0xf6/0x120
  [<ffffffff81098590>] ? flush_kthread_worker+0x1b0/0x1b0
  [<ffffffff81ab8088>] ret_from_fork+0x58/0x90
  [<ffffffff81098590>] ? flush_kthread_worker+0x1b0/0x1b0
 ------------[ cut here ]------------

This changes things to free the path first, which will also unlock the
extent buffer.

Signed-off-by: Chris Mason <clm@fb.com>
Reported-by: Dave Sterba <dsterba@suse.cz>
Tested-by: Dave Sterba <dsterba@suse.cz>
2015-04-01 08:36:05 -07:00
Liu Bo
e56a951e01 Btrfs: Remove the check for old-style mkfs
This was used to make sure that a fresh btrfs from an older mkfs.btrfs,
but it also allows us to mount a buggy btrfs if this btrfs has the right
superblock head part but has something wrong with chunk tree part[1], and
after that we can hit BUG_ON()s set in the code to prevent something
impossible.

Since David has released "Btrfs progs v3.19-rc2", just remove the check,
if anyone who wants to make a fresh btrfs, please use the latest one.

[1]: http://www.spinics.net/lists/linux-btrfs/msg42358.html

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Omar Sandoval <osandov@osandov.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-26 18:10:25 -07:00
Jeff Mahoney
727b9784b6 btrfs: cleanup orphans while looking up default subvolume
Orphans in the fs tree are cleaned up via open_ctree and subvolume
orphans are cleaned via btrfs_lookup_dentry -- except when a default
subvolume is in use.  The name for the default subvolume uses a manual
lookup that doesn't trigger orphan cleanup and needs to trigger it
manually as well. This doesn't apply to the remount case since the
subvolumes are cleaned up by walking the root radix tree.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-26 18:10:25 -07:00
Tom Van Braeckel
d862095829 btrfs: explicitly set control file's private_data
The private_data member of the Btrfs control device file
(/dev/btrfs-control) is used to hold the current transaction and needs
to be initialized to NULL to signify that no transaction is in progress.

We explicitly set the control file's private_data to NULL to be
independent of whatever value the misc subsystem initializes it to.

Backstory:
----------

The misc subsystem (which is used by /dev/btrfs-control) initializes
a file's private_data to point to the misc device when a driver has
registered a custom open file operation and initializes it to NULL
when a custom open file operation has *not* been provided.

This subtle quirk is confusing, to the point where kernel code registers
*empty* file open operations to have private_data point to the misc
device structure.

And it leads to bugs, where the addition or removal of a custom open
file operation surprisingly changes the initial contents of a file's
private_data structure.

To simplify things in the misc subsystem, a patch [1] has been proposed
to *always* set private_data to point to the misc device instead of
only doing this when a custom open file operation has been registered.

But before we can fix this in the misc subsystem itself, we need to
modify the (few) drivers that rely on this very subtle behavior.

[1] https://lkml.org/lkml/2014/12/4/939

Signed-off-by: Martin Kepplinger <martink@posteo.de>
Signed-off-by: Tom Van Braeckel <tomvanbraeckel@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-26 18:10:24 -07:00
Chengyu Song
26e726afe0 btrfs: incorrect handling for fiemap_fill_next_extent return
fiemap_fill_next_extent returns 0 on success, -errno on error, 1 if this was
the last extent that will fit in user array. If 1 is returned, the return
value may eventually returned to user space, which should not happen, according
to manpage of ioctl.

Signed-off-by: Chengyu Song <csong84@gatech.edu>
Reviewed-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-26 18:10:24 -07:00
David Sterba
3c3b04d10f btrfs: don't accept bare namespace as a valid xattr
Due to insufficient check in btrfs_is_valid_xattr, this unexpectedly
works:

 $ touch file
 $ setfattr -n user. -v 1 file
 $ getfattr -d file
user.="1"

ie. the missing attribute name after the namespace.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=94291
Reported-by: William Douglas <william.douglas@intel.com>
CC: <stable@vger.kernel.org> # 2.6.29+
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-26 18:10:24 -07:00
Filipe Manana
dcc82f4783 Btrfs: fix log tree corruption when fs mounted with -o discard
While committing a transaction we free the log roots before we write the
new super block. Freeing the log roots implies marking the disk location
of every node/leaf (metadata extent) as pinned before the new super block
is written. This is to prevent the disk location of log metadata extents
from being reused before the new super block is written, otherwise we
would have a corrupted log tree if before the new super block is written
a crash/reboot happens and the location of any log tree metadata extent
ended up being reused and rewritten.

Even though we pinned the log tree's metadata extents, we were issuing a
discard against them if the fs was mounted with the -o discard option,
resulting in corruption of the log tree if a crash/reboot happened before
writing the new super block - the next time the fs was mounted, during
the log replay process we would find nodes/leafs of the log btree with
a content full of zeroes, causing the process to fail and require the
use of the tool btrfs-zero-log to wipeout the log tree (and all data
previously fsynced becoming lost forever).

Fix this by not doing a discard when pinning an extent. The discard will
be done later when it's safe (after the new super block is committed) at
extent-tree.c:btrfs_finish_extent_commit().

Fixes: e688b7252f (Btrfs: fix extent pinning bugs in the tree log)
CC: <stable@vger.kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-26 17:56:24 -07:00
Filipe Manana
2f2ff0ee5e Btrfs: fix metadata inconsistencies after directory fsync
We can get into inconsistency between inodes and directory entries
after fsyncing a directory. The issue is that while a directory gets
the new dentries persisted in the fsync log and replayed at mount time,
the link count of the inode that directory entries point to doesn't
get updated, staying with an incorrect link count (smaller then the
correct value). This later leads to stale file handle errors when
accessing (including attempt to delete) some of the links if all the
other ones are removed, which also implies impossibility to delete the
parent directories, since the dentries can not be removed.

Another issue is that (unlike ext3/4, xfs, f2fs, reiserfs, nilfs2),
when fsyncing a directory, new files aren't logged (their metadata and
dentries) nor any child directories. So this patch fixes this issue too,
since it has the same resolution as the incorrect inode link count issue
mentioned before.

This is very easy to reproduce, and the following excerpt from my test
case for xfstests shows how:

  _scratch_mkfs >> $seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create our main test file and directory.
  $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" $SCRATCH_MNT/foo | _filter_xfs_io
  mkdir $SCRATCH_MNT/mydir

  # Make sure all metadata and data are durably persisted.
  sync

  # Add a hard link to 'foo' inside our test directory and fsync only the
  # directory. The btrfs fsync implementation had a bug that caused the new
  # directory entry to be visible after the fsync log replay but, the inode
  # of our file remained with a link count of 1.
  ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/foo_2

  # Add a few more links and new files.
  # This is just to verify nothing breaks or gives incorrect results after the
  # fsync log is replayed.
  ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/foo_3
  $XFS_IO_PROG -f -c "pwrite -S 0xff 0 64K" $SCRATCH_MNT/hello | _filter_xfs_io
  ln $SCRATCH_MNT/hello $SCRATCH_MNT/mydir/hello_2

  # Add some subdirectories and new files and links to them. This is to verify
  # that after fsyncing our top level directory 'mydir', all the subdirectories
  # and their files/links are registered in the fsync log and exist after the
  # fsync log is replayed.
  mkdir -p $SCRATCH_MNT/mydir/x/y/z
  ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/x/y/foo_y_link
  ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/x/y/z/foo_z_link
  touch $SCRATCH_MNT/mydir/x/y/z/qwerty

  # Now fsync only our top directory.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/mydir

  # And fsync now our new file named 'hello', just to verify later that it has
  # the expected content and that the previous fsync on the directory 'mydir' had
  # no bad influence on this fsync.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/hello

  # Simulate a crash/power loss.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  # Verify the content of our file 'foo' remains the same as before, 8192 bytes,
  # all with the value 0xaa.
  echo "File 'foo' content after log replay:"
  od -t x1 $SCRATCH_MNT/foo

  # Remove the first name of our inode. Because of the directory fsync bug, the
  # inode's link count was 1 instead of 5, so removing the 'foo' name ended up
  # deleting the inode and the other names became stale directory entries (still
  # visible to applications). Attempting to remove or access the remaining
  # dentries pointing to that inode resulted in stale file handle errors and
  # made it impossible to remove the parent directories since it was impossible
  # for them to become empty.
  echo "file 'foo' link count after log replay: $(stat -c %h $SCRATCH_MNT/foo)"
  rm -f $SCRATCH_MNT/foo

  # Now verify that all files, links and directories created before fsyncing our
  # directory exist after the fsync log was replayed.
  [ -f $SCRATCH_MNT/mydir/foo_2 ] || echo "Link mydir/foo_2 is missing"
  [ -f $SCRATCH_MNT/mydir/foo_3 ] || echo "Link mydir/foo_3 is missing"
  [ -f $SCRATCH_MNT/hello ] || echo "File hello is missing"
  [ -f $SCRATCH_MNT/mydir/hello_2 ] || echo "Link mydir/hello_2 is missing"
  [ -f $SCRATCH_MNT/mydir/x/y/foo_y_link ] || \
      echo "Link mydir/x/y/foo_y_link is missing"
  [ -f $SCRATCH_MNT/mydir/x/y/z/foo_z_link ] || \
      echo "Link mydir/x/y/z/foo_z_link is missing"
  [ -f $SCRATCH_MNT/mydir/x/y/z/qwerty ] || \
      echo "File mydir/x/y/z/qwerty is missing"

  # We expect our file here to have a size of 64Kb and all the bytes having the
  # value 0xff.
  echo "file 'hello' content after log replay:"
  od -t x1 $SCRATCH_MNT/hello

  # Now remove all files/links, under our test directory 'mydir', and verify we
  # can remove all the directories.
  rm -f $SCRATCH_MNT/mydir/x/y/z/*
  rmdir $SCRATCH_MNT/mydir/x/y/z
  rm -f $SCRATCH_MNT/mydir/x/y/*
  rmdir $SCRATCH_MNT/mydir/x/y
  rmdir $SCRATCH_MNT/mydir/x
  rm -f $SCRATCH_MNT/mydir/*
  rmdir $SCRATCH_MNT/mydir

  # An fsck, run by the fstests framework everytime a test finishes, also detected
  # the inconsistency and printed the following error message:
  #
  # root 5 inode 257 errors 2001, no inode item, link count wrong
  #    unresolved ref dir 258 index 2 namelen 5 name foo_2 filetype 1 errors 4, no inode ref
  #    unresolved ref dir 258 index 3 namelen 5 name foo_3 filetype 1 errors 4, no inode ref

  status=0
  exit

The expected golden output for the test is:

  wrote 8192/8192 bytes at offset 0
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  wrote 65536/65536 bytes at offset 0
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  File 'foo' content after log replay:
  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0020000
  file 'foo' link count after log replay: 5
  file 'hello' content after log replay:
  0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
  *
  0200000

Which is the output after this patch and when running the test against
ext3/4, xfs, f2fs, reiserfs or nilfs2. Without this patch, the test's
output is:

  wrote 8192/8192 bytes at offset 0
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  wrote 65536/65536 bytes at offset 0
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  File 'foo' content after log replay:
  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0020000
  file 'foo' link count after log replay: 1
  Link mydir/foo_2 is missing
  Link mydir/foo_3 is missing
  Link mydir/x/y/foo_y_link is missing
  Link mydir/x/y/z/foo_z_link is missing
  File mydir/x/y/z/qwerty is missing
  file 'hello' content after log replay:
  0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
  *
  0200000
  rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x/y/z': No such file or directory
  rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x/y': No such file or directory
  rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x': No such file or directory
  rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/foo_2': Stale file handle
  rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/foo_3': Stale file handle
  rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir': Directory not empty

Fsck, without this fix, also complains about the wrong link count:

  root 5 inode 257 errors 2001, no inode item, link count wrong
      unresolved ref dir 258 index 2 namelen 5 name foo_2 filetype 1 errors 4, no inode ref
      unresolved ref dir 258 index 3 namelen 5 name foo_3 filetype 1 errors 4, no inode ref

So fix this by logging the inodes that the dentries point to when
fsyncing a directory.

A test case for xfstests follows.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-26 17:56:23 -07:00
Filipe Manana
bf69196045 Btrfs: change the insertion criteria for the qgroup operations rbtree
After looking at Liu Bo's recent patch (titled
"Btrfs: fix comp_oper to get right order") I realized the search made by
qgroup_oper_exists() was buggy because its rbtree navigation comparison
function, comp_oper_exist(), only looks at the fields bytenr and ref_root
of a tree node, ignoring the seq field completely. This was wrong because
when we insert a node into the rbtree we use comp_oper(), which takes a
decision based first on bytenr, then on seq and then on the ref_root field.
That means qgroup_oper_exists() could miss the fact that at least one
operation with given bytenr and ref_root exists.

Consider the following simple example of a 3 nodes qgroup operations
rbtree (created using comp_oper before this patch), where each node's key
is a tuple with the shape (bytenr, seq, ref_root, op):

                          [ (4096, 2, 20, op X) ]
                         /                       \
                        /                         \
   [ (4096, 1, 5, op Y) ]                         [ (4096, 3, 10, op Z) ]

qgroup_oper_exists() when called to search for an existing operation for
bytenr 4096 and ref root 10 wouldn't find anything because it would go to
the left subtree instead of the right subtree, since comp_oper_exits()
ignores the seq field completely.

Fix this by changing the insertion navigation function to use the ref_root
field right after using the bytenr field and before using the seq field,
so that qgroup_oper_exists() / comp_oper_exist() work as expected.

This patch applies on top of the patch mentioned above from Liu.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-26 17:55:52 -07:00
Filipe Manana
3d850dd448 Btrfs: add missing inode item update in fallocate()
If we fallocate(), without the keep size flag, into an area already covered
by an extent previously fallocated, we were updating the inode's i_size but
we weren't updating the inode item in the fs/subvol tree. A following umount
+ mount would result in a loss of the inode's size (and an fsync would miss
too the fact that the inode changed).

Reproducer:

  $ mkfs.btrfs -f /dev/sdd
  $ mount /dev/sdd /mnt
  $ fallocate -n -l 1M /mnt/foobar
  $ fallocate -l 512K /mnt/foobar
  $ umount /mnt
  $ mount /dev/sdd /mnt
  $ od -t x1 /mnt/foobar
  0000000

The expected result is:

  $ od -t x1 /mnt/foobar
  0000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  *
  2000000

A test case for fstests follows soon.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-26 17:55:52 -07:00
Filipe Manana
5f806c3ae2 Btrfs: incremental send, remove dead code
The logic to detect path loops when attempting to apply a pending
directory rename, introduced in commit
f959492fc1 (Btrfs: send, fix more issues related to directory renames)
is no longer needed, and the respective fstests test case for that commit,
btrfs/045, now passes without this code (as well as all the other test
cases for send/receive).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-26 17:55:52 -07:00
Filipe Manana
8996a48c0a Btrfs: incremental send, clear name from cache after orphanization
If a directory's reference ends up being orphanized, because the inode
currently being processed has a new path that matches that directory's
path, make sure we evict the name of the directory from the name cache.
This is because there might be descendent inodes (either directories or
regular files) that will be orphanized later too, and therefore the
orphan name of the ancestor must be used, otherwise we send issue rename
operations with a wrong path in the send stream.

Reproducer:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt

  $ mkdir -p /mnt/data/n1/n2/p1/p2
  $ mkdir /mnt/data/n4
  $ mkdir -p /mnt/data/p1/p2

  $ btrfs subvolume snapshot -r /mnt /mnt/snap1

  $ mv /mnt/data/p1/p2 /mnt/data
  $ mv /mnt/data/n1/n2/p1/p2 /mnt/data/p1
  $ mv /mnt/data/p2 /mnt/data/n1/n2/p1
  $ mv /mnt/data/n1/n2 /mnt/data/p1
  $ mv /mnt/data/p1 /mnt/data/n4
  $ mv /mnt/data/n4/p1/n2/p1 /mnt/data

  $ btrfs subvolume snapshot -r /mnt /mnt/snap2

  $ btrfs send /mnt/snap1 -f /tmp/1.send
  $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.send

  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt2
  $ btrfs receive /mnt2 -f /tmp/1.send
  $ btrfs receive /mnt2 -f /tmp/2.send
  ERROR: rename data/p1/p2 -> data/n4/p1/p2 failed. no such file or directory

Directories data/p1 (inode 263) and data/p1/p2 (inode 264) in the parent
snapshot are both orphanized during the incremental send, and as soon as
data/p1 is orphanized, we must make sure that when orphanizing data/p1/p2
we use a source path of o263-6-o/p2 for the rename operation instead of
the old path data/p1/p2 (the one before the orphanization of inode 263).

A test case for xfstests follows soon.

Reported-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-26 17:55:51 -07:00
Filipe Manana
2f1f465ae6 Btrfs: send, don't leave without decrementing clone root's send_progress
If the clone root was not readonly or the dead flag was set on it, we were
leaving without decrementing the root's send_progress counter (and before
we just incremented it). If a concurrent snapshot deletion was in progress
and ended up being aborted, it would be impossible to later attempt to
delete again the snapshot, since the root's send_in_progress counter could
never go back to 0.

We were also setting clone_sources_to_rollback to i + 1 too early - if we
bailed out because the clone root we got is not readonly or flagged as dead
we ended up later derreferencing a null pointer because we didn't assign
the clone root to sctx->clone_roots[i].root:

		for (i = 0; sctx && i < clone_sources_to_rollback; i++)
			btrfs_root_dec_send_in_progress(
					sctx->clone_roots[i].root);

So just don't increment the send_in_progress counter if the root is readonly
or flagged as dead.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-26 17:55:51 -07:00
Filipe Manana
5cc2b17e80 Btrfs: send, add missing check for dead clone root
After we locked the root's root item, a concurrent snapshot deletion
call might have set the dead flag on it. So check if the dead flag
is set and abort if it is, just like we do for the parent root.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-26 17:55:51 -07:00
Filipe Manana
4f764e5153 Btrfs: remove deleted xattrs on fsync log replay
If we deleted xattrs from a file and fsynced the file, after a log replay
the xattrs would remain associated to the file. This was an unexpected
behaviour and differs from what other filesystems do, such as for example
xfs and ext3/4.

Fix this by, on fsync log replay, check if every xattr in the fs/subvol
tree (that belongs to a logged inode) has a matching xattr in the log,
and if it does not, delete it from the fs/subvol tree. This is a similar
approach to what we do for dentries when we replay a directory from the
fsync log.

This issue is trivial to reproduce, and the following excerpt from my
test for xfstests triggers the issue:

  _crash_and_mount()
  {
       # Simulate a crash/power loss.
       _load_flakey_table $FLAKEY_DROP_WRITES
       _unmount_flakey
       _load_flakey_table $FLAKEY_ALLOW_WRITES
       _mount_flakey
  }

  rm -f $seqres.full

  _scratch_mkfs >> $seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create out test file and add 3 xattrs to it.
  touch $SCRATCH_MNT/foobar
  $SETFATTR_PROG -n user.attr1 -v val1 $SCRATCH_MNT/foobar
  $SETFATTR_PROG -n user.attr2 -v val2 $SCRATCH_MNT/foobar
  $SETFATTR_PROG -n user.attr3 -v val3 $SCRATCH_MNT/foobar

  # Make sure everything is durably persisted.
  sync

  # Now delete the second xattr and fsync the inode.
  $SETFATTR_PROG -x user.attr2 $SCRATCH_MNT/foobar
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar

  _crash_and_mount

  # After the fsync log is replayed, the file should have only 2 xattrs, the ones
  # named user.attr1 and user.attr3. The btrfs fsync log replay bug left the file
  # with the 3 xattrs that we had before deleting the second one and fsyncing the
  # file.
  echo "xattr names and values after first fsync log replay:"
  $GETFATTR_PROG --absolute-names --dump $SCRATCH_MNT/foobar | _filter_scratch

  # Now write some data to our file, fsync it, remove the first xattr, add a new
  # hard link to our file and commit the fsync log by fsyncing some other new
  # file. This is to verify that after log replay our first xattr does not exist
  # anymore.
  echo "hello world!" >> $SCRATCH_MNT/foobar
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar
  $SETFATTR_PROG -x user.attr1 $SCRATCH_MNT/foobar
  ln $SCRATCH_MNT/foobar $SCRATCH_MNT/foobar_link
  touch $SCRATCH_MNT/qwerty
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/qwerty

  _crash_and_mount

  # Now only the xattr with name user.attr3 should be set in our file.
  echo "xattr names and values after second fsync log replay:"
  $GETFATTR_PROG --absolute-names --dump $SCRATCH_MNT/foobar | _filter_scratch

  status=0
  exit

The expected golden output, which is produced with this patch applied or
when testing against xfs or ext3/4, is:

  xattr names and values after first fsync log replay:
  # file: SCRATCH_MNT/foobar
  user.attr1="val1"
  user.attr3="val3"

  xattr names and values after second fsync log replay:
  # file: SCRATCH_MNT/foobar
  user.attr3="val3"

Without this patch applied, the output is:

  xattr names and values after first fsync log replay:
  # file: SCRATCH_MNT/foobar
  user.attr1="val1"
  user.attr2="val2"
  user.attr3="val3"

  xattr names and values after second fsync log replay:
  # file: SCRATCH_MNT/foobar
  user.attr1="val1"
  user.attr2="val2"
  user.attr3="val3"

A patch with a test case for xfstests follows soon.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-26 17:55:51 -07:00
Christoph Hellwig
e2e40f2c1e fs: move struct kiocb to fs.h
struct kiocb now is a generic I/O container, so move it to fs.h.
Also do a #include diet for aio.h while we're at it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-03-25 20:28:11 -04:00
Chris Mason
fc4c3c872f Merge branch 'cleanups-post-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.1
Signed-off-by: Chris Mason <clm@fb.com>

Conflicts:
	fs/btrfs/disk-io.c
2015-03-25 10:52:48 -07:00
Chris Mason
9deed229fa Merge branch 'cleanups-for-4.1-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.1 2015-03-25 10:43:16 -07:00
Linus Torvalds
521d474631 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Most of these are fixing extent reservation accounting, or corners
  with tree writeback during commit.

  Josef's set does add a test, which isn't strictly a fix, but it'll
  keep us from making this same mistake again"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix outstanding_extents accounting in DIO
  Btrfs: add sanity test for outstanding_extents accounting
  Btrfs: just free dummy extent buffers
  Btrfs: account merges/splits properly
  Btrfs: prepare block group cache before writing
  Btrfs: fix ASSERT(list_empty(&cur_trans->dirty_bgs_list)
  Btrfs: account for the correct number of extents for delalloc reservations
  Btrfs: fix merge delalloc logic
  Btrfs: fix comp_oper to get right order
  Btrfs: catch transaction abortion after waiting for it
  btrfs: fix sizeof format specifier in btrfs_check_super_valid()
2015-03-21 10:53:37 -07:00
Josef Bacik
e1cbbfa5f5 Btrfs: fix outstanding_extents accounting in DIO
We are keeping track of how many extents we need to reserve properly based on
the amount we want to write, but we were still incrementing outstanding_extents
if we wrote less than what we requested.  This isn't quite right since we will
be limited to our max extent size.  So instead lets do something horrible!  Keep
track of how many outstanding_extents we reserved, and decrement each time we
allocate an extent.  If we use our entire reserve make sure to jack up
outstanding_extents on the inode so the accounting works out properly.  Thanks,

Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2015-03-17 16:36:35 -04:00
Josef Bacik
6a3891c551 Btrfs: add sanity test for outstanding_extents accounting
I introduced a regression wrt outstanding_extents accounting.  These are tricky
areas that aren't easily covered by xfstests as we could change MAX_EXTENT_SIZE
at any time.  So add sanity tests to cover the various conditions that are
tricky in order to make sure we don't introduce regressions in the future.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
2015-03-17 16:36:31 -04:00
Josef Bacik
bcb7e449ec Btrfs: just free dummy extent buffers
If we fail during our sanity tests we could get NULL deref's because we unload
the module before the dummy extent buffers are free'd via RCU.  So check for
this case and just free the things directly.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
2015-03-17 16:30:18 -04:00
Josef Bacik
ba11721355 Btrfs: account merges/splits properly
My fix

Btrfs: fix merge delalloc logic

only fixed half of the problems, it didn't fix the case where we have two large
extents on either side and then join them together with a new small extent.  We
need to instead keep track of how many extents we have accounted for with each
side of the new extent, and then see how many extents we need for the new large
extent.  If they match then we know we need to keep our reservation, otherwise
we need to drop our reservation.  This shows up with a case like this

[BTRFS_MAX_EXTENT_SIZE+4K][4K HOLE][BTRFS_MAX_EXTENT_SIZE+4K]

Previously the logic would have said that the number extents required for the
new size (3) is larger than the number of extents required for the largest side
(2) therefore we need to keep our reservation.  But this isn't the case, since
both sides require a reservation of 2 which leads to 4 for the whole range
currently reserved, but we only need 3, so we need to drop one of the
reservations.  The same problem existed for splits, we'd think we only need 3
extents when creating the hole but in reality we need 4.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
2015-03-17 16:28:21 -04:00
Josef Bacik
dcdf7f6ddb Btrfs: prepare block group cache before writing
Writing the block group cache will modify the extent tree quite a bit because it
truncates the old space cache and pre-allocates new stuff.  To try and cut down
on the churn lets do the setup dance first, then later on hopefully we can avoid
looping with newly dirtied roots.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
2015-03-17 10:56:55 -04:00
Josef Bacik
ea526d1899 Btrfs: fix ASSERT(list_empty(&cur_trans->dirty_bgs_list)
Dave could hit this assert consistently running btrfs/078.  This is because
when we update the block groups we could truncate the free space, which would
try to delete the csums for that range and dirty the csum root.  For this to
happen we have to have already written out the csum root so it's kind of hard to
hit this case.  This patch fixes this by changing the logic to only write the
dirty block groups if the dirty_cowonly_roots list is empty.  This will get us
the same effect as before since we add the extent root last, and will cover the
case that we dirty some other root again but not the extent root.  Thanks,

Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-13 13:47:04 -07:00
Josef Bacik
6a41dd0922 Btrfs: account for the correct number of extents for delalloc reservations
Direct IO can easily pass in an buffer that is greater than
BTRFS_MAX_EXTENT_SIZE, so take this into account when reserving extents in the
delalloc reservation code.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-13 13:46:59 -07:00
Josef Bacik
8461a3de77 Btrfs: fix merge delalloc logic
My patch to properly count outstanding extents wrt MAX_EXTENT_SIZE introduced a
regression when re-dirtying already dirty areas.  We have logic in split to make
sure we are taking the largest space into account but didn't have it for merge,
so it was sometimes making us think we were turning a tiny extent into a huge
extent, when in reality we already had a huge extent and needed to use the other
side in our logic.  This fixes the regression that was reported by a user on
list.  Thanks,

Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-13 13:46:59 -07:00
Liu Bo
48da5f0a4c Btrfs: fix comp_oper to get right order
Case (oper1->seq > oper2->seq) should differ with case (oper1->seq < oper2->seq).

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-13 13:46:59 -07:00
Liu Bo
b4924a0fa1 Btrfs: catch transaction abortion after waiting for it
This problem is uncovered by a test case: http://patchwork.ozlabs.org/patch/244297.

Fsync() can report success when it actually doesn't.  When we
have several threads running fsync() at the same tiem and in one fsync() we
get a transaction abortion due to some problems(in the test case it's disk
failures), and other fsync()s may return successfully which makes userspace
programs think that data is now safely flushed into disk.

It's because that after fsyncs() fail btrfs_sync_log() due to disk failures,
they get to try btrfs_commit_transaction() where it finds that there is
already a transaction being committed, and they'll just call wait_for_commit()
and return.  Note that we actually check "trans->aborted" in btrfs_end_transaction,
but it's likely that the error message is still not yet throwed out and only after
wait_for_commit() we're sure whether the transaction is committed successfully.

This add the necessary check and it now passes the test.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-13 13:38:23 -07:00
Fabian Frederick
d22071293f btrfs: fix sizeof format specifier in btrfs_check_super_valid()
This patch fixes mips compilation warning:

fs/btrfs/disk-io.c: In function 'btrfs_check_super_valid':
fs/btrfs/disk-io.c:3927:21: warning: format '%lu' expects argument
of type 'long unsigned int', but argument 3 has type 'unsigned int' [-Wformat]

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-13 13:38:22 -07:00
Linus Torvalds
84399bb075 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Outside of misc fixes, Filipe has a few fsync corners and we're
  pulling in one more of Josef's fixes from production use here"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs:__add_inode_ref: out of bounds memory read when looking for extended ref.
  Btrfs: fix data loss in the fast fsync path
  Btrfs: remove extra run_delayed_refs in update_cowonly_root
  Btrfs: incremental send, don't rename a directory too soon
  btrfs: fix lost return value due to variable shadowing
  Btrfs: do not ignore errors from btrfs_lookup_xattr in do_setxattr
  Btrfs: fix off-by-one logic error in btrfs_realloc_node
  Btrfs: add missing inode update when punching hole
  Btrfs: abort the transaction if we fail to update the free space cache inode
  Btrfs: fix fsync race leading to ordered extent memory leaks
2015-03-06 13:52:54 -08:00
Quentin Casasnovas
dd9ef135e3 Btrfs:__add_inode_ref: out of bounds memory read when looking for extended ref.
Improper arithmetics when calculting the address of the extended ref could
lead to an out of bounds memory read and kernel panic.

Signed-off-by: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
cc: stable@vger.kernel.org # v3.7+
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-05 17:28:33 -08:00
Filipe Manana
3a8b36f378 Btrfs: fix data loss in the fast fsync path
When using the fast file fsync code path we can miss the fact that new
writes happened since the last file fsync and therefore return without
waiting for the IO to finish and write the new extents to the fsync log.

Here's an example scenario where the fsync will miss the fact that new
file data exists that wasn't yet durably persisted:

1. fs_info->last_trans_committed == N - 1 and current transaction is
   transaction N (fs_info->generation == N);

2. do a buffered write;

3. fsync our inode, this clears our inode's full sync flag, starts
   an ordered extent and waits for it to complete - when it completes
   at btrfs_finish_ordered_io(), the inode's last_trans is set to the
   value N (via btrfs_update_inode_fallback -> btrfs_update_inode ->
   btrfs_set_inode_last_trans);

4. transaction N is committed, so fs_info->last_trans_committed is now
   set to the value N and fs_info->generation remains with the value N;

5. do another buffered write, when this happens btrfs_file_write_iter
   sets our inode's last_trans to the value N + 1 (that is
   fs_info->generation + 1 == N + 1);

6. transaction N + 1 is started and fs_info->generation now has the
   value N + 1;

7. transaction N + 1 is committed, so fs_info->last_trans_committed
   is set to the value N + 1;

8. fsync our inode - because it doesn't have the full sync flag set,
   we only start the ordered extent, we don't wait for it to complete
   (only in a later phase) therefore its last_trans field has the
   value N + 1 set previously by btrfs_file_write_iter(), and so we
   have:

       inode->last_trans <= fs_info->last_trans_committed
           (N + 1)              (N + 1)

   Which made us not log the last buffered write and exit the fsync
   handler immediately, returning success (0) to user space and resulting
   in data loss after a crash.

This can actually be triggered deterministically and the following excerpt
from a testcase I made for xfstests triggers the issue. It moves a dummy
file across directories and then fsyncs the old parent directory - this
is just to trigger a transaction commit, so moving files around isn't
directly related to the issue but it was chosen because running 'sync' for
example does more than just committing the current transaction, as it
flushes/waits for all file data to be persisted. The issue can also happen
at random periods, since the transaction kthread periodicaly commits the
current transaction (about every 30 seconds by default).
The body of the test is:

  _scratch_mkfs >> $seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create our main test file 'foo', the one we check for data loss.
  # By doing an fsync against our file, it makes btrfs clear the 'needs_full_sync'
  # bit from its flags (btrfs inode specific flags).
  $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" \
                  -c "fsync" $SCRATCH_MNT/foo | _filter_xfs_io

  # Now create one other file and 2 directories. We will move this second file
  # from one directory to the other later because it forces btrfs to commit its
  # currently open transaction if we fsync the old parent directory. This is
  # necessary to trigger the data loss bug that affected btrfs.
  mkdir $SCRATCH_MNT/testdir_1
  touch $SCRATCH_MNT/testdir_1/bar
  mkdir $SCRATCH_MNT/testdir_2

  # Make sure everything is durably persisted.
  sync

  # Write more 8Kb of data to our file.
  $XFS_IO_PROG -c "pwrite -S 0xbb 8K 8K" $SCRATCH_MNT/foo | _filter_xfs_io

  # Move our 'bar' file into a new directory.
  mv $SCRATCH_MNT/testdir_1/bar $SCRATCH_MNT/testdir_2/bar

  # Fsync our first directory. Because it had a file moved into some other
  # directory, this made btrfs commit the currently open transaction. This is
  # a condition necessary to trigger the data loss bug.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir_1

  # Now fsync our main test file. If the fsync succeeds, we expect the 8Kb of
  # data we wrote previously to be persisted and available if a crash happens.
  # This did not happen with btrfs, because of the transaction commit that
  # happened when we fsynced the parent directory.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

  # Simulate a crash/power loss.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  # Now check that all data we wrote before are available.
  echo "File content after log replay:"
  od -t x1 $SCRATCH_MNT/foo

  status=0
  exit

The expected golden output for the test, which is what we get with this
fix applied (or when running against ext3/4 and xfs), is:

  wrote 8192/8192 bytes at offset 0
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  wrote 8192/8192 bytes at offset 8192
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  File content after log replay:
  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0020000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
  *
  0040000

Without this fix applied, the output shows the test file does not have
the second 8Kb extent that we successfully fsynced:

  wrote 8192/8192 bytes at offset 0
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  wrote 8192/8192 bytes at offset 8192
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  File content after log replay:
  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0020000

So fix this by skipping the fsync only if we're doing a full sync and
if the inode's last_trans is <= fs_info->last_trans_committed, or if
the inode is already in the log. Also remove setting the inode's
last_trans in btrfs_file_write_iter since it's useless/unreliable.

Also because btrfs_file_write_iter no longer sets inode->last_trans to
fs_info->generation + 1, don't set last_trans to 0 if we bail out and don't
bail out if last_trans is 0, otherwise something as simple as the following
example wouldn't log the second write on the last fsync:

  1. write to file

  2. fsync file

  3. fsync file
       |--> btrfs_inode_in_log() returns true and it set last_trans to 0

  4. write to file
       |--> btrfs_file_write_iter() no longers sets last_trans, so it
            remained with a value of 0
  5. fsync
       |--> inode->last_trans == 0, so it bails out without logging the
            second write

A test case for xfstests will be sent soon.

CC: <stable@vger.kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-05 17:28:32 -08:00
Josef Bacik
f5c0a12280 Btrfs: remove extra run_delayed_refs in update_cowonly_root
This got added with my dirty_bgs patch, it's not needed.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-05 17:28:30 -08:00
David Sterba
258ece0212 btrfs: remove shadowing variables in __btrfs_map_block
1) We can safely use the function's 'i'. Fixes warning

fs/btrfs/volumes.c:5257:7: warning: declaration of 'i' shadows a previous local
fs/btrfs/volumes.c:4951:6: warning: shadowed declaration is here

2) A local variable duplicates name of an argument, we can use the value
directly. Fixes warning

fs/btrfs/volumes.c:5433:8: warning: declaration of 'length' shadows a parameter
fs/btrfs/volumes.c:4935:27: warning: shadowed declaration is here

Signed-off-by: David Sterba <dsterba@suse.cz>
2015-03-03 17:24:35 +01:00
David Sterba
093adbcedf btrfs: switch helper macros to static inlines in sysfs.h
The conversion macros use nested container_of that leads to a warning

fs/btrfs/sysfs.c: In function 'btrfs_feature_visible':
fs/btrfs/sysfs.c:183:8: warning: declaration of '__mptr' shadows a previous local
fs/btrfs/sysfs.c:183:8: warning: shadowed declaration is here

Use of functions will add proper type checking.

Signed-off-by: David Sterba <dsterba@suse.cz>
2015-03-03 17:24:02 +01:00
David Sterba
9d644a623e btrfs: cleanup, use correct type in div_u64_rem
div_u64_rem expects u32 for divisior and reminder.

Signed-off-by: David Sterba <dsterba@suse.cz>
2015-03-03 17:24:01 +01:00
David Sterba
47c5713f47 btrfs: replace remaining do_div calls with div_u64 variants
Switch to div_u64_rem that does type checking and has more obvious
semantics than do_div.

Signed-off-by: David Sterba <dsterba@suse.cz>
2015-03-03 17:24:01 +01:00
David Sterba
b8b93addde btrfs: cleanup 64bit/32bit divs, provably bounded values
The divisor is derived from nodesize or PAGE_SIZE, fits into 32bit type.
Get rid of a few more do_div instances.

Signed-off-by: David Sterba <dsterba@suse.cz>
2015-03-03 17:24:00 +01:00
David Sterba
3284da7b7b btrfs: use explicit initializer for seq_elem
Using {} as initializer for struct seq_elem does not properly initialize
the list_head member, but it currently works because it gets set through
btrfs_get_tree_mod_seq if 'seq' is 0.

Signed-off-by: David Sterba <dsterba@suse.cz>
2015-03-03 17:23:59 +01:00
David Sterba
f64c7b12f8 btrfs: remove shadowing variables in __btrfs_buffered_write
There are lockstart and lockend defined in the function and not used
after their duplicate definition scope ends, it's safe to reuse them.

Signed-off-by: David Sterba <dsterba@suse.cz>
2015-03-03 17:23:58 +01:00
David Sterba
31e818fe73 btrfs: cleanup, use kmalloc_array/kcalloc array helpers
Convert kmalloc(nr * size, ..) to kmalloc_array that does additional
overflow checks, the zeroing variant is kcalloc.

Signed-off-by: David Sterba <dsterba@suse.cz>
2015-03-03 17:23:58 +01:00
David Sterba
f8c269d722 btrfs: cleanup 64bit/32bit divs, compile time constants
Switch to div_u64 if the divisor is a numeric constant or sum of
sizeof()s. We can remove a few instances of do_div that has the hidden
semtantics of changing the 1st argument.

Small power-of-two divisors are converted to bitshifts, large values are
kept intact for clarity.

Signed-off-by: David Sterba <dsterba@suse.cz>
2015-03-03 17:23:57 +01:00
David Sterba
351810c1d2 btrfs: use cond_resched_lock where possible
Clean the opencoded variant, cond_resched_lock also checks the lock for
contention so it might help in some cases that were not covered by
simple need_resched().

Signed-off-by: David Sterba <dsterba@suse.cz>
2015-03-03 17:23:56 +01:00
David Sterba
853d8ec4b2 btrfs: need_resched not needed with cond_resched
Cleanup, no special reason to do

if (need_resched())
        cond_resched();

Signed-off-by: David Sterba <dsterba@suse.cz>
2015-03-03 17:23:55 +01:00
Filipe Manana
84471e2429 Btrfs: incremental send, don't rename a directory too soon
There's one more case where we can't issue a rename operation for a
directory as soon as we process it. We used to delay directory renames
only if they have some ancestor directory with a higher inode number
that got renamed too, but there's another case where we need to delay
the rename too - when a directory A is renamed to the old name of a
directory B but that directory B has its rename delayed because it
has now (in the send root) an ancestor with a higher inode number that
was renamed. If we don't delay the directory rename in this case, the
receiving end of the send stream will attempt to rename A to the old
name of B before B got renamed to its new name, which results in a
"directory not empty" error. So fix this by delaying directory renames
for this case too.

Steps to reproduce:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt

  $ mkdir /mnt/a
  $ mkdir /mnt/b
  $ mkdir /mnt/c
  $ touch /mnt/a/file

  $ btrfs subvolume snapshot -r /mnt /mnt/snap1

  $ mv /mnt/c /mnt/x
  $ mv /mnt/a /mnt/x/y
  $ mv /mnt/b /mnt/a

  $ btrfs subvolume snapshot -r /mnt /mnt/snap2

  $ btrfs send /mnt/snap1 -f /tmp/1.send
  $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.send

  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt2
  $ btrfs receive /mnt2 -f /tmp/1.send
  $ btrfs receive /mnt2 -f /tmp/2.send
  ERROR: rename b -> a failed. Directory not empty

A test case for xfstests follows soon.

Reported-by: Ames Cornish <ames@cornishes.net>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-02 14:04:45 -08:00
David Sterba
1932b7be97 btrfs: fix lost return value due to variable shadowing
A block-local variable stores error code but btrfs_get_blocks_direct may
not return it in the end as there's a ret defined in the function scope.

CC: <stable@vger.kernel.org>	# 3.6+
Fixes: d187663ef2 ("Btrfs: lock extents as we map them in DIO")
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-02 14:04:45 -08:00
Filipe Manana
5cdf83edb8 Btrfs: do not ignore errors from btrfs_lookup_xattr in do_setxattr
The return value from btrfs_lookup_xattr() can be a pointer encoding an
error, therefore deal with it. This fixes commit 5f5bc6b1e2
("Btrfs: make xattr replace operations atomic").

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-02 14:04:45 -08:00
Filipe Manana
5dfe2be7ea Btrfs: fix off-by-one logic error in btrfs_realloc_node
The end_slot variable actually matches the number of pointers in the
node and not the last slot (which is 'nritems - 1'). Therefore in order
to check that the current slot in the for loop doesn't match the last
one, the correct logic is to check if 'i' is less than 'end_slot - 1'
and not 'end_slot - 2'.

Fix this and set end_slot to be 'nritems - 1', as it's less confusing
since the variable name implies it's inclusive rather then exclusive.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-02 14:04:45 -08:00
Filipe Manana
e8c1c76e80 Btrfs: add missing inode update when punching hole
When punching a file hole if we endup only zeroing parts of a page,
because the start offset isn't a multiple of the sector size or the
start offset and length fall within the same page, we were not updating
the inode item. This prevented an fsync from doing anything, if no other
file changes happened in the current transaction, because the fields
in btrfs_inode used to check if the inode needs to be fsync'ed weren't
updated.

This issue is easy to reproduce and the following excerpt from the
xfstest case I made shows how to trigger it:

  _scratch_mkfs >> $seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create our test file.
  $XFS_IO_PROG -f -c "pwrite -S 0x22 -b 16K 0 16K" \
      $SCRATCH_MNT/foo | _filter_xfs_io

  # Fsync the file, this makes btrfs update some btrfs inode specific fields
  # that are used to track if the inode needs to be written/updated to the fsync
  # log or not. After this fsync, the new values for those fields indicate that
  # a subsequent fsync does not need to touch the fsync log.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

  # Force a commit of the current transaction. After this point, any operation
  # that modifies the data or metadata of our file, should update those fields in
  # the btrfs inode with values that make the next fsync operation write to the
  # fsync log.
  sync

  # Punch a hole in our file. This small range affects only 1 page.
  # This made the btrfs hole punching implementation write only some zeroes in
  # one page, but it did not update the btrfs inode fields used to determine if
  # the next fsync needs to write to the fsync log.
  $XFS_IO_PROG -c "fpunch 8000 4K" $SCRATCH_MNT/foo

  # Another variation of the previously mentioned case.
  $XFS_IO_PROG -c "fpunch 15000 100" $SCRATCH_MNT/foo

  # Now fsync the file. This was a no-operation because the previous hole punch
  # operation didn't update the inode's fields mentioned before, so they remained
  # with the values they had after the first fsync - that is, they indicate that
  # it is not needed to write to fsync log.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

  echo "File content before:"
  od -t x1 $SCRATCH_MNT/foo

  # Simulate a crash/power loss.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  # Enable writes and mount the fs. This makes the fsync log replay code run.
  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  # Because the last fsync didn't do anything, here the file content matched what
  # it was after the first fsync, before the holes were punched, and not what it
  # was after the holes were punched.
  echo "File content after:"
  od -t x1 $SCRATCH_MNT/foo

This issue has been around since 2012, when the punch hole implementation
was added, commit 2aaa665581 ("Btrfs: add hole punching").

A test case for xfstests follows soon.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-02 14:04:44 -08:00
Josef Bacik
0c0ef4bc84 Btrfs: abort the transaction if we fail to update the free space cache inode
Our gluster boxes were hitting a problem where they'd run out of space when
updating the block group cache and therefore wouldn't be able to update the free
space inode.  This is a problem because this is how we invalidate the cache and
protect ourselves from errors further down the stack, so if this fails we have
to abort the transaction so we make sure we don't end up with stale free space
cache.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-02 14:04:44 -08:00
Filipe Manana
4d884fceaa Btrfs: fix fsync race leading to ordered extent memory leaks
We can have multiple fsync operations against the same file during the
same transaction and they can collect the same ordered extents while they
don't complete (still accessible from the inode's ordered tree). If this
happens, those ordered extents will never get their reference counts
decremented to 0, leading to memory leaks and inode leaks (an iput for an
ordered extent's inode is scheduled only when the ordered extent's refcount
drops to 0). The following sequence diagram explains this race:

         CPU 1                                         CPU 2

btrfs_sync_file()

                                                 btrfs_sync_file()

  mutex_lock(inode->i_mutex)
  btrfs_log_inode()
    btrfs_get_logged_extents()
      --> collects ordered extent X
      --> increments ordered
          extent X's refcount
    btrfs_submit_logged_extents()
  mutex_unlock(inode->i_mutex)

                                                   mutex_lock(inode->i_mutex)
  btrfs_sync_log()
     btrfs_wait_logged_extents()
       --> list_del_init(&ordered->log_list)
                                                     btrfs_log_inode()
                                                       btrfs_get_logged_extents()
                                                         --> Adds ordered extent X
                                                             to logged_list because
                                                             at this point:
                                                             list_empty(&ordered->log_list)
                                                             && test_bit(BTRFS_ORDERED_LOGGED,
                                                                         &ordered->flags) == 0
                                                         --> Increments ordered extent
                                                             X's refcount
       --> check if ordered extent's io is
           finished or not, start it if
           necessary and wait for it to finish
       --> sets bit BTRFS_ORDERED_LOGGED
           on ordered extent X's flags
           and adds it to trans->ordered
  btrfs_sync_log() finishes

                                                       btrfs_submit_logged_extents()
                                                     btrfs_log_inode() finishes
                                                   mutex_unlock(inode->i_mutex)

btrfs_sync_file() finishes

                                                   btrfs_sync_log()
                                                      btrfs_wait_logged_extents()
                                                        --> Sees ordered extent X has the
                                                            bit BTRFS_ORDERED_LOGGED set in
                                                            its flags
                                                        --> X's refcount is untouched
                                                   btrfs_sync_log() finishes

                                                 btrfs_sync_file() finishes

btrfs_commit_transaction()
  --> called by transaction kthread for e.g.
  btrfs_wait_pending_ordered()
    --> waits for ordered extent X to
        complete
    --> decrements ordered extent X's
        refcount by 1 only, corresponding
        to the increment done by the fsync
        task ran by CPU 1

In the scenario of the above diagram, after the transaction commit,
the ordered extent will remain with a refcount of 1 forever, leaking
the ordered extent structure and preventing the i_count of its inode
from ever decreasing to 0, since the delayed iput is scheduled only
when the ordered extent's refcount drops to 0, preventing the inode
from ever being evicted by the VFS.

Fix this by using the flag BTRFS_ORDERED_LOGGED differently. Use it to
mean that an ordered extent is already being processed by an fsync call,
which will attach it to the current transaction, preventing it from being
collected by subsequent fsync operations against the same inode.

This race was introduced with the following change (added in 3.19 and
backported to stable 3.18 and 3.17):

  Btrfs: make sure logged extents complete in the current transaction V3
  commit 50d9aa99bd

I ran into this issue while running xfstests/generic/113 in a loop, which
failed about 1 out of 10 runs with the following warning in dmesg:

[ 2612.440038] WARNING: CPU: 4 PID: 22057 at fs/btrfs/disk-io.c:3558 free_fs_root+0x36/0x133 [btrfs]()
[ 2612.442810] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop processor parport_pc parport psmouse therma
l_sys i2c_piix4 serio_raw pcspkr evdev microcode button i2c_core ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom virtio_scsi ata_generic virtio_pci ata_piix virtio_ring libata virtio flo
ppy e1000 scsi_mod [last unloaded: btrfs]
[ 2612.452711] CPU: 4 PID: 22057 Comm: umount Tainted: G        W      3.19.0-rc5-btrfs-next-4+ #1
[ 2612.454921] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[ 2612.457709]  0000000000000009 ffff8801342c3c78 ffffffff8142425e ffff88023ec8f2d8
[ 2612.459829]  0000000000000000 ffff8801342c3cb8 ffffffff81045308 ffff880046460000
[ 2612.461564]  ffffffffa036da56 ffff88003d07b000 ffff880046460000 ffff880046460068
[ 2612.463163] Call Trace:
[ 2612.463719]  [<ffffffff8142425e>] dump_stack+0x4c/0x65
[ 2612.464789]  [<ffffffff81045308>] warn_slowpath_common+0xa1/0xbb
[ 2612.466026]  [<ffffffffa036da56>] ? free_fs_root+0x36/0x133 [btrfs]
[ 2612.467247]  [<ffffffff810453c5>] warn_slowpath_null+0x1a/0x1c
[ 2612.468416]  [<ffffffffa036da56>] free_fs_root+0x36/0x133 [btrfs]
[ 2612.469625]  [<ffffffffa036f2a7>] btrfs_drop_and_free_fs_root+0x93/0x9b [btrfs]
[ 2612.471251]  [<ffffffffa036f353>] btrfs_free_fs_roots+0xa4/0xd6 [btrfs]
[ 2612.472536]  [<ffffffff8142612e>] ? wait_for_completion+0x24/0x26
[ 2612.473742]  [<ffffffffa0370bbc>] close_ctree+0x1f3/0x33c [btrfs]
[ 2612.475477]  [<ffffffff81059d1d>] ? destroy_workqueue+0x148/0x1ba
[ 2612.476695]  [<ffffffffa034e3da>] btrfs_put_super+0x19/0x1b [btrfs]
[ 2612.477911]  [<ffffffff81153e53>] generic_shutdown_super+0x73/0xef
[ 2612.479106]  [<ffffffff811540e2>] kill_anon_super+0x13/0x1e
[ 2612.480226]  [<ffffffffa034e1e3>] btrfs_kill_super+0x17/0x23 [btrfs]
[ 2612.481471]  [<ffffffff81154307>] deactivate_locked_super+0x3b/0x50
[ 2612.482686]  [<ffffffff811547a7>] deactivate_super+0x3f/0x43
[ 2612.483791]  [<ffffffff8116b3ed>] cleanup_mnt+0x59/0x78
[ 2612.484842]  [<ffffffff8116b44c>] __cleanup_mnt+0x12/0x14
[ 2612.485900]  [<ffffffff8105d019>] task_work_run+0x8f/0xbc
[ 2612.486960]  [<ffffffff810028d8>] do_notify_resume+0x5a/0x6b
[ 2612.488083]  [<ffffffff81236e5b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 2612.489333]  [<ffffffff8142a17f>] int_signal+0x12/0x17
[ 2612.490353] ---[ end trace 54a960a6bdcb8d93 ]---
[ 2612.557253] VFS: Busy inodes after unmount of sdb. Self-destruct in 5 seconds.  Have a nice day...

Kmemleak confirmed the ordered extent leak (and btrfs inode specific
structures such as delayed nodes):

$ cat /sys/kernel/debug/kmemleak
unreferenced object 0xffff880154290db0 (size 576):
  comm "btrfsck", pid 21980, jiffies 4295542503 (age 1273.412s)
  hex dump (first 32 bytes):
    01 40 00 00 01 00 00 00 b0 1d f1 4e 01 88 ff ff  .@.........N....
    00 00 00 00 00 00 00 00 c8 0d 29 54 01 88 ff ff  ..........)T....
  backtrace:
    [<ffffffff8141d74d>] kmemleak_update_trace+0x4c/0x6a
    [<ffffffff8122f2c0>] radix_tree_node_alloc+0x6d/0x83
    [<ffffffff8122fb26>] __radix_tree_create+0x109/0x190
    [<ffffffff8122fbdd>] radix_tree_insert+0x30/0xac
    [<ffffffffa03b9bde>] btrfs_get_or_create_delayed_node+0x130/0x187 [btrfs]
    [<ffffffffa03bb82d>] btrfs_delayed_delete_inode_ref+0x32/0xac [btrfs]
    [<ffffffffa0379dae>] __btrfs_unlink_inode+0xee/0x288 [btrfs]
    [<ffffffffa037c715>] btrfs_unlink_inode+0x1e/0x40 [btrfs]
    [<ffffffffa037c797>] btrfs_unlink+0x60/0x9b [btrfs]
    [<ffffffff8115d7f0>] vfs_unlink+0x9c/0xed
    [<ffffffff8115f5de>] do_unlinkat+0x12c/0x1fa
    [<ffffffff811601a7>] SyS_unlinkat+0x29/0x2b
    [<ffffffff81429e92>] system_call_fastpath+0x12/0x17
    [<ffffffffffffffff>] 0xffffffffffffffff
unreferenced object 0xffff88014ef11db0 (size 576):
  comm "rm", pid 22009, jiffies 4295542593 (age 1273.052s)
  hex dump (first 32 bytes):
    02 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 c8 1d f1 4e 01 88 ff ff  ...........N....
  backtrace:
    [<ffffffff8141d74d>] kmemleak_update_trace+0x4c/0x6a
    [<ffffffff8122f2c0>] radix_tree_node_alloc+0x6d/0x83
    [<ffffffff8122fb26>] __radix_tree_create+0x109/0x190
    [<ffffffff8122fbdd>] radix_tree_insert+0x30/0xac
    [<ffffffffa03b9bde>] btrfs_get_or_create_delayed_node+0x130/0x187 [btrfs]
    [<ffffffffa03bb82d>] btrfs_delayed_delete_inode_ref+0x32/0xac [btrfs]
    [<ffffffffa0379dae>] __btrfs_unlink_inode+0xee/0x288 [btrfs]
    [<ffffffffa037c715>] btrfs_unlink_inode+0x1e/0x40 [btrfs]
    [<ffffffffa037c797>] btrfs_unlink+0x60/0x9b [btrfs]
    [<ffffffff8115d7f0>] vfs_unlink+0x9c/0xed
    [<ffffffff8115f5de>] do_unlinkat+0x12c/0x1fa
    [<ffffffff811601a7>] SyS_unlinkat+0x29/0x2b
    [<ffffffff81429e92>] system_call_fastpath+0x12/0x17
    [<ffffffffffffffff>] 0xffffffffffffffff
unreferenced object 0xffff8800336feda8 (size 584):
  comm "aio-stress", pid 22031, jiffies 4295543006 (age 1271.400s)
  hex dump (first 32 bytes):
    00 40 3e 00 00 00 00 00 00 00 8f 42 00 00 00 00  .@>........B....
    00 00 01 00 00 00 00 00 00 00 01 00 00 00 00 00  ................
  backtrace:
    [<ffffffff8114eb34>] create_object+0x172/0x29a
    [<ffffffff8141d790>] kmemleak_alloc+0x25/0x41
    [<ffffffff81141ae6>] kmemleak_alloc_recursive.constprop.52+0x16/0x18
    [<ffffffff81145288>] kmem_cache_alloc+0xf7/0x198
    [<ffffffffa0389243>] __btrfs_add_ordered_extent+0x43/0x309 [btrfs]
    [<ffffffffa038968b>] btrfs_add_ordered_extent_dio+0x12/0x14 [btrfs]
    [<ffffffffa03810e2>] btrfs_get_blocks_direct+0x3ef/0x571 [btrfs]
    [<ffffffff81181349>] do_blockdev_direct_IO+0x62a/0xb47
    [<ffffffff8118189a>] __blockdev_direct_IO+0x34/0x36
    [<ffffffffa03776e5>] btrfs_direct_IO+0x16a/0x1e8 [btrfs]
    [<ffffffff81100373>] generic_file_direct_write+0xb8/0x12d
    [<ffffffffa038615c>] btrfs_file_write_iter+0x24b/0x42f [btrfs]
    [<ffffffff8118bb0d>] aio_run_iocb+0x2b7/0x32e
    [<ffffffff8118c99a>] do_io_submit+0x26e/0x2ff
    [<ffffffff8118ca3b>] SyS_io_submit+0x10/0x12
    [<ffffffff81429e92>] system_call_fastpath+0x12/0x17

CC: <stable@vger.kernel.org> # 3.19, 3.18 and 3.17
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-02 14:04:44 -08:00
Linus Torvalds
7dac5cb1bc Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fix from Chris Mason:
 "I'm still testing more fixes, but I wanted to get out the fix for the
  btrfs raid5/6 memory corruption I mentioned in my merge window pull"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix allocation size calculations in alloc_btrfs_bio
2015-02-26 10:34:24 -08:00
Linus Torvalds
be5e6616dd Merge branch 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more vfs updates from Al Viro:
 "Assorted stuff from this cycle.  The big ones here are multilayer
  overlayfs from Miklos and beginning of sorting ->d_inode accesses out
  from David"

* 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (51 commits)
  autofs4 copy_dev_ioctl(): keep the value of ->size we'd used for allocation
  procfs: fix race between symlink removals and traversals
  debugfs: leave freeing a symlink body until inode eviction
  Documentation/filesystems/Locking: ->get_sb() is long gone
  trylock_super(): replacement for grab_super_passive()
  fanotify: Fix up scripted S_ISDIR/S_ISREG/S_ISLNK conversions
  Cachefiles: Fix up scripted S_ISDIR/S_ISREG/S_ISLNK conversions
  VFS: (Scripted) Convert S_ISLNK/DIR/REG(dentry->d_inode) to d_is_*(dentry)
  SELinux: Use d_is_positive() rather than testing dentry->d_inode
  Smack: Use d_is_positive() rather than testing dentry->d_inode
  TOMOYO: Use d_is_dir() rather than d_inode and S_ISDIR()
  Apparmor: Use d_is_positive/negative() rather than testing dentry->d_inode
  Apparmor: mediated_filesystem() should use dentry->d_sb not inode->i_sb
  VFS: Split DCACHE_FILE_TYPE into regular and special types
  VFS: Add a fallthrough flag for marking virtual dentries
  VFS: Add a whiteout dentry type
  VFS: Introduce inode-getting helpers for layered/unioned fs environments
  Infiniband: Fix potential NULL d_inode dereference
  posix_acl: fix reference leaks in posix_acl_create
  autofs4: Wrong format for printing dentry
  ...
2015-02-22 17:42:14 -08:00
David Howells
e36cb0b89c VFS: (Scripted) Convert S_ISLNK/DIR/REG(dentry->d_inode) to d_is_*(dentry)
Convert the following where appropriate:

 (1) S_ISLNK(dentry->d_inode) to d_is_symlink(dentry).

 (2) S_ISREG(dentry->d_inode) to d_is_reg(dentry).

 (3) S_ISDIR(dentry->d_inode) to d_is_dir(dentry).  This is actually more
     complicated than it appears as some calls should be converted to
     d_can_lookup() instead.  The difference is whether the directory in
     question is a real dir with a ->lookup op or whether it's a fake dir with
     a ->d_automount op.

In some circumstances, we can subsume checks for dentry->d_inode not being
NULL into this, provided we the code isn't in a filesystem that expects
d_inode to be NULL if the dirent really *is* negative (ie. if we're going to
use d_inode() rather than d_backing_inode() to get the inode pointer).

Note that the dentry type field may be set to something other than
DCACHE_MISS_TYPE when d_inode is NULL in the case of unionmount, where the VFS
manages the fall-through from a negative dentry to a lower layer.  In such a
case, the dentry type of the negative union dentry is set to the same as the
type of the lower dentry.

However, if you know d_inode is not NULL at the call site, then you can use
the d_is_xxx() functions even in a filesystem.

There is one further complication: a 0,0 chardev dentry may be labelled
DCACHE_WHITEOUT_TYPE rather than DCACHE_SPECIAL_TYPE.  Strictly, this was
intended for special directory entry types that don't have attached inodes.

The following perl+coccinelle script was used:

use strict;

my @callers;
open($fd, 'git grep -l \'S_IS[A-Z].*->d_inode\' |') ||
    die "Can't grep for S_ISDIR and co. callers";
@callers = <$fd>;
close($fd);
unless (@callers) {
    print "No matches\n";
    exit(0);
}

my @cocci = (
    '@@',
    'expression E;',
    '@@',
    '',
    '- S_ISLNK(E->d_inode->i_mode)',
    '+ d_is_symlink(E)',
    '',
    '@@',
    'expression E;',
    '@@',
    '',
    '- S_ISDIR(E->d_inode->i_mode)',
    '+ d_is_dir(E)',
    '',
    '@@',
    'expression E;',
    '@@',
    '',
    '- S_ISREG(E->d_inode->i_mode)',
    '+ d_is_reg(E)' );

my $coccifile = "tmp.sp.cocci";
open($fd, ">$coccifile") || die $coccifile;
print($fd "$_\n") || die $coccifile foreach (@cocci);
close($fd);

foreach my $file (@callers) {
    chomp $file;
    print "Processing ", $file, "\n";
    system("spatch", "--sp-file", $coccifile, $file, "--in-place", "--no-show-diff") == 0 ||
	die "spatch failed";
}

[AV: overlayfs parts skipped]

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-22 11:38:41 -05:00
Chris Mason
e57cf21e97 Btrfs: fix allocation size calculations in alloc_btrfs_bio
Since commit 8e5cfb55d3 (Btrfs: Make raid_map array be inlined in
btrfs_bio structure), the raid map array is allocated along with the
btrfs bio in alloc_btrfs_bio.  The calculation used to decide how much
we need to allocate was using the wrong parameter passed into the
allocation function.

The passed in real_stripes will be zero if a target replace operation
is not currently running.  We want to use total_stripes instead.

Signed-off-by: Chris Mason <clm@fb.com>
Reported-by: David Sterba <dsterba@suse.cz>
Tested-by: David Sterba <dsterba@suse.cz>
2015-02-20 06:55:15 -08:00
Linus Torvalds
2b9fb532d4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "This pull is mostly cleanups and fixes:

   - The raid5/6 cleanups from Zhao Lei fixup some long standing warts
     in the code and add improvements on top of the scrubbing support
     from 3.19.

   - Josef has round one of our ENOSPC fixes coming from large btrfs
     clusters here at FB.

   - Dave Sterba continues a long series of cleanups (thanks Dave), and
     Filipe continues hammering on corner cases in fsync and others

  This all was held up a little trying to track down a use-after-free in
  btrfs raid5/6.  It's not clear yet if this is just made easier to
  trigger with this pull or if its a new bug from the raid5/6 cleanups.
  Dave Sterba is the only one to trigger it so far, but he has a
  consistent way to reproduce, so we'll get it nailed shortly"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (68 commits)
  Btrfs: don't remove extents and xattrs when logging new names
  Btrfs: fix fsync data loss after adding hard link to inode
  Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group
  Btrfs: account for large extents with enospc
  Btrfs: don't set and clear delalloc for O_DIRECT writes
  Btrfs: only adjust outstanding_extents when we do a short write
  btrfs: Fix out-of-space bug
  Btrfs: scrub, fix sleep in atomic context
  Btrfs: fix scheduler warning when syncing log
  Btrfs: Remove unnecessary placeholder in btrfs_err_code
  btrfs: cleanup init for list in free-space-cache
  btrfs: delete chunk allocation attemp when setting block group ro
  btrfs: clear bio reference after submit_one_bio()
  Btrfs: fix scrub race leading to use-after-free
  Btrfs: add missing cleanup on sysfs init failure
  Btrfs: fix race between transaction commit and empty block group removal
  btrfs: add more checks to btrfs_read_sys_array
  btrfs: cleanup, rename a few variables in btrfs_read_sys_array
  btrfs: add checks for sys_chunk_array sizes
  btrfs: more superblock checks, lower bounds on devices and sectorsize/nodesize
  ...
2015-02-19 14:36:00 -08:00
David Sterba
a4f3d2c4ef btrfs: cleanup, reduce temporary variables in btrfs_read_roots
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:47 +01:00
David Sterba
6f01105819 btrfs: use correct type for workqueue flags
Through all the local wrappers to alloc_workqueue, __alloc_workqueue_key
takes an unsigned int.

Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:47 +01:00
Eric Sandeen
4bbcaa648d btrfs: factor btrfs_read_roots() out of open_ctree()
Also, remove the two local variables create_uuid_tree
and check_uuid_tree; we can use the existence of
the uuid root and/or the RESCAN_UUID_TREE flag to
determine what action to take.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:46 +01:00
Eric Sandeen
63443bf54a btrfs: factor btrfs_replay_log() out of open_ctree()
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:46 +01:00
Eric Sandeen
2a4581983f btrfs: factor btrfs_init_workqueues() out of open_ctree()
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
[renamed to btrfs_init_workqueues]
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:46 +01:00
Eric Sandeen
f9e92e40b5 btrfs: factor btrfs_init_qgroup() out of open_ctree()
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
[renamed to btrfs_init_qgroup]
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:46 +01:00
Eric Sandeen
ad6183680c btrfs: factor btrfs_init_dev_replace_locks() out of open_ctree()
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
[renamed to btrfs_init_dev_replace_locks]
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:46 +01:00
Eric Sandeen
f37938e0e2 btrfs: factor btrfs_init_btree_inode() out of open_ctree()
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
[renamed to btrfs_init_btree_inode]
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:45 +01:00
Eric Sandeen
779a65a495 btrfs: factor btrfs_init_balance() out of open_ctree()
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
[renamed to btrfs_init_balance]
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:45 +01:00
Eric Sandeen
638aa7ed46 btrfs: factor btrfs_init_scrub() out of open_ctree()
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
[renamed to btrfs_init_scrub]
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:45 +01:00
Eric Sandeen
0489234077 btrfs: consistently use fs_info in close_ctree()
close_ctree() has a local fs_info var for convienience;
use it consistently.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:45 +01:00
Eric Sandeen
9eaed21ef9 btrfs: remove unused fs_info arg from btrfs_close_extra_devices()
The commit:
8dabb74 Btrfs: change core code of btrfs to support the
        device replace operations
added the fs_info argument, but never used it -
just remove it again.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:45 +01:00
Fabian Frederick
41d6b13ea7 btrfs: fix sizeof format specifier in btrfs_check_super_valid()
This patch fixes mips compilation warning:

fs/btrfs/disk-io.c: In function 'btrfs_check_super_valid':
fs/btrfs/disk-io.c:3927:21: warning: format '%lu' expects argument
of type 'long unsigned int', but argument 3 has type 'unsigned int' [-Wformat]

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:44 +01:00
Zhao Lei
08da757d3f btrfs: cleanup: use for() loop in btrfs_map_bio()
for() is obviously better in these code block, and remove noused
init-value to reduce about 6 bytes binary size.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:44 +01:00
Zhao Lei
a688a04aab btrfs: remove unused chunk_tree argument in several functions
There functions include unused chunk_tree argument from the begining,
it is time to remove them and clean up relative code to prepare value
of this argument in caller.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:44 +01:00
Zhao Lei
b9fd47cde5 btrfs: cleanup: remove no-used alloc_chunk in btrfs_check_data_free_space()
int alloc_chunk is never used in this function, remove it.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:44 +01:00
David Sterba
e8c9f18603 btrfs: constify structs with op functions or static definitions
There are some op tables that can be easily made const, similarly the
sysfs feature and raid tables. This is motivated by PaX CONSTIFY plugin.

Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:44 +01:00
Wang Shilong
f749303bda Btrfs: switch to kvfree() helper
A new helper kvfree() in mm/utils.c will do this.

Signed-off-by: Wang Shilong <wangshilong1991@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:43 +01:00
Daniel Dressler
01d58472a8 Btrfs: disk-io: replace root args iff only fs_info used
This is the 3rd independent patch of a larger project to cleanup btrfs's
internal usage of btrfs_root. Many functions take btrfs_root only to
grab the fs_info struct.

By requiring a root these functions cause programmer overhead. That
these functions can accept any valid root is not obvious until
inspection.

This patch reduces the specificity of such functions to accept the
fs_info directly.

These patches can be applied independently and thus are not being
submitted as a patch series. There should be about 26 patches by the
project's completion. Each patch will cleanup between 1 and 34 functions
apiece.  Each patch covers a single file's functions.

This patch affects the following function(s):
  1) csum_tree_block
  2) csum_dirty_buffer
  3) check_tree_block_fsid
  4) btrfs_find_tree_block
  5) clean_tree_block

Signed-off-by: Daniel Dressler <danieru.dressler@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:43 +01:00
Daniel Dressler
a585e94895 Btrfs: delayed-inode: replace root args iff only fs_info used
This is the second independent patch of a larger project to cleanup
btrfs's internal usage of btrfs_root. Many functions take btrfs_root
only to grab the fs_info struct.

By requiring a root these functions cause programmer overhead. That
these functions can accept any valid root is not obvious until
inspection.

This patch reduces the specificity of such functions to accept the
fs_info directly.

These patches can be applied independently and thus are not being
submitted as a patch series. There should be about 26 patches by the
project's completion. Each patch will cleanup between 1 and 34 functions
apiece.  Each patch covers a single file's functions.

This patch affects the following function(s):
  1) btrfs_wq_run_delayed_node

Signed-off-by: Daniel Dressler <danieru.dressler@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:43 +01:00
Daniel Dressler
b7a0365ec7 Btrfs: ctree: reduce args where only fs_info used
This patch is part of a larger project to cleanup btrfs's internal usage
of struct btrfs_root. Many functions take btrfs_root only to grab a
pointer to fs_info.

This causes programmers to ponder which root can be passed. Since only
the fs_info is read affected functions can accept any root, except this
is only obvious upon inspection.

This patch reduces the specificty of such functions to accept the
fs_info directly.

This patch does not address the two functions in ctree.c (insert_ptr,
and split_item) which only use root for BUG_ONs in ctree.c

This patch affects the following functions:
  1) fixup_low_keys
  2) btrfs_set_item_key_safe

Signed-off-by: Daniel Dressler <danieru.dressler@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:43 +01:00
Filipe Manana
a742994aa2 Btrfs: don't remove extents and xattrs when logging new names
If we are recording in the tree log that an inode has new names (new hard
links were added), we would drop items, belonging to the inode, that we
shouldn't:

1) When the flag BTRFS_INODE_COPY_EVERYTHING is set in the inode's runtime
   flags, we ended up dropping all the extent and xattr items that were
   previously logged. This was done only in memory, since logging a new
   name doesn't imply syncing the log;

2) When the flag BTRFS_INODE_COPY_EVERYTHING is set in the inode's runtime
   flags, we ended up dropping all the xattr items that were previously
   logged. Like the case before, this was done only in memory because
   logging a new name doesn't imply syncing the log.

This led to some surprises in scenarios such as the following:

1) write some extents to an inode;
2) fsync the inode;
3) truncate the inode or delete/modify some of its xattrs
4) add a new hard link for that inode
5) fsync some other file, to force the log tree to be durably persisted
6) power failure happens

The next time the fs is mounted, the fsync log replay code is executed,
and the resulting file doesn't have the content it had when the last fsync
against it was performed, instead if has a content matching what it had
when the last transaction commit happened.

So change the behaviour such that when a new name is logged, only the inode
item and reference items are processed.

This is easy to reproduce with the test I just made for xfstests, whose
main body is:

  _scratch_mkfs >> $seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create our test file with some data.
  $XFS_IO_PROG -f -c "pwrite -S 0xaa -b 8K 0 8K" \
      $SCRATCH_MNT/foo | _filter_xfs_io

  # Make sure the file is durably persisted.
  sync

  # Append some data to our file, to increase its size.
  $XFS_IO_PROG -f -c "pwrite -S 0xcc -b 4K 8K 4K" \
      $SCRATCH_MNT/foo | _filter_xfs_io

  # Fsync the file, so from this point on if a crash/power failure happens, our
  # new data is guaranteed to be there next time the fs is mounted.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

  # Now shrink our file to 5000 bytes.
  $XFS_IO_PROG -c "truncate 5000" $SCRATCH_MNT/foo

  # Now do an expanding truncate to a size larger than what we had when we last
  # fsync'ed our file. This is just to verify that after power failure and
  # replaying the fsync log, our file matches what it was when we last fsync'ed
  # it - 12Kb size, first 8Kb of data had a value of 0xaa and the last 4Kb of
  # data had a value of 0xcc.
  $XFS_IO_PROG -c "truncate 32K" $SCRATCH_MNT/foo

  # Add one hard link to our file. This made btrfs drop all of our file's
  # metadata from the fsync log, including the metadata relative to the
  # extent we just wrote and fsync'ed. This change was made only to the fsync
  # log in memory, so adding the hard link alone doesn't change the persisted
  # fsync log. This happened because the previous truncates set the runtime
  # flag BTRFS_INODE_NEEDS_FULL_SYNC in the btrfs inode structure.
  ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link

  # Now make sure the in memory fsync log is durably persisted.
  # Creating and fsync'ing another file will do it.
  # After this our persisted fsync log will no longer have metadata for our file
  # foo that points to the extent we wrote and fsync'ed before.
  touch $SCRATCH_MNT/bar
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar

  # As expected, before the crash/power failure, we should be able to see a file
  # with a size of 32Kb, with its first 5000 bytes having the value 0xaa and all
  # the remaining bytes with value 0x00.
  echo "File content before:"
  od -t x1 $SCRATCH_MNT/foo

  # Simulate a crash/power loss.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  # After mounting the fs again, the fsync log was replayed.
  # The expected result is to see a file with a size of 12Kb, with its first 8Kb
  # of data having the value 0xaa and its last 4Kb of data having a value of 0xcc.
  # The btrfs bug used to leave the file as it used te be as of the last
  # transaction commit - that is, with a size of 8Kb with all bytes having a
  # value of 0xaa.
  echo "File content after:"
  od -t x1 $SCRATCH_MNT/foo

The test case for xfstests follows soon.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-14 08:22:49 -08:00
Filipe Manana
1a4bcf470c Btrfs: fix fsync data loss after adding hard link to inode
We have a scenario where after the fsync log replay we can lose file data
that had been previously fsync'ed if we added an hard link for our inode
and after that we sync'ed the fsync log (for example by fsync'ing some
other file or directory).

This is because when adding an hard link we updated the inode item in the
log tree with an i_size value of 0. At that point the new inode item was
in memory only and a subsequent fsync log replay would not make us lose
the file data. However if after adding the hard link we sync the log tree
to disk, by fsync'ing some other file or directory for example, we ended
up losing the file data after log replay, because the inode item in the
persisted log tree had an an i_size of zero.

This is easy to reproduce, and the following excerpt from my test for
xfstests shows this:

  _scratch_mkfs >> $seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create one file with data and fsync it.
  # This made the btrfs fsync log persist the data and the inode metadata with
  # a correct inode->i_size (4096 bytes).
  $XFS_IO_PROG -f -c "pwrite -S 0xaa -b 4K 0 4K" -c "fsync" \
       $SCRATCH_MNT/foo | _filter_xfs_io

  # Now add one hard link to our file. This made the btrfs code update the fsync
  # log, in memory only, with an inode metadata having a size of 0.
  ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link

  # Now force persistence of the fsync log to disk, for example, by fsyncing some
  # other file.
  touch $SCRATCH_MNT/bar
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar

  # Before a power loss or crash, we could read the 4Kb of data from our file as
  # expected.
  echo "File content before:"
  od -t x1 $SCRATCH_MNT/foo

  # Simulate a crash/power loss.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  # After the fsync log replay, because the fsync log had a value of 0 for our
  # inode's i_size, we couldn't read anymore the 4Kb of data that we previously
  # wrote and fsync'ed. The size of the file became 0 after the fsync log replay.
  echo "File content after:"
  od -t x1 $SCRATCH_MNT/foo

Another alternative test, that doesn't need to fsync an inode in the same
transaction it was created, is:

  _scratch_mkfs >> $seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create our test file with some data.
  $XFS_IO_PROG -f -c "pwrite -S 0xaa -b 8K 0 8K" \
       $SCRATCH_MNT/foo | _filter_xfs_io

  # Make sure the file is durably persisted.
  sync

  # Append some data to our file, to increase its size.
  $XFS_IO_PROG -f -c "pwrite -S 0xcc -b 4K 8K 4K" \
       $SCRATCH_MNT/foo | _filter_xfs_io

  # Fsync the file, so from this point on if a crash/power failure happens, our
  # new data is guaranteed to be there next time the fs is mounted.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

  # Add one hard link to our file. This made btrfs write into the in memory fsync
  # log a special inode with generation 0 and an i_size of 0 too. Note that this
  # didn't update the inode in the fsync log on disk.
  ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link

  # Now make sure the in memory fsync log is durably persisted.
  # Creating and fsync'ing another file will do it.
  touch $SCRATCH_MNT/bar
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar

  # As expected, before the crash/power failure, we should be able to read the
  # 12Kb of file data.
  echo "File content before:"
  od -t x1 $SCRATCH_MNT/foo

  # Simulate a crash/power loss.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  # After mounting the fs again, the fsync log was replayed.
  # The btrfs fsync log replay code didn't update the i_size of the persisted
  # inode because the inode item in the log had a special generation with a
  # value of 0 (and it couldn't know the correct i_size, since that inode item
  # had a 0 i_size too). This made the last 4Kb of file data inaccessible and
  # effectively lost.
  echo "File content after:"
  od -t x1 $SCRATCH_MNT/foo

This isn't a new issue/regression. This problem has been around since the
log tree code was added in 2008:

  Btrfs: Add a write ahead tree log to optimize synchronous operations
  (commit e02119d5a7)

Test cases for xfstests follow soon.

CC: <stable@vger.kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-14 08:22:49 -08:00
Forrest Liu
3d84be7991 Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group
Removing large amount of block group in a transaction may encounters
BUG_ON() in btrfs_orphan_add(). That is because btrfs_orphan_reserve_metadata()
will grab metadata reservation from transaction handle, and
btrfs_delete_unused_bgs() didn't reserve metadata for trnasaction handle when
delete unused block group.

The problem can be reproduce by following script

    mntpath=/btrfs
    loopdev=/dev/loop0
    filepath=/home/forrest/image

    umount $mntpath
    losetup -d $loopdev
    truncate --size 1000g $filepath
    losetup $loopdev $filepath
    mkfs.btrfs -f $loopdev
    mount $loopdev $mntpath

    for j in `seq 1 1 1000`; do
        fallocate -l 1g $mntpath/$j
    done
    # wait cleaner thread remove unused block group
    sleep 300

The call trace that results from the BUG_ON() is:

[  613.093084] ------------[ cut here ]------------
[  613.097928] kernel BUG at fs/btrfs/inode.c:3142!
[  613.105855] invalid opcode: 0000 [#1] SMP
[  613.112702] Modules linked in: coretemp(E) crc32_pclmul(E) ghash_clmulni_intel(E) aesni_intel(E) snd_ens1371(E) snd_ac97_codec(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) ppdev(E) ac97_bus(E) ablk_helper(E) gameport(E) cryptd(E) snd_rawmidi(E) snd_seq_device(E) snd_pcm(E) vmw_balloon(E) snd_timer(E) snd(E) soundcore(E) serio_raw(E) vmwgfx(E) ttm(E) drm_kms_helper(E) drm(E) vmw_vmci(E) parport_pc(E) shpchp(E) i2c_piix4(E) mac_hid(E) lp(E) parport(E) btrfs(E) xor(E) raid6_pq(E) hid_generic(E) usbhid(E) hid(E) psmouse(E) ahci(E) libahci(E) e1000(E) mptspi(E) mptscsih(E) mptbase(E) floppy(E) vmw_pvscsi(E) vmxnet3(E)
[  613.144196] CPU: 0 PID: 1480 Comm: btrfs-cleaner Tainted: G            E  3.19.0-rc7-custom #2
[  613.148501] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  613.152694] task: ffff880035cdb1a0 ti: ffff880039cf4000 task.ti: ffff880039cf4000
[  613.154969] RIP: 0010:[<ffffffffa01441c2>]  [<ffffffffa01441c2>] btrfs_orphan_add+0x1d2/0x1e0 [btrfs]
[  613.157780] RSP: 0018:ffff880039cf7c48  EFLAGS: 00010286
[  613.159560] RAX: 00000000ffffffe4 RBX: ffff88003bd981a0 RCX: ffff88003c9e4000
[  613.161904] RDX: 0000000000002244 RSI: 0000000000040000 RDI: ffff88003c9e4138
[  613.164264] RBP: ffff880039cf7c88 R08: 000060ffc0000850 R09: 0000000000000000
[  613.166507] R10: ffff88003bc4b7a0 R11: ffffea0000eb6740 R12: ffff88003c9c0000
[  613.168681] R13: ffff88003c102160 R14: ffff88003c9c0458 R15: 0000000000000001
[  613.170932] FS:  0000000000000000(0000) GS:ffff88003f600000(0000) knlGS:0000000000000000
[  613.173316] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  613.175227] CR2: 00007f6343537000 CR3: 0000000036329000 CR4: 00000000000407f0
[  613.177554] Stack:
[  613.178712]  ffff880039cf7c88 ffffffffa0182a54 ffff88003c9e4b04 ffff88003c9c7800
[  613.181297]  ffff88003bc4b7a0 ffff88003bd981a0 ffff88003c8db200 ffff88003c2fcc60
[  613.183782]  ffff880039cf7d18 ffffffffa012da97 ffff88003bc4b7a4 ffff88003bc4b7a0
[  613.186171] Call Trace:
[  613.187493]  [<ffffffffa0182a54>] ? lookup_free_space_inode+0x44/0x100 [btrfs]
[  613.189801]  [<ffffffffa012da97>] btrfs_remove_block_group+0x137/0x740 [btrfs]
[  613.192126]  [<ffffffffa0166912>] btrfs_remove_chunk+0x672/0x780 [btrfs]
[  613.194267]  [<ffffffffa012e2ff>] btrfs_delete_unused_bgs+0x25f/0x280 [btrfs]
[  613.196567]  [<ffffffffa0135e4c>] cleaner_kthread+0x12c/0x190 [btrfs]
[  613.198687]  [<ffffffffa0135d20>] ? check_leaf+0x350/0x350 [btrfs]
[  613.200758]  [<ffffffff8108f232>] kthread+0xd2/0xf0
[  613.202616]  [<ffffffff8108f160>] ? kthread_create_on_node+0x180/0x180
[  613.204738]  [<ffffffff8175dabc>] ret_from_fork+0x7c/0xb0
[  613.206652]  [<ffffffff8108f160>] ? kthread_create_on_node+0x180/0x180
[  613.208741] Code: ff ff 0f 1f 80 00 00 00 00 89 45 c8 3e 80 63 80 fd 48 89 df e8 d0 23 fe ff 8b 45 c8 e9 14 ff ff ff b8 f4 ff ff ff e9 12 ff ff ff <0f> 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48
[  613.216562] RIP  [<ffffffffa01441c2>] btrfs_orphan_add+0x1d2/0x1e0 [btrfs]
[  613.218828]  RSP <ffff880039cf7c48>
[  613.220382] ---[ end trace 71073106deb8a457 ]---

This patch replace btrfs_join_transaction() with btrfs_start_transaction() in
btrfs_delete_unused_bgs() to revent BUG_ON() in btrfs_orphan_add()

Signed-off-by: Forrest Liu <forrestl@synology.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-14 08:22:49 -08:00
Josef Bacik
dcab6a3b2a Btrfs: account for large extents with enospc
On our gluster boxes we stream large tar balls of backups onto our fses.  With
160gb of ram this means we get really large contiguous ranges of dirty data, but
the way our ENOSPC stuff works is that as long as it's contiguous we only hold
metadata reservation for one extent.  The problem is we limit our extents to
128mb, so we'll end up with at least 800 extents so our enospc accounting is
quite a bit lower than what we need.  To keep track of this make sure we
increase outstanding_extents for every multiple of the max extent size so we can
be sure to have enough reserved metadata space.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-14 08:22:48 -08:00
Josef Bacik
3266789f9d Btrfs: don't set and clear delalloc for O_DIRECT writes
We do this to get the space accounting, but this is just needless churn on the
io_tree, so just drop setting/clearing delalloc and just drop the reserved data
space when we have a successfull allocation.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-14 08:19:14 -08:00
Josef Bacik
3e05bde8c3 Btrfs: only adjust outstanding_extents when we do a short write
We have this weird dance where we always inc outstanding_extents when we do a
O_DIRECT write, even if we allocate the entire range.  To get around this we
also drop the metadata space if we successfully write.  This is an unnecessary
dance, we only need to jack up outstanding_extents if we don't satisfy the
entire range request in get_blocks_direct, otherwise we are good using our
original reservation.  So drop the unconditional inc and the drop of the
metadata space that we have for the unconditional inc.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-14 08:19:14 -08:00
Zhao Lei
13212b54d1 btrfs: Fix out-of-space bug
Btrfs will report NO_SPACE when we create and remove files for several times,
and we can't write to filesystem until mount it again.

Steps to reproduce:
 1: Create a single-dev btrfs fs with default option
 2: Write a file into it to take up most fs space
 3: Delete above file
 4: Wait about 100s to let chunk removed
 5: goto 2

Script is like following:
 #!/bin/bash

 # Recommend 1.2G space, too large disk will make test slow
 DEV="/dev/sda16"
 MNT="/mnt/tmp"

 dev_size="$(lsblk -bn -o SIZE "$DEV")" || exit 2
 file_size_m=$((dev_size * 75 / 100 / 1024 / 1024))

 echo "Loop write ${file_size_m}M file on $((dev_size / 1024 / 1024))M dev"

 for ((i = 0; i < 10; i++)); do umount "$MNT" 2>/dev/null; done
 echo "mkfs $DEV"
 mkfs.btrfs -f "$DEV" >/dev/null || exit 2
 echo "mount $DEV $MNT"
 mount "$DEV" "$MNT" || exit 2

 for ((loop_i = 0; loop_i < 20; loop_i++)); do
     echo
     echo "loop $loop_i"

     echo "dd file..."
     cmd=(dd if=/dev/zero of="$MNT"/file0 bs=1M count="$file_size_m")
     "${cmd[@]}" 2>/dev/null || {
         # NO_SPACE error triggered
         echo "dd failed: ${cmd[*]}"
         exit 1
     }

     echo "rm file..."
     rm -f "$MNT"/file0 || exit 2

     for ((i = 0; i < 10; i++)); do
         df "$MNT" | tail -1
         sleep 10
     done
 done

Reason:
 It is triggered by commit: 47ab2a6c68
 which is used to remove empty block groups automatically, but the
 reason is not in that patch. Code before works well because btrfs
 don't need to create and delete chunks so many times with high
 complexity.
 Above bug is caused by many reason, any of them can trigger it.

Reason1:
 When we remove some continuous chunks but leave other chunks after,
 these disk space should be used by chunk-recreating, but in current
 code, only first create will successed.
 Fixed by Forrest Liu <forrestl@synology.com> in:
 Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole

Reason2:
 contains_pending_extent() return wrong value in calculation.
 Fixed by Forrest Liu <forrestl@synology.com> in:
 Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole

Reason3:
 btrfs_check_data_free_space() try to commit transaction and retry
 allocating chunk when the first allocating failed, but space_info->full
 is set in first allocating, and prevent second allocating in retry.
 Fixed in this patch by clear space_info->full in commit transaction.

 Tested for severial times by above script.

Changelog v3->v4:
 use light weight int instead of atomic_t to record have_remove_bgs in
 transaction, suggested by:
 Josef Bacik <jbacik@fb.com>

Changelog v2->v3:
 v2 fixed the bug by adding more commit-transaction, but we
 only need to reclaim space when we are really have no space for
 new chunk, noticed by:
 Filipe David Manana <fdmanana@gmail.com>

 Actually, our code already have this type of commit-and-retry,
 we only need to make it working with removed-bgs.
 v3 fixed the bug with above way.

Changelog v1->v2:
 v1 will introduce a new bug when delete and create chunk in same disk
 space in same transaction, noticed by:
 Filipe David Manana <fdmanana@gmail.com>
 V2 fix this bug by commit transaction after remove block grops.

Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Suggested-by: Filipe David Manana <fdmanana@gmail.com>
Suggested-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-14 08:19:14 -08:00
Filipe Manana
f55985f4dd Btrfs: scrub, fix sleep in atomic context
My previous patch "Btrfs: fix scrub race leading to use-after-free"
introduced the possibility to sleep in an atomic context, which happens
when the scrub_lock mutex is held at the time scrub_pending_bio_dec()
is called - this function can be called under an atomic context.
Chris ran into this in a debug kernel which gave the following trace:

[ 1928.950319] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:621
[ 1928.967334] in_atomic(): 1, irqs_disabled(): 0, pid: 149670, name: fsstress
[ 1928.981324] INFO: lockdep is turned off.
[ 1928.989244] CPU: 24 PID: 149670 Comm: fsstress Tainted: G        W     3.19.0-rc7-mason+ #41
[ 1929.006418] Hardware name: ZTSYSTEMS Echo Ridge T4  /A9DRPF-10D, BIOS 1.07 05/10/2012
[ 1929.022207]  ffffffff81a22cf8 ffff881076e03b78 ffffffff816b8dd9 ffff881076e03b78
[ 1929.037267]  ffff880d8e828710 ffff881076e03ba8 ffffffff810856c4 ffff881076e03bc8
[ 1929.052315]  0000000000000000 000000000000026d ffffffff81a22cf8 ffff881076e03bd8
[ 1929.067381] Call Trace:
[ 1929.072344]  <IRQ>  [<ffffffff816b8dd9>] dump_stack+0x4f/0x6e
[ 1929.083968]  [<ffffffff810856c4>] ___might_sleep+0x174/0x230
[ 1929.095352]  [<ffffffff810857d2>] __might_sleep+0x52/0x90
[ 1929.106223]  [<ffffffff816bb68f>] mutex_lock_nested+0x2f/0x3b0
[ 1929.117951]  [<ffffffff810ab37d>] ? trace_hardirqs_on+0xd/0x10
[ 1929.129708]  [<ffffffffa05dc838>] scrub_pending_bio_dec+0x38/0x70 [btrfs]
[ 1929.143370]  [<ffffffffa05dd0e0>] scrub_parity_bio_endio+0x50/0x70 [btrfs]
[ 1929.157191]  [<ffffffff812fa603>] bio_endio+0x53/0xa0
[ 1929.167382]  [<ffffffffa05f96bc>] rbio_orig_end_io+0x7c/0xa0 [btrfs]
[ 1929.180161]  [<ffffffffa05f97ba>] raid_write_parity_end_io+0x5a/0x80 [btrfs]
[ 1929.194318]  [<ffffffff812fa603>] bio_endio+0x53/0xa0
[ 1929.204496]  [<ffffffff8130401b>] blk_update_request+0x1eb/0x450
[ 1929.216569]  [<ffffffff81096e58>] ? trigger_load_balance+0x78/0x500
[ 1929.229176]  [<ffffffff8144c74d>] scsi_end_request+0x3d/0x1f0
[ 1929.240740]  [<ffffffff8144ccac>] scsi_io_completion+0xac/0x5b0
[ 1929.252654]  [<ffffffff81441c50>] scsi_finish_command+0xf0/0x150
[ 1929.264725]  [<ffffffff8144d317>] scsi_softirq_done+0x147/0x170
[ 1929.276635]  [<ffffffff8130ace6>] blk_done_softirq+0x86/0xa0
[ 1929.288014]  [<ffffffff8105d92e>] __do_softirq+0xde/0x600
[ 1929.298885]  [<ffffffff8105df6d>] irq_exit+0xbd/0xd0
(...)

Fix this by using a reference count on the scrub context structure
instead of locking the scrub_lock mutex.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-14 08:19:14 -08:00
Filipe Manana
575849ecf5 Btrfs: fix scheduler warning when syncing log
We try to lock a mutex while the current task state is not TASK_RUNNING,
which results in the following warning when CONFIG_DEBUG_LOCK_ALLOC=y:

[30736.772501] ------------[ cut here ]------------
[30736.774545] WARNING: CPU: 9 PID: 19972 at kernel/sched/core.c:7300 __might_sleep+0x8b/0xa8()
[30736.783453] do not call blocking ops when !TASK_RUNNING; state=2 set at [<ffffffff8107499b>] prepare_to_wait+0x43/0x89
[30736.786261] Modules linked in: dm_flakey dm_mod crc32c_generic btrfs xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop parport_pc psmouse parport pcspkr microcode serio_raw evdev processor thermal_sys i2c_piix4 i2c_core button ext4 crc16 jbd2 mbcache sg sr_mod cdrom sd_mod ata_generic virtio_scsi floppy ata_piix libata virtio_pci virtio_ring e1000 virtio scsi_mod
[30736.794323] CPU: 9 PID: 19972 Comm: fsstress Not tainted 3.19.0-rc7-btrfs-next-5+ #1
[30736.795821] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[30736.798788]  0000000000000009 ffff88042743fbd8 ffffffff814248ed ffff88043d32f2d8
[30736.800504]  ffff88042743fc28 ffff88042743fc18 ffffffff81045338 0000000000000001
[30736.802131]  ffffffff81064514 ffffffff817c52d1 000000000000026d 0000000000000000
[30736.803676] Call Trace:
[30736.804256]  [<ffffffff814248ed>] dump_stack+0x4c/0x65
[30736.805245]  [<ffffffff81045338>] warn_slowpath_common+0xa1/0xbb
[30736.806360]  [<ffffffff81064514>] ? __might_sleep+0x8b/0xa8
[30736.807391]  [<ffffffff81045398>] warn_slowpath_fmt+0x46/0x48
[30736.808511]  [<ffffffff8107499b>] ? prepare_to_wait+0x43/0x89
[30736.809620]  [<ffffffff8107499b>] ? prepare_to_wait+0x43/0x89
[30736.810691]  [<ffffffff81064514>] __might_sleep+0x8b/0xa8
[30736.811703]  [<ffffffff81426eaf>] mutex_lock_nested+0x2f/0x3a0
[30736.812889]  [<ffffffff8107bfa1>] ? trace_hardirqs_on_caller+0x18f/0x1ab
[30736.814138]  [<ffffffff8107bfca>] ? trace_hardirqs_on+0xd/0xf
[30736.819878]  [<ffffffffa038cfff>] wait_for_writer.isra.12+0x91/0xaa [btrfs]
[30736.821260]  [<ffffffff810748bd>] ? signal_pending_state+0x31/0x31
[30736.822410]  [<ffffffffa0391f0a>] btrfs_sync_log+0x160/0x947 [btrfs]
[30736.823574]  [<ffffffff8107bfa1>] ? trace_hardirqs_on_caller+0x18f/0x1ab
[30736.824847]  [<ffffffff8107bfca>] ? trace_hardirqs_on+0xd/0xf
[30736.825972]  [<ffffffffa036e555>] btrfs_sync_file+0x2b0/0x319 [btrfs]
[30736.827684]  [<ffffffff8117901a>] vfs_fsync_range+0x21/0x23
[30736.828932]  [<ffffffff81179038>] vfs_fsync+0x1c/0x1e
[30736.829917]  [<ffffffff8117928b>] do_fsync+0x34/0x4e
[30736.830862]  [<ffffffff811794b3>] SyS_fsync+0x10/0x14
[30736.831819]  [<ffffffff8142a512>] system_call_fastpath+0x12/0x17
[30736.832982] ---[ end trace c0b57df60d32ae5c ]---

Fix this my acquiring the mutex after calling finish_wait(), which sets the
task's state to TASK_RUNNING.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-14 08:19:14 -08:00
Linus Torvalds
6bec003528 Merge branch 'for-3.20/bdi' of git://git.kernel.dk/linux-block
Pull backing device changes from Jens Axboe:
 "This contains a cleanup of how the backing device is handled, in
  preparation for a rework of the life time rules.  In this part, the
  most important change is to split the unrelated nommu mmap flags from
  it, but also removing a backing_dev_info pointer from the
  address_space (and inode), and a cleanup of other various minor bits.

  Christoph did all the work here, I just fixed an oops with pages that
  have a swap backing.  Arnd fixed a missing export, and Oleg killed the
  lustre backing_dev_info from staging.  Last patch was from Al,
  unexporting parts that are now no longer needed outside"

* 'for-3.20/bdi' of git://git.kernel.dk/linux-block:
  Make super_blocks and sb_lock static
  mtd: export new mtd_mmap_capabilities
  fs: make inode_to_bdi() handle NULL inode
  staging/lustre/llite: get rid of backing_dev_info
  fs: remove default_backing_dev_info
  fs: don't reassign dirty inodes to default_backing_dev_info
  nfs: don't call bdi_unregister
  ceph: remove call to bdi_unregister
  fs: remove mapping->backing_dev_info
  fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info
  nilfs2: set up s_bdi like the generic mount_bdev code
  block_dev: get bdev inode bdi directly from the block device
  block_dev: only write bdev inode on close
  fs: introduce f_op->mmap_capabilities for nommu mmap support
  fs: kill BDI_CAP_SWAP_BACKED
  fs: deduplicate noop_backing_dev_info
2015-02-12 13:50:21 -08:00
Konstantin Khebnikov
8d38633c3b page_writeback: put account_page_redirty() after set_page_dirty()
Helper account_page_redirty() fixes dirty pages counter for redirtied
pages.  This patch puts it after dirtying and prevents temporary
underflows of dirtied pages counters on zone/bdi and current->nr_dirtied.

Signed-off-by: Konstantin Khebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 17:06:04 -08:00
Kirill A. Shutemov
d83a08db5b mm: drop vm_ops->remap_pages and generic_file_remap_pages() stub
Nobody uses it anymore.

[akpm@linux-foundation.org: fix filemap_xip.c]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-10 14:30:30 -08:00
Linus Torvalds
23e8fe2e16 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "The main RCU changes in this cycle are:

   - Documentation updates.

   - Miscellaneous fixes.

   - Preemptible-RCU fixes, including fixing an old bug in the
     interaction of RCU priority boosting and CPU hotplug.

   - SRCU updates.

   - RCU CPU stall-warning updates.

   - RCU torture-test updates"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
  rcu: Initialize tiny RCU stall-warning timeouts at boot
  rcu: Fix RCU CPU stall detection in tiny implementation
  rcu: Add GP-kthread-starvation checks to CPU stall warnings
  rcu: Make cond_resched_rcu_qs() apply to normal RCU flavors
  rcu: Optionally run grace-period kthreads at real-time priority
  ksoftirqd: Use new cond_resched_rcu_qs() function
  ksoftirqd: Enable IRQs and call cond_resched() before poking RCU
  rcutorture: Add more diagnostics in rcu_barrier() test failure case
  torture: Flag console.log file to prevent holdovers from earlier runs
  torture: Add "-enable-kvm -soundhw pcspk" to qemu command line
  rcutorture: Handle different mpstat versions
  rcutorture: Check from beginning to end of grace period
  rcu: Remove redundant rcu_batches_completed() declaration
  rcutorture: Drop rcu_torture_completed() and friends
  rcu: Provide rcu_batches_completed_sched() for TINY_RCU
  rcutorture: Use unsigned for Reader Batch computations
  rcutorture: Make build-output parsing correctly flag RCU's warnings
  rcu: Make _batches_completed() functions return unsigned long
  rcutorture: Issue warnings on close calls due to Reader Batch blows
  documentation: Fix smp typo in memory-barriers.txt
  ...
2015-02-09 14:28:42 -08:00
Linus Torvalds
bdfeb5a104 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fix from Chris Mason:
 "Forrest Liu tracked down a missing blk_finish_plug in the btrfs
  logging code.  This isn't a new bug, and it's hard to hit.  But, it's
  safe enough for inclusion now, and in my for-linus branch"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: add missing blk_finish_plug in btrfs_sync_log()
2015-02-07 11:04:48 -08:00
Forrest Liu
3da5ab5648 Btrfs: add missing blk_finish_plug in btrfs_sync_log()
Add missing blk_finish_plug in btrfs_sync_log()

Signed-off-by: Forrest Liu <forrestl@synology.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-04 18:02:37 -08:00
Gui Hecheng
b76808fc26 btrfs: cleanup init for list in free-space-cache
o removed an unecessary INIT_LIST_HEAD after LIST_HEAD

o merge a declare & INIT_LIST_HEAD pair into one LIST_HEAD

Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-02 19:25:50 -08:00
Shaohua Li
2f0810880f btrfs: delete chunk allocation attemp when setting block group ro
Below test will fail currently:
      mkfs.ext4 -F /dev/sda
      btrfs-convert /dev/sda
      mount /dev/sda /mnt
      btrfs device add -f /dev/sdb /mnt
      btrfs balance start -v -dconvert=raid1 -mconvert=raid1 /mnt

The reason is there are some block groups with usage 0, but the whole
disk hasn't free space to allocate new chunk, so we even can't set such
block group readonly. This patch deletes the chunk allocation when
setting block group ro. For META, we already have reserve. But for
SYSTEM, we don't have, so the check_system_chunk is still required.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-02 19:25:20 -08:00
Naohiro Aota
289454ad26 btrfs: clear bio reference after submit_one_bio()
After submit_one_bio(), `bio' can go away. However submit_extent_page()
leave `bio' referable if submit_one_bio() failed (e.g. -ENOMEM on OOM).
It will cause invalid paging request when submit_extent_page() is called
next time.

I reproduced ENOMEM case with the following script (need
CONFIG_FAIL_PAGE_ALLOC, and CONFIG_FAULT_INJECTION_DEBUG_FS).

  #!/bin/bash

  dmesgout=dmesg.txt
  start=100000
  end=300000
  step=1000

  # btrfs options
  device=/dev/vdb1
  directory=/mnt/btrfs

  # fault-injection options
  percent=100
  times=3

  mkdir -p $directory || exit 1
  mount -o compress $device $directory || exit 1

  rm -f $directory/file || exit 1
  dd if=/dev/zero of=$directory/file bs=1M count=512 || exit 1

  for interval in `seq $start $step $end`; do
          dmesg -C
          echo 1 > /proc/sys/vm/drop_caches
          sync
          export FAILCMD_TYPE=fail_page_alloc
          ./failcmd.sh -p $percent -t $times -i $interval \
                  --ignore-gfp-highmem=N --ignore-gfp-wait=N --min-order=0 \
                  -- \
                  cat $directory/file > /dev/null
          dmesg > ${dmesgout}
          if grep -q BUG: ${dmesgout}; then
                  cat ${dmesgout}
                  exit 1
          fi
  done

  umount $directory
  exit 0

Signed-off-by: Naohiro Aota <naota@elisp.net>
Tested-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-02 19:24:51 -08:00
Filipe Manana
de554a4fa6 Btrfs: fix scrub race leading to use-after-free
While running a scrub on a kernel with CONFIG_DEBUG_PAGEALLOC=y, I got
the following trace:

[68127.807663] BUG: unable to handle kernel paging request at ffff8803f8947a50
[68127.807663] IP: [<ffffffff8107da31>] do_raw_spin_lock+0x94/0x122
[68127.807663] PGD 3003067 PUD 43e1f5067 PMD 43e030067 PTE 80000003f8947060
[68127.807663] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[68127.807663] Modules linked in: dm_flakey dm_mod crc32c_generic btrfs xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop parport_pc processor parpo
[68127.807663] CPU: 2 PID: 3081 Comm: kworker/u8:5 Not tainted 3.18.0-rc6-btrfs-next-3+ #4
[68127.807663] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[68127.807663] Workqueue: btrfs-btrfs-scrub btrfs_scrub_helper [btrfs]
[68127.807663] task: ffff880101fc5250 ti: ffff8803f097c000 task.ti: ffff8803f097c000
[68127.807663] RIP: 0010:[<ffffffff8107da31>]  [<ffffffff8107da31>] do_raw_spin_lock+0x94/0x122
[68127.807663] RSP: 0018:ffff8803f097fbb8  EFLAGS: 00010093
[68127.807663] RAX: 0000000028dd386c RBX: ffff8803f8947a50 RCX: 0000000028dd3854
[68127.807663] RDX: 0000000000000018 RSI: 0000000000000002 RDI: 0000000000000001
[68127.807663] RBP: ffff8803f097fbd8 R08: 0000000000000004 R09: 0000000000000001
[68127.807663] R10: ffff880102620980 R11: ffff8801f3e8c900 R12: 000000000001d390
[68127.807663] R13: 00000000cabd13c8 R14: ffff8803f8947800 R15: ffff88037c574f00
[68127.807663] FS:  0000000000000000(0000) GS:ffff88043dd00000(0000) knlGS:0000000000000000
[68127.807663] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[68127.807663] CR2: ffff8803f8947a50 CR3: 00000000b6481000 CR4: 00000000000006e0
[68127.807663] Stack:
[68127.807663]  ffffffff823942a8 ffff8803f8947a50 ffff8802a3416f80 0000000000000000
[68127.807663]  ffff8803f097fc18 ffffffff8141e7c0 ffffffff81072948 000000000034f314
[68127.807663]  ffff8803f097fc08 0000000000000292 ffff8803f097fc48 ffff8803f8947a50
[68127.807663] Call Trace:
[68127.807663]  [<ffffffff8141e7c0>] _raw_spin_lock_irqsave+0x4b/0x55
[68127.807663]  [<ffffffff81072948>] ? __wake_up+0x22/0x4b
[68127.807663]  [<ffffffff81072948>] __wake_up+0x22/0x4b
[68127.807663]  [<ffffffffa0392327>] scrub_pending_bio_dec+0x32/0x36 [btrfs]
[68127.807663]  [<ffffffffa0395e70>] scrub_bio_end_io_worker+0x5a3/0x5c9 [btrfs]
[68127.807663]  [<ffffffff810e0c7c>] ? time_hardirqs_off+0x15/0x28
[68127.807663]  [<ffffffff81078106>] ? trace_hardirqs_off_caller+0x4c/0xb9
[68127.807663]  [<ffffffffa0372a7c>] normal_work_helper+0xf1/0x238 [btrfs]
[68127.807663]  [<ffffffffa0372d3d>] btrfs_scrub_helper+0x12/0x14 [btrfs]
[68127.807663]  [<ffffffff810582d2>] process_one_work+0x1e4/0x3b6
[68127.807663]  [<ffffffff81078180>] ? trace_hardirqs_off+0xd/0xf
[68127.807663]  [<ffffffff81058dc9>] worker_thread+0x1fb/0x2a8
[68127.807663]  [<ffffffff81058bce>] ? rescuer_thread+0x219/0x219
[68127.807663]  [<ffffffff8105cd75>] kthread+0xdb/0xe3
[68127.807663]  [<ffffffff8105cc9a>] ? __kthread_parkme+0x67/0x67
[68127.807663]  [<ffffffff8141f1ec>] ret_from_fork+0x7c/0xb0
[68127.807663]  [<ffffffff8105cc9a>] ? __kthread_parkme+0x67/0x67
[68127.807663] Code: 39 c2 75 14 8d 8a 00 00 01 00 89 d0 f0 0f b1 0b 39 d0 0f 84 81 00 00 00 4c 69 2d 27 86 99 00 fa 00 00 00 45 31 e4 4d 39 ec 74 2b <8b> 13 89 d0 c1 e8 10 66 39 c2 75
[68127.807663] RIP  [<ffffffff8107da31>] do_raw_spin_lock+0x94/0x122
[68127.807663]  RSP <ffff8803f097fbb8>
[68127.807663] CR2: ffff8803f8947a50
[68127.807663] ---[ end trace d7045aac00a66cd8 ]---

This is due to a race that can happen in a very tiny time window and is
illustrated by the following sequence diagram:

         CPU 1                                                     CPU 2

                                                                btrfs_scrub_dev()
scrub_bio_end_io_worker()
   scrub_pending_bio_dec()
       atomic_dec(&sctx->bios_in_flight)
                                                                   wait sctx->bios_in_flight == 0
                                                                   wait sctx->workers_pending == 0
                                                                   mutex_lock(&fs_info->scrub_lock)
                                                                   (...)
                                                                   mutex_lock(&fs_info->scrub_lock)
                                                                   scrub_free_ctx(sctx)
                                                                      kfree(sctx)
       wake_up(&sctx->list_wait)
          __wake_up()
              spin_lock_irqsave(&sctx->list_wait->lock, flags)

Another variation of this scenario that results in the same use-after-free
issue is:

         CPU 1                                                     CPU 2

                                                                btrfs_scrub_dev()
                                                                   wait sctx->bios_in_flight == 0
scrub_bio_end_io_worker()
   scrub_pending_bio_dec()
       __wake_up(&sctx->list_wait)
          spin_lock_irqsave(&sctx->list_wait->lock, flags)
          default_wake_function()
              wake up task at CPU 2
                                                                   wait sctx->workers_pending == 0
                                                                   mutex_lock(&fs_info->scrub_lock)
                                                                   (...)
                                                                   mutex_lock(&fs_info->scrub_lock)
                                                                   scrub_free_ctx(sctx)
                                                                      kfree(sctx)
          spin_unlock_irqrestore(&sctx->list_wait->lock, flags)

Fix this by holding the scrub lock while doing the wakeup.

This isn't a recent regression, the issue as been around since the scrub
feature was added (2011, commit a2de733c78).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-02 19:24:50 -08:00
Filipe Manana
001a648df4 Btrfs: add missing cleanup on sysfs init failure
If we failed during initialization of sysfs, we weren't unregistering the
top level btrfs sysfs entry nor the debugfs stuff.
Not unregistering the top level sysfs entry makes future attempts to reload
the btrfs module impossible and the following is reported in dmesg:

[ 2246.451296] WARNING: CPU: 3 PID: 10999 at fs/sysfs/dir.c:486 sysfs_warn_dup+0x91/0xb0()
[ 2246.451298] sysfs: cannot create duplicate filename '/fs/btrfs'
[ 2246.451298] Modules linked in: btrfs(+) raid6_pq xor bnep rfcomm bluetooth binfmt_misc nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc parport_pc parport psmouse serio_raw pcspkr evbug i2c_piix4 e1000 floppy [last unloaded: btrfs]
[ 2246.451310] CPU: 3 PID: 10999 Comm: modprobe Tainted: G        W    3.13.0-fdm-btrfs-next-24+ #7
[ 2246.451311] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 2246.451312]  0000000000000009 ffff8800d353fa08 ffffffff816f1da6 0000000000000410
[ 2246.451314]  ffff8800d353fa58 ffff8800d353fa48 ffffffff8104a32c ffff88020821a290
[ 2246.451316]  ffff88020821a290 ffff88020821a290 ffff8802148f0000 ffff8800d353fb80
[ 2246.451318] Call Trace:
[ 2246.451322]  [<ffffffff816f1da6>] dump_stack+0x4e/0x68
[ 2246.451324]  [<ffffffff8104a32c>] warn_slowpath_common+0x8c/0xc0
[ 2246.451325]  [<ffffffff8104a416>] warn_slowpath_fmt+0x46/0x50
[ 2246.451328]  [<ffffffff81367dc5>] ? strlcat+0x65/0x90
(....)

This fixes the following change:

    btrfs: add simple debugfs interface
    commit 1bae30982b

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-02 19:24:49 -08:00
Filipe Manana
d4b450cd4b Btrfs: fix race between transaction commit and empty block group removal
Committing a transaction can race with automatic removal of empty block
groups (cleaner kthread), leading to a BUG_ON() in the transaction
commit code while running btrfs_finish_extent_commit(). The following
sequence diagram shows how it can happen:

           CPU 1                                       CPU 2

btrfs_commit_transaction()
  fs_info->running_transaction = NULL
  btrfs_finish_extent_commit()
    find_first_extent_bit()
      -> found range for block group X
         in fs_info->freed_extents[]

                                               btrfs_delete_unused_bgs()
                                                 -> found block group X

                                                 Removed block group X's range
                                                 from fs_info->freed_extents[]

                                                 btrfs_remove_chunk()
                                                    btrfs_remove_block_group(bg X)

    unpin_extent_range(bg X range)
       btrfs_lookup_block_group(bg X)
          -> returns NULL
            -> BUG_ON()

The trace that results from the BUG_ON() is:

[48665.187808] ------------[ cut here ]------------
[48665.188032] kernel BUG at fs/btrfs/extent-tree.c:5675!
[48665.188032] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
[48665.188032] Modules linked in: dm_flakey dm_mod crc32c_generic btrfs xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop parport_pc evdev microcode
[48665.197388] CPU: 2 PID: 31211 Comm: kworker/u32:16 Tainted: G        W      3.19.0-rc5-btrfs-next-4+ #1
[48665.197388] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[48665.197388] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
[48665.197388] task: ffff880222011810 ti: ffff8801b56a4000 task.ti: ffff8801b56a4000
[48665.197388] RIP: 0010:[<ffffffffa0350d05>]  [<ffffffffa0350d05>] unpin_extent_range+0x6a/0x1ba [btrfs]
[48665.197388] RSP: 0018:ffff8801b56a7b88  EFLAGS: 00010246
[48665.197388] RAX: 0000000000000000 RBX: ffff8802143a6000 RCX: ffff8802220120c8
[48665.197388] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff8800a3c140b0
[48665.197388] RBP: ffff8801b56a7bd8 R08: 0000000000000003 R09: 0000000000000000
[48665.197388] R10: 0000000000000000 R11: 000000000000bbac R12: 0000000012e8e000
[48665.197388] R13: ffff8800a3c14000 R14: 0000000000000000 R15: 0000000000000000
[48665.197388] FS:  0000000000000000(0000) GS:ffff88023ec40000(0000) knlGS:0000000000000000
[48665.197388] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[48665.197388] CR2: 00007f065e42f270 CR3: 0000000206f70000 CR4: 00000000000006e0
[48665.197388] Stack:
[48665.197388]  ffff8801b56a7bd8 0000000012ea0000 01ff8800a3c14138 0000000012e9ffff
[48665.197388]  ffff880141df3dd8 ffff8802143a6000 ffff8800a3c14138 ffff880141df3df0
[48665.197388]  ffff880141df3dd8 0000000000000000 ffff8801b56a7c08 ffffffffa0354227
[48665.197388] Call Trace:
[48665.197388]  [<ffffffffa0354227>] btrfs_finish_extent_commit+0xb0/0xd9 [btrfs]
[48665.197388]  [<ffffffffa0366b4b>] btrfs_commit_transaction+0x791/0x92c [btrfs]
[48665.197388]  [<ffffffffa0352432>] flush_space+0x43d/0x452 [btrfs]
[48665.197388]  [<ffffffff814295c3>] ? _raw_spin_unlock+0x28/0x33
[48665.197388]  [<ffffffffa035255f>] btrfs_async_reclaim_metadata_space+0x118/0x164 [btrfs]
[48665.197388]  [<ffffffff81059917>] ? process_one_work+0x14b/0x3ab
[48665.197388]  [<ffffffff810599ac>] process_one_work+0x1e0/0x3ab
[48665.197388]  [<ffffffff81079fa9>] ? trace_hardirqs_off+0xd/0xf
[48665.197388]  [<ffffffff8105a55b>] worker_thread+0x210/0x2d0
[48665.197388]  [<ffffffff8105a34b>] ? rescuer_thread+0x2c3/0x2c3
[48665.197388]  [<ffffffff8105e5c0>] kthread+0xef/0xf7
[48665.197388]  [<ffffffff81429682>] ? _raw_spin_unlock_irq+0x2d/0x39
[48665.197388]  [<ffffffff8105e4d1>] ? __kthread_parkme+0xad/0xad
[48665.197388]  [<ffffffff81429dec>] ret_from_fork+0x7c/0xb0
[48665.197388]  [<ffffffff8105e4d1>] ? __kthread_parkme+0xad/0xad
[48665.197388] Code: 85 f6 74 14 49 8b 06 49 03 46 09 49 39 c4 72 1d 4c 89 f7 e8 83 ec ff ff 4c 89 e6 4c 89 ef e8 1e f1 ff ff 48 85 c0 49 89 c6 75 02 <0f> 0b 49 8b 1e 49 03 5e 09 48 8b
[48665.197388] RIP  [<ffffffffa0350d05>] unpin_extent_range+0x6a/0x1ba [btrfs]
[48665.197388]  RSP <ffff8801b56a7b88>
[48665.272246] ---[ end trace b9c6ab9957521376 ]---

Fix this by ensuring that unpining the block group's range in
btrfs_finish_extent_commit() is done in a synchronized fashion
with removing the block group's range from freed_extents[]
in btrfs_delete_unused_bgs()

This race got introduced with the change:

    Btrfs: remove empty block groups automatically
    commit 47ab2a6c68

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-02 19:24:48 -08:00
David Sterba
e3540eab29 btrfs: add more checks to btrfs_read_sys_array
Verify that the sys_array has enough bytes to read the next item.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-02 19:24:39 -08:00
David Sterba
1ffb22cf8c btrfs: cleanup, rename a few variables in btrfs_read_sys_array
There's a pointer to buffer, integer offset and offset passed as
pointer, try to find matching names for them.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-02 19:24:28 -08:00
David Sterba
ce7fca5f57 btrfs: add checks for sys_chunk_array sizes
Verify that possible minimum and maximum size is set, validity of
contents is checked in btrfs_read_sys_array.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-02 19:23:43 -08:00
David Sterba
75d6ad382b btrfs: more superblock checks, lower bounds on devices and sectorsize/nodesize
I received a few crafted images from Jiri, all got through the recently
added superblock checks. The lower bounds checks for num_devices and
sector/node -sizes were missing and caused a crash during mount.

Tools for symbolic code execution were used to prepare the images
contents.

Reported-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-02 19:20:39 -08:00
chandan r
9cc97d6462 Btrfs: Add code to support file creation time
This patch adds a new member to the 'struct btrfs_inode' structure to hold
the file creation time.

Signed-off-by: chandan <chandanrmail@gmail.com>
[refreshed, removed btrfs_inode_otime]
Signed-off-by: David Sterba <dsterba@suse.cz>

Signed-off-by: Chris Mason <clm@fb.com>
2015-02-02 18:39:16 -08:00
David Sterba
a937b9791e btrfs: kill btrfs_inode_*time helpers
They just opencode taking address of the timespec member.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-02 18:39:07 -08:00
Linus Torvalds
bc208e0ee0 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fix from Chris Mason:
 "We have one more fix for btrfs in my for-linus branch - this was a bug
  in the new raid5/6 scrubbing support"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: fix raid56 scrub failed in xfstests btrfs/072
2015-01-30 14:25:52 -08:00
Gui Hecheng
063c54dccd btrfs: fix raid56 scrub failed in xfstests btrfs/072
The xfstests btrfs/072 reports uncorrectable read errors in dmesg,
because scrub forgets to use commit_root for parity scrub routine
and scrub attempts to scrub those extents items whose contents are
not fully on disk.

To fix it, we just add the @search_commit_root flag back.

Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaoxie@huawei.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-27 15:26:16 -08:00
Linus Torvalds
c4e00f1d31 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "We have a few fixes in my for-linus branch.

  Qu Wenruo's batch fix a regression between some our merge window pull
  and the inode_cache feature.  The rest are smaller bugs"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: Don't call btrfs_start_transaction() on frozen fs to avoid deadlock.
  btrfs: Fix the bug that fs_info->pending_changes is never cleared.
  btrfs: fix state->private cast on 32 bit machines
  Btrfs: fix race deleting block group from space_info->ro_bgs list
  Btrfs: fix incorrect freeing in scrub_stripe
  btrfs: sync ioctl, handle errors after transaction start
2015-01-24 14:31:27 +12:00
chandan
95449a1626 Btrfs: insert_new_root: Fix lock type of the extent buffer.
btrfs_alloc_tree_block() returns an extent buffer on which a blocked lock has
been taken. Hence assign the appropriate value to path->locks[level].

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-22 05:42:23 -08:00
Anand Jain
78f55e5e1f Btrfs: fix unused members in struct btrfs_root
There isn't any real use of following members of struct btrfs_root
so delete them.

struct kobject root_kobj;
struct completion kobj_unregister;

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:22:37 -08:00
Yang Dongsheng
0ee13fe28c btrfs: qgroup: move WARN_ON() to the correct location.
In function qgroup_excl_accounting(), we need to WARN when
qg->excl is less than what we want to free, same to child
and parents. But currently, for parent qgroup, the WARN_ON()
is located after freeing qg->excl. It will WARN out even we
free it normally.

This patch move this WARN_ON() before freeing qg->excl.

Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:22:37 -08:00
Liu Bo
26455d3318 Btrfs: cleanup unused run_most
"run_most" is not used anymore.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:22:16 -08:00
Zhao Lei
570193454a Rename all ref_count to refs in struct
refs is better than ref_count to record a struct's ref count.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Suggested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:50 -08:00
Zhao Lei
ffe2d2034b Btrfs: Introduce BTRFS_BLOCK_GROUP_RAID56_MASK to check raid56 simply
So we can check raid56 with:
 (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK)
instead of long:
 (map->type & (BTRFS_BLOCK_GROUP_RAID5 | BTRFS_BLOCK_GROUP_RAID6))

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:49 -08:00
Zhao Lei
10f1190016 Btrfs: Include map_type in raid_bio
Corrent code use many kinds of "clever" way to determine operation
target's raid type, as:
  raid_map != NULL
  or
  raid_map[MAX_NR] == RAID[56]_Q_STRIPE

To make code easy to maintenance, this patch put raid type into
bbio, and we can always get raid type from bbio with a "stupid"
way.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:49 -08:00
Zhao Lei
be50a8ddaa Btrfs: Simplify scrub_setup_recheck_block()'s argument
scrub_setup_recheck_block() have many arguments but most of them
can be get from one of them, we can remove them to make code clean.
Some other cleanup for that function also included in this patch.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:49 -08:00
Zhao Lei
b968fed1c3 Btrfs: Combine per-page recover in dev-replace and scrub
The code are similar, combine them to make code clean and easy to maintenance.
Some lost condition are also completed with benefit of this combination.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:49 -08:00
Zhao Lei
8d6738c1bd Btrfs: Separate finding-right-mirror and writing-to-target's process in scrub_handle_errored_block()
In corrent code, code of finding-right-mirror and writing-to-target
are mixed in logic, if we find a right mirror but failed in writing
to target, it will treat as "hadn't found right block", and fill the
target with sblock_bad.

Actually, "failed in writing to target" does not mean "source
block is wrong", this patch separate above two condition in logic,
and do some cleanup to make code clean.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:49 -08:00
Zhao Lei
dc5f7a3bd8 Btrfs: Break loop when reach BTRFS_MAX_MIRRORS in scrub_setup_recheck_block()
Use break instead of useless loop should be more suitable in this
case.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:48 -08:00
Zhao Lei
7653947fe6 Btrfs: btrfs_rm_dev_replace_blocked(): Use wait_event()
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:48 -08:00
Zhao Lei
09dd7a01c3 Btrfs: Cleanup btrfs_bio_counter_inc_blocked()
1: Remove no-need DEFINE_WAIT(wait)
2: Add likely() for BTRFS_FS_STATE_DEV_REPLACING condition
3: Use while loop instead of goto

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:48 -08:00
Zhao Lei
114ab50d82 Btrfs: Remove noneed force_write in scrub_write_block_to_dev_replace
It is always 1 in this place, because !1 case was already jumped
out in previous code.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:48 -08:00
Zhao Lei
b25c94c580 Btrfs: Fix a jump typo of nodatasum_case to avoid wrong WARN_ON()
if (sctx->is_dev_replace && !is_metadata && !have_csum) {
    ...
    goto nodatasum_case;
}
...
nodatasum_case:
    WARN_ON(sctx->is_dev_replace);

In above code, nodatasum_case marker should be moved after
WARN_ON().

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:48 -08:00
Zhao Lei
6e9606d2a2 Btrfs: add ref_count and free function for btrfs_bio
1: ref_count is simple than current RBIO_HOLD_BBIO_MAP_BIT flag
   to keep btrfs_bio's memory in raid56 recovery implement.
2: free function for bbio will make code clean and flexible, plus
   forced data type checking in compile.

Changelog v1->v2:
 Rename following by David Sterba's suggestion:
 put_btrfs_bio() -> btrfs_put_bio()
 get_btrfs_bio() -> btrfs_get_bio()
 bbio->ref_count -> bbio->refs

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:48 -08:00
Zhao Lei
8e5cfb55d3 Btrfs: Make raid_map array be inlined in btrfs_bio structure
It can make code more simple and clear, we need not care about
free bbio and raid_map together.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:47 -08:00
Zhao Lei
cc7539edea Btrfs: sort raid_map before adding tgtdev stripes
It can avoid complex calculation of real stripes in sort,
moreover, we can clean up code of sorting tgtdev_map because it
will be in order initially.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:47 -08:00
Zhao Lei
e34c330d63 Btrfs: fix a out-of-bound access of raid_map
We add the number of stripes on target devices into bbio->num_stripes
if we are under device replacement, and we just sort the raid_map of
those stripes that not on the target devices, so if when we need
real raid_map, we need skip the stripes on the target devices.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:47 -08:00
Filipe Manana
df8d116ffa Btrfs: fix fsync log replay for inodes with a mix of regular refs and extrefs
If we have an inode with a large number of hard links, some of which may
be extrefs, turn a regular ref into an extref, fsync the inode and then
replay the fsync log (after a crash/reboot), we can endup with an fsync
log that makes the replay code always fail with -EOVERFLOW when processing
the inode's references.

This is easy to reproduce with the test case I made for xfstests. Its steps
are the following:

   _scratch_mkfs "-O extref" >> $seqres.full 2>&1
   _init_flakey
   _mount_flakey

   # Create a test file with 3001 hard links. This number is large enough to
   # make btrfs start using extrefs at some point even if the fs has the maximum
   # possible leaf/node size (64Kb).
   echo "hello world" > $SCRATCH_MNT/foo
   for i in `seq 1 3000`; do
       ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_`printf "%04d" $i`
   done

   # Make sure all metadata and data are durably persisted.
   sync

   # Now remove one link, add a new one with a new name, add another new one with
   # the same name as the one we just removed and fsync the inode.
   rm -f $SCRATCH_MNT/foo_link_0001
   ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3001
   ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_0001
   rm -f $SCRATCH_MNT/foo_link_0002
   ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3002
   ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3003
   $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

   # Simulate a crash/power loss. This makes sure the next mount
   # will see an fsync log and will replay that log.

   _load_flakey_table $FLAKEY_DROP_WRITES
   _unmount_flakey

   _load_flakey_table $FLAKEY_ALLOW_WRITES
   _mount_flakey

   # Check that the number of hard links is correct, we are able to remove all
   # the hard links and read the file's data. This is just to verify we don't
   # get stale file handle errors (due to dangling directory index entries that
   # point to inodes that no longer exist).
   echo "Link count: $(stat --format=%h $SCRATCH_MNT/foo)"
   [ -f $SCRATCH_MNT/foo ] || echo "Link foo is missing"
   for ((i = 1; i <= 3003; i++)); do
       name=foo_link_`printf "%04d" $i`
       if [ $i -eq 2 ]; then
           [ -f $SCRATCH_MNT/$name ] && echo "Link $name found"
       else
           [ -f $SCRATCH_MNT/$name ] || echo "Link $name is missing"
       fi
   done
   rm -f $SCRATCH_MNT/foo_link_*
   cat $SCRATCH_MNT/foo
   rm -f $SCRATCH_MNT/foo

   status=0
   exit

The fix is simply to correct the overflow condition when overwriting a
reference item because it was wrong, trying to increase the item in the
fs/subvol tree by an impossible amount. Also ensure that we don't insert
one normal ref and one ext ref for the same dentry - this happened because
processing a dir index entry from the parent in the log happened when
the normal ref item was full, which made the logic insert an extref and
later when the normal ref had enough room, it would be inserted again
when processing the ref item from the child inode in the log.

This issue has been present since the introduction of the extrefs feature
(2012).

A test case for xfstests follows soon. This test only passes if the previous
patch titled "Btrfs: fix fsync when extend references are added to an inode"
is applied too.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:02:05 -08:00
Filipe Manana
2c2c452b0c Btrfs: fix fsync when extend references are added to an inode
If we added an extended reference to an inode and fsync'ed it, the log
replay code would make our inode have an incorrect link count, which
was lower then the expected/correct count.
This resulted in stale directory index entries after deleting some of
the hard links, and any access to the dangling directory entries resulted
in -ESTALE errors because the entries pointed to inode items that don't
exist anymore.

This is easy to reproduce with the test case I made for xfstests, and
the bulk of that test is:

    _scratch_mkfs "-O extref" >> $seqres.full 2>&1
    _init_flakey
    _mount_flakey

    # Create a test file with 3001 hard links. This number is large enough to
    # make btrfs start using extrefs at some point even if the fs has the maximum
    # possible leaf/node size (64Kb).
    echo "hello world" > $SCRATCH_MNT/foo
    for i in `seq 1 3000`; do
        ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_`printf "%04d" $i`
    done

    # Make sure all metadata and data are durably persisted.
    sync

    # Add one more link to the inode that ends up being a btrfs extref and fsync
    # the inode.
    ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3001
    $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

    # Simulate a crash/power loss. This makes sure the next mount
    # will see an fsync log and will replay that log.

    _load_flakey_table $FLAKEY_DROP_WRITES
    _unmount_flakey

    _load_flakey_table $FLAKEY_ALLOW_WRITES
    _mount_flakey

    # Now after the fsync log replay btrfs left our inode with a wrong link count N,
    # which was smaller than the correct link count M (N < M).
    # So after removing N hard links, the remaining M - N directory entries were
    # still visible to user space but it was impossible to do anything with them
    # because they pointed to an inode that didn't exist anymore. This resulted in
    # stale file handle errors (-ESTALE) when accessing those dentries for example.
    #
    # So remove all hard links except the first one and then attempt to read the
    # file, to verify we don't get an -ESTALE error when accessing the inodel
    #
    # The btrfs fsck tool also detected the incorrect inode link count and it
    # reported an error message like the following:
    #
    # root 5 inode 257 errors 2001, no inode item, link count wrong
    #   unresolved ref dir 256 index 2978 namelen 13 name foo_link_2976 filetype 1 errors 4, no inode ref
    #
    # The fstests framework automatically calls fsck after a test is run, so we
    # don't need to call fsck explicitly here.

    rm -f $SCRATCH_MNT/foo_link_*
    cat $SCRATCH_MNT/foo

    status=0
    exit

So make sure an fsync always flushes the delayed inode item, so that the
fsync log contains it (needed in order to trigger the link count fixup
code) and fix the extref counting function, which always return -ENOENT
to its caller (and made it assume there were always 0 extrefs).

This issue has been present since the introduction of the extrefs feature
(2012).

A test case for xfstests follows soon.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:02:04 -08:00
Filipe Manana
d36808e0d4 Btrfs: fix directory inconsistency after fsync log replay
If we have an inode (file) with a link count greater than 1, remove
one of its hard links, fsync the inode, power fail/crash and then
replay the fsync log on the next mount, we end up getting the parent
directory's metadata inconsistent - its i_size still reflects the
deleted hard link and has dangling index entries (with no matching
inode reference entries). This prevents the directory from ever being
deletable, as its i_size can never decrease to BTRFS_EMPTY_DIR_SIZE
even if all of its children inodes are deleted, and the dangling index
entries can never be removed (as they point to an inode that does not
exist anymore).

This is easy to reproduce with the following excerpt from the test case
for xfstests that I just made:

    _scratch_mkfs >> $seqres.full 2>&1

    _init_flakey
    _mount_flakey

    # Create a test file with 2 hard links in the same directory.
    mkdir -p $SCRATCH_MNT/a/b
    echo "hello world" > $SCRATCH_MNT/a/b/foo
    ln $SCRATCH_MNT/a/b/foo $SCRATCH_MNT/a/b/bar

    # Make sure all metadata and data are durably persisted.
    sync

    # Now remove one of the hard links and fsync the inode.
    rm -f $SCRATCH_MNT/a/b/bar
    $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/a/b/foo

    # Simulate a crash/power loss. This makes sure the next mount
    # will see an fsync log and will replay that log.

    _load_flakey_table $FLAKEY_DROP_WRITES
    _unmount_flakey

    _load_flakey_table $FLAKEY_ALLOW_WRITES
    _mount_flakey

    # Remove the last hard link of the file and attempt to remove its parent
    # directory - this failed in btrfs because the fsync log and replay code
    # didn't decrement the parent directory's i_size and left dangling directory
    # index entries - this made the btrfs rmdir implementation always fail with
    # the error -ENOTEMPTY.
    #
    # The dangling directory index entries were visible to user space, but it was
    # impossible to do anything on them (unlink, open, read, write, stat, etc)
    # because the inode they pointed to did not exist anymore.
    #
    # The parent directory's metadata inconsistency (stale index entries) was
    # also detected by btrfs' fsck tool, which is run automatically by the fstests
    # framework when the test finishes. The error message reported by fsck was:
    #
    # root 5 inode 259 errors 2001, no inode item, link count wrong
    #   unresolved ref dir 258 index 3 namelen 3 name bar filetype 1 errors 4, no inode ref
    #
    rm -f $SCRATCH_MNT/a/b/*
    rmdir $SCRATCH_MNT/a/b
    rmdir $SCRATCH_MNT/a

To fix this just make sure that after an unlink, if the inode is fsync'ed,
he parent inode is fully logged in the fsync log.

A test case for xfstests follows soon.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:02:04 -08:00
Filipe Manana
6219872dc6 Btrfs: lookup for block group only if needed when freeing a tree block
Very often our extent buffer's header generation doesn't match the current
transaction's id or it is also referenced by other trees (snapshots), so
we don't need the corresponding block group cache object. Therefore only
search for it if we are going to use it, so we avoid an unnecessary search
in the block groups rbtree (and acquiring and releasing its spinlock).

Freeing a tree block is performed when COWing or deleting a node/leaf,
which implies we are holding the node/leaf's parent node lock, therefore
reducing the amount of time spent when freeing a tree block helps reducing
the amount of time we are holding the parent node's lock.

For example, for a run of xfstests/generic/083, the block group cache
object was needed only 682 times for a total of 226691 calls to free
a tree block.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:02:04 -08:00
David Sterba
730a78c741 btrfs: remove a no-op unfreeze superbock callback
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:02:04 -08:00
David Sterba
9ee49a047d btrfs: switch extent_state state to unsigned
Currently there's a 4B hole in the structure between refs and state and there
are only 16 bits used so we can make it unsigned. This will get a better
packing and may save some stack space for local variables.

The size of extent_state gets reduced by 8B and there are usually a lot
of slab objects.

struct extent_state {
	u64                        start;                /*     0     8 */
	u64                        end;                  /*     8     8 */
	struct rb_node             rb_node;              /*    16    24 */
	wait_queue_head_t          wq;                   /*    40    24 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	atomic_t                   refs;                 /*    64     4 */

	/* XXX 4 bytes hole, try to pack */

	long unsigned int          state;                /*    72     8 */
	u64                        private;              /*    80     8 */

	/* size: 88, cachelines: 2, members: 7 */
	/* sum members: 84, holes: 1, sum holes: 4 */
	/* last cacheline: 24 bytes */
};

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:02:04 -08:00
David Sterba
5efa0490cc btrfs: set proper message level for skinny metadata
This has been confusing people for too long, the message is really just
informative.

CC: <stable@vger.kernel.org> # 3.10+
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:02:03 -08:00
David Sterba
f0954c6637 btrfs: update message levels after checksum errors
The errors are worth noting and might get missed with INFO level.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:02:03 -08:00
David Sterba
aa8ee31209 btrfs: update message levels during failed mount
All error conditions from open_ctree shall be ERR. Warning would
suggest that something's wrong and we can continue.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:02:03 -08:00
David Sterba
68b663d13c btrfs: update message levels for errors
Several messages that point to some internal problem, level INFO is
wrong here.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:02:03 -08:00
Filipe Manana
a8df6fe666 Btrfs: fix setup_leaf_for_split() to avoid leaf corruption
We were incorrectly detecting when the target key didn't exist anymore
after releasing the path and re-searching the tree. This could make
us split or duplicate (btrfs_split_item() and btrfs_duplicate_item()
are its only callers at the moment) an item when we should not.

For the case of duplicating an item, we currently only duplicate
checksum items (csum tree) and file extent items (fs/subvol trees).
For the checksum items we end up overriding the item completely,
but for file extent items we update only some of their fields in
the copy (done in __btrfs_drop_extents), which means we can end up
having a logical corruption for some values.

Also for the case where we duplicate a file extent item it will make
us produce a leaf with a wrong key order, as btrfs_duplicate_item()
advances us to the next slot and then its caller sets a smaller key
on the new item at that slot (like in __btrfs_drop_extents() e.g.).
Alternatively if the tree search in setup_leaf_for_split() leaves
with path->slots[0] == btrfs_header_nritems(path->nodes[0]), we end
up accessing beyond the leaf's end (when we check if the item's size
has changed) and make our caller insert an item at the invalid slot
btrfs_header_nritems(path->nodes[0]) + 1, causing an invalid memory
access if the leaf is full or nearly full.

This issue has been present since the introduction of this function
in 2009:

    Btrfs: Add btrfs_duplicate_item
    commit ad48fd7546

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:02:03 -08:00
Chris Mason
57bbddd7fb Merge branch 'cleanup/blocksize-diet-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus 2015-01-21 17:49:35 -08:00
Chris Mason
d354183488 Merge branch 'fix/find-item-path-leak' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus 2015-01-21 17:45:25 -08:00
Josef Bacik
ce93ec548c Btrfs: track dirty block groups on their own list
Currently any time we try to update the block groups on disk we will walk _all_
block groups and check for the ->dirty flag to see if it is set.  This function
can get called several times during a commit.  So if you have several terabytes
of data you will be a very sad panda as we will loop through _all_ of the block
groups several times, which makes the commit take a while which slows down the
rest of the file system operations.

This patch introduces a dirty list for the block groups that we get added to
when we dirty the block group for the first time.  Then we simply update any
block groups that have been dirtied since the last time we called
btrfs_write_dirty_block_groups.  This allows us to clean up how we write the
free space cache out so it is much cleaner.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 17:36:52 -08:00
Josef Bacik
e7070be198 Btrfs: change how we track dirty roots
I've been overloading root->dirty_list to keep track of dirty roots and which
roots need to have their commit roots switched at transaction commit time.  This
could cause us to lose an update to the root which could corrupt the file
system.  To fix this use a state bit to know if the root is dirty, and if it
isn't set we go ahead and move the root to the dirty list.  This way if we
re-dirty the root after adding it to the switch_commit list we make sure to
update it.  This also makes it so that the extent root is always the last root
on the dirty list to try and keep the amount of churn down at this point in the
commit.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 17:35:49 -08:00
Ingo Molnar
f49028292c Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU updates from Paul E. McKenney:

  - Documentation updates.

  - Miscellaneous fixes.

  - Preemptible-RCU fixes, including fixing an old bug in the
    interaction of RCU priority boosting and CPU hotplug.

  - SRCU updates.

  - RCU CPU stall-warning updates.

  - RCU torture-test updates.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-21 06:12:21 +01:00
Qu Wenruo
a53f4f8e9c btrfs: Don't call btrfs_start_transaction() on frozen fs to avoid deadlock.
Commit 6b5fe46dfa (btrfs: do commit in sync_fs if there are pending
changes) will call btrfs_start_transaction() in sync_fs(), to handle
some operations needed to be done in next transaction.

However this can cause deadlock if the filesystem is frozen, with the
following sys_r+w output:
[  143.255932] Call Trace:
[  143.255936]  [<ffffffff816c0e09>] schedule+0x29/0x70
[  143.255939]  [<ffffffff811cb7f3>] __sb_start_write+0xb3/0x100
[  143.255971]  [<ffffffffa040ec06>] start_transaction+0x2e6/0x5a0
[btrfs]
[  143.255992]  [<ffffffffa040f1eb>] btrfs_start_transaction+0x1b/0x20
[btrfs]
[  143.256003]  [<ffffffffa03dc0ba>] btrfs_sync_fs+0xca/0xd0 [btrfs]
[  143.256007]  [<ffffffff811f7be0>] sync_fs_one_sb+0x20/0x30
[  143.256011]  [<ffffffff811cbd01>] iterate_supers+0xe1/0xf0
[  143.256014]  [<ffffffff811f7d75>] sys_sync+0x55/0x90
[  143.256017]  [<ffffffff816c49d2>] system_call_fastpath+0x12/0x17
[  143.256111] Call Trace:
[  143.256114]  [<ffffffff816c0e09>] schedule+0x29/0x70
[  143.256119]  [<ffffffff816c3405>] rwsem_down_write_failed+0x1c5/0x2d0
[  143.256123]  [<ffffffff8133f013>] call_rwsem_down_write_failed+0x13/0x20
[  143.256131]  [<ffffffff811caae8>] thaw_super+0x28/0xc0
[  143.256135]  [<ffffffff811db3e5>] do_vfs_ioctl+0x3f5/0x540
[  143.256187]  [<ffffffff811db5c1>] SyS_ioctl+0x91/0xb0
[  143.256213]  [<ffffffff816c49d2>] system_call_fastpath+0x12/0x17

The reason is like the following:
(Holding s_umount)
VFS sync_fs staff:
|- btrfs_sync_fs()
   |- btrfs_start_transaction()
      |- sb_start_intwrite()
      (Waiting thaw_fs to unfreeze)
					VFS thaw_fs staff:
					thaw_fs()
					(Waiting sync_fs to release
					 s_umount)

So deadlock happens.
This can be easily triggered by fstest/generic/068 with inode_cache
mount option.

The fix is to check if the fs is frozen, if the fs is frozen, just
return and waiting for the next transaction.

Cc: David Sterba <dsterba@suse.cz>
Reported-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
[enhanced comment, changed to SB_FREEZE_WRITE]
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-20 17:20:21 -08:00
Qu Wenruo
6c9fe14f9d btrfs: Fix the bug that fs_info->pending_changes is never cleared.
Fs_info->pending_changes is never cleared since the original code uses
cmpxchg(&fs_info->pending_changes, 0, 0), which will only clear it if
pending_changes is already 0.

This will cause a lot of problem when mount it with inode_cache mount
option.
If the btrfs is mounted as inode_cache, pending_changes will always be
1, even when the fs is frozen.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-20 17:19:40 -08:00
Christoph Hellwig
df0ce26cb4 fs: remove default_backing_dev_info
Now that default_backing_dev_info is not used for writeback purposes we can
git rid of it easily:

 - instead of using it's name for tracing unregistered bdi we just use
   "unknown"
 - btrfs and ceph can just assign the default read ahead window themselves
   like several other filesystems already do.
 - we can assign noop_backing_dev_info as the default one in alloc_super.
   All filesystems already either assigned their own or
   noop_backing_dev_info.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-20 14:05:38 -07:00
Christoph Hellwig
b83ae6d421 fs: remove mapping->backing_dev_info
Now that we never use the backing_dev_info pointer in struct address_space
we can simply remove it and save 4 to 8 bytes in every inode.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Reviewed-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-20 14:03:05 -07:00
Christoph Hellwig
de1414a654 fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info
Now that we got rid of the bdi abuse on character devices we can always use
sb->s_bdi to get at the backing_dev_info for a file, except for the block
device special case.  Export inode_to_bdi and replace uses of
mapping->backing_dev_info with it to prepare for the removal of
mapping->backing_dev_info.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-20 14:03:04 -07:00
Christoph Hellwig
b4caecd480 fs: introduce f_op->mmap_capabilities for nommu mmap support
Since "BDI: Provide backing device capability information [try #3]" the
backing_dev_info structure also provides flags for the kind of mmap
operation available in a nommu environment, which is entirely unrelated
to it's original purpose.

Introduce a new nommu-only file operation to provide this information to
the nommu mmap code instead.  Splitting this from the backing_dev_info
structure allows to remove lots of backing_dev_info instance that aren't
otherwise needed, and entirely gets rid of the concept of providing a
backing_dev_info for a character device.  It also removes the need for
the mtd_inodefs filesystem.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Tejun Heo <tj@kernel.org>
Acked-by: Brian Norris <computersforpeace@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-20 14:02:58 -07:00
Satoru Takeuchi
6e1103a6e9 btrfs: fix state->private cast on 32 bit machines
Suppress the following warning displayed on building 32bit (i686) kernel.

===============================================================================
...
   CC [M]  fs/btrfs/extent_io.o
fs/btrfs/extent_io.c: In function ‘btrfs_free_io_failure_record’:
fs/btrfs/extent_io.c:2193:13: warning: cast to pointer from integer of
different size [-Wint-to-pointer-cast]
    failrec = (struct io_failure_record *)state->private;
...
===============================================================================

Signed-off-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Reported-by: Chris Murphy <chris@colorremedies.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-19 13:06:06 -08:00
Filipe Manana
75c68e9fbb Btrfs: fix race deleting block group from space_info->ro_bgs list
When removing a block group we were deleting it from its space_info's
ro_bgs list without the correct protection - the space info's spinlock.
Fix this by doing the list delete while holding the spinlock of the
corresponding space info, which is the correct lock for any operation
on that list.

This issue was introduced in the 3.19 kernel by the following change:

    Btrfs: move read only block groups onto their own list V2
    commit 633c0aad4c

I ran into a kernel crash while a task was running statfs, which iterates
the space_info->ro_bgs list while holding the space info's spinlock,
and another task was deleting it from the same list, without holding that
spinlock, as part of the block group remove operation (while running the
function btrfs_remove_block_group). This happened often when running the
stress test xfstests/generic/038 I recently made.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-19 13:05:45 -08:00
Tsutomu Itoh
379d6854a2 Btrfs: fix incorrect freeing in scrub_stripe
The address that should be freed is not 'ppath' but 'path'.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Reviewed-by: Miao Xie <miaoxie@huawei.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-19 13:05:44 -08:00
David Sterba
98bd5c547e btrfs: sync ioctl, handle errors after transaction start
The version merged to 3.19 did not handle errors from start_trancaction
and could pass an invalid pointer to commit_transaction.

Fixes: 6b5fe46dfa ("btrfs: do commit in sync_fs if there are pending changes")
Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-19 13:05:44 -08:00
David Sterba
1d4c08e0a6 btrfs: expand btrfs_find_item if found_key is NULL
If the found_key is NULL, then btrfs_find_item becomes a verbose wrapper
for simple btrfs_search_slot.

After we've removed all such callers, passing a NULL key is not valid
anymore.

Signed-off-by: David Sterba <dsterba@suse.cz>
2015-01-14 19:23:48 +01:00
David Sterba
9c4f61f01d btrfs: simplify insert_orphan_item
We can search and add the orphan item in one go,
btrfs_insert_orphan_item will find out if the item already exists.

Signed-off-by: David Sterba <dsterba@suse.cz>
2015-01-14 19:23:48 +01:00
David Sterba
c234a24de9 btrfs: cleanup, remove inode_ref_info helper
A simple wrapper around btrfs_find_item.

Signed-off-by: David Sterba <dsterba@suse.cz>
2015-01-14 19:23:47 +01:00
David Sterba
14692cc150 btrfs: cleanup, remove inode_item_info helper
It's only a simple wrapper around btrfs_find_item, the locally defined
key is not used.

Signed-off-by: David Sterba <dsterba@suse.cz>
2015-01-14 19:23:47 +01:00
David Sterba
381cf6587f btrfs: fix leak of path in btrfs_find_item
If btrfs_find_item is called with NULL path it allocates one locally but
does not free it. Affected paths are inserting an orphan item for a file
and for a subvol root.

Move the path allocation to the callers.

CC: <stable@vger.kernel.org> # 3.14+
Fixes: 3f870c2899 ("btrfs: expand btrfs_find_item() to include find_orphan_item functionality")
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-01-14 19:23:46 +01:00
Linus Torvalds
03c751a5e1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "None of these are huge, but my commit does fix a regression from 3.18
  that could cause lost files during log replay.

  This also adds Dave Sterba to the list of Btrfs maintainers.  It
  doesn't mean we're doing things differently, but Dave has really been
  helping with the maintainer workload for years"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: don't delay inode ref updates during log replay
  Btrfs: correctly get tree level in tree_backref_for_extent
  Btrfs: call inode_dec_link_count() on mkdir error path
  Btrfs: abort transaction if we don't find the block group
  Btrfs, scrub: uninitialized variable in scrub_extent_for_parity()
  Btrfs: add more maintainers
2015-01-09 17:46:07 -08:00
Pranith Kumar
83fe27ea53 rcu: Make SRCU optional by using CONFIG_SRCU
SRCU is not necessary to be compiled by default in all cases. For tinification
efforts not compiling SRCU unless necessary is desirable.

The current patch tries to make compiling SRCU optional by introducing a new
Kconfig option CONFIG_SRCU which is selected when any of the components making
use of SRCU are selected.

If we do not select CONFIG_SRCU, srcu.o will not be compiled at all.

   text    data     bss     dec     hex filename
   2007       0       0    2007     7d7 kernel/rcu/srcu.o

Size of arch/powerpc/boot/zImage changes from

   text    data     bss     dec     hex filename
 831552   64180   23944  919676   e087c arch/powerpc/boot/zImage : before
 829504   64180   23952  917636   e0084 arch/powerpc/boot/zImage : after

so the savings are about ~2000 bytes.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: resolve conflict due to removal of arch/ia64/kvm/Kconfig. ]
2015-01-06 11:04:29 -08:00
Chris Mason
6f8960541b Btrfs: don't delay inode ref updates during log replay
Commit 1d52c78afb (Btrfs: try not to ENOSPC on log replay) added a
check to skip delayed inode updates during log replay because it
confuses the enospc code.  But the delayed processing will end up
ignoring delayed refs from log replay because the inode itself wasn't
put through the delayed code.

This can end up triggering a warning at commit time:

WARNING: CPU: 2 PID: 778 at fs/btrfs/delayed-inode.c:1410 btrfs_assert_delayed_root_empty+0x32/0x34()

Which is repeated for each commit because we never process the delayed
inode ref update.

The fix used here is to change btrfs_delayed_delete_inode_ref to return
an error if we're currently in log replay.  The caller will do the ref
deletion immediately and everything will work properly.

Signed-off-by: Chris Mason <clm@fb.com>
cc: stable@vger.kernel.org # v3.18 and any stable series that picked 1d52c78afb
2015-01-02 14:47:56 -05:00
Filipe Manana
a1317f455a Btrfs: correctly get tree level in tree_backref_for_extent
If we are using skinny metadata, the block's tree level is in the offset
of the key and not in a btrfs_tree_block_info structure following the
extent item (it doesn't exist). Therefore fix it.

Besides returning the correct level in the tree, this also prevents reading
past the leaf's end in the case where the extent item is the last item in
the leaf (eb) and it has only 1 inline reference - this is because
sizeof(struct btrfs_tree_block_info) is greater than
sizeof(struct btrfs_extent_inline_ref).

Got it while running a scrub which produced the following warning:

    BTRFS: checksum error at logical 42123264 on dev /dev/sde, sector 15840: metadata node (level 24) in tree 5

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-02 14:47:56 -05:00
Wang Shilong
c7cfb8a540 Btrfs: call inode_dec_link_count() on mkdir error path
In btrfs_mkdir(), if it fails to create dir, we should
clean up existed items, setting inode's link properly
to make sure it could be cleaned up properly.

Signed-off-by: Wang Shilong <wangshilong1991@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-02 14:47:55 -05:00
Josef Bacik
df95e7f0d9 Btrfs: abort transaction if we don't find the block group
We shouldn't BUG_ON() if there is corruption.  I hit this while testing my block
group patch and the abort worked properly.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-02 14:47:55 -05:00
Dan Carpenter
6b6d24b389 Btrfs, scrub: uninitialized variable in scrub_extent_for_parity()
The only way that "ret" is set is when we call scrub_pages_for_parity()
so the skip to "if (ret) " test doesn't make sense and causes a static
checker warning.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-02 14:47:55 -05:00
Linus Torvalds
ecb5ec044a Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs pile #3 from Al Viro:
 "Assorted fixes and patches from the last cycle"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  [regression] chunk lost from bd9b51
  vfs: make mounts and mountstats honor root dir like mountinfo does
  vfs: cleanup show_mountinfo
  init: fix read-write root mount
  unfuck binfmt_misc.c (broken by commit e6084d4)
  vm_area_operations: kill ->migrate()
  new helper: iter_is_iovec()
  move_extent_per_page(): get rid of unused w_flags
  lustre: get rid of playing with ->fs
  btrfs: filp_open() returns ERR_PTR() on failure, not NULL...
2014-12-19 18:19:19 -08:00
Linus Torvalds
5c68eac68b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull more btrfs updates from Chris Mason:
 "This is part two of our merge window patches.

  These are all from Filipe, and fix some really hard to find races that
  can cause corruptions.  Most of them involved block group removal
  (balance) or discard"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: remove non-sense btrfs_error_discard_extent() function
  Btrfs: fix fs corruption on transaction abort if device supports discard
  Btrfs: always clear a block group node when removing it from the tree
  Btrfs: ensure deletion from pinned_chunks list is protected
2014-12-19 18:10:42 -08:00
Al Viro
98af592f5b btrfs: filp_open() returns ERR_PTR() on failure, not NULL...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-17 06:43:56 -05:00
Linus Torvalds
bdeb03cada Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs update from Chris Mason:
 "From a feature point of view, most of the code here comes from Miao
  Xie and others at Fujitsu to implement scrubbing and replacing devices
  on raid56.  This has been in development for a while, and it's a big
  improvement.

  Filipe and Josef have a great assortment of fixes, many of which solve
  problems corruptions either after a crash or in error conditions.  I
  still have a round two from Filipe for next week that solves
  corruptions with discard and block group removal"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (62 commits)
  Btrfs: make get_caching_control unconditionally return the ctl
  Btrfs: fix unprotected deletion from pending_chunks list
  Btrfs: fix fs mapping extent map leak
  Btrfs: fix memory leak after block remove + trimming
  Btrfs: make btrfs_abort_transaction consider existence of new block groups
  Btrfs: fix race between writing free space cache and trimming
  Btrfs: fix race between fs trimming and block group remove/allocation
  Btrfs, replace: enable dev-replace for raid56
  Btrfs: fix freeing used extents after removing empty block group
  Btrfs: fix crash caused by block group removal
  Btrfs: fix invalid block group rbtree access after bg is removed
  Btrfs, raid56: fix use-after-free problem in the final device replace procedure on raid56
  Btrfs, replace: write raid56 parity into the replace target device
  Btrfs, replace: write dirty pages into the replace target device
  Btrfs, raid56: support parity scrub on raid56
  Btrfs, raid56: use a variant to record the operation type
  Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted
  Btrfs, raid56: don't change bbio and raid_map
  Btrfs: remove unnecessary code of stripe_index assignment in __btrfs_map_block
  Btrfs: remove noused bbio_ret in __btrfs_map_block in condition
  ...
2014-12-12 11:15:23 -08:00
David Sterba
ce3e69847e btrfs: sink parameter len to alloc_extent_buffer
Because we're using globally known nodesize. Do the same for the sanity
test function variant.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-12-12 18:26:57 +01:00
David Sterba
3f556f7853 btrfs: unify extent buffer allocation api
Make the extent buffer allocation interface consistent.  Cloned eb will
set a valid fs_info.  For dummy eb, we can drop the length parameter and
set it from fs_info.

The built-in sanity checks may pass a NULL fs_info that's queried for
nodesize, but we know it's 4096.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-12-12 18:26:55 +01:00
David Sterba
23d79d81b1 btrfs: use GFP_NOFS in __alloc_extent_buffer directly
Same mask from all callers.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-12-12 18:07:23 +01:00
David Sterba
7476dfdaad btrfs: sink blocksize parameter to tree_block_processed
Signed-off-by: David Sterba <dsterba@suse.cz>
2014-12-12 18:07:22 +01:00
David Sterba
a83fffb75d btrfs: sink blocksize parameter to btrfs_find_create_tree_block
Finally it's clear that the requested blocksize is always equal to
nodesize, with one exception, the superblock.

Superblock has fixed size regardless of the metadata block size, but
uses the same helpers to initialize sys array/chunk tree and to work
with the chunk items. So it pretends to be an extent_buffer for a
moment, btrfs_read_sys_array is full of special cases, we're adding one
more.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-12-12 18:07:21 +01:00
David Sterba
fe864576de btrfs: sink blocksize parameter to btrfs_init_new_buffer
Signed-off-by: David Sterba <dsterba@suse.cz>
2014-12-12 18:07:20 +01:00
David Sterba
c0dcaa4d7b btrfs: sink blocksize parameter to reada_tree_block_flagged
Signed-off-by: David Sterba <dsterba@suse.cz>
2014-12-12 18:07:20 +01:00
David Sterba
b6ae40ec76 btrfs: remove blocksize from reada_extent
Replace with global nodesize instead.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-12-12 18:07:19 +01:00
David Sterba
d3e46fea1b btrfs: sink blocksize parameter to readahead_tree_block
All callers pass nodesize.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-12-12 18:07:18 +01:00
Linus Torvalds
cbfe0de303 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull VFS changes from Al Viro:
 "First pile out of several (there _definitely_ will be more).  Stuff in
  this one:

   - unification of d_splice_alias()/d_materialize_unique()

   - iov_iter rewrite

   - killing a bunch of ->f_path.dentry users (and f_dentry macro).

     Getting that completed will make life much simpler for
     unionmount/overlayfs, since then we'll be able to limit the places
     sensitive to file _dentry_ to reasonably few.  Which allows to have
     file_inode(file) pointing to inode in a covered layer, with dentry
     pointing to (negative) dentry in union one.

     Still not complete, but much closer now.

   - crapectomy in lustre (dead code removal, mostly)

   - "let's make seq_printf return nothing" preparations

   - assorted cleanups and fixes

  There _definitely_ will be more piles"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
  copy_from_iter_nocache()
  new helper: iov_iter_kvec()
  csum_and_copy_..._iter()
  iov_iter.c: handle ITER_KVEC directly
  iov_iter.c: convert copy_to_iter() to iterate_and_advance
  iov_iter.c: convert copy_from_iter() to iterate_and_advance
  iov_iter.c: get rid of bvec_copy_page_{to,from}_iter()
  iov_iter.c: convert iov_iter_zero() to iterate_and_advance
  iov_iter.c: convert iov_iter_get_pages_alloc() to iterate_all_kinds
  iov_iter.c: convert iov_iter_get_pages() to iterate_all_kinds
  iov_iter.c: convert iov_iter_npages() to iterate_all_kinds
  iov_iter.c: iterate_and_advance
  iov_iter.c: macros for iterating over iov_iter
  kill f_dentry macro
  dcache: fix kmemcheck warning in switch_names
  new helper: audit_file()
  nfsd_vfs_write(): use file_inode()
  ncpfs: use file_inode()
  kill f_dentry uses
  lockd: get rid of ->f_path.dentry->d_sb
  ...
2014-12-10 16:10:49 -08:00
Filipe Manana
1edb647bb9 Btrfs: remove non-sense btrfs_error_discard_extent() function
It doesn't do anything special, it just calls btrfs_discard_extent(),
so just remove it.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-10 12:22:32 -08:00
Filipe Manana
678886bdc6 Btrfs: fix fs corruption on transaction abort if device supports discard
When we abort a transaction we iterate over all the ranges marked as dirty
in fs_info->freed_extents[0] and fs_info->freed_extents[1], clear them
from those trees, add them back (unpin) to the free space caches and, if
the fs was mounted with "-o discard", perform a discard on those regions.
Also, after adding the regions to the free space caches, a fitrim ioctl call
can see those ranges in a block group's free space cache and perform a discard
on the ranges, so the same issue can happen without "-o discard" as well.

This causes corruption, affecting one or multiple btree nodes (in the worst
case leaving the fs unmountable) because some of those ranges (the ones in
the fs_info->pinned_extents tree) correspond to btree nodes/leafs that are
referred by the last committed super block - breaking the rule that anything
that was committed by a transaction is untouched until the next transaction
commits successfully.

I ran into this while running in a loop (for several hours) the fstest that
I recently submitted:

  [PATCH] fstests: add btrfs test to stress chunk allocation/removal and fstrim

The corruption always happened when a transaction aborted and then fsck complained
like this:

   _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
   *** fsck.btrfs output ***
   Check tree block failed, want=94945280, have=0
   Check tree block failed, want=94945280, have=0
   Check tree block failed, want=94945280, have=0
   Check tree block failed, want=94945280, have=0
   Check tree block failed, want=94945280, have=0
   read block failed check_tree_block
   Couldn't open file system

In this case 94945280 corresponded to the root of a tree.
Using frace what I observed was the following sequence of steps happened:

   1) transaction N started, fs_info->pinned_extents pointed to
      fs_info->freed_extents[0];

   2) node/eb 94945280 is created;

   3) eb is persisted to disk;

   4) transaction N commit starts, fs_info->pinned_extents now points to
      fs_info->freed_extents[1], and transaction N completes;

   5) transaction N + 1 starts;

   6) eb is COWed, and btrfs_free_tree_block() called for this eb;

   7) eb range (94945280 to 94945280 + 16Kb) is added to
      fs_info->pinned_extents (fs_info->freed_extents[1]);

   8) Something goes wrong in transaction N + 1, like hitting ENOSPC
      for example, and the transaction is aborted, turning the fs into
      readonly mode. The stack trace I got for example:

      [112065.253935]  [<ffffffff8140c7b6>] dump_stack+0x4d/0x66
      [112065.254271]  [<ffffffff81042984>] warn_slowpath_common+0x7f/0x98
      [112065.254567]  [<ffffffffa0325990>] ? __btrfs_abort_transaction+0x50/0x10b [btrfs]
      [112065.261674]  [<ffffffff810429e5>] warn_slowpath_fmt+0x48/0x50
      [112065.261922]  [<ffffffffa032949e>] ? btrfs_free_path+0x26/0x29 [btrfs]
      [112065.262211]  [<ffffffffa0325990>] __btrfs_abort_transaction+0x50/0x10b [btrfs]
      [112065.262545]  [<ffffffffa036b1d6>] btrfs_remove_chunk+0x537/0x58b [btrfs]
      [112065.262771]  [<ffffffffa033840f>] btrfs_delete_unused_bgs+0x1de/0x21b [btrfs]
      [112065.263105]  [<ffffffffa0343106>] cleaner_kthread+0x100/0x12f [btrfs]
      (...)
      [112065.264493] ---[ end trace dd7903a975a31a08 ]---
      [112065.264673] BTRFS: error (device sdc) in btrfs_remove_chunk:2625: errno=-28 No space left
      [112065.264997] BTRFS info (device sdc): forced readonly

   9) The clear kthread sees that the BTRFS_FS_STATE_ERROR bit is set in
      fs_info->fs_state and calls btrfs_cleanup_transaction(), which in
      turn calls btrfs_destroy_pinned_extent();

   10) Then btrfs_destroy_pinned_extent() iterates over all the ranges
       marked as dirty in fs_info->freed_extents[], and for each one
       it calls discard, if the fs was mounted with "-o discard", and
       adds the range to the free space cache of the respective block
       group;

   11) btrfs_trim_block_group(), invoked from the fitrim ioctl code path,
       sees the free space entries and performs a discard;

   12) After an umount and mount (or fsck), our eb's location on disk was full
       of zeroes, and it should have been untouched, because it was marked as
       dirty in the fs_info->pinned_extents tree, and therefore used by the
       trees that the last committed superblock points to.

Fix this by not performing a discard and not adding the ranges to the free space
caches - it's useless from this point since the fs is now in readonly mode and
we won't write free space caches to disk anymore (otherwise we would leak space)
nor any new superblock. By not adding the ranges to the free space caches, it
prevents other code paths from allocating that space and write to it as well,
therefore being safer and simpler.

This isn't a new problem, as it's been present since 2011 (git commit
acce952b02).

Cc: stable@vger.kernel.org  # any kernel released after 2011-01-06
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-10 12:22:31 -08:00
Filipe Manana
01eacb2779 Btrfs: always clear a block group node when removing it from the tree
Always clear a block group's rbnode after removing it from the rbtree to
ensure that any tasks that might be holding a reference on the block group
don't end up accessing stale rbnode left and right child pointers through
next_block_group().

This is a leftover from the change titled:
"Btrfs: fix invalid block group rbtree access after bg is removed"

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-10 12:22:29 -08:00
Filipe Manana
a1e7e16ed3 Btrfs: ensure deletion from pinned_chunks list is protected
The call to remove_extent_mapping() actually deletes the extent map
from the list it's included in - fs_info->pinned_chunks - and that
list is protected by the chunk mutex. Therefore make that call
while holding the chunk mutex and remove the redundant list delete
call because it's a noop.

This fixes an overlook of the patch titled
"Btrfs: fix race between fs trimming and block group remove/allocation"
following the same obvervation from the patch titled
"Btrfs: fix unprotected deletion from pending_chunks list".

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-10 12:22:28 -08:00
Al Viro
ba00410b81 Merge branch 'iov_iter' into for-next 2014-12-08 20:39:29 -05:00
Chris Mason
9627aeee3e Merge branch 'raid56-scrub-replace' of git://github.com/miaoxie/linux-btrfs into for-linus 2014-12-02 18:42:03 -08:00
Josef Bacik
cb83b7b816 Btrfs: make get_caching_control unconditionally return the ctl
This was written when we didn't do a caching control for the fast free space
cache loading.  However we started doing that a long time ago, and there is
still a small window of time that we could be caching the block group the fast
way, so if there is a caching_ctl at all on the block group just return it, the
callers all wait properly for what they want.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02 18:35:10 -08:00
Filipe Manana
8dbcd10f69 Btrfs: fix unprotected deletion from pending_chunks list
On block group remove if the corresponding extent map was on the
transaction->pending_chunks list, we were deleting the extent map
from that list, through remove_extent_mapping(), without any
synchronization with chunk allocation (which iterates that list
and adds new elements to it). Fix this by ensure that this is done
while the chunk mutex is held, since that's the mutex that protects
the list in the chunk allocation code path.

This applies on top (depends on) of my previous patch titled:
"Btrfs: fix race between fs trimming and block group remove/allocation"

But the issue in fact was already present before that change, it only
became easier to hit after Josef's 3.18 patch that added automatic
removal of empty block groups.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02 18:35:10 -08:00
Filipe Manana
495e64f4fe Btrfs: fix fs mapping extent map leak
On chunk allocation error (label "error_del_extent"), after adding the
extent map to the tree and to the pending chunks list, we would leave
decrementing the extent map's refcount by 2 instead of 3 (our allocation
+ tree reference + list reference).

Also, on chunk/block group removal, if the block group was on the list
pending_chunks we weren't decrementing the respective list reference.

Detected by 'rmmod btrfs':

[20770.105881] kmem_cache_destroy btrfs_extent_map: Slab cache still has objects
[20770.106127] CPU: 2 PID: 11093 Comm: rmmod Tainted: G        W    L 3.17.0-rc5-btrfs-next-1+ #1
[20770.106128] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[20770.106130]  0000000000000000 ffff8800ba867eb8 ffffffff813e7a13 ffff8800a2e11040
[20770.106132]  ffff8800ba867ed0 ffffffff81105d0c 0000000000000000 ffff8800ba867ee0
[20770.106134]  ffffffffa035d65e ffff8800ba867ef0 ffffffffa03b0654 ffff8800ba867f78
[20770.106136] Call Trace:
[20770.106142]  [<ffffffff813e7a13>] dump_stack+0x45/0x56
[20770.106145]  [<ffffffff81105d0c>] kmem_cache_destroy+0x4b/0x90
[20770.106164]  [<ffffffffa035d65e>] extent_map_exit+0x1a/0x1c [btrfs]
[20770.106176]  [<ffffffffa03b0654>] exit_btrfs_fs+0x27/0x9d3 [btrfs]
[20770.106179]  [<ffffffff8109dc97>] SyS_delete_module+0x153/0x1c4
[20770.106182]  [<ffffffff8121261b>] ? trace_hardirqs_on_thunk+0x3a/0x3c
[20770.106184]  [<ffffffff813ebf52>] system_call_fastpath+0x16/0x1b

This applies on top (depends on) of my previous patch titled:
"Btrfs: fix race between fs trimming and block group remove/allocation"

But the issue in fact was already present before that change, it only
became easier to hit after Josef's 3.18 patch that added automatic
removal of empty block groups.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02 18:35:10 -08:00
Filipe Manana
946ddbe805 Btrfs: fix memory leak after block remove + trimming
There was a free space entry structure memeory leak if a block
group is remove while a free space entry is being trimmed, which
the following diagram explains:

           CPU 1                                          CPU 2

  btrfs_trim_block_group()
      trim_no_bitmap()
          remove free space entry from
          block group cache's rbtree
          do_trimming()

                                                btrfs_remove_block_group()
                                                    btrfs_remove_free_space_cache()

              add back free space entry to
              block group's cache rbtree
  btrfs_put_block_group()

                                                    (...)
                                                    btrfs_put_block_group()
                                                        kfree(bg->free_space_ctl)
                                                        kfree(bg)

The free space entry added after doing the discard of its respective
range ends up never being freed.
Detected after doing an "rmmod btrfs" after running the stress test
recently submitted for fstests:

[ 8234.642212] kmem_cache_destroy btrfs_free_space: Slab cache still has objects
[ 8234.642657] CPU: 1 PID: 32276 Comm: rmmod Tainted: G        W    L 3.17.0-rc5-btrfs-next-2+ #1
[ 8234.642660] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[ 8234.642664]  0000000000000000 ffff8801af1b3eb8 ffffffff8140c7b6 ffff8801dbedd0c0
[ 8234.642670]  ffff8801af1b3ed0 ffffffff811149ce 0000000000000000 ffff8801af1b3ee0
[ 8234.642676]  ffffffffa042dbe7 ffff8801af1b3ef0 ffffffffa0487422 ffff8801af1b3f78
[ 8234.642682] Call Trace:
[ 8234.642692]  [<ffffffff8140c7b6>] dump_stack+0x4d/0x66
[ 8234.642699]  [<ffffffff811149ce>] kmem_cache_destroy+0x4d/0x92
[ 8234.642731]  [<ffffffffa042dbe7>] btrfs_destroy_cachep+0x63/0x76 [btrfs]
[ 8234.642757]  [<ffffffffa0487422>] exit_btrfs_fs+0x9/0xbe7 [btrfs]
[ 8234.642762]  [<ffffffff810a76a5>] SyS_delete_module+0x155/0x1c6
[ 8234.642768]  [<ffffffff8122a7eb>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 8234.642773]  [<ffffffff814122d2>] system_call_fastpath+0x16/0x1b

This applies on top (depends on) of my previous patch titled:
"Btrfs: fix race between fs trimming and block group remove/allocation"

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02 18:35:09 -08:00
Filipe Manana
c92f6be34c Btrfs: make btrfs_abort_transaction consider existence of new block groups
If the transaction handle doesn't have used blocks but has created new block
groups make sure we turn the fs into readonly mode too. This is because the
new block groups didn't get all their metadata persisted into the chunk and
device trees, and therefore if a subsequent transaction starts, allocates
space from the new block groups, writes data or metadata into that space,
commits successfully and then after we unmount and mount the filesystem
again, the same space can be allocated again for a new block group,
resulting in file data or metadata corruption.

Example where we don't abort the transaction when we fail to finish the
chunk allocation (add items to the chunk and device trees) and later a
future transaction where the block group is removed fails because it can't
find the chunk item in the chunk tree:

[25230.404300] WARNING: CPU: 0 PID: 7721 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x50/0xfc [btrfs]()
[25230.404301] BTRFS: Transaction aborted (error -28)
[25230.404302] Modules linked in: btrfs dm_flakey nls_utf8 fuse xor raid6_pq ntfs vfat msdos fat xfs crc32c_generic libcrc32c ext3 jbd ext2 dm_mod nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse i2c_piix4 i2ccore parport_pc parport processor button pcspkr serio_raw thermal_sys evdev microcode ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common virtio_scsi floppy e1000 ata_piix libata virtio_pci virtio_ring scsi_mod virtio [last unloaded: btrfs]
[25230.404325] CPU: 0 PID: 7721 Comm: xfs_io Not tainted 3.17.0-rc5-btrfs-next-1+ #1
[25230.404326] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[25230.404328]  0000000000000000 ffff88004581bb08 ffffffff813e7a13 ffff88004581bb50
[25230.404330]  ffff88004581bb40 ffffffff810423aa ffffffffa049386a 00000000ffffffe4
[25230.404332]  ffffffffa05214c0 000000000000240c ffff88010fc8f800 ffff88004581bba8
[25230.404334] Call Trace:
[25230.404338]  [<ffffffff813e7a13>] dump_stack+0x45/0x56
[25230.404342]  [<ffffffff810423aa>] warn_slowpath_common+0x7f/0x98
[25230.404351]  [<ffffffffa049386a>] ? __btrfs_abort_transaction+0x50/0xfc [btrfs]
[25230.404353]  [<ffffffff8104240b>] warn_slowpath_fmt+0x48/0x50
[25230.404362]  [<ffffffffa049386a>] __btrfs_abort_transaction+0x50/0xfc [btrfs]
[25230.404374]  [<ffffffffa04a8c43>] btrfs_create_pending_block_groups+0x10c/0x135 [btrfs]
[25230.404387]  [<ffffffffa04b77fd>] __btrfs_end_transaction+0x7e/0x2de [btrfs]
[25230.404398]  [<ffffffffa04b7a6d>] btrfs_end_transaction+0x10/0x12 [btrfs]
[25230.404408]  [<ffffffffa04a3d64>] btrfs_check_data_free_space+0x111/0x1f0 [btrfs]
[25230.404421]  [<ffffffffa04c53bd>] __btrfs_buffered_write+0x160/0x48d [btrfs]
[25230.404425]  [<ffffffff811a9268>] ? cap_inode_need_killpriv+0x2d/0x37
[25230.404429]  [<ffffffff810f6501>] ? get_page+0x1a/0x2b
[25230.404441]  [<ffffffffa04c7c95>] btrfs_file_write_iter+0x321/0x42f [btrfs]
[25230.404443]  [<ffffffff8110f5d9>] ? handle_mm_fault+0x7f3/0x846
[25230.404446]  [<ffffffff813e98c5>] ? mutex_unlock+0x16/0x18
[25230.404449]  [<ffffffff81138d68>] new_sync_write+0x7c/0xa0
[25230.404450]  [<ffffffff81139401>] vfs_write+0xb0/0x112
[25230.404452]  [<ffffffff81139c9d>] SyS_pwrite64+0x66/0x84
[25230.404454]  [<ffffffff813ebf52>] system_call_fastpath+0x16/0x1b
[25230.404455] ---[ end trace 5aa5684fdf47ab38 ]---
[25230.404458] BTRFS warning (device sdc): btrfs_create_pending_block_groups:9228: Aborting unused transaction(No space left).
[25288.084814] BTRFS: error (device sdc) in btrfs_free_chunk:2509: errno=-2 No such entry (Failed lookup while freeing chunk.)

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02 18:35:09 -08:00
Filipe Manana
55507ce361 Btrfs: fix race between writing free space cache and trimming
Trimming is completely transactionless, and the way it operates consists
of hiding free space entries from a block group, perform the trim/discard
and then make the free space entries visible again.
Therefore while a free space entry is being trimmed, we can have free space
cache writing running in parallel (as part of a transaction commit) which
will miss the free space entry. This means that an unmount (or crash/reboot)
after that transaction commit and mount again before another transaction
starts/commits after the discard finishes, we will have some free space
that won't be used again unless the free space cache is rebuilt. After the
unmount, fsck (btrfsck, btrfs check) reports the issue like the following
example:

        *** fsck.btrfs output ***
        checking extents
        checking free space cache
        There is no free space entry for 521764864-521781248
        There is no free space entry for 521764864-1103101952
        cache appears valid but isnt 29360128
        Checking filesystem on /dev/sdc
        UUID: b4789e27-4774-4626-98e9-ae8dfbfb0fb5
        found 1235681286 bytes used err is -22
        (...)

Another issue caused by this race is a crash while writing bitmap entries
to the cache, because while the cache writeout task accesses the bitmaps,
the trim task can be concurrently modifying the bitmap or worse might
be freeing the bitmap. The later case results in the following crash:

[55650.804460] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
[55650.804835] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop parport_pc parport i2c_piix4 psmouse evdev pcspkr microcode processor i2ccore serio_raw thermal_sys button ext4 crc16 jbd2 mbcache sg sd_mod crc_t10dif sr_mod cdrom crct10dif_generic crct10dif_common ata_generic virtio_scsi floppy ata_piix libata virtio_pci virtio_ring virtio scsi_mod e1000 [last unloaded: btrfs]
[55650.806169] CPU: 1 PID: 31002 Comm: btrfs-transacti Tainted: G        W      3.17.0-rc5-btrfs-next-1+ #1
[55650.806493] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[55650.806867] task: ffff8800b12f6410 ti: ffff880071538000 task.ti: ffff880071538000
[55650.807166] RIP: 0010:[<ffffffffa037cf45>]  [<ffffffffa037cf45>] write_bitmap_entries+0x65/0xbb [btrfs]
[55650.807514] RSP: 0018:ffff88007153bc30  EFLAGS: 00010246
[55650.807687] RAX: 000000005d1ec000 RBX: ffff8800a665df08 RCX: 0000000000000400
[55650.807885] RDX: ffff88005d1ec000 RSI: 6b6b6b6b6b6b6b6b RDI: ffff88005d1ec000
[55650.808017] RBP: ffff88007153bc58 R08: 00000000ddd51536 R09: 00000000000001e0
[55650.808017] R10: 0000000000000000 R11: 0000000000000037 R12: 6b6b6b6b6b6b6b6b
[55650.808017] R13: ffff88007153bca8 R14: 6b6b6b6b6b6b6b6b R15: ffff88007153bc98
[55650.808017] FS:  0000000000000000(0000) GS:ffff88023ec80000(0000) knlGS:0000000000000000
[55650.808017] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[55650.808017] CR2: 0000000002273b88 CR3: 00000000b18f6000 CR4: 00000000000006e0
[55650.808017] Stack:
[55650.808017]  ffff88020e834e00 ffff880172d68db0 0000000000000000 ffff88019257c800
[55650.808017]  ffff8801d42ea720 ffff88007153bd10 ffffffffa037d2fa ffff880224e99180
[55650.808017]  ffff8801469a6188 ffff880224e99140 ffff880172d68c50 00000003000000b7
[55650.808017] Call Trace:
[55650.808017]  [<ffffffffa037d2fa>] __btrfs_write_out_cache+0x1ea/0x37f [btrfs]
[55650.808017]  [<ffffffffa037d959>] btrfs_write_out_cache+0xa1/0xd8 [btrfs]
[55650.808017]  [<ffffffffa033936b>] btrfs_write_dirty_block_groups+0x4b5/0x505 [btrfs]
[55650.808017]  [<ffffffffa03aa98e>] commit_cowonly_roots+0x15e/0x1f7 [btrfs]
[55650.808017]  [<ffffffff813eb9c7>] ? _raw_spin_lock+0xe/0x10
[55650.808017]  [<ffffffffa0346e46>] btrfs_commit_transaction+0x411/0x882 [btrfs]
[55650.808017]  [<ffffffffa03432a4>] transaction_kthread+0xf2/0x1a4 [btrfs]
[55650.808017]  [<ffffffffa03431b2>] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs]
[55650.808017]  [<ffffffff8105966b>] kthread+0xb7/0xbf
[55650.808017]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
[55650.808017]  [<ffffffff813ebeac>] ret_from_fork+0x7c/0xb0
[55650.808017]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
[55650.808017] Code: 4c 89 ef 8d 70 ff e8 d4 fc ff ff 41 8b 45 34 41 39 45 30 7d 5c 31 f6 4c 89 ef e8 80 f6 ff ff 49 8b 7d 00 4c 89 f6 b9 00 04 00 00 <f3> a5 4c 89 ef 41 8b 45 30 8d 70 ff e8 a3 fc ff ff 41 8b 45 34
[55650.808017] RIP  [<ffffffffa037cf45>] write_bitmap_entries+0x65/0xbb [btrfs]
[55650.808017]  RSP <ffff88007153bc30>
[55650.815725] ---[ end trace 1c032e96b149ff86 ]---

Fix this by serializing both tasks in such a way that cache writeout
doesn't wait for the trim/discard of free space entries to finish and
doesn't miss any free space entry.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02 18:35:09 -08:00
Filipe Manana
04216820fe Btrfs: fix race between fs trimming and block group remove/allocation
Our fs trim operation, which is completely transactionless (doesn't start
or joins an existing transaction) consists of visiting all block groups
and then for each one to iterate its free space entries and perform a
discard operation against the space range represented by the free space
entries. However before performing a discard, the corresponding free space
entry is removed from the free space rbtree, and when the discard completes
it is added back to the free space rbtree.

If a block group remove operation happens while the discard is ongoing (or
before it starts and after a free space entry is hidden), we end up not
waiting for the discard to complete, remove the extent map that maps
logical address to physical addresses and the corresponding chunk metadata
from the the chunk and device trees. After that and before the discard
completes, the current running transaction can finish and a new one start,
allowing for new block groups that map to the same physical addresses to
be allocated and written to.

So fix this by keeping the extent map in memory until the discard completes
so that the same physical addresses aren't reused before it completes.

If the physical locations that are under a discard operation end up being
used for a new metadata block group for example, and dirty metadata extents
are written before the discard finishes (the VM might call writepages() of
our btree inode's i_mapping for example, or an fsync log commit happens) we
end up overwriting metadata with zeroes, which leads to errors from fsck
like the following:

        checking extents
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        read block failed check_tree_block
        owner ref check failed [833912832 16384]
        Errors found in extent allocation tree or chunk allocation
        checking free space cache
        checking fs roots
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        read block failed check_tree_block
        root 5 root dir 256 error
        root 5 inode 260 errors 2001, no inode item, link count wrong
                unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref
        root 5 inode 262 errors 2001, no inode item, link count wrong
                unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref
        root 5 inode 263 errors 2001, no inode item, link count wrong
        (...)

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02 18:35:09 -08:00
Zhao Lei
5d3edd8f44 Btrfs, replace: enable dev-replace for raid56
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2014-12-03 10:20:48 +08:00
Filipe Manana
ae0ab003f2 Btrfs: fix freeing used extents after removing empty block group
There's a race between adding a block group to the list of the unused
block groups and removing an unused block group (cleaner kthread) that
leads to freeing extents that are in use or a crash during transaction
commmit. Basically the cleaner kthread, when executing
btrfs_delete_unused_bgs(), might catch the newly added block group to
the list fs_info->unused_bgs and clear the range representing the whole
group from fs_info->freed_extents[] before the task that added the block
group to the list (running update_block_group()) marked the last freed
extent as dirty in fs_info->freed_extents (pinned_extents).

That is:

     CPU 1                                CPU 2

                                  btrfs_delete_unused_bgs()
update_block_group()
   add block group to
   fs_info->unused_bgs
                                    got block group from the list
                                    clear_extent_bits for the whole
                                    block group range in freed_extents[]
   set_extent_dirty for the
   range covering the freed
   extent in freed_extents[]
   (fs_info->pinned_extents)

                                  block group deleted, and a new block
                                  group with the same logical address is
                                  created

                                  reserve space from the new block group
                                  for new data or metadata - the reserved
                                  space overlaps the range specified by
                                  CPU 1 for set_extent_dirty()

                                  commit transaction
                                    find all ranges marked as dirty in
                                    fs_info->pinned_extents, clear them
                                    and add them to the free space cache

Alternatively, if CPU 2 doesn't create a new block group with the same
logical address, we get a crash/BUG_ON at transaction commit when unpining
extent ranges because we can't find a block group for the range marked as
dirty by CPU 1. Sample trace:

[ 2163.426462] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 2163.426640] Modules linked in: btrfs xor raid6_pq dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio crc32c_generic libcrc32c dm_mod nfsd auth_rpc
gss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse parport_pc parport i2c_piix4 processor thermal_sys i2ccore evdev button pcspkr microcode serio_raw ext4 crc16 jbd2 mbcache
 sg sr_mod cdrom sd_mod crc_t10dif crct10dif_generic crct10dif_common ata_generic virtio_scsi floppy ata_piix libata e1000 scsi_mod virtio_pci virtio_ring virtio
[ 2163.428209] CPU: 0 PID: 11858 Comm: btrfs-transacti Tainted: G        W      3.17.0-rc5-btrfs-next-1+ #1
[ 2163.428519] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[ 2163.428875] task: ffff88009f2c0650 ti: ffff8801356bc000 task.ti: ffff8801356bc000
[ 2163.429157] RIP: 0010:[<ffffffffa037728e>]  [<ffffffffa037728e>] unpin_extent_range.isra.58+0x62/0x192 [btrfs]
[ 2163.429562] RSP: 0018:ffff8801356bfda8  EFLAGS: 00010246
[ 2163.429802] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 2163.429990] RDX: 0000000041bfffff RSI: 0000000001c00000 RDI: ffff880024307080
[ 2163.430042] RBP: ffff8801356bfde8 R08: 0000000000000068 R09: ffff88003734f118
[ 2163.430042] R10: ffff8801356bfcb8 R11: fffffffffffffb69 R12: ffff8800243070d0
[ 2163.430042] R13: 0000000083c04000 R14: ffff8800751b0f00 R15: ffff880024307000
[ 2163.430042] FS:  0000000000000000(0000) GS:ffff88013f400000(0000) knlGS:0000000000000000
[ 2163.430042] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 2163.430042] CR2: 00007ff10eb43fc0 CR3: 0000000004cb8000 CR4: 00000000000006f0
[ 2163.430042] Stack:
[ 2163.430042]  ffff8800243070d0 0000000083c08000 0000000083c07fff ffff88012d6bc800
[ 2163.430042]  ffff8800243070d0 ffff8800751b0f18 ffff8800751b0f00 0000000000000000
[ 2163.430042]  ffff8801356bfe18 ffffffffa037a481 0000000083c04000 0000000083c07fff
[ 2163.430042] Call Trace:
[ 2163.430042]  [<ffffffffa037a481>] btrfs_finish_extent_commit+0xac/0xbf [btrfs]
[ 2163.430042]  [<ffffffffa038c06d>] btrfs_commit_transaction+0x6ee/0x882 [btrfs]
[ 2163.430042]  [<ffffffffa03881f1>] transaction_kthread+0xf2/0x1a4 [btrfs]
[ 2163.430042]  [<ffffffffa03880ff>] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs]
[ 2163.430042]  [<ffffffff8105966b>] kthread+0xb7/0xbf
[ 2163.430042]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
[ 2163.430042]  [<ffffffff813ebeac>] ret_from_fork+0x7c/0xb0
[ 2163.430042]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67

So fix this by making update_block_group() first set the range as dirty
in pinned_extents before adding the block group to the unused_bgs list.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02 18:19:17 -08:00
Filipe Manana
4f69cb987e Btrfs: fix crash caused by block group removal
If we remove a block group (because it became empty), we might have left
a caching_ctl structure in fs_info->caching_block_groups that points to
the block group and is accessed at transaction commit time. This results
in accessing an invalid or incorrect block group. This issue became visible
after Josef's patch "Btrfs: remove empty block groups automatically".

So if the block group is removed make sure we don't leave a dangling
caching_ctl in caching_block_groups.

Sample crash trace:

[58380.439449] BUG: unable to handle kernel paging request at ffff8801446eaeb8
[58380.439707] IP: [<ffffffffa03f6d05>] block_group_cache_done.isra.21+0xc/0x1c [btrfs]
[58380.440879] PGD 1acb067 PUD 23f5ff067 PMD 23f5db067 PTE 80000001446ea060
[58380.441220] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[58380.441486] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse processor i2c_piix4 parport_pc parport pcspkr serio_raw evdev i2ccore thermal_sys microcode button ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common virtio_scsi floppy ata_piix e1000 libata virtio_pci scsi_mod virtio_ring virtio [last unloaded: btrfs]
[58380.443238] CPU: 3 PID: 25728 Comm: btrfs-transacti Tainted: G        W      3.17.0-rc5-btrfs-next-1+ #1
[58380.443238] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[58380.443238] task: ffff88013ac82090 ti: ffff88013896c000 task.ti: ffff88013896c000
[58380.443238] RIP: 0010:[<ffffffffa03f6d05>]  [<ffffffffa03f6d05>] block_group_cache_done.isra.21+0xc/0x1c [btrfs]
[58380.443238] RSP: 0018:ffff88013896fdd8  EFLAGS: 00010283
[58380.443238] RAX: ffff880222cae850 RBX: ffff880119ba74c0 RCX: 0000000000000000
[58380.443238] RDX: 0000000000000000 RSI: ffff880185e16800 RDI: ffff8801446eaeb8
[58380.443238] RBP: ffff88013896fdd8 R08: ffff8801a9ca9fa8 R09: ffff88013896fc60
[58380.443238] R10: ffff88013896fd28 R11: 0000000000000000 R12: ffff880222cae000
[58380.443238] R13: ffff880222cae850 R14: ffff880222cae6b0 R15: ffff8801446eae00
[58380.443238] FS:  0000000000000000(0000) GS:ffff88023ed80000(0000) knlGS:0000000000000000
[58380.443238] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[58380.443238] CR2: ffff8801446eaeb8 CR3: 0000000001811000 CR4: 00000000000006e0
[58380.443238] Stack:
[58380.443238]  ffff88013896fe18 ffffffffa03fe2d5 ffff880222cae850 ffff880185e16800
[58380.443238]  ffff88000dc41c20 0000000000000000 ffff8801a9ca9f00 0000000000000000
[58380.443238]  ffff88013896fe80 ffffffffa040fbcf ffff88018b0dcdb0 ffff88013ac82090
[58380.443238] Call Trace:
[58380.443238]  [<ffffffffa03fe2d5>] btrfs_prepare_extent_commit+0x5a/0xd7 [btrfs]
[58380.443238]  [<ffffffffa040fbcf>] btrfs_commit_transaction+0x45c/0x882 [btrfs]
[58380.443238]  [<ffffffffa040c058>] transaction_kthread+0xf2/0x1a4 [btrfs]
[58380.443238]  [<ffffffffa040bf66>] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs]
[58380.443238]  [<ffffffff8105966b>] kthread+0xb7/0xbf
[58380.443238]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
[58380.443238]  [<ffffffff813ebeac>] ret_from_fork+0x7c/0xb0
[58380.443238]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02 18:19:17 -08:00
Filipe Manana
292cbd51ec Btrfs: fix invalid block group rbtree access after bg is removed
If we grab a block group, for example in btrfs_trim_fs(), we will be holding
a reference on it but the block group can be removed after we got it (via
btrfs_remove_block_group), which means it will no longer be part of the
rbtree.

However, btrfs_remove_block_group() was only calling rb_erase() which leaves
the block group's rb_node left and right child pointers with the same content
they had before calling rb_erase. This was dangerous because a call to
next_block_group() would access the node's left and right child pointers (via
rb_next), which can be no longer valid.

Fix this by clearing a block group's node after removing it from the tree,
and have next_block_group() do a tree search to get the next block group
instead of using rb_next() if our block group was removed.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02 18:19:17 -08:00
Miao Xie
4245215d6a Btrfs, raid56: fix use-after-free problem in the final device replace procedure on raid56
The commit c404e0dc (Btrfs: fix use-after-free in the finishing
procedure of the device replace) fixed a use-after-free problem
which happened when removing the source device at the end of device
replace, but at that time, btrfs didn't support device replace
on raid56, so we didn't fix the problem on the raid56 profile.
Currently, we implemented device replace for raid56, so we need
kick that problem out before we enable that function for raid56.

The fix method is very simple, we just increase the bio per-cpu
counter before we submit a raid56 io, and decrease the counter
when the raid56 io ends.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2014-12-03 10:18:47 +08:00
Miao Xie
7603597690 Btrfs, replace: write raid56 parity into the replace target device
This function reused the code of parity scrub, and we just write
the right parity or corrected parity into the target device before
the parity scrub end.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2014-12-03 10:18:46 +08:00
Miao Xie
2c8cdd6ee4 Btrfs, replace: write dirty pages into the replace target device
The implementation is simple:
- In order to avoid changing the code logic of btrfs_map_bio and
  RAID56, we add the stripes of the replace target devices at the
  end of the stripe array in btrfs bio, and we sort those target
  device stripes in the array. And we keep the number of the target
  device stripes in the btrfs bio.
- Except write operation on RAID56, all the other operation don't
  take the target device stripes into account.
- When we do write operation, we read the data from the common devices
  and calculate the parity. Then write the dirty data and new parity
  out, at this time, we will find the relative replace target stripes
  and wirte the relative data into it.

Note: The function that copying old data on the source device to
the target device was implemented in the past, it is similar to
the other RAID type.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2014-12-03 10:18:46 +08:00
Miao Xie
5a6ac9eacb Btrfs, raid56: support parity scrub on raid56
The implementation is:
- Read and check all the data with checksum in the same stripe.
  All the data which has checksum is COW data, and we are sure
  that it is not changed though we don't lock the stripe. because
  the space of that data just can be reclaimed after the current
  transction is committed, and then the fs can use it to store the
  other data, but when doing scrub, we hold the current transaction,
  that is that data can not be recovered, it is safe that read and check
  it out of the stripe lock.
- Lock the stripe
- Read out all the data without checksum and parity
  The data without checksum and the parity may be changed if we don't
  lock the stripe, so we need read it in the stripe lock context.
- Check the parity
- Re-calculate the new parity and write back it if the old parity
  is not right
- Unlock the stripe

If we can not read out the data or the data we read is corrupted,
we will try to repair it. If the repair fails. we will mark the
horizontal sub-stripe(pages on the same horizontal) as corrupted
sub-stripe, and we will skip the parity check and repair of that
horizontal sub-stripe.

And in order to skip the horizontal sub-stripe that has no data, we
introduce a bitmap. If there is some data on the horizontal sub-stripe,
we will the relative bit to 1, and when we check and repair the
parity, we will skip those horizontal sub-stripes that the relative
bits is 0.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2014-12-03 10:18:45 +08:00
Miao Xie
1b94b5567e Btrfs, raid56: use a variant to record the operation type
We will introduce new operation type later, if we still use integer
variant as bool variant to record the operation type, we would add new
variant and increase the size of raid bio structure. It is not good,
by this patch, we define different number for different operation,
and we can just use a variant to record the operation type.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2014-12-03 10:18:45 +08:00
Miao Xie
af8e2d1df9 Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted
This patch implement the RAID5/6 common data repair function, the
implementation is similar to the scrub on the other RAID such as
RAID1, the differentia is that we don't read the data from the
mirror, we use the data repair function of RAID5/6.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2014-12-03 10:18:45 +08:00
Miao Xie
b89e1b012c Btrfs, raid56: don't change bbio and raid_map
Because we will reuse bbio and raid_map during the scrub later, it is
better that we don't change any variant of bbio and don't free it at
the end of IO request. So we introduced similar variants into the raid
bio, and don't access those bbio's variants any more.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2014-12-03 10:18:44 +08:00
Zhao Lei
6de6565075 Btrfs: remove unnecessary code of stripe_index assignment in __btrfs_map_block
stripe_index's value was set again in latter line:
stripe_index = 0;

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
2014-12-03 10:18:44 +08:00
Zhao Lei
f90523d1aa Btrfs: remove noused bbio_ret in __btrfs_map_block in condition
bbio_ret in this condition is always !NULL because previous code
already have a check-and-skip:
4908 if (!bbio_ret)
4909     goto out;

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
2014-12-03 10:18:44 +08:00
Chris Mason
2f19cad94c btrfs: zero out left over bytes after processing compression streams
Don Bailey noticed that our page zeroing for compression at end-io time
isn't complete.  This reworks a patch from Linus to push the zeroing
into the zlib and lzo specific functions instead of trying to handle the
corners inside btrfs_decompress_buf2page

Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reported-by: Don A. Bailey <donb@securitymouse.com>
cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-11-30 09:33:51 -08:00
Filipe Manana
9ea24bbe17 Btrfs: fix snapshot inconsistency after a file write followed by truncate
If right after starting the snapshot creation ioctl we perform a write against a
file followed by a truncate, with both operations increasing the file's size, we
can get a snapshot tree that reflects a state of the source subvolume's tree where
the file truncation happened but the write operation didn't. This leaves a gap
between 2 file extent items of the inode, which makes btrfs' fsck complain about it.

For example, if we perform the following file operations:

    $ mkfs.btrfs -f /dev/vdd
    $ mount /dev/vdd /mnt
    $ xfs_io -f \
          -c "pwrite -S 0xaa -b 32K 0 32K" \
          -c "fsync" \
          -c "pwrite -S 0xbb -b 32770 16K 32770" \
          -c "truncate 90123" \
          /mnt/foobar

and the snapshot creation ioctl was just called before the second write, we often
can get the following inode items in the snapshot's btree:

        item 120 key (257 INODE_ITEM 0) itemoff 7987 itemsize 160
                inode generation 146 transid 7 size 90123 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 flags 0x0
        item 121 key (257 INODE_REF 256) itemoff 7967 itemsize 20
                inode ref index 282 namelen 10 name: foobar
        item 122 key (257 EXTENT_DATA 0) itemoff 7914 itemsize 53
                extent data disk byte 1104855040 nr 32768
                extent data offset 0 nr 32768 ram 32768
                extent compression 0
        item 123 key (257 EXTENT_DATA 53248) itemoff 7861 itemsize 53
                extent data disk byte 0 nr 0
                extent data offset 0 nr 40960 ram 40960
                extent compression 0

There's a file range, corresponding to the interval [32K; ALIGN(16K + 32770, 4096)[
for which there's no file extent item covering it. This is because the file write
and file truncate operations happened both right after the snapshot creation ioctl
called btrfs_start_delalloc_inodes(), which means we didn't start and wait for the
ordered extent that matches the write and, in btrfs_setsize(), we were able to call
btrfs_cont_expand() before being able to commit the current transaction in the
snapshot creation ioctl. So this made it possibe to insert the hole file extent
item in the source subvolume (which represents the region added by the truncate)
right before the transaction commit from the snapshot creation ioctl.

Btrfs' fsck tool complains about such cases with a message like the following:

    "root 331 inode 257 errors 100, file extent discount"

>From a user perspective, the expectation when a snapshot is created while those
file operations are being performed is that the snapshot will have a file that
either:

1) is empty
2) only the first write was captured
3) only the 2 writes were captured
4) both writes and the truncation were captured

But never capture a state where only the first write and the truncation were
captured (since the second write was performed before the truncation).

A test case for xfstests follows.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-25 07:41:23 -08:00
Filipe Manana
e5fa8f865b Btrfs: ensure send always works on roots without orphans
Move the logic from the snapshot creation ioctl into send. This avoids
doing the transaction commit if send isn't used, and ensures that if
a crash/reboot happens after the transaction commit that created the
snapshot and before the transaction commit that switched the commit
root, send will not get a commit root that differs from the main root
(that has orphan items).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-25 07:41:23 -08:00
Filipe Manana
758eb51e71 Btrfs: fix freeing used extent after removing empty block group
Due to ignoring errors returned by clear_extent_bits (at the moment only
-ENOMEM is possible), we can end up freeing an extent that is actually in
use (i.e. return the extent to the free space cache).

The sequence of steps that lead to this:

1) Cleaner thread starts execution and calls btrfs_delete_unused_bgs(), with
   the goal of freeing empty block groups;

2) btrfs_delete_unused_bgs() finds an empty block group, joins the current
   transaction (or starts a new one if none is running) and attempts to
   clear the EXTENT_DIRTY bit for the block group's range from freed_extents[0]
   and freed_extents[1] (of which one corresponds to fs_info->pinned_extents);

3) Clearing the EXTENT_DIRTY bit (via clear_extent_bits()) fails with
   -ENOMEM, but such error is ignored and btrfs_delete_unused_bgs() proceeds
   to delete the block group and the respective chunk, while pinned_extents
   remains with that bit set for the whole (or a part of the) range covered
   by the block group;

4) Later while the transaction is still running, the chunk ends up being reused
   for a new block group (maybe for different purpose, data or metadata), and
   extents belonging to the new block group are allocated for file data or btree
   nodes/leafs;

5) The current transaction is committed, meaning that we unpinned one or more
   extents from the new block group (through btrfs_finish_extent_commit() and
   unpin_extent_range()) which are now being used for new file data or new
   metadata (through btrfs_finish_extent_commit() and unpin_extent_range()).
   And unpinning means we returned the extents to the free space cache of the
   new block group, which implies those extents can be used for future allocations
   while they're still in use.

Alternatively, we can hit a BUG_ON() when doing a lookup for a block group's cache
object in unpin_extent_range() if a new block group didn't end up being allocated for
the same chunk (step 4 above).

Fix this by not freeing the block group and chunk if we fail to clear the dirty bit.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-25 07:41:23 -08:00
Chris Mason
8f608de699 Btrfs: include vmalloc.h in check-integrity.c
Fengguang's build monster reported warnings on some arches because we
don't have vmalloc.h included

Signed-off-by: Chris Mason <clm@fb.com>
Reported-by: fengguang.wu@intel.com
2014-11-25 06:01:11 -08:00
Qu Wenruo
084b6e7c76 btrfs: Fix a lockdep warning when running xfstest.
The following lockdep warning is triggered during xfstests:

[ 1702.980872] =========================================================
[ 1702.981181] [ INFO: possible irq lock inversion dependency detected ]
[ 1702.981482] 3.18.0-rc1 #27 Not tainted
[ 1702.981781] ---------------------------------------------------------
[ 1702.982095] kswapd0/77 just changed the state of lock:
[ 1702.982415]  (&delayed_node->mutex){+.+.-.}, at: [<ffffffffa03b0b51>] __btrfs_release_delayed_node+0x41/0x1f0 [btrfs]
[ 1702.982794] but this lock took another, RECLAIM_FS-unsafe lock in the past:
[ 1702.983160]  (&fs_info->dev_replace.lock){+.+.+.}

and interrupts could create inverse lock ordering between them.

[ 1702.984675]
other info that might help us debug this:
[ 1702.985524] Chain exists of:
  &delayed_node->mutex --> &found->groups_sem --> &fs_info->dev_replace.lock

[ 1702.986799]  Possible interrupt unsafe locking scenario:

[ 1702.987681]        CPU0                    CPU1
[ 1702.988137]        ----                    ----
[ 1702.988598]   lock(&fs_info->dev_replace.lock);
[ 1702.989069]                                local_irq_disable();
[ 1702.989534]                                lock(&delayed_node->mutex);
[ 1702.990038]                                lock(&found->groups_sem);
[ 1702.990494]   <Interrupt>
[ 1702.990938]     lock(&delayed_node->mutex);
[ 1702.991407]
 *** DEADLOCK ***

It is because the btrfs_kobj_{add/rm}_device() will call memory
allocation with GFP_KERNEL,
which may flush fs page cache to free space, waiting for it self to do
the commit, causing the deadlock.

To solve the problem, move btrfs_kobj_{add/rm}_device() out of the
dev_replace lock range, also involing split the
btrfs_rm_dev_replace_srcdev() function into remove and free parts.

Now only btrfs_rm_dev_replace_remove_srcdev() is called in dev_replace
lock range, and kobj_{add/rm} and btrfs_rm_dev_replace_free_srcdev() are
called out of the lock range.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-25 05:55:38 -08:00
Chris Mason
ad27c0dab7 Merge branch 'dev/pending-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus 2014-11-25 05:45:30 -08:00
Linus Torvalds
d038a63ace Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs deadlock fix from Chris Mason:
 "This has a fix for a long standing deadlock that we've been trying to
  nail down for a while.  It ended up being a bad interaction with the
  fair reader/writer locks and the order btrfs reacquires locks in the
  btree"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: fix lockups from btrfs_clear_path_blocking
2014-11-23 11:16:36 -08:00
Filipe Manana
b38ef71cb1 Btrfs: ensure ordered extent errors aren't missed on fsync
When doing a fsync with a fast path we have a time window where we can miss
the fact that writeback of some file data failed, and therefore we endup
returning success (0) from fsync when we should return an error.
The steps that lead to this are the following:

1) We start all ordered extents by calling filemap_fdatawrite_range();

2) We do some other work like locking the inode's i_mutex, start a transaction,
   start a log transaction, etc;

3) We enter btrfs_log_inode(), acquire the inode's log_mutex and collect all the
   ordered extents from inode's ordered tree into a list;

4) But by the time we do ordered extent collection, some ordered extents we started
   at step 1) might have already completed with an error, and therefore we didn't
   found them in the ordered tree and had no idea they finished with an error. This
   makes our fsync return success (0) to userspace, but has no bad effects on the log
   like for example insertion of file extent items into the log that point to unwritten
   extents, because the invalid extent maps were removed before the ordered extent
   completed (in inode.c:btrfs_finish_ordered_io).

So after collecting the ordered extents just check if the inode's i_mapping has any
error flags set (AS_EIO or AS_ENOSPC) and leave with an error if it does. Whenever
writeback fails for a page of an ordered extent, we call mapping_set_error (done in
extent_io.c:end_extent_writepage, called by extent_io.c:end_bio_extent_writepage)
that sets one of those error flags in the inode's i_mapping flags.

This change also has the side effect of fixing the issue where for fast fsyncs we
never checked/cleared the error flags from the inode's i_mapping flags, which means
that a full fsync performed after a fast fsync could get such errors that belonged
to the fast fsync - because the full fsync calls btrfs_wait_ordered_range() which
calls filemap_fdatawait_range(), and the later checks for and clears those flags,
while for fast fsyncs we never call filemap_fdatawait_range() or anything else
that checks for and clears the error flags from the inode's i_mapping.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-21 11:59:57 -08:00
Filipe Manana
0870295b23 Btrfs: collect only the necessary ordered extents on ranged fsync
Instead of collecting all ordered extents from the inode's ordered tree
and then wait for all of them to complete, just collect the ones that
overlap the fsync range.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-21 11:59:56 -08:00
Filipe Manana
5ab5e44a36 Btrfs: don't ignore log btree writeback errors
If an error happens during writeback of log btree extents, make sure the
error is returned to the caller (fsync), so that it takes proper action
(commit current transaction) instead of writing a superblock that points
to log btrees with all or some nodes that weren't durably persisted.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-21 11:59:55 -08:00
Josef Bacik
a28046956c Btrfs: do not move em to modified list when unpinning
We use the modified list to keep track of which extents have been modified so we
know which ones are candidates for logging at fsync() time.  Newly modified
extents are added to the list at modification time, around the same time the
ordered extent is created.  We do this so that we don't have to wait for ordered
extents to complete before we know what we need to log.  The problem is when
something like this happens

log extent 0-4k on inode 1
copy csum for 0-4k from ordered extent into log
sync log
commit transaction
log some other extent on inode 1
ordered extent for 0-4k completes and adds itself onto modified list again
log changed extents
see ordered extent for 0-4k has already been logged
	at this point we assume the csum has been copied
sync log
crash

On replay we will see the extent 0-4k in the log, drop the original 0-4k extent
which is the same one that we are replaying which also drops the csum, and then
we won't find the csum in the log for that bytenr.  This of course causes us to
have errors about not having csums for certain ranges of our inode.  So remove
the modified list manipulation in unpin_extent_cache, any modified extents
should have been added well before now, and we don't want them re-logged.  This
fixes my test that I could reliably reproduce this problem with.  Thanks,

cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-21 11:59:54 -08:00
Josef Bacik
50d9aa99bd Btrfs: make sure logged extents complete in the current transaction V3
Liu Bo pointed out that my previous fix would lose the generation update in the
scenario I described.  It is actually much worse than that, we could lose the
entire extent if we lose power right after the transaction commits.  Consider
the following

write extent 0-4k
log extent in log tree
commit transaction
	< power fail happens here
ordered extent completes

We would lose the 0-4k extent because it hasn't updated the actual fs tree, and
the transaction commit will reset the log so it isn't replayed.  If we lose
power before the transaction commit we are save, otherwise we are not.

Fix this by keeping track of all extents we logged in this transaction.  Then
when we go to commit the transaction make sure we wait for all of those ordered
extents to complete before proceeding.  This will make sure that if we lose
power after the transaction commit we still have our data.  This also fixes the
problem of the improperly updated extent generation.  Thanks,

cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-21 11:58:32 -08:00
Josef Bacik
9dba8cf128 Btrfs: make sure we wait on logged extents when fsycning two subvols
If we have two fsync()'s race on different subvols one will do all of its work
to get into the log_tree, wait on it's outstanding IO, and then allow the
log_tree to finish it's commit.  The problem is we were just free'ing that
subvols logged extents instead of waiting on them, so whoever lost the race
wouldn't really have their data on disk.  Fix this by waiting properly instead
of freeing the logged extents.  Thanks,

cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:20:10 -08:00
David Sterba
0d95c1bec9 btrfs: fix wrong accounting of raid1 data profile in statfs
The sizes that are obtained from space infos are in raw units and have
to be adjusted according to the raid factor. This was missing for
f_bavail and df reported doubled size for raid1.

Reported-by: Martin Steigerwald <Martin@lichtvoll.de>
Fixes: ba7b6e62f4 ("btrfs: adjust statfs calculations according to raid profiles")
CC: stable@vger.kernel.org
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:20:09 -08:00
Gui Hecheng
321592427c btrfs: fix dead lock while running replace and defrag concurrently
This can be reproduced by fstests: btrfs/070

The scenario is like the following:

replace worker thread		defrag thread
---------------------		-------------
copy_nocow_pages_worker		btrfs_defrag_file
  copy_nocow_pages_for_inode	    ...
				  btrfs_writepages
  |A| lock_extent_bits		    extent_write_cache_pages
				|B|   lock_page
					__extent_writepage
		...			  writepage_delalloc
					    find_lock_delalloc_range
				|B| 	      lock_extent_bits
  find_or_create_page
    pagecache_get_page
  |A| lock_page

This leads to an ABBA pattern deadlock. To fix it,
o we just change it to an AABB pattern which means to @unlock_extent_bits()
  before we @lock_page(), and in this way the @extent_read_full_page_nolock()
  is no longer in an locked context, so change it back to @extent_read_full_page()
  to regain protection.

o Since we @unlock_extent_bits() earlier, then before @write_page_nocow(),
  the extent may not really point at the physical block we want, so we
  have to check it before write.

Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:20:08 -08:00
Filipe Manana
5f5bc6b1e2 Btrfs: make xattr replace operations atomic
Replacing a xattr consists of doing a lookup for its existing value, delete
the current value from the respective leaf, release the search path and then
finally insert the new value. This leaves a time window where readers (getxattr,
listxattrs) won't see any value for the xattr. Xattrs are used to store ACLs,
so this has security implications.

This change also fixes 2 other existing issues which were:

*) Deleting the old xattr value without verifying first if the new xattr will
   fit in the existing leaf item (in case multiple xattrs are packed in the
   same item due to name hash collision);

*) Returning -EEXIST when the flag XATTR_CREATE is given and the xattr doesn't
   exist but we have have an existing item that packs muliple xattrs with
   the same name hash as the input xattr. In this case we should return ENOSPC.

A test case for xfstests follows soon.

Thanks to Alexandre Oliva for reporting the non-atomicity of the xattr replace
implementation.

Reported-by: Alexandre Oliva <oliva@gnu.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:20:07 -08:00
Filipe Manana
c7bc6319c5 Btrfs: avoid premature -ENOMEM in clear_extent_bit()
We try to allocate an extent state structure before acquiring the extent
state tree's spinlock as we might need a new one later and therefore avoid
doing later an atomic allocation while holding the tree's spinlock. However
we returned -ENOMEM if that initial non-atomic allocation failed, which is
a bit excessive since we might end up not needing the pre-allocated extent
state at all - for the case where the tree doesn't have any extent states
that cover the input range and cover too any other range. Therefore don't
return -ENOMEM if that pre-allocation fails.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:20:06 -08:00
Josef Bacik
7e33fd993a Btrfs: don't take the chunk_mutex/dev_list mutex in statfs V2
Our gluster boxes get several thousand statfs() calls per second, which begins
to suck hardcore with all of the lock contention on the chunk mutex and dev list
mutex.  We don't really need to hold these things, if we have transient
weirdness with statfs() because of the chunk allocator we don't care, so remove
this locking.

We still need the dev_list lock if you mount with -o alloc_start however, which
is a good argument for nuking that thing from orbit, but that's a patch for
another day.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:20:05 -08:00
Josef Bacik
633c0aad4c Btrfs: move read only block groups onto their own list V2
Our gluster boxes were spending lots of time in statfs because our fs'es are
huge.  The problem is statfs loops through all of the block groups looking for
read only block groups, and when you have several terabytes worth of data that
ends up being a lot of block groups.  Move the read only block groups onto a
read only list and only proces that list in
btrfs_account_ro_block_groups_free_space to reduce the amount of churn.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:20:04 -08:00
David Sterba
cd743fac42 btrfs: fix typos in btrfs_check_super_valid
Copy&paste errors in some messages and add few more missing macro
accessors.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:20:03 -08:00
Stefan Behrens
cf90c59e68 Btrfs: check-int: don't complain about balanced blocks
The xfstest btrfs/014 which tests the balance operation caused that the
check_int module complained that known blocks changed their physical
location. Since this is not an error in this case, only print such
message if the verbose mode was enabled.

Reported-by: Wang Shilong <wangshilong1991@gmail.com>
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Tested-by: Wang Shilong <wangshilong1991@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:14:30 -08:00
Stefan Behrens
f382e4653f Btrfs: check_int: use the known block location
The xfstest btrfs/014 which tests the balance operation caused issues with
the check_int module. The attempt was made to use btrfs_map_block() to
find the physical location for a written block. However, this was not
at all needed since the location of the written block was known since
a hook to submit_bio() was the reason for entering the check_int module.
Additionally, after a block relocation it happened that btrfs_map_block()
failed causing misleading error messages afterwards.

This patch changes the check_int module to use the known information of
the physical location from the bio.

Reported-by: Wang Shilong <wangshilong1991@gmail.com>
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Tested-by: Wang Shilong <wangshilong1991@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:14:29 -08:00
Filipe Manana
c8fd3de79f Btrfs: avoid returning -ENOMEM in convert_extent_bit() too early
We try to allocate an extent state before acquiring the tree's spinlock
just in case we end up needing to split an existing extent state into two.
If that allocation failed, we would return -ENOMEM.
However, our only single caller (transaction/log commit code), passes in
an extent state that was cached from a call to find_first_extent_bit() and
that has a very high chance to match exactly the input range (always true
for a transaction commit and very often, but not always, true for a log
commit) - in this case we end up not needing at all that initial extent
state used for an eventual split. Therefore just don't return -ENOMEM if
we can't allocate the temporary extent state, since we might not need it
at all, and if we end up needing one, we'll do it later anyway.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:14:29 -08:00
Filipe Manana
e38e2ed701 Btrfs: make find_first_extent_bit be able to cache any state
Right now the only caller of find_first_extent_bit() that is interested
in caching extent states (transaction or log commit), never gets an extent
state cached. This is because find_first_extent_bit() only caches states
that have at least one of the flags EXTENT_IOBITS or EXTENT_BOUNDARY, and
the transaction/log commit caller always passes a tree that doesn't have
ever extent states with any of those flags (they can only have one of the
following flags: EXTENT_DIRTY, EXTENT_NEW or EXTENT_NEED_WAIT).

This change together with the following one in the patch series (titled
"Btrfs: avoid returning -ENOMEM in convert_extent_bit() too early") will
help reduce significantly the chances of calls to convert_extent_bit()
fail with -ENOMEM when called from the transaction/log commit code.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:14:29 -08:00
Filipe Manana
663dfbb077 Btrfs: deal with convert_extent_bit errors to avoid fs corruption
When committing a transaction or a log, we look for btree extents that
need to be durably persisted by searching for ranges in a io tree that
have some bits set (EXTENT_DIRTY or EXTENT_NEW). We then attempt to clear
those bits and set the EXTENT_NEED_WAIT bit, with calls to the function
convert_extent_bit, and then start writeback for the extents.

That function however can return an error (at the moment only -ENOMEM
is possible, specially when it does GFP_ATOMIC allocation requests
through alloc_extent_state_atomic) - that means the ranges didn't got
the EXTENT_NEED_WAIT bit set (or at least not for the whole range),
which in turn means a call to btrfs_wait_marked_extents() won't find
those ranges for which we started writeback, causing a transaction
commit or a log commit to persist a new superblock without waiting
for the writeback of extents in that range to finish first.

Therefore if a crash happens after persisting the new superblock and
before writeback finishes, we have a superblock pointing to roots that
weren't fully persisted or roots that point to nodes or leafs that weren't
fully persisted, causing all sorts of unexpected/bad behaviour as we endup
reading garbage from disk or the content of some node/leaf from a past
generation that got cowed or deleted and is no longer valid (for this later
case we end up getting error messages like "parent transid verify failed on
X wanted Y found Z" when reading btree nodes/leafs from disk).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:14:29 -08:00
Eryu Guan
2fc9f6baa2 Btrfs: return failure if btrfs_dev_replace_finishing() failed
device replace could fail due to another running scrub process or any
other errors btrfs_scrub_dev() may hit, but this failure doesn't get
returned to userspace.

The following steps could reproduce this issue

	mkfs -t btrfs -f /dev/sdb1 /dev/sdb2
	mount /dev/sdb1 /mnt/btrfs
	while true; do btrfs scrub start -B /mnt/btrfs >/dev/null 2>&1; done &
	btrfs replace start -Bf /dev/sdb2 /dev/sdb3 /mnt/btrfs
	# if this replace succeeded, do the following and repeat until
	# you see this log in dmesg
	# BTRFS: btrfs_scrub_dev(/dev/sdb2, 2, /dev/sdb3) failed -115
	#btrfs replace start -Bf /dev/sdb3 /dev/sdb2 /mnt/btrfs

	# once you see the error log in dmesg, check return value of
	# replace
	echo $?

Introduce a new dev replace result

BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS

to catch -EINPROGRESS explicitly and return other errors directly to
userspace.

Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:14:28 -08:00
Shilong Wang
6b3a4d60db Btrfs: fix allocationg memory failure for btrfsic_state structure
size of @btrfsic_state needs more than 2M, it is very likely to
fail allocating memory using kzalloc(). see following mesage:

[91428.902148] Call Trace:
[<ffffffff816f6e0f>] dump_stack+0x4d/0x66
[<ffffffff811b1c7f>] warn_alloc_failed+0xff/0x170
[<ffffffff811b66e1>] __alloc_pages_nodemask+0x951/0xc30
[<ffffffff811fd9da>] alloc_pages_current+0x11a/0x1f0
[<ffffffff811b1e0b>] ? alloc_kmem_pages+0x3b/0xf0
[<ffffffff811b1e0b>] alloc_kmem_pages+0x3b/0xf0
[<ffffffff811d1018>] kmalloc_order+0x18/0x50
[<ffffffff811d1074>] kmalloc_order_trace+0x24/0x140
[<ffffffffa06c097b>] btrfsic_mount+0x8b/0xae0 [btrfs]
[<ffffffff810af555>] ? check_preempt_curr+0x85/0xa0
[<ffffffff810b2de3>] ? try_to_wake_up+0x103/0x430
[<ffffffffa063d200>] open_ctree+0x1bd0/0x2130 [btrfs]
[<ffffffffa060fdde>] btrfs_mount+0x62e/0x8b0 [btrfs]
[<ffffffff811fd9da>] ? alloc_pages_current+0x11a/0x1f0
[<ffffffff811b0a5e>] ? __get_free_pages+0xe/0x50
[<ffffffff81230429>] mount_fs+0x39/0x1b0
[<ffffffff812509fb>] vfs_kern_mount+0x6b/0x150
[<ffffffff812537fb>] do_mount+0x27b/0xc30
[<ffffffff811b0a5e>] ? __get_free_pages+0xe/0x50
[<ffffffff812544f6>] SyS_mount+0x96/0xf0
[<ffffffff81701970>] system_call_fastpath+0x16/0x1b

Since we are allocating memory for hash table array, so
it will be good if we could allocate continuous pages here.

Fix this problem by firstly trying kzalloc(), if we fail,
use vzalloc() instead.

Signed-off-by: Wang Shilong <wangshilong1991@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:14:28 -08:00
Filipe Manana
e6eb43142a Btrfs: report error after failure inlining extent in compressed write path
If cow_file_range_inline() failed, when called from compress_file_range(),
we were tagging the locked page for writeback, end its writeback and unlock it,
but not marking it with an error nor setting AS_EIO in inode's mapping flags.

This made it impossible for a caller of filemap_fdatawrite_range (writepages)
or filemap_fdatawait_range() to know that an error happened. And the return
value of compress_file_range() is useless because it's returned to a workqueue
task and not to the task calling filemap_fdatawrite_range (writepages).

This change applies on top of the previous patchset starting at the patch
titled:

    "[1/5] Btrfs: set page and mapping error on compressed write failure"

Which changed extent_clear_unlock_delalloc() to use SetPageError and
mapping_set_error().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:14:28 -08:00
Filipe Manana
728404dacf Btrfs: add helper btrfs_fdatawrite_range
To avoid duplicating this double filemap_fdatawrite_range() call for
inodes with async extents (compressed writes) so often.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:14:28 -08:00
Filipe Manana
075bdbdbe9 Btrfs: correctly flush compressed data before/after direct IO
For compressed writes, after doing the first filemap_fdatawrite_range() we
don't get the pages tagged for writeback immediately. Instead we create
a workqueue task, which is run by other kthread, and keep the pages locked.
That other kthread compresses data, creates the respective ordered extent/s,
tags the pages for writeback and unlocks them. Therefore we need a second
call to filemap_fdatawrite_range() if we have compressed writes, as this
second call will wait for the pages to become unlocked, then see they became
tagged for writeback and finally wait for the writeback to finish.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:14:27 -08:00
Filipe Manana
c44f649e28 Btrfs: make inode.c:compress_file_range() return void
Its return value is useless, its single caller ignores it and can't do
anything with it anyway, since it's a workqueue task and not the task
calling filemap_fdatawrite_range (writepages) nor filemap_fdatawait_range().
Failure is communicated to such functions via start and end of writeback
with the respective pages tagged with an error and AS_EIO flag set in the
inode's imapping.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:14:27 -08:00
Shilong Wang
4bcbb33255 Btrfs: fix incorrect compression ratio detection
Steps to reproduce:
 # mkfs.btrfs -f /dev/sdb
 # mount -t btrfs /dev/sdb /mnt -o compress=lzo
 # dd if=/dev/zero of=/mnt/data bs=$((33*4096)) count=1

after previous steps, inode will be detected as bad compression ratio,
and NOCOMPRESS flag will be set for that inode.

Reason is that compress have a max limit pages every time(128K), if a
132k write in, it will be splitted into two write(128k+4k), this bug
is a leftover for commit 68bb462d42a(Btrfs: don't compress for a small write)

Fix this problem by checking every time before compression, if it is a
small write(<=blocksize), we bail out and fall into nocompression directly.

Signed-off-by: Wang Shilong <wangshilong1991@gmail.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:14:27 -08:00
Filipe Manana
7bdcefc103 Btrfs: don't ignore compressed bio write errors
Our compressed bio write end callback was essentially ignoring the error
parameter. When a write error happens, it must pass a value of 0 to the
inode's write_page_end_io_hook callback, SetPageError on the respective
pages and set AS_EIO in the inode's mapping flags, so that a call to
filemap_fdatawait_range() / filemap_fdatawait() can find out that errors
happened (we surely don't want silent failures on fsync for example).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:14:26 -08:00
Filipe Manana
dec8f17563 Btrfs: make inode.c:submit_compressed_extents() return void
Its return value is completely ignored by its single caller and it's
useless anyway, since errors are indicated through SetPageError and
the bit AS_EIO set in the flags of the inode's mapping. The caller
can't do anything with the value, as it's invoked from a workqueue
task and not by the task calling filemap_fdatawrite_range (which calls
the writepages address space callback, which in turn calls the inode's
fill_delalloc callback).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:14:26 -08:00
Filipe Manana
3d7a820f71 Btrfs: process all async extents on compressed write failure
If we had an error when processing one of the async extents from our list,
we were not processing the remaining async extents, meaning we would leak
those async_extent structs, never release the pages with the compressed
data and never unlock and clear the dirty flag from the inode's pages (those
that correspond to the uncompressed content).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:14:26 -08:00
Filipe Manana
40ae837b43 Btrfs: don't leak pages and memory on compressed write error
In inode.c:submit_compressed_extents(), if we fail before calling
btrfs_submit_compressed_write(), or when that function fails, we
were freeing the async_extent structure without releasing its pages
and freeing the pages array.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:14:26 -08:00
Filipe Manana
fce2a4e6b2 Btrfs: fix hang on compressed write error
In inode.c:submit_compressed_extents(), before calling btrfs_submit_compressed_write()
we start writeback for all pages, clear their dirty flag, unlock them, etc, but if
btrfs_submit_compressed_write() fails (at the moment it can only fail with -ENOMEM),
we never end the writeback on the pages, so any filemap_fdatawait_range() call will
hang forever. We were also not calling the writepage end io hook, which means the
corresponding ordered extent will never complete and all its waiters will block
forever, such as a full fsync (via btrfs_wait_ordered_range()).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:14:25 -08:00
Filipe Manana
704de49d2b Btrfs: set page and mapping error on compressed write failure
If we fail in submit_compressed_extents() before calling btrfs_submit_compressed_write(),
we start and end the writeback for the pages (clear their dirty flag, unlock them, etc)
but we don't tag the pages, nor the inode's mapping, with an error. This makes it
impossible for a caller of filemap_fdatawait_range() (fsync, or transaction commit
for e.g.) know that there was an error.

Note that the return value of submit_compressed_extents() is useless, as that function
is executed by a workqueue task and not directly by the fill_delalloc callback. This
means the writepage/s callbacks of the inode's address space operations don't get that
return value.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:14:25 -08:00
Chris Mason
f82c458a2c btrfs: fix lockups from btrfs_clear_path_blocking
The fair reader/writer locks mean that btrfs_clear_path_blocking needs
to strictly follow lock ordering rules even when we already have
blocking locks on a given path.

Before we can clear a blocking lock on the path, we need to make sure
all of the locks have been converted to blocking.  This will remove lock
inversions against anyone spinning in write_lock() against the buffers
we're trying to get read locks on.  These inversions didn't exist before
the fair read/writer locks, but now we need to be more careful.

We papered over this deadlock in the past by changing
btrfs_try_read_lock() to be a true trylock against both the spinlock and
the blocking lock.  This was slower, and not sufficient to fix all the
deadlocks.  This patch adds a btrfs_tree_read_lock_atomic(), which
basically means get the spinlock but trylock on the blocking lock.

Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reported-by: Patrick Schmid <schmid@phys.ethz.ch>
cc: stable@vger.kernel.org #v3.15+
2014-11-19 10:34:35 -08:00
Al Viro
ddb52f4fd2 btrfs: get rid of f_dentry use
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-19 13:01:21 -05:00
Al Viro
41d28bca2d switch d_materialise_unique() users to d_splice_alias()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-19 13:01:20 -05:00
David Sterba
a6f69dc801 btrfs: move commit out of sysfs when changing label
Signed-off-by: David Sterba <dsterba@suse.cz>
2014-11-12 16:53:15 +01:00
David Sterba
0eae2747ec btrfs: move commit out of sysfs when changing features
Signed-off-by: David Sterba <dsterba@suse.cz>
2014-11-12 16:53:14 +01:00
David Sterba
d51033d055 btrfs: introduce pending action: commit
In some contexts, like in sysfs handlers, we don't want to trigger a
transaction commit. It's a heavy operation, we don't know what external
locks may be taken. Instead, make it possible to finish the operation
through sync syscall or SYNC_FS ioctl.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-11-12 16:53:14 +01:00
David Sterba
7e1876aca8 btrfs: switch inode_cache option handling to pending changes
The pending mount option(s) now share namespace and bits with the normal
options, and the existing one for (inode_cache) is unset unconditionally
at each transaction commit.

Introduce a separate namespace for pending changes and enhance the
descriptions of the intended change to use separate bits for each
action.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-11-12 16:53:13 +01:00
David Sterba
6b5fe46dfa btrfs: do commit in sync_fs if there are pending changes
If a pending change is requested, it's not processed unless there is a
transaction commit about to happen, not even after sync or SYNC_FS
ioctl. For example a remount that toggles the inode_cache option will
not take effect after sync on a quiescent filesystem.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-11-12 16:53:13 +01:00
David Sterba
572d9ab784 btrfs: add support for processing pending changes
There are some actions that modify global filesystem state but cannot be
performed at the time of request, but later at the transaction commit
time when the filesystem is in a known state.

For example enabling new incompat features on-the-fly or issuing
transaction commit from unsafe contexts (sysfs handlers).

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-11-12 16:53:12 +01:00
Linus Torvalds
c4c23fb6f2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fix from Chris Mason:
 "It's a one liner for an error cleanup path that leads to crashes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix kfree on list_head in btrfs_lookup_csums_range error cleanup
2014-11-09 14:30:24 -08:00
Chris Mason
6e5aafb274 Btrfs: fix kfree on list_head in btrfs_lookup_csums_range error cleanup
If we hit any errors in btrfs_lookup_csums_range, we'll loop through all
the csums we allocate and free them.  But the code was using list_entry
incorrectly, and ended up trying to free the on-stack list_head instead.

This bug came from commit 0678b6185

btrfs: Don't BUG_ON kzalloc error in btrfs_lookup_csums_range()

Signed-off-by: Chris Mason <clm@fb.com>
Reported-by: Erik Berg <btrfs@slipsprogrammoer.no>
cc: stable@vger.kernel.org # 3.3 or newer
2014-11-04 06:59:04 -08:00
Linus Torvalds
4f4274af70 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Filipe is nailing down some problems with our skinny extent variation,
  and Dave's patch fixes endian problems in the new super block checks"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix race that makes btrfs_lookup_extent_info miss skinny extent items
  Btrfs: properly clean up btrfs_end_io_wq_cache
  Btrfs: fix invalid leaf slot access in btrfs_lookup_extent()
  btrfs: use macro accessors in superblock validation checks
2014-11-01 10:41:26 -07:00
Filipe Manana
d05a2b4cd9 Btrfs: fix race that makes btrfs_lookup_extent_info miss skinny extent items
We have a race that can lead us to miss skinny extent items in the function
btrfs_lookup_extent_info() when the skinny metadata feature is enabled.
So basically the sequence of steps is:

1) We search in the extent tree for the skinny extent, which returns > 0
   (not found);

2) We check the previous item in the returned leaf for a non-skinny extent,
   and we don't find it;

3) Because we didn't find the non-skinny extent in step 2), we release our
   path to search the extent tree again, but this time for a non-skinny
   extent key;

4) Right after we released our path in step 3), a skinny extent was inserted
   in the extent tree (delayed refs were run) - our second extent tree search
   will miss it, because it's not looking for a skinny extent;

5) After the second search returned (with ret > 0), we look for any delayed
   ref for our extent's bytenr (and we do it while holding a read lock on the
   leaf), but we won't find any, as such delayed ref had just run and completed
   after we released out path in step 3) before doing the second search.

Fix this by removing completely the path release and re-search logic. This is
safe, because if we seach for a metadata item and we don't find it, we have the
guarantee that the returned leaf is the one where the item would be inserted,
and so path->slots[0] > 0 and path->slots[0] - 1 must be the slot where the
non-skinny extent item is if it exists. The only case where path->slots[0] is
zero is when there are no smaller keys in the tree (i.e. no left siblings for
our leaf), in which case the re-search logic isn't needed as well.

This race has been present since the introduction of skinny metadata (change
3173a18f70).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-10-28 13:59:54 -07:00
Josef Bacik
5ed5f58841 Btrfs: properly clean up btrfs_end_io_wq_cache
In one of Dave's cleanup commits he forgot to call btrfs_end_io_wq_exit on
unload, which makes us unable to unload and then re-load the btrfs module.  This
fixes the problem.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-10-27 13:16:53 -07:00
Filipe Manana
1a4ed8fdca Btrfs: fix invalid leaf slot access in btrfs_lookup_extent()
If we couldn't find our extent item, we accessed the current slot
(path->slots[0]) to check if it corresponds to an equivalent skinny
metadata item. However this slot could be beyond our last item in the
leaf (i.e. path->slots[0] >= btrfs_header_nritems(leaf)), in which case
we shouldn't process it.

Since btrfs_lookup_extent() is only used to find extent items for data
extents, fix this by removing completely the logic that looks up for an
equivalent skinny metadata item, since it can not exist.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-10-27 13:16:52 -07:00
David Sterba
21e7626b12 btrfs: use macro accessors in superblock validation checks
The initial patch c926093ec5 (btrfs: add more superblock checks)
did not properly use the macro accessors that wrap endianness and the
code would not work correctly on big endian machines.

Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-10-27 13:16:52 -07:00
Miklos Szeredi
cbdf35bcb8 vfs: export check_sticky()
It's already duplicated in btrfs and about to be used in overlayfs too.

Move the sticky bit check to an inline helper and call the out-of-line
helper only in the unlikly case of the sticky bit being set.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-10-24 00:14:36 +02:00
Linus Torvalds
ef161ea1ff Merge branch 'for-linus-update' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs data corruption fix from Chris Mason:
 "I'm testing a pull with more fixes, but wanted to get this one out so
  Greg can pick it up.

  The corruption isn't easy to hit, you have to do a readonly snapshot
  and have orphans in the snapshot.  But my review and testing missed
  the bug.  Filipe has added a better xfstest to cover it"

* 'for-linus-update' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Revert "Btrfs: race free update of commit root for ro snapshots"
2014-10-18 13:32:17 -07:00
Linus Torvalds
d3dc366bba Merge branch 'for-3.18/core' of git://git.kernel.dk/linux-block
Pull core block layer changes from Jens Axboe:
 "This is the core block IO pull request for 3.18.  Apart from the new
  and improved flush machinery for blk-mq, this is all mostly bug fixes
  and cleanups.

   - blk-mq timeout updates and fixes from Christoph.

   - Removal of REQ_END, also from Christoph.  We pass it through the
     ->queue_rq() hook for blk-mq instead, freeing up one of the request
     bits.  The space was overly tight on 32-bit, so Martin also killed
     REQ_KERNEL since it's no longer used.

   - blk integrity updates and fixes from Martin and Gu Zheng.

   - Update to the flush machinery for blk-mq from Ming Lei.  Now we
     have a per hardware context flush request, which both cleans up the
     code should scale better for flush intensive workloads on blk-mq.

   - Improve the error printing, from Rob Elliott.

   - Backing device improvements and cleanups from Tejun.

   - Fixup of a misplaced rq_complete() tracepoint from Hannes.

   - Make blk_get_request() return error pointers, fixing up issues
     where we NULL deref when a device goes bad or missing.  From Joe
     Lawrence.

   - Prep work for drastically reducing the memory consumption of dm
     devices from Junichi Nomura.  This allows creating clone bio sets
     without preallocating a lot of memory.

   - Fix a blk-mq hang on certain combinations of queue depths and
     hardware queues from me.

   - Limit memory consumption for blk-mq devices for crash dump
     scenarios and drivers that use crazy high depths (certain SCSI
     shared tag setups).  We now just use a single queue and limited
     depth for that"

* 'for-3.18/core' of git://git.kernel.dk/linux-block: (58 commits)
  block: Remove REQ_KERNEL
  blk-mq: allocate cpumask on the home node
  bio-integrity: remove the needless fail handle of bip_slab creating
  block: include func name in __get_request prints
  block: make blk_update_request print prefix match ratelimited prefix
  blk-merge: don't compute bi_phys_segments from bi_vcnt for cloned bio
  block: fix alignment_offset math that assumes io_min is a power-of-2
  blk-mq: Make bt_clear_tag() easier to read
  blk-mq: fix potential hang if rolling wakeup depth is too high
  block: add bioset_create_nobvec()
  block: use bio_clone_fast() in blk_rq_prep_clone()
  block: misplaced rq_complete tracepoint
  sd: Honor block layer integrity handling flags
  block: Replace strnicmp with strncasecmp
  block: Add T10 Protection Information functions
  block: Don't merge requests if integrity flags differ
  block: Integrity checksum flag
  block: Relocate bio integrity flags
  block: Add a disk flag to block integrity profile
  block: Add prefix to block integrity profile flags
  ...
2014-10-18 11:53:51 -07:00
Chris Mason
d37973082b Revert "Btrfs: race free update of commit root for ro snapshots"
This reverts commit 9c3b306e1c.

Switching only one commit root during a transaction is wrong because it
leads the fs into an inconsistent state. All commit roots should be
switched at once, at transaction commit time, otherwise backref walking
can often miss important references that were only accessible through
the old commit root.  Plus, the root item for the snapshot's root wasn't
getting updated and preventing the next transaction commit to do it.

This made several users get into random corruption issues after creation
of readonly snapshots.

A regression test for xfstests will follow soon.

Cc: stable@vger.kernel.org # 3.17
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-10-17 02:40:59 -07:00
Vinícius Tinti
0458a953d8 btrfs: LLVMLinux: Remove VLAIS
Replaced the use of a Variable Length Array In Struct (VLAIS) with a C99
compliant equivalent.  This patch instead allocates the appropriate amount of
memory using a char array using the SHASH_DESC_ON_STACK macro.

The new code can be compiled with both gcc and clang.

Signed-off-by: Vinícius Tinti <viniciustinti@gmail.com>
Reviewed-by: Jan-Simon Möller <dl9pf@gmx.de>
Reviewed-by: Mark Charlebois <charlebm@gmail.com>
Signed-off-by: Behan Webster <behanw@converseincode.com>
Acked-by: Chris Mason <clm@fb.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "David S. Miller" <davem@davemloft.net>
2014-10-14 10:51:22 +02:00
Linus Torvalds
77c688ac87 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "The big thing in this pile is Eric's unmount-on-rmdir series; we
  finally have everything we need for that.  The final piece of prereqs
  is delayed mntput() - now filesystem shutdown always happens on
  shallow stack.

  Other than that, we have several new primitives for iov_iter (Matt
  Wilcox, culled from his XIP-related series) pushing the conversion to
  ->read_iter()/ ->write_iter() a bit more, a bunch of fs/dcache.c
  cleanups and fixes (including the external name refcounting, which
  gives consistent behaviour of d_move() wrt procfs symlinks for long
  and short names alike) and assorted cleanups and fixes all over the
  place.

  This is just the first pile; there's a lot of stuff from various
  people that ought to go in this window.  Starting with
  unionmount/overlayfs mess...  ;-/"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (60 commits)
  fs/file_table.c: Update alloc_file() comment
  vfs: Deduplicate code shared by xattr system calls operating on paths
  reiserfs: remove pointless forward declaration of struct nameidata
  don't need that forward declaration of struct nameidata in dcache.h anymore
  take dname_external() into fs/dcache.c
  let path_init() failures treated the same way as subsequent link_path_walk()
  fix misuses of f_count() in ppp and netlink
  ncpfs: use list_for_each_entry() for d_subdirs walk
  vfs: move getname() from callers to do_mount()
  gfs2_atomic_open(): skip lookups on hashed dentry
  [infiniband] remove pointless assignments
  gadgetfs: saner API for gadgetfs_create_file()
  f_fs: saner API for ffs_sb_create_file()
  jfs: don't hash direct inode
  [s390] remove pointless assignment of ->f_op in vmlogrdr ->open()
  ecryptfs: ->f_op is never NULL
  android: ->f_op is never NULL
  nouveau: __iomem misannotations
  missing annotation in fs/file.c
  fs: namespace: suppress 'may be used uninitialized' warnings
  ...
2014-10-13 11:28:42 +02:00
Linus Torvalds
90d0c376f5 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "The largest set of changes here come from Miao Xie.  He's cleaning up
  and improving read recovery/repair for raid, and has a number of
  related fixes.

  I've merged another set of fsync fixes from Filipe, and he's also
  improved the way we handle metadata write errors to make sure we force
  the FS readonly if things go wrong.

  Otherwise we have a collection of fixes and cleanups.  Dave Sterba
  gets a cookie for removing the most lines (thanks Dave)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (139 commits)
  btrfs: Fix compile error when CONFIG_SECURITY is not set.
  Btrfs: fix compiles when CONFIG_BTRFS_FS_RUN_SANITY_TESTS is off
  btrfs: Make btrfs handle security mount options internally to avoid losing security label.
  Btrfs: send, don't delay dir move if there's a new parent inode
  btrfs: add more superblock checks
  Btrfs: fix race in WAIT_SYNC ioctl
  Btrfs: be aware of btree inode write errors to avoid fs corruption
  Btrfs: remove redundant btrfs_verify_qgroup_counts declaration.
  btrfs: fix shadow warning on cmp
  Btrfs: fix compilation errors under DEBUG
  Btrfs: fix crash of btrfs_release_extent_buffer_page
  Btrfs: add missing end_page_writeback on submit_extent_page failure
  btrfs: Fix the wrong condition judgment about subset extent map
  Btrfs: fix build_backref_tree issue with multiple shared blocks
  Btrfs: cleanup error handling in build_backref_tree
  btrfs: move checks for DUMMY_ROOT into a helper
  btrfs: new define for the inline extent data start
  btrfs: kill extent_buffer_page helper
  btrfs: drop constant param from btrfs_release_extent_buffer_page
  btrfs: hide typecast to definition of BTRFS_SEND_TRANS_STUB
  ...
2014-10-11 08:03:52 -04:00
Linus Torvalds
c798360cd1 Merge branch 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
Pull percpu updates from Tejun Heo:
 "A lot of activities on percpu front.  Notable changes are...

   - percpu allocator now can take @gfp.  If @gfp doesn't contain
     GFP_KERNEL, it tries to allocate from what's already available to
     the allocator and a work item tries to keep the reserve around
     certain level so that these atomic allocations usually succeed.

     This will replace the ad-hoc percpu memory pool used by
     blk-throttle and also be used by the planned blkcg support for
     writeback IOs.

     Please note that I noticed a bug in how @gfp is interpreted while
     preparing this pull request and applied the fix 6ae833c7fe
     ("percpu: fix how @gfp is interpreted by the percpu allocator")
     just now.

   - percpu_ref now uses longs for percpu and global counters instead of
     ints.  It leads to more sparse packing of the percpu counters on
     64bit machines but the overhead should be negligible and this
     allows using percpu_ref for refcnting pages and in-memory objects
     directly.

   - The switching between percpu and single counter modes of a
     percpu_ref is made independent of putting the base ref and a
     percpu_ref can now optionally be initialized in single or killed
     mode.  This allows avoiding percpu shutdown latency for cases where
     the refcounted objects may be synchronously created and destroyed
     in rapid succession with only a fraction of them reaching fully
     operational status (SCSI probing does this when combined with
     blk-mq support).  It's also planned to be used to implement forced
     single mode to detect underflow more timely for debugging.

  There's a separate branch percpu/for-3.18-consistent-ops which cleans
  up the duplicate percpu accessors.  That branch causes a number of
  conflicts with s390 and other trees.  I'll send a separate pull
  request w/ resolutions once other branches are merged"

* 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (33 commits)
  percpu: fix how @gfp is interpreted by the percpu allocator
  blk-mq, percpu_ref: start q->mq_usage_counter in atomic mode
  percpu_ref: make INIT_ATOMIC and switch_to_atomic() sticky
  percpu_ref: add PERCPU_REF_INIT_* flags
  percpu_ref: decouple switching to percpu mode and reinit
  percpu_ref: decouple switching to atomic mode and killing
  percpu_ref: add PCPU_REF_DEAD
  percpu_ref: rename things to prepare for decoupling percpu/atomic mode switch
  percpu_ref: replace pcpu_ prefix with percpu_
  percpu_ref: minor code and comment updates
  percpu_ref: relocate percpu_ref_reinit()
  Revert "blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe"
  Revert "percpu: free percpu allocation info for uniprocessor system"
  percpu-refcount: make percpu_ref based on longs instead of ints
  percpu-refcount: improve WARN messages
  percpu: fix locking regression in the failure path of pcpu_alloc()
  percpu-refcount: add @gfp to percpu_ref_init()
  proportions: add @gfp to init functions
  percpu_counter: add @gfp to percpu_counter_init()
  percpu_counter: make percpu_counters_lock irq-safe
  ...
2014-10-10 07:26:02 -04:00
Eric W. Biederman
5542aa2fa7 vfs: Make d_invalidate return void
Now that d_invalidate can no longer fail, stop returning a useless
return code.  For the few callers that checked the return code update
remove the handling of d_invalidate failure.

Reviewed-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-09 02:38:57 -04:00
Qu Wenruo
a43bb39b5c btrfs: Fix compile error when CONFIG_SECURITY is not set.
Fix the following compile error when CONFIG_SECURITY is not set:

error: 'struct security_mnt_opts' has no member named 'num_mnt_opts'

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-10-08 06:59:24 -07:00
Linus Torvalds
28596c9722 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull "trivial tree" updates from Jiri Kosina:
 "Usual pile from trivial tree everyone is so eagerly waiting for"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
  Remove MN10300_PROC_MN2WS0038
  mei: fix comments
  treewide: Fix typos in Kconfig
  kprobes: update jprobe_example.c for do_fork() change
  Documentation: change "&" to "and" in Documentation/applying-patches.txt
  Documentation: remove obsolete pcmcia-cs from Changes
  Documentation: update links in Changes
  Documentation: Docbook: Fix generated DocBook/kernel-api.xml
  score: Remove GENERIC_HAS_IOMAP
  gpio: fix 'CONFIG_GPIO_IRQCHIP' comments
  tty: doc: Fix grammar in serial/tty
  dma-debug: modify check_for_stack output
  treewide: fix errors in printk
  genirq: fix reference in devm_request_threaded_irq comment
  treewide: fix synchronize_rcu() in comments
  checkstack.pl: port to AArch64
  doc: queue-sysfs: minor fixes
  init/do_mounts: better syntax description
  MIPS: fix comment spelling
  powerpc/simpleboot: fix comment
  ...
2014-10-07 21:16:26 -04:00
Chris Mason
0d4cf4e6bf Btrfs: fix compiles when CONFIG_BTRFS_FS_RUN_SANITY_TESTS is off
Commit fccb84c94 moved added some helpers to cleanup our sanity tests,
but it looks like both Dave and I always compile with the tests enabled.

This fixes things to work when they are turned off too.

Signed-off-by: Chris Mason <clm@fb.com>
2014-10-07 13:24:20 -07:00
Qu Wenruo
f667aef6af btrfs: Make btrfs handle security mount options internally to avoid losing security label.
[BUG]
Originally when mount btrfs with "-o subvol=" mount option, btrfs will
lose all security lable.
And if the btrfs fs is mounted somewhere else, due to the lost of
security lable, SELinux will refuse to mount since the same super block
is being mounted using different security lable.

[REPRODUCER]
With SELinux enabled:
 #mkfs -t btrfs /dev/sda5
 #mount -o context=system_u:object_r:nfs_t:s0 /dev/sda5 /mnt/btrfs
 #btrfs subvolume create /mnt/btrfs/subvol
 #mount -o subvol=subvol,context=system_u:object_r:nfs_t:s0 /dev/sda5
  /mnt/test

kernel message:
SELinux: mount invalid.  Same superblock, different security settings
for (dev sda5, type btrfs)

[REASON]
This happens because btrfs will call vfs_kern_mount() and then
mount_subtree() to handle subvolume name lookup.
First mount will cut off all the security lables and when it comes to
the second vfs_kern_mount(), it has no security label now.

[FIX]
This patch will makes btrfs behavior much more like nfs,
which has the type flag FS_BINARY_MOUNTDATA,
making btrfs handles the security label internally.
So security label will be set in the real mount time and won't lose
label when use with "subvol=" mount option.

Reported-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-10-06 06:23:32 -07:00
Chris Mason
0ec31a61f0 Merge branch 'remove-unlikely' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus 2014-10-04 09:57:44 -07:00
Chris Mason
27b19cc886 Merge branch 'cleanup/blocksize-diet-part1' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus 2014-10-04 09:57:14 -07:00
Chris Mason
bbf65cf0b5 Merge branch 'cleanup/misc-for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus
Signed-off-by: Chris Mason <clm@fb.com>

Conflicts:
	fs/btrfs/extent_io.c
2014-10-04 09:56:45 -07:00
Filipe Manana
bf8e8ca6fd Btrfs: send, don't delay dir move if there's a new parent inode
If between two snapshots we rename an existing directory named X to Y and
make it a child (direct or not) of a new inode named X, we were delaying
the move/rename of the former directory unnecessarily, which would result
in attempting to rename the new directory from its orphan name to name X
prematurely.

Minimal reproducer:

    $ mkfs.btrfs -f /dev/vdd
    $ mount /dev/vdd /mnt
    $ mkdir -p /mnt/merlin/RC/OSD/Source

    $ btrfs subvolume snapshot -r /mnt /mnt/mysnap1

    $ mkdir /mnt/OSD
    $ mv /mnt/merlin/RC/OSD /mnt/OSD/OSD-Plane_788
    $ mv /mnt/OSD /mnt/merlin/RC

    $ btrfs subvolume snapshot -r /mnt /mnt/mysnap2

    $ btrfs send /mnt/mysnap1 -f /tmp/1.snap
    $ btrfs send -p /mnt/mysnap1 /mnt/mysnap2 -f /tmp/2.snap

    $ mkfs.btrfs -f /dev/vdc
    $ mount /dev/vdc /mnt2

    $ btrfs receive /mnt2 -f /tmp/1.snap
    $ btrfs receive /mnt2 -f /tmp/2.snap

The second receive (from an incremental send) failed with the following
error message: "rename o261-7-0 -> merlin/RC/OSD failed".
This is a regression introduced in the 3.16 kernel.

A test case for xfstests follows.

Reported-by: Marc Merlin <marc@merlins.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-10-03 16:14:59 -07:00
David Sterba
c926093ec5 btrfs: add more superblock checks
Populate btrfs_check_super_valid() with checks that try to verify
consistency of superblock by additional conditions that may arise from
corrupted devices or bitflips. Some of tests are only hints and issue
warnings instead of failing the mount, basically when the checks are
derived from the data found in the superblock.

Tested on a broken image provided by Qu.

Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-10-03 16:14:59 -07:00
Sage Weil
42383020be Btrfs: fix race in WAIT_SYNC ioctl
We check whether transid is already committed via last_trans_committed and
then search through trans_list for pending transactions.  If
last_trans_committed is updated by btrfs_commit_transaction after we check
it (there is no locking), we will fail to find the committed transaction
and return EINVAL to the caller.  This has been observed occasionally by
ceph-osd (which uses this ioctl heavily).

Fix by rechecking whether the provided transid <= last_trans_committed
after the search fails, and if so return 0.

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-10-03 16:14:59 -07:00
Filipe Manana
656f30dba7 Btrfs: be aware of btree inode write errors to avoid fs corruption
While we have a transaction ongoing, the VM might decide at any time
to call btree_inode->i_mapping->a_ops->writepages(), which will start
writeback of dirty pages belonging to btree nodes/leafs. This call
might return an error or the writeback might finish with an error
before we attempt to commit the running transaction. If this happens,
we might have no way of knowing that such error happened when we are
committing the transaction - because the pages might no longer be
marked dirty nor tagged for writeback (if a subsequent modification
to the extent buffer didn't happen before the transaction commit) which
makes filemap_fdata[write|wait]_range unable to find such pages (even
if they're marked with SetPageError).
So if this happens we must abort the transaction, otherwise we commit
a super block with btree roots that point to btree nodes/leafs whose
content on disk is invalid - either garbage or the content of some
node/leaf from a past generation that got cowed or deleted and is no
longer valid (for this later case we end up getting error messages like
"parent transid verify failed on 10826481664 wanted 25748 found 29562"
when reading btree nodes/leafs from disk).

Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
i_mapping would not be enough because we need to distinguish between
log tree extents (not fatal) vs non-log tree extents (fatal) and
because the next call to filemap_fdatawait_range() will catch and clear
such errors in the mapping - and that call might be from a log sync and
not from a transaction commit, which means we would not know about the
error at transaction commit time. Also, checking for the eb flag
EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
not be completely reliable, as the eb might be removed from memory and
read back when trying to get it, which clears that flag right before
reading the eb's pages from disk, making us not know about the previous
write error.

Using the new 3 flags for the btree inode also makes us achieve the
goal of AS_EIO/AS_ENOSPC when writepages() returns success, started
writeback for all dirty pages and before filemap_fdatawait_range() is
called, the writeback for all dirty pages had already finished with
errors - because we were not using AS_EIO/AS_ENOSPC,
filemap_fdatawait_range() would return success, as it could not know
that writeback errors happened (the pages were no longer tagged for
writeback).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-10-03 16:14:59 -07:00
Fabian Frederick
15b636e1dd Btrfs: remove redundant btrfs_verify_qgroup_counts declaration.
Do like disk-io function declared under CONFIG_BTRFS_FS_RUN_SANITY_TESTS
and keep prototype in qgroup.h only

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Chris Mason <clm@fb.com>
2014-10-03 16:14:59 -07:00
Fabian Frederick
b99d9a6a4a btrfs: fix shadow warning on cmp
cmp was declared twice in btrfs_compare_trees resulting in a shadow
warning. This patch renames second internal variable.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Chris Mason <clm@fb.com>
2014-10-03 16:14:59 -07:00
Fabian Frederick
1b6e44690d Btrfs: fix compilation errors under DEBUG
bi_sector and bi_size moved to bi_iter since commit 4f024f3797
("block: Abstract out bvec iterator")

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Chris Mason <clm@fb.com>
2014-10-03 16:14:58 -07:00
Liu Bo
8146502820 Btrfs: fix crash of btrfs_release_extent_buffer_page
This is actually inspired by Filipe's patch.  When write_one_eb() fails on
submit_extent_page(), it'll give up writing this eb and mark it with
EXTENT_BUFFER_IOERR.  So if it's not the last page that encounter the failure,
there are some left pages which remain DIRTY, and if a later COW on this eb
happens, ie. eb is COWed and freed, it'd run into BUG_ON in
btrfs_release_extent_buffer_page() for the DIRTY page, ie. BUG_ON(PageDirty(page));

This adds the missing clear_page_dirty_for_io() for the rest pages of eb.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-10-03 16:14:58 -07:00
Filipe Manana
55e3bd2e0c Btrfs: add missing end_page_writeback on submit_extent_page failure
If submit_extent_page() fails in write_one_eb(), we end up with the current
page not marked dirty anymore, unlocked and marked for writeback. But we never
end up calling end_page_writeback() against the page, which will make calls to
filemap_fdatawait_range (e.g. at transaction commit time) hang forever waiting
for the writeback bit to be cleared from the page.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-10-03 16:14:58 -07:00
Qu Wenruo
32be3a1ac6 btrfs: Fix the wrong condition judgment about subset extent map
Previous commit: btrfs: Fix and enhance merge_extent_mapping() to insert
best fitted extent map
is using wrong condition to judgement whether the range is a subset of a
existing extent map.

This may cause bug in btrfs no-holes mode.

This patch will correct the judgment and fix the bug.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-10-03 16:14:58 -07:00
Josef Bacik
bbe9051441 Btrfs: fix build_backref_tree issue with multiple shared blocks
Marc Merlin sent me a broken fs image months ago where it would blow up in the
upper->checked BUG_ON() in build_backref_tree.  This is because we had a
scenario like this

block a -- level 4 (not shared)
   |
block b -- level 3 (reloc block, shared)
   |
block c -- level 2 (not shared)
   |
block d -- level 1 (shared)
   |
block e -- level 0 (shared)

We go to build a backref tree for block e, we notice block d is shared and add
it to the list of blocks to lookup it's backrefs for.  Now when we loop around
we will check edges for the block, so we will see we looked up block c last
time.  So we lookup block d and then see that the block that points to it is
block c and we can just skip that edge since we've already been up this path.
The problem is because we clear need_check when we see block d (as it is shared)
we never add block b as needing to be checked.  And because block c is in our
path already we bail out before we walk up to block b and add it to the backref
check list.

To fix this we need to reset need_check if we trip over a block that doesn't
need to be checked.  This will make sure that any subsequent blocks in the path
as we're walking up afterwards are added to the list to be processed.  With this
patch I can now mount Marc's fs image and it'll complete the balance without
panicing.  Thanks,

Reported-by: Marc MERLIN <marc@merlins.org>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-10-03 16:14:58 -07:00
Josef Bacik
75bfb9aff4 Btrfs: cleanup error handling in build_backref_tree
When balance panics it tends to panic in the

BUG_ON(!upper->checked);

test, because it means it couldn't build the backref tree properly.  This is
annoying to users and frankly a recoverable error, nothing in this function is
actually fatal since it is just an in-memory building of the backrefs for a
given bytenr.  So go through and change all the BUG_ON()'s to ASSERT()'s, and
fix the BUG_ON(!upper->checked) thing to just return an error.

This patch also fixes the error handling so it tears down the work we've done
properly.  This code was horribly broken since we always just panic'ed instead
of actually erroring out, so it needed to be completely re-worked.  With this
patch my broken image no longer panics when I mount it.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-10-03 16:14:58 -07:00
David Sterba
fccb84c94a btrfs: move checks for DUMMY_ROOT into a helper
Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 17:30:33 +02:00
David Sterba
7ec20afbcb btrfs: new define for the inline extent data start
Use a common definition for the inline data start so we don't have to
open-code it and introduce bugs like "Btrfs: fix wrong max inline data
size limit" fixed.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 17:30:33 +02:00
David Sterba
fb85fc9a67 btrfs: kill extent_buffer_page helper
It used to be more complex but now it's just a simple array access.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 17:30:32 +02:00
David Sterba
a50924e3a4 btrfs: drop constant param from btrfs_release_extent_buffer_page
All callers use the same value, simplify the function.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 17:30:32 +02:00
David Sterba
2755a0de64 btrfs: hide typecast to definition of BTRFS_SEND_TRANS_STUB
Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 17:30:31 +02:00
David Sterba
94404e82e5 btrfs: let merge_reloc_roots return void
Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 17:30:31 +02:00
David Sterba
8b9456da03 btrfs: remove unused members from struct scrub_warning
Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 17:30:30 +02:00
David Sterba
97eb6b69d1 btrfs: use slab for end_io_wq structures
The structure is frequently reused.  Rename it according to the slab
name.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 17:30:29 +02:00
David Sterba
af13b4922b btrfs: fix error labels in init_btrfs_fs
btrfs_interface_init rarely fails but we could leak the prelim_ref slab.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 17:30:29 +02:00
David Sterba
bfebd8b544 btrfs: use enum for wq endio metadata type
The enum exists but is not consistently used.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 17:30:28 +02:00
David Sterba
01d5bc3789 btrfs: remove unused extent state bits
The last users are long gone.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 17:30:27 +02:00
Filipe David Borba Manana
95ac567af2 Btrfs: set default max_inline to 8KiB instead of 8MiB
8MiB is way too large and likely set by mistake. This is not
a significant issue as in practice the max amount of data
added to an inline extent is also limited by the page cache
and btree leaf sizes.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 17:29:24 +02:00
David Sterba
4d75f8a9c8 btrfs: remove blocksize from btrfs_alloc_free_block and rename
Rename to btrfs_alloc_tree_block as it fits to the alloc/find/free +
_tree_block family. The parameter blocksize was set to the metadata
block size, directly or indirectly.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 17:14:54 +02:00
David Sterba
0308af4465 btrfs: remove unused parameter blocksize from btrfs_find_tree_block
Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 17:14:52 +02:00
David Sterba
ce86cd5917 btrfs: remove parameter blocksize from read_tree_block
We know the tree block size, no need to pass it around.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 17:14:50 +02:00
David Sterba
453848a05f btrfs: inline code of reada_tree_block and remove it
It's trivial with a single user. And remove one pointless BUG_ON.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 16:31:27 +02:00
David Sterba
6197d86eab btrfs: return void from readahead_tree_block
Errors in readahead are not fatal and ignored elsewhere in the code.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 16:30:40 +02:00
David Sterba
58dc4ce432 btrfs: remove unused parameter from readahead_tree_block
The parent_transid parameter has been unused since its introduction in
ca7a79ad8d ("Pass down the expected generation number when reading
tree blocks").  In reada_tree_block, it was even wrongly set to leafsize.
Transid check is done in the proper read and readahead ignores errors.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 16:30:18 +02:00
David Sterba
ee39b432b4 btrfs: remove unlikely from data-dependent branches and slow paths
There are the branch hints that obviously depend on the data being
processed, the CPU predictor will do better job according to the actual
load. It also does not make sense to use the hints in slow paths that do
a lot of other operations like locking, waiting or IO.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 16:15:21 +02:00
David Sterba
5d99a998f3 btrfs: remove unlikely from NULL checks
Unlikely is implicit for NULL checks of pointers.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 16:06:19 +02:00
David Sterba
143f363618 btrfs: remove unused variable from btrfs_parse_options
Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-01 19:31:33 +02:00
David Sterba
aab110abcb btrfs: defrag, use unsigned type for extent thresh
Signed type mismatches the ioctl structure, all extent calculations are
done on unsigned types.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-01 19:30:52 +02:00
Tejun Heo
d06efebf0c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block into for-3.18
This is to receive 0a30288da1 ("blk-mq, percpu_ref: implement a
kludge for SCSI blk-mq stall during probe") which implements
__percpu_ref_kill_expedited() to work around SCSI blk-mq stall.  The
commit reverted and patches to implement proper fix will be added.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
2014-09-24 13:00:21 -04:00
Josef Bacik
1d52c78afb Btrfs: try not to ENOSPC on log replay
When doing log replay we may have to update inodes, which traditionally goes
through our delayed inode stuff.  This will try to move space over from the
trans handle, but we don't reserve space in our trans handle on replay since we
don't know how much we will need, so instead we try to flush.  But because we
have a trans handle open we won't flush anything, so if we are out of reserve
space we will simply return ENOSPC.  Since we know that if an operation made it
into the log then we definitely had space before the box bought the farm then we
don't need to worry about doing this space reservation.  Use the
fs_info->log_root_recovering flag to skip the delayed inode stuff and update the
item directly.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-22 17:13:36 -07:00
Josef Bacik
f6acfd5011 Btrfs: don't do async reclaim during log replay
Trying to reproduce a log enospc bug I hit a panic in the async reclaim code
during log replay.  This is because we use fs_info->fs_root as our root for
shrinking and such.  Technically we can use whatever root we want, but let's
just not allow async reclaim while we're doing log replay.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-22 17:13:31 -07:00
Josef Bacik
47ab2a6c68 Btrfs: remove empty block groups automatically
One problem that has plagued us is that a user will use up all of his space with
data, remove a bunch of that data, and then try to create a bunch of small files
and run out of space.  This happens because all the chunks were allocated for
data since the metadata requirements were so low.  But now there's a bunch of
empty data block groups and not enough metadata space to do anything.  This
patch solves this problem by automatically deleting empty block groups.  If we
notice the used count go down to 0 when deleting or on mount notice that a block
group has a used count of 0 then we will queue it to be deleted.

When the cleaner thread runs we will double check to make sure the block group
is still empty and then we will delete it.  This patch has the side effect of no
longer having a bunch of BUG_ON()'s in the chunk delete code, which will be
helpful for both this and relocate.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-22 17:13:21 -07:00
Jens Axboe
6d11fb454b Merge branch 'for-linus' into for-3.18/core
Moving patches from for-linus to 3.18 instead, pull in this changes
that will go to Linus today.
2014-09-22 11:57:32 -06:00
Linus Torvalds
46be7b73e8 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "I've got a revert to fix a regression with btrfs device registration,
  and Filipe has part two of his fsync fix from last week"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Revert "Btrfs: device_list_add() should not update list when mounted"
  Btrfs: set inode's logged_trans/last_log_commit after ranged fsync
2014-09-19 13:10:53 -07:00
Filipe Manana
8407f55326 Btrfs: fix data corruption after fast fsync and writeback error
When we do a fast fsync, we start all ordered operations and then while
they're running in parallel we visit the list of modified extent maps
and construct their matching file extent items and write them to the
log btree. After that, in btrfs_sync_log() we wait for all the ordered
operations to finish (via btrfs_wait_logged_extents).

The problem with this is that we were completely ignoring errors that
can happen in the extent write path, such as -ENOSPC, a temporary -ENOMEM
or -EIO errors for example. When such error happens, it means we have parts
of the on disk extent that weren't written to, and so we end up logging
file extent items that point to these extents that contain garbage/random
data - so after a crash/reboot plus log replay, we get our inode's metadata
pointing to those extents.

This worked in contrast with the full (non-fast) fsync path, where we
start all ordered operations, wait for them to finish and then write
to the log btree. In this path, after each ordered operation completes
we check if it's flagged with an error (BTRFS_ORDERED_IOERR) and return
-EIO if so (via btrfs_wait_ordered_range).

So if an error happens with any ordered operation, just return a -EIO
error to userspace, so that it knows that not all of its previous writes
were durably persisted and the application can take proper action (like
redo the writes for e.g.) - and definitely not leave any file extent items
in the log refer to non fully written extents.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-19 06:57:51 -07:00
Filipe Manana
669249eea3 Btrfs: fix fsync race leading to invalid data after log replay
When the fsync callback (btrfs_sync_file) starts, it first waits for
the writeback of any dirty pages to start and finish without holding
the inode's mutex (to reduce contention). After this it acquires the
inode's mutex and repeats that process via btrfs_wait_ordered_range
only if we're doing a full sync (BTRFS_INODE_NEEDS_FULL_SYNC flag
is set on the inode).

This is not safe for a non full sync - we need to start and wait for
writeback to finish for any pages that might have been made dirty
before acquiring the inode's mutex and after that first step mentioned
before. Why this is needed is explained by the following comment added
to btrfs_sync_file:

  "Right before acquiring the inode's mutex, we might have new
   writes dirtying pages, which won't immediately start the
   respective ordered operations - that is done through the
   fill_delalloc callbacks invoked from the writepage and
   writepages address space operations. So make sure we start
   all ordered operations before starting to log our inode. Not
   doing this means that while logging the inode, writeback
   could start and invoke writepage/writepages, which would call
   the fill_delalloc callbacks (cow_file_range,
   submit_compressed_extents). These callbacks add first an
   extent map to the modified list of extents and then create
   the respective ordered operation, which means in
   tree-log.c:btrfs_log_inode() we might capture all existing
   ordered operations (with btrfs_get_logged_extents()) before
   the fill_delalloc callback adds its ordered operation, and by
   the time we visit the modified list of extent maps (with
   btrfs_log_changed_extents()), we see and process the extent
   map they created. We then use the extent map to construct a
   file extent item for logging without waiting for the
   respective ordered operation to finish - this file extent
   item points to a disk location that might not have yet been
   written to, containing random data - so after a crash a log
   replay will make our inode have file extent items that point
   to disk locations containing invalid data, as we returned
   success to userspace without waiting for the respective
   ordered operation to finish, because it wasn't captured by
   btrfs_get_logged_extents()."

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-19 06:57:50 -07:00
Chris Mason
0f23ae74f5 Revert "Btrfs: device_list_add() should not update list when mounted"
This reverts commit b96de000bc.

This commit is triggering failures to mount by subvolume id in some
configurations.  The main problem is how many different ways this
scanning function is used, both for scanning while mounted and
unmounted.  A proper cleanup is too big for late rcs.

For now, just revert the commit and we'll put a better fix into a later
merge window.

Signed-off-by: Chris Mason <clm@fb.com>
2014-09-18 07:49:05 -07:00
Qu Wenruo
e6c4efd87a btrfs: Fix and enhance merge_extent_mapping() to insert best fitted extent map
The following commit enhanced the merge_extent_mapping() to reduce
fragment in extent map tree, but it can't handle case which existing
lies before map_start:
51f39 btrfs: Use right extent length when inserting overlap extent map.

[BUG]
When existing extent map's start is before map_start,
the em->len will be minus, which will corrupt the extent map and fail to
insert the new extent map.
This will happen when someone get a large extent map, but when it is
going to insert it into extent map tree, some one has already commit
some write and split the huge extent into small parts.

[REPRODUCER]
It is very easy to tiger using filebench with randomrw personality.
It is about 100% to reproduce when using 8G preallocated file in 60s
randonrw test.

[FIX]
This patch can now handle any existing extent position.
Since it does not directly use existing->start, now it will find the
previous and next extent around map_start.
So the old existing->start < map_start bug will never happen again.

[ENHANCE]
This patch will insert the best fitted extent map into extent map tree,
other than the oldest [map_start, map_start + sectorsize) or the
relatively newer but not perfect [map_start, existing->start).

The patch will first search existing extent that does not intersects with
the desired map range [map_start, map_start + len).
The existing extent will be either before or behind map_start, and based
on the existing extent, we can find out the previous and next extent
around map_start.

So the best fitted extent would be [prev->end, next->start).
For prev or next is not found, em->start would be prev->end and em->end
wold be next->start.

With this patch, the fragment in extent map tree should be reduced much
more than the 51f39 commit and reduce an unneeded extent map tree search.

Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-18 07:14:46 -07:00
Liu Bo
4d1a40c66b Btrfs: fix up bounds checking in lseek
An user reported this, it is because that lseek's SEEK_SET/SEEK_CUR/SEEK_END
allow a negative value for @offset, but btrfs's SEEK_DATA/SEEK_HOLE don't
prepare for that and convert the negative @offset into unsigned type,
so we get (end < start) warning.

[ 1269.835374] ------------[ cut here ]------------
[ 1269.836809] WARNING: CPU: 0 PID: 1241 at fs/btrfs/extent_io.c:430 insert_state+0x11d/0x140()
[ 1269.838816] BTRFS: end < start 4094 18446744073709551615
[ 1269.840334] CPU: 0 PID: 1241 Comm: a.out Tainted: G        W      3.16.0+ #306
[ 1269.858229] Call Trace:
[ 1269.858612]  [<ffffffff81801a69>] dump_stack+0x4e/0x68
[ 1269.858952]  [<ffffffff8107894c>] warn_slowpath_common+0x8c/0xc0
[ 1269.859416]  [<ffffffff81078a36>] warn_slowpath_fmt+0x46/0x50
[ 1269.859929]  [<ffffffff813b0fbd>] insert_state+0x11d/0x140
[ 1269.860409]  [<ffffffff813b1396>] __set_extent_bit+0x3b6/0x4e0
[ 1269.860805]  [<ffffffff813b21c7>] lock_extent_bits+0x87/0x200
[ 1269.861697]  [<ffffffff813a5b28>] btrfs_file_llseek+0x148/0x2a0
[ 1269.862168]  [<ffffffff811f201e>] SyS_lseek+0xae/0xc0
[ 1269.862620]  [<ffffffff8180b212>] system_call_fastpath+0x16/0x1b
[ 1269.862970] ---[ end trace 4d33ea885832054b ]---

This assumes that btrfs starts finding DATA/HOLE from the beginning of file
if the assigned @offset is negative.

Also we add alignment for lock_extent_bits 's range.

Reported-by: Toralf Förster <toralf.foerster@gmx.de>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:46:30 -07:00
Miao Xie
f612496bca Btrfs: cleanup the read failure record after write or when the inode is freeing
After the data is written successfully, we should cleanup the read failure record
in that range because
- If we set data COW for the file, the range that the failure record pointed to is
  mapped to a new place, so it is invalid.
- If we set no data COW for the file, and if there is no error during writting,
  the corrupted data is corrected, so the failure record can be removed. And if
  some errors happen on the mirrors, we also needn't worry about it because the
  failure record will be recreated if we read the same place again.

Sometimes, we may fail to correct the data, so the failure records will be left
in the tree, we need free them when we free the inode or the memory leak happens.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:39:02 -07:00
Miao Xie
8b110e393c Btrfs: implement repair function when direct read fails
This patch implement data repair function when direct read fails.

The detail of the implementation is:
- When we find the data is not right, we try to read the data from the other
  mirror.
- When the io on the mirror ends, we will insert the endio work into the
  dedicated btrfs workqueue, not common read endio workqueue, because the
  original endio work is still blocked in the btrfs endio workqueue, if we
  insert the endio work of the io on the mirror into that workqueue, deadlock
  would happen.
- After we get right data, we write it back to the corrupted mirror.
- And if the data on the new mirror is still corrupted, we will try next
  mirror until we read right data or all the mirrors are traversed.
- After the above work, we set the uptodate flag according to the result.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:39:01 -07:00
Miao Xie
28e1cc7d1b Btrfs: Set real mirror number for read operation on RAID0/5/6
We need real mirror number for RAID0/5/6 when reading data, or if read error
happens, we would pass 0 as the number of the mirror on which the io error
happens. It is wrong and would cause the filesystem read the data from the
corrupted mirror again.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:39:00 -07:00
Miao Xie
1203b6813e Btrfs: modify clean_io_failure and make it suit direct io
We could not use clean_io_failure in the direct IO path because it got the
filesystem information from the page structure, but the page in the direct
IO bio didn't have the filesystem information in its structure. So we need
modify it and pass all the information it need by parameters.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:59 -07:00
Miao Xie
ffdd2018dd Btrfs: modify repair_io_failure and make it suit direct io
The original code of repair_io_failure was just used for buffered read,
because it got some filesystem data from page structure, it is safe for
the page in the page cache. But when we do a direct read, the pages in bio
are not in the page cache, that is there is no filesystem data in the page
structure. In order to implement direct read data repair, we need modify
repair_io_failure and pass all filesystem data it need by function
parameters.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:58 -07:00
Miao Xie
2fe6303e7c Btrfs: split bio_readpage_error into several functions
The data repair function of direct read will be implemented later, and some code
in bio_readpage_error will be reused, so split bio_readpage_error into
several functions which will be used in direct read repair later.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:56 -07:00
Miao Xie
454ff3de42 Btrfs: Cleanup unused variant and argument of IO failure handlers
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:55 -07:00
Miao Xie
6c387ab20d Btrfs: fix missing error handler if submiting re-read bio fails
We forgot to free failure record and bio after submitting re-read bio failed,
fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:54 -07:00
Miao Xie
c1dc08967f Btrfs: do file data check by sub-bio's self
Direct IO splits the original bio to several sub-bios because of the limit of
raid stripe, and the filesystem will wait for all sub-bios and then run final
end io process.

But it was very hard to implement the data repair when dio read failure happens,
because at the final end io function, we didn't know which mirror the data was
read from. So in order to implement the data repair, we have to move the file data
check in the final end io function to the sub-bio end io function, in which we can
get the mirror number of the device we access. This patch did this work as the
first step of the direct io data repair implementation.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:53 -07:00
Miao Xie
dc380aea5f Btrfs: cleanup similar code of the buffered data data check and dio read data check
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:52 -07:00
Miao Xie
23ea8e5a07 Btrfs: load checksum data once when submitting a direct read io
The current code would load checksum data for several times when we split
a whole direct read io because of the limit of the raid stripe, it would
make us search the csum tree for several times. In fact, it just wasted time,
and made the contention of the csum tree root be more serious. This patch
improves this problem by loading the data at once.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:50 -07:00
Miao Xie
c3929c3624 Btrfs: modify rw_devices counter under chunk_mutex context
rw_devices counter is often used to tune the profile when doing chunk allocation,
so we should modify it under the chunk_mutex context to avoid getting wrong
chunk profile.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:49 -07:00
Miao Xie
5f37583569 Btrfs: move the missing device to its own fs device list
For a missing device, we don't know it belong to which fs before we read its
fsid from the chunk tree. So we add them into the current fs device list at first.
When we get its fsid, we should move them to their own fs device list.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:48 -07:00
Miao Xie
416d7b802a Btrfs: stop mounting the fs if the non-ENOENT errors happen when opening seed fs
When we open a seed filesystem, if the degraded mount option is set, we continue to
mount the fs if we don't find some devices in the seed filesystem. But we should stop
mounting if other errors happen. Fix it

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:47 -07:00
Miao Xie
82372bc816 Btrfs: make the logic of source device removing more clear
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:46 -07:00
Miao Xie
67a2c45ee7 Btrfs: fix use-after-free problem of the device during device replace
The problem is:
	Task0(device scan task)		Task1(device replace task)
	scan_one_device()
	mutex_lock(&uuid_mutex)
	device = find_device()
					mutex_lock(&device_list_mutex)
					lock_chunk()
					rm_and_free_source_device
					unlock_chunk()
					mutex_unlock(&device_list_mutex)
	check device

Destroying the target device if device replace fails also has the same problem.

We fix this problem by locking uuid_mutex during destroying source device or
target device, just like the device remove operation.

It is a temporary solution, we can fix this problem and make the code more
clear by atomic counter in the future.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:44 -07:00
Miao Xie
adbbb8631b Btrfs: fix unprotected device list access when cloning fs devices
We can build a new filesystem based a seed filesystem, and we need clone
the fs devices when we open the new filesystem. But someone might clear
the seed flag of the seed filesystem, then mount that filesystem and
remove some device. If we mount the new filesystem, we might access
a device list which was being changed when we clone the fs devices.
Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:43 -07:00
Miao Xie
2196d6e8a7 Btrfs: Fix misuse of chunk mutex
There were several problems about chunk mutex usage:
- Lock chunk mutex when updating metadata. It would cause the nested
  deadlock because updating metadata might need allocate new chunks
  that need acquire chunk mutex. We remove chunk mutex at this case,
  because b-tree lock and other lock mechanism can help us.
- ABBA deadlock occured between device_list_mutex and chunk_mutex.
  When we update device status, we must acquire device_list_mutex at the
  beginning, and then we might get chunk_mutex during the device status
  update because we need allocate new chunks for metadata COW. But at
  most place, we acquire chunk_mutex at first and then acquire device list
  mutex. We need change the lock order.
- Some place we needn't acquire chunk_mutex. For example we needn't get
  chunk_mutex when we free a empty seed fs_devices structure.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:42 -07:00
Miao Xie
15484377f5 Btrfs: fix unprotected device list access when getting the fs information
When we get the fs information, we forgot to acquire the mutex of device list,
it might cause the problem we might access a device that was removed. Fix
it by acquiring the device list mutex.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:41 -07:00
Miao Xie
fe48a5c00f Btrfs: fix unprotected system chunk array insertion
We didn't protect the system chunk array when we added a new
system chunk into it, it would cause the array be corrupted
if someone remove/add some system chunk into array at the same
time. Fix it by chunk lock.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:40 -07:00
Miao Xie
7cc8e58d53 Btrfs: fix unprotected device's variants on 32bits machine
->total_bytes,->disk_total_bytes,->bytes_used is protected by chunk
lock when we change them, but sometimes we read them without any lock,
and we might get unexpected value. We fix this problem like inode's
i_size.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:38 -07:00
Miao Xie
1c1161870c Btrfs: update free_chunk_space during allocting a new chunk
We should update free_chunk_space in time when we allocate a new chunk,
not when we deal with the pending device update and block group insertion,
because we need the real free_chunk_space data to calculate the reserved
space, if we don't update it in time, we would consider the disk space which
has be allocated as free space, and would use it to do overcommit reservation.
Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:37 -07:00
Miao Xie
43530c46cc Btrfs: fix unprotected device->bytes_used update
We should update device->bytes_used in the lock context of
chunk_mutex, or we would get wrong data.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:36 -07:00
Miao Xie
5d778aaeb0 Btrfs: Fix wrong free_chunk_space assignment during removing a device
During removing a device, we have modified free_chunk_space when we
shrink the device, so we needn't assign a new value to it after
the device shrink. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:35 -07:00
Miao Xie
ce7213c70c Btrfs: fix wrong device bytes_used in the super block
device->bytes_used will be changed when allocating a new chunk, and
disk_total_size will be changed if resizing is successful.
Meanwhile, the on-disk super blocks of the previous transaction
might not be updated. Considering the consistency of the metadata
in the previous transaction, We should use the size in the previous
transaction to check if the super block is beyond the boundary
of the device.

Though it is not big problem because we don't use it now, but anyway
it is better that we make it be consistent with the common metadata,
maybe we will use it in the future.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:34 -07:00
Miao Xie
935e5cc935 Btrfs: fix wrong disk size when writing super blocks
total_size will be changed when resizing a device, and disk_total_size
will be changed if resizing is successful. Meanwhile, the on-disk super
blocks of the previous transaction might not be updated. Considering
the consistency of the metadata in the previous transaction, We should
use the size in the previous transaction to check if the super block is
beyond the boundary of the device. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:33 -07:00
Miao Xie
1c43366d3b Btrfs: fix unprotected assignment of the target device
We didn't protect the assignment of the target device, it might cause the
problem that the super block update was skipped because we might find wrong
size of the target device during the assignment. Fix it by moving the
assignment sentences into the initialization function of the target device.
And there is another merit that we can check if the target device is suitable
more early.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:31 -07:00
Miao Xie
c7662111c7 Btrfs: cleanup double assignment of device->bytes_used when device replace finishes
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:30 -07:00
Miao Xie
90180da42c Btrfs: cleanup unused num_can_discard in fs_devices
The member variants - num_can_discard - of fs_devices structure
are set, but no one use them to do anything. so remove them.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:29 -07:00
Li RongQing
82f70d62f7 btrfs: remove the wrong comments
This comments became wrong after c3c532[bdi: add helper function for
doing init and register of a bdi for a file system], so remove them.

Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:28 -07:00
Filipe Manana
a2cc11db24 Btrfs: fix directory recovery from fsync log
When replaying a directory from the fsync log, if a directory entry
exists both in the fs/subvol tree and in the log, the directory's inode
got its i_size updated incorrectly, accounting for the dentry's name
twice.

Reproducer, from a test for xfstests:

    _scratch_mkfs >> $seqres.full 2>&1
    _init_flakey
    _mount_flakey

    touch $SCRATCH_MNT/foo
    sync

    touch $SCRATCH_MNT/bar
    xfs_io -c "fsync" $SCRATCH_MNT
    xfs_io -c "fsync" $SCRATCH_MNT/bar

    _load_flakey_table $FLAKEY_DROP_WRITES
    _unmount_flakey

    _load_flakey_table $FLAKEY_ALLOW_WRITES
    _mount_flakey

    [ -f $SCRATCH_MNT/foo ] || echo "file foo is missing"
    [ -f $SCRATCH_MNT/bar ] || echo "file bar is missing"

    _unmount_flakey
    _check_scratch_fs $FLAKEY_DEV

The filesystem check at the end failed with the message:
"root 5 root dir 256 error".

A test case for xfstests follows.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:27 -07:00
Liu Bo
25ce459c1a Btrfs: fix loop writing of async reclaim
One of my tests shows that when we really don't have space to reclaim via
flush_space and also run out of space, this async reclaim work loops on adding
itself into the workqueue and keeps writing something to disk according to
iostat's results, and these writes mainly comes from commit_transaction which
writes super_block.  This's unacceptable as it can be bad to disks, especially
memeory storages.

This adds a check to avoid the above situation.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:25 -07:00
Josef Bacik
dc046b10c8 Btrfs: make fiemap not blow when you have lots of snapshots
We have been iterating all references for each extent we have in a file when we
do fiemap to see if it is shared.  This is fine when you have a few clones or a
few snapshots, but when you have 5k snapshots suddenly fiemap just sits there
and stares at you.  So add btrfs_check_shared which will use the backref walking
code but will short circuit as soon as it finds a root or inode that doesn't
match the one we currently have.  This makes fiemap on my testbox go from
looking at me blankly for a day to spitting out actual output in a reasonable
amount of time.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:24 -07:00
Filipe Manana
78a017a2c9 Btrfs: add missing compression property remove in btrfs_ioctl_setflags
The behaviour of a 'chattr -c' consists of getting the current flags,
clearing the FS_COMPR_FL bit and then sending the result to the set
flags ioctl - this means the bit FS_NOCOMP_FL isn't set in the flags
passed to the ioctl. This results in the compression property not being
cleared from the inode - it was cleared only if the bit FS_NOCOMP_FL
was set in the received flags.

Reproducer:

    $ mkfs.btrfs -f /dev/sdd
    $ mount /dev/sdd /mnt && cd /mnt
    $ mkdir a
    $ chattr +c a
    $ touch a/file
    $ lsattr a/file
    --------c------- a/file
    $ chattr -c a
    $ touch a/file2
    $ lsattr a/file2
    --------c------- a/file2
    $ lsattr -d a
    ---------------- a

Reported-by: Andreas Schneider <asn@cryptomilk.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:23 -07:00
Qu Wenruo
12b894cb28 btrfs: Fix a deadlock in btrfs_dev_replace_finishing()
btrfs-transacion:5657
[stack snip]
btrfs_bio_map()
    btrfs_bio_counter_inc_blocked()
        percpu_counter_inc(&fs_info->bio_counter)  ###bio_counter > 0(A)
        __btrfs_bio_map()
            btrfs_dev_replace_lock()
                mutex_lock(dev_replace->lock)	   ###wait mutex(B)

btrfs:32612
[stack snip]
btrfs_dev_replace_start()
    btrfs_dev_replace_lock()
	mutex_lock(dev_replace->lock)		   ###hold mutex(B)
    btrfs_dev_replace_finishing()
        btrfs_rm_dev_replace_blocked()
            wait until percpu_counter_sum == 0	   ###wait on bio_counter(A)

This bug can be triggered quite easily by the following test script:
http://pastebin.com/MQmb37Cy

This patch will fix the ABBA problem by calling
btrfs_dev_replace_unlock() before btrfs_rm_dev_replace_blocked().

The consistency of btrfs devices list and their superblocks is protected
by device_list_mutex, not btrfs_dev_replace_lock/unlock().
So it is safe the move btrfs_dev_replace_unlock() before
btrfs_rm_dev_replace_blocked().

Reported-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Cc: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:22 -07:00
Liu Bo
a583c02664 Btrfs: cleanup the same name in end_bio_extent_readpage
We've defined a 'offset' out of bio_for_each_segment_all.

This is just a clean rename, no function changes.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:20 -07:00
Mark Fasheh
0b4699dcb6 btrfs: don't go readonly on existing qgroup items
btrfs_drop_snapshot() leaves subvolume qgroup items on disk after
completion. This can cause problems with snapshot creation. If a new
snapshot tries to claim the deleted subvolumes id, btrfs will get -EEXIST
from add_qgroup_item() and go read-only. The following commands will
reproduce this problem (assume btrfs is on /dev/sda and is mounted at
/btrfs)

mkfs.btrfs -f /dev/sda
mount -t btrfs /dev/sda /btrfs/
btrfs quota enable /btrfs/
btrfs su sna /btrfs/ /btrfs/snap
btrfs su de /btrfs/snap
sleep 45
umount /btrfs/
mount -t btrfs /dev/sda /btrfs/

We can fix this by catching -EEXIST in add_qgroup_item() and
initializing the existing items. We have the problem of orphaned
relation items being on disk from an old snapshot but that is outside
the scope of this patch.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:19 -07:00
Filipe Manana
2a39e59802 Btrfs: shrink further sizeof(struct extent_buffer)
The map_start and map_len fields aren't used anywhere, so just remove
them. On a x86_64 system, this reduced sizeof(struct extent_buffer)
from 296 bytes to 280 bytes, and therefore 14 extent_buffer structs can
now fit into a page instead of 13.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:17 -07:00
Filipe Manana
4395e0c4da Btrfs: send, lower mem requirements for processing xattrs
Maximum xattr size can be up to nearly the leaf size. For an fs with a
leaf size larger than the page size, using kmalloc requires allocating
multiple pages that are contiguous, which might not be possible if
there's heavy memory fragmentation. Therefore fallback to vmalloc if
we fail to allocate with kmalloc. Also start with a smaller buffer size,
since xattr values typically are smaller than a page.

Reported-by: Chris Murphy <lists@colorremedies.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:16 -07:00
David Sterba
f87c4318af btrfs: remove stale define after removing ordered operations
Last user removed in commit "btrfs: disable strict file flushes for
renames and truncates" (8d875f95da).

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:15 -07:00
Filipe Manana
2000552396 Btrfs: improve free space cache management and space allocation
While under random IO, a block group's free space cache eventually reaches
a state where it has a mix of extent entries and bitmap entries representing
free space regions.

As later free space regions are returned to the cache, some of them are merged
with existing extent entries if they are contiguous with them. But others are
not merged, because despite the existence of adjacent free space regions in
the cache, the merging doesn't happen because the existing free space regions
are represented in bitmap extents. Even when new free space regions are merged
with existing extent entries (enlarging the free space range they represent),
we create chances of having after an enlarged region that is contiguous with
some other region represented in a bitmap entry.

Both clustered and non-clustered space allocation work by iterating over our
extent and bitmap entries and skipping any that represents a region smaller
then the allocation request (and giving preference to extent entries before
bitmap entries). By having a contiguous free space region that is represented
by 2 (or more) entries (mix of extent and bitmap entries), we end up not
satisfying an allocation request with a size larger than the size of any of
the entries but no larger than the sum of their sizes. Making the caller assume
we're under a ENOSPC condition or force it to allocate multiple smaller space
regions (as we do for file data writes), which adds extra overhead and more
chances of causing fragmentation due to the smaller regions being all spread
apart from each other (more likely when under concurrency).

For example, if we have the following in the cache:

* extent entry representing free space range: [128Mb - 256Kb, 128Mb[

* bitmap entry covering the range [128Mb, 256Mb[, but only with the bits
  representing the range [128Mb, 128Mb + 768Kb[ set - that is, only that
  space in this 128Mb area is marked as free

An allocation request for 1Mb, starting at offset not greater than 128Mb - 256Kb,
would fail before, despite the existence of such contiguous free space area in the
cache. The caller could only allocate up to 768Kb of space at once and later another
256Kb (or vice-versa). In between each smaller allocation request, another task
working on a different file/inode might come in and take that space, preventing the
former task of getting a contiguous 1Mb region of free space.

Therefore this change implements the ability to move free space from bitmap
entries into existing and new free space regions represented with extent
entries. This is done when a space region is added to the cache.

A test was added to the sanity tests that explains in detail the issue too.

Some performance test results with compilebench on a 4 cores machine, with
32Gb of ram and using an HDD follow.

Test: compilebench -D /mnt -i 30 -r 1000 --makej

Before this change:

   intial create total runs 30 avg 69.02 MB/s (user 0.28s sys 0.57s)
   compile total runs 30 avg 314.96 MB/s (user 0.12s sys 0.25s)
   read compiled tree total runs 3 avg 27.14 MB/s (user 1.52s sys 0.90s)
   delete compiled tree total runs 30 avg 3.14 seconds (user 0.15s sys 0.66s)

After this change:

   intial create total runs 30 avg 68.37 MB/s (user 0.29s sys 0.55s)
   compile total runs 30 avg 382.83 MB/s (user 0.12s sys 0.24s)
   read compiled tree total runs 3 avg 27.82 MB/s (user 1.45s sys 0.97s)
   delete compiled tree total runs 30 avg 3.18 seconds (user 0.17s sys 0.65s)

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:13 -07:00
Anand Jain
3c1dbdf54a btrfs: rename total_bytes to avoid confusion
we are assigning number_devices to the total_bytes,
that's very confusing for a moment

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:12 -07:00
Anand Jain
de4c296f63 btrfs: fix typo in the log message
there is no matching open parenthesis for the closing parenthesis

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:11 -07:00
Anand Jain
b2efedca68 btrfs: rw_devices shouldn't be incremented for seed fs in btrfs_rm_dev_replace_srcdev()
seed fs devices don't participate as rw_device, so don't increment
rw_devices when the device being handled belongs to a seed fs.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:10 -07:00
Anand Jain
8bef8401a0 btrfs: fix memory leak when there is no more seed device
When we replace all the seed device in the system there is
no point in just keeping the btrfs_fs_devices with out
any device

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:09 -07:00
Anand Jain
94d5f0c2ae btrfs: update sprout seed pointer when seed fs is relinquished
We are not updating sprout fs seed pointer when all seed device
is replaced. This patch will check if all seed device has been
replaced and then update the sprout pointer accordingly.

Same reproducer as in the previous patch would apply here.
And notice that btrfs_close_device will check if seed fs is
present and spits out the error with out this patch.

int btrfs_close_devices(struct btrfs_fs_devices *fs_devices)
{
::
                seed_devices = fs_devices->seed;
::
        while (seed_devices) {
                fs_devices = seed_devices;
                seed_devices = fs_devices->seed;
                __btrfs_close_devices(fs_devices);
                free_fs_devices(fs_devices);
        }

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:08 -07:00
Anand Jain
63dd86fa79 btrfs: fix rw_devices miss match after seed replace
reproducer:
    reproducer:
    mount /dev/sdb /btrfs
    btrfs dev add /dev/sdc /btrfs
    btrfs rep start -B /dev/sdb /dev/sdd /btrfs
    umount /btrfs

WARNING: CPU: 0 PID: 3882 at fs/btrfs/volumes.c:892 __btrfs_close_devices+0x1c8/0x200 [btrfs]()

which is

        WARN_ON(fs_devices->rw_devices);

   The problem here is that we did not add one to the rw_devices when
   we replace the seed device with a writable device.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:06 -07:00
Anand Jain
25e8e9113d btrfs: replace seed device followed by unmount causes kernel WARNING
reproducer:
mount /dev/sdb /btrfs
btrfs dev add /dev/sdc /btrfs
btrfs rep start -B /dev/sdb /dev/sdd /btrfs
umount /btrfs

WARNING: CPU: 0 PID: 12661 at fs/btrfs/volumes.c:891 __btrfs_close_devices+0x1b0/0x200 [btrfs]()
::

__btrfs_close_devices()
::
        WARN_ON(fs_devices->open_devices);

After the seed device has been replaced the new target device
is no more a seed device. So we need to update the device
numbers in the fs_devices as pointed by the fs_info.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:05 -07:00
Anand Jain
d51908ce4e btrfs: preparatory to make btrfs_rm_dev_replace_srcdev() seed aware
There is no logical change in this patch, just a preparatory patch,
so that changes can be easily reasoned.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:04 -07:00
Andrey Utkin
56094eecd3 btrfs: Drop stray check of fixup_workers creation
The issue was introduced in a79b7d4b3e,
adding allocation of extent_workers, so this stray check is surely not
meant to be a check of something else.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=82021
Reported-by: Maks Naumov <maksqwe1@ukr.net>
Signed-off-by: Andrey Utkin <andrey.krieger.utkin@gmail.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:03 -07:00
Filipe Manana
f98de9b9c0 Btrfs: make btrfs_search_forward return with nodes unlocked
None of the uses of btrfs_search_forward() need to have the path
nodes (level >= 1) read locked, only the leaf needs to be locked
while the caller processes it. Therefore make it return a path
with all nodes unlocked, except for the leaf.

This change is motivated by the observation that during a file
fsync we repeatdly call btrfs_search_forward() and process the
returned leaf while upper nodes of the returned path (level >= 1)
are read locked, which unnecessarily blocks other tasks that want
to write to the same fs/subvol btree.
Therefore instead of modifying the fsync code to unlock all nodes
with level >= 1 immediately after calling btrfs_search_forward(),
change btrfs_search_forward() to do it, so that it benefits all
callers.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:02 -07:00
Anand Jain
79aec2b80d btrfs: sysfs label interface should check for read only FS
Not sure how this escaped many eyes so far

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:01 -07:00
Anand Jain
20ee0825ec btrfs: code optimize: BTRFS_ATTR_RW could set the mode
BTRFS_ATTR_RW could set the mode and be inline with BTRFS_ATTR

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:59 -07:00
Anand Jain
98b3d389eb btrfs: code optimize: BTRFS_ATTR could handle the mode
All that uses BTRFS_ATTR want mode to be set at 0444 so just do
it at the define.  And few spacing alignments.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:58 -07:00
Anand Jain
3f4b57e09d btrfs: use BTRFS_ATTR instead of btrfs_no_store()
we have BTRFS_ATTR define to create sysfs RO file, use that.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:57 -07:00
Filipe Manana
160f4089c8 Btrfs: avoid unnecessary switch of path locks to blocking mode
If we need to cow a node, increase the write lock level and retry the
tree search, there's no point of changing the node locks in our path
to blocking mode, as we only waste time and unnecessarily wake up other
tasks waiting on the spinning locks (just to block them again shortly
after) because we release our path before repeating the tree search.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:56 -07:00
Filipe Manana
24cdc847d9 Btrfs: unlock nodes earlier when inserting items in a btree
In ctree.c:setup_items_for_insert(), we can unlock all nodes in our
path before we process the leaf (shift items and data, adjust data
offsets, etc). This allows for better btree concurrency, as we're
often holding a write lock on at least the node at level 1.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:55 -07:00
Satoru Takeuchi
d1b00a4711 btrfs: use IS_ALIGNED() for assertion in btrfs_lookup_csums_range() for simplicity
btrfs_lookup_csums_range() uses ALIGN() to check if "start"
and "end + 1" are aligned to "root->sectorsize". It's better to
replace these with IS_ALIGNED() for simplicity.

Signed-off-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:54 -07:00
Mark Fasheh
d3982100ba btrfs: add trace for qgroup accounting
We want this to debug qgroup changes on live systems.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:50 -07:00
Miao Xie
443f24fee7 Btrfs: cleanup unused latest_devid and latest_trans in fs_devices
The member variants - latest_devid and latest_trans - of fs_devices structure
are set, but no one use them to do anything. so remove them.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:49 -07:00
Miao Xie
6ba40b615f Btrfs: update the comment of total_bytes and disk_total_bytes of btrfs_devie
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:48 -07:00
Miao Xie
addc3fa74e Btrfs: Fix the problem that the dirty flag of dev stats is cleared
The io error might happen during writing out the device stats, and the
device stats information and dirty flag would be update at that time,
but the current code didn't consider this case, just clear the dirty
flag, it would cause that we forgot to write out the new device stats
information. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:46 -07:00
Miao Xie
d5ee37bcb1 Btrfs: make the device lock and its protected data in the same cacheline
The lock in btrfs_device structure was far away from its protected data, it would
make CPU load the cache line twice when we accessed them, move them together.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:45 -07:00
Miao Xie
5f546063ce Btrfs: fix wrong generation check of super block on a seed device
The super block generation of the seed devices is not the same as the
filesystem which sprouted from them because we don't update the super
block on the seed devices when we change that new filesystem. So we
should not use the generation of that new filesystem to check the super
block generation on the seed devices, Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:44 -07:00
Miao Xie
17a9be2f28 Btrfs: fix wrong fsid check of scrub
All the metadata in the seed devices has the same fsid as the fsid
of the seed filesystem which is on the seed device, so we should check
them by the current filesystem. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:43 -07:00
David Sterba
2fad4e83e1 btrfs: wake up transaction thread from SYNC_FS ioctl
The transaction thread may want to do more work, namely it pokes the
cleaner ktread that will start processing uncleaned subvols.

This can be triggered by user via the 'btrfs fi sync' command, otherwise
there was a delay up to 30 seconds before the cleaner started to clean
old snapshots.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:42 -07:00
Wang Shilong
c01a5c074c Btrfs: fix wrong max inline data size limit
inline data is stored from offset of @disk_bytenr in
struct btrfs_file_extent_item. So substracting total
size of struct btrfs_file_extent_item is wrong, fix it.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:40 -07:00
Wang Shilong
354877befa Btrfs: fix off-by-one in cow_file_range_inline()
Btrfs could still inline file data if its size is same as
page size, so don't skip max value here.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:39 -07:00
Wang Shilong
7816030eb4 Btrfs: fall into nocompression codes quickly if possible
If flag NOCOMPRESS is set which means bad compression ratio,
we could avoid call cow_file_range_async() for this case earlier.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:38 -07:00
Wang Shilong
f79707b092 Btrfs: fix wrong skipping compression for an inode
If a file's compression ratios is bad, we will set NOCOMPRESS
flag for it, and it will skip compression for that inode next time.

However, if we remount fs to COMPRESS_FORCE, it still should try
if we could compress pages for that inode, this patch fix wrong
check for this problem.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:36 -07:00
Fabian Frederick
d447d0da44 Btrfs: fix sparse warning
Fix the following sparse warning:
fs/btrfs/send.c:518:51: warning: incorrect type in argument 2 (different address spaces)
fs/btrfs/send.c:518:51:    expected char const [noderef] <asn:1>*<noident>
fs/btrfs/send.c:518:51:    got char *

We can safely use (const char __user *) with set_fs(KERNEL_DS)

__force added to avoid sparse-all warning:
fs/btrfs/send.c:518:40: warning: cast adds address space to expression (<asn:1>)

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Reviewed-by: Zach Brown <zab@zabbo.net>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:35 -07:00
HIMANGI SARAOGI
14586651ed Btrfs: use BUG_ON
Use BUG_ON(x) rather than if(x) BUG();

The semantic patch that fixes this problem is as follows:

// <smpl>
@@ identifier x; @@
-if (x) BUG();
+BUG_ON(x);
// </smpl>

Signed-off-by: Himangi Saraogi <himangi774@gmail.com>
Acked-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:34 -07:00
Sergey Senozhatsky
7880991344 btrfs compression: merge inflate and deflate z_streams
`struct workspace' used for zlib compression contains two zlib
z_stream-s: `def_strm' used in zlib_compress_pages(), and `inf_strm'
used in zlib_decompress/zlib_decompress_biovec(). None of these
functions use `inf_strm' and `def_strm' simultaniously, meaning that
for every compress/decompress operation we need only one z_stream
(out of two available).

`inf_strm' and `def_strm' are different in size of ->workspace. For
inflate stream we vmalloc() zlib_inflate_workspacesize() bytes, for
deflate stream - zlib_deflate_workspacesize() bytes. On my system zlib
returns the following workspace sizes, correspondingly: 42312 and 268104
(+ guard pages).

Keep only one `z_stream' in `struct workspace' and use it for both
compression and decompression. Hence, instead of vmalloc() of two
z_stream->worskpace-s, allocate only one of size:
	max(zlib_deflate_workspacesize(), zlib_inflate_workspacesize())

Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:33 -07:00
Filipe Manana
555e128640 Btrfs: set error return value in btrfs_get_blocks_direct
We were returning with 0 (success) because we weren't extracting the
error code from em (PTR_ERR(em)). Fix it.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:32 -07:00
Filipe Manana
27a3507de9 Btrfs: reduce size of struct extent_state
The tree field of struct extent_state was only used to figure out if
an extent state was connected to an inode's io tree or not. For this
we can just use the rb_node field itself.

On a x86_64 system with this change the sizeof(struct extent_state) is
reduced from 96 bytes down to 88 bytes, meaning that with a page size
of 4096 bytes we can now store 46 extent states per page instead of 42.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:30 -07:00
Fabian Frederick
6f84e23646 btrfs: use PTR_ERR_OR_ZERO
replace IS_ERR/PTR_ERR

Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: linux-btrfs@vger.kernel.org
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:29 -07:00
Wang Shilong
29549aec76 Btrfs: print btrfs specific info for some fatal error cases
Marc argued that if there are several btrfs filesystems mounted,
while users even don't know which filesystem hit the corrupted
errors something like generation verification failure.

Since @extent_buffer structure has a member @fs_info, let's output
btrfs device info.

Reported-by: Marc MERLIN <marc@merlins.org>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:28 -07:00
Miao Xie
d20983b40e Btrfs: fix writing data into the seed filesystem
If we mounted a seed filesystem with degraded option, and then added a new
device into the seed filesystem, then we found adding device failed because
of the IO failure.

Steps to reproduce:
 # mkfs.btrfs -d raid1 -m raid1 <dev0> <dev1>
 # btrfstune -S 1 <dev0>
 # mount <dev0> -o degraded <mnt>
 # btrfs device add -f <dev2> <mnt>

It is because the original didn't set the chunk on the seed device to be
read-only if the degraded flag was set. It was introduced by patch f48b90756,
which fixed the problem the raid1 filesystem became read-only after one device
of it was missing. But this fix method was not right, we should set the read-only
flag according to the number of the missing devices, not the degraded mount
option, if the number of the missing devices is less than the max error number
that the profile of the chunk tolerates, we don't set it to be read-only.

Cc: Josef Bacik <jbacik@fb.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:27 -07:00
Wang Shilong
47059d930f Btrfs: make defragment work with nodatacow option
Btrfs defragment will utilize COW feature, which means this
did not work for nodatacow option, this problem was detected
by xfstests generic/018 with nodatacow mount option.

Fix this problem by forcing cow for a extent with state
@EXTETN_DEFRAG setting.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:26 -07:00
Satoru Takeuchi
48fcc3ff7d btrfs: label should not contain return char
Rediffed remaining parts of original patch from Anand Jain.  This makes
sure to avoid trailing newlines in the btrfs label output

reproducer.sh:

===============================================================================

TEST_DEV=/dev/vdb
TEST_DIR=/home/sat/mnt

umount /home/sat/mnt

mkfs.btrfs -f $TEST_DEV
UUID=$(btrfs fi show $TEST_DEV | head -1 | sed -e 's/.*uuid: \([-0-9a-z]*\)$/\1/')
mount $TEST_DEV $TEST_DIR
LABELFILE=/sys/fs/btrfs/$UUID/label

echo "Test for empty label..." >&2
LINES="$(cat $LABELFILE | wc -l | awk '{print $1}')"
RET=0

if [ $LINES -eq 0 ] ; then
    echo '[PASS] Trailing \n is removed correctly.' >&2
else
    echo '[FAIL] Trailing \n still exists.' >&2
    RET=1
fi

echo "Test for non-empty label..." >&2

echo testlabel >$LABELFILE
LINES="$(cat $LABELFILE | wc -l | awk '{print $1}')"

if [ $LINES -eq 1 ] ; then
    echo '[PASS] Trailing \n is removed correctly.' >&2
else
    echo '[FAIL] Trailing \n still exists.' >&2
    RET=1
fi

exit $RET
===============================================================================

Signed-off-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:25 -07:00
Anand Jain
ec95d4917b btrfs: device delete must be sysloged
as in the disk add patch, disk detached from the volume must be
recorded in the syslog as well for the same reason.

Signed-off-by: Anand Jain <Anand.Jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:23 -07:00
Anand Jain
43d2076168 btrfs: device add must be sysloged
when we add a new disk to the mounted btrfs we don't record it
as of now, disk add is a critical change of btrfs configuration,
it must be recorded in the syslog to help offline investigations
of customer problems when reported.

Signed-off-by: Anand Jain <Anand.Jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:20 -07:00
Wang Shilong
4027e0f4c4 Btrfs: clear compress-force when remounting with compress option
Steps to reproduce:
 # mkfs.btrfs -f /dev/sdb
 # mount /dev/sdb /mnt -o compress-force=lzo
 # mount /dev/sdb /mnt -o remount,compress=zlib
 # cat /proc/mounts

Remounting from compress-force to compress could not clear compress-force
option. The problem is there is no way for users to clear compress-force
option separately.

Fix this problem by clearing @FORCE_COMPRESS flag when remounting to
compress=xxx.

Suggested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Tested-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:19 -07:00
David Sterba
ed6078f703 btrfs: use DIV_ROUND_UP instead of open-coded variants
The form

  (value + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT

is equivalent to

  (value + PAGE_CACHE_SIZE - 1) / PAGE_CACHE_SIZE

The rest is a simple subsitution, no difference in the generated
assembly code.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:17 -07:00
David Sterba
4e54b17ad6 btrfs: clean away stripe_align helper
Only wraps the ALIGN macro.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:16 -07:00
David Sterba
707e8a0715 btrfs: use nodesize everywhere, kill leafsize
The nodesize and leafsize were never of different values. Unify the
usage and make nodesize the one. Cleanup the redundant checks and
helpers.

Shaves a few bytes from .text:

  text    data     bss     dec     hex filename
852418   24560   23112  900090   dbbfa btrfs.ko.before
851074   24584   23112  898770   db6d2 btrfs.ko.after

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:14 -07:00
David Sterba
962a298f35 btrfs: kill the key type accessor helpers
btrfs_set_key_type and btrfs_key_type are used inconsistently along with
open coded variants. Other members of btrfs_key are accessed directly
without any helpers anyway.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:12 -07:00
David Sterba
3abdbd780e btrfs: make close_ctree return void
There's no user of the return value and we can get rid of the comment in
put_super.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:11 -07:00
David Sterba
57cdc8db21 btrfs: cleanup ino cache members of btrfs_root
The naming is confusing, generic yet used for a specific cache. Add a
prefix 'ino_' or rename appropriately.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:09 -07:00
David Sterba
c6f83c74fd btrfs: clenaup: don't call btrfs_release_path before free_path
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:08 -07:00
David Sterba
32471dc2ba btrfs: remove obsolete comment in btrfs_clean_one_deleted_snapshot
The comment applied when there was a BUG_ON.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:07 -07:00
Filipe Manana
125c4cf9f3 Btrfs: set inode's logged_trans/last_log_commit after ranged fsync
When a ranged fsync finishes if there are still extent maps in the modified
list, still set the inode's logged_trans and last_log_commit. This is important
in case an inode is fsync'ed and unlinked in the same transaction, to ensure its
inode ref gets deleted from the log and the respective dentries in its parent
are deleted too from the log (if the parent directory was fsync'ed in the same
transaction).

Instead make btrfs_inode_in_log() return false if the list of modified extent
maps isn't empty.

This is an incremental on top of the v4 version of the patch:

    "Btrfs: fix fsync data loss after a ranged fsync"

which was added to its v5, but didn't make it on time.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-16 16:12:19 -07:00
Linus Torvalds
7ed641be75 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Filipe is doing a careful pass through fsync problems, and these are
  the fixes so far.  I'll have one more for rc6 that we're still
  testing.

  My big commit is fixing up some inode hash races that Al Viro found
  (thanks Al)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: use insert_inode_locked4 for inode creation
  Btrfs: fix fsync data loss after a ranged fsync
  Btrfs: kfree()ing ERR_PTRs
  Btrfs: fix crash while doing a ranged fsync
  Btrfs: fix corruption after write/fsync failure + fsync + log recovery
  Btrfs: fix autodefrag with compression
2014-09-12 11:53:30 -07:00
Jens Axboe
b207892b06 Merge branch 'for-linus' into for-3.18/core
A bit of churn on the for-linus side that would be nice to have
in the core bits for 3.18, so pull it in to catch us up and make
forward progress easier.

Signed-off-by: Jens Axboe <axboe@fb.com>

Conflicts:
	block/scsi_ioctl.c
2014-09-11 09:31:18 -06:00
Chris Mason
b0d5d10f41 Btrfs: use insert_inode_locked4 for inode creation
Btrfs was inserting inodes into the hash table before we had fully
set the inode up on disk.  This leaves us open to rare races that allow
two different inodes in memory for the same [root, inode] pair.

This patch fixes things by using insert_inode_locked4 to insert an I_NEW
inode and unlock_new_inode when we're ready for the rest of the kernel
to use the inode.

It also makes sure to init the operations pointers on the inode before
going into the error handling paths.

Signed-off-by: Chris Mason <clm@fb.com>
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-08 13:56:45 -07:00
Filipe Manana
49dae1bc1c Btrfs: fix fsync data loss after a ranged fsync
While we're doing a full fsync (when the inode has the flag
BTRFS_INODE_NEEDS_FULL_SYNC set) that is ranged too (covers only a
portion of the file), we might have ordered operations that are started
before or while we're logging the inode and that fall outside the fsync
range.

Therefore when a full ranged fsync finishes don't remove every extent
map from the list of modified extent maps - as for some of them, that
fall outside our fsync range, their respective ordered operation hasn't
finished yet, meaning the corresponding file extent item wasn't inserted
into the fs/subvol tree yet and therefore we didn't log it, and we must
let the next fast fsync (one that checks only the modified list) see this
extent map and log a matching file extent item to the log btree and wait
for its ordered operation to finish (if it's still ongoing).

A test case for xfstests follows.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-08 13:56:43 -07:00
Dan Carpenter
c47ca32d3a Btrfs: kfree()ing ERR_PTRs
The "inherit" in btrfs_ioctl_snap_create_v2() and "vol_args" in
btrfs_ioctl_rm_dev() are ERR_PTRs so we can't call kfree() on them.

These kind of bugs are "One Err Bugs" where there is just one error
label that does everything.  I could set the "inherit = NULL" and keep
the single out label but it ends up being more complicated that way.  It
makes the code simpler to re-order the unwind so it's in the mirror
order of the allocation and introduce some new error labels.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-08 13:56:42 -07:00
Tejun Heo
ff9ea32381 block, bdi: an active gendisk always has a request_queue associated with it
bdev_get_queue() returns the request_queue associated with the
specified block_device.  blk_get_backing_dev_info() makes use of
bdev_get_queue() to determine the associated bdi given a block_device.

All the callers of bdev_get_queue() including
blk_get_backing_dev_info() assume that bdev_get_queue() may return
NULL and implement NULL handling; however, bdev_get_queue() requires
the passed in block_device is opened and attached to its gendisk.
Because an active gendisk always has a valid request_queue associated
with it, bdev_get_queue() can never return NULL and neither can
blk_get_backing_dev_info().

Make it clear that neither of the two functions can return NULL and
remove NULL handling from all the callers.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-09-08 10:00:35 -06:00
Tejun Heo
908c7f1949 percpu_counter: add @gfp to percpu_counter_init()
Percpu allocator now supports allocation mask.  Add @gfp to
percpu_counter_init() so that !GFP_KERNEL allocation masks can be used
with percpu_counters too.

We could have left percpu_counter_init() alone and added
percpu_counter_init_gfp(); however, the number of users isn't that
high and introducing _gfp variants to all percpu data structures would
be quite ugly, so let's just do the conversion.  This is the one with
the most users.  Other percpu data structures are a lot easier to
convert.

This patch doesn't make any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: "David S. Miller" <davem@davemloft.net>
Cc: x86@kernel.org
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
2014-09-08 09:51:29 +09:00
Filipe Manana
dac5705cad Btrfs: fix crash while doing a ranged fsync
While doing a ranged fsync, that is, one whose range doesn't cover the
whole possible file range (0 to LLONG_MAX), we can crash under certain
circumstances with a trace like the following:

[41074.641913] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
(...)
[41074.642692] CPU: 0 PID: 24580 Comm: fsx Not tainted 3.16.0-fdm-btrfs-next-45+ #1
(...)
[41074.643886] RIP: 0010:[<ffffffffa01ecc99>]  [<ffffffffa01ecc99>] btrfs_ordered_update_i_size+0x279/0x2b0 [btrfs]
(...)
[41074.644919] Stack:
(...)
[41074.644919] Call Trace:
[41074.644919]  [<ffffffffa01db531>] btrfs_truncate_inode_items+0x3f1/0xa10 [btrfs]
[41074.644919]  [<ffffffffa01eb54f>] ? btrfs_get_logged_extents+0x4f/0x80 [btrfs]
[41074.644919]  [<ffffffffa02137a9>] btrfs_log_inode+0x2f9/0x970 [btrfs]
[41074.644919]  [<ffffffff81090875>] ? sched_clock_local+0x25/0xa0
[41074.644919]  [<ffffffff8164a55e>] ? mutex_unlock+0xe/0x10
[41074.644919]  [<ffffffff810af51d>] ? trace_hardirqs_on+0xd/0x10
[41074.644919]  [<ffffffffa0214b4f>] btrfs_log_inode_parent+0x1ef/0x560 [btrfs]
[41074.644919]  [<ffffffff811d0c55>] ? dget_parent+0x5/0x180
[41074.644919]  [<ffffffffa0215d11>] btrfs_log_dentry_safe+0x51/0x80 [btrfs]
[41074.644919]  [<ffffffffa01e2d1a>] btrfs_sync_file+0x1ba/0x3e0 [btrfs]
[41074.644919]  [<ffffffff811eda6b>] vfs_fsync_range+0x1b/0x30
(...)

The necessary conditions that lead to such crash are:

* an incremental fsync (when the inode doesn't have the
  BTRFS_INODE_NEEDS_FULL_SYNC flag set) happened for our file and it logged
  a file extent item ending at offset X;

* the file got the flag BTRFS_INODE_NEEDS_FULL_SYNC set in its inode, due
  to a file truncate operation that reduces the file to a size smaller
  than X;

* a ranged fsync call happens (via an msync for example), with a range that
  doesn't cover the whole file and the end of this range, lets call it Y, is
  smaller than X;

* btrfs_log_inode, sees the flag BTRFS_INODE_NEEDS_FULL_SYNC set and
  calls btrfs_truncate_inode_items() to remove all items from the log
  tree that are associated with our file;

* btrfs_truncate_inode_items() removes all of the inode's items, and the lowest
  file extent item it removed is the one ending at offset X, where X > 0 and
  X > Y - before returning, it calls btrfs_ordered_update_i_size() with an offset
  parameter set to X;

* btrfs_ordered_update_i_size() sees that X is greater then the current ordered
  size (btrfs_inode's disk_i_size) and then it assumes there can't be any ongoing
  ordered operation with a range covering the offset X, calling a BUG_ON() if
  such ordered operation exists. This assumption is made because the disk_i_size
  is only increased after the corresponding file extent item is added to the
  btree (btrfs_finish_ordered_io);

* But because our fsync covers only a limited range, such an ordered extent might
  exist, and our fsync callback (btrfs_sync_file) doesn't wait for such ordered
  extent to finish when calling btrfs_wait_ordered_range();

And then by the time btrfs_ordered_update_i_size() is called, via:

   btrfs_sync_file() ->
       btrfs_log_dentry_safe() ->
           btrfs_log_inode_parent() ->
               btrfs_log_inode() ->
                   btrfs_truncate_inode_items() ->
                       btrfs_ordered_update_i_size()

We hit the BUG_ON(), which could never happen if the fsync range covered the whole
possible file range (0 to LLONG_MAX), as we would wait for all ordered extents to
finish before calling btrfs_truncate_inode_items().

So just don't call btrfs_ordered_update_i_size() if we're removing the inode's items
from a log tree, which isn't supposed to change the in memory inode's disk_i_size.

Issue found while running xfstests/generic/127 (happens very rarely for me), more
specifically via the fsx calls that use memory mapped IO (and issue msync calls).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-02 16:46:05 -07:00
Filipe Manana
d9f85963e3 Btrfs: fix corruption after write/fsync failure + fsync + log recovery
While writing to a file, in inode.c:cow_file_range() (and same applies to
submit_compressed_extents()), after reserving an extent for the file data,
we create a new extent map for the written range and insert it into the
extent map cache. After that, we create an ordered operation, but if it
fails (due to a transient/temporary-ENOMEM), we return without dropping
that extent map, which points to a reserved extent that is freed when we
return. A subsequent incremental fsync (when the btrfs inode doesn't have
the flag BTRFS_INODE_NEEDS_FULL_SYNC) considers this extent map valid and
logs a file extent item based on that extent map, which points to a disk
extent that doesn't contain valid data - it was freed by us earlier, at this
point it might contain any random/garbage data.

Therefore, if we reach an error condition when cowing a file range after
we added the new extent map to the cache, drop it from the cache before
returning.

Some sequence of steps that lead to this:

    $ mkfs.btrfs -f /dev/sdd
    $ mount -o commit=9999 /dev/sdd /mnt
    $ cd /mnt

    $ xfs_io -f -c "pwrite -S 0x01 -b 4096 0 4096" -c "fsync" foo
    $ xfs_io -c "pwrite -S 0x02 -b 4096 4096 4096"
    $ sync

    $ od -t x1 foo
    0000000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
    *
    0010000 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02
    *
    0020000

    $ xfs_io -c "pwrite -S 0xa1 -b 4096 0 4096" foo

    # Now this write + fsync fail with -ENOMEM, which was returned by
    # btrfs_add_ordered_extent() in inode.c:cow_file_range().
    $ xfs_io -c "pwrite -S 0xff -b 4096 4096 4096" foo
    $ xfs_io -c "fsync" foo
    fsync: Cannot allocate memory

    # Now do a new write + fsync, which will succeed. Our previous
    # -ENOMEM was a transient/temporary error.
    $ xfs_io -c "pwrite -S 0xee -b 4096 16384 4096" foo
    $ xfs_io -c "fsync" foo

    # Our file content (in page cache) is now:
    $ od -t x1 foo
    0000000 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1
    *
    0010000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
    *
    0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    *
    0040000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
    *
    0050000

    # Now reboot the machine, and mount the fs, so that fsync log replay
    # takes place.

    # The file content is now weird, in particular the first 8Kb, which
    # do not match our data before nor after the sync command above.
    $ od -t x1 foo
    0000000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
    *
    0010000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
    *
    0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    *
    0040000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
    *
    0050000

    # In fact these first 4Kb are a duplicate of the last 4kb block.
    # The last write got an extent map/file extent item that points to
    # the same disk extent that we got in the write+fsync that failed
    # with the -ENOMEM error. btrfs-debug-tree and btrfsck allow us to
    # verify that:

    $ btrfs-debug-tree /dev/sdd
    (...)
	item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53
		extent data disk byte 12582912 nr 8192
		extent data offset 0 nr 8192 ram 8192
	item 7 key (257 EXTENT_DATA 8192) itemoff 15766 itemsize 53
		extent data disk byte 0 nr 0
		extent data offset 0 nr 8192 ram 8192
	item 8 key (257 EXTENT_DATA 16384) itemoff 15713 itemsize 53
		extent data disk byte 12582912 nr 4096
		extent data offset 0 nr 4096 ram 4096

    $ umount /dev/sdd
    $ btrfsck /dev/sdd
    Checking filesystem on /dev/sdd
    UUID: db5e60e1-050d-41e6-8c7f-3d742dea5d8f
    checking extents
    extent item 12582912 has multiple extent items
    ref mismatch on [12582912 4096] extent item 1, found 2
    Backref bytes do not match extent backref, bytenr=12582912, ref bytes=4096, backref bytes=8192
    backpointer mismatch on [12582912 4096]
    Errors found in extent allocation tree or chunk allocation
    checking free space cache
    checking fs roots
    root 5 inode 257 errors 1000, some csum missing
    found 131074 bytes used err is 1
    total csum bytes: 4
    total tree bytes: 131072
    total fs tree bytes: 32768
    total extent tree bytes: 16384
    btree space waste bytes: 123404
    file data blocks allocated: 274432
     referenced 274432
    Btrfs v3.14.1-96-gcc7fd5a-dirty

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-02 16:46:05 -07:00
Linus Torvalds
1fb00cbca0 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "The biggest of these comes from Liu Bo, who tracked down a hang we've
  been hitting since moving to kernel workqueues (it's a btrfs bug, not
  in the generic code).  His patch needs backporting to 3.16 and 3.15
  stable, which I'll send once this is in.

  Otherwise these are assorted fixes.  Most were integrated last week
  during KS, but I wanted to give everyone the chance to test the
  result, so I waited for rc2 to come out before sending"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (24 commits)
  Btrfs: fix task hang under heavy compressed write
  Btrfs: fix filemap_flush call in btrfs_file_release
  Btrfs: fix crash on endio of reading corrupted block
  btrfs: fix leak in qgroup_subtree_accounting() error path
  btrfs: Use right extent length when inserting overlap extent map.
  Btrfs: clone, don't create invalid hole extent map
  Btrfs: don't monopolize a core when evicting inode
  Btrfs: fix hole detection during file fsync
  Btrfs: ensure tmpfile inode is always persisted with link count of 0
  Btrfs: race free update of commit root for ro snapshots
  Btrfs: fix regression of btrfs device replace
  Btrfs: don't consider the missing device when allocating new chunks
  Btrfs: Fix wrong device size when we are resizing the device
  Btrfs: don't write any data into a readonly device when scrub
  Btrfs: Fix the problem that the replace destroys the seed filesystem
  btrfs: Return right extent when fiemap gives unaligned offset and len.
  Btrfs: fix wrong extent mapping for DirectIO
  Btrfs: fix wrong write range for filemap_fdatawrite_range()
  Btrfs: fix wrong missing device counter decrease
  Btrfs: fix unzeroed members in fs_devices when creating a fs from seed fs
  ...
2014-08-27 09:14:17 -07:00
Chris Mason
e9512d72e8 Btrfs: fix autodefrag with compression
The autodefrag code skips defrag when two extents are adjacent.  But one
big advantage for autodefrag is cutting down on the number of small
extents, even when they are adjacent.  This commit changes it to defrag
all small extents.

Signed-off-by: Chris Mason <clm@fb.com>
2014-08-27 08:45:37 -07:00
Rasmus Villemoes
a71db86e86 fs/btrfs/tree-log.c: Fix closing brace followed by if
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2014-08-26 09:35:51 +02:00
Liu Bo
9e0af23764 Btrfs: fix task hang under heavy compressed write
This has been reported and discussed for a long time, and this hang occurs in
both 3.15 and 3.16.

Btrfs now migrates to use kernel workqueue, but it introduces this hang problem.

Btrfs has a kind of work queued as an ordered way, which means that its
ordered_func() must be processed in the way of FIFO, so it usually looks like --

normal_work_helper(arg)
    work = container_of(arg, struct btrfs_work, normal_work);

    work->func() <---- (we name it work X)
    for ordered_work in wq->ordered_list
            ordered_work->ordered_func()
            ordered_work->ordered_free()

The hang is a rare case, first when we find free space, we get an uncached block
group, then we go to read its free space cache inode for free space information,
so it will

file a readahead request
    btrfs_readpages()
         for page that is not in page cache
                __do_readpage()
                     submit_extent_page()
                           btrfs_submit_bio_hook()
                                 btrfs_bio_wq_end_io()
                                 submit_bio()
                                 end_workqueue_bio() <--(ret by the 1st endio)
                                      queue a work(named work Y) for the 2nd
                                      also the real endio()

So the hang occurs when work Y's work_struct and work X's work_struct happens
to share the same address.

A bit more explanation,

A,B,C -- struct btrfs_work
arg   -- struct work_struct

kthread:
worker_thread()
    pick up a work_struct from @worklist
    process_one_work(arg)
	worker->current_work = arg;  <-- arg is A->normal_work
	worker->current_func(arg)
		normal_work_helper(arg)
		     A = container_of(arg, struct btrfs_work, normal_work);

		     A->func()
		     A->ordered_func()
		     A->ordered_free()  <-- A gets freed

		     B->ordered_func()
			  submit_compressed_extents()
			      find_free_extent()
				  load_free_space_inode()
				      ...   <-- (the above readhead stack)
				      end_workqueue_bio()
					   btrfs_queue_work(work C)
		     B->ordered_free()

As if work A has a high priority in wq->ordered_list and there are more ordered
works queued after it, such as B->ordered_func(), its memory could have been
freed before normal_work_helper() returns, which means that kernel workqueue
code worker_thread() still has worker->current_work pointer to be work
A->normal_work's, ie. arg's address.

Meanwhile, work C is allocated after work A is freed, work C->normal_work
and work A->normal_work are likely to share the same address(I confirmed this
with ftrace output, so I'm not just guessing, it's rare though).

When another kthread picks up work C->normal_work to process, and finds our
kthread is processing it(see find_worker_executing_work()), it'll think
work C as a collision and skip then, which ends up nobody processing work C.

So the situation is that our kthread is waiting forever on work C.

Besides, there're other cases that can lead to deadlock, but the real problem
is that all btrfs workqueue shares one work->func, -- normal_work_helper,
so this makes each workqueue to have its own helper function, but only a
wraper pf normal_work_helper.

With this patch, I no long hit the above hang.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-24 07:17:02 -07:00
Chris Mason
f6dc45c7a9 Btrfs: fix filemap_flush call in btrfs_file_release
We should only be flushing on close if the file was flagged as needing
it during truncate.  I broke this with my ordered data vs transaction
commit deadlock fix.

Thanks to Miao Xie for catching this.

Signed-off-by: Chris Mason <clm@fb.com>
Reported-by: Miao Xie <miaox@cn.fujitsu.com>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
2014-08-21 07:55:31 -07:00
Liu Bo
38c1c2e44b Btrfs: fix crash on endio of reading corrupted block
The crash is

------------[ cut here ]------------
kernel BUG at fs/btrfs/extent_io.c:2124!
[...]
Workqueue: btrfs-endio normal_work_helper [btrfs]
RIP: 0010:[<ffffffffa02d6055>]  [<ffffffffa02d6055>] end_bio_extent_readpage+0xb45/0xcd0 [btrfs]

This is in fact a regression.

It is because we forgot to increase @offset properly in reading corrupted block,
so that the @offset remains, and this leads to checksum errors while reading
left blocks queued up in the same bio, and then ends up with hiting the above
BUG_ON.

Reported-by: Chris Murphy <lists@colorremedies.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-21 07:55:30 -07:00
Eric Sandeen
a3c108950d btrfs: fix leak in qgroup_subtree_accounting() error path
Coverity pointed this out; in the newly added
qgroup_subtree_accounting(), if btrfs_find_all_roots()
returns an error, we leak at least the parents pointer,
and possibly the roots pointer, depending on what failure
occurs.

If btrfs_find_all_roots() returns an error, we need to
free up all allocations before we return.  "roots" is
initialized to NULL, so it should be safe to free
it unconditionally (ulist_free() handles that case).

Cc: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-21 07:55:29 -07:00
Qu Wenruo
51f395ad40 btrfs: Use right extent length when inserting overlap extent map.
When current btrfs finds that a new extent map is going to be insereted
but failed with -EEXIST, it will try again to insert the extent map
but with the length of sectorsize.
This is OK if we don't enable 'no-holes' feature since all extent space
is continuous, we will not go into the not found->insert routine.

But if we enable 'no-holes' feature, it will make things out of control.
e.g. in 4K sectorsize, we pass the following args to btrfs_get_extent():
btrfs_get_extent() args: start:  27874 len 4100
28672		  27874		28672	27874+4100	32768
                    |-----------------------|
|---------hole--------------------|---------data----------|

1) not found and insert
Since no extent map containing the range, btrfs_get_extent() will go
into the not_found and insert routine, which will try to insert the
extent map (27874, 27847 + 4100).

2) first overlap
But it overlaps with (28672, 32768) extent, so -EEXIST will be returned
by add_extent_mapping().

3) retry but still overlap
After catching the -EEXIST, then btrfs_get_extent() will try insert it
again but with 4K length, which still overlaps, so -EEXIST will be
returned.

This makes the following patch fail to punch hole.
d77815461f btrfs: Avoid trucating page or punching hole in a already existed hole.

This patch will use the right length, which is the (exsisting->start -
em->start) to insert, making the above patch works in 'no-holes' mode.
Also, some small code style problems in above patch is fixed too.

Reported-by: Filipe David Manana <fdmanana@gmail.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Filipe David Manana <fdmanana@suse.com>
Tested-by: Filipe David Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-21 07:55:27 -07:00
Filipe Manana
62e2390e1a Btrfs: clone, don't create invalid hole extent map
When cloning a file that consists of an inline extent, we were creating
an extent map that represents a non-existing trailing hole starting at a
file offset that isn't a multiple of the sector size. This happened because
when processing an inline extent we weren't aligning the extent's length to
the sector size, and therefore incorrectly treating the range
[inline_extent_length; sector_size[ as a hole.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-21 07:55:26 -07:00
Filipe Manana
7064dd5c36 Btrfs: don't monopolize a core when evicting inode
If an inode has a very large number of extent maps, we can spend
a lot of time freeing them, which triggers a soft lockup warning.
Therefore reschedule if we need to when freeing the extent maps
while evicting the inode.

I could trigger this all the time by running xfstests/generic/299 on
a file system with the no-holes feature enabled. That test creates
an inode with 11386677 extent maps.

    $ mkfs.btrfs -f -O no-holes $TEST_DEV
    $ MKFS_OPTIONS="-O no-holes" ./check generic/299
    generic/299 382s ...
    Message from syslogd@debian-vm3 at Aug  7 10:44:29 ...
     kernel:[85304.208017] BUG: soft lockup - CPU#0 stuck for 22s! [umount:25330]
     384s
    Ran: generic/299
    Passed all 1 tests

    $ dmesg
    (...)
    [86304.300017] BUG: soft lockup - CPU#0 stuck for 23s! [umount:25330]
    (...)
    [86304.300036] Call Trace:
    [86304.300036]  [<ffffffff81698ba9>] __slab_free+0x54/0x295
    [86304.300036]  [<ffffffffa02ee9cc>] ? free_extent_map+0x5c/0xb0 [btrfs]
    [86304.300036]  [<ffffffff811a6cd2>] kmem_cache_free+0x282/0x2a0
    [86304.300036]  [<ffffffffa02ee9cc>] free_extent_map+0x5c/0xb0 [btrfs]
    [86304.300036]  [<ffffffffa02e3775>] btrfs_evict_inode+0xd5/0x660 [btrfs]
    [86304.300036]  [<ffffffff811e7c8d>] ? __inode_wait_for_writeback+0x6d/0xc0
    [86304.300036]  [<ffffffff816a389b>] ? _raw_spin_unlock+0x2b/0x40
    [86304.300036]  [<ffffffff811d8cbb>] evict+0xab/0x180
    [86304.300036]  [<ffffffff811d8dce>] dispose_list+0x3e/0x60
    [86304.300036]  [<ffffffff811d9b04>] evict_inodes+0xf4/0x110
    [86304.300036]  [<ffffffff811bd953>] generic_shutdown_super+0x53/0x110
    [86304.300036]  [<ffffffff811bdaa6>] kill_anon_super+0x16/0x30
    [86304.300036]  [<ffffffffa02a78ba>] btrfs_kill_super+0x1a/0xa0 [btrfs]
    [86304.300036]  [<ffffffff811bd3a9>] deactivate_locked_super+0x59/0x80
    [86304.300036]  [<ffffffff811be44e>] deactivate_super+0x4e/0x70
    [86304.300036]  [<ffffffff811dec14>] mntput_no_expire+0x174/0x1f0
    [86304.300036]  [<ffffffff811deab7>] ? mntput_no_expire+0x17/0x1f0
    [86304.300036]  [<ffffffff811e0517>] SyS_umount+0x97/0x100
    (...)

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Tested-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-21 07:55:25 -07:00
Filipe Manana
74121f7cbb Btrfs: fix hole detection during file fsync
The file hole detection logic during a file fsync wasn't correct,
because it didn't look back (in a previous leaf) for the last file
extent item that can be in a leaf to the left of our leaf and that
has a generation lower than the current transaction id. This made it
assume that a hole exists when it really doesn't exist in the file.

Such false positive hole detection happens in the following scenario:

* We have a file that has many file extent items, covering 3 or more
  btree leafs (the first leaf must contain non file extent items too).

* Two ranges of the file are modified, with their extent items being
  located at 2 different leafs and those leafs aren't consecutive.

* When processing the second modified leaf, we weren't checking if
  some file extent item exists that is located in some leaf that is
  between our 2 modified leafs, and therefore assumed the range defined
  between the last file extent item in the first leaf and the first file
  extent item in the second leaf matched a hole.

Fortunately this didn't result in overriding the log with wrong data,
instead it made the last loop in copy_items() attempt to insert a
duplicated key (for a hole file extent item), which makes the file
fsync code return with -EEXIST to file.c:btrfs_sync_file() which in
turn ends up doing a full transaction commit, which is much more
expensive then writing only to the log tree and wait for it to be
durably persisted (as well as the file's modified extents/pages).
Therefore fix the hole detection logic, so that we don't pay the
cost of doing full transaction commits.

I could trigger this issue with the following test for xfstests (which
never fails, either without or with this patch). The last fsync call
results in a full transaction commit, due to the -EEXIST error mentioned
above. I could also observe this behaviour happening frequently when
running xfstests/generic/075 in a loop.

Test:

    _cleanup()
    {
        _cleanup_flakey
        rm -fr $tmp
    }

    # get standard environment, filters and checks
    . ./common/rc
    . ./common/filter
    . ./common/dmflakey

    # real QA test starts here
    _supported_fs btrfs
    _supported_os Linux
    _require_scratch
    _require_dm_flakey
    _need_to_be_root

    rm -f $seqres.full

    # Create a file with many file extent items, each representing a 4Kb extent.
    # These items span 3 btree leaves, of 16Kb each (default mkfs.btrfs leaf size
    # as of btrfs-progs 3.12).
    _scratch_mkfs -l 16384 >/dev/null 2>&1
    _init_flakey
    SAVE_MOUNT_OPTIONS="$MOUNT_OPTIONS"
    MOUNT_OPTIONS="$MOUNT_OPTIONS -o commit=999"
    _mount_flakey

    # First fsync, inode has BTRFS_INODE_NEEDS_FULL_SYNC flag set.
    $XFS_IO_PROG -f -c "pwrite -S 0x01 -b 4096 0 4096" -c "fsync" \
            $SCRATCH_MNT/foo | _filter_xfs_io

    # For any of the following fsync calls, inode doesn't have the flag
    # BTRFS_INODE_NEEDS_FULL_SYNC set.
    for ((i = 1; i <= 500; i++)); do
        OFFSET=$((4096 * i))
        LEN=4096
        $XFS_IO_PROG -c "pwrite -S 0x01 $OFFSET $LEN" -c "fsync" \
                $SCRATCH_MNT/foo | _filter_xfs_io
    done

    # Commit transaction and bump next transaction's id (to 7).
    sync

    # Truncate will set the BTRFS_INODE_NEEDS_FULL_SYNC flag in the btrfs's
    # inode runtime flags.
    $XFS_IO_PROG -c "truncate 2048000" $SCRATCH_MNT/foo

    # Commit transaction and bump next transaction's id (to 8).
    sync

    # Touch 1 extent item from the first leaf and 1 from the last leaf. The leaf
    # in the middle, containing only file extent items, isn't touched. So the
    # next fsync, when calling btrfs_search_forward(), won't visit that middle
    # leaf. First and 3rd leaf have now a generation with value 8, while the
    # middle leaf remains with a generation with value 6.
    $XFS_IO_PROG \
        -c "pwrite -S 0xee -b 4096 0 4096" \
        -c "pwrite -S 0xff -b 4096 2043904 4096" \
        -c "fsync" \
        $SCRATCH_MNT/foo | _filter_xfs_io

    _load_flakey_table $FLAKEY_DROP_WRITES
    md5sum $SCRATCH_MNT/foo | _filter_scratch
    _unmount_flakey

    _load_flakey_table $FLAKEY_ALLOW_WRITES
    # During mount, we'll replay the log created by the fsync above, and the file's
    # md5 digest should be the same we got before the unmount.
    _mount_flakey
    md5sum $SCRATCH_MNT/foo | _filter_scratch
    _unmount_flakey
    MOUNT_OPTIONS="$SAVE_MOUNT_OPTIONS"

    status=0
    exit

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-21 07:55:24 -07:00
Filipe Manana
5762b5c958 Btrfs: ensure tmpfile inode is always persisted with link count of 0
If we open a file with O_TMPFILE, don't do any further operation on
it (so that the inode item isn't updated) and then force a transaction
commit, we get a persisted inode item with a link count of 1, and not 0
as it should be.

Steps to reproduce it (requires a modern xfs_io with -T support):

    $ mkfs.btrfs -f /dev/sdd
    $ mount -o /dev/sdd /mnt
    $ xfs_io -T /mnt &
    $ sync

Then btrfs-debug-tree shows the inode item with a link count of 1:

    $ btrfs-debug-tree /dev/sdd
    (...)
    fs tree key (FS_TREE ROOT_ITEM 0)
    leaf 29556736 items 4 free space 15851 generation 6 owner 5
    fs uuid f164d01b-1b92-481d-a4e4-435fb0f843d0
    chunk uuid 0e3d0e56-bcca-4a1c-aa5f-cec2c6f4f7a6
    	item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160
		inode generation 3 transid 6 size 0 block group 0 mode 40755 links 1
    	item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12
    		inode ref index 0 namelen 2 name: ..
    	item 2 key (257 INODE_ITEM 0) itemoff 15951 itemsize 160
    		inode generation 6 transid 6 size 0 block group 0 mode 100600 links 1
    	item 3 key (ORPHAN ORPHAN_ITEM 257) itemoff 15951 itemsize 0
		orphan item
    checksum tree key (CSUM_TREE ROOT_ITEM 0)
    (...)

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-21 07:55:23 -07:00
Filipe Manana
9c3b306e1c Btrfs: race free update of commit root for ro snapshots
This is a better solution for the problem addressed in the following
commit:

    Btrfs: update commit root on snapshot creation after orphan cleanup
    (3821f34888)

The previous solution wasn't the best because of 2 reasons:

    1) It added another full transaction commit, which is more expensive
       than just swapping the commit root with the root;

    2) If a reboot happened after the first transaction commit (the one
       that creates the snapshot) and before the second transaction commit,
       then we would end up with the same problem if a send using that
       snapshot was requested before the first transaction commit after
       the reboot.

This change addresses those 2 issues. The second issue is addressed by
switching the commit root in the dentry lookup VFS callback, which is
also called by the snapshot/subvol creation ioctl and performs orphan
cleanup if needed. Like the vfs, the ioctl locks the parent inode too,
preventing race issues between a dentry lookup and snapshot creation.

Cc: Alex Lyakas <alex.btrfs@zadarastorage.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-21 07:55:21 -07:00
Liu Bo
87fa3bb078 Btrfs: fix regression of btrfs device replace
Commit 49c6f736f34f901117c20960ebd7d5e60f12fcac(
btrfs: dev replace should replace the sysfs entry) added the missing sysfs entry
in the process of device replace, but didn't take missing devices into account,
so now we have

BUG: unable to handle kernel NULL pointer dereference at 0000000000000088
IP: [<ffffffffa0268551>] btrfs_kobj_rm_device+0x21/0x40 [btrfs]
...

To reproduce it,
1. mkfs.btrfs -f disk1 disk2
2. mkfs.ext4 disk1
3. mount disk2 /mnt -odegraded
4. btrfs replace start -B 1 disk3 /mnt
--------------------------

This fixes the problem.

Reported-by: Chris Murphy <lists@colorremedies.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Tested-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-21 07:55:20 -07:00
Miao Xie
95669976bd Btrfs: don't consider the missing device when allocating new chunks
The original code allocated new chunks by the number of the writable devices
and missing devices to make sure that any RAID levels on a degraded FS continue
to be honored, but it introduced a problem that it stopped us to allocating
new chunks, the steps to reproduce is following:

 # mkfs.btrfs -m raid1 -d raid1 -f <dev0> <dev1>
 # mkfs.btrfs -f <dev1>			//Removing <dev1> from the original fs
 # mount -o degraded <dev0> <mnt>
 # dd if=/dev/null of=<mnt>/tmpfile bs=1M

It is because we allocate new chunks only on the writable devices, if we take
the number of missing devices into account, and want to allocate new chunks
with higher RAID level, we will fail becaue we don't have enough writable
device. Fix it by ignoring the number of missing devices when allocating
new chunks.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:52:19 -07:00
Miao Xie
7df69d3e94 Btrfs: Fix wrong device size when we are resizing the device
total_bytes of device is just a in-memory variant which is used to record
the size of the device, and it might be changed before we resize a device,
if the resize operation fails, it will be fallbacked. But some code used it
to update on-disk metadata of the device, it would cause the problem that
on-disk metadata of the devices was not consistent. We should use the other
variant named disk_total_bytes to update the on-disk metadata of device,
because that variant is updated only when the resize operation is successful.
Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:52:18 -07:00
Miao Xie
5d68da3b8e Btrfs: don't write any data into a readonly device when scrub
We should not write data into a readonly device especially seed device when
doing scrub, skip those devices.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:52:17 -07:00
Miao Xie
ff61d17c63 Btrfs: Fix the problem that the replace destroys the seed filesystem
The seed filesystem was destroyed by the device replace, the reproduce
method is:
 # mkfs.btrfs -f <dev0>
 # btrfstune -S 1 <dev0>
 # mount <dev0> <mnt>
 # btrfs device add <dev1> <mnt>
 # umount <mnt>
 # mount <dev1> <mnt>
 # btrfs replace start -f <dev0> <dev2> <mnt>
 # umount <mnt>
 # mount <dev0> <mnt>

It is because we erase the super block on the seed device. It is wrong,
we should not change anything on the seed device.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:52:16 -07:00
Qu Wenruo
2c91943b50 btrfs: Return right extent when fiemap gives unaligned offset and len.
When page aligned start and len passed to extent_fiemap(), the result is
good, but when start and len is not aligned, e.g. start = 1 and len =
4095 is passed to extent_fiemap(), it returns no extent.

The problem is that start and len is all rounded down which causes the
problem. This patch will round down start and round up (start + len) to
return right extent.

Reported-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:52:14 -07:00
Wang Shilong
e2eca69dc6 Btrfs: fix wrong extent mapping for DirectIO
btrfs_next_leaf() will use current leaf's last key to search
and then return a bigger one. So it may still return a file extent
item that is smaller than expected value and we will
get an overflow here for @em->len.

This is easy to reproduce for Btrfs Direct writting, it did not
cause any problem, because writting will re-insert right mapping later.

However, by hacking code to make DIO support compression, wrong extent
mapping is kept and it encounter merging failure(EEXIST) quickly.

Fix this problem by looping to find next file extent item that is bigger
than @start or we could not find anything more.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:52:13 -07:00
Wang Shilong
9a025a0860 Btrfs: fix wrong write range for filemap_fdatawrite_range()
filemap_fdatawrite_range() expect the third arg to be @end
not @len, fix it.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:52:12 -07:00
Miao Xie
3a7d55c84c Btrfs: fix wrong missing device counter decrease
The missing devices are accounted by its own fs device, for example
the missing devices in seed filesystem will be accounted by the fs device
of the seed filesystem, not by the new filesystem which is based on
the seed filesystem, so when we remove the missing device in the
seed filesystem, we should decrease the counter of its own fs device.
Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:52:10 -07:00
Miao Xie
69611ac810 Btrfs: fix unzeroed members in fs_devices when creating a fs from seed fs
We forgot to zero some members in fs_devices when we create new fs_devices
from the one of the seed fs. It would cause the problem that we got wrong
chunk profile when allocating chunks. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:36:32 -07:00
Anand Jain
77bdae4d13 btrfs: check generation as replace duplicates devid+uuid
When FS in unmounted we need to check generation number as well
since devid+uuid combination could match with the missing replaced
disk when it reappears, and without this patch it might pair with
the replaced disk again.

 device_list_add() function is called in the following threads,
	mount device option
	mount argument
	ioctl BTRFS_IOC_SCAN_DEV (btrfs dev scan)
	ioctl BTRFS_IOC_DEVICES_READY (btrfs dev ready <dev>)
 they have been unit tested to work fine with this patch.

 If the user knows what he is doing and really want to pair with
 replaced disk (which is not a standard operation), then he should
 first clear the kernel btrfs device list in the memory by doing
 the module unload/load and followed with the mount -o device option.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:36:30 -07:00
Anand Jain
b96de000bc Btrfs: device_list_add() should not update list when mounted
device_list_add() is called when user runs btrfs dev scan, which would add
any btrfs device into the btrfs_fs_devices list.

Now think of a mounted btrfs. And a new device which contains the a SB
from the mounted btrfs devices.

In this situation when user runs btrfs dev scan, the current code would
just replace existing device with the new device.

Which is to note that old device is neither closed nor gracefully
removed from the btrfs.

The FS is still operational with the old bdev however the device name
is the btrfs_device is new which is provided by the btrfs dev scan.

reproducer:

devmgt[1] detach /dev/sdc

replace the missing disk /dev/sdc

btrfs rep start -f 1 /dev/sde /btrfs
Label: none  uuid: 5dc0aaf4-4683-4050-b2d6-5ebe5f5cd120
        Total devices 2 FS bytes used 32.00KiB
        devid    1 size 958.94MiB used 115.88MiB path /dev/sde
        devid    2 size 958.94MiB used 103.88MiB path /dev/sdd

make /dev/sdc to reappear

devmgt attach host2

btrfs dev scan

btrfs fi show -m
Label: none  uuid: 5dc0aaf4-4683-4050-b2d6-5ebe5f5cd120^M
        Total devices 2 FS bytes used 32.00KiB^M
        devid    1 size 958.94MiB used 115.88MiB path /dev/sdc <- Wrong.
        devid    2 size 958.94MiB used 103.88MiB path /dev/sdd

since /dev/sdc has been replaced with /dev/sde, the /dev/sdc shouldn't be
part of the btrfs-fsid when it reappears. If user want it to be part of it
then sys admin should be using btrfs device add instead.

[1] github.com/anajain/devmgt.git

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:36:28 -07:00
chandan
1707e26d6a Btrfs: fill_holes: Fix slot number passed to hole_mergeable() call.
For a non-existent key, btrfs_search_slot() sets path->slots[0] to the slot
where the key could have been present, which in this case would be the slot
containing the extent item which would be the next neighbor of the file range
being punched. The current code passes an incremented path->slots[0] and we
skip to the wrong file extent item. This would mean that we would fail to
merge the "yet to be created" hole with the next neighboring hole (if one
exists). Fix this.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:36:26 -07:00
Miao Xie
7a5c3c9be1 Btrfs: fix put dio bio twice when we submit dio bio fail
The caller of btrfs_submit_direct_hook() will put the original dio bio
when btrfs_submit_direct_hook() return a error number, so we needn't
put the original bio in btrfs_submit_direct_hook().

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:36:24 -07:00
Linus Torvalds
e64df3ebe8 Merge branch 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "These are all fixes I'd like to get out to a broader audience.

  The biggest of the bunch is Mark's quota fix, which is also in the
  SUSE kernel, and makes our subvolume quotas dramatically more
  accurate.

  I've been running xfstests with these against your current git
  overnight, but I'm queueing up longer tests as well"

* 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: disable strict file flushes for renames and truncates
  Btrfs: fix csum tree corruption, duplicate and outdated checksums
  Btrfs: Fix memory corruption by ulist_add_merge() on 32bit arch
  Btrfs: fix compressed write corruption on enospc
  btrfs: correctly handle return from ulist_add
  btrfs: qgroup: account shared subtrees during snapshot delete
  Btrfs: read lock extent buffer while walking backrefs
  Btrfs: __btrfs_mod_ref should always use no_quota
  btrfs: adjust statfs calculations according to raid profiles
2014-08-16 09:06:55 -06:00
Chris Mason
8d875f95da btrfs: disable strict file flushes for renames and truncates
Truncates and renames are often used to replace old versions of a file
with new versions.  Applications often expect this to be an atomic
replacement, even if they haven't done anything to make sure the new
version is fully on disk.

Btrfs has strict flushing in place to make sure that renaming over an
old file with a new file will fully flush out the new file before
allowing the transaction commit with the rename to complete.

This ordering means the commit code needs to be able to lock file pages,
and there are a few paths in the filesystem where we will try to end a
transaction with the page lock held.  It's rare, but these things can
deadlock.

This patch removes the ordered flushes and switches to a best effort
filemap_flush like ext4 uses. It's not perfect, but it should fix the
deadlocks.

Signed-off-by: Chris Mason <clm@fb.com>
2014-08-15 07:43:42 -07:00
Filipe Manana
27b9a8122f Btrfs: fix csum tree corruption, duplicate and outdated checksums
Under rare circumstances we can end up leaving 2 versions of a checksum
for the same file extent range.

The reason for this is that after calling btrfs_next_leaf we process
slot 0 of the leaf it returns, instead of processing the slot set in
path->slots[0]. Most of the time (by far) path->slots[0] is 0, but after
btrfs_next_leaf() releases the path and before it searches for the next
leaf, another task might cause a split of the next leaf, which migrates
some of its keys to the leaf we were processing before calling
btrfs_next_leaf(). In this case btrfs_next_leaf() returns again the
same leaf but with path->slots[0] having a slot number corresponding
to the first new key it got, that is, a slot number that didn't exist
before calling btrfs_next_leaf(), as the leaf now has more keys than
it had before. So we must really process the returned leaf starting at
path->slots[0] always, as it isn't always 0, and the key at slot 0 can
have an offset much lower than our search offset/bytenr.

For example, consider the following scenario, where we have:

sums->bytenr: 40157184, sums->len: 16384, sums end: 40173568
four 4kb file data blocks with offsets 40157184, 40161280, 40165376, 40169472

  Leaf N:

    slot = 0                           slot = btrfs_header_nritems() - 1
  |-------------------------------------------------------------------|
  | [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4] |
  |-------------------------------------------------------------------|

  Leaf N + 1:

      slot = 0                          slot = btrfs_header_nritems() - 1
  |--------------------------------------------------------------------|
  | [(CSUM CSUM 40161280), size 32] ... [((CSUM CSUM 40615936), size 8 |
  |--------------------------------------------------------------------|

Because we are at the last slot of leaf N, we call btrfs_next_leaf() to
find the next highest key, which releases the current path and then searches
for that next key. However after releasing the path and before finding that
next key, the item at slot 0 of leaf N + 1 gets moved to leaf N, due to a call
to ctree.c:push_leaf_left() (via ctree.c:split_leaf()), and therefore
btrfs_next_leaf() will returns us a path again with leaf N but with the slot
pointing to its new last key (CSUM CSUM 40161280). This new version of leaf N
is then:

    slot = 0                        slot = btrfs_header_nritems() - 2  slot = btrfs_header_nritems() - 1
  |----------------------------------------------------------------------------------------------------|
  | [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4]  [(CSUM CSUM 40161280), size 32] |
  |----------------------------------------------------------------------------------------------------|

And incorrecly using slot 0, makes us set next_offset to 39239680 and we jump
into the "insert:" label, which will set tmp to:

    tmp = min((sums->len - total_bytes) >> blocksize_bits,
        (next_offset - file_key.offset) >> blocksize_bits) =
    min((16384 - 0) >> 12, (39239680 - 40157184) >> 12) =
    min(4, (u64)-917504 = 18446744073708634112 >> 12) = 4

and

   ins_size = csum_size * tmp = 4 * 4 = 16 bytes.

In other words, we insert a new csum item in the tree with key
(CSUM_OBJECTID CSUM_KEY 40157184 = sums->bytenr) that contains the checksums
for all the data (4 blocks of 4096 bytes each = sums->len). Which is wrong,
because the item with key (CSUM CSUM 40161280) (the one that was moved from
leaf N + 1 to the end of leaf N) contains the old checksums of the last 12288
bytes of our data and won't get those old checksums removed.

So this leaves us 2 different checksums for 3 4kb blocks of data in the tree,
and breaks the logical rule:

   Key_N+1.offset >= Key_N.offset + length_of_data_its_checksums_cover

An obvious bad effect of this is that a subsequent csum tree lookup to get
the checksum of any of the blocks with logical offset of 40161280, 40165376
or 40169472 (the last 3 4kb blocks of file data), will get the old checksums.

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-15 07:43:40 -07:00
Takashi Iwai
4eb1f66dce Btrfs: Fix memory corruption by ulist_add_merge() on 32bit arch
We've got bug reports that btrfs crashes when quota is enabled on
32bit kernel, typically with the Oops like below:
 BUG: unable to handle kernel NULL pointer dereference at 00000004
 IP: [<f9234590>] find_parent_nodes+0x360/0x1380 [btrfs]
 *pde = 00000000
 Oops: 0000 [#1] SMP
 CPU: 0 PID: 151 Comm: kworker/u8:2 Tainted: G S      W 3.15.2-1.gd43d97e-default #1
 Workqueue: btrfs-qgroup-rescan normal_work_helper [btrfs]
 task: f1478130 ti: f147c000 task.ti: f147c000
 EIP: 0060:[<f9234590>] EFLAGS: 00010213 CPU: 0
 EIP is at find_parent_nodes+0x360/0x1380 [btrfs]
 EAX: f147dda8 EBX: f147ddb0 ECX: 00000011 EDX: 00000000
 ESI: 00000000 EDI: f147dda4 EBP: f147ddf8 ESP: f147dd38
  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
 CR0: 8005003b CR2: 00000004 CR3: 00bf3000 CR4: 00000690
 Stack:
  00000000 00000000 f147dda4 00000050 00000001 00000000 00000001 00000050
  00000001 00000000 d3059000 00000001 00000022 000000a8 00000000 00000000
  00000000 000000a1 00000000 00000000 00000001 00000000 00000000 11800000
 Call Trace:
  [<f923564d>] __btrfs_find_all_roots+0x9d/0xf0 [btrfs]
  [<f9237bb1>] btrfs_qgroup_rescan_worker+0x401/0x760 [btrfs]
  [<f9206148>] normal_work_helper+0xc8/0x270 [btrfs]
  [<c025e38b>] process_one_work+0x11b/0x390
  [<c025eea1>] worker_thread+0x101/0x340
  [<c026432b>] kthread+0x9b/0xb0
  [<c0712a71>] ret_from_kernel_thread+0x21/0x30
  [<c0264290>] kthread_create_on_node+0x110/0x110

This indicates a NULL corruption in prefs_delayed list.  The further
investigation and bisection pointed that the call of ulist_add_merge()
results in the corruption.

ulist_add_merge() takes u64 as aux and writes a 64bit value into
old_aux.  The callers of this function in backref.c, however, pass a
pointer of a pointer to old_aux.  That is, the function overwrites
64bit value on 32bit pointer.  This caused a NULL in the adjacent
variable, in this case, prefs_delayed.

Here is a quick attempt to band-aid over this: a new function,
ulist_add_merge_ptr() is introduced to pass/store properly a pointer
value instead of u64.  There are still ugly void ** cast remaining
in the callers because void ** cannot be taken implicitly.  But, it's
safer than explicit cast to u64, anyway.

Bugzilla: https://bugzilla.novell.com/show_bug.cgi?id=887046
Cc: <stable@vger.kernel.org> [v3.11+]
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-15 07:43:19 -07:00
Liu Bo
ce62003f69 Btrfs: fix compressed write corruption on enospc
When failing to allocate space for the whole compressed extent, we'll
fallback to uncompressed IO, but we've forgotten to redirty the pages
which belong to this compressed extent, and these 'clean' pages will
simply skip 'submit' part and go to endio directly, at last we got data
corruption as we write nothing.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Tested-By: Martin Steigerwald <martin@lichtvoll.de>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-15 07:43:18 -07:00
Mark Fasheh
f90e579c2b btrfs: correctly handle return from ulist_add
ulist_add() can return '1' on sucess, which qgroup_subtree_accounting()
doesn't take into account. As a result, that value can be bubbled up to
callers, causing an error to be printed. Fix this by only returning the
value of ulist_add() when it indicates an error.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-15 07:43:16 -07:00
Mark Fasheh
1152651a08 btrfs: qgroup: account shared subtrees during snapshot delete
During its tree walk, btrfs_drop_snapshot() will skip any shared
subtrees it encounters. This is incorrect when we have qgroups
turned on as those subtrees need to have their contents
accounted. In particular, the case we're concerned with is when
removing our snapshot root leaves the subtree with only one root
reference.

In those cases we need to find the last remaining root and add
each extent in the subtree to the corresponding qgroup exclusive
counts.

This patch implements the shared subtree walk and a new qgroup
operation, BTRFS_QGROUP_OPER_SUB_SUBTREE. When an operation of
this type is encountered during qgroup accounting, we search for
any root references to that extent and in the case that we find
only one reference left, we go ahead and do the math on it's
exclusive counts.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-15 07:43:14 -07:00
Filipe Manana
6f7ff6d783 Btrfs: read lock extent buffer while walking backrefs
Before processing the extent buffer, acquire a read lock on it, so
that we're safe against concurrent updates on the extent buffer.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-15 07:43:13 -07:00
Josef Bacik
e339a6b097 Btrfs: __btrfs_mod_ref should always use no_quota
Before I extended the no_quota arg to btrfs_dec/inc_ref because I didn't
understand how snapshot delete was using it and assumed that we needed the
quota operations there.  With Mark's work this has turned out to be not the
case, we _always_ need to use no_quota for btrfs_dec/inc_ref, so just drop the
argument and make __btrfs_mod_ref call it's process function with no_quota set
always.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-15 07:43:11 -07:00
David Sterba
ba7b6e62f4 btrfs: adjust statfs calculations according to raid profiles
This has been discussed in thread:
http://thread.gmane.org/gmane.comp.file-systems.btrfs/32528

and this patch implements this proposal:
http://thread.gmane.org/gmane.comp.file-systems.btrfs/32536

Works fine for "clean" raid profiles where the raid factor correction
does the right job. Otherwise it's pessimistic and may show low space
although there's still some left.

The df nubmers are lightly wrong in case of mixed block groups, but this
is not a major usecase and can be addressed later.

The RAID56 numbers are wrong almost the same way as before and will be
addressed separately.

CC: Hugo Mills <hugo@carfax.org.uk>
CC: cwillu <cwillu@cwillu.com>
CC: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-15 07:43:10 -07:00
Linus Torvalds
f6f993328b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "Stuff in here:

   - acct.c fixes and general rework of mnt_pin mechanism.  That allows
     to go for delayed-mntput stuff, which will permit mntput() on deep
     stack without worrying about stack overflows - fs shutdown will
     happen on shallow stack.  IOW, we can do Eric's umount-on-rmdir
     series without introducing tons of stack overflows on new mntput()
     call chains it introduces.
   - Bruce's d_splice_alias() patches
   - more Miklos' rename() stuff.
   - a couple of regression fixes (stable fodder, in the end of branch)
     and a fix for API idiocy in iov_iter.c.

  There definitely will be another pile, maybe even two.  I'd like to
  get Eric's series in this time, but even if we miss it, it'll go right
  in the beginning of for-next in the next cycle - the tricky part of
  prereqs is in this pile"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
  fix copy_tree() regression
  __generic_file_write_iter(): fix handling of sync error after DIO
  switch iov_iter_get_pages() to passing maximal number of pages
  fs: mark __d_obtain_alias static
  dcache: d_splice_alias should detect loops
  exportfs: update Exporting documentation
  dcache: d_find_alias needn't recheck IS_ROOT && DCACHE_DISCONNECTED
  dcache: remove unused d_find_alias parameter
  dcache: d_obtain_alias callers don't all want DISCONNECTED
  dcache: d_splice_alias should ignore DCACHE_DISCONNECTED
  dcache: d_splice_alias mustn't create directory aliases
  dcache: close d_move race in d_splice_alias
  dcache: move d_splice_alias
  namei: trivial fix to vfs_rename_dir comment
  VFS: allow ->d_manage() to declare -EISDIR in rcu_walk mode.
  cifs: support RENAME_NOREPLACE
  hostfs: support rename flags
  shmem: support RENAME_EXCHANGE
  shmem: support RENAME_NOREPLACE
  btrfs: add RENAME_NOREPLACE
  ...
2014-08-11 11:44:11 -07:00
J. Bruce Fields
1a0a397e41 dcache: d_obtain_alias callers don't all want DISCONNECTED
There are a few d_obtain_alias callers that are using it to get the
root of a filesystem which may already have an alias somewhere else.

This is not the same as the filehandle-lookup case, and none of them
actually need DCACHE_DISCONNECTED set.

It isn't really a serious problem, but it would really be clearer if we
reserved DCACHE_DISCONNECTED for those cases where it's actually needed.

In the btrfs case this was causing a spurious printk from
nfsd/nfsfh.c:fh_verify when it found an unexpected DCACHE_DISCONNECTED
dentry.  Josef worked around this by unsetting DCACHE_DISCONNECTED
manually in 3a0dfa6a12 "Btrfs: unset DCACHE_DISCONNECTED when mounting
default subvol", and this replaces that workaround.

Cc: Josef Bacik <jbacik@fb.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:10 -04:00
Miklos Szeredi
80ace85c91 btrfs: add RENAME_NOREPLACE
RENAME_NOREPLACE is trivial to implement for most filesystems: switch over
to ->rename2() and check for the supported flags.  The rest is done by the
VFS.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Chris Mason <clm@fb.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:09 -04:00
Ingo Molnar
ca5bc6cd5d Merge branch 'sched/urgent' into sched/core, to merge fixes before applying new changes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-28 10:03:00 +02:00
Linus Torvalds
da83fc6e0f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "We have two more fixes in my for-linus branch.

  I was hoping to also include a fix for a btrfs deadlock with
  compression enabled, but we're still nailing that one down"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: test for valid bdev before kobj removal in btrfs_rm_device
  Btrfs: fix abnormal long waiting in fsync
2014-07-20 20:21:05 -07:00
Eric Sandeen
0bfaa9c5cb btrfs: test for valid bdev before kobj removal in btrfs_rm_device
commit 99994cd btrfs: dev delete should remove sysfs entry
added a btrfs_kobj_rm_device, which dereferences device->bdev...
right after we check whether device->bdev might be NULL.

I don't honestly know if it's possible to have a NULL device->bdev
here, but assuming that it is (given the test), we need to move
the kobject removal to be under that test.

(Coverity spotted this)

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-07-19 11:49:44 -07:00
Liu Bo
98ce2deda2 Btrfs: fix abnormal long waiting in fsync
xfstests generic/127 detected this problem.

With commit 7fc34a62ca, now fsync will only flush
data within the passed range.  This is the cause of the above problem,
-- btrfs's fsync has a stage called 'sync log' which will wait for all the
ordered extents it've recorded to finish.

In xfstests/generic/127, with mixed operations such as truncate, fallocate,
punch hole, and mapwrite, we get some pre-allocated extents, and mapwrite will
mmap, and then msync.  And I find that msync will wait for quite a long time
(about 20s in my case), thanks to ftrace, it turns out that the previous
fallocate calls 'btrfs_wait_ordered_range()' to flush dirty pages, but as the
range of dirty pages may be larger than 'btrfs_wait_ordered_range()' wants,
there can be some ordered extents created but not getting corresponding pages
flushed, then they're left in memory until we fsync which runs into the
stage 'sync log', and fsync will just wait for the system writeback thread
to flush those pages and get ordered extents finished, so the latency is
inevitable.

This adds a flush similar to btrfs_start_ordered_extent() in
btrfs_wait_logged_extents() to fix that.

Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-07-19 11:49:44 -07:00
NeilBrown
743162013d sched: Remove proliferation of wait_on_bit() action functions
The current "wait_on_bit" interface requires an 'action'
function to be provided which does the actual waiting.
There are over 20 such functions, many of them identical.
Most cases can be satisfied by one of just two functions, one
which uses io_schedule() and one which just uses schedule().

So:
 Rename wait_on_bit and        wait_on_bit_lock to
        wait_on_bit_action and wait_on_bit_lock_action
 to make it explicit that they need an action function.

 Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io
 which are *not* given an action function but implicitly use
 a standard one.
 The decision to error-out if a signal is pending is now made
 based on the 'mode' argument rather than being encoded in the action
 function.

 All instances of the old wait_on_bit and wait_on_bit_lock which
 can use the new version have been changed accordingly and their
 action functions have been discarded.
 wait_on_bit{_lock} does not return any specific error code in the
 event of a signal so the caller must check for non-zero and
 interpolate their own error code as appropriate.

The wait_on_bit() call in __fscache_wait_on_invalidate() was
ambiguous as it specified TASK_UNINTERRUPTIBLE but used
fscache_wait_bit_interruptible as an action function.
David Howells confirms this should be uniformly
"uninterruptible"

The main remaining user of wait_on_bit{,_lock}_action is NFS
which needs to use a freezer-aware schedule() call.

A comment in fs/gfs2/glock.c notes that having multiple 'action'
functions is useful as they display differently in the 'wchan'
field of 'ps'. (and /proc/$PID/wchan).
As the new bit_wait{,_io} functions are tagged "__sched", they
will not show up at all, but something higher in the stack.  So
the distinction will still be visible, only with different
function names (gds2_glock_wait versus gfs2_glock_dq_wait in the
gfs2/glock.c case).

Since first version of this patch (against 3.15) two new action
functions appeared, on in NFS and one in CIFS.  CIFS also now
uses an action function that makes the same freezer aware
schedule call as NFS.

Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: David Howells <dhowells@redhat.com> (fscache, keys)
Acked-by: Steven Whitehouse <swhiteho@redhat.com> (gfs2)
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steve French <sfrench@samba.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brown
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16 15:10:39 +02:00
Linus Torvalds
b82207b8e8 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "We've queued up a few fixes in my for-linus branch"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix crash when starting transaction
  Btrfs: fix btrfs_print_leaf for skinny metadata
  Btrfs: fix race of using total_bytes_pinned
  btrfs: use E2BIG instead of EIO if compression does not help
  btrfs: remove stale comment from btrfs_flush_all_pending_stuffs
  Btrfs: fix use-after-free when cloning a trailing file hole
  btrfs: fix null pointer dereference in btrfs_show_devname when name is null
  btrfs: fix null pointer dereference in clone_fs_devices when name is null
  btrfs: fix nossd and ssd_spread mount option regression
  Btrfs: fix race between balance recovery and root deletion
  Btrfs: atomically set inode->i_flags in btrfs_update_iflags
  btrfs: only unlock block in verify_parent_transid if we locked it
  Btrfs: assert send doesn't attempt to start transactions
  btrfs compression: reuse recently used workspace
  Btrfs: fix crash when mounting raid5 btrfs with missing disks
  btrfs: create sprout should rename fsid on the sysfs as well
  btrfs: dev replace should replace the sysfs entry
  btrfs: dev add should add its sysfs entry
  btrfs: dev delete should remove sysfs entry
  btrfs: rename add_device_membership to btrfs_kobj_add_device
2014-07-04 08:53:53 -07:00
Filipe Manana
abdd2e80a5 Btrfs: fix crash when starting transaction
Often when starting a transaction we commit the currently running transaction,
which can end up writing block group caches when the current process has its
journal_info set to NULL (and not to a transaction). This makes our assertion
at btrfs_check_data_free_space() (current_journal != NULL) fail, resulting
in a crash/hang. Therefore fix it by setting journal_info.

Two different traces of this issue follow below.

1)

    [51502.241936] BTRFS: assertion failed: current->journal_info, file: fs/btrfs/extent-tree.c, line: 3670
    [51502.242213] ------------[ cut here ]------------
    [51502.242493] kernel BUG at fs/btrfs/ctree.h:3964!
    [51502.242669] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
    (...)
    [51502.244010] Call Trace:
    [51502.244010]  [<ffffffffa02bc025>] btrfs_check_data_free_space+0x395/0x3a0 [btrfs]
    [51502.244010]  [<ffffffffa02c3bdc>] btrfs_write_dirty_block_groups+0x4ac/0x640 [btrfs]
    [51502.244010]  [<ffffffffa0357a6a>] commit_cowonly_roots+0x164/0x226 [btrfs]
    [51502.244010]  [<ffffffffa02d53cd>] btrfs_commit_transaction+0x4ed/0xab0 [btrfs]
    [51502.244010]  [<ffffffff8168ec7b>] ? _raw_spin_unlock+0x2b/0x40
    [51502.244010]  [<ffffffffa02d6259>] start_transaction+0x459/0x620 [btrfs]
    [51502.244010]  [<ffffffffa02d67ab>] btrfs_start_transaction+0x1b/0x20 [btrfs]
    [51502.244010]  [<ffffffffa02d73e1>] __unlink_start_trans+0x31/0xe0 [btrfs]
    [51502.244010]  [<ffffffffa02dea67>] btrfs_unlink+0x37/0xc0 [btrfs]
    [51502.244010]  [<ffffffff811bb054>] ? do_unlinkat+0x114/0x2a0
    [51502.244010]  [<ffffffff811baebc>] vfs_unlink+0xcc/0x150
    [51502.244010]  [<ffffffff811bb1a0>] do_unlinkat+0x260/0x2a0
    [51502.244010]  [<ffffffff811a9ef4>] ? filp_close+0x64/0x90
    [51502.244010]  [<ffffffff810aaea6>] ? trace_hardirqs_on_caller+0x16/0x1e0
    [51502.244010]  [<ffffffff81349cab>] ? trace_hardirqs_on_thunk+0x3a/0x3f
    [51502.244010]  [<ffffffff811be9eb>] SyS_unlinkat+0x1b/0x40
    [51502.244010]  [<ffffffff81698452>] system_call_fastpath+0x16/0x1b
    [51502.244010] Code: 0b 55 48 89 e5 0f 0b 55 48 89 e5 0f 0b 55 89 f1 48 c7 c2 71 13 36 a0 48 89 fe 31 c0 48 c7 c7 b8 43 36 a0 48 89 e5 e8 5d b0 32 e1 <0f> 0b 0f 1f 44 00 00 55 b9 11 00 00 00 48 89 e5 41 55 49 89 f5
    [51502.244010] RIP  [<ffffffffa03575da>] assfail.constprop.88+0x1e/0x20 [btrfs]

2)

    [25405.097230] BTRFS: assertion failed: current->journal_info, file: fs/btrfs/extent-tree.c, line: 3670
    [25405.097488] ------------[ cut here ]------------
    [25405.097767] kernel BUG at fs/btrfs/ctree.h:3964!
    [25405.097940] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
    (...)
    [25405.100008] Call Trace:
    [25405.100008]  [<ffffffffa02bc025>] btrfs_check_data_free_space+0x395/0x3a0 [btrfs]
    [25405.100008]  [<ffffffffa02c3bdc>] btrfs_write_dirty_block_groups+0x4ac/0x640 [btrfs]
    [25405.100008]  [<ffffffffa035755a>] commit_cowonly_roots+0x164/0x226 [btrfs]
    [25405.100008]  [<ffffffffa02d53cd>] btrfs_commit_transaction+0x4ed/0xab0 [btrfs]
    [25405.100008]  [<ffffffff8109c170>] ? bit_waitqueue+0xc0/0xc0
    [25405.100008]  [<ffffffffa02d6259>] start_transaction+0x459/0x620 [btrfs]
    [25405.100008]  [<ffffffffa02d67ab>] btrfs_start_transaction+0x1b/0x20 [btrfs]
    [25405.100008]  [<ffffffffa02e3407>] btrfs_create+0x47/0x210 [btrfs]
    [25405.100008]  [<ffffffffa02d74cc>] ? btrfs_permission+0x3c/0x80 [btrfs]
    [25405.100008]  [<ffffffff811bc63b>] vfs_create+0x9b/0x130
    [25405.100008]  [<ffffffff811bcf19>] do_last+0x849/0xe20
    [25405.100008]  [<ffffffff811b9409>] ? link_path_walk+0x79/0x820
    [25405.100008]  [<ffffffff811bd5b5>] path_openat+0xc5/0x690
    [25405.100008]  [<ffffffff810ab07d>] ? trace_hardirqs_on+0xd/0x10
    [25405.100008]  [<ffffffff811cdcd2>] ? __alloc_fd+0x32/0x1d0
    [25405.100008]  [<ffffffff811be2a3>] do_filp_open+0x43/0xa0
    [25405.100008]  [<ffffffff811cddf1>] ? __alloc_fd+0x151/0x1d0
    [25405.100008]  [<ffffffff811abcfc>] do_sys_open+0x13c/0x230
    [25405.100008]  [<ffffffff810aaea6>] ? trace_hardirqs_on_caller+0x16/0x1e0
    [25405.100008]  [<ffffffff811abe12>] SyS_open+0x22/0x30
    [25405.100008]  [<ffffffff81698452>] system_call_fastpath+0x16/0x1b
    [25405.100008] Code: 0b 55 48 89 e5 0f 0b 55 48 89 e5 0f 0b 55 89 f1 48 c7 c2 51 13 36 a0 48 89 fe 31 c0 48 c7 c7 d0 43 36 a0 48 89 e5 e8 6d b5 32 e1 <0f> 0b 0f 1f 44 00 00 55 b9 11 00 00 00 48 89 e5 41 55 49 89 f5
    [25405.100008] RIP  [<ffffffffa03570ca>] assfail.constprop.88+0x1e/0x20 [btrfs]

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-07-03 07:04:18 -07:00
Josef Bacik
be2c765dff Btrfs: fix btrfs_print_leaf for skinny metadata
We wouldn't actuall print the extent information if we had a skinny metadata
item, this fixes that.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-07-03 07:04:16 -07:00
Liu Bo
d288db5dc0 Btrfs: fix race of using total_bytes_pinned
This percpu counter @total_bytes_pinned is introduced to skip unnecessary
operations of 'commit transaction', it accounts for those space we may free
but are stuck in delayed refs.

And we zero out @space_info->total_bytes_pinned every transaction period so
we have a better idea of how much space we'll actually free up by committing
this transaction.  However, we do the 'zero out' part a little earlier, before
we actually unpin space, so we end up returning ENOSPC when we actually have
free space that's just unpinned from committing transaction.

xfstests/generic/074 complained then.

This fixes it by actually accounting the percpu pinned number when 'unpin',
and since it's protected by space_info->lock, the race is gone now.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-07-03 07:04:15 -07:00
David Sterba
130d5b415a btrfs: use E2BIG instead of EIO if compression does not help
Return codes got updated in 60e1975acb
(btrfs: return errno instead of -1 from compression)
lzo wrapper returns E2BIG in this case, do the same for zlib.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-07-03 07:04:13 -07:00
David Sterba
0a4eaea892 btrfs: remove stale comment from btrfs_flush_all_pending_stuffs
Commit fcebe4562d (Btrfs: rework qgroup
accounting) removed the qgroup accounting after delayed refs.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-07-03 07:04:12 -07:00
Filipe Manana
14f5979633 Btrfs: fix use-after-free when cloning a trailing file hole
The transaction handle was being used after being freed.

Cc: Chris Mason <clm@fb.com>
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-07-03 07:04:10 -07:00
Anand Jain
0aeb8a6e67 btrfs: fix null pointer dereference in btrfs_show_devname when name is null
dev->name is null but missing flag is not set.
Strictly speaking the missing flag should have been set, but there
are more places where code just checks if name is null. For now this
patch does the same.

stack:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000064
IP: [<ffffffffa0228908>] btrfs_show_devname+0x58/0xf0 [btrfs]

[<ffffffff81198879>] show_vfsmnt+0x39/0x130
[<ffffffff81178056>] m_show+0x16/0x20
[<ffffffff8117d706>] seq_read+0x296/0x390
[<ffffffff8115aa7d>] vfs_read+0x9d/0x160
[<ffffffff8115b549>] SyS_read+0x49/0x90
[<ffffffff817abe52>] system_call_fastpath+0x16/0x1b

reproducer:
mkfs.btrfs -draid1 -mraid1 /dev/sdg1 /dev/sdg2
btrfstune -S 1 /dev/sdg1
modprobe -r btrfs && modprobe btrfs
mount -o degraded /dev/sdg1 /btrfs
btrfs dev add /dev/sdg3 /btrfs

Signed-off-by: Anand Jain <Anand.Jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-07-03 07:04:09 -07:00
Anand Jain
e755f78086 btrfs: fix null pointer dereference in clone_fs_devices when name is null
when one of the device path is missing btrfs_device name is null. So this
patch will check for that.

stack:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
IP: [<ffffffff812e18c0>] strlen+0x0/0x30
[<ffffffffa01cd92a>] ? clone_fs_devices+0xaa/0x160 [btrfs]
[<ffffffffa01cdcf7>] btrfs_init_new_device+0x317/0xca0 [btrfs]
[<ffffffff81155bca>] ? __kmalloc_track_caller+0x15a/0x1a0
[<ffffffffa01d6473>] btrfs_ioctl+0xaa3/0x2860 [btrfs]
[<ffffffff81132a6c>] ? handle_mm_fault+0x48c/0x9c0
[<ffffffff81192a61>] ? __blkdev_put+0x171/0x180
[<ffffffff817a784c>] ? __do_page_fault+0x4ac/0x590
[<ffffffff81193426>] ? blkdev_put+0x106/0x110
[<ffffffff81179175>] ? mntput+0x35/0x40
[<ffffffff8116d4b0>] do_vfs_ioctl+0x460/0x4a0
[<ffffffff8115c72e>] ? ____fput+0xe/0x10
[<ffffffff81068033>] ? task_work_run+0xb3/0xd0
[<ffffffff8116d547>] SyS_ioctl+0x57/0x90
[<ffffffff817a793e>] ? do_page_fault+0xe/0x10
[<ffffffff817abe52>] system_call_fastpath+0x16/0x1b

reproducer:
mkfs.btrfs -draid1 -mraid1 /dev/sdg1 /dev/sdg2
btrfstune -S 1 /dev/sdg1
modprobe -r btrfs && modprobe btrfs
mount -o degraded /dev/sdg1 /btrfs
btrfs dev add /dev/sdg3 /btrfs

Signed-off-by: Anand Jain <Anand.Jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-07-03 07:04:07 -07:00
Eric Sandeen
2aa06a35d0 btrfs: fix nossd and ssd_spread mount option regression
The commit

0780253 btrfs: Cleanup the btrfs_parse_options for remount.

broke ssd options quite badly; it stopped making ssd_spread
imply ssd, and it made "nossd" unsettable.

Put things back at least as well as they were before
(though ssd mount option handling is still pretty odd:
# mount -o "nossd,ssd_spread" works?)

Reported-by: Roman Mamedov <rm@romanrm.net>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-07-03 07:04:06 -07:00
Wang Shilong
5f3164813b Btrfs: fix race between balance recovery and root deletion
Balance recovery is called when RW mounting or remounting from
RO to RW, it is called to finish roots merging.

When doing balance recovery, relocation root's corresponding
fs root(whose root refs is 0) might be destroyed by cleaner
thread, this will make btrfs fail to mount.

Fix this problem by holding @cleaner_mutex when doing balance
recovery.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-07-03 07:04:04 -07:00
Filipe Manana
3cc7939255 Btrfs: atomically set inode->i_flags in btrfs_update_iflags
This change is based on the corresponding recent change for ext4:

  ext4: atomically set inode->i_flags in ext4_set_inode_flags()

That has the following commit message that applies to btrfs as well:

  "Use cmpxchg() to atomically set i_flags instead of clearing out the
   S_IMMUTABLE, S_APPEND, etc. flags and then setting them from the
   EXT4_IMMUTABLE_FL, EXT4_APPEND_FL flags, since this opens up a race
   where an immutable file has the immutable flag cleared for a brief
   window of time."

Replacing EXT4_IMMUTABLE_FL and EXT4_APPEND_FL with BTRFS_INODE_IMMUTABLE
and BTRFS_INODE_APPEND, respectively.

Reviewed-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-07-03 07:03:23 -07:00
Josef Bacik
472b909ff6 btrfs: only unlock block in verify_parent_transid if we locked it
This is a regression from my patch a26e8c9f75, we
need to only unlock the block if we were the one who locked it.  Otherwise this
will trip BUG_ON()'s in locking.c  Thanks,

cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-28 13:48:47 -07:00
Filipe Manana
46c4e71e9b Btrfs: assert send doesn't attempt to start transactions
When starting a transaction just assert that current->journal_info
doesn't contain a send transaction stub, since send isn't supposed
to start transactions and when it finishes (either successfully or
not) it's supposed to set current->journal_info to NULL.

This is motivated by the change titled:

    Btrfs: fix crash when starting transaction

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-28 13:48:46 -07:00
Sergey Senozhatsky
c39aa7056f btrfs compression: reuse recently used workspace
Add compression `workspace' in free_workspace() to
`idle_workspace' list head, instead of tail. So we have
better chances to reuse most recently used `workspace'.

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-28 13:48:46 -07:00
Liu Bo
5588383ece Btrfs: fix crash when mounting raid5 btrfs with missing disks
The reproducer is

$ mkfs.btrfs D1 D2 D3 -mraid5
$ mkfs.ext4 D2 && mkfs.ext4 D3
$ mount D1 /btrfs -odegraded

-------------------

[   87.672992] ------------[ cut here ]------------
[   87.673845] kernel BUG at fs/btrfs/raid56.c:1828!
...
[   87.673845] RIP: 0010:[<ffffffff813efc7e>]  [<ffffffff813efc7e>] __raid_recover_end_io+0x4ae/0x4d0
...
[   87.673845] Call Trace:
[   87.673845]  [<ffffffff8116bbc6>] ? mempool_free+0x36/0xa0
[   87.673845]  [<ffffffff813f0255>] raid_recover_end_io+0x75/0xa0
[   87.673845]  [<ffffffff81447c5b>] bio_endio+0x5b/0xa0
[   87.673845]  [<ffffffff81447cb2>] bio_endio_nodec+0x12/0x20
[   87.673845]  [<ffffffff81374621>] end_workqueue_fn+0x41/0x50
[   87.673845]  [<ffffffff813ad2aa>] normal_work_helper+0xca/0x2c0
[   87.673845]  [<ffffffff8108ba2b>] process_one_work+0x1eb/0x530
[   87.673845]  [<ffffffff8108b9c9>] ? process_one_work+0x189/0x530
[   87.673845]  [<ffffffff8108c15b>] worker_thread+0x11b/0x4f0
[   87.673845]  [<ffffffff8108c040>] ? rescuer_thread+0x290/0x290
[   87.673845]  [<ffffffff810939c4>] kthread+0xe4/0x100
[   87.673845]  [<ffffffff810938e0>] ? kthread_create_on_node+0x220/0x220
[   87.673845]  [<ffffffff817e7c7c>] ret_from_fork+0x7c/0xb0
[   87.673845]  [<ffffffff810938e0>] ? kthread_create_on_node+0x220/0x220

-------------------

It's because that we miscalculate @rbio->bbio->error so that it doesn't
reach maximum of tolerable errors while it should have.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Tested-by: Satoru Takeuchi<takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-28 13:48:45 -07:00
Anand Jain
b2373f255c btrfs: create sprout should rename fsid on the sysfs as well
Creating sprout will change the fsid of the mounted root.
do the same on the sysfs as well.

reproducer:
 mount /dev/sdb /btrfs (seed disk)
 btrfs dev add /dev/sdc /btrfs
 mount -o rw,remount /btrfs
 btrfs dev del /dev/sdb /btrfs
 mount /dev/sdb /btrfs

Error:
kobject_add_internal failed for fe350492-dc28-4051-a601-e017b17e6145 with -EEXIST, don't try to register things with the same name in the same directory.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-28 13:48:44 -07:00
Anand Jain
49c6f736f3 btrfs: dev replace should replace the sysfs entry
when we replace the device its corresponding sysfs
entry has to be replaced as well

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-28 13:48:44 -07:00
Anand Jain
0d39376aa2 btrfs: dev add should add its sysfs entry
we would need the device links to be created,
when device is added.

Signed-off-by: Anand Jain <Anand.Jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-28 13:48:43 -07:00
Anand Jain
99994cde9c btrfs: dev delete should remove sysfs entry
when we delete the device from the mounted btrfs,
we would need its corresponding sysfs enty to
be removed as well.

Signed-off-by: Anand Jain <Anand.Jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-28 13:48:42 -07:00
Anand Jain
9b4eaf43f4 btrfs: rename add_device_membership to btrfs_kobj_add_device
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-28 13:48:41 -07:00
Linus Torvalds
e13d100beb Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "This fixes some lockups in btrfs reported with rc1.  It probably has
  some performance impact because it is backing off our spinning locks
  more often and switching to a blocking lock.  I'll be able to nail
  that down next week, but for now I want to get the lockups taken care
  of.

  Otherwise some more stack reduction and assorted fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix wrong error handle when the device is missing or is not writeable
  Btrfs: fix deadlock when mounting a degraded fs
  Btrfs: use bio_endio_nodec instead of open code
  Btrfs: fix NULL pointer crash when running balance and scrub concurrently
  btrfs: Skip scrubbing removed chunks to avoid -ENOENT.
  Btrfs: fix broken free space cache after the system crashed
  Btrfs: make free space cache write out functions more readable
  Btrfs: remove unused wait queue in struct extent_buffer
  Btrfs: fix deadlocks with trylock on tree nodes
2014-06-21 14:21:43 -10:00
Miao Xie
8408c716d7 Btrfs: fix wrong error handle when the device is missing or is not writeable
The original bio might be submitted, so we shoud increase bi_remaining to
account for it when we deal with the error that the device is missing or
is not writeable, or we would skip the endio handle.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19 14:20:56 -07:00
Miao Xie
c55f139640 Btrfs: fix deadlock when mounting a degraded fs
The deadlock happened when we mount degraded filesystem, the reproduced
steps are following:
 # mkfs.btrfs -f -m raid1 -d raid1 <dev0> <dev1>
 # echo 1 > /sys/block/`basename <dev0>`/device/delete
 # mount -o degraded <dev1> <mnt>

The reason was that the counter -- bi_remaining was wrong. If the missing
or unwriteable device was the last device in the mapping array, we would
not submit the original bio, so we shouldn't increase bi_remaining of it
in btrfs_end_bio(), or we would skip the final endio handle.

Fix this problem by adding a flag into btrfs bio structure. If we submit
the original bio, we will set the flag, and we increase bi_remaining counter,
or we don't.

Though there is another way to fix it -- decrease bi_remaining counter of the
original bio when we make sure the original bio is not submitted, this method
need add more check and is easy to make mistake.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19 14:20:56 -07:00
Miao Xie
e990f16763 Btrfs: use bio_endio_nodec instead of open code
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19 14:20:55 -07:00
Wang Shilong
298a8f9cf1 Btrfs: fix NULL pointer crash when running balance and scrub concurrently
While running balance, scrub, fsstress concurrently we hit the
following kernel crash:

[56561.448845] BTRFS info (device sde): relocating block group 11005853696 flags 132
[56561.524077] BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
[56561.524237] IP: [<ffffffffa038956d>] scrub_chunk.isra.12+0xdd/0x130 [btrfs]
[56561.524297] PGD 9be28067 PUD 7f3dd067 PMD 0
[56561.524325] Oops: 0000 [#1] SMP
[....]
[56561.527237] Call Trace:
[56561.527309]  [<ffffffffa038980e>] scrub_enumerate_chunks+0x24e/0x490 [btrfs]
[56561.527392]  [<ffffffff810abe00>] ? abort_exclusive_wait+0x50/0xb0
[56561.527476]  [<ffffffffa038add4>] btrfs_scrub_dev+0x1a4/0x530 [btrfs]
[56561.527561]  [<ffffffffa0368107>] btrfs_ioctl+0x13f7/0x2a90 [btrfs]
[56561.527639]  [<ffffffff811c82f0>] do_vfs_ioctl+0x2e0/0x4c0
[56561.527712]  [<ffffffff8109c384>] ? vtime_account_user+0x54/0x60
[56561.527788]  [<ffffffff810f768c>] ? __audit_syscall_entry+0x9c/0xf0
[56561.527870]  [<ffffffff811c8551>] SyS_ioctl+0x81/0xa0
[56561.527941]  [<ffffffff815707f7>] tracesys+0xdd/0xe2
[...]
[56561.528304] RIP  [<ffffffffa038956d>] scrub_chunk.isra.12+0xdd/0x130 [btrfs]
[56561.528395]  RSP <ffff88004c0f5be8>
[56561.528454] CR2: 0000000000000078

This is because in btrfs_relocate_chunk(), we will free @bdev directly while
scrub may still hold extent mapping, and may access freed memory.

Fix this problem by wrapping freeing @bdev work into free_extent_map() which
is based on reference count.

Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19 14:20:55 -07:00
Qu Wenruo
ced96edc48 btrfs: Skip scrubbing removed chunks to avoid -ENOENT.
When run scrub with balance, sometimes -ENOENT will be returned, since
in scrub_enumerate_chunks() will search dev_extent in *COMMIT_ROOT*, but
btrfs_lookup_block_group() will search block group in *MEMORY*, so if a
chunk is removed but not committed, -ENOENT will be returned.

However, there is no need to stop scrubbing since other chunks may be
scrubbed without problem.

So this patch changes the behavior to skip removed chunks and continue
to scrub the rest.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19 14:20:54 -07:00
Miao Xie
e570fd27f2 Btrfs: fix broken free space cache after the system crashed
When we mounted the filesystem after the crash, we got the following
message:
  BTRFS error (device xxx): block group xxxx has wrong amount of free space
  BTRFS error (device xxx): failed to load free space cache for block group xxx

It is because we didn't update the metadata of the allocated space (in extent
tree) until the file data was written into the disk. During this time, there was
no information about the allocated spaces in either the extent tree nor the
free space cache. when we wrote out the free space cache at this time (commit
transaction), those spaces were lost. In fact, only the free space that is
used to store the file data had this problem, the others didn't because
the metadata of them is updated in the same transaction context.

There are many methods which can fix the above problem
- track the allocated space, and write it out when we write out the free
  space cache
- account the size of the allocated space that is used to store the file
  data, if the size is not zero, don't write out the free space cache.

The first one is complex and may make the performance drop down.
This patch chose the second method, we use a per-block-group variant to
account the size of that allocated space. Besides that, we also introduce
a per-block-group read-write semaphore to avoid the race between
the allocation and the free space cache write out.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19 14:20:54 -07:00
Miao Xie
5349d6c3ff Btrfs: make free space cache write out functions more readable
This patch makes the free space cache write out functions more readable,
and beisdes that, it also reduces the stack space that the function --
__btrfs_write_out_cache uses from 194bytes to 144bytes.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19 14:20:54 -07:00
Filipe Manana
46fefe41b5 Btrfs: remove unused wait queue in struct extent_buffer
The lock_wq wait queue is not used anywhere, therefore just remove it.
On a x86_64 system, this reduced sizeof(struct extent_buffer) from 320
bytes down to 296 bytes, which means a 4Kb page can now be used for
13 extent buffers instead of 12.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19 14:20:28 -07:00
Chris Mason
ea4ebde02e Btrfs: fix deadlocks with trylock on tree nodes
The Btrfs tree trylock function is poorly named.  It always takes
the spinlock and backs off if the blocking lock is held.  This
can lead to surprising lockups because people expect it to really be a
trylock.

This commit makes it a pure trylock, both for the spinlock and the
blocking lock.  It also reworks the nested lock handling slightly to
avoid taking the read lock while a spinning write lock might be held.

Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19 14:19:55 -07:00
Linus Torvalds
16d52ef7c0 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull more btrfs updates from Chris Mason:
 "This has a few fixes since our last pull and a new ioctl for doing
  btree searches from userland.  It's very similar to the existing
  ioctl, but lets us return larger items back down to the app"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: fix error handling in create_pending_snapshot
  btrfs: fix use of uninit "ret" in end_extent_writepage()
  btrfs: free ulist in qgroup_shared_accounting() error path
  Btrfs: fix qgroups sanity test crash or hang
  btrfs: prevent RCU warning when dereferencing radix tree slot
  Btrfs: fix unfinished readahead thread for raid5/6 degraded mounting
  btrfs: new ioctl TREE_SEARCH_V2
  btrfs: tree_search, search_ioctl: direct copy to userspace
  btrfs: new function read_extent_buffer_to_user
  btrfs: tree_search, copy_to_sk: return needed size on EOVERFLOW
  btrfs: tree_search, copy_to_sk: return EOVERFLOW for too small buffer
  btrfs: tree_search, search_ioctl: accept varying buffer
  btrfs: tree_search: eliminate redundant nr_items check
2014-06-14 19:48:43 -05:00
Eric Sandeen
47a306a748 btrfs: fix error handling in create_pending_snapshot
fcebe456 cut and pasted some code to a later point
in create_pending_snapshot(), but didn't switch
to the appropriate error handling for this stage
of the function.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-13 09:52:30 -07:00
Eric Sandeen
3e2426bd0e btrfs: fix use of uninit "ret" in end_extent_writepage()
If this condition in end_extent_writepage() is false:

	if (tree->ops && tree->ops->writepage_end_io_hook)

we will then test an uninitialized "ret" at:

	ret = ret < 0 ? ret : -EIO;

The test for ret is for the case where ->writepage_end_io_hook
failed, and we'd choose that ret as the error; but if
there is no ->writepage_end_io_hook, nothing sets ret.

Initializing ret to 0 should be sufficient; if
writepage_end_io_hook wasn't set, (!uptodate) means
non-zero err was passed in, so we choose -EIO in that case.

Signed-of-by: Eric Sandeen <sandeen@redhat.com>

Signed-off-by: Chris Mason <clm@fb.com>
2014-06-13 09:52:28 -07:00
Eric Sandeen
d737278091 btrfs: free ulist in qgroup_shared_accounting() error path
If tmp = ulist_alloc(GFP_NOFS) fails, we return without
freeing the previously allocated qgroups = ulist_alloc(GFP_NOFS)
and cause a memory leak.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-13 09:52:26 -07:00
Filipe Manana
b050f9f6dd Btrfs: fix qgroups sanity test crash or hang
Often when running the qgroups sanity test, a crash or a hang happened.
This is because the extent buffer the test uses for the root node doesn't
have an header level explicitly set, making it have a random level value.
This is a problem when it's not zero for the btrfs_search_slot() calls
the test ends up doing, resulting in crashes or hangs such as the following:

[ 6454.127192] Btrfs loaded, debug=on, assert=on, integrity-checker=on
(...)
[ 6454.127760] BTRFS: selftest: Running qgroup tests
[ 6454.127964] BTRFS: selftest: Running test_test_no_shared_qgroup
[ 6454.127966] BTRFS: selftest: Qgroup basic add
[ 6480.152005] BUG: soft lockup - CPU#0 stuck for 23s! [modprobe:5383]
[ 6480.152005] Modules linked in: btrfs(+) xor raid6_pq binfmt_misc nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc i2c_piix4 i2c_core pcspkr evbug psmouse serio_raw e1000 [last unloaded: btrfs]
[ 6480.152005] irq event stamp: 188448
[ 6480.152005] hardirqs last  enabled at (188447): [<ffffffff8168ef5c>] restore_args+0x0/0x30
[ 6480.152005] hardirqs last disabled at (188448): [<ffffffff81698e6a>] apic_timer_interrupt+0x6a/0x80
[ 6480.152005] softirqs last  enabled at (188446): [<ffffffff810516cf>] __do_softirq+0x1cf/0x450
[ 6480.152005] softirqs last disabled at (188441): [<ffffffff81051c25>] irq_exit+0xb5/0xc0
[ 6480.152005] CPU: 0 PID: 5383 Comm: modprobe Not tainted 3.15.0-rc8-fdm-btrfs-next-33+ #4
[ 6480.152005] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 6480.152005] task: ffff8802146125a0 ti: ffff8800d0d00000 task.ti: ffff8800d0d00000
[ 6480.152005] RIP: 0010:[<ffffffff81349a63>]  [<ffffffff81349a63>] __write_lock_failed+0x13/0x20
[ 6480.152005] RSP: 0018:ffff8800d0d038e8  EFLAGS: 00000287
[ 6480.152005] RAX: 0000000000000000 RBX: ffffffff8168ef5c RCX: 000005deb8525852
[ 6480.152005] RDX: 0000000000000000 RSI: 0000000000001d45 RDI: ffff8802105000b8
[ 6480.152005] RBP: ffff8800d0d038e8 R08: fffffe12710f63db R09: ffffffffa03196fb
[ 6480.152005] R10: ffff8802146125a0 R11: ffff880214612e28 R12: ffff8800d0d03858
[ 6480.152005] R13: 0000000000000000 R14: ffff8800d0d00000 R15: ffff8802146125a0
[ 6480.152005] FS:  00007f14ff804700(0000) GS:ffff880215e00000(0000) knlGS:0000000000000000
[ 6480.152005] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 6480.152005] CR2: 00007fff4df0dac8 CR3: 00000000d1796000 CR4: 00000000000006f0
[ 6480.152005] Stack:
[ 6480.152005]  ffff8800d0d03908 ffffffff810ae967 0000000000000001 ffff8802105000b8
[ 6480.152005]  ffff8800d0d03938 ffffffff8168e57e ffffffffa0319c16 0000000000000007
[ 6480.152005]  ffff880210500000 ffff880210500100 ffff8800d0d039b8 ffffffffa0319c16
[ 6480.152005] Call Trace:
[ 6480.152005]  [<ffffffff810ae967>] do_raw_write_lock+0x47/0xa0
[ 6480.152005]  [<ffffffff8168e57e>] _raw_write_lock+0x5e/0x80
[ 6480.152005]  [<ffffffffa0319c16>] ? btrfs_tree_lock+0x116/0x270 [btrfs]
[ 6480.152005]  [<ffffffffa0319c16>] btrfs_tree_lock+0x116/0x270 [btrfs]
[ 6480.152005]  [<ffffffffa02b2acb>] btrfs_lock_root_node+0x3b/0x50 [btrfs]
[ 6480.152005]  [<ffffffffa02b81a6>] btrfs_search_slot+0x916/0xa20 [btrfs]
[ 6480.152005]  [<ffffffff811a727f>] ? create_object+0x23f/0x300
[ 6480.152005]  [<ffffffffa02b9958>] btrfs_insert_empty_items+0x78/0xd0 [btrfs]
[ 6480.152005]  [<ffffffffa036041a>] insert_normal_tree_ref.constprop.4+0xa2/0x19a [btrfs]
[ 6480.152005]  [<ffffffffa03605c3>] test_no_shared_qgroup+0xb1/0x1ca [btrfs]
[ 6480.152005]  [<ffffffff8108cad6>] ? local_clock+0x16/0x30
[ 6480.152005]  [<ffffffffa035ef8e>] btrfs_test_qgroups+0x1ae/0x1d7 [btrfs]
[ 6480.152005]  [<ffffffffa03a69d2>] ? ftrace_define_fields_btrfs_space_reservation+0xfd/0xfd [btrfs]
[ 6480.152005]  [<ffffffffa03a6a86>] init_btrfs_fs+0xb4/0x153 [btrfs]
[ 6480.152005]  [<ffffffff81000352>] do_one_initcall+0x102/0x150
[ 6480.152005]  [<ffffffff8103d223>] ? set_memory_nx+0x43/0x50
[ 6480.152005]  [<ffffffff81682668>] ? set_section_ro_nx+0x6d/0x74
[ 6480.152005]  [<ffffffff810d91cc>] load_module+0x1cdc/0x2630
(...)

Therefore initialize the extent buffer as an empty leaf (level 0).

Issue easy to reproduce when btrfs is built as a module via:

    $ for ((i = 1; i <= 1000000; i++)); do rmmod btrfs; modprobe btrfs; done

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-13 09:52:24 -07:00
Sasha Levin
f1e3c28949 btrfs: prevent RCU warning when dereferencing radix tree slot
Mark the dereference as protected by lock. Not doing so triggers
an RCU warning since the radix tree assumed that RCU is in use.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-13 09:52:22 -07:00
Wang Shilong
5fbc7c59fd Btrfs: fix unfinished readahead thread for raid5/6 degraded mounting
Steps to reproduce:

 # mkfs.btrfs -f /dev/sd[b-f] -m raid5 -d raid5
 # mkfs.ext4 /dev/sdc --->corrupt one of btrfs device
 # mount /dev/sdb /mnt -o degraded
 # btrfs scrub start -BRd /mnt

This is because readahead would skip missing device, this is not true
for RAID5/6, because REQ_GET_READ_MIRRORS return 1 for RAID5/6 block
mapping. If expected data locates in missing device, readahead thread
would not call __readahead_hook() which makes event @rc->elems=0
wait forever.

Fix this problem by checking return value of btrfs_map_block(),we
can only skip missing device safely if there are several mirrors.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-13 09:52:21 -07:00
Gerhard Heift
cc68a8a5a4 btrfs: new ioctl TREE_SEARCH_V2
This new ioctl call allows the user to supply a buffer of varying size in which
a tree search can store its results. This is much more flexible if you want to
receive items which are larger than the current fixed buffer of 3992 bytes or
if you want to fetch more items at once. Items larger than this buffer are for
example some of the type EXTENT_CSUM.

Signed-off-by: Gerhard Heift <Gerhard@Heift.Name>
Signed-off-by: Chris Mason <clm@fb.com>
Acked-by: David Sterba <dsterba@suse.cz>
2014-06-13 09:52:19 -07:00
Gerhard Heift
ba346b357d btrfs: tree_search, search_ioctl: direct copy to userspace
By copying each found item seperatly to userspace, we do not need extra
buffer in the kernel.

Signed-off-by: Gerhard Heift <Gerhard@Heift.Name>
Signed-off-by: Chris Mason <clm@fb.com>
Acked-by: David Sterba <dsterba@suse.cz>
2014-06-12 18:22:05 -07:00
Gerhard Heift
550ac1d85e btrfs: new function read_extent_buffer_to_user
This new function reads the content of an extent directly to user memory.

Signed-off-by: Gerhard Heift <Gerhard@Heift.Name>
Signed-off-by: Chris Mason <clm@fb.com>
Acked-by: David Sterba <dsterba@suse.cz>
2014-06-12 18:21:56 -07:00
Gerhard Heift
9b6e817d02 btrfs: tree_search, copy_to_sk: return needed size on EOVERFLOW
If an item in tree_search is too large to be stored in the given buffer, return
the needed size (including the header).

Signed-off-by: Gerhard Heift <Gerhard@Heift.Name>
Signed-off-by: Chris Mason <clm@fb.com>
Acked-by: David Sterba <dsterba@suse.cz>
2014-06-12 18:21:47 -07:00
Gerhard Heift
8f5f6178f3 btrfs: tree_search, copy_to_sk: return EOVERFLOW for too small buffer
In copy_to_sk, if an item is too large for the given buffer, it now returns
-EOVERFLOW instead of copying a search_header with len = 0. For backward
compatibility for the first item it still copies such a header to the buffer,
but not any other following items, which could have fitted.

tree_search changes -EOVERFLOW back to 0 to behave similiar to the way it
behaved before this patch.

Signed-off-by: Gerhard Heift <Gerhard@Heift.Name>
Signed-off-by: Chris Mason <clm@fb.com>
Acked-by: David Sterba <dsterba@suse.cz>
2014-06-12 18:21:39 -07:00
Gerhard Heift
1254444288 btrfs: tree_search, search_ioctl: accept varying buffer
rewrite search_ioctl to accept a buffer with varying size

Signed-off-by: Gerhard Heift <Gerhard@Heift.Name>
Signed-off-by: Chris Mason <clm@fb.com>
Acked-by: David Sterba <dsterba@suse.cz>
2014-06-12 18:21:26 -07:00
Gerhard Heift
25c9bc2e2b btrfs: tree_search: eliminate redundant nr_items check
If the amount of items reached the given limit of nr_items, we can leave
copy_to_sk without updating the key. Also by returning 1 we leave the loop in
search_ioctl without rechecking if we reached the given limit.

Signed-off-by: Gerhard Heift <Gerhard@Heift.Name>
Signed-off-by: Chris Mason <clm@fb.com>
Acked-by: David Sterba <dsterba@suse.cz>
2014-06-12 18:20:39 -07:00
Linus Torvalds
16b9057804 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "This the bunch that sat in -next + lock_parent() fix.  This is the
  minimal set; there's more pending stuff.

  In particular, I really hope to get acct.c fixes merged this cycle -
  we need that to deal sanely with delayed-mntput stuff.  In the next
  pile, hopefully - that series is fairly short and localized
  (kernel/acct.c, fs/super.c and fs/namespace.c).  In this pile: more
  iov_iter work.  Most of prereqs for ->splice_write with sane locking
  order are there and Kent's dio rewrite would also fit nicely on top of
  this pile"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (70 commits)
  lock_parent: don't step on stale ->d_parent of all-but-freed one
  kill generic_file_splice_write()
  ceph: switch to iter_file_splice_write()
  shmem: switch to iter_file_splice_write()
  nfs: switch to iter_splice_write_file()
  fs/splice.c: remove unneeded exports
  ocfs2: switch to iter_file_splice_write()
  ->splice_write() via ->write_iter()
  bio_vec-backed iov_iter
  optimize copy_page_{to,from}_iter()
  bury generic_file_aio_{read,write}
  lustre: get rid of messing with iovecs
  ceph: switch to ->write_iter()
  ceph_sync_direct_write: stop poking into iov_iter guts
  ceph_sync_read: stop poking into iov_iter guts
  new helper: copy_page_from_iter()
  fuse: switch to ->write_iter()
  btrfs: switch to ->write_iter()
  ocfs2: switch to ->write_iter()
  xfs: switch to ->write_iter()
  ...
2014-06-12 10:30:18 -07:00
Al Viro
9c1d5284c7 Merge commit '9f12600fe425bc28f0ccba034a77783c09c15af4' into for-linus
Backmerge of dcache.c changes from mainline.  It's that, or complete
rebase...

Conflicts:
	fs/splice.c

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-06-12 00:28:09 -04:00
Linus Torvalds
859862ddd2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "The biggest change here is Josef's rework of the btrfs quota
  accounting, which improves the in-memory tracking of delayed extent
  operations.

  I had been working on Btrfs stack usage for a while, mostly because it
  had become impossible to do long stress runs with slab, lockdep and
  pagealloc debugging turned on without blowing the stack.  Even though
  you upgraded us to a nice king sized stack, I kept most of the
  patches.

  We also have some very hard to find corruption fixes, an awesome sysfs
  use after free, and the usual assortment of optimizations, cleanups
  and other fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (80 commits)
  Btrfs: convert smp_mb__{before,after}_clear_bit
  Btrfs: fix scrub_print_warning to handle skinny metadata extents
  Btrfs: make fsync work after cloning into a file
  Btrfs: use right type to get real comparison
  Btrfs: don't check nodes for extent items
  Btrfs: don't release invalid page in btrfs_page_exists_in_range()
  Btrfs: make sure we retry if page is a retriable exception
  Btrfs: make sure we retry if we couldn't get the page
  btrfs: replace EINVAL with EOPNOTSUPP for dev_replace raid56
  trivial: fs/btrfs/ioctl.c: fix typo s/substract/subtract/
  Btrfs: fix leaf corruption after __btrfs_drop_extents
  Btrfs: ensure btrfs_prev_leaf doesn't miss 1 item
  Btrfs: fix clone to deal with holes when NO_HOLES feature is enabled
  btrfs: free delayed node outside of root->inode_lock
  btrfs: replace EINVAL with ERANGE for resize when ULLONG_MAX
  Btrfs: fix transaction leak during fsync call
  btrfs: Avoid trucating page or punching hole in a already existed hole.
  Btrfs: update commit root on snapshot creation after orphan cleanup
  Btrfs: ioctl, don't re-lock extent range when not necessary
  Btrfs: avoid visiting all extent items when cloning a range
  ...
2014-06-11 09:22:21 -07:00
Chris Mason
c7548af69d Btrfs: convert smp_mb__{before,after}_clear_bit
The new call is smp_mb__{before,after}_atomic.  The __ gives us extra
protection from the atomic rays.

Signed-off-by: Chris Mason <clm@fb.com>
2014-06-10 13:10:47 -07:00
Liu Bo
6eda71d0c0 Btrfs: fix scrub_print_warning to handle skinny metadata extents
The skinny extents are intepreted incorrectly in scrub_print_warning(),
and end up hitting the BUG() in btrfs_extent_inline_ref_size.

Reported-by: Konstantinos Skarlatos <k.skarlatos@gmail.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:17 -07:00
Filipe Manana
7ffbb598a0 Btrfs: make fsync work after cloning into a file
When cloning into a file, we were correctly replacing the extent
items in the target range and removing the extent maps. However
we weren't replacing the extent maps with new ones that point to
the new extents - as a consequence, an incremental fsync (when the
inode doesn't have the full sync flag) was a NOOP, since it relies
on the existence of extent maps in the modified list of the inode's
extent map tree, which was empty. Therefore add new extent maps to
reflect the target clone range.

A test case for xfstests follows.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:16 -07:00
Liu Bo
cd857dd6bc Btrfs: use right type to get real comparison
We want to make sure the point is still within the extent item, not to verify
the memory it's pointing to.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:15 -07:00
Josef Bacik
8a56457f5f Btrfs: don't check nodes for extent items
The backref code was looking at nodes as well as leaves when we tried to
populate extent item entries.  This is not good, and although we go away with it
for the most part because we'd skip where disk_bytenr != random_memory,
sometimes random_memory would match and suddenly boom.  This fixes that problem.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:14 -07:00
Filipe Manana
6fdef6d43c Btrfs: don't release invalid page in btrfs_page_exists_in_range()
In inode.c:btrfs_page_exists_in_range(), if the page we got from
the radix tree is an exception entry, which can't be retried, we
exit the loop with a non-NULL page and then call page_cache_release
against it, which is not ok since it's not a valid page. This could
also make us return true when we shouldn't.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:14 -07:00
Filipe Manana
809f901625 Btrfs: make sure we retry if page is a retriable exception
In inode.c:btrfs_page_exists_in_range(), if the page we get from the
radix tree is an exception which should make us retry, set page to
NULL in order to really retry, because otherwise we don't get another
loop iteration executed (page != NULL makes the while loop exit).
This also was making us call page_cache_release after exiting the loop,
which isn't correct because page doesn't point to a valid page, and
possibly return true from the function when we shouldn't.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:13 -07:00
Filipe Manana
91405151eb Btrfs: make sure we retry if we couldn't get the page
In inode.c:btrfs_page_exists_in_range(), if we can't get the page
we need to retry. However we weren't retrying because we weren't
setting page to NULL, which makes the while loop exit immediately
and will make us call page_cache_release after exiting the loop
which is incorrect because our page get didn't succeed. This could
also make us return true when we shouldn't.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:12 -07:00
Gui Hecheng
c81d57679e btrfs: replace EINVAL with EOPNOTSUPP for dev_replace raid56
To return EOPNOTSUPP is more user friendly than to return EINVAL,
and then user-space tool will show that the dev_replace operation
for raid56 is not currently supported rather than showing that
there is an invalid argument.

Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:12 -07:00
Antonio Ospite
9391558411 trivial: fs/btrfs/ioctl.c: fix typo s/substract/subtract/
Signed-off-by: Antonio Ospite <ao2@ao2.it>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: linux-btrfs@vger.kernel.org
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:11 -07:00
Liu Bo
0b43e04f70 Btrfs: fix leaf corruption after __btrfs_drop_extents
Several reports about leaf corruption has been floating on the list, one of them
points to __btrfs_drop_extents(), and we find that the leaf becomes corrupted
after __btrfs_drop_extents(), it's really a rare case but it does exist.

The problem turns out to be btrfs_next_leaf() called in __btrfs_drop_extents().

So in btrfs_next_leaf(), we release the current path to re-search the last key of
the leaf for locating next leaf, and we've taken it into account that there might
be balance operations between leafs during this 'unlock and re-lock' dance, so
we check the path again and advance it if there are now more items available.
But things are a bit different if that last key happens to be removed and balance
gets a bigger key as the last one, and btrfs_search_slot will return it with
ret > 0, IOW, nothing change in this leaf except the new last key, then we think
we're okay because there is no more item balanced in, fine, we thinks we can
go to the next leaf.

However, we should return that bigger key, otherwise we deserve leaf corruption,
for example, in endio, skipping that key means that __btrfs_drop_extents() thinks
it has dropped all extent matched the required range and finish_ordered_io can
safely insert a new extent, but it actually doesn't and ends up a leaf
corruption.

One may be asking that why our locking on extent io tree doesn't work as
expected, ie. it should avoid this kind of race situation.  But in
__btrfs_drop_extents(), we don't always find extents which are included within
our locking range, IOW, extents can start before our searching start, in this
case locking on extent io tree doesn't protect us from the race.

This takes the special case into account.

Reviewed-by: Filipe Manana <fdmanana@gmail.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:10 -07:00
Filipe Manana
337c6f6830 Btrfs: ensure btrfs_prev_leaf doesn't miss 1 item
We might have had an item with the previous key in the tree right
before we released our path. And after we released our path, that
item might have been pushed to the first slot (0) of the leaf we
were holding due to a tree balance. Alternatively, an item with the
previous key can exist as the only element of a leaf (big fat item).
Therefore account for these 2 cases, so that our callers (like
btrfs_previous_item) don't miss an existing item with a key matching
the previous key we computed above.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:09 -07:00
Filipe Manana
f82a9901b0 Btrfs: fix clone to deal with holes when NO_HOLES feature is enabled
If the NO_HOLES feature is enabled holes don't have file extent items in
the btree that represent them anymore. This made the clone operation
ignore the gaps that exist between consecutive file extent items and
therefore not create the holes at the destination. When not using the
NO_HOLES feature, the holes were created at the destination.

A test case for xfstests follows.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:09 -07:00
Jeff Mahoney
964930312a btrfs: free delayed node outside of root->inode_lock
On heavy workloads, we're seeing soft lockup warnings on
root->inode_lock in __btrfs_release_delayed_node. The low hanging fruit
is to reduce the size of the critical section.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:08 -07:00
Gui Hecheng
902c68a4da btrfs: replace EINVAL with ERANGE for resize when ULLONG_MAX
To be accurate about the error case,
if the new size is beyond ULLONG_MAX, return ERANGE instead of EINVAL.

Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:07 -07:00
Filipe Manana
b05fd8742f Btrfs: fix transaction leak during fsync call
If btrfs_log_dentry_safe() returns an error, we set ret to 1 and
fall through with the goal of committing the transaction. However,
in the case where the inode doesn't need a full sync, we would call
btrfs_wait_ordered_range() against the target range for our inode,
and if it returned an error, we would return without commiting or
ending the transaction.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:06 -07:00
Qu Wenruo
d77815461f btrfs: Avoid trucating page or punching hole in a already existed hole.
btrfs_punch_hole() will truncate unaligned pages or punch hole on a
already existed hole.
This will cause unneeded zero page or holes splitting the original huge
hole.

This patch will skip already existed holes before any page truncating or
hole punching.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:06 -07:00
Filipe Manana
3821f34888 Btrfs: update commit root on snapshot creation after orphan cleanup
On snapshot creation (either writable or read-only), we do orphan cleanup
against the root of the snapshot. If the cleanup did remove any orphans,
then the current root node will be different from the commit root node
until the next transaction commit happens.

A send operation always uses the commit root of a snapshot - this means
it will see the orphans if it starts computing the send stream before the
next transaction commit happens (triggered by a timer or sync() for .e.g),
which is when the commit root gets assigned a reference to current root,
where the orphans are not visible anymore. The consequence of send seeing
the orphans is explained below.

For example:

    mkfs.btrfs -f /dev/sdd
    mount -o commit=999 /dev/sdd /mnt

    # open a file with O_TMPFILE and leave it open
    # write some data to the file
    btrfs subvolume snapshot -r /mnt /mnt/snap1

    btrfs send /mnt/snap1 -f /tmp/send.data

The send operation will fail with the following error:

    ERROR: send ioctl failed with -116: Stale file handle

What happens here is that our snapshot has an orphan inode still visible
through the commit root, that corresponds to the tmpfile. However send
will attempt to call inode.c:btrfs_iget(), with the goal of reading the
file's data, which will return -ESTALE because it will use the current
root (and not the commit root) of the snapshot.

Of course, there are other cases where we can get orphans, but this
example using a tmpfile makes it much easier to reproduce the issue.

Therefore on snapshot creation, after calling btrfs_orphan_cleanup, if
the commit root is different from the current root, just commit the
transaction associated with the snapshot's root (if it exists), so that
a send will not see any orphans that don't exist anymore. This also
guarantees a send will always see the same content regardless of whether
a transaction commit happened already before the send was requested and
after the orphan cleanup (meaning the commit root and current roots are
the same) or it hasn't happened yet (commit and current roots are
different).

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:05 -07:00
Filipe Manana
ff5df9b884 Btrfs: ioctl, don't re-lock extent range when not necessary
In ioctl.c:lock_extent_range(), after locking our target range, the
ordered extent that btrfs_lookup_first_ordered_extent() returns us
may not overlap our target range at all. In this case we would just
unlock our target range, wait for any new ordered extents that overlap
the range to complete, lock again the range and repeat all these steps
until we don't get any ordered extent and the delalloc flag isn't set
in the io tree for our target range.

Therefore just stop if we get an ordered extent that doesn't overlap
our target range and the dealalloc flag isn't set for the range in
the inode's io tree.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:04 -07:00
Filipe Manana
2c463823cb Btrfs: avoid visiting all extent items when cloning a range
When cloning a range of a file, we were visiting all the extent items in
the btree that belong to our source inode. We don't need to visit those
extent items that don't overlap the range we are cloning, as doing so only
makes us waste time and do unnecessary btree navigations (btrfs_next_leaf)
for inodes that have a large number of file extent items in the btree.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:04 -07:00
Filipe Manana
c55bfa67e9 Btrfs: set dead flag on the right root when destroying snapshot
We were setting the BTRFS_ROOT_SUBVOL_DEAD flag on the root of the
parent of our target snapshot, instead of setting it in the target
snapshot's root.

This is easy to observe by running the following scenario:

    mkfs.btrfs -f /dev/sdd
    mount /dev/sdd /mnt

    btrfs subvolume create /mnt/first_subvol
    btrfs subvolume snapshot -r /mnt /mnt/mysnap1

    btrfs subvolume delete /mnt/first_subvol
    btrfs subvolume snapshot -r /mnt /mnt/mysnap2

    btrfs send -p /mnt/mysnap1 /mnt/mysnap2 -f /tmp/send.data

The send command failed because the send ioctl returned -EPERM.
A test case for xfstests follows.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:03 -07:00
Filipe Manana
c125b8bff1 Btrfs: ensure readers see new data after a clone operation
We were cleaning the clone target file range from the page cache before
we did replace the file extent items in the fs tree. This was racy,
as right after cleaning the relevant range from the page cache and before
replacing the file extent items, a read against that range could be
performed by another task and populate again the page cache with stale
data (stale after the cloning finishes). This would result in reads after
the clone operation successfully finishes to get old data (and potentially
for a very long time). Therefore evict the pages after replacing the file
extent items, so that subsequent reads will always get the new data.

Similarly, we were prone to races while cloning the file extent items
because we weren't locking the target range and wait for any existing
ordered extents against that range to complete. It was possible that
after cloning the extent items, a write operation that was performed
before the clone operation and overlaps the same range, would end up
undoing all or part of the work the clone operation did (a worker task
running inode.c:btrfs_finish_ordered_io). Therefore lock the target
range in the io tree, wait for all pending ordered extents against that
range to finish and then safely perform the cloning.

The issue of reading stale data after the clone operation is easy to
reproduce by running the following C program in a loop until it exits
with return value 1.

 #include <unistd.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include <errno.h>
 #include <pthread.h>
 #include <fcntl.h>
 #include <assert.h>
 #include <asm/types.h>
 #include <linux/ioctl.h>
 #include <sys/stat.h>
 #include <sys/types.h>
 #include <sys/ioctl.h>

 #define SRC_FILE "/mnt/sdd/foo"
 #define DST_FILE "/mnt/sdd/bar"
 #define FILE_SIZE (16 * 1024)
 #define PATTERN_SRC 'X'
 #define PATTERN_DST 'Y'

struct btrfs_ioctl_clone_range_args {
	__s64 src_fd;
	__u64 src_offset, src_length;
	__u64 dest_offset;
};

 #define BTRFS_IOCTL_MAGIC 0x94
 #define BTRFS_IOC_CLONE_RANGE _IOW(BTRFS_IOCTL_MAGIC, 13, \
				   struct btrfs_ioctl_clone_range_args)

static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
static int clone_done = 0;
static int reader_ready = 0;
static int stale_data = 0;

static void *reader_loop(void *arg)
{
	char buf[4096], want_buf[4096];

	memset(want_buf, PATTERN_SRC, 4096);
	pthread_mutex_lock(&mutex);
	reader_ready = 1;
	pthread_mutex_unlock(&mutex);

	while (1) {
		int done, fd, ret;

		fd = open(DST_FILE, O_RDONLY);
		assert(fd != -1);

		pthread_mutex_lock(&mutex);
		done = clone_done;
		pthread_mutex_unlock(&mutex);

		ret = read(fd, buf, 4096);
		assert(ret == 4096);
		close(fd);

		if (done) {
			ret = memcmp(buf, want_buf, 4096);
			if (ret == 0) {
				printf("Found new content\n");
			} else {
				printf("Found old content\n");
				pthread_mutex_lock(&mutex);
				stale_data = 1;
				pthread_mutex_unlock(&mutex);
			}
			break;
		}
	}
	return NULL;
}

int main(int argc, char *argv[])
{
	pthread_t reader;
	int ret, i, fd;
	struct btrfs_ioctl_clone_range_args clone_args;
	int fd1, fd2;

	ret = remove(SRC_FILE);
	if (ret == -1 && errno != ENOENT) {
		fprintf(stderr, "Error deleting src file: %s\n", strerror(errno));
		return 1;
	}
	ret = remove(DST_FILE);
	if (ret == -1 && errno != ENOENT) {
		fprintf(stderr, "Error deleting dst file: %s\n", strerror(errno));
		return 1;
	}

	fd = open(SRC_FILE, O_CREAT | O_WRONLY | O_TRUNC, S_IRWXU);
	assert(fd != -1);
	for (i = 0; i < FILE_SIZE; i++) {
		char c = PATTERN_SRC;
		ret = write(fd, &c, 1);
		assert(ret == 1);
	}
	close(fd);
	fd = open(DST_FILE, O_CREAT | O_WRONLY | O_TRUNC, S_IRWXU);
	assert(fd != -1);
	for (i = 0; i < FILE_SIZE; i++) {
		char c = PATTERN_DST;
		ret = write(fd, &c, 1);
		assert(ret == 1);
	}
	close(fd);
        sync();

	ret = pthread_create(&reader, NULL, reader_loop, NULL);
	assert(ret == 0);
	while (1) {
		int r;
		pthread_mutex_lock(&mutex);
		r = reader_ready;
		pthread_mutex_unlock(&mutex);
		if (r) break;
	}

	fd1 = open(SRC_FILE, O_RDONLY);
	if (fd1 < 0) {
		fprintf(stderr, "Error open src file: %s\n", strerror(errno));
		return 1;
	}
	fd2 = open(DST_FILE, O_RDWR);
	if (fd2 < 0) {
		fprintf(stderr, "Error open dst file: %s\n", strerror(errno));
		return 1;
	}
	clone_args.src_fd = fd1;
	clone_args.src_offset = 0;
	clone_args.src_length = 4096;
	clone_args.dest_offset = 0;
	ret = ioctl(fd2, BTRFS_IOC_CLONE_RANGE, &clone_args);
	assert(ret == 0);
	close(fd1);
	close(fd2);

	pthread_mutex_lock(&mutex);
	clone_done = 1;
	pthread_mutex_unlock(&mutex);
	ret = pthread_join(reader, NULL);
	assert(ret == 0);

	pthread_mutex_lock(&mutex);
	ret = stale_data ? 1 : 0;
	pthread_mutex_unlock(&mutex);
	return ret;
}

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:02 -07:00
Rickard Strandqvist
8321cf2596 fs: btrfs: volumes.c: Fix for possible null pointer dereference
There is otherwise a risk of a possible null pointer dereference.

Was largely found by using a static code analysis program called cppcheck.

Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:01 -07:00
Jeff Mahoney
c1895442be btrfs: allocate raid type kobjects dynamically
We are currently allocating space_info objects in an array when we
allocate space_info. When a user does something like:

# btrfs balance start -mconvert=raid1 -dconvert=raid1 /mnt
# btrfs balance start -mconvert=single -dconvert=single /mnt -f
# btrfs balance start -mconvert=raid1 -dconvert=raid1 /

We can end up with memory corruption since the kobject hasn't
been reinitialized properly and the name pointer was left set.

The rationale behind allocating them statically was to avoid
creating a separate kobject container that just contained the
raid type. It used the index in the array to determine the index.

Ultimately, though, this wastes more memory than it saves in all
but the most complex scenarios and introduces kobject lifetime
questions.

This patch allocates the kobjects dynamically instead. Note that
we also remove the kobject_get/put of the parent kobject since
kobject_add and kobject_del do that internally.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:01 -07:00
Filipe Manana
7e3ae33efa Btrfs: send, use the right limits for xattr names and values
We were limiting the sum of the xattr name and value lengths to PATH_MAX,
which is not correct, specially on filesystems created with btrfs-progs
v3.12 or higher, where the default leaf size is max(16384, PAGE_SIZE), or
systems with page sizes larger than 4096 bytes.

Xattrs have their own specific maximum name and value lengths, which depend
on the leaf size, therefore use these limits to be able to send xattrs with
sizes larger than PATH_MAX.

A test case for xfstests follows.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:00 -07:00
Filipe Manana
1af56070e3 Btrfs: send, don't error in the presence of subvols/snapshots
If we are doing an incremental send and the base snapshot has a
directory with name X that doesn't exist anymore in the second
snapshot and a new subvolume/snapshot exists in the second snapshot
that has the same name as the directory (name X), the incremental
send would fail with -ENOENT error. This is because it attempts
to lookup for an inode with a number matching the objectid of a
root, which doesn't exist.

Steps to reproduce:

    mkfs.btrfs -f /dev/sdd
    mount /dev/sdd /mnt

    mkdir /mnt/testdir
    btrfs subvolume snapshot -r /mnt /mnt/mysnap1

    rmdir /mnt/testdir
    btrfs subvolume create /mnt/testdir
    btrfs subvolume snapshot -r /mnt /mnt/mysnap2

    btrfs send -p /mnt/mysnap1 /mnt/mysnap2 -f /tmp/send.data

A test case for xfstests follows.

Reported-by: Robert White <rwhite@pobox.com>
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:59 -07:00
Chris Mason
a79b7d4b3e Btrfs: async delayed refs
Delayed extent operations are triggered during transaction commits.
The goal is to queue up a healthly batch of changes to the extent
allocation tree and run through them in bulk.

This farms them off to async helper threads.  The goal is to have the
bulk of the delayed operations being done in the background, but this is
also important to limit our stack footprint.

Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:58 -07:00
Chris Mason
40f765805f Btrfs: split up __extent_writepage to lower stack usage
__extent_writepage has two unrelated parts.  First it does the delayed
allocation dance and second it does the mapping and IO for the page
we're actually writing.

This splits it up into those two parts so the stack from one doesn't
impact the stack from the other.

Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:58 -07:00
Alex Gartrell
fc4adbff82 btrfs: Drop EXTENT_UPTODATE check in hole punching and direct locking
In these instances, we are trying to determine if a page has been accessed
since we began the operation for the sake of retry.  This is easily
accomplished by doing a gang lookup in the page mapping radix tree, and it
saves us the dependency on the flag (so that we might eventually delete
it).

btrfs_page_exists_in_range borrows heavily from find_get_page, replacing
the radix tree look up with a gang lookup of 1, so that we can find the
next highest page >= index and see if it falls into our lock range.

Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Alex Gartrell <agartrell@fb.com>
2014-06-09 17:20:57 -07:00
Chris Mason
0e378df15c Btrfs: cut down stack usage in btree_write_cache_pages
This adds noinline_for_stack to two helpers used by
btree_write_cache_pages.  It shaves us down from 424 bytes on the
stack to 280.

Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:56 -07:00
Chris Mason
d4452bc526 Btrfs: break up __btrfs_write_out_cache to cut down stack usage
__btrfs_write_out_cache was one of our stack pigs.  This breaks it
up into helper functions and slims it down to 194 bytes.

Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:55 -07:00
Josef Bacik
2a10840945 Btrfs: free tmp ulist for qgroup rescan
Memory leaks are bad mmkay?

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:55 -07:00
Anand Jain
402a0f4759 btrfs: usage error should not be logged into system log
I have an opinion that system logs /var/log/messages are
valuable info to investigate the real system issues at
the data center. People handling data center issues
do spend a lot time and efforts analyzing messages
files. Having usage error logged into /var/log/messages
is something we should avoid.

Signed-off-by: Anand Jain <Anand.Jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:54 -07:00
David Sterba
67a77eb147 btrfs: remove newline from inode cache kthread name
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:53 -07:00
David Sterba
351fd35321 btrfs: remove stale newlines from log messages
I've noticed an extra line after "use no compression", but search
revealed much more in messages of more critical levels and rare errors.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:53 -07:00
Chris Mason
7d78874273 Btrfs: fix double free in find_lock_delalloc_range
We need to NULL the cached_state after freeing it, otherwise
we might free it again if find_delalloc_range doesn't find anything.

Signed-off-by: Chris Mason <clm@fb.com>
cc: stable@vger.kernel.org
2014-06-09 17:20:52 -07:00
ZhangZhen
58dfae6365 btrfs: replace simple_strtoull() with kstrtoull()
use the newer and more pleasant kstrtoull() to replace simple_strtoull(),
because simple_strtoull() is marked for obsoletion.

Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:51 -07:00
Wang Shilong
298658414a Btrfs: set right total device count for seeding support
Seeding device support allows us to create a new filesystem
based on existed filesystem.

However newly created filesystem's @total_devices should include seed
devices. This patch fix the following problem:

 # mkfs.btrfs -f /dev/sdb
 # btrfstune -S 1 /dev/sdb
 # mount /dev/sdb /mnt
 # btrfs device add -f /dev/sdc /mnt --->fs_devices->total_devices = 1
 # umount /mnt
 # mount /dev/sdc /mnt               --->fs_devices->total_devices = 2

This is because we record right @total_devices in superblock, but
@fs_devices->total_devices is reset to be 0 in btrfs_prepare_sprout().

Fix this problem by not resetting @fs_devices->total_devices.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:50 -07:00
Guangliang Zhao
45ff35d6b9 Btrfs: remove OPT_acl parse when acl disabled
Even CONFIG_BTRFS_FS_POSIX_ACL is not defined, the acl still could
been enabled using a mount option, and now fs/btrfs/acl.o is not
built, so the mount options will appear to be supported but will
be silently ignored.

Signed-off-by: Guangliang Zhao <lucienchao@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:50 -07:00
Josef Bacik
faa2dbf004 Btrfs: add sanity tests for new qgroup accounting code
This exercises the various parts of the new qgroup accounting code.  We do some
basic stuff and do some things with the shared refs to make sure all that code
works.  I had to add a bunch of infrastructure because I needed to be able to
insert items into a fake tree without having to do all the hard work myself,
hopefully this will be usefull in the future.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:49 -07:00
Josef Bacik
fcebe4562d Btrfs: rework qgroup accounting
Currently qgroups account for space by intercepting delayed ref updates to fs
trees.  It does this by adding sequence numbers to delayed ref updates so that
it can figure out how the tree looked before the update so we can adjust the
counters properly.  The problem with this is that it does not allow delayed refs
to be merged, so if you say are defragging an extent with 5k snapshots pointing
to it we will thrash the delayed ref lock because we need to go back and
manually merge these things together.  Instead we want to process quota changes
when we know they are going to happen, like when we first allocate an extent, we
free a reference for an extent, we add new references etc.  This patch
accomplishes this by only adding qgroup operations for real ref changes.  We
only modify the sequence number when we need to lookup roots for bytenrs, this
reduces the amount of churn on the sequence number and allows us to merge
delayed refs as we add them most of the time.  This patch encompasses a bunch of
architectural changes

1) qgroup ref operations: instead of tracking qgroup operations through the
delayed refs we simply add new ref operations whenever we notice that we need to
when we've modified the refs themselves.

2) tree mod seq:  we no longer have this separation of major/minor counters.
this makes the sequence number stuff much more sane and we can remove some
locking that was needed to protect the counter.

3) delayed ref seq: we now read the tree mod seq number and use that as our
sequence.  This means each new delayed ref doesn't have it's own unique sequence
number, rather whenever we go to lookup backrefs we inc the sequence number so
we can make sure to keep any new operations from screwing up our world view at
that given point.  This allows us to merge delayed refs during runtime.

With all of these changes the delayed ref stuff is a little saner and the qgroup
accounting stuff no longer goes negative in some cases like it was before.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:48 -07:00
Liu Bo
5dca6eea91 Btrfs: mark mapping with error flag to report errors to userspace
According to commit 865ffef379
(fs: fix fsync() error reporting),
it's not stable to just check error pages because pages can be
truncated or invalidated, we should also mark mapping with error
flag so that a later fsync can catch the error.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:47 -07:00
Liu Bo
29cc83f69c Btrfs: fix NULL pointer crash of deleting a seed device
Same as normal devices, seed devices should be initialized with
fs_info->dev_root as well, otherwise we'll get a NULL pointer crash.

Cc: Chris Murphy <lists@colorremedies.com>
Reported-by: Chris Murphy <lists@colorremedies.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:47 -07:00
Wang Shilong
f017f15f7c Btrfs: fix joining same transaction handle more than twice
We hit something like the following function call flows:

|->run_delalloc_range()
 |->btrfs_join_transaction()
   |->cow_file_range()
     |->btrfs_join_transaction()
       |->find_free_extent()
         |->btrfs_join_transaction()

Trace infomation can be seen as:

[ 7411.127040] ------------[ cut here ]------------
[ 7411.127060] WARNING: CPU: 0 PID: 11557 at fs/btrfs/transaction.c:383 start_transaction+0x561/0x580 [btrfs]()
[ 7411.127079] CPU: 0 PID: 11557 Comm: kworker/u8:9 Tainted: G           O 3.13.0+ #4
[ 7411.127080] Hardware name: LENOVO QiTianM4350/ , BIOS F1KT52AUS 05/24/2013
[ 7411.127085] Workqueue: writeback bdi_writeback_workfn (flush-btrfs-5)
[ 7411.127092] Call Trace:
[ 7411.127097]  [<ffffffff815b87b0>] dump_stack+0x45/0x56
[ 7411.127101]  [<ffffffff81051ffd>] warn_slowpath_common+0x7d/0xa0
[ 7411.127102]  [<ffffffff810520da>] warn_slowpath_null+0x1a/0x20
[ 7411.127109]  [<ffffffffa0444fb1>] start_transaction+0x561/0x580 [btrfs]
[ 7411.127115]  [<ffffffffa0445027>] btrfs_join_transaction+0x17/0x20 [btrfs]
[ 7411.127120]  [<ffffffffa0431c91>] find_free_extent+0xa21/0xb50 [btrfs]
[ 7411.127126]  [<ffffffffa0431f68>] btrfs_reserve_extent+0xa8/0x1a0 [btrfs]
[ 7411.127131]  [<ffffffffa04322ce>] btrfs_alloc_free_block+0xee/0x440 [btrfs]
[ 7411.127137]  [<ffffffffa043bd6e>] ? btree_set_page_dirty+0xe/0x10 [btrfs]
[ 7411.127142]  [<ffffffffa041da51>] __btrfs_cow_block+0x121/0x530 [btrfs]
[ 7411.127146]  [<ffffffffa041dfff>] btrfs_cow_block+0x11f/0x1c0 [btrfs]
[ 7411.127151]  [<ffffffffa0421b74>] btrfs_search_slot+0x1d4/0x9c0 [btrfs]
[ 7411.127157]  [<ffffffffa0438567>] btrfs_lookup_file_extent+0x37/0x40 [btrfs]
[ 7411.127163]  [<ffffffffa0456bfc>] __btrfs_drop_extents+0x16c/0xd90 [btrfs]
[ 7411.127169]  [<ffffffffa0444ae3>] ? start_transaction+0x93/0x580 [btrfs]
[ 7411.127171]  [<ffffffff811663e2>] ? kmem_cache_alloc+0x132/0x140
[ 7411.127176]  [<ffffffffa041cd9a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
[ 7411.127182]  [<ffffffffa044aa61>] cow_file_range_inline+0x181/0x2e0 [btrfs]
[ 7411.127187]  [<ffffffffa044aead>] cow_file_range+0x2ed/0x440 [btrfs]
[ 7411.127194]  [<ffffffffa0464d7f>] ? free_extent_buffer+0x4f/0xb0 [btrfs]
[ 7411.127200]  [<ffffffffa044b38f>] run_delalloc_nocow+0x38f/0xa60 [btrfs]
[ 7411.127207]  [<ffffffffa0461600>] ? test_range_bit+0x30/0x180 [btrfs]
[ 7411.127212]  [<ffffffffa044bd48>] run_delalloc_range+0x2e8/0x350 [btrfs]
[ 7411.127219]  [<ffffffffa04618f9>] ? find_lock_delalloc_range+0x1a9/0x1e0 [btrfs]
[ 7411.127222]  [<ffffffff812a1e71>] ? blk_queue_bio+0x2c1/0x330
[ 7411.127228]  [<ffffffffa0462ad4>] __extent_writepage+0x2f4/0x760 [btrfs]

Here we fix it by avoiding joining transaction again if we have held
a transaction handle when allocating chunk in find_free_extent().

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:46 -07:00
Miao Xie
995946dd29 Btrfs: use helpers for last_trans_log_full_commit instead of opencode
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:45 -07:00
Filipe Manana
1f21ef0a34 Btrfs: check if items are ordered when a leaf is marked dirty
To ease finding bugs during development related to modifying btree leaves
in such a way that it makes its items not sorted by key anymore. Since this
is an expensive check, it's only enabled if CONFIG_BTRFS_FS_CHECK_INTEGRITY
is set, which isn't meant to be enabled for regular users.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:45 -07:00
Filipe Manana
35045bf2fd Btrfs: don't access non-existent key when csum tree is empty
When the csum tree is empty, our leaf (path->nodes[0]) has a number
of items equal to 0 and since btrfs_header_nritems() returns an
unsigned integer (and so is our local nritems variable) the following
comparison always evaluates to false:

     if (path->slots[0] >= nritems - 1) {

As the casting rules lead to:

     if ((u32)0 >= (u32)4294967295) {

This makes us access key at slot paths->slots[0] + 1 (1) of the empty leaf
some lines below:

    btrfs_item_key_to_cpu(path->nodes[0], &found_key, slot);
    if (found_key.objectid != BTRFS_EXTENT_CSUM_OBJECTID ||
        found_key.type != BTRFS_EXTENT_CSUM_KEY) {
		found_next = 1;
		goto insert;
    }

So just don't access such non-existent slot and don't set found_next to 1
when the tree is empty. It's very unlikely we'll get a random key with the
objectid and type values above, which is where we could go into trouble.

If nritems is 0, just set found_next to 1 anyway as it will make us insert
a csum item covering our whole extent (or the whole leaf) when the tree is
empty.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:44 -07:00
Wang Shilong
de348ee022 Btrfs: make sure there are not any read requests before stopping workers
In close_ctree(), after we have stopped all workers,there maybe still
some read requests(for example readahead) to submit and this *maybe* trigger
an oops that user reported before:

kernel BUG at fs/btrfs/async-thread.c:619!

By hacking codes, i can reproduce this problem with one cpu available.
We fix this potential problem by invalidating all btree inode pages before
stopping all workers.

Thanks to Miao for pointing out this problem.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:43 -07:00
Tsutomu Itoh
59885b3930 Btrfs: fix possible memory leak in btrfs_create_tree()
In btrfs_create_tree(), if btrfs_insert_root() fails, we should
free root->commit_root.

Reported-by: Alex Lyakas <alex@zadarastorage.com>
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:42 -07:00
ZhangZhen
776e4aae55 btrfs: remove useless ACL check
posix_acl_xattr_set() already does the check, and it's the only
way to feed in an ACL from userspace.
So the check here is useless, remove it.

Signed-off-by: zhang zhen <zhenzhang.zhang@huawei.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:42 -07:00
Anand Jain
4d90d28b1c btrfs: btrfs_rm_device() should zero mirror SB as well
This fix will ensure all SB copies on the disk is zeroed
when the disk is intentionally removed. This helps to
better manage disks in the user land.

This version of patch also merges the Zach patch as below.

 btrfs: don't double brelse on device rm

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Zach Brown <zab@redhat.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:41 -07:00
Miao Xie
27cdeb7096 Btrfs: use bitfield instead of integer data type for the some variants in btrfs_root
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:40 -07:00
Filipe Manana
f959492fc1 Btrfs: send, fix more issues related to directory renames
This is a continuation of the previous changes titled:

   Btrfs: fix incremental send's decision to delay a dir move/rename
   Btrfs: part 2, fix incremental send's decision to delay a dir move/rename

There's a few more cases where a directory rename/move must be delayed which was
previously overlooked. If our immediate ancestor has a lower inode number than
ours and it doesn't have a delayed rename/move operation associated to it, it
doesn't mean there isn't any non-direct ancestor of our current inode that needs
to be renamed/moved before our current inode (i.e. with a higher inode number
than ours).

So we can't stop the search if our immediate ancestor has a lower inode number than
ours, we need to navigate the directory hierarchy upwards until we hit the root or:

1) find an ancestor with an higher inode number that was renamed/moved in the send
   root too (or already has a pending rename/move registered);
2) find an ancestor that is a new directory (higher inode number than ours and
   exists only in the send root).

Reproducer for case 1)

    $ mkfs.btrfs -f /dev/sdd
    $ mount /dev/sdd /mnt

    $ mkdir -p /mnt/a/b
    $ mkdir -p /mnt/a/c/d
    $ mkdir /mnt/a/b/e
    $ mkdir /mnt/a/c/d/f
    $ mv /mnt/a/b /mnt/a/c/d/2b
    $ mkdir /mnt/a/x
    $ mkdir /mnt/a/y

    $ btrfs subvolume snapshot -r /mnt /mnt/snap1
    $ btrfs send /mnt/snap1 -f /tmp/base.send

    $ mv /mnt/a/x /mnt/a/y
    $ mv /mnt/a/c/d/2b/e /mnt/a/c/d/2b/2e
    $ mv /mnt/a/c/d /mnt/a/h/2d
    $ mv /mnt/a/c /mnt/a/h/2d/2b/2c

    $ btrfs subvolume snapshot -r /mnt /mnt/snap2
    $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send

Simple reproducer for case 2)

    $ mkfs.btrfs -f /dev/sdd
    $ mount /dev/sdd /mnt

    $ mkdir -p /mnt/a/b
    $ mkdir /mnt/a/c
    $ mv /mnt/a/b /mnt/a/c/b2
    $ mkdir /mnt/a/e

    $ btrfs subvolume snapshot -r /mnt /mnt/snap1
    $ btrfs send /mnt/snap1 -f /tmp/base.send

    $ mv /mnt/a/c/b2 /mnt/a/e/b3
    $ mkdir /mnt/a/e/b3/f
    $ mkdir /mnt/a/h
    $ mv /mnt/a/c /mnt/a/e/b3/f/c2
    $ mv /mnt/a/e /mnt/a/h/e2

    $ btrfs subvolume snapshot -r /mnt /mnt/snap2
    $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send

Another simple reproducer for case 2)

    $ mkfs.btrfs -f /dev/sdd
    $ mount /dev/sdd /mnt

    $ mkdir -p /mnt/a/b
    $ mkdir /mnt/a/c
    $ mkdir /mnt/a/b/d
    $ mkdir /mnt/a/c/e

    $ btrfs subvolume snapshot -r /mnt /mnt/snap1
    $ btrfs send /mnt/snap1 -f /tmp/base.send

    $ mkdir /mnt/a/b/d/f
    $ mkdir /mnt/a/b/g
    $ mv /mnt/a/c/e /mnt/a/b/g/e2
    $ mv /mnt/a/c /mnt/a/b/d/f/c2
    $ mv /mnt/a/b/d/f /mnt/a/b/g/e2/f2

    $ btrfs subvolume snapshot -r /mnt /mnt/snap2
    $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send

More complex reproducer for case 2)

    $ mkfs.btrfs -f /dev/sdd
    $ mount /dev/sdd /mnt

    $ mkdir -p /mnt/a/b
    $ mkdir -p /mnt/a/c/d
    $ mkdir /mnt/a/b/e
    $ mkdir /mnt/a/c/d/f
    $ mv /mnt/a/b /mnt/a/c/d/2b
    $ mkdir /mnt/a/x
    $ mkdir /mnt/a/y

    $ btrfs subvolume snapshot -r /mnt /mnt/snap1
    $ btrfs send /mnt/snap1 -f /tmp/base.send

    $ mv /mnt/a/x /mnt/a/y
    $ mv /mnt/a/c/d/2b/e /mnt/a/c/d/2b/2e
    $ mv /mnt/a/c/d /mnt/a/h/2d
    $ mv /mnt/a/c /mnt/a/h/2d/2b/2c

    $ btrfs subvolume snapshot -r /mnt /mnt/snap2
    $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send

For both cases the incremental send would enter an infinite loop when building
path strings.

While solving these cases, this change also re-implements the code to detect
when directory moves/renames should be delayed. Instead of dealing with several
specific cases separately, it's now more generic handling all cases with a simple
detection algorithm and if when applying a delayed move/rename there's a path loop
detected, it further delays the move/rename registering a new ancestor inode as
the dependency inode (so our rename happens after that ancestor is renamed).

Tests for these cases is being added to xfstests too.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:40 -07:00
Filipe Manana
a10c40766c Btrfs: send, remove dead code from __get_cur_name_and_parent
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:39 -07:00
Filipe Manana
c992ec94f2 Btrfs: send, account for orphan directories when building path strings
If we have directories with a pending move/rename operation, we must take into
account any orphan directories that got created before executing the pending
move/rename. Those orphan directories are directories with an inode number higher
then the current send progress and that don't exist in the parent snapshot, they
are created before current progress reaches their inode number, with a generated
name of the form oN-M-I and at the root of the filesystem tree, and later when
progress matches their inode number, moved/renamed to their final location.

Reproducer:

          $ mkfs.btrfs -f /dev/sdd
          $ mount /dev/sdd /mnt

          $ mkdir -p /mnt/a/b/c/d
          $ mkdir /mnt/a/b/e
          $ mv /mnt/a/b/c /mnt/a/b/e/CC
          $ mkdir /mnt/a/b/e/CC/d/f
	  $ mkdir /mnt/a/g

          $ btrfs subvolume snapshot -r /mnt /mnt/snap1
          $ btrfs send /mnt/snap1 -f /tmp/base.send

          $ mkdir /mnt/a/g/h
	  $ mv /mnt/a/b/e /mnt/a/g/h/EE
          $ mv /mnt/a/g/h/EE/CC/d /mnt/a/g/h/EE/DD

          $ btrfs subvolume snapshot -r /mnt /mnt/snap2
          $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send

The second receive command failed with the following error:

    ERROR: rename a/b/e/CC/d -> o264-7-0/EE/DD failed. No such file or directory

A test case for xfstests follows soon.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:38 -07:00
Filipe Manana
b46ab97bcd Btrfs: send, avoid unnecessary inode item lookup in the btree
Regardless of whether the caller is interested or not in knowing the inode's
generation (dir_gen != NULL), get_first_ref always does a btree lookup to get
the inode item. Avoid this useless lookup if dir_gen parameter is NULL (which
is in some cases).

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:37 -07:00
Gui Hecheng
23f8f9b7ca btrfs: add dev maxs limit for __btrfs_alloc_chunk in kernel space
For RAID0,5,6,10,
For system chunk, there shouldn't be too many stripes to
make a btrfs_chunk that exceeds BTRFS_SYSTEM_CHUNK_ARRAY_SIZE
For data/meta chunk, there shouldn't be too many stripes to
make a btrfs_chunk that exceeds a leaf.

Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:36 -07:00
Gui Hecheng
5f43f86e3f btrfs: fix wrong max system array size check in kernel space
For system chunk array,
We copy a "disk_key" and an chunk item each time,
so there should be enough space to hold both of them,
not only the chunk item.

Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:36 -07:00
Qu Wenruo
65d33fd7a6 btrfs: Add check to avoid cleanup roots already in fs_info->dead_roots.
Current btrfs_orphan_cleanup will also cleanup roots which is already in
fs_info->dead_roots without protection.
This will have conditional race with fs_info->cleaner_kthread.

This patch will use refs in root->root_item to detect roots in
dead_roots and avoid conflicts.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:35 -07:00
Miao Xie
21c7e75654 Btrfs: reclaim the reserved metadata space at background
Before applying this patch, the task had to reclaim the metadata space
by itself if the metadata space was not enough. And When the task started
the space reclamation, all the other tasks which wanted to reserve the
metadata space were blocked. At some cases, they would be blocked for
a long time, it made the performance fluctuate wildly.

So we introduce the background metadata space reclamation, when the space
is about to be exhausted, we insert a reclaim work into the workqueue, the
worker of the workqueue helps us to reclaim the reserved space at the
background. By this way, the tasks needn't reclaim the space by themselves at
most cases, and even if the tasks have to reclaim the space or are blocked
for the space reclamation, they will get enough space more quickly.

Here is my test result(Tested by compilebench):
 Memory:	2GB
 CPU:		2Cores * 1CPU
 Partition:	40GB(SSD)

Test command:
 # compilebench -D <mnt> -m

Without this patch:
 intial create total runs 30 avg 54.36 MB/s (user 0.52s sys 2.44s)
 compile total runs 30 avg 123.72 MB/s (user 0.13s sys 1.17s)
 read compiled tree total runs 3 avg 81.15 MB/s (user 0.74s sys 4.89s)
 delete compiled tree total runs 30 avg 5.32 seconds (user 0.35s sys 4.37s)

With this patch:
 intial create total runs 30 avg 59.80 MB/s (user 0.52s sys 2.53s)
 compile total runs 30 avg 151.44 MB/s (user 0.13s sys 1.11s)
 read compiled tree total runs 3 avg 83.25 MB/s (user 0.76s sys 4.91s)
 delete compiled tree total runs 30 avg 5.29 seconds (user 0.34s sys 4.34s)

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:34 -07:00
Miao Xie
32d6b47fe6 Btrfs: output warning instead of error when loading free space cache failed
If we fail to load a free space cache, we can rebuild it from the extent tree,
so it is not a serious error, we should not output a error message that
would make the users uncomfortable. This patch uses warning message instead
of it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:33 -07:00
Qu Wenruo
5a1972bd9f btrfs: Add ctime/mtime update for btrfs device add/remove.
Btrfs will send uevent to udev inform the device change,
but ctime/mtime for the block device inode is not udpated, which cause
libblkid used by btrfs-progs unable to detect device change and use old
cache, causing 'btrfs dev scan; btrfs dev rmove; btrfs dev scan' give an
error message.

Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Cc: Karel Zak <kzak@redhat.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:33 -07:00
David Sterba
61155aa04e btrfs: assert that send is not in progres before root deletion
CC: Miao Xie <miaox@cn.fujitsu.com>
CC: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:32 -07:00
David Sterba
521e0546c9 btrfs: protect snapshots from deleting during send
The patch "Btrfs: fix protection between send and root deletion"
(18f687d538) does not actually prevent to delete the snapshot
and just takes care during background cleaning, but this seems rather
user unfriendly, this patch implements the idea presented in

http://www.spinics.net/lists/linux-btrfs/msg30813.html

- add an internal root_item flag to denote a dead root
- check if the send_in_progress is set and refuse to delete, otherwise
  set the flag and proceed
- check the flag in send similar to the btrfs_root_readonly checks, for
  all involved roots

The root lookup in send via btrfs_read_fs_root_no_name will check if the
root is really dead or not. If it is, ENOENT, aborted send. If it's
alive, it's protected by send_in_progress, send can continue.

CC: Miao Xie <miaox@cn.fujitsu.com>
CC: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:31 -07:00
Daeseok Youn
944a4515b2 btrfs: remove redundant null check in btrfs_dentry_release()
It doesn't need to check NULL for kfree()

Signed-off-by: Daeseok Youn <daeseok.youn@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:31 -07:00
Filipe Manana
ef3b9af50b Btrfs: implement inode_operations callback tmpfile
This implements the tmpfile callback of struct inode_operations, introduced
in the linux kernel 3.11, and implemented already by some filesystems. This
callback is invoked by the VFS when the flag O_TMPFILE is passed to the open
system call.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
2014-06-09 17:20:30 -07:00
David Sterba
e4ef90ff61 btrfs: make FS_INFO ioctl available to anyone
This ioctl provides basic info about the filesystem that can be obtained
in other ways (eg. sysfs), there's no reason to restrict it to
CAP_SYSADMIN.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:29 -07:00
David Sterba
7d6213c5a7 btrfs: make DEV_INFO ioctl available to anyone
This ioctl provides basic info about the devices that can be obtained in
other ways (eg. sysfs), there's no reason to restrict it to
CAP_SYSADMIN.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:28 -07:00
David Sterba
df93589a17 btrfs: export more from FS_INFO to sysfs
Similar to the FS_INFO updates, export the basic filesystem info through
sysfs: node size, sector size and clone alignment.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:28 -07:00
David Sterba
80a773fbfc btrfs: retrieve more info from FS_INFO ioctl
Provide the basic information about filesystem through the ioctl:
* b-tree node size (same as leaf size)
* sector size
* expected alignment of CLONE_RANGE and EXTENT_SAME ioctl arguments

Backward compatibility: if the values are 0, kernel does not provide
this information, the applications should ignore them.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:27 -07:00
David Sterba
7d824b6f9c btrfs: balance filter: add limit of processed chunks
This started as debugging helper, to watch the effects of converting
between raid levels on multiple devices, but could be useful standalone.

In my case the usage filter was not finegrained enough and led to
converting too many chunks at once. Another example use is in connection
with drange+devid or vrange filters that allow to work with a specific
chunk or even with a chunk on a given device.

The limit filter applies last, the value of 0 means no limiting.

CC: Ilya Dryomov <idryomov@gmail.com>
CC: Hugo Mills <hugo@carfax.org.uk>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:26 -07:00
Filipe Manana
fc19c5e736 Btrfs: fix leaf corruption caused by ENOSPC while hole punching
While running a stress test with multiple threads writing to the same btrfs
file system, I ended up with a situation where a leaf was corrupted in that
it had 2 file extent item keys that had the same exact key. I was able to
detect this quickly thanks to the following patch which triggers an assertion
as soon as a leaf is marked dirty if there are duplicated keys or out of order
keys:

    Btrfs: check if items are ordered when a leaf is marked dirty
    (https://patchwork.kernel.org/patch/3955431/)

Basically while running the test, I got the following in dmesg:

    [28877.415877] WARNING: CPU: 2 PID: 10706 at fs/btrfs/file.c:553 btrfs_drop_extent_cache+0x435/0x440 [btrfs]()
    (...)
    [28877.415917] Call Trace:
    [28877.415922]  [<ffffffff816f1189>] dump_stack+0x4e/0x68
    [28877.415926]  [<ffffffff8104a32c>] warn_slowpath_common+0x8c/0xc0
    [28877.415929]  [<ffffffff8104a37a>] warn_slowpath_null+0x1a/0x20
    [28877.415944]  [<ffffffffa03775a5>] btrfs_drop_extent_cache+0x435/0x440 [btrfs]
    [28877.415949]  [<ffffffff8118e7be>] ? kmem_cache_alloc+0xfe/0x1c0
    [28877.415962]  [<ffffffffa03777d9>] fill_holes+0x229/0x3e0 [btrfs]
    [28877.415972]  [<ffffffffa0345865>] ? block_rsv_add_bytes+0x55/0x80 [btrfs]
    [28877.415984]  [<ffffffffa03792cb>] btrfs_fallocate+0xb6b/0xc20 [btrfs]
    (...)
    [29854.132560] BTRFS critical (device sdc): corrupt leaf, bad key order: block=955232256,root=1, slot=24
    [29854.132565] BTRFS info (device sdc): leaf 955232256 total ptrs 40 free space 778
    (...)
    [29854.132637] 	item 23 key (3486 108 667648) itemoff 2694 itemsize 53
    [29854.132638] 		extent data disk bytenr 14574411776 nr 286720
    [29854.132639] 		extent data offset 0 nr 286720 ram 286720
    [29854.132640] 	item 24 key (3486 108 954368) itemoff 2641 itemsize 53
    [29854.132641] 		extent data disk bytenr 0 nr 0
    [29854.132643] 		extent data offset 0 nr 0 ram 0
    [29854.132644] 	item 25 key (3486 108 954368) itemoff 2588 itemsize 53
    [29854.132645] 		extent data disk bytenr 8699670528 nr 77824
    [29854.132646] 		extent data offset 0 nr 77824 ram 77824
    [29854.132647] 	item 26 key (3486 108 1146880) itemoff 2535 itemsize 53
    [29854.132648] 		extent data disk bytenr 8699670528 nr 77824
    [29854.132649] 		extent data offset 0 nr 77824 ram 77824
    (...)
    [29854.132707] kernel BUG at fs/btrfs/ctree.h:3901!
    (...)
    [29854.132771] Call Trace:
    [29854.132779]  [<ffffffffa0342b5c>] setup_items_for_insert+0x2dc/0x400 [btrfs]
    [29854.132791]  [<ffffffffa0378537>] __btrfs_drop_extents+0xba7/0xdd0 [btrfs]
    [29854.132794]  [<ffffffff8109c0d6>] ? trace_hardirqs_on_caller+0x16/0x1d0
    [29854.132797]  [<ffffffff8109c29d>] ? trace_hardirqs_on+0xd/0x10
    [29854.132800]  [<ffffffff8118e7be>] ? kmem_cache_alloc+0xfe/0x1c0
    [29854.132810]  [<ffffffffa036783b>] insert_reserved_file_extent.constprop.66+0xab/0x310 [btrfs]
    [29854.132820]  [<ffffffffa036a6c6>] __btrfs_prealloc_file_range+0x116/0x340 [btrfs]
    [29854.132830]  [<ffffffffa0374d53>] btrfs_prealloc_file_range+0x23/0x30 [btrfs]
    (...)

So this is caused by getting an -ENOSPC error while punching a file hole, more
specifically, we get -ENOSPC error from __btrfs_drop_extents in the while loop
of file.c:btrfs_punch_hole() when it's unable to modify the btree to delete one
or more file extent items due to lack of enough free space. When this happens,
in btrfs_punch_hole(), we attempt to reclaim free space by switching our transaction
block reservation object to root->fs_info->trans_block_rsv, end our transaction and
start a new transaction basically - and, we keep increasing our current offset
(cur_offset) as long as it's smaller than the end of the target range (lockend) -
this makes use leave the loop with cur_offset == drop_end which in turn makes us
call fill_holes() for inserting a file extent item that represents a 0 bytes range
hole (and this insertion succeeds, as in the meanwhile more space became available).

This 0 bytes file hole extent item is a problem because any subsequent caller of
__btrfs_drop_extents (regular file writes, or fallocate calls for e.g.), with a
start file offset that is equal to the offset of the hole, will not remove this
extent item due to the following conditional in the while loop of
__btrfs_drop_extents:

    if (extent_end <= search_start) {
            path->slots[0]++;
            goto next_slot;
    }

This later makes the call to setup_items_for_insert() (at the very end of
__btrfs_drop_extents), insert a new file extent item with the same offset as
the 0 bytes file hole extent item that follows it. Needless is to say that this
causes chaos, either when reading the leaf from disk (btree_readpage_end_io_hook),
where we perform leaf sanity checks or in subsequent operations that manipulate
file extent items, as in the fallocate call as shown by the dmesg trace above.

Without my other patch to perform the leaf sanity checks once a leaf is marked
as dirty (if the integrity checker is enabled), it would have been much harder
to debug this issue.

This change might fix a few similar issues reported by users in the mailing
list regarding assertion failures in btrfs_set_item_key_safe calls performed
by __btrfs_drop_extents, such as the following report:

    http://comments.gmane.org/gmane.comp.file-systems.btrfs/32938

Asking fill_holes() to create a 0 bytes wide file hole item also produced the
first warning in the trace above, as we passed a range to btrfs_drop_extent_cache
that has an end smaller (by -1) than its start.

On 3.14 kernels this issue manifests itself through leaf corruption, as we get
duplicated file extent item keys in a leaf when calling setup_items_for_insert(),
but on older kernels, setup_items_for_insert() isn't called by __btrfs_drop_extents(),
instead we have callers of __btrfs_drop_extents(), namely the functions
inode.c:insert_inline_extent() and inode.c:insert_reserved_file_extent(), calling
btrfs_insert_empty_item() to insert the new file extent item, which would fail with
error -EEXIST, instead of inserting a duplicated key - which is still a serious
issue as it would make all similar file extent item replace operations keep
failing if they target the same file range.

Cc: stable@vger.kernel.org
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:26 -07:00
Liu Bo
d2cbf2a260 Btrfs: do not increment on bio_index one by one
'bio_index' is just a index, it's really not necessary to do increment
one by one.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:25 -07:00
Filipe Manana
a1a50f60a6 Btrfs: read inode size after acquiring the mutex when punching a hole
In a previous change, commit 12870f1c9b,
I accidentally moved the roundup of inode->i_size to outside of the
critical section delimited by the inode mutex, which is not atomic and
not correct since the size can be changed by other task before we acquire
the mutex. Therefore fix it.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:24 -07:00
Tobias Klauser
7fb18a0664 btrfs: Remove unnecessary check for NULL
iput() already checks for the inode being NULL, thus it's unnecessary to
check before calling.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:23 -07:00
Zach Brown
166ae5a418 btrfs: fix inline compressed read err corruption
uncompress_inline() is dropping the error from btrfs_decompress() after
testing it and zeroing the page that was supposed to hold decompressed
data.  This can silently turn compressed inline data in to zeros if
decompression fails due to corrupt compressed data or memory allocation
failure.

I verified this by manually forcing the error from btrfs_decompress()
for a silly named copy of od:

	if (!strcmp(current->comm, "failod"))
		ret = -ENOMEM;

  # od -x /mnt/btrfs/dir/80 | head -1
  0000000 3031 3038 310a 2d30 6f70 6e69 0a74 3031
  # echo 3 > /proc/sys/vm/drop_caches
  # cp $(which od) /tmp/failod
  # /tmp/failod -x /mnt/btrfs/dir/80 | head -1
  0000000 0000 0000 0000 0000 0000 0000 0000 0000

The fix is to pass the error to its caller.  Which still has a BUG_ON().
So we fix that too.

There seems to be no reason for the zeroing of the page on the error
from btrfs_decompress() but not from the allocation error a few lines
above.  So the page zeroing is removed.

Signed-off-by: Zach Brown <zab@redhat.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:23 -07:00
Zach Brown
774bcb35f0 btrfs: return ptr error from compression workspace
The btrfs compression wrappers translated errors from workspace
allocation to either -ENOMEM or -1.  The compression type workspace
allocators are already returning a ERR_PTR(-ENOMEM).  Just return that
and get rid of the magical -1.

This helps a future patch return errors from the compression wrappers.

Signed-off-by: Zach Brown <zab@redhat.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:22 -07:00
Zach Brown
60e1975acb btrfs: return errno instead of -1 from compression
The compression layer seems to have been built to return -1 and have
callers make up errors that make sense.  This isn't great because there
are different errors that originate down in the compression layer.

Let's return real negative errnos from the compression layer so that
callers can pass on the error without having to guess what happened.
ENOMEM for allocation failure, E2BIG when compression exceeds the
uncompressed input, and EIO for everything else.

This helps a future path return errors from btrfs_decompress().

Signed-off-by: Zach Brown <zab@redhat.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:21 -07:00
Stefan Behrens
98806b446d btrfs: check_int: propagate out-of-memory error upwards
This issue was not causing any harm but IMO (and in the opinion of the
static code checker) it is better to propagate this error status upwards.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:21 -07:00
Filipe Manana
61391d5622 Btrfs: fix hang on error (such as ENOSPC) when writing extent pages
When running low on available disk space and having several processes
doing buffered file IO, I got the following trace in dmesg:

[ 4202.720152] INFO: task kworker/u8:1:5450 blocked for more than 120 seconds.
[ 4202.720401]       Not tainted 3.13.0-fdm-btrfs-next-26+ #1
[ 4202.720596] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4202.720874] kworker/u8:1    D 0000000000000001     0  5450      2 0x00000000
[ 4202.720904] Workqueue: btrfs-flush_delalloc normal_work_helper [btrfs]
[ 4202.720908]  ffff8801f62ddc38 0000000000000082 ffff880203ac2490 00000000001d3f40
[ 4202.720913]  ffff8801f62ddfd8 00000000001d3f40 ffff8800c4f0c920 ffff880203ac2490
[ 4202.720918]  00000000001d4a40 ffff88020fe85a40 ffff88020fe85ab8 0000000000000001
[ 4202.720922] Call Trace:
[ 4202.720931]  [<ffffffff816a3cb9>] schedule+0x29/0x70
[ 4202.720950]  [<ffffffffa01ec48d>] btrfs_start_ordered_extent+0x6d/0x110 [btrfs]
[ 4202.720956]  [<ffffffff8108e620>] ? bit_waitqueue+0xc0/0xc0
[ 4202.720972]  [<ffffffffa01ec559>] btrfs_run_ordered_extent_work+0x29/0x40 [btrfs]
[ 4202.720988]  [<ffffffffa0201987>] normal_work_helper+0x137/0x2c0 [btrfs]
[ 4202.720994]  [<ffffffff810680e5>] process_one_work+0x1f5/0x530
(...)
[ 4202.721027] 2 locks held by kworker/u8:1/5450:
[ 4202.721028]  #0:  (%s-%s){++++..}, at: [<ffffffff81068083>] process_one_work+0x193/0x530
[ 4202.721037]  #1:  ((&work->normal_work)){+.+...}, at: [<ffffffff81068083>] process_one_work+0x193/0x530
[ 4202.721054] INFO: task btrfs:7891 blocked for more than 120 seconds.
[ 4202.721258]       Not tainted 3.13.0-fdm-btrfs-next-26+ #1
[ 4202.721444] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4202.721699] btrfs           D 0000000000000001     0  7891   7890 0x00000001
[ 4202.721704]  ffff88018c2119e8 0000000000000086 ffff8800a33d2490 00000000001d3f40
[ 4202.721710]  ffff88018c211fd8 00000000001d3f40 ffff8802144b0000 ffff8800a33d2490
[ 4202.721714]  ffff8800d8576640 ffff88020fe85bc0 ffff88020fe85bc8 7fffffffffffffff
[ 4202.721718] Call Trace:
[ 4202.721723]  [<ffffffff816a3cb9>] schedule+0x29/0x70
[ 4202.721727]  [<ffffffff816a2ebc>] schedule_timeout+0x1dc/0x270
[ 4202.721732]  [<ffffffff8109bd79>] ? mark_held_locks+0xb9/0x140
[ 4202.721736]  [<ffffffff816a90c0>] ? _raw_spin_unlock_irq+0x30/0x40
[ 4202.721740]  [<ffffffff8109bf0d>] ? trace_hardirqs_on_caller+0x10d/0x1d0
[ 4202.721744]  [<ffffffff816a488f>] wait_for_completion+0xdf/0x120
[ 4202.721749]  [<ffffffff8107fa90>] ? try_to_wake_up+0x310/0x310
[ 4202.721765]  [<ffffffffa01ebee4>] btrfs_wait_ordered_extents+0x1f4/0x280 [btrfs]
[ 4202.721781]  [<ffffffffa020526e>] btrfs_mksubvol.isra.62+0x30e/0x5a0 [btrfs]
[ 4202.721786]  [<ffffffff8108e620>] ? bit_waitqueue+0xc0/0xc0
[ 4202.721799]  [<ffffffffa02056a9>] btrfs_ioctl_snap_create_transid+0x1a9/0x1b0 [btrfs]
[ 4202.721813]  [<ffffffffa020583a>] btrfs_ioctl_snap_create_v2+0x10a/0x170 [btrfs]
(...)

It turns out that extent_io.c:__extent_writepage(), which ends up being called
through filemap_fdatawrite_range() in btrfs_start_ordered_extent(), was getting
-ENOSPC when calling the fill_delalloc callback. In this situation, it returned
without the writepage_end_io_hook callback (inode.c:btrfs_writepage_end_io_hook)
ever being called for the respective page, which prevents the ordered extent's
bytes_left count from ever reaching 0, and therefore a finish_ordered_fn work
is never queued into the endio_write_workers queue. This makes the task that
called btrfs_start_ordered_extent() hang forever on the wait queue of the ordered
extent.

This is fairly easy to reproduce using a small filesystem and fsstress on
a quad core vm:

    mkfs.btrfs -f -b `expr 2100 \* 1024 \* 1024` /dev/sdd
    mount /dev/sdd /mnt

    fsstress -p 6 -d /mnt -n 100000 -x \
        "btrfs subvolume snapshot -r /mnt /mnt/mysnap" \
	    -f allocsp=0 \
	    -f bulkstat=0 \
	    -f bulkstat1=0 \
	    -f chown=0 \
	    -f creat=1 \
	    -f dread=0 \
	    -f dwrite=0 \
	    -f fallocate=1 \
	    -f fdatasync=0 \
	    -f fiemap=0 \
	    -f freesp=0 \
	    -f fsync=0 \
	    -f getattr=0 \
	    -f getdents=0 \
	    -f link=0 \
	    -f mkdir=0 \
	    -f mknod=0 \
	    -f punch=1 \
	    -f read=0 \
	    -f readlink=0 \
	    -f rename=0 \
	    -f resvsp=0 \
	    -f rmdir=0 \
	    -f setxattr=0 \
	    -f stat=0 \
	    -f symlink=0 \
	    -f sync=0 \
	    -f truncate=1 \
	    -f unlink=0 \
	    -f unresvsp=0 \
	    -f write=4

So just ensure that if an error happens while writing the extent page
we call the writepage_end_io_hook callback. Also make it return the
error code and ensure the caller (extent_write_cache_pages) processes
all pages in the page vector even if an error happens only for some
of them, so that ordered extents end up released.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:20 -07:00
Linus Torvalds
3f17ea6dea Merge branch 'next' (accumulated 3.16 merge window patches) into master
Now that 3.15 is released, this merges the 'next' branch into 'master',
bringing us to the normal situation where my 'master' branch is the
merge window.

* accumulated work in next: (6809 commits)
  ufs: sb mutex merge + mutex_destroy
  powerpc: update comments for generic idle conversion
  cris: update comments for generic idle conversion
  idle: remove cpu_idle() forward declarations
  nbd: zero from and len fields in NBD_CMD_DISCONNECT.
  mm: convert some level-less printks to pr_*
  MAINTAINERS: adi-buildroot-devel is moderated
  MAINTAINERS: add linux-api for review of API/ABI changes
  mm/kmemleak-test.c: use pr_fmt for logging
  fs/dlm/debug_fs.c: replace seq_printf by seq_puts
  fs/dlm/lockspace.c: convert simple_str to kstr
  fs/dlm/config.c: convert simple_str to kstr
  mm: mark remap_file_pages() syscall as deprecated
  mm: memcontrol: remove unnecessary memcg argument from soft limit functions
  mm: memcontrol: clean up memcg zoneinfo lookup
  mm/memblock.c: call kmemleak directly from memblock_(alloc|free)
  mm/mempool.c: update the kmemleak stack trace for mempool allocations
  lib/radix-tree.c: update the kmemleak stack trace for radix tree allocations
  mm: introduce kmemleak_update_trace()
  mm/kmemleak.c: use %u to print ->checksum
  ...
2014-06-08 11:31:16 -07:00
Filipe Manana
01a9a8a9e2 Btrfs: send, fix corrupted path strings for long paths
If a path has more than 230 characters, we allocate a new buffer to
use for the path, but we were forgotting to copy the contents of the
previous buffer into the new one, which has random content from the
kmalloc call.

Test:

    mkfs.btrfs -f /dev/sdd
    mount /dev/sdd /mnt

    TEST_PATH="/mnt/fdmanana/.config/google-chrome-mysetup/Default/Pepper_Data/Shockwave_Flash/WritableRoot/#SharedObjects/JSHJ4ZKN/s.wsj.net/[[IMPORT]]/players.edgesuite.net/flash/plugins/osmf/advanced-streaming-plugin/v2.7/osmf1.6/Ak#"
    mkdir -p $TEST_PATH
    echo "hello world" > $TEST_PATH/amaiAdvancedStreamingPlugin.txt

    btrfs subvolume snapshot -r /mnt /mnt/mysnap1
    btrfs send /mnt/mysnap1 -f /tmp/1.snap

A test for xfstests follows.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Cc: Marc Merlin <marc@merlins.org>
Tested-by: Marc MERLIN <marc@merlins.org>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-06 12:00:46 -07:00
Mel Gorman
2457aec637 mm: non-atomically mark page accessed during page cache allocation where possible
aops->write_begin may allocate a new page and make it visible only to have
mark_page_accessed called almost immediately after.  Once the page is
visible the atomic operations are necessary which is noticable overhead
when writing to an in-memory filesystem like tmpfs but should also be
noticable with fast storage.  The objective of the patch is to initialse
the accessed information with non-atomic operations before the page is
visible.

The bulk of filesystems directly or indirectly use
grab_cache_page_write_begin or find_or_create_page for the initial
allocation of a page cache page.  This patch adds an init_page_accessed()
helper which behaves like the first call to mark_page_accessed() but may
called before the page is visible and can be done non-atomically.

The primary APIs of concern in this care are the following and are used
by most filesystems.

	find_get_page
	find_lock_page
	find_or_create_page
	grab_cache_page_nowait
	grab_cache_page_write_begin

All of them are very similar in detail to the patch creates a core helper
pagecache_get_page() which takes a flags parameter that affects its
behavior such as whether the page should be marked accessed or not.  Then
old API is preserved but is basically a thin wrapper around this core
function.

Each of the filesystems are then updated to avoid calling
mark_page_accessed when it is known that the VM interfaces have already
done the job.  There is a slight snag in that the timing of the
mark_page_accessed() has now changed so in rare cases it's possible a page
gets to the end of the LRU as PageReferenced where as previously it might
have been repromoted.  This is expected to be rare but it's worth the
filesystem people thinking about it in case they see a problem with the
timing change.  It is also the case that some filesystems may be marking
pages accessed that previously did not but it makes sense that filesystems
have consistent behaviour in this regard.

The test case used to evaulate this is a simple dd of a large file done
multiple times with the file deleted on each iterations.  The size of the
file is 1/10th physical memory to avoid dirty page balancing.  In the
async case it will be possible that the workload completes without even
hitting the disk and will have variable results but highlight the impact
of mark_page_accessed for async IO.  The sync results are expected to be
more stable.  The exception is tmpfs where the normal case is for the "IO"
to not hit the disk.

The test machine was single socket and UMA to avoid any scheduling or NUMA
artifacts.  Throughput and wall times are presented for sync IO, only wall
times are shown for async as the granularity reported by dd and the
variability is unsuitable for comparison.  As async results were variable
do to writback timings, I'm only reporting the maximum figures.  The sync
results were stable enough to make the mean and stddev uninteresting.

The performance results are reported based on a run with no profiling.
Profile data is based on a separate run with oprofile running.

async dd
                                    3.15.0-rc3            3.15.0-rc3
                                       vanilla           accessed-v2
ext3    Max      elapsed     13.9900 (  0.00%)     11.5900 ( 17.16%)
tmpfs	Max      elapsed      0.5100 (  0.00%)      0.4900 (  3.92%)
btrfs   Max      elapsed     12.8100 (  0.00%)     12.7800 (  0.23%)
ext4	Max      elapsed     18.6000 (  0.00%)     13.3400 ( 28.28%)
xfs	Max      elapsed     12.5600 (  0.00%)      2.0900 ( 83.36%)

The XFS figure is a bit strange as it managed to avoid a worst case by
sheer luck but the average figures looked reasonable.

        samples percentage
ext3       86107    0.9783  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
ext3       23833    0.2710  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
ext3        5036    0.0573  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
ext4       64566    0.8961  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
ext4        5322    0.0713  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
ext4        2869    0.0384  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
xfs        62126    1.7675  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
xfs         1904    0.0554  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
xfs          103    0.0030  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
btrfs      10655    0.1338  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
btrfs       2020    0.0273  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
btrfs        587    0.0079  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
tmpfs      59562    3.2628  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
tmpfs       1210    0.0696  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
tmpfs         94    0.0054  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed

[akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Tested-by: Prabhakar Lad <prabhakar.csengg@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:10 -07:00
Linus Torvalds
776edb5931 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next
Pull core locking updates from Ingo Molnar:
 "The main changes in this cycle were:

   - reduced/streamlined smp_mb__*() interface that allows more usecases
     and makes the existing ones less buggy, especially in rarer
     architectures

   - add rwsem implementation comments

   - bump up lockdep limits"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
  rwsem: Add comments to explain the meaning of the rwsem's count field
  lockdep: Increase static allocations
  arch: Mass conversion of smp_mb__*()
  arch,doc: Convert smp_mb__*()
  arch,xtensa: Convert smp_mb__*()
  arch,x86: Convert smp_mb__*()
  arch,tile: Convert smp_mb__*()
  arch,sparc: Convert smp_mb__*()
  arch,sh: Convert smp_mb__*()
  arch,score: Convert smp_mb__*()
  arch,s390: Convert smp_mb__*()
  arch,powerpc: Convert smp_mb__*()
  arch,parisc: Convert smp_mb__*()
  arch,openrisc: Convert smp_mb__*()
  arch,mn10300: Convert smp_mb__*()
  arch,mips: Convert smp_mb__*()
  arch,metag: Convert smp_mb__*()
  arch,m68k: Convert smp_mb__*()
  arch,m32r: Convert smp_mb__*()
  arch,ia64: Convert smp_mb__*()
  ...
2014-06-03 12:57:53 -07:00
Linus Torvalds
11da37b263 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull two btrfs fixes from Chris Mason:
 "This has two fixes that we've been testing for 3.16, but since both
  are safe and fix real bugs, it makes sense to send for 3.15 instead"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: send, fix incorrect ref access when using extrefs
  Btrfs: fix EIO on reading file after ioctl clone works on it
2014-05-22 05:40:13 +09:00
Filipe Manana
51a60253a5 Btrfs: send, fix incorrect ref access when using extrefs
When running send, if an inode only has extended reference items
associated to it and no regular references, send.c:get_first_ref()
was incorrectly assuming the reference it found was of type
BTRFS_INODE_REF_KEY due to use of the wrong key variable.
This caused weird behaviour when using the found item has a regular
reference, such as weird path string, and occasionally (when lucky)
a crash:

[  190.600652] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
[  190.600994] Modules linked in: btrfs xor raid6_pq binfmt_misc nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc psmouse serio_raw evbug pcspkr i2c_piix4 e1000 floppy
[  190.602565] CPU: 2 PID: 14520 Comm: btrfs Not tainted 3.13.0-fdm-btrfs-next-26+ #1
[  190.602728] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  190.602868] task: ffff8800d447c920 ti: ffff8801fa79e000 task.ti: ffff8801fa79e000
[  190.603030] RIP: 0010:[<ffffffff813266b4>]  [<ffffffff813266b4>] memcpy+0x54/0x110
[  190.603262] RSP: 0018:ffff8801fa79f880  EFLAGS: 00010202
[  190.603395] RAX: ffff8800d4326e3f RBX: 000000000000036a RCX: ffff880000000000
[  190.603553] RDX: 000000000000032a RSI: ffe708844042936a RDI: ffff8800d43271a9
[  190.603710] RBP: ffff8801fa79f8c8 R08: 00000000003a4ef0 R09: 0000000000000000
[  190.603867] R10: 793a4ef09f000000 R11: 9f0000000053726f R12: ffff8800d43271a9
[  190.604020] R13: 0000160000000000 R14: ffff8802110134f0 R15: 000000000000036a
[  190.604020] FS:  00007fb423d09b80(0000) GS:ffff880216200000(0000) knlGS:0000000000000000
[  190.604020] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  190.604020] CR2: 00007fb4229d4b78 CR3: 00000001f5d76000 CR4: 00000000000006e0
[  190.604020] Stack:
[  190.604020]  ffffffffa01f4d49 ffff8801fa79f8f0 00000000000009f9 ffff8801fa79f8c8
[  190.604020]  00000000000009f9 ffff880211013260 000000000000f971 ffff88021147dba8
[  190.604020]  00000000000009f9 ffff8801fa79f918 ffffffffa02367f5 ffff8801fa79f928
[  190.604020] Call Trace:
[  190.604020]  [<ffffffffa01f4d49>] ? read_extent_buffer+0xb9/0x120 [btrfs]
[  190.604020]  [<ffffffffa02367f5>] fs_path_add_from_extent_buffer+0x45/0x60 [btrfs]
[  190.604020]  [<ffffffffa0238806>] get_first_ref+0x1f6/0x210 [btrfs]
[  190.604020]  [<ffffffffa0238994>] __get_cur_name_and_parent+0x174/0x3a0 [btrfs]
[  190.604020]  [<ffffffff8118df3d>] ? kmem_cache_alloc_trace+0x11d/0x1e0
[  190.604020]  [<ffffffffa0236674>] ? fs_path_alloc+0x24/0x60 [btrfs]
[  190.604020]  [<ffffffffa0238c91>] get_cur_path+0xd1/0x240 [btrfs]
(...)

Steps to reproduce (either crash or some weirdness like an odd path string):

    mkfs.btrfs -f -O extref /dev/sdd
    mount /dev/sdd /mnt

    mkdir /mnt/testdir
    touch /mnt/testdir/foobar

    for i in `seq 1 2550`; do
        ln /mnt/testdir/foobar /mnt/testdir/foobar_link_`printf "%04d" $i`
    done

    ln /mnt/testdir/foobar /mnt/testdir/final_foobar_name

    rm -f /mnt/testdir/foobar
    for i in `seq 1 2550`; do
        rm -f /mnt/testdir/foobar_link_`printf "%04d" $i`
    done

    btrfs subvolume snapshot -r /mnt /mnt/mysnap
    btrfs send /mnt/mysnap -f /tmp/mysnap.send

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2014-05-20 10:18:26 -07:00
Liu Bo
d3ecfcdf91 Btrfs: fix EIO on reading file after ioctl clone works on it
For inline data extent, we need to make its length aligned, otherwise,
we can get a phantom extent map which confuses readpages() to return -EIO.

This can be detected by xfstests/btrfs/035.

Reported-by: David Disseldorp <ddiss@suse.de>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-05-20 10:17:48 -07:00
Al Viro
b30ac0fc41 btrfs: switch to ->write_iter()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:39:41 -04:00
Al Viro
aad4f8bb42 switch simple generic_file_aio_read() users to ->read_iter()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:37:55 -04:00
Al Viro
0c949334a9 iov_iter_truncate()
Now It Can Be Done(tm) - we don't need to do iov_shorten() in
generic_file_direct_write() anymore, now that all ->direct_IO()
instances are converted to proper iov_iter methods and honour
iter->count and iter->iov_offset properly.

Get rid of count/ocount arguments of generic_file_direct_write(),
while we are at it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:54 -04:00
Al Viro
28060d5d9b btrfs: switch check_direct_IO() to iov_iter
... and don't open-code iov_iter_alignment() there

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:53 -04:00
Al Viro
71d8e532b1 start adding the tag to iov_iter
For now, just use the same thing we pass to ->direct_IO() - it's all
iovec-based at the moment.  Pass it explicitly to iov_iter_init() and
account for kvec vs. iovec in there, by the same kludge NFS ->direct_IO()
uses.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:49 -04:00
Al Viro
31b140398c switch {__,}blockdev_direct_IO() to iov_iter
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:46 -04:00
Al Viro
a6cbcd4a4a get rid of pointless iov_length() in ->direct_IO()
all callers have iov_length(iter->iov, iter->nr_segs) == iov_iter_count(iter)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:45 -04:00
Al Viro
d8d3d94b80 pass iov_iter to ->direct_IO()
unmodified, for now

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:44 -04:00
Al Viro
cb66a7a1f1 kill generic_segment_checks()
all callers of ->aio_read() and ->aio_write() have iov/nr_segs already
checked - generic_segment_checks() done after that is just an odd way
to spell iov_length().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:43 -04:00
Al Viro
0ae5e4d370 __btrfs_direct_write(): switch to iov_iter
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:43 -04:00
Al Viro
f8579f8673 generic_file_direct_write(): switch to iov_iter
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:42 -04:00
Linus Torvalds
33c0022f0e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: limit the path size in send to PATH_MAX
  Btrfs: correctly set profile flags on seqlock retry
  Btrfs: use correct key when repeating search for extent item
  Btrfs: fix inode caching vs tree log
  Btrfs: fix possible memory leaks in open_ctree()
  Btrfs: avoid triggering bug_on() when we fail to start inode caching task
  Btrfs: move btrfs_{set,clear}_and_info() to ctree.h
  btrfs: replace error code from btrfs_drop_extents
  btrfs: Change the hole range to a more accurate value.
  btrfs: fix use-after-free in mount_subvol()
2014-04-27 13:26:28 -07:00
Chris Mason
cfd4a535b6 Btrfs: limit the path size in send to PATH_MAX
fs_path_ensure_buf is used to make sure our path buffers for
send are big enough for the path names as we construct them.
The buffer size is limited to 32K by the length field in
the struct.

But bugs in the path construction can end up trying to build
a huge buffer, and we'll do invalid memmmoves when the
buffer length field wraps.

This patch is step one, preventing the overflows.

Signed-off-by: Chris Mason <clm@fb.com>
2014-04-26 05:02:03 -07:00
Filipe Manana
f8213bdc89 Btrfs: correctly set profile flags on seqlock retry
If we had to retry on the profiles seqlock (due to a concurrent write), we
would set bits on the input flags that corresponded both to the current
profile and to previous values of the profile.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-24 16:43:33 -07:00
Filipe Manana
9ce49a0b4f Btrfs: use correct key when repeating search for extent item
If skinny metadata is enabled and our first tree search fails to find a
skinny extent item, we may repeat a tree search for a "fat" extent item
(if the previous item in the leaf is not the "fat" extent we're looking
for). However we were not setting the new key's objectid to the right
value, as we previously used the same key variable to peek at the previous
item in the leaf, which has a different objectid. So just set the right
objectid to avoid modifying/deleting a wrong item if we repeat the tree
search.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-24 16:43:33 -07:00
Miao Xie
1c70d8fb4d Btrfs: fix inode caching vs tree log
Currently, with inode cache enabled, we will reuse its inode id immediately
after unlinking file, we may hit something like following:

|->iput inode
|->return inode id into inode cache
|->create dir,fsync
|->power off

An easy way to reproduce this problem is:

mkfs.btrfs -f /dev/sdb
mount /dev/sdb /mnt -o inode_cache,commit=100
dd if=/dev/zero of=/mnt/data bs=1M count=10 oflag=sync
inode_id=`ls -i /mnt/data | awk '{print $1}'`
rm -f /mnt/data

i=1
while [ 1 ]
do
        mkdir /mnt/dir_$i
        test1=`stat /mnt/dir_$i | grep Inode: | awk '{print $4}'`
        if [ $test1 -eq $inode_id ]
        then
		dd if=/dev/zero of=/mnt/dir_$i/data bs=1M count=1 oflag=sync
		echo b > /proc/sysrq-trigger
	fi
	sleep 1
        i=$(($i+1))
done

mount /dev/sdb /mnt
umount /dev/sdb
btrfs check /dev/sdb

We fix this problem by adding unlinked inode's id into pinned tree,
and we can not reuse them until committing transaction.

Cc: stable@vger.kernel.org
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-24 16:43:33 -07:00
Wang Shilong
28c16cbbc3 Btrfs: fix possible memory leaks in open_ctree()
Fix possible memory leaks in the following error handling paths:

read_tree_block()
btrfs_recover_log_trees
btrfs_commit_super()
btrfs_find_orphan_roots()
btrfs_cleanup_fs_roots()

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-24 16:43:32 -07:00
Wang Shilong
e60efa8425 Btrfs: avoid triggering bug_on() when we fail to start inode caching task
When running stress test(including snapshots,balance,fstress), we trigger
the following BUG_ON() which is because we fail to start inode caching task.

[  181.131945] kernel BUG at fs/btrfs/inode-map.c:179!
[  181.137963] invalid opcode: 0000 [#1] SMP
[  181.217096] CPU: 11 PID: 2532 Comm: btrfs Not tainted 3.14.0 #1
[  181.240521] task: ffff88013b621b30 ti: ffff8800b6ada000 task.ti: ffff8800b6ada000
[  181.367506] Call Trace:
[  181.371107]  [<ffffffffa036c1be>] btrfs_return_ino+0x9e/0x110 [btrfs]
[  181.379191]  [<ffffffffa038082b>] btrfs_evict_inode+0x46b/0x4c0 [btrfs]
[  181.387464]  [<ffffffff810b5a70>] ? autoremove_wake_function+0x40/0x40
[  181.395642]  [<ffffffff811dc5fe>] evict+0x9e/0x190
[  181.401882]  [<ffffffff811dcde3>] iput+0xf3/0x180
[  181.408025]  [<ffffffffa03812de>] btrfs_orphan_cleanup+0x1ee/0x430 [btrfs]
[  181.416614]  [<ffffffffa03a6abd>] btrfs_mksubvol.isra.29+0x3bd/0x450 [btrfs]
[  181.425399]  [<ffffffffa03a6cd6>] btrfs_ioctl_snap_create_transid+0x186/0x190 [btrfs]
[  181.435059]  [<ffffffffa03a6e3b>] btrfs_ioctl_snap_create_v2+0xeb/0x130 [btrfs]
[  181.444148]  [<ffffffffa03a9656>] btrfs_ioctl+0xf76/0x2b90 [btrfs]
[  181.451971]  [<ffffffff8117e565>] ? handle_mm_fault+0x475/0xe80
[  181.459509]  [<ffffffff8167ba0c>] ? __do_page_fault+0x1ec/0x520
[  181.467046]  [<ffffffff81185b35>] ? do_mmap_pgoff+0x2f5/0x3c0
[  181.474393]  [<ffffffff811d4da8>] do_vfs_ioctl+0x2d8/0x4b0
[  181.481450]  [<ffffffff811d5001>] SyS_ioctl+0x81/0xa0
[  181.488021]  [<ffffffff81680b69>] system_call_fastpath+0x16/0x1b

We should avoid triggering BUG_ON() here, instead, we output warning messages
and clear inode_cache option.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-24 16:43:32 -07:00
Wang Shilong
9d89ce6587 Btrfs: move btrfs_{set,clear}_and_info() to ctree.h
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-24 16:43:32 -07:00
David Sterba
3f9e3df8da btrfs: replace error code from btrfs_drop_extents
There's a case which clone does not handle and used to BUG_ON instead,
(testcase xfstests/btrfs/035), now returns EINVAL. This error code is
confusing to the ioctl caller, as it normally signifies errorneous
arguments.

Change it to ENOPNOTSUPP which allows a fall back to copy instead of
clone. This does not affect the common reflink operation.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-24 16:43:32 -07:00
Qu Wenruo
c5f7d0bb29 btrfs: Change the hole range to a more accurate value.
Commit 3ac0d7b96a fixed the btrfs expanding
write problem but the hole punched is sometimes too large for some
iovec, which has unmapped data ranges.
This patch will change to hole range to a more accurate value using the
counts checked by the write check routines.

Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-24 16:43:32 -07:00
Peter Zijlstra
4e857c58ef arch: Mass conversion of smp_mb__*()
Mostly scripted conversion of the smp_mb__* barriers.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-arch@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-18 14:20:48 +02:00
Christoph Jaeger
0040e606e3 btrfs: fix use-after-free in mount_subvol()
Pointer 'newargs' is used after the memory that it points to has already
been freed.

Picked up by Coverity - CID 1201425.

Fixes: 0723a0473f ("btrfs: allow mounting btrfs subvolumes with
different ro/rw options")
Signed-off-by: Christoph Jaeger <christophjaeger@linux.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-14 11:31:08 -07:00
Linus Torvalds
5166701b36 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "The first vfs pile, with deep apologies for being very late in this
  window.

  Assorted cleanups and fixes, plus a large preparatory part of iov_iter
  work.  There's a lot more of that, but it'll probably go into the next
  merge window - it *does* shape up nicely, removes a lot of
  boilerplate, gets rid of locking inconsistencie between aio_write and
  splice_write and I hope to get Kent's direct-io rewrite merged into
  the same queue, but some of the stuff after this point is having
  (mostly trivial) conflicts with the things already merged into
  mainline and with some I want more testing.

  This one passes LTP and xfstests without regressions, in addition to
  usual beating.  BTW, readahead02 in ltp syscalls testsuite has started
  giving failures since "mm/readahead.c: fix readahead failure for
  memoryless NUMA nodes and limit readahead pages" - might be a false
  positive, might be a real regression..."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
  missing bits of "splice: fix racy pipe->buffers uses"
  cifs: fix the race in cifs_writev()
  ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure
  kill generic_file_buffered_write()
  ocfs2_file_aio_write(): switch to generic_perform_write()
  ceph_aio_write(): switch to generic_perform_write()
  xfs_file_buffered_aio_write(): switch to generic_perform_write()
  export generic_perform_write(), start getting rid of generic_file_buffer_write()
  generic_file_direct_write(): get rid of ppos argument
  btrfs_file_aio_write(): get rid of ppos
  kill the 5th argument of generic_file_buffered_write()
  kill the 4th argument of __generic_file_aio_write()
  lustre: don't open-code kernel_recvmsg()
  ocfs2: don't open-code kernel_recvmsg()
  drbd: don't open-code kernel_recvmsg()
  constify blk_rq_map_user_iov() and friends
  lustre: switch to kernel_sendmsg()
  ocfs2: don't open-code kernel_sendmsg()
  take iov_iter stuff to mm/iov_iter.c
  process_vm_access: tidy up a bit
  ...
2014-04-12 14:49:50 -07:00
Linus Torvalds
3123bca719 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull second set of btrfs updates from Chris Mason:
 "The most important changes here are from Josef, fixing a btrfs
  regression in 3.14 that can cause corruptions in the extent allocation
  tree when snapshots are in use.

  Josef also fixed some deadlocks in send/recv and other assorted races
  when balance is running"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (23 commits)
  Btrfs: fix compile warnings on on avr32 platform
  btrfs: allow mounting btrfs subvolumes with different ro/rw options
  btrfs: export global block reserve size as space_info
  btrfs: fix crash in remount(thread_pool=) case
  Btrfs: abort the transaction when we don't find our extent ref
  Btrfs: fix EINVAL checks in btrfs_clone
  Btrfs: fix unlock in __start_delalloc_inodes()
  Btrfs: scrub raid56 stripes in the right way
  Btrfs: don't compress for a small write
  Btrfs: more efficient io tree navigation on wait_extent_bit
  Btrfs: send, build path string only once in send_hole
  btrfs: filter invalid arg for btrfs resize
  Btrfs: send, fix data corruption due to incorrect hole detection
  Btrfs: kmalloc() doesn't return an ERR_PTR
  Btrfs: fix snapshot vs nocow writting
  btrfs: Change the expanding write sequence to fix snapshot related bug.
  btrfs: make device scan less noisy
  btrfs: fix lockdep warning with reclaim lock inversion
  Btrfs: hold the commit_root_sem when getting the commit root during send
  Btrfs: remove transaction from send
  ...
2014-04-11 14:16:53 -07:00
Wang Shilong
e4fbaee292 Btrfs: fix compile warnings on on avr32 platform
fs/btrfs/scrub.c: In function 'get_raid56_logic_offset':
fs/btrfs/scrub.c:2269: warning: comparison of distinct pointer types lacks a cast
fs/btrfs/scrub.c:2269: warning: right shift count >= width of type
fs/btrfs/scrub.c:2269: warning: passing argument 1 of '__div64_32' from incompatible pointer type

Since @rot is an int type, we should not use do_div(), fix it.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-11 06:35:50 -07:00
Harald Hoyer
0723a0473f btrfs: allow mounting btrfs subvolumes with different ro/rw options
Given the following /etc/fstab entries:

/dev/sda3 /mnt/foo btrfs subvol=foo,ro 0 0
/dev/sda3 /mnt/bar btrfs subvol=bar,rw 0 0

you can't issue:

$ mount /mnt/foo
$ mount /mnt/bar

You would have to do:

$ mount /mnt/foo
$ mount -o remount,rw /mnt/foo
$ mount --bind -o remount,ro /mnt/foo
$ mount /mnt/bar

or

$ mount /mnt/bar
$ mount --rw /mnt/foo
$ mount --bind -o remount,ro /mnt/foo

With this patch you can do

$ mount /mnt/foo
$ mount /mnt/bar

$ cat /proc/self/mountinfo
49 33 0:41 /foo /mnt/foo ro,relatime shared:36 - btrfs /dev/sda3 rw,ssd,space_cache
87 33 0:41 /bar /mnt/bar rw,relatime shared:74 - btrfs /dev/sda3 rw,ssd,space_cache

Signed-off-by: Chris Mason <clm@fb.com>
2014-04-10 13:32:50 -07:00
Kirill A. Shutemov
f1820361f8 mm: implement ->map_pages for page cache
filemap_map_pages() is generic implementation of ->map_pages() for
filesystems who uses page cache.

It should be safe to use filemap_map_pages() for ->map_pages() if
filesystem use filemap_fault() for ->fault().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ning Qu <quning@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07 16:35:53 -07:00
David Sterba
36523e9512 btrfs: export global block reserve size as space_info
Introduce a block group type bit for a global reserve and fill the space
info for SPACE_INFO ioctl. This should replace the newly added ioctl
(01e219e806) to get just the 'size' part
of the global reserve, while the actual usage can be now visible in the
'btrfs fi df' output during ENOSPC stress.

The unpatched userspace tools will show the blockgroup as 'unknown'.

CC: Jeff Mahoney <jeffm@suse.com>
CC: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07 10:41:53 -07:00
Sergei Trofimovich
800ee2247f btrfs: fix crash in remount(thread_pool=) case
Reproducer:
    mount /dev/ubda /mnt
    mount -oremount,thread_pool=42 /mnt

Gives a crash:
    ? btrfs_workqueue_set_max+0x0/0x70
    btrfs_resize_thread_pool+0xe3/0xf0
    ? sync_filesystem+0x0/0xc0
    ? btrfs_resize_thread_pool+0x0/0xf0
    btrfs_remount+0x1d2/0x570
    ? kern_path+0x0/0x80
    do_remount_sb+0xd9/0x1c0
    do_mount+0x26a/0xbf0
    ? kfree+0x0/0x1b0
    SyS_mount+0xc4/0x110

It's a call
    btrfs_workqueue_set_max(fs_info->scrub_wr_completion_workers, new_pool_size);
with
    fs_info->scrub_wr_completion_workers = NULL;

as scrub wqs get created only on user's demand.

Patch skips not-created-yet workqueues.

Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
CC: Qu Wenruo <quwenruo@cn.fujitsu.com>
CC: Chris Mason <clm@fb.com>
CC: Josef Bacik <jbacik@fb.com>
CC: linux-btrfs@vger.kernel.org
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07 10:41:52 -07:00
Josef Bacik
c4a050bbbb Btrfs: abort the transaction when we don't find our extent ref
I'm not sure why we weren't aborting here in the first place, it is obviously a
bad time from the fact that we print the leaf and yell loudly about it.  Fix
this up, otherwise we panic because our path could be pointing into oblivion.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07 09:08:51 -07:00
Chris Mason
3a29bc0928 Btrfs: fix EINVAL checks in btrfs_clone
btrfs_drop_extents can now return -EINVAL, but only one caller
in btrfs_clone was checking for it.  This adds it to the
caller for inline extents, which is where we really need it.

Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07 09:08:50 -07:00
Wang Shilong
a1ecaabbf9 Btrfs: fix unlock in __start_delalloc_inodes()
This patch fix a regression caused by the following patch:
Btrfs: don't flush all delalloc inodes when we doesn't get s_umount lock

break while loop will make us call @spin_unlock() without
calling @spin_lock() before, fix it.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07 09:08:50 -07:00
Wang Shilong
3b080b2564 Btrfs: scrub raid56 stripes in the right way
Steps to reproduce:
 # mkfs.btrfs -f /dev/sda[8-11] -m raid5 -d raid5
 # mount /dev/sda8 /mnt
 # btrfs scrub start -BR /mnt
 # echo $? <--unverified errors make return value be 3

This is because we don't setup right mapping between physical
and logical address for raid56, which makes checksum mismatch.
But we will find everthing is fine later when rechecking using
btrfs_map_block().

This patch fixed the problem by settuping right mappings and
we only verify data stripes' checksums.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07 09:08:49 -07:00
Wang Shilong
68bb462d42 Btrfs: don't compress for a small write
To compress a small file range(<=blocksize) that is not
an inline extent can not save disk space at all. skip it can
save us some cpu time.

This patch can also fix wrong setting nocompression flag for
inode, say a case when @total_in is 4096, and then we get
@total_compressed 52,because we do aligment to page cache size
firstly, and then we get into conclusion @total_in=@total_compressed
thus we will clear this inode's compression flag.

An exception comes from inserting inline extent failure but we
still have @total_compressed < @total_in,so we will still reset
inode's flag, this is ok, because we don't have good compression
effect.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07 09:08:48 -07:00
Filipe Manana
c50d3e71c3 Btrfs: more efficient io tree navigation on wait_extent_bit
If we don't reschedule use rb_next to find the next extent state
instead of a full tree search, which is more efficient and safe
since we didn't release the io tree's lock.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07 09:08:47 -07:00
Filipe Manana
c715e155c9 Btrfs: send, build path string only once in send_hole
There's no point building the path string in each iteration of the
send_hole loop, as it produces always the same string.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07 09:08:46 -07:00
Gui Hecheng
9a40f1222a btrfs: filter invalid arg for btrfs resize
Originally following cmds will work:
	# btrfs fi resize -10A  <mnt>
	# btrfs fi resize -10Gaha <mnt>
Filter the arg by checking the return pointer of memparse.

Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07 09:08:45 -07:00
Filipe Manana
766b5e5ae7 Btrfs: send, fix data corruption due to incorrect hole detection
During an incremental send, when we finish processing an inode (corresponding to
a regular file) we would assume the gap between the end of the last processed file
extent and the file's size corresponded to a file hole, and therefore incorrectly
send a bunch of zero bytes to overwrite that region in the file.

This affects only kernel 3.14.

Reproducer:

    mkfs.btrfs -f /dev/sdc
    mount /dev/sdc /mnt

    xfs_io -f -c "falloc -k 0 268435456" /mnt/foo

    btrfs subvolume snapshot -r /mnt /mnt/mysnap0

    xfs_io -c "pwrite -S 0x01 -b 9216 16190218 9216" /mnt/foo
    xfs_io -c "pwrite -S 0x02 -b 1121 198720104 1121" /mnt/foo
    xfs_io -c "pwrite -S 0x05 -b 9216 107887439 9216" /mnt/foo
    xfs_io -c "pwrite -S 0x06 -b 9216 225520207 9216" /mnt/foo
    xfs_io -c "pwrite -S 0x07 -b 67584 102138300 67584" /mnt/foo
    xfs_io -c "pwrite -S 0x08 -b 7000 94897484 7000" /mnt/foo
    xfs_io -c "pwrite -S 0x09 -b 113664 245083212 113664" /mnt/foo
    xfs_io -c "pwrite -S 0x10 -b 123 17937788 123" /mnt/foo
    xfs_io -c "pwrite -S 0x11 -b 39936 229573311 39936" /mnt/foo
    xfs_io -c "pwrite -S 0x12 -b 67584 174792222 67584" /mnt/foo
    xfs_io -c "pwrite -S 0x13 -b 9216 249253213 9216" /mnt/foo
    xfs_io -c "pwrite -S 0x16 -b 67584 150046083 67584" /mnt/foo
    xfs_io -c "pwrite -S 0x17 -b 39936 118246040 39936" /mnt/foo
    xfs_io -c "pwrite -S 0x18 -b 67584 215965442 67584" /mnt/foo
    xfs_io -c "pwrite -S 0x19 -b 33792 97096725 33792" /mnt/foo
    xfs_io -c "pwrite -S 0x20 -b 125952 166300596 125952" /mnt/foo
    xfs_io -c "pwrite -S 0x21 -b 123 1078957 123" /mnt/foo
    xfs_io -c "pwrite -S 0x25 -b 9216 212044492 9216" /mnt/foo
    xfs_io -c "pwrite -S 0x26 -b 7000 265037146 7000" /mnt/foo
    xfs_io -c "pwrite -S 0x27 -b 42757 215922685 42757" /mnt/foo
    xfs_io -c "pwrite -S 0x28 -b 7000 69865411 7000" /mnt/foo
    xfs_io -c "pwrite -S 0x29 -b 67584 67948958 67584" /mnt/foo
    xfs_io -c "pwrite -S 0x30 -b 39936 266967019 39936" /mnt/foo
    xfs_io -c "pwrite -S 0x31 -b 1121 19582453 1121" /mnt/foo
    xfs_io -c "pwrite -S 0x32 -b 17408 257710255 17408" /mnt/foo
    xfs_io -c "pwrite -S 0x33 -b 39936 3895518 39936" /mnt/foo
    xfs_io -c "pwrite -S 0x34 -b 125952 12045847 125952" /mnt/foo
    xfs_io -c "pwrite -S 0x35 -b 17408 19156379 17408" /mnt/foo
    xfs_io -c "pwrite -S 0x36 -b 39936 50160066 39936" /mnt/foo
    xfs_io -c "pwrite -S 0x37 -b 113664 9549793 113664" /mnt/foo
    xfs_io -c "pwrite -S 0x38 -b 105472 94391506 105472" /mnt/foo
    xfs_io -c "pwrite -S 0x39 -b 23552 143632863 23552" /mnt/foo
    xfs_io -c "pwrite -S 0x40 -b 39936 241283845 39936" /mnt/foo
    xfs_io -c "pwrite -S 0x41 -b 113664 199937606 113664" /mnt/foo
    xfs_io -c "pwrite -S 0x42 -b 67584 67380093 67584" /mnt/foo
    xfs_io -c "pwrite -S 0x43 -b 67584 26793129 67584" /mnt/foo
    xfs_io -c "pwrite -S 0x44 -b 39936 14421913 39936" /mnt/foo
    xfs_io -c "pwrite -S 0x45 -b 123 253097405 123" /mnt/foo
    xfs_io -c "pwrite -S 0x46 -b 1121 128233424 1121" /mnt/foo
    xfs_io -c "pwrite -S 0x47 -b 105472 91577959 105472" /mnt/foo
    xfs_io -c "pwrite -S 0x48 -b 1121 7245381 1121" /mnt/foo
    xfs_io -c "pwrite -S 0x49 -b 113664 182414694 113664" /mnt/foo
    xfs_io -c "pwrite -S 0x50 -b 9216 32750608 9216" /mnt/foo
    xfs_io -c "pwrite -S 0x51 -b 67584 266546049 67584" /mnt/foo
    xfs_io -c "pwrite -S 0x52 -b 67584 87969398 67584" /mnt/foo
    xfs_io -c "pwrite -S 0x53 -b 9216 260848797 9216" /mnt/foo
    xfs_io -c "pwrite -S 0x54 -b 39936 119461243 39936" /mnt/foo
    xfs_io -c "pwrite -S 0x55 -b 7000 200178693 7000" /mnt/foo
    xfs_io -c "pwrite -S 0x56 -b 9216 243316029 9216" /mnt/foo
    xfs_io -c "pwrite -S 0x57 -b 7000 209658229 7000" /mnt/foo
    xfs_io -c "pwrite -S 0x58 -b 101376 179745192 101376" /mnt/foo
    xfs_io -c "pwrite -S 0x59 -b 9216 64012300 9216" /mnt/foo
    xfs_io -c "pwrite -S 0x60 -b 125952 181705139 125952" /mnt/foo
    xfs_io -c "pwrite -S 0x61 -b 23552 235737348 23552" /mnt/foo
    xfs_io -c "pwrite -S 0x62 -b 113664 106021355 113664" /mnt/foo
    xfs_io -c "pwrite -S 0x63 -b 67584 135753552 67584" /mnt/foo
    xfs_io -c "pwrite -S 0x64 -b 23552 95730888 23552" /mnt/foo
    xfs_io -c "pwrite -S 0x65 -b 11 17311415 11" /mnt/foo
    xfs_io -c "pwrite -S 0x66 -b 33792 120695553 33792" /mnt/foo
    xfs_io -c "pwrite -S 0x67 -b 9216 17164631 9216" /mnt/foo
    xfs_io -c "pwrite -S 0x68 -b 9216 136065853 9216" /mnt/foo
    xfs_io -c "pwrite -S 0x69 -b 67584 37752198 67584" /mnt/foo
    xfs_io -c "pwrite -S 0x70 -b 101376 189717473 101376" /mnt/foo
    xfs_io -c "pwrite -S 0x71 -b 7000 227463698 7000" /mnt/foo
    xfs_io -c "pwrite -S 0x72 -b 9216 12655137 9216" /mnt/foo
    xfs_io -c "pwrite -S 0x73 -b 7000 7488866 7000" /mnt/foo
    xfs_io -c "pwrite -S 0x74 -b 113664 87813649 113664" /mnt/foo
    xfs_io -c "pwrite -S 0x75 -b 33792 25802183 33792" /mnt/foo
    xfs_io -c "pwrite -S 0x76 -b 39936 93524024 39936" /mnt/foo
    xfs_io -c "pwrite -S 0x77 -b 33792 113336388 33792" /mnt/foo
    xfs_io -c "pwrite -S 0x78 -b 105472 184955320 105472" /mnt/foo
    xfs_io -c "pwrite -S 0x79 -b 101376 225691598 101376" /mnt/foo
    xfs_io -c "pwrite -S 0x80 -b 23552 77023155 23552" /mnt/foo
    xfs_io -c "pwrite -S 0x81 -b 11 201888192 11" /mnt/foo
    xfs_io -c "pwrite -S 0x82 -b 11 115332492 11" /mnt/foo
    xfs_io -c "pwrite -S 0x83 -b 67584 230278015 67584" /mnt/foo
    xfs_io -c "pwrite -S 0x84 -b 11 120589073 11" /mnt/foo
    xfs_io -c "pwrite -S 0x85 -b 125952 202207819 125952" /mnt/foo
    xfs_io -c "pwrite -S 0x86 -b 113664 86672080 113664" /mnt/foo
    xfs_io -c "pwrite -S 0x87 -b 17408 208459603 17408" /mnt/foo
    xfs_io -c "pwrite -S 0x88 -b 7000 73372211 7000" /mnt/foo
    xfs_io -c "pwrite -S 0x89 -b 7000 42252122 7000" /mnt/foo
    xfs_io -c "pwrite -S 0x90 -b 23552 46784881 23552" /mnt/foo
    xfs_io -c "pwrite -S 0x91 -b 101376 63172351 101376" /mnt/foo
    xfs_io -c "pwrite -S 0x92 -b 23552 59341931 23552" /mnt/foo
    xfs_io -c "pwrite -S 0x93 -b 39936 239599283 39936" /mnt/foo
    xfs_io -c "pwrite -S 0x94 -b 67584 175643105 67584" /mnt/foo
    xfs_io -c "pwrite -S 0x97 -b 23552 105534880 23552" /mnt/foo
    xfs_io -c "pwrite -S 0x98 -b 113664 8236844 113664" /mnt/foo
    xfs_io -c "pwrite -S 0x99 -b 125952 144489686 125952" /mnt/foo
    xfs_io -c "pwrite -S 0xa0 -b 7000 73273112 7000" /mnt/foo
    xfs_io -c "pwrite -S 0xa1 -b 125952 194580243 125952" /mnt/foo
    xfs_io -c "pwrite -S 0xa2 -b 123 56296779 123" /mnt/foo
    xfs_io -c "pwrite -S 0xa3 -b 11 233066845 11" /mnt/foo
    xfs_io -c "pwrite -S 0xa4 -b 39936 197727090 39936" /mnt/foo
    xfs_io -c "pwrite -S 0xa5 -b 101376 53579812 101376" /mnt/foo
    xfs_io -c "pwrite -S 0xa6 -b 9216 85669738 9216" /mnt/foo
    xfs_io -c "pwrite -S 0xa7 -b 125952 21266322 125952" /mnt/foo
    xfs_io -c "pwrite -S 0xa8 -b 23552 125726568 23552" /mnt/foo
    xfs_io -c "pwrite -S 0xa9 -b 9216 18423680 9216" /mnt/foo
    xfs_io -c "pwrite -S 0xb0 -b 1121 165901483 1121" /mnt/foo

    btrfs subvolume snapshot -r /mnt /mnt/mysnap1

    xfs_io -c "pwrite -S 0xff -b 10 16190218 10" /mnt/foo

    btrfs subvolume snapshot -r /mnt /mnt/mysnap2

    md5sum /mnt/foo          # returns 79e53f1466bfc09fd82b450689e6119e
    md5sum /mnt/mysnap2/foo  # returns 79e53f1466bfc09fd82b450689e6119e too

    btrfs send /mnt/mysnap1 -f /tmp/1.snap
    btrfs send -p /mnt/mysnap1 /mnt/mysnap2 -f /tmp/2.snap

    mkfs.btrfs -f /dev/sdc
    mount /dev/sdc /mnt

    btrfs receive /mnt -f /tmp/1.snap
    btrfs receive /mnt -f /tmp/2.snap

    md5sum /mnt/mysnap2/foo  # returns 2bb414c5155767cedccd7063e51beabd !!

A testcase for xfstests follows soon too.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07 09:08:45 -07:00
Dan Carpenter
84dbeb87d1 Btrfs: kmalloc() doesn't return an ERR_PTR
The error handling was copy and pasted from memdup_user().  It should be
checking for NULL obviously.

Fixes: abccd00f8a ('btrfs: Fix 32/64-bit problem with BTRFS_SET_RECEIVED_SUBVOL ioctl')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07 09:08:44 -07:00
Wang Shilong
e9894fd3e3 Btrfs: fix snapshot vs nocow writting
While running fsstress and snapshots concurrently, we will hit something
like followings:

Thread 1			Thread 2

|->fallocate
  |->write pages
    |->join transaction
       |->add ordered extent
    |->end transaction
				|->flushing data
				  |->creating pending snapshots
|->write data into src root's
   fallocated space

After above work flows finished, we will get a state that source and
snapshot root share same space, but source root have written data into
fallocated space, this will make fsck fail to verify checksums for
snapshot root's preallocating file extent data.Nocow writting also
has this same problem.

Fix this problem by syncing snapshots with nocow writting:

 1.for nocow writting,if there are pending snapshots, we will
 fall into COW way.

 2.if there are pending nocow writes, snapshots for this root
 will be blocked until nocow writting finish.

Reported-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07 09:08:43 -07:00
Qu Wenruo
3ac0d7b96a btrfs: Change the expanding write sequence to fix snapshot related bug.
When testing fsstress with snapshot making background, some snapshot
following problem.

Snapshot 270:
inode 323: size 0

Snapshot 271:
inode 323: size 349145
|-------Hole---|---------Empty gap-------|-------Hole-----|
0	    122880			172032	      349145

Snapshot 272:
inode 323: size 349145
|-------Hole---|------------Data---------|-------Hole-----|
0	    122880			172032	      349145

The fsstress operation on inode 323 is the following:
write: 		offset 	126832 	len 43124
truncate: 	size 	349145

Since the write with offset is consist of 2 operations:
1. punch hole
2. write data
Hole punching is faster than data write, so hole punching in write
and truncate is done first and then buffered write, so the snapshot 271 got
empty gap, which will not pass btrfsck.

To fix the bug, this patch will change the write sequence which will
first punch a hole covering the write end if a hole is needed.

Reported-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07 09:08:42 -07:00
David Sterba
60999ca4b4 btrfs: make device scan less noisy
Print the message only when the device is seen for the first time.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07 09:08:41 -07:00
Jeff Mahoney
ed55b6ac07 btrfs: fix lockdep warning with reclaim lock inversion
When encountering memory pressure, testers have run into the following
lockdep warning. It was caused by __link_block_group calling kobject_add
with the groups_sem held. kobject_add calls kvasprintf with GFP_KERNEL,
which gets us into reclaim context. The kobject doesn't actually need
to be added under the lock -- it just needs to ensure that it's only
added for the first block group to be linked.

=========================================================
[ INFO: possible irq lock inversion dependency detected ]
3.14.0-rc8-default #1 Not tainted
---------------------------------------------------------
kswapd0/169 just changed the state of lock:
 (&delayed_node->mutex){+.+.-.}, at: [<ffffffffa018baea>] __btrfs_release_delayed_node+0x3a/0x200 [btrfs]
but this lock took another, RECLAIM_FS-unsafe lock in the past:
 (&found->groups_sem){+++++.}

and interrupts could create inverse lock ordering between them.

other info that might help us debug this:
 Possible interrupt unsafe locking scenario:
       CPU0                    CPU1
       ----                    ----
  lock(&found->groups_sem);
                               local_irq_disable();
                               lock(&delayed_node->mutex);
                               lock(&found->groups_sem);
  <Interrupt>
    lock(&delayed_node->mutex);

 *** DEADLOCK ***
2 locks held by kswapd0/169:
 #0:  (shrinker_rwsem){++++..}, at: [<ffffffff81159e8a>] shrink_slab+0x3a/0x160
 #1:  (&type->s_umount_key#27){++++..}, at: [<ffffffff811bac6f>] grab_super_passive+0x3f/0x90

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07 09:08:40 -07:00
Josef Bacik
3f8a18cc53 Btrfs: hold the commit_root_sem when getting the commit root during send
We currently rely too heavily on roots being read-only to save us from just
accessing root->commit_root.  We can easily balance blocks out from underneath a
read only root, so to save us from getting screwed make sure we only access
root->commit_root under the commit root sem.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07 09:08:39 -07:00
Josef Bacik
9e351cc862 Btrfs: remove transaction from send
Lets try this again.  We can deadlock the box if we send on a box and try to
write onto the same fs with the app that is trying to listen to the send pipe.
This is because the writer could get stuck waiting for a transaction commit
which is being blocked by the send.  So fix this by making sure looking at the
commit roots is always going to be consistent.  We do this by keeping track of
which roots need to have their commit roots swapped during commit, and then
taking the commit_root_sem and swapping them all at once.  Then make sure we
take a read lock on the commit_root_sem in cases where we search the commit root
to make sure we're always looking at a consistent view of the commit roots.
Previously we had problems with this because we would swap a fs tree commit root
and then swap the extent tree commit root independently which would cause the
backref walking code to screw up sometimes.  With this patch we no longer
deadlock and pass all the weird send/receive corner cases.  Thanks,

Reportedy-by: Hugo Mills <hugo@carfax.org.uk>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-06 17:39:30 -07:00
Josef Bacik
a26e8c9f75 Btrfs: don't clear uptodate if the eb is under IO
So I have an awful exercise script that will run snapshot, balance and
send/receive in parallel.  This sometimes would crash spectacularly and when it
came back up the fs would be completely hosed.  Turns out this is because of a
bad interaction of balance and send/receive.  Send will hold onto its entire
path for the whole send, but its blocks could get relocated out from underneath
it, and because it doesn't old tree locks theres nothing to keep this from
happening.  So it will go to read in a slot with an old transid, and we could
have re-allocated this block for something else and it could have a completely
different transid.  But because we think it is invalid we clear uptodate and
re-read in the block.  If we do this before we actually write out the new block
we could write back stale data to the fs, and boom we're screwed.

Now we definitely need to fix this disconnect between send and balance, but we
really really need to not allow ourselves to accidently read in stale data over
new data.  So make sure we check if the extent buffer is not under io before
clearing uptodate, this will kick back EIO to the caller instead of reading in
stale data and keep us from corrupting the fs.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-06 17:34:37 -07:00
Josef Bacik
573a075567 Btrfs: check for an extent_op on the locked ref
We could have possibly added an extent_op to the locked_ref while we dropped
locked_ref->lock, so check for this case as well and loop around.  Otherwise we
could lose flag updates which would lead to extent tree corruption.  Thanks,

cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-06 17:34:36 -07:00
Josef Bacik
ba8b028933 Btrfs: do not reset last_snapshot after relocation
This was done to allow NO_COW to continue to be NO_COW after relocation but it
is not right.  When relocating we will convert blocks to FULL_BACKREF that we
relocate.  We can leave some of these full backref blocks behind if they are not
cow'ed out during the relocation, like if we fail the relocation with ENOSPC and
then just drop the reloc tree.  Then when we go to cow the block again we won't
lookup the extent flags because we won't think there has been a snapshot
recently which means we will do our normal ref drop thing instead of adding back
a tree ref and dropping the shared ref.  This will cause btrfs_free_extent to
blow up because it can't find the ref we are trying to free.  This was found
with my ref verifying tool.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-06 17:34:35 -07:00
Linus Torvalds
24e7ea3bea Major changes for 3.14 include support for the newly added ZERO_RANGE
and COLLAPSE_RANGE fallocate operations, and scalability improvements
 in the jbd2 layer and in xattr handling when the extended attributes
 spill over into an external block.
 
 Other than that, the usual clean ups and minor bug fixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABCAAGBQJTPbD2AAoJENNvdpvBGATwDmUQANSfGYIQazB8XKKgtNTMiG/Y
 Ky7n1JzN9lTX/6nMsqQnbfCweLRmxqpWUBuyKDRHUi8IG0/voXSTFsAOOgz0R15A
 ERRRWkVvHixLpohuL/iBdEMFHwNZYPGr3jkm0EIgzhtXNgk5DNmiuMwvHmCY27kI
 kdNZIw9fip/WRNoFLDBGnLGC37aanoHhCIbVlySy5o9LN1pkC8BgXAYV0Rk19SVd
 bWCudSJEirFEqWS5H8vsBAEm/ioxTjwnNL8tX8qms6orZ6h8yMLFkHoIGWPw3Q15
 a0TSUoMyav50Yr59QaDeWx9uaPQVeK41wiYFI2rZOnyG2ts0u0YXs/nLwJqTovgs
 rzvbdl6cd3Nj++rPi97MTA7iXK96WQPjsDJoeeEgnB0d/qPyTk6mLKgftzLTNgSa
 ZmWjrB19kr6CMbebMC4L6eqJ8Fr66pCT8c/iue8wc4MUHi7FwHKH64fqWvzp2YT/
 +165dqqo2JnUv7tIp6sUi1geun+bmDHLZFXgFa7fNYFtcU3I+uY1mRr3eMVAJndA
 2d6ASe/KhQbpVnjKJdQ8/b833ZS3p+zkgVPrd68bBr3t7gUmX91wk+p1ct6rUPLr
 700F+q/pQWL8ap0pU9Ht/h3gEJIfmRzTwxlOeYyOwDseqKuS87PSB3BzV3dDunSU
 DrPKlXwIgva7zq5/S0Vr
 =4s1Z
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "Major changes for 3.14 include support for the newly added ZERO_RANGE
  and COLLAPSE_RANGE fallocate operations, and scalability improvements
  in the jbd2 layer and in xattr handling when the extended attributes
  spill over into an external block.

  Other than that, the usual clean ups and minor bug fixes"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (42 commits)
  ext4: fix premature freeing of partial clusters split across leaf blocks
  ext4: remove unneeded test of ret variable
  ext4: fix comment typo
  ext4: make ext4_block_zero_page_range static
  ext4: atomically set inode->i_flags in ext4_set_inode_flags()
  ext4: optimize Hurd tests when reading/writing inodes
  ext4: kill i_version support for Hurd-castrated file systems
  ext4: each filesystem creates and uses its own mb_cache
  fs/mbcache.c: doucple the locking of local from global data
  fs/mbcache.c: change block and index hash chain to hlist_bl_node
  ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
  ext4: refactor ext4_fallocate code
  ext4: Update inode i_size after the preallocation
  ext4: fix partial cluster handling for bigalloc file systems
  ext4: delete path dealloc code in ext4_ext_handle_uninitialized_extents
  ext4: only call sync_filesystm() when remounting read-only
  fs: push sync_filesystem() down to the file system's remount_fs()
  jbd2: improve error messages for inconsistent journal heads
  jbd2: minimize region locked by j_list_lock in jbd2_journal_forget()
  jbd2: minimize region locked by j_list_lock in journal_get_create_access()
  ...
2014-04-04 15:39:39 -07:00
Linus Torvalds
53c566625f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs changes from Chris Mason:
 "This is a pretty long stream of bug fixes and performance fixes.

  Qu Wenruo has replaced the btrfs async threads with regular kernel
  workqueues.  We'll keep an eye out for performance differences, but
  it's nice to be using more generic code for this.

  We still have some corruption fixes and other patches coming in for
  the merge window, but this batch is tested and ready to go"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (108 commits)
  Btrfs: fix a crash of clone with inline extents's split
  btrfs: fix uninit variable warning
  Btrfs: take into account total references when doing backref lookup
  Btrfs: part 2, fix incremental send's decision to delay a dir move/rename
  Btrfs: fix incremental send's decision to delay a dir move/rename
  Btrfs: remove unnecessary inode generation lookup in send
  Btrfs: fix race when updating existing ref head
  btrfs: Add trace for btrfs_workqueue alloc/destroy
  Btrfs: less fs tree lock contention when using autodefrag
  Btrfs: return EPERM when deleting a default subvolume
  Btrfs: add missing kfree in btrfs_destroy_workqueue
  Btrfs: cache extent states in defrag code path
  Btrfs: fix deadlock with nested trans handles
  Btrfs: fix possible empty list access when flushing the delalloc inodes
  Btrfs: split the global ordered extents mutex
  Btrfs: don't flush all delalloc inodes when we doesn't get s_umount lock
  Btrfs: reclaim delalloc metadata more aggressively
  Btrfs: remove unnecessary lock in may_commit_transaction()
  Btrfs: remove the unnecessary flush when preparing the pages
  Btrfs: just do dirty page flush for the inode with compression before direct IO
  ...
2014-04-04 15:31:36 -07:00
Johannes Weiner
91b0abe36a mm + fs: store shadow entries in page cache
Reclaim will be leaving shadow entries in the page cache radix tree upon
evicting the real page.  As those pages are found from the LRU, an
iput() can lead to the inode being freed concurrently.  At this point,
reclaim must no longer install shadow pages because the inode freeing
code needs to ensure the page tree is really empty.

Add an address_space flag, AS_EXITING, that the inode freeing code sets
under the tree lock before doing the final truncate.  Reclaim will check
for this flag before installing shadow pages.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:21:01 -07:00
Johannes Weiner
0cd6144aad mm + fs: prepare for non-page entries in page cache radix trees
shmem mappings already contain exceptional entries where swap slot
information is remembered.

To be able to store eviction information for regular page cache, prepare
every site dealing with the radix trees directly to handle entries other
than pages.

The common lookup functions will filter out non-page entries and return
NULL for page cache holes, just as before.  But provide a raw version of
the API which returns non-page entries as well, and switch shmem over to
use it.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:21:00 -07:00
Dan Carpenter
45d4f85504 fs/direct-io.c: remove some left over checks
We know that "ret > 0" is true here.  These tests were left over from
commit 02afc27fae ('direct-io: Handle O_(D)SYNC AIO') and aren't
needed any more.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:57 -07:00
Al Viro
5cb6c6c7eb generic_file_direct_write(): get rid of ppos argument
always equal to &iocb->ki_pos.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:35 -04:00
Al Viro
867c4f9329 btrfs_file_aio_write(): get rid of ppos
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:35 -04:00
Al Viro
9e8c2af96e callers of iov_copy_from_user_atomic() don't need pagecache_disable()
... it does that itself (via kmap_atomic())

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:20 -04:00
Liu Bo
00fdf13a2e Btrfs: fix a crash of clone with inline extents's split
xfstests's btrfs/035 triggers a BUG_ON, which we use to detect the split
of inline extents in __btrfs_drop_extents().

For inline extents, we cannot duplicate another EXTENT_DATA item, because
it breaks the rule of inline extents, that is, 'start offset' needs to be 0.

We have set limitations for the source inode's compressed inline extents,
because it needs to decompress and recompress.  Now the destination inode's
inline extents also need similar limitations.

With this, xfstests btrfs/035 doesn't run into panic.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-03-21 17:35:18 -07:00
Chris Mason
73b802f447 btrfs: fix uninit variable warning
fs/btrfs/send.c:2926: warning: ‘entry’ may be used uninitialized in this
function

Signed-off-by: Chris Mason <clm@fb.com>
2014-03-21 15:30:44 -07:00
Josef Bacik
4485386853 Btrfs: take into account total references when doing backref lookup
I added an optimization for large files where we would stop searching for
backrefs once we had looked at the number of references we currently had for
this extent.  This works great most of the time, but for snapshots that point to
this extent and has changes in the original root this assumption falls on it
face.  So keep track of any delayed ref mods made and add in the actual ref
count as reported by the extent item and use that to limit how far down an inode
we'll search for extents.  Thanks,

Reportedy-by: Hugo Mills <hugo@carfax.org.uk>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reported-by: Hugo Mills <hugo@carfax.org.uk>
Tested-by: Hugo Mills <hugo@carfax.org.uk>
Signed-off-by: Chris Mason <clm@fb.com>
2014-03-21 15:28:09 -07:00
Filipe Manana
bfa7e1f8be Btrfs: part 2, fix incremental send's decision to delay a dir move/rename
For an incremental send, fix the process of determining whether the directory
inode we're currently processing needs to have its move/rename operation delayed.

We were ignoring the fact that if the inode's new immediate ancestor has a higher
inode number than ours but wasn't renamed/moved, we might still need to delay our
move/rename, because some other ancestor directory higher in the hierarchy might
have an inode number higher than ours *and* was renamed/moved too - in this case
we have to wait for rename/move of that ancestor to happen before our current
directory's rename/move operation.

Simple steps to reproduce this issue:

      $ mkfs.btrfs -f /dev/sdd
      $ mount /dev/sdd /mnt

      $ mkdir -p /mnt/a/x1/x2
      $ mkdir /mnt/a/Z
      $ mkdir -p /mnt/a/x1/x2/x3/x4/x5

      $ btrfs subvolume snapshot -r /mnt /mnt/snap1
      $ btrfs send /mnt/snap1 -f /tmp/base.send

      $ mv /mnt/a/x1/x2/x3 /mnt/a/Z/X33
      $ mv /mnt/a/x1/x2 /mnt/a/Z/X33/x4/x5/X22

      $ btrfs subvolume snapshot -r /mnt /mnt/snap2
      $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send

The incremental send caused the kernel code to enter an infinite loop when
building the path string for directory Z after its references are processed.

A more complex scenario:

      $ mkfs.btrfs -f /dev/sdd
      $ mount /dev/sdd /mnt

      $ mkdir -p /mnt/a/b/c/d
      $ mkdir /mnt/a/b/c/d/e
      $ mkdir /mnt/a/b/c/d/f
      $ mv /mnt/a/b/c/d/e /mnt/a/b/c/d/f/E2
      $ mkdir /mmt/a/b/c/g
      $ mv /mnt/a/b/c/d /mnt/a/b/D2

      $ btrfs subvolume snapshot -r /mnt /mnt/snap1
      $ btrfs send /mnt/snap1 -f /tmp/base.send

      $ mkdir /mnt/a/o
      $ mv /mnt/a/b/c/g /mnt/a/b/D2/f/G2
      $ mv /mnt/a/b/D2 /mnt/a/b/dd
      $ mv /mnt/a/b/c /mnt/a/C2
      $ mv /mnt/a/b/dd/f /mnt/a/o/FF
      $ mv /mnt/a/b /mnt/a/o/FF/E2/BB

      $ btrfs subvolume snapshot -r /mnt /mnt/snap2
      $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send

A test case for xfstests follows.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-03-21 15:25:48 -07:00
Filipe Manana
7b119a8b89 Btrfs: fix incremental send's decision to delay a dir move/rename
It's possible to change the parent/child relationship between directories
in such a way that if a child directory has a higher inode number than
its parent, it doesn't necessarily means the child rename/move operation
can be performed immediately. The parent migth have its own rename/move
operation delayed, therefore in this case the child needs to have its
rename/move operation delayed too, and be performed after its new parent's
rename/move.

Steps to reproduce the issue:

      $ umount /mnt
      $ mkfs.btrfs -f /dev/sdd
      $ mount /dev/sdd /mnt

      $ mkdir /mnt/A
      $ mkdir /mnt/B
      $ mkdir /mnt/C
      $ mv /mnt/C /mnt/A
      $ mv /mnt/B /mnt/A/C
      $ mkdir /mnt/A/C/D

      $ btrfs subvolume snapshot -r /mnt /mnt/snap1
      $ btrfs send /mnt/snap1 -f /tmp/base.send

      $ mv /mnt/A/C/D /mnt/A/D2
      $ mv /mnt/A/C/B /mnt/A/D2/B2
      $ mv /mnt/A/C /mnt/A/D2/B2/C2

      $ btrfs subvolume snapshot -r /mnt /mnt/snap2
      $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send

The incremental send caused the kernel code to enter an infinite loop when
building the path string for directory C after its references are processed.

The necessary conditions here are that C has an inode number higher than both
A and B, and B as an higher inode number higher than A, and D has the highest
inode number, that is:
    inode_number(A) < inode_number(B) < inode_number(C) < inode_number(D)

The same issue could happen if after the first snapshot there's any number
of intermediary parent directories between A2 and B2, and between B2 and C2.

A test case for xfstests follows, covering this simple case and more advanced
ones, with files and hard links created inside the directories.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-03-21 15:24:27 -07:00
Filipe Manana
425b5dafc8 Btrfs: remove unnecessary inode generation lookup in send
No need to search in the send tree for the generation number of the inode,
we already have it in the recorded_ref structure passed to us.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-03-20 17:15:28 -07:00
Filipe Manana
21543baddc Btrfs: fix race when updating existing ref head
While we update an existing ref head's extent_op, we're not holding
its spinlock, so while we're updating its extent_op contents (key,
flags) we can have a task running __btrfs_run_delayed_refs() that
holds the ref head's lock and sets its extent_op to NULL right after
the task updating the ref head just checked its extent_op was not NULL.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-03-20 17:15:28 -07:00
Qu Wenruo
c3a468915a btrfs: Add trace for btrfs_workqueue alloc/destroy
Since most of the btrfs_workqueue is printed as pointer address,
for easier analysis, add trace for btrfs_workqueue alloc/destroy.
So it is possible to determine the workqueue that a given work belongs
to(by comparing the wq pointer address with alloc trace event).

Signed-off-by: Qu Wenruo <quenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-03-20 17:15:28 -07:00
Filipe Manana
f094c9bd3e Btrfs: less fs tree lock contention when using autodefrag
When finding new extents during an autodefrag, don't do so many fs tree
lookups to find an extent with a size smaller then the target treshold.
Instead, after each fs tree forward search immediately unlock upper
levels and process the entire leaf while holding a read lock on the leaf,
since our leaf processing is very fast.
This reduces lock contention, allowing for higher concurrency when other
tasks want to write/update items related to other inodes in the fs tree,
as we're not holding read locks on upper tree levels while processing the
leaf and we do less tree searches.

Test:

    sysbench --test=fileio --file-num=512 --file-total-size=16G \
       --file-test-mode=rndrw --num-threads=32 --file-block-size=32768 \
       --file-rw-ratio=3 --file-io-mode=sync --max-time=1800 \
       --max-requests=10000000000 [prepare|run]

(fileystem mounted with -o autodefrag, averages of 5 runs)

Before this change: 58.852Mb/sec throughtput, read 77.589Gb, written 25.863Gb
After this change:  63.034Mb/sec throughtput, read 83.102Gb, written 27.701Gb

Test machine: quad core intel i5-3570K, 32Gb of RAM, SSD.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-03-20 17:15:27 -07:00
Guangyu Sun
72de6b5393 Btrfs: return EPERM when deleting a default subvolume
The error message is confusing:

 # btrfs sub delete /mnt/mysub/
 Delete subvolume '/mnt/mysub'
 ERROR: cannot delete '/mnt/mysub' - Directory not empty

The error message does not make sense to me: It's not about deleting a
directory but it's a subvolume, and it doesn't matter if the subvolume is
empty or not.

Maybe EPERM or is more appropriate in this case, combined with an explanatory
kernel log message. (e.g. "subvolume with ID 123 cannot be deleted because
it is configured as default subvolume.")

Reported-by: Koen De Wit <koen.de.wit@oracle.com>
Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-03-20 17:15:27 -07:00
Filipe Manana
ef66af101a Btrfs: add missing kfree in btrfs_destroy_workqueue
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-03-20 17:15:27 -07:00
Filipe Manana
308d9800b2 Btrfs: cache extent states in defrag code path
When locking file ranges in the inode's io_tree, cache the first
extent state that belongs to the target range, so that when unlocking
the range we don't need to search in the io_tree again, reducing cpu
time and making and therefore holding the io_tree's lock for a shorter
period.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-03-20 17:15:27 -07:00
Josef Bacik
3bbb24b20a Btrfs: fix deadlock with nested trans handles
Zach found this deadlock that would happen like this

btrfs_end_transaction <- reduce trans->use_count to 0
  btrfs_run_delayed_refs
    btrfs_cow_block
      find_free_extent
	btrfs_start_transaction <- increase trans->use_count to 1
          allocate chunk
	btrfs_end_transaction <- decrease trans->use_count to 0
	  btrfs_run_delayed_refs
	    lock tree block we are cowing above ^^

We need to only decrease trans->use_count if it is above 1, otherwise leave it
alone.  This will make nested trans be the only ones who decrease their added
ref, and will let us get rid of the trans->use_count++ hack if we have to commit
the transaction.  Thanks,

cc: stable@vger.kernel.org
Reported-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Tested-by: Zach Brown <zab@redhat.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-03-20 17:15:27 -07:00
Theodore Ts'o
02b9984d64 fs: push sync_filesystem() down to the file system's remount_fs()
Previously, the no-op "mount -o mount /dev/xxx" operation when the
file system is already mounted read-write causes an implied,
unconditional syncfs().  This seems pretty stupid, and it's certainly
documented or guaraunteed to do this, nor is it particularly useful,
except in the case where the file system was mounted rw and is getting
remounted read-only.

However, it's possible that there might be some file systems that are
actually depending on this behavior.  In most file systems, it's
probably fine to only call sync_filesystem() when transitioning from
read-write to read-only, and there are some file systems where this is
not needed at all (for example, for a pseudo-filesystem or something
like romfs).

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: linux-fsdevel@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Artem Bityutskiy <dedekind1@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Evgeniy Dushistov <dushistov@mail.ru>
Cc: Jan Kara <jack@suse.cz>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Anders Larsen <al@alarsen.net>
Cc: Phillip Lougher <phillip@squashfs.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Cc: Petr Vandrovec <petr@vandrovec.name>
Cc: xfs@oss.sgi.com
Cc: linux-btrfs@vger.kernel.org
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Cc: codalist@coda.cs.cmu.edu
Cc: linux-ext4@vger.kernel.org
Cc: linux-f2fs-devel@lists.sourceforge.net
Cc: fuse-devel@lists.sourceforge.net
Cc: cluster-devel@redhat.com
Cc: linux-mtd@lists.infradead.org
Cc: jfs-discussion@lists.sourceforge.net
Cc: linux-nfs@vger.kernel.org
Cc: linux-nilfs@vger.kernel.org
Cc: linux-ntfs-dev@lists.sourceforge.net
Cc: ocfs2-devel@oss.oracle.com
Cc: reiserfs-devel@vger.kernel.org
2014-03-13 10:14:33 -04:00
Miao Xie
573bfb72f7 Btrfs: fix possible empty list access when flushing the delalloc inodes
We didn't have a lock to protect the access to the delalloc inodes list, that is
we might access a empty delalloc inodes list if someone start flushing delalloc
inodes because the delalloc inodes were moved into a other list temporarily.
Fix it by wrapping the access with a lock.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:29 -04:00
Miao Xie
31f3d255c6 Btrfs: split the global ordered extents mutex
When we create a snapshot, we just need wait the ordered extents in
the source fs/file root, but because we use the global mutex to protect
this ordered extents list of the source fs/file root to avoid accessing
a empty list, if someone got the mutex to access the ordered extents list
of the other fs/file root, we had to wait.

This patch splits the above global mutex, now every fs/file root has
its own mutex to protect its own list.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:28 -04:00
Miao Xie
6c255e67ce Btrfs: don't flush all delalloc inodes when we doesn't get s_umount lock
We needn't flush all delalloc inodes when we doesn't get s_umount lock,
or we would make the tasks wait for a long time.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:27 -04:00
Miao Xie
24af7dd188 Btrfs: reclaim delalloc metadata more aggressively
generic/074 in xfstests failed sometimes because of the enospc error,
the reason of this problem is that we just reclaimed the space we need
from the reserved space for delalloc, and then tried to reserve the space,
but if some task did no-flush reservation between the above reclamation
and reservation,
	Task1			Task2
	shrink_delalloc()
	reclaim 1 block
	(The space that can
	 be reserved now is 1
	 block)
				do no-flush reservation
				reserve 1 block
				(The space that can
				 be reserved now is 0
				 block)
	reserving 1 block failed
the reservation of Task1 failed, but in fact, there was enough space to
reserve if we could reclaim more space before.

Fix this problem by the aggressive reclamation of the reserved delalloc
metadata space.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:26 -04:00
Miao Xie
0424c54897 Btrfs: remove unnecessary lock in may_commit_transaction()
The reason is:
- The per-cpu counter has its own lock to protect itself.
- Here we needn't get a exact value.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:25 -04:00
Miao Xie
b88935bf98 Btrfs: remove the unnecessary flush when preparing the pages
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:25 -04:00
Miao Xie
41bd9ca459 Btrfs: just do dirty page flush for the inode with compression before direct IO
As the comment in the btrfs_direct_IO says, only the compressed pages need be
flush again to make sure they are on the disk, but the common pages needn't,
so we add a if statement to check if the inode has compressed pages or not,
if no, skip the flush.

And in order to prevent the write ranges from intersecting, we need wait for
the running ordered extents. But the current code waits for them twice, one
is done before the direct IO starts (in btrfs_wait_ordered_range()), the other
is before we get the blocks, it is unnecessary. because we can do the direct
IO without holding i_mutex, it means that the intersected ordered extents may
happen during the direct IO, the first wait can not avoid this problem. So we
use filemap_fdatawrite_range() instead of btrfs_wait_ordered_range() to remove
the first wait.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:24 -04:00
Miao Xie
af7a65097b Btrfs: wake up the tasks that wait for the io earlier
The tasks that wait for the IO_DONE flag just care about the io of the dirty
pages, so it is better to wake up them immediately after all the pages are
written, not the whole process of the io completes.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:23 -04:00
Miao Xie
8b9d83cd6b Btrfs: fix early enospc due to the race of the two ordered extent wait
btrfs_wait_ordered_roots() moves all the list entries to a new list,
and then deals with them one by one. But if the other task invokes this
function at that time, it would get a empty list. It makes the enospc
error happens more early. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:22 -04:00
Miao Xie
8257b2dc3c Btrfs: introduce btrfs_{start, end}_nocow_write() for each subvolume
If the snapshot creation happened after the nocow write but before the dirty
data flush, we would fail to flush the dirty data because of no space.

So we must keep track of when those nocow write operations start and when they
end, if there are nocow writers, the snapshot creators must wait. In order
to implement this function, I introduce btrfs_{start, end}_nocow_write(),
which is similar to mnt_{want,drop}_write().

These two functions are only used for nocow file write operations.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:22 -04:00
Qu Wenruo
52483bc26f btrfs: Add ftrace for btrfs_workqueue
Add ftrace for btrfs_workqueue for further workqueue tunning.
This patch needs to applied after the workqueue replace patchset.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:21 -04:00
Qu Wenruo
6db8914f97 btrfs: Cleanup the btrfs_workqueue related function type
The new btrfs_workqueue still use open-coded function defition,
this patch will change them into btrfs_func_t type which is much the
same as kernel workqueue.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:20 -04:00
Liu Bo
2131bcd38b Btrfs: add readahead for send_write
Btrfs send reads data from disk and then writes to a stream via pipe or
a file via flush.

Currently we're going to read each page a time, so every page results
in a disk read, which is not friendly to disks, esp. HDD.  Given that,
the performance can be gained by adding readahead for those pages.

Here is a quick test:
$ btrfs subvolume create send
$ xfs_io -f -c "pwrite 0 1G" send/foobar
$ btrfs subvolume snap -r send ro
$ time "btrfs send ro -f /dev/null"

           w/o             w
real    1m37.527s       0m9.097s
user    0m0.122s        0m0.086s
sys     0m53.191s       0m12.857s

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:19 -04:00
Liu Bo
a4d96d6254 Btrfs: share the same code for __record_{new,deleted}_ref
This has no functional change, only picks out the same part of two functions,
and makes it shared.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:19 -04:00
Filipe Manana
fcbd2154d1 Btrfs: avoid unnecessary utimes update in incremental send
When we're finishing processing of an inode, if we're dealing with a
directory inode that has a pending move/rename operation, we don't
need to send a utimes update instruction to the send stream, as we'll
do it later after doing the move/rename operation. Therefore we save
some time here building paths and doing btree lookups.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:18 -04:00
Filipe Manana
e2127cf008 Btrfs: make defrag not fragment files when using prealloc extents
When using prealloc extents, a file defragment operation may actually
fragment the file and increase the amount of data space used by the file.
This change fixes that behaviour.

Example:

$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt
$ cd /mnt
$ xfs_io -f -c 'falloc 0 1048576' foobar && sync
$ xfs_io -c 'pwrite -S 0xff -b 100000 5000 100000' foobar
$ xfs_io -c 'pwrite -S 0xac -b 100000 200000 100000' foobar
$ xfs_io -c 'pwrite -S 0xe1 -b 100000 900000 100000' foobar && sync

Before defragmenting the file:

$ btrfs filesystem df /mnt
Data, single: total=8.00MiB, used=1.25MiB
System, DUP: total=8.00MiB, used=16.00KiB
System, single: total=4.00MiB, used=0.00
Metadata, DUP: total=1.00GiB, used=112.00KiB
Metadata, single: total=8.00MiB, used=0.00

$ btrfs-debug-tree /dev/sdb3
(...)
	item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53
		prealloc data disk byte 12845056 nr 1048576
		prealloc data offset 0 nr 4096
	item 7 key (257 EXTENT_DATA 4096) itemoff 15757 itemsize 53
		extent data disk byte 12845056 nr 1048576
		extent data offset 4096 nr 102400 ram 1048576
		extent compression 0
	item 8 key (257 EXTENT_DATA 106496) itemoff 15704 itemsize 53
		prealloc data disk byte 12845056 nr 1048576
		prealloc data offset 106496 nr 90112
	item 9 key (257 EXTENT_DATA 196608) itemoff 15651 itemsize 53
		extent data disk byte 12845056 nr 1048576
		extent data offset 196608 nr 106496 ram 1048576
		extent compression 0
	item 10 key (257 EXTENT_DATA 303104) itemoff 15598 itemsize 53
		prealloc data disk byte 12845056 nr 1048576
		prealloc data offset 303104 nr 593920
	item 11 key (257 EXTENT_DATA 897024) itemoff 15545 itemsize 53
		extent data disk byte 12845056 nr 1048576
		extent data offset 897024 nr 106496 ram 1048576
		extent compression 0
	item 12 key (257 EXTENT_DATA 1003520) itemoff 15492 itemsize 53
		prealloc data disk byte 12845056 nr 1048576
		prealloc data offset 1003520 nr 45056
(...)

Now defragmenting the file results in more data space used than before:

$ btrfs filesystem defragment -f foobar && sync
$ btrfs filesystem df /mnt
Data, single: total=8.00MiB, used=1.55MiB
System, DUP: total=8.00MiB, used=16.00KiB
System, single: total=4.00MiB, used=0.00
Metadata, DUP: total=1.00GiB, used=112.00KiB
Metadata, single: total=8.00MiB, used=0.00

And the corresponding file extent items are now no longer perfectly sequential
as before, and we're now needlessly using more space from data block groups:

$ btrfs-debug-tree /dev/sdb3
(...)
	item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53
		extent data disk byte 12845056 nr 1048576
		extent data offset 0 nr 4096 ram 1048576
		extent compression 0
	item 7 key (257 EXTENT_DATA 4096) itemoff 15757 itemsize 53
		extent data disk byte 13893632 nr 102400
		extent data offset 0 nr 102400 ram 102400
		extent compression 0
	item 8 key (257 EXTENT_DATA 106496) itemoff 15704 itemsize 53
		extent data disk byte 12845056 nr 1048576
		extent data offset 106496 nr 90112 ram 1048576
		extent compression 0
	item 9 key (257 EXTENT_DATA 196608) itemoff 15651 itemsize 53
		extent data disk byte 13996032 nr 106496
		extent data offset 0 nr 106496 ram 106496
		extent compression 0
	item 10 key (257 EXTENT_DATA 303104) itemoff 15598 itemsize 53
		prealloc data disk byte 12845056 nr 1048576
		prealloc data offset 303104 nr 593920
	item 11 key (257 EXTENT_DATA 897024) itemoff 15545 itemsize 53
		extent data disk byte 14102528 nr 106496
		extent data offset 0 nr 106496 ram 106496
		extent compression 0
	item 12 key (257 EXTENT_DATA 1003520) itemoff 15492 itemsize 53
		extent data disk byte 12845056 nr 1048576
		extent data offset 1003520 nr 45056 ram 1048576
		extent compression 0
(...)

With this change, the above example will no longer cause allocation of new data
space nor change the sequentiality of the file extents, that is, defragment will
be effectless, leaving all extent items pointing to the extent starting at disk
byte 12845056.

In a 20Gb filesystem I had, mounted with the autodefrag option and 20 files of
400Mb each, initially consisting of a single prealloc extent of 400Mb, having
random writes happening at a low rate, lead to a total of over ~17Gb of data
space used, not far from eventually reaching an ENOSPC state.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:17 -04:00
Filipe Manana
dec8ef9055 Btrfs: correctly flush data on defrag when compression is enabled
When the defrag flag BTRFS_DEFRAG_RANGE_START_IO is set and compression
enabled, we weren't flushing completely, as writing compressed extents
is a 2 steps process, one to compress the data and another one to write
the compressed data to disk.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:16 -04:00
Qu Wenruo
d458b0540e btrfs: Cleanup the "_struct" suffix in btrfs_workequeue
Since the "_struct" suffix is mainly used for distinguish the differnt
btrfs_work between the original and the newly created one,
there is no need using the suffix since all btrfs_workers are changed
into btrfs_workqueue.

Also this patch fixed some codes whose code style is changed due to the
too long "_struct" suffix.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:16 -04:00