linux/fs/btrfs
Filipe Manana 27b9a8122f Btrfs: fix csum tree corruption, duplicate and outdated checksums
Under rare circumstances we can end up leaving 2 versions of a checksum
for the same file extent range.

The reason for this is that after calling btrfs_next_leaf we process
slot 0 of the leaf it returns, instead of processing the slot set in
path->slots[0]. Most of the time (by far) path->slots[0] is 0, but after
btrfs_next_leaf() releases the path and before it searches for the next
leaf, another task might cause a split of the next leaf, which migrates
some of its keys to the leaf we were processing before calling
btrfs_next_leaf(). In this case btrfs_next_leaf() returns again the
same leaf but with path->slots[0] having a slot number corresponding
to the first new key it got, that is, a slot number that didn't exist
before calling btrfs_next_leaf(), as the leaf now has more keys than
it had before. So we must really process the returned leaf starting at
path->slots[0] always, as it isn't always 0, and the key at slot 0 can
have an offset much lower than our search offset/bytenr.

For example, consider the following scenario, where we have:

sums->bytenr: 40157184, sums->len: 16384, sums end: 40173568
four 4kb file data blocks with offsets 40157184, 40161280, 40165376, 40169472

  Leaf N:

    slot = 0                           slot = btrfs_header_nritems() - 1
  |-------------------------------------------------------------------|
  | [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4] |
  |-------------------------------------------------------------------|

  Leaf N + 1:

      slot = 0                          slot = btrfs_header_nritems() - 1
  |--------------------------------------------------------------------|
  | [(CSUM CSUM 40161280), size 32] ... [((CSUM CSUM 40615936), size 8 |
  |--------------------------------------------------------------------|

Because we are at the last slot of leaf N, we call btrfs_next_leaf() to
find the next highest key, which releases the current path and then searches
for that next key. However after releasing the path and before finding that
next key, the item at slot 0 of leaf N + 1 gets moved to leaf N, due to a call
to ctree.c:push_leaf_left() (via ctree.c:split_leaf()), and therefore
btrfs_next_leaf() will returns us a path again with leaf N but with the slot
pointing to its new last key (CSUM CSUM 40161280). This new version of leaf N
is then:

    slot = 0                        slot = btrfs_header_nritems() - 2  slot = btrfs_header_nritems() - 1
  |----------------------------------------------------------------------------------------------------|
  | [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4]  [(CSUM CSUM 40161280), size 32] |
  |----------------------------------------------------------------------------------------------------|

And incorrecly using slot 0, makes us set next_offset to 39239680 and we jump
into the "insert:" label, which will set tmp to:

    tmp = min((sums->len - total_bytes) >> blocksize_bits,
        (next_offset - file_key.offset) >> blocksize_bits) =
    min((16384 - 0) >> 12, (39239680 - 40157184) >> 12) =
    min(4, (u64)-917504 = 18446744073708634112 >> 12) = 4

and

   ins_size = csum_size * tmp = 4 * 4 = 16 bytes.

In other words, we insert a new csum item in the tree with key
(CSUM_OBJECTID CSUM_KEY 40157184 = sums->bytenr) that contains the checksums
for all the data (4 blocks of 4096 bytes each = sums->len). Which is wrong,
because the item with key (CSUM CSUM 40161280) (the one that was moved from
leaf N + 1 to the end of leaf N) contains the old checksums of the last 12288
bytes of our data and won't get those old checksums removed.

So this leaves us 2 different checksums for 3 4kb blocks of data in the tree,
and breaks the logical rule:

   Key_N+1.offset >= Key_N.offset + length_of_data_its_checksums_cover

An obvious bad effect of this is that a subsequent csum tree lookup to get
the checksum of any of the blocks with logical offset of 40161280, 40165376
or 40169472 (the last 3 4kb blocks of file data), will get the old checksums.

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-15 07:43:40 -07:00
..
tests Btrfs: fix qgroups sanity test crash or hang 2014-06-13 09:52:24 -07:00
acl.c btrfs: remove useless ACL check 2014-06-09 17:20:42 -07:00
async-thread.c btrfs: fix crash in remount(thread_pool=) case 2014-04-07 10:41:52 -07:00
async-thread.h btrfs: Add trace for btrfs_workqueue alloc/destroy 2014-03-20 17:15:28 -07:00
backref.c Btrfs: Fix memory corruption by ulist_add_merge() on 32bit arch 2014-08-15 07:43:19 -07:00
backref.h Btrfs: fix scrub_print_warning to handle skinny metadata extents 2014-06-09 17:21:17 -07:00
btrfs_inode.h Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs 2014-06-11 09:22:21 -07:00
check-integrity.c btrfs: check_int: propagate out-of-memory error upwards 2014-06-09 17:20:21 -07:00
check-integrity.h block: submit_bio_wait() conversions 2013-11-24 16:33:41 -07:00
compression.c btrfs compression: reuse recently used workspace 2014-06-28 13:48:46 -07:00
compression.h btrfs: make static code static & remove dead code 2013-05-06 15:55:23 -04:00
ctree.c Btrfs: __btrfs_mod_ref should always use no_quota 2014-08-15 07:43:11 -07:00
ctree.h Btrfs: __btrfs_mod_ref should always use no_quota 2014-08-15 07:43:11 -07:00
delayed-inode.c btrfs: free delayed node outside of root->inode_lock 2014-06-09 17:21:08 -07:00
delayed-inode.h Btrfs: introduce the delayed inode ref deletion for the single link inode 2014-01-28 13:20:09 -08:00
delayed-ref.c Btrfs: rework qgroup accounting 2014-06-09 17:20:48 -07:00
delayed-ref.h Btrfs: rework qgroup accounting 2014-06-09 17:20:48 -07:00
dev-replace.c btrfs: dev replace should replace the sysfs entry 2014-06-28 13:48:44 -07:00
dev-replace.h Btrfs: add new sources for device replace code 2012-12-12 17:15:41 -05:00
dir-item.c Btrfs: convert printk to btrfs_ and fix BTRFS prefix 2014-01-28 13:20:05 -08:00
disk-io.c Btrfs: fix race between balance recovery and root deletion 2014-07-03 07:04:04 -07:00
disk-io.h Btrfs: add sanity tests for new qgroup accounting code 2014-06-09 17:20:49 -07:00
export.c btrfs: remove fs/btrfs/compat.h 2013-11-11 22:03:19 -05:00
export.h
extent_io.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs 2014-06-14 19:48:43 -05:00
extent_io.h Btrfs: remove unused wait queue in struct extent_buffer 2014-06-19 14:20:28 -07:00
extent_map.c Btrfs: fix NULL pointer crash when running balance and scrub concurrently 2014-06-19 14:20:55 -07:00
extent_map.h Btrfs: fix NULL pointer crash when running balance and scrub concurrently 2014-06-19 14:20:55 -07:00
extent-tree.c btrfs: qgroup: account shared subtrees during snapshot delete 2014-08-15 07:43:14 -07:00
file-item.c Btrfs: fix csum tree corruption, duplicate and outdated checksums 2014-08-15 07:43:40 -07:00
file.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-06-12 10:30:18 -07:00
free-space-cache.c Btrfs: fix broken free space cache after the system crashed 2014-06-19 14:20:54 -07:00
free-space-cache.h Btrfs: remove path arg from btrfs_truncate_free_space_cache 2013-11-11 21:51:33 -05:00
hash.c Btrfs: fix btrfs boot when compiled as built-in 2014-01-28 13:20:31 -08:00
hash.h Btrfs: fix btrfs boot when compiled as built-in 2014-01-28 13:20:31 -08:00
inode-item.c btrfs: cleanup: removed unused 'btrfs_get_inode_ref_index' 2014-01-28 13:19:39 -08:00
inode-map.c btrfs: remove newline from inode cache kthread name 2014-06-09 17:20:53 -07:00
inode-map.h
inode.c Btrfs: fix compressed write corruption on enospc 2014-08-15 07:43:18 -07:00
ioctl.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs 2014-07-04 08:53:53 -07:00
Kconfig Btrfs: fix btrfs boot when compiled as built-in 2014-01-28 13:20:31 -08:00
locking.c Btrfs: fix deadlocks with trylock on tree nodes 2014-06-19 14:19:55 -07:00
locking.h Btrfs: remove btrfs_try_spin_lock 2013-03-14 14:57:10 -04:00
lzo.c btrfs: return errno instead of -1 from compression 2014-06-09 17:20:21 -07:00
Makefile Btrfs: add sanity tests for new qgroup accounting code 2014-06-09 17:20:49 -07:00
math.h Btrfs: cleanup duplicated division functions 2012-12-11 13:31:30 -05:00
ordered-data.c Btrfs: fix abnormal long waiting in fsync 2014-07-19 11:49:44 -07:00
ordered-data.h btrfs: Cleanup the "_struct" suffix in btrfs_workequeue 2014-03-10 15:17:16 -04:00
orphan.c btrfs: expand btrfs_find_item() to include find_orphan_item functionality 2014-01-28 13:19:37 -08:00
print-tree.c Btrfs: fix btrfs_print_leaf for skinny metadata 2014-07-03 07:04:16 -07:00
print-tree.h btrfs: make static code static & remove dead code 2013-05-06 15:55:23 -04:00
props.c Btrfs: add support for inode properties 2014-01-28 13:20:24 -08:00
props.h Btrfs: add support for inode properties 2014-01-28 13:20:24 -08:00
qgroup.c btrfs: correctly handle return from ulist_add 2014-08-15 07:43:16 -07:00
qgroup.h btrfs: qgroup: account shared subtrees during snapshot delete 2014-08-15 07:43:14 -07:00
raid56.c Btrfs: fix crash when mounting raid5 btrfs with missing disks 2014-06-28 13:48:45 -07:00
raid56.h Btrfs: RAID5 and RAID6 2013-02-01 14:24:23 -05:00
rcu-string.h
reada.c Btrfs: fix unfinished readahead thread for raid5/6 degraded mounting 2014-06-13 09:52:21 -07:00
relocation.c btrfs: remove stale newlines from log messages 2014-06-09 17:20:53 -07:00
root-tree.c Btrfs: use bitfield instead of integer data type for the some variants in btrfs_root 2014-06-09 17:20:40 -07:00
scrub.c btrfs: Skip scrubbing removed chunks to avoid -ENOENT. 2014-06-19 14:20:54 -07:00
send.c Btrfs: send, use the right limits for xattr names and values 2014-06-09 17:21:00 -07:00
send.h btrfs: make static code static & remove dead code 2013-05-06 15:55:23 -04:00
struct-funcs.c
super.c btrfs: adjust statfs calculations according to raid profiles 2014-08-15 07:43:10 -07:00
sysfs.c btrfs: dev add should add its sysfs entry 2014-06-28 13:48:43 -07:00
sysfs.h btrfs: dev add should add its sysfs entry 2014-06-28 13:48:43 -07:00
transaction.c Btrfs: fix crash when starting transaction 2014-07-03 07:04:18 -07:00
transaction.h Btrfs: add sanity tests for new qgroup accounting code 2014-06-09 17:20:49 -07:00
tree-defrag.c Btrfs: use bitfield instead of integer data type for the some variants in btrfs_root 2014-06-09 17:20:40 -07:00
tree-log.c Btrfs: use helpers for last_trans_log_full_commit instead of opencode 2014-06-09 17:20:45 -07:00
tree-log.h Btrfs: use helpers for last_trans_log_full_commit instead of opencode 2014-06-09 17:20:45 -07:00
ulist.c Btrfs: do not export ulist functions 2014-01-29 07:06:27 -08:00
ulist.h Btrfs: Fix memory corruption by ulist_add_merge() on 32bit arch 2014-08-15 07:43:19 -07:00
uuid-tree.c Btrfs: convert printk to btrfs_ and fix BTRFS prefix 2014-01-28 13:20:05 -08:00
volumes.c btrfs: test for valid bdev before kobj removal in btrfs_rm_device 2014-07-19 11:49:44 -07:00
volumes.h Btrfs: fix deadlock when mounting a degraded fs 2014-06-19 14:20:56 -07:00
xattr.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs 2014-01-30 20:08:20 -08:00
xattr.h btrfs: use generic posix ACL infrastructure 2014-01-25 23:58:18 -05:00
zlib.c btrfs: use E2BIG instead of EIO if compression does not help 2014-07-03 07:04:13 -07:00