linux/fs/btrfs
Filipe Manana 1a4bcf470c Btrfs: fix fsync data loss after adding hard link to inode
We have a scenario where after the fsync log replay we can lose file data
that had been previously fsync'ed if we added an hard link for our inode
and after that we sync'ed the fsync log (for example by fsync'ing some
other file or directory).

This is because when adding an hard link we updated the inode item in the
log tree with an i_size value of 0. At that point the new inode item was
in memory only and a subsequent fsync log replay would not make us lose
the file data. However if after adding the hard link we sync the log tree
to disk, by fsync'ing some other file or directory for example, we ended
up losing the file data after log replay, because the inode item in the
persisted log tree had an an i_size of zero.

This is easy to reproduce, and the following excerpt from my test for
xfstests shows this:

  _scratch_mkfs >> $seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create one file with data and fsync it.
  # This made the btrfs fsync log persist the data and the inode metadata with
  # a correct inode->i_size (4096 bytes).
  $XFS_IO_PROG -f -c "pwrite -S 0xaa -b 4K 0 4K" -c "fsync" \
       $SCRATCH_MNT/foo | _filter_xfs_io

  # Now add one hard link to our file. This made the btrfs code update the fsync
  # log, in memory only, with an inode metadata having a size of 0.
  ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link

  # Now force persistence of the fsync log to disk, for example, by fsyncing some
  # other file.
  touch $SCRATCH_MNT/bar
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar

  # Before a power loss or crash, we could read the 4Kb of data from our file as
  # expected.
  echo "File content before:"
  od -t x1 $SCRATCH_MNT/foo

  # Simulate a crash/power loss.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  # After the fsync log replay, because the fsync log had a value of 0 for our
  # inode's i_size, we couldn't read anymore the 4Kb of data that we previously
  # wrote and fsync'ed. The size of the file became 0 after the fsync log replay.
  echo "File content after:"
  od -t x1 $SCRATCH_MNT/foo

Another alternative test, that doesn't need to fsync an inode in the same
transaction it was created, is:

  _scratch_mkfs >> $seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create our test file with some data.
  $XFS_IO_PROG -f -c "pwrite -S 0xaa -b 8K 0 8K" \
       $SCRATCH_MNT/foo | _filter_xfs_io

  # Make sure the file is durably persisted.
  sync

  # Append some data to our file, to increase its size.
  $XFS_IO_PROG -f -c "pwrite -S 0xcc -b 4K 8K 4K" \
       $SCRATCH_MNT/foo | _filter_xfs_io

  # Fsync the file, so from this point on if a crash/power failure happens, our
  # new data is guaranteed to be there next time the fs is mounted.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

  # Add one hard link to our file. This made btrfs write into the in memory fsync
  # log a special inode with generation 0 and an i_size of 0 too. Note that this
  # didn't update the inode in the fsync log on disk.
  ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link

  # Now make sure the in memory fsync log is durably persisted.
  # Creating and fsync'ing another file will do it.
  touch $SCRATCH_MNT/bar
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar

  # As expected, before the crash/power failure, we should be able to read the
  # 12Kb of file data.
  echo "File content before:"
  od -t x1 $SCRATCH_MNT/foo

  # Simulate a crash/power loss.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  # After mounting the fs again, the fsync log was replayed.
  # The btrfs fsync log replay code didn't update the i_size of the persisted
  # inode because the inode item in the log had a special generation with a
  # value of 0 (and it couldn't know the correct i_size, since that inode item
  # had a 0 i_size too). This made the last 4Kb of file data inaccessible and
  # effectively lost.
  echo "File content after:"
  od -t x1 $SCRATCH_MNT/foo

This isn't a new issue/regression. This problem has been around since the
log tree code was added in 2008:

  Btrfs: Add a write ahead tree log to optimize synchronous operations
  (commit e02119d5a7)

Test cases for xfstests follow soon.

CC: <stable@vger.kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-14 08:22:49 -08:00
..
tests btrfs: switch extent_state state to unsigned 2015-01-21 18:02:04 -08:00
acl.c btrfs: remove useless ACL check 2014-06-09 17:20:42 -07:00
async-thread.c btrfs: remove unlikely from NULL checks 2014-10-02 16:06:19 +02:00
async-thread.h Btrfs: implement repair function when direct read fails 2014-09-17 13:39:01 -07:00
backref.c btrfs: cleanup, remove inode_ref_info helper 2015-01-14 19:23:47 +01:00
backref.h btrfs: cleanup, remove inode_item_info helper 2015-01-14 19:23:47 +01:00
btrfs_inode.h Btrfs: Add code to support file creation time 2015-02-02 18:39:16 -08:00
check-integrity.c Btrfs: include vmalloc.h in check-integrity.c 2014-11-25 06:01:11 -08:00
check-integrity.h block: submit_bio_wait() conversions 2013-11-24 16:33:41 -07:00
compression.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs 2014-12-12 11:15:23 -08:00
compression.h btrfs: zero out left over bytes after processing compression streams 2014-11-30 09:33:51 -08:00
ctree.c Btrfs: insert_new_root: Fix lock type of the extent buffer. 2015-01-22 05:42:23 -08:00
ctree.h Btrfs: account for large extents with enospc 2015-02-14 08:22:48 -08:00
delayed-inode.c Btrfs: Add code to support file creation time 2015-02-02 18:39:16 -08:00
delayed-inode.h Btrfs: introduce the delayed inode ref deletion for the single link inode 2014-01-28 13:20:09 -08:00
delayed-ref.c Btrfs: rework qgroup accounting 2014-06-09 17:20:48 -07:00
delayed-ref.h Btrfs: rework qgroup accounting 2014-06-09 17:20:48 -07:00
dev-replace.c Btrfs: btrfs_rm_dev_replace_blocked(): Use wait_event() 2015-01-21 18:06:48 -08:00
dev-replace.h
dir-item.c Btrfs: make xattr replace operations atomic 2014-11-20 17:20:07 -08:00
disk-io.c Btrfs: fix race between transaction commit and empty block group removal 2015-02-02 19:24:48 -08:00
disk-io.h btrfs: sink blocksize parameter to btrfs_find_create_tree_block 2014-12-12 18:07:21 +01:00
export.c btrfs: kill the key type accessor helpers 2014-09-17 13:37:12 -07:00
export.h
extent_io.c Btrfs: account for large extents with enospc 2015-02-14 08:22:48 -08:00
extent_io.h btrfs: switch extent_state state to unsigned 2015-01-21 18:02:04 -08:00
extent_map.c Btrfs: do not move em to modified list when unpinning 2014-11-21 11:59:54 -08:00
extent_map.h Btrfs: fix NULL pointer crash when running balance and scrub concurrently 2014-06-19 14:20:55 -07:00
extent-tree.c Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group 2015-02-14 08:22:49 -08:00
file-item.c Btrfs: fix kfree on list_head in btrfs_lookup_csums_range error cleanup 2014-11-04 06:59:04 -08:00
file.c Btrfs: fix snapshot inconsistency after a file write followed by truncate 2014-11-25 07:41:23 -08:00
free-space-cache.c btrfs: cleanup init for list in free-space-cache 2015-02-02 19:25:50 -08:00
free-space-cache.h Btrfs: fix race between writing free space cache and trimming 2014-12-02 18:35:09 -08:00
hash.c btrfs: LLVMLinux: Remove VLAIS 2014-10-14 10:51:22 +02:00
hash.h Btrfs: fix btrfs boot when compiled as built-in 2014-01-28 13:20:31 -08:00
inode-item.c Btrfs: fix fsync log replay for inodes with a mix of regular refs and extrefs 2015-01-21 18:02:05 -08:00
inode-map.c Btrfs: fix race between writing free space cache and trimming 2014-12-02 18:35:09 -08:00
inode-map.h
inode.c Btrfs: account for large extents with enospc 2015-02-14 08:22:48 -08:00
ioctl.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs 2014-12-12 11:15:23 -08:00
Kconfig Btrfs: fix btrfs boot when compiled as built-in 2014-01-28 13:20:31 -08:00
locking.c btrfs: fix lockups from btrfs_clear_path_blocking 2014-11-19 10:34:35 -08:00
locking.h btrfs: fix lockups from btrfs_clear_path_blocking 2014-11-19 10:34:35 -08:00
lzo.c btrfs: zero out left over bytes after processing compression streams 2014-11-30 09:33:51 -08:00
Makefile Btrfs: add sanity tests for new qgroup accounting code 2014-06-09 17:20:49 -07:00
math.h
ordered-data.c Btrfs: collect only the necessary ordered extents on ranged fsync 2014-11-21 11:59:56 -08:00
ordered-data.h Btrfs: collect only the necessary ordered extents on ranged fsync 2014-11-21 11:59:56 -08:00
orphan.c btrfs: kill the key type accessor helpers 2014-09-17 13:37:12 -07:00
print-tree.c btrfs: remove parameter blocksize from read_tree_block 2014-10-02 17:14:50 +02:00
print-tree.h btrfs: make static code static & remove dead code 2013-05-06 15:55:23 -04:00
props.c Btrfs: add support for inode properties 2014-01-28 13:20:24 -08:00
props.h Btrfs: add support for inode properties 2014-01-28 13:20:24 -08:00
qgroup.c btrfs: qgroup: move WARN_ON() to the correct location. 2015-01-21 18:22:37 -08:00
qgroup.h btrfs: qgroup: account shared subtrees during snapshot delete 2014-08-15 07:43:14 -07:00
raid56.c Btrfs: Include map_type in raid_bio 2015-01-21 18:06:49 -08:00
raid56.h Btrfs: Make raid_map array be inlined in btrfs_bio structure 2015-01-21 18:06:47 -08:00
rcu-string.h
reada.c Btrfs: add ref_count and free function for btrfs_bio 2015-01-21 18:06:48 -08:00
relocation.c btrfs: sink blocksize parameter to tree_block_processed 2014-12-12 18:07:22 +01:00
root-tree.c Btrfs: use bitfield instead of integer data type for the some variants in btrfs_root 2014-06-09 17:20:40 -07:00
scrub.c Btrfs: scrub, fix sleep in atomic context 2015-02-14 08:19:14 -08:00
send.c btrfs: kill btrfs_inode_*time helpers 2015-02-02 18:39:07 -08:00
send.h btrfs: make static code static & remove dead code 2013-05-06 15:55:23 -04:00
struct-funcs.c
super.c btrfs: remove a no-op unfreeze superbock callback 2015-01-21 18:02:04 -08:00
sysfs.c Btrfs: add missing cleanup on sysfs init failure 2015-02-02 19:24:49 -08:00
sysfs.h btrfs: code optimize: BTRFS_ATTR_RW could set the mode 2014-09-17 13:37:59 -07:00
transaction.c btrfs: Fix out-of-space bug 2015-02-14 08:19:14 -08:00
transaction.h btrfs: Fix out-of-space bug 2015-02-14 08:19:14 -08:00
tree-defrag.c Btrfs: use bitfield instead of integer data type for the some variants in btrfs_root 2014-06-09 17:20:40 -07:00
tree-log.c Btrfs: fix fsync data loss after adding hard link to inode 2015-02-14 08:22:49 -08:00
tree-log.h Btrfs: fix data corruption after fast fsync and writeback error 2014-09-19 06:57:51 -07:00
ulist.c Btrfs: do not export ulist functions 2014-01-29 07:06:27 -08:00
ulist.h Btrfs: Fix memory corruption by ulist_add_merge() on 32bit arch 2014-08-15 07:43:19 -07:00
uuid-tree.c Btrfs: make btrfs_search_forward return with nodes unlocked 2014-09-17 13:38:02 -07:00
volumes.c btrfs: Fix out-of-space bug 2015-02-14 08:19:14 -08:00
volumes.h Btrfs: Include map_type in raid_bio 2015-01-21 18:06:49 -08:00
xattr.c Btrfs: make xattr replace operations atomic 2014-11-20 17:20:07 -08:00
xattr.h btrfs: use generic posix ACL infrastructure 2014-01-25 23:58:18 -05:00
zlib.c btrfs: zero out left over bytes after processing compression streams 2014-11-30 09:33:51 -08:00