linux

History

Filipe Manana 3a8b36f378 Btrfs: fix data loss in the fast fsync path When using the fast file fsync code path we can miss the fact that new writes happened since the last file fsync and therefore return without waiting for the IO to finish and write the new extents to the fsync log. Here's an example scenario where the fsync will miss the fact that new file data exists that wasn't yet durably persisted: 1. fs_info->last_trans_committed == N - 1 and current transaction is transaction N (fs_info->generation == N); 2. do a buffered write; 3. fsync our inode, this clears our inode's full sync flag, starts an ordered extent and waits for it to complete - when it completes at btrfs_finish_ordered_io(), the inode's last_trans is set to the value N (via btrfs_update_inode_fallback -> btrfs_update_inode -> btrfs_set_inode_last_trans); 4. transaction N is committed, so fs_info->last_trans_committed is now set to the value N and fs_info->generation remains with the value N; 5. do another buffered write, when this happens btrfs_file_write_iter sets our inode's last_trans to the value N + 1 (that is fs_info->generation + 1 == N + 1); 6. transaction N + 1 is started and fs_info->generation now has the value N + 1; 7. transaction N + 1 is committed, so fs_info->last_trans_committed is set to the value N + 1; 8. fsync our inode - because it doesn't have the full sync flag set, we only start the ordered extent, we don't wait for it to complete (only in a later phase) therefore its last_trans field has the value N + 1 set previously by btrfs_file_write_iter(), and so we have: inode->last_trans <= fs_info->last_trans_committed (N + 1) (N + 1) Which made us not log the last buffered write and exit the fsync handler immediately, returning success (0) to user space and resulting in data loss after a crash. This can actually be triggered deterministically and the following excerpt from a testcase I made for xfstests triggers the issue. It moves a dummy file across directories and then fsyncs the old parent directory - this is just to trigger a transaction commit, so moving files around isn't directly related to the issue but it was chosen because running 'sync' for example does more than just committing the current transaction, as it flushes/waits for all file data to be persisted. The issue can also happen at random periods, since the transaction kthread periodicaly commits the current transaction (about every 30 seconds by default). The body of the test is: _scratch_mkfs >> $seqres.full 2>&1 _init_flakey _mount_flakey # Create our main test file 'foo', the one we check for data loss. # By doing an fsync against our file, it makes btrfs clear the 'needs_full_sync' # bit from its flags (btrfs inode specific flags). $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" \ -c "fsync" $SCRATCH_MNT/foo \| _filter_xfs_io # Now create one other file and 2 directories. We will move this second file # from one directory to the other later because it forces btrfs to commit its # currently open transaction if we fsync the old parent directory. This is # necessary to trigger the data loss bug that affected btrfs. mkdir $SCRATCH_MNT/testdir_1 touch $SCRATCH_MNT/testdir_1/bar mkdir $SCRATCH_MNT/testdir_2 # Make sure everything is durably persisted. sync # Write more 8Kb of data to our file. $XFS_IO_PROG -c "pwrite -S 0xbb 8K 8K" $SCRATCH_MNT/foo \| _filter_xfs_io # Move our 'bar' file into a new directory. mv $SCRATCH_MNT/testdir_1/bar $SCRATCH_MNT/testdir_2/bar # Fsync our first directory. Because it had a file moved into some other # directory, this made btrfs commit the currently open transaction. This is # a condition necessary to trigger the data loss bug. $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir_1 # Now fsync our main test file. If the fsync succeeds, we expect the 8Kb of # data we wrote previously to be persisted and available if a crash happens. # This did not happen with btrfs, because of the transaction commit that # happened when we fsynced the parent directory. $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo # Simulate a crash/power loss. _load_flakey_table $FLAKEY_DROP_WRITES _unmount_flakey _load_flakey_table $FLAKEY_ALLOW_WRITES _mount_flakey # Now check that all data we wrote before are available. echo "File content after log replay:" od -t x1 $SCRATCH_MNT/foo status=0 exit The expected golden output for the test, which is what we get with this fix applied (or when running against ext3/4 and xfs), is: wrote 8192/8192 bytes at offset 0 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) wrote 8192/8192 bytes at offset 8192 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) File content after log replay: 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa * 0020000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb * 0040000 Without this fix applied, the output shows the test file does not have the second 8Kb extent that we successfully fsynced: wrote 8192/8192 bytes at offset 0 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) wrote 8192/8192 bytes at offset 8192 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) File content after log replay: 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa * 0020000 So fix this by skipping the fsync only if we're doing a full sync and if the inode's last_trans is <= fs_info->last_trans_committed, or if the inode is already in the log. Also remove setting the inode's last_trans in btrfs_file_write_iter since it's useless/unreliable. Also because btrfs_file_write_iter no longer sets inode->last_trans to fs_info->generation + 1, don't set last_trans to 0 if we bail out and don't bail out if last_trans is 0, otherwise something as simple as the following example wouldn't log the second write on the last fsync: 1. write to file 2. fsync file 3. fsync file \|--> btrfs_inode_in_log() returns true and it set last_trans to 0 4. write to file \|--> btrfs_file_write_iter() no longers sets last_trans, so it remained with a value of 0 5. fsync \|--> inode->last_trans == 0, so it bails out without logging the second write A test case for xfstests will be sent soon. CC: <stable@vger.kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>		2015-03-05 17:28:32 -08:00
..
9p	assorted conversions to %p[dD]	2014-11-19 13:01:20 -05:00
adfs	adfs: add __printf verification, fix format/argument mismatches	2014-08-08 15:57:24 -07:00
affs	fs/affs/file.c: remove obsolete pagesize check	2014-12-13 12:42:52 -08:00
afs	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next	2014-12-11 14:27:06 -08:00
autofs4	assorted conversions to %p[dD]	2014-11-19 13:01:20 -05:00
befs	befs: remove dead code	2014-12-13 12:42:51 -08:00
bfs	fs/bfs: use bfs prefix for dump_imap	2014-08-08 15:57:24 -07:00
btrfs	Btrfs: fix data loss in the fast fsync path	2015-03-05 17:28:32 -08:00
cachefiles	assorted conversions to %p[dD]	2014-11-19 13:01:20 -05:00
ceph	ceph: use %zu for len in ceph_fill_inline_data()	2015-01-08 20:36:56 +03:00
cifs	cifs: make new inode cache when file type is different	2014-12-22 14:16:21 -06:00
coda	coda_venus_readdir(): use file_inode()	2014-12-11 16:28:12 -05:00
configfs	assorted conversions to %p[dD]	2014-11-19 13:01:20 -05:00
cramfs	fs/cramfs/inode.c: use linux/uaccess.h	2014-08-08 15:57:25 -07:00
debugfs	Driver core patches for 3.19-rc1	2014-12-14 16:10:09 -08:00
devpts
dlm	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2014-12-10 16:10:49 -08:00
ecryptfs	Fixes for filename decryption and encrypted view plus a cleanup	2014-12-19 18:15:12 -08:00
efivarfs	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2014-12-10 16:10:49 -08:00
efs	fs/efs/namei.c: return is not a function	2014-08-08 15:57:18 -07:00
exofs	Boaz Harrosh - Fix broken email address	2014-10-19 20:22:32 +03:00
exportfs	move d_rcu from overlapping d_child to overlapping d_alias	2014-11-03 15:20:29 -05:00
ext2	ext2: Convert to private i_dquot field	2014-11-10 10:06:10 +01:00
ext3	ext3: Convert to private i_dquot field	2014-11-10 10:06:10 +01:00
ext4	Revert a potential seek_data/hole regression which shows up when using	2015-01-06 14:05:40 -08:00
f2fs	f2fs: avoid to ra unneeded blocks in recover flow	2014-12-08 14:19:09 -08:00
fat	fat: fix data past EOF resulting from fsx testsuite	2014-12-13 12:42:51 -08:00
freevxfs
fscache	fs/fscache/object-list.c: use __seq_open_private()	2014-10-13 17:52:21 +01:00
fuse	fuse: add memory barrier to INIT	2015-01-06 10:45:35 +01:00
gfs2	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2014-12-10 16:10:49 -08:00
hfs	fs/hfs/catalog.c: fix comparison bug in hfs_cat_keycmp	2014-12-10 17:41:16 -08:00
hfsplus	hfsplus: fix longname handling	2014-12-18 19:08:10 -08:00
hostfs	hostfs: support rename flags	2014-08-07 14:40:09 -04:00
hpfs	fs/hpfs/dnode.c: fix suspect code indent	2014-08-08 15:57:22 -07:00
hppfs	vfs: make first argument of dir_context.actor typed	2014-10-31 17:48:54 -04:00
hugetlbfs	mm: convert i_mmap_mutex to rwsem	2014-12-13 12:42:45 -08:00
isofs	isofs: Fix unchecked printing of ER records	2014-12-19 11:29:24 +01:00
jbd	jbd: Deletion of an unnecessary check before the function call "iput"	2014-11-18 10:15:29 +01:00
jbd2	Lots of bugs fixes, including Zheng and Jan's extent status shrinker	2014-12-12 09:28:03 -08:00
jffs2	jffs2: Drop bogus if in comment	2014-11-28 18:23:44 -08:00
jfs	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2014-12-10 16:10:49 -08:00
kernfs	kernfs: Fix kernfs_name_compare	2015-01-09 15:51:08 -08:00
lockd	LOCKD: Fix a race when initialising nlmsvc_timeout	2015-01-05 19:40:53 -08:00
logfs	fs/logfs/readwrite.c: kernel-doc warning fixes	2014-08-06 18:01:12 -07:00
minix	minix zmap block counts calculation fix	2014-08-08 15:57:20 -07:00
ncpfs	Merge branch 'akpm' (patchbomb from Andrew)	2014-12-10 18:34:42 -08:00
nfs	NFSv4: Remove incorrect check in can_open_delegated()	2015-01-05 19:40:54 -08:00
nfs_common	lockd: move lockd's grace period handling into its own module	2014-09-17 16:33:11 -04:00
nfsd	nfsd: fix fi_delegees leak when fi_had_conflict returns true	2015-01-07 13:38:21 -05:00
nilfs2	nilfs2: fix the nilfs_iget() vs. nilfs_new_inode() races	2014-12-10 17:41:16 -08:00
nls
notify	sched, fanotify: Deal with nested sleeps	2015-01-09 11:18:12 +01:00
ntfs	assorted conversions to %p[dD]	2014-11-19 13:01:20 -05:00
ocfs2	ocfs2: fix the wrong directory passed to ocfs2_lookup_ino_from_name() when link file	2015-01-08 15:10:51 -08:00
omfs	FS/OMFS: block number sanity check during fill_super operation	2014-10-14 02:18:22 +02:00
openpromfs
overlayfs	Merge branch 'iov_iter' into for-next	2014-12-08 20:39:29 -05:00
proc	Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2014-12-19 13:26:08 -08:00
pstore	Driver core patches for 3.19-rc1	2014-12-14 16:10:09 -08:00
qnx4
qnx6	fs/qnx6: update debugging to current functions	2014-08-08 15:57:26 -07:00
quota	vfs: Remove i_dquot field from inode	2014-11-10 10:06:18 +01:00
ramfs	fs/ramfs/file-nommu.c: replace count*size kzalloc by kcalloc	2014-08-08 15:57:18 -07:00
reiserfs	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs	2014-12-16 15:46:01 -08:00
romfs	fs/romfs/super.c: add blank line after declarations	2014-08-08 15:57:25 -07:00
squashfs	Squashfs: Add LZ4 compression configuration option	2014-11-27 18:48:44 +00:00
sysfs	sysfs/kernfs: make read requests on pre-alloc files use the buffer.	2014-11-07 10:54:38 -08:00
sysv
ubifs	UBIFS: fix a couple bugs in UBIFS xattr length calculation	2014-11-07 12:32:22 +02:00
udf	udf: Reduce repeated dereferences	2014-12-21 22:42:37 +01:00
ufs	fs/ufs/balloc.c: remove unused variable	2014-10-14 02:18:20 +02:00
xfs	xfs: update for 3.19-rc1	2014-12-12 09:48:17 -08:00
aio.c	aio: Skip timer for io_getevents if timeout=0	2014-12-13 17:50:20 -05:00
anon_inodes.c
attr.c	fs,userns: Change inode_capable to capable_wrt_inode_uidgid	2014-06-10 13:57:22 -07:00
bad_inode.c	bad_inode: add ->rename2()	2014-08-07 14:40:09 -04:00
binfmt_aout.c	assorted conversions to %p[dD]	2014-11-19 13:01:20 -05:00
binfmt_elf_fdpic.c	handle suicide on late failure exits in execve() in search_binary_handler()	2014-10-09 02:39:00 -04:00
binfmt_elf.c	Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus	2014-12-11 17:56:37 -08:00
binfmt_em86.c	syscalls: implement execveat() system call	2014-12-13 12:42:51 -08:00
binfmt_flat.c
binfmt_misc.c	unfuck binfmt_misc.c (broken by commit `e6084d4`)	2014-12-17 08:27:14 -05:00
binfmt_script.c	syscalls: implement execveat() system call	2014-12-13 12:42:51 -08:00
binfmt_som.c
block_dev.c	fs: add freeze_super/thaw_super fs hooks	2014-11-17 10:35:17 +00:00
buffer.c	fs: clarify rate limit suppressed buffer I/O errors	2014-10-21 13:55:11 -06:00
char_dev.c	fs/char_dev.c: remove pointless assignment from __register_chrdev_region()	2014-12-10 17:41:04 -08:00
compat_binfmt_elf.c
compat_ioctl.c	Bluetooth: Move HCI socket definitions into its own header file	2014-07-11 13:53:04 +03:00
compat.c	vfs: make first argument of dir_context.actor typed	2014-10-31 17:48:54 -04:00
coredump.c	coredump: add %i/%I in core_pattern to report the tid of the crashed thread	2014-10-14 02:18:21 +02:00
dcache.c	Merge branch 'iov_iter' into for-next	2014-12-08 20:39:29 -05:00
dcookies.c
direct-io.c	fuse: honour max_read and max_write in direct_io mode	2014-09-26 21:16:51 -04:00
drop_caches.c	mm: vmscan: invoke slab shrinkers from shrink_zone()	2014-12-13 12:42:48 -08:00
eventfd.c	fs: Convert show_fdinfo functions to void	2014-11-05 14:13:23 -05:00
eventpoll.c	fs: Convert show_fdinfo functions to void	2014-11-05 14:13:23 -05:00
exec.c	syscalls: implement execveat() system call	2014-12-13 12:42:51 -08:00
fcntl.c	vfs: renumber FMODE_NONOTIFY and add to uniqueness check	2015-01-08 15:10:52 -08:00
fhandle.c
file_table.c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2014-10-13 11:28:42 +02:00
file.c	fs/file.c: replace get_unused_fd() with get_unused_fd_flags(0)	2014-12-10 17:41:10 -08:00
filesystems.c
fs_pin.c	make fs/{namespace,super}.c forget about acct.h	2014-08-07 14:40:09 -04:00
fs_struct.c
fs-writeback.c	writeback: fix a subtle race condition in I_DIRTY clearing	2014-11-04 10:42:23 -07:00
inode.c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2014-12-16 15:53:03 -08:00
internal.h	take the targets of /proc//ns/ symlinks to separate fs	2014-12-10 21:30:20 -05:00
ioctl.c	Merge branch 'for-3.19' of git://linux-nfs.org/~bfields/linux	2014-12-16 15:25:31 -08:00
Kconfig	overlay filesystem	2014-10-24 00:14:38 +02:00
Kconfig.binfmt	binfmt_elf: allow arch code to examine PT_LOPROC ... PT_HIPROC headers	2014-11-24 07:45:02 +01:00
libfs.c	move d_rcu from overlapping d_child to overlapping d_alias	2014-11-03 15:20:29 -05:00
locks.c	locks: fix NULL-deref in generic_delete_lease	2015-01-13 07:00:55 -05:00
Makefile	Merge branch 'nsfs' into for-next	2014-12-10 21:31:59 -05:00
mbcache.c	fs/mbcache: replace __builtin_log2() with ilog2()	2014-06-25 22:08:29 -04:00
mount.h	common object embedded into various struct ....ns	2014-12-04 14:31:00 -05:00
mpage.c	vfs: guard end of device for mpage interface	2014-10-09 22:25:53 -04:00
namei.c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2014-12-16 15:53:03 -08:00
namespace.c	mnt: Fix a memory stomp in umount	2014-12-18 11:22:02 -08:00
no-block.c
nsfs.c	take the targets of /proc//ns/ symlinks to separate fs	2014-12-10 21:30:20 -05:00
open.c	Merge branch 'for-3.19' of git://linux-nfs.org/~bfields/linux	2014-12-16 15:25:31 -08:00
pipe.c
pnode.c	mnt: Move the clear of MNT_LOCKED from copy_tree to it's callers.	2014-12-02 10:46:50 -06:00
pnode.h
posix_acl.c
proc_namespace.c	vfs: make mounts and mountstats honor root dir like mountinfo does	2014-12-17 08:27:15 -05:00
read_write.c	Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security	2014-12-14 20:36:37 -08:00
readdir.c	vfs: make first argument of dir_context.actor typed	2014-10-31 17:48:54 -04:00
select.c
seq_file.c	fs, seq_file: fallback to vmalloc instead of oom kill processes	2014-12-13 12:42:49 -08:00
signalfd.c	fs: Convert show_fdinfo functions to void	2014-11-05 14:13:23 -05:00
splice.c	vfs: export do_splice_direct() to modules	2014-10-24 00:14:35 +02:00
stack.c	fs: fix comment for 'CONFIG_LBADF'	2014-08-26 09:35:56 +02:00
stat.c
statfs.c
super.c	vfs: Remove i_dquot field from inode	2014-11-10 10:06:18 +01:00
sync.c	kill f_dentry uses	2014-11-19 13:01:25 -05:00
timerfd.c	fs: Convert show_fdinfo functions to void	2014-11-05 14:13:23 -05:00
utimes.c
xattr.c	new helper: audit_file()	2014-11-19 13:01:26 -05:00