linux/fs
Filipe Manana 3660d0bcdb btrfs: fix stale data exposure after cloning a hole with NO_HOLES enabled
When using the NO_HOLES feature, if we clone a file range that spans only
a hole into a range that is at or beyond the current i_size of the
destination file, we end up not setting the full sync runtime flag on the
inode. As a result, if we then fsync the destination file and have a power
failure, after log replay we can end up exposing stale data instead of
having a hole for that range.

The conditions for this to happen are the following:

1) We have a file with a size of, for example, 1280K;

2) There is a written (non-prealloc) extent for the file range from 1024K
   to 1280K with a length of 256K;

3) This particular file extent layout is durably persisted, so that the
   existing superblock persisted on disk points to a subvolume root where
   the file has that exact file extent layout and state;

4) The file is truncated to a smaller size, to an offset lower than the
   start offset of its last extent, for example to 800K. The truncate sets
   the full sync runtime flag on the inode;

6) Fsync the file to log it and clear the full sync runtime flag;

7) Clone a region that covers only a hole (implicit hole due to NO_HOLES)
   into the file with a destination offset that starts at or beyond the
   256K file extent item we had - for example to offset 1024K;

8) Since the clone operation does not find extents in the source range,
   we end up in the if branch at the bottom of btrfs_clone() where we
   punch a hole for the file range starting at offset 1024K by calling
   btrfs_replace_file_extents(). There we end up not setting the full
   sync flag on the inode, because we don't know we are being called in
   a clone context (and not fallocate's punch hole operation), and
   neither do we create an extent map to represent a hole because the
   requested range is beyond eof;

9) A further fsync to the file will be a fast fsync, since the clone
   operation did not set the full sync flag, and therefore it relies on
   modified extent maps to correctly log the file layout. But since
   it does not find any extent map marking the range from 1024K (the
   previous eof) to the new eof, it does not log a file extent item
   for that range representing the hole;

10) After a power failure no hole for the range starting at 1024K is
   punched and we end up exposing stale data from the old 256K extent.

Turning this into exact steps:

  $ mkfs.btrfs -f -O no-holes /dev/sdi
  $ mount /dev/sdi /mnt

  # Create our test file with 3 extents of 256K and a 256K hole at offset
  # 256K. The file has a size of 1280K.
  $ xfs_io -f -s \
              -c "pwrite -S 0xab -b 256K 0 256K" \
              -c "pwrite -S 0xcd -b 256K 512K 256K" \
              -c "pwrite -S 0xef -b 256K 768K 256K" \
              -c "pwrite -S 0x73 -b 256K 1024K 256K" \
              /mnt/sdi/foobar

  # Make sure it's durably persisted. We want the last committed super
  # block to point to this particular file extent layout.
  sync

  # Now truncate our file to a smaller size, falling within a position of
  # the second extent. This sets the full sync runtime flag on the inode.
  # Then fsync the file to log it and clear the full sync flag from the
  # inode. The third extent is no longer part of the file and therefore
  # it is not logged.
  $ xfs_io -c "truncate 800K" -c "fsync" /mnt/foobar

  # Now do a clone operation that only clones the hole and sets back the
  # file size to match the size it had before the truncate operation
  # (1280K).
  $ xfs_io \
        -c "reflink /mnt/foobar 256K 1024K 256K" \
        -c "fsync" \
        /mnt/foobar

  # File data before power failure:
  $ od -A d -t x1 /mnt/foobar
  0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
  *
  0262144 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  *
  0524288 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd
  *
  0786432 ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef
  *
  0819200 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  *
  1310720

  <power fail>

  # Mount the fs again to replay the log tree.
  $ mount /dev/sdi /mnt

  # File data after power failure:
  $ od -A d -t x1 /mnt/foobar
  0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
  *
  0262144 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  *
  0524288 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd
  *
  0786432 ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef
  *
  0819200 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  *
  1048576 73 73 73 73 73 73 73 73 73 73 73 73 73 73 73 73
  *
  1310720

The range from 1024K to 1280K should correspond to a hole but instead it
points to stale data, to the 256K extent that should not exist after the
truncate operation.

The issue does not exists when not using NO_HOLES, because for that case
we use file extent items to represent holes, these are found and copied
during the loop that iterates over extents at btrfs_clone(), and that
causes btrfs_replace_file_extents() to be called with a non-NULL
extent_info argument and therefore set the full sync runtime flag on the
inode.

So fix this by making the code that deals with a trailing hole during
cloning, at btrfs_clone(), to set the full sync flag on the inode, if the
range starts at or beyond the current i_size.

A test case for fstests will follow soon.

Backporting notes: for kernel 5.4 the change goes to ioctl.c into
btrfs_clone before the last call to btrfs_punch_hole_range.

CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-22 18:07:45 +01:00
..
9p 9p for 5.11-rc1 2020-12-21 10:28:02 -08:00
adfs Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-10-24 12:26:05 -07:00
affs Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-10-24 12:26:05 -07:00
afs rxrpc: Fix deadlock around release of dst cached on udp tunnel 2021-01-29 21:38:11 -08:00
autofs file: Replace ksys_close with close_fd 2020-12-10 12:42:59 -06:00
befs [PATCH] reduce boilerplate in fsid handling 2020-09-18 16:45:50 -04:00
bfs bfs: don't use WARNING: string when it's just info. 2020-12-15 22:46:18 -08:00
btrfs btrfs: fix stale data exposure after cloning a hole with NO_HOLES enabled 2021-02-22 18:07:45 +01:00
cachefiles cachefiles: Drop superfluous readpages aops NULL check 2021-01-20 11:33:51 -08:00
ceph libceph, ceph: disambiguate ceph_connection_operations handlers 2021-01-04 17:31:32 +01:00
cifs cifs: report error instead of invalid when revalidating a dentry fails 2021-02-05 13:17:48 -06:00
coda docs: filesystems: convert coda.txt to ReST 2020-05-05 09:22:21 -06:00
configfs configfs: fix kernel-doc markup issue 2020-11-14 10:22:45 +01:00
cramfs [PATCH] reduce boilerplate in fsid handling 2020-09-18 16:45:50 -04:00
crypto f2fs-for-5.11-rc1 2020-12-17 11:18:00 -08:00
debugfs debugfs: remove return value of debugfs_create_devm_seqfile() 2020-10-30 08:37:39 +01:00
devpts
dlm fs: dlm: check on existing node address 2020-11-10 12:14:20 -06:00
ecryptfs ecryptfs: fix uid translation for setxattr on security.capability 2021-01-26 01:47:14 +00:00
efivarfs efivarfs: revert "fix memory leak in efivarfs_create()" 2020-11-25 16:55:02 +01:00
efs [PATCH] reduce boilerplate in fsid handling 2020-09-18 16:45:50 -04:00
erofs erofs: avoid using generic_block_bmap 2020-12-10 11:07:40 +08:00
exfat exfat: Avoid allocating upcase table using kcalloc() 2020-12-22 12:31:17 +09:00
exportfs exportfs: Add a function to return the raw output from fh_to_dentry() 2020-12-09 09:39:38 -05:00
ext2 ext2: Fix fall-through warnings for Clang 2020-11-23 10:36:53 +01:00
ext4 A number of bug fixes for ext4: 2021-01-15 14:54:24 -08:00
f2fs f2fs-for-5.11-rc1 2020-12-17 11:18:00 -08:00
fat [PATCH] reduce boilerplate in fsid handling 2020-09-18 16:45:50 -04:00
freevxfs
fscache Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next 2020-06-03 16:27:18 -07:00
fuse fuse: fix bad inode 2020-12-10 15:33:14 +01:00
gfs2 gfs2: in signal_our_withdraw wait for unfreeze of _this_ fs only 2020-12-03 17:04:41 +01:00
hfs fs: Replace zero-length array with flexible-array member 2020-10-29 17:22:59 -05:00
hfsplus fs: Replace zero-length array with flexible-array member 2020-10-29 17:22:59 -05:00
hostfs fix hostfs_open() use of ->f_path.dentry 2020-12-21 21:42:29 -05:00
hpfs [PATCH] reduce boilerplate in fsid handling 2020-09-18 16:45:50 -04:00
hugetlbfs mm: hugetlbfs: fix cannot migrate the fallocated HugeTLB page 2021-02-05 11:03:47 -08:00
iomap iomap: support REQ_OP_ZONE_APPEND 2021-02-09 00:52:19 +01:00
isofs fs: Replace zero-length array with flexible-array member 2020-10-29 17:22:59 -05:00
jbd2 jbd2: add a helper to find out number of fast commit blocks 2020-12-17 13:30:45 -05:00
jffs2 jffs2: Fix NULL pointer dereference in rp_size fs option parsing 2020-12-13 21:57:21 +01:00
jfs jfs: Fix array index bounds check in dbAdjTree 2020-11-13 16:03:07 -06:00
kernfs kernfs: wire up ->splice_read and ->splice_write 2021-01-21 18:30:28 +01:00
lockd fs/lockd: convert comma to semicolon 2020-12-16 07:57:37 -05:00
minix [PATCH] reduce boilerplate in fsid handling 2020-09-18 16:45:50 -04:00
nfs pNFS/NFSv4: Improve rejection of out-of-order layouts 2021-01-24 20:52:31 -05:00
nfs_common nfs_common: need lock during iterate through the list 2020-12-09 09:38:34 -05:00
nfsd nfsd4: readdirplus shouldn't return parent of export 2021-01-12 08:54:14 -05:00
nilfs2 fs/nilfs2: remove some unused macros to tame gcc 2020-12-15 22:46:17 -08:00
nls treewide: replace '---help---' in Kconfig files with 'help' 2020-06-14 01:57:21 +09:00
notify fanotify: Fix sys_fanotify_mark() on native x86-32 2020-12-28 11:58:59 +01:00
ntfs fs/ntfs: remove unused variable attr_len 2020-12-15 12:13:37 -08:00
ocfs2 ocfs2: ratelimit the 'max lookup times reached' notice 2020-12-15 12:13:37 -08:00
omfs fs: omfs: use kmemdup() rather than kmalloc+memcpy 2020-09-22 23:39:45 -04:00
openpromfs
orangefs orangefs: add splice file operations 2020-12-16 16:14:08 -05:00
overlayfs ovl: implement volatile-specific fsync error behaviour 2021-01-28 10:22:48 +01:00
proc proc_sysctl: fix oops caused by incorrect command parameters 2021-01-24 10:34:53 -08:00
pstore Tracing updates for 5.11 2020-12-17 13:22:17 -08:00
qnx4 [PATCH] reduce boilerplate in fsid handling 2020-09-18 16:45:50 -04:00
qnx6 [PATCH] reduce boilerplate in fsid handling 2020-09-18 16:45:50 -04:00
quota \n 2020-12-17 11:00:37 -08:00
ramfs ramfs: fix nommu mmap with gaps in the page cache 2020-10-16 11:11:22 -07:00
reiserfs reiserfs: add check for an invalid ih_entry_count 2020-11-26 16:57:28 +01:00
romfs Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-10-24 12:26:05 -07:00
squashfs Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-10-24 12:26:05 -07:00
sysfs sysfs: Add sysfs_emit and sysfs_emit_at to format sysfs output 2020-10-02 12:02:30 +02:00
sysv [PATCH] reduce boilerplate in fsid handling 2020-09-18 16:45:50 -04:00
tracefs
ubifs This pull request contains changes for JFFS2, UBI and UBIFS: 2020-12-17 17:46:34 -08:00
udf udf: fix the problem that the disc content is not displayed 2021-01-18 12:06:33 +01:00
ufs Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-10-24 12:26:05 -07:00
unicode unicode: Add utf8_casefold_hash 2020-09-10 14:03:31 -07:00
vboxsf Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2020-10-15 15:11:56 -07:00
verity Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 2020-12-14 12:18:19 -08:00
xfs New code for 5.11: 2020-12-18 12:50:18 -08:00
zonefs zonefs: select CONFIG_CRC32 2021-01-04 09:06:42 +09:00
aio.c Merge branch 'akpm' (patches from Andrew) 2020-12-15 12:53:37 -08:00
anon_inodes.c
attr.c
bad_inode.c fs: move the fiemap definitions out of fs.h 2020-06-03 23:16:55 -04:00
binfmt_aout.c exec: Rename flush_old_exec begin_new_exec 2020-05-07 16:55:47 -05:00
binfmt_elf_fdpic.c binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot 2020-10-16 11:11:21 -07:00
binfmt_elf.c Merge branch 'exec-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2020-12-15 19:29:43 -08:00
binfmt_em86.c Merge branch 'akpm' (patches from Andrew) 2020-06-04 19:18:29 -07:00
binfmt_flat.c binfmt_flat: revert "binfmt_flat: don't offset the data start" 2020-08-24 08:49:13 +10:00
binfmt_misc.c Merge branch 'akpm' (patches from Andrew) 2020-06-04 19:18:29 -07:00
binfmt_script.c Merge branch 'akpm' (patches from Andrew) 2020-06-04 19:18:29 -07:00
block_dev.c Revert "block: simplify set_init_blocksize" to regain lost performance 2021-01-27 09:14:12 -07:00
buffer.c for-5.11/block-2020-12-14 2020-12-16 12:57:51 -08:00
char_dev.c vfs: allow unprivileged whiteout creation 2020-05-14 16:44:23 +02:00
compat_binfmt_elf.c elf: Expose ELF header on arch_setup_additional_pages() 2020-10-26 13:46:47 +01:00
coredump.c Merge branch 'exec-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2020-12-15 19:29:43 -08:00
d_path.c fs: fix NULL dereference due to data race in prepend_path() 2020-10-14 14:54:45 -07:00
dax.c mm: simplify follow_pte{,pmd} 2020-12-15 22:46:19 -08:00
dcache.c fs: Kill DCACHE_DONTCACHE dentry even if DCACHE_REFERENCED is set 2020-12-10 17:33:17 -05:00
dcookies.c
direct-io.c \n 2020-10-15 15:03:10 -07:00
drop_caches.c sysctl: pass kernel pointers to ->proc_handler 2020-04-27 02:07:40 -04:00
eventfd.c eventfd: Export eventfd_ctx_do_read() 2020-11-15 09:49:10 -05:00
eventpoll.c epoll: add syscall epoll_pwait2 2020-12-19 11:18:38 -08:00
exec.c Merge branch 'parisc-5.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux 2020-12-16 12:10:40 -08:00
fcntl.c fcntl: Fix potential deadlock in send_sig{io, urg}() 2020-11-05 07:44:15 -05:00
fhandle.c
file_table.c epoll: take epitem list out of struct file 2020-10-25 20:02:08 -04:00
file.c kernel/io_uring: cancel io_uring before task works 2020-12-30 19:36:54 -07:00
filesystems.c fs/filesystems.c: downgrade user-reachable WARN_ONCE() to pr_warn_once() 2020-04-10 15:36:22 -07:00
fs_context.c treewide: Use fallthrough pseudo-keyword 2020-08-23 17:36:59 -05:00
fs_parser.c fs_parse: mark fs_param_bad_value() as static 2020-10-13 18:38:27 -07:00
fs_pin.c
fs_struct.c vfs: Use sequence counter with associated spinlock 2020-07-29 16:14:27 +02:00
fs_types.c
fs-writeback.c fs: fix lazytime expiration handling in __writeback_single_inode() 2021-01-13 17:26:21 +01:00
fsopen.c treewide: Use fallthrough pseudo-keyword 2020-08-23 17:36:59 -05:00
init.c init: add an init_dup helper 2020-08-04 21:02:38 -04:00
inode.c Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-12-25 10:54:29 -08:00
internal.h for-5.11/block-2020-12-14 2020-12-16 12:57:51 -08:00
io_uring.c io_uring: drop mm/files between task_work_submit 2021-02-04 12:42:58 -07:00
io-wq.c io-wq: kill now unused io_wq_cancel_all() 2020-12-20 10:47:42 -07:00
io-wq.h io-wq: kill now unused io_wq_cancel_all() 2020-12-20 10:47:42 -07:00
ioctl.c fs: remove ksys_ioctl 2020-07-31 08:16:01 +02:00
Kconfig tmpfs: support 64-bit inums per-sb 2020-08-07 11:33:24 -07:00
Kconfig.binfmt treewide: replace '---help---' in Kconfig files with 'help' 2020-06-14 01:57:21 +09:00
kernel_read_file.c fs/kernel_file_read: Add "offset" arg for partial reads 2020-10-05 13:37:04 +02:00
libfs.c f2fs-for-5.11-rc1 2020-12-17 11:18:00 -08:00
locks.c Merge branch 'exec-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2020-12-15 19:29:43 -08:00
Makefile Refactored code for 5.10: 2020-10-23 11:33:41 -07:00
mbcache.c
mount.h mnt: Use generic ns_common::count 2020-08-19 14:14:19 +02:00
mpage.c fs: convert mpage_readpages to mpage_readahead 2020-06-02 10:59:07 -07:00
namei.c Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-12-25 10:54:29 -08:00
namespace.c umount(2): move the flag validity checks first 2021-01-04 15:31:58 -05:00
no-block.c
nsfs.c nsproxy: attach to namespaces via pidfds 2020-05-13 11:41:22 +02:00
open.c Merge branch 'exec-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2020-12-15 19:29:43 -08:00
pipe.c fs/pipe: allow sendfile() to pipe again 2021-01-25 12:32:26 -08:00
pnode.c propagate_one(): mnt_set_mountpoint() needs mount_lock 2020-04-27 10:37:14 -04:00
pnode.h fs/namespace.c: WARN if mnt_count has become negative 2020-12-10 17:33:17 -05:00
posix_acl.c vfs: clean up posix_acl_permission() logic aroudn MAY_NOT_BLOCK 2020-06-08 11:04:19 -07:00
proc_namespace.c proc mountinfo: make splice available again 2020-12-27 12:00:36 -08:00
read_write.c Refactored code for 5.10: 2020-10-23 11:33:41 -07:00
readdir.c fs: remove ksys_getdents64 2020-07-31 08:16:00 +02:00
remap_range.c vfs: verify source area in vfs_dedupe_file_range_one() 2020-12-14 15:26:13 +01:00
select.c poll: fix performance regression due to out-of-line __put_user() 2021-01-08 11:06:29 -08:00
seq_file.c fix return values of seq_read_iter() 2020-11-15 22:12:53 -05:00
signalfd.c treewide: Use fallthrough pseudo-keyword 2020-08-23 17:36:59 -05:00
splice.c io_uring-5.10-2020-10-24 2020-10-24 12:40:18 -07:00
stack.c
stat.c fs: remove KSTAT_QUERY_FLAGS 2020-09-26 22:55:05 -04:00
statfs.c block: remove i_bdev 2020-12-01 14:53:39 -07:00
super.c block: remove i_bdev 2020-12-01 14:53:39 -07:00
sync.c overlayfs update for 5.8 2020-06-09 15:40:50 -07:00
timerfd.c
userfaultfd.c userfaultfd: add user-mode only option to unprivileged_userfaultfd sysctl knob 2020-12-15 12:13:46 -08:00
utimes.c fs: expose utimes_common 2020-07-31 08:16:01 +02:00
xattr.c vfs: move cap_convert_nscap() call into vfs_setxattr() 2020-12-14 15:26:13 +01:00