linux/fs
Josef Bacik db7e68b522 btrfs: drop the backref cache during relocation if we commit
Since the inception of relocation we have maintained the backref cache
across transaction commits, updating the backref cache with the new
bytenr whenever we COWed blocks that were in the cache, and then
updating their bytenr once we detected a transaction id change.

This works as long as we're only ever modifying blocks, not changing the
structure of the tree.

However relocation does in fact change the structure of the tree.  For
example, if we are relocating a data extent, we will look up all the
leaves that point to this data extent.  We will then call
do_relocation() on each of these leaves, which will COW down to the leaf
and then update the file extent location.

But, a key feature of do_relocation() is the pending list.  This is all
the pending nodes that we modified when we updated the file extent item.
We will then process all of these blocks via finish_pending_nodes, which
calls do_relocation() on all of the nodes that led up to that leaf.

The purpose of this is to make sure we don't break sharing unless we
absolutely have to.  Consider the case that we have 3 snapshots that all
point to this leaf through the same nodes, the initial COW would have
created a whole new path.  If we did this for all 3 snapshots we would
end up with 3x the number of nodes we had originally.  To avoid this we
will cycle through each of the snapshots that point to each of these
nodes and update their pointers to point at the new nodes.

Once we update the pointer to the new node we will drop the node we
removed the link for and all of its children via btrfs_drop_subtree().
This is essentially just btrfs_drop_snapshot(), but for an arbitrary
point in the snapshot.

The problem with this is that we will never reflect this in the backref
cache.  If we do this btrfs_drop_snapshot() for a node that is in the
backref tree, we will leave the node in the backref tree.  This becomes
a problem when we change the transid, as now the backref cache has
entire subtrees that no longer exist, but exist as if they still are
pointed to by the same roots.

In the best case scenario you end up with "adding refs to an existing
tree ref" errors from insert_inline_extent_backref(), where we attempt
to link in nodes on roots that are no longer valid.

Worst case you will double free some random block and re-use it when
there's still references to the block.

This is extremely subtle, and the consequences are quite bad.  There
isn't a way to make sure our backref cache is consistent between
transid's.

In order to fix this we need to simply evict the entire backref cache
anytime we cross transid's.  This reduces performance in that we have to
rebuild this backref cache every time we change transid's, but fixes the
bug.

This has existed since relocation was added, and is a pretty critical
bug.  There's a lot more cleanup that can be done now that this
functionality is going away, but this patch is as small as possible in
order to fix the problem and make it easy for us to backport it to all
the kernels it needs to be backported to.

Followup series will dismantle more of this code and simplify relocation
drastically to remove this functionality.

We have a reproducer that reproduced the corruption within a few minutes
of running.  With this patch it survives several iterations/hours of
running the reproducer.

Fixes: 3fd0a5585e ("Btrfs: Metadata ENOSPC handling for balance")
CC: stable@vger.kernel.org
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-01 19:10:26 +02:00
..
9p 9p: Fix DIO read through netfs 2024-08-13 13:53:09 +02:00
adfs fs/adfs: add MODULE_DESCRIPTION 2024-07-18 09:50:08 +02:00
affs affs: struct slink_front: Replace 1-element array with flexible array 2024-07-11 16:14:26 +02:00
afs afs: Fix post-setattr file edit to do truncation correctly 2024-08-24 16:09:16 +02:00
autofs vfs-6.11.mount.api 2024-07-15 11:31:32 -07:00
bcachefs bcachefs: BCH_SB_MEMBER_INVALID 2024-09-03 20:43:14 -04:00
befs befs: Convert befs_symlink_read_folio() to use folio_end_read() 2024-05-31 12:31:39 +02:00
bfs mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
btrfs btrfs: drop the backref cache during relocation if we commit 2024-10-01 19:10:26 +02:00
cachefiles cachefiles: Set the max subreq size for cache writes to MAX_RW_COUNT 2024-07-24 10:53:13 +02:00
ceph netfs, ceph: Partially revert "netfs: Replace PG_fscache by setting folio->private and marking dirty" 2024-08-21 22:32:58 +02:00
coda coda: Convert coda_symlink_filler() to use folio_end_read() 2024-05-31 12:31:39 +02:00
configfs fs/configfs: Add a callback to determine attribute visibility 2024-06-17 20:42:57 +02:00
cramfs vfs-6.11.module.description 2024-07-15 11:14:59 -07:00
crypto The usual shower of singleton fixes and minor series all over MM, 2024-05-19 09:21:03 -07:00
debugfs vfs-6.11.mount.api 2024-07-15 11:31:32 -07:00
devpts
dlm dlm: add rcu_barrier before destroy kmem cache 2024-06-13 12:48:46 -05:00
ecryptfs hardening updates for 6.10-rc1 2024-05-13 14:14:05 -07:00
efivarfs efivarfs: Convert to new uid/gid option parsing helpers 2024-07-02 06:21:18 +02:00
efs vfs-6.11.module.description 2024-07-15 11:14:59 -07:00
erofs erofs: fix out-of-bound access when z_erofs_gbuf_growsize() partially fails 2024-08-21 08:12:05 +08:00
exfat Description for this pull request: 2024-07-17 12:53:47 -07:00
exportfs fhandle: relax open_by_handle_at() permission checks 2024-05-28 15:57:23 +02:00
ext2 ext2: Verify bitmap and itable block numbers before using them 2024-06-26 12:54:11 +02:00
ext4 Many cleanups and bug fixes in ext4, especially for the fast commit 2024-07-18 17:03:42 -07:00
f2fs f2fs update for 6.11-rc1 2024-07-23 15:21:19 -07:00
fat vfs-6.11.mount.api 2024-07-15 11:31:32 -07:00
freevxfs freevxfs: Convert freevxfs to the new mount API. 2024-03-26 09:04:53 +01:00
fuse fuse fixes for 6.11-rc7 2024-09-03 12:32:00 -07:00
gfs2 gfs2: Clean up glock demote logic 2024-07-09 10:40:03 +02:00
hfs vfs-6.11.module.description 2024-07-15 11:14:59 -07:00
hfsplus vfs-6.11.misc 2024-07-15 10:52:51 -07:00
hostfs vfs-6.11-rc1.fixes.3 2024-07-27 15:11:59 -07:00
hpfs vfs-6.11.module.description 2024-07-15 11:14:59 -07:00
hugetlbfs - 875fa64577da ("mm/hugetlb_vmemmap: fix race with speculative PFN 2024-07-21 17:15:46 -07:00
iomap vfs-6.11.iomap 2024-07-15 13:28:14 -07:00
isofs \n 2024-07-17 13:11:42 -07:00
jbd2 jbd2: increase maximum transaction size 2024-07-08 23:59:37 -04:00
jffs2 Kbuild updates for v6.11 2024-07-23 14:32:21 -07:00
jfs Folio conversion from Matthew Wilcox and a few various fixes 2024-07-23 15:15:16 -07:00
kernfs kernfs: mount: Remove unnecessary ‘NULL’ values from knparent 2024-05-04 19:02:39 +02:00
lockd lockd: Use *-y instead of *-objs in Makefile 2024-07-08 14:10:03 -04:00
minix vfs-6.11.module.description 2024-07-15 11:14:59 -07:00
netfs five smb3 client fixes 2024-09-06 17:30:33 -07:00
nfs NFS: Avoid unnecessary rescanning of the per-server delegation list 2024-08-22 17:01:10 -04:00
nfs_common fs: nfs: add missing MODULE_DESCRIPTION() macros 2024-07-08 13:47:24 -04:00
nfsd nfsd-6.11 fixes: 2024-09-01 06:55:47 +12:00
nilfs2 nilfs2: fix state management in error path of log writing function 2024-09-01 17:59:00 -07:00
nls fs: nls: add missing MODULE_DESCRIPTION() macros 2024-06-03 16:37:07 +02:00
notify fsnotify: clear PARENT_WATCHED flags lazily 2024-06-05 09:52:38 +02:00
ntfs3 ntfs3 changes for 6.11-rc1 2024-07-22 10:50:18 -07:00
ocfs2 - In the series "treewide: Refactor heap related implementation", 2024-07-21 17:56:22 -07:00
omfs
openpromfs openpromfs: add missing MODULE_DESCRIPTION() macro 2024-06-20 09:46:01 +02:00
orangefs orangefs: Remove calls to set/clear the error flag 2024-05-31 12:31:41 +02:00
overlayfs Merge patch series "ovl: simplify ovl_parse_param_lowerdir()" 2024-08-24 16:00:46 +02:00
proc Random number generator updates for Linux 6.11-rc1. 2024-07-24 10:29:50 -07:00
pstore memblock: updates for 6.11-rc1 2024-07-18 14:48:11 -07:00
qnx4 qnx4: add MODULE_DESCRIPTION() 2024-05-28 11:52:53 +02:00
qnx6 qnx6: add MODULE_DESCRIPTION() 2024-05-28 11:52:49 +02:00
quota sysctl: treewide: constify the ctl_table argument of proc_handlers 2024-07-24 20:59:29 +02:00
ramfs mm: switch mm->get_unmapped_area() to a flag 2024-04-25 20:56:25 -07:00
reiserfs reiserfs: Remove call to folio_set_error() 2024-05-31 12:31:41 +02:00
romfs romfs: fix romfs_read_folio() 2024-08-21 22:32:58 +02:00
smb five smb3 client fixes 2024-09-06 17:30:33 -07:00
squashfs Squashfs: sanity check symbolic link size 2024-08-13 13:56:46 +02:00
sysfs Merge 6.9-rc5 into driver-core-next 2024-04-23 13:27:43 +02:00
sysv fs: sysv: add MODULE_DESCRIPTION() 2024-05-28 11:52:45 +02:00
tests execve: Move KUnit tests to tests/ subdirectory 2024-07-22 18:25:47 -07:00
tracefs eventfs: Use list_del_rcu() for SRCU protected list variable 2024-09-05 10:18:48 -04:00
ubifs ubifs: add check for crypto_shash_tfm_digest 2024-07-12 22:01:09 +02:00
udf udf: prevent integer overflow in udf_bitmap_free_blocks() 2024-06-26 12:54:11 +02:00
ufs - In the series "treewide: Refactor heap related implementation", 2024-07-21 17:56:22 -07:00
unicode unicode: add MODULE_DESCRIPTION() macros 2024-06-20 19:30:02 -04:00
vboxsf vfs-6.11.mount.api 2024-07-15 11:31:32 -07:00
verity bpf: treewide: Align kfunc signatures to prog point-of-view 2024-06-12 11:01:31 -07:00
xfs xfs: reset rootdir extent size hint after growfsrt 2024-08-27 18:32:14 +05:30
zonefs zonefs: enable support for large folios 2024-06-11 11:22:57 +09:00
aio.c - 875fa64577da ("mm/hugetlb_vmemmap: fix race with speculative PFN 2024-07-21 17:15:46 -07:00
anon_inodes.c fs: Create anon_inode_getfile_fmode() 2024-04-26 10:33:05 +02:00
attr.c nfsd-6.11 fixes: 2024-08-29 06:20:44 +12:00
backing-file.c backing-file: convert to using fops->splice_write 2024-08-23 13:08:31 +02:00
bad_inode.c
binfmt_elf_fdpic.c binfmt_elf_fdpic: fix AUXV size calculation when ELF_HWCAP2 is defined 2024-08-26 13:00:38 -07:00
binfmt_elf.c execve fix for v6.11-rc1 2024-07-23 17:30:42 -07:00
binfmt_flat.c binfmt_flat: Fix corruption when not offsetting data start 2024-08-09 20:19:00 -07:00
binfmt_misc.c vfs-6.11.module.description 2024-07-15 11:14:59 -07:00
binfmt_script.c fs: binfmt: add missing MODULE_DESCRIPTION() macros 2024-05-28 12:06:51 +02:00
buffer.c Many cleanups and bug fixes in ext4, especially for the fast commit 2024-07-18 17:03:42 -07:00
char_dev.c
compat_binfmt_elf.c
coredump.c sysctl: treewide: constify the ctl_table argument of proc_handlers 2024-07-24 20:59:29 +02:00
d_path.c
dax.c dax: use huge_zero_folio 2024-04-25 20:56:20 -07:00
dcache.c dcache: keep dentry_hashtable or d_hash_shift even when not used 2024-08-30 12:25:50 +12:00
direct-io.c fs/direct-io: remove redundant assignment to variable retval 2024-04-11 10:21:24 +02:00
drop_caches.c sysctl: treewide: constify the ctl_table argument of proc_handlers 2024-07-24 20:59:29 +02:00
eventfd.c eventfd: strictly check the count parameter of eventfd_write to avoid inputting illegal strings 2024-02-08 10:12:26 +01:00
eventpoll.c epoll: be better about file lifetimes 2024-05-05 14:00:48 -07:00
exec.c exec: Fix ToCToU between perm check and set-uid/gid usage 2024-08-13 13:24:29 -07:00
fcntl.c fcntl: add F_DUPFD_QUERY fcntl() 2024-05-10 08:26:31 +02:00
fhandle.c fhandle: relax open_by_handle_at() permission checks 2024-05-28 15:57:23 +02:00
file_table.c sysctl: treewide: constify the ctl_table argument of proc_handlers 2024-07-24 20:59:29 +02:00
file.c fix bitmap corruption on close_range() with CLOSE_RANGE_UNSHARE 2024-08-05 19:23:11 -04:00
filesystems.c
fs_context.c
fs_parser.c fs_parse: add uid & gid option option parsing helpers 2024-07-02 06:20:49 +02:00
fs_pin.c
fs_struct.c
fs_types.c
fs-writeback.c sysctl: treewide: constify the ctl_table argument of proc_handlers 2024-07-24 20:59:29 +02:00
fsopen.c vfs: retire user_path_at_empty and drop empty arg from getname_flags 2024-06-05 17:03:57 +02:00
init.c
inode.c vfs: Don't evict inode under the inode lru traversing context 2024-08-13 13:52:16 +02:00
internal.h vfs-6.11.pidfs 2024-07-15 12:34:01 -07:00
ioctl.c fs/ioctl: Add a comment to keep the logic in sync with LSM policies 2024-05-13 06:58:35 +02:00
Kconfig - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames 2024-03-14 17:43:30 -07:00
Kconfig.binfmt exec: Add KUnit test for bprm_stack_limits() 2024-06-19 13:13:55 -07:00
kernel_read_file.c
libfs.c libfs: fix get_stashed_dentry() 2024-09-06 11:08:58 -07:00
locks.c filelock: fix name of file_lease slab cache 2024-08-12 22:03:25 +02:00
Makefile vfs-6.9.pidfd 2024-03-11 10:21:06 -07:00
mbcache.c vfs: remove SLAB_MEM_SPREAD flag usage 2024-02-27 11:21:31 +01:00
mnt_idmapping.c fs/mnt_idmapping.c: Return -EINVAL when no map is written 2024-02-08 10:12:37 +01:00
mount.h vfs-6.11.mount 2024-07-15 11:54:04 -07:00
mpage.c buffer: Remove calls to set and clear the folio error flag 2024-05-31 12:31:43 +02:00
namei.c vfs: correct the comments of vfs_*() helpers 2024-07-24 10:53:12 +02:00
namespace.c fs: use all available ids 2024-07-24 10:53:13 +02:00
nsfs.c nsfs: use cleanup guard 2024-07-18 09:50:08 +02:00
open.c vfs-6.11.misc 2024-07-15 10:52:51 -07:00
pidfs.c pidfs: handle kernels without namespaces cleanly 2024-07-24 10:53:13 +02:00
pipe.c sysctl: treewide: constify the ctl_table argument of proc_handlers 2024-07-24 20:59:29 +02:00
pnode.c
pnode.h
posix_acl.c lsm/stable-6.9 PR 20240312 2024-03-12 20:03:34 -07:00
proc_namespace.c fs: rename show_mnt_opts -> show_vfsmnt_opts 2024-06-28 14:36:43 +02:00
read_write.c fs: Initial atomic write support 2024-06-20 15:19:17 -06:00
readdir.c readdir: Add missing quote in macro comment 2024-06-03 15:49:26 +02:00
remap_range.c vfs: export remap and write check helpers 2024-04-15 14:54:13 -07:00
select.c fs/select: rework stack allocation hack for clang 2024-02-20 09:23:52 +01:00
seq_file.c seq_file: Simplify __seq_puts() 2024-05-02 16:28:20 +02:00
signalfd.c signalfd: drop an obsolete comment 2024-05-24 13:34:07 +02:00
splice.c remove call_{read,write}_iter() functions 2024-04-15 16:03:25 -04:00
stack.c
stat.c for-6.11/block-20240710 2024-07-15 14:20:22 -07:00
statfs.c
super.c fs/super.c: improve get_tree() error message 2024-08-22 02:07:23 -04:00
sync.c
sysctls.c
timerfd.c timerfd: convert to ->read_iter() 2024-04-10 16:23:02 -06:00
userfaultfd.c mm: provide mm_struct and address to huge_ptep_get() 2024-07-12 15:52:15 -07:00
utimes.c
xattr.c vfs: Fix potential circular locking through setxattr() and removexattr() 2024-07-24 10:53:14 +02:00