linux/fs/btrfs
Filipe Manana ac05ca913e Btrfs: fix race between using extent maps and merging them
We have a few cases where we allow an extent map that is in an extent map
tree to be merged with other extents in the tree. Such cases include the
unpinning of an extent after the respective ordered extent completed or
after logging an extent during a fast fsync. This can lead to subtle and
dangerous problems because when doing the merge some other task might be
using the same extent map and as consequence see an inconsistent state of
the extent map - for example sees the new length but has seen the old start
offset.

With luck this triggers a BUG_ON(), and not some silent bug, such as the
following one in __do_readpage():

  $ cat -n fs/btrfs/extent_io.c
  3061  static int __do_readpage(struct extent_io_tree *tree,
  3062                           struct page *page,
  (...)
  3127                  em = __get_extent_map(inode, page, pg_offset, cur,
  3128                                        end - cur + 1, get_extent, em_cached);
  3129                  if (IS_ERR_OR_NULL(em)) {
  3130                          SetPageError(page);
  3131                          unlock_extent(tree, cur, end);
  3132                          break;
  3133                  }
  3134                  extent_offset = cur - em->start;
  3135                  BUG_ON(extent_map_end(em) <= cur);
  (...)

Consider the following example scenario, where we end up hitting the
BUG_ON() in __do_readpage().

We have an inode with a size of 8KiB and 2 extent maps:

  extent A: file offset 0, length 4KiB, disk_bytenr = X, persisted on disk by
            a previous transaction

  extent B: file offset 4KiB, length 4KiB, disk_bytenr = X + 4KiB, not yet
            persisted but writeback started for it already. The extent map
	    is pinned since there's writeback and an ordered extent in
	    progress, so it can not be merged with extent map A yet

The following sequence of steps leads to the BUG_ON():

1) The ordered extent for extent B completes, the respective page gets its
   writeback bit cleared and the extent map is unpinned, at that point it
   is not yet merged with extent map A because it's in the list of modified
   extents;

2) Due to memory pressure, or some other reason, the MM subsystem releases
   the page corresponding to extent B - btrfs_releasepage() is called and
   returns 1, meaning the page can be released as it's not dirty, not under
   writeback anymore and the extent range is not locked in the inode's
   iotree. However the extent map is not released, either because we are
   not in a context that allows memory allocations to block or because the
   inode's size is smaller than 16MiB - in this case our inode has a size
   of 8KiB;

3) Task B needs to read extent B and ends up __do_readpage() through the
   btrfs_readpage() callback. At __do_readpage() it gets a reference to
   extent map B;

4) Task A, doing a fast fsync, calls clear_em_loggin() against extent map B
   while holding the write lock on the inode's extent map tree - this
   results in try_merge_map() being called and since it's possible to merge
   extent map B with extent map A now (the extent map B was removed from
   the list of modified extents), the merging begins - it sets extent map
   B's start offset to 0 (was 4KiB), but before it increments the map's
   length to 8KiB (4kb + 4KiB), task A is at:

   BUG_ON(extent_map_end(em) <= cur);

   The call to extent_map_end() sees the extent map has a start of 0
   and a length still at 4KiB, so it returns 4KiB and 'cur' is 4KiB, so
   the BUG_ON() is triggered.

So it's dangerous to modify an extent map that is in the tree, because some
other task might have got a reference to it before and still using it, and
needs to see a consistent map while using it. Generally this is very rare
since most paths that lookup and use extent maps also have the file range
locked in the inode's iotree. The fsync path is pretty much the only
exception where we don't do it to avoid serialization with concurrent
reads.

Fix this by not allowing an extent map do be merged if if it's being used
by tasks other then the one attempting to merge the extent map (when the
reference count of the extent map is greater than 2).

Reported-by: ryusuke1925 <st13s20@gm.ibaraki-ct.ac.jp>
Reported-by: Koki Mitani <koki.mitani.xg@hco.ntt.co.jp>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=206211
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-02-12 17:16:46 +01:00
..
tests btrfs: Correctly handle empty trees in find_first_clear_extent_bit 2020-01-31 14:01:29 +01:00
acl.c btrfs: cleanup btrfs_setxattr_trans and drop transaction parameter 2019-04-29 19:02:44 +02:00
async-thread.c btrfs: add __pure attribute to functions 2019-11-18 12:46:52 +01:00
async-thread.h btrfs: add __pure attribute to functions 2019-11-18 12:46:52 +01:00
backref.c Btrfs: fix deadlock between fiemap and transaction commits 2019-07-30 18:25:12 +02:00
backref.h btrfs: fiemap: preallocate ulists for btrfs_check_shared 2019-07-01 13:34:53 +02:00
block-group.c btrfs: take overcommit into account in inc_block_group_ro 2020-01-31 14:02:01 +01:00
block-group.h btrfs: Move and unexport btrfs_rmap_block 2020-01-23 17:24:34 +01:00
block-rsv.c btrfs: use btrfs_try_granting_tickets in update_global_rsv 2019-09-09 14:59:19 +02:00
block-rsv.h btrfs: migrate the global_block_rsv helpers to block-rsv.c 2019-07-02 12:30:55 +02:00
btrfs_inode.h Btrfs: remove unnecessary delalloc mutex for inodes 2019-11-18 17:51:46 +01:00
check-integrity.c btrfs: remove superfluous BUG_ON() in integrity checks 2020-01-20 16:40:52 +01:00
check-integrity.h
compression.c btrfs: get rid of at_offset parameter to btrfs_lookup_bio_sums() 2020-01-20 16:40:54 +01:00
compression.h btrfs: compression: remove ops pointer from workspace_manager 2019-11-18 12:46:59 +01:00
ctree.c Btrfs: fix race between adding and putting tree mod seq elements and nodes 2020-01-31 14:01:20 +01:00
ctree.h Btrfs: fix race between adding and putting tree mod seq elements and nodes 2020-01-31 14:01:20 +01:00
delalloc-space.c Btrfs: remove unnecessary delalloc mutex for inodes 2019-11-18 17:51:46 +01:00
delalloc-space.h btrfs: migrate the delalloc space stuff to it's own home 2019-07-04 17:26:17 +02:00
delayed-inode.c btrfs: use refcount_inc_not_zero in kill_all_nodes 2019-11-18 12:46:51 +01:00
delayed-inode.h
delayed-ref.c Btrfs: fix race between adding and putting tree mod seq elements and nodes 2020-01-31 14:01:20 +01:00
delayed-ref.h btrfs: migrate the delayed refs rsv code 2019-07-04 17:26:17 +02:00
dev-replace.c btrfs: sysfs, add devid/dev_state kobject and device attributes 2020-01-23 17:24:36 +01:00
dev-replace.h btrfs: add __pure attribute to functions 2019-11-18 12:46:52 +01:00
dir-item.c btrfs: remove unused parameter fs_info from btrfs_extend_item 2019-04-29 19:02:50 +02:00
discard.c btrfs: add correction to handle -1 edge case in async discard 2020-01-20 16:41:01 +01:00
discard.h btrfs: have multiple discard lists 2020-01-20 16:41:00 +01:00
disk-io.c Btrfs: fix race between adding and putting tree mod seq elements and nodes 2020-01-31 14:01:20 +01:00
disk-io.h btrfs: drop create parameter to btrfs_get_extent() 2020-01-20 16:40:55 +01:00
export.c btrfs: drop unused parameter is_new from btrfs_iget 2019-11-18 12:46:52 +01:00
export.h
extent_io.c btrfs: drop the -EBUSY case in __extent_writepage_io 2020-01-31 14:02:11 +01:00
extent_io.h btrfs: drop create parameter to btrfs_get_extent() 2020-01-20 16:40:55 +01:00
extent_map.c Btrfs: fix race between using extent maps and merging them 2020-02-12 17:16:46 +01:00
extent_map.h btrfs: remove extent_map::bdev 2019-11-18 23:43:44 +01:00
extent-io-tree.h btrfs: move the failrec tree stuff into extent-io-tree.h 2019-11-18 12:46:47 +01:00
extent-tree.c btrfs: calculate discard delay based on number of extents 2020-01-20 16:40:59 +01:00
file-item.c btrfs: safely advance counter when looking up bio csums 2020-01-20 16:41:01 +01:00
file.c btrfs: drop create parameter to btrfs_get_extent() 2020-01-20 16:40:55 +01:00
free-space-cache.c btrfs: ensure removal of discardable_* in free_bitmap() 2020-01-20 16:41:01 +01:00
free-space-cache.h btrfs: have multiple discard lists 2020-01-20 16:41:00 +01:00
free-space-tree.c btrfs: rename btrfs_block_group_cache 2019-11-18 17:51:51 +01:00
free-space-tree.h btrfs: rename btrfs_block_group_cache 2019-11-18 17:51:51 +01:00
inode-item.c btrfs: Make btrfs_find_name_in_ext_backref return struct btrfs_inode_extref 2019-09-09 14:59:16 +02:00
inode-map.c btrfs: keep track of which extents have been discarded 2020-01-20 16:40:57 +01:00
inode-map.h
inode.c btrfs: do not do delalloc reservation under page lock 2020-01-31 14:02:15 +01:00
ioctl.c btrfs: drop create parameter to btrfs_get_extent() 2020-01-20 16:40:55 +01:00
Kconfig btrfs: add Kconfig dependency for BLAKE2B 2019-12-09 17:56:06 +01:00
locking.c btrfs: document extent buffer locking 2019-11-18 17:51:50 +01:00
locking.h btrfs: move btrfs_unlock_up_safe to other locking functions 2019-11-18 12:46:49 +01:00
lzo.c btrfs: compression: inline free_workspace 2019-11-18 12:46:59 +01:00
Makefile btrfs: add the beginning of async discard, discard workqueue 2020-01-20 16:40:57 +01:00
misc.h btrfs: add 64bit safe helper for power of two checks 2019-11-18 12:46:50 +01:00
ordered-data.c btrfs: make btrfs_ordered_extent naming consistent with btrfs_file_extent_item 2020-01-20 16:40:54 +01:00
ordered-data.h btrfs: make btrfs_ordered_extent naming consistent with btrfs_file_extent_item 2020-01-20 16:40:54 +01:00
orphan.c
print-tree.c btrfs: Remove unneeded semicolon 2020-01-20 16:40:55 +01:00
print-tree.h
props.c btrfs: props: remove unnecessary hash_init() 2019-11-18 12:46:55 +01:00
props.h btrfs: delete unused function btrfs_set_prop_trans 2019-04-29 19:02:54 +02:00
qgroup.c btrfs: qgroup: return ENOTCONN instead of EINVAL when quotas are not enabled 2020-01-20 16:40:50 +01:00
qgroup.h btrfs: rename btrfs_block_group_cache 2019-11-18 17:51:51 +01:00
raid56.c btrfs: remove pointless local variable in lock_stripe_add() 2019-11-18 12:47:00 +01:00
raid56.h btrfs: constify map parameter for nr_parity_stripes and nr_data_stripes 2019-07-01 13:34:58 +02:00
rcu-string.h
reada.c btrfs: rename btrfs_block_group_cache 2019-11-18 17:51:51 +01:00
ref-verify.c btrfs: ref-verify: fix memory leaks 2020-02-12 17:16:31 +01:00
ref-verify.h btrfs: ref-verify: Use btrfs_ref to refactor btrfs_ref_tree_mod() 2019-04-29 19:02:49 +02:00
relocation.c btrfs: make btrfs_ordered_extent naming consistent with btrfs_file_extent_item 2020-01-20 16:40:54 +01:00
root-tree.c btrfs: do not delete mismatched root refs 2020-01-08 14:44:24 +01:00
scrub.c btrfs: handle empty block_group removal for async discard 2020-01-20 16:40:57 +01:00
send.c Btrfs: send, fix emission of invalid clone operations within the same file 2020-01-31 14:02:19 +01:00
send.h
space-info.c btrfs: take overcommit into account in inc_block_group_ro 2020-01-31 14:02:01 +01:00
space-info.h btrfs: take overcommit into account in inc_block_group_ro 2020-01-31 14:02:01 +01:00
struct-funcs.c btrfs: tie extent buffer and it's token together 2019-09-09 14:59:16 +02:00
super.c btrfs: do not zero f_bavail if we have available space 2020-02-02 18:49:32 +01:00
sysfs.c btrfs: sysfs, add devid/dev_state kobject and device attributes 2020-01-23 17:24:36 +01:00
sysfs.h btrfs: sysfs, add devid/dev_state kobject and device attributes 2020-01-23 17:24:36 +01:00
transaction.c btrfs: set trans->drity in btrfs_commit_transaction 2020-01-23 17:24:37 +01:00
transaction.h btrfs: Rename btrfs_join_transaction_nolock 2019-11-18 12:46:54 +01:00
tree-checker.c btrfs: tree-checker: Verify location key for DIR_ITEM/DIR_INDEX 2020-01-20 16:40:56 +01:00
tree-checker.h btrfs: get fs_info from eb in btrfs_check_chunk_valid 2019-04-29 19:02:39 +02:00
tree-defrag.c btrfs: open code now trivial btrfs_set_lock_blocking 2019-02-25 14:13:27 +01:00
tree-log.c Btrfs: fix infinite loop during fsync after rename operations 2020-01-23 17:24:37 +01:00
tree-log.h btrfs: get fs_info from trans in btrfs_set_log_full_commit 2019-04-29 19:02:41 +02:00
ulist.c
ulist.h
uuid-tree.c btrfs: handle ENOENT in btrfs_uuid_tree_iterate 2019-12-13 14:10:45 +01:00
volumes.c btrfs: Fix split-brain handling when changing FSID to metadata uuid 2020-01-23 17:24:39 +01:00
volumes.h btrfs: sysfs, add devid/dev_state kobject and device attributes 2020-01-23 17:24:36 +01:00
xattr.c Btrfs: fix failure to persist compression property xattr deletion on fsync 2019-06-17 16:37:17 +02:00
xattr.h btrfs: cleanup btrfs_setxattr_trans and drop transaction parameter 2019-04-29 19:02:44 +02:00
zlib.c btrfs: compression: inline free_workspace 2019-11-18 12:46:59 +01:00
zstd.c btrfs: compression: inline free_workspace 2019-11-18 12:46:59 +01:00