linux/fs/xfs/libxfs
Darrick J. Wong 27dada070d xfs: change the order in which child and parent defer ops are finished
The defer ops code has been finishing items in the wrong order -- if a
top level defer op creates items A and B, and finishing item A creates
more defer ops A1 and A2, we'll put the new items on the end of the
chain and process them in the order A B A1 A2.  This is kind of weird,
since it's convenient for programmers to be able to think of A and B as
an ordered sequence where all the sub-tasks for A must finish before we
move on to B, e.g. A A1 A2 D.

Right now, our log intent items are not so complex that this matters,
but this will become important for the atomic extent swapping patchset.
In order to maintain correct reference counting of extents, we have to
unmap and remap extents in that order, and we want to complete that work
before moving on to the next range that the user wants to swap.  This
patch fixes defer ops to satsify that requirement.

The primary symptom of the incorrect order was noticed in an early
performance analysis of the atomic extent swap code.  An astonishingly
large number of deferred work items accumulated when userspace requested
an atomic update of two very fragmented files.  The cause of this was
traced to the same ordering bug in the inner loop of
xfs_defer_finish_noroll.

If the ->finish_item method of a deferred operation queues new deferred
operations, those new deferred ops are appended to the tail of the
pending work list.  To illustrate, say that a caller creates a
transaction t0 with four deferred operations D0-D3.  The first thing
defer ops does is roll the transaction to t1, leaving us with:

t1: D0(t0), D1(t0), D2(t0), D3(t0)

Let's say that finishing each of D0-D3 will create two new deferred ops.
After finish D0 and roll, we'll have the following chain:

t2: D1(t0), D2(t0), D3(t0), d4(t1), d5(t1)

d4 and d5 were logged to t1.  Notice that while we're about to start
work on D1, we haven't actually completed all the work implied by D0
being finished.  So far we've been careful (or lucky) to structure the
dfops callers such that D1 doesn't depend on d4 or d5 being finished,
but this is a potential logic bomb.

There's a second problem lurking.  Let's see what happens as we finish
D1-D3:

t3: D2(t0), D3(t0), d4(t1), d5(t1), d6(t2), d7(t2)
t4: D3(t0), d4(t1), d5(t1), d6(t2), d7(t2), d8(t3), d9(t3)
t5: d4(t1), d5(t1), d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4)

Let's say that d4-d11 are simple work items that don't queue any other
operations, which means that we can complete each d4 and roll to t6:

t6: d5(t1), d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4)
t7: d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4)
...
t11: d10(t4), d11(t4)
t12: d11(t4)
<done>

When we try to roll to transaction #12, we're holding defer op d11,
which we logged way back in t4.  This means that the tail of the log is
pinned at t4.  If the log is very small or there are a lot of other
threads updating metadata, this means that we might have wrapped the log
and cannot get roll to t11 because there isn't enough space left before
we'd run into t4.

Let's shift back to the original failure.  I mentioned before that I
discovered this flaw while developing the atomic file update code.  In
that scenario, we have a defer op (D0) that finds a range of file blocks
to remap, creates a handful of new defer ops to do that, and then asks
to be continued with however much work remains.

So, D0 is the original swapext deferred op.  The first thing defer ops
does is rolls to t1:

t1: D0(t0)

We try to finish D0, logging d1 and d2 in the process, but can't get all
the work done.  We log a done item and a new intent item for the work
that D0 still has to do, and roll to t2:

t2: D0'(t1), d1(t1), d2(t1)

We roll and try to finish D0', but still can't get all the work done, so
we log a done item and a new intent item for it, requeue D0 a second
time, and roll to t3:

t3: D0''(t2), d1(t1), d2(t1), d3(t2), d4(t2)

If it takes 48 more rolls to complete D0, then we'll finally dispense
with D0 in t50:

t50: D<fifty primes>(t49), d1(t1), ..., d102(t50)

We then try to roll again to get a chain like this:

t51: d1(t1), d2(t1), ..., d101(t50), d102(t50)
...
t152: d102(t50)
<done>

Notice that in rolling to transaction #51, we're holding on to a log
intent item for d1 that was logged in transaction #1.  This means that
the tail of the log is pinned at t1.  If the log is very small or there
are a lot of other threads updating metadata, this means that we might
have wrapped the log and cannot roll to t51 because there isn't enough
space left before we'd run into t1.  This is of course problem #2 again.

But notice the third problem with this scenario: we have 102 defer ops
tied to this transaction!  Each of these items are backed by pinned
kernel memory, which means that we risk OOM if the chains get too long.

Yikes.  Problem #1 is a subtle logic bomb that could hit someone in the
future; problem #2 applies (rarely) to the current upstream, and problem
#3 applies to work under development.

This is not how incremental deferred operations were supposed to work.
The dfops design of logging in the same transaction an intent-done item
and a new intent item for the work remaining was to make it so that we
only have to juggle enough deferred work items to finish that one small
piece of work.  Deferred log item recovery will find that first
unfinished work item and restart it, no matter how many other intent
items might follow it in the log.  Therefore, it's ok to put the new
intents at the start of the dfops chain.

For the first example, the chains look like this:

t2: d4(t1), d5(t1), D1(t0), D2(t0), D3(t0)
t3: d5(t1), D1(t0), D2(t0), D3(t0)
...
t9: d9(t7), D3(t0)
t10: D3(t0)
t11: d10(t10), d11(t10)
t12: d11(t10)

For the second example, the chains look like this:

t1: D0(t0)
t2: d1(t1), d2(t1), D0'(t1)
t3: d2(t1), D0'(t1)
t4: D0'(t1)
t5: d1(t4), d2(t4), D0''(t4)
...
t148: D0<50 primes>(t147)
t149: d101(t148), d102(t148)
t150: d102(t148)
<done>

This actually sucks more for pinning the log tail (we try to roll to t10
while holding an intent item that was logged in t1) but we've solved
problem #1.  We've also reduced the maximum chain length from:

    sum(all the new items) + nr_original_items

to:

    max(new items that each original item creates) + nr_original_items

This solves problem #3 by sharply reducing the number of defer ops that
can be attached to a transaction at any given time.  The change makes
the problem of log tail pinning worse, but is improvement we need to
solve problem #2.  Actually solving #2, however, is left to the next
patch.

Note that a subsequent analysis of some hard-to-trigger reflink and COW
livelocks on extremely fragmented filesystems (or systems running a lot
of IO threads) showed the same symptoms -- uncomfortably large numbers
of incore deferred work items and occasional stalls in the transaction
grant code while waiting for log reservations.  I think this patch and
the next one will also solve these problems.

As originally written, the code used list_splice_tail_init instead of
list_splice_init, so change that, and leave a short comment explaining
our actions.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-10-07 08:40:28 -07:00
..
xfs_ag_resv.c xfs: fix missing header includes 2019-11-07 13:00:53 -08:00
xfs_ag_resv.h xfs: get rid of unnecessary xfs_perag_{get,put} pairs 2020-07-14 08:47:33 -07:00
xfs_ag.c xfs: store inode btree block counts in AGI header 2020-09-15 20:52:39 -07:00
xfs_ag.h
xfs_alloc_btree.c xfs: Remove kmem_zone_zalloc() usage 2020-07-28 20:24:14 -07:00
xfs_alloc_btree.h xfs: Use the correct style for SPDX License Identifier 2020-05-13 15:32:45 -07:00
xfs_alloc.c xfs: Remove kmem_zone_alloc() usage 2020-07-28 20:24:14 -07:00
xfs_alloc.h xfs: Use the correct style for SPDX License Identifier 2020-05-13 15:32:45 -07:00
xfs_attr_leaf.c xfs: Convert xfs_attr_sf macros to inline functions 2020-09-15 20:52:42 -07:00
xfs_attr_leaf.h xfs: Add xfs_has_attr and subroutines 2020-07-28 20:24:14 -07:00
xfs_attr_remote.c xfs: remove the redundant crc feature check in xfs_attr3_rmt_verify 2020-09-25 11:34:07 -07:00
xfs_attr_remote.h xfs: Refactor xfs_attr_rmtval_remove 2020-07-28 20:28:11 -07:00
xfs_attr_sf.h xfs: Convert xfs_attr_sf macros to inline functions 2020-09-15 20:52:42 -07:00
xfs_attr.c xfs: Convert xfs_attr_sf macros to inline functions 2020-09-15 20:52:42 -07:00
xfs_attr.h xfs: Add xfs_has_attr and subroutines 2020-07-28 20:24:14 -07:00
xfs_bit.c xfs: fix missing header includes 2019-11-07 13:00:53 -08:00
xfs_bit.h xfs: Use the correct style for SPDX License Identifier 2020-05-13 15:32:45 -07:00
xfs_bmap_btree.c xfs: Remove kmem_zone_zalloc() usage 2020-07-28 20:24:14 -07:00
xfs_bmap_btree.h xfs: Use the correct style for SPDX License Identifier 2020-05-13 15:32:45 -07:00
xfs_bmap.c xfs: don't free rt blocks when we're doing a REMAP bunmapi call 2020-09-23 08:58:51 -07:00
xfs_bmap.h xfs: redesign the reflink remap loop to fix blkres depletion crash 2020-07-06 10:46:57 -07:00
xfs_btree_staging.c xfs: support bulk loading of staged btrees 2020-03-18 08:12:23 -07:00
xfs_btree_staging.h xfs: xfs_btree_staging.h: delete duplicated words 2020-07-28 20:24:14 -07:00
xfs_btree.c xfs: support bulk loading of staged btrees 2020-03-18 08:12:23 -07:00
xfs_btree.h xfs: Use the correct style for SPDX License Identifier 2020-05-13 15:32:45 -07:00
xfs_cksum.h
xfs_da_btree.c xfs: Refactor xfs_da_state_alloc() helper 2020-07-28 20:24:14 -07:00
xfs_da_btree.h xfs: Refactor xfs_da_state_alloc() helper 2020-07-28 20:24:14 -07:00
xfs_da_format.h xfs: code cleanup in xfs_attr_leaf_entsize_{remote,local} 2020-09-25 11:34:08 -07:00
xfs_defer.c xfs: change the order in which child and parent defer ops are finished 2020-10-07 08:40:28 -07:00
xfs_defer.h xfs: fix an incore inode UAF in xfs_bui_recover 2020-10-07 08:40:28 -07:00
xfs_dir2_block.c xfs: move the per-fork nextents fields into struct xfs_ifork 2020-05-19 09:40:58 -07:00
xfs_dir2_data.c xfs: check owner of dir3 data blocks 2020-03-12 07:58:12 -07:00
xfs_dir2_leaf.c xfs: add a function to deal with corrupt buffers post-verifiers 2020-03-12 07:58:12 -07:00
xfs_dir2_node.c xfs: Refactor xfs_da_state_alloc() helper 2020-07-28 20:24:14 -07:00
xfs_dir2_priv.h xfs: Use the correct style for SPDX License Identifier 2020-05-13 15:32:45 -07:00
xfs_dir2_sf.c xfs: move the fork format fields into struct xfs_ifork 2020-05-19 09:40:58 -07:00
xfs_dir2.c xfs: move the fork format fields into struct xfs_ifork 2020-05-19 09:40:58 -07:00
xfs_dir2.h xfs: Use the correct style for SPDX License Identifier 2020-05-13 15:32:45 -07:00
xfs_dquot_buf.c xfs: widen ondisk quota expiration timestamps to handle y2038+ 2020-09-15 20:52:41 -07:00
xfs_errortag.h xfs: Use the correct style for SPDX License Identifier 2020-05-13 15:32:45 -07:00
xfs_format.h xfs: enable big timestamps 2020-09-15 20:52:41 -07:00
xfs_fs.h xfs: widen ondisk inode timestamps to deal with y2038+ 2020-09-15 20:52:41 -07:00
xfs_health.h xfs: Use the correct style for SPDX License Identifier 2020-05-13 15:32:45 -07:00
xfs_ialloc_btree.c xfs: support inode btree blockcounts in online repair 2020-09-15 20:52:40 -07:00
xfs_ialloc_btree.h xfs: add support for inode btree staging cursors 2020-03-18 08:12:23 -07:00
xfs_ialloc.c xfs: widen ondisk inode timestamps to deal with y2038+ 2020-09-15 20:52:41 -07:00
xfs_ialloc.h xfs: don't commit sunit/swidth updates to disk if that would cause repair failures 2019-12-19 07:53:48 -08:00
xfs_iext_tree.c xfs: remove kmem_realloc() 2020-09-06 18:05:51 -07:00
xfs_inode_buf.c xfs: widen ondisk inode timestamps to deal with y2038+ 2020-09-15 20:52:41 -07:00
xfs_inode_buf.h xfs: use the existing type definition for di_projid 2020-09-25 11:34:07 -07:00
xfs_inode_fork.c xfs: remove kmem_realloc() 2020-09-06 18:05:51 -07:00
xfs_inode_fork.h xfs: cleanup xfs_idestroy_fork 2020-05-19 09:40:59 -07:00
xfs_log_format.h xfs: redefine xfs_ictimestamp_t 2020-09-15 20:52:41 -07:00
xfs_log_recover.h xfs: remove xlog_recover_iodone 2020-09-15 20:52:39 -07:00
xfs_log_rlimit.c
xfs_quota_defs.h xfs: widen ondisk quota expiration timestamps to handle y2038+ 2020-09-15 20:52:41 -07:00
xfs_refcount_btree.c xfs: Remove kmem_zone_zalloc() usage 2020-07-28 20:24:14 -07:00
xfs_refcount_btree.h xfs: add support for refcount btree staging cursors 2020-03-18 08:12:23 -07:00
xfs_refcount.c xfs: make the btree ag cursor private union anonymous 2020-03-13 10:37:15 -07:00
xfs_refcount.h xfs: remove unnecessary int returns from deferred refcount functions 2019-08-28 08:31:02 -07:00
xfs_rmap_btree.c xfs: Remove kmem_zone_zalloc() usage 2020-07-28 20:24:14 -07:00
xfs_rmap_btree.h xfs: add support for rmap btree staging cursors 2020-03-18 08:12:23 -07:00
xfs_rmap.c xfs: avoid shared rmap operations for attr fork extents 2020-09-25 11:34:08 -07:00
xfs_rmap.h xfs: reinitialize rm_flags when unpacking an offset into an rmap irec 2019-08-28 08:31:02 -07:00
xfs_rtbitmap.c xfs: rename xfs_bmap_is_real_extent to is_written_extent 2020-07-06 10:46:57 -07:00
xfs_sb.c xfs: widen ondisk inode timestamps to deal with y2038+ 2020-09-15 20:52:41 -07:00
xfs_sb.h
xfs_shared.h xfs: widen ondisk inode timestamps to deal with y2038+ 2020-09-15 20:52:41 -07:00
xfs_symlink_remote.c xfs: move the fork format fields into struct xfs_ifork 2020-05-19 09:40:58 -07:00
xfs_trans_inode.c xfs: widen ondisk inode timestamps to deal with y2038+ 2020-09-15 20:52:41 -07:00
xfs_trans_resv.c xfs: add a new xfs_sb_version_has_v3inode helper 2020-03-19 08:47:34 -07:00
xfs_trans_resv.h
xfs_trans_space.h xfs: fix off-by-one in inode alloc block reservation calculation 2020-08-26 14:13:21 -07:00
xfs_types.c
xfs_types.h xfs: remove unused structure members & simple typedefs 2019-11-13 18:22:41 -08:00