btrfs: add a comment explaining the data flush steps

The data flushing steps are not obvious to people other than myself and
Chris.  Write a giant comment explaining the reasoning behind each flush
step for data as well as why it is in that particular order.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This commit is contained in:
Josef Bacik 2020-07-21 10:22:34 -04:00 committed by David Sterba
parent 5705674081
commit 1a7a92c8dd

View File

@ -998,6 +998,53 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work)
} while (flush_state <= COMMIT_TRANS);
}
/*
* FLUSH_DELALLOC_WAIT:
* Space is freed from flushing delalloc in one of two ways.
*
* 1) compression is on and we allocate less space than we reserved
* 2) we are overwriting existing space
*
* For #1 that extra space is reclaimed as soon as the delalloc pages are
* COWed, by way of btrfs_add_reserved_bytes() which adds the actual extent
* length to ->bytes_reserved, and subtracts the reserved space from
* ->bytes_may_use.
*
* For #2 this is trickier. Once the ordered extent runs we will drop the
* extent in the range we are overwriting, which creates a delayed ref for
* that freed extent. This however is not reclaimed until the transaction
* commits, thus the next stages.
*
* RUN_DELAYED_IPUTS
* If we are freeing inodes, we want to make sure all delayed iputs have
* completed, because they could have been on an inode with i_nlink == 0, and
* thus have been truncated and freed up space. But again this space is not
* immediately re-usable, it comes in the form of a delayed ref, which must be
* run and then the transaction must be committed.
*
* FLUSH_DELAYED_REFS
* The above two cases generate delayed refs that will affect
* ->total_bytes_pinned. However this counter can be inconsistent with
* reality if there are outstanding delayed refs. This is because we adjust
* the counter based solely on the current set of delayed refs and disregard
* any on-disk state which might include more refs. So for example, if we
* have an extent with 2 references, but we only drop 1, we'll see that there
* is a negative delayed ref count for the extent and assume that the space
* will be freed, and thus increase ->total_bytes_pinned.
*
* Running the delayed refs gives us the actual real view of what will be
* freed at the transaction commit time. This stage will not actually free
* space for us, it just makes sure that may_commit_transaction() has all of
* the information it needs to make the right decision.
*
* COMMIT_TRANS
* This is where we reclaim all of the pinned space generated by the previous
* two stages. We will not commit the transaction if we don't think we're
* likely to satisfy our request, which means if our current free space +
* total_bytes_pinned < reservation we will not commit. This is why the
* previous states are actually important, to make sure we know for sure
* whether committing the transaction will allow us to make progress.
*/
static const enum btrfs_flush_state data_flush_states[] = {
FLUSH_DELALLOC_WAIT,
RUN_DELAYED_IPUTS,