forked from Minki/linux
mainlining shenanigans
47d3db41e1
When logging an inode and the previous transaction is still committing, we have a time window where we can end up incorrectly think an inode has its last_unlink_trans field with a value greater than the last transaction committed, which results in the logging to fallback to a full transaction commit, which is usually much more expensive than doing a log commit. The race is described by the following steps: 1) We are at transaction 1000; 2) We modify an inode X (a directory) using transaction 1000 and set its last_unlink_trans field to 1000, because for example we removed one of its subdirectories; 3) We create a new inode Y with a dentry in inode X using transaction 1000, so its generation field is set to 1000; 4) The commit for transaction 1000 is started by task A; 5) The task committing transaction 1000 sets the transaction state to unblocked, writes the dirty extent buffers and the super blocks, then unlocks tree_log_mutex; 6) Some task starts a new transaction with a generation of 1001; 7) We do some modification to inode Y (using transaction 1001); 8) The transaction 1000 commit starts unpinning extents. At this point fs_info->last_trans_committed still has a value of 999; 9) Task B starts an fsync on inode Y, and gets a handle for transaction 1001. When it gets to check_parent_dirs_for_sync() it does the checking of the ancestor dentries because the following check does not evaluate to true: if (S_ISREG(inode->vfs_inode.i_mode) && inode->generation <= last_committed && inode->last_unlink_trans <= last_committed) goto out; The generation value for inode Y is 1000 and last_committed, which has the value read from fs_info->last_trans_committed, has a value of 999, so that check evaluates to false and we proceed to check the ancestor inodes. Once we get to the first ancestor, inode X, we call btrfs_must_commit_transaction() on it, which evaluates to true: static bool btrfs_must_commit_transaction(...) { struct btrfs_fs_info *fs_info = inode->root->fs_info; bool ret = false; mutex_lock(&inode->log_mutex); if (inode->last_unlink_trans > fs_info->last_trans_committed) { /* * Make sure any commits to the log are forced to be full * commits. */ btrfs_set_log_full_commit(trans); ret = true; } (...) because inode's X last_unlink_trans has a value of 1000 and fs_info->last_trans_committed still has a value of 999, it returns true to check_parent_dirs_for_sync(), making it return 1 which is propagated up to btrfs_sync_file(), causing it to fallback to a full transaction commit of transaction 1001. We should have not fallen back to commit transaction 1001, since inode X had last_unlink_trans set to 1000 and the super blocks for transaction 1000 were already written. So while not resulting in a functional problem, it leads to a lot more work and higher latencies for a fsync since committing a transaction is usually more expensive than committing a log (if other filesystem changes happened under that transaction). Similar problem happens when logging directories, for the same reason as btrfs_must_commit_transaction() returns true on an inode with its last_unlink_trans having the generation of the previous transaction and that transaction is still committing, unpinning its freed extents. So fix this by comparing last_unlink_trans with the id of the current transaction instead of fs_info->last_trans_committed. This case is often hit when running dbench for a long enough duration, as it does lots of rename and rmdir operations (both update the field last_unlink_trans of an inode) and fsyncs of files and directories. This patch belongs to a patch set that is comprised of the following patches: btrfs: fix race causing unnecessary inode logging during link and rename btrfs: fix race that results in logging old extents during a fast fsync btrfs: fix race that causes unnecessary logging of ancestor inodes btrfs: fix race that makes inode logging fallback to transaction commit btrfs: fix race leading to unnecessary transaction commit when logging inode btrfs: do not block inode logging for so long during transaction commit Performance results are mentioned in the change log of the last patch. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
---|---|---|
arch | ||
block | ||
certs | ||
crypto | ||
Documentation | ||
drivers | ||
fs | ||
include | ||
init | ||
ipc | ||
kernel | ||
lib | ||
LICENSES | ||
mm | ||
net | ||
samples | ||
scripts | ||
security | ||
sound | ||
tools | ||
usr | ||
virt | ||
.clang-format | ||
.cocciconfig | ||
.get_maintainer.ignore | ||
.gitattributes | ||
.gitignore | ||
.mailmap | ||
COPYING | ||
CREDITS | ||
Kbuild | ||
Kconfig | ||
MAINTAINERS | ||
Makefile | ||
README |
Linux kernel ============ There are several guides for kernel developers and users. These guides can be rendered in a number of formats, like HTML and PDF. Please read Documentation/admin-guide/README.rst first. In order to build the documentation, use ``make htmldocs`` or ``make pdfdocs``. The formatted documentation can also be read online at: https://www.kernel.org/doc/html/latest/ There are various text files in the Documentation/ subdirectory, several of them using the Restructured Text markup notation. Please read the Documentation/process/changes.rst file, as it contains the requirements for building and running the kernel, and information about the problems which may result by upgrading your kernel.