4d884fceaa
We can have multiple fsync operations against the same file during the
same transaction and they can collect the same ordered extents while they
don't complete (still accessible from the inode's ordered tree). If this
happens, those ordered extents will never get their reference counts
decremented to 0, leading to memory leaks and inode leaks (an iput for an
ordered extent's inode is scheduled only when the ordered extent's refcount
drops to 0). The following sequence diagram explains this race:
CPU 1 CPU 2
btrfs_sync_file()
btrfs_sync_file()
mutex_lock(inode->i_mutex)
btrfs_log_inode()
btrfs_get_logged_extents()
--> collects ordered extent X
--> increments ordered
extent X's refcount
btrfs_submit_logged_extents()
mutex_unlock(inode->i_mutex)
mutex_lock(inode->i_mutex)
btrfs_sync_log()
btrfs_wait_logged_extents()
--> list_del_init(&ordered->log_list)
btrfs_log_inode()
btrfs_get_logged_extents()
--> Adds ordered extent X
to logged_list because
at this point:
list_empty(&ordered->log_list)
&& test_bit(BTRFS_ORDERED_LOGGED,
&ordered->flags) == 0
--> Increments ordered extent
X's refcount
--> check if ordered extent's io is
finished or not, start it if
necessary and wait for it to finish
--> sets bit BTRFS_ORDERED_LOGGED
on ordered extent X's flags
and adds it to trans->ordered
btrfs_sync_log() finishes
btrfs_submit_logged_extents()
btrfs_log_inode() finishes
mutex_unlock(inode->i_mutex)
btrfs_sync_file() finishes
btrfs_sync_log()
btrfs_wait_logged_extents()
--> Sees ordered extent X has the
bit BTRFS_ORDERED_LOGGED set in
its flags
--> X's refcount is untouched
btrfs_sync_log() finishes
btrfs_sync_file() finishes
btrfs_commit_transaction()
--> called by transaction kthread for e.g.
btrfs_wait_pending_ordered()
--> waits for ordered extent X to
complete
--> decrements ordered extent X's
refcount by 1 only, corresponding
to the increment done by the fsync
task ran by CPU 1
In the scenario of the above diagram, after the transaction commit,
the ordered extent will remain with a refcount of 1 forever, leaking
the ordered extent structure and preventing the i_count of its inode
from ever decreasing to 0, since the delayed iput is scheduled only
when the ordered extent's refcount drops to 0, preventing the inode
from ever being evicted by the VFS.
Fix this by using the flag BTRFS_ORDERED_LOGGED differently. Use it to
mean that an ordered extent is already being processed by an fsync call,
which will attach it to the current transaction, preventing it from being
collected by subsequent fsync operations against the same inode.
This race was introduced with the following change (added in 3.19 and
backported to stable 3.18 and 3.17):
Btrfs: make sure logged extents complete in the current transaction V3
commit
|
||
---|---|---|
.. | ||
tests | ||
acl.c | ||
async-thread.c | ||
async-thread.h | ||
backref.c | ||
backref.h | ||
btrfs_inode.h | ||
check-integrity.c | ||
check-integrity.h | ||
compression.c | ||
compression.h | ||
ctree.c | ||
ctree.h | ||
delayed-inode.c | ||
delayed-inode.h | ||
delayed-ref.c | ||
delayed-ref.h | ||
dev-replace.c | ||
dev-replace.h | ||
dir-item.c | ||
disk-io.c | ||
disk-io.h | ||
export.c | ||
export.h | ||
extent_io.c | ||
extent_io.h | ||
extent_map.c | ||
extent_map.h | ||
extent-tree.c | ||
file-item.c | ||
file.c | ||
free-space-cache.c | ||
free-space-cache.h | ||
hash.c | ||
hash.h | ||
inode-item.c | ||
inode-map.c | ||
inode-map.h | ||
inode.c | ||
ioctl.c | ||
Kconfig | ||
locking.c | ||
locking.h | ||
lzo.c | ||
Makefile | ||
math.h | ||
ordered-data.c | ||
ordered-data.h | ||
orphan.c | ||
print-tree.c | ||
print-tree.h | ||
props.c | ||
props.h | ||
qgroup.c | ||
qgroup.h | ||
raid56.c | ||
raid56.h | ||
rcu-string.h | ||
reada.c | ||
relocation.c | ||
root-tree.c | ||
scrub.c | ||
send.c | ||
send.h | ||
struct-funcs.c | ||
super.c | ||
sysfs.c | ||
sysfs.h | ||
transaction.c | ||
transaction.h | ||
tree-defrag.c | ||
tree-log.c | ||
tree-log.h | ||
ulist.c | ||
ulist.h | ||
uuid-tree.c | ||
volumes.c | ||
volumes.h | ||
xattr.c | ||
xattr.h | ||
zlib.c |