Commit 62b99540a1 (btrfs: relocation: Fix leaking qgroups numbers
on data extents) only fixes the problem partly.
The previous fix is to trace all new data extents at transaction commit
time when balance finishes.
However balance is not done in a large transaction, every path
replacement can happen in its own transaction.
This makes the fix useless if transaction commits during relocation.
For example:
relocate_block_group()
|-merge_reloc_roots()
| |- merge_reloc_root()
| |- btrfs_start_transaction() <- Trans X
| |- replace_path() <- Cause leak
| |- btrfs_end_transaction_throttle() <- Trans X commits here
| | Leak not fixed
| |
| |- btrfs_start_transaction() <- Trans Y
| |- replace_path() <- Cause leak
| |- btrfs_end_transaction_throttle() <- Trans Y ends
| but not committed
|-btrfs_join_transaction() <- Still trans Y
|-qgroup_fix() <- Only fixes data leak
| in trans Y
|-btrfs_commit_transaction() <- Trans Y commits
In that case, qgroup fixup can only fix data leak in trans Y, data leak
in trans X is out of fix.
So the correct fix should happen in the same transaction of
replace_path().
This patch fixes it by tracing both subtrees of tree block swap, so it
can fix the problem and ensure all leaking and fix are in the same
transaction, so no leak again.
Reported-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-and-Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move account_shared_subtree() to qgroup.c and rename it to
btrfs_qgroup_trace_subtree().
Do the same thing for account_leaf_items() and rename it to
btrfs_qgroup_trace_leaf_items().
Since all these functions are only for qgroup, move them to qgroup.c and
export them is more appropriate.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-and-Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Rename btrfs_qgroup_insert_dirty_extent(_nolock) to
btrfs_qgroup_trace_extent(_nolock), according to the new
reserve/trace/account naming schema.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-and-Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add explaination how btrfs qgroups work.
Qgroup is split into 3 main phrases:
1) Reserve
To ensure qgroup doesn't exceed its limit
2) Trace
To info qgroup to trace which extent
3) Account
Calculate qgroup number change for each traced extent.
This should save quite some time for new developers.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
And remove the bogus check for a NULL return value from kmap, which
can't happen. While we're at it: I don't think that kmapping up to 256
will work without deadlocks on highmem machines, a better idea would
be to use vm_map_ram to map all of them into a single virtual address
range. Incidentally that would also simplify the code a lot.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Rework the loop a little bit to use the generic bio_for_each_segment_all
helper for iterating over the bio.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use the bvec offset and len members to prepare for multipage bvecs.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of using bi_vcnt to calculate it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use bio_for_each_segment_all to iterate over the segments instead.
This requires a bit of reshuffling so that we only lookup up the ordered
item once inside the bio_for_each_segment_all loop.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Just use bio_for_each_segment_all to iterate over all segments.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Just use bio_for_each_segment_all to iterate over all segments.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass the full bio to the decompression routines and use bio iterators
to iterate over the data in the bio.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
This fixes the WARN_ON on BTRFS_I(inode)->reserved_extents in
btrfs_destroy_inode and the WARN_ON on nonzero delalloc bytes on umount
with qgroups enabled.
I was able to reproduce this by setting up a small (~500kb) quota limit
and writing a file one byte at a time until I hit the limit. The warnings
would all hit on umount.
The root cause is that we would reserve a block-sized range in both
the reservation and the quota in btrfs_check_data_free_space, but if we
encountered a problem (like e.g. EDQUOT), we would only release the single
byte in the qgroup reservation. That caused an iotree state split, which
increased the number of outstanding extents, in turn disallowing releasing
the metadata reservation.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At this point we will have dropped extent entries from the file, so if we fail
to insert the new hole entries then we are leaving the fs in a corrupt state
(albeit an easily fixed one). Abort the transaciton if this happens so we can
avoid corrupting the fs. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In order to do hole punching we have a block reserve to hold the reservation we
need to drop the extents in our range. Since we could end up dropping a lot of
extents we set rsv->failfast so we can just loop around again and drop the
remaining of the range. Unfortunately we unconditionally fill the hole extents
in and start from the last extent we encountered, which we may or may not have
dropped. So this can result in overlapping file extent entries, which can be
tripped over in a variety of ways, either by hitting BUG_ON(!ret) in
fill_holes() after the search, or in btrfs_set_item_key_safe() in
btrfs_drop_extent() at a later time by an unrelated task. Fix this by only
setting drop_end to the last extent we did actually drop. This way our holes
are filled in properly for the range that we did drop, and the rest of the range
that remains to be dropped is actually dropped. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we process the last item in the leaf and hit an I/O error while
reading the next leaf, we return -EIO without having adjusted the
position. Since we have emitted dirents, getdents() will return
the byte count to the user instead of the error. Subsequent callers
will emit the last successful dirent again, and return -EIO again,
with the same result. Callers loop forever.
Instead, if we always increment ctx->pos after emitting or skipping
the dirent, we'll be sure that we won't hit the same one again. When
we go to process the next leaf, we won't have emitted any dirents
and the -EIO will be returned to the user properly. We also don't
need to track if we've emitted a dirent already or if we've changed
the position yet.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 3de4586c52 (Btrfs: Allow subvolumes and snapshots anywhere
in the directory tree) introduced the current system of placing
snapshots in the directory tree. It also introduced the behavior of
creating the snapshot and then creating the directory entries for it.
We've kept this code around for compatibility reasons, but it turns
out that no file systems with the old tree_root based snapshots can
be mounted on newer (>= 2009) kernels anyway. About a month after the
above commit, commit 2a7108ad89 (Btrfs: rev the disk format for the
inode compat and csum selection changes) landed, changing the superblock
magic number.
As a result, we know that we'll never encounter tree_root-based dirents
or have to deal with skipping our own snapshot dirents. Since that
also means that we're now only iterating over DIR_INDEX items, which only
contain one directory entry per leaf item, we don't need to loop over
the leaf item contents anymore either.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If zlib_inflateInit2 fails, the input page is never unmapped.
Add a call to kunmap when it fails.
Signed-off-by: Nick Terrell <nickrterrell@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The balance status item contains currently known filter values, but the
stripes filter was unintentionally not among them. This would mean, that
interrupted and automatically restarted balance does not apply the
stripe filters.
Fixes: dee32d0ac3
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: David Sterba <dsterba@suse.com>
'btrfs_iget()' can not return NULL, so this test can be removed.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
csum member of struct btrfs_super_block has array type of u8. It makes
sense that function btrfs_csum_final should be also declared to accept
u8 *. I changed the declaration of method void btrfs_csum_final(u32 crc,
char *result); to void btrfs_csum_final(u32 crc, u8 *result);
Signed-off-by: Domagoj Tršan <domagoj.trsan@gmail.com>
[ changed cast to u8 at several call sites ]
Signed-off-by: David Sterba <dsterba@suse.com>
If we have
|0--hole--4095||4096--preallocate--12287|
instead of using preallocated space, a 8K direct write will just
create a new 8K extent and it'll end up with
|0--new extent--8191||8192--preallocate--12287|
It's because we find a hole em and then go to create a new 8K
extent directly without adjusting @len.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need to call kfree() if memdup_user() fails, as no memory
was allocated and the error in the error-valued pointer should be returned.
Signed-off-by: Shailendra Verma <shailendra.v@samsung.com>
[ edit subject ]
Signed-off-by: David Sterba <dsterba@suse.com>
Using copy_extent_buffer is suitable for copying betwenn buffers from an
arbitrary offset and deals with page boundaries. This is not necessary
when doing a full extent_buffer-to-extent_buffer copy. We can utilize
the copy_page helper as well.
Signed-off-by: David Sterba <dsterba@suse.com>
The only memset we do is to 0, so sink the parameter to the function and
simplify all calls. Rename the function to reflect the behaviour.
Signed-off-by: David Sterba <dsterba@suse.com>
The fsid and chunk tree uuid are always located in the first page,
we don't need the to use write_extent_buffer.
Signed-off-by: David Sterba <dsterba@suse.com>
During the time, the function has been shrunk to the point that it just
calls find_extent_buffer, just passing the parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
We dereference fs_info several times, besides that post-mount functions
should never see a NULL fs_info.
Signed-off-by: David Sterba <dsterba@suse.com>
The lock is held, we make the same lookup that previously failed with
EEXIST and we don't insert NULL pointers.
Signed-off-by: David Sterba <dsterba@suse.com>
Originally, the eb and start were passed separately in case eb is NULL.
Since the readahead has been refactored in 4.6, this is not true anymore
and we can get rid of the parameter.
Signed-off-by: David Sterba <dsterba@suse.com>
'start' is not used since "btrfs: reada: Pass reada_extent into
__readahead_hook directly" (6e39dbe8b9).
Signed-off-by: David Sterba <dsterba@suse.com>
We can't touch the eb directly in case the function is called with a
non-zero error, so we can read the eb level when needed.
Signed-off-by: David Sterba <dsterba@suse.com>
The helpers are not meant to be generic, the name is misleading. Convert
them to static inlines for type checking.
Signed-off-by: David Sterba <dsterba@suse.com>
They're not even documented anywhere, letting users with no recourse but
to RTFS. It's no big burden to output the bitfield as words.
Also, display unknown flags as hex.
Signed-off-by: Adam Borowski <kilobyte@angband.pl>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: David Sterba <dsterba@suse.com>
My QEMU VM was seeing inexplicable I/O errors that I tracked down to
errors coming from the qcow2 virtual drive in the host system. The qcow2
file is a nocow file on my Btrfs drive, which QEMU opens with O_DIRECT.
Every once in awhile, pread() or pwrite() would return EEXIST, which
makes no sense. This turned out to be a bug in btrfs_get_extent().
Commit 8dff9c8534 ("Btrfs: deal with duplciates during extent_map
insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where
two threads race on adding the same extent map to an inode's extent map
tree. However, if the added em is merged with an adjacent em in the
extent tree, then we'll end up with an existing extent that is not
identical to but instead encompasses the extent we tried to add. When we
call merge_extent_mapping() to find the nonoverlapping part of the new
em, the arithmetic overflows because there is no such thing. We then end
up trying to add a bogus em to the em_tree, which results in a EEXIST
that can bubble all the way up to userspace.
Fix it by extending the identical extent map special case.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Tickets_id's name may result in some misunderstandings, it just indicates
the next ticket will be handled and is not stored per ticket.
Fixes: ce12965 ("btrfs: introduce tickets_id to determine whether
asynchronous metadata reclaim work makes progress")
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_map_block supports different types of mappings, which to a large
extent resemble block layer operations. But they don't always do, and
currently btrfs dangerously overlays it's own flag over the block layer
flags. This is just asking for a conflict, so introduce a different
map flags enum inside of btrfs instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull btrfs fixes from Chris Mason:
"Some fixes that Dave Sterba collected. We held off on these last week
because I was focused on the memory corruption testing"
* 'for-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix WARNING in btrfs_select_ref_head()
Btrfs: remove some no-op casts
btrfs: pass correct args to btrfs_async_run_delayed_refs()
btrfs: make file clone aware of fatal signals
btrfs: qgroup: Prevent qgroup->reserved from going subzero
Btrfs: kill BUG_ON in do_relocation
Pull btrfs fixes from Chris Mason:
"My patch fixes the btrfs list_head abuse that we tracked down during
Dave Jones' memory corruption investigation. With both Jens and my
patches in place, I'm no longer able to trigger problems.
Filipe is fixing a difficult old bug between snapshots, balance and
send. Dave is cooking a few more for the next rc, but these are tested
and ready"
* 'for-linus-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: fix races on root_log_ctx lists
btrfs: fix incremental send failure caused by balance
btrfs_remove_all_log_ctxs takes a shortcut where it avoids walking the
list because it knows all of the waiters are patiently waiting for the
commit to finish.
But, there's a small race where btrfs_sync_log can remove itself from
the list if it finds a log commit is already done. Also, it uses
list_del_init() to remove itself from the list, but there's no way to
know if btrfs_remove_all_log_ctxs has already run, so we don't know for
sure if it is safe to call list_del_init().
This gets rid of all the shortcuts for btrfs_remove_all_log_ctxs(), and
just calls it with the proper locking.
This is part two of the corruption fixed by cbd60aa7cd. I should have
done this in the first place, but convinced myself the optimizations were
safe. A 12 hour run of dbench 2048 will eventually trigger a list debug
WARN_ON for the list_del_init() in btrfs_sync_log().
Fixes: d1433debe7
Reported-by: Dave Jones <davej@codemonkey.org.uk>
cc: stable@vger.kernel.org # 3.15+
Signed-off-by: Chris Mason <clm@fb.com>
This issue was found when testing in-band dedupe enospc behaviour,
sometimes run_one_delayed_ref() may fail for enospc reason, then
__btrfs_run_delayed_refs()will return, but forget to add num_heads_read
back, which will trigger "WARN_ON(delayed_refs->num_heads_ready == 0)" in
btrfs_select_ref_head().
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We cast 0 to a u8 but then because of type promotion, it's immediately
cast to int back to int before we do a bitwise negate. The cast doesn't
matter in this case, the code works as intended. It causes a static
checker warning though so let's remove it.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_truncate_inode_items()->btrfs_async_run_delayed_refs(), we
swap the arg2 and arg3 wrongly, fix this.
This bug just impacts asynchronous delayed refs handle when we truncate inodes.
In delayed_ref_async_start(), there is such codes:
trans = btrfs_join_transaction(async->root);
if (trans->transid > async->transid)
goto end;
ret = btrfs_run_delayed_refs(trans, async->root, async->count);
From this codes, we can see that this just influence whether can we handle
delayed refs or the number of delayed refs to handle, this may impact
performance, but will not result in missing delayed refs, all delayed refs will
be handled in btrfs_commit_transaction().
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: David Sterba <dsterba@suse.com>