Commit Graph

1502 Commits

Author SHA1 Message Date
Josef Bacik
bedc661760 btrfs: cleanup extent_op handling
The cleanup_extent_op function actually would run the extent_op if it
needed running, which made the name sort of a misnomer.  Change it to
run_and_cleanup_extent_op, and move the actual cleanup work to
cleanup_extent_op so it can be used by check_ref_cleanup() in order to
unify the extent op handling.

Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:46 +01:00
Josef Bacik
07c47775f4 btrfs: add cleanup_ref_head_accounting helper
We were missing some quota cleanups in check_ref_cleanup, so break the
ref head accounting cleanup into a helper and call that from both
check_ref_cleanup and cleanup_ref_head.  This will hopefully ensure that
we don't screw up accounting in the future for other things that we add.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:46 +01:00
Josef Bacik
d7baffdaf9 btrfs: add btrfs_delete_ref_head helper
We do this dance in cleanup_ref_head and check_ref_cleanup, unify it
into a helper and cleanup the calling functions.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:46 +01:00
Filipe Manana
0e6ec385b5 Btrfs: allow clear_extent_dirty() to receive a cached extent state record
We can have a lot freed extents during the life span of transaction, so
the red black tree that keeps track of the ranges of each freed extent
(fs_info->freed_extents[]) can get quite big. When finishing a
transaction commit we find each range, process it (discard the extents,
unpin them) and then remove it from the red black tree.

We can use an extent state record as a cache when searching for a range,
so that when we clean the range we can use the cached extent state we
passed to the search function instead of iterating the red black tree
again. Doing things as fast as possible when finishing a transaction (in
state TRANS_STATE_UNBLOCKED) is convenient as it reduces the time we
block another task that wants to commit the next transaction.

So change clear_extent_dirty() to allow an optional extent state record to
be passed as an argument, which will be passed down to __clear_extent_bit.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:38 +01:00
Nikolay Borisov
de37aa5131 btrfs: Remove fsid/metadata_fsid fields from btrfs_info
Currently btrfs_fs_info structure contains a copy of the
fsid/metadata_uuid fields. Same values are also contained in the
btrfs_fs_devices structure which fs_info has a reference to. Let's
reduce duplication by removing the fields from fs_info and always refer
to the ones in fs_devices. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:37 +01:00
Nikolay Borisov
7239ff4b2b btrfs: Introduce support for FSID change without metadata rewrite
This field is going to be used when the user wants to change the UUID
of the filesystem without having to rewrite all metadata blocks. This
field adds another level of indirection such that when the FSID is
changed what really happens is the current UUID (the one with which the
fs was created) is copied to the 'metadata_uuid' field in the superblock
as well as a new incompat flag is set METADATA_UUID. When the kernel
detects this flag is set it knows that the superblock in fact has 2
UUIDs:

1. Is the UUID which is user-visible, currently known as FSID.
2. Metadata UUID - this is the UUID which is stamped into all on-disk
   datastructures belonging to this file system.

When the new incompat flag is present device scanning checks whether
both fsid/metadata_uuid of the scanned device match any of the
registered filesystems. When the flag is not set then both UUIDs are
equal and only the FSID is retained on disk, metadata_uuid is set only
in-memory during mount.

Additionally a new metadata_uuid field is also added to the fs_info
struct. It's initialised either with the FSID in case METADATA_UUID
incompat flag is not set or with the metdata_uuid of the superblock
otherwise.

This commit introduces the new fields as well as the new incompat flag
and switches all users of the fsid to the new logic.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor updates in comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:37 +01:00
Qu Wenruo
e72d79d6bc btrfs: Refactor find_free_extent loops update into find_free_extent_update_loop
We have a complex loop design for find_free_extent(), that has different
behavior for each loop, some even includes new chunk allocation.

Instead of putting such a long code into find_free_extent() and makes it
harder to read, just extract them into find_free_extent_update_loop().

With all the cleanups, the main find_free_extent() should be pretty
barebone:

find_free_extent()
|- Iterate through all block groups
|  |- Get a valid block group
|  |- Try to do clustered allocation in that block group
|  |- Try to do unclustered allocation in that block group
|  |- Check if the result is valid
|  |  |- If valid, then exit
|  |- Jump to next block group
|
|- Push harder to find free extents
   |- If not found, re-iterate all block groups

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com>
[ copy callchain from changelog to function comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:26 +01:00
Qu Wenruo
e1a4184815 btrfs: Refactor unclustered extent allocation into find_free_extent_unclustered()
This patch will extract unclsutered extent allocation code into
find_free_extent_unclustered().

And this helper function will use return value to indicate what to do
next.

This should make find_free_extent() a little easier to read.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
[Update merge conflict with fb5c39d7a8 ("btrfs: don't use ctl->free_space for max_extent_size")]
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:26 +01:00
Qu Wenruo
d06e3bb690 btrfs: Refactor clustered extent allocation into find_free_extent_clustered
We have two main methods to find free extents inside a block group:

1) clustered allocation
2) unclustered allocation

This patch will extract the clustered allocation into
find_free_extent_clustered() to make it a little easier to read.

Instead of jumping between different labels in find_free_extent(), the
helper function will use return value to indicate different behavior.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:26 +01:00
Qu Wenruo
b4bd745d12 btrfs: Introduce find_free_extent_ctl structure for later rework
Instead of tons of different local variables in find_free_extent(),
extract them into find_free_extent_ctl structure, and add better
explanation for them.

Some modification may looks redundant, but will later greatly simplify
function parameter list during find_free_extent() refactor.

Also add two comments to co-operate with fb5c39d7a8 ("btrfs: don't use
ctl->free_space for max_extent_size"), to make ffe_ctl->max_extent_size
update more reader-friendly.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:26 +01:00
Lu Fengqi
e2907c1a6a btrfs: extent-tree: Detect bytes_pinned underflow earlier
Introduce a new wrapper update_bytes_pinned to replace open coded
bytes_pinned modifiers. Now the underflows of space_info::bytes_pinned
get detected and reported.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:26 +01:00
Qu Wenruo
9f9b8e8d0e btrfs: extent-tree: Detect bytes_may_use underflow earlier
Although we have space_info::bytes_may_use underflow detection in
btrfs_free_reserved_data_space_noquota(), we have more callers who are
subtracting number from space_info::bytes_may_use.

So instead of doing underflow detection for every caller, introduce a
new wrapper update_bytes_may_use() to replace open coded bytes_may_use
modifiers.

This also introduce a macro to declare more wrappers, but currently
space_info::bytes_may_use is the mostly interesting one.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:25 +01:00
Josef Bacik
80ee54bfe8 btrfs: fix insert_reserved error handling
We were not handling the reserved byte accounting properly for data
references.  Metadata was fine, if it errored out the error paths would
free the bytes_reserved count and pin the extent, but it even missed one
of the error cases.  So instead move this handling up into
run_one_delayed_ref so we are sure that both cases are properly cleaned
up in case of a transaction abort.

CC: stable@vger.kernel.org # 4.18+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-19 12:20:03 +02:00
Josef Bacik
fb5c39d7a8 btrfs: don't use ctl->free_space for max_extent_size
max_extent_size is supposed to be the largest contiguous range for the
space info, and ctl->free_space is the total free space in the block
group.  We need to keep track of these separately and _only_ use the
max_free_space if we don't have a max_extent_size, as that means our
original request was too large to search any of the block groups for and
therefore wouldn't have a max_extent_size set.

CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-19 12:20:03 +02:00
Josef Bacik
21a94f7acf btrfs: reset max_extent_size properly
If we use up our block group before allocating a new one we'll easily
get a max_extent_size that's set really really low, which will result in
a lot of fragmentation.  We need to make sure we're resetting the
max_extent_size when we add a new chunk or add new space.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-19 12:20:03 +02:00
Filipe Manana
5ce555578e Btrfs: fix deadlock when writing out free space caches
When writing out a block group free space cache we can end deadlocking
with ourselves on an extent buffer lock resulting in a warning like the
following:

  [245043.379979] WARNING: CPU: 4 PID: 2608 at fs/btrfs/locking.c:251 btrfs_tree_lock+0x1be/0x1d0 [btrfs]
  [245043.392792] CPU: 4 PID: 2608 Comm: btrfs-transacti Tainted: G
    W I      4.16.8 #1
  [245043.395489] RIP: 0010:btrfs_tree_lock+0x1be/0x1d0 [btrfs]
  [245043.396791] RSP: 0018:ffffc9000424b840 EFLAGS: 00010246
  [245043.398093] RAX: 0000000000000a30 RBX: ffff8807e20a3d20 RCX: 0000000000000001
  [245043.399414] RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff8807e20a3d20
  [245043.400732] RBP: 0000000000000001 R08: ffff88041f39a700 R09: ffff880000000000
  [245043.402021] R10: 0000000000000040 R11: ffff8807e20a3d20 R12: ffff8807cb220630
  [245043.403296] R13: 0000000000000001 R14: ffff8807cb220628 R15: ffff88041fbdf000
  [245043.404780] FS:  0000000000000000(0000) GS:ffff88082fc80000(0000) knlGS:0000000000000000
  [245043.406050] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [245043.407321] CR2: 00007fffdbdb9f10 CR3: 0000000001c09005 CR4: 00000000000206e0
  [245043.408670] Call Trace:
  [245043.409977]  btrfs_search_slot+0x761/0xa60 [btrfs]
  [245043.411278]  btrfs_insert_empty_items+0x62/0xb0 [btrfs]
  [245043.412572]  btrfs_insert_item+0x5b/0xc0 [btrfs]
  [245043.413922]  btrfs_create_pending_block_groups+0xfb/0x1e0 [btrfs]
  [245043.415216]  do_chunk_alloc+0x1e5/0x2a0 [btrfs]
  [245043.416487]  find_free_extent+0xcd0/0xf60 [btrfs]
  [245043.417813]  btrfs_reserve_extent+0x96/0x1e0 [btrfs]
  [245043.419105]  btrfs_alloc_tree_block+0xfb/0x4a0 [btrfs]
  [245043.420378]  __btrfs_cow_block+0x127/0x550 [btrfs]
  [245043.421652]  btrfs_cow_block+0xee/0x190 [btrfs]
  [245043.422979]  btrfs_search_slot+0x227/0xa60 [btrfs]
  [245043.424279]  ? btrfs_update_inode_item+0x59/0x100 [btrfs]
  [245043.425538]  ? iput+0x72/0x1e0
  [245043.426798]  write_one_cache_group.isra.49+0x20/0x90 [btrfs]
  [245043.428131]  btrfs_start_dirty_block_groups+0x102/0x420 [btrfs]
  [245043.429419]  btrfs_commit_transaction+0x11b/0x880 [btrfs]
  [245043.430712]  ? start_transaction+0x8e/0x410 [btrfs]
  [245043.432006]  transaction_kthread+0x184/0x1a0 [btrfs]
  [245043.433341]  kthread+0xf0/0x130
  [245043.434628]  ? btrfs_cleanup_transaction+0x4e0/0x4e0 [btrfs]
  [245043.435928]  ? kthread_create_worker_on_cpu+0x40/0x40
  [245043.437236]  ret_from_fork+0x1f/0x30
  [245043.441054] ---[ end trace 15abaa2aaf36827f ]---

This is because at write_one_cache_group() when we are COWing a leaf from
the extent tree we end up allocating a new block group (chunk) and,
because we have hit a threshold on the number of bytes reserved for system
chunks, we attempt to finalize the creation of new block groups from the
current transaction, by calling btrfs_create_pending_block_groups().
However here we also need to modify the extent tree in order to insert
a block group item, and if the location for this new block group item
happens to be in the same leaf that we were COWing earlier, we deadlock
since btrfs_search_slot() tries to write lock the extent buffer that we
locked before at write_one_cache_group().

We have already hit similar cases in the past and commit d9a0540a79
("Btrfs: fix deadlock when finalizing block group creation") fixed some
of those cases by delaying the creation of pending block groups at the
known specific spots that could lead to a deadlock. This change reworks
that commit to be more generic so that we don't have to add similar logic
to every possible path that can lead to a deadlock. This is done by
making __btrfs_cow_block() disallowing the creation of new block groups
(setting the transaction's can_flush_pending_bgs to false) before it
attempts to allocate a new extent buffer for either the extent, chunk or
device trees, since those are the trees that pending block creation
modifies. Once the new extent buffer is allocated, it allows creation of
pending block groups to happen again.

This change depends on a recent patch from Josef which is not yet in
Linus' tree, named "btrfs: make sure we create all new block groups" in
order to avoid occasional warnings at btrfs_trans_release_chunk_metadata().

Fixes: d9a0540a79 ("Btrfs: fix deadlock when finalizing block group creation")
CC: stable@vger.kernel.org # 4.4+
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199753
Link: https://lore.kernel.org/linux-btrfs/CAJtFHUTHna09ST-_EEiyWmDH6gAqS6wa=zMNMBsifj8ABu99cw@mail.gmail.com/
Reported-by: E V <eliventer@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-17 17:46:24 +02:00
Lu Fengqi
7c8616278b btrfs: remove fs_info from btrfs_should_throttle_delayed_refs
The avg_delayed_ref_runtime can be referenced from the transaction
handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:41 +02:00
Lu Fengqi
af9b8a0e20 btrfs: remove fs_info from btrfs_check_space_for_delayed_refs
It can be referenced from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:41 +02:00
Lu Fengqi
9e920a6f03 btrfs: delayed-ref: pass delayed_refs directly to btrfs_delayed_ref_lock
Since trans is only used for referring to delayed_refs, there is no need
to pass it instead of delayed_refs to btrfs_delayed_ref_lock().

No functional change.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:41 +02:00
Lu Fengqi
5637c74b01 btrfs: delayed-ref: pass delayed_refs directly to btrfs_select_ref_head
Since trans is only used for referring to delayed_refs, there is no need
to pass it instead of delayed_refs to btrfs_select_ref_head().  No
functional change.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:40 +02:00
Josef Bacik
545e3366db btrfs: make sure we create all new block groups
Allocating new chunks modifies both the extent and chunk tree, which can
trigger new chunk allocations.  So instead of doing list_for_each_safe,
just do while (!list_empty()) so we make sure we don't exit with other
pending bg's still on our list.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:39 +02:00
Qu Wenruo
2cd86d309b btrfs: qgroup: Don't trace subtree if we're dropping reloc tree
Reloc tree doesn't contribute to qgroup numbers, as we have accounted
them at balance time (see replace_path()).

Skipping the unneeded subtree tracing should reduce the overhead.

[[Benchmark]]
Hardware:
	VM 4G vRAM, 8 vCPUs,
	disk is using 'unsafe' cache mode,
	backing device is SAMSUNG 850 evo SSD.
	Host has 16G ram.

Mkfs parameter:
	--nodesize 4K (To bump up tree size)

Initial subvolume contents:
	4G data copied from /usr and /lib.
	(With enough regular small files)

Snapshots:
	16 snapshots of the original subvolume.
	each snapshot has 3 random files modified.

balance parameter:
	-m

So the content should be pretty similar to a real world root fs layout.

                     | v4.19-rc1    | w/ patchset    | diff (*)
---------------------------------------------------------------
relocated extents    | 22929        | 22900          | -0.1%
qgroup dirty extents | 227757       | 167139         | -26.6%
time (sys)           | 65.253s      | 50.123s        | -23.2%
time (real)          | 74.032s      | 52.551s        | -29.0%

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:36 +02:00
Nikolay Borisov
0110a4c434 btrfs: refactor __btrfs_run_delayed_refs loop
Refactor the delayed refs loop by using the newly introduced
btrfs_run_delayed_refs_for_head function. This greatly simplifies
__btrfs_run_delayed_refs and makes it more obvious what is happening.

We now have 1 loop which iterates the existing delayed_heads and then
each selected ref head is processed by the new helper. All existing
semantics of the code are preserved so no functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:35 +02:00
Nikolay Borisov
e726138676 btrfs: Factor out loop processing all refs of a head
This patch introduces a new helper encompassing the implicit inner loop
in __btrfs_run_delayed_refs which processes all the refs for a given
head. The code is mostly copy/paste, the only difference is that if we
detect a newer reference then -EAGAIN is returned so that callers can
react correctly.

Also, at the end of the loop the head is relocked and
btrfs_merge_delayed_refs is run again to retain the pre-refactoring
semantics.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:35 +02:00
Nikolay Borisov
b1cdbcb53a btrfs: Factor out ref head locking code in __btrfs_run_delayed_refs
This is in preparation to refactor the giant loop in
__btrfs_run_delayed_refs. As a first step define a new function
which implements acquiring a reference to a btrfs_delayed_refs_head and
use it. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:34 +02:00
Liu Bo
e3d0396563 Btrfs: delayed-refs: use rb_first_cached for ref_tree
rb_first_cached() trades an extra pointer "leftmost" for doing the same
job as rb_first() but in O(1).

Functions manipulating href->ref_tree need to get the first entry, this
converts href->ref_tree to use rb_first_cached().

For more details about the optimization see patch "Btrfs: delayed-refs:
use rb_first_cached for href_root".

Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:33 +02:00
Liu Bo
5c9d028b3b Btrfs: delayed-refs: use rb_first_cached for href_root
rb_first_cached() trades an extra pointer "leftmost" for doing the same
job as rb_first() but in O(1).

Functions manipulating href_root need to get the first entry, this
converts href_root to use rb_first_cached().

This patch is first in the sequenct of similar updates to other rbtrees
and this is analysis of the expected behaviour and improvements.

There's a common pattern:

while (node = rb_first) {
        entry = rb_entry(node)
        next = rb_next(node)
        rb_erase(node)
        cleanup(entry)
}

rb_first needs to traverse the tree up to logN depth, rb_erase can
completely reshuffle the tree. With the caching we'll skip the traversal
in rb_first.  That's a cached memory access vs looped pointer
dereference trade-off that IMHO has a clear winner.

Measurements show there's not much difference in a sample tree with
10000 nodes: 4.5s / rb_first and 4.8s / rb_first_cached. Real effects of
caching and pointer chasing are unpredictable though.

Further optimzations can be done to avoid the expensive rb_erase step.
In some cases it's ok to process the nodes in any order, so the tree can
be traversed in post-order, not rebalancing the children nodes and just
calling free. Care must be taken regarding the next node.

Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog from mail discussions ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:33 +02:00
Josef Bacik
3aa7c7a31c btrfs: wait on caching when putting the bg cache
While testing my backport I noticed there was a panic if I ran
generic/416 generic/417 generic/418 all in a row.  This just happened to
uncover a race where we had outstanding IO after we destroy all of our
workqueues, and then we'd go to queue the endio work on those free'd
workqueues.

This is because we aren't waiting for the caching threads to be done
before freeing everything up, so to fix this make sure we wait on any
outstanding caching that's being done before we free up the block group,
so we're sure to be done with all IO by the time we get to
btrfs_stop_all_workers().  This fixes the panic I was seeing
consistently in testing.

------------[ cut here ]------------
kernel BUG at fs/btrfs/volumes.c:6112!
SMP PTI
Modules linked in:
CPU: 1 PID: 27165 Comm: kworker/u4:7 Not tainted 4.16.0-02155-g3553e54a578d-dirty #875
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
Workqueue: btrfs-cache btrfs_cache_helper
RIP: 0010:btrfs_map_bio+0x346/0x370
RSP: 0000:ffffc900061e79d0 EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffff880071542e00 RCX: 0000000000533000
RDX: ffff88006bb74380 RSI: 0000000000000008 RDI: ffff880078160000
RBP: 0000000000000001 R08: ffff8800781cd200 R09: 0000000000503000
R10: ffff88006cd21200 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: ffff8800781cd200 R15: ffff880071542e00
FS:  0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000817ffc4 CR3: 0000000078314000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 btree_submit_bio_hook+0x8a/0xd0
 submit_one_bio+0x5d/0x80
 read_extent_buffer_pages+0x18a/0x320
 btree_read_extent_buffer_pages+0xbc/0x200
 ? alloc_extent_buffer+0x359/0x3e0
 read_tree_block+0x3d/0x60
 read_block_for_search.isra.30+0x1a5/0x360
 btrfs_search_slot+0x41b/0xa10
 btrfs_next_old_leaf+0x212/0x470
 caching_thread+0x323/0x490
 normal_work_helper+0xc5/0x310
 process_one_work+0x141/0x340
 worker_thread+0x44/0x3c0
 kthread+0xf8/0x130
 ? process_one_work+0x340/0x340
 ? kthread_bind+0x10/0x10
 ret_from_fork+0x35/0x40
RIP: btrfs_map_bio+0x346/0x370 RSP: ffffc900061e79d0
---[ end trace 827eb13e50846033 ]---
Kernel panic - not syncing: Fatal exception
Kernel Offset: disabled
---[ end Kernel panic - not syncing: Fatal exception

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:33 +02:00
Jeff Mahoney
fee7acc361 btrfs: keep trim from interfering with transaction commits
Commit 499f377f49 (btrfs: iterate over unused chunk space in FITRIM)
fixed free space trimming, but introduced latency when it was running.
This is due to it pinning the transaction using both a incremented
refcount and holding the commit root sem for the duration of a single
trim operation.

This was to ensure safety but it's unnecessary.  We already hold the the
chunk mutex so we know that the chunk we're using can't be allocated
while we're trimming it.

In order to check against chunks allocated already in this transaction,
we need to check the pending chunks list.  To to that safely without
joining the transaction (or attaching than then having to commit it) we
need to ensure that the dev root's commit root doesn't change underneath
us and the pending chunk lists stays around until we're done with it.

We can ensure the former by holding the commit root sem and the latter
by pinning the transaction.  We do this now, but the critical section
covers the trim operation itself and we don't need to do that.

This patch moves the pinning and unpinning logic into helpers and unpins
the transaction after performing the search and check for pending
chunks.

Limiting the critical section of the transaction pinning improves the
latency substantially on slower storage (e.g. image files over NFS).

Fixes: 499f377f49 ("btrfs: iterate over unused chunk space in FITRIM")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:32 +02:00
Jeff Mahoney
0be88e367f btrfs: don't attempt to trim devices that don't support it
We check whether any device the file system is using supports discard in
the ioctl call, but then we attempt to trim free extents on every device
regardless of whether discard is supported.  Due to the way we mask off
EOPNOTSUPP, we can end up issuing the trim operations on each free range
on devices that don't support it, just wasting time.

Fixes: 499f377f49 ("btrfs: iterate over unused chunk space in FITRIM")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:32 +02:00
Jeff Mahoney
d4e329de5e btrfs: iterate all devices during trim, instead of fs_devices::alloc_list
btrfs_trim_fs iterates over the fs_devices->alloc_list while holding the
device_list_mutex.  The problem is that ->alloc_list is protected by the
chunk mutex.  We don't want to hold the chunk mutex over the trim of the
entire file system.  Fortunately, the ->dev_list list is protected by
the dev_list mutex and while it will give us all devices, including
read-only devices, we already just skip the read-only devices.  Then we
can continue to take and release the chunk mutex while scanning each
device.

Fixes: 499f377f49 ("btrfs: iterate over unused chunk space in FITRIM")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:32 +02:00
Qu Wenruo
6ba9fc8e62 btrfs: Ensure btrfs_trim_fs can trim the whole filesystem
[BUG]
fstrim on some btrfs only trims the unallocated space, not trimming any
space in existing block groups.

[CAUSE]
Before fstrim_range passed to btrfs_trim_fs(), it gets truncated to
range [0, super->total_bytes).  So later btrfs_trim_fs() will only be
able to trim block groups in range [0, super->total_bytes).

While for btrfs, any bytenr aligned to sectorsize is valid, since btrfs
uses its logical address space, there is nothing limiting the location
where we put block groups.

For filesystem with frequent balance, it's quite easy to relocate all
block groups and bytenr of block groups will start beyond
super->total_bytes.

In that case, btrfs will not trim existing block groups.

[FIX]
Just remove the truncation in btrfs_ioctl_fitrim(), so btrfs_trim_fs()
can get the unmodified range, which is normally set to [0, U64_MAX].

Reported-by: Chris Murphy <lists@colorremedies.com>
Fixes: f4c697e640 ("btrfs: return EINVAL if start > total_bytes in fitrim ioctl")
CC: <stable@vger.kernel.org> # v4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:32 +02:00
Qu Wenruo
93bba24d4b btrfs: Enhance btrfs_trim_fs function to handle error better
Function btrfs_trim_fs() doesn't handle errors in a consistent way. If
error happens when trimming existing block groups, it will skip the
remaining blocks and continue to trim unallocated space for each device.

The return value will only reflect the final error from device trimming.

This patch will fix such behavior by:

1) Recording the last error from block group or device trimming
   The return value will also reflect the last error during trimming.
   Make developer more aware of the problem.

2) Continuing trimming if possible
   If we failed to trim one block group or device, we could still try
   the next block group or device.

3) Report number of failures during block group and device trimming
   It would be less noisy, but still gives user a brief summary of
   what's going wrong.

Such behavior can avoid confusion for cases like failure to trim the
first block group and then only unallocated space is trimmed.

Reported-by: Chris Murphy <lists@colorremedies.com>
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add bg_ret and dev_ret to the messages ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:32 +02:00
Misono Tomohiro
380fd06640 btrfs: remove redundant variable from btrfs_cross_ref_exist
Since commit d7df2c796d ("Btrfs attach delayed ref updates to delayed
ref heads"), check_delayed_ref() won't return -ENOENT.

In btrfs_cross_ref_exist(), two variables 'ret' and 'ret2' are
originally used to handle -ENOENT error case.  Since the code is not
needed anymore, let's just remove 'ret2'.

Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:29 +02:00
Liu Bo
6aadd9eb74 Btrfs: remove confusing tracepoint in btrfs_add_reserved_bytes
Here we're not releasing any space, but transferring bytes from
->bytes_may_use to ->bytes_reserved. The last change to the code in
commit 18513091af ("btrfs: update btrfs_space_info's
bytes_may_use timely") removed a conditional tracepoint and the logic
changed too but the tracepiont remained.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:28 +02:00
Qu Wenruo
b72c3aba09 btrfs: locking: Add extra check in btrfs_init_new_buffer() to avoid deadlock
[BUG]
For certain crafted image, whose csum root leaf has missing backref, if
we try to trigger write with data csum, it could cause deadlock with the
following kernel WARN_ON():

  WARNING: CPU: 1 PID: 41 at fs/btrfs/locking.c:230 btrfs_tree_lock+0x3e2/0x400
  CPU: 1 PID: 41 Comm: kworker/u4:1 Not tainted 4.18.0-rc1+ #8
  Workqueue: btrfs-endio-write btrfs_endio_write_helper
  RIP: 0010:btrfs_tree_lock+0x3e2/0x400
  Call Trace:
   btrfs_alloc_tree_block+0x39f/0x770
   __btrfs_cow_block+0x285/0x9e0
   btrfs_cow_block+0x191/0x2e0
   btrfs_search_slot+0x492/0x1160
   btrfs_lookup_csum+0xec/0x280
   btrfs_csum_file_blocks+0x2be/0xa60
   add_pending_csums+0xaf/0xf0
   btrfs_finish_ordered_io+0x74b/0xc90
   finish_ordered_fn+0x15/0x20
   normal_work_helper+0xf6/0x500
   btrfs_endio_write_helper+0x12/0x20
   process_one_work+0x302/0x770
   worker_thread+0x81/0x6d0
   kthread+0x180/0x1d0
   ret_from_fork+0x35/0x40

[CAUSE]
That crafted image has missing backref for csum tree root leaf.  And
when we try to allocate new tree block, since there is no
EXTENT/METADATA_ITEM for csum tree root, btrfs consider it's free slot
and use it.

The extent tree of the image looks like:

  Normal image                      |       This fuzzed image
  ----------------------------------+--------------------------------
  BG 29360128                       | BG 29360128
   One empty slot                   |  One empty slot
  29364224: backref to UUID tree    | 29364224: backref to UUID tree
   Two empty slots                  |  Two empty slots
  29376512: backref to CSUM tree    |  One empty slot (bad type) <<<
  29380608: backref to D_RELOC tree | 29380608: backref to D_RELOC tree
  ...                               | ...

Since bytenr 29376512 has no METADATA/EXTENT_ITEM, when btrfs try to
alloc tree block, it's an valid slot for btrfs.

And for finish_ordered_write, when we need to insert csum, we try to CoW
csum tree root.

By accident, empty slots at bytenr BG_OFFSET, BG_OFFSET + 8K,
BG_OFFSET + 12K is already used by tree block COW for other trees, the
next empty slot is BG_OFFSET + 16K, which should be the backref for CSUM
tree.

But due to the bad type, btrfs can recognize it and still consider it as
an empty slot, and will try to use it for csum tree CoW.

Then in the following call trace, we will try to lock the new tree
block, which turns out to be the old csum tree root which is already
locked:

btrfs_search_slot() called on csum tree root, which is at 29376512
|- btrfs_cow_block()
   |- btrfs_set_lock_block()
   |  |- Now locks tree block 29376512 (old csum tree root)
   |- __btrfs_cow_block()
      |- btrfs_alloc_tree_block()
         |- btrfs_reserve_extent()
            | Now it returns tree block 29376512, which extent tree
            | shows its empty slot, but it's already hold by csum tree
            |- btrfs_init_new_buffer()
               |- btrfs_tree_lock()
                  | Triggers WARN_ON(eb->lock_owner == current->pid)
                  |- wait_event()
                     Wait lock owner to release the lock, but it's
                     locked by ourself, so it will deadlock

[FIX]
This patch will do the lock_owner and current->pid check at
btrfs_init_new_buffer().
So above deadlock can be avoided.

Since such problem can only happen in crafted image, we will still
trigger kernel warning for later aborted transaction, but with a little
more meaningful warning message.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=200405
Reported-by: Xu Wen <wen.xu@gatech.edu>
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:27 +02:00
Qu Wenruo
65c6e82bec btrfs: Handle owner mismatch gracefully when walking up tree
[BUG]
When mounting certain crafted image, btrfs will trigger kernel BUG_ON()
when trying to recover balance:

  kernel BUG at fs/btrfs/extent-tree.c:8956!
  invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
  CPU: 1 PID: 662 Comm: mount Not tainted 4.18.0-rc1-custom+ #10
  RIP: 0010:walk_up_proc+0x336/0x480 [btrfs]
  RSP: 0018:ffffb53540c9b890 EFLAGS: 00010202
  Call Trace:
   walk_up_tree+0x172/0x1f0 [btrfs]
   btrfs_drop_snapshot+0x3a4/0x830 [btrfs]
   merge_reloc_roots+0xe1/0x1d0 [btrfs]
   btrfs_recover_relocation+0x3ea/0x420 [btrfs]
   open_ctree+0x1af3/0x1dd0 [btrfs]
   btrfs_mount_root+0x66b/0x740 [btrfs]
   mount_fs+0x3b/0x16a
   vfs_kern_mount.part.9+0x54/0x140
   btrfs_mount+0x16d/0x890 [btrfs]
   mount_fs+0x3b/0x16a
   vfs_kern_mount.part.9+0x54/0x140
   do_mount+0x1fd/0xda0
   ksys_mount+0xba/0xd0
   __x64_sys_mount+0x21/0x30
   do_syscall_64+0x60/0x210
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

[CAUSE]
Extent tree corruption.  In this particular case, reloc tree root's
owner is DATA_RELOC_TREE (should be TREE_RELOC), thus its backref is
corrupted and we failed the owner check in walk_up_tree().

[FIX]
It's pretty hard to take care of every extent tree corruption, but at
least we can remove such BUG_ON() and exit more gracefully.

And since in this particular image, DATA_RELOC_TREE and TREE_RELOC share
the same root (which is obviously invalid), we needs to make
__del_reloc_root() more robust to detect such invalid sharing to avoid
possible NULL dereference as root->node can be NULL in this case.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=200411
Reported-by: Xu Wen <wen.xu@gatech.edu>
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:27 +02:00
zhong jiang
556f3ca88e btrfs: change btrfs_free_reserved_bytes to return void
btrfs_free_reserved_bytes uses the variable "ret" for return value,
but it is not modified after initialzation. Further, I find that any of
the callers do not handle the return value, so it is safe to drop the
unneeded "ret" and return void. There are no callees that would need the
function to handle or pass the value either.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:27 +02:00
Misono Tomohiro
4fd786e6c3 btrfs: Remove 'objectid' member from struct btrfs_root
There are two members in struct btrfs_root which indicate root's
objectid: objectid and root_key.objectid.

They are both set to the same value in __setup_root():

  static void __setup_root(struct btrfs_root *root,
                           struct btrfs_fs_info *fs_info,
                           u64 objectid)
  {
    ...
    root->objectid = objectid;
    ...
    root->root_key.objectid = objecitd;
    ...
  }

and not changed to other value after initialization.

grep in btrfs directory shows both are used in many places:
  $ grep -rI "root->root_key.objectid" | wc -l
  133
  $ grep -rI "root->objectid" | wc -l
  55
 (4.17, inc. some noise)

It is confusing to have two similar variable names and it seems
that there is no rule about which should be used in a certain case.

Since ->root_key itself is needed for tree reloc tree, let's remove
'objecitd' member and unify code to use ->root_key.objectid in all places.

Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:25 +02:00
Lu Fengqi
5a2cb25ab9 btrfs: remove a useless return statement in btrfs_block_rsv_add
Since ret must be 0 here, don't have to return.  No functional change
and code readability is not hurt.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:25 +02:00
Lu Fengqi
3a58417486 btrfs: switch update_size to bool in btrfs_block_rsv_migrate and btrfs_rsv_add_bytes
Using true and false here is closer to the expected semantic than using
0 and 1.  No functional change.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:25 +02:00
Lu Fengqi
a5b7f4295e btrfs: fix qgroup_free wrong num_bytes in btrfs_subvolume_reserve_metadata
After btrfs_qgroup_reserve_meta_prealloc(), num_bytes will be assigned
again by btrfs_calc_trans_metadata_size(). Once block_rsv fails, we
can't properly free the num_bytes of the previous qgroup_reserve. Use a
separate variable to store the num_bytes of the qgroup_reserve.

Delete the comment for the qgroup_reserved that does not exist and add a
comment about use_global_rsv.

Fixes: c4c129db5d ("btrfs: drop unused parameter qgroup_reserved")
CC: stable@vger.kernel.org # 4.18+
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-23 17:37:26 +02:00
Qu Wenruo
7ef49515fa btrfs: Verify that every chunk has corresponding block group at mount time
If a crafted image has missing block group items, it could cause
unexpected behavior and breaks the assumption of 1:1 chunk<->block group
mapping.

Although we have the block group -> chunk mapping check, we still need
chunk -> block group mapping check.

This patch will do extra check to ensure each chunk has its
corresponding block group.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=199847
Reported-by: Xu Wen <wen.xu@gatech.edu>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:13:03 +02:00
Qu Wenruo
514c7dca85 btrfs: Check that each block group has corresponding chunk at mount time
A crafted btrfs image with incorrect chunk<->block group mapping will
trigger a lot of unexpected things as the mapping is essential.

Although the problem can be caught by block group item checker
added in "btrfs: tree-checker: Verify block_group_item", it's still not
sufficient.  A sufficiently valid block group item can pass the check
added by the mentioned patch but could fail to match the existing chunk.

This patch will add extra block group -> chunk mapping check, to ensure
we have a completely matching (start, len, flags) chunk for each block
group at mount time.

Here we reuse the original helper find_first_block_group(), which is
already doing the basic bg -> chunk checks, adding further checks of the
start/len and type flags.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=199837
Reported-by: Xu Wen <wen.xu@gatech.edu>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:13:03 +02:00
Misono Tomohiro
85c3954819 btrfs: extent-tree: Remove unused __btrfs_free_block_rsv
There is no user of this function anymore.

This was forgotten to be removed in commit a575ceeb13
("Btrfs: get rid of unused orphan infrastructure").

Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:13:01 +02:00
Lu Fengqi
ab9ce7d42b btrfs: Remove fs_info from btrfs_del_root
It can be referenced from the passed transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:13:00 +02:00
David Sterba
b5851021f1 btrfs: extent-tree: remove unused member walk_control::for_reloc
Leftover after fix e339a6b097 ("Btrfs: __btrfs_mod_ref should always
use no_quota"), that removed it from the function calls but not the
structure.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:59 +02:00
Josef Bacik
4559b0a717 btrfs: don't leak ret from do_chunk_alloc
If we're trying to make a data reservation and we have to allocate a
data chunk we could leak ret == 1, as do_chunk_alloc() will return 1 if
it allocated a chunk.  Since the end of the function is the success path
just return 0.

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:59 +02:00
Nikolay Borisov
97aff912a2 btrfs: Remove fs_info from btrfs_finish_chunk_alloc
It can be referenced from the passed transaction handle.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:58 +02:00
Qu Wenruo
5e23a6fea6 btrfs: extent-tree: Remove dead alignment check
In find_free_extent() under checks: label, we have the following code:

		search_start = ALIGN(offset, fs_info->stripesize);
		/* move on to the next group */
		if (search_start + num_bytes >
		    block_group->key.objectid + block_group->key.offset) {
			btrfs_add_free_space(block_group, offset, num_bytes);
			goto loop;
		}
		if (offset < search_start)
			btrfs_add_free_space(block_group, offset,
					     search_start - offset);
		BUG_ON(offset > search_start);

However ALIGN() is rounding up, thus @search_start >= @offset and that
BUG_ON() will never be triggered.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:56 +02:00
David Sterba
46df06b85e btrfs: refactor block group replication factor calculation to a helper
There are many places that open code the duplicity factor of the block
group profiles, create a common helper. This can be easily extended for
more copies.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:53 +02:00
Lu Fengqi
deb4062743 btrfs: qgroup: Drop root parameter from btrfs_qgroup_trace_subtree
The fs_info can be fetched from the transaction handle directly.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:52 +02:00
Lu Fengqi
8d38d7eb7b btrfs: qgroup: Drop fs_info parameter from btrfs_qgroup_trace_leaf_items
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:52 +02:00
Qu Wenruo
031f24da2c btrfs: Use btrfs_mark_bg_unused to replace open code
Introduce a small helper, btrfs_mark_bg_unused(), to acquire locks and
add a block group to unused_bgs list.

No functional modification, and only 3 callers are involved.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:49 +02:00
Nikolay Borisov
2556fbb0be btrfs: Rewrite retry logic in do_chunk_alloc
do_chunk_alloc implements logic to detect whether there is currently
pending chunk allocation (by means of space_info->chunk_alloc being
set) and if so it loops around to the 'again' label. Additionally,
based on the state of the space_info (e.g. whether it's full or not)
and the return value of should_alloc_chunk() it decides whether this
is a "hard" error (ENOSPC) or we can just return 0.

This patch refactors all of this:

1. Put order to the scattered ifs handling the various cases in an
easy-to-read if {} else if{} branches. This makes clear the various
cases we are interested in handling.

2. Call should_alloc_chunk only once and use the result in the
if/else if constructs. All of this is done under space_info->lock, so
even before multiple calls of should_alloc_chunk were unnecessary.

3. Rewrite the "do {} while()" loop currently implemented via label
into an explicit loop construct.

4. Move the mutex locking for the case where the caller is the one doing
the allocation. For the case where the caller needs to wait a concurrent
allocation, introduce a pair of mutex_lock/mutex_unlock to act as a
barrier and reword the comment.

5. Switch local vars to bool type where pertinent.

All in all this shouldn't introduce any functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:49 +02:00
Ethan Lien
dec59fa3a7 btrfs: use customized batch size for total_bytes_pinned
In commit b150a4f10d ("Btrfs: use a percpu to keep track of possibly
pinned bytes") we use total_bytes_pinned to track how many bytes we are
going to free in this transaction. When we are close to ENOSPC, we check it
and know if we can make the allocation by commit the current transaction.
For every data/metadata extent we are going to free, we add
total_bytes_pinned in btrfs_free_extent() and btrfs_free_tree_block(), and
release it in unpin_extent_range() when we finish the transaction. So this
is a variable we frequently update but rarely read - just the suitable
use of percpu_counter. But in previous commit we update total_bytes_pinned
by default 32 batch size, making every update essentially a spin lock
protected update. Since every spin lock/unlock operation involves syncing
a globally used variable and some kind of barrier in a SMP system, this is
more expensive than using total_bytes_pinned as a simple atomic64_t.

So fix this by using a customized batch size. Since we only read
total_bytes_pinned when we are close to ENOSPC and fail to allocate new
chunk, we can use a really large batch size and have nearly no penalty
in most cases.

[Test]
We tested the patch on a 4-cores x86 machine:

1. fallocate a 16GiB size test file
2. take snapshot (so all following writes will be COW)
3. run a 180 sec, 4 jobs, 4K random write fio on test file

We also added a temporary lockdep class on percpu_counter's spin lock
used by total_bytes_pinned to track it by lock_stat.

[Results]
unpatched:
lock_stat version 0.4
-----------------------------------------------------------------------
                              class name    con-bounces    contentions
waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces
acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg

               total_bytes_pinned_percpu:            82             82
        0.21           0.61          29.46           0.36         298340
      635973           0.09          11.01      173476.25           0.27

patched:
lock_stat version 0.4
-----------------------------------------------------------------------
                              class name    con-bounces    contentions
waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces
acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg

               total_bytes_pinned_percpu:             1              1
        0.62           0.62           0.62           0.62          13601
       31542           0.14           9.61       11016.90           0.35

[Analysis]
Since the spin lock only protects a single in-memory variable, the
contentions (number of lock acquisitions that had to wait) in both
unpatched and patched version are low. But when we see acquisitions and
acq-bounces, we get much lower counts in patched version. Here the most
important metric is acq-bounces. It means how many times the lock gets
transferred between different cpus, so the patch can really reduce
cacheline bouncing of spin lock (also the global counter of percpu_counter)
in a SMP system.

Fixes: b150a4f10d ("Btrfs: use a percpu to keep track of possibly pinned bytes")
Signed-off-by: Ethan Lien <ethanlien@synology.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:48 +02:00
David Sterba
3ffbd68c48 btrfs: simplify pointer chasing of local fs_info variables
Functions that get btrfs inode can simply reach the fs_info by
dereferencing the root and this looks a bit more straightforward
compared to the btrfs_sb(...) indirection.

If the transaction handle is available and not NULL it's used instead.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:43 +02:00
David Sterba
6d8ff4e458 btrfs: annotate unlikely branches after V0 extent type removal
The v0 extent type checks are the right case for the unlikely
annotations as we don't expect to ever see them, so let's give the
compiler some hint.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:41 +02:00
Nikolay Borisov
ba3c2b196b btrfs: Add graceful handling of V0 extents
Following the removal of the v0 handling code let's be courteous and
print an error message when such extents are handled. In the cases
where we have a transaction just abort it, otherwise just call
btrfs_handle_fs_error. Both cases result in the FS being re-mounted RO.

In case the error handling would be too intrusive, leave the BUG_ON in
place, like extent_data_ref_count, other proper handling would catch
that earlier.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:41 +02:00
Nikolay Borisov
a79865c680 btrfs: Remove V0 extent support
The v0 compat code was introduced in commit 5d4f98a28c
("Btrfs: Mixed back reference  (FORWARD ROLLING FORMAT CHANGE)") 9
years ago, which was merged in 2.6.31. This means that the code is
there to support filesystems which are _VERY_ old and if you are using
btrfs on such an old kernel, you have much bigger problems. This coupled
with the fact that no one is likely testing/maintining this code likely
means it has bugs lurking. All things considered I think 43 kernel
releases later it's high time this remnant of the past got removed.

This patch removes all code wrapped in #ifdefs but leaves the BUG_ONs in case
we have a v0 with no support intact as a sort of safety-net.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:41 +02:00
Su Yue
af431dcb24 btrfs: return EUCLEAN if extent_inline_ref type is invalid
If type of extent_inline_ref found is not expected, filesystem may have
been corrupted, should return EUCLEAN instead of EINVAL.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:40 +02:00
Qu Wenruo
4379444654 btrfs: Don't remove block group that still has pinned down bytes
[BUG]
Under certain KVM load and LTP tests, it is possible to hit the
following calltrace if quota is enabled:

BTRFS critical (device vda2): unable to find logical 8820195328 length 4096
BTRFS critical (device vda2): unable to find logical 8820195328 length 4096

WARNING: CPU: 0 PID: 49 at ../block/blk-core.c:172 blk_status_to_errno+0x1a/0x30
CPU: 0 PID: 49 Comm: kworker/u2:1 Not tainted 4.12.14-15-default #1 SLE15 (unreleased)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
task: ffff9f827b340bc0 task.stack: ffffb4f8c0304000
RIP: 0010:blk_status_to_errno+0x1a/0x30
Call Trace:
 submit_extent_page+0x191/0x270 [btrfs]
 ? btrfs_create_repair_bio+0x130/0x130 [btrfs]
 __do_readpage+0x2d2/0x810 [btrfs]
 ? btrfs_create_repair_bio+0x130/0x130 [btrfs]
 ? run_one_async_done+0xc0/0xc0 [btrfs]
 __extent_read_full_page+0xe7/0x100 [btrfs]
 ? run_one_async_done+0xc0/0xc0 [btrfs]
 read_extent_buffer_pages+0x1ab/0x2d0 [btrfs]
 ? run_one_async_done+0xc0/0xc0 [btrfs]
 btree_read_extent_buffer_pages+0x94/0xf0 [btrfs]
 read_tree_block+0x31/0x60 [btrfs]
 read_block_for_search.isra.35+0xf0/0x2e0 [btrfs]
 btrfs_search_slot+0x46b/0xa00 [btrfs]
 ? kmem_cache_alloc+0x1a8/0x510
 ? btrfs_get_token_32+0x5b/0x120 [btrfs]
 find_parent_nodes+0x11d/0xeb0 [btrfs]
 ? leaf_space_used+0xb8/0xd0 [btrfs]
 ? btrfs_leaf_free_space+0x49/0x90 [btrfs]
 ? btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
 btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
 btrfs_find_all_roots+0x45/0x60 [btrfs]
 btrfs_qgroup_trace_extent_post+0x20/0x40 [btrfs]
 btrfs_add_delayed_data_ref+0x1a3/0x1d0 [btrfs]
 btrfs_alloc_reserved_file_extent+0x38/0x40 [btrfs]
 insert_reserved_file_extent.constprop.71+0x289/0x2e0 [btrfs]
 btrfs_finish_ordered_io+0x2f4/0x7f0 [btrfs]
 ? pick_next_task_fair+0x2cd/0x530
 ? __switch_to+0x92/0x4b0
 btrfs_worker_helper+0x81/0x300 [btrfs]
 process_one_work+0x1da/0x3f0
 worker_thread+0x2b/0x3f0
 ? process_one_work+0x3f0/0x3f0
 kthread+0x11a/0x130
 ? kthread_create_on_node+0x40/0x40
 ret_from_fork+0x35/0x40

BTRFS critical (device vda2): unable to find logical 8820195328 length 16384
BTRFS: error (device vda2) in btrfs_finish_ordered_io:3023: errno=-5 IO failure
BTRFS info (device vda2): forced readonly
BTRFS error (device vda2): pending csums is 2887680

[CAUSE]
It's caused by race with block group auto removal:

- There is a meta block group X, which has only one tree block
  The tree block belongs to fs tree 257.
- In current transaction, some operation modified fs tree 257
  The tree block gets COWed, so the block group X is empty, and marked
  as unused, queued to be deleted.
- Some workload (like fsync) wakes up cleaner_kthread()
  Which will call btrfs_delete_unused_bgs() to remove unused block
  groups.
  So block group X along its chunk map get removed.
- Some delalloc work finished for fs tree 257
  Quota needs to get the original reference of the extent, which will
  read tree blocks of commit root of 257.
  Then since the chunk map gets removed, the above warning gets
  triggered.

[FIX]
Just let btrfs_delete_unused_bgs() skip block group which still has
pinned bytes.

However there is a minor side effect: currently we only queue empty
blocks at update_block_group(), and such empty block group with pinned
bytes won't go through update_block_group() again, such block group
won't be removed, until it gets new extent allocated and removed.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:39 +02:00
Nikolay Borisov
bc877d285c btrfs: Deduplicate extent_buffer init code
When a new extent buffer is allocated there are a few mandatory fields
which need to be set in order for the buffer to be sane: level,
generation, bytenr, backref_rev, owner and FSID/UUID. Currently this
is open coded in the callers of btrfs_alloc_tree_block, meaning it's
fairly high in the abstraction hierarchy of operations. This patch
solves this by simply moving this init code in btrfs_init_new_buffer,
since this is the function which initializes a newly allocated
extent buffer. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:38 +02:00
Nikolay Borisov
43a7e99db6 btrfs: Remove fs_info from btrfs_force_chunk_alloc
It can be referenced from the passed transaction handle.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:38 +02:00
Nikolay Borisov
c83488afc5 btrfs: Remove fs_info from btrfs_inc_block_group_ro
It can be referenced from the passed bg cache.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:37 +02:00
Nikolay Borisov
61da2abfca btrfs: Remove fs_info from btrfs_alloc_logged_file_extent
It can be referenced from trans since the function is always called
within a valid transaction.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:37 +02:00
Nikolay Borisov
87cc7a8a2a btrfs: Remove fs_info from remove_extent_backref
It can be referenced directly from the transaction handle since it's
always valid.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:37 +02:00
Nikolay Borisov
5fac7f9ee1 btrfs: Remove fs_info from run_one_delayed_ref
It can be referenced from the passed transaction handle, since it's
always valid.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:37 +02:00
Nikolay Borisov
a639cdeba3 btrfs: Remove fs_info from insert_inline_extent_backref
It can be referenced from the passed transaction handle, since it's
always valid.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:37 +02:00
Nikolay Borisov
3c4da6574e btrfs: Remove fs_info from exclude_super_stripes
It can be referenced from the passed block group.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:36 +02:00
Nikolay Borisov
9e715da860 btrfs: Remove fs_info from free_excluded_extents
It can be referenced from the passed block group.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:36 +02:00
Nikolay Borisov
451a2c1303 btrfs: Remove fs_info from check_system_chunk
It can be referenced from trans since the function is always called
within a transaction.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:36 +02:00
Nikolay Borisov
c216b2039a btrfs: Remove fs_info from btrfs_alloc_chunk
It can be referenced from trans since the function is always called
within a transaction.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:36 +02:00
Nikolay Borisov
01458828bb btrfs: Remove fs_info from do_chunk_alloc
This function is always called with a valid transaction handle from
where fs_info can be referenced. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:36 +02:00
Nikolay Borisov
f97806f2ee btrfs: Remove fs_info from run_delayed_tree_ref
It can always be referneced from the passed transaction handle since
it's always valid. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:35 +02:00
Nikolay Borisov
f9871eddd9 btrfs: Remove fs_info from cleanup_ref_head
fs_info can be refenreced from the transaction handle, since it's always
valid. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:35 +02:00
Nikolay Borisov
c4d56d4a16 btrfs: Remove unused fs_info from cleanup_extent_op
The argument is no longer used so remove it.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:35 +02:00
Nikolay Borisov
20b9a2d670 btrfs: Remove fs_info from run_delayed_extent_op
This function is always called with a valid transaction handle so
fs_info can be referenced from there. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:35 +02:00
Nikolay Borisov
2bf98ef35f btrfs: Remove fs_info from run_delayed_data_ref
This function is always called with a valid transaction from where
fs_info can be referenced. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:35 +02:00
Nikolay Borisov
2590d0f155 btrfs: Remove fs_info argument from __btrfs_inc_extent_ref
This function already takes a transaction which holds a reference to
the fs_info struct. Use that reference and remove the extra arg. No
functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:35 +02:00
Nikolay Borisov
ef89b8245b btrfs: Remove fs_info from alloc_reserved_file_extent
fs_info can be referenced from the transaction handle, which is always
valid. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:34 +02:00
Nikolay Borisov
e72cb9235d btrfs: Remove fs_info from __btrfs_free_extent
This function is always called with a valid transaction handle so we
can reference the fs_info from there. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:34 +02:00
Nikolay Borisov
5a98ec0141 btrfs: Remove fs_info from btrfs_remove_block_group
This function is always called with a valid transaction handle from
where we can reference fs_info. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:34 +02:00
Nikolay Borisov
e7e02096d9 btrfs: Remove fs_info from btrfs_make_block_group
This function is always called with a valid transaction handle from
where we can reference the fs_info. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:34 +02:00
Nikolay Borisov
88a979c615 btrfs: Remove fs_info from btrfs_add_delayed_data_ref
This function is always called with a valid transaction handle from
where fs_info can be referenced. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:34 +02:00
Nikolay Borisov
44e1c47d5c btrfs: Remove fs_info from btrfs_add_delayed_tree_ref
This function is always called with a valid transaction handle from
where fs_info can be referenced. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:33 +02:00
Nikolay Borisov
fbe4801b26 btrfs: Remove fs_info from lookup_extent_backref
This argument is unused. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:33 +02:00
Nikolay Borisov
bd1d53ef35 btrfs: Remove fs_info argument from lookup_extent_data_ref
This function is always called with a valid transaction handle from
where fs_info can be referenced. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:33 +02:00
Nikolay Borisov
b8582eeabb btrfs: Remove fs_info argument from lookup_tree_block_ref
This function is always called with a valid transaction handle from
where the fs_info can be referenced. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:33 +02:00
Nikolay Borisov
61a18f1c66 btrfs: Remove fs_info argument from update_inline_extent_backref
This function always uses the leaf's extent_buffer which already
contains a reference to the fs_info. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:33 +02:00
Nikolay Borisov
867cc1fbeb btrfs: Remove fs_info from lookup_inline_extent_backref
This function is always called with a valid transaction handle from
where the fs_info can be referenced. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:32 +02:00
Nikolay Borisov
e9f6290d59 btrfs: Remove fs_info from remove_extent_data_ref
This function is always called with a valid transaction from where the
fs_info can be referenced. No functional change.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:32 +02:00
Nikolay Borisov
375934105c btrfs: Remove fs_info argument from insert_extent_backref
This function is always called with a valid transaction handle from
where fs_info can be referenced. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:32 +02:00
Nikolay Borisov
62b895af40 btrfs: Remove fs_info from insert_extent_data_ref
This function is always called with a valid transaction handle from
where fs_info can be referenced. So remove the redundant argument.
No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:32 +02:00
Nikolay Borisov
10728404c6 btrfs: Remove fs_info from insert_tree_block_ref
This function is always called with a valid transaction so there is no
need to duplicate the fs_info, we can reference it directly from the
trans handle. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:31 +02:00
Bart Van Assche
bece2e8239 btrfs: Fix misleading indentation reported by smatch
This patch avoids that building the BTRFS source code with smatch
triggers complaints about inconsistent indenting.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:31 +02:00
Nikolay Borisov
16d1c062c7 btrfs: Fix comment in lookup_inline_extent_backref
The comment wrongfully states that the owner parameter is the level of
the parent block. In fact owner is the level of the current block and
by adding 1 to it we can eventually get to the parent/root.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:30 +02:00
Nikolay Borisov
bd3c685ed9 btrfs: Document __btrfs_inc_extent_ref
Here is a doc-only patch which tires to deobfuscate the terra-incognita
that arguments for delayed refs are.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:29 +02:00
Su Yue
9132c4ff6f btrfs: return ENOMEM if path allocation fails in btrfs_cross_ref_exist
The error code does not match the reason of failure and may confuse the
callers.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 17:33:58 +02:00
Gu JinXiang
c4c129db5d btrfs: drop unused parameter qgroup_reserved
Since commit 7775c8184e ("btrfs: remove unused parameter from
btrfs_subvolume_release_metadata") parameter qgroup_reserved is not used
by caller of function btrfs_subvolume_reserve_metadata.  So remove it.

Signed-off-by: Gu JinXiang <gujx@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:53 +02:00
Lu Fengqi
4ca6168327 btrfs: drop unused space_info parameter from create_space_info
Since commit dc2d3005d2 ("btrfs: remove dead create_space_info
calls"), there is only one caller btrfs_init_space_info. However, it
doesn't need create_space_info to return space_info at all.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:43 +02:00
Gu Jinxiang
b89311efe6 btrfs: propagate failures of __exclude_logged_extent to upper caller
Function btrfs_exclude_logged_extents may call __exclude_logged_extent
which may fail.
Propagate the failures of __exclude_logged_extent to upper caller.

Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-29 18:12:58 +02:00
Nikolay Borisov
d4b20733d2 btrfs: Streamline shared ref check in alloc_reserved_tree_block
Instead of setting "parent" to ref->parent only when dealing with
a shared ref and subsequently performing another check to see
if (parent > 0), check the "node->type" directly and act accordingly.
This makes the code more streamline. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-29 18:12:57 +02:00
Nikolay Borisov
21ebfbe7e0 btrfs: Pass btrfs_delayed_extent_op to alloc_reserved_tree_block
Instead of taking only specific member of this structure, which results
in 2 extra arguments, just take the delayed_extent_op struct and
reference the arguments inside the functions. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-29 18:12:57 +02:00
Nikolay Borisov
4e6bd4e0aa btrfs: Simplify alloc_reserved_tree_block interface
This function currently takes 7 parameters, most of which are proxies
for values from btrfs_delayed_ref_node struct which is not passed. This
patch simplifies the interface of the function by simply passing said
delayed ref node struct to the function. This enables us to:

1. Move locals variables and init code related to them from
   run_delayed_tree_ref which should only be used inside
   alloc_reserved_tree_block, such as skinny_metadata and the btrfs_key,
   representing the extent being inserted. This removes the need for the
   "ins" argument. Instead, it's replaced by a local var with a more
   verbose name - extent_key.

2. Now that we have a reference to the node in alloc_reserved_tree_block
   the delayed_tree_ref struct can be referenced inside the function and
   this enable removing the "ref->level", "parent" and "ref_root"
   arguments.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-29 18:12:53 +02:00
Nikolay Borisov
9dcdbe0144 btrfs: Remove fs_info argument from alloc_reserved_tree_block
This function already takes a transaction handle which contains a
reference to the fs_info. So use this and remove the extra argument.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-29 18:12:53 +02:00
Omar Sandoval
a575ceeb13 Btrfs: get rid of unused orphan infrastructure
Now that we don't keep long-standing reservations for orphan items,
root->orphan_block_rsv isn't used. We can git rid of it, along with:

- root->orphan_lock, which was used to protect root->orphan_block_rsv
- root->orphan_inodes, which was used as a refcount for root->orphan_block_rsv
- BTRFS_INODE_ORPHAN_META_RESERVED, which was used to track reservations
  in root->orphan_block_rsv
- btrfs_orphan_commit_root(), which was the last user of any of these
  and does nothing else

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:57 +02:00
Nikolay Borisov
c442793e67 btrfs: Remove stale comment about select_delayed_ref
select_delayed_ref really just gets the next delayed ref which has to
be processed - either an add ref or drop ref. We never go back for
anything. So the comment is actually bogus, just remove it.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:42 +02:00
David Sterba
093258e6eb btrfs: replace waitqueue_actvie with cond_wake_up
Use the wrappers and reduce the amount of low-level details about the
waitqueue management.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:09 +02:00
Nikolay Borisov
e7355e501d btrfs: Remove fs_info argument from add_to_free_space_tree
This function takes a transaction handle which already contains a
reference to the fs_info. So use it and remove the extra function
argument.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:36 +02:00
Nikolay Borisov
25a356d3f6 btrfs: Remove fs_info argument from remove_from_free_space_tree
This function alreay takes a transaction handle which holds a reference
to the fs_info. Use that and remove the extra argument.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:35 +02:00
Nikolay Borisov
f3f7277995 btrfs: Remove fs_info parameter from remove_block_group_free_space
This function always takes a trans handle which contains a reference to
the fs_info. Use that and remove the extra argument.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:34 +02:00
Nikolay Borisov
4457c1c702 btrfs: Remove fs_info argument from add_new_free_space
This function also takes a btrfs_block_group_cache which contains a
referene to the fs_info. So use that and remove the extra argument.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:33 +02:00
Nikolay Borisov
e4e0711cd9 btrfs: Remove fs_info argument from add_block_group_free_space
We also pass in a transaction handle which has a reference to the
fs_info. Just remove the extraneous argument.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:33 +02:00
Nikolay Borisov
63a9c7b9ce btrfs: Remove devid parameter from btrfs_rmap_block
This function is used in only one place and devid argument is always
passed 0. So just remove it, similarly to how it was removed in the
userspace code.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:29 +02:00
Nikolay Borisov
82b3e53b8d btrfs: Remove delayed_iput parameter of btrfs_start_delalloc_roots
This parameter was introduced alongside the function in
eb73c1b7ce ("Btrfs: introduce per-subvolume delalloc inode list") to
avoid deadlocks since this function was used in the transaction commit
path. However, commit 8d875f95da ("btrfs: disable strict file flushes
for renames and truncates") removed that usage, rendering the parameter
obsolete.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:28 +02:00
Qu Wenruo
4ed0a7a3b7 btrfs: trace: Add trace points for unused block groups
This patch will add the following trace events:
1) btrfs_remove_block_group
   For btrfs_remove_block_group() function.
   Triggered when a block group is really removed.

2) btrfs_add_unused_block_group
   Triggered which block group is added to unused_bgs list.

3) btrfs_skip_unused_block_group
   Triggered which unused block group is not deleted.

These trace events is pretty handy to debug case related to block group
auto remove.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:28 +02:00
Qu Wenruo
3dca5c942d btrfs: trace: Remove unnecessary fs_info parameter for btrfs__reserve_extent event class
fs_info can be extracted from btrfs_block_group_cache, and all
btrfs_block_group_cache is created by btrfs_create_block_group_cache()
with fs_info initialized, no need to worry about NULL pointer
dereference.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:27 +02:00
Anand Jain
41a6e8913c btrfs: move btrfs_raid_group values to btrfs_raid_attr table
Add a new member struct btrfs_raid_attr::bg_flag so that
btrfs_raid_array can maintain the bit map flag of the raid type, and
so we can drop btrfs_raid_group.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:27 +02:00
Anand Jain
ed23467b18 btrfs: move btrfs_raid_type_names values to btrfs_raid_attr table
Add a new member struct btrfs_raid_attr::raid_name so that
btrfs_raid_array can maintain the name of the raid type, and so we can
drop btrfs_raid_type_names.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:27 +02:00
David Sterba
dccdb07bc9 btrfs: kill btrfs_fs_info::volume_mutex
Mutual exclusion of device add/rm and balance was done by the volume
mutex up to version 3.7. The commit 5ac00addc7 ("Btrfs: disallow
mutually exclusive admin operations from user mode") added a bit that
essentially tracked the same information.

The status bit has an advantage over a mutex that it can be set without
restrictions of function context, so it started to be used in the
mount-time resuming of balance or device replace.

But we don't really need to track the same information in two ways.

1) After the previous cleanups, the main ioctl handlers for
   add/del/resize copy the EXCL_OP bit next to the volume mutex, here
   it's clearly safe.

2) Resuming balance during mount or after rw remount will set only the
   EXCL_OP bit and the volume_mutex is held in the kernel thread that
   calls btrfs_balance.

3) Resuming device replace during mount or after rw remount is done
   after balance and is excluded by the EXCL_OP bit. It does not take
   the volume_mutex at all and completely relies on the EXCL_OP bit.

4) The resuming of balance and dev-replace cannot hapen at the same time
   as the ioctls cannot be started in parallel. Nevertheless, a crafted
   image could trigger that and a warning is printed.

5) Balance is normally excluded by EXCL_OP and also uses own mutex to
   protect against concurrent access to its status data. There's some
   trickery to maintain the right lock nesting in case we need to
   reexamine the status in btrfs_ioctl_balance. The volume_mutex is
   removed and the unlock/lock sequence is left in place as we might
   expect other waiters to proceed.

6) Similar to 5, the unlock/lock sequence is kept in
   btrfs_cancel_balance to allow waiters to continue.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:25 +02:00
Nikolay Borisov
be97f133b3 btrfs: Drop fs_info parameter from btrfs_merge_delayed_refs
It's provided by the transaction handle.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:20 +02:00
Nikolay Borisov
57f1642ec3 btrfs: Consolidate error checking for btrfs_alloc_chunk
The second if is really a subcase of ret being less than 0. So
introduce a generic if (ret < 0) check, and inside have another if
which explicitly handles the -ENOSPC and any other errors. No
functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:16 +02:00
Nikolay Borisov
1e7a14211b btrfs: Fix lock release order
Locks should generally be released in the oppposite order they are
acquired. Generally lock acquisiton ordering is used to ensure
deadlocks don't happen. However, as becomes more complicated it's
best to also maintain proper unlock order so as to avoid possible dead
locks. This was found by code inspection and doesn't necessarily lead
to a deadlock scenario.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:16 +02:00
Nikolay Borisov
41d0bd3b5e btrfs: Drop delayed_refs argument from btrfs_check_delayed_seq
It's used to print its pointer in a debug statement but doesn't really
bring any useful information to the error message.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 13:12:11 +02:00
Nikolay Borisov
29d2b84cf9 btrfs: Replace owner argument in add_pinned_bytes with a boolean
add_pinned_bytes really cares whether the bytes being pinned are either
data or metadata. To that effect it checks whether the 'owner' argument
is less than BTRFS_FIRST_FREE_OBJECTID (256). This works because
owner can really have 2 types of values:

 a) For metadata extents it holds the level at which the parent is in
    the btree. This amounts to owner having the values 0-7

 b) In case of modifying data extents, owner is the inode number
    to which those extents belongs.

Let's make this more explicit byt converting the owner parameter to a
boolean value and either pass it directly when we know the type of
extents we are working with (i.e. in btrfs_free_tree_block). In cases
when the parent function can be called on both metadata/data extents
perform the check in the caller. This hopefully makes the interface
of add_pinned_bytes more intuitive.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 13:12:11 +02:00
Linus Torvalds
4148d3884a for-4.17-rc3-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlrsbkEACgkQxWXV+ddt
 WDvm6Q/+KdFKJ7T8hBOc6o5EeULXCDF3FmMA7HvDC696WXKsckXJFKk52awvrSb6
 3wnIzfWmI3K+rwX3cKqLRKe6tMXtBrTjVWXyezfvx1SMcCO4hSQ+nWLqK08htaNf
 h7m3OC3y0xO8QHcFSkvHUov6KRWG3rH+4p46JsJjN7GTBtWmR6tsiyQQ9JMC3gNR
 8Jnl1YaQq/JDLFm8GmFfPqIK+MLnNJ+GOJC1pm2vJQFtjnDw9dic+dI2hGX2oh9M
 SSRAoJu7jUvTWSmQN9aJfbBUr4atzoKKGYsyAgx5qgXbzOnbUGTIhtyAZirRWWBy
 0pT2b/8XuqsIabwR5dR4UbL4Ke1h5DS4c6GFydwO4DeddTovHtDzbN0cPuPQABL1
 rwFzlnHhcM/qRu9SKXx1jRy7w4Vju8fVX9D4lyjLcyk24flkEAn1NlJCWEqSzPYR
 ikTTm71r/1/62XqE6AcOyugS8E6EYtYHo3PjrfFXr3fQCbctTLEaUKoegFkTezbX
 EZLRPy9KNGfuyUyh3eiSypNHZZ9WiT+W42BaLIcpbHJLnqB+A14Z0oZx6aHVhY0T
 VFLL91O//OmUvjpsZ7I99LyswvrzsmU0jKS41GXOQlLsLTxtGhcZbvRt4uX4UUie
 UOQrNCgO0Y2slGfv+uoDCLhH0tTCRziuEvmgu8bjPcfZE7tph+s=
 =Jdmp
 -----END PGP SIGNATURE-----

Merge tag 'for-4.17-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "Two regression fixes and one fix for stable"

* tag 'for-4.17-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Btrfs: send, fix missing truncate for inode with prealloc extent past eof
  btrfs: Take trans lock before access running trans in check_delayed_ref
  btrfs: Fix wrong first_key parameter in replace_path
2018-05-04 20:32:18 -10:00
ethanwu
998ac6d21c btrfs: Take trans lock before access running trans in check_delayed_ref
In preivous patch:
Btrfs: kill trans in run_delalloc_nocow and btrfs_cross_ref_exist
We avoid starting btrfs transaction and get this information from
fs_info->running_transaction directly.

When accessing running_transaction in check_delayed_ref, there's a
chance that current transaction will be freed by commit transaction
after the NULL pointer check of running_transaction is passed.

After looking all the other places using fs_info->running_transaction,
they are either protected by trans_lock or holding the transactions.

Fix this by using trans_lock and increasing the use_count.

Fixes: e4c3b2dcd1 ("Btrfs: kill trans in run_delalloc_nocow and btrfs_cross_ref_exist")
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: ethanwu <ethanwu@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-02 11:54:58 +02:00
Linus Torvalds
d54b5c1315 for-4.17-rc1-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlrcop8ACgkQxWXV+ddt
 WDuT2Q/9HZo9b1N4f+dM38P/cyBwEt9VYT0MLzpIpOid6ogpdF+1E7F6rDnUOyEQ
 0Me+znhC7kgXLBnpUDS16/2uGbp9VgCZo9NSq8juLrowGYpe9asadWbCqfFBmAh6
 Pc6r03Z1Z1EhXXUSm0cqJwvNHpNN6WnfBP1ithcp4OAltjVZ7IU45xHCcymfTxi+
 R6/kvJlG56YXoT97on/CHF85pAidELJ7CysLBGpCHP7Yl/lkVkyPVxSZO9up6ZtS
 078l0nCG9PpzONbdUcB9QGICPDcNLbBDhrS8yzQ+oajiRBz0+Im4c2UUvmZEUiMb
 6EAo6QV58OFUskgsY1cHWgZGS3YIFwzpmdsRQrCyuI4Z+1KXVu+RWMv1hF1G1Dqw
 8fHDm6+V2oqcoK3lSnb40UUySGw2d7PwjGxUk5iGKWISdmrxQsLcgkY/ERbjZxar
 emelwAZbsPwUNYNJj8qbm1QZlmys8FACKzJqyCR6+7/n353uWE30Tnv1MgfJg0nQ
 7ejk3U96R+oynN/9WOBFCLpuvPCbYaZk4eA9+3EjcSLzN81r/WtbgQDntEWi9auO
 /OUADaHOXoIaUnTBNtFV9DTseDW21hQ/4NseTVVsjKGWjh2qWRoEwuqI7TEZD/4M
 kGgdJL+xvD0MzR6Tq7xbhnQTcCOwH6NQ7aI4rI3abDbU9ZPuDrY=
 =lgaO
 -----END PGP SIGNATURE-----

Merge tag 'for-4.17-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "This contains a few fixups to the qgroup patches that were merged this
  dev cycle, unaligned access fix, blockgroup removal corner case fix
  and a small debugging output tweak"

* tag 'for-4.17-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: print-tree: debugging output enhancement
  btrfs: Fix race condition between delayed refs and blockgroup removal
  btrfs: fix unaligned access in readdir
  btrfs: Fix wrong btrfs_delalloc_release_extents parameter
  btrfs: delayed-inode: Remove wrong qgroup meta reservation calls
  btrfs: qgroup: Use independent and accurate per inode qgroup rsv
  btrfs: qgroup: Commit transaction in advance to reduce early EDQUOT
2018-04-22 12:09:27 -07:00
Nikolay Borisov
5e388e9581 btrfs: Fix race condition between delayed refs and blockgroup removal
When the delayed refs for a head are all run, eventually
cleanup_ref_head is called which (in case of deletion) obtains a
reference for the relevant btrfs_space_info struct by querying the bg
for the range. This is problematic because when the last extent of a
bg is deleted a race window emerges between removal of that bg and the
subsequent invocation of cleanup_ref_head. This can result in cache being null
and either a null pointer dereference or assertion failure.

	task: ffff8d04d31ed080 task.stack: ffff9e5dc10cc000
	RIP: 0010:assfail.constprop.78+0x18/0x1a [btrfs]
	RSP: 0018:ffff9e5dc10cfbe8 EFLAGS: 00010292
	RAX: 0000000000000044 RBX: 0000000000000000 RCX: 0000000000000000
	RDX: ffff8d04ffc1f868 RSI: ffff8d04ffc178c8 RDI: ffff8d04ffc178c8
	RBP: ffff8d04d29e5ea0 R08: 00000000000001f0 R09: 0000000000000001
	R10: ffff9e5dc0507d58 R11: 0000000000000001 R12: ffff8d04d29e5ea0
	R13: ffff8d04d29e5f08 R14: ffff8d04efe29b40 R15: ffff8d04efe203e0
	FS:  00007fbf58ead500(0000) GS:ffff8d04ffc00000(0000) knlGS:0000000000000000
	CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
	CR2: 00007fe6c6975648 CR3: 0000000013b2a000 CR4: 00000000000006f0
	DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
	DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
	Call Trace:
	 __btrfs_run_delayed_refs+0x10e7/0x12c0 [btrfs]
	 btrfs_run_delayed_refs+0x68/0x250 [btrfs]
	 btrfs_should_end_transaction+0x42/0x60 [btrfs]
	 btrfs_truncate_inode_items+0xaac/0xfc0 [btrfs]
	 btrfs_evict_inode+0x4c6/0x5c0 [btrfs]
	 evict+0xc6/0x190
	 do_unlinkat+0x19c/0x300
	 do_syscall_64+0x74/0x140
	 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
	RIP: 0033:0x7fbf589c57a7

To fix this, introduce a new flag "is_system" to head_ref structs,
which is populated at insertion time. This allows to decouple the
querying for the spaceinfo from querying the possibly deleted bg.

Fixes: d7eae3403f ("Btrfs: rework delayed ref total_bytes_pinned accounting")
CC: stable@vger.kernel.org # 4.14+
Suggested-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-04-20 19:17:25 +02:00
Qu Wenruo
ff6bc37eb7 btrfs: qgroup: Use independent and accurate per inode qgroup rsv
Unlike reservation calculation used in inode rsv for metadata, qgroup
doesn't really need to care about things like csum size or extent usage
for the whole tree COW.

Qgroups care more about net change of the extent usage.
That's to say, if we're going to insert one file extent, it will mostly
find its place in COWed tree block, leaving no change in extent usage.
Or causing a leaf split, resulting in one new net extent and increasing
qgroup number by nodesize.
Or in an even more rare case, increase the tree level, increasing qgroup
number by 2 * nodesize.

So here instead of using the complicated calculation for extent
allocator, which cares more about accuracy and no error, qgroup doesn't
need that over-estimated reservation.

This patch will maintain 2 new members in btrfs_block_rsv structure for
qgroup, using much smaller calculation for qgroup rsv, reducing false
EDQUOT.

Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
2018-04-18 16:46:51 +02:00
Linus Torvalds
e37563bb6c for-4.17-part2-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlrTQ4sACgkQxWXV+ddt
 WDti3Q/+MAeqsLTjvre2RQ3ka5hNyCuVftUIBmcP3YfJbt+xZYQyaewW4Xkfi3cm
 cbJE+zehzf5ag+RJhxk3OwFTvNfLGIO9asWs3b08NGUi6VzwL0/8B/iOdZPuHSAV
 TrecQIBE2Tp+xax9cQEnxav34D4dUtXNaDweGjp1MIIUkDneQP/I0vlTu7vafBgX
 UVxP6riL/MCs7sjTHGIPs0lv8L/fgdmo+dk5SnNuIPTOcFTQXgVrtHjw9IvbKWd4
 aq+sbNWoSrhXUfllbFg/wZqDe9tWn9E2f6m/H0ThSoNdxusSVgacOjFRYh20NKLW
 WGB8Amd/ItGtJwJ1CIypa7VX2U11UAi0XT7BeiK82rUNEJ6moRqFOXG861gRLoTZ
 SpH8uWO+e+CogfXob1KCndn5lot4AM2ZTkCqfrjpM35Nul72PZdne0CxNlmiRupY
 Fdt5GB+sg8plcMaRiYr++BbbHP5tggX1MrhLGEbx2XBs2eRdn+2Lv2I1Ig/U4NUb
 Vf+xk/tFLKGOTSZlbv7SV7ekXxG/3+7gAuL7A1XMETZCwBF4L3hwyW7CgkEBb3PC
 TqX8BwMaRpyp/FgW/QL6edjXZ3a64VaHIfqRPNks3lWWcCHVbzyiVPfEjx+AEJ/0
 abx6jXTJhLUBPuPxEgb5rsSv62RoxPoYHqCErrG95XwnZ8rSCts=
 =I0y0
 -----END PGP SIGNATURE-----

Merge tag 'for-4.17-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull more btrfs updates from David Sterba:
 "We have queued a few more fixes (error handling, log replay,
  softlockup) and the rest is SPDX updates that touche almost all files
  so the diffstat is long"

* tag 'for-4.17-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: Only check first key for committed tree blocks
  btrfs: add SPDX header to Kconfig
  btrfs: replace GPL boilerplate by SPDX -- sources
  btrfs: replace GPL boilerplate by SPDX -- headers
  Btrfs: fix loss of prealloc extents past i_size after fsync log replay
  Btrfs: clean up resources during umount after trans is aborted
  btrfs: Fix possible softlock on single core machines
  Btrfs: bail out on error during replay_dir_deletes
  Btrfs: fix NULL pointer dereference in log_dir_items
2018-04-15 18:08:35 -07:00
David Sterba
c1d7c514f7 btrfs: replace GPL boilerplate by SPDX -- sources
Remove GPL boilerplate text (long, short, one-line) and keep the rest,
ie. personal, company or original source copyright statements. Add the
SPDX header.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-04-12 16:29:51 +02:00
Nikolay Borisov
1e1c50a929 btrfs: Fix possible softlock on single core machines
do_chunk_alloc implements a loop checking whether there is a pending
chunk allocation and if so causes the caller do loop. Generally this
loop is executed only once, however testing with btrfs/072 on a single
core vm machines uncovered an extreme case where the system could loop
indefinitely. This is due to a missing cond_resched when loop which
doesn't give a chance to the previous chunk allocator finish its job.

The fix is to simply add the missing cond_resched.

Fixes: 6d74119f1a ("Btrfs: avoid taking the chunk_mutex in do_chunk_alloc")
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-04-05 19:22:35 +02:00
Linus Torvalds
94514bbe9e for-4.17-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlrDg0UACgkQxWXV+ddt
 WDvtOQ//QCk0zH2EcPQnqOW6HqAfkkDc7D9P51sK1izNM3vBEYtbuPlY6wp3xnJr
 0hbPjGNU7vMC/4SIwSgEdXyAvlpOr1gm9n+w1GcxpkjLa8l4P+2wt9OX0BSzRUMu
 X7LQxqg2zmQibFy4b1MSmDGsO2dxB2eqvVUT/Ir4b56uqkdtValYRWY75APJIZ5l
 6w0Ja3HVvgOX3pVwSmadCpfMEonN4JE+mfHaP8RajAlTGQcUPq8If9w4BtEoWQRl
 QC7kUCCTmp+isnzH7u4EqEQC6XUqEqeuQH+Bli1pNYTvipHY+9EO1dSxHZoCelgk
 M9PpQz8x+N6ZMcMNtJQVifkfN6tAp/acdWBTZtlpqB8nZR4v5bBndS/5TNBMu3/v
 JfhMEsIUz5o2mWz9qUneyK80RoTRkL5SfdgOclx2Yd7K1fKuJsUjmdXaT08BzotS
 5cxTFZu7EcFJyg4eHemjdyRYr3cUkS19P2uIJle9nj3DAMpCvIyL1c1vn5eB/7MN
 3JeRME6AOQcD7sFgVNuYhGVdIBuwHU6kj4mf1WN27YKGdbsaMZsFz7/HH3SsBLOF
 E0p7Q25HcHSrAimcmLTifN+gol8Y5m6dT0Pjuf9y9QbWvHwh7oRq7DVQ3L8hJkGz
 j64b0jlv43P0QorWvsX6/VJBevWedhe1hZtrBhPFSE0CtJ3Ml4Q=
 =wkSk
 -----END PGP SIGNATURE-----

Merge tag 'for-4.17-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "There are a several user visible changes, the rest is mostly invisible
  and continues to clean up the whole code base.

  User visible changes:
   - new mount option nossd_spread (pair for ssd_spread)

   - mount option subvolid will detect junk after the number and fail
     the mount

   - add message after cancelled device replace

   - direct module dependency on libcrc32, removed own crc wrappers

   - removed user space transaction ioctls

   - use lighter locking when reading /proc/self/mounts, RCU instead of
     mutex to avoid unnecessary contention

  Enhancements:
   - skip writeback of last page when truncating file to same size

   - send: do not issue unnecessary truncate operations

   - mount option token specifiers: use %u for unsigned values, more
     validation

   - selftests: more tree block validations

  qgroups:
   - preparatory work for splitting reservation types for data and
     metadata, this should allow for more accurate tracking and fix some
     issues with underflows or do further enhancements

   - split metadata reservations for started and joined transaction so
     they do not get mixed up and are accounted correctly at commit time

   - with the above, it's possible to revert patch that potentially
     deadlocks when trying to make more space by explicitly committing
     when the quota limit is hit

   - fix root item corruption when multiple same source snapshots are
     created with quota enabled

  RAID56:
   - make sure target is identical to source when raid56 rebuild fails
     after dev-replace

   - faster rebuild during scrub, batch by stripes and not
     block-by-block

   - make more use of cached data when rebuilding from a missing device

  Fixes:
   - null pointer deref when device replace target is missing

   - fix fsync after hole punching when using no-holes feature

   - fix lockdep splat when allocating percpu data with wrong GFP flags

  Cleanups, refactoring, core changes:
   - drop redunant parameters from various functions

   - kill and opencode trivial helpers

   - __cold/__exit function annotations

   - dead code removal

   - continued audit and documentation of memory barriers

   - error handling: handle removal from uuid tree

   - error handling: remove handling of impossible condtitons

   - more debugging or error messages

   - updated tracepoints

   - one VLA use removal (and one still left)"

* tag 'for-4.17-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (164 commits)
  btrfs: lift errors from add_extent_changeset to the callers
  Btrfs: print error messages when failing to read trees
  btrfs: user proper type for btrfs_mask_flags flags
  btrfs: split dev-replace locking helpers for read and write
  btrfs: remove stale comments about fs_mutex
  btrfs: use RCU in btrfs_show_devname for device list traversal
  btrfs: update barrier in should_cow_block
  btrfs: use lockdep_assert_held for mutexes
  btrfs: use lockdep_assert_held for spinlocks
  btrfs: Validate child tree block's level and first key
  btrfs: tests/qgroup: Fix wrong tree backref level
  Btrfs: fix copy_items() return value when logging an inode
  Btrfs: fix fsync after hole punching when using no-holes feature
  btrfs: use helper to set ulist aux from a qgroup
  Revert "btrfs: qgroups: Retry after commit on getting EDQUOT"
  btrfs: qgroup: Update trace events for metadata reservation
  btrfs: qgroup: Use root::qgroup_meta_rsv_* to record qgroup meta reserved space
  btrfs: delayed-inode: Use new qgroup meta rsv for delayed inode and item
  btrfs: qgroup: Use separate meta reservation type for delalloc
  btrfs: qgroup: Introduce function to convert META_PREALLOC into META_PERTRANS
  ...
2018-04-04 13:03:38 -07:00
David Sterba
a32bf9a302 btrfs: use lockdep_assert_held for mutexes
Using lockdep_assert_held is preferred, replace mutex_is_locked.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31 02:01:06 +02:00
Qu Wenruo
581c176041 btrfs: Validate child tree block's level and first key
We have several reports about node pointer points to incorrect child
tree blocks, which could have even wrong owner and level but still with
valid generation and checksum.

Although btrfs check could handle it and print error message like:
leaf parent key incorrect 60670574592

Kernel doesn't have enough check on this type of corruption correctly.
At least add such check to read_tree_block() and btrfs_read_buffer(),
where we need two new parameters @level and @first_key to verify the
child tree block.

The new @level check is mandatory and all call sites are already
modified to extract expected level from its call chain.

While @first_key is optional, the following call sites are skipping such
check:
1) Root node/leaf
   As ROOT_ITEM doesn't contain the first key, skip @first_key check.
2) Direct backref
   Only parent bytenr and level is known and we need to resolve the key
   all by ourselves, skip @first_key check.

Another note of this verification is, it needs extra info from nodeptr
or ROOT_ITEM, so it can't fit into current tree-checker framework, which
is limited to node/leaf boundary.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31 02:01:06 +02:00
Qu Wenruo
43b18595d6 btrfs: qgroup: Use separate meta reservation type for delalloc
Before this patch, btrfs qgroup is mixing per-transcation meta rsv with
preallocated meta rsv, making it quite easy to underflow qgroup meta
reservation.

Since we have the new qgroup meta rsv types, apply it to delalloc
reservation.

Now for delalloc, most of its reserved space will use META_PREALLOC qgroup
rsv type.

And for callers reducing outstanding extent like btrfs_finish_ordered_io(),
they will convert corresponding META_PREALLOC reservation to
META_PERTRANS.

This is mainly due to the fact that current qgroup numbers will only be
updated in btrfs_commit_transaction(), that's to say if we don't keep
such placeholder reservation, we can exceed qgroup limitation.

And for callers freeing outstanding extent in error handler, we will
just free META_PREALLOC bytes.

This behavior makes callers of btrfs_qgroup_release_meta() or
btrfs_qgroup_convert_meta() to be aware of which type they are.
So in this patch, btrfs_delalloc_release_metadata() and its callers get
an extra parameter to info qgroup to do correct meta convert/release.

The good news is, even we use the wrong type (convert or free), it won't
cause obvious bug, as prealloc type is always in good shape, and the
type only affects how per-trans meta is increased or not.

So the worst case will be at most metadata limitation can be sometimes
exceeded (no convert at all) or metadata limitation is reached too soon
(no free at all).

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31 01:41:14 +02:00
Qu Wenruo
733e03a0b2 btrfs: qgroup: Split meta rsv type into meta_prealloc and meta_pertrans
Btrfs uses 2 different methods to reseve metadata qgroup space.

1) Reserve at btrfs_start_transaction() time
   This is quite straightforward, caller will use the trans handler
   allocated to modify b-trees.

   In this case, reserved metadata should be kept until qgroup numbers
   are updated.

2) Reserve by using block_rsv first, and later btrfs_join_transaction()
   This is more complicated, caller will reserve space using block_rsv
   first, and then later call btrfs_join_transaction() to get a trans
   handle.

   In this case, before we modify trees, the reserved space can be
   modified on demand, and after btrfs_join_transaction(), such reserved
   space should also be kept until qgroup numbers are updated.

Since these two types behave differently, split the original "META"
reservation type into 2 sub-types:

  META_PERTRANS:
    For above case 1)

  META_PREALLOC:
    For reservations that happened before btrfs_join_transaction() of
    case 2)

NOTE: This patch will only convert existing qgroup meta reservation
callers according to its situation, not ensuring all callers are at
correct timing.
Such fix will be added in later patches.

Signed-off-by: Qu Wenruo <wqu@suse.com>
[ update comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31 01:41:14 +02:00
Jeff Mahoney
75cb379d26 btrfs: defer adding raid type kobject until after chunk relocation
Any time the first block group of a new type is created, we add a new
kobject to sysfs to hold the attributes for that type.  Kobject-internal
allocations always use GFP_KERNEL, making them prone to fs-reclaim races.
While it appears as if this can occur any time a block group is created,
the only times the first block group of a new type can be created in
memory is at mount and when we create the first new block group during
raid conversion.

This patch adds a new list to track pending kobject additions and then
handles them after we do chunk relocation.  Between relocating the
target chunk (or forcing allocation of a new chunk in the case of data)
and removing the old chunk, we're in a safe place for fs-reclaim to
occur.  We're holding the volume mutex, which is already held across
page faults, and the delete_unused_bgs_mutex, which will only stall
the cleaner thread.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31 01:41:12 +02:00
Jeff Mahoney
dc2d3005d2 btrfs: remove dead create_space_info calls
Since commit 2be12ef79 (btrfs: Separate space_info create/update), we've
separated out the creation and updating of the space info structures.
That commit was a straightforward refactoring of the two parts of
update_space_info, but we can go a step further.  Since commits
c59021f84 (Btrfs: fix OOPS of empty filesystem after balance) and
b742bb82f (Btrfs: Link block groups of different raid types), we know
that the space_info structures will be created at mount and there will
only ever be, at most, three of them.

This patch cleans out the create_space_info calls after __find_space_info
returns NULL since __find_space_info *can't* return NULL.

The initial cause for reviewing this was the kobject_add calls from
create_space_info occuring in sites where fs-reclaim wasn't allowed.  Now
we are certain they occur only early in the mount process and are safe.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31 01:41:12 +02:00
Nikolay Borisov
0a1e458a1e btrfs: Drop fs_info parameter from __btrfs_run_delayed_refs
It's provided by transaction handle.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31 01:41:11 +02:00
Nikolay Borisov
5ead2dd02c btrfs: Drop fs_info parameter from btrfs_finish_extent_commit
It's provided by the transaction handle.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31 01:41:11 +02:00
Nikolay Borisov
c79a70b133 btrfs: drop fs_info parameter from btrfs_run_delayed_refs
It's provided by the transaction handle.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31 01:41:10 +02:00
Nikolay Borisov
39d7d09dc2 btrfs: Remove unused flush var in shrink_delalloc
Added by 08e007d2e5 ("Btrfs: improve the noflush reservation") and
made redundant by 17024ad0a0 ("Btrfs: fix early ENOSPC due to
delalloc").

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31 01:41:10 +02:00
Nikolay Borisov
101d2dc0b2 btrfs: Remove unused extent_root var from caching_thread
Added by b4570aa994 ("btrfs: fix compiling with CONFIG_BTRFS_DEBUG
enabled.") and obsoleted by 2ff7e61e0d ("btrfs: take an fs_info
directly when the root is not used otherwise").

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31 01:41:10 +02:00
Nikolay Borisov
6f47c706d9 btrfs: Document parameters of btrfs_reserve_extent
This function is the entry to the extent allocator and as such has
quite a number of parameters. Some of those have subtle effects on the
allocation algorithm. Document the parameters.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31 01:26:56 +02:00
Nikolay Borisov
92e2f7e370 btrfs: Remove btrfs_fs_info::open_ioctl_trans
Since userspace transaction have been removed we no longer have use
for this field so delete it.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31 01:26:51 +02:00
Nikolay Borisov
9678c54388 btrfs: Remove custom crc32c init code
The custom crc32 init code was introduced in
14a958e678 ("Btrfs: fix btrfs boot when compiled as built-in") to
enable using btrfs as a built-in. However, later as pointed out by
60efa5eb2e ("Btrfs: use late_initcall instead of module_init") this
wasn't enough and finally btrfs was switched to late_initcall which
comes after the generic crc32c implementation is initiliased. The
latter commit superseeded the former. Now that we don't have to
maintain our own code let's just remove it and switch to using the
generic implementation.

Despite touching a lot of files the patch is really simple. Here is the gist of
the changes:

1. Select LIBCRC32C rather than the low-level modules.
2. s/btrfs_crc32c/crc32c/g
3. replace hash.h with linux/crc32c.h
4. Move the btrfs namehash funcs to ctree.h and change the tree accordingly.

I've tested this with btrfs being both a module and a built-in and xfstest
doesn't complain.

Does seem to fix the longstanding problem of not automatically selectiong
the crc32c module when btrfs is used. Possibly there is a workaround in
dracut.

The modinfo confirms that now all the module dependencies are there:

before:
depends:        zstd_compress,zstd_decompress,raid6_pq,xor,zlib_deflate

after:
depends:        libcrc32c,zstd_compress,zstd_decompress,raid6_pq,xor,zlib_deflate

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add more info to changelog from mails ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:39 +02:00
Qu Wenruo
3e72ee8874 btrfs: Refactor __get_raid_index() to btrfs_bg_flags_to_raid_index()
Function __get_raid_index() is used to convert block group flags into
raid index, which can be used to get various info directly from
btrfs_raid_array[].

Refactor this function a little:

1) Rename to btrfs_bg_flags_to_raid_index()
   Double underline prefix is normally for internal functions, while the
   function is used by both extent-tree and volumes.

   Although the name is a little longer, but it should explain its usage
   quite well.

2) Move it to volumes.h and make it static inline
   Just several if-else branches, really no need to define it as a normal
   function.

   This also makes later code re-use between kernel and btrfs-progs
   easier.

3) Remove function get_block_group_index()
   Really no need to do such a simple thing as an exported function.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:38 +02:00