Commit Graph

968538 Commits

Author SHA1 Message Date
Josef Bacik
bbb86a3717 btrfs: protect fs_info->caching_block_groups by block_group_cache_lock
I got the following lockdep splat

  ======================================================
  WARNING: possible circular locking dependency detected
  5.9.0+ #101 Not tainted
  ------------------------------------------------------
  btrfs-cleaner/3445 is trying to acquire lock:
  ffff89dbec39ab48 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x170

  but task is already holding lock:
  ffff89dbeaf28a88 (&fs_info->commit_root_sem){++++}-{3:3}, at: btrfs_find_all_roots+0x41/0x80

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #2 (&fs_info->commit_root_sem){++++}-{3:3}:
	 down_write+0x3d/0x70
	 btrfs_cache_block_group+0x2d5/0x510
	 find_free_extent+0xb6e/0x12f0
	 btrfs_reserve_extent+0xb3/0x1b0
	 btrfs_alloc_tree_block+0xb1/0x330
	 alloc_tree_block_no_bg_flush+0x4f/0x60
	 __btrfs_cow_block+0x11d/0x580
	 btrfs_cow_block+0x10c/0x220
	 commit_cowonly_roots+0x47/0x2e0
	 btrfs_commit_transaction+0x595/0xbd0
	 sync_filesystem+0x74/0x90
	 generic_shutdown_super+0x22/0x100
	 kill_anon_super+0x14/0x30
	 btrfs_kill_super+0x12/0x20
	 deactivate_locked_super+0x36/0xa0
	 cleanup_mnt+0x12d/0x190
	 task_work_run+0x5c/0xa0
	 exit_to_user_mode_prepare+0x1df/0x200
	 syscall_exit_to_user_mode+0x54/0x280
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #1 (&space_info->groups_sem){++++}-{3:3}:
	 down_read+0x40/0x130
	 find_free_extent+0x2ed/0x12f0
	 btrfs_reserve_extent+0xb3/0x1b0
	 btrfs_alloc_tree_block+0xb1/0x330
	 alloc_tree_block_no_bg_flush+0x4f/0x60
	 __btrfs_cow_block+0x11d/0x580
	 btrfs_cow_block+0x10c/0x220
	 commit_cowonly_roots+0x47/0x2e0
	 btrfs_commit_transaction+0x595/0xbd0
	 sync_filesystem+0x74/0x90
	 generic_shutdown_super+0x22/0x100
	 kill_anon_super+0x14/0x30
	 btrfs_kill_super+0x12/0x20
	 deactivate_locked_super+0x36/0xa0
	 cleanup_mnt+0x12d/0x190
	 task_work_run+0x5c/0xa0
	 exit_to_user_mode_prepare+0x1df/0x200
	 syscall_exit_to_user_mode+0x54/0x280
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #0 (btrfs-root-00){++++}-{3:3}:
	 __lock_acquire+0x1167/0x2150
	 lock_acquire+0xb9/0x3d0
	 down_read_nested+0x43/0x130
	 __btrfs_tree_read_lock+0x32/0x170
	 __btrfs_read_lock_root_node+0x3a/0x50
	 btrfs_search_slot+0x614/0x9d0
	 btrfs_find_root+0x35/0x1b0
	 btrfs_read_tree_root+0x61/0x120
	 btrfs_get_root_ref+0x14b/0x600
	 find_parent_nodes+0x3e6/0x1b30
	 btrfs_find_all_roots_safe+0xb4/0x130
	 btrfs_find_all_roots+0x60/0x80
	 btrfs_qgroup_trace_extent_post+0x27/0x40
	 btrfs_add_delayed_data_ref+0x3fd/0x460
	 btrfs_free_extent+0x42/0x100
	 __btrfs_mod_ref+0x1d7/0x2f0
	 walk_up_proc+0x11c/0x400
	 walk_up_tree+0xf0/0x180
	 btrfs_drop_snapshot+0x1c7/0x780
	 btrfs_clean_one_deleted_snapshot+0xfb/0x110
	 cleaner_kthread+0xd4/0x140
	 kthread+0x13a/0x150
	 ret_from_fork+0x1f/0x30

  other info that might help us debug this:

  Chain exists of:
    btrfs-root-00 --> &space_info->groups_sem --> &fs_info->commit_root_sem

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(&fs_info->commit_root_sem);
				 lock(&space_info->groups_sem);
				 lock(&fs_info->commit_root_sem);
    lock(btrfs-root-00);

   *** DEADLOCK ***

  3 locks held by btrfs-cleaner/3445:
   #0: ffff89dbeaf28838 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: cleaner_kthread+0x6e/0x140
   #1: ffff89dbeb6c7640 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x40b/0x5c0
   #2: ffff89dbeaf28a88 (&fs_info->commit_root_sem){++++}-{3:3}, at: btrfs_find_all_roots+0x41/0x80

  stack backtrace:
  CPU: 0 PID: 3445 Comm: btrfs-cleaner Not tainted 5.9.0+ #101
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
  Call Trace:
   dump_stack+0x8b/0xb0
   check_noncircular+0xcf/0xf0
   __lock_acquire+0x1167/0x2150
   ? __bfs+0x42/0x210
   lock_acquire+0xb9/0x3d0
   ? __btrfs_tree_read_lock+0x32/0x170
   down_read_nested+0x43/0x130
   ? __btrfs_tree_read_lock+0x32/0x170
   __btrfs_tree_read_lock+0x32/0x170
   __btrfs_read_lock_root_node+0x3a/0x50
   btrfs_search_slot+0x614/0x9d0
   ? find_held_lock+0x2b/0x80
   btrfs_find_root+0x35/0x1b0
   ? do_raw_spin_unlock+0x4b/0xa0
   btrfs_read_tree_root+0x61/0x120
   btrfs_get_root_ref+0x14b/0x600
   find_parent_nodes+0x3e6/0x1b30
   btrfs_find_all_roots_safe+0xb4/0x130
   btrfs_find_all_roots+0x60/0x80
   btrfs_qgroup_trace_extent_post+0x27/0x40
   btrfs_add_delayed_data_ref+0x3fd/0x460
   btrfs_free_extent+0x42/0x100
   __btrfs_mod_ref+0x1d7/0x2f0
   walk_up_proc+0x11c/0x400
   walk_up_tree+0xf0/0x180
   btrfs_drop_snapshot+0x1c7/0x780
   ? btrfs_clean_one_deleted_snapshot+0x73/0x110
   btrfs_clean_one_deleted_snapshot+0xfb/0x110
   cleaner_kthread+0xd4/0x140
   ? btrfs_alloc_root+0x50/0x50
   kthread+0x13a/0x150
   ? kthread_create_worker_on_cpu+0x40/0x40
   ret_from_fork+0x1f/0x30

while testing another lockdep fix.  This happens because we're using the
commit_root_sem to protect fs_info->caching_block_groups, which creates
a dependency on the groups_sem -> commit_root_sem, which is problematic
because we will allocate blocks while holding tree roots.  Fix this by
making the list itself protected by the fs_info->block_group_cache_lock.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:03 +01:00
Josef Bacik
e747853cae btrfs: load free space cache asynchronously
While documenting the usage of the commit_root_sem, I noticed that we do
not actually take the commit_root_sem in the case of the free space
cache.  This is problematic because we're supposed to hold that sem
while we're reading the commit roots, which is what we do for the free
space cache.

The reason I did it inline when I originally wrote the code was because
there's the case of unpinning where we need to make sure that the free
space cache is loaded if we're going to use the free space cache.  But
we can accomplish the same thing by simply waiting for the cache to be
loaded.

Rework this code to load the free space cache asynchronously.  This
allows us to greatly cleanup the caching code because now it's all
shared by the various caching methods.  We also are now in a position to
have the commit_root semaphore held while we're loading the free space
cache.  And finally our modification of ->last_byte_to_unpin is removed
because it can be handled in the proper way on commit.

Some care must be taken when replaying the log, when we expect that the
free space cache will be read entirely before we start excluding space
to replay. This could lead to overwriting space during replay.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:03 +01:00
Josef Bacik
4d7240f0ab btrfs: load the free space cache inode extents from commit root
Historically we've allowed recursive locking specifically for the free
space inode.  This is because we are only doing reads and know that it's
safe.  However we don't actually need this feature, we can get away with
reading the commit root for the extents.  In fact if we want to allow
asynchronous loading of the free space cache we have to use the commit
root, otherwise we will deadlock.

Switch to using the commit root for the file extents.  These are only
read at load time, and are replaced as soon as we start writing the
cache out to disk.  The cache is never read again, so this is
legitimate.  This matches what we do for the inode itself, as we read
that from the commit root as well.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:03 +01:00
Josef Bacik
cd79909bc7 btrfs: load free space cache into a temporary ctl
The free space cache has been special in that we would load it right
away instead of farming the work off to a worker thread.  This resulted
in some weirdness that had to be taken into account for this fact,
namely that if we every found a block group being cached the fast way we
had to wait for it to finish, because we could get the cache before it
had been validated and we may throw the cache away.

To handle this particular case instead create a temporary
btrfs_free_space_ctl to load the free space cache into.  Then once we've
validated that it makes sense, copy it's contents into the actual
block_group->free_space_ctl.  This allows us to avoid the problems of
needing to wait for the caching to complete, we can clean up the discard
extent handling stuff in __load_free_space_cache, and we no longer need
to do the merge_space_tree() because the space is added one by one into
the real free_space_ctl.  This will allow further reworks of how we
handle loading the free space cache.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:03 +01:00
Josef Bacik
66b53bae46 btrfs: cleanup btrfs_discard_update_discardable usage
This passes in the block_group and the free_space_ctl, but we can get
this from the block group itself.  Part of this is because we call it
from __load_free_space_cache, which can be called for the inode cache as
well.

Move that call into the block group specific load section, wrap it in
the right lock that we need for the assertion (but otherwise this is
safe without the lock because this happens in single-thread context).

Fix up the arguments to only take the block group.  Add a lockdep_assert
as well for good measure to make sure we don't mess up the locking
again.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:02 +01:00
Josef Bacik
2ca08c56e8 btrfs: explicitly protect ->last_byte_to_unpin in unpin_extent_range
Currently unpin_extent_range happens in the transaction commit context,
so we are protected from ->last_byte_to_unpin changing while we're
unpinning, because any new transactions would have to wait for us to
complete before modifying ->last_byte_to_unpin.

However in the future we may want to change how this works, for instance
with async unpinning or other such TODO items.  To prepare for that
future explicitly protect ->last_byte_to_unpin with the commit_root_sem
so we are sure it won't change while we're doing our work.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:02 +01:00
Josef Bacik
27d56e62e4 btrfs: update last_byte_to_unpin in switch_commit_roots
While writing an explanation for the need of the commit_root_sem for
btrfs_prepare_extent_commit, I realized we have a slight hole that could
result in leaked space if we have to do the old style caching.  Consider
the following scenario

 commit root
 +----+----+----+----+----+----+----+
 |\\\\|    |\\\\|\\\\|    |\\\\|\\\\|
 +----+----+----+----+----+----+----+
 0    1    2    3    4    5    6    7

 new commit root
 +----+----+----+----+----+----+----+
 |    |    |    |\\\\|    |    |\\\\|
 +----+----+----+----+----+----+----+
 0    1    2    3    4    5    6    7

Prior to this patch, we run btrfs_prepare_extent_commit, which updates
the last_byte_to_unpin, and then we subsequently run
switch_commit_roots.  In this example lets assume that
caching_ctl->progress == 1 at btrfs_prepare_extent_commit() time, which
means that cache->last_byte_to_unpin == 1.  Then we go and do the
switch_commit_roots(), but in the meantime the caching thread has made
some more progress, because we drop the commit_root_sem and re-acquired
it.  Now caching_ctl->progress == 3.  We swap out the commit root and
carry on to unpin.

The race can happen like:

  1) The caching thread was running using the old commit root when it
     found the extent for [2, 3);

  2) Then it released the commit_root_sem because it was in the last
     item of a leaf and the semaphore was contended, and set ->progress
     to 3 (value of 'last'), as the last extent item in the current leaf
     was for the extent for range [2, 3);

  3) Next time it gets the commit_root_sem, will start using the new
     commit root and search for a key with offset 3, so it never finds
     the hole for [2, 3).

  So the caching thread never saw [2, 3) as free space in any of the
  commit roots, and by the time finish_extent_commit() was called for
  the range [0, 3), ->last_byte_to_unpin was 1, so it only returned the
  subrange [0, 1) to the free space cache, skipping [2, 3).

In the unpin code we have last_byte_to_unpin == 1, so we unpin [0,1),
but do not unpin [2,3).  However because caching_ctl->progress == 3 we
do not see the newly freed section of [2,3), and thus do not add it to
our free space cache.  This results in us missing a chunk of free space
in memory (on disk too, unless we have a power failure before writing
the free space cache to disk).

Fix this by making sure the ->last_byte_to_unpin is set at the same time
that we swap the commit roots, this ensures that we will always be
consistent.

CC: stable@vger.kernel.org # 5.8+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ update changelog with Filipe's review comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:02 +01:00
Josef Bacik
9076dbd5ee btrfs: do not shorten unpin len for caching block groups
While fixing up our ->last_byte_to_unpin locking I noticed that we will
shorten len based on ->last_byte_to_unpin if we're caching when we're
adding back the free space.  This is correct for the free space, as we
cannot unpin more than ->last_byte_to_unpin, however we use len to
adjust the ->bytes_pinned counters and such, which need to track the
actual pinned usage.  This could result in
WARN_ON(space_info->bytes_pinned) triggering at unmount time.

Fix this by using a local variable for the amount to add to free space
cache, and leave len untouched in this case.

CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:02 +01:00
David Sterba
dc51616486 btrfs: reorder extent buffer members for better packing
After the rwsem replaced the tree lock implementation, the extent buffer
got smaller but leaving some holes behind. By changing log_index type
and reordering, we can squeeze the size further to 240 bytes, measured on
release config on x86_64. Log_index spans only 3 values and needs to be
signed.

Before:

struct extent_buffer {
        u64                        start;                /*     0     8 */
        long unsigned int          len;                  /*     8     8 */
        long unsigned int          bflags;               /*    16     8 */
        struct btrfs_fs_info *     fs_info;              /*    24     8 */
        spinlock_t                 refs_lock;            /*    32     4 */
        atomic_t                   refs;                 /*    36     4 */
        atomic_t                   io_pages;             /*    40     4 */
        int                        read_mirror;          /*    44     4 */
        struct callback_head       callback_head __attribute__((__aligned__(8))); /*    48    16 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        pid_t                      lock_owner;           /*    64     4 */
        bool                       lock_recursed;        /*    68     1 */

        /* XXX 3 bytes hole, try to pack */

        struct rw_semaphore        lock;                 /*    72    40 */
        short int                  log_index;            /*   112     2 */

        /* XXX 6 bytes hole, try to pack */

        struct page *              pages[16];            /*   120   128 */

        /* size: 248, cachelines: 4, members: 14 */
        /* sum members: 239, holes: 2, sum holes: 9 */
        /* forced alignments: 1 */
        /* last cacheline: 56 bytes */
} __attribute__((__aligned__(8)));

After:

struct extent_buffer {
        u64                        start;                /*     0     8 */
        long unsigned int          len;                  /*     8     8 */
        long unsigned int          bflags;               /*    16     8 */
        struct btrfs_fs_info *     fs_info;              /*    24     8 */
        spinlock_t                 refs_lock;            /*    32     4 */
        atomic_t                   refs;                 /*    36     4 */
        atomic_t                   io_pages;             /*    40     4 */
        int                        read_mirror;          /*    44     4 */
        struct callback_head       callback_head __attribute__((__aligned__(8))); /*    48    16 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        pid_t                      lock_owner;           /*    64     4 */
        bool                       lock_recursed;        /*    68     1 */
        s8                         log_index;            /*    69     1 */

        /* XXX 2 bytes hole, try to pack */

        struct rw_semaphore        lock;                 /*    72    40 */
        struct page *              pages[16];            /*   112   128 */

        /* size: 240, cachelines: 4, members: 14 */
        /* sum members: 238, holes: 1, sum holes: 2 */
        /* forced alignments: 1 */
        /* last cacheline: 48 bytes */
} __attribute__((__aligned__(8)));

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:02 +01:00
Josef Bacik
b9729ce014 btrfs: locking: rip out path->leave_spinning
We no longer distinguish between blocking and spinning, so rip out all
this code.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:02 +01:00
Josef Bacik
ac5887c8e0 btrfs: locking: remove all the blocking helpers
Now that we're using a rw_semaphore we no longer need to indicate if a
lock is blocking or not, nor do we need to flip the entire path from
blocking to spinning.  Remove these helpers and all the places they are
called.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:01 +01:00
David Sterba
2ae0c2d80d btrfs: scrub: remove local copy of csum_size from context
The context structure unnecessarily stores copy of the checksum size,
that can be now easily obtained from fs_info.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:01 +01:00
David Sterba
419b791ce7 btrfs: check integrity: remove local copy of csum_size
The state structure unnecessarily stores copy of the checksum size, that
can be now easily obtained from fs_info.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:01 +01:00
David Sterba
713cebfb98 btrfs: remove unnecessary local variables for checksum size
Remove local variable that is then used just once and replace it with
fs_info::csum_size.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:00 +01:00
David Sterba
223486c27b btrfs: switch cached fs_info::csum_size from u16 to u32
The fs_info value is 32bit, switch also the local u16 variables. This
leads to a better assembly code generated due to movzwl.

This simple change will shave some bytes on x86_64 and release config:

   text    data     bss     dec     hex filename
1090000   17980   14912 1122892  11224c pre/btrfs.ko
1089794   17980   14912 1122686  11217e post/btrfs.ko

DELTA: -206

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:59 +01:00
David Sterba
55fc29bed8 btrfs: use cached value of fs_info::csum_size everywhere
btrfs_get_16 shows up in the system performance profiles (helper to read
16bit values from on-disk structures). This is partially because of the
checksum size that's frequently read along with data reads/writes, other
u16 uses are from item size or directory entries.

Replace all calls to btrfs_super_csum_size by the cached value from
fs_info.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:59 +01:00
David Sterba
fe5ecbe818 btrfs: precalculate checksums per leaf once
btrfs_csum_bytes_to_leaves shows up in system profiles, which makes it a
candidate for optimizations. After the 64bit division has been replaced
by shift, there's still a calculation done each time the function is
called: checksums per leaf.

As this is a constant value for the entire filesystem lifetime, we
can calculate it once at mount time and reuse. This also allows to
reduce the division to 64bit/32bit as we know the constant will always
fit the 32bit type.

Replace the open-coded rounding up with a macro that internally handles
the 64bit division and as it's now a short function, make it static
inline (slight code increase, slight stack usage reduction).

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:58 +01:00
David Sterba
22b6331d96 btrfs: store precalculated csum_size in fs_info
In many places we need the checksum size and it is inefficient to read
it from the raw superblock. Store the value into fs_info, actual use
will be in followup patches.  The size is u32 as it allows to generate
better assembly than with u16.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:58 +01:00
David Sterba
265fdfa6ce btrfs: replace s_blocksize_bits with fs_info::sectorsize_bits
The value of super_block::s_blocksize_bits is the same as
fs_info::sectorsize_bits, but we don't need to do the extra dereferences
in many functions and storing the bits as u32 (in fs_info) generates
shorter assembly.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:58 +01:00
David Sterba
098e63082b btrfs: replace div_u64 by shift in free_space_bitmap_size
Change free_space_bitmap_size to take btrfs_fs_info so we can get the
sectorsize_bits to do calculations.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:57 +01:00
David Sterba
ab108d992b btrfs: use precalculated sectorsize_bits from fs_info
We do a lot of calculations where we divide or multiply by sectorsize.
We also know and make sure that sectorsize is a power of two, so this
means all divisions can be turned to shifts and avoid eg. expensive
u64/u32 divisions.

The type is u32 as it's more register friendly on x86_64 compared to u8
and the resulting assembly is smaller (movzbl vs movl).

There's also superblock s_blocksize_bits but it's usually one more
pointer dereference farther than fs_info.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:57 +01:00
Qu Wenruo
e940e9a7c7 btrfs: rename page_size to io_size in submit_extent_page
The variable @page_size in submit_extent_page() is not related to page
size.

It can already be smaller than PAGE_SIZE, so rename it to io_size to
reduce confusion, this is especially important for later subpage
support.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:56 +01:00
Qu Wenruo
8b8bbd461e btrfs: only require sector size alignment for page read
If we're reading partial page, btrfs will warn about this as read/write
is always done in sector size, which now equals page size.

But for the upcoming subpage read-only support, our data read is only
aligned to sectorsize, which can be smaller than page size.

Thus here we change the warning condition to check it against
sectorsize, the behavior is not changed for regular sectorsize ==
PAGE_SIZE case, and won't report error for subpage read.

Also, pass the proper start/end with bv_offset for check_data_csum() to
handle.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:56 +01:00
Qu Wenruo
12e3360f74 btrfs: rename pages_locked in process_pages_contig()
Function process_pages_contig() does not only handle page locking but
also other operations.  Rename the local variable pages_locked to
pages_processed to reduce confusion.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:55 +01:00
Qu Wenruo
265d4ac03f btrfs: sink parameter start and len to check_data_csum
For check_data_csum(), the page we're using is directly from the inode
mapping, thus it has valid page_offset().

We can use (page_offset() + pg_off) to replace @start parameter
completely, while the @len should always be sectorsize.

Since we're here, also add some comment, as there are quite some
confusion in words like start/offset, without explaining whether it's
file_offset or logical bytenr.

This should not affect the existing behavior, as for current sectorsize
== PAGE_SIZE case, @pgoff should always be 0, and len is always
PAGE_SIZE (or sectorsize from the dio read path).

Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:55 +01:00
Qu Wenruo
8896a08d8e btrfs: replace fs_info and private_data with inode in btrfs_wq_submit_bio
All callers of btrfs_wq_submit_bio() pass struct inode as @private_data,
so there is no need for it to be (void *), replace it with "struct inode
*inode".

While we can extract fs_info from struct inode, also remove the @fs_info
parameter.

Since we're here, also replace all the (void *private_data) into (struct
inode *inode).

Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:54 +01:00
Qu Wenruo
3f6bb4aeb5 btrfs: sink the failed_start parameter to set_extent_bit
The @failed_start parameter is only paired with @exclusive_bits, and
those parameters are only used for EXTENT_LOCKED bit, which have their
own wrappers lock_extent_bits().

Thus for regular set_extent_bit() calls, the failed_start makes no
sense, just sink the parameter.

Also, since @failed_start and @exclusive_bits are used in pairs, add
an assert to make it obvious.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:54 +01:00
Qu Wenruo
03509b781a btrfs: update the comment for find_first_extent_bit
The pitfall here is, if the parameter @bits has multiple bits set, we
will return the first range which just has one of the specified bits
set.

This is a little tricky if we want an exact match.  Anyway, update the
comment to make that clear.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:53 +01:00
Qu Wenruo
a3efb2f0ba btrfs: fix the comment on lock_extent_buffer_for_io
The return value of that function is completely wrong.

That function only returns 0 if the extent buffer doesn't need to be
submitted.  The "ret = 1" and "ret = 0" are determined by the return
value of "test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)".

And if we get ret == 1, it's because the extent buffer is dirty, and we
set its status to EXTENT_BUFFER_WRITE_BACK, and continue to page
locking.

While if we get ret == 0, it means the extent is not dirty from the
beginning, so we don't need to write it back.

The caller also follows this, in btree_write_cache_pages(), if
lock_extent_buffer_for_io() returns 0, we just skip the extent buffer
completely.

So the comment is completely wrong.

Since we're here, also change the description a little.  The write bio
flushing won't be visible to the caller, thus it's not an major feature.
In the main description, only describe the locking part to make the
point more clear.

For reference, added in commit 2e3c25136a ("btrfs: extent_io: add
proper error handling to lock_extent_buffer_for_io()")

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:53 +01:00
David Sterba
cc7c77146e btrfs: remove unnecessary casts in printk
Long time ago the explicit casts were necessary for u64 but we don't
need it.  Remove casts where the type matches, leaving only cases that
cast sector_t or loff_t.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:52 +01:00
David Sterba
c842268458 btrfs: add set/get accessors for root_item::drop_level
The drop_level member is used directly unlike all the other int types in
root_item. Add the definition and use it everywhere. The type is u8 so
there's no conversion necessary and the helpers are properly inlined,
this is for consistency.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:52 +01:00
David Sterba
f944d2cb20 btrfs: use root_item helpers for limit and flags in btrfs_create_tree
For consistency use the available helpers to set flags and limit.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:51 +01:00
David Sterba
3b5418fba3 btrfs: check-integrity: use proper helper to access btrfs_header
There's one raw use of le->cpu conversion but we have a helper to do
that for us, so use it.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:51 +01:00
David Sterba
09e3a28892 btrfs: send: use helpers to access root_item::ctransid
We have helpers to access the on-disk item members, use that for
root_item::ctransid instead of raw le64_to_cpu.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:51 +01:00
David Sterba
ab1405aa25 btrfs: generate lockdep keyset names at compile time
The names in btrfs_lockdep_keysets are generated from a simple pattern
using snprintf but we can generate them directly with some macro magic
and remove the helpers.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:50 +01:00
David Sterba
387824afd7 btrfs: use the right number of levels for lockdep keysets
BTRFS_MAX_LEVEL is 8 and the keyset table is supposed to have a key for
each level, but we'll never have more than 8 levels.  The values passed
to btrfs_set_buffer_lockdep_class are always derived from a valid extent
buffer.  Set the array sizes to the right value.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:50 +01:00
Goldwyn Rodrigues
ecfdc08b8c btrfs: remove dio iomap DSYNC workaround
This effectively reverts 09745ff88d93 ("btrfs: dio iomap DSYNC
workaround") now that the iomap API has been updated to allow
iomap_dio_complete() not to be called under i_rwsem anymore.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:49 +01:00
Goldwyn Rodrigues
a42fa64316 btrfs: call iomap_dio_complete() without inode_lock
If direct writes are called with O_DIRECT | O_DSYNC, it will result in a
deadlock because iomap_dio_rw() is called under i_rwsem which calls:

  iomap_dio_complete()
    generic_write_sync()
      btrfs_sync_file()

btrfs_sync_file() requires i_rwsem, so call __iomap_dio_rw() with the
i_rwsem locked, and call iomap_dio_complete() after unlocking i_rwsem.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:49 +01:00
Goldwyn Rodrigues
502756b380 btrfs: remove btrfs_inode::dio_sem
The inode dio_sem can be eliminated because all DIO synchronization is
now performed through inode->i_rwsem that provides the same guarantees.

This reduces btrfs_inode size by 40 bytes.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:48 +01:00
Goldwyn Rodrigues
e9adabb971 btrfs: use shared lock for direct writes within EOF
Direct writes within EOF are safe to be performed with inode shared lock
to improve parallelization with other direct writes or reads because EOF
is not changed and there is no race with truncate().

Direct reads are already performed under shared inode lock.

This patch is precursor to removing btrfs_inode->dio_sem.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:48 +01:00
Goldwyn Rodrigues
c352370633 btrfs: push inode locking and unlocking into buffered/direct write
Push inode locking and unlocking closer to where we perform the I/O. For
this we need to move the write checks inside the respective functions as
well.

pos is evaluated after generic_write_checks because O_APPEND can change
iocb->ki_pos.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:48 +01:00
Goldwyn Rodrigues
a14b78ad06 btrfs: introduce btrfs_inode_lock()/unlock()
btrfs_inode_lock/unlock() are wrappers around inode locks, separating
the type of lock and actual locking.

- 0 - default, exclusive lock
- BTRFS_ILOCK_SHARED - for shared locks, for possible parallel DIO
- BTRFS_ILOCK_TRY - for the RWF_NOWAIT sequence

The bits SHARED and TRY can be combined together.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:47 +01:00
Goldwyn Rodrigues
b8d8e1fd57 btrfs: introduce btrfs_write_check()
btrfs_write_check() checks write parameters in one place before
beginning a write. This does away with inode_unlock() after every check.
In the later patches, it will help push inode_lock/unlock() in buffered
and direct write functions respectively.

generic_write_checks needs to be called before as it could truncate
iov_iter and its return used as count.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:47 +01:00
Goldwyn Rodrigues
c86537a42f btrfs: check FS error state bit early during write
fs_info::fs_state is a filesystem bit check as opposed to inode and can
be performed before we begin with write checks. This eliminates inode
lock/unlock in case the error bit is set.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:46 +01:00
Goldwyn Rodrigues
5e8b9ef303 btrfs: move pos increment and pagecache extension to btrfs_buffered_write
While we do this, correct the call to pagecache_isize_extended:

 - pagecache_isize_extended needs to be called to the start of the write
   as opposed to i_size

 - we don't need to check range before the call, this is done in the
   function

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:46 +01:00
Goldwyn Rodrigues
4e4cabece9 btrfs: split btrfs_direct_IO to read and write
The read and write DIO don't have anything in common except for the
call to iomap_dio_rw. Extract the write call into a new function to get
rid of conditional statements for direct write.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:45 +01:00
Anand Jain
3d8cc17a05 btrfs: sysfs: add per-fs attribute for read policy
Add

 /sys/fs/btrfs/UUID/read_policy

attribute so that the read policy for the raid1, raid1c34 and raid10 can
be tuned.

When this attribute is read, it will show all available policies, with
active policy in [ ]. The read_policy attribute can be written using one
of the items listed in there.

For example:
  $ cat /sys/fs/btrfs/UUID/read_policy
  [pid]
  $ echo pid > /sys/fs/btrfs/UUID/read_policy

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:45 +01:00
Anand Jain
33fd2f714c btrfs: create read policy framework
As of now, we use the pid method to read striped mirrored data, which
means process id determines the stripe id to read. This type of routing
typically helps in a system with many small independent processes tying
to read random data. On the other hand, the pid based read IO policy is
inefficient because if there is a single process trying to read a large
file, the overall disk bandwidth remains underutilized.

So this patch introduces a read policy framework so that we could add
more read policies, such as IO routing based on the device's wait-queue
or manual when we have a read-preferred device or a policy based on the
target storage caching.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:44 +01:00
Anand Jain
aaefed2078 btrfs: add helper for string match ignoring leading/trailing whitespace
Add a generic helper to match the string in a given buffer, and ignore
the leading and trailing whitespace.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ rename variables, add comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:44 +01:00
Filipe Manana
88090ad36a btrfs: do not start and wait for delalloc on snapshot roots on transaction commit
We do not need anymore to start writeback for delalloc of roots that are
being snapshotted and wait for it to complete. This was done in commit
609e804d77 ("Btrfs: fix file corruption after snapshotting due to mix
of buffered/DIO writes") to fix a type of file corruption where files in a
snapshot end up having their i_size updated in a non-ordered way, leaving
implicit file holes, when buffered IO writes that increase a file's size
are followed by direct IO writes that also increase the file's size.

This is not needed anymore because we now have a more generic mechanism
to prevent a non-ordered i_size update since commit 9ddc959e80
("btrfs: use the file extent tree infrastructure"), which addresses this
scenario involving snapshots as well.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:43 +01:00