Commit Graph

1853 Commits

Author SHA1 Message Date
Johannes Thumshirn
c965d6402f btrfs: handle errors from async submission
Btrfs' async submit mechanism is able to handle errors in the submission
path and the meta-data async submit function correctly passes the error
code to the caller.

In btrfs_submit_bio_start() and btrfs_submit_bio_start_direct_io() we're
not handling the errors returned by btrfs_csum_one_bio() correctly though
and simply call BUG_ON(). This is unnecessary as the caller of these two
functions - run_one_async_start - correctly checks for the return values
and sets the status of the async_submit_bio. The actual bio submission
will be handled later on by run_one_async_done only if
async_submit_bio::status is 0, so the data won't be written if we
encountered an error in the checksum process.

Simply return the error from btrfs_csum_one_bio() to the async submitters,
like it's done in btree_submit_bio_start().

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-19 18:26:14 +02:00
Pavel Machek
881a3a11c2 btrfs: fix return value mixup in btrfs_get_extent
btrfs_get_extent() sets variable ret, but out: error path expect error
to be in variable err so the error code is lost.

Fixes: 6bf9e4bd6a ("btrfs: inode: Verify inode mode to avoid NULL pointer dereference")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Pavel Machek (CIP) <pavel@denx.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-11 12:30:17 +02:00
Qu Wenruo
1e6e238c30 btrfs: inode: fix NULL pointer dereference if inode doesn't need compression
[BUG]
There is a bug report of NULL pointer dereference caused in
compress_file_extent():

  Oops: Kernel access of bad area, sig: 11 [#1]
  LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
  Workqueue: btrfs-delalloc btrfs_delalloc_helper [btrfs]
  NIP [c008000006dd4d34] compress_file_range.constprop.41+0x75c/0x8a0 [btrfs]
  LR [c008000006dd4d1c] compress_file_range.constprop.41+0x744/0x8a0 [btrfs]
  Call Trace:
  [c000000c69093b00] [c008000006dd4d1c] compress_file_range.constprop.41+0x744/0x8a0 [btrfs] (unreliable)
  [c000000c69093bd0] [c008000006dd4ebc] async_cow_start+0x44/0xa0 [btrfs]
  [c000000c69093c10] [c008000006e14824] normal_work_helper+0xdc/0x598 [btrfs]
  [c000000c69093c80] [c0000000001608c0] process_one_work+0x2c0/0x5b0
  [c000000c69093d10] [c000000000160c38] worker_thread+0x88/0x660
  [c000000c69093db0] [c00000000016b55c] kthread+0x1ac/0x1c0
  [c000000c69093e20] [c00000000000b660] ret_from_kernel_thread+0x5c/0x7c
  ---[ end trace f16954aa20d822f6 ]---

[CAUSE]
For the following execution route of compress_file_range(), it's
possible to hit NULL pointer dereference:

 compress_file_extent()
 |- pages = NULL;
 |- start = async_chunk->start = 0;
 |- end = async_chunk = 4095;
 |- nr_pages = 1;
 |- inode_need_compress() == false; <<< Possible, see later explanation
 |  Now, we have nr_pages = 1, pages = NULL
 |- cont:
 |- 		ret = cow_file_range_inline();
 |- 		if (ret <= 0) {
 |-		for (i = 0; i < nr_pages; i++) {
 |-			WARN_ON(pages[i]->mapping);	<<< Crash

To enter above call execution branch, we need the following race:

    Thread 1 (chattr)     |            Thread 2 (writeback)
--------------------------+------------------------------
                          | btrfs_run_delalloc_range
                          | |- inode_need_compress = true
                          | |- cow_file_range_async()
btrfs_ioctl_set_flag()    |
|- binode_flags |=        |
   BTRFS_INODE_NOCOMPRESS |
                          | compress_file_range()
                          | |- inode_need_compress = false
                          | |- nr_page = 1 while pages = NULL
                          | |  Then hit the crash

[FIX]
This patch will fix it by checking @pages before doing accessing it.
This patch is only designed as a hot fix and easy to backport.

More elegant fix may make btrfs only check inode_need_compress() once to
avoid such race, but that would be another story.

Reported-by: Luciano Chavez <chavez@us.ibm.com>
Fixes: 4d3a800ebb ("btrfs: merge nr_pages input and output parameter in compress_pages")
CC: stable@vger.kernel.org # 4.14.x: cecc8d9038: btrfs: Move free_pages_out label in inline extent handling branch in compress_file_range
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-10 18:57:59 +02:00
Filipe Manana
3ebac17ce5 btrfs: reduce contention on log trees when logging checksums
The possibility of extents being shared (through clone and deduplication
operations) requires special care when logging data checksums, to avoid
having a log tree with different checksum items that cover ranges which
overlap (which resulted in missing checksums after replaying a log tree).
Such problems were fixed in the past by the following commits:

commit 40e046acbd ("Btrfs: fix missing data checksums after replaying a
                      log tree")

commit e289f03ea7 ("btrfs: fix corrupt log due to concurrent fsync of
                      inodes with shared extents")

Test case generic/588 exercises the scenario solved by the first commit
(purely sequential and deterministic) while test case generic/457 often
triggered the case fixed by the second commit (not deterministic, requires
specific timings under concurrency).

The problems were addressed by deleting, from the log tree, any existing
checksums before logging the new ones. And also by doing the deletion and
logging of the cheksums while locking the checksum range in an extent io
tree (root->log_csum_range), to deal with the case where we have concurrent
fsyncs against files with shared extents.

That however causes more contention on the leaves of a log tree where we
store checksums (and all the nodes in the paths leading to them), even
when we do not have shared extents, or all the shared extents were created
by past transactions. It also adds a bit of contention on the spin lock of
the log_csums_range extent io tree of the log root.

This change adds a 'last_reflink_trans' field to the inode to keep track
of the last transaction where a new extent was shared between inodes
(through clone and deduplication operations). It is updated for both the
source and destination inodes of reflink operations whenever a new extent
(created in the current transaction) becomes shared by the inodes. This
field is kept in memory only, not persisted in the inode item, similar
to other existing fields (last_unlink_trans, logged_trans).

When logging checksums for an extent, if the value of 'last_reflink_trans'
is smaller then the current transaction's generation/id, we skip locking
the extent range and deletion of checksums from the log tree, since we
know we do not have new shared extents. This reduces contention on the
log tree's leaves where checksums are stored.

The following script, which uses fio, was used to measure the impact of
this change:

  $ cat test-fsync.sh
  #!/bin/bash

  DEV=/dev/sdk
  MNT=/mnt/sdk
  MOUNT_OPTIONS="-o ssd"
  MKFS_OPTIONS="-d single -m single"

  if [ $# -ne 3 ]; then
      echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ"
      exit 1
  fi

  NUM_JOBS=$1
  FILE_SIZE=$2
  FSYNC_FREQ=$3

  cat <<EOF > /tmp/fio-job.ini
  [writers]
  rw=write
  fsync=$FSYNC_FREQ
  fallocate=none
  group_reporting=1
  direct=0
  bs=64k
  ioengine=sync
  size=$FILE_SIZE
  directory=$MNT
  numjobs=$NUM_JOBS
  EOF

  echo "Using config:"
  echo
  cat /tmp/fio-job.ini
  echo

  mkfs.btrfs -f $MKFS_OPTIONS $DEV
  mount $MOUNT_OPTIONS $DEV $MNT
  fio /tmp/fio-job.ini
  umount $MNT

The tests were performed for different numbers of jobs, file sizes and
fsync frequency. A qemu VM using kvm was used, with 8 cores (the host has
12 cores, with cpu governance set to performance mode on all cores), 16GiB
of ram (the host has 64GiB) and using a NVMe device directly (without an
intermediary filesystem in the host). While running the tests, the host
was not used for anything else, to avoid disturbing the tests.

The obtained results were the following (the last line of fio's output was
pasted). Starting with 16 jobs is where a significant difference is
observable in this particular setup and hardware (differences highlighted
below). The very small differences for tests with less than 16 jobs are
possibly just noise and random.

    **** 1 job, file size 1G, fsync frequency 1 ****

before this change:

WRITE: bw=23.8MiB/s (24.9MB/s), 23.8MiB/s-23.8MiB/s (24.9MB/s-24.9MB/s), io=1024MiB (1074MB), run=43075-43075msec

after this change:

WRITE: bw=24.4MiB/s (25.6MB/s), 24.4MiB/s-24.4MiB/s (25.6MB/s-25.6MB/s), io=1024MiB (1074MB), run=41938-41938msec

    **** 2 jobs, file size 1G, fsync frequency 1 ****

before this change:

WRITE: bw=37.7MiB/s (39.5MB/s), 37.7MiB/s-37.7MiB/s (39.5MB/s-39.5MB/s), io=2048MiB (2147MB), run=54351-54351msec

after this change:

WRITE: bw=37.7MiB/s (39.5MB/s), 37.6MiB/s-37.6MiB/s (39.5MB/s-39.5MB/s), io=2048MiB (2147MB), run=54428-54428msec

    **** 4 jobs, file size 1G, fsync frequency 1 ****

before this change:

WRITE: bw=67.5MiB/s (70.8MB/s), 67.5MiB/s-67.5MiB/s (70.8MB/s-70.8MB/s), io=4096MiB (4295MB), run=60669-60669msec

after this change:

WRITE: bw=68.6MiB/s (71.0MB/s), 68.6MiB/s-68.6MiB/s (71.0MB/s-71.0MB/s), io=4096MiB (4295MB), run=59678-59678msec

    **** 8 jobs, file size 1G, fsync frequency 1 ****

before this change:

WRITE: bw=128MiB/s (134MB/s), 128MiB/s-128MiB/s (134MB/s-134MB/s), io=8192MiB (8590MB), run=64048-64048msec

after this change:

WRITE: bw=129MiB/s (135MB/s), 129MiB/s-129MiB/s (135MB/s-135MB/s), io=8192MiB (8590MB), run=63405-63405msec

    **** 16 jobs, file size 1G, fsync frequency 1 ****

before this change:

WRITE: bw=78.5MiB/s (82.3MB/s), 78.5MiB/s-78.5MiB/s (82.3MB/s-82.3MB/s), io=16.0GiB (17.2GB), run=208676-208676msec

after this change:

WRITE: bw=110MiB/s (115MB/s), 110MiB/s-110MiB/s (115MB/s-115MB/s), io=16.0GiB (17.2GB), run=149295-149295msec
(+40.1% throughput, -28.5% runtime)

    **** 32 jobs, file size 1G, fsync frequency 1 ****

before this change:

WRITE: bw=58.8MiB/s (61.7MB/s), 58.8MiB/s-58.8MiB/s (61.7MB/s-61.7MB/s), io=32.0GiB (34.4GB), run=557134-557134msec

after this change:

WRITE: bw=76.1MiB/s (79.8MB/s), 76.1MiB/s-76.1MiB/s (79.8MB/s-79.8MB/s), io=32.0GiB (34.4GB), run=430550-430550msec
(+29.4% throughput, -22.7% runtime)

    **** 64 jobs, file size 512M, fsync frequency 1 ****

before this change:

WRITE: bw=65.8MiB/s (68.0MB/s), 65.8MiB/s-65.8MiB/s (68.0MB/s-68.0MB/s), io=32.0GiB (34.4GB), run=498055-498055msec

after this change:

WRITE: bw=85.1MiB/s (89.2MB/s), 85.1MiB/s-85.1MiB/s (89.2MB/s-89.2MB/s), io=32.0GiB (34.4GB), run=385116-385116msec
(+29.3% throughput, -22.7% runtime)

    **** 128 jobs, file size 256M, fsync frequency 1 ****

before this change:

WRITE: bw=54.7MiB/s (57.3MB/s), 54.7MiB/s-54.7MiB/s (57.3MB/s-57.3MB/s), io=32.0GiB (34.4GB), run=599373-599373msec

after this change:

WRITE: bw=121MiB/s (126MB/s), 121MiB/s-121MiB/s (126MB/s-126MB/s), io=32.0GiB (34.4GB), run=271907-271907msec
(+121.2% throughput, -54.6% runtime)

    **** 256 jobs, file size 256M, fsync frequency 1 ****

before this change:

WRITE: bw=69.2MiB/s (72.5MB/s), 69.2MiB/s-69.2MiB/s (72.5MB/s-72.5MB/s), io=64.0GiB (68.7GB), run=947536-947536msec

after this change:

WRITE: bw=121MiB/s (127MB/s), 121MiB/s-121MiB/s (127MB/s-127MB/s), io=64.0GiB (68.7GB), run=541916-541916msec
(+74.9% throughput, -42.8% runtime)

    **** 512 jobs, file size 128M, fsync frequency 1 ****

before this change:

WRITE: bw=85.4MiB/s (89.5MB/s), 85.4MiB/s-85.4MiB/s (89.5MB/s-89.5MB/s), io=64.0GiB (68.7GB), run=767734-767734msec

after this change:

WRITE: bw=141MiB/s (147MB/s), 141MiB/s-141MiB/s (147MB/s-147MB/s), io=64.0GiB (68.7GB), run=466022-466022msec
(+65.1% throughput, -39.3% runtime)

    **** 1024 jobs, file size 128M, fsync frequency 1 ****

before this change:

WRITE: bw=115MiB/s (120MB/s), 115MiB/s-115MiB/s (120MB/s-120MB/s), io=128GiB (137GB), run=1143775-1143775msec

after this change:

WRITE: bw=171MiB/s (180MB/s), 171MiB/s-171MiB/s (180MB/s-180MB/s), io=128GiB (137GB), run=764843-764843msec
(+48.7% throughput, -33.1% runtime)

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:45 +02:00
Nikolay Borisov
814723e0a5 btrfs: increment device corruption error in case of checksum error
Now that btrfs_io_bio have access to btrfs_device we can safely
increment the device corruption counter on error. There is one notable
exception - repair bios for raid. Since those don't go through the
normal submit_stripe_bio callpath but through raid56_parity_recover thus
repair bios won't have their device set.

Scrub increments the corruption counter for checksum mismatch as well
but does not call this function.

Link: https://lore.kernel.org/linux-btrfs/4857863.FCrPRfMyHP@liv/
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:40 +02:00
Qu Wenruo
082b6c970f btrfs: free anon block device right after subvolume deletion
[BUG]
When a lot of subvolumes are created, there is a user report about
transaction aborted caused by slow anonymous block device reclaim:

  BTRFS: Transaction aborted (error -24)
  WARNING: CPU: 17 PID: 17041 at fs/btrfs/transaction.c:1576 create_pending_snapshot+0xbc4/0xd10 [btrfs]
  RIP: 0010:create_pending_snapshot+0xbc4/0xd10 [btrfs]
  Call Trace:
   create_pending_snapshots+0x82/0xa0 [btrfs]
   btrfs_commit_transaction+0x275/0x8c0 [btrfs]
   btrfs_mksubvol+0x4b9/0x500 [btrfs]
   btrfs_ioctl_snap_create_transid+0x174/0x180 [btrfs]
   btrfs_ioctl_snap_create_v2+0x11c/0x180 [btrfs]
   btrfs_ioctl+0x11a4/0x2da0 [btrfs]
   do_vfs_ioctl+0xa9/0x640
   ksys_ioctl+0x67/0x90
   __x64_sys_ioctl+0x1a/0x20
   do_syscall_64+0x5a/0x110
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  ---[ end trace 33f2f83f3d5250e9 ]---
  BTRFS: error (device sda1) in create_pending_snapshot:1576: errno=-24 unknown
  BTRFS info (device sda1): forced readonly
  BTRFS warning (device sda1): Skipping commit of aborted transaction.
  BTRFS: error (device sda1) in cleanup_transaction:1831: errno=-24 unknown

[CAUSE]
The anonymous device pool is shared and its size is 1M. It's possible to
hit that limit if the subvolume deletion is not fast enough and the
subvolumes to be cleaned keep the ids allocated.

[WORKAROUND]
We can't avoid the anon device pool exhaustion but we can shorten the
time the id is attached to the subvolume root once the subvolume becomes
invisible to the user.

Reported-by: Greed Rong <greedrong@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CA+UqX+NTrZ6boGnWHhSeZmEY5J76CTqmYjO2S+=tHJX7nb9DPw@mail.gmail.com/
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:38 +02:00
Nikolay Borisov
cfdd459215 btrfs: make btrfs_qgroup_check_reserved_leak take btrfs_inode
vfs_inode is used only for the inode number everything else requires
btrfs_inode.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ use btrfs_ino ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:37 +02:00
Nikolay Borisov
d90944141b btrfs: make btrfs_set_inode_last_trans take btrfs_inode
Instead of making multiple calls to BTRFS_I simply take btrfs_inode as
an input paramter.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:37 +02:00
Nikolay Borisov
65d87f7918 btrfs: remove BTRFS_I calls in btrfs_writepage_fixup_worker
All of its children functions use btrfs_inode.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:36 +02:00
Nikolay Borisov
e5b7231e20 btrfs: make btrfs_delalloc_reserve_space take btrfs_inode
All of its children take btrfs_inode so bubble up this requirement to
btrfs_delalloc_reserve_space's interface and stop calling BTRFS_I
internally.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:36 +02:00
Nikolay Borisov
36ea6f3e93 btrfs: make btrfs_check_data_free_space take btrfs_inode
Instead of calling BTRFS_I on the passed vfs_inode take btrfs_inode
directly.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:36 +02:00
Nikolay Borisov
86d52921a2 btrfs: make btrfs_delalloc_release_space take btrfs_inode
It needs btrfs_inode so take it as a parameter directly.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:36 +02:00
Nikolay Borisov
25ce28caaa btrfs: make btrfs_free_reserved_data_space take btrfs_inode
It only uses btrfs_inode internally so take it as a parameter.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:36 +02:00
Nikolay Borisov
9db5d510ac btrfs: make btrfs_free_reserved_data_space_noquota take btrfs_fs_info
No point in taking an inode only to get btrfs_fs_info from it, instead
take btrfs_fs_info directly.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:35 +02:00
Nikolay Borisov
c2566f2289 btrfs: make btrfs_set_extent_delalloc take btrfs_inode
Preparation to make btrfs_dirty_pages take btrfs_inode as parameter.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:35 +02:00
Nikolay Borisov
9fc6f911a0 btrfs: make btrfs_new_extent_direct take btrfs_inode
This function really needs a btrfs_inode and not a generic vfs one. Take
it as a parameter and get rid of superfluous BTRFS_I() calls.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:34 +02:00
Nikolay Borisov
64f54188ea btrfs: make btrfs_create_dio_extent take btrfs_inode
Take btrfs_inode directly and stop using superfulous BTRFS_I calls.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:34 +02:00
Nikolay Borisov
c1e095202c btrfs: make btrfs_add_ordered_extent_dio take btrfs_inode
Simply forwards its argument so let's get rid of one extra BTRFS_I by
taking btrfs_inode directly.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:34 +02:00
Nikolay Borisov
98456b9c46 btrfs: make btrfs_run_delalloc_range take btrfs_inode
All children now take btrfs_inode so convert it to taking it as a
parameter as well.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:34 +02:00
Nikolay Borisov
0c4942258c btrfs: make need_force_cow take btrfs_inode
Gets rid of superfulous BTRFS_I() calls and prepare for converting
btrfs_run_delalloc_range to using btrfs_inode.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:34 +02:00
Nikolay Borisov
808a129232 btrfs: make inode_need_compress take btrfs_inode
Simply gets rid of superfluous BTRFS_I() calls.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:33 +02:00
Nikolay Borisov
99c88dc71c btrfs: make inode_can_compress take btrfs_inode
Gets rid of superfluous BTRFS_I() calls.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:33 +02:00
Nikolay Borisov
64e1db566d btrfs: make btrfs_cleanup_ordered_extents take btrfs_inode
Preparation to converting btrfs_run_delalloc_range to using btrfs_inode
without BTRFS_I() calls.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:33 +02:00
Nikolay Borisov
b672b5c156 btrfs: make __endio_write_update_ordered take btrfs_inode
It really wants btrfs_inode.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:33 +02:00
Nikolay Borisov
7095821ee1 btrfs: make btrfs_dec_test_first_ordered_pending take btrfs_inode
It doesn't really need vfs_inode but btrfs_inode.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:33 +02:00
Nikolay Borisov
751b64318d btrfs: make cow_file_range_async take btrfs_inode
It only uses vfs inode for assigning it to the async_chunk function.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:32 +02:00
Nikolay Borisov
968322c8c6 btrfs: make run_delalloc_nocow take btrfs_inode
It only really uses btrfs_inode.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:32 +02:00
Nikolay Borisov
8ba96f3dd6 btrfs: make fallback_to_cow take btrfs_inode
It really wants btrfs_inode and is prepration to converting
run_delalloc_nocow to taking btrfs_inode.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:32 +02:00
Nikolay Borisov
c553f94df4 btrfs: make insert_reserved_file_extent take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>c
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:32 +02:00
Nikolay Borisov
72b7d15bf1 btrfs: make btrfs_qgroup_release_data take btrfs_inode
It just forwards its argument to __btrfs_qgroup_release_data.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:32 +02:00
Nikolay Borisov
a0ff10dcc4 btrfs: make submit_compressed_extents take btrfs_inode
All but 3 uses require vfs_inode so convert the logic to have
btrfs_inode be the main inode struct.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:31 +02:00
Nikolay Borisov
c7ee1819dc btrfs: make btrfs_submit_compressed_write take btrfs_inode
Majority of its uses are for btrfs_inode so take it as an argument
directly.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:31 +02:00
Nikolay Borisov
4cc612090b btrfs: make btrfs_add_ordered_extent_compress take btrfs_inode
It simpy forwards its inode argument to __btrfs_add_ordered_extent which
already takes btrfs_inode.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:31 +02:00
Nikolay Borisov
6e26c44223 btrfs: make cow_file_range take btrfs_inode
All its children functions take btrfs_inode so convert it to taking
btrfs_inode.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:31 +02:00
Nikolay Borisov
e7fbf60453 btrfs: make btrfs_add_ordered_extent take btrfs_inode
Preparation to converting its callers to taking btrfs_inode.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:31 +02:00
Nikolay Borisov
a0349401c1 btrfs: make cow_file_range_inline take btrfs_inode
It has only 2 uses for the vfs_inode - insert_inline_extent and
i_size_read.  On the flipside it will allow converting its callers to
btrfs_inode, so convert it to taking btrfs_inode.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:30 +02:00
Nikolay Borisov
8b8a979f1f btrfs: make btrfs_qgroup_free_data take btrfs_inode
It passes btrfs_inode to its callee so change the interface.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:30 +02:00
Qu Wenruo
38d37aa9c3 btrfs: refactor btrfs_check_can_nocow() into two variants
The function btrfs_check_can_nocow() now has two completely different
call patterns.

For nowait variant, callers don't need to do any cleanup.  While for
wait variant, callers need to release the lock if they can do nocow
write.

This is somehow confusing, and is already a problem for the exported
btrfs_check_can_nocow().

So this patch will separate the different patterns into different
functions.
For nowait variant, the function will be called check_nocow_nolock().
For wait variant, the function pair will be btrfs_check_nocow_lock()
btrfs_check_nocow_unlock().

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:28 +02:00
Qu Wenruo
e4ecaf90bc btrfs: add comments for btrfs_check_can_nocow() and can_nocow_extent()
These two functions have extra conditions that their callers need to
meet, and some not-that-common parameters used for return value.

So adding some comments may save reviewers some time.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:28 +02:00
Qu Wenruo
6d4572a9d7 btrfs: allow btrfs_truncate_block() to fallback to nocow for data space reservation
[BUG]
When the data space is exhausted, even if the inode has NOCOW attribute,
we will still refuse to truncate unaligned range due to ENOSPC.

The following script can reproduce it pretty easily:
  #!/bin/bash

  dev=/dev/test/test
  mnt=/mnt/btrfs

  umount $dev &> /dev/null
  umount $mnt &> /dev/null

  mkfs.btrfs -f $dev -b 1G
  mount -o nospace_cache $dev $mnt
  touch $mnt/foobar
  chattr +C $mnt/foobar

  xfs_io -f -c "pwrite -b 4k 0 4k" $mnt/foobar > /dev/null
  xfs_io -f -c "pwrite -b 4k 0 1G" $mnt/padding &> /dev/null
  sync

  xfs_io -c "fpunch 0 2k" $mnt/foobar
  umount $mnt

Currently this will fail at the fpunch part.

[CAUSE]
Because btrfs_truncate_block() always reserves space without checking
the NOCOW attribute.

Since the writeback path follows NOCOW bit, we only need to bother the
space reservation code in btrfs_truncate_block().

[FIX]
Make btrfs_truncate_block() follow btrfs_buffered_write() to try to
reserve data space first, and fall back to NOCOW check only when we
don't have enough space.

Such always-try-reserve is an optimization introduced in
btrfs_buffered_write(), to avoid expensive btrfs_check_can_nocow() call.

This patch will export check_can_nocow() as btrfs_check_can_nocow(), and
use it in btrfs_truncate_block() to fix the problem.

Reported-by: Martin Doucha <martin.doucha@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:28 +02:00
David Sterba
bab16e21e8 btrfs: don't use UAPI types for fiemap callback
The fiemap callback is not part of UAPI interface and the prototypes
don't have the __u64 types either.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:27 +02:00
Nikolay Borisov
906c448c3d btrfs: make __btrfs_drop_extents take btrfs_inode
It has only 4 uses of a vfs_inode for inode_sub_bytes but unifies the
interface with the non  __ prefixed version. Will also makes converting
its callers to btrfs_inode easier.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:26 +02:00
Nikolay Borisov
bd242a08a6 btrfs: make btrfs_csum_one_bio takae btrfs_inode
Will enable converting btrfs_submit_compressed_write to btrfs_inode more
easily.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:26 +02:00
Nikolay Borisov
ad7ff17b65 btrfs: make extent_clear_unlock_delalloc take btrfs_inode
It has one VFS and 1 btrfs inode usages but converting it to btrfs_inode
interface will allow seamless conversion of its callers.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:26 +02:00
Nikolay Borisov
4b67c11dd1 btrfs: make create_io_em take btrfs_inode
It really wants a btrfs_inode and will allow submit_compressed_extents
to be completely converted to btrfs_inode in follow up patches.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:26 +02:00
Nikolay Borisov
7bfa953501 btrfs: make btrfs_reloc_clone_csums take btrfs_inode
It really wants btrfs_inode and not a vfs inode.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:26 +02:00
Nikolay Borisov
c350437269 btrfs: make btrfs_lookup_ordered_extent take btrfs_inode
It doesn't use the generic vfs inode for anything use btrfs_inode
directly.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:25 +02:00
Nikolay Borisov
43c69849ae btrfs: make get_extent_allocation_hint take btrfs_inode
It doesn't use the vfs inode for anything, can just as easily take
btrfs_inode.  Follow up patches will convert callers as well.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:25 +02:00
Qu Wenruo
7dbeaad0af btrfs: change timing for qgroup reserved space for ordered extents to fix reserved space leak
[BUG]
The following simple workload from fsstress can lead to qgroup reserved
data space leak:
  0/0: creat f0 x:0 0 0
  0/0: creat add id=0,parent=-1
  0/1: write f0[259 1 0 0 0 0] [600030,27288] 0
  0/4: dwrite - xfsctl(XFS_IOC_DIOINFO) f0[259 1 0 0 64 627318] return 25, fallback to stat()
  0/4: dwrite f0[259 1 0 0 64 627318] [610304,106496] 0

This would cause btrfs qgroup to leak 20480 bytes for data reserved
space.  If btrfs qgroup limit is enabled, such leak can lead to
unexpected early EDQUOT and unusable space.

[CAUSE]
When doing direct IO, kernel will try to writeback existing buffered
page cache, then invalidate them:
  generic_file_direct_write()
  |- filemap_write_and_wait_range();
  |- invalidate_inode_pages2_range();

However for btrfs, the bi_end_io hook doesn't finish all its heavy work
right after bio ends.  In fact, it delays its work further:

  submit_extent_page(end_io_func=end_bio_extent_writepage);
  end_bio_extent_writepage()
  |- btrfs_writepage_endio_finish_ordered()
     |- btrfs_init_work(finish_ordered_fn);

  <<< Work queue execution >>>
  finish_ordered_fn()
  |- btrfs_finish_ordered_io();
     |- Clear qgroup bits

This means, when filemap_write_and_wait_range() returns,
btrfs_finish_ordered_io() is not guaranteed to be executed, thus the
qgroup bits for related range are not cleared.

Now into how the leak happens, this will only focus on the overlapping
part of buffered and direct IO part.

1. After buffered write
   The inode had the following range with QGROUP_RESERVED bit:
   	596		616K
	|///////////////|
   Qgroup reserved data space: 20K

2. Writeback part for range [596K, 616K)
   Write back finished, but btrfs_finish_ordered_io() not get called
   yet.
   So we still have:
   	596K		616K
	|///////////////|
   Qgroup reserved data space: 20K

3. Pages for range [596K, 616K) get released
   This will clear all qgroup bits, but don't update the reserved data
   space.
   So we have:
   	596K		616K
	|		|
   Qgroup reserved data space: 20K
   That number doesn't match the qgroup bit range anymore.

4. Dio prepare space for range [596K, 700K)
   Qgroup reserved data space for that range, we got:
   	596K		616K			700K
	|///////////////|///////////////////////|
   Qgroup reserved data space: 20K + 104K = 124K

5. btrfs_finish_ordered_range() gets executed for range [596K, 616K)
   Qgroup free reserved space for that range, we got:
   	596K		616K			700K
	|		|///////////////////////|
   We need to free that range of reserved space.
   Qgroup reserved data space: 124K - 20K = 104K

6. btrfs_finish_ordered_range() gets executed for range [596K, 700K)
   However qgroup bit for range [596K, 616K) is already cleared in
   previous step, so we only free 84K for qgroup reserved space.
   	596K		616K			700K
	|		|			|
   We need to free that range of reserved space.
   Qgroup reserved data space: 104K - 84K = 20K

   Now there is no way to release that 20K unless disabling qgroup or
   unmounting the fs.

[FIX]
This patch will change the timing of btrfs_qgroup_release/free_data()
call.  Here it uses buffered COW write as an example.

	The new timing			|	The old timing
----------------------------------------+---------------------------------------
 btrfs_buffered_write()			| btrfs_buffered_write()
 |- btrfs_qgroup_reserve_data() 	| |- btrfs_qgroup_reserve_data()
					|
 btrfs_run_delalloc_range()		| btrfs_run_delalloc_range()
 |- btrfs_add_ordered_extent()  	|
    |- btrfs_qgroup_release_data()	|
       The reserved is passed into	|
       btrfs_ordered_extent structure	|
					|
 btrfs_finish_ordered_io()		| btrfs_finish_ordered_io()
 |- The reserved space is passed to 	| |- btrfs_qgroup_release_data()
    btrfs_qgroup_record			|    The resereved space is passed
					|    to btrfs_qgroup_recrod
					|
 btrfs_qgroup_account_extents()		| btrfs_qgroup_account_extents()
 |- btrfs_qgroup_free_refroot()		| |- btrfs_qgroup_free_refroot()

The point of such change is to ensure, when ordered extents are
submitted, the qgroup reserved space is already released, to keep the
timing aligned with file_write_and_wait_range().

So that qgroup data reserved space is all bound to btrfs_ordered_extent
and solve the timing mismatch.

Fixes: f695fdcef8 ("btrfs: qgroup: Introduce functions to release/free qgroup reserve data space")
Suggested-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:24 +02:00
Qu Wenruo
9729f10a60 btrfs: inode: move qgroup reserved space release to the callers of insert_reserved_file_extent()
This is to prepare for the incoming timing change of qgroup reserved
data space and ordered extent.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:24 +02:00