This commit flips the switch to start tracking/processing pinned extents
on a per-transaction basis. It mostly replaces all references from
btrfs_fs_info::(pinned_extents|freed_extents[]) to
btrfs_transaction::pinned_extents.
Two notable modifications that warrant explicit mention are changing
clean_pinned_extents to get a reference to the previously running
transaction. The other one is removal of call to
btrfs_destroy_pinned_extent since transactions are going to be cleaned
in btrfs_cleanup_one_transaction.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Preparation for refactoring pinned extents tracking.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_pin_reserved_extent is now only called with a valid transaction so
exploit the fact to take a transaction. This is preparation for tracking
pinned extents on a per-transaction basis.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Preparation for switching pinned extent tracking to a per-transaction
basis.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we're going to start relying on getting ref counting right for
roots, add a list to track allocated roots and print out any roots that
aren't freed up at free_fs_info time.
Hide this behind CONFIG_BTRFS_DEBUG because this will just be used for
developers to verify they aren't breaking things.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're going to start freeing roots and doing other complicated things in
free_fs_info, so we need to move it to disk-io.c and export it in order
to use things lik btrfs_put_fs_root().
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In order to keep track of where we have file extents on disk, and thus
where it is safe to adjust the i_size to, we need to have a tree in
place to keep track of the contiguous areas we have file extents for.
Add helpers to use this tree, as it's not required for NO_HOLES file
systems. We will use this by setting DIRTY for areas we know we have
file extent item's set, and clearing it when we remove file extent items
for truncation.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is a race between adding and removing elements to the tree mod log
list and rbtree that can lead to use-after-free problems.
Consider the following example that explains how/why the problems happens:
1) Task A has mod log element with sequence number 200. It currently is
the only element in the mod log list;
2) Task A calls btrfs_put_tree_mod_seq() because it no longer needs to
access the tree mod log. When it enters the function, it initializes
'min_seq' to (u64)-1. Then it acquires the lock 'tree_mod_seq_lock'
before checking if there are other elements in the mod seq list.
Since the list it empty, 'min_seq' remains set to (u64)-1. Then it
unlocks the lock 'tree_mod_seq_lock';
3) Before task A acquires the lock 'tree_mod_log_lock', task B adds
itself to the mod seq list through btrfs_get_tree_mod_seq() and gets a
sequence number of 201;
4) Some other task, name it task C, modifies a btree and because there
elements in the mod seq list, it adds a tree mod elem to the tree
mod log rbtree. That node added to the mod log rbtree is assigned
a sequence number of 202;
5) Task B, which is doing fiemap and resolving indirect back references,
calls btrfs get_old_root(), with 'time_seq' == 201, which in turn
calls tree_mod_log_search() - the search returns the mod log node
from the rbtree with sequence number 202, created by task C;
6) Task A now acquires the lock 'tree_mod_log_lock', starts iterating
the mod log rbtree and finds the node with sequence number 202. Since
202 is less than the previously computed 'min_seq', (u64)-1, it
removes the node and frees it;
7) Task B still has a pointer to the node with sequence number 202, and
it dereferences the pointer itself and through the call to
__tree_mod_log_rewind(), resulting in a use-after-free problem.
This issue can be triggered sporadically with the test case generic/561
from fstests, and it happens more frequently with a higher number of
duperemove processes. When it happens to me, it either freezes the VM or
it produces a trace like the following before crashing:
[ 1245.321140] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[ 1245.321200] CPU: 1 PID: 26997 Comm: pool Not tainted 5.5.0-rc6-btrfs-next-52 #1
[ 1245.321235] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
[ 1245.321287] RIP: 0010:rb_next+0x16/0x50
[ 1245.321307] Code: ....
[ 1245.321372] RSP: 0018:ffffa151c4d039b0 EFLAGS: 00010202
[ 1245.321388] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8ae221363c80 RCX: 6b6b6b6b6b6b6b6b
[ 1245.321409] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8ae221363c80
[ 1245.321439] RBP: ffff8ae20fcc4688 R08: 0000000000000002 R09: 0000000000000000
[ 1245.321475] R10: ffff8ae20b120910 R11: 00000000243f8bb1 R12: 0000000000000038
[ 1245.321506] R13: ffff8ae221363c80 R14: 000000000000075f R15: ffff8ae223f762b8
[ 1245.321539] FS: 00007fdee1ec7700(0000) GS:ffff8ae236c80000(0000) knlGS:0000000000000000
[ 1245.321591] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1245.321614] CR2: 00007fded4030c48 CR3: 000000021da16003 CR4: 00000000003606e0
[ 1245.321642] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1245.321668] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1245.321706] Call Trace:
[ 1245.321798] __tree_mod_log_rewind+0xbf/0x280 [btrfs]
[ 1245.321841] btrfs_search_old_slot+0x105/0xd00 [btrfs]
[ 1245.321877] resolve_indirect_refs+0x1eb/0xc60 [btrfs]
[ 1245.321912] find_parent_nodes+0x3dc/0x11b0 [btrfs]
[ 1245.321947] btrfs_check_shared+0x115/0x1c0 [btrfs]
[ 1245.321980] ? extent_fiemap+0x59d/0x6d0 [btrfs]
[ 1245.322029] extent_fiemap+0x59d/0x6d0 [btrfs]
[ 1245.322066] do_vfs_ioctl+0x45a/0x750
[ 1245.322081] ksys_ioctl+0x70/0x80
[ 1245.322092] ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 1245.322113] __x64_sys_ioctl+0x16/0x20
[ 1245.322126] do_syscall_64+0x5c/0x280
[ 1245.322139] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 1245.322155] RIP: 0033:0x7fdee3942dd7
[ 1245.322177] Code: ....
[ 1245.322258] RSP: 002b:00007fdee1ec6c88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 1245.322294] RAX: ffffffffffffffda RBX: 00007fded40210d8 RCX: 00007fdee3942dd7
[ 1245.322314] RDX: 00007fded40210d8 RSI: 00000000c020660b RDI: 0000000000000004
[ 1245.322337] RBP: 0000562aa89e7510 R08: 0000000000000000 R09: 00007fdee1ec6d44
[ 1245.322369] R10: 0000000000000073 R11: 0000000000000246 R12: 00007fdee1ec6d48
[ 1245.322390] R13: 00007fdee1ec6d40 R14: 00007fded40210d0 R15: 00007fdee1ec6d50
[ 1245.322423] Modules linked in: ....
[ 1245.323443] ---[ end trace 01de1e9ec5dff3cd ]---
Fix this by ensuring that btrfs_put_tree_mod_seq() computes the minimum
sequence number and iterates the rbtree while holding the lock
'tree_mod_log_lock' in write mode. Also get rid of the 'tree_mod_seq_lock'
lock, since it is now redundant.
Fixes: bd989ba359 ("Btrfs: add tree modification log functions")
Fixes: 097b8a7c9e ("Btrfs: join tree mod log code with the code holding back delayed refs")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's a report where objtool detects unreachable instructions, eg.:
fs/btrfs/ctree.o: warning: objtool: btrfs_search_slot()+0x2d4: unreachable instruction
This seems to be a false positive due to compiler version. The cause is
in the ASSERT macro implementation that does the conditional check as
IS_DEFINED(CONFIG_BTRFS_ASSERT) and not an #ifdef.
To avoid that, use the ifdefs directly.
There are still 2 reports that aren't fixed:
fs/btrfs/extent_io.o: warning: objtool: __set_extent_bit()+0x71f: unreachable instruction
fs/btrfs/relocation.o: warning: objtool: find_data_references()+0x4e0: unreachable instruction
Co-developed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David Sterba <dsterba@suse.com>
Keep track of how much we are discarding and how often we are reusing
with async discard. The discard_*_bytes values don't need any special
protection because the work item provides the single threaded access.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Non-block group destruction discarding currently only had a single list
with no minimum discard length. This can lead to caravaning more
meaningful discards behind a heavily fragmented block group.
This adds support for multiple lists with minimum discard lengths to
prevent the caravan effect. We promote block groups back up when we
exceed the BTRFS_ASYNC_DISCARD_MAX_FILTER size, currently we support
only 2 lists with filters of 1MB and 32KB respectively.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Expose max_discard_size as a tunable via sysfs and switch the current
fixed maximum to the default value.
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Provide the ability to rate limit based on kbps in addition to iops as
additional guides for the target discard rate. The delay used ends up
being max(kbps_delay, iops_delay).
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
An earlier patch keeps track of discardable_extents. These are
undiscarded extents managed by the free space cache. Here, we will use
this to dynamically calculate the discard delay interval.
There are 3 rate to consider. The first is the target convergence rate,
the rate to discard all discardable_extents over the
BTRFS_DISCARD_TARGET_MSEC time frame. This is clamped by the lower
limit, the iops limit or BTRFS_DISCARD_MIN_DELAY (1ms), and the upper
limit, BTRFS_DISCARD_MAX_DELAY (1s). We reevaluate this delay every
transaction commit.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Keep track of this metric so that we can understand how ahead or behind
we are in discarding rate. This uses the same accounting method as
discardable_extents, deltas between previous/current values and
propagating them up.
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
The number of discardable extents will serve as the rate limiting metric
for how often we should discard. This keeps track of discardable extents
in the free space caches by maintaining deltas and propagating them to
the global count.
The deltas are calculated from 2 values stored in PREV and CURR entries,
then propagated up to the global discard ctl. The current counter value
becomes the previous counter value after update.
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Setup base sysfs directory for discard stats + tunables.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs only allowed attributes to be exposed in debug/. Let's let other
groups be created by making debug its own kobject.
This also makes the per-fs debug options separate from the global
features mount attributes. This seems to be needed as
sysfs_create_files() requires const struct attribute * while
sysfs_create_group() can take struct attribute *. This seems nicer as
per file system, you'll probably use to_fs_info().
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
block_group removal is a little tricky. It can race with the extent
allocator, the cleaner thread, and balancing. The current path is for a
block_group to be added to the unused_bgs list. Then, when the cleaner
thread comes around, it starts a transaction and then proceeds with
removing the block_group. Extents that are pinned are subsequently
removed from the pinned trees and then eventually a discard is issued
for the entire block_group.
Async discard introduces another player into the game, the discard
workqueue. While it has none of the racing issues, the new problem is
ensuring we don't leave free space untrimmed prior to forgetting the
block_group. This is handled by placing fully free block_groups on a
separate discard queue. This is necessary to maintain discarding order
as in the future we will slowly trim even fully free block_groups. The
ordering helps us make progress on the same block_group rather than say
the last fully freed block_group or needing to search through the fully
freed block groups at the beginning of a list and insert after.
The new order of events is a fully freed block group gets placed on the
unused discard queue first. Once it's processed, it will be placed on
the unusued_bgs list and then the original sequence of events will
happen, just without the final whole block_group discard.
The mount flags can change when processing unused_bgs, so when flipping
from DISCARD to DISCARD_ASYNC, the unused_bgs must be punted to the
discard_list to be trimmed. If we flip off DISCARD_ASYNC, we punt
free block groups on the discard_list to the unused_bg queue which will
do the final discard for us.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When discard is enabled, everytime a pinned extent is released back to
the block_group's free space cache, a discard is issued for the extent.
This is an overeager approach when it comes to discarding and helping
the SSD maintain enough free space to prevent severe garbage collection
situations.
This adds the beginning of async discard. Instead of issuing a discard
prior to returning it to the free space, it is just marked as untrimmed.
The block_group is then added to a LRU which then feeds into a workqueue
to issue discards at a much slower rate. Full discarding of unused block
groups is still done and will be addressed in a future patch of the
series.
For now, we don't persist the discard state of extents and bitmaps.
Therefore, our failure recovery mode will be to consider extents
untrimmed. This lets us handle failure and unmounting as one in the
same.
On a number of Facebook webservers, I collected data every minute
accounting the time we spent in btrfs_finish_extent_commit() (col. 1)
and in btrfs_commit_transaction() (col. 2). btrfs_finish_extent_commit()
is where we discard extents synchronously before returning them to the
free space cache.
discard=sync:
p99 total per minute p99 total per minute
Drive | extent_commit() (ms) | commit_trans() (ms)
---------------------------------------------------------------
Drive A | 434 | 1170
Drive B | 880 | 2330
Drive C | 2943 | 3920
Drive D | 4763 | 5701
discard=async:
p99 total per minute p99 total per minute
Drive | extent_commit() (ms) | commit_trans() (ms)
--------------------------------------------------------------
Drive A | 134 | 956
Drive B | 64 | 1972
Drive C | 59 | 1032
Drive D | 62 | 1200
While it's not great that the stats are cumulative over 1m, all of these
servers are running the same workload and and the delta between the two
are substantial. We are spending significantly less time in
btrfs_finish_extent_commit() which is responsible for discarding.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This series introduces async discard which will use the flag
DISCARD_ASYNC, so rename the original flag to DISCARD_SYNC as it is
synchronously done in transaction commit.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We only pass this as 1 from __extent_writepage_io(). The parameter
basically means "pretend I didn't pass in a page". This is silly since
we can simply not pass in the page. Get rid of the parameter from
btrfs_get_extent(), and since it's used as a get_extent_t callback,
remove it from get_extent_t and btree_get_extent(), neither of which
need it.
While we're here, let's document btrfs_get_extent().
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can encode this in the offset parameter: -1 means use the page
offsets, anything else is a valid offset.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, we have two wrappers for __btrfs_lookup_bio_sums():
btrfs_lookup_bio_sums_dio(), which is used for direct I/O, and
btrfs_lookup_bio_sums(), which is used everywhere else. The only
difference is that the _dio variant looks up csums starting at the given
offset instead of using the page index, which isn't actually direct
I/O-specific. Let's clean up the signature and return value of
__btrfs_lookup_bio_sums(), rename it to btrfs_lookup_bio_sums(), and get
rid of the trivial helpers.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
__btrfs_free_reserved_extent now performs the actions of
btrfs_free_and_pin_reserved_extent. But this name is a bit of a
misnomer, since the extent is not really freed but just pinned. Reflect
this in the new name. No semantics changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging a file that has shared extents (reflinked with other files or
with itself), we can end up logging multiple checksum items that cover
overlapping ranges. This confuses the search for checksums at log replay
time causing some checksums to never be added to the fs/subvolume tree.
Consider the following example of a file that shares the same extent at
offsets 0 and 256Kb:
[ bytenr 13893632, offset 64Kb, len 64Kb ]
0 64Kb
[ bytenr 13631488, offset 64Kb, len 192Kb ]
64Kb 256Kb
[ bytenr 13893632, offset 0, len 256Kb ]
256Kb 512Kb
When logging the inode, at tree-log.c:copy_items(), when processing the
file extent item at offset 0, we log a checksum item covering the range
13959168 to 14024704, which corresponds to 13893632 + 64Kb and 13893632 +
64Kb + 64Kb, respectively.
Later when processing the extent item at offset 256K, we log the checksums
for the range from 13893632 to 14155776 (which corresponds to 13893632 +
256Kb). These checksums get merged with the checksum item for the range
from 13631488 to 13893632 (13631488 + 256Kb), logged by a previous fsync.
So after this we get the two following checksum items in the log tree:
(...)
item 6 key (EXTENT_CSUM EXTENT_CSUM 13631488) itemoff 3095 itemsize 512
range start 13631488 end 14155776 length 524288
item 7 key (EXTENT_CSUM EXTENT_CSUM 13959168) itemoff 3031 itemsize 64
range start 13959168 end 14024704 length 65536
The first one covers the range from the second one, they overlap.
So far this does not cause a problem after replaying the log, because
when replaying the file extent item for offset 256K, we copy all the
checksums for the extent 13893632 from the log tree to the fs/subvolume
tree, since searching for an checksum item for bytenr 13893632 leaves us
at the first checksum item, which covers the whole range of the extent.
However if we write 64Kb to file offset 256Kb for example, we will
not be able to find and copy the checksums for the last 128Kb of the
extent at bytenr 13893632, referenced by the file range 384Kb to 512Kb.
After writing 64Kb into file offset 256Kb we get the following extent
layout for our file:
[ bytenr 13893632, offset 64K, len 64Kb ]
0 64Kb
[ bytenr 13631488, offset 64Kb, len 192Kb ]
64Kb 256Kb
[ bytenr 14155776, offset 0, len 64Kb ]
256Kb 320Kb
[ bytenr 13893632, offset 64Kb, len 192Kb ]
320Kb 512Kb
After fsync'ing the file, if we have a power failure and then mount
the filesystem to replay the log, the following happens:
1) When replaying the file extent item for file offset 320Kb, we
lookup for the checksums for the extent range from 13959168
(13893632 + 64Kb) to 14155776 (13893632 + 256Kb), through a call
to btrfs_lookup_csums_range();
2) btrfs_lookup_csums_range() finds the checksum item that starts
precisely at offset 13959168 (item 7 in the log tree, shown before);
3) However that checksum item only covers 64Kb of data, and not 192Kb
of data;
4) As a result only the checksums for the first 64Kb of data referenced
by the file extent item are found and copied to the fs/subvolume tree.
The remaining 128Kb of data, file range 384Kb to 512Kb, doesn't get
the corresponding data checksums found and copied to the fs/subvolume
tree.
5) After replaying the log userspace will not be able to read the file
range from 384Kb to 512Kb, because the checksums are missing and
resulting in an -EIO error.
The following steps reproduce this scenario:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt/sdc
$ xfs_io -f -c "pwrite -S 0xa3 0 256K" /mnt/sdc/foobar
$ xfs_io -c "fsync" /mnt/sdc/foobar
$ xfs_io -c "pwrite -S 0xc7 256K 256K" /mnt/sdc/foobar
$ xfs_io -c "reflink /mnt/sdc/foobar 320K 0 64K" /mnt/sdc/foobar
$ xfs_io -c "fsync" /mnt/sdc/foobar
$ xfs_io -c "pwrite -S 0xe5 256K 64K" /mnt/sdc/foobar
$ xfs_io -c "fsync" /mnt/sdc/foobar
<power failure>
$ mount /dev/sdc /mnt/sdc
$ md5sum /mnt/sdc/foobar
md5sum: /mnt/sdc/foobar: Input/output error
$ dmesg | tail
[165305.003464] BTRFS info (device sdc): no csum found for inode 257 start 401408
[165305.004014] BTRFS info (device sdc): no csum found for inode 257 start 405504
[165305.004559] BTRFS info (device sdc): no csum found for inode 257 start 409600
[165305.005101] BTRFS info (device sdc): no csum found for inode 257 start 413696
[165305.005627] BTRFS info (device sdc): no csum found for inode 257 start 417792
[165305.006134] BTRFS info (device sdc): no csum found for inode 257 start 421888
[165305.006625] BTRFS info (device sdc): no csum found for inode 257 start 425984
[165305.007278] BTRFS info (device sdc): no csum found for inode 257 start 430080
[165305.008248] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1
[165305.009550] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1
Fix this simply by deleting first any checksums, from the log tree, for the
range of the extent we are logging at copy_items(). This ensures we do not
get checksum items in the log tree that have overlapping ranges.
This is a long time issue that has been present since we have the clone
(and deduplication) ioctl, and can happen both when an extent is shared
between different files and within the same file.
A test case for fstests follows soon.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The type name is misleading, a single entry is named 'cache' while this
normally means a collection of objects. Rename that everywhere. Also the
identifier was quite long, making function prototypes harder to format.
Suggested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The new raid1c3 and raid1c4 profiles are backward incompatible and the
name shall be 'raid1c34', the status can be found in the global
supported features in /sys/fs/btrfs/features or in the per-filesystem
directory.
Signed-off-by: David Sterba <dsterba@suse.com>
Add new block group profile to store 4 copies in a simliar way that
current RAID1 does. The profile attributes and constraints are defined
in the raid table and used by the same code that already handles the 2-
and 3-copy RAID1.
The minimum number of devices is 4, the maximum number of devices/chunks
that can be lost/damaged is 3. There is no comparable traditional RAID
level, the profile is added for future needs to accompany triple-parity
and beyond.
Signed-off-by: David Sterba <dsterba@suse.com>
Add new block group profile to store 3 copies in a simliar way that
current RAID1 does. The profile attributes and constraints are defined
in the raid table and used by the same code that already handles the
2-copy RAID1.
The minimum number of devices is 3, the maximum number of devices/chunks
that can be lost/damaged is 2. Like RAID6 but with 33% space
utilization.
Signed-off-by: David Sterba <dsterba@suse.com>
Accessors defined by BTRFS_SETGET_FUNCS take a raw extent buffer and
manipulate the items there, there's no special prefix required. The
block group accessors had _disk_ because previously the names were
occupied by the on-stack accessors. As this has been addressed in the
previous patch, we can now unify the naming.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All accessors defined by BTRFS_SETGET_STACK_FUNCS contain _stack_ in the
name, the block group ones were not following that scheme, so let's
switch them.
Signed-off-by: David Sterba <dsterba@suse.com>
Currently all the checksum algorithms generate a fixed size digest size
and we use it. The on-disk format can hold up to BTRFS_CSUM_SIZE bytes
and BLAKE2b produces digest of 512 bits by default. We can't do that and
will use the blake2b-256, this needs to be passed to the crypto API.
Separate that from the base algorithm name and add a member to request
specific driver, in this case with the digest size.
The only place that uses the driver name is the crypto API setup.
Signed-off-by: David Sterba <dsterba@suse.com>
Export supported checksum algorithms via sysfs in the list of static
features:
/sys/fs/btrfs/features/supported_checksums
Space spearated list of checksum algorithm names.
Co-developed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we're not using btrfs_schedule_bio() anymore, delete all the
code that supported it.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The attribute is more relaxed than const and the functions could
dereference pointers, as long as the observable state is not changed. We
do have such functions, based on -Wsuggest-attribute=pure .
The visible effects of this patch are negligible, there are differences
in the assembly but hard to summarize.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For some reason the attribute is called __attribute_const__ and not
__const, marks functions that have no observable effects on program
state, IOW not reading pointers, just the arguments and calculating a
value. Allows the compiler to do some optimizations, based on
-Wsuggest-attribute=const . The effects are rather small, though, about
60 bytes decrese of btrfs.ko.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The parameter is now always set to NULL and could be dropped. The last
user was get_default_root but that got reworked in 05dbe6837b ("Btrfs:
unify subvol= and subvolid= mounting") and the parameter became unused.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function belongs to the family of locking functions, so move it
there. The 'noinline' keyword is dropped as it's now an exported
function that does not need it.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
extent_io.c/h are huge, encompassing a bunch of different things. The
extent_io_tree code can live on its own, so separate this out.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[Background]
Btrfs qgroup uses two types of reserved space for METADATA space,
PERTRANS and PREALLOC.
PERTRANS is metadata space reserved for each transaction started by
btrfs_start_transaction().
While PREALLOC is for delalloc, where we reserve space before joining a
transaction, and finally it will be converted to PERTRANS after the
writeback is done.
[Inconsistency]
However there is inconsistency in how we handle PREALLOC metadata space.
The most obvious one is:
In btrfs_buffered_write():
btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes, true);
We always free qgroup PREALLOC meta space.
While in btrfs_truncate_block():
btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, (ret != 0));
We only free qgroup PREALLOC meta space when something went wrong.
[The Correct Behavior]
The correct behavior should be the one in btrfs_buffered_write(), we
should always free PREALLOC metadata space.
The reason is, the btrfs_delalloc_* mechanism works by:
- Reserve metadata first, even it's not necessary
In btrfs_delalloc_reserve_metadata()
- Free the unused metadata space
Normally in:
btrfs_delalloc_release_extents()
|- btrfs_inode_rsv_release()
Here we do calculation on whether we should release or not.
E.g. for 64K buffered write, the metadata rsv works like:
/* The first page */
reserve_meta: num_bytes=calc_inode_reservations()
free_meta: num_bytes=0
total: num_bytes=calc_inode_reservations()
/* The first page caused one outstanding extent, thus needs metadata
rsv */
/* The 2nd page */
reserve_meta: num_bytes=calc_inode_reservations()
free_meta: num_bytes=calc_inode_reservations()
total: not changed
/* The 2nd page doesn't cause new outstanding extent, needs no new meta
rsv, so we free what we have reserved */
/* The 3rd~16th pages */
reserve_meta: num_bytes=calc_inode_reservations()
free_meta: num_bytes=calc_inode_reservations()
total: not changed (still space for one outstanding extent)
This means, if btrfs_delalloc_release_extents() determines to free some
space, then those space should be freed NOW.
So for qgroup, we should call btrfs_qgroup_free_meta_prealloc() other
than btrfs_qgroup_convert_reserved_meta().
The good news is:
- The callers are not that hot
The hottest caller is in btrfs_buffered_write(), which is already
fixed by commit 336a8bb8e3 ("btrfs: Fix wrong
btrfs_delalloc_release_extents parameter"). Thus it's not that
easy to cause false EDQUOT.
- The trans commit in advance for qgroup would hide the bug
Since commit f5fef45936 ("btrfs: qgroup: Make qgroup async transaction
commit more aggressive"), when btrfs qgroup metadata free space is slow,
it will try to commit transaction and free the wrongly converted
PERTRANS space, so it's not that easy to hit such bug.
[FIX]
So to fix the problem, remove the @qgroup_free parameter for
btrfs_delalloc_release_extents(), and always pass true to
btrfs_inode_rsv_release().
Reported-by: Filipe Manana <fdmanana@suse.com>
Fixes: 43b18595d6 ("btrfs: qgroup: Use separate meta reservation type for delalloc")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The patch 32b593bfcb ("Btrfs: remove no longer used function to run
delayed refs asynchronously") removed the async delayed refs but the
thread has been created, without any use. Remove it to avoid resource
consumption.
Fixes: 32b593bfcb ("Btrfs: remove no longer used function to run delayed refs asynchronously")
CC: stable@vger.kernel.org # 5.2+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Create a structure to encode the type and length for the known on-disk
checksums. This makes it easier to add new checksums later.
The structure and helpers are moved from ctree.h so they don't occupy
space in all headers including ctree.h. This save some space in the
final object.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Further simplifaction of the get/set helpers is possible when the token
is uniquely tied to an extent buffer. A condition and an assignment can
be avoided.
The initializations are moved closer to the first use when the extent
buffer is valid. There's one exception in __push_leaf_left where the
token is reused.
Signed-off-by: David Sterba <dsterba@suse.com>
There are helpers for all type widths defined via macro and optionally
can use a token which is a cached pointer to avoid repeated mapping of
the extent buffer.
The token value is known at compile time, when it's valid it's always
address of a local variable, otherwise it's NULL passed by the
token-less helpers.
This can be utilized to remove some branching as the helpers are used
frequenlty.
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_find_name_in_ext_backref returns either 0/1 depending on whether it
found a backref for the given name. If it returns true then the actual
inode_ref struct is returned in one of its parameters. That's pointless,
instead refactor the function such that it returns either a pointer
to the btrfs_inode_extref or NULL it it didn't find anything. This
streamlines the function calling convention.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_find_name_in_backref returns either 0/1 depending on whether it
found a backref for the given name. If it returns true then the actual
inode_ref struct is returned in one of its parameters. That's pointless,
instead refactor the function such that it returns either a pointer
to the btrfs_inode_ref or NULL it it didn't find anything. This
streamlines the function calling convention.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The other dev stats functions are already there and the helpers are not
used by anything else.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
The io_ctl structure is used for free space management, and used only by
the v1 space cache code, but unfortunatlly the full definition is
required by block-group.h so it can't be moved to free-space-cache.c
without additional changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Send is the only user of tree_compare, we can move it there along with
the other helpers and definitions.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Preparatory work for code that will be moved out of ctree and uses this
function.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
The file ctree.h serves as a header for everything and has become quite
bloated. Split some helpers that are generic and create a new file that
should be the catch-all for code that's not btrfs-specific.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Various notifications of type "BUG kmalloc-4096 () : Redzone
overwritten" have been observed recently in various parts of the kernel.
After some time, it has been made a relation with the use of BTRFS
filesystem and with SLUB_DEBUG turned on.
[ 22.809700] BUG kmalloc-4096 (Tainted: G W ): Redzone overwritten
[ 22.810286] INFO: 0xbe1a5921-0xfbfc06cd. First byte 0x0 instead of 0xcc
[ 22.810866] INFO: Allocated in __load_free_space_cache+0x588/0x780 [btrfs] age=22 cpu=0 pid=224
[ 22.811193] __slab_alloc.constprop.26+0x44/0x70
[ 22.811345] kmem_cache_alloc_trace+0xf0/0x2ec
[ 22.811588] __load_free_space_cache+0x588/0x780 [btrfs]
[ 22.811848] load_free_space_cache+0xf4/0x1b0 [btrfs]
[ 22.812090] cache_block_group+0x1d0/0x3d0 [btrfs]
[ 22.812321] find_free_extent+0x680/0x12a4 [btrfs]
[ 22.812549] btrfs_reserve_extent+0xec/0x220 [btrfs]
[ 22.812785] btrfs_alloc_tree_block+0x178/0x5f4 [btrfs]
[ 22.813032] __btrfs_cow_block+0x150/0x5d4 [btrfs]
[ 22.813262] btrfs_cow_block+0x194/0x298 [btrfs]
[ 22.813484] commit_cowonly_roots+0x44/0x294 [btrfs]
[ 22.813718] btrfs_commit_transaction+0x63c/0xc0c [btrfs]
[ 22.813973] close_ctree+0xf8/0x2a4 [btrfs]
[ 22.814107] generic_shutdown_super+0x80/0x110
[ 22.814250] kill_anon_super+0x18/0x30
[ 22.814437] btrfs_kill_super+0x18/0x90 [btrfs]
[ 22.814590] INFO: Freed in proc_cgroup_show+0xc0/0x248 age=41 cpu=0 pid=83
[ 22.814841] proc_cgroup_show+0xc0/0x248
[ 22.814967] proc_single_show+0x54/0x98
[ 22.815086] seq_read+0x278/0x45c
[ 22.815190] __vfs_read+0x28/0x17c
[ 22.815289] vfs_read+0xa8/0x14c
[ 22.815381] ksys_read+0x50/0x94
[ 22.815475] ret_from_syscall+0x0/0x38
Commit 69d2480456 ("btrfs: use copy_page for copying pages instead of
memcpy") changed the way bitmap blocks are copied. But allthough bitmaps
have the size of a page, they were allocated with kzalloc().
Most of the time, kzalloc() allocates aligned blocks of memory, so
copy_page() can be used. But when some debug options like SLAB_DEBUG are
activated, kzalloc() may return unaligned pointer.
On powerpc, memcpy(), copy_page() and other copying functions use
'dcbz' instruction which provides an entire zeroed cacheline to avoid
memory read when the intention is to overwrite a full line. Functions
like memcpy() are writen to care about partial cachelines at the start
and end of the destination, but copy_page() assumes it gets pages. As
pages are naturally cache aligned, copy_page() doesn't care about
partial lines. This means that when copy_page() is called with a
misaligned pointer, a few leading bytes are zeroed.
To fix it, allocate bitmaps through kmem_cache instead of using kzalloc()
The cache pool is created with PAGE_SIZE alignment constraint.
Reported-by: Erhard F. <erhard_f@mailbox.org>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204371
Fixes: 69d2480456 ("btrfs: use copy_page for copying pages instead of memcpy")
Cc: stable@vger.kernel.org # 4.19+
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: David Sterba <dsterba@suse.com>
[ rename to btrfs_free_space_bitmap ]
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_calc_trunc_metadata_size differs from trans_metadata_size in that
it doesn't take into account any splitting at the levels, because
truncate will never split nodes. However truncate _and_ changing will
never split nodes, so rename btrfs_calc_trunc_metadata_size to
btrfs_calc_metadata_size. Also btrfs_calc_trans_metadata_size is purely
for inserting items, so rename this to btrfs_calc_insert_metadata_size.
Making these clearer will help when I start using them differently in
upcoming patches.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
EXTENT_DATA_REF is a little like DIR_ITEM which contains hash in its
key->offset.
This patch will check the following contents:
- Key->objectid
Basic alignment check.
- Hash
Hash of each extent_data_ref item must match key->offset.
- Offset
Basic alignment check.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have this weird space flushing loop inside inode.c for evict where
we'll do the normal LIMIT flush, and then commit the transaction and
hope we get our space. This is super janky, and in fact there's really
nothing stopping us from using FLUSH_ALL except that we run delayed
iputs, which means we could deadlock. So introduce a new flush state
for eviction that does the normal priority flushing with all of the
states that are safe for eviction.
The nice side-effect of this is that we'll try harder for evictions.
Previously if (for example generic/269) you had a bunch of other
operations happening on the fs you could race with those reservations
when committing the transaction, and eventually miss getting a
reservation for the evict. With this code we'll have our ticket in
place through the transaction commit, so any pinned bytes will go to our
pending evictions first.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Delayed iputs could very well free up enough space without needing to
commit the transaction, so make this step it's own step. This will
allow us to skip the step for evictions in a later patch.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This can now be easily migrated as well.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ refresh on top of sysfs cleanups ]
Signed-off-by: David Sterba <dsterba@suse.com>
These feel more at home in block-group.c.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ refresh, adjust btrfs_get_alloc_profile exports ]
Signed-off-by: David Sterba <dsterba@suse.com>
This feels more at home in block-group.c than in extent-tree.c.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>i
[ refresh ]
Signed-off-by: David Sterba <dsterba@suse.com>
This can be easily migrated over now.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
This can easily be moved now.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ refresh ]
Signed-off-by: David Sterba <dsterba@suse.com>
All of the prep work has been done so we can now cleanly move this chunk
over.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ refresh, add btrfs_get_alloc_profile export, comment updates ]
Signed-off-by: David Sterba <dsterba@suse.com>
This is the removal code and the unused bgs code.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ refresh, move clear_incompat_bg_bits ]
Signed-off-by: David Sterba <dsterba@suse.com>
We can now just copy it over to block-group.c.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The kobject should be pulled in via sysfs.h and that needs to include it
because it needs various definitions like kobj_attribute or kobject.
Signed-off-by: David Sterba <dsterba@suse.com>
We'll need this to move the caching stuff around.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These are relatively straightforward as well.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Another easy set to move over to block-group.c.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move these bits first as they are the easiest to move. Export two of
the helpers so they can be moved all at once.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor style updates ]
Signed-off-by: David Sterba <dsterba@suse.com>
This is prep work for moving all of the block group cache code into its
own file.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor comment updates ]
Signed-off-by: David Sterba <dsterba@suse.com>
The switch to open coded set/get has happend long time ago in
962a298f35 ("btrfs: kill the key type accessor helpers"), remove the
stray helpers.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Test case btrfs/156 fails since commit 302167c50b ("btrfs: don't end
the transaction for delayed refs in throttle") with ENOSPC.
[CAUSE]
The ENOSPC is reported from btrfs_can_relocate().
This function will check:
- If this block group is empty, we can relocate
- If we can enough free space, we can relocate
Above checks are valid but the following check is vague due to its
implementation:
- If and only if we can allocated a new block group to contain all the
used space, we can relocate
This design itself is OK, but the way to determine if we can allocate a
new block group is problematic.
btrfs_can_relocate() uses find_free_dev_extent() to find free space on a
device.
However find_free_dev_extent() only searches commit root and excludes
dev extents allocated in current trans, this makes it unable to use dev
extent just freed in current transaction.
So for the following example, btrfs_can_relocate() will report ENOSPC:
The example block group layout:
1M 129M 257M 385M 513M 550M
|///////|///////////|//////////| | |
// = Used bg, consider all bg is 100% used for easy calculation.
And all block groups are SINGLE, on-disk bytenr is the same as the
logical bytenr.
1) Bg in [129M, 257M) get relocated to [385M, 513M), transid=100
1M 129M 257M 385M 513M 550M
|///////| |//////////|/////////|
In transid 100, bg in [129M, 257M) get relocated to [385M, 513M)
However transid 100 is not committed yet, so in dev commit tree, we
still have the old dev extents layout:
1M 129M 257M 385M 513M 550M
|///////|///////////|//////////| | |
2) Try to relocate bg [257M, 385M)
We goes into btrfs_can_relocate(), no free space in current bgs, so we
check if we can find large enough free dev extents.
The first slot is [385M, 513M), but that is already used by new bg at
[385M, 513M), so we continue search.
The remaining slot is [512M, 550M), smaller than the bg's length 128M.
So btrfs_can_relocate report ENOSPC.
However this is over killed, in fact if we just skip btrfs_can_relocate()
check, and go into regular relocation routine, at extent reservation time,
if we can't find free extent, then we fallback to commit transaction,
which will free up the dev extents and allow new block group to be created.
[FIX]
The fix here is to remove btrfs_can_relocate() completely.
If we hit the false ENOSPC case just like btrfs/156, extent allocator
will push harder by committing transaction and we will have space for
new block group, avoiding the false ENOSPC.
If we really ran out of space, we will hit ENOSPC at
relocate_block_group(), and btrfs will just reports the ENOSPC error as
usual.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's unlikely in-band dedupe is going to land so just remove any
leftovers - dedupe.h header as well as the 'dedupe' parameter to
btrfs_set_extent_delalloc.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When cloning extents (or deduplicating) we create a transaction with a
space reservation that considers we will drop or update a single file
extent item of the destination inode (that we modify a single leaf). That
is fine for the vast majority of scenarios, however it might happen that
we need to drop many file extent items, and adjust at most two file extent
items, in the destination root, which can span multiple leafs. This will
lead to either the call to btrfs_drop_extents() to fail with ENOSPC or
the subsequent calls to btrfs_insert_empty_item() or btrfs_update_inode()
(called through clone_finish_inode_update()) to fail with ENOSPC. Such
failure results in a transaction abort, leaving the filesystem in a
read-only mode.
In order to fix this we need to follow the same approach as the hole
punching code, where we create a local reservation with 1 unit and keep
ending and starting transactions, after balancing the btree inode,
when __btrfs_drop_extents() returns ENOSPC. So fix this by making the
extent cloning call calls the recently added btrfs_punch_hole_range()
helper, which is what does the mentioned work for hole punching, and
make sure whenever we drop extent items in a transaction, we also add a
replacing file extent item, to avoid corruption (a hole) if after ending
a transaction and before starting a new one, the old transaction gets
committed and a power failure happens before we finish cloning.
A test case for fstests follows soon.
Reported-by: David Goodwin <david@codepoets.co.uk>
Link: https://lore.kernel.org/linux-btrfs/a4a4cf31-9cf4-e52c-1f86-c62d336c9cd1@codepoets.co.uk/
Reported-by: Sam Tygier <sam@tygier.co.uk>
Link: https://lore.kernel.org/linux-btrfs/82aace9f-a1e3-1f0b-055f-3ea75f7a41a0@tygier.co.uk/
Fixes: b6f3409b21 ("Btrfs: reserve sufficient space for ioctl clone")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the 5.3 merge window, commit 7c7e301406 ("btrfs: sysfs: Replace
default_attrs in ktypes with groups"), we started using the member
"defaults_groups" for the kobject type "btrfs_raid_ktype". That leads
to a series of warnings when running some test cases of fstests, such
as btrfs/027, btrfs/124 and btrfs/176. The traces produced by those
warnings are like the following:
[116648.059212] kernfs: can not remove 'total_bytes', no directory
[116648.060112] WARNING: CPU: 3 PID: 28500 at fs/kernfs/dir.c:1504 kernfs_remove_by_name_ns+0x75/0x80
(...)
[116648.066482] CPU: 3 PID: 28500 Comm: umount Tainted: G W 5.3.0-rc3-btrfs-next-54 #1
(...)
[116648.069376] RIP: 0010:kernfs_remove_by_name_ns+0x75/0x80
(...)
[116648.072385] RSP: 0018:ffffabfd0090bd08 EFLAGS: 00010282
[116648.073437] RAX: 0000000000000000 RBX: ffffffffc0c11998 RCX: 0000000000000000
[116648.074201] RDX: ffff9fff603a7a00 RSI: ffff9fff603978a8 RDI: ffff9fff603978a8
[116648.074956] RBP: ffffffffc0b9ca2f R08: 0000000000000000 R09: 0000000000000001
[116648.075708] R10: ffff9ffe1f72e1c0 R11: 0000000000000000 R12: ffffffffc0b94120
[116648.076434] R13: ffffffffb3d9b4e0 R14: 0000000000000000 R15: dead000000000100
[116648.077143] FS: 00007f9cdc78a2c0(0000) GS:ffff9fff60380000(0000) knlGS:0000000000000000
[116648.077852] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[116648.078546] CR2: 00007f9fc4747ab4 CR3: 00000005c7832003 CR4: 00000000003606e0
[116648.079235] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[116648.079907] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[116648.080585] Call Trace:
[116648.081262] remove_files+0x31/0x70
[116648.081929] sysfs_remove_group+0x38/0x80
[116648.082596] sysfs_remove_groups+0x34/0x70
[116648.083258] kobject_del+0x20/0x60
[116648.083933] btrfs_free_block_groups+0x405/0x430 [btrfs]
[116648.084608] close_ctree+0x19a/0x380 [btrfs]
[116648.085278] generic_shutdown_super+0x6c/0x110
[116648.085951] kill_anon_super+0xe/0x30
[116648.086621] btrfs_kill_super+0x12/0xa0 [btrfs]
[116648.087289] deactivate_locked_super+0x3a/0x70
[116648.087956] cleanup_mnt+0xb4/0x160
[116648.088620] task_work_run+0x7e/0xc0
[116648.089285] exit_to_usermode_loop+0xfa/0x100
[116648.089933] do_syscall_64+0x1cb/0x220
[116648.090567] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[116648.091197] RIP: 0033:0x7f9cdc073b37
(...)
[116648.100046] ---[ end trace 22e24db328ccadf8 ]---
[116648.100618] ------------[ cut here ]------------
[116648.101175] kernfs: can not remove 'used_bytes', no directory
[116648.101731] WARNING: CPU: 3 PID: 28500 at fs/kernfs/dir.c:1504 kernfs_remove_by_name_ns+0x75/0x80
(...)
[116648.105649] CPU: 3 PID: 28500 Comm: umount Tainted: G W 5.3.0-rc3-btrfs-next-54 #1
(...)
[116648.107461] RIP: 0010:kernfs_remove_by_name_ns+0x75/0x80
(...)
[116648.109336] RSP: 0018:ffffabfd0090bd08 EFLAGS: 00010282
[116648.109979] RAX: 0000000000000000 RBX: ffffffffc0c119a0 RCX: 0000000000000000
[116648.110625] RDX: ffff9fff603a7a00 RSI: ffff9fff603978a8 RDI: ffff9fff603978a8
[116648.111283] RBP: ffffffffc0b9ca41 R08: 0000000000000000 R09: 0000000000000001
[116648.111940] R10: ffff9ffe1f72e1c0 R11: 0000000000000000 R12: ffffffffc0b94120
[116648.112603] R13: ffffffffb3d9b4e0 R14: 0000000000000000 R15: dead000000000100
[116648.113268] FS: 00007f9cdc78a2c0(0000) GS:ffff9fff60380000(0000) knlGS:0000000000000000
[116648.113939] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[116648.114607] CR2: 00007f9fc4747ab4 CR3: 00000005c7832003 CR4: 00000000003606e0
[116648.115286] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[116648.115966] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[116648.116649] Call Trace:
[116648.117326] remove_files+0x31/0x70
[116648.117997] sysfs_remove_group+0x38/0x80
[116648.118671] sysfs_remove_groups+0x34/0x70
[116648.119342] kobject_del+0x20/0x60
[116648.120022] btrfs_free_block_groups+0x405/0x430 [btrfs]
[116648.120707] close_ctree+0x19a/0x380 [btrfs]
[116648.121396] generic_shutdown_super+0x6c/0x110
[116648.122057] kill_anon_super+0xe/0x30
[116648.122702] btrfs_kill_super+0x12/0xa0 [btrfs]
[116648.123335] deactivate_locked_super+0x3a/0x70
[116648.123961] cleanup_mnt+0xb4/0x160
[116648.124586] task_work_run+0x7e/0xc0
[116648.125210] exit_to_usermode_loop+0xfa/0x100
[116648.125830] do_syscall_64+0x1cb/0x220
[116648.126463] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[116648.127080] RIP: 0033:0x7f9cdc073b37
(...)
[116648.135923] ---[ end trace 22e24db328ccadf9 ]---
These happen because, during the unmount path, we call kobject_del() for
raid kobjects that are not fully initialized, meaning that we set their
ktype (as btrfs_raid_ktype) through link_block_group() but we didn't set
their parent kobject, which is done through btrfs_add_raid_kobjects().
We have this split raid kobject setup since commit 75cb379d26
("btrfs: defer adding raid type kobject until after chunk relocation") in
order to avoid triggering reclaim during contextes where we can not
(either we are holding a transaction handle or some lock required by
the transaction commit path), so that we do the calls to kobject_add(),
which triggers GFP_KERNEL allocations, through btrfs_add_raid_kobjects()
in contextes where it is safe to trigger reclaim. That change expected
that a new raid kobject can only be created either when mounting the
filesystem or after raid profile conversion through the relocation path.
However, we can have new raid kobject created in other two cases at least:
1) During device replace (or scrub) after adding a device a to the
filesystem. The replace procedure (and scrub) do calls to
btrfs_inc_block_group_ro() which can allocate a new block group
with a new raid profile (because we now have more devices). This
can be triggered by test cases btrfs/027 and btrfs/176.
2) During a degraded mount trough any write path. This can be triggered
by test case btrfs/124.
Fixing this by adding extra calls to btrfs_add_raid_kobjects(), not only
makes things more complex and fragile, can also introduce deadlocks with
reclaim the following way:
1) Calling btrfs_add_raid_kobjects() at btrfs_inc_block_group_ro() or
anywhere in the replace/scrub path will cause a deadlock with reclaim
because if reclaim happens and a transaction commit is triggered,
the transaction commit path will block at btrfs_scrub_pause().
2) During degraded mounts it is essentially impossible to figure out where
to add extra calls to btrfs_add_raid_kobjects(), because allocation of
a block group with a new raid profile can happen anywhere, which means
we can't safely figure out which contextes are safe for reclaim, as
we can either hold a transaction handle or some lock needed by the
transaction commit path.
So it is too complex and error prone to have this split setup of raid
kobjects. So fix the issue by consolidating the setup of the kobjects in a
single place, at link_block_group(), and setup a nofs context there in
order to prevent reclaim being triggered by the memory allocations done
through the call chain of kobject_add().
Besides fixing the sysfs warnings during kobject_del(), this also ensures
the sysfs directories for the new raid profiles end up created and visible
to users (a bug that existed before the 5.3 commit 7c7e301406
("btrfs: sysfs: Replace default_attrs in ktypes with groups")).
Fixes: 75cb379d26 ("btrfs: defer adding raid type kobject until after chunk relocation")
Fixes: 7c7e301406 ("btrfs: sysfs: Replace default_attrs in ktypes with groups")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have code for data and metadata reservations for delalloc. There's
quite a bit of code here, and it's used in a lot of places so I've
separated it out to it's own file. inode.c and file.c are already
pretty large, and this code is complicated enough to live in its own
space.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move this into transaction.c with the rest of the transaction related
code.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These belong with the delayed refs related code, not in extent-tree.c.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Prep work for separating out all of the block_rsv related code into its
own file.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are going to need this to move the metadata reservation stuff to
space_info.c.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Really we just need the enum, but as we break more things up it'll help
to have this external to extent-tree.c.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Migrate the struct definition and the one helper that's in ctree.h into
space-info.h
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently the messages printed after setting an incompat feature are
cryptis, we can easily make it better as the textual description is
passed to the helpers. Old:
setting 128 feature flag
updated:
setting incompat feature flag for RAID56 (0x80)
Signed-off-by: David Sterba <dsterba@suse.com>
Send always operates on read-only trees and always expected that while it
is in progress, nothing changes in those trees. Due to that expectation
and the fact that send is a read-only operation, it operates on commit
roots and does not hold transaction handles. However relocation can COW
nodes and leafs from read-only trees, which can cause unexpected failures
and crashes (hitting BUG_ONs). while send using a node/leaf, it gets
COWed, the transaction used to COW it is committed, a new transaction
starts, the extent previously used for that node/leaf gets allocated,
possibly for another tree, and the respective extent buffer' content
changes while send is still using it. When this happens send normally
fails with EIO being returned to user space and messages like the
following are found in dmesg/syslog:
[ 3408.699121] BTRFS error (device sdc): parent transid verify failed on 58703872 wanted 250 found 253
[ 3441.523123] BTRFS error (device sdc): did not find backref in send_root. inode=63211, offset=0, disk_byte=5222825984 found extent=5222825984
Other times, less often, we hit a BUG_ON() because an extent buffer that
send is using used to be a node, and while send is still using it, it
got COWed and got reused as a leaf while send is still using, producing
the following trace:
[ 3478.466280] ------------[ cut here ]------------
[ 3478.466282] kernel BUG at fs/btrfs/ctree.c:1806!
[ 3478.466965] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
[ 3478.467635] CPU: 0 PID: 2165 Comm: btrfs Not tainted 5.0.0-btrfs-next-46 #1
[ 3478.468311] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
[ 3478.469681] RIP: 0010:read_node_slot+0x122/0x130 [btrfs]
(...)
[ 3478.471758] RSP: 0018:ffffa437826bfaa0 EFLAGS: 00010246
[ 3478.472457] RAX: ffff961416ed7000 RBX: 000000000000003d RCX: 0000000000000002
[ 3478.473151] RDX: 000000000000003d RSI: ffff96141e387408 RDI: ffff961599b30000
[ 3478.473837] RBP: ffffa437826bfb8e R08: 0000000000000001 R09: ffffa437826bfb8e
[ 3478.474515] R10: ffffa437826bfa70 R11: 0000000000000000 R12: ffff9614385c8708
[ 3478.475186] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 3478.475840] FS: 00007f8e0e9cc8c0(0000) GS:ffff9615b6a00000(0000) knlGS:0000000000000000
[ 3478.476489] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3478.477127] CR2: 00007f98b67a056e CR3: 0000000005df6005 CR4: 00000000003606f0
[ 3478.477762] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3478.478385] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 3478.479003] Call Trace:
[ 3478.479600] ? do_raw_spin_unlock+0x49/0xc0
[ 3478.480202] tree_advance+0x173/0x1d0 [btrfs]
[ 3478.480810] btrfs_compare_trees+0x30c/0x690 [btrfs]
[ 3478.481388] ? process_extent+0x1280/0x1280 [btrfs]
[ 3478.481954] btrfs_ioctl_send+0x1037/0x1270 [btrfs]
[ 3478.482510] _btrfs_ioctl_send+0x80/0x110 [btrfs]
[ 3478.483062] btrfs_ioctl+0x13fe/0x3120 [btrfs]
[ 3478.483581] ? rq_clock_task+0x2e/0x60
[ 3478.484086] ? wake_up_new_task+0x1f3/0x370
[ 3478.484582] ? do_vfs_ioctl+0xa2/0x6f0
[ 3478.485075] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[ 3478.485552] do_vfs_ioctl+0xa2/0x6f0
[ 3478.486016] ? __fget+0x113/0x200
[ 3478.486467] ksys_ioctl+0x70/0x80
[ 3478.486911] __x64_sys_ioctl+0x16/0x20
[ 3478.487337] do_syscall_64+0x60/0x1b0
[ 3478.487751] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 3478.488159] RIP: 0033:0x7f8e0d7d4dd7
(...)
[ 3478.489349] RSP: 002b:00007ffcf6fb4908 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[ 3478.489742] RAX: ffffffffffffffda RBX: 0000000000000105 RCX: 00007f8e0d7d4dd7
[ 3478.490142] RDX: 00007ffcf6fb4990 RSI: 0000000040489426 RDI: 0000000000000005
[ 3478.490548] RBP: 0000000000000005 R08: 00007f8e0d6f3700 R09: 00007f8e0d6f3700
[ 3478.490953] R10: 00007f8e0d6f39d0 R11: 0000000000000202 R12: 0000000000000005
[ 3478.491343] R13: 00005624e0780020 R14: 0000000000000000 R15: 0000000000000001
(...)
[ 3478.493352] ---[ end trace d5f537302be4f8c8 ]---
Another possibility, much less likely to happen, is that send will not
fail but the contents of the stream it produces may not be correct.
To avoid this, do not allow send and relocation (balance) to run in
parallel. In the long term the goal is to allow for both to be able to
run concurrently without any problems, but that will take a significant
effort in development and testing.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The filename:line format is commonly understood by editors and can be
copy&pasted more easily than the current format.
Signed-off-by: David Sterba <dsterba@suse.com>
Add boilerplate code for directly including the crypto framework. This
helps us flipping the switch for new algorithms.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
BTRFS has the implicit assumption that a checksum in btrfs_orderd_sums
is 4 bytes. While this is true for CRC32C, it is not for any other
checksum.
Change the data type to be a byte array and adjust loop index
calculation accordingly.
This includes moving the adjustment of 'index' by 'ins_size' in
btrfs_csum_file_blocks() before dividing 'ins_size' by the checksum
size, because before this patch the 'sums' member of 'struct
btrfs_ordered_sum' was 4 Bytes in size and afterwards it is only one
byte.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 9678c54388 ("btrfs: Remove custom crc32c init code") removed
the btrfs_crc32c() function, because it was a duplicate of the crc32c()
library function we already have in the kernel.
Resurrect it as a shim wrapper over crc32c() to make following
transformations of the checksumming code in btrfs easier.
Also provide a btrfs_crc32_final() to ease following transformations.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
fs_info::mapping_tree is the physical<->logical mapping tree and uses
the same underlying structure as extents, but is embedded to another
structure. There are no other members and this indirection is useless.
No functional change.
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, there's only check for fast crc32c implementation on X86,
based on the CPU flags. This is used to decide if checksumming should be
offloaded to worker threads or can be calculated by the caller.
As there are more architectures that implement a faster version of
crc32c (ARM, SPARC, s390, MIPS, PowerPC), also there are specialized hw
cards.
The detection is based on driver name, all generic C implementations
contain 'generic', while the specialized versions do not. Alternatively
the priority could be used, but this is not currently provided by the
crypto API.
The flag is set per-filesystem at mount time and used for the offloading
decisions.
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlzQM7MACgkQxWXV+ddt
WDvrVw/+K0AElSuEfDFWd9HBqRAPlGaEP71xCGGle1tkzuY0DJVIBRZ72q8UR0YP
7yke7DU0oqXekGype83eTJUjDSLoOXrlVoQ+VqBdFteDk0W4BCG6Nw+N+wYBF7An
gXRXlGFaYzb2CqqjG92FbtkfxBzISR0XBCQBUN9CBqHNDu1EUQSbnTBkmTMN8MYh
PCoo37S6e5fR36uB/rOKbGNBJjsZEEg/2G6DprP52+eiQWV2h0avEUJrvv6xC4so
97QNgUNuuiUmyurqcYHdlaflZwIhuf5nQeNeu/UvMZmmRnBHPhSP7YPM7f7FftwA
y0d0p+AiEAO0he8nGFb5C6Avs4vuv1u65o1NbF5fqnmAyt+KXWem3LeG6etsXgU8
+eITgprJD3sNBMDLbLoA+wlhTps+w9tukVF5Zp2a8KgQLMMEyAYqUDWmSHvnO2Me
RCNPZLzeGXETgKun0WuMtl/CX2iBDnc0Kq5O6ks2ORl2TH6bg5lgEIwr6HP/Ewoy
w8twsmCOltrxiIptqyQHYD+kvNwqMVV9LSOQ8+EjbYd6BHsfjHjKObOBkhmJ7iqz
4MAIcZU++F9DLRv92H1kUYVNhAMCdXkEIWyxhZPwN1lUi5k9AhknY3FbheNc7ldl
LNPIgRxamWCq9oBmzfOcJ3eFOBtNN02fgA1GTXGd1/AgAilEep8=
=fEkD
-----END PGP SIGNATURE-----
Merge tag 'for-5.2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"This time the majority of changes are cleanups, though there's still a
number of changes of user interest.
User visible changes:
- better read time and write checks to catch errors early and before
writing data to disk (to catch potential memory corruption on data
that get checksummed)
- qgroups + metadata relocation: last speed up patch int the series
to address the slowness, there should be no overhead comparing
balance with and without qgroups
- FIEMAP ioctl does not start a transaction unnecessarily, this can
result in a speed up and less blocking due to IO
- LOGICAL_INO (v1, v2) does not start transaction unnecessarily, this
can speed up the mentioned ioctl and scrub as well
- fsync on files with many (but not too many) hardlinks is faster,
finer decision if the links should be fsynced individually or
completely
- send tries harder to find ranges to clone
- trim/discard will skip unallocated chunks that haven't been touched
since the last mount
Fixes:
- send flushes delayed allocation before start, otherwise it could
miss some changes in case of a very recent rw->ro switch of a
subvolume
- fix fallocate with qgroups that could lead to space accounting
underflow, reported as a warning
- trim/discard ioctl honours the requested range
- starting send and dedupe on a subvolume at the same time will let
only one of them succeed, this is to prevent changes that send
could miss due to dedupe; both operations are restartable
Core changes:
- more tree-checker validations, errors reported by fuzzing tools:
- device item
- inode item
- block group profiles
- tracepoints for extent buffer locking
- async cow preallocates memory to avoid errors happening too deep in
the call chain
- metadata reservations for delalloc reworked to better adapt in
many-writers/low-space scenarios
- improved space flushing logic for intense DIO vs buffered workloads
- lots of cleanups
- removed unused struct members
- redundant argument removal
- properties and xattrs
- extent buffer locking
- selftests
- use common file type conversions
- many-argument functions reduction"
* tag 'for-5.2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (227 commits)
btrfs: Use kvmalloc for allocating compressed path context
btrfs: Factor out common extent locking code in submit_compressed_extents
btrfs: Set io_tree only once in submit_compressed_extents
btrfs: Replace clear_extent_bit with unlock_extent
btrfs: Make compress_file_range take only struct async_chunk
btrfs: Remove fs_info from struct async_chunk
btrfs: Rename async_cow to async_chunk
btrfs: Preallocate chunks in cow_file_range_async
btrfs: reserve delalloc metadata differently
btrfs: track DIO bytes in flight
btrfs: merge calls of btrfs_setxattr and btrfs_setxattr_trans in btrfs_set_prop
btrfs: delete unused function btrfs_set_prop_trans
btrfs: start transaction in xattr_handler_set_prop
btrfs: drop local copy of inode i_mode
btrfs: drop old_fsflags in btrfs_ioctl_setflags
btrfs: modify local copy of btrfs_inode flags
btrfs: drop useless inode i_flags copy and restore
btrfs: start transaction in btrfs_ioctl_setflags()
btrfs: export btrfs_set_prop
btrfs: refactor btrfs_set_props to validate externally
...
When diagnosing a slowdown of generic/224 I noticed we were not doing
anything when calling into shrink_delalloc(). This is because all
writes in 224 are O_DIRECT, not delalloc, and thus our delalloc_bytes
counter is 0, which short circuits most of the work inside of
shrink_delalloc(). However O_DIRECT writes still consume metadata
resources and generate ordered extents, which we can still wait on.
Fix this by tracking outstanding DIO write bytes, and use this as well
as the delalloc bytes counter to decide if we need to lookup and wait on
any ordered extents. If we have more DIO writes than delalloc bytes
we'll go ahead and wait on any ordered extents regardless of our flush
state as flushing delalloc is likely to not gain us anything.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ use dio instead of odirect in identifiers ]
Signed-off-by: David Sterba <dsterba@suse.com>
Send operates on read only trees and expects them to never change while it
is using them. This is part of its initial design, and this expection is
due to two different reasons:
1) When it was introduced, no operations were allowed to modifiy read-only
subvolumes/snapshots (including defrag for example).
2) It keeps send from having an impact on other filesystem operations.
Namely send does not need to keep locks on the trees nor needs to hold on
to transaction handles and delay transaction commits. This ends up being
a consequence of the former reason.
However the deduplication feature was introduced later (on September 2013,
while send was introduced in July 2012) and it allowed for deduplication
with destination files that belong to read-only trees (subvolumes and
snapshots).
That means that having a send operation (either full or incremental) running
in parallel with a deduplication that has the destination inode in one of
the trees used by the send operation, can result in tree nodes and leaves
getting freed and reused while send is using them. This problem is similar
to the problem solved for the root nodes getting freed and reused when a
snapshot is made against one tree that is currenly being used by a send
operation, fixed in commits [1] and [2]. These commits explain in detail
how the problem happens and the explanation is valid for any node or leaf
that is not the root of a tree as well. This problem was also discussed
and explained recently in a thread [3].
The problem is very easy to reproduce when using send with large trees
(snapshots) and just a few concurrent deduplication operations that target
files in the trees used by send. A stress test case is being sent for
fstests that triggers the issue easily. The most common error to hit is
the send ioctl return -EIO with the following messages in dmesg/syslog:
[1631617.204075] BTRFS error (device sdc): did not find backref in send_root. inode=63292, offset=0, disk_byte=5228134400 found extent=5228134400
[1631633.251754] BTRFS error (device sdc): parent transid verify failed on 32243712 wanted 24 found 27
The first one is very easy to hit while the second one happens much less
frequently, except for very large trees (in that test case, snapshots
with 100000 files having large xattrs to get deep and wide trees).
Less frequently, at least one BUG_ON can be hit:
[1631742.130080] ------------[ cut here ]------------
[1631742.130625] kernel BUG at fs/btrfs/ctree.c:1806!
[1631742.131188] invalid opcode: 0000 [#6] SMP DEBUG_PAGEALLOC PTI
[1631742.131726] CPU: 1 PID: 13394 Comm: btrfs Tainted: G B D W 5.0.0-rc8-btrfs-next-45 #1
[1631742.132265] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
[1631742.133399] RIP: 0010:read_node_slot+0x122/0x130 [btrfs]
(...)
[1631742.135061] RSP: 0018:ffffb530021ebaa0 EFLAGS: 00010246
[1631742.135615] RAX: ffff93ac8912e000 RBX: 000000000000009d RCX: 0000000000000002
[1631742.136173] RDX: 000000000000009d RSI: ffff93ac564b0d08 RDI: ffff93ad5b48c000
[1631742.136759] RBP: ffffb530021ebb7d R08: 0000000000000001 R09: ffffb530021ebb7d
[1631742.137324] R10: ffffb530021eba70 R11: 0000000000000000 R12: ffff93ac87d0a708
[1631742.137900] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
[1631742.138455] FS: 00007f4cdb1528c0(0000) GS:ffff93ad76a80000(0000) knlGS:0000000000000000
[1631742.139010] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1631742.139568] CR2: 00007f5acb3d0420 CR3: 000000012be3e006 CR4: 00000000003606e0
[1631742.140131] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1631742.140719] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1631742.141272] Call Trace:
[1631742.141826] ? do_raw_spin_unlock+0x49/0xc0
[1631742.142390] tree_advance+0x173/0x1d0 [btrfs]
[1631742.142948] btrfs_compare_trees+0x268/0x690 [btrfs]
[1631742.143533] ? process_extent+0x1070/0x1070 [btrfs]
[1631742.144088] btrfs_ioctl_send+0x1037/0x1270 [btrfs]
[1631742.144645] _btrfs_ioctl_send+0x80/0x110 [btrfs]
[1631742.145161] ? trace_sched_stick_numa+0xe0/0xe0
[1631742.145685] btrfs_ioctl+0x13fe/0x3120 [btrfs]
[1631742.146179] ? account_entity_enqueue+0xd3/0x100
[1631742.146662] ? reweight_entity+0x154/0x1a0
[1631742.147135] ? update_curr+0x20/0x2a0
[1631742.147593] ? check_preempt_wakeup+0x103/0x250
[1631742.148053] ? do_vfs_ioctl+0xa2/0x6f0
[1631742.148510] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[1631742.148942] do_vfs_ioctl+0xa2/0x6f0
[1631742.149361] ? __fget+0x113/0x200
[1631742.149767] ksys_ioctl+0x70/0x80
[1631742.150159] __x64_sys_ioctl+0x16/0x20
[1631742.150543] do_syscall_64+0x60/0x1b0
[1631742.150931] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[1631742.151326] RIP: 0033:0x7f4cd9f5add7
(...)
[1631742.152509] RSP: 002b:00007ffe91017708 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[1631742.152892] RAX: ffffffffffffffda RBX: 0000000000000105 RCX: 00007f4cd9f5add7
[1631742.153268] RDX: 00007ffe91017790 RSI: 0000000040489426 RDI: 0000000000000007
[1631742.153633] RBP: 0000000000000007 R08: 00007f4cd9e79700 R09: 00007f4cd9e79700
[1631742.153999] R10: 00007f4cd9e799d0 R11: 0000000000000202 R12: 0000000000000003
[1631742.154365] R13: 0000555dfae53020 R14: 0000000000000000 R15: 0000000000000001
(...)
[1631742.156696] ---[ end trace 5dac9f96dcc3fd6b ]---
That BUG_ON happens because while send is using a node, that node is COWed
by a concurrent deduplication, gets freed and gets reused as a leaf (because
a transaction commit happened in between), so when it attempts to read a
slot from the extent buffer, at ctree.c:read_node_slot(), the extent buffer
contents were wiped out and it now matches a leaf (which can even belong to
some other tree now), hitting the BUG_ON(level == 0).
Fix this concurrency issue by not allowing send and deduplication to run
in parallel if both operate on the same readonly trees, returning EAGAIN
to user space and logging an exlicit warning in dmesg/syslog.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=be6821f82c3cc36e026f5afd10249988852b35ea
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6f2f0b394b54e2b159ef969a0b5274e9bbf82ff2
[3] https://lore.kernel.org/linux-btrfs/CAL3q7H7iqSEEyFaEtpRZw3cp613y+4k2Q8b4W7mweR3tZA05bQ@mail.gmail.com/
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>