When starting a fallocate zero range operation, before getting the first
extent map for the range, we make a call to inode_dio_wait().
This logic was needed in the past because direct IO writes within the
i_size boundary did not take the inode's VFS lock. This was because that
lock used to be a mutex, then some years ago it was switched to a rw
semaphore (by commit 9902af79c0 ("parallel lookups: actual switch to
rwsem")), and then btrfs was changed to take the VFS inode's lock in
shared mode for writes that don't cross the i_size boundary (done in
commit e9adabb971 ("btrfs: use shared lock for direct writes within
EOF")). The lockless direct IO writes could result in a race with the
zero range operation, resulting in the later getting a stale extent
map for the range.
So remove this no longer needed call to inode_dio_wait(), as fallocate
takes the inode's VFS lock in exclusive mode and direct IO writes within
i_size take that same lock in shared mode.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During a plain fallocate, we always start by reserving an amount of data
space that matches the length of the range passed to fallocate. When we
already have extents allocated in that range, we may end up trying to
reserve a lot more data space then we need, which can result in several
undesired behaviours:
1) We fail with -ENOSPC. For example the passed range has a length
of 1G, but there's only one hole with a size of 1M in that range;
2) We temporarily reserve excessive data space that could be used by
other operations happening concurrently;
3) By reserving much more data space then we need, we can end up
doing expensive things like triggering dellaloc for other inodes,
waiting for the ordered extents to complete, trigger transaction
commits, allocate new block groups, etc.
Example:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
mkfs.btrfs -f -b 1g $DEV
mount $DEV $MNT
# Create a file with a size of 600M and two holes, one at [200M, 201M[
# and another at [401M, 402M[
xfs_io -f -c "pwrite -S 0xab 0 200M" \
-c "pwrite -S 0xcd 201M 200M" \
-c "pwrite -S 0xef 402M 198M" \
$MNT/foobar
# Now call fallocate against the whole file range, see if it fails
# with -ENOSPC or not - it shouldn't since we only need to allocate
# 2M of data space.
xfs_io -c "falloc 0 600M" $MNT/foobar
umount $MNT
$ ./test.sh
(...)
wrote 209715200/209715200 bytes at offset 0
200 MiB, 51200 ops; 0.8063 sec (248.026 MiB/sec and 63494.5831 ops/sec)
wrote 209715200/209715200 bytes at offset 210763776
200 MiB, 51200 ops; 0.8053 sec (248.329 MiB/sec and 63572.3172 ops/sec)
wrote 207618048/207618048 bytes at offset 421527552
198 MiB, 50688 ops; 0.7925 sec (249.830 MiB/sec and 63956.5548 ops/sec)
fallocate: No space left on device
$
So fix this by not allocating an amount of data space that matches the
length of the range passed to fallocate. Instead allocate an amount of
data space that corresponds to the sum of the sizes of each hole found
in the range. This reservation now happens after we have locked the file
range, which is safe since we know at this point there's no delalloc
in the range because we've taken the inode's VFS lock in exclusive mode,
we have taken the inode's i_mmap_lock in exclusive mode, we have flushed
delalloc and waited for all ordered extents in the range to complete.
This type of failure actually seems to happen in practice with systemd,
and we had at least one report about this in a very long thread which
is referenced by the Link tag below.
Link: https://lore.kernel.org/linux-btrfs/bdJVxLiFr_PyQSXRUbZJfFW_jAjsGgoMetqPHJMbg-hdy54Xt_ZHhRetmnJ6cJ99eBlcX76wy-AvWwV715c3YndkxneSlod11P1hlaADx0s=@protonmail.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
According to the tree checker, "all xattrs with a given objectid follow
the inode with that objectid in the tree" is an invariant. This was
broken by the recent change "btrfs: move common inode creation code into
btrfs_create_new_inode()", which moved acl creation and property
inheritance (stored in xattrs) to before inode insertion into the tree.
As a result, under certain timings, the xattrs could be written to the
tree before the inode, causing the tree checker to report violation of
the invariant.
Move property inheritance and acl creation back to their old ordering
after the inode insertion.
Suggested-by: Omar Sandoval <osandov@osandov.com>
Reported-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: David Sterba <dsterba@suse.com>
All of our inode creation code paths duplicate the calls to
btrfs_init_inode_security() and btrfs_add_link(). Subvolume creation
additionally duplicates property inheritance and the call to
btrfs_set_inode_index(). Fix this by moving the common code into
btrfs_create_new_inode(). This accomplishes a few things at once:
1. It reduces code duplication.
2. It allows us to set up the inode completely before inserting the
inode item, removing calls to btrfs_update_inode().
3. It fixes a leak of an inode on disk in some error cases. For example,
in btrfs_create(), if btrfs_new_inode() succeeds, then we have
inserted an inode item and its inode ref. However, if something after
that fails (e.g., btrfs_init_inode_security()), then we end the
transaction and then decrement the link count on the inode. If the
transaction is committed and the system crashes before the failed
inode is deleted, then we leak that inode on disk. Instead, this
refactoring aborts the transaction when we can't recover more
gracefully.
4. It exposes various ways that subvolume creation diverges from mkdir
in terms of inheriting flags, properties, permissions, and POSIX
ACLs, a lot of which appears to be accidental. This patch explicitly
does _not_ change the existing non-standard behavior, but it makes
those differences more clear in the code and documents them so that
we can discuss whether they should be changed.
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The various inode creation code paths do not account for the compression
property, POSIX ACLs, or the parent inode item when starting a
transaction. Fix it by refactoring all of these code paths to use a new
function, btrfs_new_inode_prepare(), which computes the correct number
of items. To do so, it needs to know whether POSIX ACLs will be created,
so move the ACL creation into that function. To reduce the number of
arguments that need to be passed around for inode creation, define
struct btrfs_new_inode_args containing all of the relevant information.
btrfs_new_inode_prepare() will also be a good place to set up the
fscrypt context and encrypted filename in the future.
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_{mknod,create,mkdir}() are now identical other than the inode
initialization and some inconsequential function call order differences.
Factor out the common code to reduce code duplication.
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of calling new_inode() and inode_init_owner() inside of
btrfs_new_inode(), do it in the callers. This allows us to pass in just
the inode instead of the mnt_userns and mode and removes the need for
memalloc_nofs_{save,restores}() since we do it before starting a
transaction. In create_subvol(), it also means we no longer have to look
up the inode again to instantiate it. This also paves the way for some
more cleanups in later patches.
This also removes the comments about Smack checking i_op, which are no
longer true since commit 5d6c31910b ("xattr: Add
__vfs_{get,set,remove}xattr helpers"). Now it checks inode->i_opflags &
IOP_XATTR, which is set based on sb->s_xattr.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Although we have btrfs_extent_buffer_leak_debug_check() (enabled by
CONFIG_BTRFS_DEBUG option) to detect and warn QA testers that we have
some extent buffer leakage, it's just pr_err(), not noisy enough for
fstests to cache.
So here we trigger a WARN_ON() if the allocated_ebs list is not empty.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the function btrfs_dev_replace_finishing, we dereferenced
fs_info->fs_devices 6 times. Use keep local variable for that.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function can be simplified by refactoring to use the new iterator
macro. No functional changes.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function can be simplified by refactoring to use the new iterator
macro. No functional changes.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function can be simplified by refactoring to use the new iterator
macro. No functional changes.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function can be simplified by refactoring to use the new iterator
macro. No functional changes.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function can be simplified by refactoring to use the new iterator
macro. No functional changes.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function can be simplified by refactoring to use the new iterator
macro. No functional changes.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function can be simplified by refactoring to use the new iterator
macro. No functional changes.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function can be simplified by refactoring to use the new iterator
macro. No functional changes.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function can be simplified by refactoring to use the new iterator
macro. No functional changes.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function can be simplified by refactoring to use the new iterator
macro. No functional changes.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function can be simplified by refactoring to use the new iterator
macro. No functional changes.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function can be simplified by refactoring to use the new iterator
macro. No functional changes.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function can be simplified by refactoring to use the new iterator
macro. No functional changes.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is a common pattern when searching for a key in btrfs:
* Call btrfs_search_slot to find the slot for the key
* Enter an endless loop:
* If the found slot is larger than the no. of items in the current
leaf, check the next leaf
* If it's still not found in the next leaf, terminate the loop
* Otherwise do something with the found key
* Increment the current slot and continue
To reduce code duplication, we can replace this code pattern with an
iterator macro, similar to the existing for_each_X macros found
elsewhere in the kernel. This also makes the code easier to understand
for newcomers by putting a name to the encapsulated functionality.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since the subpage support for scrub, one page no longer always represents
one sector, thus scrub_bio::pagev and scrub_bio::sector_count are no
longer accurate.
Rename them to scrub_bio::sectors and scrub_bio::sector_count respectively.
This also involves scrub_ctx::pages_per_bio and other macros involved.
Now the renaming of pages involved in scrub is be finished.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since the subpage support of scrub, scrub_sector is in fact just
representing one sector.
Thus the name scrub_page is no longer correct, rename it to
scrub_sector.
This also involves the following renames:
- spage -> sector
Normally we would just replace "page" with "sector" and result
something like "ssector".
But the repeating 's' is not really eye friendly.
So here we just simple use "sector", as there is nothing from MM layer
called "sector" to cause any confusion.
- scrub_parity::spages -> sectors_list
Normally we use plural to indicate an array, not a list.
Rename it to @sectors_list to be more explicit on the list part.
- Also reformat and update comments that get changed
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The following will be renamed in this patch:
- scrub_block::pagev -> sectors
- scrub_block::page_count -> sector_count
- SCRUB_MAX_PAGES_PER_BLOCK -> SCRUB_MAX_SECTORS_PER_BLOCK
- page_num -> sector_num to iterate scrub_block::sectors
For now scrub_page is not yet renamed to keep the patch reasonable and
it will be updated in a followup.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_read_buffer() is useless, it just calls
btree_read_extent_buffer_pages() with exactly the same arguments.
So remove it and rename btree_read_extent_buffer_pages() to
btrfs_read_extent_buffer(), which is a shorter name, has the "btrfs_"
prefix (since it's used outside disk-io.c) and the name is clear enough
about what it does.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The comment at the top of read_block_for_search() is very outdated, as it
refers to the blocking versus spinning path locking modes. We no longer
have these two locking modes after we switched the btree locks from custom
code to rw semaphores. So update the comment to stop referring to the
blocking mode and put it more up to date.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When reading a btree node (or leaf), at read_block_for_search(), if we
can't find its extent buffer in the cache (the fs_info->buffer_radix
radix tree), then we unlock all upper level nodes before reading the
btree node/leaf from disk, to prevent blocking other tasks for too long.
However if we find that the extent buffer is in the cache but it is not
up to date, we don't unlock upper level nodes before reading it from disk,
potentially blocking other tasks on upper level nodes for too long.
Fix this inconsistent behaviour by unlocking upper level nodes if we need
to read a node/leaf from disk because its in-memory extent buffer is not
up to date. If we unlocked upper level nodes then we must return -EAGAIN
to the caller, just like the case where the extent buffer is not cached in
memory. And like that case, we determine if upper level nodes are locked
by checking only if the parent node is locked - if it isn't, then no other
upper level nodes are locked.
This is actually a rare case, as if we have an extent buffer in memory,
it typically has the uptodate flag set and passes all the checks done by
btrfs_buffer_uptodate().
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When reading a btree node, at read_block_for_search(), if we don't find
the node's (or leaf) extent buffer in the cache, we will read it from
disk. Since that requires waiting on IO, we release all upper level nodes
from our path before reading the target node/leaf, and then return -EAGAIN
to the caller, which will make the caller restart the while btree search.
However we are causing the restart of btree search even for cases where
it is not necessary:
1) We have a path with ->skip_locking set to true, typically when doing
a search on a commit root, so we are never holding locks on any node;
2) We are doing a read search (the "ins_len" argument passed to
btrfs_search_slot() is 0), or we are doing a search to modify an
existing key (the "cow" argument passed to btrfs_search_slot() has
a value of 1 and "ins_len" is 0), in which case we never hold locks
for upper level nodes;
3) We are doing a search to insert or delete a key, in which case we may
or may not have upper level nodes locked. That depends on the current
minimum write lock levels at btrfs_search_slot(), if we had to split
or merge parent nodes, if we had to COW upper level nodes and if
we ever visited slot 0 of an upper level node. It's still common to
not have upper level nodes locked, but our current node must be at
least at level 1, for insertions, or at least at level 2 for deletions.
In these cases when we have locks on upper level nodes, they are always
write locks.
These cases where we are not holding locks on upper level nodes far
outweigh the cases where we are holding locks, so it's completely wasteful
to retry the whole search when we have no upper nodes locked.
So change the logic to not return -EAGAIN, and make the caller retry the
search, when we don't have the parent node locked - when it's not locked
it means no other upper level nodes are locked as well.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_new_inode() inherits the inode flags from the parent directory and
the mount options _after_ we fill the inode item. This works because all
of the callers of btrfs_new_inode() make further changes to the inode
and then call btrfs_update_inode(). It'd be better to fully initialize
the inode once to avoid the extra update, so as a first step, set the
inode flags _before_ filling the inode item.
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Every call of btrfs_new_inode() is immediately preceded by a call to
btrfs_get_free_objectid(). Since getting an inode number is part of
creating a new inode, this is better off being moved into
btrfs_new_inode(). While we're here, get rid of the comment about
reclaiming inode numbers, since we only did that when using the ino
cache, which was removed by commit 5297199a8b ("btrfs: remove inode
number cache feature").
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For everything other than a subvolume root inode, we get the parent
objectid from the parent directory. For the subvolume root inode, the
parent objectid is the same as the inode's objectid. We can find this
within btrfs_new_inode() instead of passing it.
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 4a8b34afa9 ("btrfs: handle ACLs on idmapped mounts") added this
parameter but didn't use it. __btrfs_set_acl() is the low-level helper
that writes an ACL to disk. The higher-level btrfs_set_acl() is the one
that translates the ACL based on the user namespace.
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_new_inode() already returns an inode with nlink set to 1 (via
inode_init_always()). Get rid of the unnecessary set.
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
new_inode() always returns an inode with i_blocks and i_bytes set to 0
(via inode_init_always()). Remove the unnecessary call to
inode_set_bytes() in btrfs_new_inode().
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_new_inode() always returns an inode with i_size and disk_i_size
set to 0 (via inode_init_always() and btrfs_alloc_inode(),
respectively). Remove the unnecessary calls to btrfs_i_size_write() in
btrfs_mkdir() and btrfs_create_subvol_root().
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is a trivial wrapper around btrfs_add_link(). The only thing it
does other than moving arguments around is translating a > 0 return
value to -EEXIST. As far as I can tell, btrfs_add_link() won't return >
0 (and if it did, the existing callsites in, e.g., btrfs_mkdir() would
be broken). The check itself dates back to commit 2c90e5d658 ("Btrfs:
still corruption hunting"), so it's probably left over from debugging.
Let's just get rid of btrfs_add_nondir().
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When btrfs_qgroup_inherit(), btrfs_alloc_tree_block, or
btrfs_insert_root() fail in create_subvol(), we return without freeing
anon_dev. Reorganize the error handling in create_subvol() to fix this.
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_rename() and btrfs_rename_exchange() don't account for enough
items. Replace the incorrect explanations with a specific breakdown of
the number of items and account them accurately.
Note that this glosses over RENAME_WHITEOUT because the next commit is
going to rework that, too.
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
__btrfs_unlink_inode() calls btrfs_update_inode() on the parent
directory in order to update its size and sequence number. Make sure we
account for it.
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
- Fix filesystem block deallocation for short writes.
- Stop using glock holder auto-demotion for now.
- Get rid of buffered writes inefficiencies due to page
faults being disabled.
- Minor other cleanups.
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmJ+wL4UHGFncnVlbmJh
QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTqNkRAAuBc4oJ3Wkp/dkfF6MDY9DXqCzhQo
TodFIdEQvOUtncYBljE6DZQ9MT1sRvwxDe0nIjErFQzHYcW88zczILWOBRhrhlci
kANL6JtjSvtE3kecvR9I3nZX44aDETJliV3FX8n7vDSNMTIwjzW38d0XmDLX9t8F
bTEFv3rKsUzgcGaxpeQe7mzoQPi910WFPO/pos2ghuZwB1QEpdBrCESz4XB+OwKM
+V+8nooHvYp8T+2AzrBM6hBgsYelrBPRXlz6cYEjWY2FQuvQk/thX+zO2dvXCsPA
uNWJTCKJEsufWflPNI2ugZ9TVneIU1umGACoEHteeRvG6Qsg6K4Kf0EJEFf6Y0Tg
PKUJLUcdi0suS8UuUCBTAVAgCv9+ueLhIbbFeRkbHjxSjET7NQl2gbkfY2V1RsFt
yNFLMGU01xsb1YylncY4xQc9WVMDbPsdv1KGDF75njchuHZXhfg00ezPQys4Uy5R
0EUwqoPYNePV6VsoHLbU+kImf936VawP916yDiyflyz6UFSi5vNg7SwMqDrXpIxM
T8nNgrTC+Npv7T2xc8JhSFGv9T86nZXWjQTpDzV8onPvxdCLCT1cmm352aHqXd7+
wY9ZFJZ4iMoinYUkEzgySQW00+wK/AThQKQ6ImhLEvwxMUc6dJUnesVGkzLDGh9Z
KSfqgYzmlb2YdKY=
=FJq+
-----END PGP SIGNATURE-----
Merge tag 'gfs2-v5.18-rc4-fix3' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 fixes from Andreas Gruenbacher:
"We've finally identified commit dc732906c2 ("gfs2: Introduce flag
for glock holder auto-demotion") to be the other cause of the
filesystem corruption we've been seeing. This feature isn't strictly
necessary anymore, so we've decided to stop using it for now.
With this and the gfs_iomap_end rounding fix you've already seen
("gfs2: Fix filesystem block deallocation for short writes" in this
pull request), we're corruption free again now.
- Fix filesystem block deallocation for short writes.
- Stop using glock holder auto-demotion for now.
- Get rid of buffered writes inefficiencies due to page faults being
disabled.
- Minor other cleanups"
* tag 'gfs2-v5.18-rc4-fix3' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Stop using glock holder auto-demotion for now
gfs2: buffered write prefaulting
gfs2: Align read and write chunks to the page cache
gfs2: Pull return value test out of should_fault_in_pages
gfs2: Clean up use of fault_in_iov_iter_{read,write}able
gfs2: Variable rename
gfs2: Fix filesystem block deallocation for short writes
We're having unresolved issues with the glock holder auto-demotion mechanism
introduced in commit dc732906c2. This mechanism was assumed to be essential
for avoiding frequent short reads and writes until commit 296abc0d91
("gfs2: No short reads or writes upon glock contention"). Since then,
when the inode glock is lost, it is simply re-acquired and the operation
is resumed. This means that apart from the performance penalty, we
might as well drop the inode glock before faulting in pages, and
re-acquire it afterwards.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
In gfs2_file_buffered_write, to increase the likelihood that all the
user memory we're trying to write will be resident in memory, carry out
the write in chunks and fault in each chunk of user memory before trying
to write it. Otherwise, some workloads will trigger frequent short
"internal" writes, causing filesystem blocks to be allocated and then
partially deallocated again when writing into holes, which is wasteful
and breaks reservations.
Neither the chunked writes nor any of the short "internal" writes are
user visible.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Align the chunks that reads and writes are carried out in to the page
cache rather than the user buffers. This will be more efficient in
general, especially for allocating writes. Optimizing the case that the
user buffer is gfs2 backed isn't very useful; we only need to make sure
we won't deadlock.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Pull the return value test of the previous read or write operation out
of should_fault_in_pages(). In a following patch, we'll fault in pages
before the I/O and there will be no return value to check.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Instead of counting the number of bytes read from the filesystem,
functions gfs2_file_direct_read and gfs2_file_read_iter count the number
of bytes written into the user buffer. Conversely, functions
gfs2_file_direct_write and gfs2_file_buffered_write count the number of
bytes read from the user buffer. This is nothing but confusing, so
change the read functions to count how many bytes they have read, and
the write functions to count how many bytes they have written.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
When a write cannot be carried out in full, gfs2_iomap_end() releases
blocks that have been allocated for this write but haven't been used.
To compute the end of the allocation, gfs2_iomap_end() incorrectly
rounded the end of the attempted write down to the next block boundary
to arrive at the end of the allocation. It would have to round up, but
the end of the allocation is also available as iomap->offset +
iomap->length, so just use that instead.
In addition, use round_up() for computing the start of the unused range.
Fixes: 64bc06bb32 ("gfs2: iomap buffered write support")
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>