There are only a few places that test SIGNAL_GROUP_EXIT and
are not also already testing SIGNAL_GROUP_COREDUMP.
This will not affect the callers of signal_group_exit as zap_process
also sets group_exit_task so signal_group_exit will continue to return
true at the same times.
This does not affect wait_task_zombie as the none of the threads
wind up in EXIT_ZOMBIE state during a coredump.
This does not affect oom_kill.c:__task_will_free_mem as
sig->core_state is tested and handled before SIGNAL_GROUP_EXIT is
tested for.
This does not affect complete_signal as signal->core_state is tested
for to ensure the coredump case is handled appropriately.
Link: https://lkml.kernel.org/r/20211213225350.27481-4-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The point of using set_child_tid to hold the kthread pointer was that
it already did what is necessary. There are now restrictions on when
set_child_tid can be initialized and when set_child_tid can be used in
schedule_tail. Which indicates that continuing to use set_child_tid
to hold the kthread pointer is a bad idea.
Instead of continuing to use the set_child_tid field of task_struct
generalize the pf_io_worker field of task_struct and use it to hold
the kthread pointer.
Rename pf_io_worker (which is a void * pointer) to worker_private so
it can be used to store kthreads struct kthread pointer. Update the
kthread code to store the kthread pointer in the worker_private field.
Remove the places where set_child_tid had to be dealt with carefully
because kthreads also used it.
Link: https://lkml.kernel.org/r/CAHk-=wgtFAA9SbVYg0gR1tqPMC17-NYcs0GQkaYg1bGhh1uJQQ@mail.gmail.com
Link: https://lkml.kernel.org/r/87a6grvqy8.fsf_-_@email.froward.int.ebiederm.org
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
We still only operate on a single page of data at a time due to using
kmap(). A more complex implementation would work on each page in a folio,
but it's not clear that such a complex implementation would be worthwhile.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
find_lock_entries() already only returned the head page of folios, so
convert it to return a folio_batch instead of a pagevec. That cascades
through converting truncate_inode_pages_range() to
delete_from_page_cache_batch() and page_cache_delete_batch().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
When a TCP connection gets reestablished by the sender in cifs_reconnect,
There is a chance for race condition with demultiplex thread waiting in
cifs_readv_from_socket on the old socket. It will now return -ECONNRESET.
This condition is handled by comparing socket pointer before and after
sock_recvmsg. If the socket pointer has changed, we should not call
cifs_reconnect again, but instead retry with new socket.
Also fixed another bug in my prev mchan commits.
We should always reestablish session (even if binding) on a channel
that needs reconnection.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
If functions like cifs_negotiate_protocol, cifs_setup_session,
cifs_tree_connect are called in parallel on different channels,
each of these will be execute the requests. This maybe unnecessary
in some cases, and only the first caller may need to do the work.
This is achieved by having more states for the tcp/smb/tcon session
status fields. And tracking the state of reconnection based on the
state machine.
For example:
for tcp connections:
CifsNew/CifsNeedReconnect ->
CifsNeedNegotiate ->
CifsInNegotiate ->
CifsNeedSessSetup ->
CifsInSessSetup ->
CifsGood
for smb sessions:
CifsNew/CifsNeedReconnect ->
CifsGood
for tcon:
CifsNew/CifsNeedReconnect ->
CifsInFilesInvalidate ->
CifsNeedTcon ->
CifsInTcon ->
CifsGood
If any channel reconnect sees that it's in the middle of
transition to CifsGood, then they can skip the function.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Mount will hang if using SMB1 and DFS.
This is because every call to get_next_mid() will, unconditionally,
mark tcpStatus to CifsNeedReconnect before even establishing the
initial connect, because "reconnect" variable was not initialized.
Initializing "reconnect" to false fix this issue.
Fixes: 220c5bc25d87 ("cifs: take cifs_tcp_ses_lock for status checks")
Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
While checking/updating status for tcp ses, smb ses or tcon,
we take GlobalMid_Lock. This doesn't make any sense.
Replaced it with cifs_tcp_ses_lock.
Ideally, we should take a spin lock per struct.
But since tcp ses, smb ses and tcon objects won't add up to a lot,
I think there should not be too much contention.
Also, in few other places, these are checked without locking.
Added locking for these.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Change the afs filesystem to support the new afs driver.
The following changes have been made:
(1) The fscache_netfs struct is no more, and there's no need to register
the filesystem as a whole. There's also no longer a cell cookie.
(2) The volume cookie is now an fscache_volume cookie, allocated with
fscache_acquire_volume(). This function takes three parameters: a
string representing the "volume" in the index, a string naming the
cache to use (or NULL) and a u64 that conveys coherency metadata for
the volume.
For afs, I've made it render the volume name string as:
"afs,<cell>,<volume_id>"
and the coherency data is currently 0.
(3) The fscache_cookie_def is no more and needed information is passed
directly to fscache_acquire_cookie(). The cache no longer calls back
into the filesystem, but rather metadata changes are indicated at
other times.
fscache_acquire_cookie() is passed the same keying and coherency
information as before, except that these are now stored in big endian
form instead of cpu endian. This makes the cache more copyable.
(4) fscache_use_cookie() and fscache_unuse_cookie() are called when a file
is opened or closed to prevent a cache file from being culled and to
keep resources to hand that are needed to do I/O.
fscache_use_cookie() is given an indication if the cache is likely to
be modified locally (e.g. the file is open for writing).
fscache_unuse_cookie() is given a coherency update if we had the file
open for writing and will update that.
(5) fscache_invalidate() is now given uptodate auxiliary data and a file
size. It can also take a flag to indicate if this was due to a DIO
write. This is wrapped into afs_fscache_invalidate() now for
convenience.
(6) fscache_resize() now gets called from the finalisation of
afs_setattr(), and afs_setattr() does use/unuse of the cookie around
the call to support this.
(7) fscache_note_page_release() is called from afs_release_page().
(8) Use a killable wait in nfs_vm_page_mkwrite() when waiting for
PG_fscache to be cleared.
Render the parts of the cookie key for an afs inode cookie as big endian.
Changes
=======
ver #2:
- Use gfpflags_allow_blocking() rather than using flag directly.
- fscache_acquire_volume() now returns errors.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Tested-by: kafs-testing@auristor.com
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819661382.215744.1485608824741611837.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906970002.143852.17678518584089878259.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967174665.1823006.1301789965454084220.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021568841.640689.6684240152253400380.stgit@warthog.procyon.org.uk/ # v4
Print extra information about how many dirty bytes an uncommitted
has at the end of mount.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we extended the size of a swapfile after its header was created (by the
mkswap utility) and then try to activate it, we will map the entire file
when activating the swap file, instead of limiting to the max size defined
in the swap file's header.
Currently test case generic/643 from fstests fails because we do not
respect that size limit defined in the swap file's header.
So fix this by not mapping file ranges beyond the max size defined in the
swap header.
This is the same type of bug that iomap used to have, and was fixed in
commit 36ca7943ac ("mm/swap: consider max pages in
iomap_swapfile_add_extent").
Fixes: ed46ff3d42 ("Btrfs: support swap files")
CC: stable@vger.kernel.org # 5.4+
Reviewed-and-tested-by: Josef Bacik <josef@toxicpanda.com
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The warnings were found by running scripts/kernel-doc, which is
caused by using 'make W=1'.
fs/btrfs/extent_io.c:3210: warning: Function parameter or member
'bio_ctrl' not described in 'btrfs_bio_add_page'
fs/btrfs/extent_io.c:3210: warning: Excess function parameter 'bio'
description in 'btrfs_bio_add_page'
fs/btrfs/extent_io.c:3210: warning: Excess function parameter
'prev_bio_flags' description in 'btrfs_bio_add_page'
fs/btrfs/space-info.c:1602: warning: Excess function parameter 'root'
description in 'btrfs_reserve_metadata_bytes'
fs/btrfs/space-info.c:1602: warning: Function parameter or member
'fs_info' not described in 'btrfs_reserve_metadata_bytes'
Note: this is fixing only the warnings regarding parameter list, the
first line is not strictly conforming to the kdoc format as the btrfs
codebase does not stick to that and keeps the first line more free form
(because it's only for internal use).
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_decompress_bio, the only caller of compression_decompress_bio gets
type from @cb and passes it to compression_decompress_bio.
However, compression_decompress_bio can get compression type directly
from @cb.
So remove the parameter and access it through @cb. No functional
change.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Su Yue <l@damenly.su>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When code modifying extent-io-tree get modified and got that selftest
failed, it can take some time to pin down the cause.
To make it easier to expose the problem, dump the extent io tree if the
selftest failed.
This can save developers debug time, especially since the selftest we
can not use the trace events, thus have to manually add debug trace
points.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The argument list of btrfs_stripe() has similar problems of
scrub_chunk():
- Duplicated and ambiguous @base argument
Can be fetched from btrfs_block_group::bg.
- Ambiguous argument @length
It's again device extent length
- Ambiguous argument @num
The instinctive guess would be mirror number, but in fact it's stripe
index.
Fix it by:
- Remove @base parameter
- Rename @length to @dev_extent_len
- Rename @num to @stripe_index
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The argument list of scrub_chunk() has the following problems:
- Duplicated @chunk_offset
It is the same as btrfs_block_group::start.
- Confusing @length
The most instinctive guess is chunk length, and one may want to delete
it, but the truth is, it's the device extent length.
Fix this by:
- Remove @chunk_offset
Use btrfs_block_group::start instead.
- Rename @length to @dev_extent_len
Also rename the caller to remove the ambiguous naming.
- Rename @cache to @bg
The "_cache" suffix for btrfs_block_group has been removed for a while.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently there is only one user for btrfs metadata readahead, and
that's scrub.
But even for the single user, it's not providing the correct
functionality it needs, as scrub needs reada for commit root, which
current readahead can't provide. (Although it's pretty easy to add such
feature).
Despite this, there are some extra problems related to metadata
readahead:
- Duplicated feature with btrfs_path::reada
- Partly duplicated feature of btrfs_fs_info::buffer_radix
Btrfs already caches its metadata in buffer_radix, while readahead
tries to read the tree block no matter if it's already cached.
- Poor layer separation
Metadata readahead works kinda at device level.
This is definitely not the correct layer it should be, since metadata
is at btrfs logical address space, it should not bother device at all.
This brings extra chance for bugs to sneak in, while brings
unnecessary complexity.
- Dead code
In the very beginning of scrub.c we have #undef DEBUG, rendering all
the debug related code useless and unable to test.
Thus here I purpose to remove the metadata readahead mechanism
completely.
[BENCHMARK]
There is a full benchmark for the scrub performance difference using the
old btrfs_reada_add() and btrfs_path::reada.
For the worst case (no dirty metadata, slow HDD), there could be a 5%
performance drop for scrub.
For other cases (even SATA SSD), there is no distinguishable performance
difference.
The number is reported scrub speed, in MiB/s.
The resolution is limited by the reported duration, which only has a
resolution of 1 second.
Old New Diff
SSD 455.3 466.332 +2.42%
HDD 103.927 98.012 -5.69%
Comprehensive test methodology is in the cover letter of the patch.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For scrub, we trigger two readaheads for two trees, extent tree to get
where to scrub, and csum tree to get the data checksum.
For csum tree we already trigger readahead in
btrfs_lookup_csums_range(), by setting path->reada.
But for extent tree we don't have any path based readahead.
Add the readahead for extent tree as well, so we can later remove the
btrfs_reada_add() based readahead.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In function scrub_stripe() we allocated two btrfs_path's, one @path for
extent tree search and another @ppath for full stripe extent tree search
for RAID56.
This is totally umncessary, as the @ppath usage is completely inside
scrub_raid56_parity(), thus we can move the path allocation into
scrub_raid56_parity() completely.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The purpose of this function is to unlock all nodes in a btrfs path
which are above 'lowest_unlock' and whose slot used is different than 0.
As such it used slightly awkward structure of 'if' as well as somewhat
cryptic "no_skip" control variable which denotes whether we should
check the current level of skipability or no.
This patch does the following (cosmetic) refactorings:
* Renames 'no_skip' to 'check_skip' and makes it a boolean. This
variable controls whether we are below the lowest_unlock/skip_level
levels.
* Consolidates the 2 conditions which warrant checking whether the
current level should be skipped under 1 common if (check_skip) branch,
this increase indentation level but is not critical.
* Consolidates the 'skip_level < i && i >= lowest_unlock' and
'i >= lowest_unlock && i > skip_level' condition into a common branch
since those are identical.
* Eliminates the local extent_buffer variable as in this case it doesn't
bring anything to function readability.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At ioctl.c:create_subvol(), when we fail to create a subvolume we always
commit the transaction. In most cases this is a no-op, since all the error
paths, except for one, abort the transaction - the only exception is when
we fail to insert the new root item into the root tree, in that case we
don't abort the transaction because we didn't do anything that is
irreversible - however we end up committing the transaction which although
is not a functional problem, it adds unnecessary rotation of the backup
roots in the superblock and unnecessary work.
So change that to commit a transaction only when no error happened,
otherwise just call btrfs_end_transaction() to release our reference on
the transaction.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The ZNS specification defines a limit on the number of "active"
zones. That limit impose us to limit the number of block groups which
can be used for an allocation at the same time. Not to exceed the
limit, we reuse the existing active block groups as much as possible
when we can't activate any other zones without sacrificing an already
activated block group in commit a85f05e59b ("btrfs: zoned: avoid
chunk allocation if active block group has enough space").
However, the check is wrong in two ways. First, it checks the
condition for every raid index (ffe_ctl->index). Even if it reaches
the condition and "ffe_ctl->max_extent_size >=
ffe_ctl->min_alloc_size" is met, there can be other block groups
having enough space to hold ffe_ctl->num_bytes. (Actually, this won't
happen in the current zoned code as it only supports SINGLE
profile. But, it can happen once it enables other RAID types.)
Second, it checks the active zone availability depending on the
raid index. The raid index is just an index for
space_info->block_groups, so it has nothing to do with chunk allocation.
These mistakes are causing a faulty allocation in a certain
situation. Consider we are running zoned btrfs on a device whose
max_active_zone == 0 (no limit). And, suppose no block group have a
room to fit ffe_ctl->num_bytes but some room to meet
ffe_ctl->min_alloc_size (i.e. max_extent_size > num_bytes >=
min_alloc_size).
In this situation, the following occur:
- With SINGLE raid_index, it reaches the chunk allocation checking
code
- The check returns true because we can activate a new zone (no limit)
- But, before allocating the chunk, it iterates to the next raid index
(RAID5)
- Since there are no RAID5 block groups on zoned mode, it again
reaches the check code
- The check returns false because of btrfs_can_activate_zone()'s "if
(raid_index != BTRFS_RAID_SINGLE)" part
- That results in returning -ENOSPC without allocating a new chunk
As a result, we end up hitting -ENOSPC too early.
Move the check to the right place in the can_allocate_chunk() hook,
and do the active zone check depending on the allocation flag, not on
the raid index.
CC: stable@vger.kernel.org # 5.16
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce a new hook for an extent allocator policy. With the new
hook, a policy can decide to allocate a new block group or not. If
not, it will return -ENOSPC, so btrfs_reserve_extent() will cut the
allocation size in half and retry the allocation if min_alloc_size is
large enough.
The hook has a place holder and will be replaced with the real
implementation in the next patch.
CC: stable@vger.kernel.org # 5.16
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Allocating an extent from a block group can fail for various reasons.
When an allocation from a dedicated block group (for tree-log or
relocation data) fails, we need to unregister it as a dedicated one so
that we can allocate a new block group for the dedicated one.
However, we are returning early when the block group in case it is
read-only, fully used, or not be able to activate the zone. As a result,
we keep the non-usable block group as a dedicated one, leading to
further allocation failure. With many block groups, the allocator will
iterate hopeless loop to find a free extent, results in a hung task.
Fix the issue by delaying the return and doing the proper cleanups.
CC: stable@vger.kernel.org # 5.16
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
REQ_OP_ZONE_APPEND can only work on zoned devices, so it is redundant to
check if the filesystem is zoned when REQ_OP_ZONE_APPEND is set as the
bio's bio_op.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Sink zone check into btrfs_repair_one_zone() so we don't need to do it
in all callers.
Also as btrfs_repair_one_zone() doesn't return a sensible error, make it
a boolean function and return false in case it got called on a non-zoned
filesystem and true on a zoned filesystem.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_check_meta_write_pointer() will always be called with a NULL
'cache_ret' argument.
As there's no need to check if we have a valid block_group passed in
remove these checks.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Encapsulate the inode lock needed for serializing the data relocation
writes on a zoned filesystem into a helper.
This streamlines the code reading flow and hides special casing for
zoned filesystems.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the case of the seed device, the fsid can be different from the mounted
sprout fsid. The userland has to read the device superblock to know the
fsid but, that idea fails if the device is missing. So add a sysfs
interface devinfo/<devid>/fsid to show the fsid of the device.
For example:
$ cd /sys/fs/btrfs/b10b02a5-f9de-4276-b9e8-2bfd09a578a8
$ cat devinfo/1/fsid
c44d771f-639d-4df3-99ec-5bc7ad2af93b
$ cat devinfo/3/fsid
b10b02a5-f9de-4276-b9e8-2bfd09a578a8
Though it's related to seeding, the name of the sysfs file is plain fsid as it
matches what blkid says. A path to the device's fsid will aid scripting.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe reported a problem where sometimes he'd get an ENOSPC abort when
running delayed refs with generic/619 and the free space tree enabled.
This is partly because we do not reserve space for modifying the free
space tree, nor do we have a block rsv associated with that tree.
The delayed_refs_rsv tracks the amount of space required to run delayed
refs. This means 1 modification means 1 change to the extent root.
With the free space tree this turns into 2 changes, because modifying 1
extent means updating the extent tree and potentially updating the free
space tree to either remove that entry or add the free space. Thus if
we have the FST enabled, simply double the reservation size for our
modification.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe reported a problem where generic/619 was failing with an ENOSPC
abort while running delayed refs, like the following
BTRFS: Transaction aborted (error -28)
WARNING: CPU: 3 PID: 522920 at fs/btrfs/free-space-tree.c:1049 add_to_free_space_tree+0xe5/0x110 [btrfs]
CPU: 3 PID: 522920 Comm: kworker/u16:19 Tainted: G W 5.16.0-rc2-btrfs-next-106 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
RIP: 0010:add_to_free_space_tree+0xe5/0x110 [btrfs]
RSP: 0000:ffffa65087fb7b20 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000001000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffffff9131eeaa RDI: 00000000ffffffff
RBP: ffff8d62e26481b8 R08: ffffffff9ad97ce0 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000001 R12: 00000000ffffffe4
R13: ffff8d61c25fe688 R14: ffff8d61ebd88800 R15: ffff8d61ebd88a90
FS: 0000000000000000(0000) GS:ffff8d64ed400000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fa46a8b1000 CR3: 0000000148d18003 CR4: 0000000000370ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
__btrfs_free_extent+0x516/0x950 [btrfs]
__btrfs_run_delayed_refs+0x2b1/0x1250 [btrfs]
btrfs_run_delayed_refs+0x86/0x210 [btrfs]
flush_space+0x403/0x630 [btrfs]
? call_rcu_tasks_generic+0x50/0x80
? lock_release+0x223/0x4a0
? btrfs_get_alloc_profile+0xb5/0x290 [btrfs]
? do_raw_spin_unlock+0x4b/0xa0
btrfs_async_reclaim_metadata_space+0x139/0x320 [btrfs]
process_one_work+0x24c/0x5b0
worker_thread+0x55/0x3c0
? process_one_work+0x5b0/0x5b0
kthread+0x17c/0x1a0
? set_kthread_struct+0x40/0x40
ret_from_fork+0x22/0x30
There's a couple of reasons for this, but in generic/619's case the
largest reason is because it is a very small file system, ad we do not
reserve enough space for the global reserve.
With the free space tree we now have the free space tree that we need to
modify when running delayed refs. This means we need the global reserve
to take this into account when it calculates the minimum size it needs
to be. This is especially important for very small file systems.
Fix this by adjusting the minimum global block rsv size math to include
the size of the free space tree when calculating the size.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These two values were introduced in commit ff023aac31 ("Btrfs: add code
to scrub to copy read data to another disk") as an optimization.
But the truth is, block layer scheduler can do whatever it wants to
merge/split bios to improve performance.
Doing such "optimization" is not really going to affect much, especially
considering how good current block layer optimizations are doing.
Remove such old and immature optimization from our code.
Since we're here, also change BUG_ON()s using these two macros to use
ASSERT()s.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use BTRFS_MAX_METADATA_BLOCKSIZE and SZ_4K (minimal sectorsize) to
calculate this value.
And remove one stale comment on the value, in fact with recent subpage
support, BTRFS_MAX_METADATA_BLOCKSIZE * PAGE_SIZE is already beyond
BTRFS_STRIPE_LEN, just we don't use the full page.
Also since we're here, update the BUG_ON() related to
SCRUB_MAX_PAGES_PER_BLOCK to ASSERT().
As those ASSERT() are really only for developers to catch early obvious
bugs, not to let end users suffer.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We only throttle the btrfs_truncate_inode_items if the root is
SHAREABLE, which isn't set on the log root, which means this loop is
unnecessary.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We reset this bool on every loop through the truncate loop, make this
variable local to the loop.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have
if (del_item)
// do something
else
// something else
if (del_item)
// do yet another thing
else
// something else entirely
back to back in btrfs_truncate_inode_items, collapse these two sets of
if statements into one.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is a logic correctness check, convert it into an ASSERT() instead
of a BUG().
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a correctness BUG_ON() in btrfs_truncate_inode_items to make
sure that we're always using min_type == BTRFS_EXTENT_DATA_KEY if
new_size is > 0. Convert this to an ASSERT.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the future we're going to want to use btrfs_truncate_inode_items
without looking up the associated inode. In order to accommodate this
add the inode to btrfs_truncate_control and handle the case where
control->inode is NULL appropriately. This is fairly straightforward,
we simply need to add a helper for the trace points, as the file extent
map update is controlled by a flag on btrfs_truncate_control.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the future we are going to want to truncate inode items without
needing to have an btrfs_inode to pass in, so add ino to the
btrfs_truncate_control and use that to look up the inode items to
truncate.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We only care about updating the file extent range when we are doing a
normal truncation. We skip this for tree logging currently, but we can
also skip this for eviction as well. Using a flag makes it more
explicit when we want to do this work.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We've had weird bugs in the past where we forgot to adjust the truncate
path to deal with the fact that we can be called by the tree log path.
Instead of checking if our root is a LOG_ROOT use a flag on the
btrfs_truncate_control to indicate that we don't want to do extent
reference updates during this truncate.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We currently have a bunch of awkward checks to make sure we only update
the inode i_bytes if we're truncating the real inode. Instead keep
track of the number of bytes we need to sub in the
btrfs_truncate_control, and then do the appropriate adjustment in the
truncate paths that care.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We currently will update the i_size of the inode as we truncate it down,
however we skip this if we're calling btrfs_truncate_inode_items from
the tree log code. However we also don't care about this in the case of
evict. Instead keep track of this value in the btrfs_truncate_control
and then have btrfs_truncate() and the free space cache truncate path
both do the i_size update themselves.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
I'm going to be adding more arguments and counters to
btrfs_truncate_inode_items, so add a control struct to handle all of the
extra arguments to make it easier to follow.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We only set this if we find a normal file extent, del_item == 1, and the
file extent points to a real extent and isn't a hole extent. We can use
del_item == 1 && extent_start != 0 to get the same information that
found_extent provides, so remove this variable and use the other
variables instead.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a special case in btrfs_truncate_inode_items() to call
btrfs_kill_delayed_inode_items() if min_type == 0, which is only called
during evict.
Instead move this out into evict proper, and add some comments because I
erroneously attempted to remove this code altogether without
understanding what we were doing.
Evict is updating the inode only because we only care about making sure
the i_nlink count has hit disk. If we had pending deletions we don't
want to process those via the delayed inode updates, we simply want to
drop all of them and reclaim the reserved metadata space. Then from
there the btrfs_truncate_inode_items() will do the work to remove all of
the items as appropriate.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We no longer have inode cache feature, so this check is extraneous as
the only inode cache is in the tree_root, which is not marked as
SHAREABLE.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we are locking the extent and dropping the extent cache for
any inodes we truncate, unless they're in the tree log. We call this
helper from:
- truncate
- evict
- tree log
- free space cache truncation
For evict we've already dropped all of the extent cache for this inode
once we've gotten here, and we're the only one accessing this inode, so
this step is unnecessary.
For the tree log code we already skip this part.
Pull this work into the truncate path and the free space cache
truncation path.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is an inode item related manipulation with a few vfs related
adjustments. I'm going to remove the vfs related code from this helper
and simplify it a lot, but I want those changes to be easily seen via
git blame, so move this function now and then the simplification work
can be done.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a few helpers in inode-item.c, and I'm going to make a few
changes to how we do truncate in the future, so break out these
definitions into their own header file to trim down ctree.h some and
make it easier to do the work on truncate in the future.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The comment refers to the old extent buffer locking code, where we used to
have custom locks that had blocking and spinning behaviour modes. That is
not the case anymore, since we have transitioned to rw semaphores, so the
comment does not offer any value anymore. Remove it.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After calling split_leaf() we BUG_ON() if the returned value is greater
than zero. However split_leaf() only returns 0, in case of success, or a
negative value in case of an error.
The reason for the BUG_ON() is that if we ever get a positive return
value from split_leaf(), we can not simply propagate it to the callers
of btrfs_search_slot(), as that would be interpreted as "key not found"
and not as an error. That means it could result in callers ending up
causing some potential silent corruption.
So change the BUG_ON() to an ASSERT(), and in case assertions are
disabled, produce a warning and set the return value to an error, to make
it not possible to get into a silent corruption and having the error not
noticed.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's quite a significant amount of code for doing the key search for a
leaf at btrfs_search_slot(), with a couple labels and gotos in it, plus
btrfs_search_slot() is already big enough.
So move the logic that does the key search on a leaf into a new helper
function. This makes it better organized, removing the need for the labels
and the gotos, as well as reducing the indentation level and the size of
btrfs_search_slot().
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When inserting a key, we check if the write_lock_level is less than 1,
and if so we set it to 1, release the path and retry the tree traversal.
However that is unnecessary, because when ins_len is greater than 0, we
know that write_lock_level can never be less than 1.
The logic to retry is also buggy, because in case ins_len was decremented,
due to an exact key match and the search is not meant for item extension
(path->search_for_extension is 0), we retry without incrementing ins_len,
which would make the next retry decrement it again by the same amount.
So remove the check for write_lock_level being less than 1 and add an
assertion to assert it's always >= 1.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When inserting a new key, we release the write lock on the leaf's parent
only after doing the binary search on the leaf. This is because if the
key ends up at slot 0, we will have to update the key at slot 0 of the
parent node. The same reasoning applies to any other upper level nodes
when their slot is 0. We also need to keep the parent locked in case the
leaf does not have enough free space to insert the new key/item, because
in that case we will split the leaf and we will need to add a new key to
the parent due to a new leaf resulting from the split operation.
However if the leaf has enough space for the new key and the key does not
end up at slot 0 of the leaf we could release our write lock on the parent
before doing the binary search on the leaf to figure out the destination
slot. That leads to reducing the amount of time other tasks are blocked
waiting to lock the parent, therefore increasing parallelism when there
are other tasks that are trying to access other leaves accessible through
the same parent. This also applies to other upper nodes besides the
immediate parent, when their slot is 0, since we keep locks on them until
we figure out if the leaf slot is slot 0 or not.
In fact, having the key ending at up slot 0 when is rare. Typically it
only happens when the key is less than or equals to the smallest, the
"left most", key of the entire btree, during a split attempt when we try
to push to the right sibling leaf or when the caller just wants to update
the item of an existing key. It's also very common that a leaf has enough
space to insert a new key, since after a split we move about half of the
keys from one into the new leaf.
So unlock the parent, and any other upper level nodes, when during a key
insertion we notice the key is greater then the first key in the leaf and
the leaf has enough free space. After unlocking the upper level nodes, do
the binary search using a low boundary of slot 1 and not slot 0, to figure
out the slot where the key will be inserted (or where the key already is
in case it exists and the caller wants to modify its item data).
This extra comparison, with the first key, is cheap and the key is very
likely already in a cache line because it immediately follows the header
of the extent buffer and we have recently read the level field of the
header (which in fact is the last field of the header).
The following fs_mark test was run on a non-debug kernel (debian's default
kernel config), with a 12 cores intel CPU, and using a NVMe device:
$ cat run-fsmark.sh
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
MOUNT_OPTIONS="-o ssd"
MKFS_OPTIONS="-O no-holes -R free-space-tree"
FILES=100000
THREADS=$(nproc --all)
FILE_SIZE=0
echo "performance" | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT
OPTS="-S 0 -L 10 -n $FILES -s $FILE_SIZE -t $THREADS -k"
for ((i = 1; i <= $THREADS; i++)); do
OPTS="$OPTS -d $MNT/d$i"
done
fs_mark $OPTS
umount $MNT
Before this change:
FSUse% Count Size Files/sec App Overhead
0 1200000 0 165273.6 5958381
0 2400000 0 190938.3 6284477
0 3600000 0 181429.1 6044059
0 4800000 0 173979.2 6223418
0 6000000 0 139288.0 6384560
0 7200000 0 163000.4 6520083
1 8400000 0 57799.2 5388544
1 9600000 0 66461.6 5552969
2 10800000 0 49593.5 5163675
2 12000000 0 57672.1 4889398
After this change:
FSUse% Count Size Files/sec App Overhead
0 1200000 0 167987.3 (+1.6%) 6272730
0 2400000 0 198563.9 (+4.0%) 6048847
0 3600000 0 197436.6 (+8.8%) 6163637
0 4800000 0 202880.7 (+16.6%) 6371771
1 6000000 0 167275.9 (+20.1%) 6556733
1 7200000 0 204051.2 (+25.2%) 6817091
1 8400000 0 69622.8 (+20.5%) 5525675
1 9600000 0 69384.5 (+4.4%) 5700723
1 10800000 0 61454.1 (+23.9%) 5363754
3 12000000 0 61908.7 (+7.3%) 5370196
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Right now generic_bin_search() always uses a low boundary slot of 0, but
in the next patch we'll want to often skip slot 0 when searching for a
key. So make generic_bin_search() have the low boundary slot specified
as an argument, and move the check for the extent buffer level from
btrfs_bin_search() to generic_bin_search() to avoid adding another
wrapper around generic_bin_search().
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we clear the extent buffer uptodate if we fail to write it out
we need to check to see if our root node is uptodate before we search
down it. Otherwise we could return stale data (or potentially corrupt
data that was caught by the write verification step) and think that the
path is OK to search down.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently paused balance precludes adding a device since they are both
considered exclusive ops and we can have at most one running at a time.
This is problematic in case a filesystem encounters an ENOSPC situation
while balance is running, in this case the only thing the user can do
is mount the fs with "skip_balance" which pauses balance and delete some
data to free up space for balance. However, it should be possible to add
a new device when balance is paused.
Fix this by allowing device add to proceed when balance is paused.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is needed to enable device add to work in cases when a file system
has been mounted with 'skip_balance' mount option.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Current set of exclusive operation states is not sufficient to handle
all practical use cases. In particular there is a need to be able to add
a device to a filesystem that have paused balance. Currently there is no
way to distinguish between a running and a paused balance. Fix this by
introducing BTRFS_EXCLOP_BALANCE_PAUSED which is going to be set in 2
occasions:
1. When a filesystem is mounted with skip_balance and there is an
unfinished balance it will now be into BALANCE_PAUSED instead of
simply BALANCE state.
2. When a running balance is paused.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e6 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Cachefiles has a problem in that it needs to keep the backing file for a
cookie open whilst there are local modifications pending that need to be
written to it. However, we don't want to keep the file open indefinitely,
as that causes EMFILE/ENFILE/ENOMEM problems.
Reopening the cache file, however, is a problem if this is being done due
to writeback triggered by exit(). Some filesystems will oops if we try to
open a file in that context because they want to access current->fs or
other resources that have already been dismantled.
To get around this, I added the following:
(1) An inode flag, I_PINNING_FSCACHE_WB, to be set on a network filesystem
inode to indicate that we have a usage count on the cookie caching
that inode.
(2) A flag in struct writeback_control, unpinned_fscache_wb, that is set
when __writeback_single_inode() clears the last dirty page from
i_pages - at which point it clears I_PINNING_FSCACHE_WB and sets this
flag.
This has to be done here so that clearing I_PINNING_FSCACHE_WB can be
done atomically with the check of PAGECACHE_TAG_DIRTY that clears
I_DIRTY_PAGES.
(3) A function, fscache_set_page_dirty(), which if it is not set, sets
I_PINNING_FSCACHE_WB and calls fscache_use_cookie() to pin the cache
resources.
(4) A function, fscache_unpin_writeback(), to be called by ->write_inode()
to unuse the cookie.
(5) A function, fscache_clear_inode_writeback(), to be called when the
inode is evicted, before clear_inode() is called. This cleans up any
lingering I_PINNING_FSCACHE_WB.
The network filesystem can then use these tools to make sure that
fscache_write_to_cache() can write locally modified data to the cache as
well as to the server.
For the future, I'm working on write helpers for netfs lib that should
allow this facility to be removed by keeping track of the dirty regions
separately - but that's incomplete at the moment and is also going to be
affected by folios, one way or another, since it deals with pages
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819615157.215744.17623791756928043114.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906917856.143852.8224898306177154573.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967124567.1823006.14188359004568060298.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021524705.640689.17824932021727663017.stgit@warthog.procyon.org.uk/ # v4
Provide a higher-level function than fscache_write() to perform a write
from an inode's pagecache to the cache, whilst fending off concurrent
writes by means of the PG_fscache mark on a page:
void fscache_write_to_cache(struct fscache_cookie *cookie,
struct address_space *mapping,
loff_t start,
size_t len,
loff_t i_size,
netfs_io_terminated_t term_func,
void *term_func_priv,
bool caching);
If caching is false, this function does nothing except call (*term_func)()
if given. It assumes that, in such a case, PG_fscache will not have been
set on the pages.
Otherwise, if caching is true, this function requires the source pages to
have had PG_fscache set on them before calling. start and len define the
region of the file to be modified and i_size indicates the new file size.
The source pages are extracted from the mapping.
term_func and term_func_priv work as for fscache_write(). The PG_fscache
marks will be cleared at the end of the operation, before term_func is
called or the function otherwise returns.
There is an additonal helper function to clear the PG_fscache bits from a
range of pages:
void fscache_clear_page_bits(struct fscache_cookie *cookie,
struct address_space *mapping,
loff_t start, size_t len,
bool caching);
If caching is true, the pages to be managed are expected to be located on
mapping in the range defined by start and len. If caching is false, it
does nothing.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819614155.215744.5528123235123721230.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906916346.143852.15632773570362489926.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967123599.1823006.12946816026724657428.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021522672.640689.4381958316198807813.stgit@warthog.procyon.org.uk/ # v4
Provide a function to begin a read operation:
int fscache_begin_read_operation(
struct netfs_cache_resources *cres,
struct fscache_cookie *cookie)
This is primarily intended to be called by network filesystems on behalf of
netfslib, but may also be called to use the I/O access functions directly.
It attaches the resources required by the cache to cres struct from the
supplied cookie.
This holds access to the cache behind the cookie for the duration of the
operation and forces cache withdrawal and cookie invalidation to perform
synchronisation on the operation. cres->inval_counter is set from the
cookie at this point so that it can be compared at the end of the
operation.
Note that this does not guarantee that the cache state is fully set up and
able to perform I/O immediately; looking up and creation may be left in
progress in the background. The operations intended to be called by the
network filesystem, such as reading and writing, are expected to wait for
the cookie to move to the correct state.
This will, however, potentially sleep, waiting for a certain minimum state
to be set or for operations such as invalidate to advance far enough that
I/O can resume.
Also provide a function for the cache to call to wait for the cache object
to get to a state where it can be used for certain things:
bool fscache_wait_for_operation(struct netfs_cache_resources *cres,
enum fscache_want_stage stage);
This looks at the cache resources provided by the begin function and waits
for them to get to an appropriate stage. There's a choice of wanting just
some parameters (FSCACHE_WANT_PARAM) or the ability to do I/O
(FSCACHE_WANT_READ or FSCACHE_WANT_WRITE).
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819603692.215744.146724961588817028.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906910672.143852.13856103384424986357.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967110245.1823006.2239170567540431836.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021513617.640689.16627329360866150606.stgit@warthog.procyon.org.uk/ # v4
Provide a pair of functions to count the number of users of a cookie (open
files, writeback, invalidation, resizing, reads, writes), to obtain and pin
resources for the cookie and to prevent culling for the whilst there are
users.
The first function marks a cookie as being in use:
void fscache_use_cookie(struct fscache_cookie *cookie,
bool will_modify);
The caller should indicate the cookie to use and whether or not the caller
is in a context that may modify the cookie (e.g. a file open O_RDWR).
If the cookie is not already resourced, fscache will ask the cache backend
in the background to do whatever it needs to look up, create or otherwise
obtain the resources necessary to access data. This is pinned to the
cookie and may not be culled, though it may be withdrawn if the cache as a
whole is withdrawn.
The second function removes the in-use mark from a cookie and, optionally,
updates the coherency data:
void fscache_unuse_cookie(struct fscache_cookie *cookie,
const void *aux_data,
const loff_t *object_size);
If non-NULL, the aux_data buffer and/or the object_size will be saved into
the cookie and will be set on the backing store when the object is
committed.
If this removes the last usage on a cookie, the cookie is placed onto an
LRU list from which it will be removed and closed after a couple of seconds
if it doesn't get reused. This prevents resource overload in the cache -
in particular it prevents it from holding too many files open.
Changes
=======
ver #2:
- Fix fscache_unuse_cookie() to use atomic_dec_and_lock() to avoid a
potential race if the cookie gets reused before it completes the
unusement.
- Added missing transition to LRU_DISCARDING state.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819600612.215744.13678350304176542741.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906907567.143852.16979631199380722019.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967106467.1823006.6790864931048582667.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021511674.640689.10084988363699111860.stgit@warthog.procyon.org.uk/ # v4
Add a number of helper functions to manage access to a cookie, pinning the
cache object in place for the duration to prevent cache withdrawal from
removing it:
(1) void fscache_init_access_gate(struct fscache_cookie *cookie);
This function initialises the access count when a cache binds to a
cookie. An extra ref is taken on the access count to prevent wakeups
while the cache is active. We're only interested in the wakeup when a
cookie is being withdrawn and we're waiting for it to quiesce - at
which point the counter will be decremented before the wait.
The FSCACHE_COOKIE_NACC_ELEVATED flag is set on the cookie to keep
track of the extra ref in order to handle a race between
relinquishment and withdrawal both trying to drop the extra ref.
(2) bool fscache_begin_cookie_access(struct fscache_cookie *cookie,
enum fscache_access_trace why);
This function attempts to begin access upon a cookie, pinning it in
place if it's cached. If successful, it returns true and leaves a the
access count incremented.
(3) void fscache_end_cookie_access(struct fscache_cookie *cookie,
enum fscache_access_trace why);
This function drops the access count obtained by (2), permitting
object withdrawal to take place when it reaches zero.
A tracepoint is provided to track changes to the access counter on a
cookie.
Changes
=======
ver #2:
- Don't hold n_accesses elevated whilst cache is bound to a cookie, but
rather add a flag that prevents the state machine from being queued when
n_accesses reaches 0.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819595085.215744.1706073049250505427.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906895313.143852.10141619544149102193.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967095980.1823006.1133648159424418877.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021503063.640689.8870918985269528670.stgit@warthog.procyon.org.uk/ # v4
Add a pair of helper functions to manage access to a volume, pinning the
volume in place for the duration to prevent cache withdrawal from removing
it:
bool fscache_begin_volume_access(struct fscache_volume *volume,
enum fscache_access_trace why);
void fscache_end_volume_access(struct fscache_volume *volume,
enum fscache_access_trace why);
The way the access gate on the volume works/will work is:
(1) If the cache tests as not live (state is not FSCACHE_CACHE_IS_ACTIVE),
then we return false to indicate access was not permitted.
(2) If the cache tests as live, then we increment the volume's n_accesses
count and then recheck the cache liveness, ending the access if it
ceased to be live.
(3) When we end the access, we decrement the volume's n_accesses and wake
up the any waiters if it reaches 0.
(4) Whilst the cache is caching, the volume's n_accesses is kept
artificially incremented to prevent wakeups from happening.
(5) When the cache is taken offline, the state is changed to prevent new
accesses, the volume's n_accesses is decremented and we wait for it to
become 0.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819594158.215744.8285859817391683254.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906894315.143852.5454793807544710479.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967095028.1823006.9173132503876627466.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021501546.640689.9631510472149608443.stgit@warthog.procyon.org.uk/ # v4
Add functions to the fscache API to allow data file cookies to be acquired
and relinquished by the network filesystem. It is intended that the
filesystem will create such cookies per-inode under a volume.
To request a cookie, the filesystem should call:
struct fscache_cookie *
fscache_acquire_cookie(struct fscache_volume *volume,
u8 advice,
const void *index_key,
size_t index_key_len,
const void *aux_data,
size_t aux_data_len,
loff_t object_size)
The filesystem must first have created a volume cookie, which is passed in
here. If it passes in NULL then the function will just return a NULL
cookie.
A binary key should be passed in index_key and is of size index_key_len.
This is saved in the cookie and is used to locate the associated data in
the cache.
A coherency data buffer of size aux_data_len will be allocated and
initialised from the buffer pointed to by aux_data. This is used to
validate cache objects when they're opened and is stored on disk with them
when they're committed. The data is stored in the cookie and will be
updateable by various functions in later patches.
The object_size must also be given. This is also used to perform a
coherency check and to size the backing storage appropriately.
This function disallows a cookie from being acquired twice in parallel,
though it will cause the second user to wait if the first is busy
relinquishing its cookie.
When a network filesystem has finished with a cookie, it should call:
void
fscache_relinquish_cookie(struct fscache_volume *volume,
bool retire)
If retire is true, any backing data will be discarded immediately.
Changes
=======
ver #3:
- fscache_hash()'s size parameter is now in bytes. Use __le32 as the unit
to round up to.
- When comparing cookies, simply see if the attributes are the same rather
than subtracting them to produce a strcmp-style return[1].
- Add a check to see if the cookie is still hashed at the point of
freeing.
ver #2:
- Don't hold n_accesses elevated whilst cache is bound to a cookie, but
rather add a flag that prevents the state machine from being queued when
n_accesses reaches 0.
- Remove the unused cookie pointer field from the fscache_acquire
tracepoint.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/CAHk-=whtkzB446+hX0zdLsdcUJsJ=8_-0S1mE_R+YurThfUbLA@mail.gmail.com/ [1]
Link: https://lore.kernel.org/r/163819590658.215744.14934902514281054323.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906891983.143852.6219772337558577395.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967088507.1823006.12659006350221417165.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021498432.640689.12743483856927722772.stgit@warthog.procyon.org.uk/ # v4
Add functions to the fscache API to allow volumes to be acquired and
relinquished by the network filesystem. A volume is an index of data
storage cache objects. A volume is represented by a volume cookie in the
API. A filesystem would typically create a volume for a superblock and
then create per-inode cookies within it.
To request a volume, the filesystem calls:
struct fscache_volume *
fscache_acquire_volume(const char *volume_key,
const char *cache_name,
const void *coherency_data,
size_t coherency_len)
The volume_key is a printable string used to match the volume in the cache.
It should not contain any '/' characters. For AFS, for example, this would
be "afs,<cellname>,<volume_id>", e.g. "afs,example.com,523001".
The cache_name can be NULL, but if not it should be a string indicating the
name of the cache to use if there's more than one available.
The coherency data, if given, is an arbitrarily-sized blob that's attached
to the volume and is compared when the volume is looked up. If it doesn't
match, the old volume is judged to be out of date and it and everything
within it is discarded.
Acquiring a volume twice concurrently is disallowed, though the function
will wait if an old volume cookie is being relinquishing.
When a network filesystem has finished with a volume, it should return the
volume cookie by calling:
void
fscache_relinquish_volume(struct fscache_volume *volume,
const void *coherency_data,
bool invalidate)
If invalidate is true, the entire volume will be discarded; if false, the
volume will be synced and the coherency data will be updated.
Changes
=======
ver #4:
- Removed an extraneous param from kdoc on fscache_relinquish_volume()[3].
ver #3:
- fscache_hash()'s size parameter is now in bytes. Use __le32 as the unit
to round up to.
- When comparing cookies, simply see if the attributes are the same rather
than subtracting them to produce a strcmp-style return[2].
- Make the coherency data an arbitrary blob rather than a u64, but don't
store it for the moment.
ver #2:
- Fix error check[1].
- Make a fscache_acquire_volume() return errors, including EBUSY if a
conflicting volume cookie already exists. No error is printed now -
that's left to the netfs.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/20211203095608.GC2480@kili/ [1]
Link: https://lore.kernel.org/r/CAHk-=whtkzB446+hX0zdLsdcUJsJ=8_-0S1mE_R+YurThfUbLA@mail.gmail.com/ [2]
Link: https://lore.kernel.org/r/20211220224646.30e8205c@canb.auug.org.au/ [3]
Link: https://lore.kernel.org/r/163819588944.215744.1629085755564865996.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906890630.143852.13972180614535611154.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967086836.1823006.8191672796841981763.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021495816.640689.4403156093668590217.stgit@warthog.procyon.org.uk/ # v4
Inodes aren't supposed to have a project id of -1U (aka 4294967295) but
the kernel hasn't always validated FSSETXATTR correctly. Flag this as
something for the sysadmin to check out.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Online fsck depends on callers holding ILOCK_EXCL from the time they
decide to update a block mapping until after they've updated the reverse
mapping records to guarantee the stability of both mapping records.
Unfortunately, the quota code drops ILOCK_EXCL at the first transaction
roll in the dquot allocation process, which breaks that assertion. This
leads to sporadic failures in the online rmap repair code if the repair
code grabs the AGF after bmapi_write maps a new block into the quota
file's data fork but before it can finish the deferred rmap update.
Fix this by rewriting the function to hold the ILOCK until after the
transaction commit like all other bmap updates do, and get rid of the
dqread wrapper that does nothing but complicate the codebase.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
mp is being initialized to log->l_mp but this is never read
as record is overwritten later on. Remove the redundant
assignment.
Cleans up the following clang-analyzer warning:
fs/xfs/xfs_log_recover.c:3543:20: warning: Value stored to 'mp' during
its initialization is never read [clang-analyzer-deadcode.DeadStores].
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Oh, let me count the ways that the kvmalloc API sucks dog eggs.
The problem is when we are logging lots of large objects, we hit
kvmalloc really damn hard with costly order allocations, and
behaviour utterly sucks:
- 49.73% xlog_cil_commit
- 31.62% kvmalloc_node
- 29.96% __kmalloc_node
- 29.38% kmalloc_large_node
- 29.33% __alloc_pages
- 24.33% __alloc_pages_slowpath.constprop.0
- 18.35% __alloc_pages_direct_compact
- 17.39% try_to_compact_pages
- compact_zone_order
- 15.26% compact_zone
5.29% __pageblock_pfn_to_page
3.71% PageHuge
- 1.44% isolate_migratepages_block
0.71% set_pfnblock_flags_mask
1.11% get_pfnblock_flags_mask
- 0.81% get_page_from_freelist
- 0.59% _raw_spin_lock_irqsave
- do_raw_spin_lock
__pv_queued_spin_lock_slowpath
- 3.24% try_to_free_pages
- 3.14% shrink_node
- 2.94% shrink_slab.constprop.0
- 0.89% super_cache_count
- 0.66% xfs_fs_nr_cached_objects
- 0.65% xfs_reclaim_inodes_count
0.55% xfs_perag_get_tag
0.58% kfree_rcu_shrink_count
- 2.09% get_page_from_freelist
- 1.03% _raw_spin_lock_irqsave
- do_raw_spin_lock
__pv_queued_spin_lock_slowpath
- 4.88% get_page_from_freelist
- 3.66% _raw_spin_lock_irqsave
- do_raw_spin_lock
__pv_queued_spin_lock_slowpath
- 1.63% __vmalloc_node
- __vmalloc_node_range
- 1.10% __alloc_pages_bulk
- 0.93% __alloc_pages
- 0.92% get_page_from_freelist
- 0.89% rmqueue_bulk
- 0.69% _raw_spin_lock
- do_raw_spin_lock
__pv_queued_spin_lock_slowpath
13.73% memcpy_erms
- 2.22% kvfree
On this workload, that's almost a dozen CPUs all trying to compact
and reclaim memory inside kvmalloc_node at the same time. Yet it is
regularly falling back to vmalloc despite all that compaction, page
and shrinker reclaim that direct reclaim is doing. Copying all the
metadata is taking far less CPU time than allocating the storage!
Direct reclaim should be considered extremely harmful.
This is a high frequency, high throughput, CPU usage and latency
sensitive allocation. We've got memory there, and we're using
kvmalloc to allow memory allocation to avoid doing lots of work to
try to do contiguous allocations.
Except it still does *lots of costly work* that is unnecessary.
Worse: the only way to avoid the slowpath page allocation trying to
do compaction on costly allocations is to turn off direct reclaim
(i.e. remove __GFP_RECLAIM_DIRECT from the gfp flags).
Unfortunately, the stupid kvmalloc API then says "oh, this isn't a
GFP_KERNEL allocation context, so you only get kmalloc!". This
cuts off the vmalloc fallback, and this leads to almost instant OOM
problems which ends up in filesystems deadlocks, shutdowns and/or
kernel crashes.
I want some basic kvmalloc behaviour:
- kmalloc for a contiguous range with fail fast semantics - no
compaction direct reclaim if the allocation enters the slow path.
- run normal vmalloc (i.e. GFP_KERNEL) if kmalloc fails
The really, really stupid part about this is these kvmalloc() calls
are run under memalloc_nofs task context, so all the allocations are
always reduced to GFP_NOFS regardless of the fact that kvmalloc
requires GFP_KERNEL to be passed in. IOWs, we're already telling
kvmalloc to behave differently to the gfp flags we pass in, but it
still won't allow vmalloc to be run with anything other than
GFP_KERNEL.
So, this patch open codes the kvmalloc() in the commit path to have
the above described behaviour. The result is we more than halve the
CPU time spend doing kvmalloc() in this path and transaction commits
with 64kB objects in them more than doubles. i.e. we get ~5x
reduction in CPU usage per costly-sized kvmalloc() invocation and
the profile looks like this:
- 37.60% xlog_cil_commit
16.01% memcpy_erms
- 8.45% __kmalloc
- 8.04% kmalloc_order_trace
- 8.03% kmalloc_order
- 7.93% alloc_pages
- 7.90% __alloc_pages
- 4.05% __alloc_pages_slowpath.constprop.0
- 2.18% get_page_from_freelist
- 1.77% wake_all_kswapds
....
- __wake_up_common_lock
- 0.94% _raw_spin_lock_irqsave
- 3.72% get_page_from_freelist
- 2.43% _raw_spin_lock_irqsave
- 5.72% vmalloc
- 5.72% __vmalloc_node_range
- 4.81% __get_vm_area_node.constprop.0
- 3.26% alloc_vmap_area
- 2.52% _raw_spin_lock
- 1.46% _raw_spin_lock
0.56% __alloc_pages_bulk
- 4.66% kvfree
- 3.25% vfree
- __vfree
- 3.23% __vunmap
- 1.95% remove_vm_area
- 1.06% free_vmap_area_noflush
- 0.82% _raw_spin_lock
- 0.68% _raw_spin_lock
- 0.92% _raw_spin_lock
- 1.40% kfree
- 1.36% __free_pages
- 1.35% __free_pages_ok
- 1.02% _raw_spin_lock_irqsave
It's worth noting that over 50% of the CPU time spent allocating
these shadow buffers is now spent on spinlocks. So the shadow buffer
allocation overhead is greatly reduced by getting rid of direct
reclaim from kmalloc, and could probably be made even less costly if
vmalloc() didn't use global spinlocks to protect it's structures.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
There are currently 2 ways to create a set of sysfs files for a
kobj_type, through the default_attrs field, and the default_groups
field. Move the xfs sysfs code to use default_groups field which has
been the preferred way since aa30f47cf6 ("kobject: Add support for
default attribute groups to kobj_type") so that we can soon get rid of
the obsolete default_attrs field.
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
When the kernel is locked down the kernel allows reading only debugfs
files with mode 444. Mode 400 is also valid but is not allowed.
Make the 444 into a mask.
Fixes: 5496197f9b ("debugfs: Restrict debugfs when the kernel is locked down")
Signed-off-by: Michal Suchanek <msuchanek@suse.de>
Link: https://lore.kernel.org/r/20220104170505.10248-1-msuchanek@suse.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Move the request list macros to the header file that defines that struct
they operate on.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220105170518.3181469-2-kbusch@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Compress page will invalidate in truncate block process too, so remove
redunant invalidate compress pages in f2fs_evict_inode.
Signed-off-by: Fengnan Chang <changfengnan@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Fix the following coccicheck warning:
./fs/f2fs/sysfs.c:491:41-46: WARNING: conversion to bool not needed here
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For compressed inode, in .{invalidate,release}page, we will call
f2fs_invalidate_compress_pages() to drop all compressed page cache of
current inode.
But we don't need to drop compressed page cache synchronously in
.invalidatepage, because, all trancation paths of compressed physical
block has been covered with f2fs_invalidate_compress_page().
And also we don't need to drop compressed page cache synchronously
in .releasepage, because, if there is out-of-memory, we can count
on page cache reclaim on sbi->compress_inode.
BTW, this patch may fix the issue reported below:
https://lore.kernel.org/linux-f2fs-devel/20211202092812.197647-1-changfengnan@vivo.com/T/#u
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
https://bugzilla.kernel.org/show_bug.cgi?id=204137
With below script, we will hit panic during new segment allocation:
DISK=bingo.img
MOUNT_DIR=/mnt/f2fs
dd if=/dev/zero of=$DISK bs=1M count=105
mkfs.f2fe -a 1 -o 19 -t 1 -z 1 -f -q $DISK
mount -t f2fs $DISK $MOUNT_DIR -o "noinline_dentry,flush_merge,noextent_cache,mode=lfs,io_bits=7,fsync_mode=strict"
for (( i = 0; i < 4096; i++ )); do
name=`head /dev/urandom | tr -dc A-Za-z0-9 | head -c 10`
mkdir $MOUNT_DIR/$name
done
umount $MOUNT_DIR
rm $DISK
--- Core dump ---
Call Trace:
allocate_segment_by_default+0x9d/0x100 [f2fs]
f2fs_allocate_data_block+0x3c0/0x5c0 [f2fs]
do_write_page+0x62/0x110 [f2fs]
f2fs_outplace_write_data+0x43/0xc0 [f2fs]
f2fs_do_write_data_page+0x386/0x560 [f2fs]
__write_data_page+0x706/0x850 [f2fs]
f2fs_write_cache_pages+0x267/0x6a0 [f2fs]
f2fs_write_data_pages+0x19c/0x2e0 [f2fs]
do_writepages+0x1c/0x70
__filemap_fdatawrite_range+0xaa/0xe0
filemap_fdatawrite+0x1f/0x30
f2fs_sync_dirty_inodes+0x74/0x1f0 [f2fs]
block_operations+0xdc/0x350 [f2fs]
f2fs_write_checkpoint+0x104/0x1150 [f2fs]
f2fs_sync_fs+0xa2/0x120 [f2fs]
f2fs_balance_fs_bg+0x33c/0x390 [f2fs]
f2fs_write_node_pages+0x4c/0x1f0 [f2fs]
do_writepages+0x1c/0x70
__writeback_single_inode+0x45/0x320
writeback_sb_inodes+0x273/0x5c0
wb_writeback+0xff/0x2e0
wb_workfn+0xa1/0x370
process_one_work+0x138/0x350
worker_thread+0x4d/0x3d0
kthread+0x109/0x140
ret_from_fork+0x25/0x30
The root cause here is, with IO alignment feature enables, in worst
case, we need F2FS_IO_SIZE() free blocks space for single one 4k write
due to IO alignment feature will fill dummy pages to make IO being
aligned.
So we will easily run out of free segments during non-inline directory's
data writeback, even in process of foreground GC.
In order to fix this issue, I just propose to reserve additional free
space for IO alignment feature to handle worst case of free space usage
ratio during FGGC.
Fixes: 0a595ebaaa ("f2fs: support IO alignment for DATA and NODE writes")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Otherwise, nat_bit area may be persisted across boundary of CP area during
nat_bit rebuilding.
Fixes: 94c821fb28 ("f2fs: rebuild nat_bits during umount")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
f2fs: support fault injection for f2fs_trylock_op()
This patch supports to inject fault into f2fs_trylock_op().
Usage:
a) echo 65536 > /sys/fs/f2fs/<dev>/inject_type or
b) mount -o fault_type=65536 <dev> <mountpoint>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As Wenqing Liu reported in bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=215235
- Overview
page fault in f2fs_setxattr() when mount and operate on corrupted image
- Reproduce
tested on kernel 5.16-rc3, 5.15.X under root
1. unzip tmp7.zip
2. ./single.sh f2fs 7
Sometimes need to run the script several times
- Kernel dump
loop0: detected capacity change from 0 to 131072
F2FS-fs (loop0): Found nat_bits in checkpoint
F2FS-fs (loop0): Mounted with checkpoint version = 7548c2ee
BUG: unable to handle page fault for address: ffffe47bc7123f48
RIP: 0010:kfree+0x66/0x320
Call Trace:
__f2fs_setxattr+0x2aa/0xc00 [f2fs]
f2fs_setxattr+0xfa/0x480 [f2fs]
__f2fs_set_acl+0x19b/0x330 [f2fs]
__vfs_removexattr+0x52/0x70
__vfs_removexattr_locked+0xb1/0x140
vfs_removexattr+0x56/0x100
removexattr+0x57/0x80
path_removexattr+0xa3/0xc0
__x64_sys_removexattr+0x17/0x20
do_syscall_64+0x37/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
The root cause is in __f2fs_setxattr(), we missed to do sanity check on
last xattr entry, result in out-of-bound memory access during updating
inconsistent xattr data of target inode.
After the fix, it can detect such xattr inconsistency as below:
F2FS-fs (loop11): inode (7) has invalid last xattr entry, entry_size: 60676
F2FS-fs (loop11): inode (8) has corrupted xattr
F2FS-fs (loop11): inode (8) has corrupted xattr
F2FS-fs (loop11): inode (8) has invalid last xattr entry, entry_size: 47736
Cc: stable@vger.kernel.org
Reported-by: Wenqing Liu <wenqingliu0120@gmail.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch tries to mitigate lock contention between f2fs_write_checkpoint and
f2fs_get_node_info along with nat_tree_lock.
The idea is, if checkpoint is currently running, other threads that try to grab
nat_tree_lock would be better to wait for checkpoint.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Get rid of old erofs_get_meta_page() within zmap operations by
using on-stack meta buffers in order to prepare subpage and folio
features.
Finally, erofs_get_meta_page() is useless. Get rid of it!
Link: https://lore.kernel.org/r/20220102040017.51352-6-hsiangkao@linux.alibaba.com
Reviewed-by: Yue Hu <huyue2@yulong.com>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Get rid of old erofs_get_meta_page() within xattr operations by
using on-stack meta buffers in order to prepare subpage and folio
features.
Link: https://lore.kernel.org/r/20220102040017.51352-5-hsiangkao@linux.alibaba.com
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Get rid of old erofs_get_meta_page() within super operations by
using on-stack meta buffers in order to prepare subpage and folio
features.
Link: https://lore.kernel.org/r/20220102081317.109797-1-hsiangkao@linux.alibaba.com
Reviewed-by: Yue Hu <huyue2@yulong.com>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Get rid of old erofs_get_meta_page() within inode operations by
using on-stack meta buffers in order to prepare subpage and folio
features.
Link: https://lore.kernel.org/r/20220102040017.51352-3-hsiangkao@linux.alibaba.com
Reviewed-by: Yue Hu <huyue2@yulong.com>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
In order to support subpage and folio for all uncompressed files,
introduce meta buffer descriptors, which can be effectively stored
on stack, in place of meta page operations.
This converts the uncompressed data path to meta buffers.
Link: https://lore.kernel.org/r/20220102040017.51352-2-hsiangkao@linux.alibaba.com
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
This patch prints the cluster node address if a non-cluster node
(according to the dlm config setting) tries to connect. The current
hexdump call will print in a different loglevel and only available if
dynamic debug is enabled. Additional we using the ip address format
strings to print an IETF ip4/6 string represenation.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
btrfs_free_space_ctl::private is either unset or it always points to
struct btrfs_block_group when it is set. So there's no point in keeping
the unhelpful 'private' name and keeping it an untyped pointer. Change
both the type and name to be self-describing. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no point in the function taking an fs_info and a
btrfs_free_space because the ctl passed always belongs to the block
group. Furthermore fs_info can be referenced from the block group. No
functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The only difference between the two is whether btrfs_free_space::bytes
is adjusted. Instead of having 2 separate functions control this
behavior via an additional parameter and make them one function instead.
No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The only difference is the former adjusts btrfs_free_space::bytes
member. Consolidate the two function into 1 and add a bool parameter
which controls whether the adjustment is made or not. No functional
changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the future we are going to have multiple copies of these trees. To
facilitate this we need a way to lookup the different roots we are
looking for. Handle this by adding a global root rb tree that is
indexed on the root->root_key. Then instead of loading the roots at
mount time with individually targeted keys, simply search the tree_root
for anything with the specific objectid we want. This will make it
straightforward to support both old style and new style file systems.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We don't set SHAREABLE on the extent root, we don't need to have this
safety check here.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're going to have multiple free space roots in the future, so adjust
all the users of the free space root to use a helper to access the root.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are going to have multiple csum roots in the future, so convert all
users of ->csum_root to btrfs_csum_root() and rename ->csum_root to
->_csum_root so we can easily find remaining users in the future.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a few places where we skip doing csums if we mounted with one of
the rescue options that ignores bad csum roots. In the future when
there are multiple csum roots it'll be costly to check and see if there
are any missing csum roots, so simply add a flag to indicate the fs
should skip loading csums in case of errors.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the future we may have multiple csum roots, so simply check the
objectid is for a csum root instead of checking against ->csum_root.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we start having multiple extent roots we'll need to use a helper to
get to the correct extent_root. Rename fs_info->extent_root to
_extent_root and convert all of the users of the extent root to using
the btrfs_extent_root() helper. This will allow us to easily clean up
the remaining direct accesses in the future.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the future we're going to have multiple csum and extent root trees,
so init the roots block_rsv at setup_root time based on their root key
objectid.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We only need the root to start a transaction, and since it's a global
root we can pick anything, change to the tree_root as we'll have a lot
of extent roots in the future.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are going to have many extent_roots soon, and we don't need a root
here necessarily as we're not modifying anything, we're just getting the
trans handle so we can have an accurate view of references, so use the
tree_root here.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're just using the extent_root to set the chunk owner to
root_key->objectid, which is BTRFS_EXTENT_TREE_OBJECTID, so use that
directly instead of using the root.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We only defrag leaves on roots that have SHAREABLE set, so we don't need
to check if we're the extent root as it doesn't have SHAREABLE set.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is a leftover from when we used to independently swap the extent
root's commit root and the fs tree commit roots. At the time I simply
changed the helper to a list_add. There's actually no reason to not add
the extent root to the switch commit root at this point, we don't care
about the order we do the switching since it's all done under the
commit_root_sem.
If we re-mark the extent root dirty after adding it to the
switch_commits list we'll see that BTRFS_ROOT_DIRTY isn't set and then
list_move it back onto the dirty list, and then we'll redo the tree
update and everything will be ok.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're only using this to start the transaction with to possibly allocate
a chunk. It doesn't really matter which root to use, but with extent
tree v2 we'll need a bytenr to look up a extent root which makes the
usage of the extent_root awkward here. Simply change it to the
chunk_root.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With extent tree v2 we'll have a different extent root based on where
the bytenr is located, so adjust the remove_extent_backref() helper and
it's helpers to pass the extent_root around.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With extent tree v2 we will have a separate root to hold the block group
items. Add a btrfs_block_group_root() that will return the appropriate
root given the flags of the fs, and convert all functions that need to
modify block group items to use the helper.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we're looking for leafs that point to a data extent we want to record
the extent items that point at our bytenr. At this point we have the
reference and we know for a fact that this leaf should have a reference
to our bytenr. However if there's some sort of corruption we may not
find any references to our leaf, and thus could end up with eie == NULL.
Replace this BUG_ON() with an ASSERT() and then return -EUCLEAN for the
mortals.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We search for an extent entry with .offset = -1, which shouldn't be a
thing, but corruption happens. Add an ASSERT() for the developers,
return -EUCLEAN for mortals.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We define __TRANS_DUMMY always, so this extra ifdef stuff is not needed.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This comment was much closer to the related code when it was originally
added, but has slowly migrated north far from its ancestral lands. Move
it back down with its people.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We pass in the path, but use btrfs_next_item() using the root we
searched with. Pass the root down to add_keyed_refs() instead of the
fs_info so we can continue to use the same root we searched with.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nobody is using this anymore, remove it.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The root on the trans->root can be anything, and generally we're
committing from the transaction kthread so it's usually the tree_root.
Change this to just take an fs_info, and to maintain compatibility
simply put the ROOT_TREE_OBJECTID as the root objectid for the
tracepoint. This will allow use to remove trans->root.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we do this awful thing where we get another ref on a trans
handle, async off that handle and commit the transaction from that work.
Because we do this we have to mess with current->journal_info and the
freeze counting stuff.
We already have an async thing to kick for the transaction commit, the
transaction kthread. Replace this work struct with a flag on the
fs_info to tell the kthread to go ahead and commit even if it's before
our timeout. Then we can drastically simplify the async transaction
commit path.
Note: this can be simplified and functionality based on the pending
operation COMMIT.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ add note ]
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is no longer used, the -o nobarrier is handled by
BTRFS_MOUNT_NOBARRIER. Remove the flag.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reshuffle the code inside the first loop of tree_search_offset so that
one if() is eliminated and the becomes more linear.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When debugging calc_bio_boundaries(), I found that even for RAID1
metadata, we're following stripe length to calculate stripe boundary.
# mkfs.btrfs -m raid1 -d raid1 /dev/test/scratch[12]
# mount /dev/test/scratch /mnt/btrfs
# xfs_io -f -c "pwrite 0 64K" /mnt/btrfs/file
# umount
Above very basic operations will make calc_bio_boundaries() to report
the following result:
submit_extent_page: r/i=1/1 file_offset=22036480 len_to_stripe_boundary=49152
submit_extent_page: r/i=1/1 file_offset=30474240 len_to_stripe_boundary=65536
...
submit_extent_page: r/i=1/1 file_offset=30523392 len_to_stripe_boundary=16384
submit_extent_page: r/i=1/1 file_offset=30457856 len_to_stripe_boundary=16384
submit_extent_page: r/i=5/257 file_offset=0 len_to_stripe_boundary=65536
submit_extent_page: r/i=5/257 file_offset=65536 len_to_stripe_boundary=65536
submit_extent_page: r/i=1/1 file_offset=30490624 len_to_stripe_boundary=49152
submit_extent_page: r/i=1/1 file_offset=30507008 len_to_stripe_boundary=32768
Where "r/i" is the rootid and inode, 1/1 means they metadata.
The remaining names match the member used in kernel.
Even all data/metadata are using RAID1, we're still following stripe
length.
[CAUSE]
This behavior is caused by a wrong condition in btrfs_get_io_geometry():
if (map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
/* Fill using stripe_len */
len = min_t(u64, em->len - offset, max_len);
} else {
len = em->len - offset;
}
This means, only for SINGLE we will not follow stripe_len.
However for profiles like RAID1*, DUP, they don't need to bother
stripe_len.
This can lead to unnecessary bio split for RAID1*/DUP profiles, and can
even be a blockage for future zoned RAID support.
[FIX]
Introduce one single-use macro, BTRFS_BLOCK_GROUP_STRIPE_MASK, and
change the condition to only calculate the length using stripe length
for stripe based profiles.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is a small optimisation since the currently 'entry' is already
checked in the if () {} else if {} construct above the loop. In essence
the first iteration of the final while loop is redundant. To eliminate
this extra check simply get the next entry at the beginning of the loop.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
I noticed a few corner cases when looking at my bytes_index patch for
obvious bugs, so add a bunch of tests to validate proper behavior of the
bytes_index tree. A couple of basic tests to make sure it puts things
in the correct order, and then more complicated tests to make sure it
re-arranges bitmap entries properly and does the right thing when we try
to make allocations.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we index free space on offset only, because usually we have a
hint from the allocator that we want to honor for locality reasons.
However if we fail to use this hint we have to go back to a brute force
search through the free space entries to find a large enough extent.
With sufficiently fragmented free space this becomes quite expensive, as
we have to linearly search all of the free space entries to find if we
have a part that's long enough.
To fix this add a cached rb tree to index based on free space entry
bytes. This will allow us to quickly look up the largest chunk in the
free space tree for this block group, and stop searching once we've
found an entry that is too small to satisfy our allocation. We simply
choose to use this tree if we're searching from the beginning of the
block group, as we know we do not care about locality at that point.
I wrote an allocator test that creates a 10TiB ram backed null block
device and then fallocates random files until the file system is full.
I think go through and delete all of the odd files. Then I spawn 8
threads that fallocate 64MiB files (1/2 our extent size cap) until the
file system is full again. I use bcc's funclatency to measure the
latency of find_free_extent. The baseline results are
nsecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 10356 |**** |
512 -> 1023 : 58242 |************************* |
1024 -> 2047 : 74418 |******************************** |
2048 -> 4095 : 90393 |****************************************|
4096 -> 8191 : 79119 |*********************************** |
8192 -> 16383 : 35614 |*************** |
16384 -> 32767 : 13418 |***** |
32768 -> 65535 : 12811 |***** |
65536 -> 131071 : 17090 |******* |
131072 -> 262143 : 26465 |*********** |
262144 -> 524287 : 40179 |***************** |
524288 -> 1048575 : 55469 |************************ |
1048576 -> 2097151 : 48807 |********************* |
2097152 -> 4194303 : 26744 |*********** |
4194304 -> 8388607 : 35351 |*************** |
8388608 -> 16777215 : 13918 |****** |
16777216 -> 33554431 : 21 | |
avg = 908079 nsecs, total: 580889071441 nsecs, count: 639690
And the patch results are
nsecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 6883 |** |
512 -> 1023 : 54346 |********************* |
1024 -> 2047 : 79170 |******************************** |
2048 -> 4095 : 98890 |****************************************|
4096 -> 8191 : 81911 |********************************* |
8192 -> 16383 : 27075 |********** |
16384 -> 32767 : 14668 |***** |
32768 -> 65535 : 13251 |***** |
65536 -> 131071 : 15340 |****** |
131072 -> 262143 : 26715 |********** |
262144 -> 524287 : 43274 |***************** |
524288 -> 1048575 : 53870 |********************* |
1048576 -> 2097151 : 55368 |********************** |
2097152 -> 4194303 : 41036 |**************** |
4194304 -> 8388607 : 24927 |********** |
8388608 -> 16777215 : 33 | |
16777216 -> 33554431 : 9 | |
avg = 623599 nsecs, total: 397259314759 nsecs, count: 637042
There's a little variation in the amount of calls done because of timing
of the threads with metadata requirements, but the avg, total, and
count's are relatively consistent between runs (usually within 2-5% of
each other). As you can see here we have around a 30% decrease in
average latency with a 30% decrease in overall time spent in
find_free_extent.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While adding self tests for my space index change I was hitting a
problem where the space indexed tree wasn't returning the expected
->max_extent_size. This is because we will skip searching any entry
that doesn't have ->bytes >= the amount of bytes we want. However we'll
still set the max_extent_size based on that entry. The problem is if we
don't search the bitmap we won't have ->max_extent_size set properly, so
we can't really trust it.
This doesn't really result in a problem per-se, it can just result in us
not finding contiguous area that may exist. Fix the max_extent_size
helper to return ->bytes if ->max_extent_size isn't set, and add a big
comment explaining why we're doing this.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We use @nr_written to record how many pages have been started by
btrfs_run_delalloc_range().
Currently there are only two cases that would populate @nr_written:
- Inline extent creation
- Compressed write
But both cases will also set @page_started to one.
In fact, in writepage_delalloc() we have the following code, showing
that @nr_written is really only utilized for above two cases:
/* did the fill delalloc function already unlock and start
* the IO?
*/
if (page_started) {
/*
* we've unlocked the page, so we can't update
* the mapping's writeback index, just update
* nr_to_write.
*/
wbc->nr_to_write -= nr_written;
return 1;
}
But for such cases, writepage_delalloc() will return 1, and exit
__extent_writepage() without going through __extent_writepage_io().
Thus this means, inside __extent_writepage_io(), we always get
@nr_written as 0.
So this patch is going to remove the unnecessary parameter from the
following functions:
- writepage_delalloc()
As @nr_written passed in is always the initial value 0.
Although inside that function, we still need a local @nr_written
to update wbc->nr_to_write.
- __extent_writepage_io()
As explained above, @nr_written passed in can only be 0.
This also means we can remove one update_nr_written() call.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We used to need the root for btrfs_reserve_metadata_bytes to check the
orphan cleanup state, but we no longer need that, we simply need the
fs_info. Change btrfs_reserve_metadata_bytes() to use the fs_info, and
change both btrfs_block_rsv_refill() and btrfs_block_rsv_add() to do the
same as they simply call btrfs_reserve_metadata_bytes() and then
manipulate the block_rsv that is being used.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we don't care about the stage of the orphan_cleanup_state,
simply replace it with a bit on ->state to make sure we don't call the
orphan cleanup every time we wander into this root.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is very old code before we were stealing from the global reserve
during evict. We have proper ways to steal from the global reserve
while we're evicting, so rip out this code as it's no longer necessary.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
I forgot to convert this over when I introduced the global reserve
stealing code to the space flushing code. Evict was simply trying to
make its reservation and then if it failed it would steal from the
global rsv, which is racey because it's outside of the normal ticketing
code.
Fix this by setting ticket->steal if we are BTRFS_RESERVE_FLUSH_EVICT,
and then make the priority flushing path do the steal for us.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're going to use this helper in the priority flushing loop, move this
check into the helper to simplify the logic.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since we're dropping locks before we enter the priority flushing loops
we could have had our ticket granted before we got the space_info->lock.
So add this check to avoid doing some extra flushing in the priority
flushing cases.
The case in priority_reclaim_metadata_space is an optimization. Think
we came in to reserve, we didn't have the space, we added our ticket to
the list. But at the same time somebody was waiting on the space_info
lock to add space and do btrfs_try_granting_ticket(), so we drop the
lock, get satisfied, come in to do our loop, and we have been
satisfied.
This is the priority reclaim path, so to_reclaim could be !0 still
because we may have only satisfied the priority tickets and still left
non priority tickets on the list. We would then have to_reclaim but
->bytes == 0.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ add note about the optimization ]
Signed-off-by: David Sterba <dsterba@suse.com>
Currently the error case for the priority tickets is handled where we
deal with all of the tickets, priority and non-priority. This is OK in
general, but it makes for some awkward locking. We take and drop the
space_info->lock back to back because of these different types of
tickets.
Rework the code to handle priority ticket failures in their respective
helpers. This allows us to be less wonky with our space_info->lock
usage, and means that the main handler simply has to check
ticket->error, as the ticket is guaranteed to be off any list and
completely handled by the time it exits one of the handlers.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When mounting a device, we are reporting the zones twice: once for
checking the zone attributes in btrfs_get_dev_zone_info and once for
loading block groups' zone info in
btrfs_load_block_group_zone_info(). With a lot of block groups, that
leads to a lot of REPORT ZONE commands and slows down the mount
process.
This patch introduces a zone info cache in struct
btrfs_zoned_device_info. The cache is populated while in
btrfs_get_dev_zone_info() and used for
btrfs_load_block_group_zone_info() to reduce the number of REPORT ZONE
commands. The zone cache is then released after loading the block
groups, as it will not be much effective during the run time.
Benchmark: Mount an HDD with 57,007 block groups
Before patch: 171.368 seconds
After patch: 64.064 seconds
While it still takes a minute due to the slowness of loading all the
block groups, the patch reduces the mount time by 1/3.
Link: https://lore.kernel.org/linux-btrfs/CAHQ7scUiLtcTqZOMMY5kbWUBOhGRwKo6J6wYPT5WY+C=cD49nQ@mail.gmail.com/
Fixes: 5b31646898 ("btrfs: get zone information of zoned block devices")
CC: stable@vger.kernel.org
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit ba8a9d0795 ("Btrfs: delete the entire async bio submission
framework") removed submit workqueues, the parameter fs_devices is not used
anymore.
Remove it, no functional changes.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Su Yue <l@damenly.su>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the transaction commit path we are acquiring the tree log mutex too
early and we have a stale comment because:
1) It mentions a function named btrfs_commit_tree_roots(), which does not
exists anymore, it was the old name of commit_cowonly_roots(), renamed
a very long time ago by commit 5d4f98a28c ("Btrfs: Mixed back
reference (FORWARD ROLLING FORMAT CHANGE)"));
2) It mentions that we need to acquire the tree log mutex at that point
to ensure we have no running log writers. That is not correct anymore,
for many years at least, since we are guaranteed that we do not have
any log writers at that point simply because we have set the state of
the transaction to TRANS_STATE_COMMIT_DOING and have waited for all
writers to complete - meaning no one can log until we change the state
of the transaction to TRANS_STATE_UNBLOCKED. Any attempts to join the
transaction or start a new one will block until we do that state
transition;
3) The comment mentions a "trans mutex" which doesn't exists since 2011,
commit a4abeea41a ("Btrfs: kill trans_mutex") removed it;
4) The current use of the tree log mutex is to ensure proper serialization
of super block writes - if someone started a new transaction and uses it
for logging, it will wait for the previous transaction to write its
super block before writing the super block when attempting to sync the
log.
So acquire the tree log mutex only when it's absolutely needed, before
setting the transaction state to TRANS_STATE_UNBLOCKED, fix and move the
stale comment, add some assertions and new comments where appropriate.
Also, this has no effect on concurrency or performance, since the new
start of the critical section is still when the transaction is in the
state TRANS_STATE_COMMIT_DOING.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_prepare_sprout() splices seed devices into its own struct fs_devices,
so that its parent function btrfs_init_new_device() can add the new sprout
device to fs_info->fs_devices.
Both btrfs_prepare_sprout() and btrfs_init_new_device() need
device_list_mutex. But they are holding it separately, thus create a
small race window. Close it and hold device_list_mutex across both
functions btrfs_init_new_device() and btrfs_prepare_sprout().
Split btrfs_prepare_sprout() into btrfs_init_sprout() and
btrfs_setup_sprout(). This split is essential because device_list_mutex
must not be held for allocations in btrfs_init_sprout() but must be held
for btrfs_setup_sprout(). So now a common device_list_mutex can be used
between btrfs_init_new_device() and btrfs_setup_sprout().
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Declare int seeding_dev as a bool. Also, move its declaration a line
below to adjust packing.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Again, I don't think this was ever used since iterate_dir_item() is only
used for xattrs. No functional change.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As far as I can tell, this was never used. No functional change.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The name btrfs_item_end_nr() is a bit of a misnomer, as it's actually
the offset of the end of the data the item points to. In fact all of
the helpers that we use btrfs_item_end_nr() use data in their name, like
BTRFS_LEAF_DATA_SIZE() and leaf_data(). Rename to btrfs_item_data_end()
to make it clear what this helper is giving us.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're only using btrfs_item_end() from btrfs_item_end_nr(), so this can
be collapsed.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that all call sites are using the slot number to modify item values,
rename the SETGET helpers to raw_item_*(), and then rework the _nr()
helpers to be the btrfs_item_*() btrfs_set_item_*() helpers, and then
rename all of the callers to the new helpers.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The last remaining place where we have the pattern of
item = btrfs_item_nr(slot)
<do something with the item>
are the token helpers. Handle this by introducing token helpers that
will do the btrfs_item_nr() work inside of the helper itself, and then
convert all users of the btrfs_item token helpers to the new _nr()
variants.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of getting the btrfs_item for this, simply pass in the slot of
the item and then use the btrfs_item_size_nr() helper inside of
btrfs_file_extent_inline_item_len().
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have the pattern of
item = btrfs_item_nr(slot);
btrfs_set_item_*(leaf, item);
in a bunch of places in our code. Fix this by adding
btrfs_set_item_*_nr() helpers which will do the appropriate work, and
replace those calls with
btrfs_set_item_*_nr(leaf, slot);
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have this pattern in a lot of places
item = btrfs_item_nr(slot);
btrfs_item_size(leaf, item);
when we could simply use
btrfs_item_size(leaf, slot);
Fix all callers of btrfs_item_size() and btrfs_item_offset() to use the
_nr variation of the helpers.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we log only dir index keys when logging a directory, we no longer
need to deal with dir item keys in the log replay code for replaying
directory deletes. This is also true for the case when we replay a log
tree created by a kernel that still logs dir items.
So remove the remaining code of the replay of directory deletes algorithm
that deals with dir item keys.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, when logging a directory, we copy both dir items and dir index
items from the fs/subvolume tree to the log tree. Both items have exactly
the same data (same struct btrfs_dir_item), the difference lies in the key
values, where a dir index key contains the index number of a directory
entry while the dir item key does not, as it's used for doing fast lookups
of an entry by name, while the former is used for sorting entries when
listing a directory.
We can exploit that and log only the dir index items, since they contain
all the information needed to correctly add, replace and delete directory
entries when replaying a log tree. Logging only the dir index items is
also backward and forward compatible: an unpatched kernel (without this
change) can correctly replay a log tree generated by a patched kernel
(with this patch), and a patched kernel can correctly replay a log tree
generated by an unpatched kernel.
The backward compatibility is ensured because:
1) For inserting a new dentry: a dentry is only inserted when we find a
new dir index key - we can only insert if we know the dir index offset,
which is encoded in the dir index key's offset;
2) For deleting dentries: during log replay, before adding or replacing
dentries, we first replay dentry deletions. Whenever we find a dir item
key or a dir index key in the subvolume/fs tree that is not logged in
a range for which the log tree is authoritative, we do the unlink of
the dentry, which removes both the existing dir item key and the dir
index key. Therefore logging just dir index keys is enough to ensure
dentry deletions are correctly replayed;
3) For dentry replacements: they work when we log only dir index keys
and this is mostly due to a combination of 1) and 2). If we replace a
dentry with name "foobar" to point from inode A to inode B, then we
know the dir index key for the new dentry is different from the old
one, as it has an index number (key offset) larger than the old one.
This results in replaying a deletion, through replay_dir_deletes(),
that causes the old dentry to be removed, both the dir item key and
the dir index key, as mentioned at 2). Then when processing the new
dir index key, we add the new dentry, adding both a new dir item key
and a new index key pointing to inode B, as stated in 1).
The forward compatibility, the ability for a patched kernel to replay a
log created by an older, unpatched kernel, comes from the changes required
for making sure we are able to replay a log that only contains dir index
keys - we simply ignore every dir item key we find.
So modify directory logging to log only dir index items, and modify the
log replay process to ignore dir item keys, from log trees created by an
unpatched kernel, and process only with dir index keys. This reduces the
amount of logged metadata by about half, and therefore the time spent
logging or fsyncing large directories (less CPU time and less IO).
The following test script was used to measure this change:
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
NUM_NEW_FILES=1000000
NUM_FILE_DELETES=10000
mkfs.btrfs -f $DEV
mount -o ssd $DEV $MNT
mkdir $MNT/testdir
for ((i = 1; i <= $NUM_NEW_FILES; i++)); do
echo -n > $MNT/testdir/file_$i
done
start=$(date +%s%N)
xfs_io -c "fsync" $MNT/testdir
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "dir fsync took $dur ms after adding $NUM_NEW_FILES files"
# sync to force transaction commit and wipeout the log.
sync
del_inc=$(( $NUM_NEW_FILES / $NUM_FILE_DELETES ))
for ((i = 1; i <= $NUM_NEW_FILES; i += $del_inc)); do
rm -f $MNT/testdir/file_$i
done
start=$(date +%s%N)
xfs_io -c "fsync" $MNT/testdir
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "dir fsync took $dur ms after deleting $NUM_FILE_DELETES files"
echo
umount $MNT
The tests were run on a physical machine, with a non-debug kernel (Debian's
default kernel config), for different values of $NUM_NEW_FILES and
$NUM_FILE_DELETES, and the results were the following:
** Before patch, NUM_NEW_FILES = 1 000 000, NUM_DELETE_FILES = 10 000 **
dir fsync took 8412 ms after adding 1000000 files
dir fsync took 500 ms after deleting 10000 files
** After patch, NUM_NEW_FILES = 1 000 000, NUM_DELETE_FILES = 10 000 **
dir fsync took 4252 ms after adding 1000000 files (-49.5%)
dir fsync took 269 ms after deleting 10000 files (-46.2%)
** Before patch, NUM_NEW_FILES = 100 000, NUM_DELETE_FILES = 1 000 **
dir fsync took 745 ms after adding 100000 files
dir fsync took 59 ms after deleting 1000 files
** After patch, NUM_NEW_FILES = 100 000, NUM_DELETE_FILES = 1 000 **
dir fsync took 404 ms after adding 100000 files (-45.8%)
dir fsync took 31 ms after deleting 1000 files (-47.5%)
** Before patch, NUM_NEW_FILES = 10 000, NUM_DELETE_FILES = 1 000 **
dir fsync took 67 ms after adding 10000 files
dir fsync took 9 ms after deleting 1000 files
** After patch, NUM_NEW_FILES = 10 000, NUM_DELETE_FILES = 1 000 **
dir fsync took 36 ms after adding 10000 files (-46.3%)
dir fsync took 5 ms after deleting 1000 files (-44.4%)
** Before patch, NUM_NEW_FILES = 1 000, NUM_DELETE_FILES = 100 **
dir fsync took 9 ms after adding 1000 files
dir fsync took 4 ms after deleting 100 files
** After patch, NUM_NEW_FILES = 1 000, NUM_DELETE_FILES = 100 **
dir fsync took 7 ms after adding 1000 files (-22.2%)
dir fsync took 3 ms after deleting 100 files (-25.0%)
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since both unused block groups and reclaim bgs lists are protected by
unused_bgs_lock then free them in the same critical section without
doing an extra unlock/lock pair.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>