Commit Graph

483 Commits

Author SHA1 Message Date
Darrick J. Wong
69115f775f xfs: teach scrub to check for sole ownership of metadata objects
Strengthen online scrub's checking even further by enabling us to check
that a range of blocks are owned solely by a given owner.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:15 -07:00
Darrick J. Wong
efc0845f5d xfs: convert xfs_ialloc_has_inodes_at_extent to return keyfill scan results
Convert the xfs_ialloc_has_inodes_at_extent function to return keyfill
scan results because for a given range of inode numbers, we might have
no indexed inodes at all; the entire region might be allocated ondisk
inodes; or there might be a mix of the two.

Unfortunately, sparse inodes adds to the complexity, because each inode
record can have holes, which means that we cannot use the generic btree
_scan_keyfill function because we must look for holes in individual
records to decide the result.  On the plus side, online fsck can now
detect sub-chunk discrepancies in the inobt.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:15 -07:00
Darrick J. Wong
bc0f3b5546 xfs: directly cross-reference the inode btrees with each other
Improve the cross-referencing of the two inode btrees by directly
checking the free and hole state of each inode with the other btree.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:14 -07:00
Darrick J. Wong
c01868b60e xfs: clean up broken eearly-exit code in the inode btree scrubber
Corrupt inode chunks should cause us to exit early after setting the
CORRUPT flag on the scrub state.  While we're at it, collapse trivial
helpers.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:13 -07:00
Darrick J. Wong
7ac14fa2bd xfs: ensure that all metadata and data blocks are not cow staging extents
Make sure that all filesystem metadata blocks and file data blocks are
not also marked as CoW staging extents.  The extra checking added here
was inspired by an actual VM host filesystem corruption incident due to
bugs in the CoW handling of 4.x kernels.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:12 -07:00
Darrick J. Wong
7ad9ea6398 xfs: check the reference counts of gaps in the refcount btree
Gaps in the reference count btree are also significant -- for these
regions, there must not be any overlapping reverse mappings.  We don't
currently check this, so make the refcount scrubber more complete.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:12 -07:00
Darrick J. Wong
6abc7aef85 xfs: replace xfs_btree_has_record with a general keyspace scanner
The current implementation of xfs_btree_has_record returns true if it
finds /any/ record within the given range.  Unfortunately, that's not
sufficient for scrub.  We want to be able to tell if a range of keyspace
for a btree is devoid of records, is totally mapped to records, or is
somewhere in between.  By forcing this to be a boolean, we conflated
sparseness and fullness, which caused scrub to return incorrect results.
Fix the API so that we can tell the caller which of those three is the
current state.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:10 -07:00
Darrick J. Wong
bd7e795108 xfs: refactor ->diff_two_keys callsites
Create wrapper functions around ->diff_two_keys so that we don't have to
remember what the return values mean, and adjust some of the code
comments to reflect the longtime code behavior.  We're going to
introduce more uses of ->diff_two_keys in the next patch, so reduce the
cognitive load for readers by doing this refactoring now.

Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:10 -07:00
Darrick J. Wong
2bea8df0a5 xfs: always scrub record/key order of interior records
In commit d47fef9342, we removed the firstrec and firstkey fields of
struct xchk_btree because Christoph thought they were unnecessary
because we could use the record index in the btree cursor.  This is
incorrect because bc_ptrs (now bc_levels[].ptr) tracks the cursor
position within a specific btree block, not within the entire level.

The end result is that scrub no longer detects situations where the
rightmost record of a block is identical to the leftmost record of that
block's right sibling.  Fix this regression by reintroducing record
validity booleans so that order checking skips *only* the leftmost
record/key in each level.

Fixes: d47fef9342 ("xfs: don't track firstrec/firstkey separately in xchk_btree")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:09 -07:00
Darrick J. Wong
c99f99fa3e xfs: check btree keys reflect the child block
When scrub is checking a non-root btree block, it should make sure that
the keys in the parent btree block accurately capture the keyspace that
the child block stores.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:08 -07:00
Darrick J. Wong
38384569a2 xfs: detect unwritten bit set in rmapbt node block keys
In the last patch, we changed the rmapbt code to remove the UNWRITTEN
bit when creating an rmapbt key from an rmapbt record, and we changed
the rmapbt key comparison code to start considering the ATTR and BMBT
flags during lookup.  This brought the behavior of the rmapbt
implementation in line with its specification.

However, there may exist filesystems that have the unwritten bit still
set in the rmapbt keys.  We should detect these situations and flag the
rmapbt as one that would benefit from optimization.  Eventually, online
repair will be able to do something in response to this.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:07 -07:00
Darrick J. Wong
de1a9ce225 xfs: hoist inode record alignment checks from scrub
Move the inobt record alignment checks from xchk_iallocbt_rec into
xfs_inobt_check_irec so that they are applied everywhere.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:06 -07:00
Darrick J. Wong
7d7d6d2fd0 xfs: hoist rmap record flag checks from scrub
Move the rmap record flag checks from xchk_rmapbt_rec into
xfs_rmap_check_irec so that they are applied everywhere.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:05 -07:00
Darrick J. Wong
69010fe3ac xfs: standardize ondisk to incore conversion for bmap btrees
Fix all xfs_bmbt_disk_get_all callsites to call xfs_bmap_validate_extent
and bubble up corruption reports.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:04 -07:00
Darrick J. Wong
c4e34172da xfs: standardize ondisk to incore conversion for rmap btrees
Create a xfs_rmap_check_irec function to detect corruption in btree
records.  Fix all xfs_rmap_btrec_to_irec callsites to call the new
helper and bubble up corruption reports.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:03 -07:00
Darrick J. Wong
39ab26d59f xfs: return a failure address from xfs_rmap_irec_offset_unpack
Currently, xfs_rmap_irec_offset_unpack returns only 0 or -EFSCORRUPTED.
Change this function to return the code address of a failed conversion
in preparation for the next patch, which standardizes localized record
checking and reporting code.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:02 -07:00
Darrick J. Wong
2b30cc0bf0 xfs: standardize ondisk to incore conversion for refcount btrees
Create a xfs_refcount_check_irec function to detect corruption in btree
records.  Fix all xfs_refcount_btrec_to_irec callsites to call the new
helper and bubble up corruption reports.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:02 -07:00
Darrick J. Wong
366a0b8d49 xfs: standardize ondisk to incore conversion for inode btrees
Create a xfs_inobt_check_irec function to detect corruption in btree
records.  Fix all xfs_inobt_btrec_to_irec callsites to call the new
helper and bubble up corruption reports.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:01 -07:00
Darrick J. Wong
35e3b9a117 xfs: standardize ondisk to incore conversion for free space btrees
Create a xfs_alloc_btrec_to_irec function to convert an ondisk record to
an incore record, and a xfs_alloc_check_irec function to detect
corruption.  Replace all the open-coded logic with calls to the new
helpers and bubble up corruption reports.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:01 -07:00
Darrick J. Wong
88accf1722 xfs: scrub should use ECHRNG to signal that the drain is needed
In the previous patch, we added jump labels to the intent drain code so
that regular filesystem operations need not pay the price of checking
for someone (scrub) waiting on intents to drain from some part of the
filesystem when that someone isn't running.

However, I observed that xfs/285 now spends a lot more time pushing the
AIL from the inode btree scrubber than it used to.  This is because the
inobt scrubber will try push the AIL to try to get logged inode cores
written to the filesystem when it sees a weird discrepancy between the
ondisk inode and the inobt records.  This AIL push is triggered when the
setup function sees TRY_HARDER is set; and the requisite EDEADLOCK
return is initiated when the discrepancy is seen.

The solution to this performance slow down is to use a different result
code (ECHRNG) for scrub code to signal that it needs to wait for
deferred intent work items to drain out of some part of the filesystem.
When this happens, set a new scrub state flag (XCHK_NEED_DRAIN) so that
setup functions will activate the jump label.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:00 -07:00
Darrick J. Wong
466c525d6d xfs: minimize overhead of drain wakeups by using jump labels
To reduce the runtime overhead even further when online fsck isn't
running, use a static branch key to decide if we call wake_up on the
drain.  For compilers that support jump labels, the call to wake_up is
replaced by a nop sled when nobody is waiting for intents to drain.

From my initial microbenchmarking, every transition of the static key
between the on and off states takes about 22000ns to complete; this is
paid entirely by the xfs_scrub process.  When the static key is off
(which it should be when fsck isn't running), the nop sled adds an
overhead of approximately 0.36ns to runtime code.  The post-atomic
lockless waiter check adds about 0.03ns, which is basically free.

For the few compilers that don't support jump labels, runtime code pays
the cost of calling wake_up on an empty waitqueue, which was observed to
be about 30ns.  However, most architectures that have sufficient memory
and CPU capacity to run XFS also support jump labels, so this is not
much of a worry.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 18:59:59 -07:00
Darrick J. Wong
3f64c718d0 xfs: clean up scrub context if scrub setup returns -EDEADLOCK
It has been a longstanding convention that online scrub and repair
functions can return -EDEADLOCK to signal that they weren't able to
obtain some necessary resource.  When this happens, the scrub framework
is supposed to release all resources attached to the scrub context, set
the TRY_HARDER flag in the scrub context flags, and try again.  In this
context, individual scrub functions are supposed to take all the
resources they (incorrectly) speculated were not necessary.

We're about to make it so that the functions that lock and wait for a
filesystem AG can also return EDEADLOCK to signal that we need to try
again with the drain waiters enabled.  Therefore, refactor
xfs_scrub_metadata to support this behavior for ->setup() functions.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 18:59:59 -07:00
Darrick J. Wong
d5c88131db xfs: allow queued AG intents to drain before scrubbing
When a writer thread executes a chain of log intent items, the AG header
buffer locks will cycle during a transaction roll to get from one intent
item to the next in a chain.  Although scrub takes all AG header buffer
locks, this isn't sufficient to guard against scrub checking an AG while
that writer thread is in the middle of finishing a chain because there's
no higher level locking primitive guarding allocation groups.

When there's a collision, cross-referencing between data structures
(e.g. rmapbt and refcountbt) yields false corruption events; if repair
is running, this results in incorrect repairs, which is catastrophic.

Fix this by adding to the perag structure the count of active intents
and make scrub wait until it has both AG header buffer locks and the
intent counter reaches zero.

One quirk of the drain code is that deferred bmap updates also bump and
drop the intent counter.  A fundamental decision made during the design
phase of the reverse mapping feature is that updates to the rmapbt
records are always made by the same code that updates the primary
metadata.  In other words, callers of bmapi functions expect that the
bmapi functions will queue deferred rmap updates.

Some parts of the reflink code queue deferred refcount (CUI) and bmap
(BUI) updates in the same head transaction, but the deferred work
manager completely finishes the CUI before the BUI work is started.  As
a result, the CUI drops the intent count long before the deferred rmap
(RUI) update even has a chance to bump the intent count.  The only way
to keep the intent count elevated between the CUI and RUI is for the BUI
to bump the counter until the RUI has been created.

A second quirk of the intent drain code is that deferred work items must
increment the intent counter as soon as the work item is added to the
transaction.  When a BUI completes and queues an RUI, the RUI must
increment the counter before the BUI decrements it.  The only way to
accomplish this is to require that the counter be bumped as soon as the
deferred work item is created in memory.

In the next patches we'll improve on this facility, but this patch
provides the basic functionality.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 18:59:58 -07:00
Darrick J. Wong
9014890304 xfs: add a tracepoint to report incorrect extent refcounts
Add a new tracepoint so that I can see exactly what and where we failed
the refcount check.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 18:59:58 -07:00
Darrick J. Wong
ecc73f8a58 xfs: update copyright years for scrub/ files
Update the copyright years in the scrub/ source code files.  This isn't
required, but it's helpful to remind myself just how long it's taken to
develop this feature.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 18:59:57 -07:00
Darrick J. Wong
739a2fe042 xfs: fix author and spdx headers on scrub/ files
Fix the spdx tags to match current practice, and update the author
contact information.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 18:59:56 -07:00
Darrick J. Wong
b2ccab3199 xfs: pass per-ag references to xfs_free_extent
Pass a reference to the per-AG structure to xfs_free_extent.  Most
callers already have one, so we can eliminate unnecessary lookups.  The
one exception to this is the EFI code, which the next patch will fix.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 18:59:53 -07:00
Dave Chinner
5f36b2ce79 xfs: introduce xfs_alloc_vextent_exact_bno()
Two of the callers to xfs_alloc_vextent_this_ag() actually want
exact block number allocation, not anywhere-in-ag allocation. Split
this out from _this_ag() as a first class citizen so no external
extent allocation code needs to care about args->type anymore.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13 09:14:54 +11:00
Dave Chinner
74c36a8689 xfs: use xfs_alloc_vextent_this_ag() where appropriate
Change obvious callers of single AG allocation to use
xfs_alloc_vextent_this_ag(). Drive the per-ag grabbing out to the
callers, too, so that callers with active references don't need
to do new lookups just for an allocation in a context that already
has a perag reference.

The only remaining caller that does single AG allocation through
xfs_alloc_vextent() is xfs_bmap_btalloc() with
XFS_ALLOCTYPE_NEAR_BNO. That is going to need more untangling before
it can be converted cleanly.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13 09:14:53 +11:00
Dave Chinner
7ac2ff8bb3 xfs: perags need atomic operational state
We currently don't have any flags or operational state in the
xfs_perag except for the pagf_init and pagi_init flags. And the
agflreset flag. Oh, there's also the pagf_metadata and pagi_inodeok
flags, too.

For controlling per-ag operations, we are going to need some atomic
state flags. Hence add an opstate field similar to what we already
have in the mount and log, and convert all these state flags across
to atomic bit operations.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13 09:14:52 +11:00
Dave Chinner
bab8b79518 xfs: inobt can use perags in many more places than it does
Lots of code in the inobt infrastructure is passed both xfs_mount
and perags. We only need perags for the per-ag inode allocation
code, so reduce the duplication by passing only the perags as the
primary object.

This ends up reducing the code size by a bit:

	   text    data     bss     dec     hex filename
orig	1138878  323979     548 1463405  16546d (TOTALS)
patched	1138709  323979     548 1463236  1653c4 (TOTALS)

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13 09:14:52 +11:00
Dave Chinner
498f0adbcd xfs: convert xfs_imap() to take a perag
Callers have referenced perags but they don't pass it into
xfs_imap() so it takes it's own reference. Fix that so we can change
inode allocation over to using active references.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13 09:14:52 +11:00
Dave Chinner
c4d5660afb xfs: active perag reference counting
We need to be able to dynamically remove instantiated AGs from
memory safely, either for shrinking the filesystem or paging AG
state in and out of memory (e.g. supporting millions of AGs). This
means we need to be able to safely exclude operations from accessing
perags while dynamic removal is in progress.

To do this, introduce the concept of active and passive references.
Active references are required for high level operations that make
use of an AG for a given operation (e.g. allocation) and pin the
perag in memory for the duration of the operation that is operating
on the perag (e.g. transaction scope). This means we can fail to get
an active reference to an AG, hence callers of the new active
reference API must be able to handle lookup failure gracefully.

Passive references are used in low level code, where we might need
to access the perag structure for the purposes of completing high
level operations. For example, buffers need to use passive
references because:
- we need to be able to do metadata IO during operations like grow
  and shrink transactions where high level active references to the
  AG have already been blocked
- buffers need to pin the perag until they are reclaimed from
  memory, something that high level code has no direct control over.
- unused cached buffers should not prevent a shrink from being
  started.

Hence we have active references that will form exclusion barriers
for operations to be performed on an AG, and passive references that
will prevent reclaim of the perag until all objects with passive
references have been reclaimed themselves.

This patch introduce xfs_perag_grab()/xfs_perag_rele() as the API
for active AG reference functionality. We also need to convert the
for_each_perag*() iterators to use active references, which will
start the process of converting high level code over to using active
references. Conversion of non-iterator based code to active
references will be done in followup patches.

Note that the implementation using reference counting is really just
a development vehicle for the API to ensure we don't have any leaks
in the callers. Once we need to remove perag structures from memory
dyanmically, we will need a much more robust per-ag state transition
mechanism for preventing new references from being taken while we
wait for existing references to drain before removal from memory can
occur....

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13 09:14:42 +11:00
Darrick J. Wong
f36b954a1f xfs: check inode core when scrubbing metadata files
Metadata files (e.g. realtime bitmaps and quota files) do not show up in
the bulkstat output, which means that scrub-by-handle does not work;
they can only be checked through a specific scrub type.  Therefore, each
scrub type calls xchk_metadata_inode_forks to check the metadata for
whatever's in the file.

Unfortunately, that function doesn't actually check the inode record
itself.  Refactor the function a bit to make that happen.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-11-16 16:11:51 -08:00
Darrick J. Wong
bd5ab5f987 xfs: don't warn about files that are exactly s_maxbytes long
We can handle files that are exactly s_maxbytes bytes long; we just
can't handle anything larger than that.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-11-16 16:11:51 -08:00
Darrick J. Wong
5eef46358f xfs: teach scrub to flag non-extents format cow forks
CoW forks only exist in memory, which means that they can only ever have
an incore extent tree.  Hence they must always be FMT_EXTENTS, so check
this when we're scrubbing them.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-11-16 15:25:05 -08:00
Darrick J. Wong
3178553701 xfs: check that CoW fork extents are not shared
Ensure that extents in an inode's CoW fork are not marked as shared in
the refcount btree.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-11-16 15:25:04 -08:00
Darrick J. Wong
f23c40443d xfs: check quota files for unwritten extents
Teach scrub to flag quota files containing unwritten extents.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-11-16 15:25:04 -08:00
Darrick J. Wong
830ffa09fb xfs: block map scrub should handle incore delalloc reservations
Enhance the block map scrubber to check delayed allocation reservations.
Though there are no physical space allocations to check, we do need to
make sure that the range of file offsets being mapped are correct, and
to bump the lastoff cursor so that key order checking works correctly.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-11-16 15:25:04 -08:00
Darrick J. Wong
6a5777865e xfs: teach scrub to check for adjacent bmaps when rmap larger than bmap
When scrub is checking file fork mappings against rmap records and
the rmap record starts before or ends after the bmap record, check the
adjacent bmap records to make sure that they're adjacent to the one
we're checking.  This helps us to detect cases where the rmaps cover
territory that the bmaps do not.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-11-16 15:25:04 -08:00
Darrick J. Wong
033985b6fe xfs: fix perag loop in xchk_bmap_check_rmaps
sparse complains that we can return an uninitialized error from this
function and that pag could be uninitialized.  We know that there are no
zero-AG filesystems and hence we had to call xchk_bmap_check_ag_rmaps at
least once, so this is not actually possible, but I'm too worn out from
automated complaints from unsophisticated AIs so let's just fix this and
move on to more interesting problems, eh?

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-11-16 15:25:04 -08:00
Darrick J. Wong
e74331d6fa xfs: online checking of the free rt extent count
Teach the summary count checker to count the number of free realtime
extents and compare that to the superblock copy.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-11-16 15:25:03 -08:00
Darrick J. Wong
11f97e6845 xfs: skip fscounters comparisons when the scan is incomplete
If any part of the per-AG summary counter scan loop aborts without
collecting all of the data we need, the scrubber's observation data will
be invalid.  Set the incomplete flag so that we abort the scrub without
reporting false corruptions.  Document the data dependency here too.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-11-16 15:25:03 -08:00
Darrick J. Wong
93b0c58ed0 xfs: don't return -EFSCORRUPTED from repair when resources cannot be grabbed
If we tried to repair something but the repair failed with -EDEADLOCK,
that means that the repair function couldn't grab some resource it
needed and wants us to try again.  If we try again (with TRY_HARDER) but
still can't get all the resources we need, the repair fails and errors
remain on the filesystem.

Right now, repair returns the -EDEADLOCK to the caller as -EFSCORRUPTED,
which results in XFS_SCRUB_OFLAG_CORRUPT being passed out to userspace.
This is not correct because repair has not determined that anything is
corrupt.  If the repair had been invoked on an object that could be
optimized but wasn't corrupt (OFLAG_PREEN), the inability to grab
resources will be reported to userspace as corrupt metadata, and users
will be unnecessarily alarmed that their suboptimal metadata turned into
a corruption.

Fix this by returning zero so that the results of the actual scrub will
be copied back out to userspace.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-11-16 15:25:03 -08:00
Darrick J. Wong
6bf2f87915 xfs: don't retry repairs harder when EAGAIN is returned
Repair functions will not return EAGAIN -- if they were not able to
obtain resources, they should return EDEADLOCK (like the rest of online
fsck) to signal that we need to grab all the resources and try again.
Hence we don't need to deal with this case except as a debugging
assertion.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-11-16 15:25:02 -08:00
Darrick J. Wong
0a713bd41e xfs: fix return code when fatal signal encountered during dquot scrub
If the scrub process is sent a fatal signal while we're checking dquots,
the predicate for this will set the error code to -EINTR.  Don't then
squash that into -ECANCELED, because the wrong errno turns up in the
trace output.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-11-16 15:25:02 -08:00
Darrick J. Wong
a7a0f9a550 xfs: return EINTR when a fatal signal terminates scrub
If the program calling online fsck is terminated with a fatal signal,
bail out to userspace by returning EINTR, not EAGAIN.  EAGAIN is used by
scrubbers to indicate that we should try again with more resources
locked, and not to indicate that the operation was cancelled.  The
miswiring is mostly harmless, but it shows up in the trace data.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-11-16 15:25:02 -08:00
Darrick J. Wong
306195f355 xfs: pivot online scrub away from kmem.[ch]
Convert all the online scrub code to use the Linux slab allocator
functions directly instead of going through the kmem wrappers.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-11-16 15:25:02 -08:00
Darrick J. Wong
fcd2a43488 xfs: initialize the check_owner object fully
Initialize the check_owner list head so that we don't corrupt the list.
Reduce the scope of the object pointer.

Fixes: 858333dcf0 ("xfs: check btree block ownership with bnobt/rmapbt when scrubbing btree")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-11-16 15:25:02 -08:00
Darrick J. Wong
48ff40458f xfs: standardize GFP flags usage in online scrub
Memory allocation usage is the same throughout online fsck -- we want
kernel memory, we have to be able to back out if we can't allocate
memory, and we don't want to spray dmesg with memory allocation failure
reports.  Standardize the GFP flag usage and document these requirements.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-11-16 15:25:01 -08:00
Darrick J. Wong
b255fab0f8 xfs: make AGFL repair function avoid crosslinked blocks
Teach the AGFL repair function to check each block of the proposed AGFL
against the rmap btree.  If the rmapbt finds any mappings that are not
OWN_AG, strike that block from the list.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-11-16 15:25:01 -08:00
Darrick J. Wong
3e59c0103e xfs: log the AGI/AGF buffers when rolling transactions during an AG repair
Currently, the only way to lock an allocation group is to hold the AGI
and AGF buffers.  If a repair needs to roll the transaction while
repairing some AG metadata, it maintains that lock by holding the two
buffers across the transaction roll and joins them afterwards.

However, repair is not like other parts of XFS that employ the bhold -
roll - bjoin sequence because it's possible that the AGI or AGF buffers
are not actually dirty before the roll.  This presents two problems --
First, we need to redirty those buffers to keep them moving along in the
log to avoid pinning the log tail.  Second, a clean buffer log item can
detach from the buffer.  If this happens, the buffer type state is
discarded along with the bli and must be reattached before the next time
the buffer is logged.   If it is not, the logging code will complain and
log recovery will not work properly.

An earlier version of this patch tried to fix the second problem by
re-setting the buffer type in the bli after joining the buffer to the
new transaction, but that looked weird and didn't solve the first
problem.  Instead, solve both problems by logging the buffer before
rolling the transaction.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-11-16 15:25:01 -08:00
Darrick J. Wong
be1317fdb8 xfs: don't track the AGFL buffer in the scrub AG context
While scrubbing an allocation group, we don't need to hold the AGFL
buffer as part of the scrub context.  All that is necessary to lock an
AG is to hold the AGI and AGF buffers, so fix all the existing users of
the AGFL buffer to grab them only when necessary.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-11-16 15:25:01 -08:00
Darrick J. Wong
9a48b4a6fd xfs: fully initialize xfs_da_args in xchk_directory_blocks
While running the online fsck test suite, I noticed the following
assertion in the kernel log (edited for brevity):

XFS: Assertion failed: 0, file: fs/xfs/xfs_health.c, line: 571
------------[ cut here ]------------
WARNING: CPU: 3 PID: 11667 at fs/xfs/xfs_message.c:104 assfail+0x46/0x4a [xfs]
CPU: 3 PID: 11667 Comm: xfs_scrub Tainted: G        W         5.19.0-rc7-xfsx #rc7 6e6475eb29fd9dda3181f81b7ca7ff961d277a40
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
RIP: 0010:assfail+0x46/0x4a [xfs]
Call Trace:
 <TASK>
 xfs_dir2_isblock+0xcc/0xe0
 xchk_directory_blocks+0xc7/0x420
 xchk_directory+0x53/0xb0
 xfs_scrub_metadata+0x2b6/0x6b0
 xfs_scrubv_metadata+0x35e/0x4d0
 xfs_ioc_scrubv_metadata+0x111/0x160
 xfs_file_ioctl+0x4ec/0xef0
 __x64_sys_ioctl+0x82/0xa0
 do_syscall_64+0x2b/0x80
 entry_SYSCALL_64_after_hwframe+0x46/0xb0

This assertion triggers in xfs_dirattr_mark_sick when the caller passes
in a whichfork value that is neither of XFS_{DATA,ATTR}_FORK.  The cause
of this is that xchk_directory_blocks only partially initializes the
xfs_da_args structure that is passed to xfs_dir2_isblock.  If the data
fork is not correct, the XFS_IS_CORRUPT clause will trigger.  My
development branch reports this failure to the health monitoring
subsystem, which accesses the uninitialized args->whichfork field,
leading the the assertion tripping.  We really shouldn't be passing
random stack contents around, so the solution here is to force the
compiler to zero-initialize the struct.

Found by fuzzing u3.bmx[0].blockcount = middlebit on xfs/1554.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-11-16 15:25:01 -08:00
Darrick J. Wong
f62ac3e0ac xfs: check record domain when accessing refcount records
Now that we've separated the startblock and CoW/shared extent domain in
the incore refcount record structure, check the domain whenever we
retrieve a record to ensure that it's still in the domain that we want.
Depending on the circumstances, a change in domain either means we're
done processing or that we've found a corruption and need to fail out.

The refcount check in xchk_xref_is_cow_staging is redundant since
_get_rec has done that for a long time now, so we can get rid of it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-10-31 08:58:21 -07:00
Darrick J. Wong
f492135df0 xfs: refactor domain and refcount checking
Create a helper function to ensure that CoW staging extent records have
a single refcount and that shared extent records have more than 1
refcount.  We'll put this to more use in the next patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-10-31 08:58:21 -07:00
Darrick J. Wong
9a50ee4f8d xfs: track cow/shared record domains explicitly in xfs_refcount_irec
Just prior to committing the reflink code into upstream, the xfs
maintainer at the time requested that I find a way to shard the refcount
records into two domains -- one for records tracking shared extents, and
a second for tracking CoW staging extents.  The idea here was to
minimize mount time CoW reclamation by pushing all the CoW records to
the right edge of the keyspace, and it was accomplished by setting the
upper bit in rc_startblock.  We don't allow AGs to have more than 2^31
blocks, so the bit was free.

Unfortunately, this was a very late addition to the codebase, so most of
the refcount record processing code still treats rc_startblock as a u32
and pays no attention to whether or not the upper bit (the cow flag) is
set.  This is a weakness is theoretically exploitable, since we're not
fully validating the incoming metadata records.

Fuzzing demonstrates practical exploits of this weakness.  If the cow
flag of a node block key record is corrupted, a lookup operation can go
to the wrong record block and start returning records from the wrong
cow/shared domain.  This causes the math to go all wrong (since cow
domain is still implicit in the upper bit of rc_startblock) and we can
crash the kernel by tricking xfs into jumping into a nonexistent AG and
tripping over xfs_perag_get(mp, <nonexistent AG>) returning NULL.

To fix this, start tracking the domain as an explicit part of struct
xfs_refcount_irec, adjust all refcount functions to check the domain
of a returned record, and alter the function definitions to accept them
where necessary.

Found by fuzzing keys[2].cowflag = add in xfs/464.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-10-31 08:58:21 -07:00
Darrick J. Wong
5a8c345ca8 xfs: refactor refcount record usage in xchk_refcountbt_rec
Consolidate the open-coded xfs_refcount_irec fields into an actual
struct and use the existing _btrec_to_irec to decode the ondisk record.
This will reduce code churn in the next patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-10-31 08:58:21 -07:00
Darrick J. Wong
b65e08f83b xfs: create a predicate to verify per-AG extents
Create a predicate function to verify that a given agbno/blockcount pair
fit entirely within a single allocation group and don't suffer
mathematical overflows.  Refactor the existng open-coded logic; we're
going to add more calls to this function in the next patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-10-31 08:58:20 -07:00
Linus Torvalds
60bb8154d1 xfs: changes for 6.1-rc1
This update contains:
 - fixes for filesystem shutdown procedure during a DAX memory
   failure notification
 - bug fixes
 - logic cleanups
 - log message cleanups
 - updates to use vfs{g,u}id_t helpers where appropriate
 
 Signed-off-by: Dave Chinner <david@fromorbit.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCgAyFiEEmJOoJ8GffZYWSjj/regpR/R1+h0FAmNEjOoUHGRhdmlkQGZy
 b21vcmJpdC5jb20ACgkQregpR/R1+h2UNg/+Ib1V1XSL6g+sidIPlm5/J3U2sWTh
 gRPgd5f5U25T50TEuor93RcOBMXTEww5tsRkQLmekzzgRiCcXu24VyzfCsbx9u4o
 JrWt7po+NXPtJW8VedNdHVlOiMBQsf1u3ZY54nmv63EW69J/BEK9jTUeGy3rK0DY
 +A/wVvVDipp8VZZ5zh/SwQh1pp3CSSElwuVdlcRl5cJiKiD2vg+Z/NvHnrp+1u+9
 F6rOW6RFjU9PqfNGhx9RjC+pYVlmVrDUwHj680ReDsdgDOWzbnW05ft74JpRdGfC
 tEy9vxjQ8/3/7vTHspXCI4RIn9LrBjNke2eRgsdRqVcjHa2KQ+hUvG6v8sxV+Wms
 7N1oIS2IKtLhUGOZyCwgUCLKFQ1blfkF/XyKx9DFumsnMLzmG2ret9DuOBPccr+c
 o7e1ArIlgnJpre2nvPhF5EcM+dArVMuZGPG03vL1iS7A79Ak0/e8Jivee1ScdupP
 yNJZzBYYbCkpIUVD9wzNeziSwwSXgW9j1nr6HEOLpBAa0/v0OS5iZfjWVvSLK24Q
 OcstRkY7cV5LkU2weIK6UQ6KZF4lJxUQ9j50OeeTuy3fou3utUt0vgVEGIiZ/edp
 H0gnQ15wT/RFjOL3i9zMhh24Yy25E/df1ugYmrOwszOQxU21KuZhvzhhehD5pC+r
 3MyQQ/e99VoDzEw=
 =2o9v
 -----END PGP SIGNATURE-----

Merge tag 'xfs-6.1-for-linus' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs updates from Dave Chinner:
 "There are relatively few updates this cycle; half the cycle was eaten
  by a grue, the other half was eaten by a tricky data corruption issue
  that I still haven't entirely solved.

  Hence there's no major changes in this cycle and it's largely just
  minor cleanups and small bug fixes:

   - fixes for filesystem shutdown procedure during a DAX memory failure
     notification

   - bug fixes

   - logic cleanups

   - log message cleanups

   - updates to use vfs{g,u}id_t helpers where appropriate"

* tag 'xfs-6.1-for-linus' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: on memory failure, only shut down fs after scanning all mappings
  xfs: rearrange the logic and remove the broken comment for xfs_dir2_isxx
  xfs: trim the mapp array accordingly in xfs_da_grow_inode_int
  xfs: do not need to check return value of xlog_kvmalloc()
  xfs: port to vfs{g,u}id_t and associated helpers
  xfs: remove xfs_setattr_time() declaration
  xfs: Remove the unneeded result variable
  xfs: missing space in xfs trace log
  xfs: simplify if-else condition in xfs_reflink_trim_around_shared
  xfs: simplify if-else condition in xfs_validate_new_dalign
  xfs: replace unnecessary seq_printf with seq_puts
  xfs: clean up "%Ld/%Lu" which doesn't meet C standard
  xfs: remove redundant else for clean code
  xfs: remove the redundant word in comment
2022-10-10 20:32:10 -07:00
Shida Zhang
c098576f5f xfs: rearrange the logic and remove the broken comment for xfs_dir2_isxx
xfs_dir2_isleaf is used to see if the directory is a single-leaf
form directory instead, as commented right above the function.

Besides getting rid of the broken comment, we rearrange the logic by
converting everything over to standard formatting and conventions,
at the same time, to make it easier to understand and self documenting.

Signed-off-by: Shida Zhang <zhangshida@kylinos.cn>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-10-04 16:39:58 +11:00
Al Viro
25885a35a7 Change calling conventions for filldir_t
filldir_t instances (directory iterators callbacks) used to return 0 for
"OK, keep going" or -E... for "stop".  Note that it's *NOT* how the
error values are reported - the rules for those are callback-dependent
and ->iterate{,_shared}() instances only care about zero vs. non-zero
(look at emit_dir() and friends).

So let's just return bool ("should we keep going?") - it's less confusing
that way.  The choice between "true means keep going" and "true means
stop" is bikesheddable; we have two groups of callbacks -
	do something for everything in directory, until we run into problem
and
	find an entry in directory and do something to it.

The former tended to use 0/-E... conventions - -E<something> on failure.
The latter tended to use 0/1, 1 being "stop, we are done".
The callers treated anything non-zero as "stop", ignoring which
non-zero value did they get.

"true means stop" would be more natural for the second group; "true
means keep going" - for the first one.  I tried both variants and
the things like
	if allocation failed
		something = -ENOMEM;
		return true;
just looked unnatural and asking for trouble.

[folded suggestion from Matthew Wilcox <willy@infradead.org>]
Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-17 17:25:04 -04:00
sunliming
1a53d3d426 xfs: fix for variable set but not used warning
Fix below kernel warning:

fs/xfs/scrub/repair.c:539:19: warning: variable 'agno' set but not used [-Wunused-but-set-variable]

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: sunliming <sunliming@kylinos.cn>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-07-20 16:40:39 -07:00
Darrick J. Wong
6d200bdc01 xfs: make attr forks permanent
This series fixes a use-after-free bug that syzbot uncovered.  The UAF
 itself is a result of a race condition between getxattr and removexattr
 because callers to getxattr do not necessarily take any sort of locks
 before calling into the filesystem.
 
 Although the race condition itself can be fixed through clever use of a
 memory barrier, further consideration of the use cases of extended
 attributes shows that most files always have at least one attribute, so
 we might as well make them permanent.
 
 v2: Minor tweaks suggested by Dave, and convert some more macros to
 helper functions.
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmLQRsAACgkQ+H93GTRK
 tOseOw/+JdSH2MU2xx+XoE5M/fStzGpw0UsoOqDo8kUPKDt3z12NwuczlL4OAiuw
 XFrN/1IAnxBsTD9YxFYbqoCPNi/VR81ajfWV7JqD2B1Joj0aATsxGDdNUYJnxCdU
 HMlMqP5o77XvArwkxFbgxYi7UGdAeNwXxqUJcJ8FopQo2lb8+SG6XzpLgGKnyrKT
 FRNKXNObplhtzOs/Bv8qYAxJulmmjkktFJXhK2OAUJlIDjFwFY9Wo2T4QuOVe9w+
 NXFOYyu0BqWLpDZQkYKWoCnF0GNsUavS8DP6zZYW3qJ6mX/f1jmtfbOLAkHNhlh8
 uHy/3k3SeQhKztTqM28sPioe6mdj2xocorDCCVBgGXgWxVF6aWeM/PS4tMTWN/Bg
 TWd1egERpeVC0Ymkm0LTCIDNuLqxsknK1G6sxXhwrQ8cw/70Gl08ePwgdyZ6hpD9
 Q61iJQofcI7MJX189a2VSbbHRzFnZIA+uVK4oyhmdEkQVKTHgmzwHVP660oAv9bD
 Y5hqkWEoyKTaaCsOWRAPVXG3k03lq+TNcaGggZgwFx11Gw4oMEx5hMUztoP54xX4
 aEXb1HWcCmMxy8llnFY/82baW98ucwl8FwWF1qhNIPT40mYx9IobDmvkCtNrAoOC
 41U7O8CxxPlt7XKxoRhafQOAhzp0ZzuhCdbaFIUENV7pTAJtq5Q=
 =W3Ad
 -----END PGP SIGNATURE-----

Merge tag 'make-attr-fork-permanent-5.20_2022-07-14' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-5.20-mergeB

xfs: make attr forks permanent

This series fixes a use-after-free bug that syzbot uncovered.  The UAF
itself is a result of a race condition between getxattr and removexattr
because callers to getxattr do not necessarily take any sort of locks
before calling into the filesystem.

Although the race condition itself can be fixed through clever use of a
memory barrier, further consideration of the use cases of extended
attributes shows that most files always have at least one attribute, so
we might as well make them permanent.

v2: Minor tweaks suggested by Dave, and convert some more macros to
helper functions.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>

* tag 'make-attr-fork-permanent-5.20_2022-07-14' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
  xfs: replace inode fork size macros with functions
  xfs: replace XFS_IFORK_Q with a proper predicate function
  xfs: use XFS_IFORK_Q to determine the presence of an xattr fork
  xfs: make inode attribute forks a permanent part of struct xfs_inode
  xfs: convert XFS_IFORK_PTR to a static inline helper
2022-07-14 09:46:37 -07:00
Darrick J. Wong
35c5a09f53 xfs: lockless buffer cache lookups
Current work to merge the XFS inode life cycle with the VFS inode
 life cycle is finding some interesting issues. If we have a path
 that hits buffer trylocks fairly hard (e.g. a non-blocking
 background inode freeing function), we end up hitting massive
 contention on the buffer cache hash locks:
 
 -   92.71%     0.05%  [kernel]                  [k] xfs_inodegc_worker
    - 92.67% xfs_inodegc_worker
       - 92.13% xfs_inode_unlink
          - 91.52% xfs_inactive_ifree
             - 85.63% xfs_read_agi
                - 85.61% xfs_trans_read_buf_map
                   - 85.59% xfs_buf_read_map
                      - xfs_buf_get_map
                         - 85.55% xfs_buf_find
                            - 72.87% _raw_spin_lock
                               - do_raw_spin_lock
                                    71.86% __pv_queued_spin_lock_slowpath
                            - 8.74% xfs_buf_rele
                               - 7.88% _raw_spin_lock
                                  - 7.88% do_raw_spin_lock
                                       7.63% __pv_queued_spin_lock_slowpath
                            - 1.70% xfs_buf_trylock
                               - 1.68% down_trylock
                                  - 1.41% _raw_spin_lock_irqsave
                                     - 1.39% do_raw_spin_lock
                                          __pv_queued_spin_lock_slowpath
                            - 0.76% _raw_spin_unlock
                                 0.75% do_raw_spin_unlock
 
 This is basically hammering the pag->pag_buf_lock from lots of CPUs
 doing trylocks at the same time. Most of the buffer trylock
 operations ultimately fail after we've done the lookup, so we're
 really hammering the buf hash lock whilst making no progress.
 
 We can also see significant spinlock traffic on the same lock just
 under normal operation when lots of tasks are accessing metadata
 from the same AG, so let's avoid all this by creating a lookup fast
 path which leverages the rhashtable's ability to do RCU protected
 lookups.
 
 Signed-off-by: Dave Chinner <dchinner@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCgAyFiEEmJOoJ8GffZYWSjj/regpR/R1+h0FAmLPvngUHGRhdmlkQGZy
 b21vcmJpdC5jb20ACgkQregpR/R1+h0gTw/9EK1gj31QpurgGziYsL0JFI1Uq33Z
 2rB/yTJXzxe+J7cE6B2RYuSj4EK7YI1aZXTRC5De5A8TqbFaNztrigqxNNpm3jh0
 T0AbVQoY7XzjbvMHQ0VFPBcJGcVbQypA+rabSlLHfU9zfN3t4EnM+BmuaFqygGZj
 1A6ZjkVChmEprGjd16846sgvMqdLa4yJ4/9Jsu5WlI+vPZj9gJX/7Mjc580Zljb5
 gg9Cf8ziW78gpHzj3ufSWv2jBcWcMdyHpyCF/fNceROUaxmZKsMUDKcsia9TyQhB
 yJXxw9Rnb3F23VJSYMJIcf4+RTd7iqd88GhEEFYxj41gI/jQxqRovlS1ljk2l20R
 3i4TUs7yF24sLLQdL8YkJiGCOEvRqPPcNd4xfGwdioRwXwoEqB7L/vYpUheQ8qSZ
 Tnn4vmGm+GQHNnQNhxiF8KkAd9gwcUslN36ZJn+h3zjvfgAFQFChsk+3CoFoxsth
 BpbFT3lo4Hc6xJBDCp7Z3Gxurxq5fQ2CGYHxCBT4feNkZS5YOLd/Os2hIZVId8XA
 jp66ZyELd8zj+CxMp4ZyYqsFETIao13B8KPEqvI2/obEDE6p/++olP8aqKIP1C8d
 ASOjxP8KqWEHLe3or4W3m2WSDa5fp1b3G/mjS7r/jDKqIuTMZXYw4CJx1x3rTr4F
 nXAnlWoGVq7HjWc=
 =8UYp
 -----END PGP SIGNATURE-----

Merge tag 'xfs-buf-lockless-lookup-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs into xfs-5.20-mergeB

xfs: lockless buffer cache lookups

Current work to merge the XFS inode life cycle with the VFS inode
life cycle is finding some interesting issues. If we have a path
that hits buffer trylocks fairly hard (e.g. a non-blocking
background inode freeing function), we end up hitting massive
contention on the buffer cache hash locks:

-   92.71%     0.05%  [kernel]                  [k] xfs_inodegc_worker
   - 92.67% xfs_inodegc_worker
      - 92.13% xfs_inode_unlink
         - 91.52% xfs_inactive_ifree
            - 85.63% xfs_read_agi
               - 85.61% xfs_trans_read_buf_map
                  - 85.59% xfs_buf_read_map
                     - xfs_buf_get_map
                        - 85.55% xfs_buf_find
                           - 72.87% _raw_spin_lock
                              - do_raw_spin_lock
                                   71.86% __pv_queued_spin_lock_slowpath
                           - 8.74% xfs_buf_rele
                              - 7.88% _raw_spin_lock
                                 - 7.88% do_raw_spin_lock
                                      7.63% __pv_queued_spin_lock_slowpath
                           - 1.70% xfs_buf_trylock
                              - 1.68% down_trylock
                                 - 1.41% _raw_spin_lock_irqsave
                                    - 1.39% do_raw_spin_lock
                                         __pv_queued_spin_lock_slowpath
                           - 0.76% _raw_spin_unlock
                                0.75% do_raw_spin_unlock

This is basically hammering the pag->pag_buf_lock from lots of CPUs
doing trylocks at the same time. Most of the buffer trylock
operations ultimately fail after we've done the lookup, so we're
really hammering the buf hash lock whilst making no progress.

We can also see significant spinlock traffic on the same lock just
under normal operation when lots of tasks are accessing metadata
from the same AG, so let's avoid all this by creating a lookup fast
path which leverages the rhashtable's ability to do RCU protected
lookups.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

* tag 'xfs-buf-lockless-lookup-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
  xfs: lockless buffer lookup
  xfs: remove a superflous hash lookup when inserting new buffers
  xfs: reduce the number of atomic when locking a buffer after lookup
  xfs: merge xfs_buf_find() and xfs_buf_get_map()
  xfs: break up xfs_buf_find() into individual pieces
  xfs: rework xfs_buf_incore() API
2022-07-14 09:22:14 -07:00
Darrick J. Wong
c01147d929 xfs: replace inode fork size macros with functions
Replace the shouty macros here with typechecked helper functions.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-07-12 11:17:27 -07:00
Darrick J. Wong
932b42c66c xfs: replace XFS_IFORK_Q with a proper predicate function
Replace this shouty macro with a real C function that has a more
descriptive name.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-07-12 11:17:27 -07:00
Darrick J. Wong
732436ef91 xfs: convert XFS_IFORK_PTR to a static inline helper
We're about to make this logic do a bit more, so convert the macro to a
static inline function for better typechecking and fewer shouty macros.
No functional changes here.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-07-09 15:17:21 -07:00
Dave Chinner
85c73bf726 xfs: rework xfs_buf_incore() API
Make it consistent with the other buffer APIs to return a error and
the buffer is placed in a parameter.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07 22:05:18 +10:00
Dave Chinner
36029dee38 xfs: make is_log_ag() a first class helper
We check if an ag contains the log in many places, so make this
a first class XFS helper by lifting it to fs/xfs/libxfs/xfs_ag.h and
renaming it xfs_ag_contains_log(). The convert all the places that
check if the AG contains the log to use this helper.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07 19:13:21 +10:00
Dave Chinner
3829c9a10f xfs: replace xfs_ag_block_count() with perag accesses
Many of the places that call xfs_ag_block_count() have a perag
available. These places can just read pag->block_count directly
instead of calculating the AG block count from first principles.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07 19:13:17 +10:00
Dave Chinner
2d6ca8321c xfs: Pre-calculate per-AG agino geometry
There is a lot of overhead in functions like xfs_verify_agino() that
repeatedly calculate the geometry limits of an AG. These can be
pre-calculated as they are static and the verification context has
a per-ag context it can quickly reference.

In the case of xfs_verify_agino(), we now always have a perag
context handy, so we can store the minimum and maximum agino values
in the AG in the perag. This means we don't have to calculate
it on every call and it can be inlined in callers if we move it
to xfs_ag.h.

xfs_verify_agino_or_null() gets the same perag treatment.

xfs_agino_range() is moved to xfs_ag.c as it's not really a type
function, and it's use is largely restricted as the first and last
aginos can be grabbed straight from the perag in most cases.

Note that we leave the original xfs_verify_agino in place in
xfs_types.c as a static function as other callers in that file do
not have per-ag contexts so still need to go the long way. It's been
renamed to xfs_verify_agno_agino() to indicate it takes both an agno
and an agino to differentiate it from new function.

$ size --totals fs/xfs/built-in.a
	   text    data     bss     dec     hex filename
before	1482185	 329588	    572	1812345	 1ba779	(TOTALS)
after	1481937	 329588	    572	1812097	 1ba681	(TOTALS)

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07 19:13:10 +10:00
Dave Chinner
0800169e3e xfs: Pre-calculate per-AG agbno geometry
There is a lot of overhead in functions like xfs_verify_agbno() that
repeatedly calculate the geometry limits of an AG. These can be
pre-calculated as they are static and the verification context has
a per-ag context it can quickly reference.

In the case of xfs_verify_agbno(), we now always have a perag
context handy, so we can store the AG length and the minimum valid
block in the AG in the perag. This means we don't have to calculate
it on every call and it can be inlined in callers if we move it
to xfs_ag.h.

Move xfs_ag_block_count() to xfs_ag.c because it's really a
per-ag function and not an XFS type function. We need a little
bit of rework that is specific to xfs_initialise_perag() to allow
growfs to calculate the new perag sizes before we've updated the
primary superblock during the grow (chicken/egg situation).

Note that we leave the original xfs_verify_agbno in place in
xfs_types.c as a static function as other callers in that file do
not have per-ag contexts so still need to go the long way. It's been
renamed to xfs_verify_agno_agbno() to indicate it takes both an agno
and an agbno to differentiate it from new function.

Future commits will make similar changes for other per-ag geometry
validation functions.

Further:

$ size --totals fs/xfs/built-in.a
	   text    data     bss     dec     hex filename
before	1483006	 329588	    572	1813166	 1baaae	(TOTALS)
after	1482185	 329588	    572	1812345	 1ba779	(TOTALS)

This rework reduces the binary size by ~820 bytes, indicating
that much less work is being done to bounds check the agbno values
against on per-ag geometry information.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07 19:13:02 +10:00
Dave Chinner
cec7bb7d58 xfs: pass perag to xfs_alloc_read_agfl
We have the perag in most places we call xfs_alloc_read_agfl, so
pass the perag instead of a mount/agno pair.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07 19:08:15 +10:00
Dave Chinner
8c392eb27f xfs: pass perag to xfs_alloc_put_freelist
It's available in all callers, so pass it in so that the perag can
be passed further down the stack.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07 19:08:08 +10:00
Dave Chinner
49f0d84ec1 xfs: pass perag to xfs_alloc_get_freelist
It's available in all callers, so pass it in so that the perag can
be passed further down the stack.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07 19:08:01 +10:00
Dave Chinner
08d3e84fee xfs: pass perag to xfs_alloc_read_agf()
xfs_alloc_read_agf() initialises the perag if it hasn't been done
yet, so it makes sense to pass it the perag rather than pull a
reference from the buffer. This allows callers to be per-ag centric
rather than passing mount/agno pairs everywhere.

Whilst modifying the xfs_reflink_find_shared() function definition,
declare it static and remove the extern declaration as it is an
internal function only these days.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07 19:07:40 +10:00
Dave Chinner
99b13c7f0b xfs: pass perag to xfs_ialloc_read_agi()
xfs_ialloc_read_agi() initialises the perag if it hasn't been done
yet, so it makes sense to pass it the perag rather than pull a
reference from the buffer. This allows callers to be per-ag centric
rather than passing mount/agno pairs everywhere.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07 19:07:24 +10:00
Darrick J. Wong
df5660cf63 xfs: implement per-mount warnings for scrub and shrink usage
Currently, we don't have a consistent story around logging when an
EXPERIMENTAL feature gets turned on at runtime -- online fsck and shrink
log a message once per day across all mounts, and the recently merged
LARP mode only ever does it once per insmod cycle or reboot.

Because EXPERIMENTAL tags are supposed to go away eventually, convert
the existing daily warnings into state flags that travel with the mount,
and warn once per mount.  Making this an opstate flag means that we'll
be able to capture the experimental usage in the ftrace output too.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-05-27 10:31:34 +10:00
Allison Henderson
fd92000878 xfs: Set up infrastructure for log attribute replay
Currently attributes are modified directly across one or more
transactions. But they are not logged or replayed in the event of an
error. The goal of log attr replay is to enable logging and replaying
of attribute operations using the existing delayed operations
infrastructure.  This will later enable the attributes to become part of
larger multi part operations that also must first be recorded to the
log.  This is mostly of interest in the scheme of parent pointers which
would need to maintain an attribute containing parent inode information
any time an inode is moved, created, or removed.  Parent pointers would
then be of interest to any feature that would need to quickly derive an
inode path from the mount point. Online scrub, nfs lookups and fs grow
or shrink operations are all features that could take advantage of this.

This patch adds two new log item types for setting or removing
attributes as deferred operations.  The xfs_attri_log_item will log an
intent to set or remove an attribute.  The corresponding
xfs_attrd_log_item holds a reference to the xfs_attri_log_item and is
freed once the transaction is done.  Both log items use a generic
xfs_attr_log_format structure that contains the attribute name, value,
flags, inode, and an op_flag that indicates if the operations is a set
or remove.

[dchinner: added extra little bits needed for intent whiteouts]

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-05-04 12:41:02 +10:00
Darrick J. Wong
5b7ca8b313 xfs: simplify xfs_rmap_lookup_le call sites
Most callers of xfs_rmap_lookup_le will retrieve the btree record
immediately if the lookup succeeds.  The overlapped version of this
function (xfs_rmap_lookup_le_range) will return the record if the lookup
succeeds, so make the regular version do it too.  Get rid of the useless
len argument, since it's not part of the lookup key.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-04-27 10:47:19 -07:00
Dave Chinner
a44a027a8b Merge tag 'large-extent-counters-v9' of https://github.com/chandanr/linux into xfs-5.19-for-next
xfs: Large extent counters

The commit xfs: fix inode fork extent count overflow
(3f8a4f1d87) mentions that 10 billion
data fork extents should be possible to create. However the
corresponding on-disk field has a signed 32-bit type. Hence this
patchset extends the per-inode data fork extent counter to 64 bits
(out of which 48 bits are used to store the extent count).

Also, XFS has an attribute fork extent counter which is 16 bits
wide. A workload that,
1. Creates 1 million 255-byte sized xattrs,
2. Deletes 50% of these xattrs in an alternating manner,
3. Tries to insert 400,000 new 255-byte sized xattrs
   causes the xattr extent counter to overflow.

Dave tells me that there are instances where a single file has more
than 100 million hardlinks. With parent pointers being stored in
xattrs, we will overflow the signed 16-bits wide attribute extent
counter when large number of hardlinks are created. Hence this
patchset extends the on-disk field to 32-bits.

The following changes are made to accomplish this,
1. A 64-bit inode field is carved out of existing di_pad and
   di_flushiter fields to hold the 64-bit data fork extent counter.
2. The existing 32-bit inode data fork extent counter will be used to
   hold the attribute fork extent counter.
3. A new incompat superblock flag to prevent older kernels from mounting
   the filesystem.

Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-04-21 16:46:17 +10:00
Darrick J. Wong
f34061f554 xfs: pass explicit mount pointer to rtalloc query functions
Pass an explicit xfs_mount pointer to the rtalloc query functions so
that they can support transactionless queries.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-04-12 06:49:41 +10:00
Chandan Babu R
dd95a6ce31 xfs: Introduce xfs_dfork_nextents() helper
This commit replaces the macro XFS_DFORK_NEXTENTS() with the helper function
xfs_dfork_nextents(). As of this commit, xfs_dfork_nextents() returns the same
value as XFS_DFORK_NEXTENTS(). A future commit which extends inode's extent
counter fields will add more logic to this helper.

This commit also replaces direct accesses to xfs_dinode->di_[a]nextents
with calls to xfs_dfork_nextents().

No functional changes have been made.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
2022-04-11 04:11:18 +00:00
Chandan Babu R
bb1d50494c xfs: Use xfs_extnum_t instead of basic data types
xfs_extnum_t is the type to use to declare variables which have values
obtained from xfs_dinode->di_[a]nextents. This commit replaces basic
types (e.g. uint32_t) with xfs_extnum_t for such variables.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
2022-04-11 04:11:17 +00:00
Chandan Babu R
95f0b95e2b xfs: Define max extent length based on on-disk format definition
The maximum extent length depends on maximum block count that can be stored in
a BMBT record. Hence this commit defines MAXEXTLEN based on
BMBT_BLOCKCOUNT_BITLEN.

While at it, the commit also renames MAXEXTLEN to XFS_MAX_BMBT_EXTLEN.

Suggested-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
2022-04-11 04:11:17 +00:00
Gustavo A. R. Silva
5224f79096 treewide: Replace zero-length arrays with flexible-array members
There is a regular need in the kernel to provide a way to declare
having a dynamically sized set of trailing elements in a structure.
Kernel code should always use “flexible array members”[1] for these
cases. The older style of one-element or zero-length arrays should
no longer be used[2].

This code was transformed with the help of Coccinelle:
(next-20220214$ spatch --jobs $(getconf _NPROCESSORS_ONLN) --sp-file script.cocci --include-headers --dir . > output.patch)

@@
identifier S, member, array;
type T1, T2;
@@

struct S {
  ...
  T1 member;
  T2 array[
- 0
  ];
};

UAPI and wireless changes were intentionally excluded from this patch
and will be sent out separately.

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://www.kernel.org/doc/html/v5.16/process/deprecated.html#zero-length-and-one-element-arrays

Link: https://github.com/KSPP/linux/issues/78
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2022-02-17 07:00:39 -06:00
Darrick J. Wong
4a9bca8680 xfs: fix online fsck handling of v5 feature bits on secondary supers
While I was auditing the code in xfs_repair that adds feature bits to
existing V5 filesystems, I decided to have a look at how online fsck
handles feature bits, and I found a few problems:

1) ATTR2 is added to the primary super when an xattr is set to a file,
but that isn't consistently propagated to secondary supers.  This isn't
a corruption, merely a discrepancy that repair will fix if it ever has
to restore the primary from a secondary.  Hence, if we find a mismatch
on a secondary, this is a preen condition, not a corruption.

2) There are more compat and ro_compat features now than there used to
be, but we mask off the newer features from testing.  This means we
ignore inconsistencies in the INOBTCOUNT and BIGTIME features, which is
wrong.  Get rid of the masking and compare directly.

3) NEEDSREPAIR, when set on a secondary, is ignored by everyone.  Hence
a mismatch here should also be flagged for preening, and online repair
should clear the flag.  Right now we ignore it due to (2).

4) log_incompat features are ephemeral, since we can clear the feature
bit as soon as the log no longer contains live records for a particular
log feature.  As such, the only copy we care about is the one in the
primary super.  If we find any bits set in the secondary super, we
should flag that for preening, and clear the bits if the user elects to
repair it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-01-12 09:45:21 -08:00
Darrick J. Wong
7e937bb3cb xfs: warn about inodes with project id of -1
Inodes aren't supposed to have a project id of -1U (aka 4294967295) but
the kernel hasn't always validated FSSETXATTR correctly.  Flag this as
something for the sysadmin to check out.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-01-06 10:43:30 -08:00
Darrick J. Wong
e5d1802c70 xfs: fix a bug in the online fsck directory leaf1 bestcount check
When xfs_scrub encounters a directory with a leaf1 block, it tries to
validate that the leaf1 block's bestcount (aka the best free count of
each directory data block) is the correct size.  Previously, this author
believed that comparing bestcount to the directory isize (since
directory data blocks are under isize, and leaf/bestfree blocks are
above it) was sufficient.

Unfortunately during testing of online repair, it was discovered that it
is possible to create a directory with a hole between the last directory
block and isize.  The directory code seems to handle this situation just
fine and xfs_repair doesn't complain, which effectively makes this quirk
part of the disk format.

Fix the check to work properly.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-12-21 09:49:41 -08:00
Darrick J. Wong
59d7fab2df xfs: fix quotaoff mutex usage now that we don't support disabling it
Prior to commit 40b52225e5 ("xfs: remove support for disabling quota
accounting on a mounted file system"), we used the quotaoff mutex to
protect dquot operations against quotaoff trying to pull down dquots as
part of disabling quota.

Now that we only support turning off quota enforcement, the quotaoff
mutex only protects changes in m_qflags/sb_qflags.  We don't need it to
protect dquots, which means we can remove it from setqlimits and the
dquot scrub code.  While we're at it, fix the function that forces
quotacheck, since it should have been taking the quotaoff mutex.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-12-21 09:49:41 -08:00
Darrick J. Wong
7cb3efb4cf xfs: rename m_ag_maxlevels to m_allocbt_maxlevels
Years ago when XFS was thought to be much more simple, we introduced
m_ag_maxlevels to specify the maximum btree height of per-AG btrees for
a given filesystem mount.  Then we observed that inode btrees don't
actually have the same height and split that off; and now we have rmap
and refcount btrees with much different geometries and separate
maxlevels variables.

The 'ag' part of the name doesn't make much sense anymore, so rename
this to m_alloc_maxlevels to reinforce that this is the maximum height
of the *free space* btrees.  This sets us up for the next patch, which
will add a variable to track the maximum height of all AG btrees.

(Also take the opportunity to improve adjacent comments and fix minor
style problems.)

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19 11:45:15 -07:00
Darrick J. Wong
6ca444cfd6 xfs: prepare xfs_btree_cur for dynamic cursor heights
Split out the btree level information into a separate struct and put it
at the end of the cursor structure as a VLA.  Files with huge data forks
(and in the future, the realtime rmap btree) will require the ability to
support many more levels than a per-AG btree cursor, which means that
we're going to create per-btree type cursor caches to conserve memory
for the more common case.

Note that a subsequent patch actually introduces dynamic cursor heights.
This one merely rearranges the structure to prepare for that.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19 11:45:14 -07:00
Darrick J. Wong
eae5db476f xfs: dynamically allocate btree scrub context structure
Reorganize struct xchk_btree so that we can dynamically size the context
structure to fit the type of btree cursor that we have.  This will
enable us to use memory more efficiently once we start adding very tall
btree types.  Right-size the lastkey array to match the number of *node*
levels in the tree so that we stop wasting space.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19 11:45:14 -07:00
Darrick J. Wong
d47fef9342 xfs: don't track firstrec/firstkey separately in xchk_btree
The btree scrubbing code checks that the records (or keys) that it finds
in a btree block are all in order by calling the btree cursor's
->recs_inorder function.  This of course makes no sense for the first
item in the block, so we switch that off with a separate variable in
struct xchk_btree.

Christoph helped me figure out that the variable is unnecessary, since
we just accessed bc_ptrs[level] and can compare that against zero.  Use
that, and save ourselves some memory space.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19 11:45:14 -07:00
Darrick J. Wong
94a14cfd3b xfs: fix incorrect decoding in xchk_btree_cur_fsbno
During review of subsequent patches, Dave and I noticed that this
function doesn't work quite right -- accessing cur->bc_ino depends on
the ROOT_IN_INODE flag, not LONG_PTRS.  Fix that and the parentheses
isssue.  While we're at it, remove the piece that accesses cur->bc_ag,
because block 0 of an AG is never part of a btree.

Note: This changes the btree scrubber tracepoints behavior -- if the
cursor has no buffer for a certain level, it will always report
NULLFSBLOCK.  It is assumed that anyone tracing the online fsck code
will also be tracing xchk_start/xchk_done or otherwise be aware of what
exactly is being scrubbed.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19 11:45:13 -07:00
Darrick J. Wong
1ba6fd34ca xfs: stricter btree height checking when scanning for btree roots
When we're scanning for btree roots to rebuild the AG headers, make sure
that the proposed tree does not exceed the maximum height for that btree
type (and not just XFS_BTREE_MAXLEVELS).

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
2021-10-14 09:19:32 -07:00
Darrick J. Wong
f4585e8234 xfs: stricter btree height checking when looking for errors
Since each btree type has its own precomputed maxlevels variable now,
use them instead of the generic XFS_BTREE_MAXLEVELS to check the level
of each per-AG btree.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
2021-10-14 09:19:32 -07:00
Darrick J. Wong
510a28e195 xfs: don't allocate scrub contexts on the stack
Convert the on-stack scrub context, btree scrub context, and da btree
scrub context into a heap allocation so that we reduce stack usage and
gain the ability to handle tall btrees without issue.

Specifically, this saves us ~208 bytes for the dabtree scrub, ~464 bytes
for the btree scrub, and ~200 bytes for the main scrub context.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-10-14 09:19:32 -07:00
Darrick J. Wong
61e0d0cc51 xfs: fix perag structure refcounting error when scrub fails
The kernel test robot found the following bug when running xfs/355 to
scrub a bmap btree:

XFS: Assertion failed: !sa->pag, file: fs/xfs/scrub/common.c, line: 412
------------[ cut here ]------------
kernel BUG at fs/xfs/xfs_message.c:110!
invalid opcode: 0000 [#1] SMP PTI
CPU: 2 PID: 1415 Comm: xfs_scrub Not tainted 5.14.0-rc4-00021-g48c6615cc557 #1
Hardware name: Hewlett-Packard p6-1451cx/2ADA, BIOS 8.15 02/05/2013
RIP: 0010:assfail+0x23/0x28 [xfs]
RSP: 0018:ffffc9000aacb890 EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffffc9000aacbcc8 RCX: 0000000000000000
RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffffc09e7dcd
RBP: ffffc9000aacbc80 R08: ffff8881fdf17d50 R09: 0000000000000000
R10: 000000000000000a R11: f000000000000000 R12: 0000000000000000
R13: ffff88820c7ed000 R14: 0000000000000001 R15: ffffc9000aacb980
FS:  00007f185b955700(0000) GS:ffff8881fdf00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f7f6ef43000 CR3: 000000020de38002 CR4: 00000000001706e0
Call Trace:
 xchk_ag_read_headers+0xda/0x100 [xfs]
 xchk_ag_init+0x15/0x40 [xfs]
 xchk_btree_check_block_owner+0x76/0x180 [xfs]
 xchk_btree_get_block+0xd0/0x140 [xfs]
 xchk_btree+0x32e/0x440 [xfs]
 xchk_bmap_btree+0xd4/0x140 [xfs]
 xchk_bmap+0x1eb/0x3c0 [xfs]
 xfs_scrub_metadata+0x227/0x4c0 [xfs]
 xfs_ioc_scrub_metadata+0x50/0xc0 [xfs]
 xfs_file_ioctl+0x90c/0xc40 [xfs]
 __x64_sys_ioctl+0x83/0xc0
 do_syscall_64+0x3b/0xc0

The unusual handling of errors while initializing struct xchk_ag is the
root cause here.  Since the beginning of xfs_scrub, the goal of
xchk_ag_read_headers has been to read all three AG header buffers and
attach them both to the xchk_ag structure and the scrub transaction.
Corruption errors on any of the three headers doesn't necessarily
trigger an immediate return to userspace, because xfs_scrub can also
tell us to /fix/ the problem.

In other words, it's possible for the xchk_ag init functions to return
an error code and a partially filled out structure so that scrub can use
however much information it managed to pull.  Before 5.15, it was
sufficient to cancel (or commit) the scrub transaction on the way out of
the scrub code to release the buffers.

Ccommit 48c6615cc5 added a reference to the perag structure to struct
xchk_ag.  Since perag structures are not attached to transactions like
buffers are, this adds the requirement that the perag ref be released
explicitly.  The scrub teardown function xchk_teardown was amended to do
this for the xchk_ag embedded in struct xfs_scrub.

Unfortunately, I forgot that certain parts of the scrub code probe
multiple AGs and therefore handle the initialization and cleanup on
their own.  Specifically, the bmbt scrubber will initialize it long
enough to cross-reference AG metadata for btree blocks and for the
extent mappings in the bmbt.

If one of the AG headers is corrupt, the init function returns with a
live perag structure reference and some of the AG header buffers.  If an
error occurs, the cross referencing will be noted as XCORRUPTion and
skipped, but the main scrub process will move on to the next record.
It is now necessary to release the perag reference before we try to
analyze something from a different AG, or else we'll trip over the
assertion noted above.

Fixes: 48c6615cc5 ("xfs: grab active perag ref when reading AG headers")
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
2021-08-20 13:20:33 -07:00