forked from Minki/linux
16d91548d1
- Various cleanups to remove dead code, unnecessary conditionals, asserts, etc. - Fix a linker warning caused by xfs stuffing '-g' into CFLAGS redundantly. - Tighten up our dmesg logging to ensure that everything is prefixed with 'XFS' for easier grepping. - Kill a bunch of typedefs. - Refactor the deferred ops code to reduce indirect function calls. - Increase type-safety with the deferred ops code. - Make the DAX mount options a tri-state. - Fix some error handling problems in the inode flush code and clean up other inode flush warts. - Refactor log recovery so that each log item recovery functions now live with the other log item processing code. - Fix some SPDX forms. - Fix quota counter corruption if the fs crashes after running quotacheck but before any dquots get logged. - Don't fail metadata verification on zero-entry attr leaf blocks, since they're just part of the disk format now due to a historic lack of log atomicity. - Don't allow SWAPEXT between files with different [ugp]id when quotas are enabled. - Refactor inode fork reading and verification to run directly from the inode-from-disk function. This means that we now actually guarantee that _iget'ted inodes are totally verified and ready to go. - Move the incore inode fork format and extent counts to the ifork structure. - Scalability improvements by reducing cacheline pingponging in struct xfs_mount. - More scalability improvements by removing m_active_trans from the hot path. - Fix inode counter update sanity checking to run /only/ on debug kernels. - Fix longstanding inconsistency in what error code we return when a program hits project quota limits (ENOSPC). - Fix group quota returning the wrong error code when a program hits group quota limits. - Fix per-type quota limits and grace periods for group and project quotas so that they actually work. - Allow extension of individual grace periods. - Refactor the non-reclaim inode radix tree walking code to remove a bunch of stupid little functions and straighten out the inconsistent naming schemes. - Fix a bug in speculative preallocation where we measured a new allocation based on the last extent mapping in the file instead of looking farther for the last contiguous space allocation. - Force delalloc writes to unwritten extents. This closes a stale disk contents exposure vector if the system goes down before the write completes. - More lockdep whackamole. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl7OjhgACgkQ+H93GTRK tOuGeBAApuP9ohtvrJT9FW7U+OrRsK3lw/3R+MEYpJu8GKLpGbJ6j+SKrTHxxLvu Rp63YLIlHBOz2rNa4brm/wW8gGJIGXOnGpuiGq0Irl01xEmwqmjOLfLcYkYhno1E i+rG0PiKYZeo/xhLtTKGl+NAwHHxmbOmxUtYHnbinHtPzDyYLQ0wff+oUkmQ7ydg bMYFMXohoJ3Pc5UjmUrCuJj1cvYOUwl0P4LGKiq5Zud61AkBCSskEpk+oo5xFcEX JJc1xkn5MPi+oGpSYqhnSZ6aSjwp53/i44O9volp5vCRXXv1eLVni2u/ScZ85L72 HXxoDyuZOUupirIfMBQFHsazDGPGyFIqtPhGlXoTJjrwX+ymimY6CU/0e+Xu9DEu krlxajfUssH30zyG2q/2TaxslU35CROH6hVBXFe0Y5cEEsOIf2aOpErUhhw2YyS7 onN9gb2NBBQdYtHqIMwsbhcgq60g5H6JfGriB5dJimXXLmpuTfAREGCY2AqIoB1x +8QFod0WwsMn6FYhi/UpZjC9qp/WTvojBUEt8Ci3ketUFwO1CLf9qm6Hj71RL3fs fCEDHx/ZMMft7Bdbf36lICoMAhF/KfNcRn1PsQdpW4LY1Aml/7qjFNZthSVRDW+E rhzNu+RIzGEQsSemBvccRaaTP3HFqN+qPATu2K0sALaa1LRFxzQ= =/NYc -----END PGP SIGNATURE----- Merge tag 'xfs-5.8-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull xfs updates from Darrick Wong: "Most of the changes this cycle are refactoring of existing code in preparation for things landing in the future. We also fixed various problems and deficiencies in the quota implementation, and (I hope) the last of the stale read vectors by forcing write allocations to go through the unwritten state until the write completes. Summary: - Various cleanups to remove dead code, unnecessary conditionals, asserts, etc. - Fix a linker warning caused by xfs stuffing '-g' into CFLAGS redundantly. - Tighten up our dmesg logging to ensure that everything is prefixed with 'XFS' for easier grepping. - Kill a bunch of typedefs. - Refactor the deferred ops code to reduce indirect function calls. - Increase type-safety with the deferred ops code. - Make the DAX mount options a tri-state. - Fix some error handling problems in the inode flush code and clean up other inode flush warts. - Refactor log recovery so that each log item recovery functions now live with the other log item processing code. - Fix some SPDX forms. - Fix quota counter corruption if the fs crashes after running quotacheck but before any dquots get logged. - Don't fail metadata verification on zero-entry attr leaf blocks, since they're just part of the disk format now due to a historic lack of log atomicity. - Don't allow SWAPEXT between files with different [ugp]id when quotas are enabled. - Refactor inode fork reading and verification to run directly from the inode-from-disk function. This means that we now actually guarantee that _iget'ted inodes are totally verified and ready to go. - Move the incore inode fork format and extent counts to the ifork structure. - Scalability improvements by reducing cacheline pingponging in struct xfs_mount. - More scalability improvements by removing m_active_trans from the hot path. - Fix inode counter update sanity checking to run /only/ on debug kernels. - Fix longstanding inconsistency in what error code we return when a program hits project quota limits (ENOSPC). - Fix group quota returning the wrong error code when a program hits group quota limits. - Fix per-type quota limits and grace periods for group and project quotas so that they actually work. - Allow extension of individual grace periods. - Refactor the non-reclaim inode radix tree walking code to remove a bunch of stupid little functions and straighten out the inconsistent naming schemes. - Fix a bug in speculative preallocation where we measured a new allocation based on the last extent mapping in the file instead of looking farther for the last contiguous space allocation. - Force delalloc writes to unwritten extents. This closes a stale disk contents exposure vector if the system goes down before the write completes. - More lockdep whackamole" * tag 'xfs-5.8-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (129 commits) xfs: more lockdep whackamole with kmem_alloc* xfs: force writes to delalloc regions to unwritten xfs: refactor xfs_iomap_prealloc_size xfs: measure all contiguous previous extents for prealloc size xfs: don't fail unwritten extent conversion on writeback due to edquot xfs: rearrange xfs_inode_walk_ag parameters xfs: straighten out all the naming around incore inode tree walks xfs: move xfs_inode_ag_iterator to be closer to the perag walking code xfs: use bool for done in xfs_inode_ag_walk xfs: fix inode ag walk predicate function return values xfs: refactor eofb matching into a single helper xfs: remove __xfs_icache_free_eofblocks xfs: remove flags argument from xfs_inode_ag_walk xfs: remove xfs_inode_ag_iterator_flags xfs: remove unused xfs_inode_ag_iterator function xfs: replace open-coded XFS_ICI_NO_TAG xfs: move eofblocks conversion function to xfs_ioctl.c xfs: allow individual quota grace period extension xfs: per-type quota timers and warn limits xfs: switch xfs_get_defquota to take explicit type ...
353 lines
17 KiB
ReStructuredText
353 lines
17 KiB
ReStructuredText
.. SPDX-License-Identifier: GPL-2.0
|
|
|
|
============================
|
|
XFS Self Describing Metadata
|
|
============================
|
|
|
|
Introduction
|
|
============
|
|
|
|
The largest scalability problem facing XFS is not one of algorithmic
|
|
scalability, but of verification of the filesystem structure. Scalabilty of the
|
|
structures and indexes on disk and the algorithms for iterating them are
|
|
adequate for supporting PB scale filesystems with billions of inodes, however it
|
|
is this very scalability that causes the verification problem.
|
|
|
|
Almost all metadata on XFS is dynamically allocated. The only fixed location
|
|
metadata is the allocation group headers (SB, AGF, AGFL and AGI), while all
|
|
other metadata structures need to be discovered by walking the filesystem
|
|
structure in different ways. While this is already done by userspace tools for
|
|
validating and repairing the structure, there are limits to what they can
|
|
verify, and this in turn limits the supportable size of an XFS filesystem.
|
|
|
|
For example, it is entirely possible to manually use xfs_db and a bit of
|
|
scripting to analyse the structure of a 100TB filesystem when trying to
|
|
determine the root cause of a corruption problem, but it is still mainly a
|
|
manual task of verifying that things like single bit errors or misplaced writes
|
|
weren't the ultimate cause of a corruption event. It may take a few hours to a
|
|
few days to perform such forensic analysis, so for at this scale root cause
|
|
analysis is entirely possible.
|
|
|
|
However, if we scale the filesystem up to 1PB, we now have 10x as much metadata
|
|
to analyse and so that analysis blows out towards weeks/months of forensic work.
|
|
Most of the analysis work is slow and tedious, so as the amount of analysis goes
|
|
up, the more likely that the cause will be lost in the noise. Hence the primary
|
|
concern for supporting PB scale filesystems is minimising the time and effort
|
|
required for basic forensic analysis of the filesystem structure.
|
|
|
|
|
|
Self Describing Metadata
|
|
========================
|
|
|
|
One of the problems with the current metadata format is that apart from the
|
|
magic number in the metadata block, we have no other way of identifying what it
|
|
is supposed to be. We can't even identify if it is the right place. Put simply,
|
|
you can't look at a single metadata block in isolation and say "yes, it is
|
|
supposed to be there and the contents are valid".
|
|
|
|
Hence most of the time spent on forensic analysis is spent doing basic
|
|
verification of metadata values, looking for values that are in range (and hence
|
|
not detected by automated verification checks) but are not correct. Finding and
|
|
understanding how things like cross linked block lists (e.g. sibling
|
|
pointers in a btree end up with loops in them) are the key to understanding what
|
|
went wrong, but it is impossible to tell what order the blocks were linked into
|
|
each other or written to disk after the fact.
|
|
|
|
Hence we need to record more information into the metadata to allow us to
|
|
quickly determine if the metadata is intact and can be ignored for the purpose
|
|
of analysis. We can't protect against every possible type of error, but we can
|
|
ensure that common types of errors are easily detectable. Hence the concept of
|
|
self describing metadata.
|
|
|
|
The first, fundamental requirement of self describing metadata is that the
|
|
metadata object contains some form of unique identifier in a well known
|
|
location. This allows us to identify the expected contents of the block and
|
|
hence parse and verify the metadata object. IF we can't independently identify
|
|
the type of metadata in the object, then the metadata doesn't describe itself
|
|
very well at all!
|
|
|
|
Luckily, almost all XFS metadata has magic numbers embedded already - only the
|
|
AGFL, remote symlinks and remote attribute blocks do not contain identifying
|
|
magic numbers. Hence we can change the on-disk format of all these objects to
|
|
add more identifying information and detect this simply by changing the magic
|
|
numbers in the metadata objects. That is, if it has the current magic number,
|
|
the metadata isn't self identifying. If it contains a new magic number, it is
|
|
self identifying and we can do much more expansive automated verification of the
|
|
metadata object at runtime, during forensic analysis or repair.
|
|
|
|
As a primary concern, self describing metadata needs some form of overall
|
|
integrity checking. We cannot trust the metadata if we cannot verify that it has
|
|
not been changed as a result of external influences. Hence we need some form of
|
|
integrity check, and this is done by adding CRC32c validation to the metadata
|
|
block. If we can verify the block contains the metadata it was intended to
|
|
contain, a large amount of the manual verification work can be skipped.
|
|
|
|
CRC32c was selected as metadata cannot be more than 64k in length in XFS and
|
|
hence a 32 bit CRC is more than sufficient to detect multi-bit errors in
|
|
metadata blocks. CRC32c is also now hardware accelerated on common CPUs so it is
|
|
fast. So while CRC32c is not the strongest of possible integrity checks that
|
|
could be used, it is more than sufficient for our needs and has relatively
|
|
little overhead. Adding support for larger integrity fields and/or algorithms
|
|
does really provide any extra value over CRC32c, but it does add a lot of
|
|
complexity and so there is no provision for changing the integrity checking
|
|
mechanism.
|
|
|
|
Self describing metadata needs to contain enough information so that the
|
|
metadata block can be verified as being in the correct place without needing to
|
|
look at any other metadata. This means it needs to contain location information.
|
|
Just adding a block number to the metadata is not sufficient to protect against
|
|
mis-directed writes - a write might be misdirected to the wrong LUN and so be
|
|
written to the "correct block" of the wrong filesystem. Hence location
|
|
information must contain a filesystem identifier as well as a block number.
|
|
|
|
Another key information point in forensic analysis is knowing who the metadata
|
|
block belongs to. We already know the type, the location, that it is valid
|
|
and/or corrupted, and how long ago that it was last modified. Knowing the owner
|
|
of the block is important as it allows us to find other related metadata to
|
|
determine the scope of the corruption. For example, if we have a extent btree
|
|
object, we don't know what inode it belongs to and hence have to walk the entire
|
|
filesystem to find the owner of the block. Worse, the corruption could mean that
|
|
no owner can be found (i.e. it's an orphan block), and so without an owner field
|
|
in the metadata we have no idea of the scope of the corruption. If we have an
|
|
owner field in the metadata object, we can immediately do top down validation to
|
|
determine the scope of the problem.
|
|
|
|
Different types of metadata have different owner identifiers. For example,
|
|
directory, attribute and extent tree blocks are all owned by an inode, while
|
|
freespace btree blocks are owned by an allocation group. Hence the size and
|
|
contents of the owner field are determined by the type of metadata object we are
|
|
looking at. The owner information can also identify misplaced writes (e.g.
|
|
freespace btree block written to the wrong AG).
|
|
|
|
Self describing metadata also needs to contain some indication of when it was
|
|
written to the filesystem. One of the key information points when doing forensic
|
|
analysis is how recently the block was modified. Correlation of set of corrupted
|
|
metadata blocks based on modification times is important as it can indicate
|
|
whether the corruptions are related, whether there's been multiple corruption
|
|
events that lead to the eventual failure, and even whether there are corruptions
|
|
present that the run-time verification is not detecting.
|
|
|
|
For example, we can determine whether a metadata object is supposed to be free
|
|
space or still allocated if it is still referenced by its owner by looking at
|
|
when the free space btree block that contains the block was last written
|
|
compared to when the metadata object itself was last written. If the free space
|
|
block is more recent than the object and the object's owner, then there is a
|
|
very good chance that the block should have been removed from the owner.
|
|
|
|
To provide this "written timestamp", each metadata block gets the Log Sequence
|
|
Number (LSN) of the most recent transaction it was modified on written into it.
|
|
This number will always increase over the life of the filesystem, and the only
|
|
thing that resets it is running xfs_repair on the filesystem. Further, by use of
|
|
the LSN we can tell if the corrupted metadata all belonged to the same log
|
|
checkpoint and hence have some idea of how much modification occurred between
|
|
the first and last instance of corrupt metadata on disk and, further, how much
|
|
modification occurred between the corruption being written and when it was
|
|
detected.
|
|
|
|
Runtime Validation
|
|
==================
|
|
|
|
Validation of self-describing metadata takes place at runtime in two places:
|
|
|
|
- immediately after a successful read from disk
|
|
- immediately prior to write IO submission
|
|
|
|
The verification is completely stateless - it is done independently of the
|
|
modification process, and seeks only to check that the metadata is what it says
|
|
it is and that the metadata fields are within bounds and internally consistent.
|
|
As such, we cannot catch all types of corruption that can occur within a block
|
|
as there may be certain limitations that operational state enforces of the
|
|
metadata, or there may be corruption of interblock relationships (e.g. corrupted
|
|
sibling pointer lists). Hence we still need stateful checking in the main code
|
|
body, but in general most of the per-field validation is handled by the
|
|
verifiers.
|
|
|
|
For read verification, the caller needs to specify the expected type of metadata
|
|
that it should see, and the IO completion process verifies that the metadata
|
|
object matches what was expected. If the verification process fails, then it
|
|
marks the object being read as EFSCORRUPTED. The caller needs to catch this
|
|
error (same as for IO errors), and if it needs to take special action due to a
|
|
verification error it can do so by catching the EFSCORRUPTED error value. If we
|
|
need more discrimination of error type at higher levels, we can define new
|
|
error numbers for different errors as necessary.
|
|
|
|
The first step in read verification is checking the magic number and determining
|
|
whether CRC validating is necessary. If it is, the CRC32c is calculated and
|
|
compared against the value stored in the object itself. Once this is validated,
|
|
further checks are made against the location information, followed by extensive
|
|
object specific metadata validation. If any of these checks fail, then the
|
|
buffer is considered corrupt and the EFSCORRUPTED error is set appropriately.
|
|
|
|
Write verification is the opposite of the read verification - first the object
|
|
is extensively verified and if it is OK we then update the LSN from the last
|
|
modification made to the object, After this, we calculate the CRC and insert it
|
|
into the object. Once this is done the write IO is allowed to continue. If any
|
|
error occurs during this process, the buffer is again marked with a EFSCORRUPTED
|
|
error for the higher layers to catch.
|
|
|
|
Structures
|
|
==========
|
|
|
|
A typical on-disk structure needs to contain the following information::
|
|
|
|
struct xfs_ondisk_hdr {
|
|
__be32 magic; /* magic number */
|
|
__be32 crc; /* CRC, not logged */
|
|
uuid_t uuid; /* filesystem identifier */
|
|
__be64 owner; /* parent object */
|
|
__be64 blkno; /* location on disk */
|
|
__be64 lsn; /* last modification in log, not logged */
|
|
};
|
|
|
|
Depending on the metadata, this information may be part of a header structure
|
|
separate to the metadata contents, or may be distributed through an existing
|
|
structure. The latter occurs with metadata that already contains some of this
|
|
information, such as the superblock and AG headers.
|
|
|
|
Other metadata may have different formats for the information, but the same
|
|
level of information is generally provided. For example:
|
|
|
|
- short btree blocks have a 32 bit owner (ag number) and a 32 bit block
|
|
number for location. The two of these combined provide the same
|
|
information as @owner and @blkno in eh above structure, but using 8
|
|
bytes less space on disk.
|
|
|
|
- directory/attribute node blocks have a 16 bit magic number, and the
|
|
header that contains the magic number has other information in it as
|
|
well. hence the additional metadata headers change the overall format
|
|
of the metadata.
|
|
|
|
A typical buffer read verifier is structured as follows::
|
|
|
|
#define XFS_FOO_CRC_OFF offsetof(struct xfs_ondisk_hdr, crc)
|
|
|
|
static void
|
|
xfs_foo_read_verify(
|
|
struct xfs_buf *bp)
|
|
{
|
|
struct xfs_mount *mp = bp->b_mount;
|
|
|
|
if ((xfs_sb_version_hascrc(&mp->m_sb) &&
|
|
!xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length),
|
|
XFS_FOO_CRC_OFF)) ||
|
|
!xfs_foo_verify(bp)) {
|
|
XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
|
|
xfs_buf_ioerror(bp, EFSCORRUPTED);
|
|
}
|
|
}
|
|
|
|
The code ensures that the CRC is only checked if the filesystem has CRCs enabled
|
|
by checking the superblock of the feature bit, and then if the CRC verifies OK
|
|
(or is not needed) it verifies the actual contents of the block.
|
|
|
|
The verifier function will take a couple of different forms, depending on
|
|
whether the magic number can be used to determine the format of the block. In
|
|
the case it can't, the code is structured as follows::
|
|
|
|
static bool
|
|
xfs_foo_verify(
|
|
struct xfs_buf *bp)
|
|
{
|
|
struct xfs_mount *mp = bp->b_mount;
|
|
struct xfs_ondisk_hdr *hdr = bp->b_addr;
|
|
|
|
if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
|
|
return false;
|
|
|
|
if (!xfs_sb_version_hascrc(&mp->m_sb)) {
|
|
if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
|
|
return false;
|
|
if (bp->b_bn != be64_to_cpu(hdr->blkno))
|
|
return false;
|
|
if (hdr->owner == 0)
|
|
return false;
|
|
}
|
|
|
|
/* object specific verification checks here */
|
|
|
|
return true;
|
|
}
|
|
|
|
If there are different magic numbers for the different formats, the verifier
|
|
will look like::
|
|
|
|
static bool
|
|
xfs_foo_verify(
|
|
struct xfs_buf *bp)
|
|
{
|
|
struct xfs_mount *mp = bp->b_mount;
|
|
struct xfs_ondisk_hdr *hdr = bp->b_addr;
|
|
|
|
if (hdr->magic == cpu_to_be32(XFS_FOO_CRC_MAGIC)) {
|
|
if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
|
|
return false;
|
|
if (bp->b_bn != be64_to_cpu(hdr->blkno))
|
|
return false;
|
|
if (hdr->owner == 0)
|
|
return false;
|
|
} else if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
|
|
return false;
|
|
|
|
/* object specific verification checks here */
|
|
|
|
return true;
|
|
}
|
|
|
|
Write verifiers are very similar to the read verifiers, they just do things in
|
|
the opposite order to the read verifiers. A typical write verifier::
|
|
|
|
static void
|
|
xfs_foo_write_verify(
|
|
struct xfs_buf *bp)
|
|
{
|
|
struct xfs_mount *mp = bp->b_mount;
|
|
struct xfs_buf_log_item *bip = bp->b_fspriv;
|
|
|
|
if (!xfs_foo_verify(bp)) {
|
|
XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
|
|
xfs_buf_ioerror(bp, EFSCORRUPTED);
|
|
return;
|
|
}
|
|
|
|
if (!xfs_sb_version_hascrc(&mp->m_sb))
|
|
return;
|
|
|
|
|
|
if (bip) {
|
|
struct xfs_ondisk_hdr *hdr = bp->b_addr;
|
|
hdr->lsn = cpu_to_be64(bip->bli_item.li_lsn);
|
|
}
|
|
xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF);
|
|
}
|
|
|
|
This will verify the internal structure of the metadata before we go any
|
|
further, detecting corruptions that have occurred as the metadata has been
|
|
modified in memory. If the metadata verifies OK, and CRCs are enabled, we then
|
|
update the LSN field (when it was last modified) and calculate the CRC on the
|
|
metadata. Once this is done, we can issue the IO.
|
|
|
|
Inodes and Dquots
|
|
=================
|
|
|
|
Inodes and dquots are special snowflakes. They have per-object CRC and
|
|
self-identifiers, but they are packed so that there are multiple objects per
|
|
buffer. Hence we do not use per-buffer verifiers to do the work of per-object
|
|
verification and CRC calculations. The per-buffer verifiers simply perform basic
|
|
identification of the buffer - that they contain inodes or dquots, and that
|
|
there are magic numbers in all the expected spots. All further CRC and
|
|
verification checks are done when each inode is read from or written back to the
|
|
buffer.
|
|
|
|
The structure of the verifiers and the identifiers checks is very similar to the
|
|
buffer code described above. The only difference is where they are called. For
|
|
example, inode read verification is done in xfs_inode_from_disk() when the inode
|
|
is first read out of the buffer and the struct xfs_inode is instantiated. The
|
|
inode is already extensively verified during writeback in xfs_iflush_int, so the
|
|
only addition here is to add the LSN and CRC to the inode as it is copied back
|
|
into the buffer.
|
|
|
|
XXX: inode unlinked list modification doesn't recalculate the inode CRC! None of
|
|
the unlinked list modifications check or update CRCs, neither during unlink nor
|
|
log recovery. So, it's gone unnoticed until now. This won't matter immediately -
|
|
repair will probably complain about it - but it needs to be fixed.
|