xfs: design documentation for online fsck, part 2 [v13.2 15/16]

This series updates the design documentation for online fsck to reflect
 the final design of the parent pointers feature as well as the
 implementation of online fsck for the new metadata.
 
 This has been running on the djcloud for months with no problems.  Enjoy!
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZh23VAAKCRBKO3ySh0YR
 pg9gAP4kwa111TCKBZjr6hMpV0isKfKUWNjbO4pTEKmKM/fDBgEAssk71ReCpEt2
 rPv88LnXyce/bWgvFb3wDjxFo2ucvAU=
 =myBE
 -----END PGP SIGNATURE-----

Merge tag 'online-fsck-design-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA

xfs: design documentation for online fsck, part 2

This series updates the design documentation for online fsck to reflect
the final design of the parent pointers feature as well as the
implementation of online fsck for the new metadata.

This has been running on the djcloud for months with no problems.  Enjoy!

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>

* tag 'online-fsck-design-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
  docs: describe xfs directory tree online fsck
  docs: update offline parent pointer repair strategy
  docs: update online directory and parent pointer repair sections
  docs: update the parent pointers documentation to the final version
This commit is contained in:
Chandan Babu R 2024-04-16 12:50:19 +05:30
commit f910defd38

View File

@ -4465,10 +4465,10 @@ reconstruction of filesystem space metadata.
The parent pointer feature, however, makes total directory reconstruction
possible.
XFS parent pointers include the dirent name and location of the entry within
the parent directory.
XFS parent pointers contain the information needed to identify the
corresponding directory entry in the parent directory.
In other words, child files use extended attributes to store pointers to
parents in the form ``(parent_inum, parent_gen, dirent_pos) → (dirent_name)``.
parents in the form ``(dirent_name) → (parent_inum, parent_gen)``.
The directory checking process can be strengthened to ensure that the target of
each dirent also contains a parent pointer pointing back to the dirent.
Likewise, each parent pointer can be checked by ensuring that the target of
@ -4476,8 +4476,6 @@ each parent pointer is a directory and that it contains a dirent matching
the parent pointer.
Both online and offline repair can use this strategy.
**Note**: The ondisk format of parent pointers is not yet finalized.
+--------------------------------------------------------------------------+
| **Historical Sidebar**: |
+--------------------------------------------------------------------------+
@ -4519,8 +4517,58 @@ Both online and offline repair can use this strategy.
| Chandan increased the maximum extent counts of both data and attribute |
| forks, thereby ensuring that the extended attribute structure can grow |
| to handle the maximum hardlink count of any file. |
| |
| For this second effort, the ondisk parent pointer format as originally |
| proposed was ``(parent_inum, parent_gen, dirent_pos) → (dirent_name)``. |
| The format was changed during development to eliminate the requirement |
| of repair tools needing to to ensure that the ``dirent_pos`` field |
| always matched when reconstructing a directory. |
| |
| There were a few other ways to have solved that problem: |
| |
| 1. The field could be designated advisory, since the other three values |
| are sufficient to find the entry in the parent. |
| However, this makes indexed key lookup impossible while repairs are |
| ongoing. |
| |
| 2. We could allow creating directory entries at specified offsets, which |
| solves the referential integrity problem but runs the risk that |
| dirent creation will fail due to conflicts with the free space in the |
| directory. |
| |
| These conflicts could be resolved by appending the directory entry |
| and amending the xattr code to support updating an xattr key and |
| reindexing the dabtree, though this would have to be performed with |
| the parent directory still locked. |
| |
| 3. Same as above, but remove the old parent pointer entry and add a new |
| one atomically. |
| |
| 4. Change the ondisk xattr format to |
| ``(parent_inum, name) → (parent_gen)``, which would provide the attr |
| name uniqueness that we require, without forcing repair code to |
| update the dirent position. |
| Unfortunately, this requires changes to the xattr code to support |
| attr names as long as 263 bytes. |
| |
| 5. Change the ondisk xattr format to ``(parent_inum, hash(name)) → |
| (name, parent_gen)``. |
| If the hash is sufficiently resistant to collisions (e.g. sha256) |
| then this should provide the attr name uniqueness that we require. |
| Names shorter than 247 bytes could be stored directly. |
| |
| 6. Change the ondisk xattr format to ``(dirent_name) → (parent_ino, |
| parent_gen)``. This format doesn't require any of the complicated |
| nested name hashing of the previous suggestions. However, it was |
| discovered that multiple hardlinks to the same inode with the same |
| filename caused performance problems with hashed xattr lookups, so |
| the parent inumber is now xor'd into the hash index. |
| |
| In the end, it was decided that solution #6 was the most compact and the |
| most performant. A new hash function was designed for parent pointers. |
+--------------------------------------------------------------------------+
Case Study: Repairing Directories with Parent Pointers
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@ -4528,8 +4576,9 @@ Directory rebuilding uses a :ref:`coordinated inode scan <iscan>` and
a :ref:`directory entry live update hook <liveupdate>` as follows:
1. Set up a temporary directory for generating the new directory structure,
an xfblob for storing entry names, and an xfarray for stashing directory
updates.
an xfblob for storing entry names, and an xfarray for stashing the fixed
size fields involved in a directory update: ``(child inumber, add vs.
remove, name cookie, ftype)``.
2. Set up an inode scanner and hook into the directory entry code to receive
updates on directory operations.
@ -4538,73 +4587,36 @@ a :ref:`directory entry live update hook <liveupdate>` as follows:
pointer references the directory of interest.
If so:
a. Stash an addname entry for this dirent in the xfarray for later.
a. Stash the parent pointer name and an addname entry for this dirent in the
xfblob and xfarray, respectively.
b. When finished scanning that file, flush the stashed updates to the
temporary directory.
b. When finished scanning that file or the kernel memory consumption exceeds
a threshold, flush the stashed updates to the temporary directory.
4. For each live directory update received via the hook, decide if the child
has already been scanned.
If so:
a. Stash an addname or removename entry for this dirent update in the
xfarray for later.
a. Stash the parent pointer name an addname or removename entry for this
dirent update in the xfblob and xfarray for later.
We cannot write directly to the temporary directory because hook
functions are not allowed to modify filesystem metadata.
Instead, we stash updates in the xfarray and rely on the scanner thread
to apply the stashed updates to the temporary directory.
5. When the scan is complete, atomically exchange the contents of the temporary
5. When the scan is complete, replay any stashed entries in the xfarray.
6. When the scan is complete, atomically exchange the contents of the temporary
directory and the directory being repaired.
The temporary directory now contains the damaged directory structure.
6. Reap the temporary directory.
7. Update the dirent position field of parent pointers as necessary.
This may require the queuing of a substantial number of xattr log intent
items.
7. Reap the temporary directory.
The proposed patchset is the
`parent pointers directory repair
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-online-dir-repair>`_
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-fsck>`_
series.
**Unresolved Question**: How will repair ensure that the ``dirent_pos`` fields
match in the reconstructed directory?
*Answer*: There are a few ways to solve this problem:
1. The field could be designated advisory, since the other three values are
sufficient to find the entry in the parent.
However, this makes indexed key lookup impossible while repairs are ongoing.
2. We could allow creating directory entries at specified offsets, which solves
the referential integrity problem but runs the risk that dirent creation
will fail due to conflicts with the free space in the directory.
These conflicts could be resolved by appending the directory entry and
amending the xattr code to support updating an xattr key and reindexing the
dabtree, though this would have to be performed with the parent directory
still locked.
3. Same as above, but remove the old parent pointer entry and add a new one
atomically.
4. Change the ondisk xattr format to ``(parent_inum, name) → (parent_gen)``,
which would provide the attr name uniqueness that we require, without
forcing repair code to update the dirent position.
Unfortunately, this requires changes to the xattr code to support attr
names as long as 263 bytes.
5. Change the ondisk xattr format to ``(parent_inum, hash(name)) →
(name, parent_gen)``.
If the hash is sufficiently resistant to collisions (e.g. sha256) then
this should provide the attr name uniqueness that we require.
Names shorter than 247 bytes could be stored directly.
Discussion is ongoing under the `parent pointers patch deluge
<https://www.spinics.net/lists/linux-xfs/msg69397.html>`_.
Case Study: Repairing Parent Pointers
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@ -4612,8 +4624,9 @@ Online reconstruction of a file's parent pointer information works similarly to
directory reconstruction:
1. Set up a temporary file for generating a new extended attribute structure,
an `xfblob<xfblob>` for storing parent pointer names, and an xfarray for
stashing parent pointer updates.
an xfblob for storing parent pointer names, and an xfarray for stashing the
fixed size fields involved in a parent pointer update: ``(parent inumber,
parent generation, add vs. remove, name cookie)``.
2. Set up an inode scanner and hook into the directory entry code to receive
updates on directory operations.
@ -4622,34 +4635,36 @@ directory reconstruction:
dirent references the file of interest.
If so:
a. Stash an addpptr entry for this parent pointer in the xfblob and xfarray
for later.
a. Stash the dirent name and an addpptr entry for this parent pointer in the
xfblob and xfarray, respectively.
b. When finished scanning the directory, flush the stashed updates to the
temporary directory.
b. When finished scanning the directory or the kernel memory consumption
exceeds a threshold, flush the stashed updates to the temporary file.
4. For each live directory update received via the hook, decide if the parent
has already been scanned.
If so:
a. Stash an addpptr or removepptr entry for this dirent update in the
xfarray for later.
a. Stash the dirent name and an addpptr or removepptr entry for this dirent
update in the xfblob and xfarray for later.
We cannot write parent pointers directly to the temporary file because
hook functions are not allowed to modify filesystem metadata.
Instead, we stash updates in the xfarray and rely on the scanner thread
to apply the stashed parent pointer updates to the temporary file.
5. Copy all non-parent pointer extended attributes to the temporary file.
5. When the scan is complete, replay any stashed entries in the xfarray.
6. When the scan is complete, atomically exchange the mappings of the attribute
6. Copy all non-parent pointer extended attributes to the temporary file.
7. When the scan is complete, atomically exchange the mappings of the attribute
forks of the temporary file and the file being repaired.
The temporary file now contains the damaged extended attribute structure.
7. Reap the temporary file.
8. Reap the temporary file.
The proposed patchset is the
`parent pointers repair
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-online-parent-repair>`_
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-fsck>`_
series.
Digression: Offline Checking of Parent Pointers
@ -4660,26 +4675,56 @@ files are erased long before directory tree connectivity checks are performed.
Parent pointer checks are therefore a second pass to be added to the existing
connectivity checks:
1. After the set of surviving files has been established (i.e. phase 6),
1. After the set of surviving files has been established (phase 6),
walk the surviving directories of each AG in the filesystem.
This is already performed as part of the connectivity checks.
2. For each directory entry found, record the name in an xfblob, and store
``(child_ag_inum, parent_inum, parent_gen, dirent_pos)`` tuples in a
per-AG in-memory slab.
2. For each directory entry found,
a. If the name has already been stored in the xfblob, then use that cookie
and skip the next step.
b. Otherwise, record the name in an xfblob, and remember the xfblob cookie.
Unique mappings are critical for
1. Deduplicating names to reduce memory usage, and
2. Creating a stable sort key for the parent pointer indexes so that the
parent pointer validation described below will work.
c. Store ``(child_ag_inum, parent_inum, parent_gen, name_hash, name_len,
name_cookie)`` tuples in a per-AG in-memory slab. The ``name_hash``
referenced in this section is the regular directory entry name hash, not
the specialized one used for parent pointer xattrs.
3. For each AG in the filesystem,
a. Sort the per-AG tuples in order of child_ag_inum, parent_inum, and
dirent_pos.
a. Sort the per-AG tuple set in order of ``child_ag_inum``, ``parent_inum``,
``name_hash``, and ``name_cookie``.
Having a single ``name_cookie`` for each ``name`` is critical for
handling the uncommon case of a directory containing multiple hardlinks
to the same file where all the names hash to the same value.
b. For each inode in the AG,
1. Scan the inode for parent pointers.
Record the names in a per-file xfblob, and store ``(parent_inum,
parent_gen, dirent_pos)`` tuples in a per-file slab.
For each parent pointer found,
2. Sort the per-file tuples in order of parent_inum, and dirent_pos.
a. Validate the ondisk parent pointer.
If validation fails, move on to the next parent pointer in the
file.
b. If the name has already been stored in the xfblob, then use that
cookie and skip the next step.
c. Record the name in a per-file xfblob, and remember the xfblob
cookie.
d. Store ``(parent_inum, parent_gen, name_hash, name_len,
name_cookie)`` tuples in a per-file slab.
2. Sort the per-file tuples in order of ``parent_inum``, ``name_hash``,
and ``name_cookie``.
3. Position one slab cursor at the start of the inode's records in the
per-AG tuple slab.
@ -4688,28 +4733,37 @@ connectivity checks:
4. Position a second slab cursor at the start of the per-file tuple slab.
5. Iterate the two cursors in lockstep, comparing the parent_ino and
dirent_pos fields of the records under each cursor.
5. Iterate the two cursors in lockstep, comparing the ``parent_ino``,
``name_hash``, and ``name_cookie`` fields of the records under each
cursor:
a. Tuples in the per-AG list but not the per-file list are missing and
need to be written to the inode.
a. If the per-AG cursor is at a lower point in the keyspace than the
per-file cursor, then the per-AG cursor points to a missing parent
pointer.
Add the parent pointer to the inode and advance the per-AG
cursor.
b. Tuples in the per-file list but not the per-AG list are dangling
and need to be removed from the inode.
b. If the per-file cursor is at a lower point in the keyspace than
the per-AG cursor, then the per-file cursor points to a dangling
parent pointer.
Remove the parent pointer from the inode and advance the per-file
cursor.
c. For tuples in both lists, update the parent_gen and name components
of the parent pointer if necessary.
c. Otherwise, both cursors point at the same parent pointer.
Update the parent_gen component if necessary.
Advance both cursors.
4. Move on to examining link counts, as we do today.
The proposed patchset is the
`offline parent pointers repair
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=pptrs-repair>`_
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=pptrs-fsck>`_
series.
Rebuilding directories from parent pointers in offline repair is very
challenging because it currently uses a single-pass scan of the filesystem
during phase 3 to decide which files are corrupt enough to be zapped.
Rebuilding directories from parent pointers in offline repair would be very
challenging because xfs_repair currently uses two single-pass scans of the
filesystem during phases 3 and 4 to decide which files are corrupt enough to be
zapped.
This scan would have to be converted into a multi-pass scan:
1. The first pass of the scan zaps corrupt inodes, forks, and attributes
@ -4731,6 +4785,130 @@ This scan would have to be converted into a multi-pass scan:
This code has not yet been constructed.
.. _dirtree:
Case Study: Directory Tree Structure
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
As mentioned earlier, the filesystem directory tree is supposed to be a
directed acylic graph structure.
However, each node in this graph is a separate ``xfs_inode`` object with its
own locks, which makes validating the tree qualities difficult.
Fortunately, non-directories are allowed to have multiple parents and cannot
have children, so only directories need to be scanned.
Directories typically constitute 5-10% of the files in a filesystem, which
reduces the amount of work dramatically.
If the directory tree could be frozen, it would be easy to discover cycles and
disconnected regions by running a depth (or breadth) first search downwards
from the root directory and marking a bitmap for each directory found.
At any point in the walk, trying to set an already set bit means there is a
cycle.
After the scan completes, XORing the marked inode bitmap with the inode
allocation bitmap reveals disconnected inodes.
However, one of online repair's design goals is to avoid locking the entire
filesystem unless it's absolutely necessary.
Directory tree updates can move subtrees across the scanner wavefront on a live
filesystem, so the bitmap algorithm cannot be applied.
Directory parent pointers enable an incremental approach to validation of the
tree structure.
Instead of using one thread to scan the entire filesystem, multiple threads can
walk from individual subdirectories upwards towards the root.
For this to work, all directory entries and parent pointers must be internally
consistent, each directory entry must have a parent pointer, and the link
counts of all directories must be correct.
Each scanner thread must be able to take the IOLOCK of an alleged parent
directory while holding the IOLOCK of the child directory to prevent either
directory from being moved within the tree.
This is not possible since the VFS does not take the IOLOCK of a child
subdirectory when moving that subdirectory, so instead the scanner stabilizes
the parent -> child relationship by taking the ILOCKs and installing a dirent
update hook to detect changes.
The scanning process uses a dirent hook to detect changes to the directories
mentioned in the scan data.
The scan works as follows:
1. For each subdirectory in the filesystem,
a. For each parent pointer of that subdirectory,
1. Create a path object for that parent pointer, and mark the
subdirectory inode number in the path object's bitmap.
2. Record the parent pointer name and inode number in a path structure.
3. If the alleged parent is the subdirectory being scrubbed, the path is
a cycle.
Mark the path for deletion and repeat step 1a with the next
subdirectory parent pointer.
4. Try to mark the alleged parent inode number in a bitmap in the path
object.
If the bit is already set, then there is a cycle in the directory
tree.
Mark the path as a cycle and repeat step 1a with the next subdirectory
parent pointer.
5. Load the alleged parent.
If the alleged parent is not a linked directory, abort the scan
because the parent pointer information is inconsistent.
6. For each parent pointer of this alleged ancestor directory,
a. Record the parent pointer name and inode number in the path object
if no parent has been set for that level.
b. If an ancestor has more than one parent, mark the path as corrupt.
Repeat step 1a with the next subdirectory parent pointer.
c. Repeat steps 1a3-1a6 for the ancestor identified in step 1a6a.
This repeats until the directory tree root is reached or no parents
are found.
7. If the walk terminates at the root directory, mark the path as ok.
8. If the walk terminates without reaching the root, mark the path as
disconnected.
2. If the directory entry update hook triggers, check all paths already found
by the scan.
If the entry matches part of a path, mark that path and the scan stale.
When the scanner thread sees that the scan has been marked stale, it deletes
all scan data and starts over.
Repairing the directory tree works as follows:
1. Walk each path of the target subdirectory.
a. Corrupt paths and cycle paths are counted as suspect.
b. Paths already marked for deletion are counted as bad.
c. Paths that reached the root are counted as good.
2. If the subdirectory is either the root directory or has zero link count,
delete all incoming directory entries in the immediate parents.
Repairs are complete.
3. If the subdirectory has exactly one path, set the dotdot entry to the
parent and exit.
4. If the subdirectory has at least one good path, delete all the other
incoming directory entries in the immediate parents.
5. If the subdirectory has no good paths and more than one suspect path, delete
all the other incoming directory entries in the immediate parents.
6. If the subdirectory has zero paths, attach it to the lost and found.
The proposed patches are in the
`directory tree repair
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-directory-tree>`_
series.
.. _orphanage:
The Orphanage