8a98ec7c7b
Move the ext4 data structures book to Documentation/filesystems/ext4/ since the administrative information moved elsewhere. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
427 lines
12 KiB
ReStructuredText
427 lines
12 KiB
ReStructuredText
.. SPDX-License-Identifier: GPL-2.0
|
|
|
|
Directory Entries
|
|
-----------------
|
|
|
|
In an ext4 filesystem, a directory is more or less a flat file that maps
|
|
an arbitrary byte string (usually ASCII) to an inode number on the
|
|
filesystem. There can be many directory entries across the filesystem
|
|
that reference the same inode number--these are known as hard links, and
|
|
that is why hard links cannot reference files on other filesystems. As
|
|
such, directory entries are found by reading the data block(s)
|
|
associated with a directory file for the particular directory entry that
|
|
is desired.
|
|
|
|
Linear (Classic) Directories
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
By default, each directory lists its entries in an “almost-linear”
|
|
array. I write “almost” because it's not a linear array in the memory
|
|
sense because directory entries are not split across filesystem blocks.
|
|
Therefore, it is more accurate to say that a directory is a series of
|
|
data blocks and that each block contains a linear array of directory
|
|
entries. The end of each per-block array is signified by reaching the
|
|
end of the block; the last entry in the block has a record length that
|
|
takes it all the way to the end of the block. The end of the entire
|
|
directory is of course signified by reaching the end of the file. Unused
|
|
directory entries are signified by inode = 0. By default the filesystem
|
|
uses ``struct ext4_dir_entry_2`` for directory entries unless the
|
|
“filetype” feature flag is not set, in which case it uses
|
|
``struct ext4_dir_entry``.
|
|
|
|
The original directory entry format is ``struct ext4_dir_entry``, which
|
|
is at most 263 bytes long, though on disk you'll need to reference
|
|
``dirent.rec_len`` to know for sure.
|
|
|
|
.. list-table::
|
|
:widths: 8 8 24 40
|
|
:header-rows: 1
|
|
|
|
* - Offset
|
|
- Size
|
|
- Name
|
|
- Description
|
|
* - 0x0
|
|
- \_\_le32
|
|
- inode
|
|
- Number of the inode that this directory entry points to.
|
|
* - 0x4
|
|
- \_\_le16
|
|
- rec\_len
|
|
- Length of this directory entry. Must be a multiple of 4.
|
|
* - 0x6
|
|
- \_\_le16
|
|
- name\_len
|
|
- Length of the file name.
|
|
* - 0x8
|
|
- char
|
|
- name[EXT4\_NAME\_LEN]
|
|
- File name.
|
|
|
|
Since file names cannot be longer than 255 bytes, the new directory
|
|
entry format shortens the rec\_len field and uses the space for a file
|
|
type flag, probably to avoid having to load every inode during directory
|
|
tree traversal. This format is ``ext4_dir_entry_2``, which is at most
|
|
263 bytes long, though on disk you'll need to reference
|
|
``dirent.rec_len`` to know for sure.
|
|
|
|
.. list-table::
|
|
:widths: 8 8 24 40
|
|
:header-rows: 1
|
|
|
|
* - Offset
|
|
- Size
|
|
- Name
|
|
- Description
|
|
* - 0x0
|
|
- \_\_le32
|
|
- inode
|
|
- Number of the inode that this directory entry points to.
|
|
* - 0x4
|
|
- \_\_le16
|
|
- rec\_len
|
|
- Length of this directory entry.
|
|
* - 0x6
|
|
- \_\_u8
|
|
- name\_len
|
|
- Length of the file name.
|
|
* - 0x7
|
|
- \_\_u8
|
|
- file\_type
|
|
- File type code, see ftype_ table below.
|
|
* - 0x8
|
|
- char
|
|
- name[EXT4\_NAME\_LEN]
|
|
- File name.
|
|
|
|
.. _ftype:
|
|
|
|
The directory file type is one of the following values:
|
|
|
|
.. list-table::
|
|
:widths: 16 64
|
|
:header-rows: 1
|
|
|
|
* - Value
|
|
- Description
|
|
* - 0x0
|
|
- Unknown.
|
|
* - 0x1
|
|
- Regular file.
|
|
* - 0x2
|
|
- Directory.
|
|
* - 0x3
|
|
- Character device file.
|
|
* - 0x4
|
|
- Block device file.
|
|
* - 0x5
|
|
- FIFO.
|
|
* - 0x6
|
|
- Socket.
|
|
* - 0x7
|
|
- Symbolic link.
|
|
|
|
In order to add checksums to these classic directory blocks, a phony
|
|
``struct ext4_dir_entry`` is placed at the end of each leaf block to
|
|
hold the checksum. The directory entry is 12 bytes long. The inode
|
|
number and name\_len fields are set to zero to fool old software into
|
|
ignoring an apparently empty directory entry, and the checksum is stored
|
|
in the place where the name normally goes. The structure is
|
|
``struct ext4_dir_entry_tail``:
|
|
|
|
.. list-table::
|
|
:widths: 8 8 24 40
|
|
:header-rows: 1
|
|
|
|
* - Offset
|
|
- Size
|
|
- Name
|
|
- Description
|
|
* - 0x0
|
|
- \_\_le32
|
|
- det\_reserved\_zero1
|
|
- Inode number, which must be zero.
|
|
* - 0x4
|
|
- \_\_le16
|
|
- det\_rec\_len
|
|
- Length of this directory entry, which must be 12.
|
|
* - 0x6
|
|
- \_\_u8
|
|
- det\_reserved\_zero2
|
|
- Length of the file name, which must be zero.
|
|
* - 0x7
|
|
- \_\_u8
|
|
- det\_reserved\_ft
|
|
- File type, which must be 0xDE.
|
|
* - 0x8
|
|
- \_\_le32
|
|
- det\_checksum
|
|
- Directory leaf block checksum.
|
|
|
|
The leaf directory block checksum is calculated against the FS UUID, the
|
|
directory's inode number, the directory's inode generation number, and
|
|
the entire directory entry block up to (but not including) the fake
|
|
directory entry.
|
|
|
|
Hash Tree Directories
|
|
~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
A linear array of directory entries isn't great for performance, so a
|
|
new feature was added to ext3 to provide a faster (but peculiar)
|
|
balanced tree keyed off a hash of the directory entry name. If the
|
|
EXT4\_INDEX\_FL (0x1000) flag is set in the inode, this directory uses a
|
|
hashed btree (htree) to organize and find directory entries. For
|
|
backwards read-only compatibility with ext2, this tree is actually
|
|
hidden inside the directory file, masquerading as “empty” directory data
|
|
blocks! It was stated previously that the end of the linear directory
|
|
entry table was signified with an entry pointing to inode 0; this is
|
|
(ab)used to fool the old linear-scan algorithm into thinking that the
|
|
rest of the directory block is empty so that it moves on.
|
|
|
|
The root of the tree always lives in the first data block of the
|
|
directory. By ext2 custom, the '.' and '..' entries must appear at the
|
|
beginning of this first block, so they are put here as two
|
|
``struct ext4_dir_entry_2``\ s and not stored in the tree. The rest of
|
|
the root node contains metadata about the tree and finally a hash->block
|
|
map to find nodes that are lower in the htree. If
|
|
``dx_root.info.indirect_levels`` is non-zero then the htree has two
|
|
levels; the data block pointed to by the root node's map is an interior
|
|
node, which is indexed by a minor hash. Interior nodes in this tree
|
|
contains a zeroed out ``struct ext4_dir_entry_2`` followed by a
|
|
minor\_hash->block map to find leafe nodes. Leaf nodes contain a linear
|
|
array of all ``struct ext4_dir_entry_2``; all of these entries
|
|
(presumably) hash to the same value. If there is an overflow, the
|
|
entries simply overflow into the next leaf node, and the
|
|
least-significant bit of the hash (in the interior node map) that gets
|
|
us to this next leaf node is set.
|
|
|
|
To traverse the directory as a htree, the code calculates the hash of
|
|
the desired file name and uses it to find the corresponding block
|
|
number. If the tree is flat, the block is a linear array of directory
|
|
entries that can be searched; otherwise, the minor hash of the file name
|
|
is computed and used against this second block to find the corresponding
|
|
third block number. That third block number will be a linear array of
|
|
directory entries.
|
|
|
|
To traverse the directory as a linear array (such as the old code does),
|
|
the code simply reads every data block in the directory. The blocks used
|
|
for the htree will appear to have no entries (aside from '.' and '..')
|
|
and so only the leaf nodes will appear to have any interesting content.
|
|
|
|
The root of the htree is in ``struct dx_root``, which is the full length
|
|
of a data block:
|
|
|
|
.. list-table::
|
|
:widths: 8 8 24 40
|
|
:header-rows: 1
|
|
|
|
* - Offset
|
|
- Type
|
|
- Name
|
|
- Description
|
|
* - 0x0
|
|
- \_\_le32
|
|
- dot.inode
|
|
- inode number of this directory.
|
|
* - 0x4
|
|
- \_\_le16
|
|
- dot.rec\_len
|
|
- Length of this record, 12.
|
|
* - 0x6
|
|
- u8
|
|
- dot.name\_len
|
|
- Length of the name, 1.
|
|
* - 0x7
|
|
- u8
|
|
- dot.file\_type
|
|
- File type of this entry, 0x2 (directory) (if the feature flag is set).
|
|
* - 0x8
|
|
- char
|
|
- dot.name[4]
|
|
- “.\\0\\0\\0”
|
|
* - 0xC
|
|
- \_\_le32
|
|
- dotdot.inode
|
|
- inode number of parent directory.
|
|
* - 0x10
|
|
- \_\_le16
|
|
- dotdot.rec\_len
|
|
- block\_size - 12. The record length is long enough to cover all htree
|
|
data.
|
|
* - 0x12
|
|
- u8
|
|
- dotdot.name\_len
|
|
- Length of the name, 2.
|
|
* - 0x13
|
|
- u8
|
|
- dotdot.file\_type
|
|
- File type of this entry, 0x2 (directory) (if the feature flag is set).
|
|
* - 0x14
|
|
- char
|
|
- dotdot\_name[4]
|
|
- “..\\0\\0”
|
|
* - 0x18
|
|
- \_\_le32
|
|
- struct dx\_root\_info.reserved\_zero
|
|
- Zero.
|
|
* - 0x1C
|
|
- u8
|
|
- struct dx\_root\_info.hash\_version
|
|
- Hash type, see dirhash_ table below.
|
|
* - 0x1D
|
|
- u8
|
|
- struct dx\_root\_info.info\_length
|
|
- Length of the tree information, 0x8.
|
|
* - 0x1E
|
|
- u8
|
|
- struct dx\_root\_info.indirect\_levels
|
|
- Depth of the htree. Cannot be larger than 3 if the INCOMPAT\_LARGEDIR
|
|
feature is set; cannot be larger than 2 otherwise.
|
|
* - 0x1F
|
|
- u8
|
|
- struct dx\_root\_info.unused\_flags
|
|
-
|
|
* - 0x20
|
|
- \_\_le16
|
|
- limit
|
|
- Maximum number of dx\_entries that can follow this header, plus 1 for
|
|
the header itself.
|
|
* - 0x22
|
|
- \_\_le16
|
|
- count
|
|
- Actual number of dx\_entries that follow this header, plus 1 for the
|
|
header itself.
|
|
* - 0x24
|
|
- \_\_le32
|
|
- block
|
|
- The block number (within the directory file) that goes with hash=0.
|
|
* - 0x28
|
|
- struct dx\_entry
|
|
- entries[0]
|
|
- As many 8-byte ``struct dx_entry`` as fits in the rest of the data block.
|
|
|
|
.. _dirhash:
|
|
|
|
The directory hash is one of the following values:
|
|
|
|
.. list-table::
|
|
:widths: 16 64
|
|
:header-rows: 1
|
|
|
|
* - Value
|
|
- Description
|
|
* - 0x0
|
|
- Legacy.
|
|
* - 0x1
|
|
- Half MD4.
|
|
* - 0x2
|
|
- Tea.
|
|
* - 0x3
|
|
- Legacy, unsigned.
|
|
* - 0x4
|
|
- Half MD4, unsigned.
|
|
* - 0x5
|
|
- Tea, unsigned.
|
|
|
|
Interior nodes of an htree are recorded as ``struct dx_node``, which is
|
|
also the full length of a data block:
|
|
|
|
.. list-table::
|
|
:widths: 8 8 24 40
|
|
:header-rows: 1
|
|
|
|
* - Offset
|
|
- Type
|
|
- Name
|
|
- Description
|
|
* - 0x0
|
|
- \_\_le32
|
|
- fake.inode
|
|
- Zero, to make it look like this entry is not in use.
|
|
* - 0x4
|
|
- \_\_le16
|
|
- fake.rec\_len
|
|
- The size of the block, in order to hide all of the dx\_node data.
|
|
* - 0x6
|
|
- u8
|
|
- name\_len
|
|
- Zero. There is no name for this “unused” directory entry.
|
|
* - 0x7
|
|
- u8
|
|
- file\_type
|
|
- Zero. There is no file type for this “unused” directory entry.
|
|
* - 0x8
|
|
- \_\_le16
|
|
- limit
|
|
- Maximum number of dx\_entries that can follow this header, plus 1 for
|
|
the header itself.
|
|
* - 0xA
|
|
- \_\_le16
|
|
- count
|
|
- Actual number of dx\_entries that follow this header, plus 1 for the
|
|
header itself.
|
|
* - 0xE
|
|
- \_\_le32
|
|
- block
|
|
- The block number (within the directory file) that goes with the lowest
|
|
hash value of this block. This value is stored in the parent block.
|
|
* - 0x12
|
|
- struct dx\_entry
|
|
- entries[0]
|
|
- As many 8-byte ``struct dx_entry`` as fits in the rest of the data block.
|
|
|
|
The hash maps that exist in both ``struct dx_root`` and
|
|
``struct dx_node`` are recorded as ``struct dx_entry``, which is 8 bytes
|
|
long:
|
|
|
|
.. list-table::
|
|
:widths: 8 8 24 40
|
|
:header-rows: 1
|
|
|
|
* - Offset
|
|
- Type
|
|
- Name
|
|
- Description
|
|
* - 0x0
|
|
- \_\_le32
|
|
- hash
|
|
- Hash code.
|
|
* - 0x4
|
|
- \_\_le32
|
|
- block
|
|
- Block number (within the directory file, not filesystem blocks) of the
|
|
next node in the htree.
|
|
|
|
(If you think this is all quite clever and peculiar, so does the
|
|
author.)
|
|
|
|
If metadata checksums are enabled, the last 8 bytes of the directory
|
|
block (precisely the length of one dx\_entry) are used to store a
|
|
``struct dx_tail``, which contains the checksum. The ``limit`` and
|
|
``count`` entries in the dx\_root/dx\_node structures are adjusted as
|
|
necessary to fit the dx\_tail into the block. If there is no space for
|
|
the dx\_tail, the user is notified to run e2fsck -D to rebuild the
|
|
directory index (which will ensure that there's space for the checksum.
|
|
The dx\_tail structure is 8 bytes long and looks like this:
|
|
|
|
.. list-table::
|
|
:widths: 8 8 24 40
|
|
:header-rows: 1
|
|
|
|
* - Offset
|
|
- Type
|
|
- Name
|
|
- Description
|
|
* - 0x0
|
|
- u32
|
|
- dt\_reserved
|
|
- Zero.
|
|
* - 0x4
|
|
- \_\_le32
|
|
- dt\_checksum
|
|
- Checksum of the htree directory block.
|
|
|
|
The checksum is calculated against the FS UUID, the htree index header
|
|
(dx\_root or dx\_node), all of the htree indices (dx\_entry) that are in
|
|
use, and the tail block (dx\_tail).
|