ext4: import directory layout chapter from wiki page
Import the chapter about directory layout from the on-disk format wiki page into the kernel documentation. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This commit is contained in:
parent
b4becd48b7
commit
60edae3a04
426
Documentation/filesystems/ext4/ondisk/directory.rst
Normal file
426
Documentation/filesystems/ext4/ondisk/directory.rst
Normal file
@ -0,0 +1,426 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
Directory Entries
|
||||
-----------------
|
||||
|
||||
In an ext4 filesystem, a directory is more or less a flat file that maps
|
||||
an arbitrary byte string (usually ASCII) to an inode number on the
|
||||
filesystem. There can be many directory entries across the filesystem
|
||||
that reference the same inode number--these are known as hard links, and
|
||||
that is why hard links cannot reference files on other filesystems. As
|
||||
such, directory entries are found by reading the data block(s)
|
||||
associated with a directory file for the particular directory entry that
|
||||
is desired.
|
||||
|
||||
Linear (Classic) Directories
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
By default, each directory lists its entries in an “almost-linear”
|
||||
array. I write “almost” because it's not a linear array in the memory
|
||||
sense because directory entries are not split across filesystem blocks.
|
||||
Therefore, it is more accurate to say that a directory is a series of
|
||||
data blocks and that each block contains a linear array of directory
|
||||
entries. The end of each per-block array is signified by reaching the
|
||||
end of the block; the last entry in the block has a record length that
|
||||
takes it all the way to the end of the block. The end of the entire
|
||||
directory is of course signified by reaching the end of the file. Unused
|
||||
directory entries are signified by inode = 0. By default the filesystem
|
||||
uses ``struct ext4_dir_entry_2`` for directory entries unless the
|
||||
“filetype” feature flag is not set, in which case it uses
|
||||
``struct ext4_dir_entry``.
|
||||
|
||||
The original directory entry format is ``struct ext4_dir_entry``, which
|
||||
is at most 263 bytes long, though on disk you'll need to reference
|
||||
``dirent.rec_len`` to know for sure.
|
||||
|
||||
.. list-table::
|
||||
:widths: 1 1 1 77
|
||||
:header-rows: 1
|
||||
|
||||
* - Offset
|
||||
- Size
|
||||
- Name
|
||||
- Description
|
||||
* - 0x0
|
||||
- \_\_le32
|
||||
- inode
|
||||
- Number of the inode that this directory entry points to.
|
||||
* - 0x4
|
||||
- \_\_le16
|
||||
- rec\_len
|
||||
- Length of this directory entry. Must be a multiple of 4.
|
||||
* - 0x6
|
||||
- \_\_le16
|
||||
- name\_len
|
||||
- Length of the file name.
|
||||
* - 0x8
|
||||
- char
|
||||
- name[EXT4\_NAME\_LEN]
|
||||
- File name.
|
||||
|
||||
Since file names cannot be longer than 255 bytes, the new directory
|
||||
entry format shortens the rec\_len field and uses the space for a file
|
||||
type flag, probably to avoid having to load every inode during directory
|
||||
tree traversal. This format is ``ext4_dir_entry_2``, which is at most
|
||||
263 bytes long, though on disk you'll need to reference
|
||||
``dirent.rec_len`` to know for sure.
|
||||
|
||||
.. list-table::
|
||||
:widths: 1 1 1 77
|
||||
:header-rows: 1
|
||||
|
||||
* - Offset
|
||||
- Size
|
||||
- Name
|
||||
- Description
|
||||
* - 0x0
|
||||
- \_\_le32
|
||||
- inode
|
||||
- Number of the inode that this directory entry points to.
|
||||
* - 0x4
|
||||
- \_\_le16
|
||||
- rec\_len
|
||||
- Length of this directory entry.
|
||||
* - 0x6
|
||||
- \_\_u8
|
||||
- name\_len
|
||||
- Length of the file name.
|
||||
* - 0x7
|
||||
- \_\_u8
|
||||
- file\_type
|
||||
- File type code, see ftype_ table below.
|
||||
* - 0x8
|
||||
- char
|
||||
- name[EXT4\_NAME\_LEN]
|
||||
- File name.
|
||||
|
||||
.. _ftype:
|
||||
|
||||
The directory file type is one of the following values:
|
||||
|
||||
.. list-table::
|
||||
:widths: 1 79
|
||||
:header-rows: 1
|
||||
|
||||
* - Value
|
||||
- Description
|
||||
* - 0x0
|
||||
- Unknown.
|
||||
* - 0x1
|
||||
- Regular file.
|
||||
* - 0x2
|
||||
- Directory.
|
||||
* - 0x3
|
||||
- Character device file.
|
||||
* - 0x4
|
||||
- Block device file.
|
||||
* - 0x5
|
||||
- FIFO.
|
||||
* - 0x6
|
||||
- Socket.
|
||||
* - 0x7
|
||||
- Symbolic link.
|
||||
|
||||
In order to add checksums to these classic directory blocks, a phony
|
||||
``struct ext4_dir_entry`` is placed at the end of each leaf block to
|
||||
hold the checksum. The directory entry is 12 bytes long. The inode
|
||||
number and name\_len fields are set to zero to fool old software into
|
||||
ignoring an apparently empty directory entry, and the checksum is stored
|
||||
in the place where the name normally goes. The structure is
|
||||
``struct ext4_dir_entry_tail``:
|
||||
|
||||
.. list-table::
|
||||
:widths: 1 1 1 77
|
||||
:header-rows: 1
|
||||
|
||||
* - Offset
|
||||
- Size
|
||||
- Name
|
||||
- Description
|
||||
* - 0x0
|
||||
- \_\_le32
|
||||
- det\_reserved\_zero1
|
||||
- Inode number, which must be zero.
|
||||
* - 0x4
|
||||
- \_\_le16
|
||||
- det\_rec\_len
|
||||
- Length of this directory entry, which must be 12.
|
||||
* - 0x6
|
||||
- \_\_u8
|
||||
- det\_reserved\_zero2
|
||||
- Length of the file name, which must be zero.
|
||||
* - 0x7
|
||||
- \_\_u8
|
||||
- det\_reserved\_ft
|
||||
- File type, which must be 0xDE.
|
||||
* - 0x8
|
||||
- \_\_le32
|
||||
- det\_checksum
|
||||
- Directory leaf block checksum.
|
||||
|
||||
The leaf directory block checksum is calculated against the FS UUID, the
|
||||
directory's inode number, the directory's inode generation number, and
|
||||
the entire directory entry block up to (but not including) the fake
|
||||
directory entry.
|
||||
|
||||
Hash Tree Directories
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
A linear array of directory entries isn't great for performance, so a
|
||||
new feature was added to ext3 to provide a faster (but peculiar)
|
||||
balanced tree keyed off a hash of the directory entry name. If the
|
||||
EXT4\_INDEX\_FL (0x1000) flag is set in the inode, this directory uses a
|
||||
hashed btree (htree) to organize and find directory entries. For
|
||||
backwards read-only compatibility with ext2, this tree is actually
|
||||
hidden inside the directory file, masquerading as “empty” directory data
|
||||
blocks! It was stated previously that the end of the linear directory
|
||||
entry table was signified with an entry pointing to inode 0; this is
|
||||
(ab)used to fool the old linear-scan algorithm into thinking that the
|
||||
rest of the directory block is empty so that it moves on.
|
||||
|
||||
The root of the tree always lives in the first data block of the
|
||||
directory. By ext2 custom, the '.' and '..' entries must appear at the
|
||||
beginning of this first block, so they are put here as two
|
||||
``struct ext4_dir_entry_2``\ s and not stored in the tree. The rest of
|
||||
the root node contains metadata about the tree and finally a hash->block
|
||||
map to find nodes that are lower in the htree. If
|
||||
``dx_root.info.indirect_levels`` is non-zero then the htree has two
|
||||
levels; the data block pointed to by the root node's map is an interior
|
||||
node, which is indexed by a minor hash. Interior nodes in this tree
|
||||
contains a zeroed out ``struct ext4_dir_entry_2`` followed by a
|
||||
minor\_hash->block map to find leafe nodes. Leaf nodes contain a linear
|
||||
array of all ``struct ext4_dir_entry_2``; all of these entries
|
||||
(presumably) hash to the same value. If there is an overflow, the
|
||||
entries simply overflow into the next leaf node, and the
|
||||
least-significant bit of the hash (in the interior node map) that gets
|
||||
us to this next leaf node is set.
|
||||
|
||||
To traverse the directory as a htree, the code calculates the hash of
|
||||
the desired file name and uses it to find the corresponding block
|
||||
number. If the tree is flat, the block is a linear array of directory
|
||||
entries that can be searched; otherwise, the minor hash of the file name
|
||||
is computed and used against this second block to find the corresponding
|
||||
third block number. That third block number will be a linear array of
|
||||
directory entries.
|
||||
|
||||
To traverse the directory as a linear array (such as the old code does),
|
||||
the code simply reads every data block in the directory. The blocks used
|
||||
for the htree will appear to have no entries (aside from '.' and '..')
|
||||
and so only the leaf nodes will appear to have any interesting content.
|
||||
|
||||
The root of the htree is in ``struct dx_root``, which is the full length
|
||||
of a data block:
|
||||
|
||||
.. list-table::
|
||||
:widths: 1 1 1 77
|
||||
:header-rows: 1
|
||||
|
||||
* - Offset
|
||||
- Type
|
||||
- Name
|
||||
- Description
|
||||
* - 0x0
|
||||
- \_\_le32
|
||||
- dot.inode
|
||||
- inode number of this directory.
|
||||
* - 0x4
|
||||
- \_\_le16
|
||||
- dot.rec\_len
|
||||
- Length of this record, 12.
|
||||
* - 0x6
|
||||
- u8
|
||||
- dot.name\_len
|
||||
- Length of the name, 1.
|
||||
* - 0x7
|
||||
- u8
|
||||
- dot.file\_type
|
||||
- File type of this entry, 0x2 (directory) (if the feature flag is set).
|
||||
* - 0x8
|
||||
- char
|
||||
- dot.name[4]
|
||||
- “.\\0\\0\\0”
|
||||
* - 0xC
|
||||
- \_\_le32
|
||||
- dotdot.inode
|
||||
- inode number of parent directory.
|
||||
* - 0x10
|
||||
- \_\_le16
|
||||
- dotdot.rec\_len
|
||||
- block\_size - 12. The record length is long enough to cover all htree
|
||||
data.
|
||||
* - 0x12
|
||||
- u8
|
||||
- dotdot.name\_len
|
||||
- Length of the name, 2.
|
||||
* - 0x13
|
||||
- u8
|
||||
- dotdot.file\_type
|
||||
- File type of this entry, 0x2 (directory) (if the feature flag is set).
|
||||
* - 0x14
|
||||
- char
|
||||
- dotdot\_name[4]
|
||||
- “..\\0\\0”
|
||||
* - 0x18
|
||||
- \_\_le32
|
||||
- struct dx\_root\_info.reserved\_zero
|
||||
- Zero.
|
||||
* - 0x1C
|
||||
- u8
|
||||
- struct dx\_root\_info.hash\_version
|
||||
- Hash type, see dirhash_ table below.
|
||||
* - 0x1D
|
||||
- u8
|
||||
- struct dx\_root\_info.info\_length
|
||||
- Length of the tree information, 0x8.
|
||||
* - 0x1E
|
||||
- u8
|
||||
- struct dx\_root\_info.indirect\_levels
|
||||
- Depth of the htree. Cannot be larger than 3 if the INCOMPAT\_LARGEDIR
|
||||
feature is set; cannot be larger than 2 otherwise.
|
||||
* - 0x1F
|
||||
- u8
|
||||
- struct dx\_root\_info.unused\_flags
|
||||
-
|
||||
* - 0x20
|
||||
- \_\_le16
|
||||
- limit
|
||||
- Maximum number of dx\_entries that can follow this header, plus 1 for
|
||||
the header itself.
|
||||
* - 0x22
|
||||
- \_\_le16
|
||||
- count
|
||||
- Actual number of dx\_entries that follow this header, plus 1 for the
|
||||
header itself.
|
||||
* - 0x24
|
||||
- \_\_le32
|
||||
- block
|
||||
- The block number (within the directory file) that goes with hash=0.
|
||||
* - 0x28
|
||||
- struct dx\_entry
|
||||
- entries[0]
|
||||
- As many 8-byte ``struct dx_entry`` as fits in the rest of the data block.
|
||||
|
||||
.. _dirhash:
|
||||
|
||||
The directory hash is one of the following values:
|
||||
|
||||
.. list-table::
|
||||
:widths: 1 79
|
||||
:header-rows: 1
|
||||
|
||||
* - Value
|
||||
- Description
|
||||
* - 0x0
|
||||
- Legacy.
|
||||
* - 0x1
|
||||
- Half MD4.
|
||||
* - 0x2
|
||||
- Tea.
|
||||
* - 0x3
|
||||
- Legacy, unsigned.
|
||||
* - 0x4
|
||||
- Half MD4, unsigned.
|
||||
* - 0x5
|
||||
- Tea, unsigned.
|
||||
|
||||
Interior nodes of an htree are recorded as ``struct dx_node``, which is
|
||||
also the full length of a data block:
|
||||
|
||||
.. list-table::
|
||||
:widths: 1 1 1 77
|
||||
:header-rows: 1
|
||||
|
||||
* - Offset
|
||||
- Type
|
||||
- Name
|
||||
- Description
|
||||
* - 0x0
|
||||
- \_\_le32
|
||||
- fake.inode
|
||||
- Zero, to make it look like this entry is not in use.
|
||||
* - 0x4
|
||||
- \_\_le16
|
||||
- fake.rec\_len
|
||||
- The size of the block, in order to hide all of the dx\_node data.
|
||||
* - 0x6
|
||||
- u8
|
||||
- name\_len
|
||||
- Zero. There is no name for this “unused” directory entry.
|
||||
* - 0x7
|
||||
- u8
|
||||
- file\_type
|
||||
- Zero. There is no file type for this “unused” directory entry.
|
||||
* - 0x8
|
||||
- \_\_le16
|
||||
- limit
|
||||
- Maximum number of dx\_entries that can follow this header, plus 1 for
|
||||
the header itself.
|
||||
* - 0xA
|
||||
- \_\_le16
|
||||
- count
|
||||
- Actual number of dx\_entries that follow this header, plus 1 for the
|
||||
header itself.
|
||||
* - 0xE
|
||||
- \_\_le32
|
||||
- block
|
||||
- The block number (within the directory file) that goes with the lowest
|
||||
hash value of this block. This value is stored in the parent block.
|
||||
* - 0x12
|
||||
- struct dx\_entry
|
||||
- entries[0]
|
||||
- As many 8-byte ``struct dx_entry`` as fits in the rest of the data block.
|
||||
|
||||
The hash maps that exist in both ``struct dx_root`` and
|
||||
``struct dx_node`` are recorded as ``struct dx_entry``, which is 8 bytes
|
||||
long:
|
||||
|
||||
.. list-table::
|
||||
:widths: 1 1 1 77
|
||||
:header-rows: 1
|
||||
|
||||
* - Offset
|
||||
- Type
|
||||
- Name
|
||||
- Description
|
||||
* - 0x0
|
||||
- \_\_le32
|
||||
- hash
|
||||
- Hash code.
|
||||
* - 0x4
|
||||
- \_\_le32
|
||||
- block
|
||||
- Block number (within the directory file, not filesystem blocks) of the
|
||||
next node in the htree.
|
||||
|
||||
(If you think this is all quite clever and peculiar, so does the
|
||||
author.)
|
||||
|
||||
If metadata checksums are enabled, the last 8 bytes of the directory
|
||||
block (precisely the length of one dx\_entry) are used to store a
|
||||
``struct dx_tail``, which contains the checksum. The ``limit`` and
|
||||
``count`` entries in the dx\_root/dx\_node structures are adjusted as
|
||||
necessary to fit the dx\_tail into the block. If there is no space for
|
||||
the dx\_tail, the user is notified to run e2fsck -D to rebuild the
|
||||
directory index (which will ensure that there's space for the checksum.
|
||||
The dx\_tail structure is 8 bytes long and looks like this:
|
||||
|
||||
.. list-table::
|
||||
:widths: 1 1 1 77
|
||||
:header-rows: 1
|
||||
|
||||
* - Offset
|
||||
- Type
|
||||
- Name
|
||||
- Description
|
||||
* - 0x0
|
||||
- u32
|
||||
- dt\_reserved
|
||||
- Zero.
|
||||
* - 0x4
|
||||
- \_\_le32
|
||||
- dt\_checksum
|
||||
- Checksum of the htree directory block.
|
||||
|
||||
The checksum is calculated against the FS UUID, the htree index header
|
||||
(dx\_root or dx\_node), all of the htree indices (dx\_entry) that are in
|
||||
use, and the tail block (dx\_tail).
|
@ -8,3 +8,4 @@ allocated to files.
|
||||
|
||||
.. include:: inodes.rst
|
||||
.. include:: ifork.rst
|
||||
.. include:: directory.rst
|
||||
|
Loading…
Reference in New Issue
Block a user