Move the ext4 data structures book to Documentation/filesystems/ext4/ since the administrative information moved elsewhere. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
		
			
				
	
	
		
			427 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			427 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
| .. SPDX-License-Identifier: GPL-2.0
 | |
| 
 | |
| Directory Entries
 | |
| -----------------
 | |
| 
 | |
| In an ext4 filesystem, a directory is more or less a flat file that maps
 | |
| an arbitrary byte string (usually ASCII) to an inode number on the
 | |
| filesystem. There can be many directory entries across the filesystem
 | |
| that reference the same inode number--these are known as hard links, and
 | |
| that is why hard links cannot reference files on other filesystems. As
 | |
| such, directory entries are found by reading the data block(s)
 | |
| associated with a directory file for the particular directory entry that
 | |
| is desired.
 | |
| 
 | |
| Linear (Classic) Directories
 | |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 | |
| 
 | |
| By default, each directory lists its entries in an “almost-linear”
 | |
| array. I write “almost” because it's not a linear array in the memory
 | |
| sense because directory entries are not split across filesystem blocks.
 | |
| Therefore, it is more accurate to say that a directory is a series of
 | |
| data blocks and that each block contains a linear array of directory
 | |
| entries. The end of each per-block array is signified by reaching the
 | |
| end of the block; the last entry in the block has a record length that
 | |
| takes it all the way to the end of the block. The end of the entire
 | |
| directory is of course signified by reaching the end of the file. Unused
 | |
| directory entries are signified by inode = 0. By default the filesystem
 | |
| uses ``struct ext4_dir_entry_2`` for directory entries unless the
 | |
| “filetype” feature flag is not set, in which case it uses
 | |
| ``struct ext4_dir_entry``.
 | |
| 
 | |
| The original directory entry format is ``struct ext4_dir_entry``, which
 | |
| is at most 263 bytes long, though on disk you'll need to reference
 | |
| ``dirent.rec_len`` to know for sure.
 | |
| 
 | |
| .. list-table::
 | |
|    :widths: 8 8 24 40
 | |
|    :header-rows: 1
 | |
| 
 | |
|    * - Offset
 | |
|      - Size
 | |
|      - Name
 | |
|      - Description
 | |
|    * - 0x0
 | |
|      - \_\_le32
 | |
|      - inode
 | |
|      - Number of the inode that this directory entry points to.
 | |
|    * - 0x4
 | |
|      - \_\_le16
 | |
|      - rec\_len
 | |
|      - Length of this directory entry. Must be a multiple of 4.
 | |
|    * - 0x6
 | |
|      - \_\_le16
 | |
|      - name\_len
 | |
|      - Length of the file name.
 | |
|    * - 0x8
 | |
|      - char
 | |
|      - name[EXT4\_NAME\_LEN]
 | |
|      - File name.
 | |
| 
 | |
| Since file names cannot be longer than 255 bytes, the new directory
 | |
| entry format shortens the rec\_len field and uses the space for a file
 | |
| type flag, probably to avoid having to load every inode during directory
 | |
| tree traversal. This format is ``ext4_dir_entry_2``, which is at most
 | |
| 263 bytes long, though on disk you'll need to reference
 | |
| ``dirent.rec_len`` to know for sure.
 | |
| 
 | |
| .. list-table::
 | |
|    :widths: 8 8 24 40
 | |
|    :header-rows: 1
 | |
| 
 | |
|    * - Offset
 | |
|      - Size
 | |
|      - Name
 | |
|      - Description
 | |
|    * - 0x0
 | |
|      - \_\_le32
 | |
|      - inode
 | |
|      - Number of the inode that this directory entry points to.
 | |
|    * - 0x4
 | |
|      - \_\_le16
 | |
|      - rec\_len
 | |
|      - Length of this directory entry.
 | |
|    * - 0x6
 | |
|      - \_\_u8
 | |
|      - name\_len
 | |
|      - Length of the file name.
 | |
|    * - 0x7
 | |
|      - \_\_u8
 | |
|      - file\_type
 | |
|      - File type code, see ftype_ table below.
 | |
|    * - 0x8
 | |
|      - char
 | |
|      - name[EXT4\_NAME\_LEN]
 | |
|      - File name.
 | |
| 
 | |
| .. _ftype:
 | |
| 
 | |
| The directory file type is one of the following values:
 | |
| 
 | |
| .. list-table::
 | |
|    :widths: 16 64
 | |
|    :header-rows: 1
 | |
| 
 | |
|    * - Value
 | |
|      - Description
 | |
|    * - 0x0
 | |
|      - Unknown.
 | |
|    * - 0x1
 | |
|      - Regular file.
 | |
|    * - 0x2
 | |
|      - Directory.
 | |
|    * - 0x3
 | |
|      - Character device file.
 | |
|    * - 0x4
 | |
|      - Block device file.
 | |
|    * - 0x5
 | |
|      - FIFO.
 | |
|    * - 0x6
 | |
|      - Socket.
 | |
|    * - 0x7
 | |
|      - Symbolic link.
 | |
| 
 | |
| In order to add checksums to these classic directory blocks, a phony
 | |
| ``struct ext4_dir_entry`` is placed at the end of each leaf block to
 | |
| hold the checksum. The directory entry is 12 bytes long. The inode
 | |
| number and name\_len fields are set to zero to fool old software into
 | |
| ignoring an apparently empty directory entry, and the checksum is stored
 | |
| in the place where the name normally goes. The structure is
 | |
| ``struct ext4_dir_entry_tail``:
 | |
| 
 | |
| .. list-table::
 | |
|    :widths: 8 8 24 40
 | |
|    :header-rows: 1
 | |
| 
 | |
|    * - Offset
 | |
|      - Size
 | |
|      - Name
 | |
|      - Description
 | |
|    * - 0x0
 | |
|      - \_\_le32
 | |
|      - det\_reserved\_zero1
 | |
|      - Inode number, which must be zero.
 | |
|    * - 0x4
 | |
|      - \_\_le16
 | |
|      - det\_rec\_len
 | |
|      - Length of this directory entry, which must be 12.
 | |
|    * - 0x6
 | |
|      - \_\_u8
 | |
|      - det\_reserved\_zero2
 | |
|      - Length of the file name, which must be zero.
 | |
|    * - 0x7
 | |
|      - \_\_u8
 | |
|      - det\_reserved\_ft
 | |
|      - File type, which must be 0xDE.
 | |
|    * - 0x8
 | |
|      - \_\_le32
 | |
|      - det\_checksum
 | |
|      - Directory leaf block checksum.
 | |
| 
 | |
| The leaf directory block checksum is calculated against the FS UUID, the
 | |
| directory's inode number, the directory's inode generation number, and
 | |
| the entire directory entry block up to (but not including) the fake
 | |
| directory entry.
 | |
| 
 | |
| Hash Tree Directories
 | |
| ~~~~~~~~~~~~~~~~~~~~~
 | |
| 
 | |
| A linear array of directory entries isn't great for performance, so a
 | |
| new feature was added to ext3 to provide a faster (but peculiar)
 | |
| balanced tree keyed off a hash of the directory entry name. If the
 | |
| EXT4\_INDEX\_FL (0x1000) flag is set in the inode, this directory uses a
 | |
| hashed btree (htree) to organize and find directory entries. For
 | |
| backwards read-only compatibility with ext2, this tree is actually
 | |
| hidden inside the directory file, masquerading as “empty” directory data
 | |
| blocks! It was stated previously that the end of the linear directory
 | |
| entry table was signified with an entry pointing to inode 0; this is
 | |
| (ab)used to fool the old linear-scan algorithm into thinking that the
 | |
| rest of the directory block is empty so that it moves on.
 | |
| 
 | |
| The root of the tree always lives in the first data block of the
 | |
| directory. By ext2 custom, the '.' and '..' entries must appear at the
 | |
| beginning of this first block, so they are put here as two
 | |
| ``struct ext4_dir_entry_2``\ s and not stored in the tree. The rest of
 | |
| the root node contains metadata about the tree and finally a hash->block
 | |
| map to find nodes that are lower in the htree. If
 | |
| ``dx_root.info.indirect_levels`` is non-zero then the htree has two
 | |
| levels; the data block pointed to by the root node's map is an interior
 | |
| node, which is indexed by a minor hash. Interior nodes in this tree
 | |
| contains a zeroed out ``struct ext4_dir_entry_2`` followed by a
 | |
| minor\_hash->block map to find leafe nodes. Leaf nodes contain a linear
 | |
| array of all ``struct ext4_dir_entry_2``; all of these entries
 | |
| (presumably) hash to the same value. If there is an overflow, the
 | |
| entries simply overflow into the next leaf node, and the
 | |
| least-significant bit of the hash (in the interior node map) that gets
 | |
| us to this next leaf node is set.
 | |
| 
 | |
| To traverse the directory as a htree, the code calculates the hash of
 | |
| the desired file name and uses it to find the corresponding block
 | |
| number. If the tree is flat, the block is a linear array of directory
 | |
| entries that can be searched; otherwise, the minor hash of the file name
 | |
| is computed and used against this second block to find the corresponding
 | |
| third block number. That third block number will be a linear array of
 | |
| directory entries.
 | |
| 
 | |
| To traverse the directory as a linear array (such as the old code does),
 | |
| the code simply reads every data block in the directory. The blocks used
 | |
| for the htree will appear to have no entries (aside from '.' and '..')
 | |
| and so only the leaf nodes will appear to have any interesting content.
 | |
| 
 | |
| The root of the htree is in ``struct dx_root``, which is the full length
 | |
| of a data block:
 | |
| 
 | |
| .. list-table::
 | |
|    :widths: 8 8 24 40
 | |
|    :header-rows: 1
 | |
| 
 | |
|    * - Offset
 | |
|      - Type
 | |
|      - Name
 | |
|      - Description
 | |
|    * - 0x0
 | |
|      - \_\_le32
 | |
|      - dot.inode
 | |
|      - inode number of this directory.
 | |
|    * - 0x4
 | |
|      - \_\_le16
 | |
|      - dot.rec\_len
 | |
|      - Length of this record, 12.
 | |
|    * - 0x6
 | |
|      - u8
 | |
|      - dot.name\_len
 | |
|      - Length of the name, 1.
 | |
|    * - 0x7
 | |
|      - u8
 | |
|      - dot.file\_type
 | |
|      - File type of this entry, 0x2 (directory) (if the feature flag is set).
 | |
|    * - 0x8
 | |
|      - char
 | |
|      - dot.name[4]
 | |
|      - “.\\0\\0\\0”
 | |
|    * - 0xC
 | |
|      - \_\_le32
 | |
|      - dotdot.inode
 | |
|      - inode number of parent directory.
 | |
|    * - 0x10
 | |
|      - \_\_le16
 | |
|      - dotdot.rec\_len
 | |
|      - block\_size - 12. The record length is long enough to cover all htree
 | |
|        data.
 | |
|    * - 0x12
 | |
|      - u8
 | |
|      - dotdot.name\_len
 | |
|      - Length of the name, 2.
 | |
|    * - 0x13
 | |
|      - u8
 | |
|      - dotdot.file\_type
 | |
|      - File type of this entry, 0x2 (directory) (if the feature flag is set).
 | |
|    * - 0x14
 | |
|      - char
 | |
|      - dotdot\_name[4]
 | |
|      - “..\\0\\0”
 | |
|    * - 0x18
 | |
|      - \_\_le32
 | |
|      - struct dx\_root\_info.reserved\_zero
 | |
|      - Zero.
 | |
|    * - 0x1C
 | |
|      - u8
 | |
|      - struct dx\_root\_info.hash\_version
 | |
|      - Hash type, see dirhash_ table below.
 | |
|    * - 0x1D
 | |
|      - u8
 | |
|      - struct dx\_root\_info.info\_length
 | |
|      - Length of the tree information, 0x8.
 | |
|    * - 0x1E
 | |
|      - u8
 | |
|      - struct dx\_root\_info.indirect\_levels
 | |
|      - Depth of the htree. Cannot be larger than 3 if the INCOMPAT\_LARGEDIR
 | |
|        feature is set; cannot be larger than 2 otherwise.
 | |
|    * - 0x1F
 | |
|      - u8
 | |
|      - struct dx\_root\_info.unused\_flags
 | |
|      -
 | |
|    * - 0x20
 | |
|      - \_\_le16
 | |
|      - limit
 | |
|      - Maximum number of dx\_entries that can follow this header, plus 1 for
 | |
|        the header itself.
 | |
|    * - 0x22
 | |
|      - \_\_le16
 | |
|      - count
 | |
|      - Actual number of dx\_entries that follow this header, plus 1 for the
 | |
|        header itself.
 | |
|    * - 0x24
 | |
|      - \_\_le32
 | |
|      - block
 | |
|      - The block number (within the directory file) that goes with hash=0.
 | |
|    * - 0x28
 | |
|      - struct dx\_entry
 | |
|      - entries[0]
 | |
|      - As many 8-byte ``struct dx_entry`` as fits in the rest of the data block.
 | |
| 
 | |
| .. _dirhash:
 | |
| 
 | |
| The directory hash is one of the following values:
 | |
| 
 | |
| .. list-table::
 | |
|    :widths: 16 64
 | |
|    :header-rows: 1
 | |
| 
 | |
|    * - Value
 | |
|      - Description
 | |
|    * - 0x0
 | |
|      - Legacy.
 | |
|    * - 0x1
 | |
|      - Half MD4.
 | |
|    * - 0x2
 | |
|      - Tea.
 | |
|    * - 0x3
 | |
|      - Legacy, unsigned.
 | |
|    * - 0x4
 | |
|      - Half MD4, unsigned.
 | |
|    * - 0x5
 | |
|      - Tea, unsigned.
 | |
| 
 | |
| Interior nodes of an htree are recorded as ``struct dx_node``, which is
 | |
| also the full length of a data block:
 | |
| 
 | |
| .. list-table::
 | |
|    :widths: 8 8 24 40
 | |
|    :header-rows: 1
 | |
| 
 | |
|    * - Offset
 | |
|      - Type
 | |
|      - Name
 | |
|      - Description
 | |
|    * - 0x0
 | |
|      - \_\_le32
 | |
|      - fake.inode
 | |
|      - Zero, to make it look like this entry is not in use.
 | |
|    * - 0x4
 | |
|      - \_\_le16
 | |
|      - fake.rec\_len
 | |
|      - The size of the block, in order to hide all of the dx\_node data.
 | |
|    * - 0x6
 | |
|      - u8
 | |
|      - name\_len
 | |
|      - Zero. There is no name for this “unused” directory entry.
 | |
|    * - 0x7
 | |
|      - u8
 | |
|      - file\_type
 | |
|      - Zero. There is no file type for this “unused” directory entry.
 | |
|    * - 0x8
 | |
|      - \_\_le16
 | |
|      - limit
 | |
|      - Maximum number of dx\_entries that can follow this header, plus 1 for
 | |
|        the header itself.
 | |
|    * - 0xA
 | |
|      - \_\_le16
 | |
|      - count
 | |
|      - Actual number of dx\_entries that follow this header, plus 1 for the
 | |
|        header itself.
 | |
|    * - 0xE
 | |
|      - \_\_le32
 | |
|      - block
 | |
|      - The block number (within the directory file) that goes with the lowest
 | |
|        hash value of this block. This value is stored in the parent block.
 | |
|    * - 0x12
 | |
|      - struct dx\_entry
 | |
|      - entries[0]
 | |
|      - As many 8-byte ``struct dx_entry`` as fits in the rest of the data block.
 | |
| 
 | |
| The hash maps that exist in both ``struct dx_root`` and
 | |
| ``struct dx_node`` are recorded as ``struct dx_entry``, which is 8 bytes
 | |
| long:
 | |
| 
 | |
| .. list-table::
 | |
|    :widths: 8 8 24 40
 | |
|    :header-rows: 1
 | |
| 
 | |
|    * - Offset
 | |
|      - Type
 | |
|      - Name
 | |
|      - Description
 | |
|    * - 0x0
 | |
|      - \_\_le32
 | |
|      - hash
 | |
|      - Hash code.
 | |
|    * - 0x4
 | |
|      - \_\_le32
 | |
|      - block
 | |
|      - Block number (within the directory file, not filesystem blocks) of the
 | |
|        next node in the htree.
 | |
| 
 | |
| (If you think this is all quite clever and peculiar, so does the
 | |
| author.)
 | |
| 
 | |
| If metadata checksums are enabled, the last 8 bytes of the directory
 | |
| block (precisely the length of one dx\_entry) are used to store a
 | |
| ``struct dx_tail``, which contains the checksum. The ``limit`` and
 | |
| ``count`` entries in the dx\_root/dx\_node structures are adjusted as
 | |
| necessary to fit the dx\_tail into the block. If there is no space for
 | |
| the dx\_tail, the user is notified to run e2fsck -D to rebuild the
 | |
| directory index (which will ensure that there's space for the checksum.
 | |
| The dx\_tail structure is 8 bytes long and looks like this:
 | |
| 
 | |
| .. list-table::
 | |
|    :widths: 8 8 24 40
 | |
|    :header-rows: 1
 | |
| 
 | |
|    * - Offset
 | |
|      - Type
 | |
|      - Name
 | |
|      - Description
 | |
|    * - 0x0
 | |
|      - u32
 | |
|      - dt\_reserved
 | |
|      - Zero.
 | |
|    * - 0x4
 | |
|      - \_\_le32
 | |
|      - dt\_checksum
 | |
|      - Checksum of the htree directory block.
 | |
| 
 | |
| The checksum is calculated against the FS UUID, the htree index header
 | |
| (dx\_root or dx\_node), all of the htree indices (dx\_entry) that are in
 | |
| use, and the tail block (dx\_tail).
 |