ext4: import high level design chapter from wiki page
Import the chapter about high level design from the on-disk format wiki page into the kernel documentation. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This commit is contained in:
parent
b2e60723c1
commit
c09f3bac6d
56
Documentation/filesystems/ext4/ondisk/allocators.rst
Normal file
56
Documentation/filesystems/ext4/ondisk/allocators.rst
Normal file
@ -0,0 +1,56 @@
|
|||||||
|
.. SPDX-License-Identifier: GPL-2.0
|
||||||
|
|
||||||
|
Block and Inode Allocation Policy
|
||||||
|
---------------------------------
|
||||||
|
|
||||||
|
ext4 recognizes (better than ext3, anyway) that data locality is
|
||||||
|
generally a desirably quality of a filesystem. On a spinning disk,
|
||||||
|
keeping related blocks near each other reduces the amount of movement
|
||||||
|
that the head actuator and disk must perform to access a data block,
|
||||||
|
thus speeding up disk IO. On an SSD there of course are no moving parts,
|
||||||
|
but locality can increase the size of each transfer request while
|
||||||
|
reducing the total number of requests. This locality may also have the
|
||||||
|
effect of concentrating writes on a single erase block, which can speed
|
||||||
|
up file rewrites significantly. Therefore, it is useful to reduce
|
||||||
|
fragmentation whenever possible.
|
||||||
|
|
||||||
|
The first tool that ext4 uses to combat fragmentation is the multi-block
|
||||||
|
allocator. When a file is first created, the block allocator
|
||||||
|
speculatively allocates 8KiB of disk space to the file on the assumption
|
||||||
|
that the space will get written soon. When the file is closed, the
|
||||||
|
unused speculative allocations are of course freed, but if the
|
||||||
|
speculation is correct (typically the case for full writes of small
|
||||||
|
files) then the file data gets written out in a single multi-block
|
||||||
|
extent. A second related trick that ext4 uses is delayed allocation.
|
||||||
|
Under this scheme, when a file needs more blocks to absorb file writes,
|
||||||
|
the filesystem defers deciding the exact placement on the disk until all
|
||||||
|
the dirty buffers are being written out to disk. By not committing to a
|
||||||
|
particular placement until it's absolutely necessary (the commit timeout
|
||||||
|
is hit, or sync() is called, or the kernel runs out of memory), the hope
|
||||||
|
is that the filesystem can make better location decisions.
|
||||||
|
|
||||||
|
The third trick that ext4 (and ext3) uses is that it tries to keep a
|
||||||
|
file's data blocks in the same block group as its inode. This cuts down
|
||||||
|
on the seek penalty when the filesystem first has to read a file's inode
|
||||||
|
to learn where the file's data blocks live and then seek over to the
|
||||||
|
file's data blocks to begin I/O operations.
|
||||||
|
|
||||||
|
The fourth trick is that all the inodes in a directory are placed in the
|
||||||
|
same block group as the directory, when feasible. The working assumption
|
||||||
|
here is that all the files in a directory might be related, therefore it
|
||||||
|
is useful to try to keep them all together.
|
||||||
|
|
||||||
|
The fifth trick is that the disk volume is cut up into 128MB block
|
||||||
|
groups; these mini-containers are used as outlined above to try to
|
||||||
|
maintain data locality. However, there is a deliberate quirk -- when a
|
||||||
|
directory is created in the root directory, the inode allocator scans
|
||||||
|
the block groups and puts that directory into the least heavily loaded
|
||||||
|
block group that it can find. This encourages directories to spread out
|
||||||
|
over a disk; as the top-level directory/file blobs fill up one block
|
||||||
|
group, the allocators simply move on to the next block group. Allegedly
|
||||||
|
this scheme evens out the loading on the block groups, though the author
|
||||||
|
suspects that the directories which are so unlucky as to land towards
|
||||||
|
the end of a spinning drive get a raw deal performance-wise.
|
||||||
|
|
||||||
|
Of course if all of these mechanisms fail, one can always use e4defrag
|
||||||
|
to defragment files.
|
22
Documentation/filesystems/ext4/ondisk/bigalloc.rst
Normal file
22
Documentation/filesystems/ext4/ondisk/bigalloc.rst
Normal file
@ -0,0 +1,22 @@
|
|||||||
|
.. SPDX-License-Identifier: GPL-2.0
|
||||||
|
|
||||||
|
Bigalloc
|
||||||
|
--------
|
||||||
|
|
||||||
|
At the moment, the default size of a block is 4KiB, which is a commonly
|
||||||
|
supported page size on most MMU-capable hardware. This is fortunate, as
|
||||||
|
ext4 code is not prepared to handle the case where the block size
|
||||||
|
exceeds the page size. However, for a filesystem of mostly huge files,
|
||||||
|
it is desirable to be able to allocate disk blocks in units of multiple
|
||||||
|
blocks to reduce both fragmentation and metadata overhead. The
|
||||||
|
`bigalloc <Bigalloc>`__ feature provides exactly this ability. The
|
||||||
|
administrator can set a block cluster size at mkfs time (which is stored
|
||||||
|
in the s\_log\_cluster\_size field in the superblock); from then on, the
|
||||||
|
block bitmaps track clusters, not individual blocks. This means that
|
||||||
|
block groups can be several gigabytes in size (instead of just 128MiB);
|
||||||
|
however, the minimum allocation unit becomes a cluster, not a block,
|
||||||
|
even for directories. TaoBao had a patchset to extend the “use units of
|
||||||
|
clusters instead of blocks” to the extent tree, though it is not clear
|
||||||
|
where those patches went-- they eventually morphed into “extent tree v2”
|
||||||
|
but that code has not landed as of May 2015.
|
||||||
|
|
135
Documentation/filesystems/ext4/ondisk/blockgroup.rst
Normal file
135
Documentation/filesystems/ext4/ondisk/blockgroup.rst
Normal file
@ -0,0 +1,135 @@
|
|||||||
|
.. SPDX-License-Identifier: GPL-2.0
|
||||||
|
|
||||||
|
Layout
|
||||||
|
------
|
||||||
|
|
||||||
|
The layout of a standard block group is approximately as follows (each
|
||||||
|
of these fields is discussed in a separate section below):
|
||||||
|
|
||||||
|
.. list-table::
|
||||||
|
:widths: 1 1 1 1 1 1 1 1
|
||||||
|
:header-rows: 1
|
||||||
|
|
||||||
|
* - Group 0 Padding
|
||||||
|
- ext4 Super Block
|
||||||
|
- Group Descriptors
|
||||||
|
- Reserved GDT Blocks
|
||||||
|
- Data Block Bitmap
|
||||||
|
- inode Bitmap
|
||||||
|
- inode Table
|
||||||
|
- Data Blocks
|
||||||
|
* - 1024 bytes
|
||||||
|
- 1 block
|
||||||
|
- many blocks
|
||||||
|
- many blocks
|
||||||
|
- 1 block
|
||||||
|
- 1 block
|
||||||
|
- many blocks
|
||||||
|
- many more blocks
|
||||||
|
|
||||||
|
For the special case of block group 0, the first 1024 bytes are unused,
|
||||||
|
to allow for the installation of x86 boot sectors and other oddities.
|
||||||
|
The superblock will start at offset 1024 bytes, whichever block that
|
||||||
|
happens to be (usually 0). However, if for some reason the block size =
|
||||||
|
1024, then block 0 is marked in use and the superblock goes in block 1.
|
||||||
|
For all other block groups, there is no padding.
|
||||||
|
|
||||||
|
The ext4 driver primarily works with the superblock and the group
|
||||||
|
descriptors that are found in block group 0. Redundant copies of the
|
||||||
|
superblock and group descriptors are written to some of the block groups
|
||||||
|
across the disk in case the beginning of the disk gets trashed, though
|
||||||
|
not all block groups necessarily host a redundant copy (see following
|
||||||
|
paragraph for more details). If the group does not have a redundant
|
||||||
|
copy, the block group begins with the data block bitmap. Note also that
|
||||||
|
when the filesystem is freshly formatted, mkfs will allocate “reserve
|
||||||
|
GDT block” space after the block group descriptors and before the start
|
||||||
|
of the block bitmaps to allow for future expansion of the filesystem. By
|
||||||
|
default, a filesystem is allowed to increase in size by a factor of
|
||||||
|
1024x over the original filesystem size.
|
||||||
|
|
||||||
|
The location of the inode table is given by ``grp.bg_inode_table_*``. It
|
||||||
|
is continuous range of blocks large enough to contain
|
||||||
|
``sb.s_inodes_per_group * sb.s_inode_size`` bytes.
|
||||||
|
|
||||||
|
As for the ordering of items in a block group, it is generally
|
||||||
|
established that the super block and the group descriptor table, if
|
||||||
|
present, will be at the beginning of the block group. The bitmaps and
|
||||||
|
the inode table can be anywhere, and it is quite possible for the
|
||||||
|
bitmaps to come after the inode table, or for both to be in different
|
||||||
|
groups (flex\_bg). Leftover space is used for file data blocks, indirect
|
||||||
|
block maps, extent tree blocks, and extended attributes.
|
||||||
|
|
||||||
|
Flexible Block Groups
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
Starting in ext4, there is a new feature called flexible block groups
|
||||||
|
(flex\_bg). In a flex\_bg, several block groups are tied together as one
|
||||||
|
logical block group; the bitmap spaces and the inode table space in the
|
||||||
|
first block group of the flex\_bg are expanded to include the bitmaps
|
||||||
|
and inode tables of all other block groups in the flex\_bg. For example,
|
||||||
|
if the flex\_bg size is 4, then group 0 will contain (in order) the
|
||||||
|
superblock, group descriptors, data block bitmaps for groups 0-3, inode
|
||||||
|
bitmaps for groups 0-3, inode tables for groups 0-3, and the remaining
|
||||||
|
space in group 0 is for file data. The effect of this is to group the
|
||||||
|
block metadata close together for faster loading, and to enable large
|
||||||
|
files to be continuous on disk. Backup copies of the superblock and
|
||||||
|
group descriptors are always at the beginning of block groups, even if
|
||||||
|
flex\_bg is enabled. The number of block groups that make up a flex\_bg
|
||||||
|
is given by 2 ^ ``sb.s_log_groups_per_flex``.
|
||||||
|
|
||||||
|
Meta Block Groups
|
||||||
|
-----------------
|
||||||
|
|
||||||
|
Without the option META\_BG, for safety concerns, all block group
|
||||||
|
descriptors copies are kept in the first block group. Given the default
|
||||||
|
128MiB(2^27 bytes) block group size and 64-byte group descriptors, ext4
|
||||||
|
can have at most 2^27/64 = 2^21 block groups. This limits the entire
|
||||||
|
filesystem size to 2^21 ∗ 2^27 = 2^48bytes or 256TiB.
|
||||||
|
|
||||||
|
The solution to this problem is to use the metablock group feature
|
||||||
|
(META\_BG), which is already in ext3 for all 2.6 releases. With the
|
||||||
|
META\_BG feature, ext4 filesystems are partitioned into many metablock
|
||||||
|
groups. Each metablock group is a cluster of block groups whose group
|
||||||
|
descriptor structures can be stored in a single disk block. For ext4
|
||||||
|
filesystems with 4 KB block size, a single metablock group partition
|
||||||
|
includes 64 block groups, or 8 GiB of disk space. The metablock group
|
||||||
|
feature moves the location of the group descriptors from the congested
|
||||||
|
first block group of the whole filesystem into the first group of each
|
||||||
|
metablock group itself. The backups are in the second and last group of
|
||||||
|
each metablock group. This increases the 2^21 maximum block groups limit
|
||||||
|
to the hard limit 2^32, allowing support for a 512PiB filesystem.
|
||||||
|
|
||||||
|
The change in the filesystem format replaces the current scheme where
|
||||||
|
the superblock is followed by a variable-length set of block group
|
||||||
|
descriptors. Instead, the superblock and a single block group descriptor
|
||||||
|
block is placed at the beginning of the first, second, and last block
|
||||||
|
groups in a meta-block group. A meta-block group is a collection of
|
||||||
|
block groups which can be described by a single block group descriptor
|
||||||
|
block. Since the size of the block group descriptor structure is 32
|
||||||
|
bytes, a meta-block group contains 32 block groups for filesystems with
|
||||||
|
a 1KB block size, and 128 block groups for filesystems with a 4KB
|
||||||
|
blocksize. Filesystems can either be created using this new block group
|
||||||
|
descriptor layout, or existing filesystems can be resized on-line, and
|
||||||
|
the field s\_first\_meta\_bg in the superblock will indicate the first
|
||||||
|
block group using this new layout.
|
||||||
|
|
||||||
|
Please see an important note about ``BLOCK_UNINIT`` in the section about
|
||||||
|
block and inode bitmaps.
|
||||||
|
|
||||||
|
Lazy Block Group Initialization
|
||||||
|
-------------------------------
|
||||||
|
|
||||||
|
A new feature for ext4 are three block group descriptor flags that
|
||||||
|
enable mkfs to skip initializing other parts of the block group
|
||||||
|
metadata. Specifically, the INODE\_UNINIT and BLOCK\_UNINIT flags mean
|
||||||
|
that the inode and block bitmaps for that group can be calculated and
|
||||||
|
therefore the on-disk bitmap blocks are not initialized. This is
|
||||||
|
generally the case for an empty block group or a block group containing
|
||||||
|
only fixed-location block group metadata. The INODE\_ZEROED flag means
|
||||||
|
that the inode table has been initialized; mkfs will unset this flag and
|
||||||
|
rely on the kernel to initialize the inode tables in the background.
|
||||||
|
|
||||||
|
By not writing zeroes to the bitmaps and inode table, mkfs time is
|
||||||
|
reduced considerably. Note the feature flag is RO\_COMPAT\_GDT\_CSUM,
|
||||||
|
but the dumpe2fs output prints this as “uninit\_bg”. They are the same
|
||||||
|
thing.
|
142
Documentation/filesystems/ext4/ondisk/blocks.rst
Normal file
142
Documentation/filesystems/ext4/ondisk/blocks.rst
Normal file
@ -0,0 +1,142 @@
|
|||||||
|
.. SPDX-License-Identifier: GPL-2.0
|
||||||
|
|
||||||
|
Blocks
|
||||||
|
------
|
||||||
|
|
||||||
|
ext4 allocates storage space in units of “blocks”. A block is a group of
|
||||||
|
sectors between 1KiB and 64KiB, and the number of sectors must be an
|
||||||
|
integral power of 2. Blocks are in turn grouped into larger units called
|
||||||
|
block groups. Block size is specified at mkfs time and typically is
|
||||||
|
4KiB. You may experience mounting problems if block size is greater than
|
||||||
|
page size (i.e. 64KiB blocks on a i386 which only has 4KiB memory
|
||||||
|
pages). By default a filesystem can contain 2^32 blocks; if the '64bit'
|
||||||
|
feature is enabled, then a filesystem can have 2^64 blocks.
|
||||||
|
|
||||||
|
For 32-bit filesystems, limits are as follows:
|
||||||
|
|
||||||
|
.. list-table::
|
||||||
|
:widths: 1 1 1 1 1
|
||||||
|
:header-rows: 1
|
||||||
|
|
||||||
|
* - Item
|
||||||
|
- 1KiB
|
||||||
|
- 2KiB
|
||||||
|
- 4KiB
|
||||||
|
- 64KiB
|
||||||
|
* - Blocks
|
||||||
|
- 2^32
|
||||||
|
- 2^32
|
||||||
|
- 2^32
|
||||||
|
- 2^32
|
||||||
|
* - Inodes
|
||||||
|
- 2^32
|
||||||
|
- 2^32
|
||||||
|
- 2^32
|
||||||
|
- 2^32
|
||||||
|
* - File System Size
|
||||||
|
- 4TiB
|
||||||
|
- 8TiB
|
||||||
|
- 16TiB
|
||||||
|
- 256PiB
|
||||||
|
* - Blocks Per Block Group
|
||||||
|
- 8,192
|
||||||
|
- 16,384
|
||||||
|
- 32,768
|
||||||
|
- 524,288
|
||||||
|
* - Inodes Per Block Group
|
||||||
|
- 8,192
|
||||||
|
- 16,384
|
||||||
|
- 32,768
|
||||||
|
- 524,288
|
||||||
|
* - Block Group Size
|
||||||
|
- 8MiB
|
||||||
|
- 32MiB
|
||||||
|
- 128MiB
|
||||||
|
- 32GiB
|
||||||
|
* - Blocks Per File, Extents
|
||||||
|
- 2^32
|
||||||
|
- 2^32
|
||||||
|
- 2^32
|
||||||
|
- 2^32
|
||||||
|
* - Blocks Per File, Block Maps
|
||||||
|
- 16,843,020
|
||||||
|
- 134,480,396
|
||||||
|
- 1,074,791,436
|
||||||
|
- 4,398,314,962,956 (really 2^32 due to field size limitations)
|
||||||
|
* - File Size, Extents
|
||||||
|
- 4TiB
|
||||||
|
- 8TiB
|
||||||
|
- 16TiB
|
||||||
|
- 256TiB
|
||||||
|
* - File Size, Block Maps
|
||||||
|
- 16GiB
|
||||||
|
- 256GiB
|
||||||
|
- 4TiB
|
||||||
|
- 256TiB
|
||||||
|
|
||||||
|
For 64-bit filesystems, limits are as follows:
|
||||||
|
|
||||||
|
.. list-table::
|
||||||
|
:widths: 1 1 1 1 1
|
||||||
|
:header-rows: 1
|
||||||
|
|
||||||
|
* - Item
|
||||||
|
- 1KiB
|
||||||
|
- 2KiB
|
||||||
|
- 4KiB
|
||||||
|
- 64KiB
|
||||||
|
* - Blocks
|
||||||
|
- 2^64
|
||||||
|
- 2^64
|
||||||
|
- 2^64
|
||||||
|
- 2^64
|
||||||
|
* - Inodes
|
||||||
|
- 2^32
|
||||||
|
- 2^32
|
||||||
|
- 2^32
|
||||||
|
- 2^32
|
||||||
|
* - File System Size
|
||||||
|
- 16ZiB
|
||||||
|
- 32ZiB
|
||||||
|
- 64ZiB
|
||||||
|
- 1YiB
|
||||||
|
* - Blocks Per Block Group
|
||||||
|
- 8,192
|
||||||
|
- 16,384
|
||||||
|
- 32,768
|
||||||
|
- 524,288
|
||||||
|
* - Inodes Per Block Group
|
||||||
|
- 8,192
|
||||||
|
- 16,384
|
||||||
|
- 32,768
|
||||||
|
- 524,288
|
||||||
|
* - Block Group Size
|
||||||
|
- 8MiB
|
||||||
|
- 32MiB
|
||||||
|
- 128MiB
|
||||||
|
- 32GiB
|
||||||
|
* - Blocks Per File, Extents
|
||||||
|
- 2^32
|
||||||
|
- 2^32
|
||||||
|
- 2^32
|
||||||
|
- 2^32
|
||||||
|
* - Blocks Per File, Block Maps
|
||||||
|
- 16,843,020
|
||||||
|
- 134,480,396
|
||||||
|
- 1,074,791,436
|
||||||
|
- 4,398,314,962,956 (really 2^32 due to field size limitations)
|
||||||
|
* - File Size, Extents
|
||||||
|
- 4TiB
|
||||||
|
- 8TiB
|
||||||
|
- 16TiB
|
||||||
|
- 256TiB
|
||||||
|
* - File Size, Block Maps
|
||||||
|
- 16GiB
|
||||||
|
- 256GiB
|
||||||
|
- 4TiB
|
||||||
|
- 256TiB
|
||||||
|
|
||||||
|
Note: Files not using extents (i.e. files using block maps) must be
|
||||||
|
placed within the first 2^32 blocks of a filesystem. Files with extents
|
||||||
|
must be placed within the first 2^48 blocks of a filesystem. It's not
|
||||||
|
clear what happens with larger filesystems.
|
73
Documentation/filesystems/ext4/ondisk/checksums.rst
Normal file
73
Documentation/filesystems/ext4/ondisk/checksums.rst
Normal file
@ -0,0 +1,73 @@
|
|||||||
|
.. SPDX-License-Identifier: GPL-2.0
|
||||||
|
|
||||||
|
Checksums
|
||||||
|
---------
|
||||||
|
|
||||||
|
Starting in early 2012, metadata checksums were added to all major ext4
|
||||||
|
and jbd2 data structures. The associated feature flag is metadata\_csum.
|
||||||
|
The desired checksum algorithm is indicated in the superblock, though as
|
||||||
|
of October 2012 the only supported algorithm is crc32c. Some data
|
||||||
|
structures did not have space to fit a full 32-bit checksum, so only the
|
||||||
|
lower 16 bits are stored. Enabling the 64bit feature increases the data
|
||||||
|
structure size so that full 32-bit checksums can be stored for many data
|
||||||
|
structures. However, existing 32-bit filesystems cannot be extended to
|
||||||
|
enable 64bit mode, at least not without the experimental resize2fs
|
||||||
|
patches to do so.
|
||||||
|
|
||||||
|
Existing filesystems can have checksumming added by running
|
||||||
|
``tune2fs -O metadata_csum`` against the underlying device. If tune2fs
|
||||||
|
encounters directory blocks that lack sufficient empty space to add a
|
||||||
|
checksum, it will request that you run ``e2fsck -D`` to have the
|
||||||
|
directories rebuilt with checksums. This has the added benefit of
|
||||||
|
removing slack space from the directory files and rebalancing the htree
|
||||||
|
indexes. If you \_ignore\_ this step, your directories will not be
|
||||||
|
protected by a checksum!
|
||||||
|
|
||||||
|
The following table describes the data elements that go into each type
|
||||||
|
of checksum. The checksum function is whatever the superblock describes
|
||||||
|
(crc32c as of October 2013) unless noted otherwise.
|
||||||
|
|
||||||
|
.. list-table::
|
||||||
|
:widths: 1 1 4
|
||||||
|
:header-rows: 1
|
||||||
|
|
||||||
|
* - Metadata
|
||||||
|
- Length
|
||||||
|
- Ingredients
|
||||||
|
* - Superblock
|
||||||
|
- \_\_le32
|
||||||
|
- The entire superblock up to the checksum field. The UUID lives inside
|
||||||
|
the superblock.
|
||||||
|
* - MMP
|
||||||
|
- \_\_le32
|
||||||
|
- UUID + the entire MMP block up to the checksum field.
|
||||||
|
* - Extended Attributes
|
||||||
|
- \_\_le32
|
||||||
|
- UUID + the entire extended attribute block. The checksum field is set to
|
||||||
|
zero.
|
||||||
|
* - Directory Entries
|
||||||
|
- \_\_le32
|
||||||
|
- UUID + inode number + inode generation + the directory block up to the
|
||||||
|
fake entry enclosing the checksum field.
|
||||||
|
* - HTREE Nodes
|
||||||
|
- \_\_le32
|
||||||
|
- UUID + inode number + inode generation + all valid extents + HTREE tail.
|
||||||
|
The checksum field is set to zero.
|
||||||
|
* - Extents
|
||||||
|
- \_\_le32
|
||||||
|
- UUID + inode number + inode generation + the entire extent block up to
|
||||||
|
the checksum field.
|
||||||
|
* - Bitmaps
|
||||||
|
- \_\_le32 or \_\_le16
|
||||||
|
- UUID + the entire bitmap. Checksums are stored in the group descriptor,
|
||||||
|
and truncated if the group descriptor size is 32 bytes (i.e. ^64bit)
|
||||||
|
* - Inodes
|
||||||
|
- \_\_le32
|
||||||
|
- UUID + inode number + inode generation + the entire inode. The checksum
|
||||||
|
field is set to zero. Each inode has its own checksum.
|
||||||
|
* - Group Descriptors
|
||||||
|
- \_\_le16
|
||||||
|
- If metadata\_csum, then UUID + group number + the entire descriptor;
|
||||||
|
else if gdt\_csum, then crc16(UUID + group number + the entire
|
||||||
|
descriptor). In all cases, only the lower 16 bits are stored.
|
||||||
|
|
18
Documentation/filesystems/ext4/ondisk/eainode.rst
Normal file
18
Documentation/filesystems/ext4/ondisk/eainode.rst
Normal file
@ -0,0 +1,18 @@
|
|||||||
|
.. SPDX-License-Identifier: GPL-2.0
|
||||||
|
|
||||||
|
Large Extended Attribute Values
|
||||||
|
-------------------------------
|
||||||
|
|
||||||
|
To enable ext4 to store extended attribute values that do not fit in the
|
||||||
|
inode or in the single extended attribute block attached to an inode,
|
||||||
|
the EA\_INODE feature allows us to store the value in the data blocks of
|
||||||
|
a regular file inode. This “EA inode” is linked only from the extended
|
||||||
|
attribute name index and must not appear in a directory entry. The
|
||||||
|
inode's i\_atime field is used to store a checksum of the xattr value;
|
||||||
|
and i\_ctime/i\_version store a 64-bit reference count, which enables
|
||||||
|
sharing of large xattr values between multiple owning inodes. For
|
||||||
|
backward compatibility with older versions of this feature, the
|
||||||
|
i\_mtime/i\_generation *may* store a back-reference to the inode number
|
||||||
|
and i\_generation of the **one** owning inode (in cases where the EA
|
||||||
|
inode is not referenced by multiple inodes) to verify that the EA inode
|
||||||
|
is the correct one being accessed.
|
@ -4,3 +4,4 @@
|
|||||||
Data Structures and Algorithms
|
Data Structures and Algorithms
|
||||||
==============================
|
==============================
|
||||||
.. include:: about.rst
|
.. include:: about.rst
|
||||||
|
.. include:: overview.rst
|
||||||
|
37
Documentation/filesystems/ext4/ondisk/inlinedata.rst
Normal file
37
Documentation/filesystems/ext4/ondisk/inlinedata.rst
Normal file
@ -0,0 +1,37 @@
|
|||||||
|
.. SPDX-License-Identifier: GPL-2.0
|
||||||
|
|
||||||
|
Inline Data
|
||||||
|
-----------
|
||||||
|
|
||||||
|
The inline data feature was designed to handle the case that a file's
|
||||||
|
data is so tiny that it readily fits inside the inode, which
|
||||||
|
(theoretically) reduces disk block consumption and reduces seeks. If the
|
||||||
|
file is smaller than 60 bytes, then the data are stored inline in
|
||||||
|
``inode.i_block``. If the rest of the file would fit inside the extended
|
||||||
|
attribute space, then it might be found as an extended attribute
|
||||||
|
“system.data” within the inode body (“ibody EA”). This of course
|
||||||
|
constrains the amount of extended attributes one can attach to an inode.
|
||||||
|
If the data size increases beyond i\_block + ibody EA, a regular block
|
||||||
|
is allocated and the contents moved to that block.
|
||||||
|
|
||||||
|
Pending a change to compact the extended attribute key used to store
|
||||||
|
inline data, one ought to be able to store 160 bytes of data in a
|
||||||
|
256-byte inode (as of June 2015, when i\_extra\_isize is 28). Prior to
|
||||||
|
that, the limit was 156 bytes due to inefficient use of inode space.
|
||||||
|
|
||||||
|
The inline data feature requires the presence of an extended attribute
|
||||||
|
for “system.data”, even if the attribute value is zero length.
|
||||||
|
|
||||||
|
Inline Directories
|
||||||
|
~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
The first four bytes of i\_block are the inode number of the parent
|
||||||
|
directory. Following that is a 56-byte space for an array of directory
|
||||||
|
entries; see ``struct ext4_dir_entry``. If there is a “system.data”
|
||||||
|
attribute in the inode body, the EA value is an array of
|
||||||
|
``struct ext4_dir_entry`` as well. Note that for inline directories, the
|
||||||
|
i\_block and EA space are treated as separate dirent blocks; directory
|
||||||
|
entries cannot span the two.
|
||||||
|
|
||||||
|
Inline directory entries are not checksummed, as the inode checksum
|
||||||
|
should protect all inline data contents.
|
26
Documentation/filesystems/ext4/ondisk/overview.rst
Normal file
26
Documentation/filesystems/ext4/ondisk/overview.rst
Normal file
@ -0,0 +1,26 @@
|
|||||||
|
.. SPDX-License-Identifier: GPL-2.0
|
||||||
|
|
||||||
|
High Level Design
|
||||||
|
=================
|
||||||
|
|
||||||
|
An ext4 file system is split into a series of block groups. To reduce
|
||||||
|
performance difficulties due to fragmentation, the block allocator tries
|
||||||
|
very hard to keep each file's blocks within the same group, thereby
|
||||||
|
reducing seek times. The size of a block group is specified in
|
||||||
|
``sb.s_blocks_per_group`` blocks, though it can also calculated as 8 \*
|
||||||
|
``block_size_in_bytes``. With the default block size of 4KiB, each group
|
||||||
|
will contain 32,768 blocks, for a length of 128MiB. The number of block
|
||||||
|
groups is the size of the device divided by the size of a block group.
|
||||||
|
|
||||||
|
All fields in ext4 are written to disk in little-endian order. HOWEVER,
|
||||||
|
all fields in jbd2 (the journal) are written to disk in big-endian
|
||||||
|
order.
|
||||||
|
|
||||||
|
.. include:: blocks.rst
|
||||||
|
.. include:: blockgroup.rst
|
||||||
|
.. include:: special_inodes.rst
|
||||||
|
.. include:: allocators.rst
|
||||||
|
.. include:: checksums.rst
|
||||||
|
.. include:: bigalloc.rst
|
||||||
|
.. include:: inlinedata.rst
|
||||||
|
.. include:: eainode.rst
|
38
Documentation/filesystems/ext4/ondisk/special_inodes.rst
Normal file
38
Documentation/filesystems/ext4/ondisk/special_inodes.rst
Normal file
@ -0,0 +1,38 @@
|
|||||||
|
.. SPDX-License-Identifier: GPL-2.0
|
||||||
|
|
||||||
|
Special inodes
|
||||||
|
--------------
|
||||||
|
|
||||||
|
ext4 reserves some inode for special features, as follows:
|
||||||
|
|
||||||
|
.. list-table::
|
||||||
|
:widths: 1 79
|
||||||
|
:header-rows: 1
|
||||||
|
|
||||||
|
* - inode Number
|
||||||
|
- Purpose
|
||||||
|
* - 0
|
||||||
|
- Doesn't exist; there is no inode 0.
|
||||||
|
* - 1
|
||||||
|
- List of defective blocks.
|
||||||
|
* - 2
|
||||||
|
- Root directory.
|
||||||
|
* - 3
|
||||||
|
- User quota.
|
||||||
|
* - 4
|
||||||
|
- Group quota.
|
||||||
|
* - 5
|
||||||
|
- Boot loader.
|
||||||
|
* - 6
|
||||||
|
- Undelete directory.
|
||||||
|
* - 7
|
||||||
|
- Reserved group descriptors inode. (“resize inode”)
|
||||||
|
* - 8
|
||||||
|
- Journal inode.
|
||||||
|
* - 9
|
||||||
|
- The “exclude” inode, for snapshots(?)
|
||||||
|
* - 10
|
||||||
|
- Replica inode, used for some non-upstream feature?
|
||||||
|
* - 11
|
||||||
|
- Traditional first non-reserved inode. Usually this is the lost+found directory. See s\_first\_ino in the superblock.
|
||||||
|
|
Loading…
Reference in New Issue
Block a user