s/seperate/separate Signed-off-by: Anand Gadiyar <gadiyar@ti.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
		
			
				
	
	
		
			242 lines
		
	
	
		
			9.6 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			242 lines
		
	
	
		
			9.6 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| 
 | |
| The LogFS Flash Filesystem
 | |
| ==========================
 | |
| 
 | |
| Specification
 | |
| =============
 | |
| 
 | |
| Superblocks
 | |
| -----------
 | |
| 
 | |
| Two superblocks exist at the beginning and end of the filesystem.
 | |
| Each superblock is 256 Bytes large, with another 3840 Bytes reserved
 | |
| for future purposes, making a total of 4096 Bytes.
 | |
| 
 | |
| Superblock locations may differ for MTD and block devices.  On MTD the
 | |
| first non-bad block contains a superblock in the first 4096 Bytes and
 | |
| the last non-bad block contains a superblock in the last 4096 Bytes.
 | |
| On block devices, the first 4096 Bytes of the device contain the first
 | |
| superblock and the last aligned 4096 Byte-block contains the second
 | |
| superblock.
 | |
| 
 | |
| For the most part, the superblocks can be considered read-only.  They
 | |
| are written only to correct errors detected within the superblocks,
 | |
| move the journal and change the filesystem parameters through tunefs.
 | |
| As a result, the superblock does not contain any fields that require
 | |
| constant updates, like the amount of free space, etc.
 | |
| 
 | |
| Segments
 | |
| --------
 | |
| 
 | |
| The space in the device is split up into equal-sized segments.
 | |
| Segments are the primary write unit of LogFS.  Within each segments,
 | |
| writes happen from front (low addresses) to back (high addresses.  If
 | |
| only a partial segment has been written, the segment number, the
 | |
| current position within and optionally a write buffer are stored in
 | |
| the journal.
 | |
| 
 | |
| Segments are erased as a whole.  Therefore Garbage Collection may be
 | |
| required to completely free a segment before doing so.
 | |
| 
 | |
| Journal
 | |
| --------
 | |
| 
 | |
| The journal contains all global information about the filesystem that
 | |
| is subject to frequent change.  At mount time, it has to be scanned
 | |
| for the most recent commit entry, which contains a list of pointers to
 | |
| all currently valid entries.
 | |
| 
 | |
| Object Store
 | |
| ------------
 | |
| 
 | |
| All space except for the superblocks and journal is part of the object
 | |
| store.  Each segment contains a segment header and a number of
 | |
| objects, each consisting of the object header and the payload.
 | |
| Objects are either inodes, directory entries (dentries), file data
 | |
| blocks or indirect blocks.
 | |
| 
 | |
| Levels
 | |
| ------
 | |
| 
 | |
| Garbage collection (GC) may fail if all data is written
 | |
| indiscriminately.  One requirement of GC is that data is separated
 | |
| roughly according to the distance between the tree root and the data.
 | |
| Effectively that means all file data is on level 0, indirect blocks
 | |
| are on levels 1, 2, 3 4 or 5 for 1x, 2x, 3x, 4x or 5x indirect blocks,
 | |
| respectively.  Inode file data is on level 6 for the inodes and 7-11
 | |
| for indirect blocks.
 | |
| 
 | |
| Each segment contains objects of a single level only.  As a result,
 | |
| each level requires its own separate segment to be open for writing.
 | |
| 
 | |
| Inode File
 | |
| ----------
 | |
| 
 | |
| All inodes are stored in a special file, the inode file.  Single
 | |
| exception is the inode file's inode (master inode) which for obvious
 | |
| reasons is stored in the journal instead.  Instead of data blocks, the
 | |
| leaf nodes of the inode files are inodes.
 | |
| 
 | |
| Aliases
 | |
| -------
 | |
| 
 | |
| Writes in LogFS are done by means of a wandering tree.  A naïve
 | |
| implementation would require that for each write or a block, all
 | |
| parent blocks are written as well, since the block pointers have
 | |
| changed.  Such an implementation would not be very efficient.
 | |
| 
 | |
| In LogFS, the block pointer changes are cached in the journal by means
 | |
| of alias entries.  Each alias consists of its logical address - inode
 | |
| number, block index, level and child number (index into block) - and
 | |
| the changed data.  Any 8-byte word can be changes in this manner.
 | |
| 
 | |
| Currently aliases are used for block pointers, file size, file used
 | |
| bytes and the height of an inodes indirect tree.
 | |
| 
 | |
| Segment Aliases
 | |
| ---------------
 | |
| 
 | |
| Related to regular aliases, these are used to handle bad blocks.
 | |
| Initially, bad blocks are handled by moving the affected segment
 | |
| content to a spare segment and noting this move in the journal with a
 | |
| segment alias, a simple (to, from) tupel.  GC will later empty this
 | |
| segment and the alias can be removed again.  This is used on MTD only.
 | |
| 
 | |
| Vim
 | |
| ---
 | |
| 
 | |
| By cleverly predicting the life time of data, it is possible to
 | |
| separate long-living data from short-living data and thereby reduce
 | |
| the GC overhead later.  Each type of distinc life expectency (vim) can
 | |
| have a separate segment open for writing.  Each (level, vim) tupel can
 | |
| be open just once.  If an open segment with unknown vim is encountered
 | |
| at mount time, it is closed and ignored henceforth.
 | |
| 
 | |
| Indirect Tree
 | |
| -------------
 | |
| 
 | |
| Inodes in LogFS are similar to FFS-style filesystems with direct and
 | |
| indirect block pointers.  One difference is that LogFS uses a single
 | |
| indirect pointer that can be either a 1x, 2x, etc. indirect pointer.
 | |
| A height field in the inode defines the height of the indirect tree
 | |
| and thereby the indirection of the pointer.
 | |
| 
 | |
| Another difference is the addressing of indirect blocks.  In LogFS,
 | |
| the first 16 pointers in the first indirect block are left empty,
 | |
| corresponding to the 16 direct pointers in the inode.  In ext2 (maybe
 | |
| others as well) the first pointer in the first indirect block
 | |
| corresponds to logical block 12, skipping the 12 direct pointers.
 | |
| So where ext2 is using arithmetic to better utilize space, LogFS keeps
 | |
| arithmetic simple and uses compression to save space.
 | |
| 
 | |
| Compression
 | |
| -----------
 | |
| 
 | |
| Both file data and metadata can be compressed.  Compression for file
 | |
| data can be enabled with chattr +c and disabled with chattr -c.  Doing
 | |
| so has no effect on existing data, but new data will be stored
 | |
| accordingly.  New inodes will inherit the compression flag of the
 | |
| parent directory.
 | |
| 
 | |
| Metadata is always compressed.  However, the space accounting ignores
 | |
| this and charges for the uncompressed size.  Failing to do so could
 | |
| result in GC failures when, after moving some data, indirect blocks
 | |
| compress worse than previously.  Even on a 100% full medium, GC may
 | |
| not consume any extra space, so the compression gains are lost space
 | |
| to the user.
 | |
| 
 | |
| However, they are not lost space to the filesystem internals.  By
 | |
| cheating the user for those bytes, the filesystem gained some slack
 | |
| space and GC will run less often and faster.
 | |
| 
 | |
| Garbage Collection and Wear Leveling
 | |
| ------------------------------------
 | |
| 
 | |
| Garbage collection is invoked whenever the number of free segments
 | |
| falls below a threshold.  The best (known) candidate is picked based
 | |
| on the least amount of valid data contained in the segment.  All
 | |
| remaining valid data is copied elsewhere, thereby invalidating it.
 | |
| 
 | |
| The GC code also checks for aliases and writes then back if their
 | |
| number gets too large.
 | |
| 
 | |
| Wear leveling is done by occasionally picking a suboptimal segment for
 | |
| garbage collection.  If a stale segments erase count is significantly
 | |
| lower than the active segments' erase counts, it will be picked.  Wear
 | |
| leveling is rate limited, so it will never monopolize the device for
 | |
| more than one segment worth at a time.
 | |
| 
 | |
| Values for "occasionally", "significantly lower" are compile time
 | |
| constants.
 | |
| 
 | |
| Hashed directories
 | |
| ------------------
 | |
| 
 | |
| To satisfy efficient lookup(), directory entries are hashed and
 | |
| located based on the hash.  In order to both support large directories
 | |
| and not be overly inefficient for small directories, several hash
 | |
| tables of increasing size are used.  For each table, the hash value
 | |
| modulo the table size gives the table index.
 | |
| 
 | |
| Tables sizes are chosen to limit the number of indirect blocks with a
 | |
| fully populated table to 0, 1, 2 or 3 respectively.  So the first
 | |
| table contains 16 entries, the second 512-16, etc.
 | |
| 
 | |
| The last table is special in several ways.  First its size depends on
 | |
| the effective 32bit limit on telldir/seekdir cookies.  Since logfs
 | |
| uses the upper half of the address space for indirect blocks, the size
 | |
| is limited to 2^31.  Secondly the table contains hash buckets with 16
 | |
| entries each.
 | |
| 
 | |
| Using single-entry buckets would result in birthday "attacks".  At
 | |
| just 2^16 used entries, hash collisions would be likely (P >= 0.5).
 | |
| My math skills are insufficient to do the combinatorics for the 17x
 | |
| collisions necessary to overflow a bucket, but testing showed that in
 | |
| 10,000 runs the lowest directory fill before a bucket overflow was
 | |
| 188,057,130 entries with an average of 315,149,915 entries.  So for
 | |
| directory sizes of up to a million, bucket overflows should be
 | |
| virtually impossible under normal circumstances.
 | |
| 
 | |
| With carefully chosen filenames, it is obviously possible to cause an
 | |
| overflow with just 21 entries (4 higher tables + 16 entries + 1).  So
 | |
| there may be a security concern if a malicious user has write access
 | |
| to a directory.
 | |
| 
 | |
| Open For Discussion
 | |
| ===================
 | |
| 
 | |
| Device Address Space
 | |
| --------------------
 | |
| 
 | |
| A device address space is used for caching.  Both block devices and
 | |
| MTD provide functions to either read a single page or write a segment.
 | |
| Partial segments may be written for data integrity, but where possible
 | |
| complete segments are written for performance on simple block device
 | |
| flash media.
 | |
| 
 | |
| Meta Inodes
 | |
| -----------
 | |
| 
 | |
| Inodes are stored in the inode file, which is just a regular file for
 | |
| most purposes.  At umount time, however, the inode file needs to
 | |
| remain open until all dirty inodes are written.  So
 | |
| generic_shutdown_super() may not close this inode, but shouldn't
 | |
| complain about remaining inodes due to the inode file either.  Same
 | |
| goes for mapping inode of the device address space.
 | |
| 
 | |
| Currently logfs uses a hack that essentially copies part of fs/inode.c
 | |
| code over.  A general solution would be preferred.
 | |
| 
 | |
| Indirect block mapping
 | |
| ----------------------
 | |
| 
 | |
| With compression, the block device (or mapping inode) cannot be used
 | |
| to cache indirect blocks.  Some other place is required.  Currently
 | |
| logfs uses the top half of each inode's address space.  The low 8TB
 | |
| (on 32bit) are filled with file data, the high 8TB are used for
 | |
| indirect blocks.
 | |
| 
 | |
| One problem is that 16TB files created on 64bit systems actually have
 | |
| data in the top 8TB.  But files >16TB would cause problems anyway, so
 | |
| only the limit has changed.
 |