docs: fs: convert docs without extension to ReST

There are 3 remaining files without an extension inside the fs docs
dir.

Manually convert them to ReST.

In the case of the nfs/exporting.rst file, as the nfs docs
aren't ported yet, I opted to convert and add a :orphan: there,
with should be removed when it gets added into a nfs-specific
part of the fs documentation.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
This commit is contained in:
Mauro Carvalho Chehab 2019-07-26 09:51:27 -03:00 committed by Jonathan Corbet
parent 5a5e045bb3
commit ec23eb54fb
11 changed files with 225 additions and 119 deletions

View File

@ -1,12 +1,17 @@
Locking scheme used for directory operations is based on two =================
Directory Locking
=================
Locking scheme used for directory operations is based on two
kinds of locks - per-inode (->i_rwsem) and per-filesystem kinds of locks - per-inode (->i_rwsem) and per-filesystem
(->s_vfs_rename_mutex). (->s_vfs_rename_mutex).
When taking the i_rwsem on multiple non-directory objects, we When taking the i_rwsem on multiple non-directory objects, we
always acquire the locks in order by increasing address. We'll call always acquire the locks in order by increasing address. We'll call
that "inode pointer" order in the following. that "inode pointer" order in the following.
For our purposes all operations fall in 5 classes: For our purposes all operations fall in 5 classes:
1) read access. Locking rules: caller locks directory we are accessing. 1) read access. Locking rules: caller locks directory we are accessing.
The lock is taken shared. The lock is taken shared.
@ -27,25 +32,29 @@ NB: we might get away with locking the the source (and target in exchange
case) shared. case) shared.
5) link creation. Locking rules: 5) link creation. Locking rules:
* lock parent * lock parent
* check that source is not a directory * check that source is not a directory
* lock source * lock source
* call the method. * call the method.
All locks are exclusive. All locks are exclusive.
6) cross-directory rename. The trickiest in the whole bunch. Locking 6) cross-directory rename. The trickiest in the whole bunch. Locking
rules: rules:
* lock the filesystem * lock the filesystem
* lock parents in "ancestors first" order. * lock parents in "ancestors first" order.
* find source and target. * find source and target.
* if old parent is equal to or is a descendent of target * if old parent is equal to or is a descendent of target
fail with -ENOTEMPTY fail with -ENOTEMPTY
* if new parent is equal to or is a descendent of source * if new parent is equal to or is a descendent of source
fail with -ELOOP fail with -ELOOP
* If it's an exchange, lock both the source and the target. * If it's an exchange, lock both the source and the target.
* If the target exists, lock it. If the source is a non-directory, * If the target exists, lock it. If the source is a non-directory,
lock it. If we need to lock both, do so in inode pointer order. lock it. If we need to lock both, do so in inode pointer order.
* call the method. * call the method.
All ->i_rwsem are taken exclusive. Again, we might get away with locking All ->i_rwsem are taken exclusive. Again, we might get away with locking
the the source (and target in exchange case) shared. the the source (and target in exchange case) shared.
@ -54,10 +63,11 @@ read, modified or removed by method will be locked by caller.
If no directory is its own ancestor, the scheme above is deadlock-free. If no directory is its own ancestor, the scheme above is deadlock-free.
Proof: Proof:
First of all, at any moment we have a partial ordering of the First of all, at any moment we have a partial ordering of the
objects - A < B iff A is an ancestor of B. objects - A < B iff A is an ancestor of B.
That ordering can change. However, the following is true: That ordering can change. However, the following is true:
@ -77,32 +87,32 @@ objects - A < B iff A is an ancestor of B.
non-directory object, except renames, which take locks on source and non-directory object, except renames, which take locks on source and
target in inode pointer order in the case they are not directories.) target in inode pointer order in the case they are not directories.)
Now consider the minimal deadlock. Each process is blocked on Now consider the minimal deadlock. Each process is blocked on
attempt to acquire some lock and already holds at least one lock. Let's attempt to acquire some lock and already holds at least one lock. Let's
consider the set of contended locks. First of all, filesystem lock is consider the set of contended locks. First of all, filesystem lock is
not contended, since any process blocked on it is not holding any locks. not contended, since any process blocked on it is not holding any locks.
Thus all processes are blocked on ->i_rwsem. Thus all processes are blocked on ->i_rwsem.
By (3), any process holding a non-directory lock can only be By (3), any process holding a non-directory lock can only be
waiting on another non-directory lock with a larger address. Therefore waiting on another non-directory lock with a larger address. Therefore
the process holding the "largest" such lock can always make progress, and the process holding the "largest" such lock can always make progress, and
non-directory objects are not included in the set of contended locks. non-directory objects are not included in the set of contended locks.
Thus link creation can't be a part of deadlock - it can't be Thus link creation can't be a part of deadlock - it can't be
blocked on source and it means that it doesn't hold any locks. blocked on source and it means that it doesn't hold any locks.
Any contended object is either held by cross-directory rename or Any contended object is either held by cross-directory rename or
has a child that is also contended. Indeed, suppose that it is held by has a child that is also contended. Indeed, suppose that it is held by
operation other than cross-directory rename. Then the lock this operation operation other than cross-directory rename. Then the lock this operation
is blocked on belongs to child of that object due to (1). is blocked on belongs to child of that object due to (1).
It means that one of the operations is cross-directory rename. It means that one of the operations is cross-directory rename.
Otherwise the set of contended objects would be infinite - each of them Otherwise the set of contended objects would be infinite - each of them
would have a contended child and we had assumed that no object is its would have a contended child and we had assumed that no object is its
own descendent. Moreover, there is exactly one cross-directory rename own descendent. Moreover, there is exactly one cross-directory rename
(see above). (see above).
Consider the object blocking the cross-directory rename. One Consider the object blocking the cross-directory rename. One
of its descendents is locked by cross-directory rename (otherwise we of its descendents is locked by cross-directory rename (otherwise we
would again have an infinite set of contended objects). But that would again have an infinite set of contended objects). But that
means that cross-directory rename is taking locks out of order. Due means that cross-directory rename is taking locks out of order. Due
@ -112,7 +122,7 @@ try to acquire lock on descendent before the lock on ancestor.
Contradiction. I.e. deadlock is impossible. Q.E.D. Contradiction. I.e. deadlock is impossible. Q.E.D.
These operations are guaranteed to avoid loop creation. Indeed, These operations are guaranteed to avoid loop creation. Indeed,
the only operation that could introduce loops is cross-directory rename. the only operation that could introduce loops is cross-directory rename.
Since the only new (parent, child) pair added by rename() is (new parent, Since the only new (parent, child) pair added by rename() is (new parent,
source), such loop would have to contain these objects and the rest of it source), such loop would have to contain these objects and the rest of it
@ -123,13 +133,13 @@ new parent had been equal to or a descendent of source since the moment when
we had acquired filesystem lock and rename() would fail with -ELOOP in that we had acquired filesystem lock and rename() would fail with -ELOOP in that
case. case.
While this locking scheme works for arbitrary DAGs, it relies on While this locking scheme works for arbitrary DAGs, it relies on
ability to check that directory is a descendent of another object. Current ability to check that directory is a descendent of another object. Current
implementation assumes that directory graph is a tree. This assumption is implementation assumes that directory graph is a tree. This assumption is
also preserved by all operations (cross-directory rename on a tree that would also preserved by all operations (cross-directory rename on a tree that would
not introduce a cycle will leave it a tree and link() fails for directories). not introduce a cycle will leave it a tree and link() fails for directories).
Notice that "directory" in the above == "anything that might have Notice that "directory" in the above == "anything that might have
children", so if we are going to introduce hybrid objects we will need children", so if we are going to introduce hybrid objects we will need
either to make sure that link(2) doesn't work for them or to make changes either to make sure that link(2) doesn't work for them or to make changes
in is_subdir() that would make it work even in presence of such beasts. in is_subdir() that would make it work even in presence of such beasts.

View File

@ -20,6 +20,8 @@ algorithms work.
path-lookup path-lookup
api-summary api-summary
splice splice
locking
directory-locking
Filesystem support layers Filesystem support layers
========================= =========================

View File

@ -1,14 +1,22 @@
The text below describes the locking rules for VFS-related methods. =======
Locking
=======
The text below describes the locking rules for VFS-related methods.
It is (believed to be) up-to-date. *Please*, if you change anything in It is (believed to be) up-to-date. *Please*, if you change anything in
prototypes or locking protocols - update this file. And update the relevant prototypes or locking protocols - update this file. And update the relevant
instances in the tree, don't leave that to maintainers of filesystems/devices/ instances in the tree, don't leave that to maintainers of filesystems/devices/
etc. At the very least, put the list of dubious cases in the end of this file. etc. At the very least, put the list of dubious cases in the end of this file.
Don't turn it into log - maintainers of out-of-the-tree code are supposed to Don't turn it into log - maintainers of out-of-the-tree code are supposed to
be able to use diff(1). be able to use diff(1).
Thing currently missing here: socket operations. Alexey?
--------------------------- dentry_operations -------------------------- Thing currently missing here: socket operations. Alexey?
prototypes:
dentry_operations
=================
prototypes::
int (*d_revalidate)(struct dentry *, unsigned int); int (*d_revalidate)(struct dentry *, unsigned int);
int (*d_weak_revalidate)(struct dentry *, unsigned int); int (*d_weak_revalidate)(struct dentry *, unsigned int);
int (*d_hash)(const struct dentry *, struct qstr *); int (*d_hash)(const struct dentry *, struct qstr *);
@ -24,23 +32,30 @@ prototypes:
struct dentry *(*d_real)(struct dentry *, const struct inode *); struct dentry *(*d_real)(struct dentry *, const struct inode *);
locking rules: locking rules:
rename_lock ->d_lock may block rcu-walk
d_revalidate: no no yes (ref-walk) maybe
d_weak_revalidate:no no yes no
d_hash no no no maybe
d_compare: yes no no maybe
d_delete: no yes no no
d_init: no no yes no
d_release: no no yes no
d_prune: no yes no no
d_iput: no no yes no
d_dname: no no no no
d_automount: no no yes no
d_manage: no no yes (ref-walk) maybe
d_real no no yes no
--------------------------- inode_operations --------------------------- ================== =========== ======== ============== ========
prototypes: ops rename_lock ->d_lock may block rcu-walk
================== =========== ======== ============== ========
d_revalidate: no no yes (ref-walk) maybe
d_weak_revalidate: no no yes no
d_hash no no no maybe
d_compare: yes no no maybe
d_delete: no yes no no
d_init: no no yes no
d_release: no no yes no
d_prune: no yes no no
d_iput: no no yes no
d_dname: no no no no
d_automount: no no yes no
d_manage: no no yes (ref-walk) maybe
d_real no no yes no
================== =========== ======== ============== ========
inode_operations
================
prototypes::
int (*create) (struct inode *,struct dentry *,umode_t, bool); int (*create) (struct inode *,struct dentry *,umode_t, bool);
struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int); struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int);
int (*link) (struct dentry *,struct inode *,struct dentry *); int (*link) (struct dentry *,struct inode *,struct dentry *);
@ -68,7 +83,10 @@ prototypes:
locking rules: locking rules:
all may block all may block
i_rwsem(inode)
============ =============================================
ops i_rwsem(inode)
============ =============================================
lookup: shared lookup: shared
create: exclusive create: exclusive
link: exclusive (both) link: exclusive (both)
@ -89,17 +107,21 @@ fiemap: no
update_time: no update_time: no
atomic_open: exclusive atomic_open: exclusive
tmpfile: no tmpfile: no
============ =============================================
Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_rwsem Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_rwsem
exclusive on victim. exclusive on victim.
cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem. cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem.
See Documentation/filesystems/directory-locking for more detailed discussion See Documentation/filesystems/directory-locking.rst for more detailed discussion
of the locking scheme for directory operations. of the locking scheme for directory operations.
----------------------- xattr_handler operations ----------------------- xattr_handler operations
prototypes: ========================
prototypes::
bool (*list)(struct dentry *dentry); bool (*list)(struct dentry *dentry);
int (*get)(const struct xattr_handler *handler, struct dentry *dentry, int (*get)(const struct xattr_handler *handler, struct dentry *dentry,
struct inode *inode, const char *name, void *buffer, struct inode *inode, const char *name, void *buffer,
@ -110,13 +132,20 @@ prototypes:
locking rules: locking rules:
all may block all may block
i_rwsem(inode)
===== ==============
ops i_rwsem(inode)
===== ==============
list: no list: no
get: no get: no
set: exclusive set: exclusive
===== ==============
super_operations
================
prototypes::
--------------------------- super_operations ---------------------------
prototypes:
struct inode *(*alloc_inode)(struct super_block *sb); struct inode *(*alloc_inode)(struct super_block *sb);
void (*free_inode)(struct inode *); void (*free_inode)(struct inode *);
void (*destroy_inode)(struct inode *); void (*destroy_inode)(struct inode *);
@ -138,7 +167,10 @@ prototypes:
locking rules: locking rules:
All may block [not true, see below] All may block [not true, see below]
s_umount
====================== ============ ========================
ops s_umount note
====================== ============ ========================
alloc_inode: alloc_inode:
free_inode: called from RCU callback free_inode: called from RCU callback
destroy_inode: destroy_inode:
@ -157,6 +189,7 @@ show_options: no (namespace_sem)
quota_read: no (see below) quota_read: no (see below)
quota_write: no (see below) quota_write: no (see below)
bdev_try_to_free_page: no (see below) bdev_try_to_free_page: no (see below)
====================== ============ ========================
->statfs() has s_umount (shared) when called by ustat(2) (native or ->statfs() has s_umount (shared) when called by ustat(2) (native or
compat), but that's an accident of bad API; s_umount is used to pin compat), but that's an accident of bad API; s_umount is used to pin
@ -164,31 +197,44 @@ the superblock down when we only have dev_t given us by userland to
identify the superblock. Everything else (statfs(), fstatfs(), etc.) identify the superblock. Everything else (statfs(), fstatfs(), etc.)
doesn't hold it when calling ->statfs() - superblock is pinned down doesn't hold it when calling ->statfs() - superblock is pinned down
by resolving the pathname passed to syscall. by resolving the pathname passed to syscall.
->quota_read() and ->quota_write() functions are both guaranteed to ->quota_read() and ->quota_write() functions are both guaranteed to
be the only ones operating on the quota file by the quota code (via be the only ones operating on the quota file by the quota code (via
dqio_sem) (unless an admin really wants to screw up something and dqio_sem) (unless an admin really wants to screw up something and
writes to quota files with quotas on). For other details about locking writes to quota files with quotas on). For other details about locking
see also dquot_operations section. see also dquot_operations section.
->bdev_try_to_free_page is called from the ->releasepage handler of ->bdev_try_to_free_page is called from the ->releasepage handler of
the block device inode. See there for more details. the block device inode. See there for more details.
--------------------------- file_system_type --------------------------- file_system_type
prototypes: ================
prototypes::
struct dentry *(*mount) (struct file_system_type *, int, struct dentry *(*mount) (struct file_system_type *, int,
const char *, void *); const char *, void *);
void (*kill_sb) (struct super_block *); void (*kill_sb) (struct super_block *);
locking rules: locking rules:
may block
======= =========
ops may block
======= =========
mount yes mount yes
kill_sb yes kill_sb yes
======= =========
->mount() returns ERR_PTR or the root dentry; its superblock should be locked ->mount() returns ERR_PTR or the root dentry; its superblock should be locked
on return. on return.
->kill_sb() takes a write-locked superblock, does all shutdown work on it, ->kill_sb() takes a write-locked superblock, does all shutdown work on it,
unlocks and drops the reference. unlocks and drops the reference.
--------------------------- address_space_operations -------------------------- address_space_operations
prototypes: ========================
prototypes::
int (*writepage)(struct page *page, struct writeback_control *wbc); int (*writepage)(struct page *page, struct writeback_control *wbc);
int (*readpage)(struct file *, struct page *); int (*readpage)(struct file *, struct page *);
int (*writepages)(struct address_space *, struct writeback_control *); int (*writepages)(struct address_space *, struct writeback_control *);
@ -218,14 +264,16 @@ prototypes:
locking rules: locking rules:
All except set_page_dirty and freepage may block All except set_page_dirty and freepage may block
PageLocked(page) i_rwsem ====================== ======================== =========
ops PageLocked(page) i_rwsem
====================== ======================== =========
writepage: yes, unlocks (see below) writepage: yes, unlocks (see below)
readpage: yes, unlocks readpage: yes, unlocks
writepages: writepages:
set_page_dirty no set_page_dirty no
readpages: readpages:
write_begin: locks the page exclusive write_begin: locks the page exclusive
write_end: yes, unlocks exclusive write_end: yes, unlocks exclusive
bmap: bmap:
invalidatepage: yes invalidatepage: yes
releasepage: yes releasepage: yes
@ -239,17 +287,18 @@ is_partially_uptodate: yes
error_remove_page: yes error_remove_page: yes
swap_activate: no swap_activate: no
swap_deactivate: no swap_deactivate: no
====================== ======================== =========
->write_begin(), ->write_end() and ->readpage() may be called from ->write_begin(), ->write_end() and ->readpage() may be called from
the request handler (/dev/loop). the request handler (/dev/loop).
->readpage() unlocks the page, either synchronously or via I/O ->readpage() unlocks the page, either synchronously or via I/O
completion. completion.
->readpages() populates the pagecache with the passed pages and starts ->readpages() populates the pagecache with the passed pages and starts
I/O against them. They come unlocked upon I/O completion. I/O against them. They come unlocked upon I/O completion.
->writepage() is used for two purposes: for "memory cleansing" and for ->writepage() is used for two purposes: for "memory cleansing" and for
"sync". These are quite different operations and the behaviour may differ "sync". These are quite different operations and the behaviour may differ
depending upon the mode. depending upon the mode.
@ -297,70 +346,81 @@ will leave the page itself marked clean but it will be tagged as dirty in the
radix tree. This incoherency can lead to all sorts of hard-to-debug problems radix tree. This incoherency can lead to all sorts of hard-to-debug problems
in the filesystem like having dirty inodes at umount and losing written data. in the filesystem like having dirty inodes at umount and losing written data.
->writepages() is used for periodic writeback and for syscall-initiated ->writepages() is used for periodic writeback and for syscall-initiated
sync operations. The address_space should start I/O against at least sync operations. The address_space should start I/O against at least
*nr_to_write pages. *nr_to_write must be decremented for each page which is ``*nr_to_write`` pages. ``*nr_to_write`` must be decremented for each page
written. The address_space implementation may write more (or less) pages which is written. The address_space implementation may write more (or less)
than *nr_to_write asks for, but it should try to be reasonably close. If pages than ``*nr_to_write`` asks for, but it should try to be reasonably close.
nr_to_write is NULL, all dirty pages must be written. If nr_to_write is NULL, all dirty pages must be written.
writepages should _only_ write pages which are present on writepages should _only_ write pages which are present on
mapping->io_pages. mapping->io_pages.
->set_page_dirty() is called from various places in the kernel ->set_page_dirty() is called from various places in the kernel
when the target page is marked as needing writeback. It may be called when the target page is marked as needing writeback. It may be called
under spinlock (it cannot block) and is sometimes called with the page under spinlock (it cannot block) and is sometimes called with the page
not locked. not locked.
->bmap() is currently used by legacy ioctl() (FIBMAP) provided by some ->bmap() is currently used by legacy ioctl() (FIBMAP) provided by some
filesystems and by the swapper. The latter will eventually go away. Please, filesystems and by the swapper. The latter will eventually go away. Please,
keep it that way and don't breed new callers. keep it that way and don't breed new callers.
->invalidatepage() is called when the filesystem must attempt to drop ->invalidatepage() is called when the filesystem must attempt to drop
some or all of the buffers from the page when it is being truncated. It some or all of the buffers from the page when it is being truncated. It
returns zero on success. If ->invalidatepage is zero, the kernel uses returns zero on success. If ->invalidatepage is zero, the kernel uses
block_invalidatepage() instead. block_invalidatepage() instead.
->releasepage() is called when the kernel is about to try to drop the ->releasepage() is called when the kernel is about to try to drop the
buffers from the page in preparation for freeing it. It returns zero to buffers from the page in preparation for freeing it. It returns zero to
indicate that the buffers are (or may be) freeable. If ->releasepage is zero, indicate that the buffers are (or may be) freeable. If ->releasepage is zero,
the kernel assumes that the fs has no private interest in the buffers. the kernel assumes that the fs has no private interest in the buffers.
->freepage() is called when the kernel is done dropping the page ->freepage() is called when the kernel is done dropping the page
from the page cache. from the page cache.
->launder_page() may be called prior to releasing a page if ->launder_page() may be called prior to releasing a page if
it is still found to be dirty. It returns zero if the page was successfully it is still found to be dirty. It returns zero if the page was successfully
cleaned, or an error value if not. Note that in order to prevent the page cleaned, or an error value if not. Note that in order to prevent the page
getting mapped back in and redirtied, it needs to be kept locked getting mapped back in and redirtied, it needs to be kept locked
across the entire operation. across the entire operation.
->swap_activate will be called with a non-zero argument on ->swap_activate will be called with a non-zero argument on
files backing (non block device backed) swapfiles. A return value files backing (non block device backed) swapfiles. A return value
of zero indicates success, in which case this file can be used for of zero indicates success, in which case this file can be used for
backing swapspace. The swapspace operations will be proxied to the backing swapspace. The swapspace operations will be proxied to the
address space operations. address space operations.
->swap_deactivate() will be called in the sys_swapoff() ->swap_deactivate() will be called in the sys_swapoff()
path after ->swap_activate() returned success. path after ->swap_activate() returned success.
----------------------- file_lock_operations ------------------------------ file_lock_operations
prototypes: ====================
prototypes::
void (*fl_copy_lock)(struct file_lock *, struct file_lock *); void (*fl_copy_lock)(struct file_lock *, struct file_lock *);
void (*fl_release_private)(struct file_lock *); void (*fl_release_private)(struct file_lock *);
locking rules: locking rules:
inode->i_lock may block
=================== ============= =========
ops inode->i_lock may block
=================== ============= =========
fl_copy_lock: yes no fl_copy_lock: yes no
fl_release_private: maybe maybe[1] fl_release_private: maybe maybe[1]_
=================== ============= =========
[1]: ->fl_release_private for flock or POSIX locks is currently allowed .. [1]:
to block. Leases however can still be freed while the i_lock is held and ->fl_release_private for flock or POSIX locks is currently allowed
so fl_release_private called on a lease should not block. to block. Leases however can still be freed while the i_lock is held and
so fl_release_private called on a lease should not block.
lock_manager_operations
=======================
prototypes::
----------------------- lock_manager_operations ---------------------------
prototypes:
void (*lm_notify)(struct file_lock *); /* unblock callback */ void (*lm_notify)(struct file_lock *); /* unblock callback */
int (*lm_grant)(struct file_lock *, struct file_lock *, int); int (*lm_grant)(struct file_lock *, struct file_lock *, int);
void (*lm_break)(struct file_lock *); /* break_lease callback */ void (*lm_break)(struct file_lock *); /* break_lease callback */
@ -368,24 +428,33 @@ prototypes:
locking rules: locking rules:
inode->i_lock blocked_lock_lock may block ========== ============= ================= =========
ops inode->i_lock blocked_lock_lock may block
========== ============= ================= =========
lm_notify: yes yes no lm_notify: yes yes no
lm_grant: no no no lm_grant: no no no
lm_break: yes no no lm_break: yes no no
lm_change yes no no lm_change yes no no
========== ============= ================= =========
buffer_head
===========
prototypes::
--------------------------- buffer_head -----------------------------------
prototypes:
void (*b_end_io)(struct buffer_head *bh, int uptodate); void (*b_end_io)(struct buffer_head *bh, int uptodate);
locking rules: locking rules:
called from interrupts. In other words, extreme care is needed here.
called from interrupts. In other words, extreme care is needed here.
bh is locked, but that's all warranties we have here. Currently only RAID1, bh is locked, but that's all warranties we have here. Currently only RAID1,
highmem, fs/buffer.c, and fs/ntfs/aops.c are providing these. Block devices highmem, fs/buffer.c, and fs/ntfs/aops.c are providing these. Block devices
call this method upon the IO completion. call this method upon the IO completion.
--------------------------- block_device_operations ----------------------- block_device_operations
prototypes: =======================
prototypes::
int (*open) (struct block_device *, fmode_t); int (*open) (struct block_device *, fmode_t);
int (*release) (struct gendisk *, fmode_t); int (*release) (struct gendisk *, fmode_t);
int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long); int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
@ -399,7 +468,10 @@ prototypes:
void (*swap_slot_free_notify) (struct block_device *, unsigned long); void (*swap_slot_free_notify) (struct block_device *, unsigned long);
locking rules: locking rules:
bd_mutex
======================= ===================
ops bd_mutex
======================= ===================
open: yes open: yes
release: yes release: yes
ioctl: no ioctl: no
@ -410,6 +482,7 @@ unlock_native_capacity: no
revalidate_disk: no revalidate_disk: no
getgeo: no getgeo: no
swap_slot_free_notify: no (see below) swap_slot_free_notify: no (see below)
======================= ===================
media_changed, unlock_native_capacity and revalidate_disk are called only from media_changed, unlock_native_capacity and revalidate_disk are called only from
check_disk_change(). check_disk_change().
@ -418,8 +491,11 @@ swap_slot_free_notify is called with swap_lock and sometimes the page lock
held. held.
--------------------------- file_operations ------------------------------- file_operations
prototypes: ===============
prototypes::
loff_t (*llseek) (struct file *, loff_t, int); loff_t (*llseek) (struct file *, loff_t, int);
ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
@ -455,7 +531,6 @@ prototypes:
size_t, unsigned int); size_t, unsigned int);
int (*setlease)(struct file *, long, struct file_lock **, void **); int (*setlease)(struct file *, long, struct file_lock **, void **);
long (*fallocate)(struct file *, int, loff_t, loff_t); long (*fallocate)(struct file *, int, loff_t, loff_t);
};
locking rules: locking rules:
All may block. All may block.
@ -490,8 +565,11 @@ in sys_read() and friends.
the lease within the individual filesystem to record the result of the the lease within the individual filesystem to record the result of the
operation operation
--------------------------- dquot_operations ------------------------------- dquot_operations
prototypes: ================
prototypes::
int (*write_dquot) (struct dquot *); int (*write_dquot) (struct dquot *);
int (*acquire_dquot) (struct dquot *); int (*acquire_dquot) (struct dquot *);
int (*release_dquot) (struct dquot *); int (*release_dquot) (struct dquot *);
@ -503,20 +581,26 @@ a proper locking wrt the filesystem and call the generic quota operations.
What filesystem should expect from the generic quota functions: What filesystem should expect from the generic quota functions:
FS recursion Held locks when called ============== ============ =========================
ops FS recursion Held locks when called
============== ============ =========================
write_dquot: yes dqonoff_sem or dqptr_sem write_dquot: yes dqonoff_sem or dqptr_sem
acquire_dquot: yes dqonoff_sem or dqptr_sem acquire_dquot: yes dqonoff_sem or dqptr_sem
release_dquot: yes dqonoff_sem or dqptr_sem release_dquot: yes dqonoff_sem or dqptr_sem
mark_dirty: no - mark_dirty: no -
write_info: yes dqonoff_sem write_info: yes dqonoff_sem
============== ============ =========================
FS recursion means calling ->quota_read() and ->quota_write() from superblock FS recursion means calling ->quota_read() and ->quota_write() from superblock
operations. operations.
More details about quota locking can be found in fs/dquot.c. More details about quota locking can be found in fs/dquot.c.
--------------------------- vm_operations_struct ----------------------------- vm_operations_struct
prototypes: ====================
prototypes::
void (*open)(struct vm_area_struct*); void (*open)(struct vm_area_struct*);
void (*close)(struct vm_area_struct*); void (*close)(struct vm_area_struct*);
vm_fault_t (*fault)(struct vm_area_struct*, struct vm_fault *); vm_fault_t (*fault)(struct vm_area_struct*, struct vm_fault *);
@ -525,7 +609,10 @@ prototypes:
int (*access)(struct vm_area_struct *, unsigned long, void*, int, int); int (*access)(struct vm_area_struct *, unsigned long, void*, int, int);
locking rules: locking rules:
mmap_sem PageLocked(page)
============= ======== ===========================
ops mmap_sem PageLocked(page)
============= ======== ===========================
open: yes open: yes
close: yes close: yes
fault: yes can return with page locked fault: yes can return with page locked
@ -533,8 +620,9 @@ map_pages: yes
page_mkwrite: yes can return with page locked page_mkwrite: yes can return with page locked
pfn_mkwrite: yes pfn_mkwrite: yes
access: yes access: yes
============= ======== ===========================
->fault() is called when a previously not present pte is about ->fault() is called when a previously not present pte is about
to be faulted in. The filesystem must find and return the page associated to be faulted in. The filesystem must find and return the page associated
with the passed in "pgoff" in the vm_fault structure. If it is possible that with the passed in "pgoff" in the vm_fault structure. If it is possible that
the page may be truncated and/or invalidated, then the filesystem must lock the page may be truncated and/or invalidated, then the filesystem must lock
@ -542,7 +630,7 @@ the page, then ensure it is not already truncated (the page lock will block
subsequent truncate), and then return with VM_FAULT_LOCKED, and the page subsequent truncate), and then return with VM_FAULT_LOCKED, and the page
locked. The VM will unlock the page. locked. The VM will unlock the page.
->map_pages() is called when VM asks to map easy accessible pages. ->map_pages() is called when VM asks to map easy accessible pages.
Filesystem should find and map pages associated with offsets from "start_pgoff" Filesystem should find and map pages associated with offsets from "start_pgoff"
till "end_pgoff". ->map_pages() is called with page table locked and must till "end_pgoff". ->map_pages() is called with page table locked and must
not block. If it's not possible to reach a page without blocking, not block. If it's not possible to reach a page without blocking,
@ -551,25 +639,26 @@ page table entry. Pointer to entry associated with the page is passed in
"pte" field in vm_fault structure. Pointers to entries for other offsets "pte" field in vm_fault structure. Pointers to entries for other offsets
should be calculated relative to "pte". should be calculated relative to "pte".
->page_mkwrite() is called when a previously read-only pte is ->page_mkwrite() is called when a previously read-only pte is
about to become writeable. The filesystem again must ensure that there are about to become writeable. The filesystem again must ensure that there are
no truncate/invalidate races, and then return with the page locked. If no truncate/invalidate races, and then return with the page locked. If
the page has been truncated, the filesystem should not look up a new page the page has been truncated, the filesystem should not look up a new page
like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which
will cause the VM to retry the fault. will cause the VM to retry the fault.
->pfn_mkwrite() is the same as page_mkwrite but when the pte is ->pfn_mkwrite() is the same as page_mkwrite but when the pte is
VM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is VM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is
VM_FAULT_NOPAGE. Or one of the VM_FAULT_ERROR types. The default behavior VM_FAULT_NOPAGE. Or one of the VM_FAULT_ERROR types. The default behavior
after this call is to make the pte read-write, unless pfn_mkwrite returns after this call is to make the pte read-write, unless pfn_mkwrite returns
an error. an error.
->access() is called when get_user_pages() fails in ->access() is called when get_user_pages() fails in
access_process_vm(), typically used to debug a process through access_process_vm(), typically used to debug a process through
/proc/pid/mem or ptrace. This function is needed only for /proc/pid/mem or ptrace. This function is needed only for
VM_IO | VM_PFNMAP VMAs. VM_IO | VM_PFNMAP VMAs.
================================================================================ --------------------------------------------------------------------------------
Dubious stuff Dubious stuff
(if you break something or notice that it is broken and do not fix it yourself (if you break something or notice that it is broken and do not fix it yourself

View File

@ -1,3 +1,4 @@
:orphan:
Making Filesystems Exportable Making Filesystems Exportable
============================= =============================
@ -42,9 +43,9 @@ filehandle fragment, there is no automatic creation of a path prefix
for the object. This leads to two related but distinct features of for the object. This leads to two related but distinct features of
the dcache that are not needed for normal filesystem access. the dcache that are not needed for normal filesystem access.
1/ The dcache must sometimes contain objects that are not part of the 1. The dcache must sometimes contain objects that are not part of the
proper prefix. i.e that are not connected to the root. proper prefix. i.e that are not connected to the root.
2/ The dcache must be prepared for a newly found (via ->lookup) directory 2. The dcache must be prepared for a newly found (via ->lookup) directory
to already have a (non-connected) dentry, and must be able to move to already have a (non-connected) dentry, and must be able to move
that dentry into place (based on the parent and name in the that dentry into place (based on the parent and name in the
->lookup). This is particularly needed for directories as ->lookup). This is particularly needed for directories as
@ -52,7 +53,7 @@ the dcache that are not needed for normal filesystem access.
To implement these features, the dcache has: To implement these features, the dcache has:
a/ A dentry flag DCACHE_DISCONNECTED which is set on a. A dentry flag DCACHE_DISCONNECTED which is set on
any dentry that might not be part of the proper prefix. any dentry that might not be part of the proper prefix.
This is set when anonymous dentries are created, and cleared when a This is set when anonymous dentries are created, and cleared when a
dentry is noticed to be a child of a dentry which is in the proper dentry is noticed to be a child of a dentry which is in the proper
@ -71,48 +72,52 @@ a/ A dentry flag DCACHE_DISCONNECTED which is set on
dentries. That guarantees that we won't need to hunt them down upon dentries. That guarantees that we won't need to hunt them down upon
umount. umount.
b/ A primitive for creation of secondary roots - d_obtain_root(inode). b. A primitive for creation of secondary roots - d_obtain_root(inode).
Those do _not_ bear DCACHE_DISCONNECTED. They are placed on the Those do _not_ bear DCACHE_DISCONNECTED. They are placed on the
per-superblock list (->s_roots), so they can be located at umount per-superblock list (->s_roots), so they can be located at umount
time for eviction purposes. time for eviction purposes.
c/ Helper routines to allocate anonymous dentries, and to help attach c. Helper routines to allocate anonymous dentries, and to help attach
loose directory dentries at lookup time. They are: loose directory dentries at lookup time. They are:
d_obtain_alias(inode) will return a dentry for the given inode. d_obtain_alias(inode) will return a dentry for the given inode.
If the inode already has a dentry, one of those is returned. If the inode already has a dentry, one of those is returned.
If it doesn't, a new anonymous (IS_ROOT and If it doesn't, a new anonymous (IS_ROOT and
DCACHE_DISCONNECTED) dentry is allocated and attached. DCACHE_DISCONNECTED) dentry is allocated and attached.
In the case of a directory, care is taken that only one dentry In the case of a directory, care is taken that only one dentry
can ever be attached. can ever be attached.
d_splice_alias(inode, dentry) will introduce a new dentry into the tree; d_splice_alias(inode, dentry) will introduce a new dentry into the tree;
either the passed-in dentry or a preexisting alias for the given inode either the passed-in dentry or a preexisting alias for the given inode
(such as an anonymous one created by d_obtain_alias), if appropriate. (such as an anonymous one created by d_obtain_alias), if appropriate.
It returns NULL when the passed-in dentry is used, following the calling It returns NULL when the passed-in dentry is used, following the calling
convention of ->lookup. convention of ->lookup.
Filesystem Issues Filesystem Issues
----------------- -----------------
For a filesystem to be exportable it must: For a filesystem to be exportable it must:
1/ provide the filehandle fragment routines described below. 1. provide the filehandle fragment routines described below.
2/ make sure that d_splice_alias is used rather than d_add 2. make sure that d_splice_alias is used rather than d_add
when ->lookup finds an inode for a given parent and name. when ->lookup finds an inode for a given parent and name.
If inode is NULL, d_splice_alias(inode, dentry) is equivalent to If inode is NULL, d_splice_alias(inode, dentry) is equivalent to::
d_add(dentry, inode), NULL d_add(dentry, inode), NULL
Similarly, d_splice_alias(ERR_PTR(err), dentry) = ERR_PTR(err) Similarly, d_splice_alias(ERR_PTR(err), dentry) = ERR_PTR(err)
Typically the ->lookup routine will simply end with a: Typically the ->lookup routine will simply end with a::
return d_splice_alias(inode, dentry); return d_splice_alias(inode, dentry);
} }
A file system implementation declares that instances of the filesystem A file system implementation declares that instances of the filesystem
are exportable by setting the s_export_op field in the struct are exportable by setting the s_export_op field in the struct
super_block. This field must point to a "struct export_operations" super_block. This field must point to a "struct export_operations"
struct which has the following members: struct which has the following members:

View File

@ -20,7 +20,7 @@ kernel which allows different filesystem implementations to coexist.
VFS system calls open(2), stat(2), read(2), write(2), chmod(2) and so on VFS system calls open(2), stat(2), read(2), write(2), chmod(2) and so on
are called from a process context. Filesystem locking is described in are called from a process context. Filesystem locking is described in
the document Documentation/filesystems/Locking. the document Documentation/filesystems/locking.rst.
Directory Entry Cache (dcache) Directory Entry Cache (dcache)

View File

@ -24,7 +24,7 @@
*/ */
/* /*
* See Documentation/filesystems/nfs/Exporting * See Documentation/filesystems/nfs/exporting.rst
* and examples in fs/exportfs * and examples in fs/exportfs
* *
* Since cifs is a network file system, an "fsid" must be included for * Since cifs is a network file system, an "fsid" must be included for

View File

@ -7,7 +7,7 @@
* and for mapping back from file handles to dentries. * and for mapping back from file handles to dentries.
* *
* For details on why we do all the strange and hairy things in here * For details on why we do all the strange and hairy things in here
* take a look at Documentation/filesystems/nfs/Exporting. * take a look at Documentation/filesystems/nfs/exporting.rst.
*/ */
#include <linux/exportfs.h> #include <linux/exportfs.h>
#include <linux/fs.h> #include <linux/fs.h>

View File

@ -10,7 +10,7 @@
* *
* The following files are helpful: * The following files are helpful:
* *
* Documentation/filesystems/nfs/Exporting * Documentation/filesystems/nfs/exporting.rst
* fs/exportfs/expfs.c. * fs/exportfs/expfs.c.
*/ */

View File

@ -555,7 +555,7 @@ static int orangefs_fsync(struct file *file,
* Change the file pointer position for an instance of an open file. * Change the file pointer position for an instance of an open file.
* *
* \note If .llseek is overriden, we must acquire lock as described in * \note If .llseek is overriden, we must acquire lock as described in
* Documentation/filesystems/Locking. * Documentation/filesystems/locking.rst.
* *
* Future upgrade could support SEEK_DATA and SEEK_HOLE but would * Future upgrade could support SEEK_DATA and SEEK_HOLE but would
* require much changes to the FS * require much changes to the FS

View File

@ -151,7 +151,7 @@ struct dentry_operations {
/* /*
* Locking rules for dentry_operations callbacks are to be found in * Locking rules for dentry_operations callbacks are to be found in
* Documentation/filesystems/Locking. Keep it updated! * Documentation/filesystems/locking.rst. Keep it updated!
* *
* FUrther descriptions are found in Documentation/filesystems/vfs.rst. * FUrther descriptions are found in Documentation/filesystems/vfs.rst.
* Keep it updated too! * Keep it updated too!

View File

@ -139,7 +139,7 @@ struct fid {
* @get_parent: find the parent of a given directory * @get_parent: find the parent of a given directory
* @commit_metadata: commit metadata changes to stable storage * @commit_metadata: commit metadata changes to stable storage
* *
* See Documentation/filesystems/nfs/Exporting for details on how to use * See Documentation/filesystems/nfs/exporting.rst for details on how to use
* this interface correctly. * this interface correctly.
* *
* encode_fh: * encode_fh: