linux/Documentation/filesystems
Nick Piggin b827e496c8 mm: close page_mkwrite races
Change page_mkwrite to allow implementations to return with the page
locked, and also change it's callers (in page fault paths) to hold the
lock until the page is marked dirty.  This allows the filesystem to have
full control of page dirtying events coming from the VM.

Rather than simply hold the page locked over the page_mkwrite call, we
call page_mkwrite with the page unlocked and allow callers to return with
it locked, so filesystems can avoid LOR conditions with page lock.

The problem with the current scheme is this: a filesystem that wants to
associate some metadata with a page as long as the page is dirty, will
perform this manipulation in its ->page_mkwrite.  It currently then must
return with the page unlocked and may not hold any other locks (according
to existing page_mkwrite convention).

In this window, the VM could write out the page, clearing page-dirty.  The
filesystem has no good way to detect that a dirty pte is about to be
attached, so it will happily write out the page, at which point, the
filesystem may manipulate the metadata to reflect that the page is no
longer dirty.

It is not always possible to perform the required metadata manipulation in
->set_page_dirty, because that function cannot block or fail.  The
filesystem may need to allocate some data structure, for example.

And the VM cannot mark the pte dirty before page_mkwrite, because
page_mkwrite is allowed to fail, so we must not allow any window where the
page could be written to if page_mkwrite does fail.

This solution of holding the page locked over the 3 critical operations
(page_mkwrite, setting the pte dirty, and finally setting the page dirty)
closes out races nicely, preventing page cleaning for writeout being
initiated in that window.  This provides the filesystem with a strong
synchronisation against the VM here.

- Sage needs this race closed for ceph filesystem.
- Trond for NFS (http://bugzilla.kernel.org/show_bug.cgi?id=12913).
- I need it for fsblock.
- I suspect other filesystems may need it too (eg. btrfs).
- I have converted buffer.c to the new locking. Even simple block allocation
  under dirty pages might be susceptible to i_size changing under partial page
  at the end of file (we also have a buffer.c-side problem here, but it cannot
  be fixed properly without this patch).
- Other filesystems (eg. NFS, maybe btrfs) will need to change their
  page_mkwrite functions themselves.

[ This also moves page_mkwrite another step closer to fault, which should
  eventually allow page_mkwrite to be moved into ->fault, and thus avoiding a
  filesystem calldown and page lock/unlock cycle in __do_fault. ]

[akpm@linux-foundation.org: fix derefs of NULL ->mapping]
Cc: Sage Weil <sage@newdream.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-02 15:36:09 -07:00
..
caching CacheFiles: Fix the documentation to use the correct credential pointer names 2009-04-24 13:28:30 -07:00
configfs docsrc: build Documentation/ sources 2008-08-12 16:07:30 -07:00
pohmelfs Staging: Pohmelfs: Added IO permissions and priorities. 2009-04-17 11:06:30 -07:00
9p.txt
00-INDEX nilfs2: add document 2009-04-07 08:31:12 -07:00
adfs.txt
affs.txt
afs.txt
autofs4-mount-control.txt autofs4: device node ioctl documentation 2008-10-16 11:21:39 -07:00
automount-support.txt
befs.txt
bfs.txt remove mention of CONFIG_KMOD from documentation 2008-07-22 19:24:29 +10:00
btrfs.txt Btrfs: Add Documentation/filesystem/btrfs.txt, remove old COPYING 2009-01-07 09:54:24 -05:00
cifs.txt
coda.txt
cramfs.txt
dentry-locking.txt
devpts.txt Document usage of multiple-instances of devpts 2009-01-02 10:19:36 -08:00
directory-locking
dlmfs.txt
dnotify.txt Documentation: move dnotify.txt to filesystems/ 2008-02-07 08:42:17 -08:00
ecryptfs.txt
exofs.txt exofs: Documentation 2009-03-31 19:44:38 +03:00
Exporting
ext2.txt trivial: fix orphan dates in ext2 documentation 2009-03-23 14:21:26 -07:00
ext3.txt trivial: document ext3 semantics of 'ro' option a bit better 2009-03-30 15:21:56 +02:00
ext4.txt ext4: Regularize mount options 2009-03-28 10:59:57 -04:00
fiemap.txt vfs: vfs-level fiemap interface 2008-10-08 19:44:18 -04:00
files.txt fix f_count description in Documentation/filesystems/files.txt 2008-12-31 18:07:42 -05:00
fuse.txt
gfs2-glocks.txt [GFS2] Glock documentation 2008-06-27 09:39:53 +01:00
gfs2.txt
hfs.txt
hfsplus.txt
hpfs.txt
inotify.txt
isofs.txt isofs: implement dmode option 2008-02-08 09:22:38 -08:00
jfs.txt
knfsd-stats.txt Document /proc/fs/nfsd/pool_stats 2009-03-27 19:24:27 -04:00
Locking mm: close page_mkwrite races 2009-05-02 15:36:09 -07:00
locks.txt
mandatory-locking.txt
ncpfs.txt
nfs41-server.txt nfsd41: Documentation/filesystems/nfs41-server.txt 2009-04-03 17:41:24 -07:00
nfs-rdma.txt update port number in NFS/RDMA documentation 2009-01-27 17:20:14 -05:00
nfsroot.txt doc: typo in Documentation/filesystems/nfsroot.txt 2008-10-16 11:21:31 -07:00
nilfs2.txt nilfs2: clean up sketch file 2009-04-07 08:31:19 -07:00
ntfs.txt NTFS: update homepage 2008-09-02 19:21:37 -07:00
ocfs2.txt ocfs2: add mount option and Kconfig option for acl 2009-01-05 08:36:52 -08:00
omfs.txt omfs: add filesystem documentation 2008-07-26 12:00:05 -07:00
porting iget: remove iget() and the read_inode() super op as being obsolete 2008-02-07 08:42:29 -08:00
proc.txt documentation: update Documentation/filesystem/proc.txt and Documentation/sysctls 2009-04-02 19:04:53 -07:00
quota.txt quota: documentation for sending "below quota" messages via netlink and tiny doc update 2008-08-12 16:07:27 -07:00
ramfs-rootfs-initramfs.txt Trivial Documentation/filesystems/ramfs-rootfs-initramfs.txt fix 2008-11-30 11:40:56 -08:00
relay.txt relay: add buffer-only channels; useful for early logging 2008-07-26 12:00:04 -07:00
romfs.txt
rpc-cache.txt Documentation: move rpc-cache.txt to filesystems/ 2008-04-11 13:20:52 -06:00
seq_file.txt Document seq_path_root() 2008-04-25 11:56:37 -06:00
sharedsubtree.txt Documentation: move sharedsubtrees.txt to filesystems/ 2008-02-07 08:42:17 -08:00
smbfs.txt
spufs.txt
squashfs.txt Squashfs: fix documentation typo, Cramfs filesystem limit is 256 MiB 2009-03-05 00:40:13 +00:00
sysfs-pci.txt PCI: Introduce /sys/bus/pci/devices/.../remove 2009-03-20 14:58:48 -07:00
sysfs.txt PATCH [2/2] Documentation/filesystems/sysfs.txt: fix descriptions of device attributes 2009-02-22 09:28:15 -08:00
sysv-fs.txt
tmpfs.txt mempolicy: update NUMA memory policy documentation 2008-04-28 08:58:19 -07:00
ubifs.txt UBIFS: remove fast unmounting 2009-01-29 16:34:30 +02:00
udf.txt udf: implement mode and dmode mounting options 2009-04-02 12:29:50 +02:00
ufs.txt
vfat.txt fat: Fix ATTR_RO for directory 2008-11-06 15:41:21 -08:00
vfs.txt Documentation/filesystems: remove out of date reference to BKL being held 2009-04-20 23:01:16 -04:00
xfs.txt [XFS] remove restricted chown parameter from xfs linux 2008-10-30 18:30:09 +11:00
xip.txt DOC: update xip method info 2008-11-12 17:17:17 -08:00