linux/fs
Dave Chinner d3bc815afb xfs: punch new delalloc blocks out of failed writes inside EOF.
When a partial write inside EOF fails, it can leave delayed
allocation blocks lying around because they don't get punched back
out. This leads to assert failures like:

XFS: Assertion failed: XFS_FORCED_SHUTDOWN(ip->i_mount) || ip->i_delayed_blks == 0, file: fs/xfs/xfs_super.c, line: 847

when evicting inodes from the cache. This can be trivially triggered
by xfstests 083, which takes between 5 and 15 executions on a 512
byte block size filesystem to trip over this. Debugging shows a
failed write due to ENOSPC calling xfs_vm_write_failed such as:

[ 5012.329024] ino 0xa0026: vwf to 0x17000, sze 0x1c85ae

and no action is taken on it. This leaves behind a delayed
allocation extent that has no page covering it and no data in it:

[ 5015.867162] ino 0xa0026: blks: 0x83 delay blocks 0x1, size 0x2538c0
[ 5015.868293] ext 0: off 0x4a, fsb 0x50306, len 0x1
[ 5015.869095] ext 1: off 0x4b, fsb 0x7899, len 0x6b
[ 5015.869900] ext 2: off 0xb6, fsb 0xffffffffe0008, len 0x1
                                    ^^^^^^^^^^^^^^^
[ 5015.871027] ext 3: off 0x36e, fsb 0x7a27, len 0xd
[ 5015.872206] ext 4: off 0x4cf, fsb 0x7a1d, len 0xa

So the delayed allocation extent is one block long at offset
0x16c00. Tracing shows that a bigger write:

xfs_file_buffered_write: size 0x1c85ae offset 0x959d count 0x1ca3f ioflags

allocates the block, and then fails with ENOSPC trying to allocate
the last block on the page, leading to a failed write with stale
delalloc blocks on it.

Because we've had an ENOSPC when trying to allocate 0x16e00, it
means that we are never goinge to call ->write_end on the page and
so the allocated new buffer will not get marked dirty or have the
buffer_new state cleared. In other works, what the above write is
supposed to end up with is this mapping for the page:

    +------+------+------+------+------+------+------+------+
      UMA    UMA    UMA    UMA    UMA    UMA    UND    FAIL

where:  U = uptodate
        M = mapped
        N = new
        A = allocated
        D = delalloc
        FAIL = block we ENOSPC'd on.

and the key point being the buffer_new() state for the newly
allocated delayed allocation block. Except it doesn't - we're not
marking buffers new correctly.

That buffer_new() problem goes back to the xfs_iomap removal days,
where xfs_iomap() used to return a "new" status for any map with
newly allocated blocks, so that __xfs_get_blocks() could call
set_buffer_new() on it. We still have the "new" variable and the
check for it in the set_buffer_new() logic - except we never set it
now!

Hence that newly allocated delalloc block doesn't have the new flag
set on it, so when the write fails we cannot tell which blocks we
are supposed to punch out. WHy do we need the buffer_new flag? Well,
that's because we can have this case:

    +------+------+------+------+------+------+------+------+
      UMD    UMD    UMD    UMD    UMD    UMD    UND    FAIL

where all the UMD buffers contain valid data from a previously
successful write() system call. We only want to punch the UND buffer
because that's the only one that we added in this write and it was
only this write that failed.

That implies that even the old buffer_new() logic was wrong -
because it would result in all those UMD buffers on the page having
set_buffer_new() called on them even though they aren't new. Hence
we shoul donly be calling set_buffer_new() for delalloc buffers that
were allocated (i.e. were a hole before xfs_iomap_write_delay() was
called).

So, fix this set_buffer_new logic according to how we need it to
work for handling failed writes correctly. Also, restore the new
buffer logic handling for blocks allocated via
xfs_iomap_write_direct(), because it should still set the buffer_new
flag appropriately for newly allocated blocks, too.

SO, now we have the buffer_new() being set appropriately in
__xfs_get_blocks(), we can detect the exact delalloc ranges that
we allocated in a failed write, and hence can now do a walk of the
buffers on a page to find them.

Except, it's not that easy. When block_write_begin() fails, it
unlocks and releases the page that we just had an error on, so we
can't use that page to handle errors anymore. We have to get access
to the page while it is still locked to walk the buffers. Hence we
have to open code block_write_begin() in xfs_vm_write_begin() to be
able to insert xfs_vm_write_failed() is the right place.

With that, we can pass the page and write range to
xfs_vm_write_failed() and walk the buffers on the page, looking for
delalloc buffers that are either new or beyond EOF and punch them
out. Handling buffers beyond EOF ensures we still handle the
existing case that xfs_vm_write_failed() handles.

Of special note is the truncate_pagecache() handling - that only
should be done for pages outside EOF - pages within EOF can still
contain valid, dirty data so we must not punch them out of the
cache.

That just leaves the xfs_vm_write_end() failure handling.
The only failure case here is that we didn't copy the entire range,
and generic_write_end() handles that by zeroing the region of the
page that wasn't copied, we don't have to punch out blocks within
the file because they are guaranteed to contain zeros. Hence we only
have to handle the existing "beyond EOF" case and don't need access
to the buffers on the page. Hence it remains largely unchanged.

Note that xfs_getbmap() can still trip over delalloc blocks beyond
EOF that are left there by speculative delayed allocation. Hence
this bug fix does not solve all known issues with bmap vs delalloc,
but it does fix all the the known accidental occurances of the
problem.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-05-14 16:20:36 -05:00
..
9p 9p changes for the 3.4 merge window 2012-03-28 09:58:38 -07:00
adfs switch open-coded instances of d_make_root() to new helper 2012-03-20 21:29:35 -04:00
affs switch open-coded instances of d_make_root() to new helper 2012-03-20 21:29:35 -04:00
afs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2012-03-21 13:36:41 -07:00
autofs4 Merge branch 'x86-x32-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2012-03-29 18:12:23 -07:00
befs switch open-coded instances of d_make_root() to new helper 2012-03-20 21:29:35 -04:00
bfs switch open-coded instances of d_make_root() to new helper 2012-03-20 21:29:35 -04:00
btrfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs 2012-03-30 12:44:29 -07:00
cachefiles switch touch_atime to struct path 2012-03-20 21:29:41 -04:00
ceph Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client 2012-03-28 10:01:29 -07:00
cifs Fix UNC parsing on mount 2012-04-03 20:46:09 -05:00
coda Remove all #inclusions of asm/system.h 2012-03-28 18:30:03 +01:00
configfs make configfs_pin_fs() return root dentry on success 2012-03-20 21:29:48 -04:00
cramfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2012-03-21 13:36:41 -07:00
debugfs simple_open: automatically convert to simple_open() 2012-04-05 15:25:50 -07:00
devpts Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2012-03-21 13:36:41 -07:00
dlm simple_open: automatically convert to simple_open() 2012-04-05 15:25:50 -07:00
ecryptfs ecryptfs: make register_filesystem() the last potential failure exit 2012-03-20 21:29:49 -04:00
efs switch open-coded instances of d_make_root() to new helper 2012-03-20 21:29:35 -04:00
exofs Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osd 2012-03-28 20:04:27 -07:00
exportfs
ext2 migrate ext2_fs.h guts to fs/ext2/ext2.h 2012-03-31 16:03:16 -04:00
ext3 ext3: move headers to fs/ext3/ 2012-03-31 16:03:16 -04:00
ext4 Revert "ext4: don't release page refs in ext4_end_bio()" 2012-03-29 17:00:56 -07:00
fat fat: fix bug in enforcing Long File Name length 2012-03-23 16:58:40 -07:00
freevxfs switch open-coded instances of d_make_root() to new helper 2012-03-20 21:29:35 -04:00
fscache
fuse Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2012-03-21 13:36:41 -07:00
gfs2 get rid of pointless includes of ext2_fs.h 2012-03-31 16:03:15 -04:00
hfs switch open-coded instances of d_make_root() to new helper 2012-03-20 21:29:35 -04:00
hfsplus hfsplus: add an ioctl to bless files 2012-03-20 21:29:53 -04:00
hostfs Merge branch 'for-linus-3.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml 2012-03-27 18:29:53 -07:00
hpfs switch open-coded instances of d_make_root() to new helper 2012-03-20 21:29:35 -04:00
hppfs switch open-coded instances of d_make_root() to new helper 2012-03-20 21:29:35 -04:00
hugetlbfs hugetlbfs: remove unregister_filesystem() when initializing module 2012-04-05 15:25:50 -07:00
isofs switch open-coded instances of d_make_root() to new helper 2012-03-20 21:29:35 -04:00
jbd Power management updates for 3.4 2012-03-21 10:15:51 -07:00
jbd2 Disintegrate and delete asm/system.h 2012-03-28 15:58:21 -07:00
jffs2 MTD merge for 3.4 2012-03-30 17:31:56 -07:00
jfs jfs: mising cleanup on register_filesystem() failure 2012-03-20 21:29:48 -04:00
lockd Merge nfs containerization work from Trond's tree 2012-03-26 11:48:54 -04:00
logfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2012-03-21 13:36:41 -07:00
minix Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2012-03-21 13:36:41 -07:00
ncpfs Remove all #inclusions of asm/system.h 2012-03-28 18:30:03 +01:00
nfs NFS client bugfixes for Linux 3.4 2012-03-28 19:02:35 -07:00
nfs_common
nfsd Merge branch 'for-3.4' of git://linux-nfs.org/~bfields/linux 2012-03-29 14:53:25 -07:00
nilfs2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2012-03-21 13:36:41 -07:00
nls
notify fs/notify/notification.c: make subsys_initcall function static 2012-03-23 16:58:31 -07:00
ntfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2012-03-21 13:36:41 -07:00
ocfs2 get rid of pointless includes of ext2_fs.h 2012-03-31 16:03:15 -04:00
omfs switch open-coded instances of d_make_root() to new helper 2012-03-20 21:29:35 -04:00
openpromfs switch open-coded instances of d_make_root() to new helper 2012-03-20 21:29:35 -04:00
proc Merge branch 'akpm' (Andrew's patch-bomb) 2012-04-05 15:30:34 -07:00
pstore Merge branch 'akpm' (Andrew's patch-bomb) 2012-04-05 15:30:34 -07:00
qnx4 qnx4: new helper - try_extent() 2012-03-20 21:29:52 -04:00
qnx6 fs: initial qnx6fs addition 2012-03-20 21:29:38 -04:00
quota Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs 2012-03-28 10:00:14 -07:00
ramfs tidy up after d_make_root() conversion 2012-03-20 21:29:37 -04:00
reiserfs Disintegrate and delete asm/system.h 2012-03-28 15:58:21 -07:00
romfs MTD merge for 3.4 2012-03-30 17:31:56 -07:00
squashfs Add an extra mount time sanity check, plus some code cleanups and bug fixes. 2012-03-28 18:05:54 -07:00
sysfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2012-03-21 13:36:41 -07:00
sysv switch open-coded instances of d_make_root() to new helper 2012-03-20 21:29:35 -04:00
ubifs - Improve error messages 2012-03-23 09:27:40 -07:00
udf Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs 2012-03-28 10:00:14 -07:00
ufs Remove all #inclusions of asm/system.h 2012-03-28 18:30:03 +01:00
xfs xfs: punch new delalloc blocks out of failed writes inside EOF. 2012-05-14 16:20:36 -05:00
aio.c aio: take final put_ioctx() into callers of io_destroy() 2012-03-31 16:03:15 -04:00
anon_inodes.c anon_inodes: move allocation of anon_inode into ->mount() 2012-03-20 21:29:45 -04:00
attr.c fs: reduce the use of module.h wherever possible 2012-02-28 19:31:58 -05:00
bad_inode.c fs: reduce the use of module.h wherever possible 2012-02-28 19:31:58 -05:00
binfmt_aout.c Remove all #inclusions of asm/system.h 2012-03-28 18:30:03 +01:00
binfmt_elf_fdpic.c Add #includes needed to permit the removal of asm/system.h 2012-03-28 18:30:03 +01:00
binfmt_elf.c Merge branch 'x86-x32-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2012-03-29 18:12:23 -07:00
binfmt_em86.c __register_binfmt() made void 2012-03-20 21:29:46 -04:00
binfmt_flat.c Disintegrate and delete asm/system.h 2012-03-28 15:58:21 -07:00
binfmt_misc.c magic.h: move some FS magic numbers into magic.h 2012-03-23 16:58:31 -07:00
binfmt_script.c __register_binfmt() made void 2012-03-20 21:29:46 -04:00
binfmt_som.c take removal of PF_FORKNOEXEC to flush_old_exec() 2012-03-20 21:29:51 -04:00
bio-integrity.c fs: remove the second argument of k[un]map_atomic() 2012-03-20 21:48:21 +08:00
bio.c fs: reduce the use of module.h wherever possible 2012-02-28 19:31:58 -05:00
block_dev.c magic.h: move some FS magic numbers into magic.h 2012-03-23 16:58:31 -07:00
buffer.c fs: only send IPI to invalidate LRU BH when needed 2012-03-28 17:14:35 -07:00
char_dev.c char_dev.c: fix up some whitespace errors 2011-12-13 11:18:17 -08:00
compat_binfmt_elf.c
compat_ioctl.c The following text was taken from the original review request: 2012-03-24 10:24:31 -07:00
compat.c Merge branch 'x86-x32-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2012-03-29 18:12:23 -07:00
dcache.c vfs: fix d_ancestor() case in d_materialize_unique 2012-03-28 09:54:34 -07:00
dcookies.c fs: reduce the use of module.h wherever possible 2012-02-28 19:31:58 -05:00
direct-io.c Restore direct_io / truncate locking API 2012-02-23 15:56:21 -08:00
drop_caches.c
eventfd.c fs: reduce the use of module.h wherever possible 2012-02-28 19:31:58 -05:00
eventpoll.c Disintegrate and delete asm/system.h 2012-03-28 15:58:21 -07:00
exec.c Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2012-04-04 10:04:42 -07:00
fcntl.c Wrap accesses to the fd_sets in struct fdtable 2012-02-19 10:30:52 -08:00
fhandle.c vfs: prefer ->dentry->d_sb to ->mnt->mnt_sb 2012-01-06 23:16:53 -05:00
fifo.c
file_table.c vfs: drop_file_write_access() made static 2012-03-20 21:29:32 -04:00
file.c Merge branch 'x86-x32-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2012-03-29 18:12:23 -07:00
filesystems.c vfs: convert fs_supers to hlist 2012-01-03 22:52:39 -05:00
fs_struct.c The following text was taken from the original review request: 2012-03-24 10:24:31 -07:00
fs-writeback.c trivial writeback fixes 2012-03-28 10:07:27 -07:00
generic_acl.c
inode.c trim includes in inode.c 2012-03-20 21:29:51 -04:00
internal.h vfs: protect remounting superblock read-only 2012-01-06 23:20:12 -05:00
ioctl.c fs: reduce the use of module.h wherever possible 2012-02-28 19:31:58 -05:00
ioprio.c block: strip out locking optimization in put_io_context() 2012-02-07 07:51:30 +01:00
Kconfig Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2012-03-21 13:36:41 -07:00
Kconfig.binfmt fs: binfmt_elf: create Kconfig variable for PIE randomization 2012-01-10 16:30:51 -08:00
libfs.c libfs: add simple_open() 2012-04-05 15:25:50 -07:00
locks.c CIFS: Fix VFS lock usage for oplocked files 2012-04-01 13:54:27 -05:00
Makefile fs: initial qnx6fs addition 2012-03-20 21:29:38 -04:00
mbcache.c
mount.h vfs: keep list of mounts for each superblock 2012-01-06 23:20:12 -05:00
mpage.c fs: reduce the use of module.h wherever possible 2012-02-28 19:31:58 -05:00
namei.c Make the "word-at-a-time" helper functions more commonly usable 2012-04-06 13:54:56 -07:00
namespace.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2012-01-08 13:21:22 -08:00
no-block.c
open.c Wrap accesses to the fd_sets in struct fdtable 2012-02-19 10:30:52 -08:00
pipe.c magic.h: move some FS magic numbers into magic.h 2012-03-23 16:58:31 -07:00
pnode.c vfs: switch pnode.h macros to struct mount * 2012-01-03 22:57:11 -05:00
pnode.h vfs: switch pnode.h macros to struct mount * 2012-01-03 22:57:11 -05:00
posix_acl.c fs: reduce the use of module.h wherever possible 2012-02-28 19:31:58 -05:00
proc_namespace.c vfs: switch ->show_options() to struct dentry * 2012-01-06 23:19:54 -05:00
read_write.c fs: reduce the use of module.h wherever possible 2012-02-28 19:31:58 -05:00
read_write.h
readdir.c fs: reduce the use of module.h wherever possible 2012-02-28 19:31:58 -05:00
select.c Merge branch 'x86-x32-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2012-03-29 18:12:23 -07:00
seq_file.c The following text was taken from the original review request: 2012-03-24 10:24:31 -07:00
signalfd.c epoll: ep_unregister_pollwait() can use the freed pwq->whead 2012-02-24 11:42:50 -08:00
splice.c tcp: tcp_sendpages() should call tcp_push() once 2012-04-05 19:04:27 -04:00
stack.c fs: reduce the use of module.h wherever possible 2012-02-28 19:31:58 -05:00
stat.c The following text was taken from the original review request: 2012-03-24 10:24:31 -07:00
statfs.c fs: reduce the use of module.h wherever possible 2012-02-28 19:31:58 -05:00
super.c The following text was taken from the original review request: 2012-03-24 10:24:31 -07:00
sync.c fs: reduce the use of module.h wherever possible 2012-02-28 19:31:58 -05:00
timerfd.c
utimes.c
xattr_acl.c fs: reduce the use of module.h wherever possible 2012-02-28 19:31:58 -05:00
xattr.c fs/xattr.c:setxattr(): improve handling of allocation failures 2012-04-05 15:25:50 -07:00