linux

Author	SHA1	Message	Date
Dave Chinner	6ffc4db5de	xfs: page type check in writeback only checks last buffer xfs_is_delayed_page() checks to see if a page has buffers matching the given IO type passed in. It does so by walking the buffer heads on the page and checking if the state flags match the IO type. However, the "acceptable" variable that is calculated is overwritten every time a new buffer is checked. Hence if the first buffer on the page is of the right type, this state is lost if the second buffer is not of the correct type. This means that xfs_aops_discard_page() may not discard delalloc regions when it is supposed to, and xfs_convert_page() may not cluster IO as efficiently as possible. This problem only occurs on filesystems with a block size smaller than page size. Also, rename xfs_is_delayed_page() to xfs_check_page_type() to better describe what it is doing - it is not delalloc specific anymore. The problem was first noticed by Peter Watkins. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:20:35 -05:00
Dave Chinner	01c84d2dc1	xfs: punch all delalloc blocks beyond EOF on write failure. I've been seeing regular ASSERT failures in xfstests when running fsstress based tests over the past month. xfs_getbmap() has been failing this test: XFS: Assertion failed: ((iflags & BMV_IF_DELALLOC) != 0) \|\| (map[i].br_startblock != DELAYSTARTBLOCK), file: fs/xfs/xfs_bmap.c, line: 5650 where it is encountering a delayed allocation extent after writing all the dirty data to disk and then walking the extent map atomically by holding the XFS_IOLOCK_SHARED to prevent new delayed allocation extents from being created. Test 083 on a 512 byte block size filesystem was used to reproduce the problem, because it only had a 5s run timeand would usually fail every 3-4 runs. This test is exercising ENOSPC behaviour by running fsstress on a nearly full filesystem. The following trace extract shows the final few events on the inode that tripped the assert: xfs_ilock: flags ILOCK_EXCL caller xfs_setfilesize xfs_setfilesize: isize 0x180000 disize 0x12d400 offset 0x17e200 count 7680 file size updated to 0x180000 by IO completion xfs_ilock: flags ILOCK_EXCL caller xfs_iomap_write_delay xfs_iext_insert: state idx 3 offset 3072 block 4503599627239432 count 1 flag 0 caller xfs_bmap_add_extent_hole_delay xfs_get_blocks_alloc: size 0x180000 offset 0x180000 count 512 type startoff 0xc00 startblock -1 blockcount 0x1 xfs_ilock: flags ILOCK_EXCL caller __xfs_get_blocks delalloc write, adding a single block at offset 0x180000 xfs_delalloc_enospc: isize 0x180000 disize 0x180000 offset 0x180200 count 512 ENOSPC trying to allocate a dellalloc block at offset 0x180200 xfs_ilock: flags ILOCK_EXCL caller xfs_iomap_write_delay xfs_get_blocks_alloc: size 0x180000 offset 0x180200 count 512 type startoff 0xc00 startblock -1 blockcount 0x2 And succeeding on retry after flushing dirty inodes. xfs_ilock: flags ILOCK_EXCL caller __xfs_get_blocks xfs_delalloc_enospc: isize 0x180000 disize 0x180000 offset 0x180400 count 512 ENOSPC trying to allocate a dellalloc block at offset 0x180400 xfs_ilock: flags ILOCK_EXCL caller xfs_iomap_write_delay xfs_delalloc_enospc: isize 0x180000 disize 0x180000 offset 0x180400 count 512 And failing the retry, giving a real ENOSPC error. xfs_ilock: flags ILOCK_EXCL caller xfs_vm_write_failed ^^^^^^^^^^^^^^^^^^^ The smoking gun - the write being failed and cleaning up delalloc blocks beyond EOF allocated by the failed write. xfs_getattr: xfs_ilock: flags IOLOCK_SHARED caller xfs_getbmap xfs_ilock: flags ILOCK_SHARED caller xfs_ilock_map_shared And that's where we died almost immediately afterwards. xfs_bmapi_read() found delalloc extent beyond current file in memory file size. Some debug I added to xfs_getbmap() showed the state just before the assert failure: ino 0x80e48: off 0xc00, fsb 0xffffffffffffffff, len 0x1, size 0x180000 start_fsb 0x106, end_fsb 0x638 ino flags 0x2 nex 0xd bmvcnt 0x555, len 0x3c58a6f23c0bf1, start 0xc00 ext 0: off 0x1fc, fsb 0x24782, len 0x254 ext 1: off 0x450, fsb 0x40851, len 0x30 ext 2: off 0x480, fsb 0xd99, len 0x1b8 ext 3: off 0x92f, fsb 0x4099a, len 0x3b ext 4: off 0x96d, fsb 0x41844, len 0x98 ext 5: off 0xbf1, fsb 0x408ab, len 0xf which shows that we found a single delalloc block beyond EOF (first line of output) when we were returning the map for a length somewhere around 10^16 bytes long (second line), and the on-disk extents showed they didn't go past EOF (last lines). Further debug added to xfs_vm_write_failed() showed this happened when punching out delalloc blocks beyond the end of the file after the failed write: [ 132.606693] ino 0x80e48: vwf to 0x181000, sze 0x180000 [ 132.609573] start_fsb 0xc01, end_fsb 0xc08 It punched the range 0xc01 -> 0xc08, but the range we really need to punch is 0xc00 -> 0xc07 (8 blocks from 0xc00) as this testing was run on a 512 byte block size filesystem (8 blocks per page). the punch from is 0xc00. So end_fsb is correct, but start_fsb is wrong as we punch from start_fsb for (end_fsb - start_fsb) blocks. Hence we are not punching the delalloc block beyond EOF in the case. The fix is simple - it's a silly off-by-one mistake in calculating the range. It's especially silly because the macro used to calculate the start_fsb already takes into account the case where the inode size is an exact multiple of the filesystem block size... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:20:22 -05:00
Dave Chinner	507630b29f	xfs: use shared ilock mode for direct IO writes by default For the direct IO write path, we only really need the ilock to be taken in exclusive mode during IO submission if we need to do extent allocation instead of all the time. Change the block mapping code to take the ilock in shared mode for the initial block mapping, and only retake it exclusively when we actually have to perform extent allocations. We were already dropping the ilock for the transaction allocation, so this doesn't introduce new race windows. Based on an earlier patch from Dave Chinner. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:20:21 -05:00
Christoph Hellwig	281627df3e	xfs: log file size updates at I/O completion time Do not use unlogged metadata updates and the VFS dirty bit for updating the file size after writeback. In addition to causing various problems with updates getting delayed for far too long this also drags in the unscalable VFS dirty tracking, and is one of the few remaining unlogged metadata updates. Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-03-13 16:30:49 -05:00
Christoph Hellwig	84803fb782	xfs: log file size updates as part of unwritten extent conversion If we convert and unwritten extent past the current i_size log the size update as part of the extent manipulation transactions instead of doing an unlogged metadata update later. Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-03-05 11:53:16 -06:00
Christoph Hellwig	6923e686f1	xfs: do not require an ioend for new EOF calculation Replace xfs_ioend_new_eof with a new inline xfs_new_eof helper that doesn't require and ioend, and is available also outside of xfs_aops.c. Also make the code a bit more clear by using a normal if statement instead of a slightly misleading MIN(). Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-03-05 11:19:26 -06:00
Christoph Hellwig	aa6bf01d39	xfs: use per-filesystem I/O completion workqueues The new concurrency managed workqueues are cheap enough that we can create per-filesystem instead of global workqueues. This allows us to remove the trylock or defer scheme on the ilock, which is not helpful once we have outstanding log reservations until finishing a size update. Also allow the default concurrency on this workqueues so that I/O completions blocking on the ilock for one inode do not block process for another inode. Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-03-05 11:07:42 -06:00
Christoph Hellwig	2813d682e8	xfs: remove the i_new_size field in struct xfs_inode Now that we use the VFS i_size field throughout XFS there is no need for the i_new_size field any more given that the VFS i_size field gets updated in ->write_end before unlocking the page, and thus is always uptodate when writeback could see a page. Removing i_new_size also has the advantage that we will never have to trim back di_size during a failed buffered write, given that it never gets updated past i_size. Note that currently the generic direct I/O code only updates i_size after calling our end_io handler, which requires a small workaround to make sure di_size actually makes it to disk. I hope to fix this properly in the generic code. A downside is that we lose the support for parallel non-overlapping O_DIRECT appending writes that recently was added. I don't think keeping the complex and fragile i_new_size infrastructure for this is a good tradeoff - if we really care about parallel appending writers we should investigate turning the iolock into a range lock, which would also allow for parallel non-overlapping buffered writers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-01-17 15:10:19 -06:00
Christoph Hellwig	ce7ae151dd	xfs: remove the i_size field in struct xfs_inode There is no fundamental need to keep an in-memory inode size copy in the XFS inode. We already have the on-disk value in the dinode, and the separate in-memory copy that we need for regular files only in the XFS inode. Remove the xfs_inode i_size field and change the XFS_ISIZE macro to use the VFS inode i_size field for regular files. Switch code that was directly accessing the i_size field in the xfs_inode to XFS_ISIZE, or in cases where we are limited to regular files direct access of the VFS inode i_size field. This also allows dropping some fairly complicated code in the write path which dealt with keeping the xfs_inode i_size uptodate with the VFS i_size that is getting updated inside ->write_end. Note that we do not bother resetting the VFS i_size when truncating a file that gets freed to zero as there is no point in doing so because the VFS inode is no longer in use at this point. Just relax the assert in xfs_ifree to only check the on-disk size instead. Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-01-17 15:08:53 -06:00
Christoph Hellwig	810627d9a6	xfs: fix force shutdown handling in xfs_end_io Ensure ioend->io_error gets propagated back to e.g. AIO completions. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-11-08 10:48:23 -06:00
Mel Gorman	94054fa3fc	xfs: warn if direct reclaim tries to writeback pages Direct reclaim should never writeback pages. For now, handle the situation and warn about it. Ultimately, this will be a BUG_ON. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Alex Elder <aelder@sgi.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-10-31 17:30:46 -07:00
Dave Chinner	5c8ed2021f	xfs: introduce xfs_bmapi_read() xfs_bmapi() currently handles both extent map reading and allocation. As a result, the code is littered with "if (wr)" branches to conditionally do allocation operations if required. This makes the code much harder to follow and causes significant indent issues with the code. Given that read mapping is much simpler than allocation, we can split out read mapping from xfs_bmapi() and reuse the logic that we have already factored out do do all the hard work of handling the extent map manipulations. The results in a much simpler function for the common extent read operations, and will allow the allocation code to be simplified in another commit. Once xfs_bmapi_read() is implemented, convert all the callers of xfs_bmapi() that are only reading extents to use the new function. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:03 -05:00
Christoph Hellwig	04f658ee22	xfs: improve ioend error handling Return unwritten extent conversion errors to aio_complete. Skip both unwritten extent conversion and size updates if we had an I/O error or the filesystem has been shut down. Return -EIO to the aio/buffer completion handlers in case of a forced shutdown. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:01 -05:00
Christoph Hellwig	4a06fd262d	xfs: remove i_iocount We now have an i_dio_count filed and surrounding infrastructure to wait for direct I/O completion instead of i_icount, and we have never needed to iocount waits for buffered I/O given that we only set the page uptodate after finishing all required work. Thus remove i_iocount, and replace the actually needed waits with calls to inode_dio_wait. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:01 -05:00
Christoph Hellwig	fc0063c447	xfs: reduce ioend latency There is no reason to queue up ioends for processing in user context unless we actually need it. Just complete ioends that do not convert unwritten extents or need a size update from the end_io context. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:00 -05:00
Christoph Hellwig	c859cdd1da	xfs: defer AIO/DIO completions We really shouldn't complete AIO or DIO requests until we have finished the unwritten extent conversion and size update. This means fsync never has to pick up any ioends as all work has been completed when signalling I/O completion. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:00 -05:00
Christoph Hellwig	398d25ef23	xfs: remove dead ENODEV handling in xfs_destroy_ioend No driver returns ENODEV from it bio completion handler, not has this ever been documented. Remove the dead code dealing with it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:00 -05:00
Christoph Hellwig	2d2422aebc	xfs: fix a use after free in xfs_end_io_direct_write There is a window in which the ioend that we call inode_dio_wake on in xfs_end_io_direct_write is already free. Fix this by storing the inode pointer in a local variable. This is a fix for the regression introduced in 3.1-rc by "fs: move inode_dio_done to the end_io handler". Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-09-14 08:56:35 -05:00
Christoph Hellwig	c59d87c460	xfs: remove subdirectories Use the move from Linux 2.6 to Linux 3.x as an excuse to kill the annoying subdirectories in the XFS source code. Besides the large amount of file rename the only changes are to the Makefile, a few files including headers with the subdirectory prefix, and the binary sysctl compat code that includes a header under fs/xfs/ from kernel/. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-08-12 16:21:35 -05:00

19 Commits