Running a CONFIG_XFS_DEBUG kernel in production environments is not
the best idea as it introduces significant overhead, can change
the behaviour of algorithms (such as allocation) to improve test
coverage, and (most importantly) panic the machine on non-fatal
errors.
There are many cases where all we want to do is run a
kernel with more bounds checking enabled, such as is provided by the
ASSERT() statements throughout the code, but without all the
potential overhead and drawbacks.
This patch converts all the ASSERT statements to evaluate as
WARN_ON(1) statements and hence if they fail dump a warning and a
stack trace to the log. This has minimal overhead and does not
change any algorithms, and will allow us to find strange "out of
bounds" problems more easily on production machines.
There are a few places where assert statements contain debug only
code. These are converted to be debug-or-warn only code so that we
still get all the assert checks in the code.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Add support for larger btree blocks that contains a CRC32C checksum,
a filesystem uuid and block number for detecting filesystem
consistency and out of place writes.
[dchinner@redhat.com] Also include an owner field to allow reverse
mappings to be implemented for improved repairability and a LSN
field to so that log recovery can easily determine the last
modification that made it to disk for each buffer.
[dchinner@redhat.com] Add buffer log format flags to indicate the
type of buffer to recovery so that we don't have to do blind magic
number tests to determine what the buffer is.
[dchinner@redhat.com] Modified to fit into the verifier structure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
To separate the verifiers from iodone functions and associate read
and write verifiers at the same time, introduce a buffer verifier
operations structure to the xfs_buf.
This avoids the need for assigning the write verifier, clearing the
iodone function and re-running ioend processing in the read
verifier, and gets rid of the nasty "b_pre_io" name for the write
verifier function pointer. If we ever need to, it will also be
easier to add further content specific callbacks to a buffer with an
ops structure in place.
We also avoid needing to export verifier functions, instead we
can simply export the ops structures for those that are needed
outside the function they are defined in.
This patch also fixes a directory block readahead verifier issue
it exposed.
This patch also adds ops callbacks to the inode/alloc btree blocks
initialised by growfs. These will need more work before they will
work with CRCs.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Phil White <pwhite@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Metadata buffers that are read from disk have write verifiers
already attached to them, but newly allocated buffers do not. Add
appropriate write verifiers to all new metadata buffers.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
These verifiers are essentially the same code as the read verifiers,
but do not require ioend processing. Hence factor the read verifier
functions and add a new write verifier wrapper that is used as the
callback.
This is done as one large patch for all verifiers rather than one
patch per verifier as the change is largely mechanical. This
includes hooking up the write verifier via the read verifier
function.
Hooking up the write verifier for buffers obtained via
xfs_trans_get_buf() will be done in a separate patch as that touches
code in many different places rather than just the verifier
functions.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Add an btree block verify callback function and pass it into the
buffer read functions. Because each different btree block type
requires different verification, add a function to the ops structure
that is called from the generic code.
Also, propagate the verification callback functions through the
readahead functions, and into the external bmap and bulkstat inode
readahead code that uses the generic btree buffer read functions.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Phil White <pwhite@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
When we free a block from the alloc btree tree, we move it to the
freelist held in the AGFL and mark it busy in the busy extent tree.
This typically happens when we merge btree blocks.
Once the transaction is committed and checkpointed, the block can
remain on the free list for an indefinite amount of time. Now, this
isn't the end of the world at this point - if the free list is
shortened, the buffer is invalidated in the transaction that moves
it back to free space. If the buffer is allocated as metadata from
the free list, then all the modifications getted logged, and we have
no issues, either. And if it gets allocated as userdata direct from
the freelist, it gets invalidated and so will never get written.
However, during the time it sits on the free list, pressure on the
log can cause the AIL to be pushed and the buffer that covers the
block gets pushed for write. IOWs, we end up writing a freed
metadata block to disk. Again, this isn't the end of the world
because we know from the above we are only writing to free space.
The problem, however, is for validation callbacks. If the block was
on old btree root block, then the level of the block is going to be
higher than the current tree root, and so will fail validation.
There may be other inconsistencies in the block as well, and
currently we don't care because the block is in free space. Shutting
down the filesystem because a freed block doesn't pass write
validation, OTOH, is rather unfriendly.
So, make sure we always invalidate buffers as they move from the
free space trees to the free list so that we guarantee they never
get written to disk while on the free list.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Phil White <pwhite@sgi.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
With the removal of xfs_rw.h and other changes over time, xfs_bit.h
is being included in many files that don't actually need it. Clean
up the includes as necessary.
Also move the only-used-once xfs_ialloc_find_free() static inline
function out of a header file that is widely included to reduce
the number of needless dependencies on xfs_bit.h.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Now that the busy extent tracking has been moved out of the
allocation files, clean up the namespace it uses to
"xfs_extent_busy" rather than a mix of "xfs_busy" and
"xfs_alloc_busy".
Signed-off-by: Dave Chinner<dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
To make it easier to handle userspace code merges, move all the busy
extent handling out of the allocation code and into it's own file.
The userspace code does not need the busy extent code, so this
simplifies the merging of the kernel code into the userspace
xfsprogs library.
Because the busy extent code has been almost completely rewritten
over the past couple of years, also update the copyright on this new
file to include the authors that made all those changes.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Untangle the header file includes a bit by moving the definition of
xfs_agino_t to xfs_types.h. This removes the dependency that xfs_ag.h has on
xfs_inum.h, meaning we don't need to include xfs_inum.h everywhere we include
xfs_ag.h.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Remove various bits left over from the old kdb-only btree tracing code, but
leave the actual trace point stubs in place to ease adding new event based
btree tracing.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
GCC 4.6 complains about an array subscript is above array bounds when
using the btree index to index into the agf_levels array. The only
two indices passed in are 0 and 1, and we have an assert insuring that.
Replace the trick of using the array index directly with using constants
in the already existing branch for assigning the XFS_BTREE_LASTREC_UPDATE
flag.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Blocks for the allocation btree are allocated from and released to
the AGFL, and thus frequently reused. Even worse we do not have an
easy way to avoid using an AGFL block when it is discarded due to
the simple FILO list of free blocks, and thus can frequently stall
on blocks that are currently undergoing a discard.
Add a flag to the busy extent tracking structure to skip the discard
for allocation btree blocks. In normal operation these blocks are
reused frequently enough that there is no need to discard them
anyway, but if they spill over to the allocation btree as part of a
balance we "leak" blocks that we would otherwise discard. We could
fix this by adding another flag and keeping these block in the
rbtree even after they aren't busy any more so that we could discard
them when they migrate out of the AGFL. Given that this would cause
significant overhead I don't think it's worthwile for now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Update the extent tree in case we have to reuse a busy extent, so that it
always is kept uptodate. This is done by replacing the busy list searches
with a new xfs_alloc_busy_reuse helper, which updates the busy extent tree
in case of a reuse. This allows us to allow reusing metadata extents
unconditionally, and thus avoid log forces especially for allocation btree
blocks.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
While we need to make sure we do not reuse busy extents, there is no need
to force out busy extents when moving them between the AGFL and the
freespace btree as we still take care of that when doing the real allocation.
To avoid the log force when just moving extents from the different free
space tracking structures, move the busy search out of
xfs_alloc_get_freelist into the callers that need it, and move the busy
list insert from xfs_free_ag_extent which is used both by AGFL refills
and real allocation to xfs_free_extent, which is only used by the latter.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
The implementation os ->kill_root only differ by either simply
zeroing out the now unused buffer in the btree cursor in the inode
allocation btree or using xfs_btree_setbuf in the allocation btree.
Initially both of them used xfs_btree_setbuf, but the use in the
ialloc btree was removed early on because it interacted badly with
xfs_trans_binval.
In addition to zeroing out the buffer in the cursor xfs_btree_setbuf
updates the bc_ra array in the btree cursor, and calls
xfs_trans_brelse on the buffer previous occupying the slot.
The bc_ra update should be done for the alloc btree updated too,
although the lack of it does not cause serious problems. The
xfs_trans_brelse call on the other hand is effectively a no-op in
the end - it keeps decrementing the bli_recur refcount until it hits
zero, and then just skips out because the buffer will always be
dirty at this point. So removing it for the allocation btree is
just fine.
So unify the code and move it to xfs_btree.c. While we're at it
also replace the call to xfs_btree_setbuf with a NULL bp argument in
xfs_btree_del_cursor with a direct call to xfs_trans_brelse given
that the cursor is beeing freed just after this and the state
updates are superflous. After this xfs_btree_setbuf is only used
with a non-NULL bp argument and can thus be simplified.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Dmapi support was never merged upstream, but we still have a lot of hooks
bloating XFS for it, all over the fast pathes of the filesystem.
This patch drops over 700 lines of dmapi overhead. If we'll ever get HSM
support in mainline at least the namespace events can be done much saner
in the VFS instead of the individual filesystem, so it's not like this
is much help for future work.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
When we free a metadata extent, we record it in the per-AG busy
extent array so that it is not re-used before the freeing
transaction hits the disk. This array is fixed size, so when it
overflows we make further allocation transactions synchronous
because we cannot track more freed extents until those transactions
hit the disk and are completed. Under heavy mixed allocation and
freeing workloads with large log buffers, we can overflow this array
quite easily.
Further, the array is sparsely populated, which means that inserts
need to search for a free slot, and array searches often have to
search many more slots that are actually used to check all the
busy extents. Quite inefficient, really.
To enable this aspect of extent freeing to scale better, we need
a structure that can grow dynamically. While in other areas of
XFS we have used radix trees, the extents being freed are at random
locations on disk so are better suited to being indexed by an rbtree.
So, use a per-AG rbtree indexed by block number to track busy
extents. This incures a memory allocation when marking an extent
busy, but should not occur too often in low memory situations. This
should scale to an arbitrary number of extents so should not be a
limitation for features such as in-memory aggregation of
transactions.
However, there are still situations where we can't avoid allocating
busy extents (such as allocation from the AGFL). To minimise the
overhead of such occurences, we need to avoid doing a synchronous
log force while holding the AGF locked to ensure that the previous
transactions are safely on disk before we use the extent. We can do
this by marking the transaction doing the allocation as synchronous
rather issuing a log force.
Because of the locking involved and the ordering of transactions,
the synchronous transaction provides the same guarantees as a
synchronous log force because it ensures that all the prior
transactions are already on disk when the synchronous transaction
hits the disk. i.e. it preserves the free->allocate order of the
extent correctly in recovery.
By doing this, we avoid holding the AGF locked while log writes are
in progress, hence reducing the length of time the lock is held and
therefore we increase the rate at which we can allocate and free
from the allocation group, thereby increasing overall throughput.
The only problem with this approach is that when a metadata buffer is
marked stale (e.g. a directory block is removed), then buffer remains
pinned and locked until the log goes to disk. The issue here is that
if that stale buffer is reallocated in a subsequent transaction, the
attempt to lock that buffer in the transaction will hang waiting
the log to go to disk to unlock and unpin the buffer. Hence if
someone tries to lock a pinned, stale, locked buffer we need to
push on the log to get it unlocked ASAP. Effectively we are trading
off a guaranteed log force for a much less common trigger for log
force to occur.
Ideally we should not reallocate busy extents. That is a much more
complex fix to the problem as it involves direct intervention in the
allocation btree searches in many places. This is left to a future
set of modifications.
Finally, now that we track busy extents in allocated memory, we
don't need the descriptors in the transaction structure to point to
them. We can replace the complex busy chunk infrastructure with a
simple linked list of busy extents. This allows us to remove a large
chunk of code, making the overall change a net reduction in code
size.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Start abstracting the perag references so that the indexing of the
structures is not directly coded into all the places that uses the
perag structures. This will allow us to separate the use of the
perag structure and the way it is indexed and hence avoid the known
deadlocks related to growing a busy filesystem.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Convert the old xfs tracing support that could only be used with the
out of tree kdb and xfsidbg patches to use the generic event tracer.
To use it make sure CONFIG_EVENT_TRACING is enabled and then enable
all xfs trace channels by:
echo 1 > /sys/kernel/debug/tracing/events/xfs/enable
or alternatively enable single events by just doing the same in one
event subdirectory, e.g.
echo 1 > /sys/kernel/debug/tracing/events/xfs/xfs_ihold/enable
or set more complex filters, etc. In Documentation/trace/events.txt
all this is desctribed in more detail. To reads the events do a
cat /sys/kernel/debug/tracing/trace
Compared to the last posting this patch converts the tracing mostly to
the one tracepoint per callsite model that other users of the new
tracing facility also employ. This allows a very fine-grained control
of the tracing, a cleaner output of the traces and also enables the
perf tool to use each tracepoint as a virtual performance counter,
allowing us to e.g. count how often certain workloads git various
spots in XFS. Take a look at
http://lwn.net/Articles/346470/
for some examples.
Also the btree tracing isn't included at all yet, as it will require
additional core tracing features not in mainline yet, I plan to
deliver it later.
And the really nice thing about this patch is that it actually removes
many lines of code while adding this nice functionality:
fs/xfs/Makefile | 8
fs/xfs/linux-2.6/xfs_acl.c | 1
fs/xfs/linux-2.6/xfs_aops.c | 52 -
fs/xfs/linux-2.6/xfs_aops.h | 2
fs/xfs/linux-2.6/xfs_buf.c | 117 +--
fs/xfs/linux-2.6/xfs_buf.h | 33
fs/xfs/linux-2.6/xfs_fs_subr.c | 3
fs/xfs/linux-2.6/xfs_ioctl.c | 1
fs/xfs/linux-2.6/xfs_ioctl32.c | 1
fs/xfs/linux-2.6/xfs_iops.c | 1
fs/xfs/linux-2.6/xfs_linux.h | 1
fs/xfs/linux-2.6/xfs_lrw.c | 87 --
fs/xfs/linux-2.6/xfs_lrw.h | 45 -
fs/xfs/linux-2.6/xfs_super.c | 104 ---
fs/xfs/linux-2.6/xfs_super.h | 7
fs/xfs/linux-2.6/xfs_sync.c | 1
fs/xfs/linux-2.6/xfs_trace.c | 75 ++
fs/xfs/linux-2.6/xfs_trace.h | 1369 +++++++++++++++++++++++++++++++++++++++++
fs/xfs/linux-2.6/xfs_vnode.h | 4
fs/xfs/quota/xfs_dquot.c | 110 ---
fs/xfs/quota/xfs_dquot.h | 21
fs/xfs/quota/xfs_qm.c | 40 -
fs/xfs/quota/xfs_qm_syscalls.c | 4
fs/xfs/support/ktrace.c | 323 ---------
fs/xfs/support/ktrace.h | 85 --
fs/xfs/xfs.h | 16
fs/xfs/xfs_ag.h | 14
fs/xfs/xfs_alloc.c | 230 +-----
fs/xfs/xfs_alloc.h | 27
fs/xfs/xfs_alloc_btree.c | 1
fs/xfs/xfs_attr.c | 107 ---
fs/xfs/xfs_attr.h | 10
fs/xfs/xfs_attr_leaf.c | 14
fs/xfs/xfs_attr_sf.h | 40 -
fs/xfs/xfs_bmap.c | 507 +++------------
fs/xfs/xfs_bmap.h | 49 -
fs/xfs/xfs_bmap_btree.c | 6
fs/xfs/xfs_btree.c | 5
fs/xfs/xfs_btree_trace.h | 17
fs/xfs/xfs_buf_item.c | 87 --
fs/xfs/xfs_buf_item.h | 20
fs/xfs/xfs_da_btree.c | 3
fs/xfs/xfs_da_btree.h | 7
fs/xfs/xfs_dfrag.c | 2
fs/xfs/xfs_dir2.c | 8
fs/xfs/xfs_dir2_block.c | 20
fs/xfs/xfs_dir2_leaf.c | 21
fs/xfs/xfs_dir2_node.c | 27
fs/xfs/xfs_dir2_sf.c | 26
fs/xfs/xfs_dir2_trace.c | 216 ------
fs/xfs/xfs_dir2_trace.h | 72 --
fs/xfs/xfs_filestream.c | 8
fs/xfs/xfs_fsops.c | 2
fs/xfs/xfs_iget.c | 111 ---
fs/xfs/xfs_inode.c | 67 --
fs/xfs/xfs_inode.h | 76 --
fs/xfs/xfs_inode_item.c | 5
fs/xfs/xfs_iomap.c | 85 --
fs/xfs/xfs_iomap.h | 8
fs/xfs/xfs_log.c | 181 +----
fs/xfs/xfs_log_priv.h | 20
fs/xfs/xfs_log_recover.c | 1
fs/xfs/xfs_mount.c | 2
fs/xfs/xfs_quota.h | 8
fs/xfs/xfs_rename.c | 1
fs/xfs/xfs_rtalloc.c | 1
fs/xfs/xfs_rw.c | 3
fs/xfs/xfs_trans.h | 47 +
fs/xfs/xfs_trans_buf.c | 62 -
fs/xfs/xfs_vnodeops.c | 8
70 files changed, 2151 insertions(+), 2592 deletions(-)
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Remove the last of the macros-defined-to-static-functions.
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
structures.
Always use the generic xfs_btree_block type instead of the short / long
structures. Add XFS_BTREE_SBLOCK_LEN / XFS_BTREE_LBLOCK_LEN defines for
the length of a short / long form block. The rationale for this is that we
will grow more btree block header variants to support CRCs and other RAS
information, and always accessing them through the same datatype with
unions for the short / long form pointers makes implementing this much
easier.
SGI-PV: 988146
SGI-Modid: xfs-linux-melb:xfs-kern:32300a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Replace the generic record / key / ptr addressing macros that use cpp
token pasting with simpler macros that do the job for just one given btree
type. The new macros lose the cur argument and thus can be used outside
the core btree code, but also gain an xfs_mount * argument to allow for
checking the CRC flag in the near future. Note that many of these macros
aren't actually used in the kernel code, but only in userspace (mostly in
xfs_repair).
SGI-PV: 988146
SGI-Modid: xfs-linux-melb:xfs-kern:32295a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Clean up the way the maximum and minimum records for the btree blocks are
calculated. For the alloc and inobt btrees all the values are
pre-calculated in xfs_mount_common, and we switch the current loop around
the ugly generic macros that use cpp token pasting to generate type names
to two small helpers in normal C code. For the bmbt and bmdr trees these
helpers also exist, but can be called during runtime, too. Here we also
kill various macros dealing with them and inline the logic into the
get_minrecs / get_maxrecs / get_dmaxrecs methods in xfs_bmap_btree.c.
Note that all these new helpers take an xfs_mount * argument which will be
needed to determine the size of a btree block once we add support for
extended btree blocks with CRCs and other RAS information.
SGI-PV: 988146
SGI-Modid: xfs-linux-melb:xfs-kern:32292a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Add methods to check whether two keys/records are in the righ order. This
replaces the xfs_btree_check_key and xfs_btree_check_rec methods. For the
callers from xfs_bmap.c just opencode the bmbt-specific asserts.
SGI-PV: 985583
SGI-Modid: xfs-linux-melb:xfs-kern:32208a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Bill O'Donnell <billodo@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
Not really much reason to make it generic given that it's so small, but
this is the last non-method in xfs_alloc_btree.c and xfs_ialloc_btree.c,
so it makes the whole btree implementation more structured.
SGI-PV: 985583
SGI-Modid: xfs-linux-melb:xfs-kern:32206a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Bill O'Donnell <billodo@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
Make the btree delete code generic. Based on a patch from David Chinner
with lots of changes to follow the original btree implementations more
closely. While this loses some of the generic helper routines for
inserting/moving/removing records it also solves some of the one off bugs
in the original code and makes it easier to verify.
SGI-PV: 985583
SGI-Modid: xfs-linux-melb:xfs-kern:32205a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Bill O'Donnell <billodo@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
xfs_bmbt_killroot is a mostly generic implementation of moving from a real
block based root to an inode based root. So move it to xfs_btree.c where
it can use all the nice infrastructure there and make it pointer size
agnostic
The new name for it is xfs_btree_kill_iroot, following the old naming but
making it clear we're dealing with the root in inode case here, and to
avoid confusion with xfs_btree_new_root which is used for the not inode
rooted case. I've also added a comment describing what it does and why
it's named the way it is.
SGI-PV: 985583
SGI-Modid: xfs-linux-melb:xfs-kern:32203a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Bill O'Donnell <billodo@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
Make the btree insert code generic. Based on a patch from David Chinner
with lots of changes to follow the original btree implementations more
closely. While this loses some of the generic helper routines for
inserting/moving/removing records it also solves some of the one off bugs
in the original code and makes it easier to verify.
SGI-PV: 985583
SGI-Modid: xfs-linux-melb:xfs-kern:32202a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Bill O'Donnell <billodo@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
From: Dave Chinner <dgc@sgi.com>
Add a xfs_btree_new_root helper for the alloc and ialloc btrees. The bmap
btree needs it's own version and is not converted.
[hch: split out from bigger patch and minor adaptions]
SGI-PV: 985583
SGI-Modid: xfs-linux-melb:xfs-kern:32200a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Bill O'Donnell <billodo@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
Make the btree split code generic. Based on a patch from David Chinner
with lots of changes to follow the original btree implementations more
closely. While this loses some of the generic helper routines for
inserting/moving/removing records it also solves some of the one off bugs
in the original code and makes it easier to verify.
SGI-PV: 985583
SGI-Modid: xfs-linux-melb:xfs-kern:32198a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Bill O'Donnell <billodo@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
Make the btree left shift code generic. Based on a patch from David
Chinner with lots of changes to follow the original btree implementations
more closely. While this loses some of the generic helper routines for
inserting/moving/removing records it also solves some of the one off bugs
in the original code and makes it easier to verify.
SGI-PV: 985583
SGI-Modid: xfs-linux-melb:xfs-kern:32197a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Bill O'Donnell <billodo@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
Make the btree right shift code generic. Based on a patch from David
Chinner with lots of changes to follow the original btree implementations
more closely. While this loses some of the generic helper routines for
inserting/moving/removing records it also solves some of the one off bugs
in the original code and makes it easier to verify.
SGI-PV: 985583
SGI-Modid: xfs-linux-melb:xfs-kern:32196a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Bill O'Donnell <billodo@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
From: Dave Chinner <dgc@sgi.com>
The most complicated part here is the lastrec tracking for the alloc
btree. Most logic is in the update_lastrec method which has to do some
hopefully good enough dirty magic to maintain it.
[hch: split out from bigger patch and a rework of the lastrec
logic]
SGI-PV: 985583
SGI-Modid: xfs-linux-melb:xfs-kern:32194a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Bill O'Donnell <billodo@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
From: Dave Chinner <dgc@sgi.com>
Note that there are many > 80 char lines introduced due to the
xfs_btree_key casts. But the places where this happens is throw-away code
once the whole btree code gets merged into a common implementation.
The same is true for the temporary xfs_alloc_log_keys define to the new
name. All old users will be gone after a few patches.
[hch: split out from bigger patch and minor adaptions]
SGI-PV: 985583
SGI-Modid: xfs-linux-melb:xfs-kern:32193a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Bill O'Donnell <billodo@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
From: Dave Chinner <dgc@sgi.com>
[hch: split out from bigger patch and minor adaptions]
SGI-PV: 985583
SGI-Modid: xfs-linux-melb:xfs-kern:32192a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Bill O'Donnell <billodo@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
From: Dave Chinner <dgc@sgi.com>
[hch: split out from bigger patch and minor adaptions]
SGI-PV: 985583
SGI-Modid: xfs-linux-melb:xfs-kern:32191a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Bill O'Donnell <billodo@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
From: Dave Chinner <dgc@sgi.com>
Because this is the first major generic btree routine this patch includes
some infrastrucure, first a few routines to deal with a btree block that
can be either in short or long form, second xfs_btree_read_buf_block,
which is the new central routine to read a btree block given a cursor, and
third the new xfs_btree_ptr_addr routine to calculate the address for a
given btree pointer record.
[hch: split out from bigger patch and minor adaptions]
SGI-PV: 985583
SGI-Modid: xfs-linux-melb:xfs-kern:32190a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Bill O'Donnell <billodo@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
Add new helpers in xfs_btree.c to find the record, key and block pointer
entries inside a btree block. To implement this genericly the
->get_maxrecs methods and two new xfs_btree_ops entries for the key and
record sizes are used. Also add a big comment describing how the
addressing inside a btree block works.
Note that these helpers are unused until users are introduced in the next
patches and this patch will thus cause some harmless compiler warnings.
SGI-PV: 985583
SGI-Modid: xfs-linux-melb:xfs-kern:32189a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Bill O'Donnell <billodo@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
Factor xfs_btree_maxrecs into a per-btree operation.
The get_maxrecs method is based on a patch from Dave Chinner.
SGI-PV: 985583
SGI-Modid: xfs-linux-melb:xfs-kern:32188a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Bill O'Donnell <billodo@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
Make the existing bmap btree tracing generic so that it applies to all
btree types.
Some fragments lifted from a patch by Dave Chinner.
SGI-PV: 985583
SGI-Modid: xfs-linux-melb:xfs-kern:32187a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Bill O'Donnell <billodo@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
xfs_btree_init_cursor contains close to little shared code for the
different btrees and will get even more non-common code in the future.
Split it up into one routine per btree type.
Because xfs_btree_dup_cursor needs to call the init routine for a generic
btree cursor add a new btree operation vector that contains a dup_cursor
method that initializes a new cursor based on an existing one.
The btree operations vector is based on an idea and code from Dave Chinner
and will grow more entries later during this series.
SGI-PV: 985583
SGI-Modid: xfs-linux-melb:xfs-kern:32176a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Bill O'Donnell <billodo@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
remove beX_add functions and replace all uses with beX_add_cpu
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Reviewed-by: Dave Chinner <dgc@sgi.com>
Cc: Timothy Shimmin <tes@sgi.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When we have a couple of hundred transactions on the fly at once, they all
typically modify the on disk superblock in some way.
create/unclink/mkdir/rmdir modify inode counts, allocation/freeing modify
free block counts.
When these counts are modified in a transaction, they must eventually lock
the superblock buffer and apply the mods. The buffer then remains locked
until the transaction is committed into the incore log buffer. The result
of this is that with enough transactions on the fly the incore superblock
buffer becomes a bottleneck.
The result of contention on the incore superblock buffer is that
transaction rates fall - the more pressure that is put on the superblock
buffer, the slower things go.
The key to removing the contention is to not require the superblock fields
in question to be locked. We do that by not marking the superblock dirty
in the transaction. IOWs, we modify the incore superblock but do not
modify the cached superblock buffer. In short, we do not log superblock
modifications to critical fields in the superblock on every transaction.
In fact we only do it just before we write the superblock to disk every
sync period or just before unmount.
This creates an interesting problem - if we don't log or write out the
fields in every transaction, then how do the values get recovered after a
crash? the answer is simple - we keep enough duplicate, logged information
in other structures that we can reconstruct the correct count after log
recovery has been performed.
It is the AGF and AGI structures that contain the duplicate information;
after recovery, we walk every AGI and AGF and sum their individual
counters to get the correct value, and we do a transaction into the log to
correct them. An optimisation of this is that if we have a clean unmount
record, we know the value in the superblock is correct, so we can avoid
the summation walk under normal conditions and so mount/recovery times do
not change under normal operation.
One wrinkle that was discovered during development was that the blocks
used in the freespace btrees are never accounted for in the AGF counters.
This was once a valid optimisation to make; when the filesystem is full,
the free space btrees are empty and consume no space. Hence when it
matters, the "accounting" is correct. But that means the when we do the
AGF summations, we would not have a correct count and xfs_check would
complain. Hence a new counter was added to track the number of blocks used
by the free space btrees. This is an *on-disk format change*.
As a result of this, lazy superblock counters are a mkfs option and at the
moment on linux there is no way to convert an old filesystem. This is
possible - xfs_db can be used to twiddle the right bits and then
xfs_repair will do the format conversion for you. Similarly, you can
convert backwards as well. At some point we'll add functionality to
xfs_admin to do the bit twiddling easily....
SGI-PV: 964999
SGI-Modid: xfs-linux-melb:xfs-kern:28652a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
handling with sparse now, no need for comments.
SGI-PV: 954580
SGI-Modid: xfs-linux-melb:xfs-kern:26557a
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nathan Scott <nathans@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>