With the change to dynamically allocate entries, the cache is never
disabled on the fly. Remove this flag.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The existing code keeps a fixed-size cache of 1024 entries. This is much
too small for a busy server, and wastes memory on an idle one. This
patch changes the code to dynamically allocate and free these cache
entries.
A cap on the number of entries is retained, but it's much larger than
the existing value and now scales with the amount of low memory in the
machine.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
...otherwise, we end up with the list ordering wrong. Currently, it's
not a problem since we skip RC_INPROG entries, but keeping the ordering
strict will be necessary for a later patch that adds a cache cleaner.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Return EINVAL from write if the size is larger than
allowed. Do this before allocating kernel memory for
the bogus size, which could lead to OOM.
Reported-by: Sasha Levin <levinsasha928@gmail.com>
Tested-by: Jana Saout <jana@saout.de>
Signed-off-by: David Teigland <teigland@redhat.com>
The ext4 block allocator only maintains buddy bitmaps for chunks which
are less than or equal to one quarter of a block group. That is, for
a file aystem with a 1k blocksize, and where the number of blocks in a
block group is 8192 blocks, the largest chunk size tracked by buddy
bitmaps is 2048 blocks.
For a file system with a 4k blocksize, and where the number of blocks
in a block group is 32768 blocks, the largest chunk size tracked by
buddy bitmaps is 8192 blocks.
To work around this code, mballoc.c before this commit would truncate
allocation requests to the number of blocks in a block group minus 10.
Why 10? Aside from being a completely arbitrary number, it avoids
block allocation to be a power of two larger than 25% of the block
group. If you try to explicitly fallocate 50% of the block group
size, this will demonstrate the problem; the block allocation code
will scan the all of the blocks in the file system with cr==0 (since
the request is for a natural power of two), but then completely fail
for all blocks groups, since the buddy bitmaps don't track chunk sizes
of 50% of the block group.
To fix this, in these we use ext4_mb_complex_scan_group() instead of
ext4_mb_simple_scan_group().
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger@dilger.ca>
commit 626cf23660 "poll: add poll_requested_events()..." enabled us to send the
requested events to the filesystem.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
drop_nlink() warns if nlink is already zero. This is triggerable by a buggy
userspace filesystem. The cure, I think, is worse than the disease so disable
the warning.
Reported-by: Tero Roponen <tero.roponen@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The all pointers within fuse_req must point to valid memory once
fuse_force_forget() returns.
This bug appeared in "fuse: implement NFS-like readdirplus support"
and was never in any official Linux release.
I tested the fuse_force_forget() code path by injecting to fake -ENOMEM and
verified the FORGET operation was called properly in userspace.
Signed-off-by: Eric Wong <normalperson@yhbt.net>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Later, we'll need more than one call site for this, so break it out
into a new function.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Add a preprocessor constant for the expiry time of cache entries, and
move the test for an expired entry into a function. Note that the current
code does not test for RC_INPROG. It just assumes that it won't take more
than 2 minutes to fill out an in-progress entry.
I'm not sure how valid that assumption is though, so let's just ensure
that we never consider an RC_INPROG entry to be expired.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Entries can only get a c_type of RC_REPLBUFF iff they are
RC_DONE. Therefore the test for RC_DONE isn't necessary here.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Currently we use kmalloc() which wastes a little bit of memory on each
allocation since it's a power of 2 allocator. Since we're allocating a
1024 of these now, and may need even more later, let's create a new
slabcache for them.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The locking rules for cache entries say that locking the cache_lock
isn't needed if you're just touching the current entry. Earlier
in this function we set rp->c_state to RC_UNUSED without any locking,
so I believe it's ok to do the same here.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Currently, it only stores the first 16 bytes of any address. struct
sockaddr_in6 is 28 bytes however, so we're currently ignoring the last
12 bytes of the address.
Expand the c_addr field to a sockaddr_in6, and cast it to a sockaddr_in
as necessary. Also fix the comparitor to use the existing RPC
helpers for this.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The last orphan in the dnext list has its dnext set to NULL. Because
of that, ubifs_delete_orphan assumes that it is not on the dnext list
and frees it immediately instead ignoring it as a second delete. The
orphan is later freed again by erase_deleted.
This change adds an explicit flag to ubifs_orphan indicating whether
it is pending delete.
Signed-off-by: Adam Thomas <adamthomas1111@gmail.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: stable@vger.kernel.org
The last orphan in the cnext list has its cnext set to NULL. Because
of that, ubifs_delete_orphan assumes that it is not on the cnext list
and frees it immediately instead of adding it to the dnext list. The
freed orphan is later modified by write_orph_node.
This can cause various inconsistencies including directory entries
that cannot be removed and this error:
UBIFS error (pid 20685): layout_cnodes: LPT out of space at LEB 14:129009 needing 17, done_ltab 1, done_lsave 1
This is a regression introduced by
"7074e5eb UBIFS: remove invalid reference to list iterator variable".
This change adds an explicit flag to ubifs_orphan indicating whether
it is pending commit.
Signed-off-by: Adam Thomas <adamthomas1111@gmail.com>
Reviewed-by: Adrian Hunter <adrian.hunter@intel.com>
Cc: stable@vger.kernel.org # v3.6+
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Check for incompatible mount options when using the ext4 file system
driver to mount ext2 or ext3 file systems.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
If argument of inode_readahead_blk is too big, we just bail out
without printing any error. Fix this since it could confuse users.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The loop looking for correct mount option entry is more logical if it is
written rewritten as an empty loop looking for correct option entry and then
code handling the option. It also saves one level of indentation for a lot of
code so we can join a couple of split lines.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Several mount option (resuid, resgid, journal_dev, journal_ioprio) are
currently handled before we enter standard option handling loop. I don't
see a reason for this so move them to normal handling loop to make things
more regular.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
It is unnecessary to check i<4 after the loop; just do it before the
break.
Signed-off-by: Cong Ding <dinggnu@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Commit 2147b1a6a4 resulted in a new smatch warning:
> fs/ext4/move_extent.c:693 mext_replace_branches()
> warn: variable dereferenced before check 'dext' (see line 683)
Fix this by adding a check to make sure dext is non-NULL before we
derefrence it.
Signed-off-by: Akria Fujita <a-fujita@rs.jp.nec.com>
[ modified by tytso to make sure an ext4_error is called ]
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Use WARN rather than printk followed by WARN_ON(1), for conciseness.
A simplified version of the semantic patch that makes this transformation
is as follows: (http://coccinelle.lip6.fr/)
// <smpl>
@@
expression list es;
@@
-printk(
+WARN(1,
es);
-WARN_ON(1);
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Currently, we calculate the attribute set transaction
log space reservation at runtime in two parts:
1) XFS_ATTRSET_LOG_RES() which is calcuated out at mount time.
2) ((ext * (mp)->m_sb.sb_sectsize) + \
(ext * XFS_FSB_TO_B((mp), XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK))) + \
(128 * (ext + (ext * XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK))))))
which is calculated out at runtime since it depend on the given extent length in blocks.
This patch renamed XFS_ATTRSET_LOG_RES(mp) to XFS_ATTRSETM_LOG_RES(mp) to indicate
that it is figured out at mount time. Introduce XFS_ATTRSETRT_LOG_RES(mp) which would
be used to calculate out the unit of the log space reservation for one block.
In this way, the total runtime space for the given extent length can be figured out by:
XFS_ATTRSETM_LOG_RES(mp) + XFS_ATTRSETRT_LOG_RES(mp) * ext
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Introduce a new transaction space reservation XFS_SB_LOG_RES() for
those transactions that need to modify the superblock on disk.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Convert the calculation for end of quotaoff log space reservation
from runtime to mount time.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Convert the calculation of quota off transaction log space reservation
from runtime to mount time.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
The disk quota allocation log space reservation is calcuated at runtime,
this patch does it at mount time.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
For adjusting quota limits transactions, we calculate out the log space
reservation at runtime, this patch does it at mount time.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
For the transaction that write the incore superblock changes of quota flags
to disk, it would reserve the same log space to clear/reset quota flags
transaction, hence we can use XFS_TRANS_SBCHANGE_LOG_RES() for it as well.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
The transaction log space for clearing/reseting the quota flags
is calculated out at runtime, this patch can figure it out at
mount time.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
This patch allocates a block reservation structure before growing
or shrinking a file. Without this structure, the grow or shink code
can reference the bad pointer.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The intent here is to split the processing of the glock lru
list into two parts, so that the selection of glocks and the
disposal are separate functions. The plan is then, that further
updates can then be made to these functions in the future
to improve the selection of glocks and also the efficiency of
glock disposal.
The new feature which this patch brings is sorting the
glocks to be disposed of into glock number (and thus also
disk block number) order. Not all glocks will need i/o in
order to dispose of them, but some will, and at least we'll
generate mostly disk block order i/o now.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Add a new helper xfs_calc_buf_res() to calcuate out the transaction space
reservations per item. xfs_buf_log_overhead() is used to figure out the
extra space for struct xfs_buf_log_format that gets written into the log
for every buffer as well as a log opheader, i.e. struct xlog_op_header.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
We batch up operations to the extent allocation tree, which allows
us to deal with the recursive nature of using the extent allocation
tree to allocate extents to the extent allocation tree.
It also provides a mechanism to sort and collect extent
operations, which makes it much more efficient to record extents
that are close together.
The delayed extent operations must all be finished before the
running transaction commits, so we have code to make sure and run a few
of the batched operations when closing our transaction handles.
This creates a great deal of contention for the locks in the
delayed extent operation tree, and also contention for the lock on the
extent allocation tree itself. All the extra contention just slows
down the operations and doesn't get things done any faster.
This commit changes things to use a wait queue instead. As procs
want to run the delayed operations, one of them races in and gets
permission to hit the tree, and the others step back and wait for
progress to be made.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The extent buffers have a refs_lock which we use to make coordinate freeing
the extent buffer with operations on the radix tree. On tree roots and
other extent buffers that very cache hot, this can be highly contended.
These are also the extent buffers that are basically pinned in memory.
This commit adds code to cmpxchg our way through the ref modifications,
and as long as the result of the reference change is still pinned in
ram, we skip the expensive spinlock.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
With the new raid56 code, we want to make sure we're
properly aligning our allocation clusters with -o ssd
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Buffered writes and DIRECT_IO writes will often break up
big contiguous changes to the file into sub-stripe writes.
This adds a plugging callback to gather those smaller writes full stripe
writes.
Example on flash:
fio job to do 64K writes in batches of 3 (which makes a full stripe):
With plugging: 450MB/s
Without plugging: 220MB/s
Signed-off-by: Chris Mason <chris.mason@fusionio.com>