While in ext4_validate_block_bitmap(), if an block allocation bitmap
is found to be invalid, we call ext4_error() while the block group is
still locked. This causes ext4_commit_super() to call a function
which might sleep while in an atomic context.
There's no need to keep the block group locked at this point, so hoist
the ext4_error() call up to ext4_validate_block_bitmap() and release
the block group spinlock before calling ext4_error().
The reported stack trace can be found at:
http://article.gmane.org/gmane.comp.file-systems.ext4/33731
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
Make it possible for ext4_count_free to operate on buffers and not
just data in buffer_heads.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
Ext3 filesystems that are converted to use as many ext4 file system
features as possible will enable uninit_bg to speed up e2fsck times.
These file systems will have a native ext3 layout of inode tables and
block allocation bitmaps (as opposed to ext4's flex_bg layout).
Unfortunately, in these cases, when first allocating a block in an
uninitialized block group, ext4 would incorrectly calculate the number
of free blocks in that block group, and then errorneously report that
the file system was corrupt:
EXT4-fs error (device vdd): ext4_mb_generate_buddy:741: group 30, 32254 clusters in bitmap, 32258 in gd
This problem can be reproduced via:
mke2fs -q -t ext4 -O ^flex_bg /dev/vdd 5g
mount -t ext4 /dev/vdd /mnt
fallocate -l 4600m /mnt/test
The problem was caused by a bone headed mistake in the check to see if a
particular metadata block was part of the block group.
Many thanks to Kees Cook for finding and bisecting the buggy commit
which introduced this bug (commit fd034a84e1, present since v3.2).
Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Reported-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Tested-by: Kees Cook <keescook@chromium.org>
Cc: stable@kernel.org
The major new feature added in this update is Darrick J. Wong's
metadata checksum feature, which adds crc32 checksums to ext4's
metadata fields. There is also the usual set of cleanups and bug
fixes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
iQIcBAABCAAGBQJPyNleAAoJENNvdpvBGATwtLMP/i3WsPyTvxmYP6HttHXQb8Jk
GYCoTQ5bZMuTbOwOGg3w137cXWBv5uuPpxIk79YVLHSWx6HuanlGIa7/VnPKIaLu
2ihuvVfnrDqpwQ4MJaSq4R1Eka9JCwZ7HbYYo+fYOVobxgw588JVV9VVI9EdKRGz
z11UkW8iHE0f6Xa5gOhdAMkR0uaPnxwJX/qHZYiHuognRivuwMglqWJSiMr8nQmo
A2GmeoLehhW+k65IqgTCmSW6ZgFTvZdk6bskQIij3fOYHW3hHn/gcLFtmLTIZ/B5
LZdg/lngPYve+R/UyypliGKi+pv1qNEiTiBm0rrBgsdZFkBdGj0soSvGZzeK+Mp4
Q1vAmOBPYPFzs6nVzPst2n/osryyykFCK6TgSGZ50dosJ0NO8cBeDdX/gh9JKD2R
yQUMUltOCCSj/eWU4iwqZ0T3FXRiH/+S3XMHznoKJiwUyGDBNQy4+Yg2k2WzUXrz
Cu5t5BwNG2WNP7y5Et/wmUIzpC7VPId4qYmGyHe7OwTxSJgW+6f7GVkHfjWcDMuv
pGgEUiInbMmLajP3v2/LKfVU4hXLZy4uJbhoBgDdeIpZrnPifJG/MwDOS4W+dLVT
tDzgO1SAh3/E4jATreZ5bjzD/HGsfe1OX09UH3Pbc1EcgkrLnyrQXFwdHshdVu4A
cxMoKNPVCQJySb1UrLkO
=SdJJ
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull Ext4 updates from Theodore Ts'o:
"The major new feature added in this update is Darrick J Wong's
metadata checksum feature, which adds crc32 checksums to ext4's
metadata fields.
There is also the usual set of cleanups and bug fixes."
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (44 commits)
ext4: hole-punch use truncate_pagecache_range
jbd2: use kmem_cache_zalloc wrapper instead of flag
ext4: remove mb_groups before tearing down the buddy_cache
ext4: add ext4_mb_unload_buddy in the error path
ext4: don't trash state flags in EXT4_IOC_SETFLAGS
ext4: let getattr report the right blocks in delalloc+bigalloc
ext4: add missing save_error_info() to ext4_error()
ext4: add debugging trigger for ext4_error()
ext4: protect group inode free counting with group lock
ext4: use consistent ssize_t type in ext4_file_write()
ext4: fix format flag in ext4_ext_binsearch_idx()
ext4: cleanup in ext4_discard_allocated_blocks()
ext4: return ENOMEM when mounts fail due to lack of memory
ext4: remove redundundant "(char *) bh->b_data" casts
ext4: disallow hard-linked directory in ext4_lookup
ext4: fix potential integer overflow in alloc_flex_gd()
ext4: remove needs_recovery in ext4_mb_init()
ext4: force ro mount if ext4_setup_super() fails
ext4: fix potential NULL dereference in ext4_free_inodes_counts()
ext4/jbd2: add metadata checksumming to the list of supported features
...
metadata_csum supersedes uninit_bg. Convert the ROCOMPAT uninit_bg
flag check to a helper function that covers both, and make the
checksum calculation algorithm use either crc16 or the metadata_csum
chosen algorithm depending on which flag is set. Print a warning if
we try to mount a filesystem with both feature flags set.
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Compute and verify the checksum of the block bitmap; this checksum is
stored in the block group descriptor.
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Get rid of this one:
fs/ext4/balloc.c: In function 'ext4_wait_block_bitmap':
fs/ext4/balloc.c:405:3: warning: format '%llu' expects argument of
type 'long long unsigned int', but argument 6 has type 'sector_t' [-Wformat]
Happens because sector_t is u64 (unsigned long long) or unsigned long
dependent on CONFIG_64BIT.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
In ext4_read_{inode,block}_bitmap() we were setting bitmap_uptodate()
before submitting the buffer for read. The is bad, since we check
bitmap_uptodate() without locking the buffer, and so if another
process is racing with us, it's possible that they will think the
bitmap is uptodate even though the read has not completed yet,
resulting in inodes and blocks potentially getting allocated more than
once if we get really unlucky.
Addresses-Google-Bug: 2828254
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
A couple more functions can reasonably be made static if desired.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
sbi is not defined, so let ext4_free_blocks use EXT4_SB(sb) instead
when EXT4FS_DEBUG is defined.
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Rename the function so it is more clear what is going on. Also rename
the various variables so it's clearer what's happening.
Also fix a missing blocks to cluster conversion when reading the
number of reserved blocks for root.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This function really claims a number of free clusters, not blocks, so
rename it so it's clearer what's going on.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This function really returns the number of clusters after initializing
an uninitalized block bitmap has been initialized.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This function really counts the free clusters reported in the block
group descriptors, so rename it to reduce confusion.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The field bg_free_blocks_count_{lo,high} in the block group
descriptor has been repurposed to hold the number of free clusters for
bigalloc functions. So rename the functions so it makes it easier to
read and audit the block allocation and block freeing code.
Note: at this point in bigalloc development we doesn't support
online resize, so this also makes it really obvious all of the places
we need to fix up to add support for online resize.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
With bigalloc changes, the i_blocks value was not correctly set (it was still
set to number of blocks being used, but in case of bigalloc, we want i_blocks
to represent the number of clusters being used). Since the quota subsystem sets
the i_blocks value, this patch fixes the quota accounting and makes sure that
the i_blocks value is set correctly.
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Convert the percpu counters s_dirtyblocks_counter and
s_freeblocks_counter in struct ext4_super_info to be
s_dirtyclusters_counter and s_freeclusters_counter.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Certain parts of the ext4 code base, primarily in mballoc.c, use a
block group number and offset from the beginning of the block group.
This offset is invariably used to index into the allocation bitmap, so
change the offset to be denominated in units of clusters.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The function ext4_free_blocks_after_init() used to be a #define of
ext4_init_block_bitmap(). This actually made it difficult to
understand how the function worked, and made it hard make changes to
support clusters. So as an initial cleanup, I've separated out the
functionality of initializing block bitmap from calculating the number
of free blocks in the new block group.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This makes it easier to understand how ext4_init_block_bitmap() works,
and it will assist when we split out ext4_free_blocks_after_init() in
the next commit.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
I found that ext4_ext_find_goal() and ext4_find_near()
share the same code for returning a coloured start block
based on i_block_group.
We can refactor this into a common function so that they
don't diverge in the future.
Thanks to adilger for suggesting the new function name.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This patch adds an allocation request flag to the ext4_has_free_blocks
function which enables the use of reserved blocks. This will allow a
punch hole to proceed even if the disk is full. Punching a hole may
require additional blocks to first split the extents.
Because ext4_has_free_blocks is a low level function, the flag needs
to be passed down through several functions listed below:
ext4_ext_insert_extent
ext4_ext_create_new_leaf
ext4_ext_grow_indepth
ext4_ext_split
ext4_ext_new_meta_block
ext4_mb_new_blocks
ext4_claim_free_blocks
ext4_has_free_blocks
[ext4 punch hole patch series 1/5 v7]
Signed-off-by: Allison Henderson <achender@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Mingming Cao <cmm@us.ibm.com>
In preparation for the next patch, the function ext4_add_groupblocks()
is moved to mballoc.c, where it could use some static functions.
Signed-off-by: Amir Goldstein <amir73il@users.sf.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
percpu_counter_sum_positive() never returns a negative value.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
- Add more ext4 tracepoints.
- Change ext4 tracepoints to use dev_t field with MAJOR/MINOR macros
so that we can save 4 bytes in the ring buffer on some platforms.
- Add sync_mode to ext4_da_writepages, ext4_da_write_pages, and
ext4_da_writepages_result tracepoints. Also remove for_reclaim
field from ext4_da_writepages since it is usually not very useful.
Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Remove the short element i_delalloc_reserved_flag from the
ext4_inode_info structure and replace it a new bit in i_state_flags.
Since we have an ext4_inode_info for every ext4 inode cached in the
inode cache, any savings we can produce here is a very good thing from
a memory utilization perspective.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
These functions have no need to be exported beyond file context.
No functions needed to be moved for this commit; just some function
declarations changed to be static and removed from header files.
(A similar patch was submitted by Eric Sandeen, but I wanted to handle
code movement in separate patches to make sure code changes didn't
accidentally get dropped.)
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
We don't need to set s_dirt in most of the ext4 code when journaling
is enabled. In ext3/4 some of the summary statistics for # of free
inodes, blocks, and directories are calculated from the per-block
group statistics when the file system is mounted or unmounted. As a
result the superblock doesn't have to be updated, either via the
journal or by setting s_dirt. There are a few exceptions, most
notably when resizing the file system, where the superblock needs to
be modified --- and in that case it should be done as a journalled
operation if possible, and s_dirt set only in no-journal mode.
This patch will optimize out some unneeded disk writes when using ext4
with a journal.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Because we can badly over-reserve metadata when we
calculate worst-case, it complicates things for quota, since
we must reserve and then claim later, retry on EDQUOT, etc.
Quota is also a generally smaller pool than fs free blocks,
so this over-reservation hurts more, and more often.
I'm of the opinion that it's not the worst thing to allow
metadata to push a user slightly over quota. This simplifies
the code and avoids the false quota rejections that result
from worst-case speculation.
This patch stops the speculative quota-charging for
worst-case metadata requirements, and just charges quota
when the blocks are allocated at writeout. It also is
able to remove the try-again loop on EDQUOT.
This patch has been tested indirectly by running the xfstests
suite with a hack to mount & enable quota prior to the test.
I also did a more specific test of fragmenting freespace
and then doing a large delalloc write under quota; quota
stopped me at the right amount of file IO, and then the
writeout generated enough metadata (due to the fragmentation)
that it put me slightly over quota, as expected.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
There are duplicate macro definitions of in_range() in mballoc.h and
balloc.c. This consolidates these two definitions into ext4.h, and
changes extents.c to use in_range() as well.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger@sun.com>
This is a cleanup and simplification patch which takes some open-coded
calculations to calculate the first block number of a group and
converts them to use the (already defined) ext4_group_first_block_no()
function.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger@sun.com>
Just a pet peeve of mine; we had a mishash of calls with either __func__
or "function_name" and the latter tends to get out of sync.
I think it's easier to just hide the __func__ in a macro, and it'll
be consistent from then on.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
ext4_mb_free_blocks() is only called by ext4_free_blocks(), and the
latter function doesn't really do much. So merge the two functions
together, such that ext4_free_blocks() is now found in
fs/ext4/mballoc.c. This saves about 200 bytes of compiled text space.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The number of old-style block group descriptor blocks is
s_meta_first_bg when the meta_bg feature flag is set.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
ext4_mb_update_group_info is only called in one place, and it's
extremely simple. There's no reason to have it in a separate function
in a separate file as far as I can tell, it just obfuscates what's
really going on.
Perhaps it was intended to keep the grp->bb_* manipulation local to
mballoc.c but we're already accessing other grp-> fields in balloc.c
directly so this seems ok.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
We have sb_bgl_lock() and ext4_group_info.bb_state
bit spinlock to protech group information. The later is only
used within mballoc code. Consolidate them to use sb_bgl_lock().
This makes the mballoc.c code much simpler and also avoid
confusion with two locks protecting same info.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Ext4's on-line resizing adds a new block group and then, only at the
last step adjusts s_groups_count. However, it's possible on SMP
systems that another CPU could see the updated the s_group_count and
not see the newly initialized data structures for the just-added block
group. For this reason, it's important to insert a SMP read barrier
after reading s_groups_count and before reading any (for example) the
new block group descriptors allowed by the increased value of
s_groups_count.
Unfortunately, we rather blatently violate this locking protocol
documented in fs/ext4/resize.c. Fortunately, (1) on-line resizes
happen relatively rarely, and (2) it seems rare that the filesystem
code will immediately try to use just-added block group before any
memory ordering issues resolve themselves. So apparently problems
here are relatively hard to hit, since ext3 has been vulnerable to the
same issue for years with no one apparently complaining.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reduce pressure on the sb_bgl_lock family of locks by using atomic_t's
to track the number of free blocks and inodes in each flex_group.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The static function ext4_group_used_meta_blocks() only has one caller,
who already has access to the block group's group descriptor. So it's
better to have ext4_init_block_bitmap() pass the group descriptor to
ext4_group_used_meta_blocks(), so it doesn't need to call
ext4_group_desc(). Previously this function did not check if
ext4_group_desc() returned NULL due to an error, potentially causing a
kernel OOPS report. This avoids the issue entirely.
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Use lowercase names of quota functions instead of old uppercase ones.
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Mingming Cao <cmm@us.ibm.com>
CC: linux-ext4@vger.kernel.org
Running without a journal, I oopsed when I ran out of space,
because we called jbd2_journal_force_commit_nested() from
ext4_should_retry_alloc() without a journal.
This should take care of it, I think.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
When bg_free_blocks_count was renamed to bg_free_blocks_count_lo in
560671a0, its uses under EXT4FS_DEBUG were not changed to the helper
ext4_free_blks_count.
Another commit, 498e5f24, also did not change everything needed under
EXT4FS_DEBUG, thus making it spill some warnings related to printing
format.
This commit fixes both issues and makes ext4 build again when
EXT4FS_DEBUG is enabled.
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
For uninit block group, the on-disk bitmap is not initialized. That
implies we cannot depend on the uptodate flag on the bitmap
buffer_head to find bitmap validity. Use a new buffer_head flag which
would be set after we properly initialize the bitmap. This also
prevents (re-)initializing the uninit group bitmap every time we call
ext4_read_block_bitmap().
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
Rename some variables. We also unlock locks in the reverse order we
acquired as a part of cleanup.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Rename the lower bits with suffix _lo and add helper
to access the values. Also rename bg_itable_unused_hi
to bg_pad as in e2fsprogs.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>