2017-12-18 03:00:59 +00:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
2006-10-11 08:20:50 +00:00
|
|
|
/*
|
2006-10-11 08:20:53 +00:00
|
|
|
* linux/fs/ext4/super.c
|
2006-10-11 08:20:50 +00:00
|
|
|
*
|
|
|
|
* Copyright (C) 1992, 1993, 1994, 1995
|
|
|
|
* Remy Card (card@masi.ibp.fr)
|
|
|
|
* Laboratoire MASI - Institut Blaise Pascal
|
|
|
|
* Universite Pierre et Marie Curie (Paris VI)
|
|
|
|
*
|
|
|
|
* from
|
|
|
|
*
|
|
|
|
* linux/fs/minix/inode.c
|
|
|
|
*
|
|
|
|
* Copyright (C) 1991, 1992 Linus Torvalds
|
|
|
|
*
|
|
|
|
* Big-endian to little-endian byte-swapping/bitmaps by
|
|
|
|
* David S. Miller (davem@caip.rutgers.edu), 1995
|
|
|
|
*/
|
|
|
|
|
|
|
|
#include <linux/module.h>
|
|
|
|
#include <linux/string.h>
|
|
|
|
#include <linux/fs.h>
|
|
|
|
#include <linux/time.h>
|
2009-04-28 02:48:48 +00:00
|
|
|
#include <linux/vmalloc.h>
|
2006-10-11 08:20:50 +00:00
|
|
|
#include <linux/slab.h>
|
|
|
|
#include <linux/init.h>
|
|
|
|
#include <linux/blkdev.h>
|
2015-05-22 21:13:32 +00:00
|
|
|
#include <linux/backing-dev.h>
|
2006-10-11 08:20:50 +00:00
|
|
|
#include <linux/parser.h>
|
|
|
|
#include <linux/buffer_head.h>
|
2007-07-17 11:04:28 +00:00
|
|
|
#include <linux/exportfs.h>
|
2006-10-11 08:20:50 +00:00
|
|
|
#include <linux/vfs.h>
|
|
|
|
#include <linux/random.h>
|
|
|
|
#include <linux/mount.h>
|
|
|
|
#include <linux/namei.h>
|
|
|
|
#include <linux/quotaops.h>
|
|
|
|
#include <linux/seq_file.h>
|
2009-03-31 13:10:09 +00:00
|
|
|
#include <linux/ctype.h>
|
2007-07-18 13:11:02 +00:00
|
|
|
#include <linux/log2.h>
|
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-16 22:38:25 +00:00
|
|
|
#include <linux/crc16.h>
|
2017-05-08 17:55:27 +00:00
|
|
|
#include <linux/dax.h>
|
2011-05-26 16:02:03 +00:00
|
|
|
#include <linux/cleancache.h>
|
2016-12-24 19:46:01 +00:00
|
|
|
#include <linux/uaccess.h>
|
2018-01-09 13:21:39 +00:00
|
|
|
#include <linux/iversion.h>
|
2019-04-25 18:05:42 +00:00
|
|
|
#include <linux/unicode.h>
|
2020-03-25 15:48:42 +00:00
|
|
|
#include <linux/part_stat.h>
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
#include <linux/kthread.h>
|
|
|
|
#include <linux/freezer.h>
|
|
|
|
|
2008-04-29 22:13:32 +00:00
|
|
|
#include "ext4.h"
|
2012-11-28 18:03:30 +00:00
|
|
|
#include "ext4_extents.h" /* Needed for trace points definition */
|
2008-04-29 22:13:32 +00:00
|
|
|
#include "ext4_jbd2.h"
|
2006-10-11 08:20:50 +00:00
|
|
|
#include "xattr.h"
|
|
|
|
#include "acl.h"
|
2009-09-15 02:59:50 +00:00
|
|
|
#include "mballoc.h"
|
2017-04-30 04:36:53 +00:00
|
|
|
#include "fsmap.h"
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2009-06-17 15:48:11 +00:00
|
|
|
#define CREATE_TRACE_POINTS
|
|
|
|
#include <trace/events/ext4.h>
|
|
|
|
|
2011-02-23 17:22:49 +00:00
|
|
|
static struct ext4_lazy_init *ext4_li_info;
|
2020-12-24 13:22:44 +00:00
|
|
|
static DEFINE_MUTEX(ext4_li_mtx);
|
2015-08-15 18:59:44 +00:00
|
|
|
static struct ratelimit_state ext4_mount_msg_ratelimit;
|
2008-09-23 13:18:24 +00:00
|
|
|
|
2006-10-11 08:20:53 +00:00
|
|
|
static int ext4_load_journal(struct super_block *, struct ext4_super_block *,
|
2006-10-11 08:20:50 +00:00
|
|
|
unsigned long journal_devnum);
|
2012-03-04 04:20:50 +00:00
|
|
|
static int ext4_show_options(struct seq_file *seq, struct dentry *root);
|
2020-12-16 10:18:40 +00:00
|
|
|
static void ext4_update_super(struct super_block *sb);
|
2020-12-16 10:18:38 +00:00
|
|
|
static int ext4_commit_super(struct super_block *sb);
|
2020-07-10 14:07:59 +00:00
|
|
|
static int ext4_mark_recovery_complete(struct super_block *sb,
|
2008-07-26 20:15:44 +00:00
|
|
|
struct ext4_super_block *es);
|
2020-07-10 14:07:59 +00:00
|
|
|
static int ext4_clear_journal_err(struct super_block *sb,
|
|
|
|
struct ext4_super_block *es);
|
2006-10-11 08:20:53 +00:00
|
|
|
static int ext4_sync_fs(struct super_block *sb, int wait);
|
2008-07-26 20:15:44 +00:00
|
|
|
static int ext4_remount(struct super_block *sb, int *flags, char *data);
|
|
|
|
static int ext4_statfs(struct dentry *dentry, struct kstatfs *buf);
|
2009-01-10 00:40:58 +00:00
|
|
|
static int ext4_unfreeze(struct super_block *sb);
|
|
|
|
static int ext4_freeze(struct super_block *sb);
|
2010-07-24 20:46:55 +00:00
|
|
|
static struct dentry *ext4_mount(struct file_system_type *fs_type, int flags,
|
|
|
|
const char *dev_name, void *data);
|
2011-04-18 21:29:14 +00:00
|
|
|
static inline int ext2_feature_set_ok(struct super_block *sb);
|
|
|
|
static inline int ext3_feature_set_ok(struct super_block *sb);
|
2011-02-28 05:53:45 +00:00
|
|
|
static int ext4_feature_set_ok(struct super_block *sb, int readonly);
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
static void ext4_destroy_lazyinit_thread(void);
|
|
|
|
static void ext4_unregister_li_request(struct super_block *sb);
|
2011-02-03 19:33:15 +00:00
|
|
|
static void ext4_clear_request_list(void);
|
2016-09-30 06:05:09 +00:00
|
|
|
static struct inode *ext4_get_journal_inode(struct super_block *sb,
|
|
|
|
unsigned int journal_inum);
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2015-12-07 19:35:49 +00:00
|
|
|
/*
|
|
|
|
* Lock ordering
|
|
|
|
*
|
|
|
|
* Note the difference between i_mmap_sem (EXT4_I(inode)->i_mmap_sem) and
|
|
|
|
* i_mmap_rwsem (inode->i_mmap_rwsem)!
|
|
|
|
*
|
|
|
|
* page fault path:
|
2020-06-09 04:33:54 +00:00
|
|
|
* mmap_lock -> sb_start_pagefault -> i_mmap_sem (r) -> transaction start ->
|
2015-12-07 19:35:49 +00:00
|
|
|
* page lock -> i_data_sem (rw)
|
|
|
|
*
|
|
|
|
* buffered write path:
|
2020-06-09 04:33:54 +00:00
|
|
|
* sb_start_write -> i_mutex -> mmap_lock
|
2015-12-07 19:35:49 +00:00
|
|
|
* sb_start_write -> i_mutex -> transaction start -> page lock ->
|
|
|
|
* i_data_sem (rw)
|
|
|
|
*
|
|
|
|
* truncate:
|
2018-03-22 15:52:10 +00:00
|
|
|
* sb_start_write -> i_mutex -> i_mmap_sem (w) -> i_mmap_rwsem (w) -> page lock
|
|
|
|
* sb_start_write -> i_mutex -> i_mmap_sem (w) -> transaction start ->
|
|
|
|
* i_data_sem (rw)
|
2015-12-07 19:35:49 +00:00
|
|
|
*
|
|
|
|
* direct IO:
|
2020-06-09 04:33:54 +00:00
|
|
|
* sb_start_write -> i_mutex -> mmap_lock
|
2018-03-22 15:52:10 +00:00
|
|
|
* sb_start_write -> i_mutex -> transaction start -> i_data_sem (rw)
|
2015-12-07 19:35:49 +00:00
|
|
|
*
|
|
|
|
* writepages:
|
|
|
|
* transaction start -> page lock(s) -> i_data_sem (rw)
|
|
|
|
*/
|
|
|
|
|
2015-06-18 14:52:29 +00:00
|
|
|
#if !defined(CONFIG_EXT2_FS) && !defined(CONFIG_EXT2_FS_MODULE) && defined(CONFIG_EXT4_USE_FOR_EXT2)
|
2011-04-18 21:29:14 +00:00
|
|
|
static struct file_system_type ext2_fs_type = {
|
|
|
|
.owner = THIS_MODULE,
|
|
|
|
.name = "ext2",
|
|
|
|
.mount = ext4_mount,
|
|
|
|
.kill_sb = kill_block_super,
|
|
|
|
.fs_flags = FS_REQUIRES_DEV,
|
|
|
|
};
|
2013-03-03 03:39:14 +00:00
|
|
|
MODULE_ALIAS_FS("ext2");
|
2013-03-13 01:27:41 +00:00
|
|
|
MODULE_ALIAS("ext2");
|
2011-04-18 21:29:14 +00:00
|
|
|
#define IS_EXT2_SB(sb) ((sb)->s_bdev->bd_holder == &ext2_fs_type)
|
|
|
|
#else
|
|
|
|
#define IS_EXT2_SB(sb) (0)
|
|
|
|
#endif
|
|
|
|
|
|
|
|
|
2010-03-25 00:18:37 +00:00
|
|
|
static struct file_system_type ext3_fs_type = {
|
|
|
|
.owner = THIS_MODULE,
|
|
|
|
.name = "ext3",
|
2010-07-24 20:46:55 +00:00
|
|
|
.mount = ext4_mount,
|
2010-03-25 00:18:37 +00:00
|
|
|
.kill_sb = kill_block_super,
|
|
|
|
.fs_flags = FS_REQUIRES_DEV,
|
|
|
|
};
|
2013-03-03 03:39:14 +00:00
|
|
|
MODULE_ALIAS_FS("ext3");
|
2013-03-13 01:27:41 +00:00
|
|
|
MODULE_ALIAS("ext3");
|
2010-03-25 00:18:37 +00:00
|
|
|
#define IS_EXT3_SB(sb) ((sb)->s_bdev->bd_holder == &ext3_fs_type)
|
2006-10-11 08:21:10 +00:00
|
|
|
|
2020-09-24 07:33:32 +00:00
|
|
|
|
|
|
|
static inline void __ext4_read_bh(struct buffer_head *bh, int op_flags,
|
|
|
|
bh_end_io_t *end_io)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* buffer's verified bit is no longer valid after reading from
|
|
|
|
* disk again due to write out error, clear it to make sure we
|
|
|
|
* recheck the buffer contents.
|
|
|
|
*/
|
|
|
|
clear_buffer_verified(bh);
|
|
|
|
|
|
|
|
bh->b_end_io = end_io ? end_io : end_buffer_read_sync;
|
|
|
|
get_bh(bh);
|
|
|
|
submit_bh(REQ_OP_READ, op_flags, bh);
|
|
|
|
}
|
|
|
|
|
|
|
|
void ext4_read_bh_nowait(struct buffer_head *bh, int op_flags,
|
|
|
|
bh_end_io_t *end_io)
|
|
|
|
{
|
|
|
|
BUG_ON(!buffer_locked(bh));
|
|
|
|
|
|
|
|
if (ext4_buffer_uptodate(bh)) {
|
|
|
|
unlock_buffer(bh);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
__ext4_read_bh(bh, op_flags, end_io);
|
|
|
|
}
|
|
|
|
|
|
|
|
int ext4_read_bh(struct buffer_head *bh, int op_flags, bh_end_io_t *end_io)
|
|
|
|
{
|
|
|
|
BUG_ON(!buffer_locked(bh));
|
|
|
|
|
|
|
|
if (ext4_buffer_uptodate(bh)) {
|
|
|
|
unlock_buffer(bh);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
__ext4_read_bh(bh, op_flags, end_io);
|
|
|
|
|
|
|
|
wait_on_buffer(bh);
|
|
|
|
if (buffer_uptodate(bh))
|
|
|
|
return 0;
|
|
|
|
return -EIO;
|
|
|
|
}
|
|
|
|
|
|
|
|
int ext4_read_bh_lock(struct buffer_head *bh, int op_flags, bool wait)
|
|
|
|
{
|
|
|
|
if (trylock_buffer(bh)) {
|
|
|
|
if (wait)
|
|
|
|
return ext4_read_bh(bh, op_flags, NULL);
|
|
|
|
ext4_read_bh_nowait(bh, op_flags, NULL);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
if (wait) {
|
|
|
|
wait_on_buffer(bh);
|
|
|
|
if (buffer_uptodate(bh))
|
|
|
|
return 0;
|
|
|
|
return -EIO;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2018-11-25 22:20:31 +00:00
|
|
|
/*
|
2020-09-24 07:33:37 +00:00
|
|
|
* This works like __bread_gfp() except it uses ERR_PTR for error
|
2018-11-25 22:20:31 +00:00
|
|
|
* returns. Currently with sb_bread it's impossible to distinguish
|
|
|
|
* between ENOMEM and EIO situations (since both result in a NULL
|
|
|
|
* return.
|
|
|
|
*/
|
2020-09-24 07:33:37 +00:00
|
|
|
static struct buffer_head *__ext4_sb_bread_gfp(struct super_block *sb,
|
|
|
|
sector_t block, int op_flags,
|
|
|
|
gfp_t gfp)
|
2018-11-25 22:20:31 +00:00
|
|
|
{
|
2020-09-24 07:33:33 +00:00
|
|
|
struct buffer_head *bh;
|
|
|
|
int ret;
|
2018-11-25 22:20:31 +00:00
|
|
|
|
2020-09-24 07:33:37 +00:00
|
|
|
bh = sb_getblk_gfp(sb, block, gfp);
|
2018-11-25 22:20:31 +00:00
|
|
|
if (bh == NULL)
|
|
|
|
return ERR_PTR(-ENOMEM);
|
2019-12-14 21:42:52 +00:00
|
|
|
if (ext4_buffer_uptodate(bh))
|
2018-11-25 22:20:31 +00:00
|
|
|
return bh;
|
2020-09-24 07:33:33 +00:00
|
|
|
|
|
|
|
ret = ext4_read_bh_lock(bh, REQ_META | op_flags, true);
|
|
|
|
if (ret) {
|
|
|
|
put_bh(bh);
|
|
|
|
return ERR_PTR(ret);
|
|
|
|
}
|
|
|
|
return bh;
|
2018-11-25 22:20:31 +00:00
|
|
|
}
|
|
|
|
|
2020-09-24 07:33:37 +00:00
|
|
|
struct buffer_head *ext4_sb_bread(struct super_block *sb, sector_t block,
|
|
|
|
int op_flags)
|
|
|
|
{
|
|
|
|
return __ext4_sb_bread_gfp(sb, block, op_flags, __GFP_MOVABLE);
|
|
|
|
}
|
|
|
|
|
|
|
|
struct buffer_head *ext4_sb_bread_unmovable(struct super_block *sb,
|
|
|
|
sector_t block)
|
|
|
|
{
|
|
|
|
return __ext4_sb_bread_gfp(sb, block, 0, 0);
|
|
|
|
}
|
|
|
|
|
2020-09-24 07:33:35 +00:00
|
|
|
void ext4_sb_breadahead_unmovable(struct super_block *sb, sector_t block)
|
|
|
|
{
|
|
|
|
struct buffer_head *bh = sb_getblk_gfp(sb, block, 0);
|
|
|
|
|
|
|
|
if (likely(bh)) {
|
|
|
|
ext4_read_bh_lock(bh, REQ_RAHEAD, false);
|
|
|
|
brelse(bh);
|
|
|
|
}
|
2018-11-25 22:20:31 +00:00
|
|
|
}
|
|
|
|
|
2012-04-29 22:25:10 +00:00
|
|
|
static int ext4_verify_csum_type(struct super_block *sb,
|
|
|
|
struct ext4_super_block *es)
|
|
|
|
{
|
2015-10-17 20:18:43 +00:00
|
|
|
if (!ext4_has_feature_metadata_csum(sb))
|
2012-04-29 22:25:10 +00:00
|
|
|
return 1;
|
|
|
|
|
|
|
|
return es->s_checksum_type == EXT4_CRC32C_CHKSUM;
|
|
|
|
}
|
|
|
|
|
2012-04-29 22:29:10 +00:00
|
|
|
static __le32 ext4_superblock_csum(struct super_block *sb,
|
|
|
|
struct ext4_super_block *es)
|
|
|
|
{
|
|
|
|
struct ext4_sb_info *sbi = EXT4_SB(sb);
|
|
|
|
int offset = offsetof(struct ext4_super_block, s_checksum);
|
|
|
|
__u32 csum;
|
|
|
|
|
|
|
|
csum = ext4_chksum(sbi, ~0, (char *)es, offset);
|
|
|
|
|
|
|
|
return cpu_to_le32(csum);
|
|
|
|
}
|
|
|
|
|
2014-05-12 14:50:23 +00:00
|
|
|
static int ext4_superblock_csum_verify(struct super_block *sb,
|
|
|
|
struct ext4_super_block *es)
|
2012-04-29 22:29:10 +00:00
|
|
|
{
|
2014-10-13 07:36:16 +00:00
|
|
|
if (!ext4_has_metadata_csum(sb))
|
2012-04-29 22:29:10 +00:00
|
|
|
return 1;
|
|
|
|
|
|
|
|
return es->s_checksum == ext4_superblock_csum(sb, es);
|
|
|
|
}
|
|
|
|
|
2012-10-10 05:06:58 +00:00
|
|
|
void ext4_superblock_csum_set(struct super_block *sb)
|
2012-04-29 22:29:10 +00:00
|
|
|
{
|
2012-10-10 05:06:58 +00:00
|
|
|
struct ext4_super_block *es = EXT4_SB(sb)->s_es;
|
|
|
|
|
2014-10-13 07:36:16 +00:00
|
|
|
if (!ext4_has_metadata_csum(sb))
|
2012-04-29 22:29:10 +00:00
|
|
|
return;
|
|
|
|
|
|
|
|
es->s_checksum = ext4_superblock_csum(sb, es);
|
|
|
|
}
|
|
|
|
|
2006-10-11 08:21:15 +00:00
|
|
|
ext4_fsblk_t ext4_block_bitmap(struct super_block *sb,
|
|
|
|
struct ext4_group_desc *bg)
|
2006-10-11 08:21:10 +00:00
|
|
|
{
|
2007-10-16 22:38:25 +00:00
|
|
|
return le32_to_cpu(bg->bg_block_bitmap_lo) |
|
2006-10-11 08:21:15 +00:00
|
|
|
(EXT4_DESC_SIZE(sb) >= EXT4_MIN_DESC_SIZE_64BIT ?
|
2009-06-03 21:59:28 +00:00
|
|
|
(ext4_fsblk_t)le32_to_cpu(bg->bg_block_bitmap_hi) << 32 : 0);
|
2006-10-11 08:21:10 +00:00
|
|
|
}
|
|
|
|
|
2006-10-11 08:21:15 +00:00
|
|
|
ext4_fsblk_t ext4_inode_bitmap(struct super_block *sb,
|
|
|
|
struct ext4_group_desc *bg)
|
2006-10-11 08:21:10 +00:00
|
|
|
{
|
2007-10-16 22:38:25 +00:00
|
|
|
return le32_to_cpu(bg->bg_inode_bitmap_lo) |
|
2006-10-11 08:21:15 +00:00
|
|
|
(EXT4_DESC_SIZE(sb) >= EXT4_MIN_DESC_SIZE_64BIT ?
|
2009-06-03 21:59:28 +00:00
|
|
|
(ext4_fsblk_t)le32_to_cpu(bg->bg_inode_bitmap_hi) << 32 : 0);
|
2006-10-11 08:21:10 +00:00
|
|
|
}
|
|
|
|
|
2006-10-11 08:21:15 +00:00
|
|
|
ext4_fsblk_t ext4_inode_table(struct super_block *sb,
|
|
|
|
struct ext4_group_desc *bg)
|
2006-10-11 08:21:10 +00:00
|
|
|
{
|
2007-10-16 22:38:25 +00:00
|
|
|
return le32_to_cpu(bg->bg_inode_table_lo) |
|
2006-10-11 08:21:15 +00:00
|
|
|
(EXT4_DESC_SIZE(sb) >= EXT4_MIN_DESC_SIZE_64BIT ?
|
2009-06-03 21:59:28 +00:00
|
|
|
(ext4_fsblk_t)le32_to_cpu(bg->bg_inode_table_hi) << 32 : 0);
|
2006-10-11 08:21:10 +00:00
|
|
|
}
|
|
|
|
|
2011-09-09 23:08:51 +00:00
|
|
|
__u32 ext4_free_group_clusters(struct super_block *sb,
|
|
|
|
struct ext4_group_desc *bg)
|
2009-01-06 03:20:24 +00:00
|
|
|
{
|
|
|
|
return le16_to_cpu(bg->bg_free_blocks_count_lo) |
|
|
|
|
(EXT4_DESC_SIZE(sb) >= EXT4_MIN_DESC_SIZE_64BIT ?
|
2009-06-03 21:59:28 +00:00
|
|
|
(__u32)le16_to_cpu(bg->bg_free_blocks_count_hi) << 16 : 0);
|
2009-01-06 03:20:24 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
__u32 ext4_free_inodes_count(struct super_block *sb,
|
|
|
|
struct ext4_group_desc *bg)
|
|
|
|
{
|
|
|
|
return le16_to_cpu(bg->bg_free_inodes_count_lo) |
|
|
|
|
(EXT4_DESC_SIZE(sb) >= EXT4_MIN_DESC_SIZE_64BIT ?
|
2009-06-03 21:59:28 +00:00
|
|
|
(__u32)le16_to_cpu(bg->bg_free_inodes_count_hi) << 16 : 0);
|
2009-01-06 03:20:24 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
__u32 ext4_used_dirs_count(struct super_block *sb,
|
|
|
|
struct ext4_group_desc *bg)
|
|
|
|
{
|
|
|
|
return le16_to_cpu(bg->bg_used_dirs_count_lo) |
|
|
|
|
(EXT4_DESC_SIZE(sb) >= EXT4_MIN_DESC_SIZE_64BIT ?
|
2009-06-03 21:59:28 +00:00
|
|
|
(__u32)le16_to_cpu(bg->bg_used_dirs_count_hi) << 16 : 0);
|
2009-01-06 03:20:24 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
__u32 ext4_itable_unused_count(struct super_block *sb,
|
|
|
|
struct ext4_group_desc *bg)
|
|
|
|
{
|
|
|
|
return le16_to_cpu(bg->bg_itable_unused_lo) |
|
|
|
|
(EXT4_DESC_SIZE(sb) >= EXT4_MIN_DESC_SIZE_64BIT ?
|
2009-06-03 21:59:28 +00:00
|
|
|
(__u32)le16_to_cpu(bg->bg_itable_unused_hi) << 16 : 0);
|
2009-01-06 03:20:24 +00:00
|
|
|
}
|
|
|
|
|
2006-10-11 08:21:15 +00:00
|
|
|
void ext4_block_bitmap_set(struct super_block *sb,
|
|
|
|
struct ext4_group_desc *bg, ext4_fsblk_t blk)
|
2006-10-11 08:21:10 +00:00
|
|
|
{
|
2007-10-16 22:38:25 +00:00
|
|
|
bg->bg_block_bitmap_lo = cpu_to_le32((u32)blk);
|
2006-10-11 08:21:15 +00:00
|
|
|
if (EXT4_DESC_SIZE(sb) >= EXT4_MIN_DESC_SIZE_64BIT)
|
|
|
|
bg->bg_block_bitmap_hi = cpu_to_le32(blk >> 32);
|
2006-10-11 08:21:10 +00:00
|
|
|
}
|
|
|
|
|
2006-10-11 08:21:15 +00:00
|
|
|
void ext4_inode_bitmap_set(struct super_block *sb,
|
|
|
|
struct ext4_group_desc *bg, ext4_fsblk_t blk)
|
2006-10-11 08:21:10 +00:00
|
|
|
{
|
2007-10-16 22:38:25 +00:00
|
|
|
bg->bg_inode_bitmap_lo = cpu_to_le32((u32)blk);
|
2006-10-11 08:21:15 +00:00
|
|
|
if (EXT4_DESC_SIZE(sb) >= EXT4_MIN_DESC_SIZE_64BIT)
|
|
|
|
bg->bg_inode_bitmap_hi = cpu_to_le32(blk >> 32);
|
2006-10-11 08:21:10 +00:00
|
|
|
}
|
|
|
|
|
2006-10-11 08:21:15 +00:00
|
|
|
void ext4_inode_table_set(struct super_block *sb,
|
|
|
|
struct ext4_group_desc *bg, ext4_fsblk_t blk)
|
2006-10-11 08:21:10 +00:00
|
|
|
{
|
2007-10-16 22:38:25 +00:00
|
|
|
bg->bg_inode_table_lo = cpu_to_le32((u32)blk);
|
2006-10-11 08:21:15 +00:00
|
|
|
if (EXT4_DESC_SIZE(sb) >= EXT4_MIN_DESC_SIZE_64BIT)
|
|
|
|
bg->bg_inode_table_hi = cpu_to_le32(blk >> 32);
|
2006-10-11 08:21:10 +00:00
|
|
|
}
|
|
|
|
|
2011-09-09 23:08:51 +00:00
|
|
|
void ext4_free_group_clusters_set(struct super_block *sb,
|
|
|
|
struct ext4_group_desc *bg, __u32 count)
|
2009-01-06 03:20:24 +00:00
|
|
|
{
|
|
|
|
bg->bg_free_blocks_count_lo = cpu_to_le16((__u16)count);
|
|
|
|
if (EXT4_DESC_SIZE(sb) >= EXT4_MIN_DESC_SIZE_64BIT)
|
|
|
|
bg->bg_free_blocks_count_hi = cpu_to_le16(count >> 16);
|
|
|
|
}
|
|
|
|
|
|
|
|
void ext4_free_inodes_set(struct super_block *sb,
|
|
|
|
struct ext4_group_desc *bg, __u32 count)
|
|
|
|
{
|
|
|
|
bg->bg_free_inodes_count_lo = cpu_to_le16((__u16)count);
|
|
|
|
if (EXT4_DESC_SIZE(sb) >= EXT4_MIN_DESC_SIZE_64BIT)
|
|
|
|
bg->bg_free_inodes_count_hi = cpu_to_le16(count >> 16);
|
|
|
|
}
|
|
|
|
|
|
|
|
void ext4_used_dirs_set(struct super_block *sb,
|
|
|
|
struct ext4_group_desc *bg, __u32 count)
|
|
|
|
{
|
|
|
|
bg->bg_used_dirs_count_lo = cpu_to_le16((__u16)count);
|
|
|
|
if (EXT4_DESC_SIZE(sb) >= EXT4_MIN_DESC_SIZE_64BIT)
|
|
|
|
bg->bg_used_dirs_count_hi = cpu_to_le16(count >> 16);
|
|
|
|
}
|
|
|
|
|
|
|
|
void ext4_itable_unused_set(struct super_block *sb,
|
|
|
|
struct ext4_group_desc *bg, __u32 count)
|
|
|
|
{
|
|
|
|
bg->bg_itable_unused_lo = cpu_to_le16((__u16)count);
|
|
|
|
if (EXT4_DESC_SIZE(sb) >= EXT4_MIN_DESC_SIZE_64BIT)
|
|
|
|
bg->bg_itable_unused_hi = cpu_to_le16(count >> 16);
|
|
|
|
}
|
|
|
|
|
2020-11-27 11:34:00 +00:00
|
|
|
static void __ext4_update_tstamp(__le32 *lo, __u8 *hi, time64_t now)
|
2018-07-29 19:51:48 +00:00
|
|
|
{
|
|
|
|
now = clamp_val(now, 0, (1ull << 40) - 1);
|
|
|
|
|
|
|
|
*lo = cpu_to_le32(lower_32_bits(now));
|
|
|
|
*hi = upper_32_bits(now);
|
|
|
|
}
|
|
|
|
|
|
|
|
static time64_t __ext4_get_tstamp(__le32 *lo, __u8 *hi)
|
|
|
|
{
|
|
|
|
return ((time64_t)(*hi) << 32) + le32_to_cpu(*lo);
|
|
|
|
}
|
|
|
|
#define ext4_update_tstamp(es, tstamp) \
|
2020-11-27 11:34:00 +00:00
|
|
|
__ext4_update_tstamp(&(es)->tstamp, &(es)->tstamp ## _hi, \
|
|
|
|
ktime_get_real_seconds())
|
2018-07-29 19:51:48 +00:00
|
|
|
#define ext4_get_tstamp(es, tstamp) \
|
|
|
|
__ext4_get_tstamp(&(es)->tstamp, &(es)->tstamp ## _hi)
|
2009-09-29 15:01:03 +00:00
|
|
|
|
2015-08-16 14:03:57 +00:00
|
|
|
/*
|
|
|
|
* The del_gendisk() function uninitializes the disk-specific data
|
|
|
|
* structures, including the bdi structure, without telling anyone
|
|
|
|
* else. Once this happens, any attempt to call mark_buffer_dirty()
|
|
|
|
* (for example, by ext4_commit_super), will cause a kernel OOPS.
|
|
|
|
* This is a kludge to prevent these oops until we can put in a proper
|
|
|
|
* hook in del_gendisk() to inform the VFS and file system layers.
|
|
|
|
*/
|
|
|
|
static int block_device_ejected(struct super_block *sb)
|
|
|
|
{
|
|
|
|
struct inode *bd_inode = sb->s_bdev->bd_inode;
|
|
|
|
struct backing_dev_info *bdi = inode_to_bdi(bd_inode);
|
|
|
|
|
|
|
|
return bdi->dev == NULL;
|
|
|
|
}
|
|
|
|
|
2012-02-20 22:53:02 +00:00
|
|
|
static void ext4_journal_commit_callback(journal_t *journal, transaction_t *txn)
|
|
|
|
{
|
|
|
|
struct super_block *sb = journal->j_private;
|
|
|
|
struct ext4_sb_info *sbi = EXT4_SB(sb);
|
|
|
|
int error = is_journal_aborted(journal);
|
2013-04-04 02:08:52 +00:00
|
|
|
struct ext4_journal_cb_entry *jce;
|
2012-02-20 22:53:02 +00:00
|
|
|
|
2013-04-04 02:08:52 +00:00
|
|
|
BUG_ON(txn->t_state == T_FINISHED);
|
2017-06-23 03:54:33 +00:00
|
|
|
|
|
|
|
ext4_process_freed_data(sb, txn->t_tid);
|
|
|
|
|
2012-02-20 22:53:02 +00:00
|
|
|
spin_lock(&sbi->s_md_lock);
|
2013-04-04 02:08:52 +00:00
|
|
|
while (!list_empty(&txn->t_private_list)) {
|
|
|
|
jce = list_entry(txn->t_private_list.next,
|
|
|
|
struct ext4_journal_cb_entry, jce_list);
|
2012-02-20 22:53:02 +00:00
|
|
|
list_del_init(&jce->jce_list);
|
|
|
|
spin_unlock(&sbi->s_md_lock);
|
|
|
|
jce->jce_func(sb, jce, error);
|
|
|
|
spin_lock(&sbi->s_md_lock);
|
|
|
|
}
|
|
|
|
spin_unlock(&sbi->s_md_lock);
|
|
|
|
}
|
2010-07-27 15:56:03 +00:00
|
|
|
|
2020-10-06 00:48:41 +00:00
|
|
|
/*
|
|
|
|
* This writepage callback for write_cache_pages()
|
|
|
|
* takes care of a few cases after page cleaning.
|
|
|
|
*
|
|
|
|
* write_cache_pages() already checks for dirty pages
|
|
|
|
* and calls clear_page_dirty_for_io(), which we want,
|
|
|
|
* to write protect the pages.
|
|
|
|
*
|
|
|
|
* However, we may have to redirty a page (see below.)
|
|
|
|
*/
|
|
|
|
static int ext4_journalled_writepage_callback(struct page *page,
|
|
|
|
struct writeback_control *wbc,
|
|
|
|
void *data)
|
|
|
|
{
|
|
|
|
transaction_t *transaction = (transaction_t *) data;
|
|
|
|
struct buffer_head *bh, *head;
|
|
|
|
struct journal_head *jh;
|
|
|
|
|
|
|
|
bh = head = page_buffers(page);
|
|
|
|
do {
|
|
|
|
/*
|
|
|
|
* We have to redirty a page in these cases:
|
|
|
|
* 1) If buffer is dirty, it means the page was dirty because it
|
|
|
|
* contains a buffer that needs checkpointing. So the dirty bit
|
|
|
|
* needs to be preserved so that checkpointing writes the buffer
|
|
|
|
* properly.
|
|
|
|
* 2) If buffer is not part of the committing transaction
|
|
|
|
* (we may have just accidentally come across this buffer because
|
|
|
|
* inode range tracking is not exact) or if the currently running
|
|
|
|
* transaction already contains this buffer as well, dirty bit
|
|
|
|
* needs to be preserved so that the buffer gets writeprotected
|
|
|
|
* properly on running transaction's commit.
|
|
|
|
*/
|
|
|
|
jh = bh2jh(bh);
|
|
|
|
if (buffer_dirty(bh) ||
|
|
|
|
(jh && (jh->b_transaction != transaction ||
|
|
|
|
jh->b_next_transaction))) {
|
|
|
|
redirty_page_for_writepage(wbc, page);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
} while ((bh = bh->b_this_page) != head);
|
|
|
|
|
|
|
|
out:
|
|
|
|
return AOP_WRITEPAGE_ACTIVATE;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int ext4_journalled_submit_inode_data_buffers(struct jbd2_inode *jinode)
|
|
|
|
{
|
|
|
|
struct address_space *mapping = jinode->i_vfs_inode->i_mapping;
|
|
|
|
struct writeback_control wbc = {
|
|
|
|
.sync_mode = WB_SYNC_ALL,
|
|
|
|
.nr_to_write = LONG_MAX,
|
|
|
|
.range_start = jinode->i_dirty_start,
|
|
|
|
.range_end = jinode->i_dirty_end,
|
|
|
|
};
|
|
|
|
|
|
|
|
return write_cache_pages(mapping, &wbc,
|
|
|
|
ext4_journalled_writepage_callback,
|
|
|
|
jinode->i_transaction);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int ext4_journal_submit_inode_data_buffers(struct jbd2_inode *jinode)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (ext4_should_journal_data(jinode->i_vfs_inode))
|
|
|
|
ret = ext4_journalled_submit_inode_data_buffers(jinode);
|
|
|
|
else
|
|
|
|
ret = jbd2_journal_submit_inode_data_buffers(jinode);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int ext4_journal_finish_inode_data_buffers(struct jbd2_inode *jinode)
|
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
if (!ext4_should_journal_data(jinode->i_vfs_inode))
|
|
|
|
ret = jbd2_journal_finish_inode_data_buffers(jinode);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2019-03-15 03:46:05 +00:00
|
|
|
static bool system_going_down(void)
|
|
|
|
{
|
|
|
|
return system_state == SYSTEM_HALT || system_state == SYSTEM_POWER_OFF
|
|
|
|
|| system_state == SYSTEM_RESTART;
|
|
|
|
}
|
|
|
|
|
2020-11-27 11:33:59 +00:00
|
|
|
struct ext4_err_translation {
|
|
|
|
int code;
|
|
|
|
int errno;
|
|
|
|
};
|
|
|
|
|
|
|
|
#define EXT4_ERR_TRANSLATE(err) { .code = EXT4_ERR_##err, .errno = err }
|
|
|
|
|
|
|
|
static struct ext4_err_translation err_translation[] = {
|
|
|
|
EXT4_ERR_TRANSLATE(EIO),
|
|
|
|
EXT4_ERR_TRANSLATE(ENOMEM),
|
|
|
|
EXT4_ERR_TRANSLATE(EFSBADCRC),
|
|
|
|
EXT4_ERR_TRANSLATE(EFSCORRUPTED),
|
|
|
|
EXT4_ERR_TRANSLATE(ENOSPC),
|
|
|
|
EXT4_ERR_TRANSLATE(ENOKEY),
|
|
|
|
EXT4_ERR_TRANSLATE(EROFS),
|
|
|
|
EXT4_ERR_TRANSLATE(EFBIG),
|
|
|
|
EXT4_ERR_TRANSLATE(EEXIST),
|
|
|
|
EXT4_ERR_TRANSLATE(ERANGE),
|
|
|
|
EXT4_ERR_TRANSLATE(EOVERFLOW),
|
|
|
|
EXT4_ERR_TRANSLATE(EBUSY),
|
|
|
|
EXT4_ERR_TRANSLATE(ENOTDIR),
|
|
|
|
EXT4_ERR_TRANSLATE(ENOTEMPTY),
|
|
|
|
EXT4_ERR_TRANSLATE(ESHUTDOWN),
|
|
|
|
EXT4_ERR_TRANSLATE(EFAULT),
|
|
|
|
};
|
|
|
|
|
|
|
|
static int ext4_errno_to_code(int errno)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < ARRAY_SIZE(err_translation); i++)
|
|
|
|
if (err_translation[i].errno == errno)
|
|
|
|
return err_translation[i].code;
|
|
|
|
return EXT4_ERR_UNKNOWN;
|
|
|
|
}
|
|
|
|
|
2020-12-16 10:18:40 +00:00
|
|
|
static void save_error_info(struct super_block *sb, int error,
|
|
|
|
__u32 ino, __u64 block,
|
|
|
|
const char *func, unsigned int line)
|
2020-11-27 11:33:58 +00:00
|
|
|
{
|
2020-11-27 11:34:00 +00:00
|
|
|
struct ext4_sb_info *sbi = EXT4_SB(sb);
|
2020-11-27 11:33:58 +00:00
|
|
|
|
2020-11-27 11:33:59 +00:00
|
|
|
/* We default to EFSCORRUPTED error... */
|
|
|
|
if (error == 0)
|
|
|
|
error = EFSCORRUPTED;
|
2020-11-27 11:34:00 +00:00
|
|
|
|
|
|
|
spin_lock(&sbi->s_error_lock);
|
|
|
|
sbi->s_add_error_count++;
|
|
|
|
sbi->s_last_error_code = error;
|
|
|
|
sbi->s_last_error_line = line;
|
|
|
|
sbi->s_last_error_ino = ino;
|
|
|
|
sbi->s_last_error_block = block;
|
|
|
|
sbi->s_last_error_func = func;
|
|
|
|
sbi->s_last_error_time = ktime_get_real_seconds();
|
|
|
|
if (!sbi->s_first_error_time) {
|
|
|
|
sbi->s_first_error_code = error;
|
|
|
|
sbi->s_first_error_line = line;
|
|
|
|
sbi->s_first_error_ino = ino;
|
|
|
|
sbi->s_first_error_block = block;
|
|
|
|
sbi->s_first_error_func = func;
|
|
|
|
sbi->s_first_error_time = sbi->s_last_error_time;
|
|
|
|
}
|
|
|
|
spin_unlock(&sbi->s_error_lock);
|
2020-11-27 11:33:58 +00:00
|
|
|
}
|
|
|
|
|
2006-10-11 08:20:50 +00:00
|
|
|
/* Deal with the reporting of failure conditions on a filesystem such as
|
|
|
|
* inconsistencies detected or read IO failures.
|
|
|
|
*
|
|
|
|
* On ext2, we can store the error state of the filesystem in the
|
2006-10-11 08:20:53 +00:00
|
|
|
* superblock. That is not possible on ext4, because we may have other
|
2006-10-11 08:20:50 +00:00
|
|
|
* write ordering constraints on the superblock which prevent us from
|
|
|
|
* writing it out straight away; and given that the journal is about to
|
|
|
|
* be aborted, we can't rely on the current, or future, transactions to
|
|
|
|
* write out the superblock safely.
|
|
|
|
*
|
2006-10-11 08:21:01 +00:00
|
|
|
* We'll just use the jbd2_journal_abort() error code to record an error in
|
2010-01-17 21:10:07 +00:00
|
|
|
* the journal instead. On recovery, the journal will complain about
|
2006-10-11 08:20:50 +00:00
|
|
|
* that error until we've noted it down and cleared it.
|
2020-11-27 11:33:57 +00:00
|
|
|
*
|
|
|
|
* If force_ro is set, we unconditionally force the filesystem into an
|
|
|
|
* ABORT|READONLY state, unless the error response on the fs has been set to
|
|
|
|
* panic in which case we take the easy way out and panic immediately. This is
|
|
|
|
* used to deal with unrecoverable failures such as journal IO errors or ENOMEM
|
|
|
|
* at a critical moment in log management.
|
2006-10-11 08:20:50 +00:00
|
|
|
*/
|
2020-12-16 10:18:37 +00:00
|
|
|
static void ext4_handle_error(struct super_block *sb, bool force_ro, int error,
|
|
|
|
__u32 ino, __u64 block,
|
|
|
|
const char *func, unsigned int line)
|
2006-10-11 08:20:50 +00:00
|
|
|
{
|
2020-11-27 11:33:54 +00:00
|
|
|
journal_t *journal = EXT4_SB(sb)->s_journal;
|
2020-12-16 10:18:40 +00:00
|
|
|
bool continue_fs = !force_ro && test_opt(sb, ERRORS_CONT);
|
2020-11-27 11:33:54 +00:00
|
|
|
|
2020-12-16 10:18:37 +00:00
|
|
|
EXT4_SB(sb)->s_mount_state |= EXT4_ERROR_FS;
|
2018-06-13 03:34:57 +00:00
|
|
|
if (test_opt(sb, WARN_ON_ERROR))
|
|
|
|
WARN_ON_ONCE(1);
|
|
|
|
|
2020-12-16 10:18:40 +00:00
|
|
|
if (!continue_fs && !sb_rdonly(sb)) {
|
|
|
|
ext4_set_mount_flag(sb, EXT4_MF_FS_ABORTED);
|
|
|
|
if (journal)
|
|
|
|
jbd2_journal_abort(journal, -EIO);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!bdev_read_only(sb->s_bdev)) {
|
2020-12-16 10:18:37 +00:00
|
|
|
save_error_info(sb, error, ino, block, func, line);
|
2020-12-16 10:18:40 +00:00
|
|
|
/*
|
|
|
|
* In case the fs should keep running, we need to writeout
|
|
|
|
* superblock through the journal. Due to lock ordering
|
|
|
|
* constraints, it may not be safe to do it right here so we
|
|
|
|
* defer superblock flushing to a workqueue.
|
|
|
|
*/
|
|
|
|
if (continue_fs)
|
|
|
|
schedule_work(&EXT4_SB(sb)->s_error_work);
|
|
|
|
else
|
|
|
|
ext4_commit_super(sb);
|
|
|
|
}
|
2020-12-16 10:18:37 +00:00
|
|
|
|
2019-03-15 03:46:05 +00:00
|
|
|
/*
|
|
|
|
* We force ERRORS_RO behavior when system is rebooting. Otherwise we
|
|
|
|
* could panic during 'reboot -f' as the underlying device got already
|
|
|
|
* disabled.
|
|
|
|
*/
|
2020-11-27 11:33:57 +00:00
|
|
|
if (test_opt(sb, ERRORS_PANIC) && !system_going_down()) {
|
2006-10-11 08:20:53 +00:00
|
|
|
panic("EXT4-fs (device %s): panic forced after error\n",
|
2006-10-11 08:20:50 +00:00
|
|
|
sb->s_id);
|
2015-10-18 21:02:56 +00:00
|
|
|
}
|
ext4: always panic when errors=panic is specified
Before commit 014c9caa29d3 ("ext4: make ext4_abort() use
__ext4_error()"), the following series of commands would trigger a
panic:
1. mount /dev/sda -o ro,errors=panic test
2. mount /dev/sda -o remount,abort test
After commit 014c9caa29d3, remounting a file system using the test
mount option "abort" will no longer trigger a panic. This commit will
restore the behaviour immediately before commit 014c9caa29d3.
(However, note that the Linux kernel's behavior has not been
consistent; some previous kernel versions, including 5.4 and 4.19
similarly did not panic after using the mount option "abort".)
This also makes a change to long-standing behaviour; namely, the
following series commands will now cause a panic, when previously it
did not:
1. mount /dev/sda -o ro,errors=panic test
2. echo test > /sys/fs/ext4/sda/trigger_fs_error
However, this makes ext4's behaviour much more consistent, so this is
a good thing.
Cc: stable@kernel.org
Fixes: 014c9caa29d3 ("ext4: make ext4_abort() use __ext4_error()")
Signed-off-by: Ye Bin <yebin10@huawei.com>
Link: https://lore.kernel.org/r/20210401081903.3421208-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-01 08:19:03 +00:00
|
|
|
|
|
|
|
if (sb_rdonly(sb) || continue_fs)
|
|
|
|
return;
|
|
|
|
|
2020-11-27 11:33:57 +00:00
|
|
|
ext4_msg(sb, KERN_CRIT, "Remounting filesystem read-only");
|
|
|
|
/*
|
|
|
|
* Make sure updated value of ->s_mount_flags will be visible before
|
|
|
|
* ->s_flags update
|
|
|
|
*/
|
|
|
|
smp_wmb();
|
|
|
|
sb->s_flags |= SB_RDONLY;
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
|
|
|
|
2020-11-27 11:34:00 +00:00
|
|
|
static void flush_stashed_error_work(struct work_struct *work)
|
|
|
|
{
|
|
|
|
struct ext4_sb_info *sbi = container_of(work, struct ext4_sb_info,
|
|
|
|
s_error_work);
|
2020-12-16 10:18:40 +00:00
|
|
|
journal_t *journal = sbi->s_journal;
|
|
|
|
handle_t *handle;
|
2020-11-27 11:34:00 +00:00
|
|
|
|
2020-12-16 10:18:40 +00:00
|
|
|
/*
|
|
|
|
* If the journal is still running, we have to write out superblock
|
|
|
|
* through the journal to avoid collisions of other journalled sb
|
|
|
|
* updates.
|
|
|
|
*
|
|
|
|
* We use directly jbd2 functions here to avoid recursing back into
|
|
|
|
* ext4 error handling code during handling of previous errors.
|
|
|
|
*/
|
|
|
|
if (!sb_rdonly(sbi->s_sb) && journal) {
|
2021-06-15 09:05:37 +00:00
|
|
|
struct buffer_head *sbh = sbi->s_sbh;
|
2020-12-16 10:18:40 +00:00
|
|
|
handle = jbd2_journal_start(journal, 1);
|
|
|
|
if (IS_ERR(handle))
|
|
|
|
goto write_directly;
|
2021-06-15 09:05:37 +00:00
|
|
|
if (jbd2_journal_get_write_access(handle, sbh)) {
|
2020-12-16 10:18:40 +00:00
|
|
|
jbd2_journal_stop(handle);
|
|
|
|
goto write_directly;
|
|
|
|
}
|
|
|
|
ext4_update_super(sbi->s_sb);
|
2021-06-15 09:05:37 +00:00
|
|
|
if (buffer_write_io_error(sbh) || !buffer_uptodate(sbh)) {
|
|
|
|
ext4_msg(sbi->s_sb, KERN_ERR, "previous I/O error to "
|
|
|
|
"superblock detected");
|
|
|
|
clear_buffer_write_io_error(sbh);
|
|
|
|
set_buffer_uptodate(sbh);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (jbd2_journal_dirty_metadata(handle, sbh)) {
|
2020-12-16 10:18:40 +00:00
|
|
|
jbd2_journal_stop(handle);
|
|
|
|
goto write_directly;
|
|
|
|
}
|
|
|
|
jbd2_journal_stop(handle);
|
2021-06-11 14:02:08 +00:00
|
|
|
ext4_notify_error_sysfs(sbi);
|
2020-12-16 10:18:40 +00:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
write_directly:
|
|
|
|
/*
|
|
|
|
* Write through journal failed. Write sb directly to get error info
|
|
|
|
* out and hope for the best.
|
|
|
|
*/
|
2020-12-16 10:18:38 +00:00
|
|
|
ext4_commit_super(sbi->s_sb);
|
2021-06-11 14:02:08 +00:00
|
|
|
ext4_notify_error_sysfs(sbi);
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
|
|
|
|
2013-10-18 01:11:01 +00:00
|
|
|
#define ext4_error_ratelimit(sb) \
|
|
|
|
___ratelimit(&(EXT4_SB(sb)->s_err_ratelimit_state), \
|
|
|
|
"EXT4-fs error")
|
|
|
|
|
2010-02-15 19:19:27 +00:00
|
|
|
void __ext4_error(struct super_block *sb, const char *function,
|
2020-11-27 11:33:57 +00:00
|
|
|
unsigned int line, bool force_ro, int error, __u64 block,
|
2020-03-28 23:33:43 +00:00
|
|
|
const char *fmt, ...)
|
2006-10-11 08:20:50 +00:00
|
|
|
{
|
2010-12-20 03:43:19 +00:00
|
|
|
struct va_format vaf;
|
2006-10-11 08:20:50 +00:00
|
|
|
va_list args;
|
|
|
|
|
2017-02-05 06:28:48 +00:00
|
|
|
if (unlikely(ext4_forced_shutdown(EXT4_SB(sb))))
|
|
|
|
return;
|
|
|
|
|
2018-02-19 01:53:23 +00:00
|
|
|
trace_ext4_error(sb, function, line);
|
2013-10-18 01:11:01 +00:00
|
|
|
if (ext4_error_ratelimit(sb)) {
|
|
|
|
va_start(args, fmt);
|
|
|
|
vaf.fmt = fmt;
|
|
|
|
vaf.va = &args;
|
|
|
|
printk(KERN_CRIT
|
|
|
|
"EXT4-fs error (device %s): %s:%d: comm %s: %pV\n",
|
|
|
|
sb->s_id, function, line, current->comm, &vaf);
|
|
|
|
va_end(args);
|
|
|
|
}
|
2020-12-16 10:18:37 +00:00
|
|
|
ext4_handle_error(sb, force_ro, error, 0, block, function, line);
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
|
|
|
|
2013-07-01 12:12:37 +00:00
|
|
|
void __ext4_error_inode(struct inode *inode, const char *function,
|
2020-03-28 23:33:43 +00:00
|
|
|
unsigned int line, ext4_fsblk_t block, int error,
|
2013-07-01 12:12:37 +00:00
|
|
|
const char *fmt, ...)
|
2010-03-02 16:46:09 +00:00
|
|
|
{
|
|
|
|
va_list args;
|
2011-01-10 17:10:55 +00:00
|
|
|
struct va_format vaf;
|
2010-03-02 16:46:09 +00:00
|
|
|
|
2017-02-05 06:28:48 +00:00
|
|
|
if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
|
|
|
|
return;
|
|
|
|
|
2018-02-19 01:53:23 +00:00
|
|
|
trace_ext4_error(inode->i_sb, function, line);
|
2013-10-18 01:11:01 +00:00
|
|
|
if (ext4_error_ratelimit(inode->i_sb)) {
|
|
|
|
va_start(args, fmt);
|
|
|
|
vaf.fmt = fmt;
|
|
|
|
vaf.va = &args;
|
|
|
|
if (block)
|
|
|
|
printk(KERN_CRIT "EXT4-fs error (device %s): %s:%d: "
|
|
|
|
"inode #%lu: block %llu: comm %s: %pV\n",
|
|
|
|
inode->i_sb->s_id, function, line, inode->i_ino,
|
|
|
|
block, current->comm, &vaf);
|
|
|
|
else
|
|
|
|
printk(KERN_CRIT "EXT4-fs error (device %s): %s:%d: "
|
|
|
|
"inode #%lu: comm %s: %pV\n",
|
|
|
|
inode->i_sb->s_id, function, line, inode->i_ino,
|
|
|
|
current->comm, &vaf);
|
|
|
|
va_end(args);
|
|
|
|
}
|
2020-12-16 10:18:37 +00:00
|
|
|
ext4_handle_error(inode->i_sb, false, error, inode->i_ino, block,
|
|
|
|
function, line);
|
2010-03-02 16:46:09 +00:00
|
|
|
}
|
|
|
|
|
2013-07-01 12:12:37 +00:00
|
|
|
void __ext4_error_file(struct file *file, const char *function,
|
|
|
|
unsigned int line, ext4_fsblk_t block,
|
|
|
|
const char *fmt, ...)
|
2010-03-02 16:46:09 +00:00
|
|
|
{
|
|
|
|
va_list args;
|
2011-01-10 17:10:55 +00:00
|
|
|
struct va_format vaf;
|
2013-01-23 22:07:38 +00:00
|
|
|
struct inode *inode = file_inode(file);
|
2010-03-02 16:46:09 +00:00
|
|
|
char pathname[80], *path;
|
|
|
|
|
2017-02-05 06:28:48 +00:00
|
|
|
if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
|
|
|
|
return;
|
|
|
|
|
2018-02-19 01:53:23 +00:00
|
|
|
trace_ext4_error(inode->i_sb, function, line);
|
2013-10-18 01:11:01 +00:00
|
|
|
if (ext4_error_ratelimit(inode->i_sb)) {
|
2015-06-19 08:29:13 +00:00
|
|
|
path = file_path(file, pathname, sizeof(pathname));
|
2013-10-18 01:11:01 +00:00
|
|
|
if (IS_ERR(path))
|
|
|
|
path = "(unknown)";
|
|
|
|
va_start(args, fmt);
|
|
|
|
vaf.fmt = fmt;
|
|
|
|
vaf.va = &args;
|
|
|
|
if (block)
|
|
|
|
printk(KERN_CRIT
|
|
|
|
"EXT4-fs error (device %s): %s:%d: inode #%lu: "
|
|
|
|
"block %llu: comm %s: path %s: %pV\n",
|
|
|
|
inode->i_sb->s_id, function, line, inode->i_ino,
|
|
|
|
block, current->comm, path, &vaf);
|
|
|
|
else
|
|
|
|
printk(KERN_CRIT
|
|
|
|
"EXT4-fs error (device %s): %s:%d: inode #%lu: "
|
|
|
|
"comm %s: path %s: %pV\n",
|
|
|
|
inode->i_sb->s_id, function, line, inode->i_ino,
|
|
|
|
current->comm, path, &vaf);
|
|
|
|
va_end(args);
|
|
|
|
}
|
2020-12-16 10:18:37 +00:00
|
|
|
ext4_handle_error(inode->i_sb, false, EFSCORRUPTED, inode->i_ino, block,
|
|
|
|
function, line);
|
2010-03-02 16:46:09 +00:00
|
|
|
}
|
|
|
|
|
2013-02-08 18:00:31 +00:00
|
|
|
const char *ext4_decode_error(struct super_block *sb, int errno,
|
|
|
|
char nbuf[16])
|
2006-10-11 08:20:50 +00:00
|
|
|
{
|
|
|
|
char *errstr = NULL;
|
|
|
|
|
|
|
|
switch (errno) {
|
2015-10-17 20:16:04 +00:00
|
|
|
case -EFSCORRUPTED:
|
|
|
|
errstr = "Corrupt filesystem";
|
|
|
|
break;
|
|
|
|
case -EFSBADCRC:
|
|
|
|
errstr = "Filesystem failed CRC";
|
|
|
|
break;
|
2006-10-11 08:20:50 +00:00
|
|
|
case -EIO:
|
|
|
|
errstr = "IO failure";
|
|
|
|
break;
|
|
|
|
case -ENOMEM:
|
|
|
|
errstr = "Out of memory";
|
|
|
|
break;
|
|
|
|
case -EROFS:
|
2009-07-28 03:09:47 +00:00
|
|
|
if (!sb || (EXT4_SB(sb)->s_journal &&
|
|
|
|
EXT4_SB(sb)->s_journal->j_flags & JBD2_ABORT))
|
2006-10-11 08:20:50 +00:00
|
|
|
errstr = "Journal has aborted";
|
|
|
|
else
|
|
|
|
errstr = "Readonly filesystem";
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
/* If the caller passed in an extra buffer for unknown
|
|
|
|
* errors, textualise them now. Else we just return
|
|
|
|
* NULL. */
|
|
|
|
if (nbuf) {
|
|
|
|
/* Check for truncated error codes... */
|
|
|
|
if (snprintf(nbuf, 16, "error %d", -errno) >= 0)
|
|
|
|
errstr = nbuf;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
return errstr;
|
|
|
|
}
|
|
|
|
|
2006-10-11 08:20:53 +00:00
|
|
|
/* __ext4_std_error decodes expected errors from journaling functions
|
2006-10-11 08:20:50 +00:00
|
|
|
* automatically and invokes the appropriate error response. */
|
|
|
|
|
2010-07-27 15:56:40 +00:00
|
|
|
void __ext4_std_error(struct super_block *sb, const char *function,
|
|
|
|
unsigned int line, int errno)
|
2006-10-11 08:20:50 +00:00
|
|
|
{
|
|
|
|
char nbuf[16];
|
|
|
|
const char *errstr;
|
|
|
|
|
2017-02-05 06:28:48 +00:00
|
|
|
if (unlikely(ext4_forced_shutdown(EXT4_SB(sb))))
|
|
|
|
return;
|
|
|
|
|
2006-10-11 08:20:50 +00:00
|
|
|
/* Special case: if the error is EROFS, and we're not already
|
|
|
|
* inside a transaction, then there's really no point in logging
|
|
|
|
* an error. */
|
2017-07-17 07:45:34 +00:00
|
|
|
if (errno == -EROFS && journal_current_handle() == NULL && sb_rdonly(sb))
|
2006-10-11 08:20:50 +00:00
|
|
|
return;
|
|
|
|
|
2013-10-18 01:11:01 +00:00
|
|
|
if (ext4_error_ratelimit(sb)) {
|
|
|
|
errstr = ext4_decode_error(sb, errno, nbuf);
|
|
|
|
printk(KERN_CRIT "EXT4-fs error (device %s) in %s:%d: %s\n",
|
|
|
|
sb->s_id, function, line, errstr);
|
|
|
|
}
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2020-12-16 10:18:37 +00:00
|
|
|
ext4_handle_error(sb, false, -errno, 0, 0, function, line);
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
|
|
|
|
2013-07-01 12:12:37 +00:00
|
|
|
void __ext4_msg(struct super_block *sb,
|
|
|
|
const char *prefix, const char *fmt, ...)
|
2009-06-04 21:36:36 +00:00
|
|
|
{
|
2010-12-20 03:43:19 +00:00
|
|
|
struct va_format vaf;
|
2009-06-04 21:36:36 +00:00
|
|
|
va_list args;
|
|
|
|
|
2020-07-25 12:33:13 +00:00
|
|
|
atomic_inc(&EXT4_SB(sb)->s_msg_count);
|
2013-10-18 01:11:01 +00:00
|
|
|
if (!___ratelimit(&(EXT4_SB(sb)->s_msg_ratelimit_state), "EXT4-fs"))
|
|
|
|
return;
|
|
|
|
|
2009-06-04 21:36:36 +00:00
|
|
|
va_start(args, fmt);
|
2010-12-20 03:43:19 +00:00
|
|
|
vaf.fmt = fmt;
|
|
|
|
vaf.va = &args;
|
|
|
|
printk("%sEXT4-fs (%s): %pV\n", prefix, sb->s_id, &vaf);
|
2009-06-04 21:36:36 +00:00
|
|
|
va_end(args);
|
|
|
|
}
|
|
|
|
|
2020-07-25 12:33:13 +00:00
|
|
|
static int ext4_warning_ratelimit(struct super_block *sb)
|
|
|
|
{
|
|
|
|
atomic_inc(&EXT4_SB(sb)->s_warning_count);
|
|
|
|
return ___ratelimit(&(EXT4_SB(sb)->s_warning_ratelimit_state),
|
|
|
|
"EXT4-fs warning");
|
|
|
|
}
|
2015-06-15 18:50:26 +00:00
|
|
|
|
2010-02-15 19:19:27 +00:00
|
|
|
void __ext4_warning(struct super_block *sb, const char *function,
|
2010-07-27 15:56:40 +00:00
|
|
|
unsigned int line, const char *fmt, ...)
|
2006-10-11 08:20:50 +00:00
|
|
|
{
|
2010-12-20 03:43:19 +00:00
|
|
|
struct va_format vaf;
|
2006-10-11 08:20:50 +00:00
|
|
|
va_list args;
|
|
|
|
|
2015-06-15 18:50:26 +00:00
|
|
|
if (!ext4_warning_ratelimit(sb))
|
2013-10-18 01:11:01 +00:00
|
|
|
return;
|
|
|
|
|
2006-10-11 08:20:50 +00:00
|
|
|
va_start(args, fmt);
|
2010-12-20 03:43:19 +00:00
|
|
|
vaf.fmt = fmt;
|
|
|
|
vaf.va = &args;
|
|
|
|
printk(KERN_WARNING "EXT4-fs warning (device %s): %s:%d: %pV\n",
|
|
|
|
sb->s_id, function, line, &vaf);
|
2006-10-11 08:20:50 +00:00
|
|
|
va_end(args);
|
|
|
|
}
|
|
|
|
|
2015-06-15 18:50:26 +00:00
|
|
|
void __ext4_warning_inode(const struct inode *inode, const char *function,
|
|
|
|
unsigned int line, const char *fmt, ...)
|
|
|
|
{
|
|
|
|
struct va_format vaf;
|
|
|
|
va_list args;
|
|
|
|
|
|
|
|
if (!ext4_warning_ratelimit(inode->i_sb))
|
|
|
|
return;
|
|
|
|
|
|
|
|
va_start(args, fmt);
|
|
|
|
vaf.fmt = fmt;
|
|
|
|
vaf.va = &args;
|
|
|
|
printk(KERN_WARNING "EXT4-fs warning (device %s): %s:%d: "
|
|
|
|
"inode #%lu: comm %s: %pV\n", inode->i_sb->s_id,
|
|
|
|
function, line, inode->i_ino, current->comm, &vaf);
|
|
|
|
va_end(args);
|
|
|
|
}
|
|
|
|
|
2010-06-29 16:54:28 +00:00
|
|
|
void __ext4_grp_locked_error(const char *function, unsigned int line,
|
|
|
|
struct super_block *sb, ext4_group_t grp,
|
|
|
|
unsigned long ino, ext4_fsblk_t block,
|
|
|
|
const char *fmt, ...)
|
2009-01-06 03:19:52 +00:00
|
|
|
__releases(bitlock)
|
|
|
|
__acquires(bitlock)
|
|
|
|
{
|
2010-12-20 03:43:19 +00:00
|
|
|
struct va_format vaf;
|
2009-01-06 03:19:52 +00:00
|
|
|
va_list args;
|
|
|
|
|
2017-02-05 06:28:48 +00:00
|
|
|
if (unlikely(ext4_forced_shutdown(EXT4_SB(sb))))
|
|
|
|
return;
|
|
|
|
|
2018-02-19 01:53:23 +00:00
|
|
|
trace_ext4_error(sb, function, line);
|
2013-10-18 01:11:01 +00:00
|
|
|
if (ext4_error_ratelimit(sb)) {
|
|
|
|
va_start(args, fmt);
|
|
|
|
vaf.fmt = fmt;
|
|
|
|
vaf.va = &args;
|
|
|
|
printk(KERN_CRIT "EXT4-fs error (device %s): %s:%d: group %u, ",
|
|
|
|
sb->s_id, function, line, grp);
|
|
|
|
if (ino)
|
|
|
|
printk(KERN_CONT "inode %lu: ", ino);
|
|
|
|
if (block)
|
|
|
|
printk(KERN_CONT "block %llu:",
|
|
|
|
(unsigned long long) block);
|
|
|
|
printk(KERN_CONT "%pV\n", &vaf);
|
|
|
|
va_end(args);
|
|
|
|
}
|
2009-01-06 03:19:52 +00:00
|
|
|
|
|
|
|
if (test_opt(sb, ERRORS_CONT)) {
|
2020-11-27 11:34:00 +00:00
|
|
|
if (test_opt(sb, WARN_ON_ERROR))
|
|
|
|
WARN_ON_ONCE(1);
|
2020-12-16 10:18:37 +00:00
|
|
|
EXT4_SB(sb)->s_mount_state |= EXT4_ERROR_FS;
|
2020-12-16 10:18:40 +00:00
|
|
|
if (!bdev_read_only(sb->s_bdev)) {
|
|
|
|
save_error_info(sb, EFSCORRUPTED, ino, block, function,
|
|
|
|
line);
|
2020-12-16 10:18:37 +00:00
|
|
|
schedule_work(&EXT4_SB(sb)->s_error_work);
|
2020-12-16 10:18:40 +00:00
|
|
|
}
|
2009-01-06 03:19:52 +00:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
ext4_unlock_group(sb, grp);
|
2020-12-16 10:18:37 +00:00
|
|
|
ext4_handle_error(sb, false, EFSCORRUPTED, ino, block, function, line);
|
2009-01-06 03:19:52 +00:00
|
|
|
/*
|
|
|
|
* We only get here in the ERRORS_RO case; relocking the group
|
|
|
|
* may be dangerous, but nothing bad will happen since the
|
|
|
|
* filesystem will have already been marked read/only and the
|
|
|
|
* journal has been aborted. We return 1 as a hint to callers
|
|
|
|
* who might what to use the return value from
|
2011-03-31 01:57:33 +00:00
|
|
|
* ext4_grp_locked_error() to distinguish between the
|
2009-01-06 03:19:52 +00:00
|
|
|
* ERRORS_CONT and ERRORS_RO case, and perhaps return more
|
|
|
|
* aggressively from the ext4 function in question, with a
|
|
|
|
* more appropriate error code.
|
|
|
|
*/
|
|
|
|
ext4_lock_group(sb, grp);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2018-05-12 15:39:40 +00:00
|
|
|
void ext4_mark_group_bitmap_corrupted(struct super_block *sb,
|
|
|
|
ext4_group_t group,
|
|
|
|
unsigned int flags)
|
|
|
|
{
|
|
|
|
struct ext4_sb_info *sbi = EXT4_SB(sb);
|
|
|
|
struct ext4_group_info *grp = ext4_get_group_info(sb, group);
|
|
|
|
struct ext4_group_desc *gdp = ext4_get_group_desc(sb, group, NULL);
|
2018-07-29 21:27:45 +00:00
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (flags & EXT4_GROUP_INFO_BBITMAP_CORRUPT) {
|
|
|
|
ret = ext4_test_and_set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT,
|
|
|
|
&grp->bb_state);
|
|
|
|
if (!ret)
|
|
|
|
percpu_counter_sub(&sbi->s_freeclusters_counter,
|
|
|
|
grp->bb_free);
|
2018-05-12 15:39:40 +00:00
|
|
|
}
|
|
|
|
|
2018-07-29 21:27:45 +00:00
|
|
|
if (flags & EXT4_GROUP_INFO_IBITMAP_CORRUPT) {
|
|
|
|
ret = ext4_test_and_set_bit(EXT4_GROUP_INFO_IBITMAP_CORRUPT_BIT,
|
|
|
|
&grp->bb_state);
|
|
|
|
if (!ret && gdp) {
|
2018-05-12 15:39:40 +00:00
|
|
|
int count;
|
|
|
|
|
|
|
|
count = ext4_free_inodes_count(sb, gdp);
|
|
|
|
percpu_counter_sub(&sbi->s_freeinodes_counter,
|
|
|
|
count);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2006-10-11 08:20:53 +00:00
|
|
|
void ext4_update_dynamic_rev(struct super_block *sb)
|
2006-10-11 08:20:50 +00:00
|
|
|
{
|
2006-10-11 08:20:53 +00:00
|
|
|
struct ext4_super_block *es = EXT4_SB(sb)->s_es;
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2006-10-11 08:20:53 +00:00
|
|
|
if (le32_to_cpu(es->s_rev_level) > EXT4_GOOD_OLD_REV)
|
2006-10-11 08:20:50 +00:00
|
|
|
return;
|
|
|
|
|
2010-02-15 19:19:27 +00:00
|
|
|
ext4_warning(sb,
|
2006-10-11 08:20:50 +00:00
|
|
|
"updating to rev %d because of new feature flag, "
|
|
|
|
"running e2fsck is recommended",
|
2006-10-11 08:20:53 +00:00
|
|
|
EXT4_DYNAMIC_REV);
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2006-10-11 08:20:53 +00:00
|
|
|
es->s_first_ino = cpu_to_le32(EXT4_GOOD_OLD_FIRST_INO);
|
|
|
|
es->s_inode_size = cpu_to_le16(EXT4_GOOD_OLD_INODE_SIZE);
|
|
|
|
es->s_rev_level = cpu_to_le32(EXT4_DYNAMIC_REV);
|
2006-10-11 08:20:50 +00:00
|
|
|
/* leave es->s_feature_*compat flags alone */
|
|
|
|
/* es->s_uuid will be set by e2fsck if empty */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The rest of the superblock fields should be zero, and if not it
|
|
|
|
* means they are likely already in use, so leave them alone. We
|
|
|
|
* can leave it up to e2fsck to clean up any inconsistencies there.
|
|
|
|
*/
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Open the external journal device
|
|
|
|
*/
|
2009-06-04 21:36:36 +00:00
|
|
|
static struct block_device *ext4_blkdev_get(dev_t dev, struct super_block *sb)
|
2006-10-11 08:20:50 +00:00
|
|
|
{
|
|
|
|
struct block_device *bdev;
|
|
|
|
|
2010-11-13 10:55:18 +00:00
|
|
|
bdev = blkdev_get_by_dev(dev, FMODE_READ|FMODE_WRITE|FMODE_EXCL, sb);
|
2006-10-11 08:20:50 +00:00
|
|
|
if (IS_ERR(bdev))
|
|
|
|
goto fail;
|
|
|
|
return bdev;
|
|
|
|
|
|
|
|
fail:
|
2020-03-24 07:25:11 +00:00
|
|
|
ext4_msg(sb, KERN_ERR,
|
|
|
|
"failed to open journal device unknown-block(%u,%u) %ld",
|
|
|
|
MAJOR(dev), MINOR(dev), PTR_ERR(bdev));
|
2006-10-11 08:20:50 +00:00
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Release the journal device
|
|
|
|
*/
|
2013-05-06 02:11:03 +00:00
|
|
|
static void ext4_blkdev_put(struct block_device *bdev)
|
2006-10-11 08:20:50 +00:00
|
|
|
{
|
2013-05-06 02:11:03 +00:00
|
|
|
blkdev_put(bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL);
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
|
|
|
|
2013-05-06 02:11:03 +00:00
|
|
|
static void ext4_blkdev_remove(struct ext4_sb_info *sbi)
|
2006-10-11 08:20:50 +00:00
|
|
|
{
|
|
|
|
struct block_device *bdev;
|
2020-09-24 03:03:42 +00:00
|
|
|
bdev = sbi->s_journal_bdev;
|
2006-10-11 08:20:50 +00:00
|
|
|
if (bdev) {
|
2013-05-06 02:11:03 +00:00
|
|
|
ext4_blkdev_put(bdev);
|
2020-09-24 03:03:42 +00:00
|
|
|
sbi->s_journal_bdev = NULL;
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline struct inode *orphan_list_entry(struct list_head *l)
|
|
|
|
{
|
2006-10-11 08:20:53 +00:00
|
|
|
return &list_entry(l, struct ext4_inode_info, i_orphan)->vfs_inode;
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
|
|
|
|
2006-10-11 08:20:53 +00:00
|
|
|
static void dump_orphan_list(struct super_block *sb, struct ext4_sb_info *sbi)
|
2006-10-11 08:20:50 +00:00
|
|
|
{
|
|
|
|
struct list_head *l;
|
|
|
|
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_ERR, "sb orphan head is %d",
|
|
|
|
le32_to_cpu(sbi->s_es->s_last_orphan));
|
2006-10-11 08:20:50 +00:00
|
|
|
|
|
|
|
printk(KERN_ERR "sb_info orphan list:\n");
|
|
|
|
list_for_each(l, &sbi->s_orphan) {
|
|
|
|
struct inode *inode = orphan_list_entry(l);
|
|
|
|
printk(KERN_ERR " "
|
|
|
|
"inode %s:%lu at %p: mode %o, nlink %d, next %d\n",
|
|
|
|
inode->i_sb->s_id, inode->i_ino, inode,
|
|
|
|
inode->i_mode, inode->i_nlink,
|
|
|
|
NEXT_ORPHAN(inode));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2017-04-06 13:40:06 +00:00
|
|
|
#ifdef CONFIG_QUOTA
|
|
|
|
static int ext4_quota_off(struct super_block *sb, int type);
|
|
|
|
|
|
|
|
static inline void ext4_quota_off_umount(struct super_block *sb)
|
|
|
|
{
|
|
|
|
int type;
|
|
|
|
|
2017-05-22 02:31:23 +00:00
|
|
|
/* Use our quota_off function to clear inode flags etc. */
|
|
|
|
for (type = 0; type < EXT4_MAXQUOTAS; type++)
|
|
|
|
ext4_quota_off(sb, type);
|
2017-04-06 13:40:06 +00:00
|
|
|
}
|
2018-10-12 13:28:09 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* This is a helper function which is used in the mount/remount
|
|
|
|
* codepaths (which holds s_umount) to fetch the quota file name.
|
|
|
|
*/
|
|
|
|
static inline char *get_qf_name(struct super_block *sb,
|
|
|
|
struct ext4_sb_info *sbi,
|
|
|
|
int type)
|
|
|
|
{
|
|
|
|
return rcu_dereference_protected(sbi->s_qf_names[type],
|
|
|
|
lockdep_is_held(&sb->s_umount));
|
|
|
|
}
|
2017-04-06 13:40:06 +00:00
|
|
|
#else
|
|
|
|
static inline void ext4_quota_off_umount(struct super_block *sb)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2008-07-26 20:15:44 +00:00
|
|
|
static void ext4_put_super(struct super_block *sb)
|
2006-10-11 08:20:50 +00:00
|
|
|
{
|
2006-10-11 08:20:53 +00:00
|
|
|
struct ext4_sb_info *sbi = EXT4_SB(sb);
|
|
|
|
struct ext4_super_block *es = sbi->s_es;
|
2020-02-15 21:40:37 +00:00
|
|
|
struct buffer_head **group_desc;
|
2020-02-19 03:08:51 +00:00
|
|
|
struct flex_groups **flex_groups;
|
2017-02-05 04:38:06 +00:00
|
|
|
int aborted = 0;
|
2008-10-28 02:53:05 +00:00
|
|
|
int i, err;
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2010-10-28 01:30:05 +00:00
|
|
|
ext4_unregister_li_request(sb);
|
2017-04-06 13:40:06 +00:00
|
|
|
ext4_quota_off_umount(sb);
|
2010-05-19 11:16:42 +00:00
|
|
|
|
2020-11-27 11:34:00 +00:00
|
|
|
flush_work(&sbi->s_error_work);
|
2013-06-04 18:21:02 +00:00
|
|
|
destroy_workqueue(sbi->rsv_conversion_wq);
|
2009-09-28 19:48:41 +00:00
|
|
|
|
2020-03-18 06:13:01 +00:00
|
|
|
/*
|
|
|
|
* Unregister sysfs before destroying jbd2 journal.
|
|
|
|
* Since we could still access attr_journal_task attribute via sysfs
|
|
|
|
* path which could have sbi->s_journal->j_task as NULL
|
|
|
|
*/
|
|
|
|
ext4_unregister_sysfs(sb);
|
|
|
|
|
2009-01-07 05:06:22 +00:00
|
|
|
if (sbi->s_journal) {
|
2017-02-05 04:38:06 +00:00
|
|
|
aborted = is_journal_aborted(sbi->s_journal);
|
2009-01-07 05:06:22 +00:00
|
|
|
err = jbd2_journal_destroy(sbi->s_journal);
|
|
|
|
sbi->s_journal = NULL;
|
2019-11-20 02:54:15 +00:00
|
|
|
if ((err < 0) && !aborted) {
|
2020-03-28 23:33:43 +00:00
|
|
|
ext4_abort(sb, -err, "Couldn't clean up the journal");
|
2019-11-20 02:54:15 +00:00
|
|
|
}
|
2009-01-07 05:06:22 +00:00
|
|
|
}
|
2009-12-09 02:48:58 +00:00
|
|
|
|
2013-07-01 12:12:37 +00:00
|
|
|
ext4_es_unregister_shrinker(sbi);
|
2013-12-09 01:52:31 +00:00
|
|
|
del_timer_sync(&sbi->s_err_report);
|
2009-12-09 02:48:58 +00:00
|
|
|
ext4_release_system_zone(sb);
|
|
|
|
ext4_mb_release(sb);
|
|
|
|
ext4_ext_release(sb);
|
|
|
|
|
2017-07-17 07:45:34 +00:00
|
|
|
if (!sb_rdonly(sb) && !aborted) {
|
2015-10-17 20:18:43 +00:00
|
|
|
ext4_clear_feature_journal_needs_recovery(sb);
|
2006-10-11 08:20:50 +00:00
|
|
|
es->s_state = cpu_to_le16(sbi->s_mount_state);
|
|
|
|
}
|
2017-07-17 07:45:34 +00:00
|
|
|
if (!sb_rdonly(sb))
|
2020-12-16 10:18:38 +00:00
|
|
|
ext4_commit_super(sb);
|
2012-03-22 02:29:15 +00:00
|
|
|
|
2020-02-15 21:40:37 +00:00
|
|
|
rcu_read_lock();
|
|
|
|
group_desc = rcu_dereference(sbi->s_group_desc);
|
2006-10-11 08:20:50 +00:00
|
|
|
for (i = 0; i < sbi->s_gdb_count; i++)
|
2020-02-15 21:40:37 +00:00
|
|
|
brelse(group_desc[i]);
|
|
|
|
kvfree(group_desc);
|
2020-02-19 03:08:51 +00:00
|
|
|
flex_groups = rcu_dereference(sbi->s_flex_groups);
|
|
|
|
if (flex_groups) {
|
|
|
|
for (i = 0; i < sbi->s_flex_groups_allocated; i++)
|
|
|
|
kvfree(flex_groups[i]);
|
|
|
|
kvfree(flex_groups);
|
|
|
|
}
|
2020-02-15 21:40:37 +00:00
|
|
|
rcu_read_unlock();
|
2011-09-09 22:56:51 +00:00
|
|
|
percpu_counter_destroy(&sbi->s_freeclusters_counter);
|
2006-10-11 08:20:50 +00:00
|
|
|
percpu_counter_destroy(&sbi->s_freeinodes_counter);
|
|
|
|
percpu_counter_destroy(&sbi->s_dirs_counter);
|
2011-09-09 22:56:51 +00:00
|
|
|
percpu_counter_destroy(&sbi->s_dirtyclusters_counter);
|
2021-02-18 15:11:32 +00:00
|
|
|
percpu_counter_destroy(&sbi->s_sra_exceeded_retry_limit);
|
2020-02-19 18:30:46 +00:00
|
|
|
percpu_free_rwsem(&sbi->s_writepages_rwsem);
|
2006-10-11 08:20:50 +00:00
|
|
|
#ifdef CONFIG_QUOTA
|
2014-09-11 15:15:15 +00:00
|
|
|
for (i = 0; i < EXT4_MAXQUOTAS; i++)
|
2018-10-12 13:28:09 +00:00
|
|
|
kfree(get_qf_name(sb, sbi, i));
|
2006-10-11 08:20:50 +00:00
|
|
|
#endif
|
|
|
|
|
|
|
|
/* Debugging code just in case the in-memory inode orphan list
|
|
|
|
* isn't empty. The on-disk one can be non-empty if we've
|
|
|
|
* detected an error and taken the fs readonly, but the
|
|
|
|
* in-memory list had better be clean by this point. */
|
|
|
|
if (!list_empty(&sbi->s_orphan))
|
|
|
|
dump_orphan_list(sb, sbi);
|
2020-11-07 15:58:11 +00:00
|
|
|
ASSERT(list_empty(&sbi->s_orphan));
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2015-06-21 02:50:33 +00:00
|
|
|
sync_blockdev(sb->s_bdev);
|
2007-05-06 21:49:54 +00:00
|
|
|
invalidate_bdev(sb->s_bdev);
|
2020-09-24 03:03:42 +00:00
|
|
|
if (sbi->s_journal_bdev && sbi->s_journal_bdev != sb->s_bdev) {
|
2006-10-11 08:20:50 +00:00
|
|
|
/*
|
|
|
|
* Invalidate the journal device's buffers. We don't want them
|
|
|
|
* floating about in memory - the physical journal device may
|
|
|
|
* hotswapped, and it breaks the `ro-after' testing code.
|
|
|
|
*/
|
2020-09-24 03:03:42 +00:00
|
|
|
sync_blockdev(sbi->s_journal_bdev);
|
|
|
|
invalidate_bdev(sbi->s_journal_bdev);
|
2006-10-11 08:20:53 +00:00
|
|
|
ext4_blkdev_remove(sbi);
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
2018-12-04 05:24:42 +00:00
|
|
|
|
|
|
|
ext4_xattr_destroy_cache(sbi->s_ea_inode_cache);
|
|
|
|
sbi->s_ea_inode_cache = NULL;
|
|
|
|
|
|
|
|
ext4_xattr_destroy_cache(sbi->s_ea_block_cache);
|
|
|
|
sbi->s_ea_block_cache = NULL;
|
|
|
|
|
2021-04-30 18:50:46 +00:00
|
|
|
ext4_stop_mmpd(sbi);
|
|
|
|
|
2016-11-26 19:24:51 +00:00
|
|
|
brelse(sbi->s_sbh);
|
2006-10-11 08:20:50 +00:00
|
|
|
sb->s_fs_info = NULL;
|
2009-03-31 13:10:09 +00:00
|
|
|
/*
|
|
|
|
* Now that we are completely done shutting down the
|
|
|
|
* superblock, we need to actually destroy the kobject.
|
|
|
|
*/
|
|
|
|
kobject_put(&sbi->s_kobj);
|
|
|
|
wait_for_completion(&sbi->s_kobj_unregister);
|
2012-04-29 22:27:10 +00:00
|
|
|
if (sbi->s_chksum_driver)
|
|
|
|
crypto_free_shash(sbi->s_chksum_driver);
|
2009-02-15 23:07:52 +00:00
|
|
|
kfree(sbi->s_blockgroup_lock);
|
2017-08-24 23:42:48 +00:00
|
|
|
fs_put_dax(sbi->s_daxdev);
|
fscrypt: handle test_dummy_encryption in more logical way
The behavior of the test_dummy_encryption mount option is that when a
new file (or directory or symlink) is created in an unencrypted
directory, it's automatically encrypted using a dummy encryption policy.
That's it; in particular, the encryption (or lack thereof) of existing
files (or directories or symlinks) doesn't change.
Unfortunately the implementation of test_dummy_encryption is a bit weird
and confusing. When test_dummy_encryption is enabled and a file is
being created in an unencrypted directory, we set up an encryption key
(->i_crypt_info) for the directory. This isn't actually used to do any
encryption, however, since the directory is still unencrypted! Instead,
->i_crypt_info is only used for inheriting the encryption policy.
One consequence of this is that the filesystem ends up providing a
"dummy context" (policy + nonce) instead of a "dummy policy". In
commit ed318a6cc0b6 ("fscrypt: support test_dummy_encryption=v2"), I
mistakenly thought this was required. However, actually the nonce only
ends up being used to derive a key that is never used.
Another consequence of this implementation is that it allows for
'inode->i_crypt_info != NULL && !IS_ENCRYPTED(inode)', which is an edge
case that can be forgotten about. For example, currently
FS_IOC_GET_ENCRYPTION_POLICY on an unencrypted directory may return the
dummy encryption policy when the filesystem is mounted with
test_dummy_encryption. That seems like the wrong thing to do, since
again, the directory itself is not actually encrypted.
Therefore, switch to a more logical and maintainable implementation
where the dummy encryption policy inheritance is done without setting up
keys for unencrypted directories. This involves:
- Adding a function fscrypt_policy_to_inherit() which returns the
encryption policy to inherit from a directory. This can be a real
policy, a dummy policy, or no policy.
- Replacing struct fscrypt_dummy_context, ->get_dummy_context(), etc.
with struct fscrypt_dummy_policy, ->get_dummy_policy(), etc.
- Making fscrypt_fname_encrypted_size() take an fscrypt_policy instead
of an inode.
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Acked-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20200917041136.178600-13-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-09-17 04:11:35 +00:00
|
|
|
fscrypt_free_dummy_policy(&sbi->s_dummy_enc_policy);
|
2019-04-25 18:05:42 +00:00
|
|
|
#ifdef CONFIG_UNICODE
|
2020-10-28 05:08:20 +00:00
|
|
|
utf8_unload(sb->s_encoding);
|
2019-04-25 18:05:42 +00:00
|
|
|
#endif
|
2006-10-11 08:20:50 +00:00
|
|
|
kfree(sbi);
|
|
|
|
}
|
|
|
|
|
2006-12-07 04:33:20 +00:00
|
|
|
static struct kmem_cache *ext4_inode_cachep;
|
2006-10-11 08:20:50 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Called inside transaction, so use GFP_NOFS
|
|
|
|
*/
|
2006-10-11 08:20:53 +00:00
|
|
|
static struct inode *ext4_alloc_inode(struct super_block *sb)
|
2006-10-11 08:20:50 +00:00
|
|
|
{
|
2006-10-11 08:20:53 +00:00
|
|
|
struct ext4_inode_info *ei;
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2006-12-07 04:33:14 +00:00
|
|
|
ei = kmem_cache_alloc(ext4_inode_cachep, GFP_NOFS);
|
2006-10-11 08:20:50 +00:00
|
|
|
if (!ei)
|
|
|
|
return NULL;
|
2009-06-03 21:59:28 +00:00
|
|
|
|
2018-01-09 13:21:39 +00:00
|
|
|
inode_set_iversion(&ei->vfs_inode, 1);
|
2014-04-21 18:37:55 +00:00
|
|
|
spin_lock_init(&ei->i_raw_lock);
|
2008-01-29 05:19:52 +00:00
|
|
|
INIT_LIST_HEAD(&ei->i_prealloc_list);
|
2020-08-17 07:36:15 +00:00
|
|
|
atomic_set(&ei->i_prealloc_active, 0);
|
2008-01-29 05:19:52 +00:00
|
|
|
spin_lock_init(&ei->i_prealloc_lock);
|
2012-11-09 02:57:30 +00:00
|
|
|
ext4_es_init_tree(&ei->i_es_tree);
|
|
|
|
rwlock_init(&ei->i_es_lock);
|
2014-11-25 16:45:37 +00:00
|
|
|
INIT_LIST_HEAD(&ei->i_es_list);
|
2014-09-02 02:26:49 +00:00
|
|
|
ei->i_es_all_nr = 0;
|
2014-11-25 16:45:37 +00:00
|
|
|
ei->i_es_shk_nr = 0;
|
2014-11-25 16:51:23 +00:00
|
|
|
ei->i_es_shrink_lblk = 0;
|
2008-07-14 21:52:37 +00:00
|
|
|
ei->i_reserved_data_blocks = 0;
|
|
|
|
spin_lock_init(&(ei->i_block_reservation_lock));
|
2018-10-01 18:17:41 +00:00
|
|
|
ext4_init_pending_tree(&ei->i_pending_tree);
|
2009-12-14 12:21:14 +00:00
|
|
|
#ifdef CONFIG_QUOTA
|
|
|
|
ei->i_reserved_quota = 0;
|
2014-09-29 12:58:25 +00:00
|
|
|
memset(&ei->i_dquot, 0, sizeof(ei->i_dquot));
|
2009-12-14 12:21:14 +00:00
|
|
|
#endif
|
2011-01-10 17:29:43 +00:00
|
|
|
ei->jinode = NULL;
|
2013-06-04 18:21:02 +00:00
|
|
|
INIT_LIST_HEAD(&ei->i_rsv_conversion_list);
|
2010-03-04 21:14:02 +00:00
|
|
|
spin_lock_init(&ei->i_completed_io_lock);
|
2009-12-09 04:51:10 +00:00
|
|
|
ei->i_sync_tid = 0;
|
|
|
|
ei->i_datasync_tid = 0;
|
2012-09-29 03:24:52 +00:00
|
|
|
atomic_set(&ei->i_unwritten, 0);
|
2013-06-04 18:21:02 +00:00
|
|
|
INIT_WORK(&ei->i_rsv_conversion_work, ext4_end_io_rsv_work);
|
2020-10-15 20:37:57 +00:00
|
|
|
ext4_fc_init_inode(&ei->vfs_inode);
|
|
|
|
mutex_init(&ei->i_fc_lock);
|
2006-10-11 08:20:50 +00:00
|
|
|
return &ei->vfs_inode;
|
|
|
|
}
|
|
|
|
|
2010-11-08 18:51:33 +00:00
|
|
|
static int ext4_drop_inode(struct inode *inode)
|
|
|
|
{
|
|
|
|
int drop = generic_drop_inode(inode);
|
|
|
|
|
2019-08-05 02:35:48 +00:00
|
|
|
if (!drop)
|
|
|
|
drop = fscrypt_drop_inode(inode);
|
|
|
|
|
2010-11-08 18:51:33 +00:00
|
|
|
trace_ext4_drop_inode(inode, drop);
|
|
|
|
return drop;
|
|
|
|
}
|
|
|
|
|
2019-04-15 23:28:34 +00:00
|
|
|
static void ext4_free_in_core_inode(struct inode *inode)
|
2011-01-07 06:49:49 +00:00
|
|
|
{
|
2019-04-10 20:21:15 +00:00
|
|
|
fscrypt_free_inode(inode);
|
2020-10-15 20:37:57 +00:00
|
|
|
if (!list_empty(&(EXT4_I(inode)->i_fc_list))) {
|
|
|
|
pr_warn("%s: inode %ld still in fc list",
|
|
|
|
__func__, inode->i_ino);
|
|
|
|
}
|
2011-01-07 06:49:49 +00:00
|
|
|
kmem_cache_free(ext4_inode_cachep, EXT4_I(inode));
|
|
|
|
}
|
|
|
|
|
2006-10-11 08:20:53 +00:00
|
|
|
static void ext4_destroy_inode(struct inode *inode)
|
2006-10-11 08:20:50 +00:00
|
|
|
{
|
2007-07-16 06:40:45 +00:00
|
|
|
if (!list_empty(&(EXT4_I(inode)->i_orphan))) {
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(inode->i_sb, KERN_ERR,
|
|
|
|
"Inode %lu (%p): orphan list check failed!",
|
|
|
|
inode->i_ino, EXT4_I(inode));
|
2007-07-16 06:40:45 +00:00
|
|
|
print_hex_dump(KERN_INFO, "", DUMP_PREFIX_ADDRESS, 16, 4,
|
|
|
|
EXT4_I(inode), sizeof(struct ext4_inode_info),
|
|
|
|
true);
|
|
|
|
dump_stack();
|
|
|
|
}
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
|
|
|
|
2008-07-26 02:45:34 +00:00
|
|
|
static void init_once(void *foo)
|
2006-10-11 08:20:50 +00:00
|
|
|
{
|
2006-10-11 08:20:53 +00:00
|
|
|
struct ext4_inode_info *ei = (struct ext4_inode_info *) foo;
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2007-05-17 05:10:57 +00:00
|
|
|
INIT_LIST_HEAD(&ei->i_orphan);
|
|
|
|
init_rwsem(&ei->xattr_sem);
|
2008-01-29 04:58:26 +00:00
|
|
|
init_rwsem(&ei->i_data_sem);
|
2015-12-07 19:28:03 +00:00
|
|
|
init_rwsem(&ei->i_mmap_sem);
|
2007-05-17 05:10:57 +00:00
|
|
|
inode_init_once(&ei->vfs_inode);
|
2020-10-15 20:37:57 +00:00
|
|
|
ext4_fc_init_inode(&ei->vfs_inode);
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
|
|
|
|
2014-02-18 01:34:53 +00:00
|
|
|
static int __init init_inodecache(void)
|
2006-10-11 08:20:50 +00:00
|
|
|
{
|
ext4: Define usercopy region in ext4_inode_cache slab cache
The ext4 symlink pathnames, stored in struct ext4_inode_info.i_data
and therefore contained in the ext4_inode_cache slab cache, need
to be copied to/from userspace.
cache object allocation:
fs/ext4/super.c:
ext4_alloc_inode(...):
struct ext4_inode_info *ei;
...
ei = kmem_cache_alloc(ext4_inode_cachep, GFP_NOFS);
...
return &ei->vfs_inode;
include/trace/events/ext4.h:
#define EXT4_I(inode) \
(container_of(inode, struct ext4_inode_info, vfs_inode))
fs/ext4/namei.c:
ext4_symlink(...):
...
inode->i_link = (char *)&EXT4_I(inode)->i_data;
example usage trace:
readlink_copy+0x43/0x70
vfs_readlink+0x62/0x110
SyS_readlinkat+0x100/0x130
fs/namei.c:
readlink_copy(..., link):
...
copy_to_user(..., link, len)
(inlined into vfs_readlink)
generic_readlink(dentry, ...):
struct inode *inode = d_inode(dentry);
const char *link = inode->i_link;
...
readlink_copy(..., link);
In support of usercopy hardening, this patch defines a region in the
ext4_inode_cache slab cache in which userspace copy operations are
allowed.
This region is known as the slab cache's usercopy region. Slab caches
can now check that each dynamically sized copy operation involving
cache-managed memory falls entirely within the slab's usercopy region.
This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on my
understanding of the code. Changes or omissions from the original code are
mine and don't reflect the original grsecurity/PaX code.
Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log, provide usage trace]
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: linux-ext4@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
2017-06-11 02:50:36 +00:00
|
|
|
ext4_inode_cachep = kmem_cache_create_usercopy("ext4_inode_cache",
|
|
|
|
sizeof(struct ext4_inode_info), 0,
|
|
|
|
(SLAB_RECLAIM_ACCOUNT|SLAB_MEM_SPREAD|
|
|
|
|
SLAB_ACCOUNT),
|
|
|
|
offsetof(struct ext4_inode_info, i_data),
|
|
|
|
sizeof_field(struct ext4_inode_info, i_data),
|
|
|
|
init_once);
|
2006-10-11 08:20:53 +00:00
|
|
|
if (ext4_inode_cachep == NULL)
|
2006-10-11 08:20:50 +00:00
|
|
|
return -ENOMEM;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void destroy_inodecache(void)
|
|
|
|
{
|
2012-09-26 01:33:07 +00:00
|
|
|
/*
|
|
|
|
* Make sure all delayed rcu free inodes are flushed before we
|
|
|
|
* destroy cache.
|
|
|
|
*/
|
|
|
|
rcu_barrier();
|
2006-10-11 08:20:53 +00:00
|
|
|
kmem_cache_destroy(ext4_inode_cachep);
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
|
|
|
|
2010-06-07 17:16:22 +00:00
|
|
|
void ext4_clear_inode(struct inode *inode)
|
2006-10-11 08:20:50 +00:00
|
|
|
{
|
2020-10-15 20:37:57 +00:00
|
|
|
ext4_fc_del(inode);
|
2010-06-07 17:16:22 +00:00
|
|
|
invalidate_inode_buffers(inode);
|
2012-05-03 12:48:02 +00:00
|
|
|
clear_inode(inode);
|
2020-08-17 07:36:15 +00:00
|
|
|
ext4_discard_preallocations(inode, 0);
|
2012-11-09 02:57:32 +00:00
|
|
|
ext4_es_remove_extent(inode, 0, EXT_MAX_BLOCKS);
|
2019-11-08 11:45:11 +00:00
|
|
|
dquot_drop(inode);
|
2011-01-10 17:29:43 +00:00
|
|
|
if (EXT4_I(inode)->jinode) {
|
|
|
|
jbd2_journal_release_jbd_inode(EXT4_JOURNAL(inode),
|
|
|
|
EXT4_I(inode)->jinode);
|
|
|
|
jbd2_free_inode(EXT4_I(inode)->jinode);
|
|
|
|
EXT4_I(inode)->jinode = NULL;
|
|
|
|
}
|
2018-01-12 04:30:13 +00:00
|
|
|
fscrypt_put_encryption_info(inode);
|
2019-07-22 16:26:24 +00:00
|
|
|
fsverity_cleanup_inode(inode);
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
|
|
|
|
2007-10-21 23:42:08 +00:00
|
|
|
static struct inode *ext4_nfs_get_inode(struct super_block *sb,
|
2009-06-03 21:59:28 +00:00
|
|
|
u64 ino, u32 generation)
|
2006-10-11 08:20:50 +00:00
|
|
|
{
|
|
|
|
struct inode *inode;
|
|
|
|
|
ext4: avoid declaring fs inconsistent due to invalid file handles
If we receive a file handle, either from NFS or open_by_handle_at(2),
and it points at an inode which has not been initialized, and the file
system has metadata checksums enabled, we shouldn't try to get the
inode, discover the checksum is invalid, and then declare the file
system as being inconsistent.
This can be reproduced by creating a test file system via "mke2fs -t
ext4 -O metadata_csum /tmp/foo.img 8M", mounting it, cd'ing into that
directory, and then running the following program.
#define _GNU_SOURCE
#include <fcntl.h>
struct handle {
struct file_handle fh;
unsigned char fid[MAX_HANDLE_SZ];
};
int main(int argc, char **argv)
{
struct handle h = {{8, 1 }, { 12, }};
open_by_handle_at(AT_FDCWD, &h.fh, O_RDONLY);
return 0;
}
Google-Bug-Id: 120690101
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2018-12-19 17:29:13 +00:00
|
|
|
/*
|
2006-10-11 08:20:50 +00:00
|
|
|
* Currently we don't know the generation for parent directory, so
|
|
|
|
* a generation of 0 means "accept any"
|
|
|
|
*/
|
ext4: avoid declaring fs inconsistent due to invalid file handles
If we receive a file handle, either from NFS or open_by_handle_at(2),
and it points at an inode which has not been initialized, and the file
system has metadata checksums enabled, we shouldn't try to get the
inode, discover the checksum is invalid, and then declare the file
system as being inconsistent.
This can be reproduced by creating a test file system via "mke2fs -t
ext4 -O metadata_csum /tmp/foo.img 8M", mounting it, cd'ing into that
directory, and then running the following program.
#define _GNU_SOURCE
#include <fcntl.h>
struct handle {
struct file_handle fh;
unsigned char fid[MAX_HANDLE_SZ];
};
int main(int argc, char **argv)
{
struct handle h = {{8, 1 }, { 12, }};
open_by_handle_at(AT_FDCWD, &h.fh, O_RDONLY);
return 0;
}
Google-Bug-Id: 120690101
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2018-12-19 17:29:13 +00:00
|
|
|
inode = ext4_iget(sb, ino, EXT4_IGET_HANDLE);
|
2008-02-07 08:15:37 +00:00
|
|
|
if (IS_ERR(inode))
|
|
|
|
return ERR_CAST(inode);
|
|
|
|
if (generation && inode->i_generation != generation) {
|
2006-10-11 08:20:50 +00:00
|
|
|
iput(inode);
|
|
|
|
return ERR_PTR(-ESTALE);
|
|
|
|
}
|
2007-10-21 23:42:08 +00:00
|
|
|
|
|
|
|
return inode;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct dentry *ext4_fh_to_dentry(struct super_block *sb, struct fid *fid,
|
2009-06-03 21:59:28 +00:00
|
|
|
int fh_len, int fh_type)
|
2007-10-21 23:42:08 +00:00
|
|
|
{
|
|
|
|
return generic_fh_to_dentry(sb, fid, fh_len, fh_type,
|
|
|
|
ext4_nfs_get_inode);
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct dentry *ext4_fh_to_parent(struct super_block *sb, struct fid *fid,
|
2009-06-03 21:59:28 +00:00
|
|
|
int fh_len, int fh_type)
|
2007-10-21 23:42:08 +00:00
|
|
|
{
|
|
|
|
return generic_fh_to_parent(sb, fid, fh_len, fh_type,
|
|
|
|
ext4_nfs_get_inode);
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
|
|
|
|
2018-12-19 19:07:58 +00:00
|
|
|
static int ext4_nfs_commit_metadata(struct inode *inode)
|
|
|
|
{
|
|
|
|
struct writeback_control wbc = {
|
|
|
|
.sync_mode = WB_SYNC_ALL
|
|
|
|
};
|
|
|
|
|
|
|
|
trace_ext4_nfs_commit_metadata(inode);
|
|
|
|
return ext4_write_inode(inode, &wbc);
|
|
|
|
}
|
|
|
|
|
2018-12-12 09:50:12 +00:00
|
|
|
#ifdef CONFIG_FS_ENCRYPTION
|
2016-07-10 18:01:03 +00:00
|
|
|
static int ext4_get_context(struct inode *inode, void *ctx, size_t len)
|
|
|
|
{
|
|
|
|
return ext4_xattr_get(inode, EXT4_XATTR_INDEX_ENCRYPTION,
|
|
|
|
EXT4_XATTR_NAME_ENCRYPTION_CONTEXT, ctx, len);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int ext4_set_context(struct inode *inode, const void *ctx, size_t len,
|
|
|
|
void *fs_data)
|
|
|
|
{
|
ext4: avoid lockdep warning when inheriting encryption context
On a lockdep-enabled kernel, xfstests generic/027 fails due to a lockdep
warning when run on ext4 mounted with -o test_dummy_encryption:
xfs_io/4594 is trying to acquire lock:
(jbd2_handle
){++++.+}, at:
[<ffffffff813096ef>] jbd2_log_wait_commit+0x5/0x11b
but task is already holding lock:
(jbd2_handle
){++++.+}, at:
[<ffffffff813000de>] start_this_handle+0x354/0x3d8
The abbreviated call stack is:
[<ffffffff813096ef>] ? jbd2_log_wait_commit+0x5/0x11b
[<ffffffff8130972a>] jbd2_log_wait_commit+0x40/0x11b
[<ffffffff813096ef>] ? jbd2_log_wait_commit+0x5/0x11b
[<ffffffff8130987b>] ? __jbd2_journal_force_commit+0x76/0xa6
[<ffffffff81309896>] __jbd2_journal_force_commit+0x91/0xa6
[<ffffffff813098b9>] jbd2_journal_force_commit_nested+0xe/0x18
[<ffffffff812a6049>] ext4_should_retry_alloc+0x72/0x79
[<ffffffff812f0c1f>] ext4_xattr_set+0xef/0x11f
[<ffffffff812cc35b>] ext4_set_context+0x3a/0x16b
[<ffffffff81258123>] fscrypt_inherit_context+0xe3/0x103
[<ffffffff812ab611>] __ext4_new_inode+0x12dc/0x153a
[<ffffffff812bd371>] ext4_create+0xb7/0x161
When a file is created in an encrypted directory, ext4_set_context() is
called to set an encryption context on the new file. This calls
ext4_xattr_set(), which contains a retry loop where the journal is
forced to commit if an ENOSPC error is encountered.
If the task actually were to wait for the journal to commit in this
case, then it would deadlock because a handle remains open from
__ext4_new_inode(), so the running transaction can't be committed yet.
Fortunately, __jbd2_journal_force_commit() avoids the deadlock by not
allowing the running transaction to be committed while the current task
has it open. However, the above lockdep warning is still triggered.
This was a false positive which was introduced by: 1eaa566d368b: jbd2:
track more dependencies on transaction commit
Fix the problem by passing the handle through the 'fs_data' argument to
ext4_set_context(), then using ext4_xattr_set_handle() instead of
ext4_xattr_set(). And in the case where no journal handle is specified
and ext4_set_context() has to open one, add an ENOSPC retry loop since
in that case it is the outermost transaction.
Signed-off-by: Eric Biggers <ebiggers@google.com>
2016-11-21 16:52:44 +00:00
|
|
|
handle_t *handle = fs_data;
|
2017-06-22 02:28:40 +00:00
|
|
|
int res, res2, credits, retries = 0;
|
ext4: avoid lockdep warning when inheriting encryption context
On a lockdep-enabled kernel, xfstests generic/027 fails due to a lockdep
warning when run on ext4 mounted with -o test_dummy_encryption:
xfs_io/4594 is trying to acquire lock:
(jbd2_handle
){++++.+}, at:
[<ffffffff813096ef>] jbd2_log_wait_commit+0x5/0x11b
but task is already holding lock:
(jbd2_handle
){++++.+}, at:
[<ffffffff813000de>] start_this_handle+0x354/0x3d8
The abbreviated call stack is:
[<ffffffff813096ef>] ? jbd2_log_wait_commit+0x5/0x11b
[<ffffffff8130972a>] jbd2_log_wait_commit+0x40/0x11b
[<ffffffff813096ef>] ? jbd2_log_wait_commit+0x5/0x11b
[<ffffffff8130987b>] ? __jbd2_journal_force_commit+0x76/0xa6
[<ffffffff81309896>] __jbd2_journal_force_commit+0x91/0xa6
[<ffffffff813098b9>] jbd2_journal_force_commit_nested+0xe/0x18
[<ffffffff812a6049>] ext4_should_retry_alloc+0x72/0x79
[<ffffffff812f0c1f>] ext4_xattr_set+0xef/0x11f
[<ffffffff812cc35b>] ext4_set_context+0x3a/0x16b
[<ffffffff81258123>] fscrypt_inherit_context+0xe3/0x103
[<ffffffff812ab611>] __ext4_new_inode+0x12dc/0x153a
[<ffffffff812bd371>] ext4_create+0xb7/0x161
When a file is created in an encrypted directory, ext4_set_context() is
called to set an encryption context on the new file. This calls
ext4_xattr_set(), which contains a retry loop where the journal is
forced to commit if an ENOSPC error is encountered.
If the task actually were to wait for the journal to commit in this
case, then it would deadlock because a handle remains open from
__ext4_new_inode(), so the running transaction can't be committed yet.
Fortunately, __jbd2_journal_force_commit() avoids the deadlock by not
allowing the running transaction to be committed while the current task
has it open. However, the above lockdep warning is still triggered.
This was a false positive which was introduced by: 1eaa566d368b: jbd2:
track more dependencies on transaction commit
Fix the problem by passing the handle through the 'fs_data' argument to
ext4_set_context(), then using ext4_xattr_set_handle() instead of
ext4_xattr_set(). And in the case where no journal handle is specified
and ext4_set_context() has to open one, add an ENOSPC retry loop since
in that case it is the outermost transaction.
Signed-off-by: Eric Biggers <ebiggers@google.com>
2016-11-21 16:52:44 +00:00
|
|
|
|
ext4: forbid encrypting root directory
Currently it's possible to encrypt all files and directories on an ext4
filesystem by deleting everything, including lost+found, then setting an
encryption policy on the root directory. However, this is incompatible
with e2fsck because e2fsck expects to find, create, and/or write to
lost+found and does not have access to any encryption keys. Especially
problematic is that if e2fsck can't find lost+found, it will create it
without regard for whether the root directory is encrypted. This is
wrong for obvious reasons, and it causes a later run of e2fsck to
consider the lost+found directory entry to be corrupted.
Encrypting the root directory may also be of limited use because it is
the "all-or-nothing" use case, for which dm-crypt can be used instead.
(By design, encryption policies are inherited and cannot be overridden;
so the root directory having an encryption policy implies that all files
and directories on the filesystem have that same encryption policy.)
In any case, encrypting the root directory is broken currently and must
not be allowed; so start returning an error if userspace requests it.
For now only do this in ext4, because f2fs and ubifs do not appear to
have the lost+found requirement. We could move it into
fscrypt_ioctl_set_policy() later if desired, though.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2017-06-23 04:10:36 +00:00
|
|
|
/*
|
|
|
|
* Encrypting the root directory is not allowed because e2fsck expects
|
|
|
|
* lost+found to exist and be unencrypted, and encrypting the root
|
|
|
|
* directory would imply encrypting the lost+found directory as well as
|
|
|
|
* the filename "lost+found" itself.
|
|
|
|
*/
|
|
|
|
if (inode->i_ino == EXT4_ROOT_INO)
|
|
|
|
return -EPERM;
|
ext4: avoid lockdep warning when inheriting encryption context
On a lockdep-enabled kernel, xfstests generic/027 fails due to a lockdep
warning when run on ext4 mounted with -o test_dummy_encryption:
xfs_io/4594 is trying to acquire lock:
(jbd2_handle
){++++.+}, at:
[<ffffffff813096ef>] jbd2_log_wait_commit+0x5/0x11b
but task is already holding lock:
(jbd2_handle
){++++.+}, at:
[<ffffffff813000de>] start_this_handle+0x354/0x3d8
The abbreviated call stack is:
[<ffffffff813096ef>] ? jbd2_log_wait_commit+0x5/0x11b
[<ffffffff8130972a>] jbd2_log_wait_commit+0x40/0x11b
[<ffffffff813096ef>] ? jbd2_log_wait_commit+0x5/0x11b
[<ffffffff8130987b>] ? __jbd2_journal_force_commit+0x76/0xa6
[<ffffffff81309896>] __jbd2_journal_force_commit+0x91/0xa6
[<ffffffff813098b9>] jbd2_journal_force_commit_nested+0xe/0x18
[<ffffffff812a6049>] ext4_should_retry_alloc+0x72/0x79
[<ffffffff812f0c1f>] ext4_xattr_set+0xef/0x11f
[<ffffffff812cc35b>] ext4_set_context+0x3a/0x16b
[<ffffffff81258123>] fscrypt_inherit_context+0xe3/0x103
[<ffffffff812ab611>] __ext4_new_inode+0x12dc/0x153a
[<ffffffff812bd371>] ext4_create+0xb7/0x161
When a file is created in an encrypted directory, ext4_set_context() is
called to set an encryption context on the new file. This calls
ext4_xattr_set(), which contains a retry loop where the journal is
forced to commit if an ENOSPC error is encountered.
If the task actually were to wait for the journal to commit in this
case, then it would deadlock because a handle remains open from
__ext4_new_inode(), so the running transaction can't be committed yet.
Fortunately, __jbd2_journal_force_commit() avoids the deadlock by not
allowing the running transaction to be committed while the current task
has it open. However, the above lockdep warning is still triggered.
This was a false positive which was introduced by: 1eaa566d368b: jbd2:
track more dependencies on transaction commit
Fix the problem by passing the handle through the 'fs_data' argument to
ext4_set_context(), then using ext4_xattr_set_handle() instead of
ext4_xattr_set(). And in the case where no journal handle is specified
and ext4_set_context() has to open one, add an ENOSPC retry loop since
in that case it is the outermost transaction.
Signed-off-by: Eric Biggers <ebiggers@google.com>
2016-11-21 16:52:44 +00:00
|
|
|
|
ext4: add sanity check for encryption + DAX
We prevent DAX from being used on inodes which are using ext4's built in
encryption via a check in ext4_set_inode_flags(). We do have what appears
to be an unsafe transition of S_DAX in ext4_set_context(), though, where
S_DAX can get disabled without us doing a proper writeback + invalidate.
There are also issues with mm-level races when changing the value of S_DAX,
as well as issues with the VM_MIXEDMAP flag:
https://www.spinics.net/lists/linux-xfs/msg09859.html
I actually think we are safe in this case because of the following:
1) You can't encrypt an existing file. Encryption can only be set on an
empty directory, with new inodes in that directory being created with
encryption turned on, so I don't think it's possible to turn encryption on
for a file that has open DAX mmaps or outstanding I/Os.
2) There is no way to turn encryption off on a given file. Once an inode
is encrypted, it stays encrypted for the life of that inode, so we don't
have to worry about the case where we turn encryption off and S_DAX
suddenly turns on.
3) The only way we end up in ext4_set_context() to turn on encryption is
when we are creating a new file in the encrypted directory. This happens
as part of ext4_create() before the inode has been allowed to do any I/O.
Here's the call tree:
ext4_create()
__ext4_new_inode()
ext4_set_inode_flags() // sets S_DAX
fscrypt_inherit_context()
fscrypt_get_encryption_info();
ext4_set_context() // sets EXT4_INODE_ENCRYPT, clears S_DAX
So, I actually think it's safe to transition S_DAX in ext4_set_context()
without any locking, writebacks or invalidations. I've added a
WARN_ON_ONCE() sanity check to make sure that we are notified if we ever
encounter a case where we are encrypting an inode that already has data,
in which case we need to add code to safely transition S_DAX.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-10-12 15:58:05 +00:00
|
|
|
if (WARN_ON_ONCE(IS_DAX(inode) && i_size_read(inode)))
|
|
|
|
return -EINVAL;
|
|
|
|
|
2020-05-28 15:00:02 +00:00
|
|
|
if (ext4_test_inode_flag(inode, EXT4_INODE_DAX))
|
|
|
|
return -EOPNOTSUPP;
|
|
|
|
|
2017-02-22 21:25:14 +00:00
|
|
|
res = ext4_convert_inline_data(inode);
|
|
|
|
if (res)
|
|
|
|
return res;
|
|
|
|
|
ext4: avoid lockdep warning when inheriting encryption context
On a lockdep-enabled kernel, xfstests generic/027 fails due to a lockdep
warning when run on ext4 mounted with -o test_dummy_encryption:
xfs_io/4594 is trying to acquire lock:
(jbd2_handle
){++++.+}, at:
[<ffffffff813096ef>] jbd2_log_wait_commit+0x5/0x11b
but task is already holding lock:
(jbd2_handle
){++++.+}, at:
[<ffffffff813000de>] start_this_handle+0x354/0x3d8
The abbreviated call stack is:
[<ffffffff813096ef>] ? jbd2_log_wait_commit+0x5/0x11b
[<ffffffff8130972a>] jbd2_log_wait_commit+0x40/0x11b
[<ffffffff813096ef>] ? jbd2_log_wait_commit+0x5/0x11b
[<ffffffff8130987b>] ? __jbd2_journal_force_commit+0x76/0xa6
[<ffffffff81309896>] __jbd2_journal_force_commit+0x91/0xa6
[<ffffffff813098b9>] jbd2_journal_force_commit_nested+0xe/0x18
[<ffffffff812a6049>] ext4_should_retry_alloc+0x72/0x79
[<ffffffff812f0c1f>] ext4_xattr_set+0xef/0x11f
[<ffffffff812cc35b>] ext4_set_context+0x3a/0x16b
[<ffffffff81258123>] fscrypt_inherit_context+0xe3/0x103
[<ffffffff812ab611>] __ext4_new_inode+0x12dc/0x153a
[<ffffffff812bd371>] ext4_create+0xb7/0x161
When a file is created in an encrypted directory, ext4_set_context() is
called to set an encryption context on the new file. This calls
ext4_xattr_set(), which contains a retry loop where the journal is
forced to commit if an ENOSPC error is encountered.
If the task actually were to wait for the journal to commit in this
case, then it would deadlock because a handle remains open from
__ext4_new_inode(), so the running transaction can't be committed yet.
Fortunately, __jbd2_journal_force_commit() avoids the deadlock by not
allowing the running transaction to be committed while the current task
has it open. However, the above lockdep warning is still triggered.
This was a false positive which was introduced by: 1eaa566d368b: jbd2:
track more dependencies on transaction commit
Fix the problem by passing the handle through the 'fs_data' argument to
ext4_set_context(), then using ext4_xattr_set_handle() instead of
ext4_xattr_set(). And in the case where no journal handle is specified
and ext4_set_context() has to open one, add an ENOSPC retry loop since
in that case it is the outermost transaction.
Signed-off-by: Eric Biggers <ebiggers@google.com>
2016-11-21 16:52:44 +00:00
|
|
|
/*
|
|
|
|
* If a journal handle was specified, then the encryption context is
|
|
|
|
* being set on a new inode via inheritance and is part of a larger
|
|
|
|
* transaction to create the inode. Otherwise the encryption context is
|
|
|
|
* being set on an existing inode in its own transaction. Only in the
|
|
|
|
* latter case should the "retry on ENOSPC" logic be used.
|
|
|
|
*/
|
2016-07-10 18:01:03 +00:00
|
|
|
|
ext4: avoid lockdep warning when inheriting encryption context
On a lockdep-enabled kernel, xfstests generic/027 fails due to a lockdep
warning when run on ext4 mounted with -o test_dummy_encryption:
xfs_io/4594 is trying to acquire lock:
(jbd2_handle
){++++.+}, at:
[<ffffffff813096ef>] jbd2_log_wait_commit+0x5/0x11b
but task is already holding lock:
(jbd2_handle
){++++.+}, at:
[<ffffffff813000de>] start_this_handle+0x354/0x3d8
The abbreviated call stack is:
[<ffffffff813096ef>] ? jbd2_log_wait_commit+0x5/0x11b
[<ffffffff8130972a>] jbd2_log_wait_commit+0x40/0x11b
[<ffffffff813096ef>] ? jbd2_log_wait_commit+0x5/0x11b
[<ffffffff8130987b>] ? __jbd2_journal_force_commit+0x76/0xa6
[<ffffffff81309896>] __jbd2_journal_force_commit+0x91/0xa6
[<ffffffff813098b9>] jbd2_journal_force_commit_nested+0xe/0x18
[<ffffffff812a6049>] ext4_should_retry_alloc+0x72/0x79
[<ffffffff812f0c1f>] ext4_xattr_set+0xef/0x11f
[<ffffffff812cc35b>] ext4_set_context+0x3a/0x16b
[<ffffffff81258123>] fscrypt_inherit_context+0xe3/0x103
[<ffffffff812ab611>] __ext4_new_inode+0x12dc/0x153a
[<ffffffff812bd371>] ext4_create+0xb7/0x161
When a file is created in an encrypted directory, ext4_set_context() is
called to set an encryption context on the new file. This calls
ext4_xattr_set(), which contains a retry loop where the journal is
forced to commit if an ENOSPC error is encountered.
If the task actually were to wait for the journal to commit in this
case, then it would deadlock because a handle remains open from
__ext4_new_inode(), so the running transaction can't be committed yet.
Fortunately, __jbd2_journal_force_commit() avoids the deadlock by not
allowing the running transaction to be committed while the current task
has it open. However, the above lockdep warning is still triggered.
This was a false positive which was introduced by: 1eaa566d368b: jbd2:
track more dependencies on transaction commit
Fix the problem by passing the handle through the 'fs_data' argument to
ext4_set_context(), then using ext4_xattr_set_handle() instead of
ext4_xattr_set(). And in the case where no journal handle is specified
and ext4_set_context() has to open one, add an ENOSPC retry loop since
in that case it is the outermost transaction.
Signed-off-by: Eric Biggers <ebiggers@google.com>
2016-11-21 16:52:44 +00:00
|
|
|
if (handle) {
|
|
|
|
res = ext4_xattr_set_handle(handle, inode,
|
|
|
|
EXT4_XATTR_INDEX_ENCRYPTION,
|
|
|
|
EXT4_XATTR_NAME_ENCRYPTION_CONTEXT,
|
|
|
|
ctx, len, 0);
|
2016-07-10 18:01:03 +00:00
|
|
|
if (!res) {
|
|
|
|
ext4_set_inode_flag(inode, EXT4_INODE_ENCRYPT);
|
|
|
|
ext4_clear_inode_state(inode,
|
|
|
|
EXT4_STATE_MAY_INLINE_DATA);
|
2016-11-20 22:32:59 +00:00
|
|
|
/*
|
2017-10-09 19:15:35 +00:00
|
|
|
* Update inode->i_flags - S_ENCRYPTED will be enabled,
|
|
|
|
* S_DAX may be disabled
|
2016-11-20 22:32:59 +00:00
|
|
|
*/
|
2020-05-28 14:59:59 +00:00
|
|
|
ext4_set_inode_flags(inode, false);
|
2016-07-10 18:01:03 +00:00
|
|
|
}
|
|
|
|
return res;
|
|
|
|
}
|
|
|
|
|
2017-05-24 22:24:07 +00:00
|
|
|
res = dquot_initialize(inode);
|
|
|
|
if (res)
|
|
|
|
return res;
|
ext4: avoid lockdep warning when inheriting encryption context
On a lockdep-enabled kernel, xfstests generic/027 fails due to a lockdep
warning when run on ext4 mounted with -o test_dummy_encryption:
xfs_io/4594 is trying to acquire lock:
(jbd2_handle
){++++.+}, at:
[<ffffffff813096ef>] jbd2_log_wait_commit+0x5/0x11b
but task is already holding lock:
(jbd2_handle
){++++.+}, at:
[<ffffffff813000de>] start_this_handle+0x354/0x3d8
The abbreviated call stack is:
[<ffffffff813096ef>] ? jbd2_log_wait_commit+0x5/0x11b
[<ffffffff8130972a>] jbd2_log_wait_commit+0x40/0x11b
[<ffffffff813096ef>] ? jbd2_log_wait_commit+0x5/0x11b
[<ffffffff8130987b>] ? __jbd2_journal_force_commit+0x76/0xa6
[<ffffffff81309896>] __jbd2_journal_force_commit+0x91/0xa6
[<ffffffff813098b9>] jbd2_journal_force_commit_nested+0xe/0x18
[<ffffffff812a6049>] ext4_should_retry_alloc+0x72/0x79
[<ffffffff812f0c1f>] ext4_xattr_set+0xef/0x11f
[<ffffffff812cc35b>] ext4_set_context+0x3a/0x16b
[<ffffffff81258123>] fscrypt_inherit_context+0xe3/0x103
[<ffffffff812ab611>] __ext4_new_inode+0x12dc/0x153a
[<ffffffff812bd371>] ext4_create+0xb7/0x161
When a file is created in an encrypted directory, ext4_set_context() is
called to set an encryption context on the new file. This calls
ext4_xattr_set(), which contains a retry loop where the journal is
forced to commit if an ENOSPC error is encountered.
If the task actually were to wait for the journal to commit in this
case, then it would deadlock because a handle remains open from
__ext4_new_inode(), so the running transaction can't be committed yet.
Fortunately, __jbd2_journal_force_commit() avoids the deadlock by not
allowing the running transaction to be committed while the current task
has it open. However, the above lockdep warning is still triggered.
This was a false positive which was introduced by: 1eaa566d368b: jbd2:
track more dependencies on transaction commit
Fix the problem by passing the handle through the 'fs_data' argument to
ext4_set_context(), then using ext4_xattr_set_handle() instead of
ext4_xattr_set(). And in the case where no journal handle is specified
and ext4_set_context() has to open one, add an ENOSPC retry loop since
in that case it is the outermost transaction.
Signed-off-by: Eric Biggers <ebiggers@google.com>
2016-11-21 16:52:44 +00:00
|
|
|
retry:
|
2017-07-06 04:01:59 +00:00
|
|
|
res = ext4_xattr_set_credits(inode, len, false /* is_create */,
|
|
|
|
&credits);
|
2017-06-22 15:44:55 +00:00
|
|
|
if (res)
|
|
|
|
return res;
|
|
|
|
|
2017-06-22 02:28:40 +00:00
|
|
|
handle = ext4_journal_start(inode, EXT4_HT_MISC, credits);
|
2016-07-10 18:01:03 +00:00
|
|
|
if (IS_ERR(handle))
|
|
|
|
return PTR_ERR(handle);
|
|
|
|
|
ext4: avoid lockdep warning when inheriting encryption context
On a lockdep-enabled kernel, xfstests generic/027 fails due to a lockdep
warning when run on ext4 mounted with -o test_dummy_encryption:
xfs_io/4594 is trying to acquire lock:
(jbd2_handle
){++++.+}, at:
[<ffffffff813096ef>] jbd2_log_wait_commit+0x5/0x11b
but task is already holding lock:
(jbd2_handle
){++++.+}, at:
[<ffffffff813000de>] start_this_handle+0x354/0x3d8
The abbreviated call stack is:
[<ffffffff813096ef>] ? jbd2_log_wait_commit+0x5/0x11b
[<ffffffff8130972a>] jbd2_log_wait_commit+0x40/0x11b
[<ffffffff813096ef>] ? jbd2_log_wait_commit+0x5/0x11b
[<ffffffff8130987b>] ? __jbd2_journal_force_commit+0x76/0xa6
[<ffffffff81309896>] __jbd2_journal_force_commit+0x91/0xa6
[<ffffffff813098b9>] jbd2_journal_force_commit_nested+0xe/0x18
[<ffffffff812a6049>] ext4_should_retry_alloc+0x72/0x79
[<ffffffff812f0c1f>] ext4_xattr_set+0xef/0x11f
[<ffffffff812cc35b>] ext4_set_context+0x3a/0x16b
[<ffffffff81258123>] fscrypt_inherit_context+0xe3/0x103
[<ffffffff812ab611>] __ext4_new_inode+0x12dc/0x153a
[<ffffffff812bd371>] ext4_create+0xb7/0x161
When a file is created in an encrypted directory, ext4_set_context() is
called to set an encryption context on the new file. This calls
ext4_xattr_set(), which contains a retry loop where the journal is
forced to commit if an ENOSPC error is encountered.
If the task actually were to wait for the journal to commit in this
case, then it would deadlock because a handle remains open from
__ext4_new_inode(), so the running transaction can't be committed yet.
Fortunately, __jbd2_journal_force_commit() avoids the deadlock by not
allowing the running transaction to be committed while the current task
has it open. However, the above lockdep warning is still triggered.
This was a false positive which was introduced by: 1eaa566d368b: jbd2:
track more dependencies on transaction commit
Fix the problem by passing the handle through the 'fs_data' argument to
ext4_set_context(), then using ext4_xattr_set_handle() instead of
ext4_xattr_set(). And in the case where no journal handle is specified
and ext4_set_context() has to open one, add an ENOSPC retry loop since
in that case it is the outermost transaction.
Signed-off-by: Eric Biggers <ebiggers@google.com>
2016-11-21 16:52:44 +00:00
|
|
|
res = ext4_xattr_set_handle(handle, inode, EXT4_XATTR_INDEX_ENCRYPTION,
|
|
|
|
EXT4_XATTR_NAME_ENCRYPTION_CONTEXT,
|
|
|
|
ctx, len, 0);
|
2016-07-10 18:01:03 +00:00
|
|
|
if (!res) {
|
|
|
|
ext4_set_inode_flag(inode, EXT4_INODE_ENCRYPT);
|
2017-10-09 19:15:35 +00:00
|
|
|
/*
|
|
|
|
* Update inode->i_flags - S_ENCRYPTED will be enabled,
|
|
|
|
* S_DAX may be disabled
|
|
|
|
*/
|
2020-05-28 14:59:59 +00:00
|
|
|
ext4_set_inode_flags(inode, false);
|
2016-07-10 18:01:03 +00:00
|
|
|
res = ext4_mark_inode_dirty(handle, inode);
|
|
|
|
if (res)
|
|
|
|
EXT4_ERROR_INODE(inode, "Failed to mark inode dirty");
|
|
|
|
}
|
|
|
|
res2 = ext4_journal_stop(handle);
|
ext4: avoid lockdep warning when inheriting encryption context
On a lockdep-enabled kernel, xfstests generic/027 fails due to a lockdep
warning when run on ext4 mounted with -o test_dummy_encryption:
xfs_io/4594 is trying to acquire lock:
(jbd2_handle
){++++.+}, at:
[<ffffffff813096ef>] jbd2_log_wait_commit+0x5/0x11b
but task is already holding lock:
(jbd2_handle
){++++.+}, at:
[<ffffffff813000de>] start_this_handle+0x354/0x3d8
The abbreviated call stack is:
[<ffffffff813096ef>] ? jbd2_log_wait_commit+0x5/0x11b
[<ffffffff8130972a>] jbd2_log_wait_commit+0x40/0x11b
[<ffffffff813096ef>] ? jbd2_log_wait_commit+0x5/0x11b
[<ffffffff8130987b>] ? __jbd2_journal_force_commit+0x76/0xa6
[<ffffffff81309896>] __jbd2_journal_force_commit+0x91/0xa6
[<ffffffff813098b9>] jbd2_journal_force_commit_nested+0xe/0x18
[<ffffffff812a6049>] ext4_should_retry_alloc+0x72/0x79
[<ffffffff812f0c1f>] ext4_xattr_set+0xef/0x11f
[<ffffffff812cc35b>] ext4_set_context+0x3a/0x16b
[<ffffffff81258123>] fscrypt_inherit_context+0xe3/0x103
[<ffffffff812ab611>] __ext4_new_inode+0x12dc/0x153a
[<ffffffff812bd371>] ext4_create+0xb7/0x161
When a file is created in an encrypted directory, ext4_set_context() is
called to set an encryption context on the new file. This calls
ext4_xattr_set(), which contains a retry loop where the journal is
forced to commit if an ENOSPC error is encountered.
If the task actually were to wait for the journal to commit in this
case, then it would deadlock because a handle remains open from
__ext4_new_inode(), so the running transaction can't be committed yet.
Fortunately, __jbd2_journal_force_commit() avoids the deadlock by not
allowing the running transaction to be committed while the current task
has it open. However, the above lockdep warning is still triggered.
This was a false positive which was introduced by: 1eaa566d368b: jbd2:
track more dependencies on transaction commit
Fix the problem by passing the handle through the 'fs_data' argument to
ext4_set_context(), then using ext4_xattr_set_handle() instead of
ext4_xattr_set(). And in the case where no journal handle is specified
and ext4_set_context() has to open one, add an ENOSPC retry loop since
in that case it is the outermost transaction.
Signed-off-by: Eric Biggers <ebiggers@google.com>
2016-11-21 16:52:44 +00:00
|
|
|
|
|
|
|
if (res == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
|
|
|
|
goto retry;
|
2016-07-10 18:01:03 +00:00
|
|
|
if (!res)
|
|
|
|
res = res2;
|
|
|
|
return res;
|
|
|
|
}
|
|
|
|
|
fscrypt: handle test_dummy_encryption in more logical way
The behavior of the test_dummy_encryption mount option is that when a
new file (or directory or symlink) is created in an unencrypted
directory, it's automatically encrypted using a dummy encryption policy.
That's it; in particular, the encryption (or lack thereof) of existing
files (or directories or symlinks) doesn't change.
Unfortunately the implementation of test_dummy_encryption is a bit weird
and confusing. When test_dummy_encryption is enabled and a file is
being created in an unencrypted directory, we set up an encryption key
(->i_crypt_info) for the directory. This isn't actually used to do any
encryption, however, since the directory is still unencrypted! Instead,
->i_crypt_info is only used for inheriting the encryption policy.
One consequence of this is that the filesystem ends up providing a
"dummy context" (policy + nonce) instead of a "dummy policy". In
commit ed318a6cc0b6 ("fscrypt: support test_dummy_encryption=v2"), I
mistakenly thought this was required. However, actually the nonce only
ends up being used to derive a key that is never used.
Another consequence of this implementation is that it allows for
'inode->i_crypt_info != NULL && !IS_ENCRYPTED(inode)', which is an edge
case that can be forgotten about. For example, currently
FS_IOC_GET_ENCRYPTION_POLICY on an unencrypted directory may return the
dummy encryption policy when the filesystem is mounted with
test_dummy_encryption. That seems like the wrong thing to do, since
again, the directory itself is not actually encrypted.
Therefore, switch to a more logical and maintainable implementation
where the dummy encryption policy inheritance is done without setting up
keys for unencrypted directories. This involves:
- Adding a function fscrypt_policy_to_inherit() which returns the
encryption policy to inherit from a directory. This can be a real
policy, a dummy policy, or no policy.
- Replacing struct fscrypt_dummy_context, ->get_dummy_context(), etc.
with struct fscrypt_dummy_policy, ->get_dummy_policy(), etc.
- Making fscrypt_fname_encrypted_size() take an fscrypt_policy instead
of an inode.
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Acked-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20200917041136.178600-13-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-09-17 04:11:35 +00:00
|
|
|
static const union fscrypt_policy *ext4_get_dummy_policy(struct super_block *sb)
|
2016-07-10 18:01:03 +00:00
|
|
|
{
|
fscrypt: handle test_dummy_encryption in more logical way
The behavior of the test_dummy_encryption mount option is that when a
new file (or directory or symlink) is created in an unencrypted
directory, it's automatically encrypted using a dummy encryption policy.
That's it; in particular, the encryption (or lack thereof) of existing
files (or directories or symlinks) doesn't change.
Unfortunately the implementation of test_dummy_encryption is a bit weird
and confusing. When test_dummy_encryption is enabled and a file is
being created in an unencrypted directory, we set up an encryption key
(->i_crypt_info) for the directory. This isn't actually used to do any
encryption, however, since the directory is still unencrypted! Instead,
->i_crypt_info is only used for inheriting the encryption policy.
One consequence of this is that the filesystem ends up providing a
"dummy context" (policy + nonce) instead of a "dummy policy". In
commit ed318a6cc0b6 ("fscrypt: support test_dummy_encryption=v2"), I
mistakenly thought this was required. However, actually the nonce only
ends up being used to derive a key that is never used.
Another consequence of this implementation is that it allows for
'inode->i_crypt_info != NULL && !IS_ENCRYPTED(inode)', which is an edge
case that can be forgotten about. For example, currently
FS_IOC_GET_ENCRYPTION_POLICY on an unencrypted directory may return the
dummy encryption policy when the filesystem is mounted with
test_dummy_encryption. That seems like the wrong thing to do, since
again, the directory itself is not actually encrypted.
Therefore, switch to a more logical and maintainable implementation
where the dummy encryption policy inheritance is done without setting up
keys for unencrypted directories. This involves:
- Adding a function fscrypt_policy_to_inherit() which returns the
encryption policy to inherit from a directory. This can be a real
policy, a dummy policy, or no policy.
- Replacing struct fscrypt_dummy_context, ->get_dummy_context(), etc.
with struct fscrypt_dummy_policy, ->get_dummy_policy(), etc.
- Making fscrypt_fname_encrypted_size() take an fscrypt_policy instead
of an inode.
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Acked-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20200917041136.178600-13-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-09-17 04:11:35 +00:00
|
|
|
return EXT4_SB(sb)->s_dummy_enc_policy.policy;
|
2016-07-10 18:01:03 +00:00
|
|
|
}
|
|
|
|
|
2019-10-24 21:54:37 +00:00
|
|
|
static bool ext4_has_stable_inodes(struct super_block *sb)
|
|
|
|
{
|
|
|
|
return ext4_has_feature_stable_inodes(sb);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void ext4_get_ino_and_lblk_bits(struct super_block *sb,
|
|
|
|
int *ino_bits_ret, int *lblk_bits_ret)
|
|
|
|
{
|
|
|
|
*ino_bits_ret = 8 * sizeof(EXT4_SB(sb)->s_es->s_inodes_count);
|
|
|
|
*lblk_bits_ret = 8 * sizeof(ext4_lblk_t);
|
|
|
|
}
|
|
|
|
|
2017-02-07 20:42:10 +00:00
|
|
|
static const struct fscrypt_operations ext4_cryptops = {
|
2017-01-05 21:51:18 +00:00
|
|
|
.key_prefix = "ext4:",
|
2016-07-10 18:01:03 +00:00
|
|
|
.get_context = ext4_get_context,
|
|
|
|
.set_context = ext4_set_context,
|
fscrypt: handle test_dummy_encryption in more logical way
The behavior of the test_dummy_encryption mount option is that when a
new file (or directory or symlink) is created in an unencrypted
directory, it's automatically encrypted using a dummy encryption policy.
That's it; in particular, the encryption (or lack thereof) of existing
files (or directories or symlinks) doesn't change.
Unfortunately the implementation of test_dummy_encryption is a bit weird
and confusing. When test_dummy_encryption is enabled and a file is
being created in an unencrypted directory, we set up an encryption key
(->i_crypt_info) for the directory. This isn't actually used to do any
encryption, however, since the directory is still unencrypted! Instead,
->i_crypt_info is only used for inheriting the encryption policy.
One consequence of this is that the filesystem ends up providing a
"dummy context" (policy + nonce) instead of a "dummy policy". In
commit ed318a6cc0b6 ("fscrypt: support test_dummy_encryption=v2"), I
mistakenly thought this was required. However, actually the nonce only
ends up being used to derive a key that is never used.
Another consequence of this implementation is that it allows for
'inode->i_crypt_info != NULL && !IS_ENCRYPTED(inode)', which is an edge
case that can be forgotten about. For example, currently
FS_IOC_GET_ENCRYPTION_POLICY on an unencrypted directory may return the
dummy encryption policy when the filesystem is mounted with
test_dummy_encryption. That seems like the wrong thing to do, since
again, the directory itself is not actually encrypted.
Therefore, switch to a more logical and maintainable implementation
where the dummy encryption policy inheritance is done without setting up
keys for unencrypted directories. This involves:
- Adding a function fscrypt_policy_to_inherit() which returns the
encryption policy to inherit from a directory. This can be a real
policy, a dummy policy, or no policy.
- Replacing struct fscrypt_dummy_context, ->get_dummy_context(), etc.
with struct fscrypt_dummy_policy, ->get_dummy_policy(), etc.
- Making fscrypt_fname_encrypted_size() take an fscrypt_policy instead
of an inode.
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Acked-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20200917041136.178600-13-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-09-17 04:11:35 +00:00
|
|
|
.get_dummy_policy = ext4_get_dummy_policy,
|
2016-07-10 18:01:03 +00:00
|
|
|
.empty_dir = ext4_empty_dir,
|
2018-04-30 22:51:44 +00:00
|
|
|
.max_namelen = EXT4_NAME_LEN,
|
2019-10-24 21:54:37 +00:00
|
|
|
.has_stable_inodes = ext4_has_stable_inodes,
|
|
|
|
.get_ino_and_lblk_bits = ext4_get_ino_and_lblk_bits,
|
2016-07-10 18:01:03 +00:00
|
|
|
};
|
|
|
|
#endif
|
|
|
|
|
2006-10-11 08:20:50 +00:00
|
|
|
#ifdef CONFIG_QUOTA
|
2017-04-30 03:47:50 +00:00
|
|
|
static const char * const quotatypes[] = INITQFNAMES;
|
2016-01-08 21:01:22 +00:00
|
|
|
#define QTYPE2NAME(t) (quotatypes[t])
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2006-10-11 08:20:53 +00:00
|
|
|
static int ext4_write_dquot(struct dquot *dquot);
|
|
|
|
static int ext4_acquire_dquot(struct dquot *dquot);
|
|
|
|
static int ext4_release_dquot(struct dquot *dquot);
|
|
|
|
static int ext4_mark_dquot_dirty(struct dquot *dquot);
|
|
|
|
static int ext4_write_info(struct super_block *sb, int type);
|
2008-04-28 09:14:34 +00:00
|
|
|
static int ext4_quota_on(struct super_block *sb, int type, int format_id,
|
2016-11-21 00:49:34 +00:00
|
|
|
const struct path *path);
|
2006-10-11 08:20:53 +00:00
|
|
|
static int ext4_quota_on_mount(struct super_block *sb, int type);
|
|
|
|
static ssize_t ext4_quota_read(struct super_block *sb, int type, char *data,
|
2006-10-11 08:20:50 +00:00
|
|
|
size_t len, loff_t off);
|
2006-10-11 08:20:53 +00:00
|
|
|
static ssize_t ext4_quota_write(struct super_block *sb, int type,
|
2006-10-11 08:20:50 +00:00
|
|
|
const char *data, size_t len, loff_t off);
|
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 00:21:31 +00:00
|
|
|
static int ext4_quota_enable(struct super_block *sb, int type, int format_id,
|
|
|
|
unsigned int flags);
|
|
|
|
static int ext4_enable_quotas(struct super_block *sb);
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2014-09-29 12:58:25 +00:00
|
|
|
static struct dquot **ext4_get_dquots(struct inode *inode)
|
|
|
|
{
|
|
|
|
return EXT4_I(inode)->i_dquot;
|
|
|
|
}
|
|
|
|
|
2009-09-22 00:01:08 +00:00
|
|
|
static const struct dquot_operations ext4_quota_operations = {
|
2017-06-22 15:46:48 +00:00
|
|
|
.get_reserved_space = ext4_get_reserved_space,
|
|
|
|
.write_dquot = ext4_write_dquot,
|
|
|
|
.acquire_dquot = ext4_acquire_dquot,
|
|
|
|
.release_dquot = ext4_release_dquot,
|
|
|
|
.mark_dirty = ext4_mark_dquot_dirty,
|
|
|
|
.write_info = ext4_write_info,
|
|
|
|
.alloc_dquot = dquot_alloc,
|
|
|
|
.destroy_dquot = dquot_destroy,
|
|
|
|
.get_projid = ext4_get_projid,
|
|
|
|
.get_inode_usage = ext4_get_inode_usage,
|
2019-10-06 10:30:28 +00:00
|
|
|
.get_next_id = dquot_get_next_id,
|
2006-10-11 08:20:50 +00:00
|
|
|
};
|
|
|
|
|
2009-09-22 00:01:09 +00:00
|
|
|
static const struct quotactl_ops ext4_qctl_operations = {
|
2006-10-11 08:20:53 +00:00
|
|
|
.quota_on = ext4_quota_on,
|
2010-08-01 21:48:36 +00:00
|
|
|
.quota_off = ext4_quota_off,
|
2010-05-19 11:16:45 +00:00
|
|
|
.quota_sync = dquot_quota_sync,
|
2014-11-18 23:42:09 +00:00
|
|
|
.get_state = dquot_get_state,
|
2010-05-19 11:16:45 +00:00
|
|
|
.set_info = dquot_set_dqinfo,
|
|
|
|
.get_dqblk = dquot_get_dqblk,
|
2016-02-19 18:19:01 +00:00
|
|
|
.set_dqblk = dquot_set_dqblk,
|
|
|
|
.get_nextdqblk = dquot_get_next_dqblk,
|
2006-10-11 08:20:50 +00:00
|
|
|
};
|
|
|
|
#endif
|
|
|
|
|
2007-02-12 08:55:41 +00:00
|
|
|
static const struct super_operations ext4_sops = {
|
2006-10-11 08:20:53 +00:00
|
|
|
.alloc_inode = ext4_alloc_inode,
|
2019-04-15 23:28:34 +00:00
|
|
|
.free_inode = ext4_free_in_core_inode,
|
2006-10-11 08:20:53 +00:00
|
|
|
.destroy_inode = ext4_destroy_inode,
|
|
|
|
.write_inode = ext4_write_inode,
|
|
|
|
.dirty_inode = ext4_dirty_inode,
|
2010-11-08 18:51:33 +00:00
|
|
|
.drop_inode = ext4_drop_inode,
|
2010-06-07 17:16:22 +00:00
|
|
|
.evict_inode = ext4_evict_inode,
|
2006-10-11 08:20:53 +00:00
|
|
|
.put_super = ext4_put_super,
|
|
|
|
.sync_fs = ext4_sync_fs,
|
2009-01-10 00:40:58 +00:00
|
|
|
.freeze_fs = ext4_freeze,
|
|
|
|
.unfreeze_fs = ext4_unfreeze,
|
2006-10-11 08:20:53 +00:00
|
|
|
.statfs = ext4_statfs,
|
|
|
|
.remount_fs = ext4_remount,
|
|
|
|
.show_options = ext4_show_options,
|
2006-10-11 08:20:50 +00:00
|
|
|
#ifdef CONFIG_QUOTA
|
2006-10-11 08:20:53 +00:00
|
|
|
.quota_read = ext4_quota_read,
|
|
|
|
.quota_write = ext4_quota_write,
|
2014-09-29 12:58:25 +00:00
|
|
|
.get_dquots = ext4_get_dquots,
|
2006-10-11 08:20:50 +00:00
|
|
|
#endif
|
|
|
|
};
|
|
|
|
|
2007-10-21 23:42:17 +00:00
|
|
|
static const struct export_operations ext4_export_ops = {
|
2007-10-21 23:42:08 +00:00
|
|
|
.fh_to_dentry = ext4_fh_to_dentry,
|
|
|
|
.fh_to_parent = ext4_fh_to_parent,
|
2006-10-11 08:20:53 +00:00
|
|
|
.get_parent = ext4_get_parent,
|
2018-12-19 19:07:58 +00:00
|
|
|
.commit_metadata = ext4_nfs_commit_metadata,
|
2006-10-11 08:20:50 +00:00
|
|
|
};
|
|
|
|
|
|
|
|
enum {
|
|
|
|
Opt_bsd_df, Opt_minix_df, Opt_grpid, Opt_nogrpid,
|
|
|
|
Opt_resgid, Opt_resuid, Opt_sb, Opt_err_cont, Opt_err_panic, Opt_err_ro,
|
2012-03-03 23:04:40 +00:00
|
|
|
Opt_nouid32, Opt_debug, Opt_removed,
|
2006-10-11 08:20:50 +00:00
|
|
|
Opt_user_xattr, Opt_nouser_xattr, Opt_acl, Opt_noacl,
|
2012-03-03 23:04:40 +00:00
|
|
|
Opt_auto_da_alloc, Opt_noauto_da_alloc, Opt_noload,
|
ext4: allow specifying external journal by pathname mount option
It's always been a hassle that if an external journal's
device number changes, the filesystem won't mount.
And since boot-time enumeration can change, device number
changes aren't unusual.
The current mechanism to update the journal location is by
passing in a mount option w/ a new devnum, but that's a hassle;
it's a manual approach, fixing things after the fact.
Adding a mount option, "-o journal_path=/dev/$DEVICE" would
help, since then we can do i.e.
# mount -o journal_path=/dev/disk/by-label/$JOURNAL_LABEL ...
and it'll mount even if the devnum has changed, as shown here:
# losetup /dev/loop0 journalfile
# mke2fs -L mylabel-journal -O journal_dev /dev/loop0
# mkfs.ext4 -L mylabel -J device=/dev/loop0 /dev/sdb1
Change the journal device number:
# losetup -d /dev/loop0
# losetup /dev/loop1 journalfile
And today it will fail:
# mount /dev/sdb1 /mnt/test
mount: wrong fs type, bad option, bad superblock on /dev/sdb1,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so
# dmesg | tail -n 1
[17343.240702] EXT4-fs (sdb1): error: couldn't read superblock of external journal
But with this new mount option, we can specify the new path:
# mount -o journal_path=/dev/loop1 /dev/sdb1 /mnt/test
#
(which does update the encoded device number, incidentally):
# umount /dev/sdb1
# dumpe2fs -h /dev/sdb1 | grep "Journal device"
dumpe2fs 1.41.12 (17-May-2010)
Journal device: 0x0701
But best of all we can just always mount by journal-path, and
it'll always work:
# mount -o journal_path=/dev/disk/by-label/mylabel-journal /dev/sdb1 /mnt/test
#
So the journal_path option can be specified in fstab, and as long as
the disk is available somewhere, and findable by label (or by UUID),
we can mount.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2013-08-28 23:05:07 +00:00
|
|
|
Opt_commit, Opt_min_batch_time, Opt_max_batch_time, Opt_journal_dev,
|
|
|
|
Opt_journal_path, Opt_journal_checksum, Opt_journal_async_commit,
|
2006-10-11 08:20:50 +00:00
|
|
|
Opt_abort, Opt_data_journal, Opt_data_ordered, Opt_data_writeback,
|
2015-04-16 05:56:00 +00:00
|
|
|
Opt_data_err_abort, Opt_data_err_ignore, Opt_test_dummy_encryption,
|
2020-07-02 01:56:07 +00:00
|
|
|
Opt_inlinecrypt,
|
2006-10-11 08:20:50 +00:00
|
|
|
Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
|
2009-11-30 22:58:32 +00:00
|
|
|
Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_jqfmt_vfsv1, Opt_quota,
|
2012-03-02 17:14:24 +00:00
|
|
|
Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err,
|
2020-05-28 15:00:00 +00:00
|
|
|
Opt_usrquota, Opt_grpquota, Opt_prjquota, Opt_i_version,
|
|
|
|
Opt_dax, Opt_dax_always, Opt_dax_inode, Opt_dax_never,
|
2018-06-13 03:34:57 +00:00
|
|
|
Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_warn_on_error,
|
|
|
|
Opt_nowarn_on_error, Opt_mblk_io_submit,
|
2017-01-11 20:32:22 +00:00
|
|
|
Opt_lazytime, Opt_nolazytime, Opt_debug_want_extra_isize,
|
ext4: Turn off multiple page-io submission by default
Jon Nelson has found a test case which causes postgresql to fail with
the error:
psql:t.sql:4: ERROR: invalid page header in block 38269 of relation base/16384/16581
Under memory pressure, it looks like part of a file can end up getting
replaced by zero's. Until we can figure out the cause, we'll roll
back the change and use block_write_full_page() instead of
ext4_bio_write_page(). The new, more efficient writing function can
be used via the mount option mblk_io_submit, so we can test and fix
the new page I/O code.
To reproduce the problem, install postgres 8.4 or 9.0, and pin enough
memory such that the system just at the end of triggering writeback
before running the following sql script:
begin;
create temporary table foo as select x as a, ARRAY[x] as b FROM
generate_series(1, 10000000 ) AS x;
create index foo_a_idx on foo (a);
create index foo_b_idx on foo USING GIN (b);
rollback;
If the temporary table is created on a hard drive partition which is
encrypted using dm_crypt, then under memory pressure, approximately
30-40% of the time, pgsql will issue the above failure.
This patch should fix this problem, and the problem will come back if
the file system is mounted with the mblk_io_submit mount option.
Reported-by: Jon Nelson <jnelson@jamponi.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-14 20:27:50 +00:00
|
|
|
Opt_nomblk_io_submit, Opt_block_validity, Opt_noblock_validity,
|
2009-11-19 19:25:42 +00:00
|
|
|
Opt_inode_readahead_blks, Opt_journal_ioprio,
|
2010-03-04 21:14:02 +00:00
|
|
|
Opt_dioread_nolock, Opt_dioread_lock,
|
2011-12-13 03:06:18 +00:00
|
|
|
Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
|
2017-06-22 15:55:14 +00:00
|
|
|
Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache,
|
2021-04-01 17:21:29 +00:00
|
|
|
Opt_no_prefetch_block_bitmaps, Opt_mb_optimize_scan,
|
2020-10-15 20:37:59 +00:00
|
|
|
#ifdef CONFIG_EXT4_DEBUG
|
2020-11-06 03:59:11 +00:00
|
|
|
Opt_fc_debug_max_replay, Opt_fc_debug_force
|
2020-10-15 20:37:59 +00:00
|
|
|
#endif
|
2006-10-11 08:20:50 +00:00
|
|
|
};
|
|
|
|
|
2008-10-13 09:46:57 +00:00
|
|
|
static const match_table_t tokens = {
|
2006-10-11 08:20:50 +00:00
|
|
|
{Opt_bsd_df, "bsddf"},
|
|
|
|
{Opt_minix_df, "minixdf"},
|
|
|
|
{Opt_grpid, "grpid"},
|
|
|
|
{Opt_grpid, "bsdgroups"},
|
|
|
|
{Opt_nogrpid, "nogrpid"},
|
|
|
|
{Opt_nogrpid, "sysvgroups"},
|
|
|
|
{Opt_resgid, "resgid=%u"},
|
|
|
|
{Opt_resuid, "resuid=%u"},
|
|
|
|
{Opt_sb, "sb=%u"},
|
|
|
|
{Opt_err_cont, "errors=continue"},
|
|
|
|
{Opt_err_panic, "errors=panic"},
|
|
|
|
{Opt_err_ro, "errors=remount-ro"},
|
|
|
|
{Opt_nouid32, "nouid32"},
|
|
|
|
{Opt_debug, "debug"},
|
2012-03-03 23:04:40 +00:00
|
|
|
{Opt_removed, "oldalloc"},
|
|
|
|
{Opt_removed, "orlov"},
|
2006-10-11 08:20:50 +00:00
|
|
|
{Opt_user_xattr, "user_xattr"},
|
|
|
|
{Opt_nouser_xattr, "nouser_xattr"},
|
|
|
|
{Opt_acl, "acl"},
|
|
|
|
{Opt_noacl, "noacl"},
|
2009-11-19 19:28:50 +00:00
|
|
|
{Opt_noload, "norecovery"},
|
2012-03-05 00:27:31 +00:00
|
|
|
{Opt_noload, "noload"},
|
2012-03-03 23:04:40 +00:00
|
|
|
{Opt_removed, "nobh"},
|
|
|
|
{Opt_removed, "bh"},
|
2006-10-11 08:20:50 +00:00
|
|
|
{Opt_commit, "commit=%u"},
|
2009-01-04 01:27:38 +00:00
|
|
|
{Opt_min_batch_time, "min_batch_time=%u"},
|
|
|
|
{Opt_max_batch_time, "max_batch_time=%u"},
|
2006-10-11 08:20:50 +00:00
|
|
|
{Opt_journal_dev, "journal_dev=%u"},
|
ext4: allow specifying external journal by pathname mount option
It's always been a hassle that if an external journal's
device number changes, the filesystem won't mount.
And since boot-time enumeration can change, device number
changes aren't unusual.
The current mechanism to update the journal location is by
passing in a mount option w/ a new devnum, but that's a hassle;
it's a manual approach, fixing things after the fact.
Adding a mount option, "-o journal_path=/dev/$DEVICE" would
help, since then we can do i.e.
# mount -o journal_path=/dev/disk/by-label/$JOURNAL_LABEL ...
and it'll mount even if the devnum has changed, as shown here:
# losetup /dev/loop0 journalfile
# mke2fs -L mylabel-journal -O journal_dev /dev/loop0
# mkfs.ext4 -L mylabel -J device=/dev/loop0 /dev/sdb1
Change the journal device number:
# losetup -d /dev/loop0
# losetup /dev/loop1 journalfile
And today it will fail:
# mount /dev/sdb1 /mnt/test
mount: wrong fs type, bad option, bad superblock on /dev/sdb1,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so
# dmesg | tail -n 1
[17343.240702] EXT4-fs (sdb1): error: couldn't read superblock of external journal
But with this new mount option, we can specify the new path:
# mount -o journal_path=/dev/loop1 /dev/sdb1 /mnt/test
#
(which does update the encoded device number, incidentally):
# umount /dev/sdb1
# dumpe2fs -h /dev/sdb1 | grep "Journal device"
dumpe2fs 1.41.12 (17-May-2010)
Journal device: 0x0701
But best of all we can just always mount by journal-path, and
it'll always work:
# mount -o journal_path=/dev/disk/by-label/mylabel-journal /dev/sdb1 /mnt/test
#
So the journal_path option can be specified in fstab, and as long as
the disk is available somewhere, and findable by label (or by UUID),
we can mount.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2013-08-28 23:05:07 +00:00
|
|
|
{Opt_journal_path, "journal_path=%s"},
|
2008-01-29 04:58:27 +00:00
|
|
|
{Opt_journal_checksum, "journal_checksum"},
|
2014-11-25 21:20:50 +00:00
|
|
|
{Opt_nojournal_checksum, "nojournal_checksum"},
|
2008-01-29 04:58:27 +00:00
|
|
|
{Opt_journal_async_commit, "journal_async_commit"},
|
2006-10-11 08:20:50 +00:00
|
|
|
{Opt_abort, "abort"},
|
|
|
|
{Opt_data_journal, "data=journal"},
|
|
|
|
{Opt_data_ordered, "data=ordered"},
|
|
|
|
{Opt_data_writeback, "data=writeback"},
|
2008-10-11 02:12:43 +00:00
|
|
|
{Opt_data_err_abort, "data_err=abort"},
|
|
|
|
{Opt_data_err_ignore, "data_err=ignore"},
|
2006-10-11 08:20:50 +00:00
|
|
|
{Opt_offusrjquota, "usrjquota="},
|
|
|
|
{Opt_usrjquota, "usrjquota=%s"},
|
|
|
|
{Opt_offgrpjquota, "grpjquota="},
|
|
|
|
{Opt_grpjquota, "grpjquota=%s"},
|
|
|
|
{Opt_jqfmt_vfsold, "jqfmt=vfsold"},
|
|
|
|
{Opt_jqfmt_vfsv0, "jqfmt=vfsv0"},
|
2009-11-30 22:58:32 +00:00
|
|
|
{Opt_jqfmt_vfsv1, "jqfmt=vfsv1"},
|
2006-10-11 08:20:50 +00:00
|
|
|
{Opt_grpquota, "grpquota"},
|
|
|
|
{Opt_noquota, "noquota"},
|
|
|
|
{Opt_quota, "quota"},
|
|
|
|
{Opt_usrquota, "usrquota"},
|
2016-09-06 03:08:16 +00:00
|
|
|
{Opt_prjquota, "prjquota"},
|
2006-10-11 08:20:50 +00:00
|
|
|
{Opt_barrier, "barrier=%u"},
|
2009-03-28 14:59:57 +00:00
|
|
|
{Opt_barrier, "barrier"},
|
|
|
|
{Opt_nobarrier, "nobarrier"},
|
2008-01-29 04:58:27 +00:00
|
|
|
{Opt_i_version, "i_version"},
|
2015-02-16 23:59:38 +00:00
|
|
|
{Opt_dax, "dax"},
|
2020-05-28 15:00:00 +00:00
|
|
|
{Opt_dax_always, "dax=always"},
|
|
|
|
{Opt_dax_inode, "dax=inode"},
|
|
|
|
{Opt_dax_never, "dax=never"},
|
2008-01-29 05:19:52 +00:00
|
|
|
{Opt_stripe, "stripe=%u"},
|
2008-07-11 23:27:31 +00:00
|
|
|
{Opt_delalloc, "delalloc"},
|
2018-06-13 03:34:57 +00:00
|
|
|
{Opt_warn_on_error, "warn_on_error"},
|
|
|
|
{Opt_nowarn_on_error, "nowarn_on_error"},
|
2015-02-02 05:37:02 +00:00
|
|
|
{Opt_lazytime, "lazytime"},
|
|
|
|
{Opt_nolazytime, "nolazytime"},
|
2017-01-11 20:32:22 +00:00
|
|
|
{Opt_debug_want_extra_isize, "debug_want_extra_isize=%u"},
|
2008-07-11 23:27:31 +00:00
|
|
|
{Opt_nodelalloc, "nodelalloc"},
|
2013-01-28 14:30:52 +00:00
|
|
|
{Opt_removed, "mblk_io_submit"},
|
|
|
|
{Opt_removed, "nomblk_io_submit"},
|
2009-05-17 19:38:01 +00:00
|
|
|
{Opt_block_validity, "block_validity"},
|
|
|
|
{Opt_noblock_validity, "noblock_validity"},
|
2008-10-10 03:53:47 +00:00
|
|
|
{Opt_inode_readahead_blks, "inode_readahead_blks=%u"},
|
2009-01-06 03:46:26 +00:00
|
|
|
{Opt_journal_ioprio, "journal_ioprio=%u"},
|
2009-03-17 03:12:23 +00:00
|
|
|
{Opt_auto_da_alloc, "auto_da_alloc=%u"},
|
2009-03-28 14:59:57 +00:00
|
|
|
{Opt_auto_da_alloc, "auto_da_alloc"},
|
|
|
|
{Opt_noauto_da_alloc, "noauto_da_alloc"},
|
2010-03-04 21:14:02 +00:00
|
|
|
{Opt_dioread_nolock, "dioread_nolock"},
|
2020-01-23 17:23:17 +00:00
|
|
|
{Opt_dioread_lock, "nodioread_nolock"},
|
2010-03-04 21:14:02 +00:00
|
|
|
{Opt_dioread_lock, "dioread_lock"},
|
2009-11-19 19:25:42 +00:00
|
|
|
{Opt_discard, "discard"},
|
|
|
|
{Opt_nodiscard, "nodiscard"},
|
2011-12-13 03:06:18 +00:00
|
|
|
{Opt_init_itable, "init_itable=%u"},
|
|
|
|
{Opt_init_itable, "init_itable"},
|
|
|
|
{Opt_noinit_itable, "noinit_itable"},
|
2020-10-15 20:37:59 +00:00
|
|
|
#ifdef CONFIG_EXT4_DEBUG
|
2020-11-06 03:59:11 +00:00
|
|
|
{Opt_fc_debug_force, "fc_debug_force"},
|
2020-10-15 20:37:59 +00:00
|
|
|
{Opt_fc_debug_max_replay, "fc_debug_max_replay=%u"},
|
|
|
|
#endif
|
2012-08-17 13:48:17 +00:00
|
|
|
{Opt_max_dir_size_kb, "max_dir_size_kb=%u"},
|
fscrypt: support test_dummy_encryption=v2
v1 encryption policies are deprecated in favor of v2, and some new
features (e.g. encryption+casefolding) are only being added for v2.
Therefore, the "test_dummy_encryption" mount option (which is used for
encryption I/O testing with xfstests) needs to support v2 policies.
To do this, extend its syntax to be "test_dummy_encryption=v1" or
"test_dummy_encryption=v2". The existing "test_dummy_encryption" (no
argument) also continues to be accepted, to specify the default setting
-- currently v1, but the next patch changes it to v2.
To cleanly support both v1 and v2 while also making it easy to support
specifying other encryption settings in the future (say, accepting
"$contents_mode:$filenames_mode:v2"), make ext4 and f2fs maintain a
pointer to the dummy fscrypt_context rather than using mount flags.
To avoid concurrency issues, don't allow test_dummy_encryption to be set
or changed during a remount. (The former restriction is new, but
xfstests doesn't run into it, so no one should notice.)
Tested with 'gce-xfstests -c {ext4,f2fs}/encrypt -g auto'. On ext4,
there are two regressions, both of which are test bugs: ext4/023 and
ext4/028 fail because they set an xattr and expect it to be stored
inline, but the increase in size of the fscrypt_context from
24 to 40 bytes causes this xattr to be spilled into an external block.
Link: https://lore.kernel.org/r/20200512233251.118314-4-ebiggers@kernel.org
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-05-12 23:32:50 +00:00
|
|
|
{Opt_test_dummy_encryption, "test_dummy_encryption=%s"},
|
2015-04-16 05:56:00 +00:00
|
|
|
{Opt_test_dummy_encryption, "test_dummy_encryption"},
|
2020-07-02 01:56:07 +00:00
|
|
|
{Opt_inlinecrypt, "inlinecrypt"},
|
2017-06-22 15:55:14 +00:00
|
|
|
{Opt_nombcache, "nombcache"},
|
|
|
|
{Opt_nombcache, "no_mbcache"}, /* for backward compatibility */
|
2021-04-01 17:21:29 +00:00
|
|
|
{Opt_removed, "prefetch_block_bitmaps"},
|
|
|
|
{Opt_no_prefetch_block_bitmaps, "no_prefetch_block_bitmaps"},
|
ext4: improve cr 0 / cr 1 group scanning
Instead of traversing through groups linearly, scan groups in specific
orders at cr 0 and cr 1. At cr 0, we want to find groups that have the
largest free order >= the order of the request. So, with this patch,
we maintain lists for each possible order and insert each group into a
list based on the largest free order in its buddy bitmap. During cr 0
allocation, we traverse these lists in the increasing order of largest
free orders. This allows us to find a group with the best available cr
0 match in constant time. If nothing can be found, we fallback to cr 1
immediately.
At CR1, the story is slightly different. We want to traverse in the
order of increasing average fragment size. For CR1, we maintain a rb
tree of groupinfos which is sorted by average fragment size. Instead
of traversing linearly, at CR1, we traverse in the order of increasing
average fragment size, starting at the most optimal group. This brings
down cr 1 search complexity to log(num groups).
For cr >= 2, we just perform the linear search as before. Also, in
case of lock contention, we intermittently fallback to linear search
even in CR 0 and CR 1 cases. This allows us to proceed during the
allocation path even in case of high contention.
There is an opportunity to do optimization at CR2 too. That's because
at CR2 we only consider groups where bb_free counter (number of free
blocks) is greater than the request extent size. That's left as future
work.
All the changes introduced in this patch are protected under a new
mount option "mb_optimize_scan".
With this patchset, following experiment was performed:
Created a highly fragmented disk of size 65TB. The disk had no
contiguous 2M regions. Following command was run consecutively for 3
times:
time dd if=/dev/urandom of=file bs=2M count=10
Here are the results with and without cr 0/1 optimizations introduced
in this patch:
|---------+------------------------------+---------------------------|
| | Without CR 0/1 Optimizations | With CR 0/1 Optimizations |
|---------+------------------------------+---------------------------|
| 1st run | 5m1.871s | 2m47.642s |
| 2nd run | 2m28.390s | 0m0.611s |
| 3rd run | 2m26.530s | 0m1.255s |
|---------+------------------------------+---------------------------|
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20210401172129.189766-6-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-01 17:21:27 +00:00
|
|
|
{Opt_mb_optimize_scan, "mb_optimize_scan=%d"},
|
2012-03-05 03:00:53 +00:00
|
|
|
{Opt_removed, "check=none"}, /* mount option from ext2/3 */
|
|
|
|
{Opt_removed, "nocheck"}, /* mount option from ext2/3 */
|
|
|
|
{Opt_removed, "reservation"}, /* mount option from ext2/3 */
|
|
|
|
{Opt_removed, "noreservation"}, /* mount option from ext2/3 */
|
|
|
|
{Opt_removed, "journal=%u"}, /* mount option from ext2/3 */
|
2008-04-30 02:05:28 +00:00
|
|
|
{Opt_err, NULL},
|
2006-10-11 08:20:50 +00:00
|
|
|
};
|
|
|
|
|
2006-10-11 08:20:53 +00:00
|
|
|
static ext4_fsblk_t get_sb_block(void **data)
|
2006-10-11 08:20:50 +00:00
|
|
|
{
|
2006-10-11 08:20:53 +00:00
|
|
|
ext4_fsblk_t sb_block;
|
2006-10-11 08:20:50 +00:00
|
|
|
char *options = (char *) *data;
|
|
|
|
|
|
|
|
if (!options || strncmp(options, "sb=", 3) != 0)
|
|
|
|
return 1; /* Default location */
|
2009-06-03 21:59:28 +00:00
|
|
|
|
2006-10-11 08:20:50 +00:00
|
|
|
options += 3;
|
2009-06-03 21:59:28 +00:00
|
|
|
/* TODO: use simple_strtoll with >32bit ext4 */
|
2006-10-11 08:20:50 +00:00
|
|
|
sb_block = simple_strtoul(options, &options, 0);
|
|
|
|
if (*options && *options != ',') {
|
2008-09-09 03:00:52 +00:00
|
|
|
printk(KERN_ERR "EXT4-fs: Invalid sb specification: %s\n",
|
2006-10-11 08:20:50 +00:00
|
|
|
(char *) *data);
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
if (*options == ',')
|
|
|
|
options++;
|
|
|
|
*data = (void *) options;
|
2009-06-03 21:59:28 +00:00
|
|
|
|
2006-10-11 08:20:50 +00:00
|
|
|
return sb_block;
|
|
|
|
}
|
|
|
|
|
2009-01-06 03:46:26 +00:00
|
|
|
#define DEFAULT_JOURNAL_IOPRIO (IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, 3))
|
ext4: improve cr 0 / cr 1 group scanning
Instead of traversing through groups linearly, scan groups in specific
orders at cr 0 and cr 1. At cr 0, we want to find groups that have the
largest free order >= the order of the request. So, with this patch,
we maintain lists for each possible order and insert each group into a
list based on the largest free order in its buddy bitmap. During cr 0
allocation, we traverse these lists in the increasing order of largest
free orders. This allows us to find a group with the best available cr
0 match in constant time. If nothing can be found, we fallback to cr 1
immediately.
At CR1, the story is slightly different. We want to traverse in the
order of increasing average fragment size. For CR1, we maintain a rb
tree of groupinfos which is sorted by average fragment size. Instead
of traversing linearly, at CR1, we traverse in the order of increasing
average fragment size, starting at the most optimal group. This brings
down cr 1 search complexity to log(num groups).
For cr >= 2, we just perform the linear search as before. Also, in
case of lock contention, we intermittently fallback to linear search
even in CR 0 and CR 1 cases. This allows us to proceed during the
allocation path even in case of high contention.
There is an opportunity to do optimization at CR2 too. That's because
at CR2 we only consider groups where bb_free counter (number of free
blocks) is greater than the request extent size. That's left as future
work.
All the changes introduced in this patch are protected under a new
mount option "mb_optimize_scan".
With this patchset, following experiment was performed:
Created a highly fragmented disk of size 65TB. The disk had no
contiguous 2M regions. Following command was run consecutively for 3
times:
time dd if=/dev/urandom of=file bs=2M count=10
Here are the results with and without cr 0/1 optimizations introduced
in this patch:
|---------+------------------------------+---------------------------|
| | Without CR 0/1 Optimizations | With CR 0/1 Optimizations |
|---------+------------------------------+---------------------------|
| 1st run | 5m1.871s | 2m47.642s |
| 2nd run | 2m28.390s | 0m0.611s |
| 3rd run | 2m26.530s | 0m1.255s |
|---------+------------------------------+---------------------------|
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20210401172129.189766-6-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-01 17:21:27 +00:00
|
|
|
#define DEFAULT_MB_OPTIMIZE_SCAN (-1)
|
|
|
|
|
2017-04-30 03:47:50 +00:00
|
|
|
static const char deprecated_msg[] =
|
|
|
|
"Mount option \"%s\" will be removed by %s\n"
|
2010-03-02 03:29:21 +00:00
|
|
|
"Contact linux-ext4@vger.kernel.org if you think we should keep it.\n";
|
2009-01-06 03:46:26 +00:00
|
|
|
|
2010-03-02 04:28:41 +00:00
|
|
|
#ifdef CONFIG_QUOTA
|
|
|
|
static int set_qf_name(struct super_block *sb, int qtype, substring_t *args)
|
|
|
|
{
|
|
|
|
struct ext4_sb_info *sbi = EXT4_SB(sb);
|
2018-10-12 13:28:09 +00:00
|
|
|
char *qname, *old_qname = get_qf_name(sb, sbi, qtype);
|
2013-01-25 04:24:58 +00:00
|
|
|
int ret = -1;
|
2010-03-02 04:28:41 +00:00
|
|
|
|
2018-10-12 13:28:09 +00:00
|
|
|
if (sb_any_quota_loaded(sb) && !old_qname) {
|
2010-03-02 04:28:41 +00:00
|
|
|
ext4_msg(sb, KERN_ERR,
|
|
|
|
"Cannot change journaled "
|
|
|
|
"quota options when quota turned on");
|
2012-04-16 22:55:26 +00:00
|
|
|
return -1;
|
2010-03-02 04:28:41 +00:00
|
|
|
}
|
2015-10-17 20:18:43 +00:00
|
|
|
if (ext4_has_feature_quota(sb)) {
|
2016-04-03 21:03:37 +00:00
|
|
|
ext4_msg(sb, KERN_INFO, "Journaled quota options "
|
|
|
|
"ignored when QUOTA feature is enabled");
|
|
|
|
return 1;
|
2013-03-02 22:57:08 +00:00
|
|
|
}
|
2010-03-02 04:28:41 +00:00
|
|
|
qname = match_strdup(args);
|
|
|
|
if (!qname) {
|
|
|
|
ext4_msg(sb, KERN_ERR,
|
|
|
|
"Not enough memory for storing quotafile name");
|
2012-04-16 22:55:26 +00:00
|
|
|
return -1;
|
2010-03-02 04:28:41 +00:00
|
|
|
}
|
2018-10-12 13:28:09 +00:00
|
|
|
if (old_qname) {
|
|
|
|
if (strcmp(old_qname, qname) == 0)
|
2013-01-25 04:24:58 +00:00
|
|
|
ret = 1;
|
|
|
|
else
|
|
|
|
ext4_msg(sb, KERN_ERR,
|
|
|
|
"%s quota file already specified",
|
|
|
|
QTYPE2NAME(qtype));
|
|
|
|
goto errout;
|
2010-03-02 04:28:41 +00:00
|
|
|
}
|
2013-01-25 04:24:58 +00:00
|
|
|
if (strchr(qname, '/')) {
|
2010-03-02 04:28:41 +00:00
|
|
|
ext4_msg(sb, KERN_ERR,
|
|
|
|
"quotafile must be on filesystem root");
|
2013-01-25 04:24:58 +00:00
|
|
|
goto errout;
|
2010-03-02 04:28:41 +00:00
|
|
|
}
|
2018-10-12 13:28:09 +00:00
|
|
|
rcu_assign_pointer(sbi->s_qf_names[qtype], qname);
|
2010-12-16 01:26:48 +00:00
|
|
|
set_opt(sb, QUOTA);
|
2010-03-02 04:28:41 +00:00
|
|
|
return 1;
|
2013-01-25 04:24:58 +00:00
|
|
|
errout:
|
|
|
|
kfree(qname);
|
|
|
|
return ret;
|
2010-03-02 04:28:41 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static int clear_qf_name(struct super_block *sb, int qtype)
|
|
|
|
{
|
|
|
|
|
|
|
|
struct ext4_sb_info *sbi = EXT4_SB(sb);
|
2018-10-12 13:28:09 +00:00
|
|
|
char *old_qname = get_qf_name(sb, sbi, qtype);
|
2010-03-02 04:28:41 +00:00
|
|
|
|
2018-10-12 13:28:09 +00:00
|
|
|
if (sb_any_quota_loaded(sb) && old_qname) {
|
2010-03-02 04:28:41 +00:00
|
|
|
ext4_msg(sb, KERN_ERR, "Cannot change journaled quota options"
|
|
|
|
" when quota turned on");
|
2012-04-16 22:55:26 +00:00
|
|
|
return -1;
|
2010-03-02 04:28:41 +00:00
|
|
|
}
|
2018-10-12 13:28:09 +00:00
|
|
|
rcu_assign_pointer(sbi->s_qf_names[qtype], NULL);
|
|
|
|
synchronize_rcu();
|
|
|
|
kfree(old_qname);
|
2010-03-02 04:28:41 +00:00
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2012-03-04 04:20:47 +00:00
|
|
|
#define MOPT_SET 0x0001
|
|
|
|
#define MOPT_CLEAR 0x0002
|
|
|
|
#define MOPT_NOSUPPORT 0x0004
|
|
|
|
#define MOPT_EXPLICIT 0x0008
|
|
|
|
#define MOPT_CLEAR_ERR 0x0010
|
|
|
|
#define MOPT_GTE0 0x0020
|
2006-10-11 08:20:50 +00:00
|
|
|
#ifdef CONFIG_QUOTA
|
2012-03-04 04:20:47 +00:00
|
|
|
#define MOPT_Q 0
|
|
|
|
#define MOPT_QFMT 0x0040
|
|
|
|
#else
|
|
|
|
#define MOPT_Q MOPT_NOSUPPORT
|
|
|
|
#define MOPT_QFMT MOPT_NOSUPPORT
|
2006-10-11 08:20:50 +00:00
|
|
|
#endif
|
2012-03-04 04:20:47 +00:00
|
|
|
#define MOPT_DATAJ 0x0080
|
2013-02-03 04:38:39 +00:00
|
|
|
#define MOPT_NO_EXT2 0x0100
|
|
|
|
#define MOPT_NO_EXT3 0x0200
|
|
|
|
#define MOPT_EXT4_ONLY (MOPT_NO_EXT2 | MOPT_NO_EXT3)
|
ext4: allow specifying external journal by pathname mount option
It's always been a hassle that if an external journal's
device number changes, the filesystem won't mount.
And since boot-time enumeration can change, device number
changes aren't unusual.
The current mechanism to update the journal location is by
passing in a mount option w/ a new devnum, but that's a hassle;
it's a manual approach, fixing things after the fact.
Adding a mount option, "-o journal_path=/dev/$DEVICE" would
help, since then we can do i.e.
# mount -o journal_path=/dev/disk/by-label/$JOURNAL_LABEL ...
and it'll mount even if the devnum has changed, as shown here:
# losetup /dev/loop0 journalfile
# mke2fs -L mylabel-journal -O journal_dev /dev/loop0
# mkfs.ext4 -L mylabel -J device=/dev/loop0 /dev/sdb1
Change the journal device number:
# losetup -d /dev/loop0
# losetup /dev/loop1 journalfile
And today it will fail:
# mount /dev/sdb1 /mnt/test
mount: wrong fs type, bad option, bad superblock on /dev/sdb1,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so
# dmesg | tail -n 1
[17343.240702] EXT4-fs (sdb1): error: couldn't read superblock of external journal
But with this new mount option, we can specify the new path:
# mount -o journal_path=/dev/loop1 /dev/sdb1 /mnt/test
#
(which does update the encoded device number, incidentally):
# umount /dev/sdb1
# dumpe2fs -h /dev/sdb1 | grep "Journal device"
dumpe2fs 1.41.12 (17-May-2010)
Journal device: 0x0701
But best of all we can just always mount by journal-path, and
it'll always work:
# mount -o journal_path=/dev/disk/by-label/mylabel-journal /dev/sdb1 /mnt/test
#
So the journal_path option can be specified in fstab, and as long as
the disk is available somewhere, and findable by label (or by UUID),
we can mount.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2013-08-28 23:05:07 +00:00
|
|
|
#define MOPT_STRING 0x0400
|
2020-05-28 15:00:00 +00:00
|
|
|
#define MOPT_SKIP 0x0800
|
2020-10-15 20:37:54 +00:00
|
|
|
#define MOPT_2 0x1000
|
2012-03-04 04:20:47 +00:00
|
|
|
|
|
|
|
static const struct mount_opts {
|
|
|
|
int token;
|
|
|
|
int mount_opt;
|
|
|
|
int flags;
|
|
|
|
} ext4_mount_opts[] = {
|
|
|
|
{Opt_minix_df, EXT4_MOUNT_MINIX_DF, MOPT_SET},
|
|
|
|
{Opt_bsd_df, EXT4_MOUNT_MINIX_DF, MOPT_CLEAR},
|
|
|
|
{Opt_grpid, EXT4_MOUNT_GRPID, MOPT_SET},
|
|
|
|
{Opt_nogrpid, EXT4_MOUNT_GRPID, MOPT_CLEAR},
|
|
|
|
{Opt_block_validity, EXT4_MOUNT_BLOCK_VALIDITY, MOPT_SET},
|
|
|
|
{Opt_noblock_validity, EXT4_MOUNT_BLOCK_VALIDITY, MOPT_CLEAR},
|
2013-02-03 04:38:39 +00:00
|
|
|
{Opt_dioread_nolock, EXT4_MOUNT_DIOREAD_NOLOCK,
|
|
|
|
MOPT_EXT4_ONLY | MOPT_SET},
|
|
|
|
{Opt_dioread_lock, EXT4_MOUNT_DIOREAD_NOLOCK,
|
|
|
|
MOPT_EXT4_ONLY | MOPT_CLEAR},
|
2012-03-04 04:20:47 +00:00
|
|
|
{Opt_discard, EXT4_MOUNT_DISCARD, MOPT_SET},
|
|
|
|
{Opt_nodiscard, EXT4_MOUNT_DISCARD, MOPT_CLEAR},
|
2013-02-03 04:38:39 +00:00
|
|
|
{Opt_delalloc, EXT4_MOUNT_DELALLOC,
|
|
|
|
MOPT_EXT4_ONLY | MOPT_SET | MOPT_EXPLICIT},
|
|
|
|
{Opt_nodelalloc, EXT4_MOUNT_DELALLOC,
|
2013-08-09 03:01:24 +00:00
|
|
|
MOPT_EXT4_ONLY | MOPT_CLEAR},
|
2018-06-13 03:34:57 +00:00
|
|
|
{Opt_warn_on_error, EXT4_MOUNT_WARN_ON_ERROR, MOPT_SET},
|
|
|
|
{Opt_nowarn_on_error, EXT4_MOUNT_WARN_ON_ERROR, MOPT_CLEAR},
|
2014-11-25 21:20:50 +00:00
|
|
|
{Opt_nojournal_checksum, EXT4_MOUNT_JOURNAL_CHECKSUM,
|
|
|
|
MOPT_EXT4_ONLY | MOPT_CLEAR},
|
2013-02-03 04:38:39 +00:00
|
|
|
{Opt_journal_checksum, EXT4_MOUNT_JOURNAL_CHECKSUM,
|
ext4: do not allow journal_opts for fs w/o journal
It is appeared that we can pass journal related mount options and such options
be shown in /proc/mounts
Example:
#mkfs.ext4 -F /dev/vdb
#tune2fs -O ^has_journal /dev/vdb
#mount /dev/vdb /mnt/ -ocommit=20,journal_async_commit
#cat /proc/mounts | grep /mnt
/dev/vdb /mnt ext4 rw,relatime,journal_checksum,journal_async_commit,commit=20,data=ordered 0 0
But options:"journal_checksum,journal_async_commit,commit=20,data=ordered" has
nothing with reality because there is no journal at all.
This patch disallow following options for journalless configurations:
- journal_checksum
- journal_async_commit
- commit=%ld
- data={writeback,ordered,journal}
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2015-10-19 03:50:26 +00:00
|
|
|
MOPT_EXT4_ONLY | MOPT_SET | MOPT_EXPLICIT},
|
2012-03-04 04:20:47 +00:00
|
|
|
{Opt_journal_async_commit, (EXT4_MOUNT_JOURNAL_ASYNC_COMMIT |
|
2013-02-03 04:38:39 +00:00
|
|
|
EXT4_MOUNT_JOURNAL_CHECKSUM),
|
ext4: do not allow journal_opts for fs w/o journal
It is appeared that we can pass journal related mount options and such options
be shown in /proc/mounts
Example:
#mkfs.ext4 -F /dev/vdb
#tune2fs -O ^has_journal /dev/vdb
#mount /dev/vdb /mnt/ -ocommit=20,journal_async_commit
#cat /proc/mounts | grep /mnt
/dev/vdb /mnt ext4 rw,relatime,journal_checksum,journal_async_commit,commit=20,data=ordered 0 0
But options:"journal_checksum,journal_async_commit,commit=20,data=ordered" has
nothing with reality because there is no journal at all.
This patch disallow following options for journalless configurations:
- journal_checksum
- journal_async_commit
- commit=%ld
- data={writeback,ordered,journal}
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2015-10-19 03:50:26 +00:00
|
|
|
MOPT_EXT4_ONLY | MOPT_SET | MOPT_EXPLICIT},
|
2013-02-03 04:38:39 +00:00
|
|
|
{Opt_noload, EXT4_MOUNT_NOLOAD, MOPT_NO_EXT2 | MOPT_SET},
|
2012-03-04 04:20:47 +00:00
|
|
|
{Opt_err_panic, EXT4_MOUNT_ERRORS_PANIC, MOPT_SET | MOPT_CLEAR_ERR},
|
|
|
|
{Opt_err_ro, EXT4_MOUNT_ERRORS_RO, MOPT_SET | MOPT_CLEAR_ERR},
|
|
|
|
{Opt_err_cont, EXT4_MOUNT_ERRORS_CONT, MOPT_SET | MOPT_CLEAR_ERR},
|
2013-02-03 04:38:39 +00:00
|
|
|
{Opt_data_err_abort, EXT4_MOUNT_DATA_ERR_ABORT,
|
2016-03-13 02:55:50 +00:00
|
|
|
MOPT_NO_EXT2},
|
2013-02-03 04:38:39 +00:00
|
|
|
{Opt_data_err_ignore, EXT4_MOUNT_DATA_ERR_ABORT,
|
2016-03-13 02:55:50 +00:00
|
|
|
MOPT_NO_EXT2},
|
2012-03-04 04:20:47 +00:00
|
|
|
{Opt_barrier, EXT4_MOUNT_BARRIER, MOPT_SET},
|
|
|
|
{Opt_nobarrier, EXT4_MOUNT_BARRIER, MOPT_CLEAR},
|
|
|
|
{Opt_noauto_da_alloc, EXT4_MOUNT_NO_AUTO_DA_ALLOC, MOPT_SET},
|
|
|
|
{Opt_auto_da_alloc, EXT4_MOUNT_NO_AUTO_DA_ALLOC, MOPT_CLEAR},
|
|
|
|
{Opt_noinit_itable, EXT4_MOUNT_INIT_INODE_TABLE, MOPT_CLEAR},
|
|
|
|
{Opt_commit, 0, MOPT_GTE0},
|
|
|
|
{Opt_max_batch_time, 0, MOPT_GTE0},
|
|
|
|
{Opt_min_batch_time, 0, MOPT_GTE0},
|
|
|
|
{Opt_inode_readahead_blks, 0, MOPT_GTE0},
|
|
|
|
{Opt_init_itable, 0, MOPT_GTE0},
|
2020-05-28 15:00:00 +00:00
|
|
|
{Opt_dax, EXT4_MOUNT_DAX_ALWAYS, MOPT_SET | MOPT_SKIP},
|
|
|
|
{Opt_dax_always, EXT4_MOUNT_DAX_ALWAYS,
|
|
|
|
MOPT_EXT4_ONLY | MOPT_SET | MOPT_SKIP},
|
|
|
|
{Opt_dax_inode, EXT4_MOUNT2_DAX_INODE,
|
|
|
|
MOPT_EXT4_ONLY | MOPT_SET | MOPT_SKIP},
|
|
|
|
{Opt_dax_never, EXT4_MOUNT2_DAX_NEVER,
|
|
|
|
MOPT_EXT4_ONLY | MOPT_SET | MOPT_SKIP},
|
2012-03-04 04:20:47 +00:00
|
|
|
{Opt_stripe, 0, MOPT_GTE0},
|
2013-02-03 03:52:19 +00:00
|
|
|
{Opt_resuid, 0, MOPT_GTE0},
|
|
|
|
{Opt_resgid, 0, MOPT_GTE0},
|
2015-07-22 03:57:59 +00:00
|
|
|
{Opt_journal_dev, 0, MOPT_NO_EXT2 | MOPT_GTE0},
|
|
|
|
{Opt_journal_path, 0, MOPT_NO_EXT2 | MOPT_STRING},
|
|
|
|
{Opt_journal_ioprio, 0, MOPT_NO_EXT2 | MOPT_GTE0},
|
2013-02-03 04:38:39 +00:00
|
|
|
{Opt_data_journal, EXT4_MOUNT_JOURNAL_DATA, MOPT_NO_EXT2 | MOPT_DATAJ},
|
|
|
|
{Opt_data_ordered, EXT4_MOUNT_ORDERED_DATA, MOPT_NO_EXT2 | MOPT_DATAJ},
|
|
|
|
{Opt_data_writeback, EXT4_MOUNT_WRITEBACK_DATA,
|
|
|
|
MOPT_NO_EXT2 | MOPT_DATAJ},
|
2012-03-04 04:20:47 +00:00
|
|
|
{Opt_user_xattr, EXT4_MOUNT_XATTR_USER, MOPT_SET},
|
|
|
|
{Opt_nouser_xattr, EXT4_MOUNT_XATTR_USER, MOPT_CLEAR},
|
2008-10-11 00:02:48 +00:00
|
|
|
#ifdef CONFIG_EXT4_FS_POSIX_ACL
|
2012-03-04 04:20:47 +00:00
|
|
|
{Opt_acl, EXT4_MOUNT_POSIX_ACL, MOPT_SET},
|
|
|
|
{Opt_noacl, EXT4_MOUNT_POSIX_ACL, MOPT_CLEAR},
|
2006-10-11 08:20:50 +00:00
|
|
|
#else
|
2012-03-04 04:20:47 +00:00
|
|
|
{Opt_acl, 0, MOPT_NOSUPPORT},
|
|
|
|
{Opt_noacl, 0, MOPT_NOSUPPORT},
|
2006-10-11 08:20:50 +00:00
|
|
|
#endif
|
2012-03-04 04:20:47 +00:00
|
|
|
{Opt_nouid32, EXT4_MOUNT_NO_UID32, MOPT_SET},
|
|
|
|
{Opt_debug, EXT4_MOUNT_DEBUG, MOPT_SET},
|
2017-01-11 20:32:22 +00:00
|
|
|
{Opt_debug_want_extra_isize, 0, MOPT_GTE0},
|
2012-03-04 04:20:47 +00:00
|
|
|
{Opt_quota, EXT4_MOUNT_QUOTA | EXT4_MOUNT_USRQUOTA, MOPT_SET | MOPT_Q},
|
|
|
|
{Opt_usrquota, EXT4_MOUNT_QUOTA | EXT4_MOUNT_USRQUOTA,
|
|
|
|
MOPT_SET | MOPT_Q},
|
|
|
|
{Opt_grpquota, EXT4_MOUNT_QUOTA | EXT4_MOUNT_GRPQUOTA,
|
|
|
|
MOPT_SET | MOPT_Q},
|
2016-09-06 03:08:16 +00:00
|
|
|
{Opt_prjquota, EXT4_MOUNT_QUOTA | EXT4_MOUNT_PRJQUOTA,
|
|
|
|
MOPT_SET | MOPT_Q},
|
2012-03-04 04:20:47 +00:00
|
|
|
{Opt_noquota, (EXT4_MOUNT_QUOTA | EXT4_MOUNT_USRQUOTA |
|
2016-09-06 03:08:16 +00:00
|
|
|
EXT4_MOUNT_GRPQUOTA | EXT4_MOUNT_PRJQUOTA),
|
|
|
|
MOPT_CLEAR | MOPT_Q},
|
2020-10-29 15:46:36 +00:00
|
|
|
{Opt_usrjquota, 0, MOPT_Q | MOPT_STRING},
|
|
|
|
{Opt_grpjquota, 0, MOPT_Q | MOPT_STRING},
|
2012-03-04 04:20:47 +00:00
|
|
|
{Opt_offusrjquota, 0, MOPT_Q},
|
|
|
|
{Opt_offgrpjquota, 0, MOPT_Q},
|
|
|
|
{Opt_jqfmt_vfsold, QFMT_VFS_OLD, MOPT_QFMT},
|
|
|
|
{Opt_jqfmt_vfsv0, QFMT_VFS_V0, MOPT_QFMT},
|
|
|
|
{Opt_jqfmt_vfsv1, QFMT_VFS_V1, MOPT_QFMT},
|
2012-08-17 13:48:17 +00:00
|
|
|
{Opt_max_dir_size_kb, 0, MOPT_GTE0},
|
fscrypt: support test_dummy_encryption=v2
v1 encryption policies are deprecated in favor of v2, and some new
features (e.g. encryption+casefolding) are only being added for v2.
Therefore, the "test_dummy_encryption" mount option (which is used for
encryption I/O testing with xfstests) needs to support v2 policies.
To do this, extend its syntax to be "test_dummy_encryption=v1" or
"test_dummy_encryption=v2". The existing "test_dummy_encryption" (no
argument) also continues to be accepted, to specify the default setting
-- currently v1, but the next patch changes it to v2.
To cleanly support both v1 and v2 while also making it easy to support
specifying other encryption settings in the future (say, accepting
"$contents_mode:$filenames_mode:v2"), make ext4 and f2fs maintain a
pointer to the dummy fscrypt_context rather than using mount flags.
To avoid concurrency issues, don't allow test_dummy_encryption to be set
or changed during a remount. (The former restriction is new, but
xfstests doesn't run into it, so no one should notice.)
Tested with 'gce-xfstests -c {ext4,f2fs}/encrypt -g auto'. On ext4,
there are two regressions, both of which are test bugs: ext4/023 and
ext4/028 fail because they set an xattr and expect it to be stored
inline, but the increase in size of the fscrypt_context from
24 to 40 bytes causes this xattr to be spilled into an external block.
Link: https://lore.kernel.org/r/20200512233251.118314-4-ebiggers@kernel.org
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-05-12 23:32:50 +00:00
|
|
|
{Opt_test_dummy_encryption, 0, MOPT_STRING},
|
2017-06-22 15:55:14 +00:00
|
|
|
{Opt_nombcache, EXT4_MOUNT_NO_MBCACHE, MOPT_SET},
|
2021-04-01 17:21:29 +00:00
|
|
|
{Opt_no_prefetch_block_bitmaps, EXT4_MOUNT_NO_PREFETCH_BLOCK_BITMAPS,
|
2020-07-17 04:14:40 +00:00
|
|
|
MOPT_SET},
|
ext4: improve cr 0 / cr 1 group scanning
Instead of traversing through groups linearly, scan groups in specific
orders at cr 0 and cr 1. At cr 0, we want to find groups that have the
largest free order >= the order of the request. So, with this patch,
we maintain lists for each possible order and insert each group into a
list based on the largest free order in its buddy bitmap. During cr 0
allocation, we traverse these lists in the increasing order of largest
free orders. This allows us to find a group with the best available cr
0 match in constant time. If nothing can be found, we fallback to cr 1
immediately.
At CR1, the story is slightly different. We want to traverse in the
order of increasing average fragment size. For CR1, we maintain a rb
tree of groupinfos which is sorted by average fragment size. Instead
of traversing linearly, at CR1, we traverse in the order of increasing
average fragment size, starting at the most optimal group. This brings
down cr 1 search complexity to log(num groups).
For cr >= 2, we just perform the linear search as before. Also, in
case of lock contention, we intermittently fallback to linear search
even in CR 0 and CR 1 cases. This allows us to proceed during the
allocation path even in case of high contention.
There is an opportunity to do optimization at CR2 too. That's because
at CR2 we only consider groups where bb_free counter (number of free
blocks) is greater than the request extent size. That's left as future
work.
All the changes introduced in this patch are protected under a new
mount option "mb_optimize_scan".
With this patchset, following experiment was performed:
Created a highly fragmented disk of size 65TB. The disk had no
contiguous 2M regions. Following command was run consecutively for 3
times:
time dd if=/dev/urandom of=file bs=2M count=10
Here are the results with and without cr 0/1 optimizations introduced
in this patch:
|---------+------------------------------+---------------------------|
| | Without CR 0/1 Optimizations | With CR 0/1 Optimizations |
|---------+------------------------------+---------------------------|
| 1st run | 5m1.871s | 2m47.642s |
| 2nd run | 2m28.390s | 0m0.611s |
| 3rd run | 2m26.530s | 0m1.255s |
|---------+------------------------------+---------------------------|
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20210401172129.189766-6-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-01 17:21:27 +00:00
|
|
|
{Opt_mb_optimize_scan, EXT4_MOUNT2_MB_OPTIMIZE_SCAN, MOPT_GTE0},
|
2020-11-06 03:59:11 +00:00
|
|
|
#ifdef CONFIG_EXT4_DEBUG
|
2020-10-15 20:38:00 +00:00
|
|
|
{Opt_fc_debug_force, EXT4_MOUNT2_JOURNAL_FAST_COMMIT,
|
|
|
|
MOPT_SET | MOPT_2 | MOPT_EXT4_ONLY},
|
2020-10-15 20:37:59 +00:00
|
|
|
{Opt_fc_debug_max_replay, 0, MOPT_GTE0},
|
|
|
|
#endif
|
2012-03-04 04:20:47 +00:00
|
|
|
{Opt_err, 0, 0}
|
|
|
|
};
|
|
|
|
|
2019-04-25 18:05:42 +00:00
|
|
|
#ifdef CONFIG_UNICODE
|
|
|
|
static const struct ext4_sb_encodings {
|
|
|
|
__u16 magic;
|
|
|
|
char *name;
|
|
|
|
char *version;
|
|
|
|
} ext4_sb_encoding_map[] = {
|
|
|
|
{EXT4_ENC_UTF8_12_1, "utf8", "12.1.0"},
|
|
|
|
};
|
|
|
|
|
|
|
|
static int ext4_sb_read_encoding(const struct ext4_super_block *es,
|
|
|
|
const struct ext4_sb_encodings **encoding,
|
|
|
|
__u16 *flags)
|
|
|
|
{
|
|
|
|
__u16 magic = le16_to_cpu(es->s_encoding);
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < ARRAY_SIZE(ext4_sb_encoding_map); i++)
|
|
|
|
if (magic == ext4_sb_encoding_map[i].magic)
|
|
|
|
break;
|
|
|
|
|
|
|
|
if (i >= ARRAY_SIZE(ext4_sb_encoding_map))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
*encoding = &ext4_sb_encoding_map[i];
|
|
|
|
*flags = le16_to_cpu(es->s_encoding_flags);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
fscrypt: support test_dummy_encryption=v2
v1 encryption policies are deprecated in favor of v2, and some new
features (e.g. encryption+casefolding) are only being added for v2.
Therefore, the "test_dummy_encryption" mount option (which is used for
encryption I/O testing with xfstests) needs to support v2 policies.
To do this, extend its syntax to be "test_dummy_encryption=v1" or
"test_dummy_encryption=v2". The existing "test_dummy_encryption" (no
argument) also continues to be accepted, to specify the default setting
-- currently v1, but the next patch changes it to v2.
To cleanly support both v1 and v2 while also making it easy to support
specifying other encryption settings in the future (say, accepting
"$contents_mode:$filenames_mode:v2"), make ext4 and f2fs maintain a
pointer to the dummy fscrypt_context rather than using mount flags.
To avoid concurrency issues, don't allow test_dummy_encryption to be set
or changed during a remount. (The former restriction is new, but
xfstests doesn't run into it, so no one should notice.)
Tested with 'gce-xfstests -c {ext4,f2fs}/encrypt -g auto'. On ext4,
there are two regressions, both of which are test bugs: ext4/023 and
ext4/028 fail because they set an xattr and expect it to be stored
inline, but the increase in size of the fscrypt_context from
24 to 40 bytes causes this xattr to be spilled into an external block.
Link: https://lore.kernel.org/r/20200512233251.118314-4-ebiggers@kernel.org
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-05-12 23:32:50 +00:00
|
|
|
static int ext4_set_test_dummy_encryption(struct super_block *sb,
|
|
|
|
const char *opt,
|
|
|
|
const substring_t *arg,
|
|
|
|
bool is_remount)
|
|
|
|
{
|
|
|
|
#ifdef CONFIG_FS_ENCRYPTION
|
|
|
|
struct ext4_sb_info *sbi = EXT4_SB(sb);
|
|
|
|
int err;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This mount option is just for testing, and it's not worthwhile to
|
|
|
|
* implement the extra complexity (e.g. RCU protection) that would be
|
|
|
|
* needed to allow it to be set or changed during remount. We do allow
|
|
|
|
* it to be specified during remount, but only if there is no change.
|
|
|
|
*/
|
fscrypt: handle test_dummy_encryption in more logical way
The behavior of the test_dummy_encryption mount option is that when a
new file (or directory or symlink) is created in an unencrypted
directory, it's automatically encrypted using a dummy encryption policy.
That's it; in particular, the encryption (or lack thereof) of existing
files (or directories or symlinks) doesn't change.
Unfortunately the implementation of test_dummy_encryption is a bit weird
and confusing. When test_dummy_encryption is enabled and a file is
being created in an unencrypted directory, we set up an encryption key
(->i_crypt_info) for the directory. This isn't actually used to do any
encryption, however, since the directory is still unencrypted! Instead,
->i_crypt_info is only used for inheriting the encryption policy.
One consequence of this is that the filesystem ends up providing a
"dummy context" (policy + nonce) instead of a "dummy policy". In
commit ed318a6cc0b6 ("fscrypt: support test_dummy_encryption=v2"), I
mistakenly thought this was required. However, actually the nonce only
ends up being used to derive a key that is never used.
Another consequence of this implementation is that it allows for
'inode->i_crypt_info != NULL && !IS_ENCRYPTED(inode)', which is an edge
case that can be forgotten about. For example, currently
FS_IOC_GET_ENCRYPTION_POLICY on an unencrypted directory may return the
dummy encryption policy when the filesystem is mounted with
test_dummy_encryption. That seems like the wrong thing to do, since
again, the directory itself is not actually encrypted.
Therefore, switch to a more logical and maintainable implementation
where the dummy encryption policy inheritance is done without setting up
keys for unencrypted directories. This involves:
- Adding a function fscrypt_policy_to_inherit() which returns the
encryption policy to inherit from a directory. This can be a real
policy, a dummy policy, or no policy.
- Replacing struct fscrypt_dummy_context, ->get_dummy_context(), etc.
with struct fscrypt_dummy_policy, ->get_dummy_policy(), etc.
- Making fscrypt_fname_encrypted_size() take an fscrypt_policy instead
of an inode.
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Acked-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20200917041136.178600-13-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-09-17 04:11:35 +00:00
|
|
|
if (is_remount && !sbi->s_dummy_enc_policy.policy) {
|
fscrypt: support test_dummy_encryption=v2
v1 encryption policies are deprecated in favor of v2, and some new
features (e.g. encryption+casefolding) are only being added for v2.
Therefore, the "test_dummy_encryption" mount option (which is used for
encryption I/O testing with xfstests) needs to support v2 policies.
To do this, extend its syntax to be "test_dummy_encryption=v1" or
"test_dummy_encryption=v2". The existing "test_dummy_encryption" (no
argument) also continues to be accepted, to specify the default setting
-- currently v1, but the next patch changes it to v2.
To cleanly support both v1 and v2 while also making it easy to support
specifying other encryption settings in the future (say, accepting
"$contents_mode:$filenames_mode:v2"), make ext4 and f2fs maintain a
pointer to the dummy fscrypt_context rather than using mount flags.
To avoid concurrency issues, don't allow test_dummy_encryption to be set
or changed during a remount. (The former restriction is new, but
xfstests doesn't run into it, so no one should notice.)
Tested with 'gce-xfstests -c {ext4,f2fs}/encrypt -g auto'. On ext4,
there are two regressions, both of which are test bugs: ext4/023 and
ext4/028 fail because they set an xattr and expect it to be stored
inline, but the increase in size of the fscrypt_context from
24 to 40 bytes causes this xattr to be spilled into an external block.
Link: https://lore.kernel.org/r/20200512233251.118314-4-ebiggers@kernel.org
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-05-12 23:32:50 +00:00
|
|
|
ext4_msg(sb, KERN_WARNING,
|
|
|
|
"Can't set test_dummy_encryption on remount");
|
|
|
|
return -1;
|
|
|
|
}
|
2020-09-17 04:11:36 +00:00
|
|
|
err = fscrypt_set_test_dummy_encryption(sb, arg->from,
|
fscrypt: handle test_dummy_encryption in more logical way
The behavior of the test_dummy_encryption mount option is that when a
new file (or directory or symlink) is created in an unencrypted
directory, it's automatically encrypted using a dummy encryption policy.
That's it; in particular, the encryption (or lack thereof) of existing
files (or directories or symlinks) doesn't change.
Unfortunately the implementation of test_dummy_encryption is a bit weird
and confusing. When test_dummy_encryption is enabled and a file is
being created in an unencrypted directory, we set up an encryption key
(->i_crypt_info) for the directory. This isn't actually used to do any
encryption, however, since the directory is still unencrypted! Instead,
->i_crypt_info is only used for inheriting the encryption policy.
One consequence of this is that the filesystem ends up providing a
"dummy context" (policy + nonce) instead of a "dummy policy". In
commit ed318a6cc0b6 ("fscrypt: support test_dummy_encryption=v2"), I
mistakenly thought this was required. However, actually the nonce only
ends up being used to derive a key that is never used.
Another consequence of this implementation is that it allows for
'inode->i_crypt_info != NULL && !IS_ENCRYPTED(inode)', which is an edge
case that can be forgotten about. For example, currently
FS_IOC_GET_ENCRYPTION_POLICY on an unencrypted directory may return the
dummy encryption policy when the filesystem is mounted with
test_dummy_encryption. That seems like the wrong thing to do, since
again, the directory itself is not actually encrypted.
Therefore, switch to a more logical and maintainable implementation
where the dummy encryption policy inheritance is done without setting up
keys for unencrypted directories. This involves:
- Adding a function fscrypt_policy_to_inherit() which returns the
encryption policy to inherit from a directory. This can be a real
policy, a dummy policy, or no policy.
- Replacing struct fscrypt_dummy_context, ->get_dummy_context(), etc.
with struct fscrypt_dummy_policy, ->get_dummy_policy(), etc.
- Making fscrypt_fname_encrypted_size() take an fscrypt_policy instead
of an inode.
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Acked-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20200917041136.178600-13-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-09-17 04:11:35 +00:00
|
|
|
&sbi->s_dummy_enc_policy);
|
fscrypt: support test_dummy_encryption=v2
v1 encryption policies are deprecated in favor of v2, and some new
features (e.g. encryption+casefolding) are only being added for v2.
Therefore, the "test_dummy_encryption" mount option (which is used for
encryption I/O testing with xfstests) needs to support v2 policies.
To do this, extend its syntax to be "test_dummy_encryption=v1" or
"test_dummy_encryption=v2". The existing "test_dummy_encryption" (no
argument) also continues to be accepted, to specify the default setting
-- currently v1, but the next patch changes it to v2.
To cleanly support both v1 and v2 while also making it easy to support
specifying other encryption settings in the future (say, accepting
"$contents_mode:$filenames_mode:v2"), make ext4 and f2fs maintain a
pointer to the dummy fscrypt_context rather than using mount flags.
To avoid concurrency issues, don't allow test_dummy_encryption to be set
or changed during a remount. (The former restriction is new, but
xfstests doesn't run into it, so no one should notice.)
Tested with 'gce-xfstests -c {ext4,f2fs}/encrypt -g auto'. On ext4,
there are two regressions, both of which are test bugs: ext4/023 and
ext4/028 fail because they set an xattr and expect it to be stored
inline, but the increase in size of the fscrypt_context from
24 to 40 bytes causes this xattr to be spilled into an external block.
Link: https://lore.kernel.org/r/20200512233251.118314-4-ebiggers@kernel.org
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-05-12 23:32:50 +00:00
|
|
|
if (err) {
|
|
|
|
if (err == -EEXIST)
|
|
|
|
ext4_msg(sb, KERN_WARNING,
|
|
|
|
"Can't change test_dummy_encryption on remount");
|
|
|
|
else if (err == -EINVAL)
|
|
|
|
ext4_msg(sb, KERN_WARNING,
|
|
|
|
"Value of option \"%s\" is unrecognized", opt);
|
|
|
|
else
|
|
|
|
ext4_msg(sb, KERN_WARNING,
|
|
|
|
"Error processing option \"%s\" [%d]",
|
|
|
|
opt, err);
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
ext4_msg(sb, KERN_WARNING, "Test dummy encryption mode enabled");
|
|
|
|
#else
|
|
|
|
ext4_msg(sb, KERN_WARNING,
|
|
|
|
"Test dummy encryption mount option ignored");
|
|
|
|
#endif
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2021-04-01 17:21:24 +00:00
|
|
|
struct ext4_parsed_options {
|
|
|
|
unsigned long journal_devnum;
|
|
|
|
unsigned int journal_ioprio;
|
ext4: improve cr 0 / cr 1 group scanning
Instead of traversing through groups linearly, scan groups in specific
orders at cr 0 and cr 1. At cr 0, we want to find groups that have the
largest free order >= the order of the request. So, with this patch,
we maintain lists for each possible order and insert each group into a
list based on the largest free order in its buddy bitmap. During cr 0
allocation, we traverse these lists in the increasing order of largest
free orders. This allows us to find a group with the best available cr
0 match in constant time. If nothing can be found, we fallback to cr 1
immediately.
At CR1, the story is slightly different. We want to traverse in the
order of increasing average fragment size. For CR1, we maintain a rb
tree of groupinfos which is sorted by average fragment size. Instead
of traversing linearly, at CR1, we traverse in the order of increasing
average fragment size, starting at the most optimal group. This brings
down cr 1 search complexity to log(num groups).
For cr >= 2, we just perform the linear search as before. Also, in
case of lock contention, we intermittently fallback to linear search
even in CR 0 and CR 1 cases. This allows us to proceed during the
allocation path even in case of high contention.
There is an opportunity to do optimization at CR2 too. That's because
at CR2 we only consider groups where bb_free counter (number of free
blocks) is greater than the request extent size. That's left as future
work.
All the changes introduced in this patch are protected under a new
mount option "mb_optimize_scan".
With this patchset, following experiment was performed:
Created a highly fragmented disk of size 65TB. The disk had no
contiguous 2M regions. Following command was run consecutively for 3
times:
time dd if=/dev/urandom of=file bs=2M count=10
Here are the results with and without cr 0/1 optimizations introduced
in this patch:
|---------+------------------------------+---------------------------|
| | Without CR 0/1 Optimizations | With CR 0/1 Optimizations |
|---------+------------------------------+---------------------------|
| 1st run | 5m1.871s | 2m47.642s |
| 2nd run | 2m28.390s | 0m0.611s |
| 3rd run | 2m26.530s | 0m1.255s |
|---------+------------------------------+---------------------------|
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20210401172129.189766-6-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-01 17:21:27 +00:00
|
|
|
int mb_optimize_scan;
|
2021-04-01 17:21:24 +00:00
|
|
|
};
|
|
|
|
|
2012-03-04 04:20:47 +00:00
|
|
|
static int handle_mount_opt(struct super_block *sb, char *opt, int token,
|
2021-04-01 17:21:24 +00:00
|
|
|
substring_t *args, struct ext4_parsed_options *parsed_opts,
|
|
|
|
int is_remount)
|
2012-03-04 04:20:47 +00:00
|
|
|
{
|
|
|
|
struct ext4_sb_info *sbi = EXT4_SB(sb);
|
|
|
|
const struct mount_opts *m;
|
2012-02-07 23:41:49 +00:00
|
|
|
kuid_t uid;
|
|
|
|
kgid_t gid;
|
2012-03-04 04:20:47 +00:00
|
|
|
int arg = 0;
|
|
|
|
|
2012-04-16 22:55:26 +00:00
|
|
|
#ifdef CONFIG_QUOTA
|
|
|
|
if (token == Opt_usrjquota)
|
|
|
|
return set_qf_name(sb, USRQUOTA, &args[0]);
|
|
|
|
else if (token == Opt_grpjquota)
|
|
|
|
return set_qf_name(sb, GRPQUOTA, &args[0]);
|
|
|
|
else if (token == Opt_offusrjquota)
|
|
|
|
return clear_qf_name(sb, USRQUOTA);
|
|
|
|
else if (token == Opt_offgrpjquota)
|
|
|
|
return clear_qf_name(sb, GRPQUOTA);
|
|
|
|
#endif
|
2012-03-04 04:20:47 +00:00
|
|
|
switch (token) {
|
2012-03-05 03:06:20 +00:00
|
|
|
case Opt_noacl:
|
|
|
|
case Opt_nouser_xattr:
|
|
|
|
ext4_msg(sb, KERN_WARNING, deprecated_msg, opt, "3.5");
|
|
|
|
break;
|
2012-03-04 04:20:47 +00:00
|
|
|
case Opt_sb:
|
|
|
|
return 1; /* handled by get_sb_block() */
|
|
|
|
case Opt_removed:
|
2013-02-03 04:09:36 +00:00
|
|
|
ext4_msg(sb, KERN_WARNING, "Ignoring removed %s option", opt);
|
2012-03-04 04:20:47 +00:00
|
|
|
return 1;
|
|
|
|
case Opt_abort:
|
2020-11-06 03:59:09 +00:00
|
|
|
ext4_set_mount_flag(sb, EXT4_MF_FS_ABORTED);
|
2012-03-04 04:20:47 +00:00
|
|
|
return 1;
|
|
|
|
case Opt_i_version:
|
2017-10-18 20:56:26 +00:00
|
|
|
sb->s_flags |= SB_I_VERSION;
|
2012-03-04 04:20:47 +00:00
|
|
|
return 1;
|
2015-02-02 05:37:02 +00:00
|
|
|
case Opt_lazytime:
|
2017-11-27 21:05:09 +00:00
|
|
|
sb->s_flags |= SB_LAZYTIME;
|
2015-02-02 05:37:02 +00:00
|
|
|
return 1;
|
|
|
|
case Opt_nolazytime:
|
2017-11-27 21:05:09 +00:00
|
|
|
sb->s_flags &= ~SB_LAZYTIME;
|
2015-02-02 05:37:02 +00:00
|
|
|
return 1;
|
2020-07-02 01:56:07 +00:00
|
|
|
case Opt_inlinecrypt:
|
|
|
|
#ifdef CONFIG_FS_ENCRYPTION_INLINE_CRYPT
|
|
|
|
sb->s_flags |= SB_INLINECRYPT;
|
|
|
|
#else
|
|
|
|
ext4_msg(sb, KERN_ERR, "inline encryption not supported");
|
|
|
|
#endif
|
|
|
|
return 1;
|
2012-03-04 04:20:47 +00:00
|
|
|
}
|
|
|
|
|
2013-02-03 04:09:36 +00:00
|
|
|
for (m = ext4_mount_opts; m->token != Opt_err; m++)
|
|
|
|
if (token == m->token)
|
|
|
|
break;
|
|
|
|
|
|
|
|
if (m->token == Opt_err) {
|
|
|
|
ext4_msg(sb, KERN_ERR, "Unrecognized mount option \"%s\" "
|
|
|
|
"or missing value", opt);
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
|
2013-02-03 04:38:39 +00:00
|
|
|
if ((m->flags & MOPT_NO_EXT2) && IS_EXT2_SB(sb)) {
|
|
|
|
ext4_msg(sb, KERN_ERR,
|
|
|
|
"Mount option \"%s\" incompatible with ext2", opt);
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
if ((m->flags & MOPT_NO_EXT3) && IS_EXT3_SB(sb)) {
|
|
|
|
ext4_msg(sb, KERN_ERR,
|
|
|
|
"Mount option \"%s\" incompatible with ext3", opt);
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
|
ext4: allow specifying external journal by pathname mount option
It's always been a hassle that if an external journal's
device number changes, the filesystem won't mount.
And since boot-time enumeration can change, device number
changes aren't unusual.
The current mechanism to update the journal location is by
passing in a mount option w/ a new devnum, but that's a hassle;
it's a manual approach, fixing things after the fact.
Adding a mount option, "-o journal_path=/dev/$DEVICE" would
help, since then we can do i.e.
# mount -o journal_path=/dev/disk/by-label/$JOURNAL_LABEL ...
and it'll mount even if the devnum has changed, as shown here:
# losetup /dev/loop0 journalfile
# mke2fs -L mylabel-journal -O journal_dev /dev/loop0
# mkfs.ext4 -L mylabel -J device=/dev/loop0 /dev/sdb1
Change the journal device number:
# losetup -d /dev/loop0
# losetup /dev/loop1 journalfile
And today it will fail:
# mount /dev/sdb1 /mnt/test
mount: wrong fs type, bad option, bad superblock on /dev/sdb1,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so
# dmesg | tail -n 1
[17343.240702] EXT4-fs (sdb1): error: couldn't read superblock of external journal
But with this new mount option, we can specify the new path:
# mount -o journal_path=/dev/loop1 /dev/sdb1 /mnt/test
#
(which does update the encoded device number, incidentally):
# umount /dev/sdb1
# dumpe2fs -h /dev/sdb1 | grep "Journal device"
dumpe2fs 1.41.12 (17-May-2010)
Journal device: 0x0701
But best of all we can just always mount by journal-path, and
it'll always work:
# mount -o journal_path=/dev/disk/by-label/mylabel-journal /dev/sdb1 /mnt/test
#
So the journal_path option can be specified in fstab, and as long as
the disk is available somewhere, and findable by label (or by UUID),
we can mount.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2013-08-28 23:05:07 +00:00
|
|
|
if (args->from && !(m->flags & MOPT_STRING) && match_int(args, &arg))
|
2013-02-03 04:09:36 +00:00
|
|
|
return -1;
|
|
|
|
if (args->from && (m->flags & MOPT_GTE0) && (arg < 0))
|
|
|
|
return -1;
|
2015-10-19 03:35:32 +00:00
|
|
|
if (m->flags & MOPT_EXPLICIT) {
|
|
|
|
if (m->mount_opt & EXT4_MOUNT_DELALLOC) {
|
|
|
|
set_opt2(sb, EXPLICIT_DELALLOC);
|
ext4: do not allow journal_opts for fs w/o journal
It is appeared that we can pass journal related mount options and such options
be shown in /proc/mounts
Example:
#mkfs.ext4 -F /dev/vdb
#tune2fs -O ^has_journal /dev/vdb
#mount /dev/vdb /mnt/ -ocommit=20,journal_async_commit
#cat /proc/mounts | grep /mnt
/dev/vdb /mnt ext4 rw,relatime,journal_checksum,journal_async_commit,commit=20,data=ordered 0 0
But options:"journal_checksum,journal_async_commit,commit=20,data=ordered" has
nothing with reality because there is no journal at all.
This patch disallow following options for journalless configurations:
- journal_checksum
- journal_async_commit
- commit=%ld
- data={writeback,ordered,journal}
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2015-10-19 03:50:26 +00:00
|
|
|
} else if (m->mount_opt & EXT4_MOUNT_JOURNAL_CHECKSUM) {
|
|
|
|
set_opt2(sb, EXPLICIT_JOURNAL_CHECKSUM);
|
2015-10-19 03:35:32 +00:00
|
|
|
} else
|
|
|
|
return -1;
|
|
|
|
}
|
2013-02-03 04:09:36 +00:00
|
|
|
if (m->flags & MOPT_CLEAR_ERR)
|
|
|
|
clear_opt(sb, ERRORS_MASK);
|
|
|
|
if (token == Opt_noquota && sb_any_quota_loaded(sb)) {
|
|
|
|
ext4_msg(sb, KERN_ERR, "Cannot change quota "
|
|
|
|
"options when quota turned on");
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (m->flags & MOPT_NOSUPPORT) {
|
|
|
|
ext4_msg(sb, KERN_ERR, "%s option not supported", opt);
|
|
|
|
} else if (token == Opt_commit) {
|
|
|
|
if (arg == 0)
|
|
|
|
arg = JBD2_DEFAULT_MAX_COMMIT_AGE;
|
2019-08-28 15:25:01 +00:00
|
|
|
else if (arg > INT_MAX / HZ) {
|
|
|
|
ext4_msg(sb, KERN_ERR,
|
|
|
|
"Invalid commit interval %d, "
|
|
|
|
"must be smaller than %d",
|
|
|
|
arg, INT_MAX / HZ);
|
|
|
|
return -1;
|
|
|
|
}
|
2013-02-03 04:09:36 +00:00
|
|
|
sbi->s_commit_interval = HZ * arg;
|
2017-01-11 20:32:22 +00:00
|
|
|
} else if (token == Opt_debug_want_extra_isize) {
|
2019-12-15 06:09:03 +00:00
|
|
|
if ((arg & 1) ||
|
|
|
|
(arg < 4) ||
|
|
|
|
(arg > (sbi->s_inode_size - EXT4_GOOD_OLD_INODE_SIZE))) {
|
|
|
|
ext4_msg(sb, KERN_ERR,
|
|
|
|
"Invalid want_extra_isize %d", arg);
|
|
|
|
return -1;
|
|
|
|
}
|
2017-01-11 20:32:22 +00:00
|
|
|
sbi->s_want_extra_isize = arg;
|
2013-02-03 04:09:36 +00:00
|
|
|
} else if (token == Opt_max_batch_time) {
|
|
|
|
sbi->s_max_batch_time = arg;
|
|
|
|
} else if (token == Opt_min_batch_time) {
|
|
|
|
sbi->s_min_batch_time = arg;
|
|
|
|
} else if (token == Opt_inode_readahead_blks) {
|
2013-02-03 04:14:31 +00:00
|
|
|
if (arg && (arg > (1 << 30) || !is_power_of_2(arg))) {
|
|
|
|
ext4_msg(sb, KERN_ERR,
|
|
|
|
"EXT4-fs: inode_readahead_blks must be "
|
|
|
|
"0 or a power of 2 smaller than 2^31");
|
2012-03-04 04:20:47 +00:00
|
|
|
return -1;
|
2013-02-03 04:09:36 +00:00
|
|
|
}
|
|
|
|
sbi->s_inode_readahead_blks = arg;
|
|
|
|
} else if (token == Opt_init_itable) {
|
|
|
|
set_opt(sb, INIT_INODE_TABLE);
|
|
|
|
if (!args->from)
|
|
|
|
arg = EXT4_DEF_LI_WAIT_MULT;
|
|
|
|
sbi->s_li_wait_mult = arg;
|
|
|
|
} else if (token == Opt_max_dir_size_kb) {
|
|
|
|
sbi->s_max_dir_size_kb = arg;
|
2020-10-15 20:37:59 +00:00
|
|
|
#ifdef CONFIG_EXT4_DEBUG
|
|
|
|
} else if (token == Opt_fc_debug_max_replay) {
|
|
|
|
sbi->s_fc_debug_max_replay = arg;
|
|
|
|
#endif
|
2013-02-03 04:09:36 +00:00
|
|
|
} else if (token == Opt_stripe) {
|
|
|
|
sbi->s_stripe = arg;
|
|
|
|
} else if (token == Opt_resuid) {
|
|
|
|
uid = make_kuid(current_user_ns(), arg);
|
|
|
|
if (!uid_valid(uid)) {
|
|
|
|
ext4_msg(sb, KERN_ERR, "Invalid uid value %d", arg);
|
2012-03-04 04:20:47 +00:00
|
|
|
return -1;
|
|
|
|
}
|
2013-02-03 04:09:36 +00:00
|
|
|
sbi->s_resuid = uid;
|
|
|
|
} else if (token == Opt_resgid) {
|
|
|
|
gid = make_kgid(current_user_ns(), arg);
|
|
|
|
if (!gid_valid(gid)) {
|
|
|
|
ext4_msg(sb, KERN_ERR, "Invalid gid value %d", arg);
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
sbi->s_resgid = gid;
|
|
|
|
} else if (token == Opt_journal_dev) {
|
|
|
|
if (is_remount) {
|
|
|
|
ext4_msg(sb, KERN_ERR,
|
|
|
|
"Cannot specify journal on remount");
|
|
|
|
return -1;
|
|
|
|
}
|
2021-04-01 17:21:24 +00:00
|
|
|
parsed_opts->journal_devnum = arg;
|
ext4: allow specifying external journal by pathname mount option
It's always been a hassle that if an external journal's
device number changes, the filesystem won't mount.
And since boot-time enumeration can change, device number
changes aren't unusual.
The current mechanism to update the journal location is by
passing in a mount option w/ a new devnum, but that's a hassle;
it's a manual approach, fixing things after the fact.
Adding a mount option, "-o journal_path=/dev/$DEVICE" would
help, since then we can do i.e.
# mount -o journal_path=/dev/disk/by-label/$JOURNAL_LABEL ...
and it'll mount even if the devnum has changed, as shown here:
# losetup /dev/loop0 journalfile
# mke2fs -L mylabel-journal -O journal_dev /dev/loop0
# mkfs.ext4 -L mylabel -J device=/dev/loop0 /dev/sdb1
Change the journal device number:
# losetup -d /dev/loop0
# losetup /dev/loop1 journalfile
And today it will fail:
# mount /dev/sdb1 /mnt/test
mount: wrong fs type, bad option, bad superblock on /dev/sdb1,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so
# dmesg | tail -n 1
[17343.240702] EXT4-fs (sdb1): error: couldn't read superblock of external journal
But with this new mount option, we can specify the new path:
# mount -o journal_path=/dev/loop1 /dev/sdb1 /mnt/test
#
(which does update the encoded device number, incidentally):
# umount /dev/sdb1
# dumpe2fs -h /dev/sdb1 | grep "Journal device"
dumpe2fs 1.41.12 (17-May-2010)
Journal device: 0x0701
But best of all we can just always mount by journal-path, and
it'll always work:
# mount -o journal_path=/dev/disk/by-label/mylabel-journal /dev/sdb1 /mnt/test
#
So the journal_path option can be specified in fstab, and as long as
the disk is available somewhere, and findable by label (or by UUID),
we can mount.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2013-08-28 23:05:07 +00:00
|
|
|
} else if (token == Opt_journal_path) {
|
|
|
|
char *journal_path;
|
|
|
|
struct inode *journal_inode;
|
|
|
|
struct path path;
|
|
|
|
int error;
|
|
|
|
|
|
|
|
if (is_remount) {
|
|
|
|
ext4_msg(sb, KERN_ERR,
|
|
|
|
"Cannot specify journal on remount");
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
journal_path = match_strdup(&args[0]);
|
|
|
|
if (!journal_path) {
|
|
|
|
ext4_msg(sb, KERN_ERR, "error: could not dup "
|
|
|
|
"journal device string");
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
|
|
|
|
error = kern_path(journal_path, LOOKUP_FOLLOW, &path);
|
|
|
|
if (error) {
|
|
|
|
ext4_msg(sb, KERN_ERR, "error: could not find "
|
|
|
|
"journal device path: error %d", error);
|
|
|
|
kfree(journal_path);
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
|
2015-03-17 22:25:59 +00:00
|
|
|
journal_inode = d_inode(path.dentry);
|
ext4: allow specifying external journal by pathname mount option
It's always been a hassle that if an external journal's
device number changes, the filesystem won't mount.
And since boot-time enumeration can change, device number
changes aren't unusual.
The current mechanism to update the journal location is by
passing in a mount option w/ a new devnum, but that's a hassle;
it's a manual approach, fixing things after the fact.
Adding a mount option, "-o journal_path=/dev/$DEVICE" would
help, since then we can do i.e.
# mount -o journal_path=/dev/disk/by-label/$JOURNAL_LABEL ...
and it'll mount even if the devnum has changed, as shown here:
# losetup /dev/loop0 journalfile
# mke2fs -L mylabel-journal -O journal_dev /dev/loop0
# mkfs.ext4 -L mylabel -J device=/dev/loop0 /dev/sdb1
Change the journal device number:
# losetup -d /dev/loop0
# losetup /dev/loop1 journalfile
And today it will fail:
# mount /dev/sdb1 /mnt/test
mount: wrong fs type, bad option, bad superblock on /dev/sdb1,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so
# dmesg | tail -n 1
[17343.240702] EXT4-fs (sdb1): error: couldn't read superblock of external journal
But with this new mount option, we can specify the new path:
# mount -o journal_path=/dev/loop1 /dev/sdb1 /mnt/test
#
(which does update the encoded device number, incidentally):
# umount /dev/sdb1
# dumpe2fs -h /dev/sdb1 | grep "Journal device"
dumpe2fs 1.41.12 (17-May-2010)
Journal device: 0x0701
But best of all we can just always mount by journal-path, and
it'll always work:
# mount -o journal_path=/dev/disk/by-label/mylabel-journal /dev/sdb1 /mnt/test
#
So the journal_path option can be specified in fstab, and as long as
the disk is available somewhere, and findable by label (or by UUID),
we can mount.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2013-08-28 23:05:07 +00:00
|
|
|
if (!S_ISBLK(journal_inode->i_mode)) {
|
|
|
|
ext4_msg(sb, KERN_ERR, "error: journal path %s "
|
|
|
|
"is not a block device", journal_path);
|
|
|
|
path_put(&path);
|
|
|
|
kfree(journal_path);
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
|
2021-04-01 17:21:24 +00:00
|
|
|
parsed_opts->journal_devnum = new_encode_dev(journal_inode->i_rdev);
|
ext4: allow specifying external journal by pathname mount option
It's always been a hassle that if an external journal's
device number changes, the filesystem won't mount.
And since boot-time enumeration can change, device number
changes aren't unusual.
The current mechanism to update the journal location is by
passing in a mount option w/ a new devnum, but that's a hassle;
it's a manual approach, fixing things after the fact.
Adding a mount option, "-o journal_path=/dev/$DEVICE" would
help, since then we can do i.e.
# mount -o journal_path=/dev/disk/by-label/$JOURNAL_LABEL ...
and it'll mount even if the devnum has changed, as shown here:
# losetup /dev/loop0 journalfile
# mke2fs -L mylabel-journal -O journal_dev /dev/loop0
# mkfs.ext4 -L mylabel -J device=/dev/loop0 /dev/sdb1
Change the journal device number:
# losetup -d /dev/loop0
# losetup /dev/loop1 journalfile
And today it will fail:
# mount /dev/sdb1 /mnt/test
mount: wrong fs type, bad option, bad superblock on /dev/sdb1,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so
# dmesg | tail -n 1
[17343.240702] EXT4-fs (sdb1): error: couldn't read superblock of external journal
But with this new mount option, we can specify the new path:
# mount -o journal_path=/dev/loop1 /dev/sdb1 /mnt/test
#
(which does update the encoded device number, incidentally):
# umount /dev/sdb1
# dumpe2fs -h /dev/sdb1 | grep "Journal device"
dumpe2fs 1.41.12 (17-May-2010)
Journal device: 0x0701
But best of all we can just always mount by journal-path, and
it'll always work:
# mount -o journal_path=/dev/disk/by-label/mylabel-journal /dev/sdb1 /mnt/test
#
So the journal_path option can be specified in fstab, and as long as
the disk is available somewhere, and findable by label (or by UUID),
we can mount.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2013-08-28 23:05:07 +00:00
|
|
|
path_put(&path);
|
|
|
|
kfree(journal_path);
|
2013-02-03 04:09:36 +00:00
|
|
|
} else if (token == Opt_journal_ioprio) {
|
|
|
|
if (arg > 7) {
|
|
|
|
ext4_msg(sb, KERN_ERR, "Invalid journal IO priority"
|
|
|
|
" (must be 0-7)");
|
|
|
|
return -1;
|
|
|
|
}
|
2021-04-01 17:21:24 +00:00
|
|
|
parsed_opts->journal_ioprio =
|
2013-02-03 04:09:36 +00:00
|
|
|
IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, arg);
|
2015-04-16 05:56:00 +00:00
|
|
|
} else if (token == Opt_test_dummy_encryption) {
|
fscrypt: support test_dummy_encryption=v2
v1 encryption policies are deprecated in favor of v2, and some new
features (e.g. encryption+casefolding) are only being added for v2.
Therefore, the "test_dummy_encryption" mount option (which is used for
encryption I/O testing with xfstests) needs to support v2 policies.
To do this, extend its syntax to be "test_dummy_encryption=v1" or
"test_dummy_encryption=v2". The existing "test_dummy_encryption" (no
argument) also continues to be accepted, to specify the default setting
-- currently v1, but the next patch changes it to v2.
To cleanly support both v1 and v2 while also making it easy to support
specifying other encryption settings in the future (say, accepting
"$contents_mode:$filenames_mode:v2"), make ext4 and f2fs maintain a
pointer to the dummy fscrypt_context rather than using mount flags.
To avoid concurrency issues, don't allow test_dummy_encryption to be set
or changed during a remount. (The former restriction is new, but
xfstests doesn't run into it, so no one should notice.)
Tested with 'gce-xfstests -c {ext4,f2fs}/encrypt -g auto'. On ext4,
there are two regressions, both of which are test bugs: ext4/023 and
ext4/028 fail because they set an xattr and expect it to be stored
inline, but the increase in size of the fscrypt_context from
24 to 40 bytes causes this xattr to be spilled into an external block.
Link: https://lore.kernel.org/r/20200512233251.118314-4-ebiggers@kernel.org
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-05-12 23:32:50 +00:00
|
|
|
return ext4_set_test_dummy_encryption(sb, opt, &args[0],
|
|
|
|
is_remount);
|
2013-02-03 04:09:36 +00:00
|
|
|
} else if (m->flags & MOPT_DATAJ) {
|
|
|
|
if (is_remount) {
|
|
|
|
if (!sbi->s_journal)
|
|
|
|
ext4_msg(sb, KERN_WARNING, "Remounting file system with no journal so ignoring journalled data option");
|
|
|
|
else if (test_opt(sb, DATA_FLAGS) != m->mount_opt) {
|
2013-02-03 03:52:19 +00:00
|
|
|
ext4_msg(sb, KERN_ERR,
|
2012-03-04 04:20:47 +00:00
|
|
|
"Cannot change data mode on remount");
|
|
|
|
return -1;
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
2012-03-04 04:20:47 +00:00
|
|
|
} else {
|
2013-02-03 04:09:36 +00:00
|
|
|
clear_opt(sb, DATA_FLAGS);
|
|
|
|
sbi->s_mount_opt |= m->mount_opt;
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
2013-02-03 04:09:36 +00:00
|
|
|
#ifdef CONFIG_QUOTA
|
|
|
|
} else if (m->flags & MOPT_QFMT) {
|
|
|
|
if (sb_any_quota_loaded(sb) &&
|
|
|
|
sbi->s_jquota_fmt != m->mount_opt) {
|
|
|
|
ext4_msg(sb, KERN_ERR, "Cannot change journaled "
|
|
|
|
"quota options when quota turned on");
|
|
|
|
return -1;
|
|
|
|
}
|
2015-10-17 20:18:43 +00:00
|
|
|
if (ext4_has_feature_quota(sb)) {
|
2016-04-03 21:03:37 +00:00
|
|
|
ext4_msg(sb, KERN_INFO,
|
|
|
|
"Quota format mount options ignored "
|
2013-03-02 22:57:08 +00:00
|
|
|
"when QUOTA feature is enabled");
|
2016-04-03 21:03:37 +00:00
|
|
|
return 1;
|
2013-03-02 22:57:08 +00:00
|
|
|
}
|
2013-02-03 04:09:36 +00:00
|
|
|
sbi->s_jquota_fmt = m->mount_opt;
|
2015-02-16 23:59:38 +00:00
|
|
|
#endif
|
2020-05-28 15:00:00 +00:00
|
|
|
} else if (token == Opt_dax || token == Opt_dax_always ||
|
|
|
|
token == Opt_dax_inode || token == Opt_dax_never) {
|
2015-09-29 19:48:11 +00:00
|
|
|
#ifdef CONFIG_FS_DAX
|
2020-05-28 15:00:00 +00:00
|
|
|
switch (token) {
|
|
|
|
case Opt_dax:
|
|
|
|
case Opt_dax_always:
|
2020-06-10 15:16:37 +00:00
|
|
|
if (is_remount &&
|
|
|
|
(!(sbi->s_mount_opt & EXT4_MOUNT_DAX_ALWAYS) ||
|
|
|
|
(sbi->s_mount_opt2 & EXT4_MOUNT2_DAX_NEVER))) {
|
|
|
|
fail_dax_change_remount:
|
|
|
|
ext4_msg(sb, KERN_ERR, "can't change "
|
|
|
|
"dax mount option while remounting");
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
if (is_remount &&
|
|
|
|
(test_opt(sb, DATA_FLAGS) ==
|
|
|
|
EXT4_MOUNT_JOURNAL_DATA)) {
|
|
|
|
ext4_msg(sb, KERN_ERR, "can't mount with "
|
|
|
|
"both data=journal and dax");
|
|
|
|
return -1;
|
|
|
|
}
|
2020-05-28 15:00:00 +00:00
|
|
|
ext4_msg(sb, KERN_WARNING,
|
|
|
|
"DAX enabled. Warning: EXPERIMENTAL, use at your own risk");
|
|
|
|
sbi->s_mount_opt |= EXT4_MOUNT_DAX_ALWAYS;
|
|
|
|
sbi->s_mount_opt2 &= ~EXT4_MOUNT2_DAX_NEVER;
|
|
|
|
break;
|
|
|
|
case Opt_dax_never:
|
2020-06-10 15:16:37 +00:00
|
|
|
if (is_remount &&
|
|
|
|
(!(sbi->s_mount_opt2 & EXT4_MOUNT2_DAX_NEVER) ||
|
|
|
|
(sbi->s_mount_opt & EXT4_MOUNT_DAX_ALWAYS)))
|
|
|
|
goto fail_dax_change_remount;
|
2020-05-28 15:00:00 +00:00
|
|
|
sbi->s_mount_opt2 |= EXT4_MOUNT2_DAX_NEVER;
|
|
|
|
sbi->s_mount_opt &= ~EXT4_MOUNT_DAX_ALWAYS;
|
|
|
|
break;
|
|
|
|
case Opt_dax_inode:
|
2020-06-10 15:16:37 +00:00
|
|
|
if (is_remount &&
|
|
|
|
((sbi->s_mount_opt & EXT4_MOUNT_DAX_ALWAYS) ||
|
|
|
|
(sbi->s_mount_opt2 & EXT4_MOUNT2_DAX_NEVER) ||
|
|
|
|
!(sbi->s_mount_opt2 & EXT4_MOUNT2_DAX_INODE)))
|
|
|
|
goto fail_dax_change_remount;
|
2020-05-28 15:00:00 +00:00
|
|
|
sbi->s_mount_opt &= ~EXT4_MOUNT_DAX_ALWAYS;
|
|
|
|
sbi->s_mount_opt2 &= ~EXT4_MOUNT2_DAX_NEVER;
|
|
|
|
/* Strictly for printing options */
|
|
|
|
sbi->s_mount_opt2 |= EXT4_MOUNT2_DAX_INODE;
|
|
|
|
break;
|
|
|
|
}
|
2015-09-29 19:48:11 +00:00
|
|
|
#else
|
2015-02-16 23:59:38 +00:00
|
|
|
ext4_msg(sb, KERN_INFO, "dax option not supported");
|
2020-05-28 15:00:00 +00:00
|
|
|
sbi->s_mount_opt2 |= EXT4_MOUNT2_DAX_NEVER;
|
|
|
|
sbi->s_mount_opt &= ~EXT4_MOUNT_DAX_ALWAYS;
|
2015-02-16 23:59:38 +00:00
|
|
|
return -1;
|
2013-02-03 04:09:36 +00:00
|
|
|
#endif
|
2016-03-13 02:55:50 +00:00
|
|
|
} else if (token == Opt_data_err_abort) {
|
|
|
|
sbi->s_mount_opt |= m->mount_opt;
|
|
|
|
} else if (token == Opt_data_err_ignore) {
|
|
|
|
sbi->s_mount_opt &= ~m->mount_opt;
|
ext4: improve cr 0 / cr 1 group scanning
Instead of traversing through groups linearly, scan groups in specific
orders at cr 0 and cr 1. At cr 0, we want to find groups that have the
largest free order >= the order of the request. So, with this patch,
we maintain lists for each possible order and insert each group into a
list based on the largest free order in its buddy bitmap. During cr 0
allocation, we traverse these lists in the increasing order of largest
free orders. This allows us to find a group with the best available cr
0 match in constant time. If nothing can be found, we fallback to cr 1
immediately.
At CR1, the story is slightly different. We want to traverse in the
order of increasing average fragment size. For CR1, we maintain a rb
tree of groupinfos which is sorted by average fragment size. Instead
of traversing linearly, at CR1, we traverse in the order of increasing
average fragment size, starting at the most optimal group. This brings
down cr 1 search complexity to log(num groups).
For cr >= 2, we just perform the linear search as before. Also, in
case of lock contention, we intermittently fallback to linear search
even in CR 0 and CR 1 cases. This allows us to proceed during the
allocation path even in case of high contention.
There is an opportunity to do optimization at CR2 too. That's because
at CR2 we only consider groups where bb_free counter (number of free
blocks) is greater than the request extent size. That's left as future
work.
All the changes introduced in this patch are protected under a new
mount option "mb_optimize_scan".
With this patchset, following experiment was performed:
Created a highly fragmented disk of size 65TB. The disk had no
contiguous 2M regions. Following command was run consecutively for 3
times:
time dd if=/dev/urandom of=file bs=2M count=10
Here are the results with and without cr 0/1 optimizations introduced
in this patch:
|---------+------------------------------+---------------------------|
| | Without CR 0/1 Optimizations | With CR 0/1 Optimizations |
|---------+------------------------------+---------------------------|
| 1st run | 5m1.871s | 2m47.642s |
| 2nd run | 2m28.390s | 0m0.611s |
| 3rd run | 2m26.530s | 0m1.255s |
|---------+------------------------------+---------------------------|
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20210401172129.189766-6-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-01 17:21:27 +00:00
|
|
|
} else if (token == Opt_mb_optimize_scan) {
|
|
|
|
if (arg != 0 && arg != 1) {
|
|
|
|
ext4_msg(sb, KERN_WARNING,
|
|
|
|
"mb_optimize_scan should be set to 0 or 1.");
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
parsed_opts->mb_optimize_scan = arg;
|
2013-02-03 04:09:36 +00:00
|
|
|
} else {
|
|
|
|
if (!args->from)
|
|
|
|
arg = 1;
|
|
|
|
if (m->flags & MOPT_CLEAR)
|
|
|
|
arg = !arg;
|
|
|
|
else if (unlikely(!(m->flags & MOPT_SET))) {
|
|
|
|
ext4_msg(sb, KERN_WARNING,
|
|
|
|
"buggy handling of option %s", opt);
|
|
|
|
WARN_ON(1);
|
|
|
|
return -1;
|
|
|
|
}
|
2020-10-15 20:37:54 +00:00
|
|
|
if (m->flags & MOPT_2) {
|
|
|
|
if (arg != 0)
|
|
|
|
sbi->s_mount_opt2 |= m->mount_opt;
|
|
|
|
else
|
|
|
|
sbi->s_mount_opt2 &= ~m->mount_opt;
|
|
|
|
} else {
|
|
|
|
if (arg != 0)
|
|
|
|
sbi->s_mount_opt |= m->mount_opt;
|
|
|
|
else
|
|
|
|
sbi->s_mount_opt &= ~m->mount_opt;
|
|
|
|
}
|
2012-03-04 04:20:47 +00:00
|
|
|
}
|
2013-02-03 04:09:36 +00:00
|
|
|
return 1;
|
2012-03-04 04:20:47 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static int parse_options(char *options, struct super_block *sb,
|
2021-04-01 17:21:24 +00:00
|
|
|
struct ext4_parsed_options *ret_opts,
|
2012-03-04 04:20:47 +00:00
|
|
|
int is_remount)
|
|
|
|
{
|
2019-11-11 02:25:23 +00:00
|
|
|
struct ext4_sb_info __maybe_unused *sbi = EXT4_SB(sb);
|
2018-10-12 13:28:09 +00:00
|
|
|
char *p, __maybe_unused *usr_qf_name, __maybe_unused *grp_qf_name;
|
2012-03-04 04:20:47 +00:00
|
|
|
substring_t args[MAX_OPT_ARGS];
|
|
|
|
int token;
|
|
|
|
|
|
|
|
if (!options)
|
|
|
|
return 1;
|
|
|
|
|
|
|
|
while ((p = strsep(&options, ",")) != NULL) {
|
|
|
|
if (!*p)
|
|
|
|
continue;
|
|
|
|
/*
|
|
|
|
* Initialize args struct so we know whether arg was
|
|
|
|
* found; some options take optional arguments.
|
|
|
|
*/
|
2012-08-19 02:29:18 +00:00
|
|
|
args[0].to = args[0].from = NULL;
|
2012-03-04 04:20:47 +00:00
|
|
|
token = match_token(p, tokens, args);
|
2021-04-01 17:21:24 +00:00
|
|
|
if (handle_mount_opt(sb, p, token, args, ret_opts,
|
|
|
|
is_remount) < 0)
|
2012-03-04 04:20:47 +00:00
|
|
|
return 0;
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
|
|
|
#ifdef CONFIG_QUOTA
|
2016-09-06 03:08:16 +00:00
|
|
|
/*
|
|
|
|
* We do the test below only for project quotas. 'usrquota' and
|
|
|
|
* 'grpquota' mount options are allowed even without quota feature
|
|
|
|
* to support legacy quotas in quota files.
|
|
|
|
*/
|
|
|
|
if (test_opt(sb, PRJQUOTA) && !ext4_has_feature_project(sb)) {
|
|
|
|
ext4_msg(sb, KERN_ERR, "Project quota feature not enabled. "
|
|
|
|
"Cannot enable project quota enforcement.");
|
|
|
|
return 0;
|
|
|
|
}
|
2018-10-12 13:28:09 +00:00
|
|
|
usr_qf_name = get_qf_name(sb, sbi, USRQUOTA);
|
|
|
|
grp_qf_name = get_qf_name(sb, sbi, GRPQUOTA);
|
|
|
|
if (usr_qf_name || grp_qf_name) {
|
|
|
|
if (test_opt(sb, USRQUOTA) && usr_qf_name)
|
2010-12-16 01:26:48 +00:00
|
|
|
clear_opt(sb, USRQUOTA);
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2018-10-12 13:28:09 +00:00
|
|
|
if (test_opt(sb, GRPQUOTA) && grp_qf_name)
|
2010-12-16 01:26:48 +00:00
|
|
|
clear_opt(sb, GRPQUOTA);
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2010-03-02 04:28:41 +00:00
|
|
|
if (test_opt(sb, GRPQUOTA) || test_opt(sb, USRQUOTA)) {
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_ERR, "old and new quota "
|
|
|
|
"format mixing");
|
2006-10-11 08:20:50 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!sbi->s_jquota_fmt) {
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_ERR, "journaled quota format "
|
|
|
|
"not specified");
|
2006-10-11 08:20:50 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
#endif
|
2020-03-27 20:07:44 +00:00
|
|
|
if (test_opt(sb, DIOREAD_NOLOCK)) {
|
|
|
|
int blocksize =
|
|
|
|
BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size);
|
|
|
|
if (blocksize < PAGE_SIZE)
|
|
|
|
ext4_msg(sb, KERN_WARNING, "Warning: mounting with an "
|
|
|
|
"experimental mount option 'dioread_nolock' "
|
|
|
|
"for blocksize < PAGE_SIZE");
|
|
|
|
}
|
2006-10-11 08:20:50 +00:00
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2012-03-04 04:20:50 +00:00
|
|
|
static inline void ext4_show_quota_options(struct seq_file *seq,
|
|
|
|
struct super_block *sb)
|
|
|
|
{
|
|
|
|
#if defined(CONFIG_QUOTA)
|
|
|
|
struct ext4_sb_info *sbi = EXT4_SB(sb);
|
2018-10-12 13:28:09 +00:00
|
|
|
char *usr_qf_name, *grp_qf_name;
|
2012-03-04 04:20:50 +00:00
|
|
|
|
|
|
|
if (sbi->s_jquota_fmt) {
|
|
|
|
char *fmtname = "";
|
|
|
|
|
|
|
|
switch (sbi->s_jquota_fmt) {
|
|
|
|
case QFMT_VFS_OLD:
|
|
|
|
fmtname = "vfsold";
|
|
|
|
break;
|
|
|
|
case QFMT_VFS_V0:
|
|
|
|
fmtname = "vfsv0";
|
|
|
|
break;
|
|
|
|
case QFMT_VFS_V1:
|
|
|
|
fmtname = "vfsv1";
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
seq_printf(seq, ",jqfmt=%s", fmtname);
|
|
|
|
}
|
|
|
|
|
2018-10-12 13:28:09 +00:00
|
|
|
rcu_read_lock();
|
|
|
|
usr_qf_name = rcu_dereference(sbi->s_qf_names[USRQUOTA]);
|
|
|
|
grp_qf_name = rcu_dereference(sbi->s_qf_names[GRPQUOTA]);
|
|
|
|
if (usr_qf_name)
|
|
|
|
seq_show_option(seq, "usrjquota", usr_qf_name);
|
|
|
|
if (grp_qf_name)
|
|
|
|
seq_show_option(seq, "grpjquota", grp_qf_name);
|
|
|
|
rcu_read_unlock();
|
2012-03-04 04:20:50 +00:00
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2012-03-05 00:27:31 +00:00
|
|
|
static const char *token2str(int token)
|
|
|
|
{
|
2012-09-24 02:49:12 +00:00
|
|
|
const struct match_token *t;
|
2012-03-05 00:27:31 +00:00
|
|
|
|
|
|
|
for (t = tokens; t->token != Opt_err; t++)
|
|
|
|
if (t->token == token && !strchr(t->pattern, '='))
|
|
|
|
break;
|
|
|
|
return t->pattern;
|
|
|
|
}
|
|
|
|
|
2012-03-04 04:20:50 +00:00
|
|
|
/*
|
|
|
|
* Show an option if
|
|
|
|
* - it's set to a non-default value OR
|
|
|
|
* - if the per-sb default is different from the global default
|
|
|
|
*/
|
2012-03-05 01:21:38 +00:00
|
|
|
static int _ext4_show_options(struct seq_file *seq, struct super_block *sb,
|
|
|
|
int nodefs)
|
2012-03-04 04:20:50 +00:00
|
|
|
{
|
|
|
|
struct ext4_sb_info *sbi = EXT4_SB(sb);
|
|
|
|
struct ext4_super_block *es = sbi->s_es;
|
2018-03-30 04:51:10 +00:00
|
|
|
int def_errors, def_mount_opt = sbi->s_def_mount_opt;
|
2012-03-05 00:27:31 +00:00
|
|
|
const struct mount_opts *m;
|
2012-03-05 01:21:38 +00:00
|
|
|
char sep = nodefs ? '\n' : ',';
|
2012-03-04 04:20:50 +00:00
|
|
|
|
2012-03-05 01:21:38 +00:00
|
|
|
#define SEQ_OPTS_PUTS(str) seq_printf(seq, "%c" str, sep)
|
|
|
|
#define SEQ_OPTS_PRINT(str, arg) seq_printf(seq, "%c" str, sep, arg)
|
2012-03-04 04:20:50 +00:00
|
|
|
|
|
|
|
if (sbi->s_sb_block != 1)
|
2012-03-05 00:27:31 +00:00
|
|
|
SEQ_OPTS_PRINT("sb=%llu", sbi->s_sb_block);
|
|
|
|
|
|
|
|
for (m = ext4_mount_opts; m->token != Opt_err; m++) {
|
|
|
|
int want_set = m->flags & MOPT_SET;
|
|
|
|
if (((m->flags & (MOPT_SET|MOPT_CLEAR)) == 0) ||
|
2020-05-28 15:00:00 +00:00
|
|
|
(m->flags & MOPT_CLEAR_ERR) || m->flags & MOPT_SKIP)
|
2012-03-05 00:27:31 +00:00
|
|
|
continue;
|
2018-03-30 04:51:10 +00:00
|
|
|
if (!nodefs && !(m->mount_opt & (sbi->s_mount_opt ^ def_mount_opt)))
|
2012-03-05 00:27:31 +00:00
|
|
|
continue; /* skip if same as the default */
|
|
|
|
if ((want_set &&
|
|
|
|
(sbi->s_mount_opt & m->mount_opt) != m->mount_opt) ||
|
|
|
|
(!want_set && (sbi->s_mount_opt & m->mount_opt)))
|
|
|
|
continue; /* select Opt_noFoo vs Opt_Foo */
|
|
|
|
SEQ_OPTS_PRINT("%s", token2str(m->token));
|
2012-03-04 04:20:50 +00:00
|
|
|
}
|
2012-03-05 00:27:31 +00:00
|
|
|
|
2012-02-07 23:41:49 +00:00
|
|
|
if (nodefs || !uid_eq(sbi->s_resuid, make_kuid(&init_user_ns, EXT4_DEF_RESUID)) ||
|
2012-03-05 00:27:31 +00:00
|
|
|
le16_to_cpu(es->s_def_resuid) != EXT4_DEF_RESUID)
|
2012-02-07 23:41:49 +00:00
|
|
|
SEQ_OPTS_PRINT("resuid=%u",
|
|
|
|
from_kuid_munged(&init_user_ns, sbi->s_resuid));
|
|
|
|
if (nodefs || !gid_eq(sbi->s_resgid, make_kgid(&init_user_ns, EXT4_DEF_RESGID)) ||
|
2012-03-05 00:27:31 +00:00
|
|
|
le16_to_cpu(es->s_def_resgid) != EXT4_DEF_RESGID)
|
2012-02-07 23:41:49 +00:00
|
|
|
SEQ_OPTS_PRINT("resgid=%u",
|
|
|
|
from_kgid_munged(&init_user_ns, sbi->s_resgid));
|
2012-03-05 01:21:38 +00:00
|
|
|
def_errors = nodefs ? -1 : le16_to_cpu(es->s_errors);
|
2012-03-05 00:27:31 +00:00
|
|
|
if (test_opt(sb, ERRORS_RO) && def_errors != EXT4_ERRORS_RO)
|
|
|
|
SEQ_OPTS_PUTS("errors=remount-ro");
|
2012-03-04 04:20:50 +00:00
|
|
|
if (test_opt(sb, ERRORS_CONT) && def_errors != EXT4_ERRORS_CONTINUE)
|
2012-03-05 00:27:31 +00:00
|
|
|
SEQ_OPTS_PUTS("errors=continue");
|
2012-03-04 04:20:50 +00:00
|
|
|
if (test_opt(sb, ERRORS_PANIC) && def_errors != EXT4_ERRORS_PANIC)
|
2012-03-05 00:27:31 +00:00
|
|
|
SEQ_OPTS_PUTS("errors=panic");
|
2012-03-05 01:21:38 +00:00
|
|
|
if (nodefs || sbi->s_commit_interval != JBD2_DEFAULT_MAX_COMMIT_AGE*HZ)
|
2012-03-05 00:27:31 +00:00
|
|
|
SEQ_OPTS_PRINT("commit=%lu", sbi->s_commit_interval / HZ);
|
2012-03-05 01:21:38 +00:00
|
|
|
if (nodefs || sbi->s_min_batch_time != EXT4_DEF_MIN_BATCH_TIME)
|
2012-03-05 00:27:31 +00:00
|
|
|
SEQ_OPTS_PRINT("min_batch_time=%u", sbi->s_min_batch_time);
|
2012-03-05 01:21:38 +00:00
|
|
|
if (nodefs || sbi->s_max_batch_time != EXT4_DEF_MAX_BATCH_TIME)
|
2012-03-05 00:27:31 +00:00
|
|
|
SEQ_OPTS_PRINT("max_batch_time=%u", sbi->s_max_batch_time);
|
2017-10-18 20:56:26 +00:00
|
|
|
if (sb->s_flags & SB_I_VERSION)
|
2012-03-05 00:27:31 +00:00
|
|
|
SEQ_OPTS_PUTS("i_version");
|
2012-03-05 01:21:38 +00:00
|
|
|
if (nodefs || sbi->s_stripe)
|
2012-03-05 00:27:31 +00:00
|
|
|
SEQ_OPTS_PRINT("stripe=%lu", sbi->s_stripe);
|
2018-03-30 04:51:10 +00:00
|
|
|
if (nodefs || EXT4_MOUNT_DATA_FLAGS &
|
|
|
|
(sbi->s_mount_opt ^ def_mount_opt)) {
|
2012-03-05 00:27:31 +00:00
|
|
|
if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA)
|
|
|
|
SEQ_OPTS_PUTS("data=journal");
|
|
|
|
else if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_ORDERED_DATA)
|
|
|
|
SEQ_OPTS_PUTS("data=ordered");
|
|
|
|
else if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_WRITEBACK_DATA)
|
|
|
|
SEQ_OPTS_PUTS("data=writeback");
|
|
|
|
}
|
2012-03-05 01:21:38 +00:00
|
|
|
if (nodefs ||
|
|
|
|
sbi->s_inode_readahead_blks != EXT4_DEF_INODE_READAHEAD_BLKS)
|
2012-03-05 00:27:31 +00:00
|
|
|
SEQ_OPTS_PRINT("inode_readahead_blks=%u",
|
|
|
|
sbi->s_inode_readahead_blks);
|
2012-03-04 04:20:50 +00:00
|
|
|
|
2018-03-30 04:53:33 +00:00
|
|
|
if (test_opt(sb, INIT_INODE_TABLE) && (nodefs ||
|
2012-03-05 01:21:38 +00:00
|
|
|
(sbi->s_li_wait_mult != EXT4_DEF_LI_WAIT_MULT)))
|
2012-03-05 00:27:31 +00:00
|
|
|
SEQ_OPTS_PRINT("init_itable=%u", sbi->s_li_wait_mult);
|
2012-08-17 13:48:17 +00:00
|
|
|
if (nodefs || sbi->s_max_dir_size_kb)
|
|
|
|
SEQ_OPTS_PRINT("max_dir_size_kb=%u", sbi->s_max_dir_size_kb);
|
2016-03-13 02:55:50 +00:00
|
|
|
if (test_opt(sb, DATA_ERR_ABORT))
|
|
|
|
SEQ_OPTS_PUTS("data_err=abort");
|
fscrypt: support test_dummy_encryption=v2
v1 encryption policies are deprecated in favor of v2, and some new
features (e.g. encryption+casefolding) are only being added for v2.
Therefore, the "test_dummy_encryption" mount option (which is used for
encryption I/O testing with xfstests) needs to support v2 policies.
To do this, extend its syntax to be "test_dummy_encryption=v1" or
"test_dummy_encryption=v2". The existing "test_dummy_encryption" (no
argument) also continues to be accepted, to specify the default setting
-- currently v1, but the next patch changes it to v2.
To cleanly support both v1 and v2 while also making it easy to support
specifying other encryption settings in the future (say, accepting
"$contents_mode:$filenames_mode:v2"), make ext4 and f2fs maintain a
pointer to the dummy fscrypt_context rather than using mount flags.
To avoid concurrency issues, don't allow test_dummy_encryption to be set
or changed during a remount. (The former restriction is new, but
xfstests doesn't run into it, so no one should notice.)
Tested with 'gce-xfstests -c {ext4,f2fs}/encrypt -g auto'. On ext4,
there are two regressions, both of which are test bugs: ext4/023 and
ext4/028 fail because they set an xattr and expect it to be stored
inline, but the increase in size of the fscrypt_context from
24 to 40 bytes causes this xattr to be spilled into an external block.
Link: https://lore.kernel.org/r/20200512233251.118314-4-ebiggers@kernel.org
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-05-12 23:32:50 +00:00
|
|
|
|
|
|
|
fscrypt_show_test_dummy_encryption(seq, sep, sb);
|
2012-03-04 04:20:50 +00:00
|
|
|
|
2020-07-02 01:56:07 +00:00
|
|
|
if (sb->s_flags & SB_INLINECRYPT)
|
|
|
|
SEQ_OPTS_PUTS("inlinecrypt");
|
|
|
|
|
2020-05-28 15:00:00 +00:00
|
|
|
if (test_opt(sb, DAX_ALWAYS)) {
|
|
|
|
if (IS_EXT2_SB(sb))
|
|
|
|
SEQ_OPTS_PUTS("dax");
|
|
|
|
else
|
|
|
|
SEQ_OPTS_PUTS("dax=always");
|
|
|
|
} else if (test_opt2(sb, DAX_NEVER)) {
|
|
|
|
SEQ_OPTS_PUTS("dax=never");
|
|
|
|
} else if (test_opt2(sb, DAX_INODE)) {
|
|
|
|
SEQ_OPTS_PUTS("dax=inode");
|
|
|
|
}
|
2012-03-04 04:20:50 +00:00
|
|
|
ext4_show_quota_options(seq, sb);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2012-03-05 01:21:38 +00:00
|
|
|
static int ext4_show_options(struct seq_file *seq, struct dentry *root)
|
|
|
|
{
|
|
|
|
return _ext4_show_options(seq, root->d_sb, 0);
|
|
|
|
}
|
|
|
|
|
2015-09-23 16:46:17 +00:00
|
|
|
int ext4_seq_options_show(struct seq_file *seq, void *offset)
|
2012-03-05 01:21:38 +00:00
|
|
|
{
|
|
|
|
struct super_block *sb = seq->private;
|
|
|
|
int rc;
|
|
|
|
|
2017-07-17 07:45:34 +00:00
|
|
|
seq_puts(seq, sb_rdonly(sb) ? "ro" : "rw");
|
2012-03-05 01:21:38 +00:00
|
|
|
rc = _ext4_show_options(seq, sb, 1);
|
|
|
|
seq_puts(seq, "\n");
|
|
|
|
return rc;
|
|
|
|
}
|
|
|
|
|
2006-10-11 08:20:53 +00:00
|
|
|
static int ext4_setup_super(struct super_block *sb, struct ext4_super_block *es,
|
2006-10-11 08:20:50 +00:00
|
|
|
int read_only)
|
|
|
|
{
|
2006-10-11 08:20:53 +00:00
|
|
|
struct ext4_sb_info *sbi = EXT4_SB(sb);
|
2018-05-14 03:02:19 +00:00
|
|
|
int err = 0;
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2006-10-11 08:20:53 +00:00
|
|
|
if (le32_to_cpu(es->s_rev_level) > EXT4_MAX_SUPP_REV) {
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_ERR, "revision level too high, "
|
|
|
|
"forcing read-only mode");
|
2018-05-14 03:02:19 +00:00
|
|
|
err = -EROFS;
|
2020-06-01 07:34:04 +00:00
|
|
|
goto done;
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
|
|
|
if (read_only)
|
2011-09-09 22:34:51 +00:00
|
|
|
goto done;
|
2006-10-11 08:20:53 +00:00
|
|
|
if (!(sbi->s_mount_state & EXT4_VALID_FS))
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_WARNING, "warning: mounting unchecked fs, "
|
|
|
|
"running e2fsck is recommended");
|
2014-05-12 16:55:07 +00:00
|
|
|
else if (sbi->s_mount_state & EXT4_ERROR_FS)
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_WARNING,
|
|
|
|
"warning: mounting fs with errors, "
|
|
|
|
"running e2fsck is recommended");
|
2011-05-18 17:29:57 +00:00
|
|
|
else if ((__s16) le16_to_cpu(es->s_max_mnt_count) > 0 &&
|
2006-10-11 08:20:50 +00:00
|
|
|
le16_to_cpu(es->s_mnt_count) >=
|
|
|
|
(unsigned short) (__s16) le16_to_cpu(es->s_max_mnt_count))
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_WARNING,
|
|
|
|
"warning: maximal mount count reached, "
|
|
|
|
"running e2fsck is recommended");
|
2006-10-11 08:20:50 +00:00
|
|
|
else if (le32_to_cpu(es->s_checkinterval) &&
|
2018-07-29 19:51:48 +00:00
|
|
|
(ext4_get_tstamp(es, s_lastcheck) +
|
|
|
|
le32_to_cpu(es->s_checkinterval) <= ktime_get_real_seconds()))
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_WARNING,
|
|
|
|
"warning: checktime reached, "
|
|
|
|
"running e2fsck is recommended");
|
2009-06-03 21:59:28 +00:00
|
|
|
if (!sbi->s_journal)
|
2009-01-07 05:06:22 +00:00
|
|
|
es->s_state &= cpu_to_le16(~EXT4_VALID_FS);
|
2006-10-11 08:20:50 +00:00
|
|
|
if (!(__s16) le16_to_cpu(es->s_max_mnt_count))
|
2006-10-11 08:20:53 +00:00
|
|
|
es->s_max_mnt_count = cpu_to_le16(EXT4_DFL_MAX_MNT_COUNT);
|
2008-04-17 14:38:59 +00:00
|
|
|
le16_add_cpu(&es->s_mnt_count, 1);
|
2018-07-29 19:51:48 +00:00
|
|
|
ext4_update_tstamp(es, s_mtime);
|
2009-01-07 05:06:22 +00:00
|
|
|
if (sbi->s_journal)
|
2015-10-17 20:18:43 +00:00
|
|
|
ext4_set_feature_journal_needs_recovery(sb);
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2020-12-16 10:18:38 +00:00
|
|
|
err = ext4_commit_super(sb);
|
2011-09-09 22:34:51 +00:00
|
|
|
done:
|
2006-10-11 08:20:50 +00:00
|
|
|
if (test_opt(sb, DEBUG))
|
2009-01-06 03:18:16 +00:00
|
|
|
printk(KERN_INFO "[EXT4 FS bs=%lu, gc=%u, "
|
2010-12-16 01:30:48 +00:00
|
|
|
"bpg=%lu, ipg=%lu, mo=%04x, mo2=%04x]\n",
|
2006-10-11 08:20:50 +00:00
|
|
|
sb->s_blocksize,
|
|
|
|
sbi->s_groups_count,
|
2006-10-11 08:20:53 +00:00
|
|
|
EXT4_BLOCKS_PER_GROUP(sb),
|
|
|
|
EXT4_INODES_PER_GROUP(sb),
|
2010-12-16 01:30:48 +00:00
|
|
|
sbi->s_mount_opt, sbi->s_mount_opt2);
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2011-05-26 16:02:03 +00:00
|
|
|
cleancache_init_fs(sb);
|
2018-05-14 03:02:19 +00:00
|
|
|
return err;
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
|
|
|
|
2012-09-05 05:29:50 +00:00
|
|
|
int ext4_alloc_flex_bg_array(struct super_block *sb, ext4_group_t ngroup)
|
|
|
|
{
|
|
|
|
struct ext4_sb_info *sbi = EXT4_SB(sb);
|
2020-02-19 03:08:51 +00:00
|
|
|
struct flex_groups **old_groups, **new_groups;
|
2020-02-28 09:22:56 +00:00
|
|
|
int size, i, j;
|
2012-09-05 05:29:50 +00:00
|
|
|
|
|
|
|
if (!sbi->s_log_groups_per_flex)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
size = ext4_flex_group(sbi, ngroup - 1) + 1;
|
|
|
|
if (size <= sbi->s_flex_groups_allocated)
|
|
|
|
return 0;
|
|
|
|
|
2020-02-19 03:08:51 +00:00
|
|
|
new_groups = kvzalloc(roundup_pow_of_two(size *
|
|
|
|
sizeof(*sbi->s_flex_groups)), GFP_KERNEL);
|
2012-09-05 05:29:50 +00:00
|
|
|
if (!new_groups) {
|
2020-02-19 03:08:51 +00:00
|
|
|
ext4_msg(sb, KERN_ERR,
|
|
|
|
"not enough memory for %d flex group pointers", size);
|
2012-09-05 05:29:50 +00:00
|
|
|
return -ENOMEM;
|
|
|
|
}
|
2020-02-19 03:08:51 +00:00
|
|
|
for (i = sbi->s_flex_groups_allocated; i < size; i++) {
|
|
|
|
new_groups[i] = kvzalloc(roundup_pow_of_two(
|
|
|
|
sizeof(struct flex_groups)),
|
|
|
|
GFP_KERNEL);
|
|
|
|
if (!new_groups[i]) {
|
2020-02-28 09:22:56 +00:00
|
|
|
for (j = sbi->s_flex_groups_allocated; j < i; j++)
|
|
|
|
kvfree(new_groups[j]);
|
2020-02-19 03:08:51 +00:00
|
|
|
kvfree(new_groups);
|
|
|
|
ext4_msg(sb, KERN_ERR,
|
|
|
|
"not enough memory for %d flex groups", size);
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
2012-09-05 05:29:50 +00:00
|
|
|
}
|
2020-02-19 03:08:51 +00:00
|
|
|
rcu_read_lock();
|
|
|
|
old_groups = rcu_dereference(sbi->s_flex_groups);
|
|
|
|
if (old_groups)
|
|
|
|
memcpy(new_groups, old_groups,
|
|
|
|
(sbi->s_flex_groups_allocated *
|
|
|
|
sizeof(struct flex_groups *)));
|
|
|
|
rcu_read_unlock();
|
|
|
|
rcu_assign_pointer(sbi->s_flex_groups, new_groups);
|
|
|
|
sbi->s_flex_groups_allocated = size;
|
|
|
|
if (old_groups)
|
|
|
|
ext4_kvfree_array_rcu(old_groups);
|
2012-09-05 05:29:50 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2008-07-11 23:27:31 +00:00
|
|
|
static int ext4_fill_flex_info(struct super_block *sb)
|
|
|
|
{
|
|
|
|
struct ext4_sb_info *sbi = EXT4_SB(sb);
|
|
|
|
struct ext4_group_desc *gdp = NULL;
|
2020-02-19 03:08:51 +00:00
|
|
|
struct flex_groups *fg;
|
2008-07-11 23:27:31 +00:00
|
|
|
ext4_group_t flex_group;
|
2012-09-05 05:29:50 +00:00
|
|
|
int i, err;
|
2008-07-11 23:27:31 +00:00
|
|
|
|
2009-11-23 12:24:46 +00:00
|
|
|
sbi->s_log_groups_per_flex = sbi->s_es->s_log_groups_per_flex;
|
ext4: fix undefined behavior in ext4_fill_flex_info()
Commit 503358ae01b70ce6909d19dd01287093f6b6271c ("ext4: avoid divide by
zero when trying to mount a corrupted file system") fixes CVE-2009-4307
by performing a sanity check on s_log_groups_per_flex, since it can be
set to a bogus value by an attacker.
sbi->s_log_groups_per_flex = sbi->s_es->s_log_groups_per_flex;
groups_per_flex = 1 << sbi->s_log_groups_per_flex;
if (groups_per_flex < 2) { ... }
This patch fixes two potential issues in the previous commit.
1) The sanity check might only work on architectures like PowerPC.
On x86, 5 bits are used for the shifting amount. That means, given a
large s_log_groups_per_flex value like 36, groups_per_flex = 1 << 36
is essentially 1 << 4 = 16, rather than 0. This will bypass the check,
leaving s_log_groups_per_flex and groups_per_flex inconsistent.
2) The sanity check relies on undefined behavior, i.e., oversized shift.
A standard-confirming C compiler could rewrite the check in unexpected
ways. Consider the following equivalent form, assuming groups_per_flex
is unsigned for simplicity.
groups_per_flex = 1 << sbi->s_log_groups_per_flex;
if (groups_per_flex == 0 || groups_per_flex == 1) {
We compile the code snippet using Clang 3.0 and GCC 4.6. Clang will
completely optimize away the check groups_per_flex == 0, leaving the
patched code as vulnerable as the original. GCC keeps the check, but
there is no guarantee that future versions will do the same.
Signed-off-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2012-01-10 16:51:10 +00:00
|
|
|
if (sbi->s_log_groups_per_flex < 1 || sbi->s_log_groups_per_flex > 31) {
|
2008-07-11 23:27:31 +00:00
|
|
|
sbi->s_log_groups_per_flex = 0;
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2012-09-05 05:29:50 +00:00
|
|
|
err = ext4_alloc_flex_bg_array(sb, sbi->s_groups_count);
|
|
|
|
if (err)
|
2011-08-01 12:45:02 +00:00
|
|
|
goto failed;
|
2008-07-11 23:27:31 +00:00
|
|
|
|
|
|
|
for (i = 0; i < sbi->s_groups_count; i++) {
|
2009-05-25 15:50:39 +00:00
|
|
|
gdp = ext4_get_group_desc(sb, i, NULL);
|
2008-07-11 23:27:31 +00:00
|
|
|
|
|
|
|
flex_group = ext4_flex_group(sbi, i);
|
2020-02-19 03:08:51 +00:00
|
|
|
fg = sbi_array_rcu_deref(sbi, s_flex_groups, flex_group);
|
|
|
|
atomic_add(ext4_free_inodes_count(sb, gdp), &fg->free_inodes);
|
2013-03-12 03:39:59 +00:00
|
|
|
atomic64_add(ext4_free_group_clusters(sb, gdp),
|
2020-02-19 03:08:51 +00:00
|
|
|
&fg->free_clusters);
|
|
|
|
atomic_add(ext4_used_dirs_count(sb, gdp), &fg->used_dirs);
|
2008-07-11 23:27:31 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
return 1;
|
|
|
|
failed:
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2015-10-17 20:18:43 +00:00
|
|
|
static __le16 ext4_group_desc_csum(struct super_block *sb, __u32 block_group,
|
2012-04-29 22:45:10 +00:00
|
|
|
struct ext4_group_desc *gdp)
|
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-16 22:38:25 +00:00
|
|
|
{
|
2016-07-03 21:51:39 +00:00
|
|
|
int offset = offsetof(struct ext4_group_desc, bg_checksum);
|
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-16 22:38:25 +00:00
|
|
|
__u16 crc = 0;
|
2012-04-29 22:45:10 +00:00
|
|
|
__le32 le_group = cpu_to_le32(block_group);
|
2015-10-17 20:18:43 +00:00
|
|
|
struct ext4_sb_info *sbi = EXT4_SB(sb);
|
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-16 22:38:25 +00:00
|
|
|
|
2014-10-13 07:36:16 +00:00
|
|
|
if (ext4_has_metadata_csum(sbi->s_sb)) {
|
2012-04-29 22:45:10 +00:00
|
|
|
/* Use new metadata_csum algorithm */
|
|
|
|
__u32 csum32;
|
2016-07-03 21:51:39 +00:00
|
|
|
__u16 dummy_csum = 0;
|
2012-04-29 22:45:10 +00:00
|
|
|
|
|
|
|
csum32 = ext4_chksum(sbi, sbi->s_csum_seed, (__u8 *)&le_group,
|
|
|
|
sizeof(le_group));
|
2016-07-03 21:51:39 +00:00
|
|
|
csum32 = ext4_chksum(sbi, csum32, (__u8 *)gdp, offset);
|
|
|
|
csum32 = ext4_chksum(sbi, csum32, (__u8 *)&dummy_csum,
|
|
|
|
sizeof(dummy_csum));
|
|
|
|
offset += sizeof(dummy_csum);
|
|
|
|
if (offset < sbi->s_desc_size)
|
|
|
|
csum32 = ext4_chksum(sbi, csum32, (__u8 *)gdp + offset,
|
|
|
|
sbi->s_desc_size - offset);
|
2012-04-29 22:45:10 +00:00
|
|
|
|
|
|
|
crc = csum32 & 0xFFFF;
|
|
|
|
goto out;
|
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-16 22:38:25 +00:00
|
|
|
}
|
|
|
|
|
2012-04-29 22:45:10 +00:00
|
|
|
/* old crc16 code */
|
2015-10-17 20:18:43 +00:00
|
|
|
if (!ext4_has_feature_gdt_csum(sb))
|
2014-10-14 06:35:49 +00:00
|
|
|
return 0;
|
|
|
|
|
2012-04-29 22:45:10 +00:00
|
|
|
crc = crc16(~0, sbi->s_es->s_uuid, sizeof(sbi->s_es->s_uuid));
|
|
|
|
crc = crc16(crc, (__u8 *)&le_group, sizeof(le_group));
|
|
|
|
crc = crc16(crc, (__u8 *)gdp, offset);
|
|
|
|
offset += sizeof(gdp->bg_checksum); /* skip checksum */
|
|
|
|
/* for checksum of struct ext4_group_desc do the rest...*/
|
2015-10-17 20:18:43 +00:00
|
|
|
if (ext4_has_feature_64bit(sb) &&
|
2012-04-29 22:45:10 +00:00
|
|
|
offset < le16_to_cpu(sbi->s_es->s_desc_size))
|
|
|
|
crc = crc16(crc, (__u8 *)gdp + offset,
|
|
|
|
le16_to_cpu(sbi->s_es->s_desc_size) -
|
|
|
|
offset);
|
|
|
|
|
|
|
|
out:
|
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-16 22:38:25 +00:00
|
|
|
return cpu_to_le16(crc);
|
|
|
|
}
|
|
|
|
|
2012-04-29 22:45:10 +00:00
|
|
|
int ext4_group_desc_csum_verify(struct super_block *sb, __u32 block_group,
|
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-16 22:38:25 +00:00
|
|
|
struct ext4_group_desc *gdp)
|
|
|
|
{
|
2012-04-29 22:45:10 +00:00
|
|
|
if (ext4_has_group_desc_csum(sb) &&
|
2015-10-17 20:18:43 +00:00
|
|
|
(gdp->bg_checksum != ext4_group_desc_csum(sb, block_group, gdp)))
|
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-16 22:38:25 +00:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2012-04-29 22:45:10 +00:00
|
|
|
void ext4_group_desc_csum_set(struct super_block *sb, __u32 block_group,
|
|
|
|
struct ext4_group_desc *gdp)
|
|
|
|
{
|
|
|
|
if (!ext4_has_group_desc_csum(sb))
|
|
|
|
return;
|
2015-10-17 20:18:43 +00:00
|
|
|
gdp->bg_checksum = ext4_group_desc_csum(sb, block_group, gdp);
|
2012-04-29 22:45:10 +00:00
|
|
|
}
|
|
|
|
|
2006-10-11 08:20:50 +00:00
|
|
|
/* Called at mount-time, super-block is locked */
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
static int ext4_check_descriptors(struct super_block *sb,
|
2016-08-01 04:51:02 +00:00
|
|
|
ext4_fsblk_t sb_block,
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
ext4_group_t *first_not_zeroed)
|
2006-10-11 08:20:50 +00:00
|
|
|
{
|
2006-10-11 08:20:53 +00:00
|
|
|
struct ext4_sb_info *sbi = EXT4_SB(sb);
|
|
|
|
ext4_fsblk_t first_block = le32_to_cpu(sbi->s_es->s_first_data_block);
|
|
|
|
ext4_fsblk_t last_block;
|
2018-07-08 23:35:02 +00:00
|
|
|
ext4_fsblk_t last_bg_block = sb_block + ext4_bg_num_gdb(sb, 0);
|
2006-10-11 08:21:10 +00:00
|
|
|
ext4_fsblk_t block_bitmap;
|
|
|
|
ext4_fsblk_t inode_bitmap;
|
|
|
|
ext4_fsblk_t inode_table;
|
2007-10-16 22:38:25 +00:00
|
|
|
int flexbg_flag = 0;
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
ext4_group_t i, grp = sbi->s_groups_count;
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2015-10-17 20:18:43 +00:00
|
|
|
if (ext4_has_feature_flex_bg(sb))
|
2007-10-16 22:38:25 +00:00
|
|
|
flexbg_flag = 1;
|
|
|
|
|
2008-09-09 02:25:24 +00:00
|
|
|
ext4_debug("Checking group descriptors");
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2008-02-06 09:40:16 +00:00
|
|
|
for (i = 0; i < sbi->s_groups_count; i++) {
|
|
|
|
struct ext4_group_desc *gdp = ext4_get_group_desc(sb, i, NULL);
|
|
|
|
|
2007-10-16 22:38:25 +00:00
|
|
|
if (i == sbi->s_groups_count - 1 || flexbg_flag)
|
2006-10-11 08:21:10 +00:00
|
|
|
last_block = ext4_blocks_count(sbi->s_es) - 1;
|
2006-10-11 08:20:50 +00:00
|
|
|
else
|
|
|
|
last_block = first_block +
|
2006-10-11 08:20:53 +00:00
|
|
|
(EXT4_BLOCKS_PER_GROUP(sb) - 1);
|
2006-10-11 08:20:50 +00:00
|
|
|
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
if ((grp == sbi->s_groups_count) &&
|
|
|
|
!(gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_ZEROED)))
|
|
|
|
grp = i;
|
|
|
|
|
2006-10-11 08:21:15 +00:00
|
|
|
block_bitmap = ext4_block_bitmap(sb, gdp);
|
2016-08-01 04:51:02 +00:00
|
|
|
if (block_bitmap == sb_block) {
|
|
|
|
ext4_msg(sb, KERN_ERR, "ext4_check_descriptors: "
|
|
|
|
"Block bitmap for group %u overlaps "
|
|
|
|
"superblock", i);
|
2018-03-30 02:10:35 +00:00
|
|
|
if (!sb_rdonly(sb))
|
|
|
|
return 0;
|
2016-08-01 04:51:02 +00:00
|
|
|
}
|
2018-06-14 03:08:26 +00:00
|
|
|
if (block_bitmap >= sb_block + 1 &&
|
|
|
|
block_bitmap <= last_bg_block) {
|
|
|
|
ext4_msg(sb, KERN_ERR, "ext4_check_descriptors: "
|
|
|
|
"Block bitmap for group %u overlaps "
|
|
|
|
"block group descriptors", i);
|
|
|
|
if (!sb_rdonly(sb))
|
|
|
|
return 0;
|
|
|
|
}
|
2008-07-26 20:15:44 +00:00
|
|
|
if (block_bitmap < first_block || block_bitmap > last_block) {
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_ERR, "ext4_check_descriptors: "
|
2009-01-06 03:18:16 +00:00
|
|
|
"Block bitmap for group %u not in group "
|
2009-06-04 21:36:36 +00:00
|
|
|
"(block %llu)!", i, block_bitmap);
|
2006-10-11 08:20:50 +00:00
|
|
|
return 0;
|
|
|
|
}
|
2006-10-11 08:21:15 +00:00
|
|
|
inode_bitmap = ext4_inode_bitmap(sb, gdp);
|
2016-08-01 04:51:02 +00:00
|
|
|
if (inode_bitmap == sb_block) {
|
|
|
|
ext4_msg(sb, KERN_ERR, "ext4_check_descriptors: "
|
|
|
|
"Inode bitmap for group %u overlaps "
|
|
|
|
"superblock", i);
|
2018-03-30 02:10:35 +00:00
|
|
|
if (!sb_rdonly(sb))
|
|
|
|
return 0;
|
2016-08-01 04:51:02 +00:00
|
|
|
}
|
2018-06-14 03:08:26 +00:00
|
|
|
if (inode_bitmap >= sb_block + 1 &&
|
|
|
|
inode_bitmap <= last_bg_block) {
|
|
|
|
ext4_msg(sb, KERN_ERR, "ext4_check_descriptors: "
|
|
|
|
"Inode bitmap for group %u overlaps "
|
|
|
|
"block group descriptors", i);
|
|
|
|
if (!sb_rdonly(sb))
|
|
|
|
return 0;
|
|
|
|
}
|
2008-07-26 20:15:44 +00:00
|
|
|
if (inode_bitmap < first_block || inode_bitmap > last_block) {
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_ERR, "ext4_check_descriptors: "
|
2009-01-06 03:18:16 +00:00
|
|
|
"Inode bitmap for group %u not in group "
|
2009-06-04 21:36:36 +00:00
|
|
|
"(block %llu)!", i, inode_bitmap);
|
2006-10-11 08:20:50 +00:00
|
|
|
return 0;
|
|
|
|
}
|
2006-10-11 08:21:15 +00:00
|
|
|
inode_table = ext4_inode_table(sb, gdp);
|
2016-08-01 04:51:02 +00:00
|
|
|
if (inode_table == sb_block) {
|
|
|
|
ext4_msg(sb, KERN_ERR, "ext4_check_descriptors: "
|
|
|
|
"Inode table for group %u overlaps "
|
|
|
|
"superblock", i);
|
2018-03-30 02:10:35 +00:00
|
|
|
if (!sb_rdonly(sb))
|
|
|
|
return 0;
|
2016-08-01 04:51:02 +00:00
|
|
|
}
|
2018-06-14 03:08:26 +00:00
|
|
|
if (inode_table >= sb_block + 1 &&
|
|
|
|
inode_table <= last_bg_block) {
|
|
|
|
ext4_msg(sb, KERN_ERR, "ext4_check_descriptors: "
|
|
|
|
"Inode table for group %u overlaps "
|
|
|
|
"block group descriptors", i);
|
|
|
|
if (!sb_rdonly(sb))
|
|
|
|
return 0;
|
|
|
|
}
|
2006-10-11 08:21:10 +00:00
|
|
|
if (inode_table < first_block ||
|
2008-07-26 20:15:44 +00:00
|
|
|
inode_table + sbi->s_itb_per_group - 1 > last_block) {
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_ERR, "ext4_check_descriptors: "
|
2009-01-06 03:18:16 +00:00
|
|
|
"Inode table for group %u not in group "
|
2009-06-04 21:36:36 +00:00
|
|
|
"(block %llu)!", i, inode_table);
|
2006-10-11 08:20:50 +00:00
|
|
|
return 0;
|
|
|
|
}
|
2009-05-03 00:35:09 +00:00
|
|
|
ext4_lock_group(sb, i);
|
2012-04-29 22:45:10 +00:00
|
|
|
if (!ext4_group_desc_csum_verify(sb, i, gdp)) {
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_ERR, "ext4_check_descriptors: "
|
|
|
|
"Checksum for group %u failed (%u!=%u)",
|
2015-10-17 20:18:43 +00:00
|
|
|
i, le16_to_cpu(ext4_group_desc_csum(sb, i,
|
2009-06-04 21:36:36 +00:00
|
|
|
gdp)), le16_to_cpu(gdp->bg_checksum));
|
2017-07-17 07:45:34 +00:00
|
|
|
if (!sb_rdonly(sb)) {
|
2009-05-03 00:35:09 +00:00
|
|
|
ext4_unlock_group(sb, i);
|
2008-07-26 18:34:21 +00:00
|
|
|
return 0;
|
2008-09-08 14:47:19 +00:00
|
|
|
}
|
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-16 22:38:25 +00:00
|
|
|
}
|
2009-05-03 00:35:09 +00:00
|
|
|
ext4_unlock_group(sb, i);
|
2007-10-16 22:38:25 +00:00
|
|
|
if (!flexbg_flag)
|
|
|
|
first_block += EXT4_BLOCKS_PER_GROUP(sb);
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
if (NULL != first_not_zeroed)
|
|
|
|
*first_not_zeroed = grp;
|
2006-10-11 08:20:50 +00:00
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2006-10-11 08:20:53 +00:00
|
|
|
/* ext4_orphan_cleanup() walks a singly-linked list of inodes (starting at
|
2006-10-11 08:20:50 +00:00
|
|
|
* the superblock) which were deleted from all directories, but held open by
|
|
|
|
* a process at the time of a crash. We walk the list and try to delete these
|
|
|
|
* inodes at recovery time (only with a read-write filesystem).
|
|
|
|
*
|
|
|
|
* In order to keep the orphan inode chain consistent during traversal (in
|
|
|
|
* case of crash during recovery), we link each inode into the superblock
|
|
|
|
* orphan list_head and handle it the same way as an inode deletion during
|
|
|
|
* normal operation (which journals the operations for us).
|
|
|
|
*
|
|
|
|
* We only do an iget() and an iput() on each inode, which is very safe if we
|
|
|
|
* accidentally point at an in-use or already deleted inode. The worst that
|
|
|
|
* can happen in this case is that we get a "bit already cleared" message from
|
2006-10-11 08:20:53 +00:00
|
|
|
* ext4_free_inode(). The only reason we would point at a wrong inode is if
|
2006-10-11 08:20:50 +00:00
|
|
|
* e2fsck was run on this filesystem, and it must have already done the orphan
|
|
|
|
* inode cleanup for us, so we can safely abort without any further action.
|
|
|
|
*/
|
2008-07-26 20:15:44 +00:00
|
|
|
static void ext4_orphan_cleanup(struct super_block *sb,
|
|
|
|
struct ext4_super_block *es)
|
2006-10-11 08:20:50 +00:00
|
|
|
{
|
|
|
|
unsigned int s_flags = sb->s_flags;
|
2016-11-14 03:02:26 +00:00
|
|
|
int ret, nr_orphans = 0, nr_truncates = 0;
|
2006-10-11 08:20:50 +00:00
|
|
|
#ifdef CONFIG_QUOTA
|
2017-08-24 19:21:50 +00:00
|
|
|
int quota_update = 0;
|
2006-10-11 08:20:50 +00:00
|
|
|
int i;
|
|
|
|
#endif
|
|
|
|
if (!es->s_last_orphan) {
|
|
|
|
jbd_debug(4, "no orphan inodes to clean up\n");
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2006-12-07 04:40:13 +00:00
|
|
|
if (bdev_read_only(sb->s_bdev)) {
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_ERR, "write access "
|
|
|
|
"unavailable, skipping orphan cleanup");
|
2006-12-07 04:40:13 +00:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2011-02-28 05:53:45 +00:00
|
|
|
/* Check if feature set would not allow a r/w mount */
|
|
|
|
if (!ext4_feature_set_ok(sb, 0)) {
|
|
|
|
ext4_msg(sb, KERN_INFO, "Skipping orphan cleanup due to "
|
|
|
|
"unknown ROCOMPAT features");
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2006-10-11 08:20:53 +00:00
|
|
|
if (EXT4_SB(sb)->s_mount_state & EXT4_ERROR_FS) {
|
2012-09-27 03:30:12 +00:00
|
|
|
/* don't clear list on RO mount w/ errors */
|
2017-11-27 21:05:09 +00:00
|
|
|
if (es->s_last_orphan && !(s_flags & SB_RDONLY)) {
|
2014-09-16 18:52:03 +00:00
|
|
|
ext4_msg(sb, KERN_INFO, "Errors on filesystem, "
|
2006-10-11 08:20:50 +00:00
|
|
|
"clearing orphan list.\n");
|
2012-09-27 03:30:12 +00:00
|
|
|
es->s_last_orphan = 0;
|
|
|
|
}
|
2006-10-11 08:20:50 +00:00
|
|
|
jbd_debug(1, "Skipping orphan recovery on fs with errors.\n");
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2017-11-27 21:05:09 +00:00
|
|
|
if (s_flags & SB_RDONLY) {
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_INFO, "orphan cleanup on readonly fs");
|
2017-11-27 21:05:09 +00:00
|
|
|
sb->s_flags &= ~SB_RDONLY;
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
|
|
|
#ifdef CONFIG_QUOTA
|
2017-08-24 19:21:50 +00:00
|
|
|
/*
|
|
|
|
* Turn on quotas which were not enabled for read-only mounts if
|
|
|
|
* filesystem has quota feature, so that they are updated correctly.
|
|
|
|
*/
|
2017-11-27 21:05:09 +00:00
|
|
|
if (ext4_has_feature_quota(sb) && (s_flags & SB_RDONLY)) {
|
2017-08-24 19:21:50 +00:00
|
|
|
int ret = ext4_enable_quotas(sb);
|
|
|
|
|
|
|
|
if (!ret)
|
|
|
|
quota_update = 1;
|
|
|
|
else
|
|
|
|
ext4_msg(sb, KERN_ERR,
|
|
|
|
"Cannot turn on quotas: error %d", ret);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Turn on journaled quotas used for old sytle */
|
2014-09-11 15:15:15 +00:00
|
|
|
for (i = 0; i < EXT4_MAXQUOTAS; i++) {
|
2006-10-11 08:20:53 +00:00
|
|
|
if (EXT4_SB(sb)->s_qf_names[i]) {
|
|
|
|
int ret = ext4_quota_on_mount(sb, i);
|
2017-08-24 19:21:50 +00:00
|
|
|
|
|
|
|
if (!ret)
|
|
|
|
quota_update = 1;
|
|
|
|
else
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_ERR,
|
|
|
|
"Cannot turn on journaled "
|
2017-08-24 19:21:50 +00:00
|
|
|
"quota: type %d: error %d", i, ret);
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
while (es->s_last_orphan) {
|
|
|
|
struct inode *inode;
|
|
|
|
|
2016-07-15 03:21:35 +00:00
|
|
|
/*
|
|
|
|
* We may have encountered an error during cleanup; if
|
|
|
|
* so, skip the rest.
|
|
|
|
*/
|
|
|
|
if (EXT4_SB(sb)->s_mount_state & EXT4_ERROR_FS) {
|
|
|
|
jbd_debug(1, "Skipping orphan recovery on fs with errors.\n");
|
|
|
|
es->s_last_orphan = 0;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2008-04-30 02:04:56 +00:00
|
|
|
inode = ext4_orphan_get(sb, le32_to_cpu(es->s_last_orphan));
|
|
|
|
if (IS_ERR(inode)) {
|
2006-10-11 08:20:50 +00:00
|
|
|
es->s_last_orphan = 0;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2006-10-11 08:20:53 +00:00
|
|
|
list_add(&EXT4_I(inode)->i_orphan, &EXT4_SB(sb)->s_orphan);
|
2010-03-03 14:05:07 +00:00
|
|
|
dquot_initialize(inode);
|
2006-10-11 08:20:50 +00:00
|
|
|
if (inode->i_nlink) {
|
2013-05-28 11:51:21 +00:00
|
|
|
if (test_opt(sb, DEBUG))
|
|
|
|
ext4_msg(sb, KERN_DEBUG,
|
|
|
|
"%s: truncating inode %lu to %lld bytes",
|
|
|
|
__func__, inode->i_ino, inode->i_size);
|
2008-09-09 02:25:04 +00:00
|
|
|
jbd_debug(2, "truncating inode %lu to %lld bytes\n",
|
2006-10-11 08:20:50 +00:00
|
|
|
inode->i_ino, inode->i_size);
|
2016-01-22 20:40:57 +00:00
|
|
|
inode_lock(inode);
|
2013-05-28 03:32:35 +00:00
|
|
|
truncate_inode_pages(inode->i_mapping, inode->i_size);
|
2016-11-14 03:02:26 +00:00
|
|
|
ret = ext4_truncate(inode);
|
2021-05-07 07:19:04 +00:00
|
|
|
if (ret) {
|
|
|
|
/*
|
|
|
|
* We need to clean up the in-core orphan list
|
|
|
|
* manually if ext4_truncate() failed to get a
|
|
|
|
* transaction handle.
|
|
|
|
*/
|
|
|
|
ext4_orphan_del(NULL, inode);
|
2016-11-14 03:02:26 +00:00
|
|
|
ext4_std_error(inode->i_sb, ret);
|
2021-05-07 07:19:04 +00:00
|
|
|
}
|
2016-01-22 20:40:57 +00:00
|
|
|
inode_unlock(inode);
|
2006-10-11 08:20:50 +00:00
|
|
|
nr_truncates++;
|
|
|
|
} else {
|
2013-05-28 11:51:21 +00:00
|
|
|
if (test_opt(sb, DEBUG))
|
|
|
|
ext4_msg(sb, KERN_DEBUG,
|
|
|
|
"%s: deleting unreferenced inode %lu",
|
|
|
|
__func__, inode->i_ino);
|
2006-10-11 08:20:50 +00:00
|
|
|
jbd_debug(2, "deleting unreferenced inode %lu\n",
|
|
|
|
inode->i_ino);
|
|
|
|
nr_orphans++;
|
|
|
|
}
|
|
|
|
iput(inode); /* The delete magic happens here! */
|
|
|
|
}
|
|
|
|
|
2008-07-26 20:15:44 +00:00
|
|
|
#define PLURAL(x) (x), ((x) == 1) ? "" : "s"
|
2006-10-11 08:20:50 +00:00
|
|
|
|
|
|
|
if (nr_orphans)
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_INFO, "%d orphan inode%s deleted",
|
|
|
|
PLURAL(nr_orphans));
|
2006-10-11 08:20:50 +00:00
|
|
|
if (nr_truncates)
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_INFO, "%d truncate%s cleaned up",
|
|
|
|
PLURAL(nr_truncates));
|
2006-10-11 08:20:50 +00:00
|
|
|
#ifdef CONFIG_QUOTA
|
2017-08-24 19:21:50 +00:00
|
|
|
/* Turn off quotas if they were enabled for orphan cleanup */
|
|
|
|
if (quota_update) {
|
|
|
|
for (i = 0; i < EXT4_MAXQUOTAS; i++) {
|
|
|
|
if (sb_dqopt(sb)->files[i])
|
|
|
|
dquot_quota_off(sb, i);
|
|
|
|
}
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
|
|
|
#endif
|
2017-11-27 21:05:09 +00:00
|
|
|
sb->s_flags = s_flags; /* Restore SB_RDONLY status */
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
2009-06-03 21:59:28 +00:00
|
|
|
|
2008-01-29 04:58:27 +00:00
|
|
|
/*
|
|
|
|
* Maximal extent format file size.
|
|
|
|
* Resulting logical blkno at s_maxbytes must fit in our on-disk
|
|
|
|
* extent format containers, within a sector_t, and within i_blocks
|
|
|
|
* in the vfs. ext4 inode has 48 bits of i_block in fsblock units,
|
|
|
|
* so that won't be a limiting factor.
|
|
|
|
*
|
2011-06-06 04:05:17 +00:00
|
|
|
* However there is other limiting factor. We do store extents in the form
|
|
|
|
* of starting block and length, hence the resulting length of the extent
|
|
|
|
* covering maximum file size must fit into on-disk format containers as
|
|
|
|
* well. Given that length is always by 1 unit bigger than max unit (because
|
|
|
|
* we count 0 as well) we have to lower the s_maxbytes by one fs block.
|
|
|
|
*
|
2008-01-29 04:58:27 +00:00
|
|
|
* Note, this does *not* consider any metadata overhead for vfs i_blocks.
|
|
|
|
*/
|
2008-10-17 02:50:48 +00:00
|
|
|
static loff_t ext4_max_size(int blkbits, int has_huge_files)
|
2008-01-29 04:58:27 +00:00
|
|
|
{
|
|
|
|
loff_t res;
|
|
|
|
loff_t upper_limit = MAX_LFS_FILESIZE;
|
|
|
|
|
2019-04-05 16:08:59 +00:00
|
|
|
BUILD_BUG_ON(sizeof(blkcnt_t) < sizeof(u64));
|
|
|
|
|
|
|
|
if (!has_huge_files) {
|
2008-01-29 04:58:27 +00:00
|
|
|
upper_limit = (1LL << 32) - 1;
|
|
|
|
|
|
|
|
/* total blocks in file system block size */
|
|
|
|
upper_limit >>= (blkbits - 9);
|
|
|
|
upper_limit <<= blkbits;
|
|
|
|
}
|
|
|
|
|
2011-06-06 04:05:17 +00:00
|
|
|
/*
|
|
|
|
* 32-bit extent-start container, ee_block. We lower the maxbytes
|
|
|
|
* by one fs block, so ee_len can cover the extent of maximum file
|
|
|
|
* size
|
|
|
|
*/
|
|
|
|
res = (1LL << 32) - 1;
|
2008-01-29 04:58:27 +00:00
|
|
|
res <<= blkbits;
|
|
|
|
|
|
|
|
/* Sanity check against vm- & vfs- imposed limits */
|
|
|
|
if (res > upper_limit)
|
|
|
|
res = upper_limit;
|
|
|
|
|
|
|
|
return res;
|
|
|
|
}
|
2006-10-11 08:20:50 +00:00
|
|
|
|
|
|
|
/*
|
2008-01-29 04:58:27 +00:00
|
|
|
* Maximal bitmap file size. There is a direct, and {,double-,triple-}indirect
|
2008-01-29 04:58:26 +00:00
|
|
|
* block limit, and also a limit of (2^48 - 1) 512-byte sectors in i_blocks.
|
|
|
|
* We need to be 1 filesystem block less than the 2^48 sector limit.
|
2006-10-11 08:20:50 +00:00
|
|
|
*/
|
2008-10-17 02:50:48 +00:00
|
|
|
static loff_t ext4_max_bitmap_size(int bits, int has_huge_files)
|
2006-10-11 08:20:50 +00:00
|
|
|
{
|
2006-10-11 08:20:53 +00:00
|
|
|
loff_t res = EXT4_NDIR_BLOCKS;
|
2008-01-29 04:58:26 +00:00
|
|
|
int meta_blocks;
|
|
|
|
loff_t upper_limit;
|
2009-06-03 21:59:28 +00:00
|
|
|
/* This is calculated to be the largest file size for a dense, block
|
|
|
|
* mapped file such that the file's total number of 512-byte sectors,
|
|
|
|
* including data and all indirect blocks, does not exceed (2^48 - 1).
|
|
|
|
*
|
|
|
|
* __u32 i_blocks_lo and _u16 i_blocks_high represent the total
|
|
|
|
* number of 512-byte sectors of the file.
|
2008-01-29 04:58:26 +00:00
|
|
|
*/
|
|
|
|
|
2019-04-05 16:08:59 +00:00
|
|
|
if (!has_huge_files) {
|
2008-01-29 04:58:26 +00:00
|
|
|
/*
|
2019-04-05 16:08:59 +00:00
|
|
|
* !has_huge_files or implies that the inode i_block field
|
|
|
|
* represents total file blocks in 2^32 512-byte sectors ==
|
|
|
|
* size of vfs inode i_blocks * 8
|
2008-01-29 04:58:26 +00:00
|
|
|
*/
|
|
|
|
upper_limit = (1LL << 32) - 1;
|
|
|
|
|
|
|
|
/* total blocks in file system block size */
|
|
|
|
upper_limit >>= (bits - 9);
|
|
|
|
|
|
|
|
} else {
|
2008-01-29 04:58:27 +00:00
|
|
|
/*
|
|
|
|
* We use 48 bit ext4_inode i_blocks
|
|
|
|
* With EXT4_HUGE_FILE_FL set the i_blocks
|
|
|
|
* represent total number of blocks in
|
|
|
|
* file system block size
|
|
|
|
*/
|
2008-01-29 04:58:26 +00:00
|
|
|
upper_limit = (1LL << 48) - 1;
|
|
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
/* indirect blocks */
|
|
|
|
meta_blocks = 1;
|
|
|
|
/* double indirect blocks */
|
|
|
|
meta_blocks += 1 + (1LL << (bits-2));
|
|
|
|
/* tripple indirect blocks */
|
|
|
|
meta_blocks += 1 + (1LL << (bits-2)) + (1LL << (2*(bits-2)));
|
|
|
|
|
|
|
|
upper_limit -= meta_blocks;
|
|
|
|
upper_limit <<= bits;
|
2006-10-11 08:20:50 +00:00
|
|
|
|
|
|
|
res += 1LL << (bits-2);
|
|
|
|
res += 1LL << (2*(bits-2));
|
|
|
|
res += 1LL << (3*(bits-2));
|
|
|
|
res <<= bits;
|
|
|
|
if (res > upper_limit)
|
|
|
|
res = upper_limit;
|
2008-01-29 04:58:26 +00:00
|
|
|
|
|
|
|
if (res > MAX_LFS_FILESIZE)
|
|
|
|
res = MAX_LFS_FILESIZE;
|
|
|
|
|
2006-10-11 08:20:50 +00:00
|
|
|
return res;
|
|
|
|
}
|
|
|
|
|
2006-10-11 08:20:53 +00:00
|
|
|
static ext4_fsblk_t descriptor_loc(struct super_block *sb,
|
2009-06-03 21:59:28 +00:00
|
|
|
ext4_fsblk_t logical_sb_block, int nr)
|
2006-10-11 08:20:50 +00:00
|
|
|
{
|
2006-10-11 08:20:53 +00:00
|
|
|
struct ext4_sb_info *sbi = EXT4_SB(sb);
|
2008-01-29 04:58:27 +00:00
|
|
|
ext4_group_t bg, first_meta_bg;
|
2006-10-11 08:20:50 +00:00
|
|
|
int has_super = 0;
|
|
|
|
|
|
|
|
first_meta_bg = le32_to_cpu(sbi->s_es->s_first_meta_bg);
|
|
|
|
|
2015-10-17 20:18:43 +00:00
|
|
|
if (!ext4_has_feature_meta_bg(sb) || nr < first_meta_bg)
|
2006-10-11 08:21:20 +00:00
|
|
|
return logical_sb_block + nr + 1;
|
2006-10-11 08:20:50 +00:00
|
|
|
bg = sbi->s_desc_per_block * nr;
|
2006-10-11 08:20:53 +00:00
|
|
|
if (ext4_bg_has_super(sb, bg))
|
2006-10-11 08:20:50 +00:00
|
|
|
has_super = 1;
|
2009-06-03 21:59:28 +00:00
|
|
|
|
2014-05-12 14:06:27 +00:00
|
|
|
/*
|
|
|
|
* If we have a meta_bg fs with 1k blocks, group 0's GDT is at
|
|
|
|
* block 2, not 1. If s_first_data_block == 0 (bigalloc is enabled
|
|
|
|
* on modern mke2fs or blksize > 1k on older mke2fs) then we must
|
|
|
|
* compensate.
|
|
|
|
*/
|
|
|
|
if (sb->s_blocksize == 1024 && nr == 0 &&
|
2018-01-11 18:17:49 +00:00
|
|
|
le32_to_cpu(sbi->s_es->s_first_data_block) == 0)
|
2014-05-12 14:06:27 +00:00
|
|
|
has_super++;
|
|
|
|
|
2006-10-11 08:20:53 +00:00
|
|
|
return (has_super + ext4_group_first_block_no(sb, bg));
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
|
|
|
|
2008-01-29 05:19:52 +00:00
|
|
|
/**
|
|
|
|
* ext4_get_stripe_size: Get the stripe size.
|
|
|
|
* @sbi: In memory super block info
|
|
|
|
*
|
|
|
|
* If we have specified it via mount option, then
|
|
|
|
* use the mount option value. If the value specified at mount time is
|
|
|
|
* greater than the blocks per group use the super block value.
|
|
|
|
* If the super block value is greater than blocks per group return 0.
|
|
|
|
* Allocator needs it be less than blocks per group.
|
|
|
|
*
|
|
|
|
*/
|
|
|
|
static unsigned long ext4_get_stripe_size(struct ext4_sb_info *sbi)
|
|
|
|
{
|
|
|
|
unsigned long stride = le16_to_cpu(sbi->s_es->s_raid_stride);
|
|
|
|
unsigned long stripe_width =
|
|
|
|
le32_to_cpu(sbi->s_es->s_raid_stripe_width);
|
2011-07-18 01:18:51 +00:00
|
|
|
int ret;
|
2008-01-29 05:19:52 +00:00
|
|
|
|
|
|
|
if (sbi->s_stripe && sbi->s_stripe <= sbi->s_blocks_per_group)
|
2011-07-18 01:18:51 +00:00
|
|
|
ret = sbi->s_stripe;
|
2017-02-10 05:56:09 +00:00
|
|
|
else if (stripe_width && stripe_width <= sbi->s_blocks_per_group)
|
2011-07-18 01:18:51 +00:00
|
|
|
ret = stripe_width;
|
2017-02-10 05:56:09 +00:00
|
|
|
else if (stride && stride <= sbi->s_blocks_per_group)
|
2011-07-18 01:18:51 +00:00
|
|
|
ret = stride;
|
|
|
|
else
|
|
|
|
ret = 0;
|
2008-01-29 05:19:52 +00:00
|
|
|
|
2011-07-18 01:18:51 +00:00
|
|
|
/*
|
|
|
|
* If the stripe width is 1, this makes no sense and
|
|
|
|
* we set it to 0 to turn off stripe handling code.
|
|
|
|
*/
|
|
|
|
if (ret <= 1)
|
|
|
|
ret = 0;
|
2008-01-29 05:19:52 +00:00
|
|
|
|
2011-07-18 01:18:51 +00:00
|
|
|
return ret;
|
2008-01-29 05:19:52 +00:00
|
|
|
}
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2009-08-18 04:20:23 +00:00
|
|
|
/*
|
|
|
|
* Check whether this filesystem can be mounted based on
|
|
|
|
* the features present and the RDONLY/RDWR mount requested.
|
|
|
|
* Returns 1 if this filesystem can be mounted as requested,
|
|
|
|
* 0 if it cannot be.
|
|
|
|
*/
|
|
|
|
static int ext4_feature_set_ok(struct super_block *sb, int readonly)
|
|
|
|
{
|
2015-10-17 20:18:43 +00:00
|
|
|
if (ext4_has_unknown_ext4_incompat_features(sb)) {
|
2009-08-18 04:20:23 +00:00
|
|
|
ext4_msg(sb, KERN_ERR,
|
|
|
|
"Couldn't mount because of "
|
|
|
|
"unsupported optional features (%x)",
|
|
|
|
(le32_to_cpu(EXT4_SB(sb)->s_es->s_feature_incompat) &
|
|
|
|
~EXT4_FEATURE_INCOMPAT_SUPP));
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2019-04-25 18:05:42 +00:00
|
|
|
#ifndef CONFIG_UNICODE
|
|
|
|
if (ext4_has_feature_casefold(sb)) {
|
|
|
|
ext4_msg(sb, KERN_ERR,
|
|
|
|
"Filesystem with casefold feature cannot be "
|
|
|
|
"mounted without CONFIG_UNICODE");
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2009-08-18 04:20:23 +00:00
|
|
|
if (readonly)
|
|
|
|
return 1;
|
|
|
|
|
2015-10-17 20:18:43 +00:00
|
|
|
if (ext4_has_feature_readonly(sb)) {
|
2015-02-13 03:31:21 +00:00
|
|
|
ext4_msg(sb, KERN_INFO, "filesystem is read-only");
|
2017-11-27 21:05:09 +00:00
|
|
|
sb->s_flags |= SB_RDONLY;
|
2015-02-13 03:31:21 +00:00
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2009-08-18 04:20:23 +00:00
|
|
|
/* Check that feature set is OK for a read-write mount */
|
2015-10-17 20:18:43 +00:00
|
|
|
if (ext4_has_unknown_ext4_ro_compat_features(sb)) {
|
2009-08-18 04:20:23 +00:00
|
|
|
ext4_msg(sb, KERN_ERR, "couldn't mount RDWR because of "
|
|
|
|
"unsupported optional features (%x)",
|
|
|
|
(le32_to_cpu(EXT4_SB(sb)->s_es->s_feature_ro_compat) &
|
|
|
|
~EXT4_FEATURE_RO_COMPAT_SUPP));
|
|
|
|
return 0;
|
|
|
|
}
|
2015-10-17 20:18:43 +00:00
|
|
|
if (ext4_has_feature_bigalloc(sb) && !ext4_has_feature_extents(sb)) {
|
2011-09-09 22:36:51 +00:00
|
|
|
ext4_msg(sb, KERN_ERR,
|
|
|
|
"Can't support bigalloc feature without "
|
|
|
|
"extents feature\n");
|
|
|
|
return 0;
|
|
|
|
}
|
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 00:21:31 +00:00
|
|
|
|
2020-02-21 10:08:35 +00:00
|
|
|
#if !IS_ENABLED(CONFIG_QUOTA) || !IS_ENABLED(CONFIG_QFMT_V2)
|
2020-02-14 23:11:19 +00:00
|
|
|
if (!readonly && (ext4_has_feature_quota(sb) ||
|
|
|
|
ext4_has_feature_project(sb))) {
|
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 00:21:31 +00:00
|
|
|
ext4_msg(sb, KERN_ERR,
|
2020-02-14 23:11:19 +00:00
|
|
|
"The kernel was not built with CONFIG_QUOTA and CONFIG_QFMT_V2");
|
2016-01-08 21:01:22 +00:00
|
|
|
return 0;
|
|
|
|
}
|
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 00:21:31 +00:00
|
|
|
#endif /* CONFIG_QUOTA */
|
2009-08-18 04:20:23 +00:00
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2010-07-27 15:56:04 +00:00
|
|
|
/*
|
|
|
|
* This function is called once a day if we have errors logged
|
|
|
|
* on the file system
|
|
|
|
*/
|
2017-10-18 16:45:17 +00:00
|
|
|
static void print_daily_error_info(struct timer_list *t)
|
2010-07-27 15:56:04 +00:00
|
|
|
{
|
2017-10-18 16:45:17 +00:00
|
|
|
struct ext4_sb_info *sbi = from_timer(sbi, t, s_err_report);
|
|
|
|
struct super_block *sb = sbi->s_sb;
|
|
|
|
struct ext4_super_block *es = sbi->s_es;
|
2010-07-27 15:56:04 +00:00
|
|
|
|
|
|
|
if (es->s_error_count)
|
2014-07-05 22:40:52 +00:00
|
|
|
/* fsck newer than v1.41.13 is needed to clean this condition. */
|
|
|
|
ext4_msg(sb, KERN_NOTICE, "error count since last fsck: %u",
|
2010-07-27 15:56:04 +00:00
|
|
|
le32_to_cpu(es->s_error_count));
|
|
|
|
if (es->s_first_error_time) {
|
2018-07-29 19:51:48 +00:00
|
|
|
printk(KERN_NOTICE "EXT4-fs (%s): initial error at time %llu: %.*s:%d",
|
|
|
|
sb->s_id,
|
|
|
|
ext4_get_tstamp(es, s_first_error_time),
|
2010-07-27 15:56:04 +00:00
|
|
|
(int) sizeof(es->s_first_error_func),
|
|
|
|
es->s_first_error_func,
|
|
|
|
le32_to_cpu(es->s_first_error_line));
|
|
|
|
if (es->s_first_error_ino)
|
2016-10-13 03:12:53 +00:00
|
|
|
printk(KERN_CONT ": inode %u",
|
2010-07-27 15:56:04 +00:00
|
|
|
le32_to_cpu(es->s_first_error_ino));
|
|
|
|
if (es->s_first_error_block)
|
2016-10-13 03:12:53 +00:00
|
|
|
printk(KERN_CONT ": block %llu", (unsigned long long)
|
2010-07-27 15:56:04 +00:00
|
|
|
le64_to_cpu(es->s_first_error_block));
|
2016-10-13 03:12:53 +00:00
|
|
|
printk(KERN_CONT "\n");
|
2010-07-27 15:56:04 +00:00
|
|
|
}
|
|
|
|
if (es->s_last_error_time) {
|
2018-07-29 19:51:48 +00:00
|
|
|
printk(KERN_NOTICE "EXT4-fs (%s): last error at time %llu: %.*s:%d",
|
|
|
|
sb->s_id,
|
|
|
|
ext4_get_tstamp(es, s_last_error_time),
|
2010-07-27 15:56:04 +00:00
|
|
|
(int) sizeof(es->s_last_error_func),
|
|
|
|
es->s_last_error_func,
|
|
|
|
le32_to_cpu(es->s_last_error_line));
|
|
|
|
if (es->s_last_error_ino)
|
2016-10-13 03:12:53 +00:00
|
|
|
printk(KERN_CONT ": inode %u",
|
2010-07-27 15:56:04 +00:00
|
|
|
le32_to_cpu(es->s_last_error_ino));
|
|
|
|
if (es->s_last_error_block)
|
2016-10-13 03:12:53 +00:00
|
|
|
printk(KERN_CONT ": block %llu", (unsigned long long)
|
2010-07-27 15:56:04 +00:00
|
|
|
le64_to_cpu(es->s_last_error_block));
|
2016-10-13 03:12:53 +00:00
|
|
|
printk(KERN_CONT "\n");
|
2010-07-27 15:56:04 +00:00
|
|
|
}
|
|
|
|
mod_timer(&sbi->s_err_report, jiffies + 24*60*60*HZ); /* Once a day */
|
|
|
|
}
|
|
|
|
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
/* Find next suitable group and run ext4_init_inode_table */
|
|
|
|
static int ext4_run_li_request(struct ext4_li_request *elr)
|
|
|
|
{
|
|
|
|
struct ext4_group_desc *gdp = NULL;
|
2020-07-17 04:14:40 +00:00
|
|
|
struct super_block *sb = elr->lr_super;
|
|
|
|
ext4_group_t ngroups = EXT4_SB(sb)->s_groups_count;
|
|
|
|
ext4_group_t group = elr->lr_next_group;
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
unsigned long timeout = 0;
|
2020-07-17 04:14:40 +00:00
|
|
|
unsigned int prefetch_ios = 0;
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
int ret = 0;
|
|
|
|
|
2020-07-17 04:14:40 +00:00
|
|
|
if (elr->lr_mode == EXT4_LI_MODE_PREFETCH_BBITMAP) {
|
|
|
|
elr->lr_next_group = ext4_mb_prefetch(sb, group,
|
|
|
|
EXT4_SB(sb)->s_mb_prefetch, &prefetch_ios);
|
|
|
|
if (prefetch_ios)
|
|
|
|
ext4_mb_prefetch_fini(sb, elr->lr_next_group,
|
|
|
|
prefetch_ios);
|
|
|
|
trace_ext4_prefetch_bitmaps(sb, group, elr->lr_next_group,
|
|
|
|
prefetch_ios);
|
|
|
|
if (group >= elr->lr_next_group) {
|
|
|
|
ret = 1;
|
|
|
|
if (elr->lr_first_not_zeroed != ngroups &&
|
|
|
|
!sb_rdonly(sb) && test_opt(sb, INIT_INODE_TABLE)) {
|
|
|
|
elr->lr_next_group = elr->lr_first_not_zeroed;
|
|
|
|
elr->lr_mode = EXT4_LI_MODE_ITABLE;
|
|
|
|
ret = 0;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
|
2020-07-17 04:14:40 +00:00
|
|
|
for (; group < ngroups; group++) {
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
gdp = ext4_get_group_desc(sb, group, NULL);
|
|
|
|
if (!gdp) {
|
|
|
|
ret = 1;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!(gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_ZEROED)))
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2013-01-13 13:41:45 +00:00
|
|
|
if (group >= ngroups)
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
ret = 1;
|
|
|
|
|
|
|
|
if (!ret) {
|
|
|
|
timeout = jiffies;
|
|
|
|
ret = ext4_init_inode_table(sb, group,
|
|
|
|
elr->lr_timeout ? 0 : 1);
|
2020-07-17 04:14:40 +00:00
|
|
|
trace_ext4_lazy_itable_init(sb, group);
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
if (elr->lr_timeout == 0) {
|
2011-05-20 17:55:16 +00:00
|
|
|
timeout = (jiffies - timeout) *
|
2020-07-17 04:14:40 +00:00
|
|
|
EXT4_SB(elr->lr_super)->s_li_wait_mult;
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
elr->lr_timeout = timeout;
|
|
|
|
}
|
|
|
|
elr->lr_next_sched = jiffies + elr->lr_timeout;
|
|
|
|
elr->lr_next_group = group + 1;
|
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Remove lr_request from the list_request and free the
|
2011-05-20 17:49:04 +00:00
|
|
|
* request structure. Should be called with li_list_mtx held
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
*/
|
|
|
|
static void ext4_remove_li_request(struct ext4_li_request *elr)
|
|
|
|
{
|
|
|
|
if (!elr)
|
|
|
|
return;
|
|
|
|
|
|
|
|
list_del(&elr->lr_request);
|
2020-07-17 04:14:40 +00:00
|
|
|
EXT4_SB(elr->lr_super)->s_li_request = NULL;
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
kfree(elr);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void ext4_unregister_li_request(struct super_block *sb)
|
|
|
|
{
|
2011-05-20 17:55:29 +00:00
|
|
|
mutex_lock(&ext4_li_mtx);
|
|
|
|
if (!ext4_li_info) {
|
|
|
|
mutex_unlock(&ext4_li_mtx);
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
return;
|
2011-05-20 17:55:29 +00:00
|
|
|
}
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
|
|
|
|
mutex_lock(&ext4_li_info->li_list_mtx);
|
2011-05-20 17:55:29 +00:00
|
|
|
ext4_remove_li_request(EXT4_SB(sb)->s_li_request);
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
mutex_unlock(&ext4_li_info->li_list_mtx);
|
2011-05-20 17:55:29 +00:00
|
|
|
mutex_unlock(&ext4_li_mtx);
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
}
|
|
|
|
|
2011-02-03 19:33:15 +00:00
|
|
|
static struct task_struct *ext4_lazyinit_task;
|
|
|
|
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
/*
|
|
|
|
* This is the function where ext4lazyinit thread lives. It walks
|
|
|
|
* through the request list searching for next scheduled filesystem.
|
|
|
|
* When such a fs is found, run the lazy initialization request
|
|
|
|
* (ext4_rn_li_request) and keep track of the time spend in this
|
|
|
|
* function. Based on that time we compute next schedule time of
|
|
|
|
* the request. When walking through the list is complete, compute
|
|
|
|
* next waking time and put itself into sleep.
|
|
|
|
*/
|
|
|
|
static int ext4_lazyinit_thread(void *arg)
|
|
|
|
{
|
|
|
|
struct ext4_lazy_init *eli = (struct ext4_lazy_init *)arg;
|
|
|
|
struct list_head *pos, *n;
|
|
|
|
struct ext4_li_request *elr;
|
2011-05-20 17:49:04 +00:00
|
|
|
unsigned long next_wakeup, cur;
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
|
|
|
|
BUG_ON(NULL == eli);
|
|
|
|
|
|
|
|
cont_thread:
|
|
|
|
while (true) {
|
|
|
|
next_wakeup = MAX_JIFFY_OFFSET;
|
|
|
|
|
|
|
|
mutex_lock(&eli->li_list_mtx);
|
|
|
|
if (list_empty(&eli->li_request_list)) {
|
|
|
|
mutex_unlock(&eli->li_list_mtx);
|
|
|
|
goto exit_thread;
|
|
|
|
}
|
|
|
|
list_for_each_safe(pos, n, &eli->li_request_list) {
|
2016-09-06 03:38:36 +00:00
|
|
|
int err = 0;
|
|
|
|
int progress = 0;
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
elr = list_entry(pos, struct ext4_li_request,
|
|
|
|
lr_request);
|
|
|
|
|
2016-09-06 03:38:36 +00:00
|
|
|
if (time_before(jiffies, elr->lr_next_sched)) {
|
|
|
|
if (time_before(elr->lr_next_sched, next_wakeup))
|
|
|
|
next_wakeup = elr->lr_next_sched;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
if (down_read_trylock(&elr->lr_super->s_umount)) {
|
|
|
|
if (sb_start_write_trylock(elr->lr_super)) {
|
|
|
|
progress = 1;
|
|
|
|
/*
|
|
|
|
* We hold sb->s_umount, sb can not
|
|
|
|
* be removed from the list, it is
|
|
|
|
* now safe to drop li_list_mtx
|
|
|
|
*/
|
|
|
|
mutex_unlock(&eli->li_list_mtx);
|
|
|
|
err = ext4_run_li_request(elr);
|
|
|
|
sb_end_write(elr->lr_super);
|
|
|
|
mutex_lock(&eli->li_list_mtx);
|
|
|
|
n = pos->next;
|
2010-11-02 18:19:30 +00:00
|
|
|
}
|
2016-09-06 03:38:36 +00:00
|
|
|
up_read((&elr->lr_super->s_umount));
|
|
|
|
}
|
|
|
|
/* error, remove the lazy_init job */
|
|
|
|
if (err) {
|
|
|
|
ext4_remove_li_request(elr);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
if (!progress) {
|
|
|
|
elr->lr_next_sched = jiffies +
|
|
|
|
(prandom_u32()
|
|
|
|
% (EXT4_DEF_LI_MAX_START_DELAY * HZ));
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
}
|
|
|
|
if (time_before(elr->lr_next_sched, next_wakeup))
|
|
|
|
next_wakeup = elr->lr_next_sched;
|
|
|
|
}
|
|
|
|
mutex_unlock(&eli->li_list_mtx);
|
|
|
|
|
2011-11-21 20:32:22 +00:00
|
|
|
try_to_freeze();
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
|
2011-05-20 17:49:04 +00:00
|
|
|
cur = jiffies;
|
|
|
|
if ((time_after_eq(cur, next_wakeup)) ||
|
2010-11-02 18:07:17 +00:00
|
|
|
(MAX_JIFFY_OFFSET == next_wakeup)) {
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
cond_resched();
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2011-05-20 17:49:04 +00:00
|
|
|
schedule_timeout_interruptible(next_wakeup - cur);
|
|
|
|
|
2011-02-03 19:33:15 +00:00
|
|
|
if (kthread_should_stop()) {
|
|
|
|
ext4_clear_request_list();
|
|
|
|
goto exit_thread;
|
|
|
|
}
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
exit_thread:
|
|
|
|
/*
|
|
|
|
* It looks like the request list is empty, but we need
|
|
|
|
* to check it under the li_list_mtx lock, to prevent any
|
|
|
|
* additions into it, and of course we should lock ext4_li_mtx
|
|
|
|
* to atomically free the list and ext4_li_info, because at
|
|
|
|
* this point another ext4 filesystem could be registering
|
|
|
|
* new one.
|
|
|
|
*/
|
|
|
|
mutex_lock(&ext4_li_mtx);
|
|
|
|
mutex_lock(&eli->li_list_mtx);
|
|
|
|
if (!list_empty(&eli->li_request_list)) {
|
|
|
|
mutex_unlock(&eli->li_list_mtx);
|
|
|
|
mutex_unlock(&ext4_li_mtx);
|
|
|
|
goto cont_thread;
|
|
|
|
}
|
|
|
|
mutex_unlock(&eli->li_list_mtx);
|
|
|
|
kfree(ext4_li_info);
|
|
|
|
ext4_li_info = NULL;
|
|
|
|
mutex_unlock(&ext4_li_mtx);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void ext4_clear_request_list(void)
|
|
|
|
{
|
|
|
|
struct list_head *pos, *n;
|
|
|
|
struct ext4_li_request *elr;
|
|
|
|
|
|
|
|
mutex_lock(&ext4_li_info->li_list_mtx);
|
|
|
|
list_for_each_safe(pos, n, &ext4_li_info->li_request_list) {
|
|
|
|
elr = list_entry(pos, struct ext4_li_request,
|
|
|
|
lr_request);
|
|
|
|
ext4_remove_li_request(elr);
|
|
|
|
}
|
|
|
|
mutex_unlock(&ext4_li_info->li_list_mtx);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int ext4_run_lazyinit_thread(void)
|
|
|
|
{
|
2011-02-03 19:33:15 +00:00
|
|
|
ext4_lazyinit_task = kthread_run(ext4_lazyinit_thread,
|
|
|
|
ext4_li_info, "ext4lazyinit");
|
|
|
|
if (IS_ERR(ext4_lazyinit_task)) {
|
|
|
|
int err = PTR_ERR(ext4_lazyinit_task);
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
ext4_clear_request_list();
|
|
|
|
kfree(ext4_li_info);
|
|
|
|
ext4_li_info = NULL;
|
2012-03-20 03:41:49 +00:00
|
|
|
printk(KERN_CRIT "EXT4-fs: error %d creating inode table "
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
"initialization thread\n",
|
|
|
|
err);
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
ext4_li_info->li_state |= EXT4_LAZYINIT_RUNNING;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Check whether it make sense to run itable init. thread or not.
|
|
|
|
* If there is at least one uninitialized inode table, return
|
|
|
|
* corresponding group number, else the loop goes through all
|
|
|
|
* groups and return total number of groups.
|
|
|
|
*/
|
|
|
|
static ext4_group_t ext4_has_uninit_itable(struct super_block *sb)
|
|
|
|
{
|
|
|
|
ext4_group_t group, ngroups = EXT4_SB(sb)->s_groups_count;
|
|
|
|
struct ext4_group_desc *gdp = NULL;
|
|
|
|
|
2018-06-14 04:58:00 +00:00
|
|
|
if (!ext4_has_group_desc_csum(sb))
|
|
|
|
return ngroups;
|
|
|
|
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
for (group = 0; group < ngroups; group++) {
|
|
|
|
gdp = ext4_get_group_desc(sb, group, NULL);
|
|
|
|
if (!gdp)
|
|
|
|
continue;
|
|
|
|
|
2018-07-28 12:12:04 +00:00
|
|
|
if (!(gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_ZEROED)))
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
return group;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int ext4_li_info_new(void)
|
|
|
|
{
|
|
|
|
struct ext4_lazy_init *eli = NULL;
|
|
|
|
|
|
|
|
eli = kzalloc(sizeof(*eli), GFP_KERNEL);
|
|
|
|
if (!eli)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
INIT_LIST_HEAD(&eli->li_request_list);
|
|
|
|
mutex_init(&eli->li_list_mtx);
|
|
|
|
|
|
|
|
eli->li_state |= EXT4_LAZYINIT_QUIT;
|
|
|
|
|
|
|
|
ext4_li_info = eli;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct ext4_li_request *ext4_li_request_new(struct super_block *sb,
|
|
|
|
ext4_group_t start)
|
|
|
|
{
|
|
|
|
struct ext4_li_request *elr;
|
|
|
|
|
|
|
|
elr = kzalloc(sizeof(*elr), GFP_KERNEL);
|
|
|
|
if (!elr)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
elr->lr_super = sb;
|
2020-07-17 04:14:40 +00:00
|
|
|
elr->lr_first_not_zeroed = start;
|
2021-04-01 17:21:29 +00:00
|
|
|
if (test_opt(sb, NO_PREFETCH_BLOCK_BITMAPS)) {
|
2020-07-17 04:14:40 +00:00
|
|
|
elr->lr_mode = EXT4_LI_MODE_ITABLE;
|
|
|
|
elr->lr_next_group = start;
|
2021-04-01 17:21:29 +00:00
|
|
|
} else {
|
|
|
|
elr->lr_mode = EXT4_LI_MODE_PREFETCH_BBITMAP;
|
2020-07-17 04:14:40 +00:00
|
|
|
}
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Randomize first schedule time of the request to
|
|
|
|
* spread the inode table initialization requests
|
|
|
|
* better.
|
|
|
|
*/
|
2013-11-08 05:14:53 +00:00
|
|
|
elr->lr_next_sched = jiffies + (prandom_u32() %
|
|
|
|
(EXT4_DEF_LI_MAX_START_DELAY * HZ));
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
return elr;
|
|
|
|
}
|
|
|
|
|
2013-01-13 13:41:45 +00:00
|
|
|
int ext4_register_li_request(struct super_block *sb,
|
|
|
|
ext4_group_t first_not_zeroed)
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
{
|
|
|
|
struct ext4_sb_info *sbi = EXT4_SB(sb);
|
2013-01-13 13:41:45 +00:00
|
|
|
struct ext4_li_request *elr = NULL;
|
2018-01-11 18:17:49 +00:00
|
|
|
ext4_group_t ngroups = sbi->s_groups_count;
|
2011-01-10 17:30:17 +00:00
|
|
|
int ret = 0;
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
|
2013-01-13 13:41:45 +00:00
|
|
|
mutex_lock(&ext4_li_mtx);
|
2011-05-20 17:55:16 +00:00
|
|
|
if (sbi->s_li_request != NULL) {
|
|
|
|
/*
|
|
|
|
* Reset timeout so it can be computed again, because
|
|
|
|
* s_li_wait_mult might have changed.
|
|
|
|
*/
|
|
|
|
sbi->s_li_request->lr_timeout = 0;
|
2013-01-13 13:41:45 +00:00
|
|
|
goto out;
|
2011-05-20 17:55:16 +00:00
|
|
|
}
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
|
2021-04-01 17:21:29 +00:00
|
|
|
if (test_opt(sb, NO_PREFETCH_BLOCK_BITMAPS) &&
|
2020-07-17 04:14:40 +00:00
|
|
|
(first_not_zeroed == ngroups || sb_rdonly(sb) ||
|
|
|
|
!test_opt(sb, INIT_INODE_TABLE)))
|
2013-01-13 13:41:45 +00:00
|
|
|
goto out;
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
|
|
|
|
elr = ext4_li_request_new(sb, first_not_zeroed);
|
2013-01-13 13:41:45 +00:00
|
|
|
if (!elr) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
|
|
|
|
if (NULL == ext4_li_info) {
|
|
|
|
ret = ext4_li_info_new();
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
mutex_lock(&ext4_li_info->li_list_mtx);
|
|
|
|
list_add(&elr->lr_request, &ext4_li_info->li_request_list);
|
|
|
|
mutex_unlock(&ext4_li_info->li_list_mtx);
|
|
|
|
|
|
|
|
sbi->s_li_request = elr;
|
2011-04-04 20:00:49 +00:00
|
|
|
/*
|
|
|
|
* set elr to NULL here since it has been inserted to
|
|
|
|
* the request_list and the removal and free of it is
|
|
|
|
* handled by ext4_clear_request_list from now on.
|
|
|
|
*/
|
|
|
|
elr = NULL;
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
|
|
|
|
if (!(ext4_li_info->li_state & EXT4_LAZYINIT_RUNNING)) {
|
|
|
|
ret = ext4_run_lazyinit_thread();
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
out:
|
2010-10-28 02:08:42 +00:00
|
|
|
mutex_unlock(&ext4_li_mtx);
|
|
|
|
if (ret)
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
kfree(elr);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We do not need to lock anything since this is called on
|
|
|
|
* module unload.
|
|
|
|
*/
|
|
|
|
static void ext4_destroy_lazyinit_thread(void)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* If thread exited earlier
|
|
|
|
* there's nothing to be done.
|
|
|
|
*/
|
2011-02-03 19:33:15 +00:00
|
|
|
if (!ext4_li_info || !ext4_lazyinit_task)
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
return;
|
|
|
|
|
2011-02-03 19:33:15 +00:00
|
|
|
kthread_stop(ext4_lazyinit_task);
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
}
|
|
|
|
|
2012-05-27 11:48:56 +00:00
|
|
|
static int set_journal_csum_feature_set(struct super_block *sb)
|
|
|
|
{
|
|
|
|
int ret = 1;
|
|
|
|
int compat, incompat;
|
|
|
|
struct ext4_sb_info *sbi = EXT4_SB(sb);
|
|
|
|
|
2014-10-13 07:36:16 +00:00
|
|
|
if (ext4_has_metadata_csum(sb)) {
|
2014-08-27 22:40:07 +00:00
|
|
|
/* journal checksum v3 */
|
2012-05-27 11:48:56 +00:00
|
|
|
compat = 0;
|
2014-08-27 22:40:07 +00:00
|
|
|
incompat = JBD2_FEATURE_INCOMPAT_CSUM_V3;
|
2012-05-27 11:48:56 +00:00
|
|
|
} else {
|
|
|
|
/* journal checksum v1 */
|
|
|
|
compat = JBD2_FEATURE_COMPAT_CHECKSUM;
|
|
|
|
incompat = 0;
|
|
|
|
}
|
|
|
|
|
2014-09-11 15:38:21 +00:00
|
|
|
jbd2_journal_clear_features(sbi->s_journal,
|
|
|
|
JBD2_FEATURE_COMPAT_CHECKSUM, 0,
|
|
|
|
JBD2_FEATURE_INCOMPAT_CSUM_V3 |
|
|
|
|
JBD2_FEATURE_INCOMPAT_CSUM_V2);
|
2012-05-27 11:48:56 +00:00
|
|
|
if (test_opt(sb, JOURNAL_ASYNC_COMMIT)) {
|
|
|
|
ret = jbd2_journal_set_features(sbi->s_journal,
|
|
|
|
compat, 0,
|
|
|
|
JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT |
|
|
|
|
incompat);
|
|
|
|
} else if (test_opt(sb, JOURNAL_CHECKSUM)) {
|
|
|
|
ret = jbd2_journal_set_features(sbi->s_journal,
|
|
|
|
compat, 0,
|
|
|
|
incompat);
|
|
|
|
jbd2_journal_clear_features(sbi->s_journal, 0, 0,
|
|
|
|
JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
|
|
|
|
} else {
|
2014-09-11 15:38:21 +00:00
|
|
|
jbd2_journal_clear_features(sbi->s_journal, 0, 0,
|
|
|
|
JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
|
2012-05-27 11:48:56 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-07-09 20:27:05 +00:00
|
|
|
/*
|
|
|
|
* Note: calculating the overhead so we can be compatible with
|
|
|
|
* historical BSD practice is quite difficult in the face of
|
|
|
|
* clusters/bigalloc. This is because multiple metadata blocks from
|
|
|
|
* different block group can end up in the same allocation cluster.
|
|
|
|
* Calculating the exact overhead in the face of clustered allocation
|
|
|
|
* requires either O(all block bitmaps) in memory or O(number of block
|
|
|
|
* groups**2) in time. We will still calculate the superblock for
|
|
|
|
* older file systems --- and if we come across with a bigalloc file
|
|
|
|
* system with zero in s_overhead_clusters the estimate will be close to
|
|
|
|
* correct especially for very large cluster sizes --- but for newer
|
|
|
|
* file systems, it's better to calculate this figure once at mkfs
|
|
|
|
* time, and store it in the superblock. If the superblock value is
|
|
|
|
* present (even for non-bigalloc file systems), we will use it.
|
|
|
|
*/
|
|
|
|
static int count_overhead(struct super_block *sb, ext4_group_t grp,
|
|
|
|
char *buf)
|
|
|
|
{
|
|
|
|
struct ext4_sb_info *sbi = EXT4_SB(sb);
|
|
|
|
struct ext4_group_desc *gdp;
|
|
|
|
ext4_fsblk_t first_block, last_block, b;
|
|
|
|
ext4_group_t i, ngroups = ext4_get_groups_count(sb);
|
|
|
|
int s, j, count = 0;
|
|
|
|
|
2015-10-17 20:18:43 +00:00
|
|
|
if (!ext4_has_feature_bigalloc(sb))
|
2012-08-16 15:59:04 +00:00
|
|
|
return (ext4_bg_has_super(sb, grp) + ext4_bg_num_gdb(sb, grp) +
|
|
|
|
sbi->s_itb_per_group + 2);
|
|
|
|
|
2012-07-09 20:27:05 +00:00
|
|
|
first_block = le32_to_cpu(sbi->s_es->s_first_data_block) +
|
|
|
|
(grp * EXT4_BLOCKS_PER_GROUP(sb));
|
|
|
|
last_block = first_block + EXT4_BLOCKS_PER_GROUP(sb) - 1;
|
|
|
|
for (i = 0; i < ngroups; i++) {
|
|
|
|
gdp = ext4_get_group_desc(sb, i, NULL);
|
|
|
|
b = ext4_block_bitmap(sb, gdp);
|
|
|
|
if (b >= first_block && b <= last_block) {
|
|
|
|
ext4_set_bit(EXT4_B2C(sbi, b - first_block), buf);
|
|
|
|
count++;
|
|
|
|
}
|
|
|
|
b = ext4_inode_bitmap(sb, gdp);
|
|
|
|
if (b >= first_block && b <= last_block) {
|
|
|
|
ext4_set_bit(EXT4_B2C(sbi, b - first_block), buf);
|
|
|
|
count++;
|
|
|
|
}
|
|
|
|
b = ext4_inode_table(sb, gdp);
|
|
|
|
if (b >= first_block && b + sbi->s_itb_per_group <= last_block)
|
|
|
|
for (j = 0; j < sbi->s_itb_per_group; j++, b++) {
|
|
|
|
int c = EXT4_B2C(sbi, b - first_block);
|
|
|
|
ext4_set_bit(c, buf);
|
|
|
|
count++;
|
|
|
|
}
|
|
|
|
if (i != grp)
|
|
|
|
continue;
|
|
|
|
s = 0;
|
|
|
|
if (ext4_bg_has_super(sb, grp)) {
|
|
|
|
ext4_set_bit(s++, buf);
|
|
|
|
count++;
|
|
|
|
}
|
2016-11-18 18:37:47 +00:00
|
|
|
j = ext4_bg_num_gdb(sb, grp);
|
|
|
|
if (s + j > EXT4_BLOCKS_PER_GROUP(sb)) {
|
|
|
|
ext4_error(sb, "Invalid number of block group "
|
|
|
|
"descriptor blocks: %d", j);
|
|
|
|
j = EXT4_BLOCKS_PER_GROUP(sb) - s;
|
2012-07-09 20:27:05 +00:00
|
|
|
}
|
2016-11-18 18:37:47 +00:00
|
|
|
count += j;
|
|
|
|
for (; j > 0; j--)
|
|
|
|
ext4_set_bit(EXT4_B2C(sbi, s++), buf);
|
2012-07-09 20:27:05 +00:00
|
|
|
}
|
|
|
|
if (!count)
|
|
|
|
return 0;
|
|
|
|
return EXT4_CLUSTERS_PER_GROUP(sb) -
|
|
|
|
ext4_count_free(buf, EXT4_CLUSTERS_PER_GROUP(sb) / 8);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Compute the overhead and stash it in sbi->s_overhead
|
|
|
|
*/
|
|
|
|
int ext4_calculate_overhead(struct super_block *sb)
|
|
|
|
{
|
|
|
|
struct ext4_sb_info *sbi = EXT4_SB(sb);
|
|
|
|
struct ext4_super_block *es = sbi->s_es;
|
2016-09-30 06:08:49 +00:00
|
|
|
struct inode *j_inode;
|
|
|
|
unsigned int j_blocks, j_inum = le32_to_cpu(es->s_journal_inum);
|
2012-07-09 20:27:05 +00:00
|
|
|
ext4_group_t i, ngroups = ext4_get_groups_count(sb);
|
|
|
|
ext4_fsblk_t overhead = 0;
|
2014-11-25 18:08:04 +00:00
|
|
|
char *buf = (char *) get_zeroed_page(GFP_NOFS);
|
2012-07-09 20:27:05 +00:00
|
|
|
|
|
|
|
if (!buf)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Compute the overhead (FS structures). This is constant
|
|
|
|
* for a given filesystem unless the number of block groups
|
|
|
|
* changes so we cache the previous value until it does.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* All of the blocks before first_data_block are overhead
|
|
|
|
*/
|
|
|
|
overhead = EXT4_B2C(sbi, le32_to_cpu(es->s_first_data_block));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Add the overhead found in each block group
|
|
|
|
*/
|
|
|
|
for (i = 0; i < ngroups; i++) {
|
|
|
|
int blks;
|
|
|
|
|
|
|
|
blks = count_overhead(sb, i, buf);
|
|
|
|
overhead += blks;
|
|
|
|
if (blks)
|
|
|
|
memset(buf, 0, PAGE_SIZE);
|
|
|
|
cond_resched();
|
|
|
|
}
|
2016-09-30 06:08:49 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Add the internal journal blocks whether the journal has been
|
|
|
|
* loaded or not
|
|
|
|
*/
|
2020-09-24 03:03:42 +00:00
|
|
|
if (sbi->s_journal && !sbi->s_journal_bdev)
|
2020-11-06 03:58:54 +00:00
|
|
|
overhead += EXT4_NUM_B2C(sbi, sbi->s_journal->j_total_len);
|
2020-03-16 09:30:38 +00:00
|
|
|
else if (ext4_has_feature_journal(sb) && !sbi->s_journal && j_inum) {
|
|
|
|
/* j_inum for internal journal is non-zero */
|
2016-09-30 06:08:49 +00:00
|
|
|
j_inode = ext4_get_journal_inode(sb, j_inum);
|
|
|
|
if (j_inode) {
|
|
|
|
j_blocks = j_inode->i_size >> sb->s_blocksize_bits;
|
|
|
|
overhead += EXT4_NUM_B2C(sbi, j_blocks);
|
|
|
|
iput(j_inode);
|
|
|
|
} else {
|
|
|
|
ext4_msg(sb, KERN_ERR, "can't get journal size");
|
|
|
|
}
|
|
|
|
}
|
2012-07-09 20:27:05 +00:00
|
|
|
sbi->s_overhead = overhead;
|
|
|
|
smp_wmb();
|
|
|
|
free_page((unsigned long) buf);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2015-09-23 16:44:17 +00:00
|
|
|
static void ext4_set_resv_clusters(struct super_block *sb)
|
2013-04-10 02:11:22 +00:00
|
|
|
{
|
|
|
|
ext4_fsblk_t resv_clusters;
|
2015-09-23 16:44:17 +00:00
|
|
|
struct ext4_sb_info *sbi = EXT4_SB(sb);
|
2013-04-10 02:11:22 +00:00
|
|
|
|
2013-12-09 02:11:59 +00:00
|
|
|
/*
|
|
|
|
* There's no need to reserve anything when we aren't using extents.
|
|
|
|
* The space estimates are exact, there are no unwritten extents,
|
|
|
|
* hole punching doesn't need new metadata... This is needed especially
|
|
|
|
* to keep ext2/3 backward compatibility.
|
|
|
|
*/
|
2015-10-17 20:18:43 +00:00
|
|
|
if (!ext4_has_feature_extents(sb))
|
2015-09-23 16:44:17 +00:00
|
|
|
return;
|
2013-04-10 02:11:22 +00:00
|
|
|
/*
|
|
|
|
* By default we reserve 2% or 4096 clusters, whichever is smaller.
|
|
|
|
* This should cover the situations where we can not afford to run
|
|
|
|
* out of space like for example punch hole, or converting
|
2014-04-21 03:45:47 +00:00
|
|
|
* unwritten extents in delalloc path. In most cases such
|
2013-04-10 02:11:22 +00:00
|
|
|
* allocation would require 1, or 2 blocks, higher numbers are
|
|
|
|
* very rare.
|
|
|
|
*/
|
2015-09-23 16:44:17 +00:00
|
|
|
resv_clusters = (ext4_blocks_count(sbi->s_es) >>
|
|
|
|
sbi->s_cluster_bits);
|
2013-04-10 02:11:22 +00:00
|
|
|
|
|
|
|
do_div(resv_clusters, 50);
|
|
|
|
resv_clusters = min_t(ext4_fsblk_t, resv_clusters, 4096);
|
|
|
|
|
2015-09-23 16:44:17 +00:00
|
|
|
atomic64_set(&sbi->s_resv_clusters, resv_clusters);
|
2013-04-10 02:11:22 +00:00
|
|
|
}
|
|
|
|
|
2020-10-22 03:21:00 +00:00
|
|
|
static const char *ext4_quota_mode(struct super_block *sb)
|
|
|
|
{
|
|
|
|
#ifdef CONFIG_QUOTA
|
|
|
|
if (!ext4_quota_capable(sb))
|
|
|
|
return "none";
|
|
|
|
|
|
|
|
if (EXT4_SB(sb)->s_journal && ext4_is_quota_journalled(sb))
|
|
|
|
return "journalled";
|
|
|
|
else
|
|
|
|
return "writeback";
|
|
|
|
#else
|
|
|
|
return "disabled";
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2021-08-16 09:57:04 +00:00
|
|
|
static void ext4_setup_csum_trigger(struct super_block *sb,
|
|
|
|
enum ext4_journal_trigger_type type,
|
|
|
|
void (*trigger)(
|
|
|
|
struct jbd2_buffer_trigger_type *type,
|
|
|
|
struct buffer_head *bh,
|
|
|
|
void *mapped_data,
|
|
|
|
size_t size))
|
|
|
|
{
|
|
|
|
struct ext4_sb_info *sbi = EXT4_SB(sb);
|
|
|
|
|
|
|
|
sbi->s_journal_triggers[type].sb = sb;
|
|
|
|
sbi->s_journal_triggers[type].tr_triggers.t_frozen = trigger;
|
|
|
|
}
|
|
|
|
|
2008-07-26 20:15:44 +00:00
|
|
|
static int ext4_fill_super(struct super_block *sb, void *data, int silent)
|
2006-10-11 08:20:50 +00:00
|
|
|
{
|
2017-08-24 23:42:48 +00:00
|
|
|
struct dax_device *dax_dev = fs_dax_get_by_bdev(sb->s_bdev);
|
2010-05-16 16:00:00 +00:00
|
|
|
char *orig_data = kstrdup(data, GFP_KERNEL);
|
2020-02-15 21:40:37 +00:00
|
|
|
struct buffer_head *bh, **group_desc;
|
2006-10-11 08:20:53 +00:00
|
|
|
struct ext4_super_block *es = NULL;
|
2016-11-18 18:24:26 +00:00
|
|
|
struct ext4_sb_info *sbi = kzalloc(sizeof(*sbi), GFP_KERNEL);
|
2020-02-19 03:08:51 +00:00
|
|
|
struct flex_groups **flex_groups;
|
2006-10-11 08:20:53 +00:00
|
|
|
ext4_fsblk_t block;
|
|
|
|
ext4_fsblk_t sb_block = get_sb_block(&data);
|
2006-10-11 08:21:20 +00:00
|
|
|
ext4_fsblk_t logical_sb_block;
|
2006-10-11 08:20:50 +00:00
|
|
|
unsigned long offset = 0;
|
|
|
|
unsigned long def_mount_opts;
|
|
|
|
struct inode *root;
|
2009-01-07 05:06:22 +00:00
|
|
|
const char *descr;
|
2010-07-27 15:56:07 +00:00
|
|
|
int ret = -ENOMEM;
|
2011-09-09 22:34:51 +00:00
|
|
|
int blocksize, clustersize;
|
2009-01-06 19:53:26 +00:00
|
|
|
unsigned int db_count;
|
|
|
|
unsigned int i;
|
2020-04-15 07:25:42 +00:00
|
|
|
int needs_recovery, has_huge_files;
|
2006-10-11 08:21:10 +00:00
|
|
|
__u64 blocks_count;
|
2012-11-08 20:16:54 +00:00
|
|
|
int err = 0;
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
ext4_group_t first_not_zeroed;
|
2021-04-01 17:21:24 +00:00
|
|
|
struct ext4_parsed_options parsed_opts;
|
|
|
|
|
|
|
|
/* Set defaults for the variables that will be set during parsing */
|
|
|
|
parsed_opts.journal_ioprio = DEFAULT_JOURNAL_IOPRIO;
|
|
|
|
parsed_opts.journal_devnum = 0;
|
ext4: improve cr 0 / cr 1 group scanning
Instead of traversing through groups linearly, scan groups in specific
orders at cr 0 and cr 1. At cr 0, we want to find groups that have the
largest free order >= the order of the request. So, with this patch,
we maintain lists for each possible order and insert each group into a
list based on the largest free order in its buddy bitmap. During cr 0
allocation, we traverse these lists in the increasing order of largest
free orders. This allows us to find a group with the best available cr
0 match in constant time. If nothing can be found, we fallback to cr 1
immediately.
At CR1, the story is slightly different. We want to traverse in the
order of increasing average fragment size. For CR1, we maintain a rb
tree of groupinfos which is sorted by average fragment size. Instead
of traversing linearly, at CR1, we traverse in the order of increasing
average fragment size, starting at the most optimal group. This brings
down cr 1 search complexity to log(num groups).
For cr >= 2, we just perform the linear search as before. Also, in
case of lock contention, we intermittently fallback to linear search
even in CR 0 and CR 1 cases. This allows us to proceed during the
allocation path even in case of high contention.
There is an opportunity to do optimization at CR2 too. That's because
at CR2 we only consider groups where bb_free counter (number of free
blocks) is greater than the request extent size. That's left as future
work.
All the changes introduced in this patch are protected under a new
mount option "mb_optimize_scan".
With this patchset, following experiment was performed:
Created a highly fragmented disk of size 65TB. The disk had no
contiguous 2M regions. Following command was run consecutively for 3
times:
time dd if=/dev/urandom of=file bs=2M count=10
Here are the results with and without cr 0/1 optimizations introduced
in this patch:
|---------+------------------------------+---------------------------|
| | Without CR 0/1 Optimizations | With CR 0/1 Optimizations |
|---------+------------------------------+---------------------------|
| 1st run | 5m1.871s | 2m47.642s |
| 2nd run | 2m28.390s | 0m0.611s |
| 3rd run | 2m26.530s | 0m1.255s |
|---------+------------------------------+---------------------------|
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20210401172129.189766-6-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-01 17:21:27 +00:00
|
|
|
parsed_opts.mb_optimize_scan = DEFAULT_MB_OPTIMIZE_SCAN;
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2016-11-18 18:24:26 +00:00
|
|
|
if ((data && !orig_data) || !sbi)
|
|
|
|
goto out_free_base;
|
2009-02-15 23:07:52 +00:00
|
|
|
|
2017-09-05 16:51:23 +00:00
|
|
|
sbi->s_daxdev = dax_dev;
|
2009-02-15 23:07:52 +00:00
|
|
|
sbi->s_blockgroup_lock =
|
|
|
|
kzalloc(sizeof(struct blockgroup_lock), GFP_KERNEL);
|
2016-11-18 18:24:26 +00:00
|
|
|
if (!sbi->s_blockgroup_lock)
|
|
|
|
goto out_free_base;
|
|
|
|
|
2006-10-11 08:20:50 +00:00
|
|
|
sb->s_fs_info = sbi;
|
2012-05-31 02:56:46 +00:00
|
|
|
sbi->s_sb = sb;
|
2008-10-10 03:53:47 +00:00
|
|
|
sbi->s_inode_readahead_blks = EXT4_DEF_INODE_READAHEAD_BLKS;
|
2007-10-17 06:26:27 +00:00
|
|
|
sbi->s_sb_block = sb_block;
|
2020-11-24 08:36:54 +00:00
|
|
|
sbi->s_sectors_written_start =
|
|
|
|
part_stat_read(sb->s_bdev, sectors[STAT_WRITE]);
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2008-09-23 13:18:24 +00:00
|
|
|
/* Cleanup superblock name */
|
2015-06-25 22:02:41 +00:00
|
|
|
strreplace(sb->s_id, '/', '!');
|
2008-09-23 13:18:24 +00:00
|
|
|
|
2012-11-08 20:16:54 +00:00
|
|
|
/* -EINVAL is default */
|
2010-07-27 15:56:07 +00:00
|
|
|
ret = -EINVAL;
|
2006-10-11 08:20:53 +00:00
|
|
|
blocksize = sb_min_blocksize(sb, EXT4_MIN_BLOCK_SIZE);
|
2006-10-11 08:20:50 +00:00
|
|
|
if (!blocksize) {
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_ERR, "unable to set blocksize");
|
2006-10-11 08:20:50 +00:00
|
|
|
goto out_fail;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2006-10-11 08:20:53 +00:00
|
|
|
* The ext4 superblock will not be buffer aligned for other than 1kB
|
2006-10-11 08:20:50 +00:00
|
|
|
* block sizes. We need to calculate the offset from buffer start.
|
|
|
|
*/
|
2006-10-11 08:20:53 +00:00
|
|
|
if (blocksize != EXT4_MIN_BLOCK_SIZE) {
|
2006-10-11 08:21:20 +00:00
|
|
|
logical_sb_block = sb_block * EXT4_MIN_BLOCK_SIZE;
|
|
|
|
offset = do_div(logical_sb_block, blocksize);
|
2006-10-11 08:20:50 +00:00
|
|
|
} else {
|
2006-10-11 08:21:20 +00:00
|
|
|
logical_sb_block = sb_block;
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
|
|
|
|
2020-09-24 07:33:37 +00:00
|
|
|
bh = ext4_sb_bread_unmovable(sb, logical_sb_block);
|
|
|
|
if (IS_ERR(bh)) {
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_ERR, "unable to read superblock");
|
2020-09-24 07:33:37 +00:00
|
|
|
ret = PTR_ERR(bh);
|
2006-10-11 08:20:50 +00:00
|
|
|
goto out_fail;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* Note: s_es must be initialized as soon as possible because
|
2006-10-11 08:20:53 +00:00
|
|
|
* some ext4 macro-instructions depend on its value
|
2006-10-11 08:20:50 +00:00
|
|
|
*/
|
2012-05-28 21:47:52 +00:00
|
|
|
es = (struct ext4_super_block *) (bh->b_data + offset);
|
2006-10-11 08:20:50 +00:00
|
|
|
sbi->s_es = es;
|
|
|
|
sb->s_magic = le16_to_cpu(es->s_magic);
|
2006-10-11 08:20:53 +00:00
|
|
|
if (sb->s_magic != EXT4_SUPER_MAGIC)
|
|
|
|
goto cantfind_ext4;
|
2009-03-01 00:39:58 +00:00
|
|
|
sbi->s_kbytes_written = le64_to_cpu(es->s_kbytes_written);
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2012-04-29 22:45:10 +00:00
|
|
|
/* Warn if metadata_csum and gdt_csum are both set. */
|
2015-10-17 20:18:43 +00:00
|
|
|
if (ext4_has_feature_metadata_csum(sb) &&
|
|
|
|
ext4_has_feature_gdt_csum(sb))
|
2015-01-02 20:31:14 +00:00
|
|
|
ext4_warning(sb, "metadata_csum and uninit_bg are "
|
2012-04-29 22:45:10 +00:00
|
|
|
"redundant flags; please run fsck.");
|
|
|
|
|
2012-04-29 22:25:10 +00:00
|
|
|
/* Check for a known checksum algorithm */
|
|
|
|
if (!ext4_verify_csum_type(sb, es)) {
|
|
|
|
ext4_msg(sb, KERN_ERR, "VFS: Found ext4 filesystem with "
|
|
|
|
"unknown checksum algorithm.");
|
|
|
|
silent = 1;
|
|
|
|
goto cantfind_ext4;
|
|
|
|
}
|
|
|
|
|
2012-04-29 22:27:10 +00:00
|
|
|
/* Load the checksum driver */
|
2018-03-30 02:10:31 +00:00
|
|
|
sbi->s_chksum_driver = crypto_alloc_shash("crc32c", 0, 0);
|
|
|
|
if (IS_ERR(sbi->s_chksum_driver)) {
|
|
|
|
ext4_msg(sb, KERN_ERR, "Cannot load crc32c driver.");
|
|
|
|
ret = PTR_ERR(sbi->s_chksum_driver);
|
|
|
|
sbi->s_chksum_driver = NULL;
|
|
|
|
goto failed_mount;
|
2012-04-29 22:27:10 +00:00
|
|
|
}
|
|
|
|
|
2012-04-29 22:29:10 +00:00
|
|
|
/* Check superblock checksum */
|
|
|
|
if (!ext4_superblock_csum_verify(sb, es)) {
|
|
|
|
ext4_msg(sb, KERN_ERR, "VFS: Found ext4 filesystem with "
|
|
|
|
"invalid superblock checksum. Run e2fsck?");
|
|
|
|
silent = 1;
|
2015-10-17 20:16:04 +00:00
|
|
|
ret = -EFSBADCRC;
|
2012-04-29 22:29:10 +00:00
|
|
|
goto cantfind_ext4;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Precompute checksum seed for all metadata */
|
2015-10-17 20:18:43 +00:00
|
|
|
if (ext4_has_feature_csum_seed(sb))
|
2015-10-17 20:16:02 +00:00
|
|
|
sbi->s_csum_seed = le32_to_cpu(es->s_checksum_seed);
|
2017-06-22 15:44:55 +00:00
|
|
|
else if (ext4_has_metadata_csum(sb) || ext4_has_feature_ea_inode(sb))
|
2012-04-29 22:29:10 +00:00
|
|
|
sbi->s_csum_seed = ext4_chksum(sbi, ~0, es->s_uuid,
|
|
|
|
sizeof(es->s_uuid));
|
|
|
|
|
2006-10-11 08:20:50 +00:00
|
|
|
/* Set defaults before we parse the mount options */
|
|
|
|
def_mount_opts = le32_to_cpu(es->s_default_mount_opts);
|
2010-12-16 01:26:48 +00:00
|
|
|
set_opt(sb, INIT_INODE_TABLE);
|
2006-10-11 08:20:53 +00:00
|
|
|
if (def_mount_opts & EXT4_DEFM_DEBUG)
|
2010-12-16 01:26:48 +00:00
|
|
|
set_opt(sb, DEBUG);
|
2012-03-02 05:03:21 +00:00
|
|
|
if (def_mount_opts & EXT4_DEFM_BSDGROUPS)
|
2010-12-16 01:26:48 +00:00
|
|
|
set_opt(sb, GRPID);
|
2006-10-11 08:20:53 +00:00
|
|
|
if (def_mount_opts & EXT4_DEFM_UID16)
|
2010-12-16 01:26:48 +00:00
|
|
|
set_opt(sb, NO_UID32);
|
2011-02-23 22:51:51 +00:00
|
|
|
/* xattr user namespace & acls are now defaulted on */
|
|
|
|
set_opt(sb, XATTR_USER);
|
2008-10-11 00:02:48 +00:00
|
|
|
#ifdef CONFIG_EXT4_FS_POSIX_ACL
|
2011-02-23 22:51:51 +00:00
|
|
|
set_opt(sb, POSIX_ACL);
|
2007-02-10 09:46:13 +00:00
|
|
|
#endif
|
2020-10-15 20:37:54 +00:00
|
|
|
if (ext4_has_feature_fast_commit(sb))
|
|
|
|
set_opt2(sb, JOURNAL_FAST_COMMIT);
|
2014-10-30 14:53:16 +00:00
|
|
|
/* don't forget to enable journal_csum when metadata_csum is enabled. */
|
|
|
|
if (ext4_has_metadata_csum(sb))
|
|
|
|
set_opt(sb, JOURNAL_CHECKSUM);
|
|
|
|
|
2006-10-11 08:20:53 +00:00
|
|
|
if ((def_mount_opts & EXT4_DEFM_JMODE) == EXT4_DEFM_JMODE_DATA)
|
2010-12-16 01:26:48 +00:00
|
|
|
set_opt(sb, JOURNAL_DATA);
|
2006-10-11 08:20:53 +00:00
|
|
|
else if ((def_mount_opts & EXT4_DEFM_JMODE) == EXT4_DEFM_JMODE_ORDERED)
|
2010-12-16 01:26:48 +00:00
|
|
|
set_opt(sb, ORDERED_DATA);
|
2006-10-11 08:20:53 +00:00
|
|
|
else if ((def_mount_opts & EXT4_DEFM_JMODE) == EXT4_DEFM_JMODE_WBACK)
|
2010-12-16 01:26:48 +00:00
|
|
|
set_opt(sb, WRITEBACK_DATA);
|
2006-10-11 08:20:53 +00:00
|
|
|
|
|
|
|
if (le16_to_cpu(sbi->s_es->s_errors) == EXT4_ERRORS_PANIC)
|
2010-12-16 01:26:48 +00:00
|
|
|
set_opt(sb, ERRORS_PANIC);
|
2008-01-29 04:58:26 +00:00
|
|
|
else if (le16_to_cpu(sbi->s_es->s_errors) == EXT4_ERRORS_CONTINUE)
|
2010-12-16 01:26:48 +00:00
|
|
|
set_opt(sb, ERRORS_CONT);
|
2008-01-29 04:58:26 +00:00
|
|
|
else
|
2010-12-16 01:26:48 +00:00
|
|
|
set_opt(sb, ERRORS_RO);
|
2014-09-02 01:34:09 +00:00
|
|
|
/* block_validity enabled by default; disable with noblock_validity */
|
|
|
|
set_opt(sb, BLOCK_VALIDITY);
|
2010-08-02 03:14:20 +00:00
|
|
|
if (def_mount_opts & EXT4_DEFM_DISCARD)
|
2010-12-16 01:26:48 +00:00
|
|
|
set_opt(sb, DISCARD);
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2012-02-07 23:41:49 +00:00
|
|
|
sbi->s_resuid = make_kuid(&init_user_ns, le16_to_cpu(es->s_def_resuid));
|
|
|
|
sbi->s_resgid = make_kgid(&init_user_ns, le16_to_cpu(es->s_def_resgid));
|
2009-01-04 01:27:38 +00:00
|
|
|
sbi->s_commit_interval = JBD2_DEFAULT_MAX_COMMIT_AGE * HZ;
|
|
|
|
sbi->s_min_batch_time = EXT4_DEF_MIN_BATCH_TIME;
|
|
|
|
sbi->s_max_batch_time = EXT4_DEF_MAX_BATCH_TIME;
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2010-08-02 03:14:20 +00:00
|
|
|
if ((def_mount_opts & EXT4_DEFM_NOBARRIER) == 0)
|
2010-12-16 01:26:48 +00:00
|
|
|
set_opt(sb, BARRIER);
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2008-07-11 23:27:31 +00:00
|
|
|
/*
|
|
|
|
* enable delayed allocation by default
|
|
|
|
* Use -o nodelalloc to turn it off
|
|
|
|
*/
|
2012-09-18 02:54:36 +00:00
|
|
|
if (!IS_EXT3_SB(sb) && !IS_EXT2_SB(sb) &&
|
2010-08-02 03:14:20 +00:00
|
|
|
((def_mount_opts & EXT4_DEFM_NODELALLOC) == 0))
|
2010-12-16 01:26:48 +00:00
|
|
|
set_opt(sb, DELALLOC);
|
2008-07-11 23:27:31 +00:00
|
|
|
|
2011-05-20 17:55:16 +00:00
|
|
|
/*
|
|
|
|
* set default s_li_wait_mult for lazyinit, for the case there is
|
|
|
|
* no mount option specified.
|
|
|
|
*/
|
|
|
|
sbi->s_li_wait_mult = EXT4_DEF_LI_WAIT_MULT;
|
|
|
|
|
2020-12-09 20:59:11 +00:00
|
|
|
if (le32_to_cpu(es->s_log_block_size) >
|
|
|
|
(EXT4_MAX_BLOCK_LOG_SIZE - EXT4_MIN_BLOCK_LOG_SIZE)) {
|
2020-02-06 22:35:01 +00:00
|
|
|
ext4_msg(sb, KERN_ERR,
|
2020-12-09 20:59:11 +00:00
|
|
|
"Invalid log block size: %u",
|
|
|
|
le32_to_cpu(es->s_log_block_size));
|
2020-02-06 22:35:01 +00:00
|
|
|
goto failed_mount;
|
|
|
|
}
|
2020-12-09 20:59:11 +00:00
|
|
|
if (le32_to_cpu(es->s_log_cluster_size) >
|
|
|
|
(EXT4_MAX_CLUSTER_LOG_SIZE - EXT4_MIN_BLOCK_LOG_SIZE)) {
|
2020-02-06 22:35:01 +00:00
|
|
|
ext4_msg(sb, KERN_ERR,
|
2020-12-09 20:59:11 +00:00
|
|
|
"Invalid log cluster size: %u",
|
|
|
|
le32_to_cpu(es->s_log_cluster_size));
|
2020-02-06 22:35:01 +00:00
|
|
|
goto failed_mount;
|
|
|
|
}
|
|
|
|
|
2020-12-09 20:59:11 +00:00
|
|
|
blocksize = EXT4_MIN_BLOCK_SIZE << le32_to_cpu(es->s_log_block_size);
|
|
|
|
|
|
|
|
if (blocksize == PAGE_SIZE)
|
|
|
|
set_opt(sb, DIOREAD_NOLOCK);
|
2020-02-06 22:35:01 +00:00
|
|
|
|
2019-12-15 06:09:03 +00:00
|
|
|
if (le32_to_cpu(es->s_rev_level) == EXT4_GOOD_OLD_REV) {
|
|
|
|
sbi->s_inode_size = EXT4_GOOD_OLD_INODE_SIZE;
|
|
|
|
sbi->s_first_ino = EXT4_GOOD_OLD_FIRST_INO;
|
|
|
|
} else {
|
|
|
|
sbi->s_inode_size = le16_to_cpu(es->s_inode_size);
|
|
|
|
sbi->s_first_ino = le32_to_cpu(es->s_first_ino);
|
|
|
|
if (sbi->s_first_ino < EXT4_GOOD_OLD_FIRST_INO) {
|
|
|
|
ext4_msg(sb, KERN_ERR, "invalid first ino: %u",
|
|
|
|
sbi->s_first_ino);
|
|
|
|
goto failed_mount;
|
|
|
|
}
|
|
|
|
if ((sbi->s_inode_size < EXT4_GOOD_OLD_INODE_SIZE) ||
|
|
|
|
(!is_power_of_2(sbi->s_inode_size)) ||
|
|
|
|
(sbi->s_inode_size > blocksize)) {
|
|
|
|
ext4_msg(sb, KERN_ERR,
|
|
|
|
"unsupported inode size: %d",
|
|
|
|
sbi->s_inode_size);
|
2020-02-06 22:35:01 +00:00
|
|
|
ext4_msg(sb, KERN_ERR, "blocksize: %d", blocksize);
|
2019-12-15 06:09:03 +00:00
|
|
|
goto failed_mount;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* i_atime_extra is the last extra field available for
|
|
|
|
* [acm]times in struct ext4_inode. Checking for that
|
|
|
|
* field should suffice to ensure we have extra space
|
|
|
|
* for all three.
|
|
|
|
*/
|
|
|
|
if (sbi->s_inode_size >= offsetof(struct ext4_inode, i_atime_extra) +
|
|
|
|
sizeof(((struct ext4_inode *)0)->i_atime_extra)) {
|
|
|
|
sb->s_time_gran = 1;
|
|
|
|
sb->s_time_max = EXT4_EXTRA_TIMESTAMP_MAX;
|
|
|
|
} else {
|
|
|
|
sb->s_time_gran = NSEC_PER_SEC;
|
|
|
|
sb->s_time_max = EXT4_NON_EXTRA_TIMESTAMP_MAX;
|
|
|
|
}
|
|
|
|
sb->s_time_min = EXT4_TIMESTAMP_MIN;
|
|
|
|
}
|
|
|
|
if (sbi->s_inode_size > EXT4_GOOD_OLD_INODE_SIZE) {
|
|
|
|
sbi->s_want_extra_isize = sizeof(struct ext4_inode) -
|
|
|
|
EXT4_GOOD_OLD_INODE_SIZE;
|
|
|
|
if (ext4_has_feature_extra_isize(sb)) {
|
|
|
|
unsigned v, max = (sbi->s_inode_size -
|
|
|
|
EXT4_GOOD_OLD_INODE_SIZE);
|
|
|
|
|
|
|
|
v = le16_to_cpu(es->s_want_extra_isize);
|
|
|
|
if (v > max) {
|
|
|
|
ext4_msg(sb, KERN_ERR,
|
|
|
|
"bad s_want_extra_isize: %d", v);
|
|
|
|
goto failed_mount;
|
|
|
|
}
|
|
|
|
if (sbi->s_want_extra_isize < v)
|
|
|
|
sbi->s_want_extra_isize = v;
|
|
|
|
|
|
|
|
v = le16_to_cpu(es->s_min_extra_isize);
|
|
|
|
if (v > max) {
|
|
|
|
ext4_msg(sb, KERN_ERR,
|
|
|
|
"bad s_min_extra_isize: %d", v);
|
|
|
|
goto failed_mount;
|
|
|
|
}
|
|
|
|
if (sbi->s_want_extra_isize < v)
|
|
|
|
sbi->s_want_extra_isize = v;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-11-18 18:24:26 +00:00
|
|
|
if (sbi->s_es->s_mount_opts[0]) {
|
|
|
|
char *s_mount_opts = kstrndup(sbi->s_es->s_mount_opts,
|
|
|
|
sizeof(sbi->s_es->s_mount_opts),
|
|
|
|
GFP_KERNEL);
|
|
|
|
if (!s_mount_opts)
|
|
|
|
goto failed_mount;
|
2021-04-01 17:21:24 +00:00
|
|
|
if (!parse_options(s_mount_opts, sb, &parsed_opts, 0)) {
|
2016-11-18 18:24:26 +00:00
|
|
|
ext4_msg(sb, KERN_WARNING,
|
|
|
|
"failed to parse options in superblock: %s",
|
|
|
|
s_mount_opts);
|
|
|
|
}
|
|
|
|
kfree(s_mount_opts);
|
2010-08-02 03:14:20 +00:00
|
|
|
}
|
2012-03-05 00:27:31 +00:00
|
|
|
sbi->s_def_mount_opt = sbi->s_mount_opt;
|
2021-04-01 17:21:24 +00:00
|
|
|
if (!parse_options((char *) data, sb, &parsed_opts, 0))
|
2006-10-11 08:20:50 +00:00
|
|
|
goto failed_mount;
|
|
|
|
|
2019-04-25 18:05:42 +00:00
|
|
|
#ifdef CONFIG_UNICODE
|
2020-10-28 05:08:20 +00:00
|
|
|
if (ext4_has_feature_casefold(sb) && !sb->s_encoding) {
|
2019-04-25 18:05:42 +00:00
|
|
|
const struct ext4_sb_encodings *encoding_info;
|
|
|
|
struct unicode_map *encoding;
|
|
|
|
__u16 encoding_flags;
|
|
|
|
|
|
|
|
if (ext4_sb_read_encoding(es, &encoding_info,
|
|
|
|
&encoding_flags)) {
|
|
|
|
ext4_msg(sb, KERN_ERR,
|
|
|
|
"Encoding requested by superblock is unknown");
|
|
|
|
goto failed_mount;
|
|
|
|
}
|
|
|
|
|
|
|
|
encoding = utf8_load(encoding_info->version);
|
|
|
|
if (IS_ERR(encoding)) {
|
|
|
|
ext4_msg(sb, KERN_ERR,
|
|
|
|
"can't mount with superblock charset: %s-%s "
|
|
|
|
"not supported by the kernel. flags: 0x%x.",
|
|
|
|
encoding_info->name, encoding_info->version,
|
|
|
|
encoding_flags);
|
|
|
|
goto failed_mount;
|
|
|
|
}
|
|
|
|
ext4_msg(sb, KERN_INFO,"Using encoding defined by superblock: "
|
|
|
|
"%s-%s with flags 0x%hx", encoding_info->name,
|
|
|
|
encoding_info->version?:"\b", encoding_flags);
|
|
|
|
|
2020-10-28 05:08:20 +00:00
|
|
|
sb->s_encoding = encoding;
|
|
|
|
sb->s_encoding_flags = encoding_flags;
|
2019-04-25 18:05:42 +00:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2011-09-03 22:22:38 +00:00
|
|
|
if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA) {
|
2020-11-06 03:59:07 +00:00
|
|
|
printk_once(KERN_WARNING "EXT4-fs: Warning: mounting with data=journal disables delayed allocation, dioread_nolock, O_DIRECT and fast_commit support!\n");
|
2020-04-13 04:24:22 +00:00
|
|
|
/* can't mount with both data=journal and dioread_nolock. */
|
2020-01-23 17:23:17 +00:00
|
|
|
clear_opt(sb, DIOREAD_NOLOCK);
|
2020-11-06 03:59:07 +00:00
|
|
|
clear_opt2(sb, JOURNAL_FAST_COMMIT);
|
2011-09-03 22:22:38 +00:00
|
|
|
if (test_opt2(sb, EXPLICIT_DELALLOC)) {
|
|
|
|
ext4_msg(sb, KERN_ERR, "can't mount with "
|
|
|
|
"both data=journal and delalloc");
|
|
|
|
goto failed_mount;
|
|
|
|
}
|
2020-05-28 14:59:57 +00:00
|
|
|
if (test_opt(sb, DAX_ALWAYS)) {
|
2015-02-16 23:59:38 +00:00
|
|
|
ext4_msg(sb, KERN_ERR, "can't mount with "
|
|
|
|
"both data=journal and dax");
|
|
|
|
goto failed_mount;
|
|
|
|
}
|
ext4: do not perform data journaling when data is encrypted
Currently data journalling is incompatible with encryption: enabling both
at the same time has never been supported by design, and would result in
unpredictable behavior. However, users are not precluded from turning on
both features simultaneously. This change programmatically replaces data
journaling for encrypted regular files with ordered data journaling mode.
Background:
Journaling encrypted data has not been supported because it operates on
buffer heads of the page in the page cache. Namely, when the commit
happens, which could be up to five seconds after caching, the commit
thread uses the buffer heads attached to the page to copy the contents of
the page to the journal. With encryption, it would have been required to
keep the bounce buffer with ciphertext for up to the aforementioned five
seconds, since the page cache can only hold plaintext and could not be
used for journaling. Alternatively, it would be required to setup the
journal to initiate a callback at the commit time to perform deferred
encryption - in this case, not only would the data have to be written
twice, but it would also have to be encrypted twice. This level of
complexity was not justified for a mode that in practice is very rarely
used because of the overhead from the data journalling.
Solution:
If data=journaled has been set as a mount option for a filesystem, or if
journaling is enabled on a regular file, do not perform journaling if the
file is also encrypted, instead fall back to the data=ordered mode for the
file.
Rationale:
The intent is to allow seamless and proper filesystem operation when
journaling and encryption have both been enabled, and have these two
conflicting features gracefully resolved by the filesystem.
Fixes: 4461471107b7
Signed-off-by: Sergey Karamov <skaramov@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2016-12-10 22:54:58 +00:00
|
|
|
if (ext4_has_feature_encrypt(sb)) {
|
|
|
|
ext4_msg(sb, KERN_WARNING,
|
|
|
|
"encrypted files will use data=ordered "
|
|
|
|
"instead of data journaling mode");
|
|
|
|
}
|
2011-09-03 22:22:38 +00:00
|
|
|
if (test_opt(sb, DELALLOC))
|
|
|
|
clear_opt(sb, DELALLOC);
|
2015-07-22 03:51:26 +00:00
|
|
|
} else {
|
|
|
|
sb->s_iflags |= SB_I_CGROUPWB;
|
2011-09-03 22:22:38 +00:00
|
|
|
}
|
|
|
|
|
2017-11-27 21:05:09 +00:00
|
|
|
sb->s_flags = (sb->s_flags & ~SB_POSIXACL) |
|
|
|
|
(test_opt(sb, POSIX_ACL) ? SB_POSIXACL : 0);
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2006-10-11 08:20:53 +00:00
|
|
|
if (le32_to_cpu(es->s_rev_level) == EXT4_GOOD_OLD_REV &&
|
2015-10-17 20:18:43 +00:00
|
|
|
(ext4_has_compat_features(sb) ||
|
|
|
|
ext4_has_ro_compat_features(sb) ||
|
|
|
|
ext4_has_incompat_features(sb)))
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_WARNING,
|
|
|
|
"feature flags set on rev 0 fs, "
|
|
|
|
"running e2fsck is recommended");
|
2008-02-10 06:11:44 +00:00
|
|
|
|
2014-03-24 18:09:06 +00:00
|
|
|
if (es->s_creator_os == cpu_to_le32(EXT4_OS_HURD)) {
|
|
|
|
set_opt2(sb, HURD_COMPAT);
|
2015-10-17 20:18:43 +00:00
|
|
|
if (ext4_has_feature_64bit(sb)) {
|
2014-03-24 18:09:06 +00:00
|
|
|
ext4_msg(sb, KERN_ERR,
|
|
|
|
"The Hurd can't support 64-bit file systems");
|
|
|
|
goto failed_mount;
|
|
|
|
}
|
2017-06-22 15:44:55 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* ea_inode feature uses l_i_version field which is not
|
|
|
|
* available in HURD_COMPAT mode.
|
|
|
|
*/
|
|
|
|
if (ext4_has_feature_ea_inode(sb)) {
|
|
|
|
ext4_msg(sb, KERN_ERR,
|
|
|
|
"ea_inode feature is not supported for Hurd");
|
|
|
|
goto failed_mount;
|
|
|
|
}
|
2014-03-24 18:09:06 +00:00
|
|
|
}
|
|
|
|
|
2011-04-18 21:29:14 +00:00
|
|
|
if (IS_EXT2_SB(sb)) {
|
|
|
|
if (ext2_feature_set_ok(sb))
|
|
|
|
ext4_msg(sb, KERN_INFO, "mounting ext2 file system "
|
|
|
|
"using the ext4 subsystem");
|
|
|
|
else {
|
2018-03-22 15:59:00 +00:00
|
|
|
/*
|
|
|
|
* If we're probing be silent, if this looks like
|
|
|
|
* it's actually an ext[34] filesystem.
|
|
|
|
*/
|
|
|
|
if (silent && ext4_feature_set_ok(sb, sb_rdonly(sb)))
|
|
|
|
goto failed_mount;
|
2011-04-18 21:29:14 +00:00
|
|
|
ext4_msg(sb, KERN_ERR, "couldn't mount as ext2 due "
|
|
|
|
"to feature incompatibilities");
|
|
|
|
goto failed_mount;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (IS_EXT3_SB(sb)) {
|
|
|
|
if (ext3_feature_set_ok(sb))
|
|
|
|
ext4_msg(sb, KERN_INFO, "mounting ext3 file system "
|
|
|
|
"using the ext4 subsystem");
|
|
|
|
else {
|
2018-03-22 15:59:00 +00:00
|
|
|
/*
|
|
|
|
* If we're probing be silent, if this looks like
|
|
|
|
* it's actually an ext4 filesystem.
|
|
|
|
*/
|
|
|
|
if (silent && ext4_feature_set_ok(sb, sb_rdonly(sb)))
|
|
|
|
goto failed_mount;
|
2011-04-18 21:29:14 +00:00
|
|
|
ext4_msg(sb, KERN_ERR, "couldn't mount as ext3 due "
|
|
|
|
"to feature incompatibilities");
|
|
|
|
goto failed_mount;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2006-10-11 08:20:50 +00:00
|
|
|
/*
|
|
|
|
* Check feature flags regardless of the revision level, since we
|
|
|
|
* previously didn't change the revision level when setting the flags,
|
|
|
|
* so there is a chance incompat flags are set on a rev 0 filesystem.
|
|
|
|
*/
|
2017-07-17 07:45:34 +00:00
|
|
|
if (!ext4_feature_set_ok(sb, (sb_rdonly(sb))))
|
2006-10-11 08:20:50 +00:00
|
|
|
goto failed_mount;
|
2009-08-18 04:20:23 +00:00
|
|
|
|
2016-07-06 00:01:52 +00:00
|
|
|
if (le16_to_cpu(sbi->s_es->s_reserved_gdt_blocks) > (blocksize / 4)) {
|
|
|
|
ext4_msg(sb, KERN_ERR,
|
|
|
|
"Number of reserved GDT blocks insanely large: %d",
|
|
|
|
le16_to_cpu(sbi->s_es->s_reserved_gdt_blocks));
|
|
|
|
goto failed_mount;
|
|
|
|
}
|
|
|
|
|
2020-05-28 14:59:58 +00:00
|
|
|
if (bdev_dax_supported(sb->s_bdev, blocksize))
|
|
|
|
set_bit(EXT4_FLAGS_BDEV_IS_DAX, &sbi->s_ext4_flags);
|
|
|
|
|
2020-05-28 14:59:57 +00:00
|
|
|
if (sbi->s_mount_opt & EXT4_MOUNT_DAX_ALWAYS) {
|
2017-10-12 15:52:34 +00:00
|
|
|
if (ext4_has_feature_inline_data(sb)) {
|
|
|
|
ext4_msg(sb, KERN_ERR, "Cannot use DAX on a filesystem"
|
|
|
|
" that may contain inline data");
|
2018-12-04 05:46:39 +00:00
|
|
|
goto failed_mount;
|
2017-10-12 15:52:34 +00:00
|
|
|
}
|
2020-05-28 14:59:58 +00:00
|
|
|
if (!test_bit(EXT4_FLAGS_BDEV_IS_DAX, &sbi->s_ext4_flags)) {
|
2017-12-22 01:04:07 +00:00
|
|
|
ext4_msg(sb, KERN_ERR,
|
2018-12-04 05:46:39 +00:00
|
|
|
"DAX unsupported by block device.");
|
|
|
|
goto failed_mount;
|
2017-12-22 01:04:07 +00:00
|
|
|
}
|
2015-02-16 23:59:38 +00:00
|
|
|
}
|
|
|
|
|
2015-10-17 20:18:43 +00:00
|
|
|
if (ext4_has_feature_encrypt(sb) && es->s_encryption_level) {
|
2015-04-16 05:56:00 +00:00
|
|
|
ext4_msg(sb, KERN_ERR, "Unsupported encryption level %d",
|
|
|
|
es->s_encryption_level);
|
|
|
|
goto failed_mount;
|
|
|
|
}
|
|
|
|
|
2006-10-11 08:20:50 +00:00
|
|
|
if (sb->s_blocksize != blocksize) {
|
2021-05-21 07:55:33 +00:00
|
|
|
/*
|
|
|
|
* bh must be released before kill_bdev(), otherwise
|
|
|
|
* it won't be freed and its page also. kill_bdev()
|
|
|
|
* is called by sb_set_blocksize().
|
|
|
|
*/
|
|
|
|
brelse(bh);
|
2008-01-29 04:58:27 +00:00
|
|
|
/* Validate the filesystem blocksize */
|
|
|
|
if (!sb_set_blocksize(sb, blocksize)) {
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_ERR, "bad block size %d",
|
2008-01-29 04:58:27 +00:00
|
|
|
blocksize);
|
2021-05-21 07:55:33 +00:00
|
|
|
bh = NULL;
|
2006-10-11 08:20:50 +00:00
|
|
|
goto failed_mount;
|
|
|
|
}
|
|
|
|
|
2006-10-11 08:21:20 +00:00
|
|
|
logical_sb_block = sb_block * EXT4_MIN_BLOCK_SIZE;
|
|
|
|
offset = do_div(logical_sb_block, blocksize);
|
2020-09-24 07:33:37 +00:00
|
|
|
bh = ext4_sb_bread_unmovable(sb, logical_sb_block);
|
|
|
|
if (IS_ERR(bh)) {
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_ERR,
|
|
|
|
"Can't read superblock on 2nd try");
|
2020-09-24 07:33:37 +00:00
|
|
|
ret = PTR_ERR(bh);
|
|
|
|
bh = NULL;
|
2006-10-11 08:20:50 +00:00
|
|
|
goto failed_mount;
|
|
|
|
}
|
2012-05-28 21:47:52 +00:00
|
|
|
es = (struct ext4_super_block *)(bh->b_data + offset);
|
2006-10-11 08:20:50 +00:00
|
|
|
sbi->s_es = es;
|
2006-10-11 08:20:53 +00:00
|
|
|
if (es->s_magic != cpu_to_le16(EXT4_SUPER_MAGIC)) {
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_ERR,
|
|
|
|
"Magic mismatch, very weird!");
|
2006-10-11 08:20:50 +00:00
|
|
|
goto failed_mount;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2015-10-17 20:18:43 +00:00
|
|
|
has_huge_files = ext4_has_feature_huge_file(sb);
|
2008-10-17 02:50:48 +00:00
|
|
|
sbi->s_bitmap_maxbytes = ext4_max_bitmap_size(sb->s_blocksize_bits,
|
|
|
|
has_huge_files);
|
|
|
|
sb->s_maxbytes = ext4_max_size(sb->s_blocksize_bits, has_huge_files);
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2006-10-11 08:21:14 +00:00
|
|
|
sbi->s_desc_size = le16_to_cpu(es->s_desc_size);
|
2015-10-17 20:18:43 +00:00
|
|
|
if (ext4_has_feature_64bit(sb)) {
|
2006-10-11 08:21:15 +00:00
|
|
|
if (sbi->s_desc_size < EXT4_MIN_DESC_SIZE_64BIT ||
|
2006-10-11 08:21:14 +00:00
|
|
|
sbi->s_desc_size > EXT4_MAX_DESC_SIZE ||
|
2007-10-17 06:27:14 +00:00
|
|
|
!is_power_of_2(sbi->s_desc_size)) {
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_ERR,
|
|
|
|
"unsupported descriptor size %lu",
|
2006-10-11 08:21:14 +00:00
|
|
|
sbi->s_desc_size);
|
|
|
|
goto failed_mount;
|
|
|
|
}
|
|
|
|
} else
|
|
|
|
sbi->s_desc_size = EXT4_MIN_DESC_SIZE;
|
2009-06-03 21:59:28 +00:00
|
|
|
|
2006-10-11 08:20:50 +00:00
|
|
|
sbi->s_blocks_per_group = le32_to_cpu(es->s_blocks_per_group);
|
|
|
|
sbi->s_inodes_per_group = le32_to_cpu(es->s_inodes_per_group);
|
2009-06-03 21:59:28 +00:00
|
|
|
|
2006-10-11 08:20:53 +00:00
|
|
|
sbi->s_inodes_per_block = blocksize / EXT4_INODE_SIZE(sb);
|
2006-10-11 08:20:50 +00:00
|
|
|
if (sbi->s_inodes_per_block == 0)
|
2006-10-11 08:20:53 +00:00
|
|
|
goto cantfind_ext4;
|
2016-11-18 18:28:30 +00:00
|
|
|
if (sbi->s_inodes_per_group < sbi->s_inodes_per_block ||
|
|
|
|
sbi->s_inodes_per_group > blocksize * 8) {
|
|
|
|
ext4_msg(sb, KERN_ERR, "invalid inodes per group: %lu\n",
|
2020-03-28 22:34:15 +00:00
|
|
|
sbi->s_inodes_per_group);
|
2016-11-18 18:28:30 +00:00
|
|
|
goto failed_mount;
|
|
|
|
}
|
2006-10-11 08:20:50 +00:00
|
|
|
sbi->s_itb_per_group = sbi->s_inodes_per_group /
|
|
|
|
sbi->s_inodes_per_block;
|
2006-10-11 08:21:14 +00:00
|
|
|
sbi->s_desc_per_block = blocksize / EXT4_DESC_SIZE(sb);
|
2006-10-11 08:20:50 +00:00
|
|
|
sbi->s_sbh = bh;
|
|
|
|
sbi->s_mount_state = le16_to_cpu(es->s_state);
|
2007-10-17 06:26:25 +00:00
|
|
|
sbi->s_addr_per_block_bits = ilog2(EXT4_ADDR_PER_BLOCK(sb));
|
|
|
|
sbi->s_desc_per_block_bits = ilog2(EXT4_DESC_PER_BLOCK(sb));
|
2009-06-03 21:59:28 +00:00
|
|
|
|
2008-07-26 20:15:44 +00:00
|
|
|
for (i = 0; i < 4; i++)
|
2006-10-11 08:20:50 +00:00
|
|
|
sbi->s_hash_seed[i] = le32_to_cpu(es->s_hash_seed[i]);
|
|
|
|
sbi->s_def_hash_version = es->s_def_hash_version;
|
2015-10-17 20:18:43 +00:00
|
|
|
if (ext4_has_feature_dir_index(sb)) {
|
2014-02-12 17:16:04 +00:00
|
|
|
i = le32_to_cpu(es->s_flags);
|
|
|
|
if (i & EXT2_FLAGS_UNSIGNED_HASH)
|
|
|
|
sbi->s_hash_unsigned = 3;
|
|
|
|
else if ((i & EXT2_FLAGS_SIGNED_HASH) == 0) {
|
2008-10-28 17:21:44 +00:00
|
|
|
#ifdef __CHAR_UNSIGNED__
|
2017-07-17 07:45:34 +00:00
|
|
|
if (!sb_rdonly(sb))
|
2014-02-12 17:16:04 +00:00
|
|
|
es->s_flags |=
|
|
|
|
cpu_to_le32(EXT2_FLAGS_UNSIGNED_HASH);
|
|
|
|
sbi->s_hash_unsigned = 3;
|
2008-10-28 17:21:44 +00:00
|
|
|
#else
|
2017-07-17 07:45:34 +00:00
|
|
|
if (!sb_rdonly(sb))
|
2014-02-12 17:16:04 +00:00
|
|
|
es->s_flags |=
|
|
|
|
cpu_to_le32(EXT2_FLAGS_SIGNED_HASH);
|
2008-10-28 17:21:44 +00:00
|
|
|
#endif
|
2014-02-12 17:16:04 +00:00
|
|
|
}
|
2008-10-28 17:21:44 +00:00
|
|
|
}
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2011-09-09 22:34:51 +00:00
|
|
|
/* Handle clustersize */
|
|
|
|
clustersize = BLOCK_SIZE << le32_to_cpu(es->s_log_cluster_size);
|
2020-04-15 07:25:42 +00:00
|
|
|
if (ext4_has_feature_bigalloc(sb)) {
|
2011-09-09 22:34:51 +00:00
|
|
|
if (clustersize < blocksize) {
|
|
|
|
ext4_msg(sb, KERN_ERR,
|
|
|
|
"cluster size (%d) smaller than "
|
|
|
|
"block size (%d)", clustersize, blocksize);
|
|
|
|
goto failed_mount;
|
|
|
|
}
|
|
|
|
sbi->s_cluster_bits = le32_to_cpu(es->s_log_cluster_size) -
|
|
|
|
le32_to_cpu(es->s_log_block_size);
|
|
|
|
sbi->s_clusters_per_group =
|
|
|
|
le32_to_cpu(es->s_clusters_per_group);
|
|
|
|
if (sbi->s_clusters_per_group > blocksize * 8) {
|
|
|
|
ext4_msg(sb, KERN_ERR,
|
|
|
|
"#clusters per group too big: %lu",
|
|
|
|
sbi->s_clusters_per_group);
|
|
|
|
goto failed_mount;
|
|
|
|
}
|
|
|
|
if (sbi->s_blocks_per_group !=
|
|
|
|
(sbi->s_clusters_per_group * (clustersize / blocksize))) {
|
|
|
|
ext4_msg(sb, KERN_ERR, "blocks per group (%lu) and "
|
|
|
|
"clusters per group (%lu) inconsistent",
|
|
|
|
sbi->s_blocks_per_group,
|
|
|
|
sbi->s_clusters_per_group);
|
|
|
|
goto failed_mount;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
if (clustersize != blocksize) {
|
2018-06-17 22:11:20 +00:00
|
|
|
ext4_msg(sb, KERN_ERR,
|
|
|
|
"fragment/cluster size (%d) != "
|
|
|
|
"block size (%d)", clustersize, blocksize);
|
|
|
|
goto failed_mount;
|
2011-09-09 22:34:51 +00:00
|
|
|
}
|
|
|
|
if (sbi->s_blocks_per_group > blocksize * 8) {
|
|
|
|
ext4_msg(sb, KERN_ERR,
|
|
|
|
"#blocks per group too big: %lu",
|
|
|
|
sbi->s_blocks_per_group);
|
|
|
|
goto failed_mount;
|
|
|
|
}
|
|
|
|
sbi->s_clusters_per_group = sbi->s_blocks_per_group;
|
|
|
|
sbi->s_cluster_bits = 0;
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
2011-09-09 22:34:51 +00:00
|
|
|
sbi->s_cluster_ratio = clustersize / blocksize;
|
|
|
|
|
2013-07-06 03:11:16 +00:00
|
|
|
/* Do we have standard group size of clustersize * 8 blocks ? */
|
|
|
|
if (sbi->s_blocks_per_group == clustersize << 3)
|
|
|
|
set_opt2(sb, STD_GROUP_SIZE);
|
|
|
|
|
2009-08-18 03:48:51 +00:00
|
|
|
/*
|
|
|
|
* Test whether we have more sectors than will fit in sector_t,
|
|
|
|
* and whether the max offset is addressable by the page cache.
|
|
|
|
*/
|
2010-11-19 14:56:44 +00:00
|
|
|
err = generic_check_addressable(sb->s_blocksize_bits,
|
2010-07-22 22:03:41 +00:00
|
|
|
ext4_blocks_count(es));
|
2010-11-19 14:56:44 +00:00
|
|
|
if (err) {
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_ERR, "filesystem"
|
2009-08-18 03:48:51 +00:00
|
|
|
" too large to mount safely on this system");
|
2006-10-11 08:20:50 +00:00
|
|
|
goto failed_mount;
|
|
|
|
}
|
|
|
|
|
2006-10-11 08:20:53 +00:00
|
|
|
if (EXT4_BLOCKS_PER_GROUP(sb) == 0)
|
|
|
|
goto cantfind_ext4;
|
ext4: fix oops on corrupted ext4 mount
When mounting an ext4 filesystem with corrupted s_first_data_block, things
can go very wrong and oops.
Because blocks_count in ext4_fill_super is a u64, and we must use do_div,
the calculation of db_count is done differently than on ext4. If
first_data_block is corrupted such that it is larger than ext4_blocks_count,
for example, then the intermediate blocks_count value may go negative,
but sign-extend to a very large value:
blocks_count = (ext4_blocks_count(es) -
le32_to_cpu(es->s_first_data_block) +
EXT4_BLOCKS_PER_GROUP(sb) - 1);
This is then assigned to s_groups_count which is an unsigned long:
sbi->s_groups_count = blocks_count;
This may result in a value of 0xFFFFFFFF which is then used to compute
db_count:
db_count = (sbi->s_groups_count + EXT4_DESC_PER_BLOCK(sb) - 1) /
EXT4_DESC_PER_BLOCK(sb);
and in this case db_count will wind up as 0 because the addition overflows
32 bits. This in turn causes the kmalloc for group_desc to be of 0 size:
sbi->s_group_desc = kmalloc(db_count * sizeof (struct buffer_head *),
GFP_KERNEL);
and eventually in ext4_check_descriptors, dereferencing
sbi->s_group_desc[desc_block] will result in a NULL pointer dereference.
The simplest test seems to be to sanity check s_first_data_block,
EXT4_BLOCKS_PER_GROUP, and ext4_blocks_count values to be sure
their combination won't result in a bad intermediate value for
blocks_count. We could just check for db_count == 0, but
catching it at the root cause seems like it provides more info.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Mingming Cao <cmm@us.ibm.com>
2008-01-29 04:58:27 +00:00
|
|
|
|
2009-04-07 18:07:47 +00:00
|
|
|
/* check blocks count against device size */
|
|
|
|
blocks_count = sb->s_bdev->bd_inode->i_size >> sb->s_blocksize_bits;
|
|
|
|
if (blocks_count && ext4_blocks_count(es) > blocks_count) {
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_WARNING, "bad geometry: block count %llu "
|
|
|
|
"exceeds size of device (%llu blocks)",
|
2009-04-07 18:07:47 +00:00
|
|
|
ext4_blocks_count(es), blocks_count);
|
|
|
|
goto failed_mount;
|
|
|
|
}
|
|
|
|
|
2009-06-03 21:59:28 +00:00
|
|
|
/*
|
|
|
|
* It makes no sense for the first data block to be beyond the end
|
|
|
|
* of the filesystem.
|
|
|
|
*/
|
|
|
|
if (le32_to_cpu(es->s_first_data_block) >= ext4_blocks_count(es)) {
|
2011-12-18 21:13:58 +00:00
|
|
|
ext4_msg(sb, KERN_WARNING, "bad geometry: first data "
|
2009-06-04 21:36:36 +00:00
|
|
|
"block %u is beyond end of filesystem (%llu)",
|
|
|
|
le32_to_cpu(es->s_first_data_block),
|
|
|
|
ext4_blocks_count(es));
|
ext4: fix oops on corrupted ext4 mount
When mounting an ext4 filesystem with corrupted s_first_data_block, things
can go very wrong and oops.
Because blocks_count in ext4_fill_super is a u64, and we must use do_div,
the calculation of db_count is done differently than on ext4. If
first_data_block is corrupted such that it is larger than ext4_blocks_count,
for example, then the intermediate blocks_count value may go negative,
but sign-extend to a very large value:
blocks_count = (ext4_blocks_count(es) -
le32_to_cpu(es->s_first_data_block) +
EXT4_BLOCKS_PER_GROUP(sb) - 1);
This is then assigned to s_groups_count which is an unsigned long:
sbi->s_groups_count = blocks_count;
This may result in a value of 0xFFFFFFFF which is then used to compute
db_count:
db_count = (sbi->s_groups_count + EXT4_DESC_PER_BLOCK(sb) - 1) /
EXT4_DESC_PER_BLOCK(sb);
and in this case db_count will wind up as 0 because the addition overflows
32 bits. This in turn causes the kmalloc for group_desc to be of 0 size:
sbi->s_group_desc = kmalloc(db_count * sizeof (struct buffer_head *),
GFP_KERNEL);
and eventually in ext4_check_descriptors, dereferencing
sbi->s_group_desc[desc_block] will result in a NULL pointer dereference.
The simplest test seems to be to sanity check s_first_data_block,
EXT4_BLOCKS_PER_GROUP, and ext4_blocks_count values to be sure
their combination won't result in a bad intermediate value for
blocks_count. We could just check for db_count == 0, but
catching it at the root cause seems like it provides more info.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Mingming Cao <cmm@us.ibm.com>
2008-01-29 04:58:27 +00:00
|
|
|
goto failed_mount;
|
|
|
|
}
|
2018-06-17 22:11:20 +00:00
|
|
|
if ((es->s_first_data_block == 0) && (es->s_log_block_size == 0) &&
|
|
|
|
(sbi->s_cluster_ratio == 1)) {
|
|
|
|
ext4_msg(sb, KERN_WARNING, "bad geometry: first data "
|
|
|
|
"block is 0 with a 1k block and cluster size");
|
|
|
|
goto failed_mount;
|
|
|
|
}
|
|
|
|
|
2006-10-11 08:21:10 +00:00
|
|
|
blocks_count = (ext4_blocks_count(es) -
|
|
|
|
le32_to_cpu(es->s_first_data_block) +
|
|
|
|
EXT4_BLOCKS_PER_GROUP(sb) - 1);
|
|
|
|
do_div(blocks_count, EXT4_BLOCKS_PER_GROUP(sb));
|
2009-01-06 19:53:26 +00:00
|
|
|
if (blocks_count > ((uint64_t)1<<32) - EXT4_DESC_PER_BLOCK(sb)) {
|
2020-03-28 21:54:01 +00:00
|
|
|
ext4_msg(sb, KERN_WARNING, "groups count too large: %llu "
|
2009-01-06 19:53:26 +00:00
|
|
|
"(block count %llu, first data block %u, "
|
2020-03-28 21:54:01 +00:00
|
|
|
"blocks per group %lu)", blocks_count,
|
2009-01-06 19:53:26 +00:00
|
|
|
ext4_blocks_count(es),
|
|
|
|
le32_to_cpu(es->s_first_data_block),
|
|
|
|
EXT4_BLOCKS_PER_GROUP(sb));
|
|
|
|
goto failed_mount;
|
|
|
|
}
|
2006-10-11 08:21:10 +00:00
|
|
|
sbi->s_groups_count = blocks_count;
|
ext4: limit block allocations for indirect-block files to < 2^32
Today, the ext4 allocator will happily allocate blocks past
2^32 for indirect-block files, which results in the block
numbers getting truncated, and corruption ensues.
This patch limits such allocations to < 2^32, and adds
BUG_ONs if we do get blocks larger than that.
This should address RH Bug 519471, ext4 bitmap allocator
must limit blocks to < 2^32
* ext4_find_goal() is modified to choose a goal < UINT_MAX,
so that our starting point is in an acceptable range.
* ext4_xattr_block_set() is modified such that the goal block
is < UINT_MAX, as above.
* ext4_mb_regular_allocator() is modified so that the group
search does not continue into groups which are too high
* ext4_mb_use_preallocated() has a check that we don't use
preallocated space which is too far out
* ext4_alloc_blocks() and ext4_xattr_block_set() add some BUG_ONs
No attempt has been made to limit inode locations to < 2^32,
so we may wind up with blocks far from their inodes. Doing
this much already will lead to some odd ENOSPC issues when the
"lower 32" gets full, and further restricting inodes could
make that even weirder.
For high inodes, choosing a goal of the original, % UINT_MAX,
may be a bit odd, but then we're in an odd situation anyway,
and I don't know of a better heuristic.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-16 18:45:10 +00:00
|
|
|
sbi->s_blockfile_groups = min_t(ext4_group_t, sbi->s_groups_count,
|
|
|
|
(EXT4_MAX_BLOCK_FILE_PHYS / EXT4_BLOCKS_PER_GROUP(sb)));
|
2018-11-07 15:32:53 +00:00
|
|
|
if (((u64)sbi->s_groups_count * sbi->s_inodes_per_group) !=
|
|
|
|
le32_to_cpu(es->s_inodes_count)) {
|
|
|
|
ext4_msg(sb, KERN_ERR, "inodes count not valid: %u vs %llu",
|
|
|
|
le32_to_cpu(es->s_inodes_count),
|
|
|
|
((u64)sbi->s_groups_count * sbi->s_inodes_per_group));
|
|
|
|
ret = -EINVAL;
|
|
|
|
goto failed_mount;
|
|
|
|
}
|
2006-10-11 08:20:53 +00:00
|
|
|
db_count = (sbi->s_groups_count + EXT4_DESC_PER_BLOCK(sb) - 1) /
|
|
|
|
EXT4_DESC_PER_BLOCK(sb);
|
ext4: validate s_first_meta_bg at mount time
Ralf Spenneberg reported that he hit a kernel crash when mounting a
modified ext4 image. And it turns out that kernel crashed when
calculating fs overhead (ext4_calculate_overhead()), this is because
the image has very large s_first_meta_bg (debug code shows it's
842150400), and ext4 overruns the memory in count_overhead() when
setting bitmap buffer, which is PAGE_SIZE.
ext4_calculate_overhead():
buf = get_zeroed_page(GFP_NOFS); <=== PAGE_SIZE buffer
blks = count_overhead(sb, i, buf);
count_overhead():
for (j = ext4_bg_num_gdb(sb, grp); j > 0; j--) { <=== j = 842150400
ext4_set_bit(EXT4_B2C(sbi, s++), buf); <=== buffer overrun
count++;
}
This can be reproduced easily for me by this script:
#!/bin/bash
rm -f fs.img
mkdir -p /mnt/ext4
fallocate -l 16M fs.img
mke2fs -t ext4 -O bigalloc,meta_bg,^resize_inode -F fs.img
debugfs -w -R "ssv first_meta_bg 842150400" fs.img
mount -o loop fs.img /mnt/ext4
Fix it by validating s_first_meta_bg first at mount time, and
refusing to mount if its value exceeds the largest possible meta_bg
number.
Reported-by: Ralf Spenneberg <ralf@os-t.de>
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2016-12-01 20:08:37 +00:00
|
|
|
if (ext4_has_feature_meta_bg(sb)) {
|
2017-02-15 06:26:39 +00:00
|
|
|
if (le32_to_cpu(es->s_first_meta_bg) > db_count) {
|
ext4: validate s_first_meta_bg at mount time
Ralf Spenneberg reported that he hit a kernel crash when mounting a
modified ext4 image. And it turns out that kernel crashed when
calculating fs overhead (ext4_calculate_overhead()), this is because
the image has very large s_first_meta_bg (debug code shows it's
842150400), and ext4 overruns the memory in count_overhead() when
setting bitmap buffer, which is PAGE_SIZE.
ext4_calculate_overhead():
buf = get_zeroed_page(GFP_NOFS); <=== PAGE_SIZE buffer
blks = count_overhead(sb, i, buf);
count_overhead():
for (j = ext4_bg_num_gdb(sb, grp); j > 0; j--) { <=== j = 842150400
ext4_set_bit(EXT4_B2C(sbi, s++), buf); <=== buffer overrun
count++;
}
This can be reproduced easily for me by this script:
#!/bin/bash
rm -f fs.img
mkdir -p /mnt/ext4
fallocate -l 16M fs.img
mke2fs -t ext4 -O bigalloc,meta_bg,^resize_inode -F fs.img
debugfs -w -R "ssv first_meta_bg 842150400" fs.img
mount -o loop fs.img /mnt/ext4
Fix it by validating s_first_meta_bg first at mount time, and
refusing to mount if its value exceeds the largest possible meta_bg
number.
Reported-by: Ralf Spenneberg <ralf@os-t.de>
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2016-12-01 20:08:37 +00:00
|
|
|
ext4_msg(sb, KERN_WARNING,
|
|
|
|
"first meta block group too large: %u "
|
|
|
|
"(group descriptor block count %u)",
|
|
|
|
le32_to_cpu(es->s_first_meta_bg), db_count);
|
|
|
|
goto failed_mount;
|
|
|
|
}
|
|
|
|
}
|
2020-02-15 21:40:37 +00:00
|
|
|
rcu_assign_pointer(sbi->s_group_desc,
|
|
|
|
kvmalloc_array(db_count,
|
|
|
|
sizeof(struct buffer_head *),
|
|
|
|
GFP_KERNEL));
|
2006-10-11 08:20:50 +00:00
|
|
|
if (sbi->s_group_desc == NULL) {
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_ERR, "not enough memory");
|
2012-05-28 21:49:54 +00:00
|
|
|
ret = -ENOMEM;
|
2006-10-11 08:20:50 +00:00
|
|
|
goto failed_mount;
|
|
|
|
}
|
|
|
|
|
2009-02-15 23:07:52 +00:00
|
|
|
bgl_lock_init(sbi->s_blockgroup_lock);
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2017-04-30 04:46:35 +00:00
|
|
|
/* Pre-read the descriptors into the buffer cache */
|
|
|
|
for (i = 0; i < db_count; i++) {
|
|
|
|
block = descriptor_loc(sb, logical_sb_block, i);
|
2020-09-24 07:33:35 +00:00
|
|
|
ext4_sb_breadahead_unmovable(sb, block);
|
2017-04-30 04:46:35 +00:00
|
|
|
}
|
|
|
|
|
2006-10-11 08:20:50 +00:00
|
|
|
for (i = 0; i < db_count; i++) {
|
2020-02-15 21:40:37 +00:00
|
|
|
struct buffer_head *bh;
|
|
|
|
|
2006-10-11 08:21:20 +00:00
|
|
|
block = descriptor_loc(sb, logical_sb_block, i);
|
2020-09-24 07:33:37 +00:00
|
|
|
bh = ext4_sb_bread_unmovable(sb, block);
|
|
|
|
if (IS_ERR(bh)) {
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_ERR,
|
|
|
|
"can't read group descriptor %d", i);
|
2006-10-11 08:20:50 +00:00
|
|
|
db_count = i;
|
2020-09-24 07:33:37 +00:00
|
|
|
ret = PTR_ERR(bh);
|
2006-10-11 08:20:50 +00:00
|
|
|
goto failed_mount2;
|
|
|
|
}
|
2020-02-15 21:40:37 +00:00
|
|
|
rcu_read_lock();
|
|
|
|
rcu_dereference(sbi->s_group_desc)[i] = bh;
|
|
|
|
rcu_read_unlock();
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
2018-07-08 23:35:02 +00:00
|
|
|
sbi->s_gdb_count = db_count;
|
2016-08-01 04:51:02 +00:00
|
|
|
if (!ext4_check_descriptors(sb, logical_sb_block, &first_not_zeroed)) {
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_ERR, "group descriptors corrupted!");
|
2015-10-17 20:16:04 +00:00
|
|
|
ret = -EFSCORRUPTED;
|
2014-07-11 17:55:40 +00:00
|
|
|
goto failed_mount2;
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
2008-07-11 23:27:31 +00:00
|
|
|
|
2017-10-18 16:45:17 +00:00
|
|
|
timer_setup(&sbi->s_err_report, print_daily_error_info, 0);
|
2020-11-27 11:34:00 +00:00
|
|
|
spin_lock_init(&sbi->s_error_lock);
|
|
|
|
INIT_WORK(&sbi->s_error_work, flush_stashed_error_work);
|
2011-04-05 23:55:28 +00:00
|
|
|
|
2013-04-04 02:10:52 +00:00
|
|
|
/* Register extent status tree shrinker */
|
2014-09-02 02:26:49 +00:00
|
|
|
if (ext4_es_register_shrinker(sbi))
|
2010-11-03 16:03:21 +00:00
|
|
|
goto failed_mount3;
|
|
|
|
|
2008-01-29 05:19:52 +00:00
|
|
|
sbi->s_stripe = ext4_get_stripe_size(sbi);
|
2012-08-17 13:54:17 +00:00
|
|
|
sbi->s_extent_max_zeroout_kb = 32;
|
2008-01-29 05:19:52 +00:00
|
|
|
|
2014-07-11 17:55:40 +00:00
|
|
|
/*
|
|
|
|
* set up enough so that it can read an inode
|
|
|
|
*/
|
2014-09-18 21:12:30 +00:00
|
|
|
sb->s_op = &ext4_sops;
|
2006-10-11 08:20:53 +00:00
|
|
|
sb->s_export_op = &ext4_export_ops;
|
|
|
|
sb->s_xattr = ext4_xattr_handlers;
|
2018-12-12 09:50:12 +00:00
|
|
|
#ifdef CONFIG_FS_ENCRYPTION
|
2016-07-10 18:01:03 +00:00
|
|
|
sb->s_cop = &ext4_cryptops;
|
2017-10-09 19:15:38 +00:00
|
|
|
#endif
|
2019-07-22 16:26:24 +00:00
|
|
|
#ifdef CONFIG_FS_VERITY
|
|
|
|
sb->s_vop = &ext4_verityops;
|
|
|
|
#endif
|
2006-10-11 08:20:50 +00:00
|
|
|
#ifdef CONFIG_QUOTA
|
2006-10-11 08:20:53 +00:00
|
|
|
sb->dq_op = &ext4_quota_operations;
|
2015-10-17 20:18:43 +00:00
|
|
|
if (ext4_has_feature_quota(sb))
|
2014-10-08 16:26:54 +00:00
|
|
|
sb->s_qcop = &dquot_quotactl_sysfile_ops;
|
2013-03-02 22:57:08 +00:00
|
|
|
else
|
|
|
|
sb->s_qcop = &ext4_qctl_operations;
|
2016-01-08 21:01:22 +00:00
|
|
|
sb->s_quota_types = QTYPE_MASK_USR | QTYPE_MASK_GRP | QTYPE_MASK_PRJ;
|
2006-10-11 08:20:50 +00:00
|
|
|
#endif
|
2017-05-10 13:06:33 +00:00
|
|
|
memcpy(&sb->s_uuid, es->s_uuid, sizeof(es->s_uuid));
|
2011-01-29 13:13:40 +00:00
|
|
|
|
2006-10-11 08:20:50 +00:00
|
|
|
INIT_LIST_HEAD(&sbi->s_orphan); /* unlinked but open files */
|
2009-04-26 02:54:04 +00:00
|
|
|
mutex_init(&sbi->s_orphan_lock);
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2020-10-15 20:37:57 +00:00
|
|
|
/* Initialize fast commit stuff */
|
|
|
|
atomic_set(&sbi->s_fc_subtid, 0);
|
|
|
|
atomic_set(&sbi->s_fc_ineligible_updates, 0);
|
|
|
|
INIT_LIST_HEAD(&sbi->s_fc_q[FC_Q_MAIN]);
|
|
|
|
INIT_LIST_HEAD(&sbi->s_fc_q[FC_Q_STAGING]);
|
|
|
|
INIT_LIST_HEAD(&sbi->s_fc_dentry_q[FC_Q_MAIN]);
|
|
|
|
INIT_LIST_HEAD(&sbi->s_fc_dentry_q[FC_Q_STAGING]);
|
|
|
|
sbi->s_fc_bytes = 0;
|
2020-11-06 03:59:09 +00:00
|
|
|
ext4_clear_mount_flag(sb, EXT4_MF_FC_INELIGIBLE);
|
|
|
|
ext4_clear_mount_flag(sb, EXT4_MF_FC_COMMITTING);
|
2020-10-15 20:37:57 +00:00
|
|
|
spin_lock_init(&sbi->s_fc_lock);
|
|
|
|
memset(&sbi->s_fc_stats, 0, sizeof(sbi->s_fc_stats));
|
2020-10-15 20:37:59 +00:00
|
|
|
sbi->s_fc_replay_state.fc_regions = NULL;
|
|
|
|
sbi->s_fc_replay_state.fc_regions_size = 0;
|
|
|
|
sbi->s_fc_replay_state.fc_regions_used = 0;
|
|
|
|
sbi->s_fc_replay_state.fc_regions_valid = 0;
|
|
|
|
sbi->s_fc_replay_state.fc_modified_inodes = NULL;
|
|
|
|
sbi->s_fc_replay_state.fc_modified_inodes_size = 0;
|
|
|
|
sbi->s_fc_replay_state.fc_modified_inodes_used = 0;
|
2020-10-15 20:37:57 +00:00
|
|
|
|
2006-10-11 08:20:50 +00:00
|
|
|
sb->s_root = NULL;
|
|
|
|
|
|
|
|
needs_recovery = (es->s_last_orphan != 0 ||
|
2015-10-17 20:18:43 +00:00
|
|
|
ext4_has_feature_journal_needs_recovery(sb));
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2017-07-17 07:45:34 +00:00
|
|
|
if (ext4_has_feature_mmp(sb) && !sb_rdonly(sb))
|
2011-05-24 22:31:25 +00:00
|
|
|
if (ext4_multi_mount_protect(sb, le64_to_cpu(es->s_mmp_block)))
|
2014-10-30 14:53:16 +00:00
|
|
|
goto failed_mount3a;
|
2011-05-24 22:31:25 +00:00
|
|
|
|
2006-10-11 08:20:50 +00:00
|
|
|
/*
|
|
|
|
* The first inode we look at is the journal inode. Don't try
|
|
|
|
* root first: it may be modified in the journal!
|
|
|
|
*/
|
2015-10-17 20:18:43 +00:00
|
|
|
if (!test_opt(sb, NOLOAD) && ext4_has_feature_journal(sb)) {
|
2021-04-01 17:21:24 +00:00
|
|
|
err = ext4_load_journal(sb, es, parsed_opts.journal_devnum);
|
2017-02-05 06:26:48 +00:00
|
|
|
if (err)
|
2014-10-30 14:53:16 +00:00
|
|
|
goto failed_mount3a;
|
2017-07-17 07:45:34 +00:00
|
|
|
} else if (test_opt(sb, NOLOAD) && !sb_rdonly(sb) &&
|
2015-10-17 20:18:43 +00:00
|
|
|
ext4_has_feature_journal_needs_recovery(sb)) {
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_ERR, "required journal recovery "
|
|
|
|
"suppressed and not mounted read-only");
|
2010-03-04 21:14:02 +00:00
|
|
|
goto failed_mount_wq;
|
2006-10-11 08:20:50 +00:00
|
|
|
} else {
|
ext4: do not allow journal_opts for fs w/o journal
It is appeared that we can pass journal related mount options and such options
be shown in /proc/mounts
Example:
#mkfs.ext4 -F /dev/vdb
#tune2fs -O ^has_journal /dev/vdb
#mount /dev/vdb /mnt/ -ocommit=20,journal_async_commit
#cat /proc/mounts | grep /mnt
/dev/vdb /mnt ext4 rw,relatime,journal_checksum,journal_async_commit,commit=20,data=ordered 0 0
But options:"journal_checksum,journal_async_commit,commit=20,data=ordered" has
nothing with reality because there is no journal at all.
This patch disallow following options for journalless configurations:
- journal_checksum
- journal_async_commit
- commit=%ld
- data={writeback,ordered,journal}
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2015-10-19 03:50:26 +00:00
|
|
|
/* Nojournal mode, all journal mount options are illegal */
|
|
|
|
if (test_opt2(sb, EXPLICIT_JOURNAL_CHECKSUM)) {
|
|
|
|
ext4_msg(sb, KERN_ERR, "can't mount with "
|
|
|
|
"journal_checksum, fs mounted w/o journal");
|
|
|
|
goto failed_mount_wq;
|
|
|
|
}
|
|
|
|
if (test_opt(sb, JOURNAL_ASYNC_COMMIT)) {
|
|
|
|
ext4_msg(sb, KERN_ERR, "can't mount with "
|
|
|
|
"journal_async_commit, fs mounted w/o journal");
|
|
|
|
goto failed_mount_wq;
|
|
|
|
}
|
|
|
|
if (sbi->s_commit_interval != JBD2_DEFAULT_MAX_COMMIT_AGE*HZ) {
|
|
|
|
ext4_msg(sb, KERN_ERR, "can't mount with "
|
|
|
|
"commit=%lu, fs mounted w/o journal",
|
|
|
|
sbi->s_commit_interval / HZ);
|
|
|
|
goto failed_mount_wq;
|
|
|
|
}
|
|
|
|
if (EXT4_MOUNT_DATA_FLAGS &
|
|
|
|
(sbi->s_mount_opt ^ sbi->s_def_mount_opt)) {
|
|
|
|
ext4_msg(sb, KERN_ERR, "can't mount with "
|
|
|
|
"data=, fs mounted w/o journal");
|
|
|
|
goto failed_mount_wq;
|
|
|
|
}
|
2019-05-01 03:08:15 +00:00
|
|
|
sbi->s_def_mount_opt &= ~EXT4_MOUNT_JOURNAL_CHECKSUM;
|
ext4: do not allow journal_opts for fs w/o journal
It is appeared that we can pass journal related mount options and such options
be shown in /proc/mounts
Example:
#mkfs.ext4 -F /dev/vdb
#tune2fs -O ^has_journal /dev/vdb
#mount /dev/vdb /mnt/ -ocommit=20,journal_async_commit
#cat /proc/mounts | grep /mnt
/dev/vdb /mnt ext4 rw,relatime,journal_checksum,journal_async_commit,commit=20,data=ordered 0 0
But options:"journal_checksum,journal_async_commit,commit=20,data=ordered" has
nothing with reality because there is no journal at all.
This patch disallow following options for journalless configurations:
- journal_checksum
- journal_async_commit
- commit=%ld
- data={writeback,ordered,journal}
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2015-10-19 03:50:26 +00:00
|
|
|
clear_opt(sb, JOURNAL_CHECKSUM);
|
2010-12-16 01:26:48 +00:00
|
|
|
clear_opt(sb, DATA_FLAGS);
|
2020-10-15 20:37:54 +00:00
|
|
|
clear_opt2(sb, JOURNAL_FAST_COMMIT);
|
2009-01-07 05:06:22 +00:00
|
|
|
sbi->s_journal = NULL;
|
|
|
|
needs_recovery = 0;
|
|
|
|
goto no_journal;
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
|
|
|
|
2015-10-17 20:18:43 +00:00
|
|
|
if (ext4_has_feature_64bit(sb) &&
|
2007-07-18 12:37:25 +00:00
|
|
|
!jbd2_journal_set_features(EXT4_SB(sb)->s_journal, 0, 0,
|
|
|
|
JBD2_FEATURE_INCOMPAT_64BIT)) {
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_ERR, "Failed to set 64-bit journal feature");
|
2010-03-04 21:14:02 +00:00
|
|
|
goto failed_mount_wq;
|
2007-07-18 12:37:25 +00:00
|
|
|
}
|
|
|
|
|
2012-05-27 11:48:56 +00:00
|
|
|
if (!set_journal_csum_feature_set(sb)) {
|
|
|
|
ext4_msg(sb, KERN_ERR, "Failed to set journal checksum "
|
|
|
|
"feature set");
|
|
|
|
goto failed_mount_wq;
|
2009-11-02 18:15:27 +00:00
|
|
|
}
|
2008-01-29 04:58:27 +00:00
|
|
|
|
2020-11-06 03:58:55 +00:00
|
|
|
if (test_opt2(sb, JOURNAL_FAST_COMMIT) &&
|
|
|
|
!jbd2_journal_set_features(EXT4_SB(sb)->s_journal, 0, 0,
|
|
|
|
JBD2_FEATURE_INCOMPAT_FAST_COMMIT)) {
|
|
|
|
ext4_msg(sb, KERN_ERR,
|
|
|
|
"Failed to set fast commit journal feature");
|
|
|
|
goto failed_mount_wq;
|
|
|
|
}
|
|
|
|
|
2006-10-11 08:20:50 +00:00
|
|
|
/* We have now updated the journal if required, so we can
|
|
|
|
* validate the data journaling mode. */
|
|
|
|
switch (test_opt(sb, DATA_FLAGS)) {
|
|
|
|
case 0:
|
|
|
|
/* No mode set, assume a default based on the journal
|
2006-10-11 08:21:24 +00:00
|
|
|
* capabilities: ORDERED_DATA if the journal can
|
|
|
|
* cope, else JOURNAL_DATA
|
|
|
|
*/
|
2006-10-11 08:21:01 +00:00
|
|
|
if (jbd2_journal_check_available_features
|
2018-03-30 04:56:10 +00:00
|
|
|
(sbi->s_journal, 0, 0, JBD2_FEATURE_INCOMPAT_REVOKE)) {
|
2010-12-16 01:26:48 +00:00
|
|
|
set_opt(sb, ORDERED_DATA);
|
2018-03-30 04:56:10 +00:00
|
|
|
sbi->s_def_mount_opt |= EXT4_MOUNT_ORDERED_DATA;
|
|
|
|
} else {
|
2010-12-16 01:26:48 +00:00
|
|
|
set_opt(sb, JOURNAL_DATA);
|
2018-03-30 04:56:10 +00:00
|
|
|
sbi->s_def_mount_opt |= EXT4_MOUNT_JOURNAL_DATA;
|
|
|
|
}
|
2006-10-11 08:20:50 +00:00
|
|
|
break;
|
|
|
|
|
2006-10-11 08:20:53 +00:00
|
|
|
case EXT4_MOUNT_ORDERED_DATA:
|
|
|
|
case EXT4_MOUNT_WRITEBACK_DATA:
|
2006-10-11 08:21:01 +00:00
|
|
|
if (!jbd2_journal_check_available_features
|
|
|
|
(sbi->s_journal, 0, 0, JBD2_FEATURE_INCOMPAT_REVOKE)) {
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_ERR, "Journal does not support "
|
|
|
|
"requested data journaling mode");
|
2010-03-04 21:14:02 +00:00
|
|
|
goto failed_mount_wq;
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
2020-11-20 18:28:32 +00:00
|
|
|
break;
|
2006-10-11 08:20:50 +00:00
|
|
|
default:
|
|
|
|
break;
|
|
|
|
}
|
2016-12-03 21:20:53 +00:00
|
|
|
|
|
|
|
if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_ORDERED_DATA &&
|
|
|
|
test_opt(sb, JOURNAL_ASYNC_COMMIT)) {
|
|
|
|
ext4_msg(sb, KERN_ERR, "can't mount with "
|
|
|
|
"journal_async_commit in data=ordered mode");
|
|
|
|
goto failed_mount_wq;
|
|
|
|
}
|
|
|
|
|
2021-04-01 17:21:24 +00:00
|
|
|
set_task_ioprio(sbi->s_journal->j_task, parsed_opts.journal_ioprio);
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2020-10-06 00:48:39 +00:00
|
|
|
sbi->s_journal->j_submit_inode_data_buffers =
|
2020-10-06 00:48:41 +00:00
|
|
|
ext4_journal_submit_inode_data_buffers;
|
2020-10-06 00:48:39 +00:00
|
|
|
sbi->s_journal->j_finish_inode_data_buffers =
|
2020-10-06 00:48:41 +00:00
|
|
|
ext4_journal_finish_inode_data_buffers;
|
2012-02-20 22:53:02 +00:00
|
|
|
|
2010-11-03 16:03:21 +00:00
|
|
|
no_journal:
|
2017-06-22 15:55:14 +00:00
|
|
|
if (!test_opt(sb, NO_MBCACHE)) {
|
|
|
|
sbi->s_ea_block_cache = ext4_xattr_create_cache();
|
|
|
|
if (!sbi->s_ea_block_cache) {
|
2017-06-22 15:44:55 +00:00
|
|
|
ext4_msg(sb, KERN_ERR,
|
2017-06-22 15:55:14 +00:00
|
|
|
"Failed to create ea_block_cache");
|
2017-06-22 15:44:55 +00:00
|
|
|
goto failed_mount_wq;
|
|
|
|
}
|
2017-06-22 15:55:14 +00:00
|
|
|
|
|
|
|
if (ext4_has_feature_ea_inode(sb)) {
|
|
|
|
sbi->s_ea_inode_cache = ext4_xattr_create_cache();
|
|
|
|
if (!sbi->s_ea_inode_cache) {
|
|
|
|
ext4_msg(sb, KERN_ERR,
|
|
|
|
"Failed to create ea_inode_cache");
|
|
|
|
goto failed_mount_wq;
|
|
|
|
}
|
|
|
|
}
|
2014-03-18 23:24:49 +00:00
|
|
|
}
|
|
|
|
|
2019-07-22 16:26:24 +00:00
|
|
|
if (ext4_has_feature_verity(sb) && blocksize != PAGE_SIZE) {
|
|
|
|
ext4_msg(sb, KERN_ERR, "Unsupported blocksize for fs-verity");
|
|
|
|
goto failed_mount_wq;
|
|
|
|
}
|
|
|
|
|
2017-07-17 07:45:34 +00:00
|
|
|
if (DUMMY_ENCRYPTION_ENABLED(sbi) && !sb_rdonly(sb) &&
|
2015-10-17 20:18:43 +00:00
|
|
|
!ext4_has_feature_encrypt(sb)) {
|
|
|
|
ext4_set_feature_encrypt(sb);
|
2020-12-16 10:18:38 +00:00
|
|
|
ext4_commit_super(sb);
|
2015-04-16 05:56:00 +00:00
|
|
|
}
|
|
|
|
|
2012-07-09 20:27:05 +00:00
|
|
|
/*
|
|
|
|
* Get the # of file system overhead blocks from the
|
|
|
|
* superblock if present.
|
|
|
|
*/
|
|
|
|
if (es->s_overhead_clusters)
|
|
|
|
sbi->s_overhead = le32_to_cpu(es->s_overhead_clusters);
|
|
|
|
else {
|
2012-11-08 20:16:54 +00:00
|
|
|
err = ext4_calculate_overhead(sb);
|
|
|
|
if (err)
|
2012-07-09 20:27:05 +00:00
|
|
|
goto failed_mount_wq;
|
|
|
|
}
|
|
|
|
|
2011-02-01 10:42:42 +00:00
|
|
|
/*
|
|
|
|
* The maximum number of concurrent works can be high and
|
|
|
|
* concurrency isn't really necessary. Limit it to 1.
|
|
|
|
*/
|
2013-06-04 18:21:02 +00:00
|
|
|
EXT4_SB(sb)->rsv_conversion_wq =
|
|
|
|
alloc_workqueue("ext4-rsv-conversion", WQ_MEM_RECLAIM | WQ_UNBOUND, 1);
|
|
|
|
if (!EXT4_SB(sb)->rsv_conversion_wq) {
|
|
|
|
printk(KERN_ERR "EXT4-fs: failed to create workqueue\n");
|
2012-11-08 20:16:54 +00:00
|
|
|
ret = -ENOMEM;
|
2013-06-04 18:21:02 +00:00
|
|
|
goto failed_mount4;
|
|
|
|
}
|
|
|
|
|
2006-10-11 08:20:50 +00:00
|
|
|
/*
|
2006-10-11 08:21:01 +00:00
|
|
|
* The jbd2_journal_load will have done any necessary log recovery,
|
2006-10-11 08:20:50 +00:00
|
|
|
* so we can safely mount the rest of the filesystem now.
|
|
|
|
*/
|
|
|
|
|
ext4: avoid declaring fs inconsistent due to invalid file handles
If we receive a file handle, either from NFS or open_by_handle_at(2),
and it points at an inode which has not been initialized, and the file
system has metadata checksums enabled, we shouldn't try to get the
inode, discover the checksum is invalid, and then declare the file
system as being inconsistent.
This can be reproduced by creating a test file system via "mke2fs -t
ext4 -O metadata_csum /tmp/foo.img 8M", mounting it, cd'ing into that
directory, and then running the following program.
#define _GNU_SOURCE
#include <fcntl.h>
struct handle {
struct file_handle fh;
unsigned char fid[MAX_HANDLE_SZ];
};
int main(int argc, char **argv)
{
struct handle h = {{8, 1 }, { 12, }};
open_by_handle_at(AT_FDCWD, &h.fh, O_RDONLY);
return 0;
}
Google-Bug-Id: 120690101
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2018-12-19 17:29:13 +00:00
|
|
|
root = ext4_iget(sb, EXT4_ROOT_INO, EXT4_IGET_SPECIAL);
|
2008-02-07 08:15:37 +00:00
|
|
|
if (IS_ERR(root)) {
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_ERR, "get root inode failed");
|
2008-02-07 08:15:37 +00:00
|
|
|
ret = PTR_ERR(root);
|
2011-02-28 01:42:06 +00:00
|
|
|
root = NULL;
|
2006-10-11 08:20:50 +00:00
|
|
|
goto failed_mount4;
|
|
|
|
}
|
|
|
|
if (!S_ISDIR(root->i_mode) || !root->i_blocks || !root->i_size) {
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_ERR, "corrupt root inode, run e2fsck");
|
2012-01-09 20:53:24 +00:00
|
|
|
iput(root);
|
2006-10-11 08:20:50 +00:00
|
|
|
goto failed_mount4;
|
|
|
|
}
|
ext4: Support case-insensitive file name lookups
This patch implements the actual support for case-insensitive file name
lookups in ext4, based on the feature bit and the encoding stored in the
superblock.
A filesystem that has the casefold feature set is able to configure
directories with the +F (EXT4_CASEFOLD_FL) attribute, enabling lookups
to succeed in that directory in a case-insensitive fashion, i.e: match
a directory entry even if the name used by userspace is not a byte per
byte match with the disk name, but is an equivalent case-insensitive
version of the Unicode string. This operation is called a
case-insensitive file name lookup.
The feature is configured as an inode attribute applied to directories
and inherited by its children. This attribute can only be enabled on
empty directories for filesystems that support the encoding feature,
thus preventing collision of file names that only differ by case.
* dcache handling:
For a +F directory, Ext4 only stores the first equivalent name dentry
used in the dcache. This is done to prevent unintentional duplication of
dentries in the dcache, while also allowing the VFS code to quickly find
the right entry in the cache despite which equivalent string was used in
a previous lookup, without having to resort to ->lookup().
d_hash() of casefolded directories is implemented as the hash of the
casefolded string, such that we always have a well-known bucket for all
the equivalencies of the same string. d_compare() uses the
utf8_strncasecmp() infrastructure, which handles the comparison of
equivalent, same case, names as well.
For now, negative lookups are not inserted in the dcache, since they
would need to be invalidated anyway, because we can't trust missing file
dentries. This is bad for performance but requires some leveraging of
the vfs layer to fix. We can live without that for now, and so does
everyone else.
* on-disk data:
Despite using a specific version of the name as the internal
representation within the dcache, the name stored and fetched from the
disk is a byte-per-byte match with what the user requested, making this
implementation 'name-preserving'. i.e. no actual information is lost
when writing to storage.
DX is supported by modifying the hashes used in +F directories to make
them case/encoding-aware. The new disk hashes are calculated as the
hash of the full casefolded string, instead of the string directly.
This allows us to efficiently search for file names in the htree without
requiring the user to provide an exact name.
* Dealing with invalid sequences:
By default, when a invalid UTF-8 sequence is identified, ext4 will treat
it as an opaque byte sequence, ignoring the encoding and reverting to
the old behavior for that unique file. This means that case-insensitive
file name lookup will not work only for that file. An optional bit can
be set in the superblock telling the filesystem code and userspace tools
to enforce the encoding. When that optional bit is set, any attempt to
create a file name using an invalid UTF-8 sequence will fail and return
an error to userspace.
* Normalization algorithm:
The UTF-8 algorithms used to compare strings in ext4 is implemented
lives in fs/unicode, and is based on a previous version developed by
SGI. It implements the Canonical decomposition (NFD) algorithm
described by the Unicode specification 12.1, or higher, combined with
the elimination of ignorable code points (NFDi) and full
case-folding (CF) as documented in fs/unicode/utf8_norm.c.
NFD seems to be the best normalization method for EXT4 because:
- It has a lower cost than NFC/NFKC (which requires
decomposing to NFD as an intermediary step)
- It doesn't eliminate important semantic meaning like
compatibility decompositions.
Although:
- This implementation is not completely linguistic accurate, because
different languages have conflicting rules, which would require the
specialization of the filesystem to a given locale, which brings all
sorts of problems for removable media and for users who use more than
one language.
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-04-25 18:12:08 +00:00
|
|
|
|
2012-01-09 03:15:13 +00:00
|
|
|
sb->s_root = d_make_root(root);
|
2008-02-07 08:15:37 +00:00
|
|
|
if (!sb->s_root) {
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_ERR, "get root dentry failed");
|
2008-02-07 08:15:37 +00:00
|
|
|
ret = -ENOMEM;
|
|
|
|
goto failed_mount4;
|
|
|
|
}
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2018-05-14 03:02:19 +00:00
|
|
|
ret = ext4_setup_super(sb, es, sb_rdonly(sb));
|
|
|
|
if (ret == -EROFS) {
|
2017-11-27 21:05:09 +00:00
|
|
|
sb->s_flags |= SB_RDONLY;
|
2018-05-14 03:02:19 +00:00
|
|
|
ret = 0;
|
|
|
|
} else if (ret)
|
|
|
|
goto failed_mount4a;
|
2007-07-18 13:15:20 +00:00
|
|
|
|
2015-09-23 16:44:17 +00:00
|
|
|
ext4_set_resv_clusters(sb);
|
2013-04-10 02:11:22 +00:00
|
|
|
|
2020-07-28 13:04:37 +00:00
|
|
|
if (test_opt(sb, BLOCK_VALIDITY)) {
|
|
|
|
err = ext4_setup_system_zone(sb);
|
|
|
|
if (err) {
|
|
|
|
ext4_msg(sb, KERN_ERR, "failed to initialize system "
|
|
|
|
"zone (%d)", err);
|
|
|
|
goto failed_mount4a;
|
|
|
|
}
|
2014-07-11 17:55:40 +00:00
|
|
|
}
|
2020-10-15 20:37:59 +00:00
|
|
|
ext4_fc_replay_cleanup(sb);
|
2014-07-11 17:55:40 +00:00
|
|
|
|
|
|
|
ext4_ext_init(sb);
|
ext4: improve cr 0 / cr 1 group scanning
Instead of traversing through groups linearly, scan groups in specific
orders at cr 0 and cr 1. At cr 0, we want to find groups that have the
largest free order >= the order of the request. So, with this patch,
we maintain lists for each possible order and insert each group into a
list based on the largest free order in its buddy bitmap. During cr 0
allocation, we traverse these lists in the increasing order of largest
free orders. This allows us to find a group with the best available cr
0 match in constant time. If nothing can be found, we fallback to cr 1
immediately.
At CR1, the story is slightly different. We want to traverse in the
order of increasing average fragment size. For CR1, we maintain a rb
tree of groupinfos which is sorted by average fragment size. Instead
of traversing linearly, at CR1, we traverse in the order of increasing
average fragment size, starting at the most optimal group. This brings
down cr 1 search complexity to log(num groups).
For cr >= 2, we just perform the linear search as before. Also, in
case of lock contention, we intermittently fallback to linear search
even in CR 0 and CR 1 cases. This allows us to proceed during the
allocation path even in case of high contention.
There is an opportunity to do optimization at CR2 too. That's because
at CR2 we only consider groups where bb_free counter (number of free
blocks) is greater than the request extent size. That's left as future
work.
All the changes introduced in this patch are protected under a new
mount option "mb_optimize_scan".
With this patchset, following experiment was performed:
Created a highly fragmented disk of size 65TB. The disk had no
contiguous 2M regions. Following command was run consecutively for 3
times:
time dd if=/dev/urandom of=file bs=2M count=10
Here are the results with and without cr 0/1 optimizations introduced
in this patch:
|---------+------------------------------+---------------------------|
| | Without CR 0/1 Optimizations | With CR 0/1 Optimizations |
|---------+------------------------------+---------------------------|
| 1st run | 5m1.871s | 2m47.642s |
| 2nd run | 2m28.390s | 0m0.611s |
| 3rd run | 2m26.530s | 0m1.255s |
|---------+------------------------------+---------------------------|
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20210401172129.189766-6-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-01 17:21:27 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Enable optimize_scan if number of groups is > threshold. This can be
|
|
|
|
* turned off by passing "mb_optimize_scan=0". This can also be
|
|
|
|
* turned on forcefully by passing "mb_optimize_scan=1".
|
|
|
|
*/
|
|
|
|
if (parsed_opts.mb_optimize_scan == 1)
|
|
|
|
set_opt2(sb, MB_OPTIMIZE_SCAN);
|
|
|
|
else if (parsed_opts.mb_optimize_scan == 0)
|
|
|
|
clear_opt2(sb, MB_OPTIMIZE_SCAN);
|
|
|
|
else if (sbi->s_groups_count >= MB_DEFAULT_LINEAR_SCAN_THRESHOLD)
|
|
|
|
set_opt2(sb, MB_OPTIMIZE_SCAN);
|
|
|
|
|
2014-07-11 17:55:40 +00:00
|
|
|
err = ext4_mb_init(sb);
|
|
|
|
if (err) {
|
|
|
|
ext4_msg(sb, KERN_ERR, "failed to initialize mballoc (%d)",
|
|
|
|
err);
|
2011-10-06 16:10:11 +00:00
|
|
|
goto failed_mount5;
|
2008-10-11 00:07:20 +00:00
|
|
|
}
|
|
|
|
|
2021-01-21 17:33:20 +00:00
|
|
|
/*
|
|
|
|
* We can only set up the journal commit callback once
|
|
|
|
* mballoc is initialized
|
|
|
|
*/
|
|
|
|
if (sbi->s_journal)
|
|
|
|
sbi->s_journal->j_commit_callback =
|
|
|
|
ext4_journal_commit_callback;
|
|
|
|
|
2014-07-15 10:01:38 +00:00
|
|
|
block = ext4_count_free_clusters(sb);
|
2021-04-09 04:20:35 +00:00
|
|
|
ext4_free_blocks_count_set(sbi->s_es,
|
2014-07-15 10:01:38 +00:00
|
|
|
EXT4_C2B(sbi, block));
|
2014-09-08 00:51:29 +00:00
|
|
|
err = percpu_counter_init(&sbi->s_freeclusters_counter, block,
|
|
|
|
GFP_KERNEL);
|
2014-07-15 10:01:38 +00:00
|
|
|
if (!err) {
|
|
|
|
unsigned long freei = ext4_count_free_inodes(sb);
|
|
|
|
sbi->s_es->s_free_inodes_count = cpu_to_le32(freei);
|
2014-09-08 00:51:29 +00:00
|
|
|
err = percpu_counter_init(&sbi->s_freeinodes_counter, freei,
|
|
|
|
GFP_KERNEL);
|
2014-07-15 10:01:38 +00:00
|
|
|
}
|
2021-08-12 12:47:37 +00:00
|
|
|
/*
|
|
|
|
* Update the checksum after updating free space/inode
|
|
|
|
* counters. Otherwise the superblock can have an incorrect
|
|
|
|
* checksum in the buffer cache until it is written out and
|
|
|
|
* e2fsprogs programs trying to open a file system immediately
|
|
|
|
* after it is mounted can fail.
|
|
|
|
*/
|
|
|
|
ext4_superblock_csum_set(sb);
|
2014-07-15 10:01:38 +00:00
|
|
|
if (!err)
|
|
|
|
err = percpu_counter_init(&sbi->s_dirs_counter,
|
2014-09-08 00:51:29 +00:00
|
|
|
ext4_count_dirs(sb), GFP_KERNEL);
|
2014-07-15 10:01:38 +00:00
|
|
|
if (!err)
|
2014-09-08 00:51:29 +00:00
|
|
|
err = percpu_counter_init(&sbi->s_dirtyclusters_counter, 0,
|
|
|
|
GFP_KERNEL);
|
2021-02-18 15:11:32 +00:00
|
|
|
if (!err)
|
|
|
|
err = percpu_counter_init(&sbi->s_sra_exceeded_retry_limit, 0,
|
|
|
|
GFP_KERNEL);
|
2016-04-26 03:22:35 +00:00
|
|
|
if (!err)
|
2020-02-19 18:30:46 +00:00
|
|
|
err = percpu_init_rwsem(&sbi->s_writepages_rwsem);
|
2016-04-26 03:22:35 +00:00
|
|
|
|
2014-07-15 10:01:38 +00:00
|
|
|
if (err) {
|
|
|
|
ext4_msg(sb, KERN_ERR, "insufficient memory");
|
|
|
|
goto failed_mount6;
|
|
|
|
}
|
|
|
|
|
2015-10-17 20:18:43 +00:00
|
|
|
if (ext4_has_feature_flex_bg(sb))
|
2014-07-15 10:01:38 +00:00
|
|
|
if (!ext4_fill_flex_info(sb)) {
|
|
|
|
ext4_msg(sb, KERN_ERR,
|
|
|
|
"unable to initialize "
|
|
|
|
"flex_bg meta info!");
|
2021-05-10 11:10:51 +00:00
|
|
|
ret = -ENOMEM;
|
2014-07-15 10:01:38 +00:00
|
|
|
goto failed_mount6;
|
|
|
|
}
|
|
|
|
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
err = ext4_register_li_request(sb, first_not_zeroed);
|
|
|
|
if (err)
|
2011-10-06 16:10:11 +00:00
|
|
|
goto failed_mount6;
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
|
2015-09-23 16:44:17 +00:00
|
|
|
err = ext4_register_sysfs(sb);
|
2011-10-06 16:10:11 +00:00
|
|
|
if (err)
|
|
|
|
goto failed_mount7;
|
2009-03-31 13:10:09 +00:00
|
|
|
|
2013-03-02 23:22:38 +00:00
|
|
|
#ifdef CONFIG_QUOTA
|
|
|
|
/* Enable quota usage during mount. */
|
2017-07-17 07:45:34 +00:00
|
|
|
if (ext4_has_feature_quota(sb) && !sb_rdonly(sb)) {
|
2013-03-02 23:22:38 +00:00
|
|
|
err = ext4_enable_quotas(sb);
|
|
|
|
if (err)
|
|
|
|
goto failed_mount8;
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_QUOTA */
|
|
|
|
|
2020-06-20 02:54:23 +00:00
|
|
|
/*
|
|
|
|
* Save the original bdev mapping's wb_err value which could be
|
|
|
|
* used to detect the metadata async write error.
|
|
|
|
*/
|
|
|
|
spin_lock_init(&sbi->s_bdev_wb_lock);
|
2020-09-28 02:05:56 +00:00
|
|
|
errseq_check_and_advance(&sb->s_bdev->bd_inode->i_mapping->wb_err,
|
|
|
|
&sbi->s_bdev_wb_err);
|
2020-06-20 02:54:23 +00:00
|
|
|
sb->s_bdev->bd_super = sb;
|
2006-10-11 08:20:53 +00:00
|
|
|
EXT4_SB(sb)->s_mount_state |= EXT4_ORPHAN_FS;
|
|
|
|
ext4_orphan_cleanup(sb, es);
|
|
|
|
EXT4_SB(sb)->s_mount_state &= ~EXT4_ORPHAN_FS;
|
2009-01-07 05:06:22 +00:00
|
|
|
if (needs_recovery) {
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_INFO, "recovery complete");
|
2020-07-10 14:07:59 +00:00
|
|
|
err = ext4_mark_recovery_complete(sb, es);
|
|
|
|
if (err)
|
|
|
|
goto failed_mount8;
|
2009-01-07 05:06:22 +00:00
|
|
|
}
|
|
|
|
if (EXT4_SB(sb)->s_journal) {
|
|
|
|
if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA)
|
|
|
|
descr = " journalled data mode";
|
|
|
|
else if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_ORDERED_DATA)
|
|
|
|
descr = " ordered data mode";
|
|
|
|
else
|
|
|
|
descr = " writeback data mode";
|
|
|
|
} else
|
|
|
|
descr = "out journal";
|
|
|
|
|
2012-11-08 18:28:29 +00:00
|
|
|
if (test_opt(sb, DISCARD)) {
|
|
|
|
struct request_queue *q = bdev_get_queue(sb->s_bdev);
|
|
|
|
if (!blk_queue_discard(q))
|
|
|
|
ext4_msg(sb, KERN_WARNING,
|
|
|
|
"mounting with \"discard\" option, but "
|
|
|
|
"the device does not support discard");
|
|
|
|
}
|
|
|
|
|
2015-08-15 18:59:44 +00:00
|
|
|
if (___ratelimit(&ext4_mount_msg_ratelimit, "EXT4-fs mount"))
|
|
|
|
ext4_msg(sb, KERN_INFO, "mounted filesystem with%s. "
|
2020-10-22 03:21:00 +00:00
|
|
|
"Opts: %.*s%s%s. Quota mode: %s.", descr,
|
2016-11-18 18:24:26 +00:00
|
|
|
(int) sizeof(sbi->s_es->s_mount_opts),
|
|
|
|
sbi->s_es->s_mount_opts,
|
2020-10-22 03:21:00 +00:00
|
|
|
*sbi->s_es->s_mount_opts ? "; " : "", orig_data,
|
|
|
|
ext4_quota_mode(sb));
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2010-07-27 15:56:04 +00:00
|
|
|
if (es->s_error_count)
|
|
|
|
mod_timer(&sbi->s_err_report, jiffies + 300*HZ); /* 5 minutes */
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2013-10-18 01:11:01 +00:00
|
|
|
/* Enable message ratelimiting. Default is 10 messages per 5 secs. */
|
|
|
|
ratelimit_state_init(&sbi->s_err_ratelimit_state, 5 * HZ, 10);
|
|
|
|
ratelimit_state_init(&sbi->s_warning_ratelimit_state, 5 * HZ, 10);
|
|
|
|
ratelimit_state_init(&sbi->s_msg_ratelimit_state, 5 * HZ, 10);
|
2020-07-25 12:33:13 +00:00
|
|
|
atomic_set(&sbi->s_warning_count, 0);
|
|
|
|
atomic_set(&sbi->s_msg_count, 0);
|
2013-10-18 01:11:01 +00:00
|
|
|
|
2010-05-16 16:00:00 +00:00
|
|
|
kfree(orig_data);
|
2006-10-11 08:20:50 +00:00
|
|
|
return 0;
|
|
|
|
|
2006-10-11 08:20:53 +00:00
|
|
|
cantfind_ext4:
|
2006-10-11 08:20:50 +00:00
|
|
|
if (!silent)
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_ERR, "VFS: Can't find ext4 filesystem");
|
2006-10-11 08:20:50 +00:00
|
|
|
goto failed_mount;
|
|
|
|
|
2013-01-25 04:24:54 +00:00
|
|
|
failed_mount8:
|
2015-09-23 16:46:17 +00:00
|
|
|
ext4_unregister_sysfs(sb);
|
2020-09-22 16:24:56 +00:00
|
|
|
kobject_put(&sbi->s_kobj);
|
2011-10-06 16:10:11 +00:00
|
|
|
failed_mount7:
|
|
|
|
ext4_unregister_li_request(sb);
|
|
|
|
failed_mount6:
|
2014-07-11 17:55:40 +00:00
|
|
|
ext4_mb_release(sb);
|
2020-02-19 03:08:51 +00:00
|
|
|
rcu_read_lock();
|
|
|
|
flex_groups = rcu_dereference(sbi->s_flex_groups);
|
|
|
|
if (flex_groups) {
|
|
|
|
for (i = 0; i < sbi->s_flex_groups_allocated; i++)
|
|
|
|
kvfree(flex_groups[i]);
|
|
|
|
kvfree(flex_groups);
|
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
2014-07-15 10:01:38 +00:00
|
|
|
percpu_counter_destroy(&sbi->s_freeclusters_counter);
|
|
|
|
percpu_counter_destroy(&sbi->s_freeinodes_counter);
|
|
|
|
percpu_counter_destroy(&sbi->s_dirs_counter);
|
|
|
|
percpu_counter_destroy(&sbi->s_dirtyclusters_counter);
|
2021-02-18 15:11:32 +00:00
|
|
|
percpu_counter_destroy(&sbi->s_sra_exceeded_retry_limit);
|
2020-02-19 18:30:46 +00:00
|
|
|
percpu_free_rwsem(&sbi->s_writepages_rwsem);
|
ext4: initialize multi-block allocator before checking block descriptors
With EXT4FS_DEBUG ext4_count_free_clusters() will call
ext4_read_block_bitmap() without s_group_info initialized, so we need to
initialize multi-block allocator before.
And dependencies that must be solved, to allow this:
- multi-block allocator needs in group descriptors
- need to install s_op before initializing multi-block allocator,
because in ext4_mb_init_backend() new inode is created.
- initialize number of group desc blocks (s_gdb_count) otherwise
number of clusters returned by ext4_free_clusters_after_init() is not correct.
(see ext4_bg_num_gdb_nometa())
Here is the stack backtrace:
(gdb) bt
#0 ext4_get_group_info (group=0, sb=0xffff880079a10000) at ext4.h:2430
#1 ext4_validate_block_bitmap (sb=sb@entry=0xffff880079a10000,
desc=desc@entry=0xffff880056510000, block_group=block_group@entry=0,
bh=bh@entry=0xffff88007bf2b2d8) at balloc.c:358
#2 0xffffffff81232202 in ext4_wait_block_bitmap (sb=sb@entry=0xffff880079a10000,
block_group=block_group@entry=0,
bh=bh@entry=0xffff88007bf2b2d8) at balloc.c:476
#3 0xffffffff81232eaf in ext4_read_block_bitmap (sb=sb@entry=0xffff880079a10000,
block_group=block_group@entry=0) at balloc.c:489
#4 0xffffffff81232fc0 in ext4_count_free_clusters (sb=sb@entry=0xffff880079a10000) at balloc.c:665
#5 0xffffffff81259ffa in ext4_check_descriptors (first_not_zeroed=<synthetic pointer>,
sb=0xffff880079a10000) at super.c:2143
#6 ext4_fill_super (sb=sb@entry=0xffff880079a10000, data=<optimized out>,
data@entry=0x0 <irq_stack_union>, silent=silent@entry=0) at super.c:3851
...
Signed-off-by: Azat Khuzhin <a3at.mail@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-07 14:54:20 +00:00
|
|
|
failed_mount5:
|
2014-07-11 17:55:40 +00:00
|
|
|
ext4_ext_release(sb);
|
|
|
|
ext4_release_system_zone(sb);
|
|
|
|
failed_mount4a:
|
2012-01-09 20:53:24 +00:00
|
|
|
dput(sb->s_root);
|
2011-02-28 01:42:06 +00:00
|
|
|
sb->s_root = NULL;
|
2012-01-09 20:53:24 +00:00
|
|
|
failed_mount4:
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_ERR, "mount failed");
|
2013-06-04 18:21:02 +00:00
|
|
|
if (EXT4_SB(sb)->rsv_conversion_wq)
|
|
|
|
destroy_workqueue(EXT4_SB(sb)->rsv_conversion_wq);
|
2009-09-28 19:48:41 +00:00
|
|
|
failed_mount_wq:
|
2018-12-04 05:24:42 +00:00
|
|
|
ext4_xattr_destroy_cache(sbi->s_ea_inode_cache);
|
|
|
|
sbi->s_ea_inode_cache = NULL;
|
|
|
|
|
|
|
|
ext4_xattr_destroy_cache(sbi->s_ea_block_cache);
|
|
|
|
sbi->s_ea_block_cache = NULL;
|
|
|
|
|
2009-01-07 05:06:22 +00:00
|
|
|
if (sbi->s_journal) {
|
|
|
|
jbd2_journal_destroy(sbi->s_journal);
|
|
|
|
sbi->s_journal = NULL;
|
|
|
|
}
|
2014-10-30 14:53:16 +00:00
|
|
|
failed_mount3a:
|
2013-07-01 12:12:37 +00:00
|
|
|
ext4_es_unregister_shrinker(sbi);
|
2014-09-02 02:26:49 +00:00
|
|
|
failed_mount3:
|
2020-11-27 11:34:00 +00:00
|
|
|
flush_work(&sbi->s_error_work);
|
2021-03-15 16:59:06 +00:00
|
|
|
del_timer_sync(&sbi->s_err_report);
|
2021-04-30 18:50:46 +00:00
|
|
|
ext4_stop_mmpd(sbi);
|
2006-10-11 08:20:50 +00:00
|
|
|
failed_mount2:
|
2020-02-15 21:40:37 +00:00
|
|
|
rcu_read_lock();
|
|
|
|
group_desc = rcu_dereference(sbi->s_group_desc);
|
2006-10-11 08:20:50 +00:00
|
|
|
for (i = 0; i < db_count; i++)
|
2020-02-15 21:40:37 +00:00
|
|
|
brelse(group_desc[i]);
|
|
|
|
kvfree(group_desc);
|
|
|
|
rcu_read_unlock();
|
2006-10-11 08:20:50 +00:00
|
|
|
failed_mount:
|
2012-04-29 22:27:10 +00:00
|
|
|
if (sbi->s_chksum_driver)
|
|
|
|
crypto_free_shash(sbi->s_chksum_driver);
|
2019-04-25 18:05:42 +00:00
|
|
|
|
|
|
|
#ifdef CONFIG_UNICODE
|
2020-10-28 05:08:20 +00:00
|
|
|
utf8_unload(sb->s_encoding);
|
2019-04-25 18:05:42 +00:00
|
|
|
#endif
|
|
|
|
|
2006-10-11 08:20:50 +00:00
|
|
|
#ifdef CONFIG_QUOTA
|
2014-09-11 15:15:15 +00:00
|
|
|
for (i = 0; i < EXT4_MAXQUOTAS; i++)
|
2019-05-12 08:49:47 +00:00
|
|
|
kfree(get_qf_name(sb, sbi, i));
|
2006-10-11 08:20:50 +00:00
|
|
|
#endif
|
fscrypt: handle test_dummy_encryption in more logical way
The behavior of the test_dummy_encryption mount option is that when a
new file (or directory or symlink) is created in an unencrypted
directory, it's automatically encrypted using a dummy encryption policy.
That's it; in particular, the encryption (or lack thereof) of existing
files (or directories or symlinks) doesn't change.
Unfortunately the implementation of test_dummy_encryption is a bit weird
and confusing. When test_dummy_encryption is enabled and a file is
being created in an unencrypted directory, we set up an encryption key
(->i_crypt_info) for the directory. This isn't actually used to do any
encryption, however, since the directory is still unencrypted! Instead,
->i_crypt_info is only used for inheriting the encryption policy.
One consequence of this is that the filesystem ends up providing a
"dummy context" (policy + nonce) instead of a "dummy policy". In
commit ed318a6cc0b6 ("fscrypt: support test_dummy_encryption=v2"), I
mistakenly thought this was required. However, actually the nonce only
ends up being used to derive a key that is never used.
Another consequence of this implementation is that it allows for
'inode->i_crypt_info != NULL && !IS_ENCRYPTED(inode)', which is an edge
case that can be forgotten about. For example, currently
FS_IOC_GET_ENCRYPTION_POLICY on an unencrypted directory may return the
dummy encryption policy when the filesystem is mounted with
test_dummy_encryption. That seems like the wrong thing to do, since
again, the directory itself is not actually encrypted.
Therefore, switch to a more logical and maintainable implementation
where the dummy encryption policy inheritance is done without setting up
keys for unencrypted directories. This involves:
- Adding a function fscrypt_policy_to_inherit() which returns the
encryption policy to inherit from a directory. This can be a real
policy, a dummy policy, or no policy.
- Replacing struct fscrypt_dummy_context, ->get_dummy_context(), etc.
with struct fscrypt_dummy_policy, ->get_dummy_policy(), etc.
- Making fscrypt_fname_encrypted_size() take an fscrypt_policy instead
of an inode.
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Acked-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20200917041136.178600-13-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-09-17 04:11:35 +00:00
|
|
|
fscrypt_free_dummy_policy(&sbi->s_dummy_enc_policy);
|
2021-05-21 07:55:33 +00:00
|
|
|
/* ext4_blkdev_remove() calls kill_bdev(), release bh before it. */
|
2006-10-11 08:20:50 +00:00
|
|
|
brelse(bh);
|
2021-05-21 07:55:33 +00:00
|
|
|
ext4_blkdev_remove(sbi);
|
2006-10-11 08:20:50 +00:00
|
|
|
out_fail:
|
|
|
|
sb->s_fs_info = NULL;
|
2009-05-18 03:52:44 +00:00
|
|
|
kfree(sbi->s_blockgroup_lock);
|
2016-11-18 18:24:26 +00:00
|
|
|
out_free_base:
|
2006-10-11 08:20:50 +00:00
|
|
|
kfree(sbi);
|
2010-05-16 16:00:00 +00:00
|
|
|
kfree(orig_data);
|
2017-08-24 23:42:48 +00:00
|
|
|
fs_put_dax(dax_dev);
|
2012-11-08 20:16:54 +00:00
|
|
|
return err ? err : ret;
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Setup any per-fs journal parameters now. We'll do this both on
|
|
|
|
* initial mount, once the journal has been initialised but before we've
|
|
|
|
* done any recovery; and again on any subsequent remount.
|
|
|
|
*/
|
2006-10-11 08:20:53 +00:00
|
|
|
static void ext4_init_journal_params(struct super_block *sb, journal_t *journal)
|
2006-10-11 08:20:50 +00:00
|
|
|
{
|
2006-10-11 08:20:53 +00:00
|
|
|
struct ext4_sb_info *sbi = EXT4_SB(sb);
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2009-01-04 01:27:38 +00:00
|
|
|
journal->j_commit_interval = sbi->s_commit_interval;
|
|
|
|
journal->j_min_batch_time = sbi->s_min_batch_time;
|
|
|
|
journal->j_max_batch_time = sbi->s_max_batch_time;
|
2020-10-15 20:37:55 +00:00
|
|
|
ext4_fc_init(sb, journal);
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2010-08-04 01:35:12 +00:00
|
|
|
write_lock(&journal->j_state_lock);
|
2006-10-11 08:20:50 +00:00
|
|
|
if (test_opt(sb, BARRIER))
|
2006-10-11 08:21:01 +00:00
|
|
|
journal->j_flags |= JBD2_BARRIER;
|
2006-10-11 08:20:50 +00:00
|
|
|
else
|
2006-10-11 08:21:01 +00:00
|
|
|
journal->j_flags &= ~JBD2_BARRIER;
|
2008-10-11 02:12:43 +00:00
|
|
|
if (test_opt(sb, DATA_ERR_ABORT))
|
|
|
|
journal->j_flags |= JBD2_ABORT_ON_SYNCDATA_ERR;
|
|
|
|
else
|
|
|
|
journal->j_flags &= ~JBD2_ABORT_ON_SYNCDATA_ERR;
|
2010-08-04 01:35:12 +00:00
|
|
|
write_unlock(&journal->j_state_lock);
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
|
|
|
|
2016-09-30 06:05:09 +00:00
|
|
|
static struct inode *ext4_get_journal_inode(struct super_block *sb,
|
|
|
|
unsigned int journal_inum)
|
2006-10-11 08:20:50 +00:00
|
|
|
{
|
|
|
|
struct inode *journal_inode;
|
|
|
|
|
2016-09-30 06:05:09 +00:00
|
|
|
/*
|
|
|
|
* Test for the existence of a valid inode on disk. Bad things
|
|
|
|
* happen if we iget() an unused inode, as the subsequent iput()
|
|
|
|
* will try to delete it.
|
|
|
|
*/
|
ext4: avoid declaring fs inconsistent due to invalid file handles
If we receive a file handle, either from NFS or open_by_handle_at(2),
and it points at an inode which has not been initialized, and the file
system has metadata checksums enabled, we shouldn't try to get the
inode, discover the checksum is invalid, and then declare the file
system as being inconsistent.
This can be reproduced by creating a test file system via "mke2fs -t
ext4 -O metadata_csum /tmp/foo.img 8M", mounting it, cd'ing into that
directory, and then running the following program.
#define _GNU_SOURCE
#include <fcntl.h>
struct handle {
struct file_handle fh;
unsigned char fid[MAX_HANDLE_SZ];
};
int main(int argc, char **argv)
{
struct handle h = {{8, 1 }, { 12, }};
open_by_handle_at(AT_FDCWD, &h.fh, O_RDONLY);
return 0;
}
Google-Bug-Id: 120690101
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2018-12-19 17:29:13 +00:00
|
|
|
journal_inode = ext4_iget(sb, journal_inum, EXT4_IGET_SPECIAL);
|
2008-02-07 08:15:37 +00:00
|
|
|
if (IS_ERR(journal_inode)) {
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_ERR, "no journal found");
|
2006-10-11 08:20:50 +00:00
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
if (!journal_inode->i_nlink) {
|
|
|
|
make_bad_inode(journal_inode);
|
|
|
|
iput(journal_inode);
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_ERR, "journal inode is deleted");
|
2006-10-11 08:20:50 +00:00
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2008-09-09 02:25:04 +00:00
|
|
|
jbd_debug(2, "Journal inode found at %p: %lld bytes\n",
|
2006-10-11 08:20:50 +00:00
|
|
|
journal_inode, journal_inode->i_size);
|
2008-02-07 08:15:37 +00:00
|
|
|
if (!S_ISREG(journal_inode->i_mode)) {
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_ERR, "invalid journal inode");
|
2006-10-11 08:20:50 +00:00
|
|
|
iput(journal_inode);
|
|
|
|
return NULL;
|
|
|
|
}
|
2016-09-30 06:05:09 +00:00
|
|
|
return journal_inode;
|
|
|
|
}
|
|
|
|
|
|
|
|
static journal_t *ext4_get_journal(struct super_block *sb,
|
|
|
|
unsigned int journal_inum)
|
|
|
|
{
|
|
|
|
struct inode *journal_inode;
|
|
|
|
journal_t *journal;
|
|
|
|
|
2020-07-10 14:07:59 +00:00
|
|
|
if (WARN_ON_ONCE(!ext4_has_feature_journal(sb)))
|
|
|
|
return NULL;
|
2016-09-30 06:05:09 +00:00
|
|
|
|
|
|
|
journal_inode = ext4_get_journal_inode(sb, journal_inum);
|
|
|
|
if (!journal_inode)
|
|
|
|
return NULL;
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2006-10-11 08:21:01 +00:00
|
|
|
journal = jbd2_journal_init_inode(journal_inode);
|
2006-10-11 08:20:50 +00:00
|
|
|
if (!journal) {
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_ERR, "Could not load journal inode");
|
2006-10-11 08:20:50 +00:00
|
|
|
iput(journal_inode);
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
journal->j_private = sb;
|
2006-10-11 08:20:53 +00:00
|
|
|
ext4_init_journal_params(sb, journal);
|
2006-10-11 08:20:50 +00:00
|
|
|
return journal;
|
|
|
|
}
|
|
|
|
|
2006-10-11 08:20:53 +00:00
|
|
|
static journal_t *ext4_get_dev_journal(struct super_block *sb,
|
2006-10-11 08:20:50 +00:00
|
|
|
dev_t j_dev)
|
|
|
|
{
|
2008-07-26 20:15:44 +00:00
|
|
|
struct buffer_head *bh;
|
2006-10-11 08:20:50 +00:00
|
|
|
journal_t *journal;
|
2006-10-11 08:20:53 +00:00
|
|
|
ext4_fsblk_t start;
|
|
|
|
ext4_fsblk_t len;
|
2006-10-11 08:20:50 +00:00
|
|
|
int hblock, blocksize;
|
2006-10-11 08:20:53 +00:00
|
|
|
ext4_fsblk_t sb_block;
|
2006-10-11 08:20:50 +00:00
|
|
|
unsigned long offset;
|
2008-07-26 20:15:44 +00:00
|
|
|
struct ext4_super_block *es;
|
2006-10-11 08:20:50 +00:00
|
|
|
struct block_device *bdev;
|
|
|
|
|
2020-07-10 14:07:59 +00:00
|
|
|
if (WARN_ON_ONCE(!ext4_has_feature_journal(sb)))
|
|
|
|
return NULL;
|
2009-01-07 05:06:22 +00:00
|
|
|
|
2009-06-04 21:36:36 +00:00
|
|
|
bdev = ext4_blkdev_get(j_dev, sb);
|
2006-10-11 08:20:50 +00:00
|
|
|
if (bdev == NULL)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
blocksize = sb->s_blocksize;
|
2009-05-22 21:17:49 +00:00
|
|
|
hblock = bdev_logical_block_size(bdev);
|
2006-10-11 08:20:50 +00:00
|
|
|
if (blocksize < hblock) {
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_ERR,
|
|
|
|
"blocksize too small for journal device");
|
2006-10-11 08:20:50 +00:00
|
|
|
goto out_bdev;
|
|
|
|
}
|
|
|
|
|
2006-10-11 08:20:53 +00:00
|
|
|
sb_block = EXT4_MIN_BLOCK_SIZE / blocksize;
|
|
|
|
offset = EXT4_MIN_BLOCK_SIZE % blocksize;
|
2006-10-11 08:20:50 +00:00
|
|
|
set_blocksize(bdev, blocksize);
|
|
|
|
if (!(bh = __bread(bdev, sb_block, blocksize))) {
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_ERR, "couldn't read superblock of "
|
|
|
|
"external journal");
|
2006-10-11 08:20:50 +00:00
|
|
|
goto out_bdev;
|
|
|
|
}
|
|
|
|
|
2012-05-28 21:47:52 +00:00
|
|
|
es = (struct ext4_super_block *) (bh->b_data + offset);
|
2006-10-11 08:20:53 +00:00
|
|
|
if ((le16_to_cpu(es->s_magic) != EXT4_SUPER_MAGIC) ||
|
2006-10-11 08:20:50 +00:00
|
|
|
!(le32_to_cpu(es->s_feature_incompat) &
|
2006-10-11 08:20:53 +00:00
|
|
|
EXT4_FEATURE_INCOMPAT_JOURNAL_DEV)) {
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_ERR, "external journal has "
|
|
|
|
"bad superblock");
|
2006-10-11 08:20:50 +00:00
|
|
|
brelse(bh);
|
|
|
|
goto out_bdev;
|
|
|
|
}
|
|
|
|
|
2014-09-11 15:44:36 +00:00
|
|
|
if ((le32_to_cpu(es->s_feature_ro_compat) &
|
|
|
|
EXT4_FEATURE_RO_COMPAT_METADATA_CSUM) &&
|
|
|
|
es->s_checksum != ext4_superblock_csum(sb, es)) {
|
|
|
|
ext4_msg(sb, KERN_ERR, "external journal has "
|
|
|
|
"corrupt superblock");
|
|
|
|
brelse(bh);
|
|
|
|
goto out_bdev;
|
|
|
|
}
|
|
|
|
|
2006-10-11 08:20:53 +00:00
|
|
|
if (memcmp(EXT4_SB(sb)->s_es->s_journal_uuid, es->s_uuid, 16)) {
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_ERR, "journal UUID does not match");
|
2006-10-11 08:20:50 +00:00
|
|
|
brelse(bh);
|
|
|
|
goto out_bdev;
|
|
|
|
}
|
|
|
|
|
2006-10-11 08:21:10 +00:00
|
|
|
len = ext4_blocks_count(es);
|
2006-10-11 08:20:50 +00:00
|
|
|
start = sb_block + 1;
|
|
|
|
brelse(bh); /* we're done with the superblock */
|
|
|
|
|
2006-10-11 08:21:01 +00:00
|
|
|
journal = jbd2_journal_init_dev(bdev, sb->s_bdev,
|
2006-10-11 08:20:50 +00:00
|
|
|
start, len, blocksize);
|
|
|
|
if (!journal) {
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_ERR, "failed to create device journal");
|
2006-10-11 08:20:50 +00:00
|
|
|
goto out_bdev;
|
|
|
|
}
|
|
|
|
journal->j_private = sb;
|
2020-09-24 07:33:33 +00:00
|
|
|
if (ext4_read_bh_lock(journal->j_sb_buffer, REQ_META | REQ_PRIO, true)) {
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_ERR, "I/O error on journal device");
|
2006-10-11 08:20:50 +00:00
|
|
|
goto out_journal;
|
|
|
|
}
|
|
|
|
if (be32_to_cpu(journal->j_superblock->s_nr_users) != 1) {
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_ERR, "External journal has more than one "
|
|
|
|
"user (unsupported) - %d",
|
2006-10-11 08:20:50 +00:00
|
|
|
be32_to_cpu(journal->j_superblock->s_nr_users));
|
|
|
|
goto out_journal;
|
|
|
|
}
|
2020-09-24 03:03:42 +00:00
|
|
|
EXT4_SB(sb)->s_journal_bdev = bdev;
|
2006-10-11 08:20:53 +00:00
|
|
|
ext4_init_journal_params(sb, journal);
|
2006-10-11 08:20:50 +00:00
|
|
|
return journal;
|
2009-06-03 21:59:28 +00:00
|
|
|
|
2006-10-11 08:20:50 +00:00
|
|
|
out_journal:
|
2006-10-11 08:21:01 +00:00
|
|
|
jbd2_journal_destroy(journal);
|
2006-10-11 08:20:50 +00:00
|
|
|
out_bdev:
|
2006-10-11 08:20:53 +00:00
|
|
|
ext4_blkdev_put(bdev);
|
2006-10-11 08:20:50 +00:00
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2006-10-11 08:20:53 +00:00
|
|
|
static int ext4_load_journal(struct super_block *sb,
|
|
|
|
struct ext4_super_block *es,
|
2006-10-11 08:20:50 +00:00
|
|
|
unsigned long journal_devnum)
|
|
|
|
{
|
|
|
|
journal_t *journal;
|
|
|
|
unsigned int journal_inum = le32_to_cpu(es->s_journal_inum);
|
|
|
|
dev_t journal_dev;
|
|
|
|
int err = 0;
|
|
|
|
int really_read_only;
|
2020-07-17 09:06:05 +00:00
|
|
|
int journal_dev_ro;
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2020-07-10 14:07:59 +00:00
|
|
|
if (WARN_ON_ONCE(!ext4_has_feature_journal(sb)))
|
|
|
|
return -EFSCORRUPTED;
|
2009-01-07 05:06:22 +00:00
|
|
|
|
2006-10-11 08:20:50 +00:00
|
|
|
if (journal_devnum &&
|
|
|
|
journal_devnum != le32_to_cpu(es->s_journal_dev)) {
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_INFO, "external journal device major/minor "
|
|
|
|
"numbers have changed");
|
2006-10-11 08:20:50 +00:00
|
|
|
journal_dev = new_decode_dev(journal_devnum);
|
|
|
|
} else
|
|
|
|
journal_dev = new_decode_dev(le32_to_cpu(es->s_journal_dev));
|
|
|
|
|
2020-07-17 09:06:05 +00:00
|
|
|
if (journal_inum && journal_dev) {
|
|
|
|
ext4_msg(sb, KERN_ERR,
|
|
|
|
"filesystem has both journal inode and journal device!");
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (journal_inum) {
|
|
|
|
journal = ext4_get_journal(sb, journal_inum);
|
|
|
|
if (!journal)
|
|
|
|
return -EINVAL;
|
|
|
|
} else {
|
|
|
|
journal = ext4_get_dev_journal(sb, journal_dev);
|
|
|
|
if (!journal)
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
journal_dev_ro = bdev_read_only(journal->j_dev);
|
|
|
|
really_read_only = bdev_read_only(sb->s_bdev) | journal_dev_ro;
|
|
|
|
|
|
|
|
if (journal_dev_ro && !sb_rdonly(sb)) {
|
|
|
|
ext4_msg(sb, KERN_ERR,
|
|
|
|
"journal device read-only, try mounting with '-o ro'");
|
|
|
|
err = -EROFS;
|
|
|
|
goto err_out;
|
|
|
|
}
|
2006-10-11 08:20:50 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Are we loading a blank journal or performing recovery after a
|
|
|
|
* crash? For recovery, we need to check in advance whether we
|
|
|
|
* can get read-write access to the device.
|
|
|
|
*/
|
2015-10-17 20:18:43 +00:00
|
|
|
if (ext4_has_feature_journal_needs_recovery(sb)) {
|
2017-07-17 07:45:34 +00:00
|
|
|
if (sb_rdonly(sb)) {
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_INFO, "INFO: recovery "
|
|
|
|
"required on readonly filesystem");
|
2006-10-11 08:20:50 +00:00
|
|
|
if (really_read_only) {
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_ERR, "write access "
|
2017-10-18 17:06:37 +00:00
|
|
|
"unavailable, cannot proceed "
|
|
|
|
"(try mounting with noload)");
|
2020-07-17 09:06:05 +00:00
|
|
|
err = -EROFS;
|
|
|
|
goto err_out;
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_INFO, "write access will "
|
|
|
|
"be enabled during recovery");
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2009-09-29 19:51:30 +00:00
|
|
|
if (!(journal->j_flags & JBD2_BARRIER))
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_INFO, "barriers disabled");
|
2008-09-09 03:00:52 +00:00
|
|
|
|
2015-10-17 20:18:43 +00:00
|
|
|
if (!ext4_has_feature_journal_needs_recovery(sb))
|
2006-10-11 08:21:01 +00:00
|
|
|
err = jbd2_journal_wipe(journal, !really_read_only);
|
2010-07-27 15:56:03 +00:00
|
|
|
if (!err) {
|
|
|
|
char *save = kmalloc(EXT4_S_ERR_LEN, GFP_KERNEL);
|
|
|
|
if (save)
|
|
|
|
memcpy(save, ((char *) es) +
|
|
|
|
EXT4_S_ERR_START, EXT4_S_ERR_LEN);
|
2006-10-11 08:21:01 +00:00
|
|
|
err = jbd2_journal_load(journal);
|
2010-07-27 15:56:03 +00:00
|
|
|
if (save)
|
|
|
|
memcpy(((char *) es) + EXT4_S_ERR_START,
|
|
|
|
save, EXT4_S_ERR_LEN);
|
|
|
|
kfree(save);
|
|
|
|
}
|
2006-10-11 08:20:50 +00:00
|
|
|
|
|
|
|
if (err) {
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_ERR, "error loading journal");
|
2020-07-17 09:06:05 +00:00
|
|
|
goto err_out;
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
|
|
|
|
2006-10-11 08:20:53 +00:00
|
|
|
EXT4_SB(sb)->s_journal = journal;
|
2020-07-10 14:07:59 +00:00
|
|
|
err = ext4_clear_journal_err(sb, es);
|
|
|
|
if (err) {
|
|
|
|
EXT4_SB(sb)->s_journal = NULL;
|
|
|
|
jbd2_journal_destroy(journal);
|
|
|
|
return err;
|
|
|
|
}
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2010-10-28 01:30:06 +00:00
|
|
|
if (!really_read_only && journal_devnum &&
|
2006-10-11 08:20:50 +00:00
|
|
|
journal_devnum != le32_to_cpu(es->s_journal_dev)) {
|
|
|
|
es->s_journal_dev = cpu_to_le32(journal_devnum);
|
|
|
|
|
|
|
|
/* Make sure we flush the recovery flag to disk. */
|
2020-12-16 10:18:38 +00:00
|
|
|
ext4_commit_super(sb);
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
2020-07-17 09:06:05 +00:00
|
|
|
|
|
|
|
err_out:
|
|
|
|
jbd2_journal_destroy(journal);
|
|
|
|
return err;
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
|
|
|
|
2020-12-16 10:18:40 +00:00
|
|
|
/* Copy state of EXT4_SB(sb) into buffer for on-disk superblock */
|
|
|
|
static void ext4_update_super(struct super_block *sb)
|
2006-10-11 08:20:50 +00:00
|
|
|
{
|
2020-11-27 11:34:00 +00:00
|
|
|
struct ext4_sb_info *sbi = EXT4_SB(sb);
|
2020-12-16 10:18:41 +00:00
|
|
|
struct ext4_super_block *es = sbi->s_es;
|
|
|
|
struct buffer_head *sbh = sbi->s_sbh;
|
2018-07-02 22:45:18 +00:00
|
|
|
|
2020-12-16 10:18:39 +00:00
|
|
|
lock_buffer(sbh);
|
2009-09-10 21:31:04 +00:00
|
|
|
/*
|
|
|
|
* If the file system is mounted read-only, don't update the
|
|
|
|
* superblock write time. This avoids updating the superblock
|
|
|
|
* write time when we are mounting the root file system
|
|
|
|
* read/only but we need to replay the journal; at that point,
|
|
|
|
* for people who are east of GMT and who make their clock
|
|
|
|
* tick in localtime for Windows bug-for-bug compatibility,
|
|
|
|
* the clock is set in the future, and this will cause e2fsck
|
|
|
|
* to complain and force a full file system check.
|
|
|
|
*/
|
2017-11-27 21:05:09 +00:00
|
|
|
if (!(sb->s_flags & SB_RDONLY))
|
2018-07-29 19:51:48 +00:00
|
|
|
ext4_update_tstamp(es, s_wtime);
|
2020-11-24 08:36:54 +00:00
|
|
|
es->s_kbytes_written =
|
2021-01-15 22:54:24 +00:00
|
|
|
cpu_to_le64(sbi->s_kbytes_written +
|
2020-11-24 08:36:54 +00:00
|
|
|
((part_stat_read(sb->s_bdev, sectors[STAT_WRITE]) -
|
2021-01-15 22:54:24 +00:00
|
|
|
sbi->s_sectors_written_start) >> 1));
|
2020-12-16 10:18:41 +00:00
|
|
|
if (percpu_counter_initialized(&sbi->s_freeclusters_counter))
|
2014-07-15 10:01:38 +00:00
|
|
|
ext4_free_blocks_count_set(es,
|
2020-12-16 10:18:41 +00:00
|
|
|
EXT4_C2B(sbi, percpu_counter_sum_positive(
|
|
|
|
&sbi->s_freeclusters_counter)));
|
|
|
|
if (percpu_counter_initialized(&sbi->s_freeinodes_counter))
|
2014-07-15 10:01:38 +00:00
|
|
|
es->s_free_inodes_count =
|
|
|
|
cpu_to_le32(percpu_counter_sum_positive(
|
2020-12-16 10:18:41 +00:00
|
|
|
&sbi->s_freeinodes_counter));
|
2020-11-27 11:34:00 +00:00
|
|
|
/* Copy error information to the on-disk superblock */
|
|
|
|
spin_lock(&sbi->s_error_lock);
|
|
|
|
if (sbi->s_add_error_count > 0) {
|
|
|
|
es->s_state |= cpu_to_le16(EXT4_ERROR_FS);
|
|
|
|
if (!es->s_first_error_time && !es->s_first_error_time_hi) {
|
|
|
|
__ext4_update_tstamp(&es->s_first_error_time,
|
|
|
|
&es->s_first_error_time_hi,
|
|
|
|
sbi->s_first_error_time);
|
|
|
|
strncpy(es->s_first_error_func, sbi->s_first_error_func,
|
|
|
|
sizeof(es->s_first_error_func));
|
|
|
|
es->s_first_error_line =
|
|
|
|
cpu_to_le32(sbi->s_first_error_line);
|
|
|
|
es->s_first_error_ino =
|
|
|
|
cpu_to_le32(sbi->s_first_error_ino);
|
|
|
|
es->s_first_error_block =
|
|
|
|
cpu_to_le64(sbi->s_first_error_block);
|
|
|
|
es->s_first_error_errcode =
|
|
|
|
ext4_errno_to_code(sbi->s_first_error_code);
|
|
|
|
}
|
|
|
|
__ext4_update_tstamp(&es->s_last_error_time,
|
|
|
|
&es->s_last_error_time_hi,
|
|
|
|
sbi->s_last_error_time);
|
|
|
|
strncpy(es->s_last_error_func, sbi->s_last_error_func,
|
|
|
|
sizeof(es->s_last_error_func));
|
|
|
|
es->s_last_error_line = cpu_to_le32(sbi->s_last_error_line);
|
|
|
|
es->s_last_error_ino = cpu_to_le32(sbi->s_last_error_ino);
|
|
|
|
es->s_last_error_block = cpu_to_le64(sbi->s_last_error_block);
|
|
|
|
es->s_last_error_errcode =
|
|
|
|
ext4_errno_to_code(sbi->s_last_error_code);
|
|
|
|
/*
|
|
|
|
* Start the daily error reporting function if it hasn't been
|
|
|
|
* started already
|
|
|
|
*/
|
|
|
|
if (!es->s_error_count)
|
|
|
|
mod_timer(&sbi->s_err_report, jiffies + 24*60*60*HZ);
|
|
|
|
le32_add_cpu(&es->s_error_count, sbi->s_add_error_count);
|
|
|
|
sbi->s_add_error_count = 0;
|
|
|
|
}
|
|
|
|
spin_unlock(&sbi->s_error_lock);
|
|
|
|
|
2012-10-10 05:06:58 +00:00
|
|
|
ext4_superblock_csum_set(sb);
|
2020-12-16 10:18:40 +00:00
|
|
|
unlock_buffer(sbh);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int ext4_commit_super(struct super_block *sb)
|
|
|
|
{
|
|
|
|
struct buffer_head *sbh = EXT4_SB(sb)->s_sbh;
|
|
|
|
int error = 0;
|
|
|
|
|
2021-04-02 10:16:31 +00:00
|
|
|
if (!sbh)
|
|
|
|
return -EINVAL;
|
|
|
|
if (block_device_ejected(sb))
|
|
|
|
return -ENODEV;
|
2020-12-16 10:18:40 +00:00
|
|
|
|
|
|
|
ext4_update_super(sb);
|
|
|
|
|
2018-12-31 04:20:39 +00:00
|
|
|
if (buffer_write_io_error(sbh) || !buffer_uptodate(sbh)) {
|
2016-07-04 14:24:52 +00:00
|
|
|
/*
|
|
|
|
* Oh, dear. A previous attempt to write the
|
|
|
|
* superblock failed. This could happen because the
|
|
|
|
* USB device was yanked out. Or it could happen to
|
|
|
|
* be a transient write error and maybe the block will
|
|
|
|
* be remapped. Nothing we can do but to retry the
|
|
|
|
* write and hope for the best.
|
|
|
|
*/
|
|
|
|
ext4_msg(sb, KERN_ERR, "previous I/O error to "
|
|
|
|
"superblock detected");
|
|
|
|
clear_buffer_write_io_error(sbh);
|
|
|
|
set_buffer_uptodate(sbh);
|
|
|
|
}
|
2020-12-16 10:18:40 +00:00
|
|
|
BUFFER_TRACE(sbh, "marking dirty");
|
2006-10-11 08:20:50 +00:00
|
|
|
mark_buffer_dirty(sbh);
|
2020-12-16 10:18:38 +00:00
|
|
|
error = __sync_dirty_buffer(sbh,
|
|
|
|
REQ_SYNC | (test_opt(sb, BARRIER) ? REQ_FUA : 0));
|
|
|
|
if (buffer_write_io_error(sbh)) {
|
|
|
|
ext4_msg(sb, KERN_ERR, "I/O error while writing "
|
|
|
|
"superblock");
|
|
|
|
clear_buffer_write_io_error(sbh);
|
|
|
|
set_buffer_uptodate(sbh);
|
2008-10-07 01:35:40 +00:00
|
|
|
}
|
2009-01-10 00:40:58 +00:00
|
|
|
return error;
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Have we just finished recovery? If so, and if we are mounting (or
|
|
|
|
* remounting) the filesystem readonly, then we will end up with a
|
|
|
|
* consistent fs on disk. Record that fact.
|
|
|
|
*/
|
2020-07-10 14:07:59 +00:00
|
|
|
static int ext4_mark_recovery_complete(struct super_block *sb,
|
|
|
|
struct ext4_super_block *es)
|
2006-10-11 08:20:50 +00:00
|
|
|
{
|
2020-07-10 14:07:59 +00:00
|
|
|
int err;
|
2006-10-11 08:20:53 +00:00
|
|
|
journal_t *journal = EXT4_SB(sb)->s_journal;
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2015-10-17 20:18:43 +00:00
|
|
|
if (!ext4_has_feature_journal(sb)) {
|
2020-07-10 14:07:59 +00:00
|
|
|
if (journal != NULL) {
|
|
|
|
ext4_error(sb, "Journal got removed while the fs was "
|
|
|
|
"mounted!");
|
|
|
|
return -EFSCORRUPTED;
|
|
|
|
}
|
|
|
|
return 0;
|
2009-01-07 05:06:22 +00:00
|
|
|
}
|
2006-10-11 08:21:01 +00:00
|
|
|
jbd2_journal_lock_updates(journal);
|
2021-05-18 15:13:25 +00:00
|
|
|
err = jbd2_journal_flush(journal, 0);
|
2020-07-10 14:07:59 +00:00
|
|
|
if (err < 0)
|
2008-10-11 00:29:21 +00:00
|
|
|
goto out;
|
|
|
|
|
2017-07-17 07:45:34 +00:00
|
|
|
if (ext4_has_feature_journal_needs_recovery(sb) && sb_rdonly(sb)) {
|
2015-10-17 20:18:43 +00:00
|
|
|
ext4_clear_feature_journal_needs_recovery(sb);
|
2020-12-16 10:18:38 +00:00
|
|
|
ext4_commit_super(sb);
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
2008-10-11 00:29:21 +00:00
|
|
|
out:
|
2006-10-11 08:21:01 +00:00
|
|
|
jbd2_journal_unlock_updates(journal);
|
2020-07-10 14:07:59 +00:00
|
|
|
return err;
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we are mounting (or read-write remounting) a filesystem whose journal
|
|
|
|
* has recorded an error from a previous lifetime, move that error to the
|
|
|
|
* main filesystem now.
|
|
|
|
*/
|
2020-07-10 14:07:59 +00:00
|
|
|
static int ext4_clear_journal_err(struct super_block *sb,
|
2008-07-26 20:15:44 +00:00
|
|
|
struct ext4_super_block *es)
|
2006-10-11 08:20:50 +00:00
|
|
|
{
|
|
|
|
journal_t *journal;
|
|
|
|
int j_errno;
|
|
|
|
const char *errstr;
|
|
|
|
|
2020-07-10 14:07:59 +00:00
|
|
|
if (!ext4_has_feature_journal(sb)) {
|
|
|
|
ext4_error(sb, "Journal got removed while the fs was mounted!");
|
|
|
|
return -EFSCORRUPTED;
|
|
|
|
}
|
2009-01-07 05:06:22 +00:00
|
|
|
|
2006-10-11 08:20:53 +00:00
|
|
|
journal = EXT4_SB(sb)->s_journal;
|
2006-10-11 08:20:50 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Now check for any error status which may have been recorded in the
|
2006-10-11 08:20:53 +00:00
|
|
|
* journal by a prior ext4_error() or ext4_abort()
|
2006-10-11 08:20:50 +00:00
|
|
|
*/
|
|
|
|
|
2006-10-11 08:21:01 +00:00
|
|
|
j_errno = jbd2_journal_errno(journal);
|
2006-10-11 08:20:50 +00:00
|
|
|
if (j_errno) {
|
|
|
|
char nbuf[16];
|
|
|
|
|
2006-10-11 08:20:53 +00:00
|
|
|
errstr = ext4_decode_error(sb, j_errno, nbuf);
|
2010-02-15 19:19:27 +00:00
|
|
|
ext4_warning(sb, "Filesystem error recorded "
|
2006-10-11 08:20:50 +00:00
|
|
|
"from previous mount: %s", errstr);
|
2010-02-15 19:19:27 +00:00
|
|
|
ext4_warning(sb, "Marking fs in need of filesystem check.");
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2006-10-11 08:20:53 +00:00
|
|
|
EXT4_SB(sb)->s_mount_state |= EXT4_ERROR_FS;
|
|
|
|
es->s_state |= cpu_to_le16(EXT4_ERROR_FS);
|
2020-12-16 10:18:38 +00:00
|
|
|
ext4_commit_super(sb);
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2006-10-11 08:21:01 +00:00
|
|
|
jbd2_journal_clear_err(journal);
|
2012-08-05 23:04:57 +00:00
|
|
|
jbd2_journal_update_sb_errno(journal);
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
2020-07-10 14:07:59 +00:00
|
|
|
return 0;
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Force the running and committing transactions to commit,
|
|
|
|
* and wait on the commit.
|
|
|
|
*/
|
2006-10-11 08:20:53 +00:00
|
|
|
int ext4_force_commit(struct super_block *sb)
|
2006-10-11 08:20:50 +00:00
|
|
|
{
|
|
|
|
journal_t *journal;
|
|
|
|
|
2017-07-17 07:45:34 +00:00
|
|
|
if (sb_rdonly(sb))
|
2006-10-11 08:20:50 +00:00
|
|
|
return 0;
|
|
|
|
|
2006-10-11 08:20:53 +00:00
|
|
|
journal = EXT4_SB(sb)->s_journal;
|
2013-01-29 02:41:02 +00:00
|
|
|
return ext4_journal_force_commit(journal);
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
|
|
|
|
2006-10-11 08:20:53 +00:00
|
|
|
static int ext4_sync_fs(struct super_block *sb, int wait)
|
2006-10-11 08:20:50 +00:00
|
|
|
{
|
2008-11-03 23:10:55 +00:00
|
|
|
int ret = 0;
|
2009-02-10 11:46:05 +00:00
|
|
|
tid_t target;
|
2013-06-13 02:25:07 +00:00
|
|
|
bool needs_barrier = false;
|
2009-09-28 19:48:29 +00:00
|
|
|
struct ext4_sb_info *sbi = EXT4_SB(sb);
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2018-01-11 18:17:49 +00:00
|
|
|
if (unlikely(ext4_forced_shutdown(sbi)))
|
2017-02-05 06:28:48 +00:00
|
|
|
return 0;
|
|
|
|
|
2009-06-17 15:48:11 +00:00
|
|
|
trace_ext4_sync_fs(sb, wait);
|
2013-06-04 18:21:02 +00:00
|
|
|
flush_workqueue(sbi->rsv_conversion_wq);
|
2012-07-03 14:45:29 +00:00
|
|
|
/*
|
|
|
|
* Writeback quota in non-journalled quota case - journalled quota has
|
|
|
|
* no dirty dquots
|
|
|
|
*/
|
|
|
|
dquot_writeback_dquots(sb, -1);
|
2013-06-13 02:25:07 +00:00
|
|
|
/*
|
|
|
|
* Data writeback is possible w/o journal transaction, so barrier must
|
|
|
|
* being sent at the end of the function. But we can skip it if
|
|
|
|
* transaction_commit will do it for us.
|
|
|
|
*/
|
2014-09-18 20:12:37 +00:00
|
|
|
if (sbi->s_journal) {
|
|
|
|
target = jbd2_get_latest_transaction(sbi->s_journal);
|
|
|
|
if (wait && sbi->s_journal->j_flags & JBD2_BARRIER &&
|
|
|
|
!jbd2_trans_will_send_data_barrier(sbi->s_journal, target))
|
|
|
|
needs_barrier = true;
|
|
|
|
|
|
|
|
if (jbd2_journal_start_commit(sbi->s_journal, &target)) {
|
|
|
|
if (wait)
|
|
|
|
ret = jbd2_log_wait_commit(sbi->s_journal,
|
|
|
|
target);
|
|
|
|
}
|
|
|
|
} else if (wait && test_opt(sb, BARRIER))
|
2013-06-13 02:25:07 +00:00
|
|
|
needs_barrier = true;
|
|
|
|
if (needs_barrier) {
|
|
|
|
int err;
|
2021-01-26 14:52:35 +00:00
|
|
|
err = blkdev_issue_flush(sb->s_bdev);
|
2013-06-13 02:25:07 +00:00
|
|
|
if (!ret)
|
|
|
|
ret = err;
|
2009-01-07 05:06:22 +00:00
|
|
|
}
|
2013-06-13 02:25:07 +00:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2006-10-11 08:20:50 +00:00
|
|
|
/*
|
|
|
|
* LVM calls this function before a (read-only) snapshot is created. This
|
|
|
|
* gives us a chance to flush the journal completely and mark the fs clean.
|
2011-04-11 02:06:07 +00:00
|
|
|
*
|
|
|
|
* Note that only this function cannot bring a filesystem to be in a clean
|
2012-06-12 14:20:38 +00:00
|
|
|
* state independently. It relies on upper layer to stop all data & metadata
|
|
|
|
* modifications.
|
2006-10-11 08:20:50 +00:00
|
|
|
*/
|
2009-01-10 00:40:58 +00:00
|
|
|
static int ext4_freeze(struct super_block *sb)
|
2006-10-11 08:20:50 +00:00
|
|
|
{
|
2009-01-10 00:40:58 +00:00
|
|
|
int error = 0;
|
|
|
|
journal_t *journal;
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2017-07-17 07:45:34 +00:00
|
|
|
if (sb_rdonly(sb))
|
2009-05-01 16:52:25 +00:00
|
|
|
return 0;
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2009-05-01 16:52:25 +00:00
|
|
|
journal = EXT4_SB(sb)->s_journal;
|
2008-10-11 00:29:21 +00:00
|
|
|
|
2014-09-18 21:12:02 +00:00
|
|
|
if (journal) {
|
|
|
|
/* Now we set up the journal barrier. */
|
|
|
|
jbd2_journal_lock_updates(journal);
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2014-09-18 21:12:02 +00:00
|
|
|
/*
|
|
|
|
* Don't clear the needs_recovery flag if we failed to
|
|
|
|
* flush the journal.
|
|
|
|
*/
|
2021-05-18 15:13:25 +00:00
|
|
|
error = jbd2_journal_flush(journal, 0);
|
2014-09-18 21:12:02 +00:00
|
|
|
if (error < 0)
|
|
|
|
goto out;
|
2015-08-15 14:45:06 +00:00
|
|
|
|
|
|
|
/* Journal blocked and flushed, clear needs_recovery flag. */
|
2015-10-17 20:18:43 +00:00
|
|
|
ext4_clear_feature_journal_needs_recovery(sb);
|
2014-09-18 21:12:02 +00:00
|
|
|
}
|
2009-05-01 16:52:25 +00:00
|
|
|
|
2020-12-16 10:18:38 +00:00
|
|
|
error = ext4_commit_super(sb);
|
2010-05-16 06:00:00 +00:00
|
|
|
out:
|
2014-09-18 21:12:02 +00:00
|
|
|
if (journal)
|
|
|
|
/* we rely on upper layer to stop further updates */
|
|
|
|
jbd2_journal_unlock_updates(journal);
|
2010-05-16 06:00:00 +00:00
|
|
|
return error;
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Called by LVM after the snapshot is done. We need to reset the RECOVER
|
|
|
|
* flag here, even though the filesystem is not technically dirty yet.
|
|
|
|
*/
|
2009-01-10 00:40:58 +00:00
|
|
|
static int ext4_unfreeze(struct super_block *sb)
|
2006-10-11 08:20:50 +00:00
|
|
|
{
|
2017-07-17 07:45:34 +00:00
|
|
|
if (sb_rdonly(sb) || ext4_forced_shutdown(EXT4_SB(sb)))
|
2009-05-01 16:52:25 +00:00
|
|
|
return 0;
|
|
|
|
|
2015-08-15 14:45:06 +00:00
|
|
|
if (EXT4_SB(sb)->s_journal) {
|
|
|
|
/* Reset the needs_recovery flag before the fs is unlocked. */
|
2015-10-17 20:18:43 +00:00
|
|
|
ext4_set_feature_journal_needs_recovery(sb);
|
2015-08-15 14:45:06 +00:00
|
|
|
}
|
|
|
|
|
2020-12-16 10:18:38 +00:00
|
|
|
ext4_commit_super(sb);
|
2009-01-10 00:40:58 +00:00
|
|
|
return 0;
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
|
|
|
|
2010-12-16 01:28:48 +00:00
|
|
|
/*
|
|
|
|
* Structure to save mount options for ext4_remount's benefit
|
|
|
|
*/
|
|
|
|
struct ext4_mount_options {
|
|
|
|
unsigned long s_mount_opt;
|
2010-12-16 01:30:48 +00:00
|
|
|
unsigned long s_mount_opt2;
|
2012-02-07 23:41:49 +00:00
|
|
|
kuid_t s_resuid;
|
|
|
|
kgid_t s_resgid;
|
2010-12-16 01:28:48 +00:00
|
|
|
unsigned long s_commit_interval;
|
|
|
|
u32 s_min_batch_time, s_max_batch_time;
|
|
|
|
#ifdef CONFIG_QUOTA
|
|
|
|
int s_jquota_fmt;
|
2014-09-11 15:15:15 +00:00
|
|
|
char *s_qf_names[EXT4_MAXQUOTAS];
|
2010-12-16 01:28:48 +00:00
|
|
|
#endif
|
|
|
|
};
|
|
|
|
|
2008-07-26 20:15:44 +00:00
|
|
|
static int ext4_remount(struct super_block *sb, int *flags, char *data)
|
2006-10-11 08:20:50 +00:00
|
|
|
{
|
2008-07-26 20:15:44 +00:00
|
|
|
struct ext4_super_block *es;
|
2006-10-11 08:20:53 +00:00
|
|
|
struct ext4_sb_info *sbi = EXT4_SB(sb);
|
ext4: handle option set by mount flags correctly
Currently there is a problem with mount options that can be both set by
vfs using mount flags or by a string parsing in ext4.
i_version/iversion options gets lost after remount, for example
$ mount -o i_version /dev/pmem0 /mnt
$ grep pmem0 /proc/self/mountinfo | grep i_version
310 95 259:0 / /mnt rw,relatime shared:163 - ext4 /dev/pmem0 rw,seclabel,i_version
$ mount -o remount,ro /mnt
$ grep pmem0 /proc/self/mountinfo | grep i_version
nolazytime gets ignored by ext4 on remount, for example
$ mount -o lazytime /dev/pmem0 /mnt
$ grep pmem0 /proc/self/mountinfo | grep lazytime
310 95 259:0 / /mnt rw,relatime shared:163 - ext4 /dev/pmem0 rw,lazytime,seclabel
$ mount -o remount,nolazytime /mnt
$ grep pmem0 /proc/self/mountinfo | grep lazytime
310 95 259:0 / /mnt rw,relatime shared:163 - ext4 /dev/pmem0 rw,lazytime,seclabel
Fix it by applying the SB_LAZYTIME and SB_I_VERSION flags from *flags to
s_flags before we parse the option and use the resulting state of the
same flags in *flags at the end of successful remount.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200723150526.19931-1-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-07-23 15:05:26 +00:00
|
|
|
unsigned long old_sb_flags, vfs_flags;
|
2006-10-11 08:20:53 +00:00
|
|
|
struct ext4_mount_options old_opts;
|
2010-05-19 11:16:40 +00:00
|
|
|
int enable_quota = 0;
|
2008-07-26 18:34:21 +00:00
|
|
|
ext4_group_t g;
|
2011-05-24 22:31:25 +00:00
|
|
|
int err = 0;
|
2006-10-11 08:20:50 +00:00
|
|
|
#ifdef CONFIG_QUOTA
|
2013-01-25 04:24:58 +00:00
|
|
|
int i, j;
|
2018-10-12 13:28:09 +00:00
|
|
|
char *to_free[EXT4_MAXQUOTAS];
|
2006-10-11 08:20:50 +00:00
|
|
|
#endif
|
2010-05-16 16:00:00 +00:00
|
|
|
char *orig_data = kstrdup(data, GFP_KERNEL);
|
2021-04-01 17:21:24 +00:00
|
|
|
struct ext4_parsed_options parsed_opts;
|
|
|
|
|
|
|
|
parsed_opts.journal_ioprio = DEFAULT_JOURNAL_IOPRIO;
|
|
|
|
parsed_opts.journal_devnum = 0;
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2018-07-29 19:51:54 +00:00
|
|
|
if (data && !orig_data)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2006-10-11 08:20:50 +00:00
|
|
|
/* Store the original options */
|
|
|
|
old_sb_flags = sb->s_flags;
|
|
|
|
old_opts.s_mount_opt = sbi->s_mount_opt;
|
2010-12-16 01:30:48 +00:00
|
|
|
old_opts.s_mount_opt2 = sbi->s_mount_opt2;
|
2006-10-11 08:20:50 +00:00
|
|
|
old_opts.s_resuid = sbi->s_resuid;
|
|
|
|
old_opts.s_resgid = sbi->s_resgid;
|
|
|
|
old_opts.s_commit_interval = sbi->s_commit_interval;
|
2009-01-04 01:27:38 +00:00
|
|
|
old_opts.s_min_batch_time = sbi->s_min_batch_time;
|
|
|
|
old_opts.s_max_batch_time = sbi->s_max_batch_time;
|
2006-10-11 08:20:50 +00:00
|
|
|
#ifdef CONFIG_QUOTA
|
|
|
|
old_opts.s_jquota_fmt = sbi->s_jquota_fmt;
|
2014-09-11 15:15:15 +00:00
|
|
|
for (i = 0; i < EXT4_MAXQUOTAS; i++)
|
2013-01-25 04:24:58 +00:00
|
|
|
if (sbi->s_qf_names[i]) {
|
2018-10-12 13:28:09 +00:00
|
|
|
char *qf_name = get_qf_name(sb, sbi, i);
|
|
|
|
|
|
|
|
old_opts.s_qf_names[i] = kstrdup(qf_name, GFP_KERNEL);
|
2013-01-25 04:24:58 +00:00
|
|
|
if (!old_opts.s_qf_names[i]) {
|
|
|
|
for (j = 0; j < i; j++)
|
|
|
|
kfree(old_opts.s_qf_names[j]);
|
2013-03-02 22:13:55 +00:00
|
|
|
kfree(orig_data);
|
2013-01-25 04:24:58 +00:00
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
} else
|
|
|
|
old_opts.s_qf_names[i] = NULL;
|
2006-10-11 08:20:50 +00:00
|
|
|
#endif
|
2009-01-06 03:46:26 +00:00
|
|
|
if (sbi->s_journal && sbi->s_journal->j_task->io_context)
|
2021-04-01 17:21:24 +00:00
|
|
|
parsed_opts.journal_ioprio =
|
|
|
|
sbi->s_journal->j_task->io_context->ioprio;
|
2006-10-11 08:20:50 +00:00
|
|
|
|
ext4: handle option set by mount flags correctly
Currently there is a problem with mount options that can be both set by
vfs using mount flags or by a string parsing in ext4.
i_version/iversion options gets lost after remount, for example
$ mount -o i_version /dev/pmem0 /mnt
$ grep pmem0 /proc/self/mountinfo | grep i_version
310 95 259:0 / /mnt rw,relatime shared:163 - ext4 /dev/pmem0 rw,seclabel,i_version
$ mount -o remount,ro /mnt
$ grep pmem0 /proc/self/mountinfo | grep i_version
nolazytime gets ignored by ext4 on remount, for example
$ mount -o lazytime /dev/pmem0 /mnt
$ grep pmem0 /proc/self/mountinfo | grep lazytime
310 95 259:0 / /mnt rw,relatime shared:163 - ext4 /dev/pmem0 rw,lazytime,seclabel
$ mount -o remount,nolazytime /mnt
$ grep pmem0 /proc/self/mountinfo | grep lazytime
310 95 259:0 / /mnt rw,relatime shared:163 - ext4 /dev/pmem0 rw,lazytime,seclabel
Fix it by applying the SB_LAZYTIME and SB_I_VERSION flags from *flags to
s_flags before we parse the option and use the resulting state of the
same flags in *flags at the end of successful remount.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200723150526.19931-1-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-07-23 15:05:26 +00:00
|
|
|
/*
|
|
|
|
* Some options can be enabled by ext4 and/or by VFS mount flag
|
|
|
|
* either way we need to make sure it matches in both *flags and
|
|
|
|
* s_flags. Copy those selected flags from *flags to s_flags
|
|
|
|
*/
|
|
|
|
vfs_flags = SB_LAZYTIME | SB_I_VERSION;
|
|
|
|
sb->s_flags = (sb->s_flags & ~vfs_flags) | (*flags & vfs_flags);
|
|
|
|
|
2021-04-01 17:21:24 +00:00
|
|
|
if (!parse_options(data, sb, &parsed_opts, 1)) {
|
2006-10-11 08:20:50 +00:00
|
|
|
err = -EINVAL;
|
|
|
|
goto restore_opts;
|
|
|
|
}
|
|
|
|
|
2014-10-30 14:53:16 +00:00
|
|
|
if ((old_opts.s_mount_opt & EXT4_MOUNT_JOURNAL_CHECKSUM) ^
|
2014-11-25 21:20:50 +00:00
|
|
|
test_opt(sb, JOURNAL_CHECKSUM)) {
|
|
|
|
ext4_msg(sb, KERN_ERR, "changing journal_checksum "
|
2015-02-13 04:07:37 +00:00
|
|
|
"during remount not supported; ignoring");
|
|
|
|
sbi->s_mount_opt ^= EXT4_MOUNT_JOURNAL_CHECKSUM;
|
2014-10-30 14:53:16 +00:00
|
|
|
}
|
|
|
|
|
2013-08-09 03:02:24 +00:00
|
|
|
if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA) {
|
|
|
|
if (test_opt2(sb, EXPLICIT_DELALLOC)) {
|
|
|
|
ext4_msg(sb, KERN_ERR, "can't mount with "
|
|
|
|
"both data=journal and delalloc");
|
|
|
|
err = -EINVAL;
|
|
|
|
goto restore_opts;
|
|
|
|
}
|
|
|
|
if (test_opt(sb, DIOREAD_NOLOCK)) {
|
|
|
|
ext4_msg(sb, KERN_ERR, "can't mount with "
|
|
|
|
"both data=journal and dioread_nolock");
|
|
|
|
err = -EINVAL;
|
|
|
|
goto restore_opts;
|
|
|
|
}
|
2016-12-03 21:20:53 +00:00
|
|
|
} else if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_ORDERED_DATA) {
|
|
|
|
if (test_opt(sb, JOURNAL_ASYNC_COMMIT)) {
|
|
|
|
ext4_msg(sb, KERN_ERR, "can't mount with "
|
|
|
|
"journal_async_commit in data=ordered mode");
|
|
|
|
err = -EINVAL;
|
|
|
|
goto restore_opts;
|
|
|
|
}
|
2015-02-16 23:59:38 +00:00
|
|
|
}
|
|
|
|
|
2017-06-22 15:55:14 +00:00
|
|
|
if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT4_MOUNT_NO_MBCACHE) {
|
|
|
|
ext4_msg(sb, KERN_ERR, "can't enable nombcache during remount");
|
|
|
|
err = -EINVAL;
|
|
|
|
goto restore_opts;
|
|
|
|
}
|
|
|
|
|
2020-11-06 03:59:09 +00:00
|
|
|
if (ext4_test_mount_flag(sb, EXT4_MF_FS_ABORTED))
|
2020-03-28 23:33:43 +00:00
|
|
|
ext4_abort(sb, EXT4_ERR_ESHUTDOWN, "Abort forced by user");
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2017-11-27 21:05:09 +00:00
|
|
|
sb->s_flags = (sb->s_flags & ~SB_POSIXACL) |
|
|
|
|
(test_opt(sb, POSIX_ACL) ? SB_POSIXACL : 0);
|
2006-10-11 08:20:50 +00:00
|
|
|
|
|
|
|
es = sbi->s_es;
|
|
|
|
|
2009-01-06 03:46:26 +00:00
|
|
|
if (sbi->s_journal) {
|
2009-01-07 05:06:22 +00:00
|
|
|
ext4_init_journal_params(sb, sbi->s_journal);
|
2021-04-01 17:21:24 +00:00
|
|
|
set_task_ioprio(sbi->s_journal->j_task, parsed_opts.journal_ioprio);
|
2009-01-06 03:46:26 +00:00
|
|
|
}
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2020-11-27 11:34:00 +00:00
|
|
|
/* Flush outstanding errors before changing fs state */
|
|
|
|
flush_work(&sbi->s_error_work);
|
|
|
|
|
2017-11-27 21:05:09 +00:00
|
|
|
if ((bool)(*flags & SB_RDONLY) != sb_rdonly(sb)) {
|
2020-11-06 03:59:09 +00:00
|
|
|
if (ext4_test_mount_flag(sb, EXT4_MF_FS_ABORTED)) {
|
2006-10-11 08:20:50 +00:00
|
|
|
err = -EROFS;
|
|
|
|
goto restore_opts;
|
|
|
|
}
|
|
|
|
|
2017-11-27 21:05:09 +00:00
|
|
|
if (*flags & SB_RDONLY) {
|
2014-03-14 02:49:42 +00:00
|
|
|
err = sync_filesystem(sb);
|
|
|
|
if (err < 0)
|
|
|
|
goto restore_opts;
|
2010-05-19 11:16:41 +00:00
|
|
|
err = dquot_suspend(sb, -1);
|
|
|
|
if (err < 0)
|
2010-05-19 11:16:40 +00:00
|
|
|
goto restore_opts;
|
|
|
|
|
2006-10-11 08:20:50 +00:00
|
|
|
/*
|
|
|
|
* First of all, the unconditional stuff we have to do
|
|
|
|
* to disable replay of the journal when we next remount
|
|
|
|
*/
|
2017-11-27 21:05:09 +00:00
|
|
|
sb->s_flags |= SB_RDONLY;
|
2006-10-11 08:20:50 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* OK, test if we are remounting a valid rw partition
|
|
|
|
* readonly, and if so set the rdonly flag and then
|
|
|
|
* mark the partition as valid again.
|
|
|
|
*/
|
2006-10-11 08:20:53 +00:00
|
|
|
if (!(es->s_state & cpu_to_le16(EXT4_VALID_FS)) &&
|
|
|
|
(sbi->s_mount_state & EXT4_VALID_FS))
|
2006-10-11 08:20:50 +00:00
|
|
|
es->s_state = cpu_to_le16(sbi->s_mount_state);
|
|
|
|
|
2020-07-10 14:07:59 +00:00
|
|
|
if (sbi->s_journal) {
|
|
|
|
/*
|
|
|
|
* We let remount-ro finish even if marking fs
|
|
|
|
* as clean failed...
|
|
|
|
*/
|
2009-01-07 05:06:22 +00:00
|
|
|
ext4_mark_recovery_complete(sb, es);
|
2020-07-10 14:07:59 +00:00
|
|
|
}
|
2006-10-11 08:20:50 +00:00
|
|
|
} else {
|
2009-08-18 04:20:23 +00:00
|
|
|
/* Make sure we can mount this feature set readwrite */
|
2015-10-17 20:18:43 +00:00
|
|
|
if (ext4_has_feature_readonly(sb) ||
|
2015-02-13 03:31:21 +00:00
|
|
|
!ext4_feature_set_ok(sb, 0)) {
|
2006-10-11 08:20:50 +00:00
|
|
|
err = -EROFS;
|
|
|
|
goto restore_opts;
|
|
|
|
}
|
2008-07-26 18:34:21 +00:00
|
|
|
/*
|
|
|
|
* Make sure the group descriptor checksums
|
2009-06-03 21:59:28 +00:00
|
|
|
* are sane. If they aren't, refuse to remount r/w.
|
2008-07-26 18:34:21 +00:00
|
|
|
*/
|
|
|
|
for (g = 0; g < sbi->s_groups_count; g++) {
|
|
|
|
struct ext4_group_desc *gdp =
|
|
|
|
ext4_get_group_desc(sb, g, NULL);
|
|
|
|
|
2012-04-29 22:45:10 +00:00
|
|
|
if (!ext4_group_desc_csum_verify(sb, g, gdp)) {
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_ERR,
|
|
|
|
"ext4_remount: Checksum for group %u failed (%u!=%u)",
|
2015-10-17 20:18:43 +00:00
|
|
|
g, le16_to_cpu(ext4_group_desc_csum(sb, g, gdp)),
|
2008-07-26 18:34:21 +00:00
|
|
|
le16_to_cpu(gdp->bg_checksum));
|
2015-10-17 20:16:04 +00:00
|
|
|
err = -EFSBADCRC;
|
2008-07-26 18:34:21 +00:00
|
|
|
goto restore_opts;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2007-02-10 09:46:08 +00:00
|
|
|
/*
|
|
|
|
* If we have an unprocessed orphan list hanging
|
|
|
|
* around from a previously readonly bdev mount,
|
|
|
|
* require a full umount/remount for now.
|
|
|
|
*/
|
|
|
|
if (es->s_last_orphan) {
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_WARNING, "Couldn't "
|
2007-02-10 09:46:08 +00:00
|
|
|
"remount RDWR because of unprocessed "
|
|
|
|
"orphan inode list. Please "
|
2009-06-04 21:36:36 +00:00
|
|
|
"umount/remount instead");
|
2007-02-10 09:46:08 +00:00
|
|
|
err = -EINVAL;
|
|
|
|
goto restore_opts;
|
|
|
|
}
|
|
|
|
|
2006-10-11 08:20:50 +00:00
|
|
|
/*
|
|
|
|
* Mounting a RDONLY partition read-write, so reread
|
|
|
|
* and store the current valid flag. (It may have
|
|
|
|
* been changed by e2fsck since we originally mounted
|
|
|
|
* the partition.)
|
|
|
|
*/
|
2020-07-10 14:07:59 +00:00
|
|
|
if (sbi->s_journal) {
|
|
|
|
err = ext4_clear_journal_err(sb, es);
|
|
|
|
if (err)
|
|
|
|
goto restore_opts;
|
|
|
|
}
|
2006-10-11 08:20:50 +00:00
|
|
|
sbi->s_mount_state = le16_to_cpu(es->s_state);
|
2018-05-14 03:02:19 +00:00
|
|
|
|
|
|
|
err = ext4_setup_super(sb, es, 0);
|
|
|
|
if (err)
|
|
|
|
goto restore_opts;
|
|
|
|
|
|
|
|
sb->s_flags &= ~SB_RDONLY;
|
2015-10-17 20:18:43 +00:00
|
|
|
if (ext4_has_feature_mmp(sb))
|
2011-05-24 22:31:25 +00:00
|
|
|
if (ext4_multi_mount_protect(sb,
|
|
|
|
le64_to_cpu(es->s_mmp_block))) {
|
|
|
|
err = -EROFS;
|
|
|
|
goto restore_opts;
|
|
|
|
}
|
2010-05-19 11:16:40 +00:00
|
|
|
enable_quota = 1;
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
|
|
|
}
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Reinitialize lazy itable initialization thread based on
|
|
|
|
* current settings
|
|
|
|
*/
|
2017-07-17 07:45:34 +00:00
|
|
|
if (sb_rdonly(sb) || !test_opt(sb, INIT_INODE_TABLE))
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
ext4_unregister_li_request(sb);
|
|
|
|
else {
|
|
|
|
ext4_group_t first_not_zeroed;
|
|
|
|
first_not_zeroed = ext4_has_uninit_itable(sb);
|
|
|
|
ext4_register_li_request(sb, first_not_zeroed);
|
|
|
|
}
|
|
|
|
|
2020-07-28 13:04:37 +00:00
|
|
|
/*
|
|
|
|
* Handle creation of system zone data early because it can fail.
|
|
|
|
* Releasing of existing data is done when we are sure remount will
|
|
|
|
* succeed.
|
|
|
|
*/
|
2020-09-24 03:03:43 +00:00
|
|
|
if (test_opt(sb, BLOCK_VALIDITY) && !sbi->s_system_blks) {
|
2020-07-28 13:04:37 +00:00
|
|
|
err = ext4_setup_system_zone(sb);
|
|
|
|
if (err)
|
|
|
|
goto restore_opts;
|
|
|
|
}
|
2020-07-28 13:04:32 +00:00
|
|
|
|
2018-05-14 03:02:19 +00:00
|
|
|
if (sbi->s_journal == NULL && !(old_sb_flags & SB_RDONLY)) {
|
2020-12-16 10:18:38 +00:00
|
|
|
err = ext4_commit_super(sb);
|
2018-05-14 03:02:19 +00:00
|
|
|
if (err)
|
|
|
|
goto restore_opts;
|
|
|
|
}
|
2009-01-07 05:06:22 +00:00
|
|
|
|
2006-10-11 08:20:50 +00:00
|
|
|
#ifdef CONFIG_QUOTA
|
|
|
|
/* Release old quota file names */
|
2014-09-11 15:15:15 +00:00
|
|
|
for (i = 0; i < EXT4_MAXQUOTAS; i++)
|
2013-01-25 04:24:58 +00:00
|
|
|
kfree(old_opts.s_qf_names[i]);
|
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 00:21:31 +00:00
|
|
|
if (enable_quota) {
|
|
|
|
if (sb_any_quota_suspended(sb))
|
|
|
|
dquot_resume(sb, -1);
|
2015-10-17 20:18:43 +00:00
|
|
|
else if (ext4_has_feature_quota(sb)) {
|
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 00:21:31 +00:00
|
|
|
err = ext4_enable_quotas(sb);
|
2012-08-17 23:08:42 +00:00
|
|
|
if (err)
|
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 00:21:31 +00:00
|
|
|
goto restore_opts;
|
|
|
|
}
|
|
|
|
}
|
2006-10-11 08:20:50 +00:00
|
|
|
#endif
|
2020-09-24 03:03:43 +00:00
|
|
|
if (!test_opt(sb, BLOCK_VALIDITY) && sbi->s_system_blks)
|
2020-07-28 13:04:37 +00:00
|
|
|
ext4_release_system_zone(sb);
|
2010-05-16 16:00:00 +00:00
|
|
|
|
2021-07-02 16:45:02 +00:00
|
|
|
if (!ext4_has_feature_mmp(sb) || sb_rdonly(sb))
|
|
|
|
ext4_stop_mmpd(sbi);
|
|
|
|
|
ext4: handle option set by mount flags correctly
Currently there is a problem with mount options that can be both set by
vfs using mount flags or by a string parsing in ext4.
i_version/iversion options gets lost after remount, for example
$ mount -o i_version /dev/pmem0 /mnt
$ grep pmem0 /proc/self/mountinfo | grep i_version
310 95 259:0 / /mnt rw,relatime shared:163 - ext4 /dev/pmem0 rw,seclabel,i_version
$ mount -o remount,ro /mnt
$ grep pmem0 /proc/self/mountinfo | grep i_version
nolazytime gets ignored by ext4 on remount, for example
$ mount -o lazytime /dev/pmem0 /mnt
$ grep pmem0 /proc/self/mountinfo | grep lazytime
310 95 259:0 / /mnt rw,relatime shared:163 - ext4 /dev/pmem0 rw,lazytime,seclabel
$ mount -o remount,nolazytime /mnt
$ grep pmem0 /proc/self/mountinfo | grep lazytime
310 95 259:0 / /mnt rw,relatime shared:163 - ext4 /dev/pmem0 rw,lazytime,seclabel
Fix it by applying the SB_LAZYTIME and SB_I_VERSION flags from *flags to
s_flags before we parse the option and use the resulting state of the
same flags in *flags at the end of successful remount.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200723150526.19931-1-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-07-23 15:05:26 +00:00
|
|
|
/*
|
|
|
|
* Some options can be enabled by ext4 and/or by VFS mount flag
|
|
|
|
* either way we need to make sure it matches in both *flags and
|
|
|
|
* s_flags. Copy those selected flags from s_flags to *flags
|
|
|
|
*/
|
|
|
|
*flags = (*flags & ~vfs_flags) | (sb->s_flags & vfs_flags);
|
2010-05-16 16:00:00 +00:00
|
|
|
|
2020-10-22 03:21:00 +00:00
|
|
|
ext4_msg(sb, KERN_INFO, "re-mounted. Opts: %s. Quota mode: %s.",
|
|
|
|
orig_data, ext4_quota_mode(sb));
|
2010-05-16 16:00:00 +00:00
|
|
|
kfree(orig_data);
|
2006-10-11 08:20:50 +00:00
|
|
|
return 0;
|
2009-06-03 21:59:28 +00:00
|
|
|
|
2006-10-11 08:20:50 +00:00
|
|
|
restore_opts:
|
|
|
|
sb->s_flags = old_sb_flags;
|
|
|
|
sbi->s_mount_opt = old_opts.s_mount_opt;
|
2010-12-16 01:30:48 +00:00
|
|
|
sbi->s_mount_opt2 = old_opts.s_mount_opt2;
|
2006-10-11 08:20:50 +00:00
|
|
|
sbi->s_resuid = old_opts.s_resuid;
|
|
|
|
sbi->s_resgid = old_opts.s_resgid;
|
|
|
|
sbi->s_commit_interval = old_opts.s_commit_interval;
|
2009-01-04 01:27:38 +00:00
|
|
|
sbi->s_min_batch_time = old_opts.s_min_batch_time;
|
|
|
|
sbi->s_max_batch_time = old_opts.s_max_batch_time;
|
2020-09-24 03:03:43 +00:00
|
|
|
if (!test_opt(sb, BLOCK_VALIDITY) && sbi->s_system_blks)
|
2020-07-28 13:04:37 +00:00
|
|
|
ext4_release_system_zone(sb);
|
2006-10-11 08:20:50 +00:00
|
|
|
#ifdef CONFIG_QUOTA
|
|
|
|
sbi->s_jquota_fmt = old_opts.s_jquota_fmt;
|
2014-09-11 15:15:15 +00:00
|
|
|
for (i = 0; i < EXT4_MAXQUOTAS; i++) {
|
2018-10-12 13:28:09 +00:00
|
|
|
to_free[i] = get_qf_name(sb, sbi, i);
|
|
|
|
rcu_assign_pointer(sbi->s_qf_names[i], old_opts.s_qf_names[i]);
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
2018-10-12 13:28:09 +00:00
|
|
|
synchronize_rcu();
|
|
|
|
for (i = 0; i < EXT4_MAXQUOTAS; i++)
|
|
|
|
kfree(to_free[i]);
|
2006-10-11 08:20:50 +00:00
|
|
|
#endif
|
2021-07-02 16:45:02 +00:00
|
|
|
if (!ext4_has_feature_mmp(sb) || sb_rdonly(sb))
|
|
|
|
ext4_stop_mmpd(sbi);
|
2010-05-16 16:00:00 +00:00
|
|
|
kfree(orig_data);
|
2006-10-11 08:20:50 +00:00
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2016-01-08 21:01:22 +00:00
|
|
|
#ifdef CONFIG_QUOTA
|
|
|
|
static int ext4_statfs_project(struct super_block *sb,
|
|
|
|
kprojid_t projid, struct kstatfs *buf)
|
|
|
|
{
|
|
|
|
struct kqid qid;
|
|
|
|
struct dquot *dquot;
|
|
|
|
u64 limit;
|
|
|
|
u64 curblock;
|
|
|
|
|
|
|
|
qid = make_kqid_projid(projid);
|
|
|
|
dquot = dqget(sb, qid);
|
|
|
|
if (IS_ERR(dquot))
|
|
|
|
return PTR_ERR(dquot);
|
2017-08-07 11:19:50 +00:00
|
|
|
spin_lock(&dquot->dq_dqb_lock);
|
2016-01-08 21:01:22 +00:00
|
|
|
|
2020-02-10 08:24:45 +00:00
|
|
|
limit = min_not_zero(dquot->dq_dqb.dqb_bsoftlimit,
|
|
|
|
dquot->dq_dqb.dqb_bhardlimit);
|
2019-10-16 02:25:01 +00:00
|
|
|
limit >>= sb->s_blocksize_bits;
|
|
|
|
|
2016-01-08 21:01:22 +00:00
|
|
|
if (limit && buf->f_blocks > limit) {
|
2018-05-21 02:49:54 +00:00
|
|
|
curblock = (dquot->dq_dqb.dqb_curspace +
|
|
|
|
dquot->dq_dqb.dqb_rsvspace) >> sb->s_blocksize_bits;
|
2016-01-08 21:01:22 +00:00
|
|
|
buf->f_blocks = limit;
|
|
|
|
buf->f_bfree = buf->f_bavail =
|
|
|
|
(buf->f_blocks > curblock) ?
|
|
|
|
(buf->f_blocks - curblock) : 0;
|
|
|
|
}
|
|
|
|
|
2020-02-10 08:24:45 +00:00
|
|
|
limit = min_not_zero(dquot->dq_dqb.dqb_isoftlimit,
|
|
|
|
dquot->dq_dqb.dqb_ihardlimit);
|
2016-01-08 21:01:22 +00:00
|
|
|
if (limit && buf->f_files > limit) {
|
|
|
|
buf->f_files = limit;
|
|
|
|
buf->f_ffree =
|
|
|
|
(buf->f_files > dquot->dq_dqb.dqb_curinodes) ?
|
|
|
|
(buf->f_files - dquot->dq_dqb.dqb_curinodes) : 0;
|
|
|
|
}
|
|
|
|
|
2017-08-07 11:19:50 +00:00
|
|
|
spin_unlock(&dquot->dq_dqb_lock);
|
2016-01-08 21:01:22 +00:00
|
|
|
dqput(dquot);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2008-07-26 20:15:44 +00:00
|
|
|
static int ext4_statfs(struct dentry *dentry, struct kstatfs *buf)
|
2006-10-11 08:20:50 +00:00
|
|
|
{
|
|
|
|
struct super_block *sb = dentry->d_sb;
|
2006-10-11 08:20:53 +00:00
|
|
|
struct ext4_sb_info *sbi = EXT4_SB(sb);
|
|
|
|
struct ext4_super_block *es = sbi->s_es;
|
2013-04-10 02:11:22 +00:00
|
|
|
ext4_fsblk_t overhead = 0, resv_blocks;
|
2011-05-24 22:30:07 +00:00
|
|
|
s64 bfree;
|
2013-04-10 02:11:22 +00:00
|
|
|
resv_blocks = EXT4_C2B(sbi, atomic64_read(&sbi->s_resv_clusters));
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2012-07-09 20:27:05 +00:00
|
|
|
if (!test_opt(sb, MINIX_DF))
|
|
|
|
overhead = sbi->s_overhead;
|
2006-10-11 08:20:50 +00:00
|
|
|
|
2006-10-11 08:20:53 +00:00
|
|
|
buf->f_type = EXT4_SUPER_MAGIC;
|
2006-10-11 08:20:50 +00:00
|
|
|
buf->f_bsize = sb->s_blocksize;
|
2012-11-08 15:33:36 +00:00
|
|
|
buf->f_blocks = ext4_blocks_count(es) - EXT4_C2B(sbi, overhead);
|
2011-09-09 22:56:51 +00:00
|
|
|
bfree = percpu_counter_sum_positive(&sbi->s_freeclusters_counter) -
|
|
|
|
percpu_counter_sum_positive(&sbi->s_dirtyclusters_counter);
|
2011-05-24 22:30:07 +00:00
|
|
|
/* prevent underflow in case that few free space is available */
|
2011-09-09 22:56:51 +00:00
|
|
|
buf->f_bfree = EXT4_C2B(sbi, max_t(s64, bfree, 0));
|
2013-04-10 02:11:22 +00:00
|
|
|
buf->f_bavail = buf->f_bfree -
|
|
|
|
(ext4_r_blocks_count(es) + resv_blocks);
|
|
|
|
if (buf->f_bfree < (ext4_r_blocks_count(es) + resv_blocks))
|
2006-10-11 08:20:50 +00:00
|
|
|
buf->f_bavail = 0;
|
|
|
|
buf->f_files = le32_to_cpu(es->s_inodes_count);
|
2007-10-17 06:25:44 +00:00
|
|
|
buf->f_ffree = percpu_counter_sum_positive(&sbi->s_freeinodes_counter);
|
2006-10-11 08:20:53 +00:00
|
|
|
buf->f_namelen = EXT4_NAME_LEN;
|
2021-03-22 17:39:43 +00:00
|
|
|
buf->f_fsid = uuid_to_fsid(es->s_uuid);
|
2009-06-03 21:59:28 +00:00
|
|
|
|
2016-01-08 21:01:22 +00:00
|
|
|
#ifdef CONFIG_QUOTA
|
|
|
|
if (ext4_test_inode_flag(dentry->d_inode, EXT4_INODE_PROJINHERIT) &&
|
|
|
|
sb_has_quota_limits_enabled(sb, PRJQUOTA))
|
|
|
|
ext4_statfs_project(sb, EXT4_I(dentry->d_inode)->i_projid, buf);
|
|
|
|
#endif
|
2006-10-11 08:20:50 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
#ifdef CONFIG_QUOTA
|
|
|
|
|
2017-06-08 12:39:48 +00:00
|
|
|
/*
|
|
|
|
* Helper functions so that transaction is started before we acquire dqio_sem
|
|
|
|
* to keep correct lock ordering of transaction > dqio_sem
|
|
|
|
*/
|
2006-10-11 08:20:50 +00:00
|
|
|
static inline struct inode *dquot_to_inode(struct dquot *dquot)
|
|
|
|
{
|
2012-09-16 10:56:19 +00:00
|
|
|
return sb_dqopt(dquot->dq_sb)->files[dquot->dq_id.type];
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
|
|
|
|
2006-10-11 08:20:53 +00:00
|
|
|
static int ext4_write_dquot(struct dquot *dquot)
|
2006-10-11 08:20:50 +00:00
|
|
|
{
|
|
|
|
int ret, err;
|
|
|
|
handle_t *handle;
|
|
|
|
struct inode *inode;
|
|
|
|
|
|
|
|
inode = dquot_to_inode(dquot);
|
2013-02-09 02:59:22 +00:00
|
|
|
handle = ext4_journal_start(inode, EXT4_HT_QUOTA,
|
2009-06-03 21:59:28 +00:00
|
|
|
EXT4_QUOTA_TRANS_BLOCKS(dquot->dq_sb));
|
2006-10-11 08:20:50 +00:00
|
|
|
if (IS_ERR(handle))
|
|
|
|
return PTR_ERR(handle);
|
|
|
|
ret = dquot_commit(dquot);
|
2006-10-11 08:20:53 +00:00
|
|
|
err = ext4_journal_stop(handle);
|
2006-10-11 08:20:50 +00:00
|
|
|
if (!ret)
|
|
|
|
ret = err;
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2006-10-11 08:20:53 +00:00
|
|
|
static int ext4_acquire_dquot(struct dquot *dquot)
|
2006-10-11 08:20:50 +00:00
|
|
|
{
|
|
|
|
int ret, err;
|
|
|
|
handle_t *handle;
|
|
|
|
|
2013-02-09 02:59:22 +00:00
|
|
|
handle = ext4_journal_start(dquot_to_inode(dquot), EXT4_HT_QUOTA,
|
2009-06-03 21:59:28 +00:00
|
|
|
EXT4_QUOTA_INIT_BLOCKS(dquot->dq_sb));
|
2006-10-11 08:20:50 +00:00
|
|
|
if (IS_ERR(handle))
|
|
|
|
return PTR_ERR(handle);
|
|
|
|
ret = dquot_acquire(dquot);
|
2006-10-11 08:20:53 +00:00
|
|
|
err = ext4_journal_stop(handle);
|
2006-10-11 08:20:50 +00:00
|
|
|
if (!ret)
|
|
|
|
ret = err;
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2006-10-11 08:20:53 +00:00
|
|
|
static int ext4_release_dquot(struct dquot *dquot)
|
2006-10-11 08:20:50 +00:00
|
|
|
{
|
|
|
|
int ret, err;
|
|
|
|
handle_t *handle;
|
|
|
|
|
2013-02-09 02:59:22 +00:00
|
|
|
handle = ext4_journal_start(dquot_to_inode(dquot), EXT4_HT_QUOTA,
|
2009-06-03 21:59:28 +00:00
|
|
|
EXT4_QUOTA_DEL_BLOCKS(dquot->dq_sb));
|
2007-09-11 22:23:29 +00:00
|
|
|
if (IS_ERR(handle)) {
|
|
|
|
/* Release dquot anyway to avoid endless cycle in dqput() */
|
|
|
|
dquot_release(dquot);
|
2006-10-11 08:20:50 +00:00
|
|
|
return PTR_ERR(handle);
|
2007-09-11 22:23:29 +00:00
|
|
|
}
|
2006-10-11 08:20:50 +00:00
|
|
|
ret = dquot_release(dquot);
|
2006-10-11 08:20:53 +00:00
|
|
|
err = ext4_journal_stop(handle);
|
2006-10-11 08:20:50 +00:00
|
|
|
if (!ret)
|
|
|
|
ret = err;
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2006-10-11 08:20:53 +00:00
|
|
|
static int ext4_mark_dquot_dirty(struct dquot *dquot)
|
2006-10-11 08:20:50 +00:00
|
|
|
{
|
2013-03-02 22:57:08 +00:00
|
|
|
struct super_block *sb = dquot->dq_sb;
|
|
|
|
|
2020-10-22 03:20:59 +00:00
|
|
|
if (ext4_is_quota_journalled(sb)) {
|
2006-10-11 08:20:50 +00:00
|
|
|
dquot_mark_dquot_dirty(dquot);
|
2006-10-11 08:20:53 +00:00
|
|
|
return ext4_write_dquot(dquot);
|
2006-10-11 08:20:50 +00:00
|
|
|
} else {
|
|
|
|
return dquot_mark_dquot_dirty(dquot);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2006-10-11 08:20:53 +00:00
|
|
|
static int ext4_write_info(struct super_block *sb, int type)
|
2006-10-11 08:20:50 +00:00
|
|
|
{
|
|
|
|
int ret, err;
|
|
|
|
handle_t *handle;
|
|
|
|
|
|
|
|
/* Data block + inode block */
|
2015-03-17 22:25:59 +00:00
|
|
|
handle = ext4_journal_start(d_inode(sb->s_root), EXT4_HT_QUOTA, 2);
|
2006-10-11 08:20:50 +00:00
|
|
|
if (IS_ERR(handle))
|
|
|
|
return PTR_ERR(handle);
|
|
|
|
ret = dquot_commit_info(sb, type);
|
2006-10-11 08:20:53 +00:00
|
|
|
err = ext4_journal_stop(handle);
|
2006-10-11 08:20:50 +00:00
|
|
|
if (!ret)
|
|
|
|
ret = err;
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Turn on quotas during mount time - we need to find
|
|
|
|
* the quota file and such...
|
|
|
|
*/
|
2006-10-11 08:20:53 +00:00
|
|
|
static int ext4_quota_on_mount(struct super_block *sb, int type)
|
2006-10-11 08:20:50 +00:00
|
|
|
{
|
2018-10-12 13:28:09 +00:00
|
|
|
return dquot_quota_on_mount(sb, get_qf_name(sb, EXT4_SB(sb), type),
|
2010-05-19 11:16:45 +00:00
|
|
|
EXT4_SB(sb)->s_jquota_fmt, type);
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
|
|
|
|
2016-04-01 05:31:28 +00:00
|
|
|
static void lockdep_set_quota_inode(struct inode *inode, int subclass)
|
|
|
|
{
|
|
|
|
struct ext4_inode_info *ei = EXT4_I(inode);
|
|
|
|
|
|
|
|
/* The first argument of lockdep_set_subclass has to be
|
|
|
|
* *exactly* the same as the argument to init_rwsem() --- in
|
|
|
|
* this case, in init_once() --- or lockdep gets unhappy
|
|
|
|
* because the name of the lock is set using the
|
|
|
|
* stringification of the argument to init_rwsem().
|
|
|
|
*/
|
|
|
|
(void) ei; /* shut up clang warning if !CONFIG_LOCKDEP */
|
|
|
|
lockdep_set_subclass(&ei->i_data_sem, subclass);
|
|
|
|
}
|
|
|
|
|
2006-10-11 08:20:50 +00:00
|
|
|
/*
|
|
|
|
* Standard function to be called on quota_on
|
|
|
|
*/
|
2006-10-11 08:20:53 +00:00
|
|
|
static int ext4_quota_on(struct super_block *sb, int type, int format_id,
|
2016-11-21 00:49:34 +00:00
|
|
|
const struct path *path)
|
2006-10-11 08:20:50 +00:00
|
|
|
{
|
|
|
|
int err;
|
|
|
|
|
|
|
|
if (!test_opt(sb, QUOTA))
|
|
|
|
return -EINVAL;
|
2008-05-13 23:11:51 +00:00
|
|
|
|
2006-10-11 08:20:50 +00:00
|
|
|
/* Quotafile not on the same filesystem? */
|
2011-12-07 23:16:57 +00:00
|
|
|
if (path->dentry->d_sb != sb)
|
2006-10-11 08:20:50 +00:00
|
|
|
return -EXDEV;
|
2020-10-15 11:03:30 +00:00
|
|
|
|
|
|
|
/* Quota already enabled for this file? */
|
|
|
|
if (IS_NOQUOTA(d_inode(path->dentry)))
|
|
|
|
return -EBUSY;
|
|
|
|
|
2008-05-13 23:11:51 +00:00
|
|
|
/* Journaling quota? */
|
|
|
|
if (EXT4_SB(sb)->s_qf_names[type]) {
|
2008-07-26 20:15:44 +00:00
|
|
|
/* Quotafile not in fs root? */
|
2010-09-15 15:38:58 +00:00
|
|
|
if (path->dentry->d_parent != sb->s_root)
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_WARNING,
|
|
|
|
"Quota file not on filesystem root. "
|
|
|
|
"Journaled quota will not work");
|
2017-08-03 09:25:55 +00:00
|
|
|
sb_dqopt(sb)->flags |= DQUOT_NOLIST_DIRTY;
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* Clear the flag just in case mount options changed since
|
|
|
|
* last time.
|
|
|
|
*/
|
|
|
|
sb_dqopt(sb)->flags &= ~DQUOT_NOLIST_DIRTY;
|
2008-07-26 20:15:44 +00:00
|
|
|
}
|
2008-05-13 23:11:51 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* When we journal data on quota file, we have to flush journal to see
|
|
|
|
* all updates to the file when we bypass pagecache...
|
|
|
|
*/
|
2009-01-07 05:06:22 +00:00
|
|
|
if (EXT4_SB(sb)->s_journal &&
|
2015-03-17 22:25:59 +00:00
|
|
|
ext4_should_journal_data(d_inode(path->dentry))) {
|
2008-05-13 23:11:51 +00:00
|
|
|
/*
|
|
|
|
* We don't need to lock updates but journal_flush() could
|
|
|
|
* otherwise be livelocked...
|
|
|
|
*/
|
|
|
|
jbd2_journal_lock_updates(EXT4_SB(sb)->s_journal);
|
2021-05-18 15:13:25 +00:00
|
|
|
err = jbd2_journal_flush(EXT4_SB(sb)->s_journal, 0);
|
2008-05-13 23:11:51 +00:00
|
|
|
jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
|
2010-09-15 15:38:58 +00:00
|
|
|
if (err)
|
2008-10-11 00:29:21 +00:00
|
|
|
return err;
|
2008-05-13 23:11:51 +00:00
|
|
|
}
|
2017-04-06 13:40:06 +00:00
|
|
|
|
2016-04-01 05:31:28 +00:00
|
|
|
lockdep_set_quota_inode(path->dentry->d_inode, I_DATA_SEM_QUOTA);
|
|
|
|
err = dquot_quota_on(sb, type, format_id, path);
|
2017-04-06 13:40:06 +00:00
|
|
|
if (err) {
|
2016-04-01 05:31:28 +00:00
|
|
|
lockdep_set_quota_inode(path->dentry->d_inode,
|
|
|
|
I_DATA_SEM_NORMAL);
|
2017-04-06 13:40:06 +00:00
|
|
|
} else {
|
|
|
|
struct inode *inode = d_inode(path->dentry);
|
|
|
|
handle_t *handle;
|
|
|
|
|
2017-04-24 14:49:16 +00:00
|
|
|
/*
|
|
|
|
* Set inode flags to prevent userspace from messing with quota
|
|
|
|
* files. If this fails, we return success anyway since quotas
|
|
|
|
* are already enabled and this is not a hard failure.
|
|
|
|
*/
|
2017-04-06 13:40:06 +00:00
|
|
|
inode_lock(inode);
|
|
|
|
handle = ext4_journal_start(inode, EXT4_HT_QUOTA, 1);
|
|
|
|
if (IS_ERR(handle))
|
|
|
|
goto unlock_inode;
|
|
|
|
EXT4_I(inode)->i_flags |= EXT4_NOATIME_FL | EXT4_IMMUTABLE_FL;
|
|
|
|
inode_set_flags(inode, S_NOATIME | S_IMMUTABLE,
|
|
|
|
S_NOATIME | S_IMMUTABLE);
|
2020-04-27 01:34:37 +00:00
|
|
|
err = ext4_mark_inode_dirty(handle, inode);
|
2017-04-06 13:40:06 +00:00
|
|
|
ext4_journal_stop(handle);
|
|
|
|
unlock_inode:
|
|
|
|
inode_unlock(inode);
|
|
|
|
}
|
2016-04-01 05:31:28 +00:00
|
|
|
return err;
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
|
|
|
|
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 00:21:31 +00:00
|
|
|
static int ext4_quota_enable(struct super_block *sb, int type, int format_id,
|
|
|
|
unsigned int flags)
|
|
|
|
{
|
|
|
|
int err;
|
|
|
|
struct inode *qf_inode;
|
2014-09-11 15:15:15 +00:00
|
|
|
unsigned long qf_inums[EXT4_MAXQUOTAS] = {
|
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 00:21:31 +00:00
|
|
|
le32_to_cpu(EXT4_SB(sb)->s_es->s_usr_quota_inum),
|
2016-01-08 21:01:22 +00:00
|
|
|
le32_to_cpu(EXT4_SB(sb)->s_es->s_grp_quota_inum),
|
|
|
|
le32_to_cpu(EXT4_SB(sb)->s_es->s_prj_quota_inum)
|
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 00:21:31 +00:00
|
|
|
};
|
|
|
|
|
2015-10-17 20:18:43 +00:00
|
|
|
BUG_ON(!ext4_has_feature_quota(sb));
|
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 00:21:31 +00:00
|
|
|
|
|
|
|
if (!qf_inums[type])
|
|
|
|
return -EPERM;
|
|
|
|
|
ext4: avoid declaring fs inconsistent due to invalid file handles
If we receive a file handle, either from NFS or open_by_handle_at(2),
and it points at an inode which has not been initialized, and the file
system has metadata checksums enabled, we shouldn't try to get the
inode, discover the checksum is invalid, and then declare the file
system as being inconsistent.
This can be reproduced by creating a test file system via "mke2fs -t
ext4 -O metadata_csum /tmp/foo.img 8M", mounting it, cd'ing into that
directory, and then running the following program.
#define _GNU_SOURCE
#include <fcntl.h>
struct handle {
struct file_handle fh;
unsigned char fid[MAX_HANDLE_SZ];
};
int main(int argc, char **argv)
{
struct handle h = {{8, 1 }, { 12, }};
open_by_handle_at(AT_FDCWD, &h.fh, O_RDONLY);
return 0;
}
Google-Bug-Id: 120690101
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2018-12-19 17:29:13 +00:00
|
|
|
qf_inode = ext4_iget(sb, qf_inums[type], EXT4_IGET_SPECIAL);
|
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 00:21:31 +00:00
|
|
|
if (IS_ERR(qf_inode)) {
|
|
|
|
ext4_error(sb, "Bad quota inode # %lu", qf_inums[type]);
|
|
|
|
return PTR_ERR(qf_inode);
|
|
|
|
}
|
|
|
|
|
2013-04-09 13:21:41 +00:00
|
|
|
/* Don't account quota for quota files to avoid recursion */
|
|
|
|
qf_inode->i_flags |= S_NOQUOTA;
|
2016-04-01 05:31:28 +00:00
|
|
|
lockdep_set_quota_inode(qf_inode, I_DATA_SEM_QUOTA);
|
2019-11-01 17:55:38 +00:00
|
|
|
err = dquot_load_quota_inode(qf_inode, type, format_id, flags);
|
2016-04-01 05:31:28 +00:00
|
|
|
if (err)
|
|
|
|
lockdep_set_quota_inode(qf_inode, I_DATA_SEM_NORMAL);
|
2018-12-04 04:28:02 +00:00
|
|
|
iput(qf_inode);
|
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 00:21:31 +00:00
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Enable usage tracking for all quota types. */
|
|
|
|
static int ext4_enable_quotas(struct super_block *sb)
|
|
|
|
{
|
|
|
|
int type, err = 0;
|
2014-09-11 15:15:15 +00:00
|
|
|
unsigned long qf_inums[EXT4_MAXQUOTAS] = {
|
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 00:21:31 +00:00
|
|
|
le32_to_cpu(EXT4_SB(sb)->s_es->s_usr_quota_inum),
|
2016-01-08 21:01:22 +00:00
|
|
|
le32_to_cpu(EXT4_SB(sb)->s_es->s_grp_quota_inum),
|
|
|
|
le32_to_cpu(EXT4_SB(sb)->s_es->s_prj_quota_inum)
|
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 00:21:31 +00:00
|
|
|
};
|
2016-09-06 03:08:16 +00:00
|
|
|
bool quota_mopt[EXT4_MAXQUOTAS] = {
|
|
|
|
test_opt(sb, USRQUOTA),
|
|
|
|
test_opt(sb, GRPQUOTA),
|
|
|
|
test_opt(sb, PRJQUOTA),
|
|
|
|
};
|
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 00:21:31 +00:00
|
|
|
|
2017-08-03 09:25:55 +00:00
|
|
|
sb_dqopt(sb)->flags |= DQUOT_QUOTA_SYS_FILE | DQUOT_NOLIST_DIRTY;
|
2014-09-11 15:15:15 +00:00
|
|
|
for (type = 0; type < EXT4_MAXQUOTAS; type++) {
|
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 00:21:31 +00:00
|
|
|
if (qf_inums[type]) {
|
|
|
|
err = ext4_quota_enable(sb, type, QFMT_VFS_V1,
|
2016-09-06 03:08:16 +00:00
|
|
|
DQUOT_USAGE_ENABLED |
|
|
|
|
(quota_mopt[type] ? DQUOT_LIMITS_ENABLED : 0));
|
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 00:21:31 +00:00
|
|
|
if (err) {
|
|
|
|
ext4_warning(sb,
|
2013-01-25 04:24:54 +00:00
|
|
|
"Failed to enable quota tracking "
|
|
|
|
"(type=%d, err=%d). Please run "
|
|
|
|
"e2fsck to fix.", type, err);
|
2018-07-29 19:51:52 +00:00
|
|
|
for (type--; type >= 0; type--)
|
|
|
|
dquot_quota_off(sb, type);
|
|
|
|
|
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 00:21:31 +00:00
|
|
|
return err;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2010-08-01 21:48:36 +00:00
|
|
|
static int ext4_quota_off(struct super_block *sb, int type)
|
|
|
|
{
|
2011-04-04 19:33:39 +00:00
|
|
|
struct inode *inode = sb_dqopt(sb)->files[type];
|
|
|
|
handle_t *handle;
|
2017-04-06 13:40:06 +00:00
|
|
|
int err;
|
2011-04-04 19:33:39 +00:00
|
|
|
|
2010-11-08 18:47:33 +00:00
|
|
|
/* Force all delayed allocation blocks to be allocated.
|
|
|
|
* Caller already holds s_umount sem */
|
|
|
|
if (test_opt(sb, DELALLOC))
|
2010-08-01 21:48:36 +00:00
|
|
|
sync_filesystem(sb);
|
|
|
|
|
2017-04-06 13:40:06 +00:00
|
|
|
if (!inode || !igrab(inode))
|
2011-05-16 13:59:13 +00:00
|
|
|
goto out;
|
|
|
|
|
2017-04-06 13:40:06 +00:00
|
|
|
err = dquot_quota_off(sb, type);
|
2017-05-22 02:31:23 +00:00
|
|
|
if (err || ext4_has_feature_quota(sb))
|
2017-04-06 13:40:06 +00:00
|
|
|
goto out_put;
|
|
|
|
|
|
|
|
inode_lock(inode);
|
2017-04-24 14:49:16 +00:00
|
|
|
/*
|
|
|
|
* Update modification times of quota files when userspace can
|
|
|
|
* start looking at them. If we fail, we return success anyway since
|
|
|
|
* this is not a hard failure and quotas are already disabled.
|
|
|
|
*/
|
2013-02-09 02:59:22 +00:00
|
|
|
handle = ext4_journal_start(inode, EXT4_HT_QUOTA, 1);
|
2020-04-27 01:34:37 +00:00
|
|
|
if (IS_ERR(handle)) {
|
|
|
|
err = PTR_ERR(handle);
|
2017-04-06 13:40:06 +00:00
|
|
|
goto out_unlock;
|
2020-04-27 01:34:37 +00:00
|
|
|
}
|
2017-04-06 13:40:06 +00:00
|
|
|
EXT4_I(inode)->i_flags &= ~(EXT4_NOATIME_FL | EXT4_IMMUTABLE_FL);
|
|
|
|
inode_set_flags(inode, 0, S_NOATIME | S_IMMUTABLE);
|
2016-11-15 02:40:10 +00:00
|
|
|
inode->i_mtime = inode->i_ctime = current_time(inode);
|
2020-04-27 01:34:37 +00:00
|
|
|
err = ext4_mark_inode_dirty(handle, inode);
|
2011-04-04 19:33:39 +00:00
|
|
|
ext4_journal_stop(handle);
|
2017-04-06 13:40:06 +00:00
|
|
|
out_unlock:
|
|
|
|
inode_unlock(inode);
|
|
|
|
out_put:
|
2017-05-22 02:31:23 +00:00
|
|
|
lockdep_set_quota_inode(inode, I_DATA_SEM_NORMAL);
|
2017-04-06 13:40:06 +00:00
|
|
|
iput(inode);
|
|
|
|
return err;
|
2011-04-04 19:33:39 +00:00
|
|
|
out:
|
2010-08-01 21:48:36 +00:00
|
|
|
return dquot_quota_off(sb, type);
|
|
|
|
}
|
|
|
|
|
2006-10-11 08:20:50 +00:00
|
|
|
/* Read data from quotafile - avoid pagecache and such because we cannot afford
|
|
|
|
* acquiring the locks... As quota files are never truncated and quota code
|
2011-03-31 01:57:33 +00:00
|
|
|
* itself serializes the operations (and no one else should touch the files)
|
2006-10-11 08:20:50 +00:00
|
|
|
* we don't have to be afraid of races */
|
2006-10-11 08:20:53 +00:00
|
|
|
static ssize_t ext4_quota_read(struct super_block *sb, int type, char *data,
|
2006-10-11 08:20:50 +00:00
|
|
|
size_t len, loff_t off)
|
|
|
|
{
|
|
|
|
struct inode *inode = sb_dqopt(sb)->files[type];
|
2008-01-29 04:58:27 +00:00
|
|
|
ext4_lblk_t blk = off >> EXT4_BLOCK_SIZE_BITS(sb);
|
2006-10-11 08:20:50 +00:00
|
|
|
int offset = off & (sb->s_blocksize - 1);
|
|
|
|
int tocopy;
|
|
|
|
size_t toread;
|
|
|
|
struct buffer_head *bh;
|
|
|
|
loff_t i_size = i_size_read(inode);
|
|
|
|
|
|
|
|
if (off > i_size)
|
|
|
|
return 0;
|
|
|
|
if (off+len > i_size)
|
|
|
|
len = i_size-off;
|
|
|
|
toread = len;
|
|
|
|
while (toread > 0) {
|
|
|
|
tocopy = sb->s_blocksize - offset < toread ?
|
|
|
|
sb->s_blocksize - offset : toread;
|
2014-08-30 00:52:15 +00:00
|
|
|
bh = ext4_bread(NULL, inode, blk, 0);
|
|
|
|
if (IS_ERR(bh))
|
|
|
|
return PTR_ERR(bh);
|
2006-10-11 08:20:50 +00:00
|
|
|
if (!bh) /* A hole? */
|
|
|
|
memset(data, 0, tocopy);
|
|
|
|
else
|
|
|
|
memcpy(data, bh->b_data+offset, tocopy);
|
|
|
|
brelse(bh);
|
|
|
|
offset = 0;
|
|
|
|
toread -= tocopy;
|
|
|
|
data += tocopy;
|
|
|
|
blk++;
|
|
|
|
}
|
|
|
|
return len;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Write to quotafile (we know the transaction is already started and has
|
|
|
|
* enough credits) */
|
2006-10-11 08:20:53 +00:00
|
|
|
static ssize_t ext4_quota_write(struct super_block *sb, int type,
|
2006-10-11 08:20:50 +00:00
|
|
|
const char *data, size_t len, loff_t off)
|
|
|
|
{
|
|
|
|
struct inode *inode = sb_dqopt(sb)->files[type];
|
2008-01-29 04:58:27 +00:00
|
|
|
ext4_lblk_t blk = off >> EXT4_BLOCK_SIZE_BITS(sb);
|
2020-04-27 01:34:37 +00:00
|
|
|
int err = 0, err2 = 0, offset = off & (sb->s_blocksize - 1);
|
2015-06-21 05:25:29 +00:00
|
|
|
int retries = 0;
|
2006-10-11 08:20:50 +00:00
|
|
|
struct buffer_head *bh;
|
|
|
|
handle_t *handle = journal_current_handle();
|
|
|
|
|
2009-01-07 05:06:22 +00:00
|
|
|
if (EXT4_SB(sb)->s_journal && !handle) {
|
2009-06-04 21:36:36 +00:00
|
|
|
ext4_msg(sb, KERN_WARNING, "Quota write (off=%llu, len=%llu)"
|
|
|
|
" cancelled because transaction is not started",
|
2007-09-11 22:23:29 +00:00
|
|
|
(unsigned long long)off, (unsigned long long)len);
|
|
|
|
return -EIO;
|
|
|
|
}
|
2010-03-02 13:08:51 +00:00
|
|
|
/*
|
|
|
|
* Since we account only one data block in transaction credits,
|
|
|
|
* then it is impossible to cross a block boundary.
|
|
|
|
*/
|
|
|
|
if (sb->s_blocksize - offset < len) {
|
|
|
|
ext4_msg(sb, KERN_WARNING, "Quota write (off=%llu, len=%llu)"
|
|
|
|
" cancelled because not block aligned",
|
|
|
|
(unsigned long long)off, (unsigned long long)len);
|
|
|
|
return -EIO;
|
|
|
|
}
|
|
|
|
|
2015-06-21 05:25:29 +00:00
|
|
|
do {
|
|
|
|
bh = ext4_bread(handle, inode, blk,
|
|
|
|
EXT4_GET_BLOCKS_CREATE |
|
|
|
|
EXT4_GET_BLOCKS_METADATA_NOFAIL);
|
2020-02-04 01:37:45 +00:00
|
|
|
} while (PTR_ERR(bh) == -ENOSPC &&
|
2015-06-21 05:25:29 +00:00
|
|
|
ext4_should_retry_alloc(inode->i_sb, &retries));
|
2014-08-30 00:52:15 +00:00
|
|
|
if (IS_ERR(bh))
|
|
|
|
return PTR_ERR(bh);
|
2010-03-02 13:08:51 +00:00
|
|
|
if (!bh)
|
|
|
|
goto out;
|
2014-05-13 02:06:43 +00:00
|
|
|
BUFFER_TRACE(bh, "get write access");
|
2021-08-16 09:57:04 +00:00
|
|
|
err = ext4_journal_get_write_access(handle, sb, bh, EXT4_JTR_NONE);
|
2010-07-27 15:56:07 +00:00
|
|
|
if (err) {
|
|
|
|
brelse(bh);
|
2014-08-30 00:52:15 +00:00
|
|
|
return err;
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
2010-03-02 13:08:51 +00:00
|
|
|
lock_buffer(bh);
|
|
|
|
memcpy(bh->b_data+offset, data, len);
|
|
|
|
flush_dcache_page(bh->b_page);
|
|
|
|
unlock_buffer(bh);
|
2010-07-27 15:56:07 +00:00
|
|
|
err = ext4_handle_dirty_metadata(handle, NULL, bh);
|
2010-03-02 13:08:51 +00:00
|
|
|
brelse(bh);
|
2006-10-11 08:20:50 +00:00
|
|
|
out:
|
2010-03-02 13:08:51 +00:00
|
|
|
if (inode->i_size < off + len) {
|
|
|
|
i_size_write(inode, off + len);
|
2006-10-11 08:20:53 +00:00
|
|
|
EXT4_I(inode)->i_disksize = inode->i_size;
|
2020-04-27 01:34:37 +00:00
|
|
|
err2 = ext4_mark_inode_dirty(handle, inode);
|
|
|
|
if (unlikely(err2 && !err))
|
|
|
|
err = err2;
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
2020-04-27 01:34:37 +00:00
|
|
|
return err ? err : len;
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2010-07-24 20:46:55 +00:00
|
|
|
static struct dentry *ext4_mount(struct file_system_type *fs_type, int flags,
|
|
|
|
const char *dev_name, void *data)
|
2006-10-11 08:20:50 +00:00
|
|
|
{
|
2010-07-24 20:46:55 +00:00
|
|
|
return mount_bdev(fs_type, flags, dev_name, data, ext4_fill_super);
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
|
|
|
|
2015-06-18 14:52:29 +00:00
|
|
|
#if !defined(CONFIG_EXT2_FS) && !defined(CONFIG_EXT2_FS_MODULE) && defined(CONFIG_EXT4_USE_FOR_EXT2)
|
2009-12-07 19:08:51 +00:00
|
|
|
static inline void register_as_ext2(void)
|
|
|
|
{
|
|
|
|
int err = register_filesystem(&ext2_fs_type);
|
|
|
|
if (err)
|
|
|
|
printk(KERN_WARNING
|
|
|
|
"EXT4-fs: Unable to register as ext2 (%d)\n", err);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void unregister_as_ext2(void)
|
|
|
|
{
|
|
|
|
unregister_filesystem(&ext2_fs_type);
|
|
|
|
}
|
2011-04-18 21:29:14 +00:00
|
|
|
|
|
|
|
static inline int ext2_feature_set_ok(struct super_block *sb)
|
|
|
|
{
|
2015-10-17 20:18:43 +00:00
|
|
|
if (ext4_has_unknown_ext2_incompat_features(sb))
|
2011-04-18 21:29:14 +00:00
|
|
|
return 0;
|
2017-07-17 07:45:34 +00:00
|
|
|
if (sb_rdonly(sb))
|
2011-04-18 21:29:14 +00:00
|
|
|
return 1;
|
2015-10-17 20:18:43 +00:00
|
|
|
if (ext4_has_unknown_ext2_ro_compat_features(sb))
|
2011-04-18 21:29:14 +00:00
|
|
|
return 0;
|
|
|
|
return 1;
|
|
|
|
}
|
2009-12-07 19:08:51 +00:00
|
|
|
#else
|
|
|
|
static inline void register_as_ext2(void) { }
|
|
|
|
static inline void unregister_as_ext2(void) { }
|
2011-04-18 21:29:14 +00:00
|
|
|
static inline int ext2_feature_set_ok(struct super_block *sb) { return 0; }
|
2009-12-07 19:08:51 +00:00
|
|
|
#endif
|
|
|
|
|
|
|
|
static inline void register_as_ext3(void)
|
|
|
|
{
|
|
|
|
int err = register_filesystem(&ext3_fs_type);
|
|
|
|
if (err)
|
|
|
|
printk(KERN_WARNING
|
|
|
|
"EXT4-fs: Unable to register as ext3 (%d)\n", err);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void unregister_as_ext3(void)
|
|
|
|
{
|
|
|
|
unregister_filesystem(&ext3_fs_type);
|
|
|
|
}
|
2011-04-18 21:29:14 +00:00
|
|
|
|
|
|
|
static inline int ext3_feature_set_ok(struct super_block *sb)
|
|
|
|
{
|
2015-10-17 20:18:43 +00:00
|
|
|
if (ext4_has_unknown_ext3_incompat_features(sb))
|
2011-04-18 21:29:14 +00:00
|
|
|
return 0;
|
2015-10-17 20:18:43 +00:00
|
|
|
if (!ext4_has_feature_journal(sb))
|
2011-04-18 21:29:14 +00:00
|
|
|
return 0;
|
2017-07-17 07:45:34 +00:00
|
|
|
if (sb_rdonly(sb))
|
2011-04-18 21:29:14 +00:00
|
|
|
return 1;
|
2015-10-17 20:18:43 +00:00
|
|
|
if (ext4_has_unknown_ext3_ro_compat_features(sb))
|
2011-04-18 21:29:14 +00:00
|
|
|
return 0;
|
|
|
|
return 1;
|
|
|
|
}
|
2009-12-07 19:08:51 +00:00
|
|
|
|
2008-10-11 00:02:48 +00:00
|
|
|
static struct file_system_type ext4_fs_type = {
|
|
|
|
.owner = THIS_MODULE,
|
|
|
|
.name = "ext4",
|
2010-07-24 20:46:55 +00:00
|
|
|
.mount = ext4_mount,
|
2008-10-11 00:02:48 +00:00
|
|
|
.kill_sb = kill_block_super,
|
2021-01-21 13:19:57 +00:00
|
|
|
.fs_flags = FS_REQUIRES_DEV | FS_ALLOW_IDMAP,
|
2008-10-11 00:02:48 +00:00
|
|
|
};
|
2013-03-03 03:39:14 +00:00
|
|
|
MODULE_ALIAS_FS("ext4");
|
2008-10-11 00:02:48 +00:00
|
|
|
|
ext4: serialize unaligned asynchronous DIO
ext4 has a data corruption case when doing non-block-aligned
asynchronous direct IO into a sparse file, as demonstrated
by xfstest 240.
The root cause is that while ext4 preallocates space in the
hole, mappings of that space still look "new" and
dio_zero_block() will zero out the unwritten portions. When
more than one AIO thread is going, they both find this "new"
block and race to zero out their portion; this is uncoordinated
and causes data corruption.
Dave Chinner fixed this for xfs by simply serializing all
unaligned asynchronous direct IO. I've done the same here.
The difference is that we only wait on conversions, not all IO.
This is a very big hammer, and I'm not very pleased with
stuffing this into ext4_file_write(). But since ext4 is
DIO_LOCKING, we need to serialize it at this high level.
I tried to move this into ext4_ext_direct_IO, but by then
we have the i_mutex already, and we will wait on the
work queue to do conversions - which must also take the
i_mutex. So that won't work.
This was originally exposed by qemu-kvm installing to
a raw disk image with a normal sector-63 alignment. I've
tested a backport of this patch with qemu, and it does
avoid the corruption. It is also quite a lot slower
(14 min for package installs, vs. 8 min for well-aligned)
but I'll take slow correctness over fast corruption any day.
Mingming suggested that we can track outstanding
conversions, and wait on those so that non-sparse
files won't be affected, and I've implemented that here;
unaligned AIO to nonsparse files won't take a perf hit.
[tytso@mit.edu: Keep the mutex as a hashed array instead
of bloating the ext4 inode]
[tytso@mit.edu: Fix up namespace issues so that global
variables are protected with an "ext4_" prefix.]
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-12 13:17:34 +00:00
|
|
|
/* Shared across all ext4 file systems */
|
|
|
|
wait_queue_head_t ext4__ioend_wq[EXT4_WQ_HASH_SZ];
|
|
|
|
|
2010-10-28 01:30:14 +00:00
|
|
|
static int __init ext4_init_fs(void)
|
2006-10-11 08:20:50 +00:00
|
|
|
{
|
ext4: serialize unaligned asynchronous DIO
ext4 has a data corruption case when doing non-block-aligned
asynchronous direct IO into a sparse file, as demonstrated
by xfstest 240.
The root cause is that while ext4 preallocates space in the
hole, mappings of that space still look "new" and
dio_zero_block() will zero out the unwritten portions. When
more than one AIO thread is going, they both find this "new"
block and race to zero out their portion; this is uncoordinated
and causes data corruption.
Dave Chinner fixed this for xfs by simply serializing all
unaligned asynchronous direct IO. I've done the same here.
The difference is that we only wait on conversions, not all IO.
This is a very big hammer, and I'm not very pleased with
stuffing this into ext4_file_write(). But since ext4 is
DIO_LOCKING, we need to serialize it at this high level.
I tried to move this into ext4_ext_direct_IO, but by then
we have the i_mutex already, and we will wait on the
work queue to do conversions - which must also take the
i_mutex. So that won't work.
This was originally exposed by qemu-kvm installing to
a raw disk image with a normal sector-63 alignment. I've
tested a backport of this patch with qemu, and it does
avoid the corruption. It is also quite a lot slower
(14 min for package installs, vs. 8 min for well-aligned)
but I'll take slow correctness over fast corruption any day.
Mingming suggested that we can track outstanding
conversions, and wait on those so that non-sparse
files won't be affected, and I've implemented that here;
unaligned AIO to nonsparse files won't take a perf hit.
[tytso@mit.edu: Keep the mutex as a hashed array instead
of bloating the ext4 inode]
[tytso@mit.edu: Fix up namespace issues so that global
variables are protected with an "ext4_" prefix.]
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-12 13:17:34 +00:00
|
|
|
int i, err;
|
2008-01-29 05:19:52 +00:00
|
|
|
|
2015-08-15 18:59:44 +00:00
|
|
|
ratelimit_state_init(&ext4_mount_msg_ratelimit, 30 * HZ, 64);
|
2012-03-21 02:05:02 +00:00
|
|
|
ext4_li_info = NULL;
|
|
|
|
|
ext4: ensure Inode flags consistency are checked at build time
Flags being used by atomic operations in inode flags (e.g.
ext4_test_inode_flag(), should be consistent with that actually stored
in inodes, i.e.: EXT4_XXX_FL.
It ensures that this consistency is checked at build-time, not at
run-time.
Currently, the flags consistency are being checked at run-time, but,
there is no real reason to not do a build-time check instead of a
run-time check. The code is comparing macro defined values with enum
type variables, where both are constants, so, there is no problem in
comparing constants at build-time.
enum variables are treated as constants by the C compiler, according
to the C99 specs (see www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf
sec. 6.2.5, item 16), so, there is no real problem in comparing an
enumeration type at build time
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-10 21:30:45 +00:00
|
|
|
/* Build-time check for flags consistency */
|
2010-05-17 02:00:00 +00:00
|
|
|
ext4_check_flag_values();
|
ext4: serialize unaligned asynchronous DIO
ext4 has a data corruption case when doing non-block-aligned
asynchronous direct IO into a sparse file, as demonstrated
by xfstest 240.
The root cause is that while ext4 preallocates space in the
hole, mappings of that space still look "new" and
dio_zero_block() will zero out the unwritten portions. When
more than one AIO thread is going, they both find this "new"
block and race to zero out their portion; this is uncoordinated
and causes data corruption.
Dave Chinner fixed this for xfs by simply serializing all
unaligned asynchronous direct IO. I've done the same here.
The difference is that we only wait on conversions, not all IO.
This is a very big hammer, and I'm not very pleased with
stuffing this into ext4_file_write(). But since ext4 is
DIO_LOCKING, we need to serialize it at this high level.
I tried to move this into ext4_ext_direct_IO, but by then
we have the i_mutex already, and we will wait on the
work queue to do conversions - which must also take the
i_mutex. So that won't work.
This was originally exposed by qemu-kvm installing to
a raw disk image with a normal sector-63 alignment. I've
tested a backport of this patch with qemu, and it does
avoid the corruption. It is also quite a lot slower
(14 min for package installs, vs. 8 min for well-aligned)
but I'll take slow correctness over fast corruption any day.
Mingming suggested that we can track outstanding
conversions, and wait on those so that non-sparse
files won't be affected, and I've implemented that here;
unaligned AIO to nonsparse files won't take a perf hit.
[tytso@mit.edu: Keep the mutex as a hashed array instead
of bloating the ext4 inode]
[tytso@mit.edu: Fix up namespace issues so that global
variables are protected with an "ext4_" prefix.]
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-12 13:17:34 +00:00
|
|
|
|
2016-03-09 03:44:50 +00:00
|
|
|
for (i = 0; i < EXT4_WQ_HASH_SZ; i++)
|
ext4: serialize unaligned asynchronous DIO
ext4 has a data corruption case when doing non-block-aligned
asynchronous direct IO into a sparse file, as demonstrated
by xfstest 240.
The root cause is that while ext4 preallocates space in the
hole, mappings of that space still look "new" and
dio_zero_block() will zero out the unwritten portions. When
more than one AIO thread is going, they both find this "new"
block and race to zero out their portion; this is uncoordinated
and causes data corruption.
Dave Chinner fixed this for xfs by simply serializing all
unaligned asynchronous direct IO. I've done the same here.
The difference is that we only wait on conversions, not all IO.
This is a very big hammer, and I'm not very pleased with
stuffing this into ext4_file_write(). But since ext4 is
DIO_LOCKING, we need to serialize it at this high level.
I tried to move this into ext4_ext_direct_IO, but by then
we have the i_mutex already, and we will wait on the
work queue to do conversions - which must also take the
i_mutex. So that won't work.
This was originally exposed by qemu-kvm installing to
a raw disk image with a normal sector-63 alignment. I've
tested a backport of this patch with qemu, and it does
avoid the corruption. It is also quite a lot slower
(14 min for package installs, vs. 8 min for well-aligned)
but I'll take slow correctness over fast corruption any day.
Mingming suggested that we can track outstanding
conversions, and wait on those so that non-sparse
files won't be affected, and I've implemented that here;
unaligned AIO to nonsparse files won't take a perf hit.
[tytso@mit.edu: Keep the mutex as a hashed array instead
of bloating the ext4 inode]
[tytso@mit.edu: Fix up namespace issues so that global
variables are protected with an "ext4_" prefix.]
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-12 13:17:34 +00:00
|
|
|
init_waitqueue_head(&ext4__ioend_wq[i]);
|
|
|
|
|
2012-11-09 02:57:32 +00:00
|
|
|
err = ext4_init_es();
|
2009-05-17 19:38:01 +00:00
|
|
|
if (err)
|
|
|
|
return err;
|
2012-11-09 02:57:32 +00:00
|
|
|
|
2018-10-01 18:17:41 +00:00
|
|
|
err = ext4_init_pending();
|
2019-07-22 16:26:24 +00:00
|
|
|
if (err)
|
|
|
|
goto out7;
|
|
|
|
|
|
|
|
err = ext4_init_post_read_processing();
|
2018-10-01 18:17:41 +00:00
|
|
|
if (err)
|
|
|
|
goto out6;
|
|
|
|
|
2012-11-09 02:57:32 +00:00
|
|
|
err = ext4_init_pageio();
|
|
|
|
if (err)
|
2015-09-23 16:44:17 +00:00
|
|
|
goto out5;
|
2012-11-09 02:57:32 +00:00
|
|
|
|
2010-10-28 01:30:14 +00:00
|
|
|
err = ext4_init_system_zone();
|
2010-10-28 01:30:10 +00:00
|
|
|
if (err)
|
2015-09-23 16:44:17 +00:00
|
|
|
goto out4;
|
2010-10-28 01:30:05 +00:00
|
|
|
|
2015-09-23 16:44:17 +00:00
|
|
|
err = ext4_init_sysfs();
|
2011-02-03 19:33:49 +00:00
|
|
|
if (err)
|
2015-09-23 16:44:17 +00:00
|
|
|
goto out3;
|
2010-10-28 01:30:05 +00:00
|
|
|
|
2010-10-28 01:30:14 +00:00
|
|
|
err = ext4_init_mballoc();
|
2008-01-29 05:19:52 +00:00
|
|
|
if (err)
|
|
|
|
goto out2;
|
2006-10-11 08:20:50 +00:00
|
|
|
err = init_inodecache();
|
|
|
|
if (err)
|
|
|
|
goto out1;
|
2020-10-15 20:37:57 +00:00
|
|
|
|
|
|
|
err = ext4_fc_init_dentry_cache();
|
|
|
|
if (err)
|
|
|
|
goto out05;
|
|
|
|
|
2009-12-07 19:08:51 +00:00
|
|
|
register_as_ext3();
|
2011-04-18 21:29:14 +00:00
|
|
|
register_as_ext2();
|
2008-10-11 00:02:48 +00:00
|
|
|
err = register_filesystem(&ext4_fs_type);
|
2006-10-11 08:20:50 +00:00
|
|
|
if (err)
|
|
|
|
goto out;
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
|
2006-10-11 08:20:50 +00:00
|
|
|
return 0;
|
|
|
|
out:
|
2009-12-07 19:08:51 +00:00
|
|
|
unregister_as_ext2();
|
|
|
|
unregister_as_ext3();
|
2020-10-15 20:37:57 +00:00
|
|
|
out05:
|
2006-10-11 08:20:50 +00:00
|
|
|
destroy_inodecache();
|
|
|
|
out1:
|
2010-10-28 01:30:14 +00:00
|
|
|
ext4_exit_mballoc();
|
2014-03-18 23:24:49 +00:00
|
|
|
out2:
|
2015-09-23 16:44:17 +00:00
|
|
|
ext4_exit_sysfs();
|
|
|
|
out3:
|
2010-10-28 01:30:14 +00:00
|
|
|
ext4_exit_system_zone();
|
2015-09-23 16:44:17 +00:00
|
|
|
out4:
|
2010-10-28 01:30:14 +00:00
|
|
|
ext4_exit_pageio();
|
2015-09-23 16:44:17 +00:00
|
|
|
out5:
|
2019-07-22 16:26:24 +00:00
|
|
|
ext4_exit_post_read_processing();
|
2018-10-01 18:17:41 +00:00
|
|
|
out6:
|
2019-07-22 16:26:24 +00:00
|
|
|
ext4_exit_pending();
|
|
|
|
out7:
|
2012-11-09 02:57:32 +00:00
|
|
|
ext4_exit_es();
|
|
|
|
|
2006-10-11 08:20:50 +00:00
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2010-10-28 01:30:14 +00:00
|
|
|
static void __exit ext4_exit_fs(void)
|
2006-10-11 08:20:50 +00:00
|
|
|
{
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 01:30:05 +00:00
|
|
|
ext4_destroy_lazyinit_thread();
|
2009-12-07 19:08:51 +00:00
|
|
|
unregister_as_ext2();
|
|
|
|
unregister_as_ext3();
|
2008-10-11 00:02:48 +00:00
|
|
|
unregister_filesystem(&ext4_fs_type);
|
2006-10-11 08:20:50 +00:00
|
|
|
destroy_inodecache();
|
2010-10-28 01:30:14 +00:00
|
|
|
ext4_exit_mballoc();
|
2015-09-23 16:44:17 +00:00
|
|
|
ext4_exit_sysfs();
|
2010-10-28 01:30:14 +00:00
|
|
|
ext4_exit_system_zone();
|
|
|
|
ext4_exit_pageio();
|
2019-07-22 16:26:24 +00:00
|
|
|
ext4_exit_post_read_processing();
|
2013-07-26 19:21:11 +00:00
|
|
|
ext4_exit_es();
|
2018-10-01 18:17:41 +00:00
|
|
|
ext4_exit_pending();
|
2006-10-11 08:20:50 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
MODULE_AUTHOR("Remy Card, Stephen Tweedie, Andrew Morton, Andreas Dilger, Theodore Ts'o and others");
|
2009-01-06 19:53:16 +00:00
|
|
|
MODULE_DESCRIPTION("Fourth Extended Filesystem");
|
2006-10-11 08:20:50 +00:00
|
|
|
MODULE_LICENSE("GPL");
|
2018-04-26 04:44:46 +00:00
|
|
|
MODULE_SOFTDEP("pre: crc32c");
|
2010-10-28 01:30:14 +00:00
|
|
|
module_init(ext4_init_fs)
|
|
|
|
module_exit(ext4_exit_fs)
|