Commit Graph

4059 Commits

Author SHA1 Message Date
Al Viro
80e5d1ff5d useful constants: struct qstr for ".."
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-04-15 22:36:45 -04:00
Theodore Ts'o
4811d9929c ext4: allow the dax flag to be set and cleared on inline directories
This is needed to allow generic/607 to pass for file systems with the
inline data_feature enabled, and it allows the use of file systems
where the directories use inline_data, while the files are accessed
via DAX.

Cc: stable@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-12 23:33:01 -04:00
Miklos Szeredi
4db5c2e623 ext4: convert to fileattr
Use the fileattr API to let the VFS handle locking, permission checking and
conversion.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
2021-04-12 15:04:29 +02:00
Arnd Bergmann
fcdf3c34b7 ext4: fix debug format string warning
Using no_printk() for jbd_debug() revealed two warnings:

fs/jbd2/recovery.c: In function 'fc_do_one_pass':
fs/jbd2/recovery.c:256:30: error: format '%d' expects a matching 'int' argument [-Werror=format=]
  256 |                 jbd_debug(3, "Processing fast commit blk with seq %d");
      |                              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
fs/ext4/fast_commit.c: In function 'ext4_fc_replay_add_range':
fs/ext4/fast_commit.c:1732:30: error: format '%d' expects argument of type 'int', but argument 2 has type 'long unsigned int' [-Werror=format=]
 1732 |                 jbd_debug(1, "Converting from %d to %d %lld",

The first one was added incorrectly, and was also missing a few newlines
in debug output, and the second one happened when the type of an
argument changed.

Reported-by: kernel test robot <lkp@intel.com>
Fixes: d556435156 ("jbd2: avoid -Wempty-body warnings")
Fixes: 6db0746189 ("ext4: use BIT() macro for BH_** state bits")
Fixes: 5b849b5f96 ("jbd2: fast commit recovery path")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20210409201211.1866633-1-arnd@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-09 23:32:16 -04:00
Jack Qiu
666245d9a4 ext4: fix trailing whitespace
Made suggested modifications from checkpatch in reference to ERROR:
 trailing whitespace

Signed-off-by: Jack Qiu <jack.qiu@huawei.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20210409042035.15516-1-jack.qiu@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-09 23:24:52 -04:00
Bhaskar Chowdhury
3088e5a515 ext4: fix various seppling typos
Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Link: https://lore.kernel.org/r/cover.1616840203.git.unixbhaskar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-09 23:14:59 -04:00
Xu Yihang
e1262cd2e6 ext4: fix error return code in ext4_fc_perform_commit()
In case of if not ext4_fc_add_tlv branch, an error return code is missing.

Cc: stable@kernel.org
Fixes: aa75f4d3da ("ext4: main fast-commit commit path")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Xu Yihang <xuyihang@huawei.com>
Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20210408070033.123047-1-xuyihang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-09 21:28:30 -04:00
Ye Bin
6810fad956 ext4: fix ext4_error_err save negative errno into superblock
Fix As write_mmp_block() so that it returns -EIO instead of 1, so that
the correct error gets saved into the superblock.

Cc: stable@kernel.org
Fixes: 54d3adbc29 ("ext4: save all error info in save_error_info() and drop ext4_set_errno()")
Reported-by: Liu Zhi Qiang <liuzhiqiang26@huawei.com>
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20210406025331.148343-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-09 20:58:30 -04:00
Fengnan Chang
f88f1466e2 ext4: fix error code in ext4_commit_super
We should set the error code when ext4_commit_super check argument failed.
Found in code review.
Fixes: c4be0c1dc4 ("filesystem freeze: add error handling of write_super_lockfs/unlockfs").

Cc: stable@kernel.org
Signed-off-by: Fengnan Chang <changfengnan@vivo.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20210402101631.561-1-changfengnan@vivo.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-09 20:52:51 -04:00
Ye Bin
ac2f7ca51b ext4: always panic when errors=panic is specified
Before commit 014c9caa29 ("ext4: make ext4_abort() use
__ext4_error()"), the following series of commands would trigger a
panic:

1. mount /dev/sda -o ro,errors=panic test
2. mount /dev/sda -o remount,abort test

After commit 014c9caa29, remounting a file system using the test
mount option "abort" will no longer trigger a panic.  This commit will
restore the behaviour immediately before commit 014c9caa29.
(However, note that the Linux kernel's behavior has not been
consistent; some previous kernel versions, including 5.4 and 4.19
similarly did not panic after using the mount option "abort".)

This also makes a change to long-standing behaviour; namely, the
following series commands will now cause a panic, when previously it
did not:

1. mount /dev/sda -o ro,errors=panic test
2. echo test > /sys/fs/ext4/sda/trigger_fs_error

However, this makes ext4's behaviour much more consistent, so this is
a good thing.

Cc: stable@kernel.org
Fixes: 014c9caa29 ("ext4: make ext4_abort() use __ext4_error()")
Signed-off-by: Ye Bin <yebin10@huawei.com>
Link: https://lore.kernel.org/r/20210401081903.3421208-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-09 20:52:37 -04:00
Yang Guo
3cd461712c ext4: delete redundant uptodate check for buffer
The buffer uptodate state has been checked in function set_buffer_uptodate,
there is no need use buffer_uptodate before calling set_buffer_uptodate and
delete it.

Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Yang Guo <guoyang2@huawei.com>
Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/1617260610-29770-1-git-send-email-zhangshaokun@hisilicon.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-09 12:55:28 -04:00
Zhang Yi
72ffb49a7b ext4: do not set SB_ACTIVE in ext4_orphan_cleanup()
When CONFIG_QUOTA is enabled, if we failed to mount the filesystem due
to some error happens behind ext4_orphan_cleanup(), it will end up
triggering a after free issue of super_block. The problem is that
ext4_orphan_cleanup() will set SB_ACTIVE flag if CONFIG_QUOTA is
enabled, after we cleanup the truncated inodes, the last iput() will put
them into the lru list, and these inodes' pages may probably dirty and
will be write back by the writeback thread, so it could be raced by
freeing super_block in the error path of mount_bdev().

After check the setting of SB_ACTIVE flag in ext4_orphan_cleanup(), it
was used to ensure updating the quota file properly, but evict inode and
trash data immediately in the last iput does not affect the quotafile,
so setting the SB_ACTIVE flag seems not required[1]. Fix this issue by
just remove the SB_ACTIVE setting.

[1] https://lore.kernel.org/linux-ext4/99cce8ca-e4a0-7301-840f-2ace67c551f3@huawei.com/T/#m04990cfbc4f44592421736b504afcc346b2a7c00

Cc: stable@kernel.org
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210331033138.918975-1-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-09 12:44:08 -04:00
Harshad Shirwadkar
21175ca434 ext4: make prefetch_block_bitmaps default
Block bitmap prefetching is needed for these allocator optimization
data structures to get populated and provide better group scanning
order. So, turn it on bu default. prefetch_block_bitmaps mount option
is now marked as removed and a new option no_prefetch_block_bitmaps is
added to disable block bitmap prefetching.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20210401172129.189766-8-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-09 11:34:59 -04:00
Harshad Shirwadkar
f68f406385 ext4: add proc files to monitor new structures
This patch adds a new file "mb_structs_summary" which allows us to see
the summary of the new allocator structures added in this
series. Here's the sample output of file:

optimize_scan: 1
max_free_order_lists:
        list_order_0_groups: 0
        list_order_1_groups: 0
        list_order_2_groups: 0
        list_order_3_groups: 0
        list_order_4_groups: 0
        list_order_5_groups: 0
        list_order_6_groups: 0
        list_order_7_groups: 0
        list_order_8_groups: 0
        list_order_9_groups: 0
        list_order_10_groups: 0
        list_order_11_groups: 0
        list_order_12_groups: 0
        list_order_13_groups: 40
fragment_size_tree:
        tree_min: 16384
        tree_max: 32768
        tree_nodes: 40

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20210401172129.189766-7-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-09 11:34:59 -04:00
Harshad Shirwadkar
196e402adf ext4: improve cr 0 / cr 1 group scanning
Instead of traversing through groups linearly, scan groups in specific
orders at cr 0 and cr 1. At cr 0, we want to find groups that have the
largest free order >= the order of the request. So, with this patch,
we maintain lists for each possible order and insert each group into a
list based on the largest free order in its buddy bitmap. During cr 0
allocation, we traverse these lists in the increasing order of largest
free orders. This allows us to find a group with the best available cr
0 match in constant time. If nothing can be found, we fallback to cr 1
immediately.

At CR1, the story is slightly different. We want to traverse in the
order of increasing average fragment size. For CR1, we maintain a rb
tree of groupinfos which is sorted by average fragment size. Instead
of traversing linearly, at CR1, we traverse in the order of increasing
average fragment size, starting at the most optimal group. This brings
down cr 1 search complexity to log(num groups).

For cr >= 2, we just perform the linear search as before. Also, in
case of lock contention, we intermittently fallback to linear search
even in CR 0 and CR 1 cases. This allows us to proceed during the
allocation path even in case of high contention.

There is an opportunity to do optimization at CR2 too. That's because
at CR2 we only consider groups where bb_free counter (number of free
blocks) is greater than the request extent size. That's left as future
work.

All the changes introduced in this patch are protected under a new
mount option "mb_optimize_scan".

With this patchset, following experiment was performed:

Created a highly fragmented disk of size 65TB. The disk had no
contiguous 2M regions. Following command was run consecutively for 3
times:

time dd if=/dev/urandom of=file bs=2M count=10

Here are the results with and without cr 0/1 optimizations introduced
in this patch:

|---------+------------------------------+---------------------------|
|         | Without CR 0/1 Optimizations | With CR 0/1 Optimizations |
|---------+------------------------------+---------------------------|
| 1st run | 5m1.871s                     | 2m47.642s                 |
| 2nd run | 2m28.390s                    | 0m0.611s                  |
| 3rd run | 2m26.530s                    | 0m1.255s                  |
|---------+------------------------------+---------------------------|

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20210401172129.189766-6-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-09 11:34:59 -04:00
Harshad Shirwadkar
4b68f6df10 ext4: add MB_NUM_ORDERS macro
A few arrays in mballoc.c use the total number of valid orders as
their size. Currently, this value is set as "sb->s_blocksize_bits +
2". This makes code harder to read. So, instead add a new macro
MB_NUM_ORDERS(sb) to make the code more readable.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20210401172129.189766-5-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-09 11:34:59 -04:00
Harshad Shirwadkar
a6c75eaf11 ext4: add mballoc stats proc file
Add new stats for measuring the performance of mballoc. This patch is
forked from Artem Blagodarenko's work that can be found here:

https://github.com/lustre/lustre-release/blob/master/ldiskfs/kernel_patches/patches/rhel8/ext4-simple-blockalloc.patch

This patch reorganizes the stats by cr level. This is how the output
looks like:

mballoc:
	reqs: 0
	success: 0
	groups_scanned: 0
	cr0_stats:
		hits: 0
		groups_considered: 0
		useless_loops: 0
		bad_suggestions: 0
	cr1_stats:
		hits: 0
		groups_considered: 0
		useless_loops: 0
		bad_suggestions: 0
	cr2_stats:
		hits: 0
		groups_considered: 0
		useless_loops: 0
	cr3_stats:
		hits: 0
		groups_considered: 0
		useless_loops: 0
	extents_scanned: 0
		goal_hits: 0
		2^n_hits: 0
		breaks: 0
		lost: 0
	buddies_generated: 0/40
	buddies_time_used: 0
	preallocated: 0
	discarded: 0

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20210401172129.189766-4-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-09 11:34:59 -04:00
Harshad Shirwadkar
b237e30444 ext4: add ability to return parsed options from parse_options
Before this patch, the function parse_options() was returning
journal_devnum and journal_ioprio variables to the caller. This patch
generalizes that interface to allow parse_options to return any parsed
options to return back to the caller. In this patch series, it gets
used to capture the value of "mb_optimize_scan=%u" mount option.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20210401172129.189766-3-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-09 11:34:58 -04:00
Harshad Shirwadkar
67d2518604 ext4: drop s_mb_bal_lock and convert protected fields to atomic
s_mb_buddies_generated gets used later in this patch series to
determine if the cr 0 and cr 1 optimziations should be performed or
not. Currently, s_mb_buddies_generated is protected under a
spin_lock. In the allocation path, it is better if we don't depend on
the lock and instead read the value atomically. In order to do that,
we drop s_bal_lock altogether and we convert the only two protected
fields by it s_mb_buddies_generated and s_mb_generation_time to atomic
type.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20210401172129.189766-2-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-09 11:34:58 -04:00
Zhang Yi
a149d2a5ca ext4: fix check to prevent false positive report of incorrect used inodes
Commit <50122847007> ("ext4: fix check to prevent initializing reserved
inodes") check the block group zero and prevent initializing reserved
inodes. But in some special cases, the reserved inode may not all belong
to the group zero, it may exist into the second group if we format
filesystem below.

  mkfs.ext4 -b 4096 -g 8192 -N 1024 -I 4096 /dev/sda

So, it will end up triggering a false positive report of a corrupted
file system. This patch fix it by avoid check reserved inodes if no free
inode blocks will be zeroed.

Cc: stable@kernel.org
Fixes: 5012284700 ("ext4: fix check to prevent initializing reserved inodes")
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Suggested-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210331121516.2243099-1-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-09 11:34:58 -04:00
Sami Tolvanen
4f0f586bf0 treewide: Change list_sort to use const pointers
list_sort() internally casts the comparison function passed to it
to a different type with constant struct list_head pointers, and
uses this pointer to call the functions, which trips indirect call
Control-Flow Integrity (CFI) checking.

Instead of removing the consts, this change defines the
list_cmp_func_t type and changes the comparison function types of
all list_sort() callers to use const pointers, thus avoiding type
mismatches.

Suggested-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210408182843.1754385-10-samitolvanen@google.com
2021-04-08 16:04:22 -07:00
Daniel Rosenberg
1ae98e295f ext4: optimize match for casefolded encrypted dirs
Matching names with casefolded encrypting directories requires
decrypting entries to confirm case since we are case preserving. We can
avoid needing to decrypt if our hash values don't match.

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Link: https://lore.kernel.org/r/20210319073414.1381041-3-drosen@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-05 22:18:36 -04:00
Daniel Rosenberg
471fbbea7f ext4: handle casefolding with encryption
This adds support for encryption with casefolding.

Since the name on disk is case preserving, and also encrypted, we can no
longer just recompute the hash on the fly. Additionally, to avoid
leaking extra information from the hash of the unencrypted name, we use
siphash via an fscrypt v2 policy.

The hash is stored at the end of the directory entry for all entries
inside of an encrypted and casefolded directory apart from those that
deal with '.' and '..'. This way, the change is backwards compatible
with existing ext4 filesystems.

[ Changed to advertise this feature via the file:
  /sys/fs/ext4/features/encrypted_casefold -- TYT ]

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20210319073414.1381041-2-drosen@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-05 22:04:20 -04:00
Milan Djurovic
400086d7c1 ext4: remove unnecessary braces in fs/ext4/dir.c
Removes braces to follow the coding style.

Signed-off-by: Milan Djurovic <mdjurovic@zohomail.com>
Link: https://lore.kernel.org/r/20210316052953.67616-1-mdjurovic@zohomail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-02 17:22:14 -04:00
Chaitanya Kulkarni
bd256fda92 ext4: use memcpy_to_page() in pagecache_write()
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Link: https://lore.kernel.org/r/20210207190425.38107-7-chaitanya.kulkarni@wdc.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-03-25 10:19:48 -04:00
Chaitanya Kulkarni
4d93874b9e ext4: use memcpy_from_page() in pagecache_read()
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Link: https://lore.kernel.org/r/20210207190425.38107-6-chaitanya.kulkarni@wdc.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-03-25 10:19:48 -04:00
Christian Brauner
db998553cf
fs: introduce two inode i_{u,g}id initialization helpers
Give filesystem two little helpers that do the right thing when
initializing the i_uid and i_gid fields on idmapped and non-idmapped
mounts. Filesystems shouldn't have to be concerned with too many
details.

Link: https://lore.kernel.org/r/20210320122623.599086-5-christian.brauner@ubuntu.com
Inspired-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-03-23 11:15:26 +01:00
Christian Brauner
a65e58e791
fs: document and rename fsid helpers
Vivek pointed out that the fs{g,u}id_into_mnt() naming scheme can be
misleading as it could be understood as implying they do the exact same
thing as i_{g,u}id_into_mnt(). The original motivation for this naming
scheme was to signal to callers that the helpers will always take care
to map the k{g,u}id such that the ownership is expressed in terms of the
mnt_users.
Get rid of the confusion by renaming those helpers to something more
sensible. Al suggested mapped_fs{g,u}id() which seems a really good fit.
Usually filesystems don't need to bother with these helpers directly
only in some cases where they allocate objects that carry {g,u}ids which
are either filesystem specific (e.g. xfs quota objects) or don't have a
clean set of helpers as inodes have.

Link: https://lore.kernel.org/r/20210320122623.599086-3-christian.brauner@ubuntu.com
Inspired-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-03-23 11:13:32 +01:00
Linus Torvalds
d7f5f1bd3c Miscellaneous ext4 bug fixes for v5.12.
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmBXj1oACgkQ8vlZVpUN
 gaNnAwgAqZJ0S/Hctexs+v+DNvuyMxsA84pB/9KYlK2zgbBOyK5Iftxjqxb9Sb6j
 6XKQOIaP2EXYJ0MDWW/fDMUHatlJvXUp+A9kLTiOLMDaRXbobQzb5jlGg9ZB/pBj
 TzISrR4widiqJbVT2RFpO9O7B75BQqlpqFNfkF/yJ9CU/ozAw9x+voPcZK7q8/Sh
 +DeQCARvgfx1ZipHGTYKjJdujA0qGcDfboYJpgId/gA5Zi76tx4NlbeXAM2QmRfh
 zAd1NzFhqf7JmKDAWDdUeRnrDHcje9FLcAxo7Quq7YWxRKFsOCz9LTxazL2UIoa2
 HvGpMD23qmISCLUyyrfnrpGPj/mD2w==
 =xcuH
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Miscellaneous ext4 bug fixes for v5.12"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: initialize ret to suppress smatch warning
  ext4: stop inode update before return
  ext4: fix rename whiteout with fast commit
  ext4: fix timer use-after-free on failed mount
  ext4: fix potential error in ext4_do_update_inode
  ext4: do not try to set xattr into ea_inode if value is empty
  ext4: do not iput inode under running transaction in ext4_rename()
  ext4: find old entry again if failed to rename whiteout
  ext4: fix error handling in ext4_end_enable_verity()
  ext4: fix bh ref count on error paths
  fs/ext4: fix integer overflow in s_log_groups_per_flex
  ext4: add reclaim checks to xattr code
  ext4: shrink race window in ext4_should_retry_alloc()
2021-03-21 14:06:10 -07:00
Theodore Ts'o
64395d950b ext4: initialize ret to suppress smatch warning
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-03-21 00:45:37 -04:00
Pan Bian
512c15ef05 ext4: stop inode update before return
The inode update should be stopped before returing the error code.

Signed-off-by: Pan Bian <bianpan2016@163.com>
Link: https://lore.kernel.org/r/20210117085732.93788-1-bianpan2016@163.com
Fixes: 8016e29f43 ("ext4: fast commit recovery path")
Cc: stable@kernel.org
Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-03-21 00:42:12 -04:00
Harshad Shirwadkar
8210bb29c1 ext4: fix rename whiteout with fast commit
This patch adds rename whiteout support in fast commits. Note that the
whiteout object that gets created is actually char device. Which
imples, the function ext4_inode_journal_mode(struct inode *inode)
would return "JOURNAL_DATA" for this inode. This has a consequence in
fast commit code that it will make creation of the whiteout object a
fast-commit ineligible behavior and thus will fall back to full
commits. With this patch, this can be observed by running fast commits
with rename whiteout and seeing the stats generated by ext4_fc_stats
tracepoint as follows:

ext4_fc_stats: dev 254:32 fc ineligible reasons:
XATTR:0, CROSS_RENAME:0, JOURNAL_FLAG_CHANGE:0, NO_MEM:0, SWAP_BOOT:0,
RESIZE:0, RENAME_DIR:0, FALLOC_RANGE:0, INODE_JOURNAL_DATA:16;
num_commits:6, ineligible: 6, numblks: 3

So in short, this patch guarantees that in case of rename whiteout, we
fall back to full commits.

Amir mentioned that instead of creating a new whiteout object for
every rename, we can create a static whiteout object with irrelevant
nlink. That will make fast commits to not fall back to full
commit. But until this happens, this patch will ensure correctness by
falling back to full commits.

Fixes: 8016e29f43 ("ext4: fast commit recovery path")
Cc: stable@kernel.org
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20210316221921.1124955-1-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-03-21 00:38:18 -04:00
Jan Kara
2a4ae3bcdf ext4: fix timer use-after-free on failed mount
When filesystem mount fails because of corrupted filesystem we first
cancel the s_err_report timer reminding fs errors every day and only
then we flush s_error_work. However s_error_work may report another fs
error and re-arm timer thus resulting in timer use-after-free. Fix the
problem by first flushing the work and only after that canceling the
s_err_report timer.

Reported-by: syzbot+628472a2aac693ab0fcd@syzkaller.appspotmail.com
Fixes: 2d01ddc866 ("ext4: save error info to sb through journal if available")
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210315165906.2175-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-03-21 00:27:49 -04:00
Shijie Luo
7d8bd3c76d ext4: fix potential error in ext4_do_update_inode
If set_large_file = 1 and errors occur in ext4_handle_dirty_metadata(),
the error code will be overridden, go to out_brelse to avoid this
situation.

Signed-off-by: Shijie Luo <luoshijie1@huawei.com>
Link: https://lore.kernel.org/r/20210312065051.36314-1-luoshijie1@huawei.com
Cc: stable@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-03-21 00:14:08 -04:00
zhangyi (F)
6b22489911 ext4: do not try to set xattr into ea_inode if value is empty
Syzbot report a warning that ext4 may create an empty ea_inode if set
an empty extent attribute to a file on the file system which is no free
blocks left.

  WARNING: CPU: 6 PID: 10667 at fs/ext4/xattr.c:1640 ext4_xattr_set_entry+0x10f8/0x1114 fs/ext4/xattr.c:1640
  ...
  Call trace:
   ext4_xattr_set_entry+0x10f8/0x1114 fs/ext4/xattr.c:1640
   ext4_xattr_block_set+0x1d0/0x1b1c fs/ext4/xattr.c:1942
   ext4_xattr_set_handle+0x8a0/0xf1c fs/ext4/xattr.c:2390
   ext4_xattr_set+0x120/0x1f0 fs/ext4/xattr.c:2491
   ext4_xattr_trusted_set+0x48/0x5c fs/ext4/xattr_trusted.c:37
   __vfs_setxattr+0x208/0x23c fs/xattr.c:177
  ...

Now, ext4 try to store extent attribute into an external inode if
ext4_xattr_block_set() return -ENOSPC, but for the case of store an
empty extent attribute, store the extent entry into the extent
attribute block is enough. A simple reproduce below.

  fallocate test.img -l 1M
  mkfs.ext4 -F -b 2048 -O ea_inode test.img
  mount test.img /mnt
  dd if=/dev/zero of=/mnt/foo bs=2048 count=500
  setfattr -n "user.test" /mnt/foo

Reported-by: syzbot+98b881fdd8ebf45ab4ae@syzkaller.appspotmail.com
Fixes: 9c6e7853c5 ("ext4: reserve space for xattr entries/names")
Cc: stable@kernel.org
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20210305120508.298465-1-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-03-21 00:09:17 -04:00
zhangyi (F)
5dccdc5a19 ext4: do not iput inode under running transaction in ext4_rename()
In ext4_rename(), when RENAME_WHITEOUT failed to add new entry into
directory, it ends up dropping new created whiteout inode under the
running transaction. After commit <9b88f9fb0d2> ("ext4: Do not iput inode
under running transaction"), we follow the assumptions that evict() does
not get called from a transaction context but in ext4_rename() it breaks
this suggestion. Although it's not a real problem, better to obey it, so
this patch add inode to orphan list and stop transaction before final
iput().

Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20210303131703.330415-2-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-03-21 00:09:14 -04:00
zhangyi (F)
b7ff91fd03 ext4: find old entry again if failed to rename whiteout
If we failed to add new entry on rename whiteout, we cannot reset the
old->de entry directly, because the old->de could have moved from under
us during make indexed dir. So find the old entry again before reset is
needed, otherwise it may corrupt the filesystem as below.

  /dev/sda: Entry '00000001' in ??? (12) has deleted/unused inode 15. CLEARED.
  /dev/sda: Unattached inode 75
  /dev/sda: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.

Fixes: 6b4b8e6b4a ("ext4: fix bug for rename with RENAME_WHITEOUT")
Cc: stable@vger.kernel.org
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20210303131703.330415-1-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-03-21 00:03:39 -04:00
Eric Biggers
f053cf7aa6 ext4: fix error handling in ext4_end_enable_verity()
ext4 didn't properly clean up if verity failed to be enabled on a file:

- It left verity metadata (pages past EOF) in the page cache, which
  would be exposed to userspace if the file was later extended.

- It didn't truncate the verity metadata at all (either from cache or
  from disk) if an error occurred while setting the verity bit.

Fix these bugs by adding a call to truncate_inode_pages() and ensuring
that we truncate the verity metadata (both from cache and from disk) in
all error paths.  Also rework the code to cleanly separate the success
path from the error paths, which makes it much easier to understand.

Reported-by: Yunlei He <heyunlei@hihonor.com>
Fixes: c93d8f8858 ("ext4: add basic fs-verity support")
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20210302200420.137977-2-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-03-11 10:38:50 -05:00
Christoph Hellwig
a8affc03a9 block: rename BIO_MAX_PAGES to BIO_MAX_VECS
Ever since the addition of multipage bio_vecs BIO_MAX_PAGES has been
horribly confusingly misnamed.  Rename it to BIO_MAX_VECS to stop
confusing users of the bio API.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20210311110137.1132391-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-11 07:47:48 -07:00
Zhaolong Zhang
c915fb80ea ext4: fix bh ref count on error paths
__ext4_journalled_writepage should drop bhs' ref count on error paths

Signed-off-by: Zhaolong Zhang <zhangzl2013@126.com>
Link: https://lore.kernel.org/r/1614678151-70481-1-git-send-email-zhangzl2013@126.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-03-06 11:56:11 -05:00
Sabyrzhan Tasbolatov
f91436d55a fs/ext4: fix integer overflow in s_log_groups_per_flex
syzbot found UBSAN: shift-out-of-bounds in ext4_mb_init [1], when
1 << sbi->s_es->s_log_groups_per_flex is bigger than UINT_MAX,
where sbi->s_mb_prefetch is unsigned integer type.

32 is the maximum allowed power of s_log_groups_per_flex. Following if
check will also trigger UBSAN shift-out-of-bound:

if (1 << sbi->s_es->s_log_groups_per_flex >= UINT_MAX) {

So I'm checking it against the raw number, perhaps there is another way
to calculate UINT_MAX max power. Also use min_t as to make sure it's
uint type.

[1] UBSAN: shift-out-of-bounds in fs/ext4/mballoc.c:2713:24
shift exponent 60 is too large for 32-bit type 'int'
Call Trace:
 __dump_stack lib/dump_stack.c:79 [inline]
 dump_stack+0x137/0x1be lib/dump_stack.c:120
 ubsan_epilogue lib/ubsan.c:148 [inline]
 __ubsan_handle_shift_out_of_bounds+0x432/0x4d0 lib/ubsan.c:395
 ext4_mb_init_backend fs/ext4/mballoc.c:2713 [inline]
 ext4_mb_init+0x19bc/0x19f0 fs/ext4/mballoc.c:2898
 ext4_fill_super+0xc2ec/0xfbe0 fs/ext4/super.c:4983

Reported-by: syzbot+a8b4b0c60155e87e9484@syzkaller.appspotmail.com
Signed-off-by: Sabyrzhan Tasbolatov <snovitoll@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210224095800.3350002-1-snovitoll@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-03-06 11:56:11 -05:00
Jan Kara
163f0ec1df ext4: add reclaim checks to xattr code
Syzbot is reporting that ext4 can enter fs reclaim from kvmalloc() while
the transaction is started like:

  fs_reclaim_acquire+0x117/0x150 mm/page_alloc.c:4340
  might_alloc include/linux/sched/mm.h:193 [inline]
  slab_pre_alloc_hook mm/slab.h:493 [inline]
  slab_alloc_node mm/slub.c:2817 [inline]
  __kmalloc_node+0x5f/0x430 mm/slub.c:4015
  kmalloc_node include/linux/slab.h:575 [inline]
  kvmalloc_node+0x61/0xf0 mm/util.c:587
  kvmalloc include/linux/mm.h:781 [inline]
  ext4_xattr_inode_cache_find fs/ext4/xattr.c:1465 [inline]
  ext4_xattr_inode_lookup_create fs/ext4/xattr.c:1508 [inline]
  ext4_xattr_set_entry+0x1ce6/0x3780 fs/ext4/xattr.c:1649
  ext4_xattr_ibody_set+0x78/0x2b0 fs/ext4/xattr.c:2224
  ext4_xattr_set_handle+0x8f4/0x13e0 fs/ext4/xattr.c:2380
  ext4_xattr_set+0x13a/0x340 fs/ext4/xattr.c:2493

This should be impossible since transaction start sets PF_MEMALLOC_NOFS.
Add some assertions to the code to catch if something isn't working as
expected early.

Link: https://lore.kernel.org/linux-ext4/000000000000563a0205bafb7970@google.com/
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210222171626.21884-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-03-06 11:56:10 -05:00
Eric Whitney
efc6134527 ext4: shrink race window in ext4_should_retry_alloc()
When generic/371 is run on kvm-xfstests using 5.10 and 5.11 kernels, it
fails at significant rates on the two test scenarios that disable
delayed allocation (ext3conv and data_journal) and force actual block
allocation for the fallocate and pwrite functions in the test.  The
failure rate on 5.10 for both ext3conv and data_journal on one test
system typically runs about 85%.  On 5.11, the failure rate on ext3conv
sometimes drops to as low as 1% while the rate on data_journal
increases to nearly 100%.

The observed failures are largely due to ext4_should_retry_alloc()
cutting off block allocation retries when s_mb_free_pending (used to
indicate that a transaction in progress will free blocks) is 0.
However, free space is usually available when this occurs during runs
of generic/371.  It appears that a thread attempting to allocate
blocks is just missing transaction commits in other threads that
increase the free cluster count and reset s_mb_free_pending while
the allocating thread isn't running.  Explicitly testing for free space
availability avoids this race.

The current code uses a post-increment operator in the conditional
expression that determines whether the retry limit has been exceeded.
This means that the conditional expression uses the value of the
retry counter before it's increased, resulting in an extra retry cycle.
The current code actually retries twice before hitting its retry limit
rather than once.

Increasing the retry limit to 3 from the current actual maximum retry
count of 2 in combination with the change described above reduces the
observed failure rate to less that 0.1% on both ext3conv and
data_journal with what should be limited impact on users sensitive to
the overhead caused by retries.

A per filesystem percpu counter exported via sysfs is added to allow
users or developers to track the number of times the retry limit is
exceeded without resorting to debugging methods.  This should provide
some insight into worst case retry behavior.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20210218151132.19678-1-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-03-06 11:56:10 -05:00
Linus Torvalds
3ab6608e66 block-5.12-2021-02-27
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmA6njIQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgprolD/9zWti9LsZvA7yE+PhVwrwF3CsNzLfQlClw
 99HaA7HxtAc/VLJrnD/SubhCAPdBC5B2xPv6faajdwF2iUR3Rr1Uc93CQ3uP2KKq
 kvm6ALTpzPTMI6YSABhY74sg9BkkoDbMo54JQYVQPleiE+5eDLbuFZck6ObfUHyY
 a4aaImlndWp/t14GzrClL4hucF+5KJy846P+QCVclkh0yl8xSsqZ5LIFU7tu3iQb
 HpZ5HKLT/2ma/EOr3wknnsIe97AUZQU0q5aMparhYlm+qR511eop3QXx850FL/oC
 tEGceKLij6qazmkiocKVzML8Fs+Y9/a4vCMjLCScWJmzDlmKdlH2uudeahN6b9Hm
 15qRQHOjl1Hc2bdr5ZVn87nq9RWhSm18C+SRMwOKHCOnEhwxqM3RjRfAgj4BJ6QB
 PFbFqdY+8Y1YLPFmn9hph72ePaEcN4L2IXW6TI/WX8mot8ODAnkq9Hr38dKwzO+i
 0mon6DVyJKKho6XwvVu5IYurkR2beQprjeVUxwZjjT6DxUgsc+J6itK5LDHFSkeZ
 qZlXn5Di8MkiXg0DFJYDQiFXnO0Z5GlRWOGPVfBaOr3x+1dqzDdHGw4oz1oGqvnr
 GNNYCsYIpDGm7eauX5lqL5MUFpjqRCceXy5JSHPhnWWw617nYkr4H9jdsV9HiTX1
 tQFx05QW3w==
 =ccMs
 -----END PGP SIGNATURE-----

Merge tag 'block-5.12-2021-02-27' of git://git.kernel.dk/linux-block

Pull more block updates from Jens Axboe:
 "A few stragglers (and one due to me missing it originally), and fixes
  for changes in this merge window mostly. In particular:

   - blktrace cleanups (Chaitanya, Greg)

   - Kill dead blk_pm_* functions (Bart)

   - Fixes for the bio alloc changes (Christoph)

   - Fix for the partition changes (Christoph, Ming)

   - Fix for turning off iopoll with polled IO inflight (Jeffle)

   - nbd disconnect fix (Josef)

   - loop fsync error fix (Mauricio)

   - kyber update depth fix (Yang)

   - max_sectors alignment fix (Mikulas)

   - Add bio_max_segs helper (Matthew)"

* tag 'block-5.12-2021-02-27' of git://git.kernel.dk/linux-block: (21 commits)
  block: Add bio_max_segs
  blktrace: fix documentation for blk_fill_rw()
  block: memory allocations in bounce_clone_bio must not fail
  block: remove the gfp_mask argument to bounce_clone_bio
  block: fix bounce_clone_bio for passthrough bios
  block-crypto-fallback: use a bio_set for splitting bios
  block: fix logging on capacity change
  blk-settings: align max_sectors on "logical_block_size" boundary
  block: reopen the device in blkdev_reread_part
  block: don't skip empty device in in disk_uevent
  blktrace: remove debugfs file dentries from struct blk_trace
  nbd: handle device refs for DESTROY_ON_DISCONNECT properly
  kyber: introduce kyber_depth_updated()
  loop: fix I/O error on fsync() in detached loop devices
  block: fix potential IO hang when turning off io_poll
  block: get rid of the trace rq insert wrapper
  blktrace: fix blk_rq_merge documentation
  blktrace: fix blk_rq_issue documentation
  blktrace: add blk_fill_rwbs documentation comment
  block: remove superfluous param in blk_fill_rwbs()
  ...
2021-02-28 11:23:38 -08:00
Matthew Wilcox (Oracle)
5f7136db82 block: Add bio_max_segs
It's often inconvenient to use BIO_MAX_PAGES due to min() requiring the
sign to be the same.  Introduce bio_max_segs() and change BIO_MAX_PAGES to
be unsigned to make it easier for the users.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-26 15:49:51 -07:00
Linus Torvalds
6f9972bbfe Miscellaneous ext4 cleanups and bug fixes. Pretty boring this
cycle...
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmA3qgEACgkQ8vlZVpUN
 gaMHqwf+IGiMeyB3BnIbQZjcwgyxRYMMmod0tcGv4yK4bf85+rTZZMcPZm5Ioqwm
 +XLnuRW+x3Do6lnzNZj9p+qnPNwX2YGupwGeAoT5puMDKl0J7HhvdqPcNi4WgucW
 K+LCAxbPjVFUAkpD2gwgNgXej4Us1QP/93GGG8YhTd+jiYS7axrRbAlE+4SmCsye
 83dEWpXSFDE0qXHoaPwLo26LMPySPjWM/UuiWD8ozQ+mcL0Doecw1tWJottSo+Ds
 +cJtxiOPVPTFwZUJOt/qoxfPnALrqAeazS+3lGmqK5c3xbvsc9CiYqP15FUbQF08
 eCDMo9bpqQdRs4WmaRSvBzV8v+/TXg==
 =+amQ
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "Miscellaneous ext4 cleanups and bug fixes. Pretty boring this cycle..."

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: add .kunitconfig fragment to enable ext4-specific tests
  ext: EXT4_KUNIT_TESTS should depend on EXT4_FS instead of selecting it
  ext4: reset retry counter when ext4_alloc_file_blocks() makes progress
  ext4: fix potential htree index checksum corruption
  ext4: factor out htree rep invariant check
  ext4: Change list_for_each* to list_for_each_entry*
  ext4: don't try to processed freed blocks until mballoc is initialized
  ext4: use DEFINE_MUTEX() for mutex lock
2021-02-25 10:06:55 -08:00
Linus Torvalds
7d6beb71da idmapped-mounts-v5.12
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYCegywAKCRCRxhvAZXjc
 ouJ6AQDlf+7jCQlQdeKKoN9QDFfMzG1ooemat36EpRRTONaGuAD8D9A4sUsG4+5f
 4IU5Lj9oY4DEmF8HenbWK2ZHsesL2Qg=
 =yPaw
 -----END PGP SIGNATURE-----

Merge tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull idmapped mounts from Christian Brauner:
 "This introduces idmapped mounts which has been in the making for some
  time. Simply put, different mounts can expose the same file or
  directory with different ownership. This initial implementation comes
  with ports for fat, ext4 and with Christoph's port for xfs with more
  filesystems being actively worked on by independent people and
  maintainers.

  Idmapping mounts handle a wide range of long standing use-cases. Here
  are just a few:

   - Idmapped mounts make it possible to easily share files between
     multiple users or multiple machines especially in complex
     scenarios. For example, idmapped mounts will be used in the
     implementation of portable home directories in
     systemd-homed.service(8) where they allow users to move their home
     directory to an external storage device and use it on multiple
     computers where they are assigned different uids and gids. This
     effectively makes it possible to assign random uids and gids at
     login time.

   - It is possible to share files from the host with unprivileged
     containers without having to change ownership permanently through
     chown(2).

   - It is possible to idmap a container's rootfs and without having to
     mangle every file. For example, Chromebooks use it to share the
     user's Download folder with their unprivileged containers in their
     Linux subsystem.

   - It is possible to share files between containers with
     non-overlapping idmappings.

   - Filesystem that lack a proper concept of ownership such as fat can
     use idmapped mounts to implement discretionary access (DAC)
     permission checking.

   - They allow users to efficiently changing ownership on a per-mount
     basis without having to (recursively) chown(2) all files. In
     contrast to chown (2) changing ownership of large sets of files is
     instantenous with idmapped mounts. This is especially useful when
     ownership of a whole root filesystem of a virtual machine or
     container is changed. With idmapped mounts a single syscall
     mount_setattr syscall will be sufficient to change the ownership of
     all files.

   - Idmapped mounts always take the current ownership into account as
     idmappings specify what a given uid or gid is supposed to be mapped
     to. This contrasts with the chown(2) syscall which cannot by itself
     take the current ownership of the files it changes into account. It
     simply changes the ownership to the specified uid and gid. This is
     especially problematic when recursively chown(2)ing a large set of
     files which is commong with the aforementioned portable home
     directory and container and vm scenario.

   - Idmapped mounts allow to change ownership locally, restricting it
     to specific mounts, and temporarily as the ownership changes only
     apply as long as the mount exists.

  Several userspace projects have either already put up patches and
  pull-requests for this feature or will do so should you decide to pull
  this:

   - systemd: In a wide variety of scenarios but especially right away
     in their implementation of portable home directories.

         https://systemd.io/HOME_DIRECTORY/

   - container runtimes: containerd, runC, LXD:To share data between
     host and unprivileged containers, unprivileged and privileged
     containers, etc. The pull request for idmapped mounts support in
     containerd, the default Kubernetes runtime is already up for quite
     a while now: https://github.com/containerd/containerd/pull/4734

   - The virtio-fs developers and several users have expressed interest
     in using this feature with virtual machines once virtio-fs is
     ported.

   - ChromeOS: Sharing host-directories with unprivileged containers.

  I've tightly synced with all those projects and all of those listed
  here have also expressed their need/desire for this feature on the
  mailing list. For more info on how people use this there's a bunch of
  talks about this too. Here's just two recent ones:

      https://www.cncf.io/wp-content/uploads/2020/12/Rootless-Containers-in-Gitpod.pdf
      https://fosdem.org/2021/schedule/event/containers_idmap/

  This comes with an extensive xfstests suite covering both ext4 and
  xfs:

      https://git.kernel.org/brauner/xfstests-dev/h/idmapped_mounts

  It covers truncation, creation, opening, xattrs, vfscaps, setid
  execution, setgid inheritance and more both with idmapped and
  non-idmapped mounts. It already helped to discover an unrelated xfs
  setgid inheritance bug which has since been fixed in mainline. It will
  be sent for inclusion with the xfstests project should you decide to
  merge this.

  In order to support per-mount idmappings vfsmounts are marked with
  user namespaces. The idmapping of the user namespace will be used to
  map the ids of vfs objects when they are accessed through that mount.
  By default all vfsmounts are marked with the initial user namespace.
  The initial user namespace is used to indicate that a mount is not
  idmapped. All operations behave as before and this is verified in the
  testsuite.

  Based on prior discussions we want to attach the whole user namespace
  and not just a dedicated idmapping struct. This allows us to reuse all
  the helpers that already exist for dealing with idmappings instead of
  introducing a whole new range of helpers. In addition, if we decide in
  the future that we are confident enough to enable unprivileged users
  to setup idmapped mounts the permission checking can take into account
  whether the caller is privileged in the user namespace the mount is
  currently marked with.

  The user namespace the mount will be marked with can be specified by
  passing a file descriptor refering to the user namespace as an
  argument to the new mount_setattr() syscall together with the new
  MOUNT_ATTR_IDMAP flag. The system call follows the openat2() pattern
  of extensibility.

  The following conditions must be met in order to create an idmapped
  mount:

   - The caller must currently have the CAP_SYS_ADMIN capability in the
     user namespace the underlying filesystem has been mounted in.

   - The underlying filesystem must support idmapped mounts.

   - The mount must not already be idmapped. This also implies that the
     idmapping of a mount cannot be altered once it has been idmapped.

   - The mount must be a detached/anonymous mount, i.e. it must have
     been created by calling open_tree() with the OPEN_TREE_CLONE flag
     and it must not already have been visible in the filesystem.

  The last two points guarantee easier semantics for userspace and the
  kernel and make the implementation significantly simpler.

  By default vfsmounts are marked with the initial user namespace and no
  behavioral or performance changes are observed.

  The manpage with a detailed description can be found here:

      1d7b902e28

  In order to support idmapped mounts, filesystems need to be changed
  and mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. The
  patches to convert individual filesystem are not very large or
  complicated overall as can be seen from the included fat, ext4, and
  xfs ports. Patches for other filesystems are actively worked on and
  will be sent out separately. The xfstestsuite can be used to verify
  that port has been done correctly.

  The mount_setattr() syscall is motivated independent of the idmapped
  mounts patches and it's been around since July 2019. One of the most
  valuable features of the new mount api is the ability to perform
  mounts based on file descriptors only.

  Together with the lookup restrictions available in the openat2()
  RESOLVE_* flag namespace which we added in v5.6 this is the first time
  we are close to hardened and race-free (e.g. symlinks) mounting and
  path resolution.

  While userspace has started porting to the new mount api to mount
  proper filesystems and create new bind-mounts it is currently not
  possible to change mount options of an already existing bind mount in
  the new mount api since the mount_setattr() syscall is missing.

  With the addition of the mount_setattr() syscall we remove this last
  restriction and userspace can now fully port to the new mount api,
  covering every use-case the old mount api could. We also add the
  crucial ability to recursively change mount options for a whole mount
  tree, both removing and adding mount options at the same time. This
  syscall has been requested multiple times by various people and
  projects.

  There is a simple tool available at

      https://github.com/brauner/mount-idmapped

  that allows to create idmapped mounts so people can play with this
  patch series. I'll add support for the regular mount binary should you
  decide to pull this in the following weeks:

  Here's an example to a simple idmapped mount of another user's home
  directory:

	u1001@f2-vm:/$ sudo ./mount --idmap both:1000:1001:1 /home/ubuntu/ /mnt

	u1001@f2-vm:/$ ls -al /home/ubuntu/
	total 28
	drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 22:07 .
	drwxr-xr-x 4 root   root   4096 Oct 28 04:00 ..
	-rw------- 1 ubuntu ubuntu 3154 Oct 28 22:12 .bash_history
	-rw-r--r-- 1 ubuntu ubuntu  220 Feb 25  2020 .bash_logout
	-rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25  2020 .bashrc
	-rw-r--r-- 1 ubuntu ubuntu  807 Feb 25  2020 .profile
	-rw-r--r-- 1 ubuntu ubuntu    0 Oct 16 16:11 .sudo_as_admin_successful
	-rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo

	u1001@f2-vm:/$ ls -al /mnt/
	total 28
	drwxr-xr-x  2 u1001 u1001 4096 Oct 28 22:07 .
	drwxr-xr-x 29 root  root  4096 Oct 28 22:01 ..
	-rw-------  1 u1001 u1001 3154 Oct 28 22:12 .bash_history
	-rw-r--r--  1 u1001 u1001  220 Feb 25  2020 .bash_logout
	-rw-r--r--  1 u1001 u1001 3771 Feb 25  2020 .bashrc
	-rw-r--r--  1 u1001 u1001  807 Feb 25  2020 .profile
	-rw-r--r--  1 u1001 u1001    0 Oct 16 16:11 .sudo_as_admin_successful
	-rw-------  1 u1001 u1001 1144 Oct 28 00:43 .viminfo

	u1001@f2-vm:/$ touch /mnt/my-file

	u1001@f2-vm:/$ setfacl -m u:1001:rwx /mnt/my-file

	u1001@f2-vm:/$ sudo setcap -n 1001 cap_net_raw+ep /mnt/my-file

	u1001@f2-vm:/$ ls -al /mnt/my-file
	-rw-rwxr--+ 1 u1001 u1001 0 Oct 28 22:14 /mnt/my-file

	u1001@f2-vm:/$ ls -al /home/ubuntu/my-file
	-rw-rwxr--+ 1 ubuntu ubuntu 0 Oct 28 22:14 /home/ubuntu/my-file

	u1001@f2-vm:/$ getfacl /mnt/my-file
	getfacl: Removing leading '/' from absolute path names
	# file: mnt/my-file
	# owner: u1001
	# group: u1001
	user::rw-
	user:u1001:rwx
	group::rw-
	mask::rwx
	other::r--

	u1001@f2-vm:/$ getfacl /home/ubuntu/my-file
	getfacl: Removing leading '/' from absolute path names
	# file: home/ubuntu/my-file
	# owner: ubuntu
	# group: ubuntu
	user::rw-
	user:ubuntu:rwx
	group::rw-
	mask::rwx
	other::r--"

* tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: (41 commits)
  xfs: remove the possibly unused mp variable in xfs_file_compat_ioctl
  xfs: support idmapped mounts
  ext4: support idmapped mounts
  fat: handle idmapped mounts
  tests: add mount_setattr() selftests
  fs: introduce MOUNT_ATTR_IDMAP
  fs: add mount_setattr()
  fs: add attr_flags_to_mnt_flags helper
  fs: split out functions to hold writers
  namespace: only take read lock in do_reconfigure_mnt()
  mount: make {lock,unlock}_mount_hash() static
  namespace: take lock_mount_hash() directly when changing flags
  nfs: do not export idmapped mounts
  overlayfs: do not mount on top of idmapped mounts
  ecryptfs: do not mount on top of idmapped mounts
  ima: handle idmapped mounts
  apparmor: handle idmapped mounts
  fs: make helpers idmap mount aware
  exec: handle idmapped mounts
  would_dump: handle idmapped mounts
  ...
2021-02-23 13:39:45 -08:00
Linus Torvalds
d61c6a58ae \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmAzoWUACgkQnJ2qBz9k
 QNnFgQgAlng0JOzeCQvLpwweqFl0FCxYbOsZXC1xDyvfX3TiA6A6oiOR4tx3uhQN
 cOQmJXaiMn4oCXjD1j6WZwGfy23yx0XchaoFK9jy2IqodaB/zUjkiWYYqt0G3XIX
 ud35mxjLAGS12BCD0c+vHy2RMsUFl5ep+5aBHRHZJJhCcYbl7e5ctXZ3xB1Q0mgI
 r639gD8JhH3ICdu9W0NaMvqOrVhJFNmhSGATKL/N96+oKub2x2ycYE4L2OXegxy3
 mnFf26LjA8jt7K+KfHloTvkC6D4HVnnvKFvKiIbGKafiWhAE7q57ZO6BPCMajGue
 3UHIhWGmwKXRU72+nW6N+089GbcO/g==
 =1e+z
 -----END PGP SIGNATURE-----

Merge tag 'lazytime_for_v5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull lazytime updates from Jan Kara:
 "Cleanups of the lazytime handling in the writeback code making rules
  for calling ->dirty_inode() filesystem handlers saner"

* tag 'lazytime_for_v5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  ext4: simplify i_state checks in __ext4_update_other_inode_time()
  gfs2: don't worry about I_DIRTY_TIME in gfs2_fsync()
  fs: improve comments for writeback_single_inode()
  fs: drop redundant check from __writeback_single_inode()
  fs: clean up __mark_inode_dirty() a bit
  fs: pass only I_DIRTY_INODE flags to ->dirty_inode
  fs: don't call ->dirty_inode for lazytime timestamp updates
  fat: only specify I_DIRTY_TIME when needed in fat_update_time()
  fs: only specify I_DIRTY_TIME when needed in generic_update_time()
  fs: correctly document the inode dirty flags
2021-02-22 13:17:39 -08:00
Linus Torvalds
582cd91f69 for-5.12/block-2021-02-17
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmAtmIwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgplzLEAC5O+3rBM8QuiJdo39Yppmuw4hDJ6hOKynP
 EJQLKQQi0VfXgU+MprGvcbpFYmNbgICvUICQkEzJuk++kPCu/BJtJz0yErQeLgS+
 RdXiPV6enbF7iRML5TVRTr1q/z7sJMXcIIJ8Pz/rU/JNfGYExVd0WfnEY9mp1jOt
 Bl9V+qyTazdP+Ma4+uEPatSayqcdi1rxB5I+7v/sLiOvKZZWkaRZjUZ/mxAjUfvK
 dBOOPjMygEo3tCLkIyyA6lpLvr1r+SUZhLuebRLEKa3To3TW6RtoG0qwpKmI2iKw
 ylLeVLB60nM9RUxjflVOfBsHxz1bDg5Ve86y5nCjQd4Jo8x1c4DnecyGE5/Tu8Rg
 rgbsfD6nFWzhDCvcZT0XrfQ4ZAjIL2IfT+ypQiQ6UlRd3hvIKRmzWMkjuH2svr0u
 ey9Kq+lYerI4cM0F3W73gzUKdIQOuCzBCYxQuSQQomscBa7FCInyU192dAI9Aj6l
 Yd06mgKu6qCx6zLv6JfpBqaBHZMwyGE4dmZgPQFuuwO+b4N+Ck3Jm5fzEzw/xIxQ
 wdo/DlsAl60BXentB6FByGBJaCjVdSymRqN/xNCAbFKCjmr6TLBuXPfg1gYYO7xC
 VOcVjWe8iN3wWHZab3t2mxMKH9B9B/KKzIhu6TNHSmgtQ5paZPRCBx995pDyRw26
 WC22RGC2MA==
 =os1E
 -----END PGP SIGNATURE-----

Merge tag 'for-5.12/block-2021-02-17' of git://git.kernel.dk/linux-block

Pull core block updates from Jens Axboe:
 "Another nice round of removing more code than what is added, mostly
  due to Christoph's relentless pursuit of tech debt removal/cleanups.
  This pull request contains:

   - Two series of BFQ improvements (Paolo, Jan, Jia)

   - Block iov_iter improvements (Pavel)

   - bsg error path fix (Pan)

   - blk-mq scheduler improvements (Jan)

   - -EBUSY discard fix (Jan)

   - bvec allocation improvements (Ming, Christoph)

   - bio allocation and init improvements (Christoph)

   - Store bdev pointer in bio instead of gendisk + partno (Christoph)

   - Block trace point cleanups (Christoph)

   - hard read-only vs read-only split (Christoph)

   - Block based swap cleanups (Christoph)

   - Zoned write granularity support (Damien)

   - Various fixes/tweaks (Chunguang, Guoqing, Lei, Lukas, Huhai)"

* tag 'for-5.12/block-2021-02-17' of git://git.kernel.dk/linux-block: (104 commits)
  mm: simplify swapdev_block
  sd_zbc: clear zone resources for non-zoned case
  block: introduce blk_queue_clear_zone_settings()
  zonefs: use zone write granularity as block size
  block: introduce zone_write_granularity limit
  block: use blk_queue_set_zoned in add_partition()
  nullb: use blk_queue_set_zoned() to setup zoned devices
  nvme: cleanup zone information initialization
  block: document zone_append_max_bytes attribute
  block: use bi_max_vecs to find the bvec pool
  md/raid10: remove dead code in reshape_request
  block: mark the bio as cloned in bio_iov_bvec_set
  block: set BIO_NO_PAGE_REF in bio_iov_bvec_set
  block: remove a layer of indentation in bio_iov_iter_get_pages
  block: turn the nr_iovecs argument to bio_alloc* into an unsigned short
  block: remove the 1 and 4 vec bvec_slabs entries
  block: streamline bvec_alloc
  block: factor out a bvec_alloc_gfp helper
  block: move struct biovec_slab to bio.c
  block: reuse BIO_INLINE_VECS for integrity bvecs
  ...
2021-02-21 11:02:48 -08:00
Linus Torvalds
4f016a316f New code for 5.12:
- Adjust the final parameter of iomap_dio_rw.
 - Add a new flag to request that iomap directio writes return EAGAIN if
   the write is not a pure overwrite within EOF; this will be used to
   reduce lock contention with unaligned direct writes on XFS.
 - Amend XFS' directio code to eliminate exclusive locking for unaligned
   direct writes if the circumstances permit
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmAZgQAACgkQ+H93GTRK
 tOtNqw/+KPff1NjQVK2k361R0+LjlEHfe2nxh7+kS10IiR5nbBz4Fu+GwEosZKq+
 H9ficBbZ0wIveV+5CEt2xZLEJFC4LZUpNPVVrUf8XPLKiVexP/U3wtKzmv9Z7D5J
 5walMWQycVeR+ycomynV36giqekvARL7KCQG5By2ITfSNxfnb/wvKhn1d61ZDOF6
 f4xzq7F6+cEOrSZt2LcFzGSfsTl6oakYMAomPU57sqGmw7MHRqoPTErbdh2HnVJy
 yQ47eiZgSKWKA+Qm+VvHHePYCYnu0nvA2rbNerjTN70hnO8rK9S0Vle6Sp5CUqAX
 sXOy8zxOLYKqyM4S/QkIN2TGIyWg+CHiakVLZGF3Q4AUDDYfpD0cHvAe9N3v9euL
 qt8ypT8dz2C3qiTg5E31xy033wlAP0wg3FZiLAqEjL5o3fzD+qbplTiSmYbMV2Fb
 xuu7a2T6u1MHaIn1IhaL0cB49Fzn+5EMyp6BlAucAOakyuqJCyJiXokdk0Looy5e
 jUshvcwWcmHMpI/YYYY6t56KV6tl2exGq5sySY5U6dr8/r5lwc0SI+TrYFG0jTR8
 59DGd5CkKgdBFcuys+eaZDXgr7A4ymkVE+pE0QNDz9UwNP20tLb3dQNlhgxchUgu
 NgPaFgQkoNM3HmQNyU2wX/t1aFlC/doqSkb/96UWQSxq6IrajMU=
 =AR07
 -----END PGP SIGNATURE-----

Merge tag 'iomap-5.12-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull iomap updates from Darrick Wong:
 "The big change in this cycle is some new code to make it possible for
  XFS to try unaligned directio overwrites without taking locks. If the
  block is fully written and within EOF (i.e. doesn't require any
  further fs intervention) then we can let the unlocked write proceed.
  If not, we fall back to synchronizing direct writes.

  Summary:

   - Adjust the final parameter of iomap_dio_rw.

   - Add a new flag to request that iomap directio writes return EAGAIN
     if the write is not a pure overwrite within EOF; this will be used
     to reduce lock contention with unaligned direct writes on XFS.

   - Amend XFS' directio code to eliminate exclusive locking for
     unaligned direct writes if the circumstances permit"

* tag 'iomap-5.12-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: reduce exclusive locking on unaligned dio
  xfs: split the unaligned DIO write code out
  xfs: improve the reflink_bounce_dio_write tracepoint
  xfs: simplify the read/write tracepoints
  xfs: remove the buffered I/O fallback assert
  xfs: cleanup the read/write helper naming
  xfs: make xfs_file_aio_write_checks IOCB_NOWAIT-aware
  xfs: factor out a xfs_ilock_iocb helper
  iomap: add a IOMAP_DIO_OVERWRITE_ONLY flag
  iomap: pass a flags argument to iomap_dio_rw
  iomap: rename the flags variable in __iomap_dio_rw
2021-02-21 10:29:20 -08:00
Daniel Latypov
0a76945fd1 ext4: add .kunitconfig fragment to enable ext4-specific tests
As of [1], we no longer want EXT4_KUNIT_TESTS and others to `select`
their deps. This means it can get harder to get all the right things
selected as we gain more tests w/ more deps over time.

This patch (and [2]) proposes we store kunitconfig fragments in-tree to
represent sets of tests. (N.B. right now we only have one ext4 test).

There's still a discussion to be had about how to have a hierarchy of
these files (e.g. if one wanted to test all of fs/, not just fs/ext4).

But this fragment would likely be a leaf node and isn't blocked on
deciding if we want `import` statements and the like.

Usage
=====

Before [2] (on its way to being merged):
  $ cp fs/ext4/.kunitconfig .kunit/
  $ ./tools/testing/kunit/kunit.py run

After [2]:
  $ ./tools/testing/kunit/kunit.py run --kunitconfig=fs/ext4/.kunitconfig

".kunitconfig" vs "kunitconfig"
===============================

See also: commit 14ee5cfd45 ("kunit: Rename 'kunitconfig' to '.kunitconfig'").
* The bit about .gitignore exluding it by default is now a con, however.
* But there are a lot of directories with files that begin with "k" and
  so this could cause some annoyance w/ tab completion*
* This is the name kunit.py expects right now, so some people are used
  to .kunitconfig over "kunitconfig"

[1] https://lore.kernel.org/linux-ext4/20210122110234.2825685-1-geert@linux-m68k.org/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git/commit/?h=kunit&id=243180f5924ed27ea417db39feb7f9691777688e

* 372/5556 directories isn't too much, but still not a small number:
$ find -type f -name 'k*' | xargs dirname | sort -u | wc -l
372

Signed-off-by: Daniel Latypov <dlatypov@google.com>
Link: https://lore.kernel.org/r/20210210013206.136227-1-dlatypov@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-02-11 23:16:30 -05:00
Geert Uytterhoeven
302fdadeaf ext: EXT4_KUNIT_TESTS should depend on EXT4_FS instead of selecting it
EXT4_KUNIT_TESTS selects EXT4_FS, thus enabling an optional feature the
user may not want to enable.  Fix this by making the test depend on
EXT4_FS instead.

Fixes: 1cbeab1b24 ("ext4: add kunit test for decoding extended timestamps")
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Link: https://lore.kernel.org/r/20210122110234.2825685-1-geert@linux-m68k.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-02-11 23:12:59 -05:00
Eric Whitney
3258386aba ext4: reset retry counter when ext4_alloc_file_blocks() makes progress
Change the retry policy in ext4_alloc_file_blocks() to allow for a full
retry cycle whenever a portion of an allocation request has been
fulfilled.  A large allocation request often results in multiple calls
to ext4_map_blocks(), each of which is potentially subject to a
temporary ENOSPC condition and retry cycle.  The current code only
allows for a single retry cycle.

This patch does not address a known bug or reported complaint.
However, it should make block allocation for fallocate and zero range
more robust.

In addition, simplify the conditional controlling the allocation while
loop, where testing len alone is sufficient.  Remove the assignment to
ret2 in the error path after the call to ext4_map_blocks() since its
value isn't subsequently used.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20210113221403.18258-1-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-02-08 18:03:56 -05:00
Eric Biggers
e17fe6579d fs-verity: add FS_IOC_READ_VERITY_METADATA ioctl
Add an ioctl FS_IOC_READ_VERITY_METADATA which will allow reading verity
metadata from a file that has fs-verity enabled, including:

- The Merkle tree
- The fsverity_descriptor (not including the signature if present)
- The built-in signature, if present

This ioctl has similar semantics to pread().  It is passed the type of
metadata to read (one of the above three), and a buffer, offset, and
size.  It returns the number of bytes read or an error.

Separate patches will add support for each of the above metadata types.
This patch just adds the ioctl itself.

This ioctl doesn't make any assumption about where the metadata is
stored on-disk.  It does assume the metadata is in a stable format, but
that's basically already the case:

- The Merkle tree and fsverity_descriptor are defined by how fs-verity
  file digests are computed; see the "File digest computation" section
  of Documentation/filesystems/fsverity.rst.  Technically, the way in
  which the levels of the tree are ordered relative to each other wasn't
  previously specified, but it's logical to put the root level first.

- The built-in signature is the value passed to FS_IOC_ENABLE_VERITY.

This ioctl is useful because it allows writing a server program that
takes a verity file and serves it to a client program, such that the
client can do its own fs-verity compatible verification of the file.
This only makes sense if the client doesn't trust the server and if the
server needs to provide the storage for the client.

More concretely, there is interest in using this ability in Android to
export APK files (which are protected by fs-verity) to "protected VMs".
This would use Protected KVM (https://lwn.net/Articles/836693), which
provides an isolated execution environment without having to trust the
traditional "host".  A "guest" VM can boot from a signed image and
perform specific tasks in a minimum trusted environment using files that
have fs-verity enabled on the host, without trusting the host or
requiring that the guest has its own trusted storage.

Technically, it would be possible to duplicate the metadata and store it
in separate files for serving.  However, that would be less efficient
and would require extra care in userspace to maintain file consistency.

In addition to the above, the ability to read the built-in signatures is
useful because it allows a system that is using the in-kernel signature
verification to migrate to userspace signature verification.

Link: https://lore.kernel.org/r/20210115181819.34732-4-ebiggers@kernel.org
Reviewed-by: Victor Hsieh <victorhsieh@google.com>
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2021-02-07 14:51:11 -08:00
Theodore Ts'o
b5776e7524 ext4: fix potential htree index checksum corruption
In the case where we need to do an interior node split, and
immediately afterwards, we are unable to allocate a new directory leaf
block due to ENOSPC, the directory index checksum's will not be filled
in correctly (and indeed, will not be correctly journalled).

This looks like a bug that was introduced when we added largedir
support.  The original code doesn't make any sense (and should have
been caught in code review), but it was hidden because most of the
time, the index node checksum will be set by do_split().  But if
do_split bails out due to ENOSPC, then ext4_handle_dirty_dx_node()
won't get called, and so the directory index checksum field will not
get set, leading to:

EXT4-fs error (device sdb): dx_probe:858: inode #6635543: block 4022: comm nfsd: Directory index failed checksum

Google-Bug-Id: 176345532
Fixes: e08ac99fa2 ("ext4: add largedir feature")
Cc: Artem Blagodarenko <artem.blagodarenko@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-02-04 00:05:20 -05:00
Vinicius Tinti
c6c818e50d ext4: factor out htree rep invariant check
This patch moves some debugging code which is used to validate the
hash tree node when doing a binary search of an htree node into a
separate function, which is disabled by default (since it is only used
by developers when they are modifying the htree code paths).

In addition to cleaning up the code to make it more maintainable, it
silences a Clang compiler warning when -Wunreachable-code-aggressive
is enabled.  (There is no plan to enable this warning by default, since
there it has far too many false positives; nevertheless, this commit
reduces one of the many false positives by one.)

Signed-off-by: Vinicius Tinti <viniciustinti@gmail.com>
Link: https://lore.kernel.org/r/20210202162837.129631-1-viniciustinti@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-02-03 00:34:02 -05:00
Daejun Park
96e7c02d0b ext4: Change list_for_each* to list_for_each_entry*
In the fast_commit.c, list_for_each* + list_entry can be changed to
list_for_each_entry*. It reduces number of variables and lines.

Signed-off-by: Daejun Park <daejun7.park@samsung.com>
Link: https://lore.kernel.org/r/20210111013726epcms2p4579ae56040d7043db785bf0d0a785dc7@epcms2p4
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-02-03 00:05:01 -05:00
Theodore Ts'o
027f14f535 ext4: don't try to processed freed blocks until mballoc is initialized
If we try to make any changes via the journal between when the journal
is initialized, but before the multi-block allocated is initialized,
we will end up deferencing a NULL pointer when the journal commit
callback function calls ext4_process_freed_data().

The proximate cause of this failure was commit 2d01ddc866 ("ext4:
save error info to sb through journal if available") since file system
corruption problems detected before the call to ext4_mb_init() would
result in a journal commit before we aborted the mount of the file
system.... and we would then trigger the NULL pointer deref.

Link: https://lore.kernel.org/r/YAm8qH/0oo2ofSMR@mit.edu
Reported-by: Murphy Zhou <jencce.kernel@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-02-02 23:16:01 -05:00
Zheng Yongjun
59ebc7fd74 ext4: use DEFINE_MUTEX() for mutex lock
mutex lock can be initialized automatically with DEFINE_MUTEX()
rather than explicitly calling mutex_init().

Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com>
Link: https://lore.kernel.org/r/20201224132244.30907-1-zhengyongjun3@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-02-02 23:16:01 -05:00
Christoph Hellwig
c6bf3f0e25 block: use an on-stack bio in blkdev_issue_flush
There is no point in allocating memory for a synchronous flush.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-27 09:51:48 -07:00
Christian Brauner
14f3db5542
ext4: support idmapped mounts
Enable idmapped mounts for ext4. All dedicated helpers we need for this
exist. So this basically just means we're passing down the
user_namespace argument from the VFS methods to the relevant helpers.

Let's create simple example where we idmap an ext4 filesystem:

 root@f2-vm:~# truncate -s 5G ext4.img

 root@f2-vm:~# mkfs.ext4 ./ext4.img
 mke2fs 1.45.5 (07-Jan-2020)
 Discarding device blocks: done
 Creating filesystem with 1310720 4k blocks and 327680 inodes
 Filesystem UUID: 3fd91794-c6ca-4b0f-9964-289a000919cf
 Superblock backups stored on blocks:
         32768, 98304, 163840, 229376, 294912, 819200, 884736

 Allocating group tables: done
 Writing inode tables: done
 Creating journal (16384 blocks): done
 Writing superblocks and filesystem accounting information: done

 root@f2-vm:~# losetup -f --show ./ext4.img
 /dev/loop0

 root@f2-vm:~# mount /dev/loop0 /mnt

 root@f2-vm:~# ls -al /mnt/
 total 24
 drwxr-xr-x  3 root root  4096 Oct 28 13:34 .
 drwxr-xr-x 30 root root  4096 Oct 28 13:22 ..
 drwx------  2 root root 16384 Oct 28 13:34 lost+found

 # Let's create an idmapped mount at /idmapped1 where we map uid and gid
 # 0 to uid and gid 1000
 root@f2-vm:/# ./mount-idmapped --map-mount b:0:1000:1 /mnt/ /idmapped1/

 root@f2-vm:/# ls -al /idmapped1/
 total 24
 drwxr-xr-x  3 ubuntu ubuntu  4096 Oct 28 13:34 .
 drwxr-xr-x 30 root   root    4096 Oct 28 13:22 ..
 drwx------  2 ubuntu ubuntu 16384 Oct 28 13:34 lost+found

 # Let's create an idmapped mount at /idmapped2 where we map uid and gid
 # 0 to uid and gid 2000
 root@f2-vm:/# ./mount-idmapped --map-mount b:0:2000:1 /mnt/ /idmapped2/

 root@f2-vm:/# ls -al /idmapped2/
 total 24
 drwxr-xr-x  3 2000 2000  4096 Oct 28 13:34 .
 drwxr-xr-x 31 root root  4096 Oct 28 13:39 ..
 drwx------  2 2000 2000 16384 Oct 28 13:34 lost+found

Let's create another example where we idmap the rootfs filesystem
without a mapping for uid 0 and gid 0:

 # Create an idmapped mount of for a full POSIX range of rootfs under
 # /mnt but without a mapping for uid 0 to reduce attack surface

 root@f2-vm:/# ./mount-idmapped --map-mount b:1:1:65536 / /mnt/

 # Since we don't have a mapping for uid and gid 0 all files owned by
 # uid and gid 0 should show up as uid and gid 65534:
 root@f2-vm:/# ls -al /mnt/
 total 664
 drwxr-xr-x 31 nobody nogroup   4096 Oct 28 13:39 .
 drwxr-xr-x 31 root   root      4096 Oct 28 13:39 ..
 lrwxrwxrwx  1 nobody nogroup      7 Aug 25 07:44 bin -> usr/bin
 drwxr-xr-x  4 nobody nogroup   4096 Oct 28 13:17 boot
 drwxr-xr-x  2 nobody nogroup   4096 Aug 25 07:48 dev
 drwxr-xr-x 81 nobody nogroup   4096 Oct 28 04:00 etc
 drwxr-xr-x  4 nobody nogroup   4096 Oct 28 04:00 home
 lrwxrwxrwx  1 nobody nogroup      7 Aug 25 07:44 lib -> usr/lib
 lrwxrwxrwx  1 nobody nogroup      9 Aug 25 07:44 lib32 -> usr/lib32
 lrwxrwxrwx  1 nobody nogroup      9 Aug 25 07:44 lib64 -> usr/lib64
 lrwxrwxrwx  1 nobody nogroup     10 Aug 25 07:44 libx32 -> usr/libx32
 drwx------  2 nobody nogroup  16384 Aug 25 07:47 lost+found
 drwxr-xr-x  2 nobody nogroup   4096 Aug 25 07:44 media
 drwxr-xr-x 31 nobody nogroup   4096 Oct 28 13:39 mnt
 drwxr-xr-x  2 nobody nogroup   4096 Aug 25 07:44 opt
 drwxr-xr-x  2 nobody nogroup   4096 Apr 15  2020 proc
 drwx--x--x  6 nobody nogroup   4096 Oct 28 13:34 root
 drwxr-xr-x  2 nobody nogroup   4096 Aug 25 07:46 run
 lrwxrwxrwx  1 nobody nogroup      8 Aug 25 07:44 sbin -> usr/sbin
 drwxr-xr-x  2 nobody nogroup   4096 Aug 25 07:44 srv
 drwxr-xr-x  2 nobody nogroup   4096 Apr 15  2020 sys
 drwxrwxrwt 10 nobody nogroup   4096 Oct 28 13:19 tmp
 drwxr-xr-x 14 nobody nogroup   4096 Oct 20 13:00 usr
 drwxr-xr-x 12 nobody nogroup   4096 Aug 25 07:45 var

 # Since we do have a mapping for uid and gid 1000 all files owned by
 # uid and gid 1000 should simply show up as uid and gid 1000:
 root@f2-vm:/# ls -al /mnt/home/ubuntu/
 total 40
 drwxr-xr-x 3 ubuntu ubuntu  4096 Oct 28 00:43 .
 drwxr-xr-x 4 nobody nogroup 4096 Oct 28 04:00 ..
 -rw------- 1 ubuntu ubuntu  2936 Oct 28 12:26 .bash_history
 -rw-r--r-- 1 ubuntu ubuntu   220 Feb 25  2020 .bash_logout
 -rw-r--r-- 1 ubuntu ubuntu  3771 Feb 25  2020 .bashrc
 -rw-r--r-- 1 ubuntu ubuntu   807 Feb 25  2020 .profile
 -rw-r--r-- 1 ubuntu ubuntu     0 Oct 16 16:11 .sudo_as_admin_successful
 -rw------- 1 ubuntu ubuntu  1144 Oct 28 00:43 .viminfo

Link: https://lore.kernel.org/r/20210121131959.646623-39-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-ext4@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:43:46 +01:00
Christian Brauner
549c729771
fs: make helpers idmap mount aware
Extend some inode methods with an additional user namespace argument. A
filesystem that is aware of idmapped mounts will receive the user
namespace the mount has been marked with. This can be used for
additional permission checking and also to enable filesystems to
translate between uids and gids if they need to. We have implemented all
relevant helpers in earlier patches.

As requested we simply extend the exisiting inode method instead of
introducing new ones. This is a little more code churn but it's mostly
mechanical and doesnt't leave us with additional inode methods.

Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:20 +01:00
Christian Brauner
0d56a4518d
stat: handle idmapped mounts
The generic_fillattr() helper fills in the basic attributes associated
with an inode. Enable it to handle idmapped mounts. If the inode is
accessed through an idmapped mount map it into the mount's user
namespace before we store the uid and gid. If the initial user namespace
is passed nothing changes so non-idmapped mounts will see identical
behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-12-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:17 +01:00
Christian Brauner
e65ce2a50c
acl: handle idmapped mounts
The posix acl permission checking helpers determine whether a caller is
privileged over an inode according to the acls associated with the
inode. Add helpers that make it possible to handle acls on idmapped
mounts.

The vfs and the filesystems targeted by this first iteration make use of
posix_acl_fix_xattr_from_user() and posix_acl_fix_xattr_to_user() to
translate basic posix access and default permissions such as the
ACL_USER and ACL_GROUP type according to the initial user namespace (or
the superblock's user namespace) to and from the caller's current user
namespace. Adapt these two helpers to handle idmapped mounts whereby we
either map from or into the mount's user namespace depending on in which
direction we're translating.
Similarly, cap_convert_nscap() is used by the vfs to translate user
namespace and non-user namespace aware filesystem capabilities from the
superblock's user namespace to the caller's user namespace. Enable it to
handle idmapped mounts by accounting for the mount's user namespace.

In addition the fileystems targeted in the first iteration of this patch
series make use of the posix_acl_chmod() and, posix_acl_update_mode()
helpers. Both helpers perform permission checks on the target inode. Let
them handle idmapped mounts. These two helpers are called when posix
acls are set by the respective filesystems to handle this case we extend
the ->set() method to take an additional user namespace argument to pass
the mount's user namespace down.

Link: https://lore.kernel.org/r/20210121131959.646623-9-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:17 +01:00
Christian Brauner
2f221d6f7b
attr: handle idmapped mounts
When file attributes are changed most filesystems rely on the
setattr_prepare(), setattr_copy(), and notify_change() helpers for
initialization and permission checking. Let them handle idmapped mounts.
If the inode is accessed through an idmapped mount map it into the
mount's user namespace. Afterwards the checks are identical to
non-idmapped mounts. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Helpers that perform checks on the ia_uid and ia_gid fields in struct
iattr assume that ia_uid and ia_gid are intended values and have already
been mapped correctly at the userspace-kernelspace boundary as we
already do today. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:16 +01:00
Christian Brauner
21cb47be6f
inode: make init and permission helpers idmapped mount aware
The inode_owner_or_capable() helper determines whether the caller is the
owner of the inode or is capable with respect to that inode. Allow it to
handle idmapped mounts. If the inode is accessed through an idmapped
mount it according to the mount's user namespace. Afterwards the checks
are identical to non-idmapped mounts. If the initial user namespace is
passed nothing changes so non-idmapped mounts will see identical
behavior as before.

Similarly, allow the inode_init_owner() helper to handle idmapped
mounts. It initializes a new inode on idmapped mounts by mapping the
fsuid and fsgid of the caller from the mount's user namespace. If the
initial user namespace is passed nothing changes so non-idmapped mounts
will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-7-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:16 +01:00
Christoph Hellwig
2f63296578 iomap: pass a flags argument to iomap_dio_rw
Pass a set of flags to iomap_dio_rw instead of the boolean
wait_for_completion argument.  The IOMAP_DIO_FORCE_WAIT flag
replaces the wait_for_completion, but only needs to be passed
when the iocb isn't synchronous to start with to simplify the
callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
[djwong: rework xfs_file.c so that we can push iomap changes separately]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-01-23 10:06:09 -08:00
Linus Torvalds
0bc9bc1d8b A number of bug fixes for ext4:
* For the new fast_commit feature
    * Fix some error handling codepaths in whiteout handling and
      mountpoint sampling
    * Fix how we write ext4_error information so it goes through the journal
      when journalling is active, to avoid races that can lead to lost
      error information, superblock checksum failures, or DIF/DIX features.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmAB8eMACgkQ8vlZVpUN
 gaMUxAf+MW22dceTto2RO0ox9OEBNoZDFiVnlEuUaIOxkqOlovIWaqX7wwuF/121
 +FaNeDVzqNSS/QjQSB5lHF5OfHCD2u1Ef/bGzCm9cQyeN2/n0sCsStfPCcyLHy/0
 4R8PsjF0xhhbCETLcAc0U/YBFEoqSn1i7DG5nnpx63Wt1S/SSMmTAXzafWbzisEZ
 XNsz3CEPCDDSmSzOt3qMMHxkSoOZhYcLe7fCoKkhZ2pvTyrQsHrne6NNLtxc+sDL
 AcKkaI0EWFiFRhebowQO/5ouq6nnGKLCsukuZN9//Br8ht5gNcFpuKNVFl+LOiM6
 ud4H3qcRokcdPPAn3uwI0AJKFXqLvg==
 =Dgdj
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "A number of bug fixes for ext4:

   - Fix for the new fast_commit feature

   - Fix some error handling codepaths in whiteout handling and
     mountpoint sampling

   - Fix how we write ext4_error information so it goes through the
     journal when journalling is active, to avoid races that can lead to
     lost error information, superblock checksum failures, or DIF/DIX
     features"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: remove expensive flush on fast commit
  ext4: fix bug for rename with RENAME_WHITEOUT
  ext4: fix wrong list_splice in ext4_fc_cleanup
  ext4: use IS_ERR instead of IS_ERR_OR_NULL and set inode null when IS_ERR
  ext4: don't leak old mountpoint samples
  ext4: drop ext4_handle_dirty_super()
  ext4: fix superblock checksum failure when setting password salt
  ext4: use sbi instead of EXT4_SB(sb) in ext4_update_super()
  ext4: save error info to sb through journal if available
  ext4: protect superblock modifications with a buffer lock
  ext4: drop sync argument of ext4_commit_super()
  ext4: combine ext4_handle_error() and save_error_info()
2021-01-15 14:54:24 -08:00
Daejun Park
e9f53353e1 ext4: remove expensive flush on fast commit
In the fast commit, it adds REQ_FUA and REQ_PREFLUSH on each fast
commit block when barrier is enabled.  However, in recovery phase,
ext4 compares CRC value in the tail.  So it is sufficient to add
REQ_FUA and REQ_PREFLUSH on the block that has tail.

Signed-off-by: Daejun Park <daejun7.park@samsung.com>
Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20210106013242epcms2p5b6b4ed8ca86f29456fdf56aa580e74b4@epcms2p5
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-01-15 14:41:31 -05:00
yangerkun
6b4b8e6b4a ext4: fix bug for rename with RENAME_WHITEOUT
We got a "deleted inode referenced" warning cross our fsstress test. The
bug can be reproduced easily with following steps:

  cd /dev/shm
  mkdir test/
  fallocate -l 128M img
  mkfs.ext4 -b 1024 img
  mount img test/
  dd if=/dev/zero of=test/foo bs=1M count=128
  mkdir test/dir/ && cd test/dir/
  for ((i=0;i<1000;i++)); do touch file$i; done # consume all block
  cd ~ && renameat2(AT_FDCWD, /dev/shm/test/dir/file1, AT_FDCWD,
    /dev/shm/test/dir/dst_file, RENAME_WHITEOUT) # ext4_add_entry in
    ext4_rename will return ENOSPC!!
  cd /dev/shm/ && umount test/ && mount img test/ && ls -li test/dir/file1
  We will get the output:
  "ls: cannot access 'test/dir/file1': Structure needs cleaning"
  and the dmesg show:
  "EXT4-fs error (device loop0): ext4_lookup:1626: inode #2049: comm ls:
  deleted inode referenced: 139"

ext4_rename will create a special inode for whiteout and use this 'ino'
to replace the source file's dir entry 'ino'. Once error happens
latter(the error above was the ENOSPC return from ext4_add_entry in
ext4_rename since all space has been consumed), the cleanup do drop the
nlink for whiteout, but forget to restore 'ino' with source file. This
will trigger the bug describle as above.

Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
Fixes: cd808deced ("ext4: support RENAME_WHITEOUT")
Link: https://lore.kernel.org/r/20210105062857.3566-1-yangerkun@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-01-15 14:41:31 -05:00
Daejun Park
31e203e09f ext4: fix wrong list_splice in ext4_fc_cleanup
After full/fast commit, entries in staging queue are promoted to main
queue. In ext4_fs_cleanup function, it splice to staging queue to
staging queue.

Fixes: aa75f4d3da ("ext4: main fast-commit commit path")
Signed-off-by: Daejun Park <daejun7.park@samsung.com>
Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201230094851epcms2p6eeead8cc984379b37b2efd21af90fd1a@epcms2p6
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2021-01-15 14:40:12 -05:00
Yi Li
23dd561ad9 ext4: use IS_ERR instead of IS_ERR_OR_NULL and set inode null when IS_ERR
1: ext4_iget/ext4_find_extent never returns NULL, use IS_ERR
instead of IS_ERR_OR_NULL to fix this.

2: ext4_fc_replay_inode should set the inode to NULL when IS_ERR.
and go to call iput properly.

Fixes: 8016e29f43 ("ext4: fast commit recovery path")
Signed-off-by: Yi Li <yili@winhong.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20201230033827.3996064-1-yili@winhong.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2021-01-15 14:39:14 -05:00
Eric Biggers
ed296c6c05 ext4: simplify i_state checks in __ext4_update_other_inode_time()
Since I_DIRTY_TIME and I_DIRTY_INODE are mutually exclusive in i_state,
there's no need to check for I_DIRTY_TIME && !I_DIRTY_INODE.  Just check
for I_DIRTY_TIME.

Also introduce a helper function in include/linux/fs.h to do this check.

Link: https://lore.kernel.org/r/20210112190253.64307-12-ebiggers@kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2021-01-13 17:27:13 +01:00
Eric Biggers
e2728c5621 fs: don't call ->dirty_inode for lazytime timestamp updates
There is no need to call ->dirty_inode for lazytime timestamp updates
(i.e. for __mark_inode_dirty(I_DIRTY_TIME)), since by the definition of
lazytime, filesystems must ignore these updates.  Filesystems only need
to care about the updated timestamps when they expire.

Therefore, only call ->dirty_inode when I_DIRTY_INODE is set.

Based on a patch from Christoph Hellwig:
https://lore.kernel.org/r/20200325122825.1086872-4-hch@lst.de

Link: https://lore.kernel.org/r/20210112190253.64307-6-ebiggers@kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2021-01-13 17:26:33 +01:00
Linus Torvalds
555a6e8c11 Various bug fixes and cleanups for ext4; no new features this cycle.
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEOrBXt+eNlFyMVZH70292m8EYBPAFAl/iRkQACgkQ0292m8EY
 BPDxnw/+Nyp5/bscuKflyAs8m8BUDE8MSP9TTaPcflDgyV1Z3QiSx5ITOXi4JSy6
 V5W1RXScJL3mG8QrtukY7zP7bSTS+UDQlbZ1BlpiyZDjo3l15gKwWVOmJteGusfI
 U+xRz1RoiXcGTq6vSDYZIKquNVNezAE0UbYVRECm+lEjfka1vrlQz20gEgvh/mvh
 ZQV5wLqr61gGsGDhHSw+HD16KYJQWtwcRirNuzw7VfFtUJNfthXkPy8mIuJcKx2a
 sHuGEnLxhwapl27Os2o8gMTjt05nQZSZ6Fx3p7i3c4NBknR3F8ZI1LVSvDDZU6Rz
 v0v/VfMEP/CEg5Vq8gLL1dgWCwF0GWKYLdhyge/oCQU9r3RbheRKCRmOqyrYw5sA
 e4AqfWw39OmKV8L54XKUV4lXaBd7rCHW0l0uJdm1cdGxg7kewjS6TajhJ30tRqz/
 vvCo5HZ/TgF71AhONu+0WSzbuOOZzRxJC4k+WLRHspOzXVg7f8PzAxkfHMJaEk3t
 TsuM0prorm4r1PAb7TblXJpV7LlI2sjhYqsC/ngAfmZqw59wou3n+hytg8BkKlOz
 Gq9LUdFw4MJqK0OqaRzwWCcXFdc5EEFS8Kq4aVNLbNMmF1gfYy/RzTRc1iZ5RkN+
 stk7OYhs657Kv925nhvZsLCoOwGzp+8dbKmW2TlgQMwYHx30+Jw=
 =a5OE
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "Various bug fixes and cleanups for ext4; no new features this cycle"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (29 commits)
  ext4: remove unnecessary wbc parameter from ext4_bio_write_page
  ext4: avoid s_mb_prefetch to be zero in individual scenarios
  ext4: defer saving error info from atomic context
  ext4: simplify ext4 error translation
  ext4: move functions in super.c
  ext4: make ext4_abort() use __ext4_error()
  ext4: standardize error message in ext4_protect_reserved_inode()
  ext4: remove redundant sb checksum recomputation
  ext4: don't remount read-only with errors=continue on reboot
  ext4: fix deadlock with fs freezing and EA inodes
  jbd2: add a helper to find out number of fast commit blocks
  ext4: make fast_commit.h byte identical with e2fsprogs/fast_commit.h
  ext4: fix fall-through warnings for Clang
  ext4: add docs about fast commit idempotence
  ext4: remove the unused EXT4_CURRENT_REV macro
  ext4: fix an IS_ERR() vs NULL check
  ext4: check for invalid block size early when mounting a file system
  ext4: fix a memory leak of ext4_free_data
  ext4: delete nonsensical (commented-out) code inside ext4_xattr_block_set()
  ext4: update ext4_data_block_valid related comments
  ...
2020-12-24 14:16:02 -08:00
Theodore Ts'o
5a3b590d4b ext4: don't leak old mountpoint samples
When the first file is opened, ext4 samples the mountpoint of the
filesystem in 64 bytes of the super block.  It does so using
strlcpy(), this means that the remaining bytes in the super block
string buffer are untouched.  If the mount point before had a longer
path than the current one, it can be reconstructed.

Consider the case where the fs was mounted to "/media/johnjdeveloper"
and later to "/".  The super block buffer then contains
"/\x00edia/johnjdeveloper".

This case was seen in the wild and caused confusion how the name
of a developer ands up on the super block of a filesystem used
in production...

Fix this by using strncpy() instead of strlcpy().  The superblock
field is defined to be a fixed-size char array, and it is already
marked using __nonstring in fs/ext4/ext4.h.  The consumer of the field
in e2fsprogs already assumes that in the case of a 64+ byte mount
path, that s_last_mounted will not be NUL terminated.

Link: https://lore.kernel.org/r/X9ujIOJG/HqMr88R@mit.edu
Reported-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2020-12-22 13:08:46 -05:00
Jan Kara
a3f5cf14ff ext4: drop ext4_handle_dirty_super()
The wrapper is now useless since it does what
ext4_handle_dirty_metadata() does. Just remove it.

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20201216101844.22917-9-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-22 13:08:46 -05:00
Jan Kara
dfd56c2c0c ext4: fix superblock checksum failure when setting password salt
When setting password salt in the superblock, we forget to recompute the
superblock checksum so it will not match until the next superblock
modification which recomputes the checksum. Fix it.

CC: Michael Halcrow <mhalcrow@google.com>
Reported-by: Andreas Dilger <adilger@dilger.ca>
Fixes: 9bd8212f98 ("ext4 crypto: add encryption policy and password salt support")
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20201216101844.22917-8-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-22 13:08:46 -05:00
Jan Kara
e92ad03fa5 ext4: use sbi instead of EXT4_SB(sb) in ext4_update_super()
No behavioral change.

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20201216101844.22917-6-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-22 13:08:46 -05:00
Jan Kara
2d01ddc866 ext4: save error info to sb through journal if available
If journalling is still working at the moment we get to writing error
information to the superblock we cannot write directly to the superblock
as such write could race with journalled update of the superblock and
cause journal checksum failures, writing inconsistent information to the
journal or other problems. We cannot journal the superblock directly
from the error handling functions as we are running in uncertain context
and could deadlock so just punt journalled superblock update to a
workqueue.

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20201216101844.22917-5-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-22 13:08:46 -05:00
Jan Kara
05c2c00f37 ext4: protect superblock modifications with a buffer lock
Protect all superblock modifications (including checksum computation)
with a superblock buffer lock. That way we are sure computed checksum
matches current superblock contents (a mismatch could cause checksum
failures in nojournal mode or if an unjournalled superblock update races
with a journalled one). Also we avoid modifying superblock contents
while it is being written out (which can cause DIF/DIX failures if we
are running in nojournal mode).

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20201216101844.22917-4-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-22 13:08:46 -05:00
Jan Kara
4392fbc4ba ext4: drop sync argument of ext4_commit_super()
Everybody passes 1 as sync argument of ext4_commit_super(). Just drop
it.

Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20201216101844.22917-3-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-22 13:08:46 -05:00
Lei Chen
be993933d2 ext4: remove unnecessary wbc parameter from ext4_bio_write_page
ext4_bio_write_page does not need wbc parameter, since its parameter
io contains the io_wbc field. The io::io_wbc is initialized by
ext4_io_submit_init which is called in ext4_writepages and
ext4_writepage functions prior to ext4_bio_write_page.
Therefor, when ext4_bio_write_page is called, wbc info
has already been included in io parameter.

Signed-off-by: Lei Chen <lennychen@tencent.com>
Link: https://lore.kernel.org/r/1607669664-25656-1-git-send-email-lennychen@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-22 13:08:45 -05:00
Jan Kara
e789ca0cc1 ext4: combine ext4_handle_error() and save_error_info()
save_error_info() is always called together with ext4_handle_error().
Combine them into a single call and move unconditional bits out of
save_error_info() into ext4_handle_error().

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20201216101844.22917-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-22 13:08:45 -05:00
Chunguang Xu
82ef1370b0 ext4: avoid s_mb_prefetch to be zero in individual scenarios
Commit cfd7323772 ("ext4: add prefetching for block allocation
bitmaps") introduced block bitmap prefetch, and expects to read block
bitmaps of flex_bg through an IO.  However, it seems to ignore the
value range of s_log_groups_per_flex.  In the scenario where the value
of s_log_groups_per_flex is greater than 27, s_mb_prefetch or
s_mb_prefetch_limit will overflow, cause a divide zero exception.

In addition, the logic of calculating nr is also flawed, because the
size of flexbg is fixed during a single mount, but s_mb_prefetch can
be modified, which causes nr to fail to meet the value condition of
[1, flexbg_size].

To solve this problem, we need to set the upper limit of
s_mb_prefetch.  Since we expect to load block bitmaps of a flex_bg
through an IO, we can consider determining a reasonable upper limit
among the IO limit parameters.  After consideration, we chose
BLK_MAX_SEGMENT_SIZE.  This is a good choice to solve divide zero
problem and avoiding performance degradation.

[ Some minor code simplifications to make the changes easy to follow -- TYT ]

Reported-by: Tosk Robot <tencent_os_robot@tencent.com>
Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Reviewed-by: Samuel Liao <samuelliao@tencent.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/1607051143-24508-1-git-send-email-brookxu@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-22 13:08:45 -05:00
Jan Kara
c92dc85684 ext4: defer saving error info from atomic context
When filesystem inconsistency is detected with group locked, we
currently try to modify superblock to store error there without
blocking. However this can cause superblock checksum failures (or
DIF/DIX failure) when the superblock is just being written out.

Make error handling code just store error information in ext4_sb_info
structure and copy it to on-disk superblock only in ext4_commit_super().
In case of error happening with group locked, we just postpone the
superblock flushing to a workqueue.

[ Added fixup so that s_first_error_* does not get updated after
  the file system is remounted.
  Also added fix for syzbot failure.  - Ted ]

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20201127113405.26867-8-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Hillf Danton <hdanton@sina.com>
Reported-by: syzbot+9043030c040ce1849a60@syzkaller.appspotmail.com
2020-12-22 13:07:49 -05:00
Linus Torvalds
ff49c86f27 f2fs-for-5.11-rc1
In this round, we've made more work into per-file compression support. For
 example, F2FS_IOC_GET|SET_COMPRESS_OPTION provides a way to change the
 algorithm or cluster size per file. F2FS_IOC_COMPRESS|DECOMPRESS_FILE provides
 a way to compress and decompress the existing normal files manually along with
 a new mount option, compress_mode=fs|user, which can control who compresses the
 data. Chao also added a checksum feature with a mount option so that we are able
 to detect any corrupted cluster. In addition, Daniel contributed casefolding
 with encryption patch, which will be used for Android devices.
 
 Enhancement:
  - add ioctls and mount option to manage per-file compression feature
  - support casefolding with encryption
  - support checksum for compressed cluster
  - avoid IO starvation by replacing mutex with rwsem
  - add sysfs, max_io_bytes, to control max bio size
 
 Bug fix:
  - fix use-after-free issue when compression and fsverity are enabled
  - fix consistency corruption during fault injection test
  - fix data offset for lseek
  - get rid of buffer_head which has 32bits limit in fiemap
  - fix some bugs in multi-partitions support
  - fix nat entry count calculation in shrinker
  - fix some stat information
 
 And, we've refactored some logics and fix minor bugs as well.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAl/a8ywACgkQQBSofoJI
 UNLa2RAAjK+6tOs+NuYx2w9SegghKxwCg4Mb362BMdaAGx6GzMqAkCiVdujuoz/r
 +wy8sdqO9QE7723ZDNsebNMLRnkNPHnpneSL2p6OsSLJrD3ORTELVRrzNlkemvnK
 rRHZyYnNJvQQnD4uU7ABvROKsIDw/nCfcFvzHmLIgEw8EHO0W4n6fTtBdTwXv1qi
 N3qXhGuQldonR9XICuGjzj7wh17n9ua6Mr12XX3Ok38giMcZb9KFBwgvlhl35cxt
 htEmUpxWD3NTSw6zJmV4VAiajpiIkW6QRQuVA1nzdLZK644gaJMhM1EUsOnZhfDl
 wX0ZtKoNkXxb0glD34O3aYqeHJ3tHWgPmmpVm9TECJP9A/X7kmEHgQYpH/eJ9I7d
 tk51Uz28Mz1RShXU4i5RyKZeeoNTLiVlqiC95E2cnq4C1tLOJyI00N9AinrLzvR+
 fqUrAwCrBpiYX63mWKYwq7GWxWwp4+PY09kyIZxxJiWhTE/St0bRx2bQL8zA8C6J
 Rtxl+QWyQhkFbNu8fAukLFAhC6mqX/FKpXvUqRehBnHRvMWBiVZG0//eOPQLk71u
 qsdCgYuEVcg3itDQrZvmsjxi4Pb5E9mNr0s5oC4I2WvBPMheD4esSyG7cKDN0qfS
 3FFHlRYLOvnjPMLnKTmZXjFvFyHR8mwsD4Z83MeSrqYnWC14tFY=
 =KneU
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-for-5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "In this round, we've made more work into per-file compression support.

  For example, F2FS_IOC_GET | SET_COMPRESS_OPTION provides a way to
  change the algorithm or cluster size per file. F2FS_IOC_COMPRESS |
  DECOMPRESS_FILE provides a way to compress and decompress the existing
  normal files manually.

  There is also a new mount option, compress_mode=fs|user, which can
  control who compresses the data.

  Chao also added a checksum feature with a mount option so that
  we are able to detect any corrupted cluster.

  In addition, Daniel contributed casefolding with encryption patch,
  which will be used for Android devices.

  Summary:

  Enhancements:
   - add ioctls and mount option to manage per-file compression feature
   - support casefolding with encryption
   - support checksum for compressed cluster
   - avoid IO starvation by replacing mutex with rwsem
   - add sysfs, max_io_bytes, to control max bio size

  Bug fixes:
   - fix use-after-free issue when compression and fsverity are enabled
   - fix consistency corruption during fault injection test
   - fix data offset for lseek
   - get rid of buffer_head which has 32bits limit in fiemap
   - fix some bugs in multi-partitions support
   - fix nat entry count calculation in shrinker
   - fix some stat information

  And, we've refactored some logics and fix minor bugs as well"

* tag 'f2fs-for-5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (36 commits)
  f2fs: compress: fix compression chksum
  f2fs: fix shift-out-of-bounds in sanity_check_raw_super()
  f2fs: fix race of pending_pages in decompression
  f2fs: fix to account inline xattr correctly during recovery
  f2fs: inline: fix wrong inline inode stat
  f2fs: inline: correct comment in f2fs_recover_inline_data
  f2fs: don't check PAGE_SIZE again in sanity_check_raw_super()
  f2fs: convert to F2FS_*_INO macro
  f2fs: introduce max_io_bytes, a sysfs entry, to limit bio size
  f2fs: don't allow any writes on readonly mount
  f2fs: avoid race condition for shrinker count
  f2fs: add F2FS_IOC_DECOMPRESS_FILE and F2FS_IOC_COMPRESS_FILE
  f2fs: add compress_mode mount option
  f2fs: Remove unnecessary unlikely()
  f2fs: init dirty_secmap incorrectly
  f2fs: remove buffer_head which has 32bits limit
  f2fs: fix wrong block count instead of bytes
  f2fs: use new conversion functions between blks and bytes
  f2fs: rename logical_to_blk and blk_to_logical
  f2fs: fix kbytes written stat for multi-device case
  ...
2020-12-17 11:18:00 -08:00
Jan Kara
02a7780e4d ext4: simplify ext4 error translation
We convert errno's to ext4 on-disk format error codes in
save_error_info(). Add a function and a bit of macro magic to make this
simpler.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20201127113405.26867-7-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-17 13:30:55 -05:00
Jan Kara
4067662388 ext4: move functions in super.c
Just move error info related functions in super.c close to
ext4_handle_error(). We'll want to combine save_error_info() with
ext4_handle_error() and this makes change more obvious and saves a
forward declaration as well. No functional change.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20201127113405.26867-6-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-17 13:30:55 -05:00
Jan Kara
014c9caa29 ext4: make ext4_abort() use __ext4_error()
The only difference between __ext4_abort() and __ext4_error() is that
the former one ignores errors=continue mount option. Unify the code to
reduce duplication.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20201127113405.26867-5-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-17 13:30:55 -05:00
Jan Kara
93c20bc3ea ext4: standardize error message in ext4_protect_reserved_inode()
We use __ext4_error() when ext4_protect_reserved_inode() finds
filesystem corruption. However EXT4_ERROR_INODE_ERR() is perfectly
capable of reporting all the needed information. So just use that.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20201127113405.26867-4-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-17 13:30:55 -05:00
Jan Kara
81414b4dd4 ext4: remove redundant sb checksum recomputation
Superblock is written out either through ext4_commit_super() or through
ext4_handle_dirty_super(). In both cases we recompute the checksum so it
is not necessary to recompute it after updating superblock free inodes &
blocks counters.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20201127113405.26867-3-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-17 13:30:55 -05:00
Jan Kara
b08070eca9 ext4: don't remount read-only with errors=continue on reboot
ext4_handle_error() with errors=continue mount option can accidentally
remount the filesystem read-only when the system is rebooting. Fix that.

Fixes: 1dc1097ff6 ("ext4: avoid panic during forced reboot")
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20201127113405.26867-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-17 13:30:55 -05:00
Jan Kara
46e294efc3 ext4: fix deadlock with fs freezing and EA inodes
Xattr code using inodes with large xattr data can end up dropping last
inode reference (and thus deleting the inode) from places like
ext4_xattr_set_entry(). That function is called with transaction started
and so ext4_evict_inode() can deadlock against fs freezing like:

CPU1					CPU2

removexattr()				freeze_super()
  vfs_removexattr()
    ext4_xattr_set()
      handle = ext4_journal_start()
      ...
      ext4_xattr_set_entry()
        iput(old_ea_inode)
          ext4_evict_inode(old_ea_inode)
					  sb->s_writers.frozen = SB_FREEZE_FS;
					  sb_wait_write(sb, SB_FREEZE_FS);
					  ext4_freeze()
					    jbd2_journal_lock_updates()
					      -> blocks waiting for all
					         handles to stop
            sb_start_intwrite()
	      -> blocks as sb is already in SB_FREEZE_FS state

Generally it is advisable to delete inodes from a separate transaction
as it can consume quite some credits however in this case it would be
quite clumsy and furthermore the credits for inode deletion are quite
limited and already accounted for. So just tweak ext4_evict_inode() to
avoid freeze protection if we have transaction already started and thus
it is not really needed anyway.

Cc: stable@vger.kernel.org
Fixes: dec214d00e ("ext4: xattr inode deduplication")
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20201127110649.24730-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-17 13:30:45 -05:00
Harshad Shirwadkar
941ba122ca ext4: make fast_commit.h byte identical with e2fsprogs/fast_commit.h
This patch makes fast_commit.h byte by byte identical with
e2fsprogs/fast_commit.h. This will help us ensure that there are no
on-disk format inconsistencies between e2fsck and kernel ext4.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201120202232.2240293-1-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-17 13:30:45 -05:00
Gustavo A. R. Silva
5a150bdec7 ext4: fix fall-through warnings for Clang
In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning
by explicitly adding a break statement instead of just letting the code
fall through to the next case.

Link: https://github.com/KSPP/linux/issues/115
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Link: https://lore.kernel.org/r/03497331f088a938d7a728e7a689bd7953139429.1605896059.git.gustavoars@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-17 13:30:45 -05:00
Harshad Shirwadkar
b1b7dce3f0 ext4: add docs about fast commit idempotence
Fast commit on-disk format is designed such that the replay of these
tags can be idempotent. This patch adds documentation in the code in
form of comments and in form kernel docs that describes these
characteristics. This patch also adds a TODO item needed to ensure
kernel fast commit replay idempotence.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201119232822.1860882-1-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-17 13:30:44 -05:00
Kaixu Xia
03505c58b8 ext4: remove the unused EXT4_CURRENT_REV macro
There are no callers of the EXT4_CURRENT_REV macro, so remove it.

Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/1605164202-31120-1-git-send-email-kaixuxia@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-17 13:30:44 -05:00
Dan Carpenter
bc18546bf6 ext4: fix an IS_ERR() vs NULL check
The ext4_find_extent() function never returns NULL, it returns error
pointers.

Fixes: 44059e503b03 ("ext4: fast commit recovery path")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20201023112232.GB282278@mwanda
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2020-12-17 13:30:32 -05:00
Theodore Ts'o
c9200760da ext4: check for invalid block size early when mounting a file system
Check for valid block size directly by validating s_log_block_size; we
were doing this in two places.  First, by calculating blocksize via
BLOCK_SIZE << s_log_block_size, and then checking that the blocksize
was valid.  And then secondly, by checking s_log_block_size directly.

The first check is not reliable, and can trigger an UBSAN warning if
s_log_block_size on a maliciously corrupted superblock is greater than
22.  This is harmless, since the second test will correctly reject the
maliciously fuzzed file system, but to make syzbot shut up, and
because the two checks are duplicative in any case, delete the
blocksize check, and move the s_log_block_size earlier in
ext4_fill_super().

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: syzbot+345b75652b1d24227443@syzkaller.appspotmail.com
2020-12-17 13:30:32 -05:00